January 9, 2009 * Context-free grammars * Chomsky Normal Form * Push-Down Automata * Pumping Lemma for CFLs CONTEXT-FREE LANGUAGES Context-free grammars are more powerful method for describing languages. Superset of regular grammars. Naturally captures fragments of natural language and programs. A CFG is a 4-tuple (V, Sigma, R, S), where 1. V is a finite set of variables 2. Sigma is a finite set of terminals 3. R is a set of rules of the form v -> string of variables and terminals 4. S is the start symbol There are several forms for writing down context-free grammars. One canonical form is the Chomsky Normal Form. CHMOSKY NORMAL FORM A CFG is in Chomsky Normal Form if every rule is of the form A -> BC A -> a where A, B, and C are any variables -- except that B and C cannot be the start symbol S. In addition, the grammar may have the rule S -> eps To convert an arbitrary CFG to CNF: (a) add a new start symbol; (b) eliminate all rules of the form A -> eps, generating new rules corresponding to those where A occurs in the right-hand side; (c) eliminate all rules of the form A -> B, generating new rules corresponding to those where A occurs in the right-hand side; (d) convert remaining rules into the proper form. PUSHDOWN AUTOMATA (PDA) Regular languages are accepted by finite state automata. How about CFLs? They are accepted by Pushdown Automata, equivalent in power. A pushdown automaton is a nondeterministic finite state automaton plus a stack which you can read/write in a LIFO order. Examples of CFL: -- Palindromes over {a,b} -- {a^ib^jc^k | i,j,k >= 0 and i=j or i = k} A CFG for Palindromes. S -> eps S -> a S -> b S -> aSa S -> bSb CNF for Palindromes S -> eps S -> a S -> b A -> a B -> b S -> AX S -> BY P -> AX P -> BY X -> PA Y -> PB PDA for Palindrome Guess whether the string is odd length or even. Guess the mid-point of the string. Push what is read in the first half into the stack, and then compare with what is read in the second half. EQUIVALENCE OF PDAs AND CFGs Lemma: For every CFG G, there is a PDA that accepts L(G). Proof: Consider any CFG G. G specifies how to generate the strings of L(G), via an application of rules. We will do the same using the PDA by generating the string using the stack. A priori, we do not know the set of rules that are used to generate the string. Here is where nondeterminism helps, and is essential. The PDA guesses which rule to apply. The PDA starts by putting the start symbol S on the stack. Then, the first rule is applied. The RHS of the rule is inserted into the stack. All symbols at the top of the stack before the first variable are compared with the string and popped out. If any comparison fails, we terminate in a nonaccepting state. Otherwise, a variable appears at the top of the stack (or there is nothing on the stack, in which case we again terminate, with acceptance depending on whether any of the string is left to be processed). We replace the variable by the RHS of a rule chosen nondeterministically, and we repeat. Easy to see that both G and the PDA have the same language. End of Proof Lemma: For every PDA M, there is a CFG G such that L(G) = L(M). Proof: The PDA M is specified by is states and transitions. The grammar G is specified by its rules. We will have rules emulate the states and transitions. One challenge is that the PDA can either push to or pop from the stack, while rules in a CFG can only add symbols to the string that is being generated. For each pair of states p and q, we have a variable A_{pq} that generates all strings that take the PDA in state p with empty stack to a PDA in state q with empty stack. If on state p with empty stack, the PDA consumes a, moves to state r and pushes b into stack, and there is a transition of the PDA that moves from state s to state q while popping b from the stack and consuming c, we add the following rule. A_{pq} -> aA_{rs}c If the last symbol popped in the sequence of transitions from p to q was not b, and was something else, then the stack must have become empty at an earlier point, when b was popped. Let the state at this point be r, then we have A_{pq} -> A_{pr}A_{rq} In fact, note that the above rule can be always added and captures all scenarios of the second kind discussed above. We can stipulate wlog that the given PDA has a single accepting state q_f and always ends with an empty stack. Then, the start symbol is A_{q_0q_f}. End of Proof PUMPING LEMMA FOR CFLs If A is a context-free language, then there is a number p where if s is in A and of length at least p, then s can be written as uvxyz satisfying the conditions 1. for each i >= 0, uv^ixy^iz in A, 2. |vy| > 0, and 3. |vxy| <= p. Proof: Easy to see from the parse tree derivation for CFGs. Consider the parse tree derivation of a string w generated by a CFG. We may assume that the CFG is CFL, in which case the degree of each intermediate node is at most 2. For a string that is sufficiently long, there exists a path from the root to a leaf such that a variable appears twice. If we replace the subtree rooted under the lower occurrence by the subtree rooted at the upper occurrence, we generate a new string of the form uvvxyyz where w was of the form uvxyz. We could also replace the upper occurrence by the lower occurrence to get the string uxz. End of Proof Examples of non-CFLs: -- {a^ib^jc^k| 0 <= i <= j <= k} -- {ww: w in {0,1}^*} Deterministic PDAs do not have the same power as nondeterministic PDAs. They are important from a PL point of view since many parsers implement deterministic PDAs. Languages accepted by DPDAs lie between regular languages and CFLs. -- {wcw^R} is accepted by DPDA but is not regular -- {ww^R} is CFL but is not accepted by any DPDA History: CFGs proposed for natural languages by Chomsky (1956). For PL, Backus (1959) for Fortran and Naur et al (1960) for Algol. CFG essential for implementation of compilers. Also for description of the structure of documents. (DTDs in XML). PDAs defined by Oettinger (1961) and Schutzenberger (1963). Equivalence with CFGs due to Chomsky (1961) and Evey (1963). LR(k), defined by Knuth (1965), are equivalent to DPDAs and form the basis for YACC.