January 6, 2009 * Administrivia * Introduction and a bit of history * Regular languages - Definition and Examples - Equivalence of DFA and NFA - Closure properties - Pumping Lemma - Myhill-Nerode Theorem =================== REGULAR LANGUAGES Finite-state automaton: 5-tuple (Q, Sigma, d, q_0, F) Q = set of states Sigma = alphabet d = transition function : Q x Sigma -> Q q_0 is initial state F = set of accepting states A FSA M accepts a string w if its computation on w ends in an accepting state. The set of all strings accepted by M is the language accepted or recognized by M. A language is regular if it is recognized by a finite-state automaton. Examples of regular languages: -- Set of strings that contain a finite pattern, e.g., contain 00110 as a substring. -- set of string with three consecutive 0s -- set of all strings such that the 10th symbol from the right end is 1 -- {a^ib^jc^k} -- Applications in various control systems: automated voice systems, elevators, automated doorways, thermostats, dishwashers, lexical analyzers, text editors. NONDETERMINISM Non-deterministic finite state automata are identical to FSA above except that d is a relation (as opposed to a function) and that transitions can be done over the empty symbol eps. d: Q x (Sigma U {eps}) -> P(Q) Theorem: NFAs are equivalent to DFAs; the languages they accept are precisely the regular languages. Proof: Since DFAs are NFAs, the languages accepted by NFAs form a superset of those accepted by DFAs. We now show that given an NFA N, there exists a DFA D that accepts L(N). The DFA essentially simulates the NFA, keeping track of all possible states that the run could take. Let the NFA be (Q, Sigma, d, q_0, F). The DFA is (Q', Sigma, d', q'_0, F'). Q' equals 2^Q, that is the set of all possible subsets of Q. d'(S, a) = {q: there exists p in S such that there is a path p ---> q in N on a} q'_0 = {q: there is a path q_0 --> q in N on eps} F' = {q: there exists p in F and path p ---> q in N on eps} We now show both D and N accept the same language. Consider the string w = a_1...a_n accepted by N and suppose we have the path q_0 ---> q_1 ---> q_2 .... ---> q_n where q_i ---> q_{i+1} indicates path consisting of an a_i-transition followed by a sequence of eps-transitions. We obtain that q_1 is in q'_0. q_2 is in d(q'_0, a_1) = q'_1. In general, q_{i+1} is in d(q'_{i-1}, a_i) = q'_i. Finally, q_n is in q'_{n-1} which is also an accepting state. Thus w is accepted by D. Next we show that if string w takes DFA D to a state q', then for every q in q', w takes NFA N to q; that is, there is a path q_0 ---> q in N, labeled w. The proof is by induction on the length of w. For the base case, w is eps. In this case, q' = q'_0. This is precisely the set of states reachable from q_0 in N on eps. For the induction step, suppose the claim is true for w. We now argue for wa. The argument follows immediately from construction. End of Proof CLOSURE PROPERTIES Theorem: Regular languages are closed under union, concatenation, and star operations. Proof: Let L_1 and L_2 be two regular languages. Consider a DFA D_1 for L_1 and DFA D_2 for L_2. We make a new NFA that accepts L_1 U L_2. It has a new start state S_0. The remaining states are the states of D_1 and those of D_2. The accepting states are the accepting states of D_1 and D_2. The start state S_0 has eps-transitions to the start states of D_1 and D_2. Similar constructions can be done for concatentation and star operations. End of Proof Regular expressions -- start from an alphabet, then generate strings using concatenation, union, and star. Equivalent to regular languages. NONREGULAR LANGUAGES: Quite clear that many languages are not regular. Consider the set of all palindromes. Since these can be arbitrarily long, difficult to keep track using a finite state automaton. Pumping lemma: If A is a regular language, there is a number p such that if s in A is of length >= p, then s can be written as s = xyz, satisfying the following: 1. for each i>=0, x y^i z is in A. 2. |y| > 0, and 3. |xy| <= p. Examples of nonregular languages: -- {ww| w in {0,1}^*} -- Palindromes -- {0^n1^n} -- {a^ib^jc^k| i = j or i = k} Not all nonregular languages can be proved nonregular directly via the pumping lemma. Consider: {a^ib^jc^k: i,j,k >= 0 and if i = 1 then j = k} It satisfies the pumping lemma. MYHILL-NERODE THEOREM: Let L be a language over alphabet Sigma. Say that x and y are indistinguishable by L if for every z in Sigma^*, either both xz and yz are in L or both xz and yz are not in L. Consider the relation x =L y if x and y are indistinguishable by L. Clearly =L is an equivalence relation. The index of an equivalence relation is the number of equivalence classes. The index of =L is thus the maximum number of strings that are pairwise distinguishable by L. Myhill-Nerode Theorem: The following three statements are equivalent: (1) L is regular iff the index of =L is finite. Proof: Consider DFA M for L. Suppose the index of =L is greater than the number of states k of M. There exist k+1 strings that are pairwise distinguishable by L. By pigeonhole principle, there exist x and y such that d(q_0, x) = d(q_0, y). This implies that d(q_0, xz) = d(q_0, yz) for every z in Sigma^*, implying that x and y are indistinguishable by L. Suppose =L is finite. Let the set of states be the set of equivalence classes. For string x, let [x] denote its equivalence class. Define d([x], a) = [xa] We argue that the above definition is consistent. That is, if [x] = [y], then [xa] = [ya]. To show the latter, we need to show that for all z, either xaz and yaz are both in L or both not in L. This is true since x =L y. Let q_0 = [eps] and F = {[x] : x is in L}. End of Proof History: Finite-state automata introduced to study sequential switching circuits -- Huffman (1954), Mealy (1955), and Moore (1956). Also defined earlier in the context of neural nets -- McCulloch and Pitts (1943). NFA and subset construction (equivalence with DFA) due to Rabin-Scott (1959). Regular expressions and equivalence to finite-state automata due to Kleene (1956). Myhill-Nerode theorem in 1957-58. Usage in PL and OS introduced by Thompson in 1968.