The Chomsky Hierarchy and Automata in Computer Science

 


This article placed third in the inaugural Fuller Research Prize competition 2021

HAMISH STARLING

Even the least technical among us are familiar with programming languages in a loose sense: purposefully invented syntaxes constructed from keywords, symbols and identifiers used to tell a computer what to do. These confections power our modern world. From the operating system on which you are reading this article to the aeroplane which just passed overhead, most things are now controlled by code. So to fully comprehend the scope, characteristics and limitations of computers, it was realised in the 1950s that understanding the computational structures behind language was critical. In this piece I’ll discuss the Chomsky Hierarchy, a mathematical classification of languages into 4 types - regular, context-free, context-sensitive and recursively enumerable - explaining what each means. We’ll also discuss why this concept is relevant in the real world and how it links to “Automata”.

Languages, Grammars and Production Rules

But first: what do we mean by a “language” and a “grammar”? Noam Chomsky, the American linguist and philosopher, created the formal definition that a language is a set of text strings made up of characters from some alphabet. A grammar is a list of rules (“production rules”) defining how to build all of the strings in the language from the available characters. For example if the language L contains {“cat”,”cot”}, we have the symbols {c,a,t,o,w}. We could say that the language L is defined as:

  • vowel: {a,o}
  • L: ‘c’+vowel+’t’
With two production rules we define the vowel as a set containing a and o and then we say that a word is in the language L if it is the symbol c then a vowel then the symbol t. Note that the symbols could have been words and the language a list of sentences instead of letters and words. The critical thing to understand about Chomsky's "languages" is that sentences don't need to make sense in terms of their meaning as long as they are “syntactically” correct.

Regular Languages

With that out of the way, let’s classify some languages! Chomsky’s most restrictive, and least expressive, language is a “regular language”, generated by a type 3 (regular) grammar. A regular grammar is one in which production rules are defined in terms of only one symbol which is a member of the basic alphabet (these are called terminal symbols) and one which is not (“nonterminal”/derived symbols), with all nonterminals appearing on the same end of the rule. So for example the language L:

  • consonant: {c,v,d}
  • L : {consonant,’a’+L,’e’+L}
This defines all strings which have c,v or d at the end with any combination of a and e before, for example aeac or aav. In this grammar c,v,d,a,e are “terminal symbols” (not defined in terms of any others).  Consonant and L are non-terminal symbols as they are defined in terms of other symbols. This language uses recursion (self-referencing) to build long strings because a word in L could be any word already in L with an a or e on the front. Why is this grammar regular? Notice that all of the rules to generate a word in L contain at most one occurrence of a nonterminal symbol (L or consonant) and the nonterminal symbols only appear at the end of the rules. 

So what use do regular grammars have? The most well-known example is “Regular Expressions” (regex) - an example of a regular grammar supported by most “Find/Replace” tools. You specify the format of a piece of text you want to locate in your file instead of a specific text string. Say you wanted to find all mobile numbers in a document. These have a 0 and then ten more digits. You can’t just search for the number itself as you don’t know the number you are looking for and you want to match all phone numbers present. A regex for this would be: 0[0-9]{10} i.e 0 then the digits 0-9 ten times. The regex processor understands this grammar and can find all matching strings for you. 

Context-Free Grammars

Enclosing Regular Languages we have Context-Free Grammars (CFGs).  All Regular Grammars are CFGs but not the other way round. A CFG is one in which all production rules are defined as any sequence of terminals and nonterminals. Now we can have multiple of each in a sequence and they can be any position. A useful example of this is in bracket matching, for example the CFG:

  •  L:{LL,(L),[L],(),[]} 
This describes all strings of parentheses containing only matched brackets. We can use this to check mathematical expressions or computer programs are right.

Another real-world example of the importance of the Chomsky Hierarchy involves the distinction between CFGs and regular languages. The regular expressions we talked about in the previous paragraph can’t be used to match arbitrary HTML (webpage) code because HTML is a context-free language whereas regular expressions are a regular grammar. They are provably not “powerful” enough to express HTML patterns.

Context-Sensitive Grammars

After this, we have context-sensitive grammars (CSGs) in which whether or not a string is valid depends on what surrounds it. We can start to write definitions in which the left hand side contains terminal characters. Example: 

  • ‘c’+vowel+’t’: cat
  • ‘h’+vowel+’t’: hot
  • consonant: {c,t,h}
  • L: {consonant+vowel+consonant}
In this language 'cat' and 'hot' are valid but cot and hat are not because whether the non-terminal vowel can be a or o depends on what surrounds it. In a CFG of consonant+vowel+consonant we couldn’t specify this as vowel and consonant would be defined once and be the same forever.

Real life programming languages are context-sensitive because they have “semantics”, meaning certain things which are syntactically valid don’t make sense. For example we can’t do a=5;b=”hello”;a+b because b is not a number. We could do a=5;b=4;a+b. The line a+b is the same in both programs, but in one it has an error and in the other it doesn't: context matters.

Recursively enumerable languages

The last type is ‘unrestricted grammars’ defining “recursively enumerable languages”. These can have any production rules. They can represent natural (human) language more easily than the other types but they are much harder to process and understand. Natural language processing (NLP) and linguistics deal with these. However there are programming languages such as “Thue” using these.

Automata

Finally, it’s insightful to understand how automata (theoretical mechanical computers following instructions) relate to languages. Each level of the Chomsky Hierarchy corresponds to a single type of automaton (from Finite State Automata capable of recognising regular language strings to full-on Turing machines which can implement unrestricted grammars), which can verify whether or not a given string is a member of a given language. These have their own characteristics and work in their own ways. For example a FSA has a set of defined states it can be in and follows rules to move between these. Automata theory and formal language theory were thought to be unrelated fields of computing and linguistics until Chomsky himself brought the two together in 1956, by realising that the description of a type of automaton and of a type of language are related. 

Conclusion

In summary, Noam Chomsky’s 1956 seminal paper in computational linguistics categorises grammars - sets of production rules to generate languages. The categories range from the most restrictive (regular) and easy to parse to the most free (unrestricted). We’ve seen both how each of these categories is relevant to real life questions and how Chomsky’s genius indelibly linked Computer Science and Linguistics forever.  


Sources: [All accessed 20th May 2021]

[1] https://en.wikipedia.org/wiki/Chomsky_hierarchy

[2] https://en.wikipedia.org/wiki/Noam_Chomsky

[3] https://www.reddit.com/r/compsci/comments/aqdlpy/what_are_some_real_applications_of_the_chomsky/

[4] https://en.wikipedia.org/wiki/Context-free_grammar

[5] https://en.wikipedia.org/wiki/Regular_language

[6] https://en.wikipedia.org/wiki/Regular_grammar

[7] https://en.wikipedia.org/wiki/Thue_(programming_language) 

[8] https://en.wikipedia.org/wiki/Unrestricted_grammar

[9] https://en.wikipedia.org/wiki/Context-sensitive_grammar

[10] https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

[11] https://en.wikipedia.org/wiki/Chomsky_hierarchy

[12] https://www.coursera.org/lecture/c-plus-plus-b/4-10-c-11-standard-4mLyd 

[13] You can visit https://regexr.com/ to learn about and test regular expressions.