This article placed third in the inaugural Fuller Research Prize competition 2021
HAMISH STARLING
Even the least technical among us are familiar with programming languages in a loose sense: purposefully invented syntaxes constructed from keywords, symbols and identifiers used to tell a computer what to do. These confections power our modern world. From the operating system on which you are reading this article to the aeroplane which just passed overhead, most things are now controlled by code. So to fully comprehend the scope, characteristics and limitations of computers, it was realised in the 1950s that understanding the computational structures behind language was critical. In this piece I’ll discuss the Chomsky Hierarchy, a mathematical classification of languages into 4 types - regular, context-free, context-sensitive and recursively enumerable - explaining what each means. We’ll also discuss why this concept is relevant in the real world and how it links to “Automata”.
Languages, Grammars and Production Rules
But first: what do we mean by a “language” and a “grammar”? Noam Chomsky, the American linguist and philosopher, created the formal definition that a language is a set of text strings made up of characters from some alphabet. A grammar is a list of rules (“production rules”) defining how to build all of the strings in the language from the available characters. For example if the language L contains {“cat”,”cot”}, we have the symbols {c,a,t,o,w}. We could say that the language L is defined as:
- vowel: {a,o}
- L: ‘c’+vowel+’t’
Regular Languages
With that out of the way, let’s classify some languages! Chomsky’s most restrictive, and least expressive, language is a “regular language”, generated by a type 3 (regular) grammar. A regular grammar is one in which production rules are defined in terms of only one symbol which is a member of the basic alphabet (these are called terminal symbols) and one which is not (“nonterminal”/derived symbols), with all nonterminals appearing on the same end of the rule. So for example the language L:
- consonant: {c,v,d}
- L : {consonant,’a’+L,’e’+L}
So what use do regular grammars have? The most well-known example is “Regular Expressions” (regex) - an example of a regular grammar supported by most “Find/Replace” tools. You specify the format of a piece of text you want to locate in your file instead of a specific text string. Say you wanted to find all mobile numbers in a document. These have a 0 and then ten more digits. You can’t just search for the number itself as you don’t know the number you are looking for and you want to match all phone numbers present. A regex for this would be: 0[0-9]{10} i.e 0 then the digits 0-9 ten times. The regex processor understands this grammar and can find all matching strings for you.
Context-Free Grammars
Enclosing Regular Languages we have Context-Free Grammars (CFGs). All Regular Grammars are CFGs but not the other way round. A CFG is one in which all production rules are defined as any sequence of terminals and nonterminals. Now we can have multiple of each in a sequence and they can be any position. A useful example of this is in bracket matching, for example the CFG:
- L:{LL,(L),[L],(),[]}
Another real-world example of the importance of the Chomsky Hierarchy involves the distinction between CFGs and regular languages. The regular expressions we talked about in the previous paragraph can’t be used to match arbitrary HTML (webpage) code because HTML is a context-free language whereas regular expressions are a regular grammar. They are provably not “powerful” enough to express HTML patterns.
Context-Sensitive Grammars
After this, we have context-sensitive grammars (CSGs) in which whether or not a string is valid depends on what surrounds it. We can start to write definitions in which the left hand side contains terminal characters. Example:
- ‘c’+vowel+’t’: cat
- ‘h’+vowel+’t’: hot
- consonant: {c,t,h}
- L: {consonant+vowel+consonant}
Real life programming languages are context-sensitive because they have “semantics”, meaning certain things which are syntactically valid don’t make sense. For example we can’t do a=5;b=”hello”;a+b because b is not a number. We could do a=5;b=4;a+b. The line a+b is the same in both programs, but in one it has an error and in the other it doesn't: context matters.
Recursively enumerable languages
The last type is ‘unrestricted grammars’ defining “recursively enumerable languages”. These can have any production rules. They can represent natural (human) language more easily than the other types but they are much harder to process and understand. Natural language processing (NLP) and linguistics deal with these. However there are programming languages such as “Thue” using these.
Automata
Finally, it’s insightful to understand how automata (theoretical mechanical computers following instructions) relate to languages. Each level of the Chomsky Hierarchy corresponds to a single type of automaton (from Finite State Automata capable of recognising regular language strings to full-on Turing machines which can implement unrestricted grammars), which can verify whether or not a given string is a member of a given language. These have their own characteristics and work in their own ways. For example a FSA has a set of defined states it can be in and follows rules to move between these. Automata theory and formal language theory were thought to be unrelated fields of computing and linguistics until Chomsky himself brought the two together in 1956, by realising that the description of a type of automaton and of a type of language are related.
Conclusion
In summary, Noam Chomsky’s 1956 seminal paper in computational linguistics categorises grammars - sets of production rules to generate languages. The categories range from the most restrictive (regular) and easy to parse to the most free (unrestricted). We’ve seen both how each of these categories is relevant to real life questions and how Chomsky’s genius indelibly linked Computer Science and Linguistics forever.
Sources: [All accessed 20th May 2021]
[1] https://en.wikipedia.org/wiki/Chomsky_hierarchy
[2] https://en.wikipedia.org/wiki/Noam_Chomsky
[3] https://www.reddit.com/r/compsci/comments/aqdlpy/what_are_some_real_applications_of_the_chomsky/
[4] https://en.wikipedia.org/wiki/Context-free_grammar
[5] https://en.wikipedia.org/wiki/Regular_language
[6] https://en.wikipedia.org/wiki/Regular_grammar
[7] https://en.wikipedia.org/wiki/Thue_(programming_language)
[8] https://en.wikipedia.org/wiki/Unrestricted_grammar
[9] https://en.wikipedia.org/wiki/Context-sensitive_grammar
[10] https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
[11] https://en.wikipedia.org/wiki/Chomsky_hierarchy
[12] https://www.coursera.org/lecture/c-plus-plus-b/4-10-c-11-standard-4mLyd
[13] You can visit https://regexr.com/ to learn about and test regular expressions.