6. Phrase-Structure Grammars and the Chomsky Hierarchy
Phrase-Structure Grammars and the Chomsky Hierarchy
This note presents of on overview of the Chomsky hierarchy of phrase-structure grammars and of the languages they generate. The information covered here serves principally to provide context for the context-free grammars and languages. The presentation consists of brief observations on the following topics:- Phrase-Structure Grammars
 - Productions
 - Sentences
 - Languages
 - Equivalent Grammars
 - Chomsky Hierarchy
 - Hierarchy of Languages
 - l-Productions
 - l-Free Grammars
 - Normal Form Grammars
 - Chomsky Normal Form
 - Self-Embedding Grammars
 - Type 3 Grammars
 - Type 3 Normal Forms
 - Type 3 Grammar Parse Tree
 
Phrase-Structure Grammars. A phrase-structure grammar G is defined as an ordered quadruple of the form
G = á VN, VT, S, P ñwhere the entries are identified as follows:
- VN is a nonterminal vocabulary consisting of the lexical and syntactic category labels;
 - VT denotes a set of words, called the terminal vocabulary of G;
 - S is a special member of VNthat, in addition to being the label of the sentence category, identifies the starting symbol of G; and
 - P identifies the collection of rewriting rules, described as the production set of the grammar.
 
Productions. Members of the production set P of G have the general form
a ® bwhere a Î (VT È VN)+ and b Î (VT È VN)*, subject to the following conditions:
- At least one member of P has the form S ® b with b Î (VT È VN)*;
 - At least one of the symbols in the left-hand side string a of each of the members of P must be an element of VN; and
 - There must be at least one member of P the right-hand side of which consists only of terminal symbols, that is, b Î V*T.
 
Sentences. Derivations are undertaken by beginning with the starting symbol S of a grammar G and applying productions from P. A derivation terminates when a sentential form is obtained that consists solely of words from the vocabulary VT of terminal symbols. A string s Î V*T is said to be a sentence generated by G if and only if
S Þ* s
Languages. The language generated by G, denoted L(G), where L(G) Í V*T, is the set of all strings of terminal symbols derivable from S. Thus,
L(G) = {s | s Î V*T, S Þ* s}
Equivalent Grammars. Two phrase-structure grammars Ga and Gb with the same terminal vocabulary VT, but with production sets Pa and Pb, respectively, where Pa ¹ Pb, could generate different languages; however, they also may generate the same language. Different grammars which generate the same language are said to be equivalent; that is, two grammars Ga and Gb are equivalent if and only if
L(Ga) = L(Gb)
Chomsky Hierarchy. Phrase-structure grammars may be classified according to the form of their productions. The conventional classification scheme is known as the Chomsky Hierarchy. According to this scheme, there four types of phrase-structure grammar. The first of these, identified as the Type 0 grammars, consists of those grammars which have no restrictions on their productions other than the conditions noted above. The other three classes of grammar, identified as the Type 1 through Type 3 grammars, are defined by increasingly restrictive conditions on their productions, with the Type 3 grammars being those that are subject to the most stringent conditions. The grammars in the Chomsky hierarchy are also identified by name, and these names together with the forms of their productions are shown in the following table:
Type Name Productions 0 Unrestricted a ® b with a Î (VT È VN)+ and b Î (VT È VN)* 1 Context-Sensitive a1Aa2 ® a1b a2 with A Î VN and a1, a2 Î (VT È VN)* and b Î (VT È VN)+ 2 Context-Free A ® b with A Î VN and b Î (VT È VN)* 3 Regular, Finite A ® bB or A ® b with A, B Î VN and b Î V*T 
Hierarchy of Languages. A language over a vocabulary VT is said to be of Type i if it can be generated by a Type i Grammar. For exampe, a Type 2 Language, also known as a Context-Free Language, is generated by a Type 2 Grammar, that is, by a Context-Free Grammar. It is possible to speak of the family of languages over VT generated by the grammars of Type i, and identify this family or class of languages by writing Li. For example, the family of languages generated by the context-free, or Type 2 grammars may be denoted by L2. The families of languages generated by the grammars in the Chomsky hierarchy form an inclusive hierarchy that may be represented as
L3 Ì L2 Ì L1 Ì L0Thus, for example, the context-free languages are included in the family of context-sensitive languages, and the context-free languages include the regular or finite languages. The inclusion is proper so that, for example, there are context-free languages that are not regular or finite.
l-Productions. A l-Production is a rewriting rule of the form
A ® lwith its right-hand side being the null string l. Note that the condition on the right-hand sides of context-free rules that they be strings in (VT È VN)* includes the possibility that the production set might contain l-productions. It can be demonstrated that l-productions are required only if the language generated by the grammar includes the null string as a sentence. Hence, an equivalent grammar can be constructed wherein all of the l-productions have been eliminated, and replaced by the two rules
S' ® lThe new symbol S' is added to the nonterminal vocabulary of the grammar and the starting symbol of the resulting grammar is replaced by S'. The production S' ® l is then applied only to derive the sentence consisting of the null string.
S' ® S
l-Free Grammars. A language that does not include the null string as a sentence is called l-free and a grammar that generates a l-free language is said to be l-free itself. Note that the form of the context-sensitive productions presented above means that they generate only l-free languages. A context-sensitive language can include the null string as a sentence. In such cases, the rule S ® l is added to the production set of the generating grammar; but, S cannot appear in the right-hand side of any of the other productions.
Normal Form Grammars. A Normal Form for a given type or class of grammar is a grammar with specified additional conditions imposed upon its productions, but which is equivalent to the grammars in the given type. For example, the productions of a Type 2, or context-free grammar have the general form
A ® b with A Î VN and b Î (VT È VN)*The right-hand sides of such rules may consist of strings that include a mixture of terminal and non-terminal symbols. Thus, the rule
NP ® the Nwith NP, N Î VN and 'the' Î VT is an acceptable context-free production. Normally, we would replace such productions with rules of the form
DET ® thewith the symbol DET being added to the nonterminal vocabulary VN if it not already included therein. The right-hand sides of the rules then consist either of a single terminal symbol, or of a string of non-terminals. Hence, all the rules of the context-free grammar have either one or the other of the following two general forms:
NP ® DET N
A ® awith A, B Î VN, a Î VT, and b Î V*N. The productions of any context-free grammar can be replaced with rules of the foregoing form, with additonal symbols being added to the vocabulary VN of non-terminal symbols as required. The resulting grammar generates the same language as the original, and hence, is equivalent to the original context-free grammar.
B ® b
Chomsky Normal Form. A l-free context-free grammar is said to be in Chomsky Normal Form if each of its productions has one of the following forms:
A ® awith A, B, C, D Î VN and a Î VT. If the original grammar is not l-free, then the production set of the equivalent Chomsky Normal Form grammar includes the production S ® l. Because the right-hand sides of rules of the second variety always consist of exactly two symbols, Chomsky Normal Form grammars are sometimes called binary grammars.
B ® CD
If the productions of the original grammar are already in the normal form described above, namely A ® a B ® b with A, B Î VN, a Î VT, and b Î V*N, then an equivalent Chomsky Normal Form grammar can be constructed by replacing those rules of the original grammar wherein the right-hand side string b consists of more than two nonterminal symbols. For example, a rewriting rule such as
NP ® DET A Nis eliminated and two rules such as the following are included in the production set of the normal form grammar:
NP ® A NChain rules such as
NP ® DET NP
NP ® Nwherein the right-hand side consists of a single nonterminal symbol must also be eliminated by, in these cases, treating collective nouns as comprising NPs and intransitive verbs as VPs.
VP ® V
In the foregoinging examples, the nonterminal symbols in the normal form grammar are already in the nonterminal vocabulary of the original grammar. In general, however, the nonterminal vocabulary of the Chomsky Normal Form grammar will include new symbols, not included in the nonterminal vocabulary of the original grammar.
Self-Embedding Grammars. A grammar is said to be self-embedding if there is a symbol A Î VN and a derivation such that
A Þ+ bAg where b,g Î (VT È VN)+For example, a context-free grammar with the rules
S ® NP VPis self-embedding because of the derivation
NP ® NP S
S Þ NP VP Þ NP S VPwhich might be included in the generation of a sentence such as
'a cat the dog saw ran.'The clause 'the dog saw' is sometimes described as being centre-embedded in the matrix sentence 'a cat ran.'
Type 3 Grammars. A context-free grammar that is not self-embedding generates a Type 3 language. Thus, it is the self-embedding property of some context-free grammars which enables them to generate languages that are not Type 3. The rules of a Type 3 grammar, that is, a finite or regular grammar, all have one or the other of the following two forms
A ® bB, orwith A, B Î VN and b Î V*T. Consequently, although a Type 3 rule might be recursive, with the form
A ® b
A ® bAand could yield right-embedded constituents, Type 3 rules cannot produce a centre-embedding because there can be no string to the right of the noterminal symbol A.
A Type 3 grammar is a Type 2 grammar because the left-hand side of each Type 3 production consists of a single nonterminal symbol. Furthermore, for any non-self-embedding Type 2 grammar there is an equivalent Type 3 grammar; that is, provided it is not self-embedding, a context-free grammar can be converted to a Type 3 grammar.
Type 3 Normal Forms. Every Type 3 language, that is, every finite or regular language, can be generated by a grammar all the rules of which have one or the other of the following forms:
A ® aB with A,B Î VN and a Î VTIn a variant of this normal form Type 3 grammar, rules of the form
A ® l with A Î VN
A ® a where a Î VTcan be substituted in place of those with the form A ® l.
A rule such as
S ® dogs like NPwith S,NP Î VN and the string 'dogs like' Î V*T is a legitimate Type 3 production. To obtain normal form rules, a new symbol such as VPp can be added to the nonterminal vocabulary of the original Type 3 grammar, and the following rules can be added to its production set:
S ® dogs VPpwith S,NP,VPp Î VN and 'dogs','like' Î VT. The original rule S ® dogs like NP is eliminated. If this rule happens to be the only one in the original grammar that is not in the normal form, then its replacement with the two rules cited above, and the addition of the new symbol VPp to the nonterminal vocabulary will yield a Type 3 normal form grammar. In other words, if G where
VPp ® like NP
G = á VN, VT, S, P ñdenotes the original Type 3 grammar, then the normal form Type 3 grammar G' can be defined as
G' = á V'N, VT, S, P' ñwhere
V'N = VN È {VPp}
P' = P - {S ® dogs like NP} È {S ® dogs VPp, VPp ® like NP}
Type 3 Grammar Parse Tree. The accompanying figure shows the structure generated by a Type 3 normal form grammar for the sentence 'dogs like them.' Observe that, because all the rules of the grammar are either of the form A ® l, or of the form A ® aB, such as in this example S ® dogs VPp and VPp ® like NP, every node of the tree, except for the lowest one labelled H here, has two branches. The left-hand branch of the two connects the node to a word, while the right-hand branch connects it to another node. Thus, the parser tree is a strictly right-branching structure, with the last branch ending in a node connected to the null string.
Note that, according to the alternative Type 3 normal form, all the rules have one or the other of the two forms A ® a or A ® aB. Thus, application of the alternative normal form to this example, would see the rule H ® l would be eliminated, and the rule NP ® them H replaced by NP ® them. The upper part of the tree generated by this grammar will then be the same as that shown in the figure; but the lowest node of the new tree will be labelled NP, and its single branch will connect to the word 'them.'
                    
                
                
            
        
浙公网安备 33010602011771号