1. Sets and Sentence Analysis

Sets and Sentence Analysis

Set theory is the mathematics of classes. Sets are classes. The notion of class is so fundamental to thought that we cannot hope to define it in more general terms. We can say that a class is any aggregate, any collection, any combination of objects of any sort; if it helps, well and good. But even this will be less help than hindrance unless we keep clearly in mind that the aggregating or collecting or combining here is to connote no actual displacement of the objects ... In short, a class may be thought of as an aggregate or collection or combination of objects just so long as 'aggregate' or 'collection' or 'combination' is understood strictly in the sense of 'class'. Quine (1969).

A set is a collection of things. The term collection, in fact, as well as such terms as class, category, family, aggregate, combination, and ensemble may be used synonymously with the term set. This page reviews some of the concepts and notation from the theory of sets that are applied to sentence analysis. The notation discussed here serves principally as a language whereby sentence analysis can described. The page begins with some illustrations of this notation, and the concepts for which it stands. The application of this notation to sets of words and to sentence analysis is then illustrated.

Specifying a set by listing its members. The specification of the membership of a set completely characterises the set. A set may be specified by listing its members, or by specifying particular properties or attributes the members must have in order that they be considered members of the set. For example, the set of things on my desk may be specified by listing these things as follows:

{mug,pencil,paper,dictionary,eraser,lamp,elbow,dust}

By convention, such lists are enclosed in braces, { and }, sometimes called curly brackets, and the individual members or elements of the set are separated by commas. The left-hand brace may be read as "the set consisting of", with the commas separating the elements being read as " and".

A letter may be used to name or represent a set, as

D = {mug,pencil,paper,dictionary,eraser,lamp,elbow,dust}

where the equal sign may be read "represents" or " identifies" or "denotes", and is understood to have the meaning that the letter on the left may be substituted for the set specification on the right wherever this might be convenient. Combinations of characters, including subscripts and superscripts, also may be used to name, identify, or label sets.

The order in which elements are listed is not important. Thus, we can represent the set D by listing its elements as follows:

D = {elbow,mug,lamp,paper,dictionary,dust,pencil,eraser}

which is the same set D as was specified previously. Also, one or more elements may be named more than once without changing the set that is specified. For example

D = {elbow,elbow,mug,lamp,paper,pencil,dictionary,dust,pencil,eraser,pencil}

specifies the same set as that identified as D, above, notwithstanding the element elbow has been cited twice, and pencil three times.

Note that neither D itself, nor the list of its elements is the set: D denotes or identifies the set, and the list merely denotes or identifies its members. Furthermore, the words in D identify objects. If we wanted to refer to the words themselves, we would enclose them in apostrophes, also know as single quote marks, as in the following example

D_w = {'mug','pencil','paper','dictionary','eraser','lamp','elbow','dust'}

where D_w is a different set from D. The apostrophes may be omitted, in the interests of readability and to facilitate typing, but only so long as it is clear from the context that the elements of the set are words, and not the things identified by the words.

Specifying a set by identifying a property of its members. The foregoing examples illustrate the specification of sets by listing their members. A set also may be specified by citing an attribute or property comprising a criterion on the basis of which things are collected together as a set. For example, the set of things on my desk may be specified as

D = {d | d a thing on my desk}

wherein the "dummy variable," d in this case, denotes a "typical element" of the set being specified. The vertical stroke, which is sometimes replaced by a colon, is read as " such that". The attribute or property that d must satisfy is given following the vertical stroke. Note that, in this example, both the specification by attribute and by a list of members have been used to specify the same set D.

Identifying a member of a set. The statement that my elbow is a member or element of the set D may be written

elbow Î D

wherein the character Î stands for the membership relation. It is read as " is an element of", " is a member of", or "belongs to".

The symbol Ï denotes the negation of Î , and is read " is not an element of", " is not a member of", or "does not belong to". Thus, it would be used in a statement such as

computer Ï D

to say that my computer is not on my desk. (My computer is on a separate table beside my desk.)

Subset of a set. It is possible to talk about subsets of sets. For example, suppose we want to consider only those elements of D that are used in writing and that someone might describe as writing tools or materials. Then, we might specify this set of writing tools or materials as follows:

M = {m | m Î D, m used in writing}

On the basis of this specification, the set M of writing materials might alternatively be identified by listing its elements as follows:

M = {paper,pencil}

The set M can be described as a subset of D because every element of M is an element of the set D. This fact can be represented by writing

M Í D

where the symbol Í can be read as " is a subset of" or " is contained by". We can then also write

D Ê M

which can be read as a statement that "D is a superset of M," or "D contains M." In general, given two sets X and Y, X is a subset of Y, and we can write X Í Y, if and only if every element of X is also a member of Y.

Note that the symbols Í and Ê denote a relationship among sets, or a relation on sets. The symbol Î , however, denotes a relation between elements and sets. For example, we can write

paper,pencil Î D

to indicate that the individual things, pencil and paper, are elements of the set D, but we write

{paper,pencil} Í D

to indicate that the set consisting of the elements pencil and paper is a subset of D. As idicated above, we have used M to identify this set made up of the two the elements, pencil and paper. To illustrate the difference between Î and Í , we can write

paper Î {paper} Í M

to say that paper is an element of the set {paper} which is a subset of M. The set {paper}, incidentally, since it consists of a single element, is called a singleton set.

Proper subset of a set. Since the set D contains elements that are not members of M, we can say the M is a proper subset of D and write

M Ì D

where the symbol Ì may be read as " is a proper subset". We can also write

D É M

where É may be read as "properly contains". In general, given two sets X and Y, X is a proper subset of Y, and we can write X Ì Y, if and only if every element of X is also a member of Y, but there is an element of Y that is not an element of X.

Equality of sets. The horizontal line added beneath É and Ì , which yields the symbols Ê and Í , respectively, indicates the possibility that two sets might be equal. For example, we can write

D Ê D

that is, D contains itself. In general, given two sets identified as X and Y, if X Í Y and Y Í X, or in other words, if every element of X is a member of Y and every element of Y is a member of X, then we can say that X equals Y and write

X = Y

meaning that X and Y must designate the same set.

Equivalence of sets. It is also possible to speak of equivalent sets. Two sets are said to be equivalent if and only if they contain the same number of elements. For example, the set D of things on my desk, and the set D_w of words identifying the things on my desk, are equivalent because we can match each word in D_w with a thing in D. We can state this relation between the two sets by writing

D_w º D

where º can be read as " is equivalent to". Recall, however, that D_w and D are different sets, and notwithstanding they are equivalent, they are not equal. Hence, we can write

D_w ¹ D

Since D includes elements, such as eraser and dictionary, which are not members of M, we can also write

M ¹ D

The specification of the set M is rather vague and open to alternative interpretations. Someone might, for example, consider that the eraser should be included, along with the pencil and paper, because it is obviously used in writing. Consequently, this person will interpret the specification as actually meaning the set M_e where

M_e = {eraser,pencil,paper}

with he subscript e being appended to distinguish it from the original set M. Clearly, M Í M_e because pencil,paper Î M and pencil,paper Î M_e, since eraser Î M_e, but eraser Ï M, we could also write M Ì M_e.

Suppose another person thinks that the erase should not be included, but she claims she cannot write without a dictionary. Hence, she construes the specification of materials used in writing as meaning the set

M_d = {dictionary,paper,pencil}

where the subscript d serves to distinguish the set M_d from the other subsets M_e and M of D. Since each of the two sets M_d and M_e includes an element that is not included in the other, we can write

M_d ¹ M_e

Intersection of sets. As is the case for M_e, M_d also includes M as a proper subset, that is, M_d É M. Thus, the two individuals agree in their interpretations that, among the elements in D, paper and pencil can be considered as things used in writing. The two sets, M_d and M_e can consequently be said to intersect. The intersection of two sets is the set consisting of those elements that are shared by, or included in both sets. In the case of two sets M_d and M_e, we can specify that their intersection is the set M by writing

M = M_d Ç M_e

where the symbol Ç can be read as " intersect", although the foregoing line would normally be read as "M is the intersection of M_d and M_e".

In general, the intersection of any two sets X and Y is the set of all elements that are members of both X and Y. The set intersection operation is commutative, that is

X Ç Y = Y Ç X

for any sets X and Y. Set intersection is also associative so that

(X Ç Y) Ç Z = X Ç (Y Ç Z)

for any sets X, Y, and Z. The parentheses indicate that the intersection operation in the expression they enclose is performed first. Thus, performing the intersection X Ç Y, and then taking the intersection of the resulting set with Z, yields the same result as performing the intersection Y Ç Z first, then taking the intersection of the resulting set with X.

The null set. If two sets share no elements in common, their intersection is the empty or null set, which is usually identified by the symbol Æ, or by writing {}. For example, suppose we consider the subset

M_c = {dust,elbow,mug,lamp}

of those items in D that no one in this illustration thinks of as things that can be used in writing, and we compare it with the set M = {paper,pencil} of things upon which everyone agrees are used in writing. These two sets share no elements in common. Hence, their intersection is the null set, that is

M Ç M_c = Æ

Sets that have no elements in common, so that their intersection is empty, are said to be disjoint.

The null set is a subset of any set so that, for example, we can write

Æ Ì M and Æ Ì M_c

Also, the intersection of any set X with the null set yields the null set, that is

X Ç Æ = Æ

Thus, the null set is analogous to the number 0, and taking the intersection of a any set with the null set is something like multiplying any number by 0.

Union of sets. In identifying the set of those elements of D that are used in writing, one of the two individuals included the dictionary, in addition to the paper and pencil, while the other included the eraser. Thus, the four items, the paper, pencil, eraser, and dictionary, were selected by one or the other of the two individuals as things used in writing. We might use M_u to name this set and identify its elements as follows

M_u = {paper,pencil,eraser,dictionary}

The set M_u can be called the union of M_d and M_e, and we can specify the relationship among the three sets by writing

M_u = M_d È M_e

where the symbol È can be read as " union", while the equation can be read as stating that "M_u is the union of M_d and M_e".

In general, the union of any two sets X and Y is the set of all elements that are members of either X or Y. Like the set intersection operation, set union is commutative, that is

X È Y = Y È X

for any sets X and Y. Set union is also associative so that

(X È Y) È Z = Y È (X È Z)

for any sets X, Y, and Z.

Difference of sets. The individuals selecting the subsets M_e and M_d of D agree that the remaining elements, namely, those comprising the subset M_c consisting of the mug, lamp, elbow, and dust, cannot be considered as things used in writing. These things comprise a set consisting of those elements of D with the elements in the set M_u excluded or removed. We can identify this set as the difference between D and M_u, or as the complement of M_u relative to D. If the set M_c denotes this difference, or relative complement set, we can define it as follows:

M_c = D - M_u

where the minus sign can be construed as meaning the removal from the set D of all those elements that are also members of M_u.

Because M_u Í D, the difference operation entails that all the elements of M_u are removed from D. Thus, the intersection of M_c and M_u is the empty set so that we can write

M_c Ç M_u = Æ

The interpretation of the difference operation also means that we can obtain the complement of a set such as M_d relative to another set such as M_e. In this case, the removal from M_e of those elements that are also members of M_d leaves the singleton set {eraser}.

M_e - M_d = {eraser}

Notwithstanding dictionary is an element of M_d, only those elements shared by M_d and M_e, namely, the set M = {paper,pencil}, can be removed from M_e. Thus, only those elements in the intersection of the two sets are removed from M_e. The complement of M_e relative to M_d consequently yields the set {dictionary} so that we can write

M_d - M_e = {dictionary}

Sets of sets. A set can also contain other sets as elements. For example, we can define a set M_de as consisting of the two sets M_d and M_e, that is

M_de = {M_d, M_e}

Note that this set has two elements. Note especially that M_de is most certainly not the union of M_d and M_e. This fact is evident from examination of the following specification of M_de wherein M_d and M_e have been replaced by the lists of elements comprising the sets that they name:

M_de = {{dictionary,paper,pencil}, {eraser,paper,pencil}}

Sets of words. The concepts and notation discussed above can be applied to sets of words. For example, consider the following sentence:

s = 'they stand by the bow and see her bow near the stand'

Note that the lowercase Greek letter, sigma in this case, is used here to identify the sentence, following the convention the lowercase letters from the Greek alphabet are used to designate sequences of words. We might then specify a set W consisting of the words in s by writing

W = {w | w a word in s}

or we might specify this set by listing its elements as follows:

W = {'they','stand','by','the','bow','and','see','her','near'}

Notice that, although the words 'stand', 'the', and 'bow' each occur twice in the sentence s, they appear only once in the set W.

Subsets of words. We can identify several subsets of W. For example, we might consider the following subsets:

N = {w | w Î W, w a noun in s}
V = {w | w Î W, w a verb in s}

Assuming these specifications are sufficiently unambiguous, and the concepts "noun" and "verb" are clear, then the sets N and V consist of the following elements:

N = {'stand','bow'}
V = {'stand','bow','see'}

Both of the sets N and V are proper subsets of W so that we can write

N Ì W and V Ì W

As a noun in s, 'bow' for example can identify the front end of a ship, a device for launching arrows, or a bending in general or of the head or body in a gesture of greeting or respect. As a verb, 'bow' might correspond to the action of bending in general or of the head or body to perform a gesture of greeting or respect. Since the two sets share elements in common, their intersection is not empty. In fact,

N Ç V = N

We can also look at the union and the difference of the sets.

N È V = V
V - N = {'see'}

As the definitions of the sets N and V illustrate, the identification of such sets corresponds to the classification of the words in s according to their part of speech or lexical category. We can also specify other subsets of W on the same basis as follows:

PRO = {'they','her'}
DET = {'the'}
P = {'near','by'}
CONJ = {'and'}

where the set names PRO, DET, P, and CONJ correspond to the labels PRO, DET, P, and CONJ for the pronoun, determiner, preposition, and conjunction lexical categories, respectively. Observe that set names have been indicated above using a slanted or italic font, while the category labels are written in a what is usually called a roman type face.

Finite, infinite, denumerable, and countable sets. The sets illustrated above can all be described as finite; that is, they are either empty, or they consist of n distinct elements, where n denotes a whole number, usually called an integer, which is greater than zero. Thus, finite sets are equivalent to the set consisting of the first n positive integers, which is sometimes represented by writing

I_n = {1, 2, 3, ¼, n}

Notwithstanding there is no order among the elements of a set, the ellipsis, '¼', is used as an abbreviation in this case to indicate the numbers between 3 and n. The equivalence of an arbitrary set X to I_n is established by matching each element of I_n with an element of X. This matching up of the elements of the two sets is obviously just the counting process, but for a finite set, the counting stops at some number n, that may nonetheless be very large, rather than continuing indefinitely.

A set that is not finite is said to be infinite. There can be sets that are equivalent to the set of all positive integers, I₊, which may be represented by writing

I₊ = {1, 2, 3, ¼}

where the ellipsis in this case indicates that the numbers continue indefinitely, without bound, so that I₊ is itself an infinite set. Thus, sets equivalent to I₊ are infinite, but because the elements of such sets are matched with the elements of I₊, these sets are described as denumerable, and a set that is either finite or denumerable is said to be countable. (Incidentally, there are sets, such as the fractions in the interval between 0 and 1, that are infinite, but are not denumerable.)

Products of sets. The lexicons or vocabularies of natural languages are considered to constitute finite sets, and sentences can, in principle, be analysed in terms of subsets of these finite sets of words. Sets, however, include no relationship of order among their elements, and in languages such as English, the order the words in a sentence is important. To represent this order, an order relation must be added to the relations on sets, such as union and subset, described above. Although there are other ways of introducing an order relation on sets of words, the method most commonly employed is based upon the product of sets.

Ordered pairs. To illustrate the concept of a product of sets, we consider the set W of the words that make up the sentence s and the set C, consisting of lexical category labels cited above, including the lables N and V for the noun and verb categories, respectively, where

C = {PRO, DET, P, CONJ, N, V}

We then define the product of W and C as follows:

W × C = {áw,cñ | w Î W, c Î C}

which consists of all the ordered pairs of elements of the form

áw,cñ for every w Î W and c Î C

where the first entry in each pair is a word from the set W and the second entry is a category label from the set C.

The set W × C is called variously the Cartesian product, the cross product, or simply the product of C and W. As shown above, the elements of W × C consist of ordered pairs with the form áw,cñ, wherein the entries, namely, w and c, are separated by a comma and the pair is enclosed in angled brackets, á and ñ. The relative order of the entries in these pairs is determined by the positions of the sets W and C relative to the × sign. Since W appears to the left of the ×, its elements appear as the first entry of each ordered pair. The elements of C appear in the second entry of each pair áw,cñ because the set C is to the right of the × sign.

Ordered pairs and rewriting rules. The product set W × C includes all pairings of the elements of the two sets C and W. But in applying the product set to a sentence analysis, we select a subset of its elements according to the conditions that the second entry of each pair corresponds to the lexical category label of the word in the first entry. Thus, we select those pairs that correspond to rewriting rules of the form

c ® w

where w Î W and c Î C, with W being the set of words and C the set of lexical category labels as above. The foregoing rewriting rule can be recast as the following statement:

w Î c

which just says that the word w is in the lexical category c. For example, the rule

PRO ® they

is a statement that

they Î PRO

Then, if we were analysing a sentence such as

b = 'they see the bow'

made up of words from W, we might choose a subset of W × C according to the following identifications:

PRO ® they áthey, PROñ

V ® see ásee, Vñ

DET ® the áthe, DETñ

N ® bow ábow, Nñ

in order to assign lexical category labels to the words of b. Thus, we will choose the following subset F of W × C:

F = {áthey, PROñ, ásee, Vñ, áthe, DETñ, ábow, Nñ}

where F is itself a subset of the set

{ áw, cñ | áw, cñ Î W × C, w Î c } .

with F being obtained by restricting w to the words in b.

Notice that the apostrophes used previously to distinguish words from the things they identify have been omitted because it is clear from the context that the lowercase letters on the right-hand sides of the rewriting rules and in the first entry of the ordered pairs are words.

The ordered pair notation can be viewed as an alternative way of representating rewriting rules, or depending upon your point of view, the rewriting rule format might more appropriately be seen as an alternative to the ordered pair notation.

Mappings and partial mappings. The set F Í W × C can be called a mapping or function that maps from the set of words W into the set of category labels C. The set C is usually called the range of F, while W is called the domain of F. F represents an association of the elements of its domain W with particular elements of its range C. Actually, in this example, F might be more appropriately identified as a partial function or partial mapping because it is defined only for a subset of W. Functions such as F would normally be defined for every element of their domain set; that is, for every w Î W, there would be a pair áw,cñ Î F. A function meeting this condition might be called a total function or a complete function to distinguish it from a partial function. Normally, if it is described simply as a function, then the function is assumed to be total.

The partial function F enables us, in principle, to associate each word in the sentence b with a lexical category label. To complete the analysis of b, however, we require functions, or partial functions, that map the lexical category labels into syntactic category labels NP and VP, for example. Thus, we need a partial function, or functions, with elements corresponding to rewiting rules such as NP ® DET N and VP ® V NP. Notice that the right-hand sides of these particular rules include two category labels for a total, including the left-hand label, of three symbols. Hence, the elements of the function we require must consist of ordered triples.

Ordered triples and rewriting rules. We can redefine the set C of category labels to include the elements NP, VP, and S, as well as the lexical labels, so that

C = {S, NP, VP, PRO, DET, P, CONJ, N, V}

where

S = {NP, VP},
NP = {DET, N, PRO}, and
VP = {V, NP} .

The ordered triples we require are then elements of the set

C × C × C = {áa,b,cñ | a,b,c Î C}

This product set may also be represented by writing C³, where the superscripted number 3 denotes the number of factors, each consisting of the set C, in the cross product. The particular elements of interest in C³, and the rewriting rules to which they correspond, are as follows:

áNP, VP, Sñ S ® NP VP

áDET, N, NPñ NP ® DET N

áV, NP, VPñ VP ® V NP

We can therefore obtain the following partial mapping G Í C³:

G = {áNP, VP, Sñ áDET, N, NPñ áV, NP, VPñ}

by first imposing the conditions

NP, VP Î, S
V, NP Î VP, and
DET, N Î NP .

Additional conditions, however, must be applied. These conditions follow from the observation that the ordered triples comprising G constitute a partial mapping from C × C into C. Thus, the domain of this partial function can be viewed as a set of ordered pairs. In fact, the ordered triples identified above can actually be written as ordered pairs, but with the first entry being itself an ordered pair. For example, áNP, VP, Sñ can be represented as ááNP, VPñ, Sñ which can be regarded as an element of the set C² × C.

Although C² × C = C³, the representation of the rule S ® NP VP as ááNP, VPñ, Sñ makes it clear that the two symbols on the right-hand side of the rule constitute an ordered pair. Thus, in addition to the condition that a simple English sentence consists of an NP and a VP (and not two NPs or two VPs), the pair áNP, VPñ can be seen as a statement that the subject NP of a simple English sentence precedes the predicate VP. Similarly, the pair áV, NPñ says that the object NP follows the verb in the VP, and áDET, Nñ states that the DET precedes the N in an NP.

The foregoing conditions thus restrict the elements of C³ to those listed in G. But since the subject noun phrase of the sentence b consists of a pronoun, we also require an ordered pair áPRO, NPñ Î C² corresponding to the rewriting rule NP ® PRO. Hence, we define a partial mapping H Í C² as follows:

H = {áPRO, NPñ}

consisting of one element.

The ordered pair in H, together with the ordered triples comprising the partial mapping G Í C³, and the ordered pairs making up the partial mapping F Í W × C permit us to define the set

R = F È G È H

of ordered n-tuples corresponding to the rewriting rules that might be applied to the analysis of the sentence b.

Ordered n-tuples and strings. Before the elements of R can be applied to analyse b, however, it must be recognised that b is itself an ordered quadruple; that is, it is an element of W⁴. We can therefore show b as

b = áthey, see, the, bowñ Î W⁴

Writing b as an ordered quadruple just makes explict what is implicit in the representation of b as a sequence of words separated with blanks; that is, the sequence is an ordered listing of the words. Thus, when we write

b = 'they see the bow'

we actually mean that the four words constitute an ordered quadruple. The two representations are actually not all that different. If we replace the apostrophes with angled brackets, and replace the blanks separating the words with commas, we can convert the ordered list into the ordered quadruple form.

Sequences of words such as b, regardless whether they are written as a list of words or are represented as an n-tuple, are called strings. In this example, b is said to be a string over W.

Strings over a vocabulary. A set such as W is generally described as a vocabulary. Sequences of words such as 'the bow' and 'they see', or even single words such as 'see' that make up b are described as substrings of b.

The string b is just one instance of all the different strings over the vocabulary W. For example, if we consider only that subset of W consisting of the four words in b, we can devise the following strings over the vocabulary W:

b₁ = áseeñ Î W¹
b₂ = áthey, seeñ Î W²
b₃ = ásee, the, bowñ Î W³
b₄ = áthey, the, bow, seeñ Î W⁴

where in the case of b₁ we have extended the angled bracket notation to include strings consisting of just one word, with W¹ being just W.

Length of a string. The subscript numbers here serve only to distinguish the foregoing strings over W from b. The subscripts on b₁, b₂, and so on, however, also happen to correspond to the length of the string in each case. The length of a string is, naturally enough, defined to be the number of words it contains. The length of an arbitrary string a over W is denoted as | a | where

| a | = i if and only if a Î Wⁱ for 0 £ i £ n

The vertical strokes on either side of a can be read as "the length of". The notation 0 £ i £ n means that the length of the string a is an integer i between zero and n, where n denotes some, perhaps very large integer. The symbol £ can thus be read as " is less than or equal to".

The upper bound n on the length a string, as signified by the condition i £ n, indicates that we normally deal only with finite strings; that is, for any string a over a vocabulary W, although a might be very long, it is always possible to identify an integer n such that | a | £ n.

Null string. That the length of a string might be zero, as signified by the condition 0 £ i, includes the possibility that a might be the string consisting of no words, that is, it is the null string. The null string might be represented by writing áñ or W⁰, but the symbols l or e, the lowercase Greek letters lambda or espsilon, respectively, are usually used to denote it. The null string is analoguous to the null set; however, they are not the same entities. Note that {l} is not the null set; rather, it is the set consisting of the null string.

Concatenation of strings. Strings can be concatenated. For example, the string áthe, bowñ can be concatenated to the string áthey, seeñ to obtain the string áthey, see, the, bowñ. The concatenation of strings might be described as a combining of two strings by, in effect, adding one of the strings onto the end of the other (so long as it is understood that " adding" in this context is not an arithmetic operation). In terms of the apostrophes and blanks notation, the concatenation of two strings can be represented by writing one of the strings after so that, for example, the concatenation of 'they see' and 'the bow' yields the string 'they see the bow'.

Concatenation is obviously not commutative, unlike the union or intersection of sets, because the string 'the bow they see', obtained by concatenating 'they see' to 'the bow', is not the string 'they see the bow'. Concatenation, however, is associative. For example, concatenating 'the bow' to 'see' to produce 'see the bow', and then catcatenating this string to 'they' yields 'they see the bow'. The same result is obtained by first concatenating 'see' to 'they', and then concatenating 'the bow'.

In general, the result of concatenating any strings g and d over a vocabulary such as W is written as gd where

gd Î W^i+j if and only if g Î Wⁱ and d Î W^j

Thus, if | g | = i and | d | = j, then | gd | = i+j (where g and d are the lowercase Greek letters gamma and delta, respectively). Two strings such as g and d are equal (that is, they are the same string) if and only if they are the same length and each of their entries are equal (are the same). Hence, we can write

g = d if and only if g_k = d_k for 1 £ k £ n = | g | = | d |

Because concatenation is not commutative,

gd = dg if and only if g = d;
gd ¹ dg otherwise

The concatenation of the null string with any string yields just the string so that we can write

gl = lg = g

for any string g. Thus, concatenation of the null string with any string is something like the union of the null set with any set, or the addition of 0 to any number.

Finite concatenative closure. Strings over a vocabulary such as W can, in principle, be of any length, provided they are finite. The set of all finite strings over W is denoted by W^*, where the superscript asterisk is called the Kleene star. This set might be represented by writing

W^* = {W⁰ ÈW¹ ÈW² È¼ÈWⁿ}

where, as indicated above, W⁰ can stand for the null string, W¹ is the set of all single-word strings, W² is the set of all two-word strings, and so on up to the set Wⁿ of all strings of n words, where n does not denote any particular integer, but rather indicates that, regardless how long they might be, the strings in W^* are nonetheless finite. W^* is called the finite concatenative closure, or simply the closure of W because it is considered that the concatenation of any two stings in W^* is also a string in W^*, that is

gd Î W^* if and only if g, d Î W^*

For example,

'they stand' Î W^* and
'and' Î W^* means that
'they stand and' Î W^* and
'see her near' Î W^* means that
'they stand and see her near' Î W^* and
'the stand by the bow' Î W^* means that
'they stand and see her near the stand by the bow' Î W^*

The concept of finite closure under concatenation in the domain of strings is analogous to the concept of closure under the operation of addition in the domain of integers. The integers are said to be closed under addition because adding two of them together yields another integer, not some other kind of object. Thus, adding two integers, say i and j, produces the integer i+j which is finite because it is possible to identify another integer n such that i+j £ n.

Closure without the null string. The set W^* includes the null string l. The set W⁺ is just W^* but without the null string; that is,

W⁺ = W^* - {l}

The superscipted plus sign on W⁺ may be thought of in terms of an analogy with the set of positive integers without 0.

The concept of a concatenative closure is not restricted to just sets or vocabularies of words such as W. One can also consider the closure of the set or vocabulary of category labels such as C, defined above as

C = {S, NP, VP, PRO, DET, P, CONJ, N, V}

For example, application of the partial mapping F defined above that associates the words in the string b, where

b = áthey, see, the, bowñ Î W⁴

with their lexical categories, yields the string

áPRO, V, DET, Nñ Î C⁴

Partial mappings and sentence analysis. Application of other partial mappings from the set R defined above, and which correspond to other rewriting rules, can produce the following strings in the course of an analysis of b:

áPRO, V, NPñ Î C³
áPRO, VPñ Î C²
áNP, VPñ Î C²
áSñ Î C¹

Note that a mapping can be applied only if a substring is in the domain of the mapping. For example, the substring áV, NPñ of the first of the strings listed above is in the domain of the partial mapping that includes the element áV, NP, Sñ. Hence, the substring can be mapped to áVPñ, which is a substring of the second of the foregoing strings. If, on the other hand, a string such as b₄ = áthey, the, bow, seeñ were being analysed, the string áPRO, NP, Vñ would be produced in the course of its analysis. Since the substring áNP, Vñ is not in the domain of a mapping in R, the analysis could proceed no further.

The analysis of the string b yields strings that are elements of he sets C¹, C², and C³. These sets are subsets of the finite concatenative closure C^* of the vocabulary of category labels C. Hence, all of the foregoing strings are elements of C^*. The closure C⁺ of C excluding the null string can also be defined

C⁺ = C^* - {l}

Notice that the string áPRO, V, DET, Nñ is the result of mapping all the words of b to their lexical categories. This element of C⁴ is thus the last in a sequence of strings which includes the following:

áPRO, see, the, bowñ Î C¹×W³
áPRO, V, the, bowñ Î C²×W²
áPRO, V, DET, bowñ Î C³×W¹

where each of these strings is produced as the words in b are mapped to their category labels one word at a time. The sets C¹×W³, C²×W² and C³×W¹ are actually all subsets the set (C ÈW)⁴, which includes strings beginning with one, two, or three entries from C, as well as strings with either words from W or category labels from C in each entry.

Partial mappings from words to category labels, and among category labels, can, in principle, be applied anywhere in a string. The only condition is that a substring of the string be an element of the domain of the mapping. Thus, the following strings might be produced in the course of analysis of b:

áthey, see, NPñ Î (C ÈW)³
áthey, VPñ Î (C ÈW)²
áSñ Î (C ÈW)¹

The foregoing examples illustrate that the strings produced during an analysis can be treated as elements of the finite concatenative closure of the union of the sets of words and category labels, (C ÈW)^*. The concatenative closure (C ÈW)⁺ excluding the null string l is defined as (C ÈW)^* - {l}.

Derivations and sentential forms. If a string such as b Î W^* can be reduced to a string consisting of just the sentence category label S Î C, then b can be identified as a sentence. The intermediate strings such as those shown above that are produced during the course of an analysis are called sentential forms. The sequence of sentential forms produced during the analysis of b which ends in áSñ can be called a derivation.

The derivation relation. Note that the term "derivation" is normally applied to the sequence of sentential forms that begin with áSñ and which ends with a sentence such as b. For example, the string áNP, V, NPñ can be derived from the string áNP, VPñ, and we can write

áNP, VPñÞ áNP, V, NPñ

by applying the rewriting rule

VP ® V, NP

The double-shafted arrow symbol Þ dnotes the derivation relation on strings which can be read as "derives".

Notice the difference between the double-shafted derivation arrow and the single-shafted rewriting arrow. The string on the right-hand side of the Þ derivation arrow is the string that results from applying a rewriting rule to the string to the left of the Þ. The string on the left-hand side includes a substring, consisting of the single category label VP in this example, that matches the left-hand side of the rewriting rule VP ® V, NP. The label VP is replaced by the string V NP on the right-hand side of the rewriting rule to produce the string áNP, V, NPñ on the right of the Þ derivation arrow. Thus, the strings on either side of the derivation arrow are just strings before and after the application of a rewriting rule.

In general, suppose that a sentential form consisting of the string ág₁, A, g₂ñ, where g₁,g₂ Î (C ÈW)^* and A Î C, is produced at some point in a derivation. Then, if there is a rewriting rule A ®a, where a Î (C ÈW)^*, application of this rule will derive a sentential form consisting of the string ág₁, a, g₂ñ.

Closure of the derivation relation. The symbol Þ^* denotes the finite closure of the derivation relation, and means that the string on its right-hand side can be derived from a string on the left-hand side by the application of zero rewriting rules, or by at most a finite number of applications of rewriting rules. Thus, we can write

S Þ^* áthey, see, the, bowñ

In general, if

S Þ^* s, where s Î W^*

then s is a sentence. Thus, of all the strings in W^*, that is, strings consisting exclusively of words, only those that can be derived from the sentence category label S Î C are sentences.

Note that Þ^* includes the application of no rewriting rules so that we can actually write something such as

áNP, VPñÞ^* áNP, VPñ

The derivation arrow Þ⁺, however, specifies that at least one rewriting rule is required; that is, the application of zero rules is excluded. Hence, it would normally be the case that

S Þ⁺ s, where s Î W⁺

Athough it is possible in principle to have a sentence consisting of the null string, and there could be a rewriting rule S ® l, sentences usually consist of one or more words and at least one rewriting rule is required to derive them from S.

The notion of a derivation has been introduced above in the context of applying rewriting rules, although we have continued to represent strings using the ordered n-tuple notation. In the context of partial mappings, rather than rewriting rules, derivations correspond to reductions; that is, strings of words are reduced to a string consisting of the sentence category label. Of course, a string s Î W⁺ can be reduced to S Î C if and only if s is a sentence, and we can write

sÜ⁺ S, where s Î W⁺

or more generally

sÜ^* S, where s Î W^*

wherein we have reused the double-shafted arrow notation, but have written the arrows pointing leftward rather than to the right. The application of the partial mapping

áV, NP, VPñ

whereby the substring áV, NPñ is mapped to the substring áVPñ, can reduce the string áNP, V, NPñ to the string áNP, VPñ so that we can write

áNP, V, NPñÜ áNP, VPñ

Terminal and non-terminal vocabularies. The particular sets C and W of category labels and words, respectively, have been used to provide concrete examples for concepts and notation discussed above. Generally, the names S (the uppercase Greek letter sigma) or V_T is used to identify the vocabulary, or a subset of the vocabulary of a language. The subscript T is appended to V in V_T because the words comprising the vocabulary are often called terminal symbols. The term "terminal" is used because strings consisting exclusively of words are the end result of a derivation. The term "symbol" is used to include both the words and the category labels that are processed during a derivation. The category labels are identified as "non-terminal symbols" or as variables, and they comprise the non-terminal vocabulary, V_N. (When the vocabulary of words or terminal symbols is identified using S, then vocabulary of non-terminal symbols or variables is sometimes identified simply as V.) The two vocabularies are normally disjoint, that is,

V_T ÇV_N = Æ

In terms of these two vocabularies, rewriting rules often have the form

B ® b, where b Î V_T, B Î V_N
A ® a, where a Î V_N^*, A Î V_N

that is, the uppercase Roman letters A and B stand for category labels, such as NP and N, and are elements of the non-terminal vocabulary V_N; b stands for a word in the vocabulary of terminal symbols V_T; and a in this illustration stands for a string of non-terminal symbols in the V_N^*, the closure of the non-terminal vocabulary. Thus, if A were to stand for the category label NP, then a might stand for the string áDET, Nñ, which is usually written as DET N, as in the rewriting rule NP ® DET N. Elements of the partial mappings that correspond to the foregoing rewriting rules appear as follows:

áb, Bñ, where b Î V_T, B Î V_N
áa, Añ, where a Î V_N^*, A Î V_N

Generative grammar. A set of partial mappings or rewriting rules such as those illustrated here, together with the terminal and non-terminal vocabularies V_T and V_N, respectively, comprise a grammar. Such grammars are also called generative grammars because sentences can be thought of as being generated from the sentence category label by applying the rewriting rules. These grammars are also called phrase-structure grammars or constituent-structure grammars because they are based upon the notion of sentence constituents or phrases such as the noun and verb phrases. A generative or phrase-structure grammar G is defined by an ordered quadruple

G = áV_N, V_T, S, Pñ

where V_N and V_T denote the non-terminal and terminal vocabularies, respectively. The sentence category label, S Î V_N, sometimes called the starting or root symbol of the grammar, is included as one of the elements of the quadruple. P identifies the set of rewriting rules, which are sometimes called productions because sentences can be thought of as being produced by their application. (The set of rewriting rules or partial mappings or functions may also be identified by the names F or R.)

Language generated by a grammar. A string s Î V_T^* that can be derived from the sentence category label S Î V_N by the rewriting rules in P, or which can be reduced to a string áSñ consisting of only the sentence category label by the corresponding partial mappings, is said to be a sentence generated by G. The set of all such strings is described as the language generated by G, denoted as as L(G), where

L(G) = {s | s Î V_T^*, S Þ^*s}

L(G) = {s | s Î V_T^*, sÜ^*áSñ}

with L(G) Í V_T^*.

posted @ 2006-09-23 02:38 -> 阅读(294) 评论(0) 收藏举报

刷新页面返回顶部

PRO ® they	áthey, PROñ
V ® see	ásee, Vñ
DET ® the	áthe, DETñ
N ® bow	ábow, Nñ

áNP, VP, Sñ	S ® NP VP
áDET, N, NPñ	NP ® DET N
áV, NP, VPñ	VP ® V NP

1. Sets and Sentence Analysis

Sets and Sentence Analysis

公告