4. Chart Parsing

Chart Parsing

A chart parser is a sentence analysis procedure that yields all possible analyses of the sentences it processes. The analysis procedure emulates parallel processing by, in effect, concurrently pursuing all alternative analyses at each stage in the analysis of a sentence. In fact, a chart parser actually works in serial fashion by following alternative analyses one at a time, in sequence. But, it pursues these alternatives at each stage of the analysis as each word is processed, and it records the results of these alternative, intermediate results in a chart data structure.

A chart parser may be regarded as a production system. Thus, consists of a production set, an interpreter, and a database. The production set is a collection of rewriting rules, and the interpreter is a procedure specifying how these rules will be applied in order to analyse a sentence. The database of a production often consists of just a location at which sentential forms are stored during an analysis and upon which rewriting rules are applied. The database could perhaps consist of a stack structure to enable the interpreter to backtrack when it becomes blocked. The database of a chart parser, on the other hand, consists of the more elaborate chart data structure.

As a data structure, a chart consists of an organised, or linked, collection of data elements. The data elements comprising the structure are employed to record information about the productions which might be, or have been applied during the course of a recognition, and where, relative to the tokens constituting a putative sentence, productions can be, or have been applied. The locations of their application, and the symbols contained in the individual productions, comprise the links connecting the elements to form the organisation of the structure.

At each stage of a recognition, all the productions which can be applied are recorded in individual elements of the chart structure (although some of these productions may never actually contribute to the recognition of the putative sentence). Thus, while the identification of applicable productions, and then their application, takes place in a serial, step by step sequential fashion, the procedure emulates a parallel process. Since all potentially applicable productions are identified at each stage, the interpreter never becomes blocked (unless the input tokens do not comprise a sentence) and need never backtrack as must a normal production system.

The interpreter of the chart parser has available to it, in its chart data structure, at each stage of a recognition, all alternative sequences of production applications, one or more of such sequences may yield recognition of the putative sentence. Consequently, if the the sentence is structurally ambiguous, the chart parser will identify the multiple readings and produce corresponding phrase markers.

The subsequences or phrases of a putative sentence which have been found to be "grammatical" are described as "well-formed substrings." For particular subsequences, there may be more than one reading, and although some of these subsequences, or "readings," may not result in recognition, others might so that, if an input sequence is a sentence, one or more of these collections of well-formed substrings will yield recognition.

The location at which a production can be, or has been applied during a recognition is recorded relative to the positions of the tokens constituting the putative sentence. This location of application is established by associating integers with the positions of the tokens. The beginning, left-hand end of the sequence of tokens is identified using 0; the right-hand end of a sequence of n tokens is identified using n. The spaces between the tokens then are numbered from 1 to n-1. For example, if the putative sentence consists of the three tokens

radio broadcasts pay

the positions of the tokens are identified as follows:

0 radio 1 broadcasts 2 pay 3

Then, if the production A ® radio is applied, its location of application is recorded as the ordered pair <0,1>. If productions N ® broadcasts and NP ® A N are applied, the locations of their application are recorded as <1,2> and <0,2>, respectively.

The application of these three productions to the putative sentence may be depicted as follows:

           A                         A ® radio
-------
|       |
0 radio 1 broadcasts 2 pay 3
A         N              N ® broadcasts
------- ------------
|       |            |
0 radio 1 broadcasts 2 pay 3
NP                 NP ® A N
--------------------
|                    |
|   A         N      |
|-------|------------|
|       |            |
0 radio 1 broadcasts 2 pay 3

This diagram should suggest why the term "chart" is used to describe the data structure employed by a chart parser to store information regarding the progress of a recognition. The picture also should explain why the elements stored in the chart data structure are normally called "edges." The foregoing figures shows three edges, labelled A, N, and NP, where these labels derive from the left-hand side symbols of the productions to which they correspond. The ends of the edges are called "vertices." Individual edges are identified, in part, by citing their labels and their vertices. This information is recorded in the data element which is stored in the chart when a production is applied. The data elements corresponding to the edges depicted in the foregoing diagram may be represented as follows:

      <0, 1, A, [], [radio]>
<1, 2, N, [], [broadcasts]>
<0, 2, NP, [], [A,N]>

that is, as ordered 5-tuples (quintuples) wherein the the left- and right-hand vertices and the label of the edge are recorded in the first through third entries. The fourth entry will be described below. The fifth entry consists of a list of the right-hand side symbols which were matched when the corresponding production was applied.

Initialisation Step. If the interpreter works from left to right through the tokens of a putative sentence, and in this example, if the production set includes the productions

      N ® radio
V ® radio

in addition to the production

      A ® radio

cited above, then the following three edges will be stored in the the chart:

      <0, 1, A, [], [radio]>
<0, 1, N, [], [radio]>
<0, 1, V, [], [radio]>

and these will be the first edges added to the chart. The storing of such lexical category edges constitutes what is called the "Initialisation Step" of a chart parsing procedure.

If the interpreter works bottom-up and depth-first, then, before further lexical category edges are added, constituent productions are tested. The first right-hand side symbols of these productions are compared with the labels of edges already in the chart. For those productions for which there is a match, edges are added to the chart. Note that, at this stage, no attempt is made to match the other right-hand side symbols of productions with more than one right-hand symbol. Thus, for example, in the case of the production

      NP ® A N

no attempt is made to match the second right-hand symbol N; it is sufficient that, if the edge

      <0, 1, A, [], [radio]>

is in the chart, the first right-hand symbol, A, of the candidate production matches its label. Since there is a match, an edge of the form

      <0, 0, NP, [A,N], []>

is added to the chart. Edges such as these are special, intermediate elements, which do not correspond to an edge which could be depicted (easily, in any event) in a diagram such as that presented above. (In the context of a production system, edges of this nature correspond to the "selection" of a production, on the basis that it could be applied; but, the production has not yet actually been applied.) Both of the vertices of these edges, in the first and second entries of the 5-tuple, match the left-hand vertex of the edge the label of which has been matched. This information indicates where, if the production can be applied, the resulting edge will start. The edge label, as has been described already, appears as the third entry of the 5-tuple.

The fourth entry is a list of the right-hand side symbols of the corresponding production. Recall that the fourth entry of the 5-tuples cited above contained a null list. In those examples, edges corresponding to productions which had actually been applied were shown. In those cases, matches for all the right-hand side symbols of the corresponding productions had been found, and this fact is represented by the null list. In this case, since the production has not yet been applied, all the right-hand side symbols are listed. Note further that the fifth entry of this 5-tuple is the null list, while the fifth entry of the previously illustrated edges contained the right-hand side symbols of the corresponding productions. The fifth entry is used to list the right-hand side symbols for which matches have been found; the null list shows that, since the corresponding production has not yet been applied, none of the right-hand side symbols has been matched.

The fourth and fifth entries of an edge may be referred to as the "ToFind" and "Found" fields, respectively, of a chart data element. Thus, the format of an edge may be depicted as follows:

      < Left_Vertex, Right_Vertex, Label, ToFind, Found >

An edge with the null list in its ToFind field is said to be "inactive," because matches have been found for (all of) the right-hand side symbols of the production to which it corresponds. The Found field of such edges will contain a list of the right-hand side symbols.

An edge with a list of one or more symbols in its ToFind field is said to be "active" because the interpreter will continue to seek matches for symbols in the ToFind list. An edge with the null list in its Found field, indicating that no matches have yet been established, is sometimes described as an "empty edge." Of course, empty edges are active edges.

Bottom-up Rule. The addition to the chart of empty edges such as <0, 0, NP, [A,N], []> is effected by the interpreter according to what is called the "Bottom-up Rule" of chart parsing. The Bottom-up Rule is applied in conjunction with, and immediately following, the addition of an inactive edge to the chart. For example, the empty active edge <0, 0, NP, [A,N], []> would be added following addition of the inactive edge <0, 1, A, [], [radio]>. A general statement of the Bottom-up Rule is as follows:

If an edge < i, j, A, [], [X'] > is added to the chart, then for every production in the grammar of the form B ® A Y', add the edge < i, i, B, [A,Y'], [] > to the chart, where X' may be a terminal symbol, or a sequence of one or more nonterminals, and Y' may denote a sequence of one or more nonterminal symbols, or the null string.

Note that the inactive edge need not necessarily be added to the chart as a consequence of the application of a lexical category production. For example, the addition to the chart of the inactive edge <0, 2, NP, [], [A,N]>, as a result of the matching of the right-hand symbols A and N of the production NP ® A N with the labels of the edges <0, 1, A, [], [radio]> and <1, 2, N, [], [broadcasts]>, will result in the empty active edge <0, 0, S, [NP,VP], []> being added to the chart according to the Bottom-up Rule.

Fundamental Rule. The actual application of productions, following their selection, is effected according to what is called the "Fundamental Rule" of chart parsing, a general statement of which is as follows:

If the chart contains edges < j, k, A, [], [X'] > and < i, j, B, [A,Y'], [Z'] >, then add the edge < i, k, B, [Y'], [A,Z'] >, where X' may be a terminal symbol, or a sequence of one or more nonterminals, Y' and Z' each may be a sequence of one or more nonterminal symbols, or the null string, and i may equal j, but k will always differ from j.

For example, suppose the chart contains edges <0, 1, A, [], [radio]> and <0, 0, NP, [A,N], []>, which have been added according to the Initialisation Step and the Bottom-up Rule, respectively. Then, according to the Fundamental Rule, the edge <0, 1, NP, [N], [A]> is added to the chart. The Found field of this edge (its fifth entry) shows that the symbol A (the first right-hand side symbol of the corresponding production NP ® A N ) has matched the label of the edge <0, 1, A, [], [radio]>; but, no match has yet been found for the symbol N, so it remains in the ToFind field.

Observe that, with the addition of the edge <0, 1, NP, [N], [A]>, the production NP ® A N has been only "partially applied;" that is, the interpreter is still seeking a match for the second right-hand side symbol, N. If the production set includes only the productions

       A ® radio
N ® broadcasts
V ® pay
S  ® NP VP
NP ® A  N
VP ® V

and the chart at this point contains only the three edges

      <0, 1, A, [], [radio]>
<0, 0, NP, [A,N], []>
<0, 1, NP, [N], [A]>

then there is no edge with a label matching the as yet unmatched symbol N.

Since only one token of the putative sentence has been consumed so far, and there are others which have not yet been read, the interpreter can perform an Initialisation Step and add the edge <1, 2, N, [], [broadcasts]> to the chart. It is then able to undertake the Fundamental Rule, whereby the symbol N is matched with the label of the edge just added, and add the edge <0, 2, NP, [], [A,N]> to the chart, thereby completing application of the NP ® A N production.

The <0, 2, NP, [], [A, N]> edge is inactive, and hence, the interpreter undertakes the Bottom-up Rule to add the empty active edge <0, 0, S, [NP,VP], []> to the chart, corresponding to selection of the production S ® NP VP because of the match of its first right-hand side symbol with the label of the <0, 2, NP, [], [A,N]> edge. The interpreter will then invoke the Fundamental Rule to add the active edge <0, 2, S, [VP], [NP]>, thereby partially applying the S ® NP VP production.

The selection and the progress of partial applications of productions is frequently represented in descriptions of the chart parsing procedure by including a dot among the right-hand side symbols of a production. For example, given a production A ® B, the fact that it has been selected, but not yet applied, is indicated by writing A ® .B , and then by writing A ® B. after the production has been applied. With a production such as S ® NP VP, writing S ® .NP VP shows that it has been selected, writing S ® NP .VP shows that NP has been matched, and S ® NP VP. shows that both right-hand side symbols have been matched, and hence, the production has been applied.

The "dotted rule" notation was introduced into common use by Earley (1970) in his description of a recogniser for sentences generated by context-free grammars. The Earley algorithm might be regarded as the precursor to the chart parser devised by Kay (1970). Both procedures are included in the class of "tabular recognisers," a name deriving from the nature of the structure employed as a memory device to keep track of the progress of the recognition by storing intermediate stages (consisting of well-formed substrings), and enabling the procedure to emulate parallel processing, whereby all alternatives are followed concurrently, eliminating the need for backtracking, and yielding all possible structures of a given sentence.

Figures

The following figures illustrate the progress of the recognition of the sequence of tokens

       radio broadcasts pay

given the lexical productions

       A ® radio
N ® broadcasts
V ® pay

and the constituent productions

       S  ® NP VP
NP ® A  N
VP ® V

In the figures, the steps of the recognition process are numbered according to the order in which they occur, and the chart parsing rule applied (Initialisation, Bottom-up, or Fundamental) is identified. The production applied, and the edge added to the chart, are shown. The progress of the application of each production is indicated using the dot convention. A partial tree is also included for each step to illustrate the progress of the recognition as each edge is added to the chart.

   1. Initialisation Step
<0, 1, A, [], [radio]>            A
|
A --> radio.              0-radio-1
2. Bottom-up Rule
<0, 0, NP, [A,N], []>                  NP
/  \
NP --> .A N                      A    N
A
|
0-radio-1
3. Fundamental Rule
NP
<0, 1, NP, [N], [A]>                _/  \
/     N
NP --> A .N                   A
|
0-radio-1
4. Initialisation Step
NP
_/  \
/     N
<1, 2, N, [], [broadcasts]>       A         N
|         |
N --> broadcasts.         0-radio-1-broadcasts-2
5. Fundamental Rule
NP
__/ \__
<0, 2, NP, [], [A,N]>              /       \
A         N
NP --> A N.                   |         |
0-radio-1-broadcasts-2
6. Bottom-up Rule
<0, 0, S, [NP,VP], []>                        S
/ \
S --> .NP VP                           NP   VP
NP
__/ \__
/       \
A         N
|         |
0-radio-1-broadcasts-2
7. Fundamental Rule
S
<0, 2, S, [VP], [NP]>                     __/ \__
/       \
S --> NP .VP                       NP         VP
__/ \__
/       \
A         N
|         |
0-radio-1-broadcasts-2
8. Initialisation Step
S
__/ \__
/       \
NP         VP
__/ \__
/       \
<2, 3, V, [], [pay]>              A         N         V
|         |         |
V --> pay.                0-radio-1-broadcasts-2-pay-3
9. Bottom-up Rule
S
__/ \__
/       \
<2, 2, VP, [V], []>                    NP         VP    VP
__/ \__             |
VP --> .V                      /       \            V
A         N         V
|         |         |
0-radio-1-broadcasts-2-pay-3
10. Fundamental Rule
S
__/ \__
/       \
NP         VP
__/ \__           VP
<2, 3, VP, [], [V]>                /       \          |
A         N         V
VP --> V.                     |         |         |
0-radio-1-broadcasts-2-pay-3
11. Fundamental Rule
S
___/ \___
/         \
<0, 3, S, [], [NP,VP]>                 NP           \
__/ \__          VP
S --> NP VP.                   /       \          |
A         N         V
|         |         |
0-radio-1-broadcasts-2-pay-3

The following figures show chart representations of three structures of "radio broadcasts pay." These diagrams show the spanning of the tokens of the sentence by the labelled edges comprising the chart. Vertices of each edge may be determined by following the vertical lines down to the numbers inserted between the words and at either end of the sentence. The list representation of each structure is also shown.

      [s, [np, [a, [radio]], [n, [broadcasts]]], [vp, [v, [pay]]]]
S
|--------------------------------------------|
|                                            |
|             NP                     VP      |
|-----------------------------|--------------|
|                             |              |
|       A             N       |      V       |
|--------------|--------------|--------------|
|              |              |              |
|     radio    |  broadcasts  |     pay      |
0--------------1--------------2--------------3
[s, [np, [n, [radio]]], [vp, [v, [broadcasts]], [np, [n, [pay]]]]]
S
|--------------------------------------------|
|                                            |
|                            VP              |
|              |-----------------------------|
|              |                             |
|       NP     |                     NP      |
|--------------|              |--------------|
|              |              |              |
|       N      |      V       |      N       |
|--------------|--------------|--------------|
|              |              |              |
|     radio    |  broadcasts  |     pay      |
0--------------1--------------2--------------3
[s, [vp, [v, [radio]]], [s, [np, [n, [broadcasts]]], [vp, [v, [pay]]]]]
S
|--------------------------------------------|
|                                            |
|                             S              |
|              |-----------------------------|
|              |                             |
|       VP     |      NP             VP      |
|--------------|--------------|--------------|
|              |              |              |
|       V      |      N       |      V       |
|--------------|--------------|--------------|
|              |              |              |
|     radio    |  broadcasts  |     pay      |
0--------------1--------------2--------------3

posted @ 2006-09-23 02:41 -> 阅读(378) 评论(0) 收藏举报

刷新页面返回顶部

4. Chart Parsing

Figures

公告