Analyzing Sentence Structure Earlier chapters focused on words: how to identify them, analyze their structure, assign them to lexical categories, and access their meanings. We regularities of development of children also seen how to identify patterns in word sequences or n-grams.

However, these methods only scratch the surface of the complex constraints that govern sentences. We need a way to deal with the ambiguity that natural language is famous for. How can we use a formal grammar to describe the structure of an unlimited set of sentences? How do we represent the structure of sentences using syntax trees? How do parsers analyze a sentence and automatically build a syntax tree? Along the way, we will cover the fundamentals of English syntax, and see that there are systematic aspects of meaning that are much easier to capture once we have identified the structure of sentences.

1   Linguistic Data and Unlimited Possibilities Previous chapters have shown you how to process and analyse text corpora, and we have stressed the challenges for NLP in dealing with the vast amount of electronic language data that is growing daily. English will say that most such examples are errors, and therefore not part of English after all. Accordingly, we can argue that the “modern English” is not equivalent to the very big set of word sequences in our imaginary corpus. Speakers of English can make judgements about these sequences, and will reject some of them as being ungrammatical. Equally, it is easy to compose a new sentence and have speakers agree that it is perfectly good English. For example, sentences have an interesting property that they can be embedded inside larger sentences.

These are templates for taking a sentence and constructing a bigger sentence. With a bit of ingenuity we can construct some really long sentences using these templates. Here’s an impressive example from a Winnie the Pooh story by A. We can see from this example that language provides us with constructions which seem to allow us to extend sentences indefinitely. It is also striking that we can understand sentences of arbitrary length that we’ve never heard before: it’s not hard to concoct an entirely novel sentence, one that has probably never been used before in the history of the language, yet all speakers of the language will understand it. The purpose of a grammar is to give an explicit description of a language. But the way in which we think of a grammar is closely intertwined with what we consider to be a language.

Is it a large but finite set of observed utterances and written texts? Is it something more abstract like the implicit knowledge that competent speakers have about grammatical sentences? Or is it some combination of the two? We won’t take a stand on this issue, but instead will introduce the main approaches. In this chapter, we will adopt the formal framework of “generative grammar”, in which a “language” is considered to be nothing more than an enormous collection of all grammatical sentences, and a grammar is a formal notation that can be used for “generating” the members of this set. While hunting in Africa, I shot an elephant in my pajamas. How he got into my pajamas, I don’t know.