developerWorks: Charming Python: The Natural Language Toolkit
Jun 28, 2004, 03:30 (0 Talkback[s])
(Other stories by David Mertz)
"Your humble writer knows a little bit about a lot of things;
but despite writing a fair amount about text processing (a book,
for example), linguistic processing is a relatively novel area for
me. Forgive me if I stumble through my explanations of the quite
remarkable Natural Language Toolkit (NLTK), a wonderful tool for
teaching, and working in, computational linguistics using Python.
Computational linguistics, moreover, is closely related to the
fields of artificial intelligence, language/speech recognition,
translation, and grammar checking.
"It is natural to think of NLTK as a stacked series of layers
that build on each other. Readers familiar with lexing and parsing
of artificial languages (like, say, Python) will not have too much
of a leap to understand the similar -- but deeper -- layers
involved in natural language modeling. While NLTK comes with a
number of corpora that have been pre-processed (often manually) to
various degrees, conceptually each layer relies on the processing
in the adjacent lower layer. Tokenization comes first; then words
are tagged; then groups of words are parsed into grammatical
elements, like noun phrases or sentences (according to one of
several techniques, each with advantages and drawbacks); finally
sentences or other grammatical units can be classified..."