developerWorks: Parsing with the Spark Module
Jan 03, 2003, 05:30 (0 Talkback[s])
(Other stories by David Mertz)
"In this article, which follows on an earlier installment of
'Charming Python' devoted to SimpleParse, I introduce some basic
concepts in parsing and discuss the Spark module. Parsing
frameworks are a rich topic that warrants quite a bit of study to
get the full picture; these two articles make a good start, for
both readers and myself.
"In my programming life, I have frequently needed to identify
parts and structures that exist inside textual documents: log
files, configuration files, delimited data, and more free-form (but
still semi-structured) report formats. All of these documents have
their own 'little languages' for what can occur within them. The
way I have programmed these informal parsing tasks has always been
somewhat of a hodgepodge of custom state-machines, regular
expressions, and context-driven string tests. The pattern in these
programs was always, roughly, 'read a bit of text, figure out if we
can make something of it, maybe read a bit more text afterwards,
keep trying.'
"Parsers distill descriptions of the parts and structures in
documents into concise, clear, and declarative rules identifying
what makes up a document. Most formal parsers use variants on
Extended Backus-Naur Form (EBNF) to describe the 'grammars' of the
languages they describe. Basically, an EBNF grammar gives names to
the parts one might find in a document; additionally, larger parts
are frequently composed of smaller parts. The frequency and order
in which small parts may occur in larger parts is specified by
operators..."
Complete Story
Related Stories: