Of Spiders and Scrapers: Decomposing Web Pages 101
Jul 30, 2009, 10:34 (0 Talkback[s])
(Other stories by Martin Streicher)
" With so many different platforms connecting to the Internet
these days, the traditional, HTML Web page is just one of many
outlets of information. RSS syndicates content to aggregators and
specialized readers; messaging services such as Twitter and
Facebook keep audiences engaged with frequent, even real-time
alerts; and programmatic interfaces, or APIs, provide automated
access and further blur the distinction between client and server.
If you're authoring a specialized client or a "mashup" application
for a new site, there's likely no shortage of methods to collect
and repurpose content.
"Of course, not all sites proffer slick RESTful interfaces and
XML feeds. Indeed, most don't. In those cases, collecting data
requires some good, old-fashioned scraping: identify the pages you
want, download the content, and sift through the text or HTML of
each page to extract the pertinent data. Depending on the
complexity of the source, scraping can be simple or extremely
difficult; nonetheless, the tools required are largely the same
from task to task."