Linux Today: Linux News On Internet Time.
Search Linux Today
Linux News Sections:  Developer -  High Performance -  Infrastructure -  IT Management -  Security -  Storage -
Linux Today Navigation
LT Home
Contribute
Contribute
Link to Us
Linux Jobs


Top White Papers

More on LinuxToday


Of Spiders and Scrapers: Decomposing Web Pages 101

Jul 30, 2009, 10:34 (0 Talkback[s])
(Other stories by Martin Streicher)

" With so many different platforms connecting to the Internet these days, the traditional, HTML Web page is just one of many outlets of information. RSS syndicates content to aggregators and specialized readers; messaging services such as Twitter and Facebook keep audiences engaged with frequent, even real-time alerts; and programmatic interfaces, or APIs, provide automated access and further blur the distinction between client and server. If you're authoring a specialized client or a "mashup" application for a new site, there's likely no shortage of methods to collect and repurpose content.

"Of course, not all sites proffer slick RESTful interfaces and XML feeds. Indeed, most don't. In those cases, collecting data requires some good, old-fashioned scraping: identify the pages you want, download the content, and sift through the text or HTML of each page to extract the pertinent data. Depending on the complexity of the source, scraping can be simple or extremely difficult; nonetheless, the tools required are largely the same from task to task."

Complete Story

Related Stories: