Linux Today: Linux News On Internet Time.

More on LinuxToday

Of Spiders and Scrapers: Decomposing Web Pages 101

Jul 30, 2009, 10:34 (0 Talkback[s])
(Other stories by Martin Streicher)


Desktop-as-a-Service Designed for Any Cloud ? Nutanix Frame

" With so many different platforms connecting to the Internet these days, the traditional, HTML Web page is just one of many outlets of information. RSS syndicates content to aggregators and specialized readers; messaging services such as Twitter and Facebook keep audiences engaged with frequent, even real-time alerts; and programmatic interfaces, or APIs, provide automated access and further blur the distinction between client and server. If you're authoring a specialized client or a "mashup" application for a new site, there's likely no shortage of methods to collect and repurpose content.

"Of course, not all sites proffer slick RESTful interfaces and XML feeds. Indeed, most don't. In those cases, collecting data requires some good, old-fashioned scraping: identify the pages you want, download the content, and sift through the text or HTML of each page to extract the pertinent data. Depending on the complexity of the source, scraping can be simple or extremely difficult; nonetheless, the tools required are largely the same from task to task."

Complete Story

Related Stories: