Linux Today: Linux News On Internet Time.
Search Linux Today
Linux News Sections:  Developer -  High Performance -  Infrastructure -  IT Management -  Security -  Storage -
Linux Today Navigation
LT Home
Contribute
Contribute
Link to Us
Linux Jobs


Top White Papers

More on LinuxToday


Crawling in Open Source, Part 1

Feb 20, 2011, 15:02 (0 Talkback[s])

[ Thanks to linuxaria for this link. ]

"Today I present you this excellent and comprehensive article on an open source search engine: Nutch, you can find the original article with the code examples here

"After reading this article readers should be somewhat familiar with the basic crawling concepts and core MapReduce jobs in Nutch.

"What is a web crawler?

"A Web Crawler is a computer program that usually discovers and downloads content from the web via an HTTP protocol. The discovery process of a crawler is usually simple and straightforward. A crawler is first given a set of URLs, often called seeds. Next the crawler goes and downloads the content from those URLs and then extracts hyperlinks or URLs from the downloaded content. This is exactly the same thing that happens in the real world when a human is interfacing with a web browser and clicks on links from a homepage, and pages that follow, one after another."

Complete Story

Related Stories: