[ Thanks to linuxaria for this link.
]
“Today I present you this excellent and comprehensive
article on an open source search engine: Nutch, you can find the
original article with the code examples here“After reading this article readers should be somewhat familiar
with the basic crawling concepts and core MapReduce jobs in
Nutch.“What is a web crawler?
“A Web Crawler is a computer program that usually discovers and
downloads content from the web via an HTTP protocol. The discovery
process of a crawler is usually simple and straightforward. A
crawler is first given a set of URLs, often called seeds. Next the
crawler goes and downloads the content from those URLs and then
extracts hyperlinks or URLs from the downloaded content. This is
exactly the same thing that happens in the real world when a human
is interfacing with a web browser and clicks on links from a
homepage, and pages that follow, one after another.”