Crawling in Open Source, Part 1 | Linux Today

Crawling in Open Source, Part 1

Written By
Web Webster
Web Webster
Feb 20, 2011

[ Thanks to linuxaria for this link.
]

“Today I present you this excellent and comprehensive
article on an open source search engine: Nutch, you can find the
original article with the code examples here

“After reading this article readers should be somewhat familiar
with the basic crawling concepts and core MapReduce jobs in
Nutch.

“What is a web crawler?

“A Web Crawler is a computer program that usually discovers and
downloads content from the web via an HTTP protocol. The discovery
process of a crawler is usually simple and straightforward. A
crawler is first given a set of URLs, often called seeds. Next the
crawler goes and downloads the content from those URLs and then
extracts hyperlinks or URLs from the downloaded content. This is
exactly the same thing that happens in the real world when a human
is interfacing with a web browser and clicks on links from a
homepage, and pages that follow, one after another.”


Complete Story

Web Webster

Web Webster

Web Webster has more than 20 years of writing and editorial experience in the tech sector. He’s written and edited news, demand generation, user-focused, and thought leadership content for business software solutions, consumer tech, and Linux Today, he edits and writes for a portfolio of tech industry news and analysis websites including webopedia.com, and DatabaseJournal.com.

Linux Today Logo

LinuxToday is a trusted, contributor-driven news resource supporting all types of Linux users. Our thriving international community engages with us through social media and frequent content contributions aimed at solving problems ranging from personal computing to enterprise-level IT operations. LinuxToday serves as a home for a community that struggles to find comparable information elsewhere on the web.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.