Apache Nutch 2.0 indexes at web scale
Jul 11, 2012, 07:00 (0 Talkback[s])
The Apache Nutch developers have announced that version 2.0 of the network crawling and indexing search framework is now available. Built on top of other Apache projects including Solr, Tika, Hadoop and Gora, Nutch has been designed to crawl "at web scale" to allow organisations to create searchable indexes of their web-published content. Nutch adds web-specific functionality to Solr with a link-graph database and uses Tika to parse web pages and a number of other document formats.