Linux Today: Linux News On Internet Time.

Under the Hood in Apache Lucene 4.0

Aug 23, 2011, 18:02 (0 Talkback[s])
(Other stories by Sam Dean)

"One of the most significant changes in Lucene 4.0 is the full switch to using bytes (UTF8) in place of text strings for indexing within the search engine library. This change has improved the efficiency of a number of core processes: the 'term dictionary', used as a core part of the index, can now be loaded up to 30 times faster; it uses 10% of the memory; and search speeds are increased by removing the need for string conversion.

"This switch to using bytes for indexing has also facilitated one of the main goals for Lucene 4.0, which is 'flexible indexing'. The data structure for the index format can now be chosen and loaded into Lucene as a pluggable codec. As such, optimised codecs can be loaded to suit the indexing of individual datasets or even individual fields."

Complete Story

Related Stories: