Under the Hood in Apache Lucene 4.0
Aug 23, 2011, 18:02 (0 Talkback[s])
(Other stories by Sam Dean)
"One of the most significant changes in Lucene 4.0 is the full
switch to using bytes (UTF8) in place of text strings for indexing
within the search engine library. This change has improved the
efficiency of a number of core processes: the 'term dictionary',
used as a core part of the index, can now be loaded up to 30 times
faster; it uses 10% of the memory; and search speeds are increased
by removing the need for string conversion.
"This switch to using bytes for indexing has also facilitated
one of the main goals for Lucene 4.0, which is 'flexible indexing'.
The data structure for the index format can now be chosen and
loaded into Lucene as a pluggable codec. As such, optimised codecs
can be loaded to suit the indexing of individual datasets or even