This example will more or less bridge the gap between DirectoryFileContents (which looks at the Lucene files on disk) and SearchWithTermsEnum (which looks at how a TermQuery runs against an IndexReader, the layer below IndexSearcher).
The example will cover the following:
package example.basic;
import org.apache.lucene.codecs.Codec;
import org.apache.lucene.codecs.CodecUtil;
import org.apache.lucene.codecs.CompoundDirectory;
import org.apache.lucene.codecs.FieldsProducer;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.IntField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.FieldInfos;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexableField;
import org.apache.lucene.index.SegmentCommitInfo;
import org.apache.lucene.index.SegmentInfo;
import org.apache.lucene.index.SegmentInfos;
import org.apache.lucene.index.SegmentReadState;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.store.ChecksumIndexInput;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.IOContext;
import org.apache.lucene.store.Lock;
import org.apache.lucene.util.BytesRef;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;
public class BottomUpIndexReader {
We’ll create a similar index to DirectoryFileContents, but we’ll stick with the default codec, since we’re going to explore how the codec is used to decode the index files (since codec is short for “coder and decoder”).
private static Path createLuceneIndex() throws IOException {
Path indexDir = Files.createTempDirectory(DirectoryFileContents.class.getSimpleName());
try (FSDirectory directory = FSDirectory.open(indexDir);
IndexWriter indexWriter = new IndexWriter(directory, new IndexWriterConfig())) {
List<String> texts = List.of(
"The quick fox jumped over the lazy, brown dog",
"Lorem ipsum, dolor sit amet",
"She left the web, she left the loom, she made three paces through the room",
"The sly fox sneaks past the oblivious dog"
);
int i = 0;
for (String text : texts) {
List<IndexableField> doc = new ArrayList<>();
doc.add(new TextField("text", text, Field.Store.YES));
doc.add(new IntField("val", i++, Field.Store.YES));
indexWriter.addDocument(doc);
}
}
return indexDir;
}
public static void main(String[] args) throws IOException {
Path indexDir = createLuceneIndex();
Lock lock = null;
The Lucene Directory
class provides an abstraction matching a filesystem directory (with no
subdirectories). In our case, the Lucene directory really is a filesystem directory corresponding to the
path of indexDir
.
Note that on most 64-bit Java runtime environments, FSDirectory.open(path)
just calls
new MMapDirectory(path)
. Using FSDirectory.open
is the more portable choice, though.
try (Directory dir = FSDirectory.open(indexDir)) {
The following snippet outputs the files in the directory and their sizes:
Directory contents:
_0.cfe size = 479
_0.cfs size = 2337
_0.si size = 348
segments_1 size = 154
write.lock size = 0
System.out.println("Directory contents:");
for (String file : dir.listAll()) {
System.out.println(" " + file + " size = " + dir.fileLength(file));
}
The write.lock
file is there to make sure that only one IndexWriter
can operate on the directory
at a time. For fun, let’s lock the directory, to make extra sure that another IndexWriter
can’t mess
with our files. We’ll release it in the finally
block.
lock = dir.obtainLock("write.lock");
The next file of interest is segments_1
. An IndexWriter
or IndexReader
looks at these segments_N
files to determine the current “commit generation” and understand what segments are in the index.
That is, each time we commit, this number is incremented by 1 and written (in base 36, so 0-9a-z) as a
suffix to this “segment infos” file. We’re currently on generation 1.
System.out.println("Commit generation is " + SegmentInfos.getLastCommitGeneration(dir.listAll()));
Unlike most Lucene files, the segment infos file needs a relatively stable format to be understood by
different versions of Lucene. Instead of digging into the specific bytes in this file, we’ll use
SegmentInfos.readCommit
to load it and call some methods to output its contents. Note that the
SegmentInfos
instance maintains a list of SegmentCommitInfo
.
Segment _0 has the following files:
_0.cfe
_0.si
_0.cfs
SegmentInfos segmentInfos = SegmentInfos.readCommit(dir, "segments_1");
for (SegmentCommitInfo sci : segmentInfos) {
SegmentInfo segmentInfo = sci.info;
System.out.println("Segment " + segmentInfo.name + " has the following files:");
for (String segmentfile : sci.files()) {
System.out.println(segmentfile);
}
}
Since Lucene segments are written once, documents are never “deleted” from a segment. Instead, a
subsequent commit writes another file (with a suffix with the commit generation and a .liv extension,
like _0_9.liv
), which is a bit set indicating what documents are still “live” (i.e. not deleted).
A SegmentCommitInfo
holds a reference to the (written-once) SegmentInfo
and a reference to the live
docs for the current commit. If the index uses updatable doc values, it will also reference (per-commit)
doc value updates.
The files in each segment start with an underscore and the segment generation (also a base 36 counter,
but separate from the commit generation). The file extensions indicate what they contain. In this case,
the _0.si
file contains the SegmentInfo
for the segment. We already loaded this file when we called
SegmentInfos.readCommit
. Among other things, the SegmentInfo
knows what Codec
to use to read the
other files in the segment. (Technically, the Codec
class name is stored in the segments_1
file,
sinceSegmentInfos.readCommit()
uses the Codec
to read the .si
file.)
At the time of writing (and subject to change with new versions), this outputs:
Segment _0 was written with the Lucene99 codec
SegmentCommitInfo seg0commitInfo = segmentInfos.info(0);
SegmentInfo seg0info = seg0commitInfo.info;
Codec codec = seg0info.getCodec();
System.out.println("Segment _0 was written with the " + codec.getName() + " codec.");
Since our single segment is small, Lucene chose to stick all the individual files into the file _0.cfs
,
to avoid consuming many operating system file handles for a bunch of even smaller files. (It’s not a
problem with a single small segment, but can become a problem with many small segments or many small
Lucene indices open on the same machine.) The _0.cfe
file holds the “compound file entries”, which
store the start and end offsets for each file stored within the .cfs
file.
The compound file segment is modeled as a directory itself.
Compound file segment _0.cfs has the following contents:
_0_Lucene99_0.tip
_0.nvm
_0.fnm
_0_Lucene99_0.doc
_0_Lucene99_0.tim
_0_Lucene99_0.pos
_0_Lucene90_0.dvd
_0.kdd
_0.kdm
_0_Lucene99_0.tmd
_0_Lucene90_0.dvm
_0.kdi
_0.fdm
_0.nvd
_0.fdx
_0.fdt
CompoundDirectory seg0dir = codec.compoundFormat().getCompoundReader(dir, seg0info);
System.out.println("Compound file segment _0.cfs has the following contents:");
for (String innerFile : seg0dir.listAll()) {
System.out.println(" " + innerFile);
}
Let’s decode that .cfe
file ourselves.
DON’T TRY THIS AT HOME. (Or at least don’t ever do this in a real application.)
Reading the Lucene90CompoundFormat source code, we see that the .cfe
file has the following format:
<header: validates file integrity>
<numfiles: VInt>
<file 0 suffix: string> <file 0 start offset: long> <file 0 end offset: long>
<file 1 suffix: string> <file 1 start offset: long> <file 1 end offset: long>
...
<file N suffix: string> <file N start offset: long> <file N end offset: long>
<footer: validates file integrity>
This assumes we wrote the index using the Lucene90CompoundFormat. If it doesn’t work for you, you can delete or comment out this code. The higher-level APIs should continue to compile and work, but low-level details are subject to changes.
File with suffix .nvd starts at offset 48 and has length 63
File with suffix .fdx starts at offset 112 and has length 64
File with suffix .kdi starts at offset 176 and has length 68
File with suffix _Lucene99_0.tip starts at offset 248 and has length 72
File with suffix _Lucene90_0.dvd starts at offset 320 and has length 74
File with suffix .kdd starts at offset 400 and has length 82
File with suffix _Lucene99_0.doc starts at offset 488 and has length 87
File with suffix .nvm starts at offset 576 and has length 103
File with suffix _Lucene99_0.pos starts at offset 680 and has length 114
File with suffix .kdm starts at offset 800 and has length 135
File with suffix .fdm starts at offset 936 and has length 157
File with suffix _Lucene90_0.dvm starts at offset 1096 and has length 162
File with suffix _Lucene99_0.tmd starts at offset 1264 and has length 190
File with suffix .fnm starts at offset 1456 and has length 250
File with suffix _Lucene99_0.tim starts at offset 1712 and has length 290
File with suffix .fdt starts at offset 2008 and has length 313
These should be equal: 479 == 479
try (ChecksumIndexInput cfeInput = dir.openChecksumInput("_0.cfe")) {
/* If/when Lucene changes this file format, this header check should fail. */
CodecUtil.checkIndexHeader(cfeInput, "Lucene90CompoundEntries", 0, 0, seg0info.getId(), "");
int numFiles = cfeInput.readVInt();
for (int i = 0 ; i < numFiles; i++) {
String fileSuffix = cfeInput.readString();
long start = cfeInput.readLong();
long end = cfeInput.readLong();
System.out.printf("File with suffix %s starts at offset %d and has length %d%n", fileSuffix, start, end);
}
CodecUtil.checkFooter(cfeInput);
/* Check that we actually reached the end of the file */
System.out.println("These should be equal: " + cfeInput.getFilePointer() + " == " + dir.fileLength("_0.cfe"));
}
We’ve previously covered term queries from a high level in SimpleSearch, then a slightly lower level in SearchWithTermsEnum. Let’s close the gap by reading the terms directly from this segment (essentially what the LeafReader did for us in SearchWithTermsEnum).
First we need the information about the fields in the index, available via FieldInfos, which we read from
the .fnm
file in our compound file segment. Then we get a FieldsProducer
using the codec’s postings
format. That lets us load the Terms
(and therefore TermsEnum
) for a given field.
The code outputs the following terms:
The field 'text' has the following terms:
amet brown dog dolor fox ipsum jumped lazy left loom lorem made oblivious over paces past quick room
she sit sly sneaks the three through web
FieldInfos fieldInfos = codec.fieldInfosFormat().read(seg0dir, seg0info, "", IOContext.DEFAULT);
SegmentReadState segmentReadState = new SegmentReadState(seg0dir, seg0info, fieldInfos, IOContext.DEFAULT);
try (FieldsProducer fieldsProducer = codec.postingsFormat().fieldsProducer(segmentReadState)) {
Terms terms = fieldsProducer.terms("text");
TermsEnum termsEnum = terms.iterator();
BytesRef termBytesRef;
System.out.println("The field 'text' has the following terms:");
StringBuilder builder = new StringBuilder();
int lineLength = 0;
while ((termBytesRef = termsEnum.next()) != null) {
String termString = termBytesRef.utf8ToString();
builder.append(termString).append(' ');
lineLength += termString.length();
if (lineLength > 80) {
builder.append('\n');
lineLength = 0;
}
}
System.out.println(builder);
}
In the above example, we initialized a FieldsProducer
, which first opened an IndexInput
“slice” of the
_0.cfs
IndexInput
corresponding to the _0_Lucene99_0.doc
file (for the postings – the doc ids for
each term). Then it opened another IndexInput
slice in the CFS file for _0_Lucene99_0.tim
(the terms),
as well as slices for _0_Lucene99_0.tip
(the terms index), and _0_Lucene99_0.tmd
(the term metadata).
For each indexed field in the index (in this case, just the text
field), the postings reader
(technically the Lucene90BlockTreeTermsReader) allocated a FieldReader
. Each FieldReader
holds some
field metadata (e.g. first/last term, sum of doc frequencies and term frequencies across all terms for
the field), but otherwise it’s a lazy data structure that reads from the files opened above when asked
(but not the.tmd
file – we’re done with that) when asked. It’s also the Terms
variable that we
retrieved above.
Similarly, each time we call termsEnum.next()
, we’re just advancing a pointer into a memory-mapped
file. The file is probably fully paged into memory (since the whole _0.cfs
file is only 2337
bytes), so it’s really not much worse than reading values out of objects in the JVM heap.
} finally {
for (String indexFile : FSDirectory.listAll(indexDir)) {
Files.deleteIfExists(indexDir.resolve(indexFile));
}
Files.delete(indexDir);
if (lock != null) {
lock.close();
}
}
}
}