Lucene file content (with SimpleTextCodec)

This worked example helps explain what data structures exist in a Lucene index and how they are stored (albeit with a text representation – the real implementations use very compact binary files).

Unlike some other examples, the interesting part is not the code, but rather the output, which is included below the DirectoryFileContents class.

There are a number of changes that can be made to this example to make it more interesting (at the expense of producing files too large to walk through here). Readers are encouraged to modify the createDocuments method to add other field types, add more documents, etc.

package example.basic;

import org.apache.lucene.codecs.simpletext.SimpleTextCodec;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.IntField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexableField;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;

Example code

public class DirectoryFileContents {

Creating Documents

We will mostly reuse the same documents from SimpleSearch, but we’ll also add a numeric IntField which will write both points and doc values. Points are covered in more detail by examples in the “points” package.

To make the output mildly more interesting, let’s not add the numeric field to one of the documents.

    private static List<List<IndexableField>> createDocuments() {
        List<String> texts = List.of(
                "The quick fox jumped over the lazy, brown dog",
                "Lorem ipsum, dolor sit amet",
                "She left the web, she left the loom, she made three paces through the room",
                "The sly fox sneaks past the oblivious dog"
        );
        List<List<IndexableField>> docs = new ArrayList<>();
        int i = 0;
        for (String text : texts) {
            List<IndexableField> doc = new ArrayList<>();
            doc.add(new TextField("text", text, Field.Store.YES));
            /* I want one document to miss "val" */
            if (++i != 2) {
                doc.add(new IntField("val", i, Field.Store.YES));
            }
            docs.add(doc);
        }
        return docs;
    }

Create the Lucene index with SimpleTextCodec

    public static void main(String[] args) throws IOException {
        Path tmpDir = Files.createTempDirectory(DirectoryFileContents.class.getSimpleName());

In other examples, we have been using the default IndexWriterConfig. This time, we construct the IndexWriterConfig, but override the codec.

Codecs are Lucene’s abstraction that define how low-level constructs are written as files. The default codecs are highly-tuned for compact size and good read/write performance, while SimpleTextCodec is designed to be human-readable. The JavaDoc for SimpleTextCodec (and its associated classes) says FOR RECREATIONAL USE ONLY.

        IndexWriterConfig conf = new IndexWriterConfig();
        conf.setCodec(new SimpleTextCodec());

By default, Lucene writes small segments as a single “compound” file. To make the output easier to read, we disable that with setUseCompoundFile(false).

        conf.setUseCompoundFile(false);
        try (Directory directory = FSDirectory.open(tmpDir);
             IndexWriter writer = new IndexWriter(directory, conf)) {
            for (List<IndexableField> doc : createDocuments()) {
                writer.addDocument(doc);
            }
        } finally {

Dump the index contents

Once we’ve closed the IndexWriter, before we delete each file, we print each file name and its size.

            for (String indexFile : FSDirectory.listAll(tmpDir)) {
                Path path = tmpDir.resolve(indexFile);
                long size = Files.size(path);
                System.out.println("File " + indexFile + " has size " + size);

Don’t output the segmentinfos (.si) file yet, because it may contain non-UTF-8 bytes. (See https://github.com/apache/lucene/pull/12897.) Also, don’t output the “segments_1” file, because it is always a binary file, independent of the codec. (The IndexReader opens the last “segments_*”, which explains what codec was used to write each segment.)

                if (!indexFile.endsWith(".si") && !indexFile.startsWith("segments_")) {
                    System.out.println(Files.readString(path));
                }
                Files.deleteIfExists(path);
            }

Then we delete the directory itself.

            Files.deleteIfExists(tmpDir);
        }
    }
}
/*

Program output

The program dumps the following files. Let’s explore the files one-by-one.

Note that the SimpleTextCodec is an implementation that is conceptually similar to the real binary codecs, but certainly // not identical. There are compromises that SimpleTextCodec has made to implement a fully-functioning codec in plain text.

Doc values

The .dat file stores the doc values for the val field.

The IntField uses the “binary” doc values format. In this case, each value has a maximum length of 1 byte. The “maxlength” and “pattern” values let us efficiently seek to the start of a document’s values. Specifically, relative to the start of the ddc values (i.e. the byte following the newline after pattern 0), a given document’s values start at startOffset + (9 + pattern.length + maxlength + 2) * docNum (taken from the Javadoc for SimpleTextDocValuesFormat).

Each document’s entry has a length, specifying how many doc values are present in the document. In our case, each document has a single value for val, except the second document, which has none.

Each Lucene file has a trailing checksum used to verify the file integrity and protect against flipped bits.

File _0.dat has size 136
field val
  type BINARY
  maxlength 1
  pattern 0
length 1
1
T
length 0

F
length 1
3
T
length 1
4
T
END
checksum 00000000001474172410

Points

The .dii file stores an index to locate the point data for individual fields in the .dim file. In this case, the point tree for the field val starts at byte 113 in the .dim file.

That byte corresponds to the line num data dims 1. The contents of the file before that are the “blocks” in the “block K-d” tree, corresponding to leaves of the tree. In this case, since we only have 3 documents with points, they all fit in a single block. The offset of this leaf is specified as part of the tree definition, in the line block fp 0 (i.e. the block starts at byte 0 of the file).

File _0.dii has size 79
field count 1
  field fp name val
  field fp 113
checksum 00000000001996750873

File _0.dim has size 361
block count 3
  doc 0
  doc 2
  doc 3
  block value [80 0 0 1]
  block value [80 0 0 3]
  block value [80 0 0 4]
num data dims 1
num index dims 1
bytes per dim 4
max leaf points 3
index count 1
min value [80 0 0 1]
max value [80 0 0 4]
point count 3
doc count 3
  block fp 0
split count 1
  split dim 0
  split value [0 0 0 0]
END
checksum 00000000000107327399

Stored fields

The .fld file keep stored field, used to retrieve the original field values as sent to the index writer.

Stored fields are organized into a hierarchy of Document -> Field ordinal -> Field value. A multi-valued field (not to be confused with a multi-dimensional point) is just the same field added to the document multiple times, and will be assigned multiple stored field ordinals based on the order that the fields were added.

File _0.fld has size 593
doc 0
  field 0
    name text
    type string
    value The quick fox jumped over the lazy, brown dog
  field 1
    name val
    type int
    value 1
doc 1
  field 0
    name text
    type string
    value Lorem ipsum, dolor sit amet
doc 2
  field 0
    name text
    type string
    value She left the web, she left the loom, she made three paces through the room
  field 1
    name val
    type int
    value 3
doc 3
  field 0
    name text
    type string
    value The sly fox sneaks past the oblivious dog
  field 1
    name val
    type int
    value 4
END
checksum 00000000000213864262

Field infos

The .inf file stores information about each field.

Note that many of the properties (e.g. vector encoding and vector similarity) are not applicable to the fields that we added. The values shown here are the field defaults. Several of the data structures are only created for a field if some property is set. For example, vectors are only written for fields where the “vector number of dimensions” is greater than zero. Doc values are only written when the doc values type for a field is not NONE. See the IndexingChain.processField method to see exactly how field type values decide what structures get written to an index for a field based on the field type properties.

File _0.inf has size 758
number of fields 2
  name text
  number 0
  index options DOCS_AND_FREQS_AND_POSITIONS
  term vectors false
  payloads false
  norms true
  doc values NONE
  doc values gen -1
  attributes 0
  data dimensional count 0
  index dimensional count 0
  dimensional num bytes 0
  vector number of dimensions 0
  vector encoding FLOAT32
  vector similarity EUCLIDEAN
  soft-deletes false
  name val
  number 1
  index options NONE
  term vectors false
  payloads false
  norms true
  doc values SORTED_NUMERIC
  doc values gen -1
  attributes 0
  data dimensional count 1
  index dimensional count 1
  dimensional num bytes 4
  vector number of dimensions 0
  vector encoding FLOAT32
  vector similarity EUCLIDEAN
  soft-deletes false
checksum 00000000000798287814

Norms

The .len file contains the norms for each text field in the index. The norms are the length (relative to term positions) of a text field in each document containing that field.

In this case, we have a single text field called text. The norms are encoded as a “delta” from the shortest document in the segment. In this case, our shortest document is the second one with length 5 (represented as 00 more than the minvalue). The longest document is the third one with length 15 (i.e. 10 more than minValue).

Field length per document is an important value used in the tf-idf and BM25 scoring formulae.

File _0.len has size 106
field text
  type NUMERIC
  minvalue 5
  pattern 00
04
T
00
T
10
T
03
T
END
checksum 00000000003850040528

Postings

The postings, stored in .pst files, are the key data structure used for efficient text search in Lucene.

Postings are organized from field to term to matching documents. In this case, since the text field was indexed with DOCS_AND_FREQS_AND_POSITIONS (see the “Field infos” section above), each document entry for a term encodes the frequency of the term in the document (used in scoring calculations), as well as the positions at which the term can be found in the document (used for phrase and span queries).

While many of the terms in our example occur in a single position in a single document, look at the postings for the term the, which appears in 3 documents, in multiple positions for each.

File _0.pst has size 1508
field text
  term amet
    doc 1
      freq 1
      pos 4
  term brown
    doc 0
      freq 1
      pos 7
  term dog
    doc 0
      freq 1
      pos 8
    doc 3
      freq 1
      pos 7
  term dolor
    doc 1
      freq 1
      pos 2
  term fox
    doc 0
      freq 1
      pos 2
    doc 3
      freq 1
      pos 2
  term ipsum
    doc 1
      freq 1
      pos 1
  term jumped
    doc 0
      freq 1
      pos 3
  term lazy
    doc 0
      freq 1
      pos 6
  term left
    doc 2
      freq 2
      pos 1
      pos 5
  term loom
    doc 2
      freq 1
      pos 7
  term lorem
    doc 1
      freq 1
      pos 0
  term made
    doc 2
      freq 1
      pos 9
  term oblivious
    doc 3
      freq 1
      pos 6
  term over
    doc 0
      freq 1
      pos 4
  term paces
    doc 2
      freq 1
      pos 11
  term past
    doc 3
      freq 1
      pos 4
  term quick
    doc 0
      freq 1
      pos 1
  term room
    doc 2
      freq 1
      pos 14
  term she
    doc 2
      freq 3
      pos 0
      pos 4
      pos 8
  term sit
    doc 1
      freq 1
      pos 3
  term sly
    doc 3
      freq 1
      pos 1
  term sneaks
    doc 3
      freq 1
      pos 3
  term the
    doc 0
      freq 2
      pos 0
      pos 5
    doc 2
      freq 3
      pos 2
      pos 6
      pos 13
    doc 3
      freq 2
      pos 0
      pos 5
  term three
    doc 2
      freq 1
      pos 10
  term through
    doc 2
      freq 1
      pos 12
  term web
    doc 2
      freq 1
      pos 3
END
checksum 00000000001512782415

Segment info

The segment info file .si stores information about all the other files involved in a segment.

While the segment info file is managed by the codec, the SimpleTextSegmentInfoFormat implementation currently outputs the raw bytes for the segment’s unique ID, so it is not a valid UTF-8 encoding. See https://github.com/apache/lucene/pull/12897.

File _0.si has size 739

Commit file

Each commit increments the commit generation and writes a segments_<generation> file. When indexing from a single thread with regular commits, the commit generation will often match the ordinal of the last segment (since each counts up by one on each commit). If segments are flushed without committing or flushed from multiple threads, the segment numbers will usually be higher than the commit generation.

The commit file holds the “SegmentInfos” (plural). It is not managed by the codec, since it encodes the information about what segments are part of the given commit and which codecs were used to write each segment. Since the file was not written by SimpleTextCodec, it is a binary file, so we don’t output it here.

File segments_1 has size 156

Write lock

On creation of the IndexWriter, a write.lock file is created and locked. The lock implementation is configurable, but is usually based on a java.nio.channels.FileLock.

The write lock ensures that no more than one IndexWriter is ever writing to the same directory.

File write.lock has size 0

 */