• Jump To … +
    AnalyzerBasics.java BooleanQueryANDInternals.java BooleanQueryIntro.java BottomUpIndexReader.java BytesRefHashExample.java CombinedFieldQueryExample.java DirectoryFileContents.java DocValuesSearchExample.java FunctionQuerySearchExample.java KnnSearchExample.java PointTreeRangeQuery.java PrimitivesRef.java SearchWithTermsEnum.java SimpleSearch.java TextVectorSearchExample.java VisualizePointTree.java
  • §

    One of Lucene’s performance advantages is that it is very cautious about allocating memory, striving to reuse buffers where possible. While Java makes management of short-lived objects pretty lightweight, it’s even more lightweight to avoid allocating them altogether.

    To simplify the management of arrays used as buffers, Lucene provides some wrapper classes, like BytesRef, CharsRef, IntsRef, and LongsRef.

    This example will focus on BytesRef, which is the most widely used within the Lucene codebase.

    package example.foundations;
    
    import org.apache.lucene.util.ArrayUtil;
    
    import java.io.BufferedWriter;
    import java.io.IOException;
    import java.io.InputStream;
    import java.nio.file.Files;
    import java.nio.file.Path;
    
    public class PrimitivesRef {
    
    
    
  • §

    Lucene’s BytesRef holds a reference to a byte array, a start offset into the byte array representing the start of the value of interest, and the length of the value of interest. Since we aim to reuse the buffer, the length of the value is often less than the length of the array, since the array may have previously held a larger value.

        private static class BytesRef {
            public static final byte[] EMPTY_BYTES = new byte[0];
            public byte[] bytes;
            public int offset;
            public int length;
    
            public BytesRef() {
                bytes = EMPTY_BYTES;
            }
        }
    
    
  • §

    Let’s use BytesRef to simulate tokenizing a text file into lines, and then splitting those lines into individual words. This is similar to the logic used to tokenize a text file.

    To start, let’s create a helper method that creates a text file. The content is taken from “A Book About Words” by G. F. Graham, published in 1869, copied from https://www.gutenberg.org/ebooks/55200.

    Note that by using Files#newBufferedWriter, we guarantee that the resulting file is encoded with the UTF-8 character set.

        private static Path createTextFile() throws IOException {
            Path file = Files.createTempFile("text", ".txt");
            try (BufferedWriter writer = Files.newBufferedWriter(file)) {
                writer.append("""
                        What is meant by a Language? It is a collection of all the words,
                        phrases, grammatical forms, idioms, &c., which are used by one
                        people. It is the outward expression of the tendencies, turn of
                        mind, and habits of thought of some one nation, and the best
                        criterion of their intellect and feelings. If this explanation be
                        admitted, it will naturally follow that the connection between a
                        people and their language is so close, that the one may be judged
                        of by the other; and that the language is a lasting monument of the
                        nature and character of the people.
                        
                        Every language, then, has its genius; forms of words, idioms, and
                        turns of expression peculiar to itself; by which, independently of
                        other differences, one nation may be distinguished from another.
                        This condition may be produced by various causes; such as soil,
                        climate, conquest, immigration, &c. Out of the old Roman, or Latin,
                        there arose several modern languages of Europe; all known by the
                        generic name--Romance; viz. Italian, French, Provençal, Spanish,
                        and Portuguese. These may be called daughters of ancient Latin; and
                        the natives of all these countries down to the seventh century,
                        both spoke and wrote that language. But when the Scandinavian and
                        Germanic tribes invaded the West of Europe, the Latin was broken
                        up, and was succeeded by Italian, French, Spanish, &c. The Latin
                        now became gradually more and more corrupt, and was, at length, in
                        each of these countries, wholly remodelled.
                        
                        History has been called ‘the study of the law of change;’ i.e. the
                        process by which human affairs are transferred from one condition
                        to another. The history of a language has naturally a close
                        analogy with political history; the chief difference being that
                        the materials of the latter are facts, events, and institutions;
                        whilst the former treats of words, forms, and constructions. Now,
                        in the same way as a nation never stands still, but is continually
                        undergoing a silent--perhaps imperceptible--transformation, so
                        it is with its language. This is proved both by experience and
                        reason. We need hardly say that the English of the present time
                        differ widely from the English of the fourteenth century; and we
                        may be quite sure that the language of this country, two or three
                        centuries hence, will be very different from what it is at present.
                        It would be impossible for a nation either to improve or decay, and
                        for its language at the same time to remain stationary. The one
                        being a reflex of the other, they must stand or fall together.
                        """);
            }
            return file;
        }
    
        public static void main(String[] args) throws IOException {
    
            Path textFile = createTextFile();
    
    
  • §

    We allocate an initially-empty BytesRef for each line. We will dynamically resize it to accommodate the maximum size needed for a line. We also allocate a BytesRef that will provide a view over each word in each line. Note that the variable memory used by this program is given by the maximum size of lineBytesRef.bytes. Everything else operates within bytes of that same buffer.

            BytesRef lineBytesRef = new BytesRef();
            BytesRef wordBytesRef = new BytesRef();
    
    
  • §

    We open a basic Java InputStream, so that we can process the input file byte-by-byte.

            try (InputStream is = Files.newInputStream(textFile)) {
    
  • §

    The following loop will collect each line into linesBytesRef.

                int linePos = 0;
                int nextByte;
                while ((nextByte = is.read()) >= 0) {
    
  • §

    Let’s consider the case where we’re not at the end of a line yet.

                    if (nextByte != '\n') {
    
  • §

    If the current lineBytesRef is not large enough, we use Lucene’s ArrayUtil#grow helper method, which will allocate an exponentially-larger array and copy the existing contents.

    If we did not want to preserve the existing contents, we could use ArrayUtil.growNoCopy.

                        if (linePos == lineBytesRef.bytes.length) {
                            lineBytesRef.bytes = ArrayUtil.grow(lineBytesRef.bytes, linePos + 1);
                        }
    
  • §

    Append the current character to the line.

                        lineBytesRef.bytes[linePos++] = (byte) (nextByte & 0xFF);
                    } else {
    
  • §

    Finished the current line. We set the length to the current position.

                        lineBytesRef.length = linePos;
    
    
  • §

    In the following method call, wordsBytesRef is just used as a reusable scratch variable that lets us filter over a subset of lineBytesRef.

                        outputWords(lineBytesRef, wordBytesRef);
    
  • §

    Reset linePos in preparation for the next line.

                        linePos = 0;
                    }
                }
            } finally {
    
  • §

    Clean up the file when we’re done.

                Files.deleteIfExists(textFile);
            }
        }
    
    
  • §

    The following helper method uses wordsBytesRef as a window over lineBytesRef. They share the same underlying byte array, but wordsBytesRef is updated to point to the starting offset and length of each word.

        private static void outputWords(BytesRef lineBytesRef, BytesRef wordBytesRef) {
            wordBytesRef.offset = lineBytesRef.offset;
            wordBytesRef.bytes = lineBytesRef.bytes;
            for (int i = lineBytesRef.offset; i < lineBytesRef.offset + lineBytesRef.length; ++i) {
                if (Character.isWhitespace(wordBytesRef.bytes[i])) {
    
  • §

    Reached the end of a word. Update the length of the word, then output it.

                    wordBytesRef.length = i - wordBytesRef.offset;
                    outputWord(wordBytesRef);
    
  • §

    Update the word offset in preparation for the next word.

                    wordBytesRef.offset = i + 1;
                }
            }
    
  • §

    Output the last word from each line.

            wordBytesRef.length = lineBytesRef.length + lineBytesRef.offset - wordBytesRef.offset;
            outputWord(wordBytesRef);
        }
    
        private static void outputWord(BytesRef bytesRef) {
    
  • §

    Output each word character-by-character.

    Note that because the input above contains some UTF-8 multi-byte characters, we have three branches to handle 1-, 2-, and 3-byte characters. This logic is lifted from Lucene’s UnicodeUtil#UTF8toUTF16 method, which is conveniently called from BytesRef#utf8ToString. (I skipped the 4-byte branch, since it’s complicated and not needed for this specific example.)

    In this case, I didn’t want to use BytesRef#utf8ToString, since the goal of this example is to avoid allocating any unnecessary extra objects, including the String objects returned by that method.

            for (int i = bytesRef.offset; i < bytesRef.offset + bytesRef.length; ++i) {
                int b = bytesRef.bytes[i] & 0xff;
                if (b < 0xc0) {
                    assert b < 0x80;
                    System.out.print((char) b);
                } else if (b < 0xe0) {
                    char c  = (char) (((b & 0x1f) << 6) + (bytesRef.bytes[++i] & 0x3f));
                    System.out.print(c);
                } else if (b < 0xf0) {
                    char c =
                            (char) (((b & 0xf) << 12) + ((bytesRef.bytes[i+1] & 0x3f) << 6) + (bytesRef.bytes[i + 2] & 0x3f));
                    i += 2;
                    System.out.print(c);
                } else {
                    throw new IllegalArgumentException();
                }
            }
    
  • §

    Add a newline between words.

            System.out.println();
        }
    }