BytesRef (and other array references)

One of Lucene’s performance advantages is that it is very cautious about allocating memory, striving to reuse buffers where possible. While Java makes management of short-lived objects pretty lightweight, it’s even more lightweight to avoid allocating them altogether.

To simplify the management of arrays used as buffers, Lucene provides some wrapper classes, like BytesRef, CharsRef, IntsRef, and LongsRef.

This example will focus on BytesRef, which is the most widely used within the Lucene codebase.

package example.foundations;

import org.apache.lucene.util.ArrayUtil;

import java.io.BufferedWriter;
import java.io.IOException;
import java.io.InputStream;
import java.nio.file.Files;
import java.nio.file.Path;

public class PrimitivesRef {

Lucene’s BytesRef holds a reference to a byte array, a start offset into the byte array representing the start of the value of interest, and the length of the value of interest. Since we aim to reuse the buffer, the length of the value is often less than the length of the array, since the array may have previously held a larger value.

    private static class BytesRef {
        public static final byte[] EMPTY_BYTES = new byte[0];
        public byte[] bytes;
        public int offset;
        public int length;

        public BytesRef() {
            bytes = EMPTY_BYTES;
        }
    }

Let’s use BytesRef to simulate tokenizing a text file into lines, and then splitting those lines into individual words. This is similar to the logic used to tokenize a text file.

To start, let’s create a helper method that creates a text file. The content is taken from “A Book About Words” by G. F. Graham, published in 1869, copied from https://www.gutenberg.org/ebooks/55200.

Note that by using Files#newBufferedWriter, we guarantee that the resulting file is encoded with the UTF-8 character set.

    private static Path createTextFile() throws IOException {
        Path file = Files.createTempFile("text", ".txt");
        try (BufferedWriter writer = Files.newBufferedWriter(file)) {
            writer.append("""
                    What is meant by a Language? It is a collection of all the words,
                    phrases, grammatical forms, idioms, &c., which are used by one
                    people. It is the outward expression of the tendencies, turn of
                    mind, and habits of thought of some one nation, and the best
                    criterion of their intellect and feelings. If this explanation be
                    admitted, it will naturally follow that the connection between a
                    people and their language is so close, that the one may be judged
                    of by the other; and that the language is a lasting monument of the
                    nature and character of the people.
                    
                    Every language, then, has its genius; forms of words, idioms, and
                    turns of expression peculiar to itself; by which, independently of
                    other differences, one nation may be distinguished from another.
                    This condition may be produced by various causes; such as soil,
                    climate, conquest, immigration, &c. Out of the old Roman, or Latin,
                    there arose several modern languages of Europe; all known by the
                    generic name--Romance; viz. Italian, French, Provençal, Spanish,
                    and Portuguese. These may be called daughters of ancient Latin; and
                    the natives of all these countries down to the seventh century,
                    both spoke and wrote that language. But when the Scandinavian and
                    Germanic tribes invaded the West of Europe, the Latin was broken
                    up, and was succeeded by Italian, French, Spanish, &c. The Latin
                    now became gradually more and more corrupt, and was, at length, in
                    each of these countries, wholly remodelled.
                    
                    History has been called ‘the study of the law of change;’ i.e. the
                    process by which human affairs are transferred from one condition
                    to another. The history of a language has naturally a close
                    analogy with political history; the chief difference being that
                    the materials of the latter are facts, events, and institutions;
                    whilst the former treats of words, forms, and constructions. Now,
                    in the same way as a nation never stands still, but is continually
                    undergoing a silent--perhaps imperceptible--transformation, so
                    it is with its language. This is proved both by experience and
                    reason. We need hardly say that the English of the present time
                    differ widely from the English of the fourteenth century; and we
                    may be quite sure that the language of this country, two or three
                    centuries hence, will be very different from what it is at present.
                    It would be impossible for a nation either to improve or decay, and
                    for its language at the same time to remain stationary. The one
                    being a reflex of the other, they must stand or fall together.
                    """);
        }
        return file;
    }

    public static void main(String[] args) throws IOException {

        Path textFile = createTextFile();

We allocate an initially-empty BytesRef for each line. We will dynamically resize it to accommodate the maximum size needed for a line. We also allocate a BytesRef that will provide a view over each word in each line. Note that the variable memory used by this program is given by the maximum size of lineBytesRef.bytes. Everything else operates within bytes of that same buffer.

        BytesRef lineBytesRef = new BytesRef();
        BytesRef wordBytesRef = new BytesRef();

We open a basic Java InputStream, so that we can process the input file byte-by-byte.

        try (InputStream is = Files.newInputStream(textFile)) {

The following loop will collect each line into linesBytesRef.

            int linePos = 0;
            int nextByte;
            while ((nextByte = is.read()) >= 0) {

Let’s consider the case where we’re not at the end of a line yet.

                if (nextByte != '\n') {

If the current lineBytesRef is not large enough, we use Lucene’s ArrayUtil#grow helper method, which will allocate an exponentially-larger array and copy the existing contents.

If we did not want to preserve the existing contents, we could use ArrayUtil.growNoCopy.

                    if (linePos == lineBytesRef.bytes.length) {
                        lineBytesRef.bytes = ArrayUtil.grow(lineBytesRef.bytes, linePos + 1);
                    }

Append the current character to the line.

                    lineBytesRef.bytes[linePos++] = (byte) (nextByte & 0xFF);
                } else {

Finished the current line. We set the length to the current position.

                    lineBytesRef.length = linePos;

In the following method call, wordsBytesRef is just used as a reusable scratch variable that lets us filter over a subset of lineBytesRef.

                    outputWords(lineBytesRef, wordBytesRef);

Reset linePos in preparation for the next line.

                    linePos = 0;
                }
            }
        } finally {

Clean up the file when we’re done.

            Files.deleteIfExists(textFile);
        }
    }

The following helper method uses wordsBytesRef as a window over lineBytesRef. They share the same underlying byte array, but wordsBytesRef is updated to point to the starting offset and length of each word.

    private static void outputWords(BytesRef lineBytesRef, BytesRef wordBytesRef) {
        wordBytesRef.offset = lineBytesRef.offset;
        wordBytesRef.bytes = lineBytesRef.bytes;
        for (int i = lineBytesRef.offset; i < lineBytesRef.offset + lineBytesRef.length; ++i) {
            if (Character.isWhitespace(wordBytesRef.bytes[i])) {

Reached the end of a word. Update the length of the word, then output it.

                wordBytesRef.length = i - wordBytesRef.offset;
                outputWord(wordBytesRef);

Update the word offset in preparation for the next word.

                wordBytesRef.offset = i + 1;
            }
        }

Output the last word from each line.

        wordBytesRef.length = lineBytesRef.length + lineBytesRef.offset - wordBytesRef.offset;
        outputWord(wordBytesRef);
    }

    private static void outputWord(BytesRef bytesRef) {

Output each word character-by-character.

Note that because the input above contains some UTF-8 multi-byte characters, we have three branches to handle 1-, 2-, and 3-byte characters. This logic is lifted from Lucene’s UnicodeUtil#UTF8toUTF16 method, which is conveniently called from BytesRef#utf8ToString. (I skipped the 4-byte branch, since it’s complicated and not needed for this specific example.)

In this case, I didn’t want to use BytesRef#utf8ToString, since the goal of this example is to avoid allocating any unnecessary extra objects, including the String objects returned by that method.

        for (int i = bytesRef.offset; i < bytesRef.offset + bytesRef.length; ++i) {
            int b = bytesRef.bytes[i] & 0xff;
            if (b < 0xc0) {
                assert b < 0x80;
                System.out.print((char) b);
            } else if (b < 0xe0) {
                char c  = (char) (((b & 0x1f) << 6) + (bytesRef.bytes[++i] & 0x3f));
                System.out.print(c);
            } else if (b < 0xf0) {
                char c =
                        (char) (((b & 0xf) << 12) + ((bytesRef.bytes[i+1] & 0x3f) << 6) + (bytesRef.bytes[i + 2] & 0x3f));
                i += 2;
                System.out.print(c);
            } else {
                throw new IllegalArgumentException();
            }
        }

Add a newline between words.

        System.out.println();
    }
}

BytesRef (and other array references)

Table of Contents