Building on the previous example in PrimitivesRef, which introduced the Lucene BytesRef class, this example
dives into BytesRefHash, a highly-optimized data structure for storing and sorting distinct BytesRef instances.
A BytesRefHash operates like a HashSet
The internal structure of BytesRefHash is fairly simple. It includes:
count: an int that serves as an auto-incrementing ID generator (starting at zero) and counter for the number of elements,ids: an int[] acting as a hash map from values to their auto-incrementing IDs, with -1 indicating an empty slot,byteStart: an int[] array from IDs to starting byte offsets in a large byte buffer, andpool : a large, chunked byte buffer that holds the actual BytesRef values, with their length encoded as a prefix.The basic flow for inserting into a BytesRefHash (within the add(BytesRef) method) is as follows:
BytesRef, modulo the size of ids, as a possible insert position.ids, check if it is the current BytesRef. The check works
by comparing against the value in pool starting at byteStart[ids[position]].BytesRef, then check the next position.ids at the returned position.-1, then we must already have the current value. Nothing to do. Returncounter to ids[position]. This maps from the hash bucket to the numeric ID.pool, saving the start offset in byteStart[counter].counter in preparation for the next value.package example.foundations;
import org.apache.lucene.util.ArrayUtil;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.BytesRefHash;
import org.apache.lucene.util.UnicodeUtil;
public class BytesRefHashExample {
public static void main(String[] args) {
BytesRefHash bytesRefHash = new BytesRefHash();The basic flow for inserting into a BytesRefHash (within the add(BytesRef) method) is as follows:
BytesRef, modulo the size of ids, as a possible insert position.ids, check if it is the current BytesRef. The check works
by comparing against the value in pool starting at byteStart[ids[position]].BytesRef, then check the next position.ids at the returned position.-1, then we must already have the current value. Nothing to do. Returncounter to ids[position]. This maps from the hash bucket to the numeric ID.pool, saving the start offset in byteStart[counter].counter in preparation for the next value.
BytesRef bytesRef = new BytesRef("foo");
int id = bytesRefHash.add(bytesRef);Position will be greater than zero, because we inserted a new value.
assert id >= 0;
setBytesRefValue(bytesRef, "bar");Position will be greater than zero, because we inserted a new value.
id = bytesRefHash.add(bytesRef);
assert id >= 0;
setBytesRefValue(bytesRef, "foo");
id = bytesRefHash.add(bytesRef);Position will be negative, because we’ve seen this value before.
assert id < 0;Let’s add a bunch of values to the BytesRefHash:
for (String word : "One point to be observed in the nature and history of words is their tendency to contract in form and degenerate in meaning.".split(" ")) {
setBytesRefValue(bytesRef, word);
bytesRefHash.add(bytesRef);
}The values in a BytesRefHash are assigned IDs sequentially from 0 (inclusive) to bytesRefHash.size() (exclusive).
Since a BytesRef is intended to be reused, we can use the same BytesRef instance to access values from
a BytesRefHash. Note, though, that the BytesRef is updated to point into the backing pool of the BytesRefHash,
so we must not modify the returned bytes array.
The get() method does not do a hash table lookup, as it only needs to know the starting offset for each
value, which it gets from the internal byteStart array.
The following will return values in the order that they were added to the BytesRefHash (without duplicates).
System.out.println("---- Insertion-ordered values ----");
for (int i = 0; i < bytesRefHash.size(); ++i) {
bytesRefHash.get(i, bytesRef);
System.out.println(i + " -> " + bytesRef.utf8ToString());
}
System.out.println();There are two “destructive” operations on BytesRefHash, so named because they discard the structure of the
ids hash table.
The first is compact(), which moves all ids values to the beginning of the table and
returns them. Note that the returned array is the size of the full hash table, with the -1 entries pushed
to the end, so we must take care to only iterate over the first bytesRefHash.size() entries.
The following outputs the set of inserted values in seemingly random (hash) order.
System.out.println("---- Hash-ordered values ----");
int[] ids = bytesRefHash.compact();
for (int i = 0; i < bytesRefHash.size(); i++) {
bytesRefHash.get(ids[i], bytesRef);
System.out.println(ids[i] + " -> " + bytesRef.utf8ToString());
}
System.out.println();The compact() method on its own is not used anywhere in Lucene outside of the sort() method, described
next. The compact() method should probably be made private.
The second destructive operation is sort(). Part of why it is destructive is that it relies on compact().
The output of sort() is also an array of ids, but instead of appearing in hash order, they are in lexicographic
order of the corresponding values.
System.out.println("---- Sorted values ----");
int[] sortedIds = bytesRefHash.sort();
for (int i = 0; i < bytesRefHash.size(); i++) {
bytesRefHash.get(sortedIds[i], bytesRef);
System.out.println(sortedIds[i] + " -> " + bytesRef.utf8ToString());
}The sort() method is used to produce a sorted set of unique values. This is how Lucene builds
SortedSetDocValues (via SortedSetDocValuesWriter), as well as the term dictionary used during
an index writer flush (via TermsHashPerField and its subclasses). This deduplicating and sorting
behavior is key to the construction of a search index.
}The following is a helper function to load string values into a (reusable) BytesRef instance.
private static void setBytesRefValue(BytesRef bytesRef, String value) {
bytesRef.bytes = ArrayUtil.growNoCopy(bytesRef.bytes, UnicodeUtil.maxUTF8Length(value.length()));
bytesRef.offset = 0;
bytesRef.length = UnicodeUtil.UTF16toUTF8(value, 0, value.length(), bytesRef.bytes);
}
}