This example partly builds on top of the concepts introduced in BooleanQueryIntro
, so I suggest you go check
that example first.
If you happen to have arrived here through a web search, please DO NOT ACTUALLY USE THIS CODE. Lucene already
has a BooleanQuery class. See the neighboring BooleanQueryIntro
example for some guidance on using it.
This example digs into some of the implementation details for Boolean queries, specifically conjunctive (AND
)
queries. Since Lucene’s real Boolean queries are quite sophisticated, most of this example will involve building
our own version of a Boolean query that behaves like a binary “AND” operator. That is, it has exactly two clauses,
which are both required (and participate in scoring). Along the way, we’ll learn some things about how Lucene
queries in general are implemented, and how Boolean queries implement “leap-frogging”.
Before we get into the code, here is an overview of the chain of classes we need to implement for our own query:
Query
: A query represents the instruction for what we want to match/score, without getting into the details
of how we’re going to do so.Weight
: Returned by the createWeight
method on the query. The JavaDoc for Weight
says, “The purpose of
Weight is to ensure searching does not modify a Query, so that a Query instance can be reused.” While
the Query
is independent of any index state, the Weight
is constructed based on the current
IndexSearcher
, and properties of the search action (e.g. whether scoring is needed).Scorer
: For each index segment, the Weight
creates a Scorer
. The Scorer
may be used once to step through
matching documents in the segment (via its embedded DocIdSetIterator
– explained next), and returns
the score (at this query’s level) for the document.DocIdSetIterator
: A DocIdSetIterator
(often abbreviated to DISI) steps through matching document IDs for the
given segment in increasing order. Many DISIs support an efficient advance(int n)
method that returns
the matching doc ID greater than or equal to n
.package example.basic;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.search.Explanation;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.QueryVisitor;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.ScoreMode;
import org.apache.lucene.search.Scorer;
import org.apache.lucene.search.ScorerSupplier;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.Weight;
import org.apache.lucene.search.similarities.BasicStats;
import org.apache.lucene.search.similarities.SimilarityBase;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
public class BooleanQueryANDInternals {
We’ll reuse the documents from BooleanQueryIntro
. In brief, the documents contain the words for their
prime factors less than 10.
private static List<String> createDocumentText(int numDocs) {
List<String> docs = new ArrayList<>();
docs.add("zero"); /* First doc will just have "zero", since the first doc ID is 0 */
for (int i = 1; i < numDocs; i++) {
StringBuilder sb = new StringBuilder();
sb.append("one");
if (i % 2 == 0) {
sb.append(" two");
}
if (i % 3 == 0) {
sb.append(" three");
}
if (i % 5 == 0) {
sb.append(" five");
}
if (i % 7 == 0) {
sb.append(" seven");
}
docs.add(sb.toString());
}
return docs;
}
Let’s start with our custom query class. Since we want to represent the specific a AND b
case, we will
explicitly take a
and b
as “left” and “right” operands.
private static final class BinaryAndQuery extends Query {
private final Query left;
private final Query right;
public BinaryAndQuery(Query left, Query right) {
this.left = Objects.requireNonNull(left);
this.right = Objects.requireNonNull(right);
}
A query should implement a friendly toString
method that provides a human-readable representation.
@Override
public String toString(String field) {
return left.toString(field) + " AND " + right.toString(field);
}
To allow operations on complex query hierarchies, queries are expected to take a QueryVisitor
and pass
it down the tree. The “leaf” queries at the end of a query hierarchy should call visitor.visitLeaf(this)
.
Functionally, this implementation is identical to what BooleanQuery
will do with a pair of MUST
clauses.
@Override
public void visit(QueryVisitor visitor) {
QueryVisitor childVisitor = visitor.getSubVisitor(BooleanClause.Occur.MUST, this);
left.visit(childVisitor);
right.visit(childVisitor);
}
Since a Query
may be used as a hash key (e.g. if the results of the query are cached), they are expected
to correctly implement equals
and hashCode
.
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
BinaryAndQuery that = (BinaryAndQuery) o;
return Objects.equals(left, that.left) && Objects.equals(right, that.right);
}
@Override
public int hashCode() {
return Objects.hash(left, right);
}
Finally, when it comes time to search with our query, we create the Weight
instance. Note that
we need to call createWeight
on each of our nested queries, too. Note that while we could call
createWeight
on the left
and right
queries directly, it is more correct to delegate to the searcher
,
which may have one of the nested queries in its cache. (See the isCacheable
method in BinaryAndWeight
,
below.)
@Override
public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) throws IOException {
return new BinaryAndWeight(this, searcher.createWeight(left, scoreMode, boost), searcher.createWeight(right, scoreMode, boost));
}
}
For our purposes, the role of the Weight
implementation is to hold the Weight
instances for the operands,
create the Scorer
for each segment, and implement the explain
method.
private static class BinaryAndWeight extends Weight {
private final Weight leftWeight;
private final Weight rightWeight;
public BinaryAndWeight(Query query, Weight leftWeight, Weight rightWeight) {
super(query);
this.leftWeight = leftWeight;
this.rightWeight = rightWeight;
}
Every Weight
is expected to implement the explain
method to show how a given document’s score was
derived (or to explain why a non-matching document did not match).
@Override
public Explanation explain(LeafReaderContext context, int doc) throws IOException {
Explanation leftExplain = leftWeight.explain(context, doc);
Explanation rightExplain = rightWeight.explain(context, doc);
if (!leftExplain.isMatch()) {
return Explanation.noMatch("no match on required clause (" + leftWeight.getQuery().toString() + ")", leftExplain);
} else if (!rightExplain.isMatch()) {
return Explanation.noMatch("no match on required clause (" + rightWeight.getQuery().toString() + ")", rightExplain);
}
Scorer scorer = scorer(context);
int advanced = scorer.iterator().advance(doc);
assert advanced == doc;
return Explanation.match(scorer.score(), "sum of:", leftExplain, rightExplain);
}
We return our custom Scorer
implementation for a given index segment, after creating the scorers for
each of our operands.
@Override
public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOException {
ScorerSupplier leftScorerSupplier = leftWeight.scorerSupplier(context);
ScorerSupplier rightScorerSupplier = rightWeight.scorerSupplier(context);
Weight weight = this;
return new ScorerSupplier() {
@Override
public Scorer get(long l) throws IOException {
return new BinaryAndScorer(weight, leftScorerSupplier.get(l), rightScorerSupplier.get(l));
}
@Override
public long cost() {
Worst case, we match everything from left and right.
return leftScorerSupplier.cost() + rightScorerSupplier.cost();
}
};
}
The IndexSearcher
holds a QueryCache
instance that may store the matching doc IDs for a given query.
Not all queries are readily cacheable, though, so the IndexSearcher
asks the weight if it is cacheable
or not.
@Override
public boolean isCacheable(LeafReaderContext ctx) {
return leftWeight.isCacheable(ctx) && rightWeight.isCacheable(ctx);
}
}
Next, we need to implement the Scorer
class. Most of the interesting logic lives in the Scorer
‘s associated
DocIdSetIterator
.
private static class BinaryAndScorer extends Scorer {
private final Scorer leftScorer;
private final Scorer rightScorer;
private final DocIdSetIterator docIdSetIterator;
public BinaryAndScorer(Weight weight, Scorer leftScorer, Scorer rightScorer) {
this.leftScorer = leftScorer;
this.rightScorer = rightScorer;
this.docIdSetIterator = new BinaryAndDocIdSetIterator(leftScorer.iterator(), rightScorer.iterator());
}
If this scorer is pointing at a doc ID, then both the left and right scorers must be pointing at the same doc ID (since it’s the current match for the left AND right).
@Override
public int docID() {
assert leftScorer.docID() == rightScorer.docID();
return docIdSetIterator.docID();
}
@Override
public DocIdSetIterator iterator() {
return docIdSetIterator;
}
This getMaxScore
can be used to skip entire blocks of uncompetitive documents, by providing an upper bound
on the possible score of a match. In this case, we ask each of our subscorers what their maximum scores
are upTo
some future target and sum them together. But the max from left
and the max from right
might
not come from the same document, so there may not actually be a document with this score. That’s okay,
though – the skipping is a best-effort optimization.
@Override
public float getMaxScore(int upTo) throws IOException {
return leftScorer.getMaxScore(upTo) + rightScorer.getMaxScore(upTo);
}
Consistent with BooleanQuery (using MUST clauses), we sum up the scores of our individual clauses.
@Override
public float score() throws IOException {
assert leftScorer.docID() == rightScorer.docID();
return leftScorer.score() + rightScorer.score();
}
}
This class is really the whole reason for this example (though the journey was hopefully informative too). The DocIdSetIterator (DISI) classes are fundamental to Lucene’s matching logic.
Conjunctions of required clauses (like our AND query) step through all documents where all of their underlying
DISIs point to the same thing. We achieve this with the (generally) fast advance
method, asking a DISI
that’s behind to skip to the first value greater than or equal to the DISI that’s in the lead. If they point
to the same thing, then we have a match and return it. Otherwise, the old lead is now behind and it can leap
over the new lead, like a game of leapfrog.
Note that our advance
and nextDoc
methods guarantee that they only return when leftIterator
and
rightIterator
point to the same value (though that value may be NO_MORE_DOCS
).
private static class BinaryAndDocIdSetIterator extends DocIdSetIterator {
private final DocIdSetIterator leftIterator;
private final DocIdSetIterator rightIterator;
public BinaryAndDocIdSetIterator(DocIdSetIterator leftIterator, DocIdSetIterator rightIterator) {
this.leftIterator = leftIterator;
this.rightIterator = rightIterator;
}
We can (and do) check the invariant that the left and right iterators are pointing to the same doc, and return the left doc ID.
@Override
public int docID() {
assert leftIterator.docID() == rightIterator.docID();
return leftIterator.docID();
}
For nextDoc
, we arbitrarily step to the next document with leftIterator
, then jump over to our
leapfrogging logic in doNext
.
@Override
public int nextDoc() throws IOException {
return doNext(leftIterator.nextDoc());
}
As with nextDoc
, we can advance
the leftIterator
and leapfrog in doNext
until the iterators
converge.
@Override
public int advance(int target) throws IOException {
return doNext(leftIterator.advance(target));
}
Here it is, the exciting part of the example and one of the most important concepts in Lucene.
The left and right iterators jump over each other, with the lagging iterator advancing to the
first doc ID greater than or equal to the leading iterator. Several of the built-in query types (including
TermQuery) have efficient advance
methods that can return in logarithmic time.
This while
loop is guaranteed to terminate, because eventually both iterators will point to NO_MORE_DOCS
,
and passing NO_MORE_DOCS
to an advance
call is expected to return NO_MORE_DOCS
. (The comparators work
because NO_MORE_DOCS
is equal to Integer.MAX_VALUE
.)
private int doNext(int target) throws IOException {
while (leftIterator.docID() != rightIterator.docID()) {
if (rightIterator.docID() < leftIterator.docID()) {
rightIterator.advance(leftIterator.docID());
} else {
leftIterator.advance(rightIterator.docID());
}
}
return leftIterator.docID();
}
The cost()
of a DISI is a measure of how expensive it would be to step through it fully, generally
corresponding to an upper bound on the number of docs the DISI matches. We don’t know how many documents
are in the intersection of left
and right
without stepping through them, but we know that the
intersection of left
and right
is definitely no bigger than whichever is smaller. (Picture the Venn
diagram of “A AND B”.)
One use of cost()
is to lead with the cheapest (sparsest) iterator. Since our advance
method always
steps with left
first and our “AND” operator is commutative, we could have assigned leftIterator
to the lower-cost iterator to get a speed-up. (That’s what Lucene’s ConjunctionDISI
class does.)
@Override
public long cost() {
return Math.min(leftIterator.cost(), rightIterator.cost());
}
}
We can reuse the custom similarity from BooleanQueryIntro
to give our example term queries scores of 1
on a
match, rather than worrying about BM25 scores.
private static class CountMatchingClauseSimilarity extends SimilarityBase {
@Override
protected double score(BasicStats stats, double freq, double docLen) {
return 1;
}
@Override
public String toString() {
return "Every match has a score of 1";
}
While we didn’t look at it in BooleanQueryIntro
, a custom similarity can also explain its scoring
methodology. In our case, we can pass along the explanation from the toString
method.
@Override
protected void explain(List<Explanation> subExpls, BasicStats stats, double freq, double docLen) {
subExpls.add(Explanation.match(1, this.toString()));
}
}
Let’s put our custom query type to work.
After indexing documents 0-999, we create a twoAndFive
query, similar to what we did in BooleanQueryIntro
,
but this time it’s using the logic implemented above.
When the IndexSearcher
(or technically the DefaultBulkScorer
) first calls nextDoc
on our
DocIdSetIterator
, the left iterator (“two) will step to its first match – document 2. Then the right (“five”)
iterator steps to its first match greater than or equal to 2 – document 5. Then the left iterator advances to
document 6. Then the right iterator advances to document 10. Finally, the left iterator advances to document 10
too, and we return it. (It’s the first hit in the returned TopDocs
.)
The logic for the next (and every other) matching document is similar:
left=12
right=15
left=16
right=20 <--
left=20 <-- HIT
After returning the top 10 hits, we can take a look at the explain
output to see the output from our
custom Weight
and custom Similarity
.
public static void main(String[] args) throws IOException {
Path tmpDir = Files.createTempDirectory(BooleanQueryIntro.class.getSimpleName());
try (Directory directory = FSDirectory.open(tmpDir);
IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig())) {
for (String doc : createDocumentText(1000)) {
writer.addDocument(List.of(new TextField("text", doc, Field.Store.NO)));
}
try (IndexReader reader = DirectoryReader.open(writer)) {
IndexSearcher searcher = new IndexSearcher(reader);
searcher.setSimilarity(new CountMatchingClauseSimilarity());
BinaryAndQuery twoAndFive = new BinaryAndQuery(
new TermQuery(new Term("text", "two")),
new TermQuery(new Term("text", "five"))
);
System.out.println(twoAndFive);
TopDocs topDocs = searcher.search(twoAndFive, 10);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
System.out.println(scoreDoc.doc + " " + scoreDoc.score);
}
System.out.println(searcher.explain(twoAndFive, 20));
/*
2.0 = sum of:
1.0 = weight(text:two in 20) [CountMatchingClauseSimilarity], result of:
1.0 = score(CountMatchingClauseSimilarity, freq=1.0), computed from:
1 = Every match has a score of 1
1.0 = weight(text:five in 20) [CountMatchingClauseSimilarity], result of:
1.0 = score(CountMatchingClauseSimilarity, freq=1.0), computed from:
1 = Every match has a score of 1
*/
System.out.println(searcher.explain(twoAndFive, 22));
/*
0.0 = no match on required clause (text:five)
0.0 = no matching term
*/
}
} finally {
for (String indexFile : FSDirectory.listAll(tmpDir)) {
Files.deleteIfExists(tmpDir.resolve(indexFile));
}
Files.deleteIfExists(tmpDir);
}
}
}