See: Description
Interface  Description 

LMSimilarity.CollectionModel 
A strategy for computing the collection language model.

Class  Description 

AfterEffect 
This class acts as the base class for the implementations of the first
normalization of the informative content in the DFR framework.

AfterEffectB 
Model of the information gain based on the ratio of two Bernoulli processes.

AfterEffectL 
Model of the information gain based on Laplace's law of succession.

Axiomatic 
Axiomatic approaches for IR.

AxiomaticF1EXP 
F1EXP is defined as Sum(tf(term_doc_freq)*ln(docLen)*IDF(term))
where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq

AxiomaticF1LOG 
F1LOG is defined as Sum(tf(term_doc_freq)*ln(docLen)*IDF(term))
where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq

AxiomaticF2EXP 
F2EXP is defined as Sum(tfln(term_doc_freq, docLen)*IDF(term))
where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq

AxiomaticF2LOG 
F2EXP is defined as Sum(tfln(term_doc_freq, docLen)*IDF(term))
where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq

AxiomaticF3EXP 
F3EXP is defined as Sum(tf(term_doc_freq)*IDF(term)gamma(docLen, queryLen))
where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq
gamma(docLen, queryLen) = (docLenqueryLen)*queryLen*s/avdl
NOTE: the gamma function of this similarity creates negative scores

AxiomaticF3LOG 
F3EXP is defined as Sum(tf(term_doc_freq)*IDF(term)gamma(docLen, queryLen))
where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq
gamma(docLen, queryLen) = (docLenqueryLen)*queryLen*s/avdl
NOTE: the gamma function of this similarity creates negative scores

BasicModel 
This class acts as the base class for the specific basic model
implementations in the DFR framework.

BasicModelG 
Geometric as limiting form of the BoseEinstein model.

BasicModelIF 
An approximation of the I(n_{e}) model.

BasicModelIn 
The basic tfidf model of randomness.

BasicModelIne 
Tfidf model of randomness, based on a mixture of Poisson and inverse
document frequency.

BasicStats 
Stores all statistics commonly used ranking methods.

BM25Similarity 
BM25 Similarity.

BooleanSimilarity 
Simple similarity that gives terms a score that is equal to their query
boost.

ClassicSimilarity 
Expert: Historical scoring implementation.

DFISimilarity 
Implements the Divergence from Independence (DFI) model based on Chisquare statistics
(i.e., standardized Chisquared distance from independence in term frequency tf).

DFRSimilarity 
Implements the divergence from randomness (DFR) framework
introduced in Gianni Amati and Cornelis Joost Van Rijsbergen.

Distribution 
The probabilistic distribution used to model term occurrence
in informationbased models.

DistributionLL 
Loglogistic distribution.

DistributionSPL 
The smoothed powerlaw (SPL) distribution for the informationbased framework
that is described in the original paper.

IBSimilarity 
Provides a framework for the family of informationbased models, as described
in Stéphane Clinchant and Eric Gaussier.

Independence 
Computes the measure of divergence from independence for DFI
scoring functions.

IndependenceChiSquared 
Normalized chisquared measure of distance from independence

IndependenceSaturated 
Saturated measure of distance from independence

IndependenceStandardized 
Standardized measure of distance from independence

Lambda 
The lambda (λ_{w}) parameter in informationbased
models.

LambdaDF 
Computes lambda as
docFreq+1 / numberOfDocuments+1 . 
LambdaTTF 
Computes lambda as
totalTermFreq+1 / numberOfDocuments+1 . 
LMDirichletSimilarity 
Bayesian smoothing using Dirichlet priors.

LMJelinekMercerSimilarity 
Language model based on the JelinekMercer smoothing method.

LMSimilarity 
Abstract superclass for language modeling Similarities.

LMSimilarity.DefaultCollectionModel 
Models
p(wC) as the number of occurrences of the term in the
collection, divided by the total number of tokens + 1 . 
LMSimilarity.LMStats 
Stores the collection distribution of the current term.

MultiSimilarity 
Implements the CombSUM method for combining evidence from multiple
similarity values described in: Joseph A.

Normalization 
This class acts as the base class for the implementations of the term
frequency normalization methods in the DFR framework.

Normalization.NoNormalization 
Implementation used when there is no normalization.

NormalizationH1 
Normalization model that assumes a uniform distribution of the term frequency.

NormalizationH2 
Normalization model in which the term frequency is inversely related to the
length.

NormalizationH3 
Dirichlet Priors normalization

NormalizationZ 
ParetoZipf Normalization

PerFieldSimilarityWrapper 
Provides the ability to use a different
Similarity for different fields. 
Similarity 
Similarity defines the components of Lucene scoring.

Similarity.SimScorer 
Stores the weight for a query across the indexed collection.

SimilarityBase 
A subclass of
Similarity that provides a simplified API for its
descendants. 
TFIDFSimilarity 
Implementation of
Similarity with the Vector Space Model. 
Similarity
serves
as the base for ranking functions. For searching, users can employ the models
already implemented or create their own by extending one of the classes in this
package.
BM25Similarity
is an optimized
implementation of the successful Okapi BM25 model.
ClassicSimilarity
is the original Lucene
scoring function. It is based on the
Vector Space Model. For more
information, see TFIDFSimilarity
.
SimilarityBase
provides a basic
implementation of the Similarity contract and exposes a highly simplified
interface, which makes it an ideal starting point for new ranking functions.
Lucene ships the following methods built on
SimilarityBase
:
SimilarityBase
is not
optimized to the same extent as
ClassicSimilarity
and
BM25Similarity
, a difference in
performance is to be expected when using the methods listed above. However,
optimizations can always be implemented in subclasses; see
below.
Chances are the available Similarities are sufficient for all
your searching needs.
However, in some applications it may be necessary to customize your Similarity implementation. For instance, some
applications do not need to distinguish between shorter and longer documents
and could set BM25's b
parameter to 0
.
To change Similarity
, one must do so for both indexing and
searching, and the changes must happen before
either of these actions take place. Although in theory there is nothing stopping you from changing midstream, it
just isn't welldefined what is going to happen.
To make this change, implement your own Similarity
(likely
you'll want to simply subclass SimilarityBase
), and
then register the new class by calling
IndexWriterConfig.setSimilarity(Similarity)
before indexing and
IndexSearcher.setSimilarity(Similarity)
before searching.
BM25Similarity
has
two parameters that may be tuned:
0
makes term frequency completely
ignored, making documents scored only based on the value of the IDF
of the matched terms. Higher values of k1 increase the impact of
term frequency on the final score. Default value is 1.2
.[0, 1]
. A value of 0
disables length normalization completely. Default value is 0.75
.
The easiest way to quickly implement a new ranking method is to extend
SimilarityBase
, which provides
basic implementations for the low level . Subclasses are only required to
implement the SimilarityBase.score(BasicStats, double, double)
and SimilarityBase.toString()
methods.
Another option is to extend one of the frameworks
based on SimilarityBase
. These
Similarities are implemented modularly, e.g.
DFRSimilarity
delegates
computation of the three parts of its formula to the classes
BasicModel
,
AfterEffect
and
Normalization
. Instead of
subclassing the Similarity, one can simply introduce a new basic model and tell
DFRSimilarity
to use it.
Copyright © 20002020 Apache Software Foundation. All Rights Reserved.