Information Retrieval Models Foundations and Relationships
Information Retrieval (IR) models are a core component of IR research and IR systems. The past decade brought a consolidation of the family of IR models, which by 2000 existed of relatively isolated views on TF IDF, vector-space model, probabilistic relevance framework and BM25, and language modelling (LM). Though TF-IDF is the model best known outside of IR, in the IR research community, there is the view that "TF-IDF is not a model." Moreover, TF-IDF is referred to as "purely heuristic," and BM25 and LM are viewed as instantiations of the PRF. BM25 is viewed as being a mostly heuristic composition of well-defined (since PRF-based) parameters. LM is viewed as an overall clean and surprisingly simple and effective approach to IR, but it does not come near to TF-IDF regarding the strong intuition that made TF-IDF so popular. The BIR model and the Poisson model are models that feed into BM25. Whereas the BIR model is relatively well known and understood, the Poisson model and its role in showing the probabilistic roots of F-IDF and BM25 feel less known. This book takes a horizontal approach in gathering the foundations of TF-IDF, PRF, BIR, Poisson, BM25, LM, probabilistic inference networks (PIN's) and divergence-based models. Also, we include precision and recall as a "model" since conditional probability and total probability are a link between retrieval models and evaluation models. After the model chapter, this book focuses on relationships between models, starting with the frameworks that allow for modeling or exploring the relationships. We review the role of the vector-space "model" being a framework to implement IR models. Moreover, we reconsider the PRF as the binding element between BIR (BM25), TF- IDF and LM, thereby pointing out that an LM-based relevance model "BM25-LM" is a research challenge. When pairing the models for discussing the relationships, we choose TF-IDF as a centre point, and it turns out that the most popular