# [repost ]RankLib is a library of learning to rank algorithms.

Overview

RankLib is a library of learning to rank algorithms. Currently six popular algorithms have been implemented:

• MART (Multiple Additive Regression Trees, a.k.a. Gradient boosted regression tree) [6]
• RankNet [1]
• RankBoost [2]
• Coordinate Ascent [4]
• LambdaMART [5]

It also implements many retrieval metrics as well as provides many ways to carry out evaluation.

• Change the calculation of NDCG to the standard version: (2^{rel_i} – 1) / log_{2} (i+1). Therefore, the absolute NDCG score might be slightly lower than before.
• Fix the divide-by-zero bug related to the sum normalization ( q->D={d1,d2,…,d_n}; F={f1,f2,…,fm}; fk_di = fk_di / sum_{dj \in D} |fk_dj| ).
• Add the ability to split the training file to x% train and (100-x)% validation (previous versions only allow train/test split, not train/validation).
• Add some minor cmd-line parameters.
• Internal code clean up for slight improvement in efficiency/speed.
• Some cmd-line parameter strings have been changed.
• Older versions [Here]

The binary is located in the “bin” directory and it should work right out of the box. If you want to re-build it, you need to have ant installed on your machine. Open the terminal, go to the RankLib directory and type “ant” and you’re good to go.

Mailing list

If you find RankLib useful and want to receive email notices when a new version / bug fix comes out, please subscribe to this Google groupsmailing list. This list is used strictly for announcement only, so you won’t be seeing any discussion/spam emails.

If you are not using gmail, please email me using your favorite email account saying you want to know about RankLib updates, I will keep you posted.

How to use

Open the terminal and type in “java -jar bin/RankLib.jar” under the RankLib directory, you will see all necessary parameters. Parameters in the square bracket are optional with their default value shown at the end of the line.

 Usage: java -jar RankLib.jar Params: [+] Training (+ tuning and evaluation) -train Training data -ranker Specify which ranking algorithm to use 0: MART (gradient boosted regression tree) 1: RankNet 2: RankBoost 3: AdaRank 4: Coordinate Ascent 6: LambdaMART [ -feature ] Feature description file: list features to be considered by the learner, each on a separate line. If not specified, all features will be used. [ -metric2t ] Metric to optimize on the training data. Supported: MAP, NDCG@k, DCG@k, P@k, RR@k, ERR@k (default=ERR@10) [ -metric2T ] Metric to evaluate on the test data (default to the same as specified for -metric2t) [ -gmax

NOTE: ALWAYS include -letor if you’re doing experiments on LETOR 4.0 dataset. The reason is a relevance degree of 2 in the dataset is actually counted as 3 (this is based on the evaluation script they provided). To be consistent with their numbers, this program will change 2 to 3 when it loads the data into memory if the -letor flag is specified.

File format

The file format of the training and test and validation files is the same as for SVM-Rank. This is also the format used in LETOR datasets. Each of the following lines represents one training example and is of the following format:

<line> .=. <target> qid:<qid> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>
<target> .=. <float>
<qid> .=. <positive integer>
<feature> .=. <positive integer>
<value> .=. <float>
<info> .=. <string>

Here’s an example: (taken from the SVM-Rank website). Note that everything after “#” are ignored.

3 qid:1 1:1 2:1 3:0 4:0.2 5:0 # 1A
2 qid:1 1:0 2:0 3:1 4:0.1 5:1 # 1B
1 qid:1 1:0 2:1 3:0 4:0.4 5:0 # 1C
1 qid:1 1:0 2:0 3:1 4:0.3 5:0 # 1D
1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2A
2 qid:2 1:1 2:0 3:1 4:0.4 5:0 # 2B
1 qid:2 1:0 2:0 3:1 4:0.1 5:0 # 2C
1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2D
2 qid:3 1:0 2:0 3:1 4:0.1 5:1 # 3A
3 qid:3 1:1 2:1 3:0 4:0.3 5:0 # 3B
4 qid:3 1:1 2:0 3:0 4:0.4 5:1 # 3C
1 qid:3 1:0 2:1 3:1 4:0.5 5:0 # 3D

Examples

Go to the LETOR website and download any of their datasets. For instance, let’s pick MQ2008 from the LETOR 4.0 dataset. Suppose we put it under the RankLib directory. Type this into the command-line:

java -jar bin/RankLib.jar -train MQ2008/Fold1/train.txt -test MQ2008/Fold1/test.txt -validate MQ2008/Fold1/vali.txt -ranker 6 -metric2t NDCG@10 -metric2T ERR@10

What we specified means we want to train a LambdaMART ranker: train on the training data and record the model that performs best on the validation data. The training metric is NDCG@10. After training is completed, evaluate the trained model on the test data in ERR@10.

The parameter -validate is optional but it often leads to better models. In particular, -validate is very important for RankNet/MART/LambdaMART. Coordinate Ascent, on the other hand, works pretty well without validation data. As a result, my implementation of Coordinate Ascent completely ignores the validation data to reduce training time (this is the only exception in RankLib).

Note though, -metric2t (e.g. NDCG, ERR, etc) only applies to list-wise algorithms (AdaRank, Coordinate Ascent and LambdaMART). Point-wise and pair-wise techniques (MART, RankNet, RankBoost), due to their nature, always use their internal RMSE / pair-wise loss as the optimization criteria. Thus, -metric2t has no effects on them.

Disclaimer

This program is free to use for any purpose. However, the software and derivatives of it must not be distributed without prior permission of the author.

The program is provided “as is” without warranty of any kind. The author should NOT be held responsible if using this software results in any kind of damages.

References

    [1] C.J.C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton and G. Hullender. Learning to rank using gradient descent. In Proc. of ICML, pages 89-96, 2005.
[2] Y. Freund, R. Iyer, R. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. The Journal of Machine Learning Research, 4: 933-969, 2003.
[3] J. Xu and H. Li. AdaRank: a boosting algorithm for information retrieval. In Proc. of SIGIR, pages 391-398, 2007.
[4] D. Metzler and W.B. Croft. Linear feature-based models for information retrieval. Information Retrieval, 10(3): 257-274, 2000.
[5] Q. Wu, C.J.C. Burges, K. Svore and J. Gao. Adapting Boosting for Information Retrieval Measures. Journal of Information Retrieval, 2007.
[6] J.H. Friedman. Greedy function approximation: A gradient boosting machine. Technical Report, IMS Reitz Lecture, Stanford, 1999; see also Annals of Statistics, 2001.

Feedback

Any question/suggestion/bug, please email me at “vdang at cs.umass.edu”.