A Benchmark Data Set for Community Question-Answering Research

CQADupStack is a benchmark dataset for community question-answering (cQA) research. It contains threads from twelve StackExchange1 subforums, annotated with duplicate question information and comes with pre-defined training, development, and test splits, both for retrieval and classification experiments.

The script below can be used to make these splits, and provides easy access to the different questions, answers, comments, users, and different meta data provided in the set. It also has several evaluation metrics implemented.

Some initial baseline results were published at the Australasion Document Computing Symposium (ADCS) 2015, and an analysis of the quality of the set was published at the WebQA Workshop at SIGIR 2016.


The Data

The Script


Papers that make use of CQADupStack:

