CQADupStack

CQADupStack is a benchmark dataset for community question-answering (cQA) research. It contains threads from twelve StackExchange¹ subforums, annotated with duplicate question information and comes with pre-defined training, development, and test splits, both for retrieval and classification experiments.

The script below can be used to make these splits, and provides easy access to the different questions, answers, comments, users, and different meta data provided in the set. It also has several evaluation metrics implemented.

Some initial baseline results were published at the Australasion Document Computing Symposium (ADCS) 2015, and an analysis of the quality of the set was published at the WebQA Workshop at SIGIR 2016.

¹http://stackexchange.com/

The Data

The Script

Papers:

CQADupStack: A Benchmark Data Set for Community Question-Answering Research (ADCS 2015)
CQADupStack: Gold or Silver? (WebQA Workshop at SIGIR 2016)

Papers that make use of CQADupStack:

MixKMeans: Clustering Question-Answer Archives, P. Deepak, EMNLP 2016.
An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation, Jey Han Lau and Timothy Baldwin, RepL4NLP 2016
Detecting Duplicate Posts in Programming QA Communities via Latent Semantics and Association Rules, by Wei Emma Zhang, Quan Z. Sheng, Jey Han Lau, and Ermyas Abebe, WWW 2017
SemEval-2017 Task 3: Community Question Answering, by Preslav Nakov, Doris Hoogeveen, Lluís Màrquez, Alessandro Moschitti, Hamdy Mubarak, Timothy Baldwin, and Karin M. Verspoor, SemEval 2017

For any questions contact