This data contains lists of conference and journal names culled from the Web by Rexa. Given a set of strings referring to the same conference or journal, the task is to determine which string should be the canonical one. The canonical string should be free of spelling, segmentation, and OCR errors, and should in some sense be prototypical of the entity.
For more details, see this paper.
Strings referring to different venues are separated by a blank line. The first string in a list of venues is the one that has been manually annotated as canonical.
For example, in the block
in proceedings of the ninth national conference on artificial intelligence
in proc 9th conf aaai
in proceedings of the ninth national conference on artifici al intelligence
The first string is the canonical one.
Because the canonical string is in some sense subjective, the data has two sets of labels, located in separate directories:
Some statistics about the data: