Publication Venue Canonicalization Data v1.0

6/6/07 The data is available here: canonicalization_data_v1.0.tgz (md5sum = 1752b38bbe66d99af38a429a3b157bbd)

This data contains lists of conference and journal names culled from the Web by Rexa. Given a set of strings referring to the same conference or journal, the task is to determine which string should be the canonical one. The canonical string should be free of spelling, segmentation, and OCR errors, and should in some sense be prototypical of the entity.

For more details, see this paper.

Strings referring to different venues are separated by a blank line. The first string in a list of venues is the one that has been manually annotated as canonical. For example, in the block
in proceedings of the ninth national conference on artificial intelligence
in proc 9th conf aaai
in proceedings of the ninth national conference on artifici al intelligence

The first string is the canonical one.

Because the canonical string is in some sense subjective, the data has two sets of labels, located in separate directories:

  • long - The canonical string is the long, full venue name (e.g., National Conference on Artificial Intelligence)
  • short - The canonical string is the abbreviated venue name (e.g., AAAI)

    Each directory is also split into 5 cross-validation sets (e.g., long/train.1 and long/test.1 are one split).

    Some statistics about the data:

  • 3,683 venue strings
  • 100 unique venues
  • 32,556 tokens