Design and data collection: Each pair was randomly matched with a comparison pair and rated in this setting (as either more or less related than the comparison point) by a single Turker (hence, there is no well-defined inter-subject agreement). Rather than ask annotators to give an absolute score reflecting how much a word pair is semantically related (like in Wordsim353), we instead asked them to make comparative judgements on two pair exemplars at a time because this is both more natural for an individual annotator, and also permits seamless integration of the supervision from many annotators, each of whom may have a different internal "calibration" for the relatedness strengths. Moreover, binary choices were preferred because they make the construction of "right" and "wrong" control items straightforward. In total, each pair was rated in this way against 50 comparison pairs, thus obtaining a final score on a 50-point scale, although the Turkers' choices were binary.
Instructions: We did not instruct the subjects about the difference between similarity and relatedness. We used only the second term, and gave examples involving both similarity, that is a special case of relatedness ("car-automobile"), and relatedness ("wheels-car").
Participants: We requested participants to be native speakers and only accepted those connecting from an English speaking country. We cannot guarantee that non-natives did not take part in the study, but our subject filtering techniques based on control "gold standard" pairs ensures that only the data of speakers with a good command of English were retained.
Agreement: Because raters saw the MEN pairs matched to different random items, with the number of pairs also varying from rater to rater, it is not possible to compute annotator agreement scores for MEN. However, to get a sense of human agreement, the first and third author rated all 3,000 pairs (presented in different random orders) on a standard 1-7 Likert scale. The Spearman correlation of the two authors is at 0.68, the correlation of their average ratings with the MEN scores is at 0.84. On the one hand, this high correlation suggests that MEN contains meaningful semantic ratings. On the other, it can also be taken as an upper bound on what computational models can realistically achieve when simulating the human MEN judgments.
All sets (development, testing and combined) are available in two formats:
If you publish results based on this data set, please cite as
Multimodal Distributional Semantics E. Bruni, N. K. Tran and M. Baroni. Journal of Artificial Intelligence Research 49: 1-47. [pdf]
Please also inform your readers of the current location of the data set:
http://clic.cimec.unitn.it/~elia.bruni/MEN