Evaluating embeddings on dictionary-based similarity

We propose a method for evaluating embeddings against dictionaries with tens or hundreds of thousands of entries, covering the entire gamut of the vocabulary


Introduction
Continuous vector representations (embeddings) are, to a remarkable extent, supplementing and potentially taking over the role of detail dictionaries in a broad variety of tasks ranging from POS tagging (Collobert et al., 2011) and parsing (Socher et al., 2013) to MT (Zou et al., 2013), and beyond (Karpathy, Joulin, and Li, 2014). Yet an evaluation method that directly compares embeddings on their ability to handle word similarity at the entire breadth of a dictionary has been lacking, which is all the more regrettable in light of the fact that embeddings are normally generated from gigaword or larger corpora, while the state of the art test sets surveyed in Chiu, Korhonen, and Pyysalo (2016) range between a low of 30 (MC-30) and a high of 3,000 word pairs (MEN).
We propose to develop a dictionary-based standard in two steps. First, given a dictionary such as the freely available Collins-COBUILD (Sinclair, 1987), which has over 77,400 headwords, or Wiktionary (162,400 headwords), we compute a frequency list F that lists the probabilities of the headwords (this is standard, and discussed only briefly), and a dense similarity matrix M or an embedding ψ, this is discussed in Section 2. Next, in Section 3 we consider an arbitrary embedding φ, and we systematically compare both its frequency and its similarity predictions to the gold standard embodied in F and ψ, building on the insights of Arora et al. (2015). Pilot studies conducted along these lines are discussed in Section 4. Before turning to the details, in the rest of this Introduction we attempt to evaluate the proposed evaluation itself, primarily in terms of the criteria listed in the call. As we shall see, our method is highly replicable for other researchers for English, and to the extent monolingual dictionaries are available, for other other languages as well. Low resource languages will typically lack a monolingual dictionary, but this is less of a perceptible problem in that they also lack larger corpora so building robust embeddings is already out of the question for these. The costs are minimal, since we are just running software on preexisting dictionaries. Initially, dictionaries are hard to assemble, require a great deal of manual labor, and are often copyrighted, but here our point is to leverage the manual (often crowdsourced) work that they already embody.
The proposed algorithm, as we present it here, is aimed primarily at word-level evaluation, but there are standard methods for extending these from word to sentence similarity (Han et al., 2013). Perhaps the most attractive downstream application we see is MT, in particular word sense disambiguation during translation. As for linguistic/semantic/psychological properties, dictionaries, both mono-and bilingual, are crucial resources not only for humans (language learners, translators, etc.) but also for a variety of NLP applications, including MT, cross-lingual information retrieval, cross-lingual QA, computer-assisted language learning, and many more. The mandate of lexicographers is to capture a huge number of linguistic phenomena ranging from gross synonymy to subtle meaning distinctions, and at the semantic level the inter-annotator agreement is very high, a point we discuss in greater detail below. Gladkova and Drozd (2016) quote Schütze (2016) that "human linguistic judgments (...) are subject to over 50 potential linguistic, psychologi-cal, and social confounds", and many of these taint the crowd-sourced dictionaries, but lexicographers are annotators of a highly trained sort, and their work gives us valuable data, as near to laboratory purity as it gets.

Constructing the standard
Our main inputs are a frequency list F , ideally generated from a corpus we consider representative of the text of interest (the expected input to the downstream task), and a preexisting dictionary D which is not assumed to be task-specific. For English, we use both the Collins-COBUILD dictionary (CED) and Wiktionary, as these are freely available, but other general-purpose dictionaries would be just as good, and for specific tasks (e.g. medical or legal texts) it may make sense to add in a task-specific dictionary if available. Neither D nor F need contain the other, but we assume that they are stemmed using the same stemmer.
Parse dictionary Adjacency matrix SVD Figure 1: Building the standard The first step is to parse D into word, definition stanzas. (This step is specific to the dictionary at hand, see e.g. Mark Lieberman's readme. for CED). Next, we turn the definitions into dependency graphs. We use the Stanford dependency parser (Chen and Manning, 2014) at this stage, and have not experimented with alternatives. This way, we can assign to each word a graph with dependency labels, see Fig 2 for an example, and Recski (2016) for details. The dependency graphs are not part of the current incarnation of the evaluation method proposed here, but are essential for our future plans of extending the evaluation pipeline (see Section 4).
In the second step we construct two global graphs: the definitional dependency graph DD which has a node for each word in the dictionary, and directed edges running from w i to w j if w j appears in the definition of w i ; and the headword graph HG which only retains the edge running from the definiendum to the head of the definiens. We take the head to be the 'root' node returned by the Stanford parser, but in many dictionaries the syntactic head of the definition is typographically set aside and can be obtained directly from the raw D.
At first blush it may appear that the results of this process are highly dependent on the choice of D, and perhaps on the choice of the parser as well. Consider the definition of client taken from four separate sources: 'someone who gets services or advice from a professional person, company, or organization' (Longman); 'a person who pays a professional person or organization for services' (Webster); 'a person who uses the services or advice of a professional person or organization' (Oxford); 'a person or group that uses the professional advice or services of a lawyer, accountant, advertising agency, architect, etc.' (dictionary.com). The definitions do not literally preserve the headword (hypernym, genus, IS A): in three cases we have 'person', in one 'somebody'. But semantically, these two headwords are very close synonyms, distinguished more by POS than by content. Similarly, the various definitions do not present the exact same verbal pivot, 'engage/hire/pay for/use the services of', but their semantic relatedness is evident. Finally, there are differences in attachment, e.g. is the service rendered professional, or is the person/organization rendering the service professional? In Section 3 we will present evidence that the proposed method is not overly sensitive to these differences, because the subsequent steps wipe out such subtle distinctions.
In the third step, by performing SVD on the Laplacian of the graphs DD and HG we obtain two embeddings we call the definitional and the head embedding. For any embedding ψ, a (sym-metric, dense) similarity matrix M i,j is given by the cosine similarity of ψ(w i ) and ψ(w j ). Other methods for computing the similarity matrix M are also possible, and the embedding could also be obtained by direct computation, setting the context window of each word to its definition -we defer the discussion of these and similar alternatives to the concluding Section 4. Now we define the direct similarity of two embeddings φ and ψ as the average of the (cosine) similarities of the words that occur in both: It may also make sense to use a frequencyweighted average, since we already have a frequency table F -we return to this matter in Section 3. In and of itself, S is not a very useful measure, in that even random seeding effects are sufficient to destroy similarity between near-identical embeddings, such as could be obtained from two halves of the same corpus. For example, the value of S between 300-dimensional GloVe (Pennington, Socher, and Manning, 2014) embeddings generated from the first and the second halves of the UMBC Webbase (Han et al., 2013) is only 0.0003. But for any two embeddings, it is an easy matter to compute the rotation (orthonormal transform) R and the general linear transform G that would maximize S(φ, R(ψ)) and S(φ, G(ψ)) respectively, and it is these rotational resp. general similarities S R and S G that we will use. For the same embeddings, we obtain S R = 0.709, S G = 0.734. Note that only S R is symmetrical between embeddings of the same dimension, for S G the order of arguments matters.
With this, the essence of our proposal should be clear: we generate ψ from a dictionary, and measure the goodness of an arbitrary embedding φ by means of computing S R or S G between φ and ψ. What remains to be seen is that different dictionary-based embeddings are close to one another, and measure the same thing.

Using the standard
In the random walk on context space model of Arora et al. (2015), we expect the log frequency of words to have a simple linear relation to the length of the word vectors: Kornai and Kracht (2015) compared GloVe to the Google 1T frequency count (Brants and Franz, 2006) and found a correlation of 0.395, with the frequency model failing primarily in distinguishing mid-from low-frequency words. The key insight we take from Arora et al. (2015) is that an embedding is both a model of frequency, whose merit can be tested by direct comparison to F , and a model of cooccurrence, given by log p(w, w ) = 1 2d || w + w || 2 − 2 log Z ± o(1). Needless to say, the word, definition stanzas of a dictionary do not constitute a random walk: to the contrary, they amount to statements of semantic, rather than cooccurrence-based, similarity between definiendum and definiens, and this is precisely what makes dictionaries the appropriate yardstick for evaluating embeddings.
State of the art on Simlex-999 was ρ = 0.64 (Banjade et al., 2015), obtained by combining many methods and data sources. More recently, Wieting et al. (2015) added paraphrase data to achieve 0.69, and  added dictionary data to get to 0.76. Standard, widely used embeddings used in isolation do not come near this, the best we tested was GoogleNews-vectors-negative300, which gets only ρ = 0.44; senna gets 0.27; and hpca.2B.200d gets 0.16, very much in line with the design goals of Simlex-999. The purely dictionary-based embeddings are even worse, the best obtains only ρ = 0.082 at 300 dimensions, ρ = 0.079 at 30 dimensions.
A heuristic indication of the observation that choice of dictionary will be a secondary factor comes from the fact that dictionary-based embeddings are close to one another. As can be seen, the ρ and S R numbers largely, though not entirely, move together. This is akin to the astronomers' method of building the 'distance ladder' starting from well-understood measurements (in our case, Simlex-999), and correlating these to the new technique proposed here. While Chiu, Korhonen, and Pyysalo (2016) make a rather compelling case that testsets such as MEN, Mtruk-28, RareWord, and WS353 are not reliable for predicting downstream results, we present here ρ values for the two largest tasks, MEN, with 3,000 word pairs, and RareWord, ideally 2,034, but in practice considerably less, depending on the intersection of the embedding vocabulary with the Rare Word vocabulary (given in the last column of Table 2). We attribute the failure of the lesser test sets, amply demonstrated by Chiu, Korhonen, and Pyysalo (2016), simply to undersampling: a good embedding will have 10 5 or more words, and the idea of assessing the quality on less than 1% simply makes no sense, given the variability of the data. A dictionary-wide evaluation improves this by an order of magnitude or more.

Conclusions, further directions
An important aspect of the proposal is the possibility of making better use of F . By optimizing the frequency-weighted rotation we put the emphasis on the function words, which may be very appropriate for some tasks. In other tasks, we may want to simply omit the high frequency words, or give them very low weights. In medical texts we may want to emphasize the words that stand out from the background English frequency counts. To continue with astronomy, the method proposed in this paper is akin to a telescope, which can be pointed at various phenomena. It is clear from the foregoing that we are offer-ing not a single measurement yardstick but rather a family of these. Lexicographers actually include information that we are only beginning to explore, such as the NSUBJ and DOBJ relations that are also returned in the dependency parse. These can also be built into, or even selectively emphasized, in the similarity matrix M , which would offer a more direct measurement of the potential of individual embeddings in e.g. semantic role labeling tasks. We can also create large-scale systematic evaluations of paraphrase quality, using definitions of the same word coming from different dictionaries - Wieting et al. (2015) already demonstrated the value of paraphrase information on Simlex-999.
We have experimented with headword graphs that retain only the head of a definition, typically the genus. Since the results were very bad, we do not burden the paper with them, but note the following. HGs are very sparse, and SVD doesn't preserve a lot of information from them (the ultimate test of an embedding would be the ability to reconstruct the dictionary relations from the vectors). Even in the best of cases, such as hypernyms derived from WordNet, the relative weight of this information is low (Banjade et al., 2015;. That said, the impact of hypernym/genus on the problem of hubness (Dinu, Lazaridou, and Baroni, 2015) is worth investigating further.
One avenue of research opened up by dictionary-based embeddings is to use not just the definitional dependency graph, but an enriched graph that contains the unification of all definition graphs parsed from the definitions. This will, among other issues, enable the study of selectional restrictions (Chomsky, 1965), e.g. that the subject of elapse must be a time interval, the object of drink must be a liquid, and so on. Such information is routinely encoded in dictionaries. Consider the definition of wilt '(of a plant) to become weak and begin to bend towards the ground, or (of a person) to become weaker, tired, or less confident'. To the extent the network derived from the dictionary already contains selectional restriction information, a better fit with the dictionary-based embedding is good news for any downstream task.