Synonym Generation at BloomReach

This is a post by Apurva Gupta, Antariksh Bothale and Soubhik Bhattacharya, engineers from the Search Quality Team at BloomReach. 

Abstract

As a company that’s in the business of helping people find things and helping our customers “Get Found,” it’s important for BloomReach to accurately understand what people mean when they search for something. User queries are written in human language, which leads to dialectal variations in vocabulary (different names for the same thing—cookies/biscuits). Then there is the problem of context. In the phrases “frozen yogurt” and “frozen meat,” the word “frozen” carries a very different meaning from the meaning it carries in the phrase, “Frozen toys,” which refers to the hit Disney movie. This problem isn’t helped by the fact that search queries are often just bunches of words, devoid of capitalization, punctuation and/or syntactical clues that can make the meaning clearer – ”nj devils tees” would need to be understood as ”New Jersey Devils T-shirts.” Over- and under-specificity of search queries is another common problem – people searching for “hp printers,” might also be interested in buying printers from other brands.

From new fashion trends and terminology (jorts, anyone?), to words acquiring new connotations (such as the frozen → Frozen™ example), the e-commerce world springs several traps on us on our way to improving query understanding and improving search quality.

Synonym extraction and generation is one of the ways in which we deal with the continuously evolving, contextual nature of human language. To bridge the gap between the query and actual content, we figure out different possible representations of a word and process them to generate synonyms. To mine for synonym pairs, we look at descriptions of 100 million products, process billions of lines of text downloaded from the Web and also look for possible synonyms for queries amongst other queries [30 million queries].

At BloomReach, we do this every week, learning new words and meanings from new data.

Why do synonyms need automated extraction/generation?

Traditionally e-commerce search engines have relied on following approaches to attain synonyms:

  1. Thesauri: There are many freely available thesauri  on the market, which capture different synonyms of words.
  2. Stemmers: e.g. Porter stemmer, which contains rules to reduce words to their root forms — “mangoes => mango.”
  3. Curated Dictionary: Manual entries that define synonyms.

In our tests we found these approaches to have a slew of problems. For example:

  1. Context: A thesaurus will have “black => dark” as synonyms. This is applicable in certain contexts such as “black night => dark night”, but not in “black dress => dark dress”.
  2. Usage: Thesauri do not cover general use of language. “rolling laptop case => rolling computer bag”. In other words, thesauri are not designed for e-commerce or products.
  3. Evolution: “frozen princess => elsa” is a relationship with which all viewers of the movie “Frozen” would be familiar, but which dictionaries would not be.
  4. Precision: In Web search, a user is presented with millions of results and generally finds relevant results among the first few. Whereas in e-commerce, people tend to go through hundreds of results to find the product they wanted. In such a scenario, tail results also need to be relevant, e.g. showing “short stemmed roses” for a search of “shorts” is a very bad user experience.
  5. Sorting: Also, e-commerce portals allow users to sort by price, popularity etc. Such sorting prominently shows users results that otherwise would have been at the end. Hence adding a bad or context-free synonym such as “black dress => dark dress,” may, under certain sorts, show users dresses which are not black.
  6. Hyponyms: It is difficult to find mappings such as “computer accessories => mouse, keyboard” in a dictionary. These are not synonyms but nevertheless are required in order to bridge the gap from user query to the product.

Although these problems can be resolved by using heuristics or by cleaning/augmenting the dictionaries manually, such an approach requires significant investment of manpower; and as language keeps evolving, it is a continuous investment.

Different types of synonyms

Given a query Q, a phrase Q’ will be called synonym of Q, if results of Q’ are also relevant for Q.

Adhering to above definition, we can classify synonyms into various classes: 

  1. Spelling or Lemma variants (e.g. for the query “wifi adapter”)
    1. Abbreviation: wireless usb adapter
    2. Space variation: wi fi usb adapter
    3. Lemma Variants: wifi usb adaptor
    4. Stem: wifi usb adapters, wireless usb adapters
  2. Same meaning
    1. women’s =>  ladies’
    2. graph paper => grid paper
  3. Hyponyms/Hypernyms.
    In this class, synonyms describe a superset of the query:

    1. desktop pc => computers.
    2. football gear => football cleats, football gloves, football helmet.
  4. Related. These are not synonyms, but related phrases. They can be substituted when results are not available for true synonyms.
    1. paper plates => plastic plates
    2. abacus => calculator

Different types of synonyms are applied with varying confidence and methodologies: e.g. adapter => adapters can be applied with very high confidence and irrespective of context but “apple => apples” cannot be. Similarly, “plastic plates” should be shown for a search of “paper plates,” only when paper plates are not available. Synonyms need to be graded by quality and type when they are generated in order to account for the complexity of properly applying them.

Generation

The process of synonym generation is closer to “mining” than “searching:” e.g. we do not specifically generate synonyms for a query Q, but we generate all possible synonym pairs and expect pairs with Q or its subparts to have been generated. Just like mining, we can increase the probability of finding useful pairs by looking into the right data and using the right techniques. And just like mining, it will be all in the background, hidden from search users.

Representation

In a few words, the process of synonym generation can be classified into one which filters candidate synonym pairs, where each pair is a tuple of two phrases <P1,P2>. Let us take the example of this pair <wireless presenter, presentation clicker>. To us (humans) it is obvious that these two phrases are talking about the same objects. But to a machine, which has no context of what a presentation is or what it means to click, these two phrases are just sequences of letters, as randomly paired as any two other phrases. A simple English phrase, such as “wireless presenter,” contains a lot of contextual information, which goes amiss when we use just these two words to describe something. In representation phase we get a more complete description about this phrase using different sources of information. A few such possible representations and the information they augment are:

  1. Neighboring Context (Bag or embeddings)
    You shall know a word by the company it keeps (Firth, J. R.)We collect phrases surrounding a phrase (e.g. frozen toys) from all over the Web. Then we count and sort these phrases by their frequency. Thus we obtain a list of frequently occurring phrases around “frozen toys.” These phrases describe the contexts in which “frozen toys” occur. Intuition behind such representation is that words/phrases with similar meaning would occur in similar contexts. Thus by comparing contexts of two phrases, we can compute a similarity score – which gives a quantitative association between two words, or phrases:

    • P(P1,P2) = probability of P1 and P2 occurring adjacently in same query
    • Rep(P1) = P(P1,Pi) for all Pi where {Pi} is the set of all phrases in corpus and Rep(P1) is the final representation of P1 in this method.

    Recently, word embeddings such as word vectors generated by word2vec have been discovered to represent such contextual information in many fewer dimensions.

  2. Product Clicks (Behavioral): Each phrase is represented as a probability distribution of products/documents that have been viewed/clicked/bought etc., after a user used this phrase as a query. Intuition behind such a  representation is that for two search queries, q1 and q2, if users have the same/similar intent, they will click on similar documents and will have similar choices of documents. For example, if a product document is a relevant swimsuit, it will also be relevant  swimwear. So, intuitively it should get user clicks for both the search queries and it will also be clicked more frequently than a less relevant swimwear/swimsuit.
    • P(d1,q) = probability of document d1 being clicked after q was queried and before any other query is done
    • REP(q) = {P(di,q)} for all documents di
  3. Document vector: A phrase is represented by the “average of documents“ that contain this phrase. To achieve this, we first represent each document containing the phrase as a tf-idf vector of terms/phrases in that document. Then we take the average of all these vectors to achieve a vector representation for the “average of documents.”
    • Rep(D1) =<tf-idf(Pi)> for all phrases Pi in document D1 {this is representation of document D1}
    • Rep(P1) = average of Rep(Di) for all Di where P1 occurs {average of all documents containing P1 , represents P1}
  4. Nearby queries: Whenever a user queries for “X”  and then issues another query, “Y”, we add “Y” to representation of “X”. Thus we obtain a frequency count of phrases (such as “Y”), which can be used to represent “X”.P(P1,P2) = probability of P1 and P2 occurring in same sessionRep(P1) = P(P1,Pi) for all Pi where {Pi} is the set of all phrases in corpus and Rep(P1) is the final representation of P1 in this method.

Such representations map the humble two-word pairs to a pair of vectors, which have hundreds of dimensions. This new pair contains far more information and is more obvious for an algorithm as a pair.

Similarity

We borrow different distance and divergence metrics from various spaces to judge the quality of a candidate pair.

  1. Different similarity measures:
    • Asymmetric: Asymmetric measures are used to figure out hyponym/hypernym pairs. For example a low value of |A∩B| / |A| and high value of |A∩B| / |B|, indicates that B is close to being a subset of A. If A and B were products viewed for queries A’ and B’, this measure can be used to determine if A is hypernym of B or vice versa.
    • Symmetric: Symmetric measures determine if A and B are approximately equivalent or contain similar information. To compare the distributional representation of clicks (2) of two phrases, we use KL – divergence, which is a measure of the information lost when distribution Q is used to approximate distribution P.

    KL – Divergence of two distributions for phrase P and phrase Q:

    D_{KL}(P||Q) = \sum_i \ln \left( \frac{P(i)}{Q(i)} \right) P(i)

    Where summation is over the set of documents and P(i) is the probability of a document being clicked for phrase P. But this is a directional measure and is valid only when Q = 0 implies P = 0. Hence we use a symmetrized version which is JS divergence.

    JSD(P||Q) = \frac{D(P||M) + D(Q||M)}{2};

    where

    M = \frac{1}{2}\left(P + Q \right)

    A good JSD score implies that P and Q are both good approximations of each other. Although, if P and Q are over a very small event space, confidence on this score will not be enough to use P and Q as synonym. Having another measure of confidence has proven to help in such cases.

  2. Augment with extra similarities: Besides computing the similarities on the basis of the above representations, we can augment certain similarities such as edit distance.

After computing all similarities and augmentation, we have various scores which tell us the grade and type of synonym pairs. We observed that simple heuristics such as thresholding over the vector of scores works well. But these scores are continuously distributed and choosing a threshold manually caused a lot of good synonyms to get removed. This led us to the finding that thresholds would have to be phrase specific. For example, the distance between “black” and “dark” is less than the distance between “timer => stopwatch,” whereas, the second one happens to be a better synonym pair. We found that using scores of some obvious synonyms for a query helped in determining the threshold: e.g. Any phrase which is closer to “shoe” than “shoes” would be a good synonym for shoe. Thus, D(P, Q) < D(P,P’) where P’ is an obvious variation of P implies that P, Q are good candidates for synonymy.

Since we had already represented phrases and similarity of pairs as vectors , we can also feed these vectors to a supervised classifier for extracting synonyms.

Diagram

Illustration of synonym generation system

Illustration of synonym generation system

Conclusion

In this blog we have described how we generate different types of synonyms at BloomReach. It helps us in bridging the gap between our customers and the end users who search on e-commerce websites. We have tried our best to keep the methods independent of language, by not relying on grammar, etc. This will help us in serving people searching for products in different languages. Just like spoken language, our algorithms have their share of problems. They also make mistakes and have their good and bad days. We continuously work at improving them and welcome any suggestions or feedback.

References