Natural Language Puzzling: Cooking and music words: Solution; and the importance of style

The submission deadline has passed, so let's solve this puzzle! And spoiler warning -- the solution will be revealed in the second half of this post. Again, here's the Sunday Puzzle from 1/3/2021:

This week's challenge comes from listener Robert Flood of Allen, Texas. Think of a seven-letter hyphenated word for a kind of cooking. Change the middle letter to get a new word describing a kind of music. What words are these?

In the preview post, I suggested that we start with a list of cooking words and a list of music words. How do we get such word lists? Here are some ideas:

Search the web for lists. We might get lucky and find that someone has already published lists like those we need.
Scrape relevant sources on the web. We could come up with a handful of food and recipe sites like allrecipes.com and epicurious.com and music sites like pitchfork.com and last.fm. Then we could use a couple of python tools to "scrape" these sites and clean the text. But then what? We'd need to use something like term frequency-inverse document frequency (tf-idf) to find words that occur at a higher rate in the cooking and music documents than they occur in a mixed collection of documents--this is basically what a "word cloud" is. This approach would require significant effort.
Wikipedia. Here's a page listing music styles, for example.
Language models. We could use a pre-trained language model to predict words that are similar to known cooking and music words, or that are found in similar contexts.

Approach 3 would be fun but also a lot of effort. Let's try that on some future puzzle. In my previous post, I had success with a language modeling approach, so I think we'll start with approach 4 above.

Let's return to BERT in masking mode. BERT is a sentence embedding tool that trains on massive amounts of text to create a language model. Such models can be used for a variety of tasks, like scoring how similar two sentences are, or predicting the next word or sentence in a document. In our case, we'll use it to predict words that fill in a blank. This will give us lists of candidate words.

Aaaand... that won't work. I started trying some mask queries on BERT and quickly realized that it does not produce any hyphenated words. I'm no expert in BERT, but I suspect that when it tokenizes the training data, it breaks up hyphenated words. In other words, if it isn't trained on hyphenated words, it won't include them in its predictions. Perhaps there's a way around this, but I didn't find a quick fix.

Word2vec is an approach to language modeling that predates BERT. It also trains on massive amounts of text. It basically learns where each word in the vocabulary occurs on average in relation to each other word in the vocabulary, and represents this information as a vector (a list of numbers). Words that occur in similar contexts have similar vectors. Vectors can be compared using cosine to measure word similarity.

We can use the word2vec implementation in Gensim, an open source Python package for textual similarity and related tasks. The tool can take a "seed word" and give us the top n similar words.

I'm going to start with a list of seed words for each of the two lists we need. First, let's list some seed words for "a kind of cooking". I started with this Wikipedia page for "Cooking Methods". After dropping the multi-word terms (to keep this simple), and adding a base verb form for the gerund ("-ing") forms listed there, I have 58 cooking seed words, like these:

bake
baking
barbecue
barbecuing
blacken
blackening
etc.

And some seed words for "describing a kind of music". I couldn't find a comparable list, so I poked around Wikipedia and the web and cobbled together my own list of 77 terms like these:

a-capella
acapella
acoustic
adagio
allegro
anthem
anthemic
etc.

From there, I put together a script to create a list of cooking words by running each of my seed words through word2vec, generating a list of the 100 most similar words for each seed word. There is overlap in the results, so duplicates are removed.

My script then does the same for my music seed words.

At this point, we have a few thousand words in each list, and we expect that the target cooking word and the target music word are among them.

Now the script simply needs to iterate through them, applying the rules of the puzzle to find a match. So first it prunes our list of cooking words, keeping only words that would be 7 characters if we removed hyphens. Similarly for music words, only 7-letter words are kept.

Finally, the script iterates through the remaining cooking words, one by one. It deletes any hyphens in the cooking word, leaving only 7 letters. Then it starts iterating through the music words, one by one. If the first 3 letters don't match with the cooking word, it moves on to the next music word. If the first 3 letters do match, it then checks to see if the final 3 letters match, and finally checks that the 4th (middle) letter does not match. If it checks a cooking word against all the music words and doesn't find a match, it moves on to the next cooking word and iterates through the music words again.

Does it work?

Well, yeah... But there were some interesting failures first.

At first, I was only getting a handful of word pairs that fit the form requirements but weren't actually cooking or music words. Sometimes a seed word can have multiple word senses, so our results might include some unwanted words. "Smoking," for example, is a cooking seed word, but it has multiple non-cooking meanings too. One example pair was "sub-sect" and "subject"---the pattern fits but the meanings do not.

Ultimately, I ended up manually looking through a list of the generated music words, and the target word caught my attention. I quickly realized why my script wasn't working---the cooking word has 2 hyphens, and I had mistakenly assumed it had only 1 hyphen. You know what happens when you assume, right?

The solution:

bar-b-que and baroque

Once I knew what the script should be finding, I went back to fix it. No problem.

But... it still didn't include bar-b-que in the results, so the match was not found. Argh. What gives?

I think this is where things get really interesting! The Gensim word2vec implementation currently has 11 pre-trained English word vector models. Each model is trained on different datasets. (Well, not exactly---some are just larger "more detailed" models derived from the same data as smaller models). The training sets fall into roughly 3 categories:

Wikipedia text
Newspaper text
Twitter text

I started with a model trained on Wikipedia and newspaper text (glove-wiki-gigaword-100). Do you see why this might be a problem?

I've had enough journalism-ish experience to know that news publications use strict rules like the AP Stylebook. This dictates things like grammar and punctuation (Oxford commas? Yes!). It also lists how to spell some particular words, especially those with multiple possible spellings and/or foreign words and names. Hanukkah not Chanuka; Qur'an not Koran.

Heck, I even found this old tweet from the AP:

AP Style tip: barbecue is cooking foods over flame or hot coals. Noun refers to both meat cooked and fire pit. Not barbeque, Bar-B-Q or BBQ.
— APStylebook (@APStylebook) May 15, 2015

Also, presumably, not bar-b-que! The point here is that bar-b-que would likely never appear in the newspaper text, and thus it isn't even in the model's vocabulary. (I don't know as much about Wikipedia, but I believe it observes similar strict guidelines.)

I switched to a model trained on Twitter, which of course has no style guide, and the correct solution popped right out!

Lessons from this puzzle: Question your assumptions and consider your training data!

For anyone interested, the solver script is on GitHub. I also uploaded a related script that I used to run each model interactively, giving it a seed word to see if it produces a given target word (bar-b-que).

Thanks for reading, and see you next week!

--Levi King

Natural Language Puzzling

Friday, January 08, 2021

Cooking and music words: Solution; and the importance of style

No comments:

Post a Comment

Director, anagram, film award

Report Abuse