Welcome back, Puzzle Gang! I think I put together a pretty cool process to solve this one. Once again, here's the puzzle:
This week's challenge comes from listener Greg VanMechelen, of Berkeley, Calif. Name something birds do. Put the last sound of this word at the start and the first sound at the end, and phonetically you'll name something else birds do. What are these things?
- B, a list of "something birds do" words;
- P, a pronunciation dictionary;
- f_swap, a function to swap the initial and final phonemes;
['flying', 'flapping', 'nesting', 'brooding', 'pecking', 'roosting', 'tweeting', 'cawing', 'crowing']
For each seed word, I took the top 100 word2vec suggestions. These suggestions are words that appear in similar contexts to the seed word. I chose "-ing" progressive verb forms in hopes that would return more verbs; "nesting" vs "nest" should do that, right? The suggestions from word2vec take many word forms, so I then run everything through a lemmatizer to get the base form of each word. So we now have list B, with a few hundred candidate "bird words." Let's set that aside for now.
For a pronunciation dictionary, I used the Carnegie Mellon University Pronouncing Dictionary. It contains over 134,000 words and their pronunciations (nice!) and for phoneme symbols, it uses ARPAbet (yuck!). It's available in a plain text file; here's an excerpt:
missourians M AH0 Z UH1 R IY0 AH0 N Z
misspeak M IH0 S S P IY1 K
misspeak(2) M IH0 S P IY1 K
misspell M IH0 S S P EH1 L
misspell(2) M IH0 S P EH1 L
misspelled M IH0 S S P EH1 L D
We don't need to worry too much about the details of ARPAbet or the CMU dictionary, but here are some relevant points. All vowels (i.e., the syllable nucleus) have a stress marker of 0, 1, or 2. So if we split each line on whitespace, we can take the first piece as the word (may need to remove "(2)" or presumably "(3)", etc.) and the remaining pieces as the pronunciation. Taking the pronunciation, we can just keep removing characters from the left side until we hit a vowel (containing 0, 1 or 2) to get the onset of the first syllable. We do the same from the right side of the pronunciation to get the coda of the final syllable. Then we swap the onset and coda and recombine with the rest of the pronunciation. That's our f_swap function.
I put my solver script on Github, and it contains plenty of notes so you can follow along. (Note that if you want to try my script, you'll need to download word2vec, Stanza and the CMU dictionary.) The whole ting works like this:
- Query word2vec with seed words to get B (bird words);
- lemmatize B;
- For each word w in B, query w's pronunciations from the CMU dictionary;
- For each w pronunciation (let's call each p_w), swap onset and coda, call this p_ws (for "pron_w_swapped");
- For each other (not equal to w) word (let's call this word x) in B (yes, we're iterating through B again) query the pronuciations;
- For each x pronunciation (p_x):
- if p_x equals p_ws, we have a solution (w & x).
And what do you know---I had to try a couple of different word2vec models and experiment with the number of suggestions I want for each seed word, but I got the solution! Here were the two solutions my system produced---one valid, one not: