If you're new to the blog, welcome! Join us here each week as we use Natural Language Processing (NLP), language modeling and linguistics to solve the Sunday Puzzle from NPR. Here's this week's puzzle:
This week's challenge comes from listener Andrew Chaikin, of San Francisco. Think of an 11-letter word that might describe milk. Change one letter in it to an A, and say the result out loud. You'll get a hyphenated word that might describe beef. What is it?
For this puzzle, the solution pretty quickly jumped out at me, and maybe it did for you too. Regardless, let's walk through how we can use our tools and skills to get a solution. Here's what we need:
- M, a list of words that might describe milk
- We can ask an LLM directly for this list, or
- We can query an LLM to fill in blanks for us with the top candidates, e.g.:
- "Some milk is more BLANK than other milk."
- "Milk sold in stores must be BLANK."
- (I asked an LLM to give me a list, and I got a list of about 90 words)
- We'll need to filter this contain only 11-letter words (because LLMs are terrible about that)
- B, a list of words that might describe beef
- Ideally, we'll have a list for this side of the equation, too, then we'll check to see if we can apply the transformation to a word in M and match it to a word in B
- Alternatively, we can simply take each candidate in M, then apply the transformation ("change one letter in it to an A and say the result out loud") and evaluate the result in the context where it is describing beef.
- Of course this is tricky, because we're not matching strings, we're matching pronunciations; see the next point.
- transform(): We need a function to apply the prescribed transformation: "change one letter in it to an A and say the result out loud"
- How would we do this programmatically? We might need some guidelines to keep this manageable.
- Firstly, I think we probably want to restrict our replacement targets to other vowels.
- Secondly, the puzzle asks us to replace a letter with another letter (A), but we're only interested in the resulting pronunciation, not the resulting spelling. So we need to operate on pronunciations. We can use the CMU Pronouncing Dictionary as we have in past puzzles.
- The CMU dictionary uses ARPAbet symbols, so we need a list of all the ARPAbet symbols that an "A" can correspond to. Then we can generate new pronunciations by replacing each vowel sound in the pronunciation one by one with these various "A" vowel sounds.
- A few ARPAbet examples:
- AE T : "at"
- EY T : "ate"
- TH EY T A : "theta"
- Next we need a way to convert the pronunciation symbols back into a spelling ("phoneme-to-grapheme"). In our case, we'll probably simply want to query the CMU Dictionary to see if the new pronunciation exists in the dictionary, then we pull the corresponding spelling.
- One more twist: Note that the puzzle specifies we'll get a hyphenated word. The hyphenated word is probably not going to have its own entry in the pronunciation dictionary. This means we'll need to try tokenizing the new pronunciation; in other words, we need to split it in two, then try finding each of the two pronunciations in the dictionary. We could simply brute force this, trying the split at each possible location (between syllables).
- evaluation: Let's imagine the above steps have gone well, and we've found a number of potential spellings for the "hyphenated word that might describe beef". Now we need a language model to evaluate each of the candidates in context. As we've done in the past, we can insert each candidate into a set of sentence templates and get a perplexity score for each, and ideally, the correct solution should have the lowest or near lowest perplexity. Our templates might be something like these:
- "That was some of the best [BLANK] beef I have ever had."
- "[BLANK] beef is expensive but usually high quality."
- "The supermarket is having a special on [BLANK] beef this week."
That's how I'd go about approaching this one systematically. In truth, it would be overkill for this easier-than-usual puzzle. Alternatively, we could simply start with M (our list of milk words) and from there, most people would spot the solution right away. I went ahead and did just that; if you'd like to give it a shot, or even flesh out the full approach described above, you can start by running my script to get a list of 11-letter milk-related words. It only returns 10 words, so I'm confident you'll recognize the solution.
The deadline for NPR submissions has passed, so click the spoiler button below to see my solution.
No comments:
Post a Comment