Natural Language Puzzling: September 2024

Monday, September 23, 2024

NEW TOWELS anagram, brand name

It's Monday Funday on Natural Language Puzzling, the puzzle blog where use natural language processing, corpus linguistics, computational linguistics, language modeling and other tools of the trade to take on the weekly Sunday Puzzle from Weekend Edition on NPR.

This week's challenge: Take the phrase NEW TOWELS. Rearrange its nine letters to get the brand name of a product that you might buy at a supermarket.

What approach would you take here?

Here's how I will set this problem up:

C: a list of candidate anagrams (strings)

We shouldn't assume that the solution will be a single word. We'll need to allow for multiword solutions, so long as they include the same 9 tokens as NEW TOWELS.
We can use our own approach to generating anagrams, as we did in this puzzle from two weeks ago (script here);
I'm using a more efficient implementation that I found in this script on Github, credit to Raghav Gurung;
Ideally we'll set this up recursively, so that for each word we find that can be spelled with some subset of NEWTOWELS, we then look for additional spellable words within the remaining letters of NEWTOWELS.
In fact, I'll likely brute force it by simply rerunning the function 3 or 4 times on the remaining letters until none remain or no spellable words exist. If this works, we can approach this as an exercise in recursive programming and come back to reimplement--sounds like fun!

Then, we'll need one of these two:

B: a list of product brand names you might find at a supermarket

We might be able to find a decent list already on the web somewhere;
We could probably get something workable from an LLM if we run enough queries;
This is an open class, so we'll never have a complete list

OR:

LM: a language model (LM) that we can use to evaluate the probability of a sentence.

In fact, I will be using the GPT2 implementation in the Python Transformers library, as this is probably adequate and we can run it easily on a laptop.

I'll plug each candidate into a sentence template like this:
"I prefer to buy name brands like [BLANK] that can be found in supermarkets everywhere."
Then the sentence is passed to the LM, which returns a perplexity score
I'll sort the sentences by this score; the solution should be the one with the lowest or near lowest perplexity.

That's the approach I've taken in my script, and I'll be back after NPR's Thursday deadline for submissions to share my solution. Good luck!

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!

Monday, September 16, 2024

State capital, world capital, expensive entrée

Welcome to my Natural Language Puzzling blog, where I use natural language processing (NLP), large language models (LLMs), and other linguistics tricks and tools of the trade to solve the Weekly Puzzle from Weekend Edition Sunday on NPR. Here's this week's puzzle:

This week's challenge comes from listener Rawson Scheinberg, of Northville, Mich. Name a U.S. state capital. Then name a world capital. Say these names one after the over and phonetically you'll get an expensive dinner entrée. What is it?

As usual, let's begin with a breakdown of this problem.

We're concerned with 3 classes here: S: state capitals; this is a closed class of just 50 items, so we can easily obtain this list.

W: world capitals; also a closed list of ~200 items, easy to obtain.
X: expensive dinner entrees; this is an open list, so theoretically harder to obtain.

However, I suspect if we simply came up with the top 10 or 20 possibilities, we'd have the solution covered.

We also need some phonetic resources:

p: pronunciation dictionary function

Note: This assumes we have an adequate list for X (expensive dinner entrees)
We'll use p to obtain the phonetic spelling of each string in S, W, and X:

p('albany') returns 'AO L B AH0 N IY0'
p('berlin') returns 'B ER0 L IH1 N'
p('steak') returns 'S T EY1 K'
This is the ARPAbet phoneme representation used in CMUdict, the CMU Pronouncing Dictionary

We've used CMUdict before by loading it directly as a text file and parsing it ourselves, but this time I'm experimenting with an existing python implementation. This includes a pretrained g2p (grapheme-to-phoneme) model; if a spelling isn't found in the dictionary, the model will predict the most likely way to pronounce that word.

m: a matching function

We need to be able to compare the entrée phonetic spelling with the combined state capital + world capital phonetic spelling

m('S T EY1 K', 'AO L B AH0 N IY0 B ER0 L IH1 N') returns 'False'

To avoid rejecting the correct solution, this will almost certainly need to allow some "fuzzy" matching

We'll definitely want to drop the stress markers (0, 1, 2).
We may even want to drop vowels, as vowel pronunciations can be highly context dependent, and when we combine the state capital and world capital strings, we're changing the context and the stress, so vowel changes are quite likely.

Ideally, we'd have a tool that can take two words, find their phonetic spellings and then return a phonetic similarity score using an edit distance or Levenshtein distance.

Alternatively, I'd also like to try another approach, just for fun. What if we couldn't obtain a reasonable list for X, the expensive dinner entrees? How else could we approach the problem?

I think we'd have to iterate through the combinations of state and world capitals, generating pronunciations for each, most of which would not be real words. Then we'd need a p2g, or phoneme to grapheme model, to take each string of phonemes and map them to the most likely orthographic spelling of real English text. This resulting spelling would be treated as a candidate for the "expensive dinner entrée". From there, we'd query a language model to judge the likelihood of the candidate; in the past, we've used python implementations of BERT/SBERT and GPT2 for this evaluation. In this case, we might ask the model to score this sentence with each candidate inserted: "The [blank] is the most expensive entrée on the menu, but very highly reviewed by food critics."

Let's see how far we get with the first approach. I'll be back after the Thursday deadline for NPR submissions (so we don't spoil it for listeners) to share my solution. In the meantime, you can see the approach I'm tinkering with in this script. Good luck!

Levi King

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!

Monday, September 09, 2024

Watercraft and body of water

Hi Puzzlers! Welcome back to another exciting installment of Natural Language Puzzling, where we use natural language processing to solve the NPR Sunday Puzzle. Let's look at this week's puzzle:

This week's challenge comes from listener Michael Schwartz, of Florence, Oregon. Take the name of a watercraft that contains an odd number of letters. Remove the middle letter and rearrange the remaining ones to name a body of water. What words are these?

It's a puzzle format as old as time: Take item from Class A (boat), apply transformation (remove letter and rearrange), return item from Class B (body of water). A potentially tricky part here is that these are both open classes---we'll never have a complete list of types of watercraft or bodies of water. New watercraft get invented, for example.

So how do we go about this? Here's what I need, how I'll get it and how I'll use it:

C: a list of watercraft

It's not clear whether this is a list of common nouns like raft and submarine, and/or specific craft, like Titanic

After years of following the Sunday Puzzle, I suspect it's the former, so I'll start with that
Between web search, Wikipedia, and asking ChatGPT, I cobbled together a list of 100+ different types of watercraft. I filtered this down to only those that are one word, because these puzzles tend to prefer that kind of simplicity. This left me with about 70 words, which I saved in my Python script

W: a list of bodies of water

Again, it's not clear if we want common nouns like lake and reservoir or proper nouns like Mediterranean Sea

I am working under the assumption we want the simple common nouns, because the puzzles tend to work that way unless they specify otherwise. Also, the proper nouns will typically be more than one word, which would make this a bit messy and the puzzles tend to be cleaner than that.
However, I'm not super confident about assembling this list, so I think instead of starting with a list, I'll use a language model to evaluate the candidate anagrams to see if they make sense in a sentence as a body of water

L: lexicon, a list of valid English spellings (see "script" section below)
LLM: Large language model

We'll use a large language model to evaluate the probability (or perplexity) of sentences where the candidate words are evaluated in context as a body of water.
I'll be using the GPT2 implementation in the Python transformers library, which is more than powerful enough for our needs here. I'll start with a sentence frame (or template):

As the largest body of water in the area, the _____ is a popular summer destination.
So for the watercraft steamer, we drop the middle letter to get stemer, and we find the anagrams meters, metres
I get scores from the LLM for the resulting sentences:
3.91 As the largest body of water in the area, the metres is a popular summer destination.
3.87 As the largest body of water in the area, the meters is a popular summer destination.
This is a perplexity score, so a lower score indicates a more likely sentence than a higher score. Of course, with only these two examples, we can't conclude much. I took these from a larger list (which I can't post here because spoilers), and in that context these scores could probably be taken to indicate that the words are real words and are not incoherent in the context, but an LLM wouldn't find them particularly likely.
For comparison, I tried this with carburetor and lake:
3.67 As the largest body of water in the area, the carburetor is a popular summer destination.
2.94 As the largest body of water in the area, the lake is a popular summer destination.

script:

We need a script that can iterate through our list of watercraft and apply some functions:

Keep only those with an odd number of letters
Remove the middle letter
Find all the anagrams

We'll simply get all the recombined permutations of the letters, then check them against an English lexicon (L)

Slot these candidate strings into the sentence frame and pass each sentence to the LLM for a score
Keep all the scores below a reasonable perplexity threshold

From there, if I've done this correctly and set an appropriate perplexity threshold, I should get a manageable number of scored sentences back. This will be human in the loop (HITL) approach. That's me, I'm the human. I'll then review the results and ideally, the correct solution will have a relatively low perplexity score and it should be fairly easy to spot.

And in fact, that's exactly what I've done, and I got a solution that I'll be back to share with you after the Thursday deadline for submissions to NPR.

In the meantime, if you'd like to try my script or modify my approach, you can find it here. Make sure you have the torch and transformers libraries installed for Python, and you'll need to also download this lexicon file.

A TV Personality of the past and a creature of the past

Happy Friday, Puzzlers! I'm later in the week than usual, but I've got a puzzle breakdown and solution for you. Let's look at this week's NPR Sunday Puzzle:

Yes, it comes from listener Ethan Kane (ph) of Albuquerque, N.M. Name a famous TV personality of the past. Drop the second letter of this person's last name, and phonetically, the first and last names together will sound like a creature of the past. What celebrity is this?
So again, a famous TV personality of the past. Drop the second letter of this person's last name, and say it out loud. The first and last names together will sound like a creature of the past. What celebrity is it?

This is a familiar format-- take item from Class A, apply a transformation and return an item from Class B.

In this case, we have an extra layer of representation to deal with--the speech sounds of the two items in question. In other words, we're dealing with both orthography and phonology.

So here's what we need to solve this puzzle:

T: a list of candidate names for the TV personality of the past
C: a list of candidate "creatures of the past"
pronunciation dictionary: we'll have the spelling of the personalities and creatures, but we need this resource to get the phonetic spellings so we can compare them
script, which will:

load the data (list of TV personalities, list of creatures, pronunciation dictionary)
split personalities and creatures into separate strings that we can query in the pron dictionary;

"Bob Hope" --> ["bob", "hope"]
"Saber-toothed tiger" --> ["saber", "toothed", "tiger"]
"Dodo" --> ["dodo"]

query these separate strings for pronunciations:

"bob" --> "B AA1 B"
"hope" --> "HH OW1 P"
"saber" --> "S EY1 B ER0"
"toothed" --> "T UW1 TH T"
"tiger" --> "T AY1 G ER0"

rejoin the pronunciation strings and normalize (remove numeral accent markers and ensure correct spaces):

["B AA1 B", "HH OW1 P"] --> "B AA B HH OW P"
["S EY1 B ER0", "T UW1 TH T", "T AY1 G ER0"] --> "S EY B ER T UW TH T T AY G ER"
Note: we do this because stress can be quite variable anyway, and we probably want to relax the constraint here.

I'll start with this approach, and revisit if necessary. For example, I can see that perhaps we'd want to reduce the double \T T\ to a single \T\ in the phonetic spelling of sabertooth tiger, as a contextual phonotactic rule would normally apply here in running speech anyway. So this is something to revisit if I'm striking out.

Iterate through the person pronunciations, then through the creature pronunciations, looking for a match and printing out any results.

I've done a bit of a "draw the rest of the owl" trick here, of course, because much of the challenge is simply coming up with T and C, our lists of candidate TV personalities and creatures. I tried using BERT in masking mode to fill in blanks like "The paleontologists discovered a rare complete [MASK] skeleton last month" in hopes of getting a list of suitable creatures, but I found it really difficult to tune these prompts in a way that generated good answers and not a lot of bad answers. We had success with BERT masked mode before, like in this puzzle, but I pivoted to use an LLM this time. Actually, for the creatures list, I found a few lists online and cobbled together a list of only about 30 creatures. For the list of TV personalities, I ran a few ChatGPT queries and eventually had a list of about 250 names.

You can see my work on GitHub; you'll also want my list of TV personalities (the list of creatures is much shorter and hard-coded). As these puzzles are always one-off solutions, I rarely optimize the script for efficiency--it's simply a race to get a solution. This script could benefit from some TLC (feel free to push your changes), but I'm happy we got a solution. The deadline for submissions has passed, so click below if you just want to see my answer. See you next week!

Natural Language Puzzling