Natural Language Puzzling

Monday, June 23, 2025

Director, anagram, film award

Welcome back to Natural Language Puzzling, the blog where we use natural language processing and linguistics to solve the Sunday Puzzle from NPR. Here's this week's puzzle:

This week's challenge comes from listener Bob Weisz. Take the name of a major film director. Drop the last six letters of his name, and rearrange what remains. You'll get the name of a major film award -- for which this director has been nominated six times. Who is he and what is the award?

Full disclosure---this one's probably pretty easy to solve with just your brain, assuming you know a little about movies and directors. Nonetheless, let's break this down and enumerate the data and functions we'll need in order to find a solution programmatically.

D: a list of film directors

note the "he"---we're looking for male directors;
"drop the last six letters"---the name should probably be at least 10 letters so we have enough left over to spell an award;
"this director has been nominated six times"---we can probably assume the six nominations represent six different movies, so this director has at least six movies to his credit;
Options for acquiring this list:

brainstorm our own
search the web for a list of top directors
ask an LLM to provide a list of directors that meet the above criteria

This was my approach, and I got a pretty good list:
['Steven Spielberg', 'Martin Scorsese', 'Akira Kurosawa', 'Stanley Kubrick', 'Alfred Hitchcock', 'Quentin Tarantino', 'Ridley Scott', 'Francis Ford Coppola', 'Clint Eastwood', 'Woody Allen', 'Ingmar Bergman', 'Federico Fellini', 'Billy Wilder', 'Christopher Nolan', 'Pedro Almodóvar', 'Wes Anderson', 'Paul Thomas Anderson', 'David Fincher', 'Joel Coen', 'Ethan Coen', 'Ang Lee', 'Jean-Luc Godard', 'Robert Altman', 'John Ford', 'George Lucas', 'Frank Capra', 'Roman Polanski', 'Michael Haneke', 'Ken Loach', 'Spike Lee']

A: a list of major film awards

Our options for acquiring this list are the same; here's my list:

['Academy Award', 'Oscar', 'BAFTA', 'Golden Globe', 'Cannes Film Festival Award', 'Palme d Or', 'Directors Guild of America Award', 'Screen Actors Guild Award', 'Critics Choice Movie Award', 'Independent Spirit Award']
Obviously, some of these are unlikely solutions---we probably won't find an anagram of 'Independent Spirit Award'.

~~anagram(word): a function for generating anagrams~~

While we've done this before, it's typically when we aren't confident that we have complete lists for solving both sides of the equation, i.e., it's an open class problem. This technically is a closed class problem, because we can't ever be certain our lists of directors or awards is complete. However, I'm confident that our lists are adequate here. This means we don't actually need to form valid anagrams with our function---we don't need to confirm that the anagram string is a real word or name in some lexicon. So, let's instead take the following approach:

compare(d,a):

The function we need is simpler than anagramming. We need to lowercase the strings, remove any spaces, punctuation, and diacritics, and remove the final 6 letters. Python can handle all that easily, and I like to use a package called slugify for changing things like "ó" to "o". Our function here doesn't need to form valid words, just sort the letters alphabetically. When we have the solution, the sorted list of letters for the director will match the sorted list for the award.

Using the above approach, I've put together a python script that is spitting out the correct solution. Have you solved the puzzle? Check back here after Thursday's NPR submission deadline (no spoilers here!) and I'll link my script and post my solution. Good luck!

The deadline for NPR submissions has passed, so click the spoiler button below to see my solution, and click here to see my script.

Monday, May 05, 2025

Country anagrams rhyming with "Spain"

Welcome back to Natural Language Puzzling, the blog where we use linguistics, natural language processing (NLP), language modeling, and artificial intelligence to solve the weekly challenge from NPR's Sunday Puzzle. Let's dive into the latest puzzle:

This week's challenge comes from Andrew Tuite, of Chicago. There are four countries whose names have one-syllable anagrams that rhyme with "Spain." What are they?

As usual, let's break this one down and determine what tools and resources we need to solve it.

C: a list of countries

This is a closed class---we can easily obtain a complete list.
The United Nations recognizes 195 countries.
We'll use some heuristic to filter this down to strings that can be anagrammed to a one-syllable word

Wikipedia claims a handful of 10 and 11 letter words might be pronounced as one-syllable
I'll limit my list of countries to those with seven or fewer letters
Note that the anagram must be one-syllable, but the original string might be more than one syllable, as in MASTER --> STREAM

Anagrammer: a function to anagram strings

This will need a lexicon to ensure that the anagrams are real words.
It will also need a pronunciation dictionary to ensure that the anagrams rhyme with Spain.

We can simply use the pronunciation dictionary as the lexicon, too.
As we've done in the past, I'll be using the CMU Pronouncing Dictionary.

Python script: We'll need a bit of scripting to tie all this together:

Iterate through the country names and filter out strings longer than seven letters;
Generate possible anagrams and keep only those found in the dictionary;
Look up pronunciations and determine if they rhyme with Spain. CMU uses Arpabet, which represents Spain as SP EY N

We'll need to find any anagrams that end with -EY N.

That's the approach I'll be taking. Would you do it differently? Leave your comments below. And check back here after the NPR deadline for submissions on Thursday to see my solution. Happy Puzzling!

The deadline for NPR submissions has passed, so click the spoiler button below to see my solution, and click here to see my script.

Monday, April 21, 2025

Animal (5) --> bird (7) --> animal (9)

It's Monday Funday here at Natural Language Puzzling, the blog where we explore natural language processing (NLP), artificial intelligence (AI), language modeling and linguistics for the purposes of solving the weekly challenge from NPR's Sunday Puzzle. Welcome back! Let's take a crack at this week's challenge:

This week's challenge comes from Philip Goodman, of Binghamton, N.Y. Name an animal in five letters. Add two letters and rearrange the result to name a bird in seven letters. Then add two letters to that and rearrange the result to name another animal in nine letters. What creatures are these?

How should we approach this one? It basically follows the classic Sunday Puzzle pattern:

Take a thing from Class A and apply a transformation to yield a thing from Class B.

In this case, it's a little more complicated but really we're just adding a second transformation from Class B to Class C (which is similar to Class A, but the strings are nine letters instead of five).

As you can probably see, this one is relatively light on NLP, AI, and linguistics. We really just need good lists of birds and animals to start with, and it'll all be string and list manipulation from there.

Here's the approach I plan to take. I'll need these things:

A: A big ol' list of animals

I've cobbled one together from lists on the web
You could also ask your favorite LLM chatbot
When we load the list to work with it in python, we'll filter it out into a list of five-letter words and another list of nine-letter words

B: A big ol' list of birds

Again, I found multiple different lists on the web and combined them, then removed duplicates, lowercased everything, etc.
We'll filter this one and keep only the birds with seven-letter names

check_letters: We need some kind of function to compare the letters in each of the strings in the above lists

We can brute force this as it's not super computationally heavy and it's a one-off solution anyway.
I'll take each five-letter animal (i.e., "for m in five_letter_animals:")

split it into a list of letters
take each (seven-letter) bird

split it into a list of letters
take each letter from the animal and try to remove it from the list of bird letters
if only two letters remain in the bird letter list, we keep this pair as a possible solution

Then we can more or less repeat this process, starting with the bird list and nine-letter animal list;

Again, if we can remove all the letters from the shorter word and we're left with two letters from the longer word, we have a possible solution.
We'll know the solution when we have a bird that is a possible solution in both of these comparisons

I'm happy to report that the approach above worked for me and I have the solution! I'll be keeping it (and my python script) to myself until after the Thursday 3pm ET deadline for submitting solutions to NPR, as I wouldn't want to get on host Will Shortz' naughty list for giving spoilers. But please check back after then if you'd like to see exactly how I solved it. Good luck, Puzzlers!

The deadline for NPR submissions has passed, so click the spoiler button below to see my solution, and click here to see my script.

Monday, April 14, 2025

European tourist site, planting medium

It's Monday, and you Puzzlers know what that means. Welcome back to Natural Language Puzzling, the blog where we use natural language programming (NLP), language models and tools and good old fashioned linguistics to solve the weekly Sunday Puzzle from NPR. Here is this week's puzzle:

This week's challenge comes from listener Jessica Popp, of Indiana, Pa. Name a famous European tourist site in nine letters. Rearrange its last four letters to name something that its first five letters can be planted in.

It's been a few weeks since we've had a "normal" puzzle like this one. It follows the pattern of many Sunday Puzzles: take thing from category A, apply a prescribed transformation, get thing from category B.

In these kinds of puzzles, we often start with lists: a list for Category A and/or a list for Category B. Then we need some kind of function to apply the transformation. Finally, we either check to see if the result is in our list for Category B; if we don't have a list for Category B, we use some kind of language model to evaluate each transformed candidate in an appropriate context and see how well it fits.

This time around, we'll need:

E: a list of European tourist sites

we can search the web for existing lists
we can ask a chatbot LLM to generate a list for us
we can try to cobble one together ourselves
we'll need to filter this list to only nine-letter entries

Note that LLMs are still not very good at returning words of specified length

P: a list of things that some unknown thing can be planted in

this one probably makes sense to ask chatbot LLM for
we could also query a non-chatbot LLM to fill in blanks for us
again, we'll need to filter this by length--for 4-letter words this time

functions/tools:

from my reading of the puzzle, the first five letters of the European tourist site should be something that can be planted; thus, the first five letters alone should be a common noun, so we could start with a function that takes the first five letters of each item in list E and checks to see if they make a valid word.
if we find a valid word, we can:

generate all the permutations of the last four letters
keep only those that are valid words
use a sentence template or templates with an LLM to see how likely the candidates are in context, e.g.:

I know the perfect [BLANK] to plant the new [BLANK] in.

In theory, the solution would be among the candidate pairs that result in the lowest perplexity score from the above sentence.

This is the approach I'll be taking. How would you do things differently? Share your thoughts in the comments. I'll be back with an update after the Thursday NPR submission deadline (no spoilers here!) to share my scripts and my solution (if I manage one). Good luck!

The deadline for NPR submissions has passed, so click the spoiler button below to see my solution.

Thursday, April 10, 2025

Milk and Beef

Hello Puzzlers. We're almost at this week's NPR submission deadline, but let's get a quick solution posted for posterity.

If you're new to the blog, welcome! Join us here each week as we use Natural Language Processing (NLP), language modeling and linguistics to solve the Sunday Puzzle from NPR. Here's this week's puzzle:

This week's challenge comes from listener Andrew Chaikin, of San Francisco. Think of an 11-letter word that might describe milk. Change one letter in it to an A, and say the result out loud. You'll get a hyphenated word that might describe beef. What is it?

For this puzzle, the solution pretty quickly jumped out at me, and maybe it did for you too. Regardless, let's walk through how we can use our tools and skills to get a solution. Here's what we need:

M, a list of words that might describe milk

We can ask an LLM directly for this list, or
We can query an LLM to fill in blanks for us with the top candidates, e.g.:

"Some milk is more BLANK than other milk."
"Milk sold in stores must be BLANK."

(I asked an LLM to give me a list, and I got a list of about 90 words)
We'll need to filter this contain only 11-letter words (because LLMs are terrible about that)

B, a list of words that might describe beef

Ideally, we'll have a list for this side of the equation, too, then we'll check to see if we can apply the transformation to a word in M and match it to a word in B
Alternatively, we can simply take each candidate in M, then apply the transformation ("change one letter in it to an A and say the result out loud") and evaluate the result in the context where it is describing beef.

Of course this is tricky, because we're not matching strings, we're matching pronunciations; see the next point.

transform(): We need a function to apply the prescribed transformation: "change one letter in it to an A and say the result out loud"

How would we do this programmatically? We might need some guidelines to keep this manageable.
Firstly, I think we probably want to restrict our replacement targets to other vowels.
Secondly, the puzzle asks us to replace a letter with another letter (A), but we're only interested in the resulting pronunciation, not the resulting spelling. So we need to operate on pronunciations. We can use the CMU Pronouncing Dictionary as we have in past puzzles.
The CMU dictionary uses ARPAbet symbols, so we need a list of all the ARPAbet symbols that an "A" can correspond to. Then we can generate new pronunciations by replacing each vowel sound in the pronunciation one by one with these various "A" vowel sounds.

A few ARPAbet examples:
AE T : "at"
EY T : "ate"
TH EY T A : "theta"

Next we need a way to convert the pronunciation symbols back into a spelling ("phoneme-to-grapheme"). In our case, we'll probably simply want to query the CMU Dictionary to see if the new pronunciation exists in the dictionary, then we pull the corresponding spelling.
One more twist: Note that the puzzle specifies we'll get a hyphenated word. The hyphenated word is probably not going to have its own entry in the pronunciation dictionary. This means we'll need to try tokenizing the new pronunciation; in other words, we need to split it in two, then try finding each of the two pronunciations in the dictionary. We could simply brute force this, trying the split at each possible location (between syllables).

evaluation: Let's imagine the above steps have gone well, and we've found a number of potential spellings for the "hyphenated word that might describe beef". Now we need a language model to evaluate each of the candidates in context. As we've done in the past, we can insert each candidate into a set of sentence templates and get a perplexity score for each, and ideally, the correct solution should have the lowest or near lowest perplexity. Our templates might be something like these:

"That was some of the best [BLANK] beef I have ever had."
"[BLANK] beef is expensive but usually high quality."
"The supermarket is having a special on [BLANK] beef this week."

That's how I'd go about approaching this one systematically. In truth, it would be overkill for this easier-than-usual puzzle. Alternatively, we could simply start with M (our list of milk words) and from there, most people would spot the solution right away. I went ahead and did just that; if you'd like to give it a shot, or even flesh out the full approach described above, you can start by running my script to get a list of 11-letter milk-related words. It only returns 10 words, so I'm confident you'll recognize the solution.

The deadline for NPR submissions has passed, so click the spoiler button below to see my solution.

Monday, March 24, 2025

Smooth booth

It's Monday, Puzzlers, and you know what that means! Welcome to Natural Language Puzzling, the blog where we use natural language processing (NLP), linguistics, data science, language modeling, programming and our own noggins to solve the weekly Sunday Puzzle from NPR. Here's this week's puzzle:

This week's challenge comes from listener Dan Asimov, of Berkeley, Calif. In English the two-letter combination TH can be pronounced in two different ways: once as in the word "booth," the other as in "smooth." What is the only common English word, other than "smooth," that ends in the letters TH as pronounced in smooth?

Well, this is certainly a diversion from the usual Sunday Puzzle format, which involves taking one string of text, applying a prescribed transformation and generating another string of text.

So what do we need to solve this puzzle?

Really only one thing: a robust pronunciation dictionary. For that, we can turn to an old friend-- the Carnegie-Mellon University (CMU) Pronouncing Dictionary. You can read more about that on the project website, or simply download the file from GitHub.

Now that we have the dictionary, we can navigate to it from the command line. From that location, we can query the pronunciations for "booth" and "smooth".

Mac:cmudict-master puzzler$ grep 'smooth ' cmudict-dict.txt

smooth S M UW1 DH

Mac:cmudict-master puzzler$ grep 'booth ' cmudict-dict.txt

booth B UW1 TH

tollbooth T OW1 L B UW2 TH

This shows us how the phonemes in question are represented in the CMU Pronouncing Dictionary, which uses the ARPAbet symbols. We can see that "smooth" ends with the symbol "DH", so our target word (the only common English word, other than "smooth," that ends in the letters TH as pronounced in smooth) must also end with the symbol "DH".

The next step is simply to query the dictionary file for any pronunciations that end in "DH". For reference, here's what the body of the file looks like:

carousel K EH1 R AH0 S EH2 L

carousing K ER0 AW1 Z IH0 NG

carow K AE1 R OW0

carozza K ER0 AA1 Z AH0

carp K AA1 R P

carpal K AA1 R P AH0 L

We can use grep plus the "$" anchor to ensure that we find "DH" at the end of a line:

Mac:cmudict-master puzzler$ grep 'DH$' cmudict-dict.txt

bathe B EY1 DH

blithe B L AY1 DH

blythe B L AY1 DH

boothe B UW1 DH

bothe B OW1 DH

breathe B R IY1 DH

clothe K L OW1 DH

...

I truncated the output so as not to spoil the solution. As we can see, the results shown all end in "-the", whereas the target word should end in "-th". In fact, there's only one word (other than "smooth") in the output of the grep query that fits the bill, so that must be the solution. I'll be back after NPR's Thursday deadline for submissions to share that solution. Good luck, Puzzlers!

The deadline for NPR submissions has passed, so click the spoiler button below to see my solution.

Monday, March 10, 2025

'Jon Stewart' & classic movies

Welcome back, Puzzlers! It's Monday Funday at Natural Language Puzzling, the blog where we use natural language processing, computational linguistics and data science to solve the weekly Sunday Puzzle from NPR. Here is this week's puzzle:

This week's challenge comes from listener Al Gori, of Cozy Lake, N.J. Take the name JON STEWART, as in the comedian and TV host. Rearrange the letters to spell the titles of three classic movies. One of the titles is its familiar shortened form.

Let's break this down. Here's what we need:

M: A list of "classic movies" to iterate through

Find a list online
Build a list ourselves
Ask an LLM for a list
I extracted a list of all the titles from this spreadsheet of every Oscar winner and nominee

(We don't know for certain that the movies in the solution are Oscar winners or nominees, but it's a fairly safe bet and a good place to start.)

fx(M): We need a function that can generate combinations of three movie titles from our list, such that:

the total length of all 3 movie titles is 10 letters (len('jonstewart'))

we can use this as a first filter

then we check that the letters in the combined movie titles are the same as the letters in 'jonstewart'

Clearly, this puzzle is on the easier side, with minimal NLP required. The critical piece here is the starting list of movies. I have my solution, and I'll be back after the Thursday NPR deadline to share it along with my script. In the meantime, if you'd like to try this yourself, you can start from my list of about 3700 Oscar winners and nominees.

Update & Solution

The deadline for submissions has passed, so click the spoiler button below if you want to see my answer. Click here for my python script on GitHub. See you next week!

Monday, March 03, 2025

Singer/actress, dangerous object

Greetings, Puzzlers, and welcome back to Natural Language Puzzling, the blog where we use natural language processing, linguistics and data science to solve the Sunday Puzzle from NPR!

Here's this week's puzzle:

This week's challenge comes from listener Dennis Burnside, of Lincoln, Neb. Think of a famous singer and actress, first and last names, two syllables each. The second syllable of the last name followed by the first syllable of the first name spell something that can be dangerous to run into. What is it?

Let's break this down and determine how we'll approach solving the puzzle.

This is what I call a "Classic format" Sunday Puzzle: Take thing from Class A (singer/actress), apply transformation, yield thing from Class B (something dangerous to run into).

So here's what we'll need to solve it:

A: list of singer/actresses

This is an open class, so we'll need robust coverage
We could try finding an existing list online, or cobbling one together from various lists and listicles
We could ask a chatbot LLM like ChatGPT/Gemini/etc. to give us a list

B: list of things that are dangerous to run into

Also open class
A little vague. Is it inherently dangerous, or only dangerous if you run into it?

e.g., knife vs tree

Is there wordplay here? (probably not)

e.g., ex-spouse

We don't actually need a list here a priori. We can iterate through list A and evaluate the resulting candidates for "something dangerous to run into".

We simply need an LLM that can provide us with perplexity scores for sentences
We use a sentence template:

"Stay alert because running into [BLANK] can be very dangerous."

We plug the string derived from combining the two syllables from the singer/actress into the blank, and get a score representing how (im)probable the resulting sentence is (i.e., its perplexity).
Ideally, the correct solution will be among those ranked as most probable here.

Functions:

Syllabification:

We need the ability to divide each string into syllables. This is for counting and for forming the "something dangerous" string.
This seems simple but it can be messy...
We're dealing with names, which can have highly variable orthography and phonology, and pronunciation dictionaries are certain to be missing lots of less frequent names.
It's not always easy to determine syllables based on orthography. A sequence of vowels could be a single syllable (a diphthong as in train) or multiple syllables (a hiatus as in trivia)
I'm going to rely on the CMU Pronouncing Dictionary, which we can call from within NLTK in Python. Because each syllable is marked with 0, 1 or 2 for stress, we can simply count the number of numeral digits in the pronunciation. For example, the pronunciation for trivia is given as ['T', 'R', 'IH1', 'V', 'IY0', 'AH0'].
For strings that are out of vocabulary (OOV), I'll back off to a rule-based approach that tries to count syllables. I plan to use the syllapy library, as described here.

String transformation:

We also need a simple function to recombine the syllables as described: second syllable of the last name followed by the first syllable of the first name.
We'll handle this in python as well.

Perplexity scoring:

As mentioned above, we'll need an LLM that can give us perplexity scores for sentences. I'm going to use a huggingface transformers implementation of a GPT model. (But stay tuned--I'm building implementations of several other LLMs and we'll be evaluating those in future puzzling!)

That more or less covers my approach. Would you handle this differently? Do you think an LLM could solve this puzzle outright? (So far, the ones I've tried have not managed a solution).

Please leave your comments and insights (but no spoilers please). I'll be back after NPR's Thursday deadline for submissions to share my solution and my implementation. Good luck!

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!

Monday, December 02, 2024

Classic TV actor, sleepwear

Happy Monday funday, Puzzlers! Welcome back to Natural Language Puzzling, the blog where we use natural language processing, linguistics and data science to solve the weekly Sunday Puzzle from NPR. Let's take a look at this week's puzzle:

This week's challenge comes from the crossword constructor and editor Peter Gordon. Think of a classic television actor — first and last names. Add a long-E sound at the end of each name and you'll get two things that are worn while sleeping. What are they?

Let's break this down into the tools and resources that we'll need:

A: a list of classic television actors

I'll start with this list of actors that we've compiled for previous puzzles
It's not strictly classic television actors because it includes contemporary names as well as actors known primarily for movies, but that shouldn't be a problem for us.

P: a pronunciation dictionary

I'll be using this python implementation: https://github.com/Kyubyong/g2p
It queries the CMU Pronouncing Dictionary and uses a grapheme to phoneme model to predict pronunciations for out of vocabulary (OOV) items
Unfortunately, this implementation allows us to query a word to get its pronunciation but doesn't allow the reverse; i.e., it's g2p but not p2g. We'll also need to query a pronunciation to get an orthographic word. This is because we need to add "long-E" (/i/) to the end of the pronunciation to see if it results in a word that is a "thing worn while sleeping". So in this case we'll need to load the CMU dictionary file and work with it.
**See my note below

S: list of things worn while sleeping

~~It seems like this should be a fairly short list, especially given that we're only looking for items that end in /i/ ("long E").~~
~~I'm not so confident that I could brainstorm a comprehensive list, so~~ we'll use an LLM approach here instead.

LLM: We'll use the python transformers implementation of GPT2 to score sleepwear candidates.

The algorithm is this, roughly:

We're searching for the actor name that gives us two items of sleepwear after we add /i/ to the end of the first and last names.
So we take each actor, get the first name and last name, convert each to pronunciation, and add /i/ to the pronunciation:

Bob Hope --> B AA B . HH OW P .
B AA B . HH OW P . --> B AA B IY. HH OW P IY.

Next, we use the pronunciation dictionary to try to map each pronunciation back to a word:

B AA B IY . --> [ 'bobby', 'bobbie' ]
HH OW P IY . --> [ 'hopi' ]

Then we use an LLM to evaluate the probability of these sleepwear candidates in the context of a relevant sentence:

"I usually sleep in my pajamas or my __1__, but last night I slept in my __2__."
"I usually sleep in my pajamas or my bobby, but last night I slept in my hopi."=

We sort all these candidate sentences by their LLM perplexity scores, and we expect to find the correct solution among those with the lowest perplexity score. Our "bobby" and "hopi" would likely have a high perplexity score, as it doesn't really make sense.

Oh boy, this one took me several hours to implement, but I finally have an approach that produces the correct solution!

**Oddly enough, one of the sleepwear items is not present in the CMU Pronouncing Dictionary. This is odd because there are just over 135,000 unique word types there, and the word we are hunting is not particularly rare. In other words, I figured out the solution manually as I was attempting to solve this one, and when I checked the CMU dict, I found the word missing. I added the word and pronunciation to my own copy of the dictionary in order to make this approach work. This has me thinking that I should start maintaining my own addendum to the CMU Pronouncing Dictionary that I could share on my GitHub.

Anyway, here's what some of my output looks like, omitting the correct solution of course. I'm pleased to say that the correct solution has the lowest perplexity of all candidate solutions. I'll be back after the Thursday NPR deadline for submissions to share that solution with you. Good luck!

3.404 Rose Byrne rosie bernie I usually sleep in ...

3.655   Jack Black   jackie   blackie   I usually sleep in ...
3.741   Paul Rudd   pauli    ruddy   I usually sleep in ...
3.893   Will Smith   willie    smithee   I usually sleep in ...

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!

Sunday, November 24, 2024

State Capital, TV character, TV game show host

Happy Puzzle Day, friends! Let's take a crack at today's Sunday Puzzle from NPR:

This week's challenge comes from listener Greg VanMechelen, of Berkeley, Calif. Name a state capital. Inside it in consecutive letters is the first name of a popular TV character of the past. Remove that name, and the remaining letters in order will spell the first name of a popular TV game show host of the past. What is the capital and what are the names?

I'll break this puzzle down as usual. Here's what we need to solve it:

C: a list of state capitals

easy peasy--we love starting with a closed class (i.e., there is a fixed list and we don't need to worry about whether our list contains all the right candidates)

get_substrings(): We need some kind of a (python) function to iterate through each state capital string and apply the puzzle transformation

In this case this means extracting a substring, squishing the remaining letters together into a string, and returning the substring and the remaining string.
I suggest we do this in a way that allows each substring to be between 2 and 6 letters long. This is a somewhat arbitrary decision, but my intuition tells me that we're unlikely to find a sequence of 7 letters embedded in a state capital that happen to spell out a person's name and still allow the rest of the letters to spell another name. We can of course modify this limit as needed.

LM: We need a (large) language model to evaluate candidates.

I'm using a pretrained huggingface transformers library implementation of GPT2.
I'll iterate through the candidates, inserting them into a sentence template and getting a perplexity score from the LLM.
For example, we could imagine: Madison --> di, mason
And we use them in the sentence template: "Di was a popular TV character and Mason was a popular TV game show host."

Note that capitalization matters here. Without capitalizing the names, the correct solution is ranked quite poorly, but with capitalization in emerges among the top 8 candidates.

We return a list of these sentences ranked by perplexity scores, and theoretically, the correct solution should be among those with the lowest perplexity.

As I mentioned above, my approach did lead me to the solution. You can try my script here. Here's a look at some of the output (not including the solution).

(__1__ is a TV character and __2__ is a TV game show host)
3.54 | Jackson | Son | Jack | Son is a TV character and Jack is ...
3.69 | Annapolis | Anna | Polis | Anna is a TV character and Polis is ...
3.71 | Honolulu | Hono | Lulu | Hono is a TV character and Lulu is ...
3.80 | Salem | Le | Sam | Le is a TV character and Sam is ...
3.82 | Helena | He | Lena | He is a TV character and Lena is ...
3.86 | Littlerock | Ock | Littler | Ock is a TV character and Littler is ...
3.87 | Hartford | Hart | Ford | Hart is a TV character and Ford is ...

Good luck! I'll be back after NPR's Thursday deadline for submissions to share my solution.

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!

Sunday, November 03, 2024

Experiment place and person

It's Sunday Funday here on Natural Language Processing, where we use natural language processing and linguistics to solve the Sunday Puzzle from NPR! Let's look at this week's puzzle:

This week's challenge comes from listener Mark Maxwell-Smith. Name a place where experiments are done (two words). Drop the last letter of each word. The remaining letters, reading from left to right, will name someone famously associated with experiments. Who is it?

Classic Sunday Puzzle, right? Take thing from Class A, apply a string transformation, get thing from Class B.

Here's how we'll solve the puzzle:

P: (list) places where experiments are done (two words)

We need this list to iterate through in search of the solution
We can ask any available LLM to generate a list to get us started

I started with such a list, but broke it into a list of first words and second words, that my python script then recombines to get a longer list, so:

[science lab, physics room] -->
[science lab, science room, physics lab, physics room]

This is an open class, but the clue is so restrictive I think we can cover all the possibilities this way

S: (list) someone famously associated with experiments

Again, we can ask an LLM to provide us a list, but we can't be certain we'll cover the solution because this is a fairly vague clue and definitely a very open class
Alternatively, we can simply iterate through P, apply the transformation, and plug the new string into an LLM to get a score

transform(str): (function) We need to define a function that can take in a string like "science laboratory" and return the transformed string: "Scienclaborator"

Note that we'll capitalize the string as we're expecting a name

LLM: We'll use the python transformers library and its pretrained GPT2 model

We'll start with a sentence frame like this:

'[BLANK] was a famous scientist known for experiments.'

We insert each transformed string into the sentence frame, pass it to the LLM, and return a perplexity score
We rank these candidates by perplexity

The solution should be among the candidates with the lowest perplexity score.

And that's how I solved it! If you have alternative approaches, drop a note in the comments. I'll be back right here after NPR's Thursday deadline to share my solution. Good luck!

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. Click here for the python script on GitHub. See you next week!

Monday, October 28, 2024

Place name and animal behavior

It's Monday Funday at Natural Language Puzzling, the blog where we use natural language processing and linguistics to solve the weekly Sunday Puzzle from NPR! Let's take a look at this week's puzzle:

Name a place somewhere on the globe -- in two words. Rearrange the letters of the first word to name some animals. The second word in the place name is something those animals sometimes do. What is it?

This one immediately strikes me as a relatively difficult one. Let's look at tools and data we need to solve this one and get a better understanding of what makes it difficult. We'll need:

P: a list of two word place names anywhere on the globe

this is a very unrestricted class
we can return to the OSM Names file that we've used in recent puzzles

it has 21 million entries, so we'll narrow that down, keeping only the list of names (and dumping the rest of the tsv file)
we'll also filter it down to only two word place names, and only places written with Roman letters
this still leaves about 8 place names; *we'll return to this later

anagram_checker(string): A function to take a string and return all the valid anagrams as a first step to rearranging the letters of the first word to "name some animals"

This will require an English lexicon for validating words; I'll be using a lexicon extracted from the Brown Corpus in the NLTK (much like I did in this recent puzzle)

LLM: We'll use an LLM to evaluate the likelihood that each valid anagram fits the "name some animals" clue.

I'll use the pretrained GPT2 model in the python transformers library
I'll plug each candidate into a sentence template like "The wildlife photographer published a set of photos of [BLANK]."
I'll keep the candidates that score below some (currently unestablished) threshold for perplexity.

For the most likely candidates, I'll use the anagram_checker on the second word of the place name to find valid English words

Then I'll use the LLM again to evaluate candidates in a sentence like this one: "[BLANK2] is something that [BLANK1] sometimes do."
Ideally, the correct solution should be among the candidates with the lowest perplexity scores.

Regarding the list of 8 million place names of two words: Naturally, many of these are places most of us have never heard of. I'm certain that the solution will be a place name that is familiar to most listeners. So we need a way to sort the 8 million candidate place names by how well known they are. Again, I returned to the LLM and plugged the candidates into this template: "[BLANK] is one of the most famous places in the world." And I'm storing the resulting list of ranked candidates. When you consider the approach I'm taking above, it's very brute force and computationally heavy, so we want to start with a short list of place names. I'll likely try running my script with the first 1000 place names, and if the solution isn't found there, take the second 1000 place names, and so on.

Good luck this week! I'll be working on my approach and I'll be back to share my solution after the Thursday submission deadline from NPR (i.e., no spoilers here).

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!

Sunday, October 20, 2024

US city and state with 13 unique letters

Hello and welcome back to Natural Language Puzzling, the only blog on the web where we use natural language processing and linguistics to solve the Sunday Puzzle from NPR! Let's take a look at this week's puzzle:

This week's challenge comes from listener David Dickerson, of Tucson, Arizona. The city UTICA, NEW YORK, when spelled out, contains 12 letters, all of them different. Think of a well-known U.S. city, that when its name is spelled out, contains 13 letters, all of them different. Your answer doesn't have to match mine.

Immediately we can tell that this is going to be an easy one with minimal NLP or linguistics, and not a lot of tools required. Here's what we need:

C: a long list of cities (and their states) in the US

Easy--we can reuse the list of places (plus their states) in the US that we prepared for the puzzle two weeks ago
Note that our list includes non-city places-- counties, streets, parks, etc., but that's okay.

f: a function to check whether the city+state contains 13 unique letters (i.e., 13 types)

this is an easy job for python

We need a python script that can do a bit of cleanup on place names-- lowercase everything (so "a" and "A" are counted as one letter type and not two), remove punctuation, etc. Then we simply check if the original length of the word and the count of unique letters in the word are the same.

My script is producing 38 solutions, although only one or two of them are "well-known". Good luck! I'll be back to share my solution after the Thursday NPR submission deadline.

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!

Monday, October 14, 2024

Athletes, beverage

Welcome back to Natural Language Puzzling, where we use natural language processing and linguistics to solve the Sunday Puzzle from NPR. Here's this week's puzzle:

This week's challenge comes from listener Mike Selinker, of Renton, Wash. Think of something to drink whose name is a compound word. Delete the first letter of the first part and you'll get some athletes. Delete the first letter of the second part and you'll get where these athletes compete. What words are these?

This one is not terribly different from last week's puzzle. Here's how I plan to approach this problem:

B: We need a sizable list of beverages

I've pieced together a list of about 200 beverages including cocktails, coffee and tea drinks, etc.
We know the solution should be a single string and it should be a compound word (e.g., "houseboat", "sunshine", etc.)

After we drop a letter from each part of the compound word, we'll still have a word for athletes and a word for a place where they compete
These two strings probably need to be a minimum of 3 letters each
So we should start with a beverage word that is at least 8 letters long

compound word function: Ideally, we need a function or a tool that can tell us whether a string is a compound word, and if so, what its component words are

I haven't found a readily available solution for this, so I effectively had to create my own

splitting function: To create our own compound word checker, we'll need to split each beverage word in two, such that each of the two strings contains a minimum of 3 letters; we can do this in python
Lexicon: With each beverage split into a pair or multiple pairs, we need to check whether each half of the split is a real English word.

There are many ways to approach this and numerous wordlists we could choose from
I chose to use NLTK to extract the 6000 most frequent words from the Brown Corpus
We can find much larger lexicons, but they tend to be too permissive

LLM: Now that we know which pairs are valid words, we can plug these words into a sentence template, pass that sentence to an LLM to get a perplexity score, then rank the candidate words according their scores, and the correct solution should be among the sentences with the lowest perplexity.

My sentence template:

We went to the __1__ to see the __0__ compete in the tournament.

I'm using GPT2 as implemented in the transformers python library.

This approach worked for me. I'll be back to share my solution after NPR's Thursday submission deadline. In the meantime, feel free to examine my script on GitHub and drop your suggestions in the comments. Good luck, Puzzlers!

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!

Tuesday, October 08, 2024

Place name, palindrome, body part

Ahoy! Welcome back to Natural Language Puzzling, the weekly blog where we solve the NPR Sunday puzzle using natural language processing (NLP), language modeling and linguistics. Let's dive into this week's puzzle:

This week's challenge comes from listener Joe Krozel, of Creve Coeur, MO. Think of a place in America. Two words, 10 letters altogether. The first five letters read the same forward and backward. The last five letters spell something found in the body. What place is this?

Okay, so here's what I need to solve this problem:

P: a list of place names in America

Let's assume "America" means "USA" here
The longer the list, the better
We could do a web search for an existing list of place names
We could ask an LLM for a list
I'm going to use OpenStreetMap data (**details below)

is_anagram(string): We need a function to check whether a string is the same forward and backward; we'll use python to take the first five letters and run this function.
is_found_in_body(string): We also need a function to determine whether the last five letters are "something found in the body".

We could start out with a list of body parts, fluids, etc., but "something found in the body" is worded kind of vaguely and I'm not confident we can easily assemble an exhaustive list
I'm going to use an LLM instead; I'll take the last five letters from each of our candidate place names, insert the string into a sentence template, pass this sentence to the LLM, get a perplexity score, then sort these candidate solutions from highest to lowest perplexity; the correct solution should be among the lowest perplexities. I'll use the python transformers library's implementation of GPT2 for this.

There's a little more to it, but that's the basic idea. We'll also need some functions to do a bit of string cleaning and manipulation and to filter our list to candidates that have exactly two words and 10 letters. I've done all this in a python script, which you can view here on the GitHub repo.

**Regarding the OpenStreetMap data

OSMNames is a free, open source, downloadable database of over 21 million place names around the world. It downloads as a massive file with lots of columns that are totally extraneous for our purposes, so I use a few BASH commands to filter it down to what we need. To reproduce the file I'm using in my script, you can start with the downloaded file and run the following commands:

## unzip the file (if you're on a Windows/Linux machine, you can probably skip this and use "zcat" instead of "cat" on the next step)

gunzip planet-latest_geonames.tsv.gz

## get only place names for the USA

cat planet-latest_geonames.tsv | awk -F '\t' -v OFS='\t' 'NR == 1 || $16 == "us"' > us1.tsv

## keep only column 1 and column 14 (place name and state)

cut -f1,14 < us1.tsv > us2.tsv

## remove any duplicate entries

awk '!seen[$0]++' us2.tsv > usa-places.tsv

Now that we have an appropriately filtered list, our script can create a subset of candidate place names that are exactly two words and 10 letters. For my approach, I assumed this could be simply a two word place name that totals 10 letters, like San Antonio. But it could also be a combination of one-word place name and one-word state name, like Dayton Ohio. So my python script churns through the input file to find all such possibilities.

Next I iterate through this list, keeping only the candidates where the first five letters are palindromic.

Then I iterate through the remaining list, plugging each candidate into a couple of sentence templates for the LLM to evaluate. Here are the two sentence templates I've come up with:

In anatomy class, students learn about all the things inside the body, like bones and [BLANK].
Each anatomy lesson focuses on a new part of the body, like a bone or a [BLANK].

In the end, I get a long list of sentences ranked by perplexity, and indeed the sentence with the lowest perplexity score is the correct solution! Without giving you the solution, here's what some of that output looks like:

5.0561 Each anatomy ... like a bone or a ararc. mirimararc
5.0450 In anatomy ... like bones and aradr. caracaradr
5.0373 Each anatomy ... like a bone or a awaii. hanahawaii
5.0224 In anatomy ... like bones and acove. noconacove
4.9808 In anatomy ... like bones and nroad. aegeanroad
4.9779 Each anatomy ... like a bone or a uaway. hilihuaway

Good luck out there, puzzlers! I'll be back after the Thursday afternoon NPR deadline to share my solution. In the meantime, feel free to give my script a try.

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!

4.247	Barbra Streisand	Stay alert because running into a sandbar can be very dangerous.
4.943	Linda Ronstadt	Stay alert because running into a stadtlin can be very dangerous.
5.071	Dolly Parton	Stay alert because running into a tondol can be very dangerous.
5.116	Alanis Morissette	Stay alert because running into a setteala can be very dangerous.