Natural Language Puzzling: March 2025

Monday, March 24, 2025

Smooth booth

It's Monday, Puzzlers, and you know what that means! Welcome to Natural Language Puzzling, the blog where we use natural language processing (NLP), linguistics, data science, language modeling, programming and our own noggins to solve the weekly Sunday Puzzle from NPR. Here's this week's puzzle:

This week's challenge comes from listener Dan Asimov, of Berkeley, Calif. In English the two-letter combination TH can be pronounced in two different ways: once as in the word "booth," the other as in "smooth." What is the only common English word, other than "smooth," that ends in the letters TH as pronounced in smooth?

Well, this is certainly a diversion from the usual Sunday Puzzle format, which involves taking one string of text, applying a prescribed transformation and generating another string of text.

So what do we need to solve this puzzle?

Really only one thing: a robust pronunciation dictionary. For that, we can turn to an old friend-- the Carnegie-Mellon University (CMU) Pronouncing Dictionary. You can read more about that on the project website, or simply download the file from GitHub.

Now that we have the dictionary, we can navigate to it from the command line. From that location, we can query the pronunciations for "booth" and "smooth".

Mac:cmudict-master puzzler$ grep 'smooth ' cmudict-dict.txt

smooth S M UW1 DH

Mac:cmudict-master puzzler$ grep 'booth ' cmudict-dict.txt

booth B UW1 TH

tollbooth T OW1 L B UW2 TH

This shows us how the phonemes in question are represented in the CMU Pronouncing Dictionary, which uses the ARPAbet symbols. We can see that "smooth" ends with the symbol "DH", so our target word (the only common English word, other than "smooth," that ends in the letters TH as pronounced in smooth) must also end with the symbol "DH".

The next step is simply to query the dictionary file for any pronunciations that end in "DH". For reference, here's what the body of the file looks like:

carousel K EH1 R AH0 S EH2 L

carousing K ER0 AW1 Z IH0 NG

carow K AE1 R OW0

carozza K ER0 AA1 Z AH0

carp K AA1 R P

carpal K AA1 R P AH0 L

We can use grep plus the "$" anchor to ensure that we find "DH" at the end of a line:

Mac:cmudict-master puzzler$ grep 'DH$' cmudict-dict.txt

bathe B EY1 DH

blithe B L AY1 DH

blythe B L AY1 DH

boothe B UW1 DH

bothe B OW1 DH

breathe B R IY1 DH

clothe K L OW1 DH

...

I truncated the output so as not to spoil the solution. As we can see, the results shown all end in "-the", whereas the target word should end in "-th". In fact, there's only one word (other than "smooth") in the output of the grep query that fits the bill, so that must be the solution. I'll be back after NPR's Thursday deadline for submissions to share that solution. Good luck, Puzzlers!

The deadline for NPR submissions has passed, so click the spoiler button below to see my solution.

Monday, March 10, 2025

'Jon Stewart' & classic movies

Welcome back, Puzzlers! It's Monday Funday at Natural Language Puzzling, the blog where we use natural language processing, computational linguistics and data science to solve the weekly Sunday Puzzle from NPR. Here is this week's puzzle:

This week's challenge comes from listener Al Gori, of Cozy Lake, N.J. Take the name JON STEWART, as in the comedian and TV host. Rearrange the letters to spell the titles of three classic movies. One of the titles is its familiar shortened form.

Let's break this down. Here's what we need:

M: A list of "classic movies" to iterate through

Find a list online
Build a list ourselves
Ask an LLM for a list
I extracted a list of all the titles from this spreadsheet of every Oscar winner and nominee

(We don't know for certain that the movies in the solution are Oscar winners or nominees, but it's a fairly safe bet and a good place to start.)

fx(M): We need a function that can generate combinations of three movie titles from our list, such that:

the total length of all 3 movie titles is 10 letters (len('jonstewart'))

we can use this as a first filter

then we check that the letters in the combined movie titles are the same as the letters in 'jonstewart'

Clearly, this puzzle is on the easier side, with minimal NLP required. The critical piece here is the starting list of movies. I have my solution, and I'll be back after the Thursday NPR deadline to share it along with my script. In the meantime, if you'd like to try this yourself, you can start from my list of about 3700 Oscar winners and nominees.

Update & Solution

The deadline for submissions has passed, so click the spoiler button below if you want to see my answer. Click here for my python script on GitHub. See you next week!

Monday, March 03, 2025

Singer/actress, dangerous object

Greetings, Puzzlers, and welcome back to Natural Language Puzzling, the blog where we use natural language processing, linguistics and data science to solve the Sunday Puzzle from NPR!

Here's this week's puzzle:

This week's challenge comes from listener Dennis Burnside, of Lincoln, Neb. Think of a famous singer and actress, first and last names, two syllables each. The second syllable of the last name followed by the first syllable of the first name spell something that can be dangerous to run into. What is it?

Let's break this down and determine how we'll approach solving the puzzle.

This is what I call a "Classic format" Sunday Puzzle: Take thing from Class A (singer/actress), apply transformation, yield thing from Class B (something dangerous to run into).

So here's what we'll need to solve it:

A: list of singer/actresses

This is an open class, so we'll need robust coverage
We could try finding an existing list online, or cobbling one together from various lists and listicles
We could ask a chatbot LLM like ChatGPT/Gemini/etc. to give us a list

B: list of things that are dangerous to run into

Also open class
A little vague. Is it inherently dangerous, or only dangerous if you run into it?

e.g., knife vs tree

Is there wordplay here? (probably not)

e.g., ex-spouse

We don't actually need a list here a priori. We can iterate through list A and evaluate the resulting candidates for "something dangerous to run into".

We simply need an LLM that can provide us with perplexity scores for sentences
We use a sentence template:

"Stay alert because running into [BLANK] can be very dangerous."

We plug the string derived from combining the two syllables from the singer/actress into the blank, and get a score representing how (im)probable the resulting sentence is (i.e., its perplexity).
Ideally, the correct solution will be among those ranked as most probable here.

Functions:

Syllabification:

We need the ability to divide each string into syllables. This is for counting and for forming the "something dangerous" string.
This seems simple but it can be messy...
We're dealing with names, which can have highly variable orthography and phonology, and pronunciation dictionaries are certain to be missing lots of less frequent names.
It's not always easy to determine syllables based on orthography. A sequence of vowels could be a single syllable (a diphthong as in train) or multiple syllables (a hiatus as in trivia)
I'm going to rely on the CMU Pronouncing Dictionary, which we can call from within NLTK in Python. Because each syllable is marked with 0, 1 or 2 for stress, we can simply count the number of numeral digits in the pronunciation. For example, the pronunciation for trivia is given as ['T', 'R', 'IH1', 'V', 'IY0', 'AH0'].
For strings that are out of vocabulary (OOV), I'll back off to a rule-based approach that tries to count syllables. I plan to use the syllapy library, as described here.

String transformation:

We also need a simple function to recombine the syllables as described: second syllable of the last name followed by the first syllable of the first name.
We'll handle this in python as well.

Perplexity scoring:

As mentioned above, we'll need an LLM that can give us perplexity scores for sentences. I'm going to use a huggingface transformers implementation of a GPT model. (But stay tuned--I'm building implementations of several other LLMs and we'll be evaluating those in future puzzling!)

That more or less covers my approach. Would you handle this differently? Do you think an LLM could solve this puzzle outright? (So far, the ones I've tried have not managed a solution).

Please leave your comments and insights (but no spoilers please). I'll be back after NPR's Thursday deadline for submissions to share my solution and my implementation. Good luck!

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!

4.247	Barbra Streisand	Stay alert because running into a sandbar can be very dangerous.
4.943	Linda Ronstadt	Stay alert because running into a stadtlin can be very dangerous.
5.071	Dolly Parton	Stay alert because running into a tondol can be very dangerous.
5.116	Alanis Morissette	Stay alert because running into a setteala can be very dangerous.

Natural Language Puzzling