It's Monday, Puzzlers, and you know what that means! Welcome to Natural Language Puzzling, the blog where we use natural language processing (NLP), linguistics, data science, language modeling, programming and our own noggins to solve the weekly Sunday Puzzle from NPR. Here's this week's puzzle:
This week's challenge comes from listener Dan Asimov, of Berkeley, Calif. In English the two-letter combination TH can be pronounced in two different ways: once as in the word "booth," the other as in "smooth." What is the only common English word, other than "smooth," that ends in the letters TH as pronounced in smooth?
Well, this is certainly a diversion from the usual Sunday Puzzle format, which involves taking one string of text, applying a prescribed transformation and generating another string of text.
So what do we need to solve this puzzle?
Really only one thing: a robust pronunciation dictionary. For that, we can turn to an old friend-- the Carnegie-Mellon University (CMU) Pronouncing Dictionary. You can read more about that on the project website, or simply download the file from GitHub.
Now that we have the dictionary, we can navigate to it from the command line. From that location, we can query the pronunciations for "booth" and "smooth".
This shows us how the phonemes in question are represented in the CMU Pronouncing Dictionary, which uses the ARPAbet symbols. We can see that "smooth" ends with the symbol "DH", so our target word (the only common English word, other than "smooth," that ends in the letters TH as pronounced in smooth) must also end with the symbol "DH".
The next step is simply to query the dictionary file for any pronunciations that end in "DH". For reference, here's what the body of the file looks like:
carousel K EH1 R AH0 S EH2 L
carousing K ER0 AW1 Z IH0 NG
carow K AE1 R OW0
carozza K ER0 AA1 Z AH0
carp K AA1 R P
carpal K AA1 R P AH0 L
We can use grep plus the "$" anchor to ensure that we find "DH" at the end of a line:
I truncated the output so as not to spoil the solution. As we can see, the results shown all end in "-the", whereas the target word should end in "-th". In fact, there's only one word (other than "smooth") in the output of the grep query that fits the bill, so that must be the solution. I'll be back after NPR's Thursday deadline for submissions to share that solution. Good luck, Puzzlers!
The deadline for NPR submissions has passed, so click the spoiler button below to see my solution.
The CMU Pronouncing Dictionary gives these four pronunciations for "with". Two of them fit the bill:
Welcome back, Puzzlers! It's Monday Funday at Natural Language Puzzling, the blog where we use natural language processing, computational linguistics and data science to solve the weekly Sunday Puzzle from NPR. Here is this week's puzzle:
This week's challenge comes from listener Al Gori, of Cozy Lake, N.J. Take the name JON STEWART, as in the comedian and TV host. Rearrange the letters to spell the titles of three classic movies. One of the titles is its familiar shortened form.
(We don't know for certain that the movies in the solution are Oscar winners or nominees, but it's a fairly safe bet and a good place to start.)
fx(M): We need a function that can generate combinations of three movie titles from our list, such that:
the total length of all 3 movie titles is 10 letters (len('jonstewart'))
we can use this as a first filter
then we check that the letters in the combined movie titles are the same as the letters in 'jonstewart'
Clearly, this puzzle is on the easier side, with minimal NLP required. The critical piece here is the starting list of movies. I have my solution, and I'll be back after the Thursday NPR deadline to share it along with my script. In the meantime, if you'd like to try this yourself, you can start from my list of about 3700 Oscar winners and nominees.
Update & Solution
The deadline for submissions has passed, so click the spoiler button below if you want to see my answer. Click here for my python script on GitHub. See you next week!
Greetings, Puzzlers, and welcome back to Natural Language Puzzling, the blog where we use natural language processing, linguistics and data science to solve the Sunday Puzzle from NPR!
This week's challenge comes from listener Dennis Burnside, of Lincoln, Neb. Think of a famous singer and actress, first and last names, two syllables each. The second syllable of the last name followed by the first syllable of the first name spell something that can be dangerous to run into. What is it?
Let's break this down and determine how we'll approach solving the puzzle.
This is what I call a "Classic format" Sunday Puzzle: Take thing from Class A (singer/actress), apply transformation, yield thing from Class B (something dangerous to run into).
So here's what we'll need to solve it:
A: list of singer/actresses
This is an open class, so we'll need robust coverage
We could try finding an existing list online, or cobbling one together from various lists and listicles
We could ask a chatbot LLM like ChatGPT/Gemini/etc. to give us a list
B: list of things that are dangerous to run into
Also open class
A little vague. Is it inherently dangerous, or only dangerous if you run into it?
e.g., knife vs tree
Is there wordplay here? (probably not)
e.g., ex-spouse
We don't actually need a list here a priori. We can iterate through list A and evaluate the resulting candidates for "something dangerous to run into".
We simply need an LLM that can provide us with perplexity scores for sentences
We use a sentence template:
"Stay alert because running into [BLANK] can be very dangerous."
We plug the string derived from combining the two syllables from the singer/actress into the blank, and get a score representing how (im)probable the resulting sentence is (i.e., its perplexity).
Ideally, the correct solution will be among those ranked as most probable here.
Functions:
Syllabification:
We need the ability to divide each string into syllables. This is for counting and for forming the "something dangerous" string.
This seems simple but it can be messy...
We're dealing with names, which can have highly variable orthography and phonology, and pronunciation dictionaries are certain to be missing lots of less frequent names.
It's not always easy to determine syllables based on orthography. A sequence of vowels could be a single syllable (a diphthong as in train) or multiple syllables (a hiatus as in trivia)
I'm going to rely on the CMU Pronouncing Dictionary, which we can call from within NLTK in Python. Because each syllable is marked with 0, 1 or 2 for stress, we can simply count the number of numeral digits in the pronunciation. For example, the pronunciation for trivia is given as ['T', 'R', 'IH1', 'V', 'IY0', 'AH0'].
For strings that are out of vocabulary (OOV), I'll back off to a rule-based approach that tries to count syllables. I plan to use the syllapy library, as described here.
String transformation:
We also need a simple function to recombine the syllables as described: second syllable of the last name followed by the first syllable of the first name.
We'll handle this in python as well.
Perplexity scoring:
As mentioned above, we'll need an LLM that can give us perplexity scores for sentences. I'm going to use a huggingface transformers implementation of a GPT model. (But stay tuned--I'm building implementations of several other LLMs and we'll be evaluating those in future puzzling!)
That more or less covers my approach. Would you handle this differently? Do you think an LLM could solve this puzzle outright? (So far, the ones I've tried have not managed a solution).
Please leave your comments and insights (but no spoilers please). I'll be back after NPR's Thursday deadline for submissions to share my solution and my implementation. Good luck!
Update & Solution
The deadline for submissions has passed, so click below if you want to see my answer. See you next week!
4.247
Barbra Streisand
Stay alert because running into a sandbar can be very dangerous.
4.943
Linda Ronstadt
Stay alert because running into a stadtlin can be very dangerous.
5.071
Dolly Parton
Stay alert because running into a tondol can be very dangerous.
5.116
Alanis Morissette
Stay alert because running into a setteala can be very dangerous.
I confess that I cheated here; I tried several syllabification/hyphenation tools, and none of them could return ['strei', 'sand'] for 'Streisand'. They would not recognize the string and thus return 'Streisand' unchanged.
I also tried using a grapheme-to-phoneme tool (https://github.com/Kyubyong/g2p) as a back-off for out of vocabulary (OOV) items that are not in the CMU dictionary (which most syllabification tools rely on). However, this tool and others like it return the phonemic transcription of the syllables, not the graphemic or orthographic transcription. So it could return ['STRAY1', 'ZAH0ND'], but not ['strei', 'sand'].
In other words, for OOVs, there's no tool to map from a g2p generated phonetic transcription of syllables back to an original orthographic spelling which is split into syllables. I suspect this won't be the last time we'll need to do this for the Sunday Puzzle, so I'm currently working on such a tool and I'll share it soon when it is ready.