Monday, December 02, 2024

Classic TV actor, sleepwear

Happy Monday funday, Puzzlers! Welcome back to Natural Language Puzzling, the blog where we use natural language processing, linguistics and data science to solve the weekly Sunday Puzzle from NPR. Let's take a look at this week's puzzle:

This week's challenge comes from the crossword constructor and editor Peter Gordon. Think of a classic television actor — first and last names. Add a long-E sound at the end of each name and you'll get two things that are worn while sleeping. What are they?

 Let's break this down into the tools and resources that we'll need:

  • A: a list of classic television actors
  • P: a pronunciation dictionary
    • I'll be using this python implementation: https://github.com/Kyubyong/g2p
    • It queries the CMU Pronouncing Dictionary and uses a grapheme to phoneme model to predict pronunciations for out of vocabulary (OOV) items
    • Unfortunately, this implementation allows us to query a word to get its pronunciation but doesn't allow the reverse; i.e., it's g2p but not p2g. We'll also need to query a pronunciation to get an orthographic word. This is because we need to add "long-E" (/i/) to the end of the pronunciation to see if it results in a word that is a "thing worn while sleeping". So in this case we'll need to load the CMU dictionary file and work with it.
    • **See my note below
  • S: list of things worn while sleeping
    • It seems like this should be a fairly short list, especially given that we're only looking for items that end in /i/ ("long E").
    • I'm not so confident that I could brainstorm a comprehensive list, so we'll use an LLM approach here instead.
  • LLM: We'll use the python transformers implementation of GPT2 to score sleepwear candidates.
The algorithm is this, roughly:
  • We're searching for the actor name that gives us two items of sleepwear after we add /i/ to the end of the first and last names.
  • So we take each actor, get the first name and last name, convert each to pronunciation, and add /i/ to the pronunciation:
    • Bob Hope --> B AA B . HH OW P .
    • B AA B . HH OW P . --> B AA B IY. HH OW P IY.
  • Next, we use the pronunciation dictionary to try to map each pronunciation back to a word:
    • B AA B IY . --> [ 'bobby', 'bobbie' ]
    • HH OW P IY . --> [ 'hopi' ]
  • Then we use an LLM to evaluate the probability of these sleepwear candidates in the context of a relevant sentence:
    • "I usually sleep in my pajamas or my __1__, but last night I slept in my __2__."
    • "I usually sleep in my pajamas or my bobby, but last night I slept in my hopi."=
  • We sort all these candidate sentences by their LLM perplexity scores, and we expect to find the correct solution among those with the lowest perplexity score. Our "bobby" and "hopi" would likely have a high perplexity score, as it doesn't really make sense.

**Oddly enough, one of the sleepwear items is not present in the CMU Pronouncing Dictionary. This is odd because there are just over 135,000 unique word types there, and the word we are hunting is not particularly rare. In other words, I figured out the solution manually as I was attempting to solve this one, and when I checked the CMU dict, I found the word missing. I added the word and pronunciation to my own copy of the dictionary in order to make this approach work. This has me thinking that I should start maintaining my own addendum to the CMU Pronouncing Dictionary that I could share on my GitHub.

Anyway, here's what some of my output looks like, omitting the correct solution of course. I'm pleased to say that the correct solution has the lowest perplexity of all candidate solutions. I'll be back after the Thursday NPR deadline for submissions to share that solution with you. Good luck!

3.404    Rose Byrne    rosie    bernie    I usually sleep in ...
3.655    Jack Black    jackie    blackie    I usually sleep in ...
3.741    Paul Rudd    pauli     ruddy    I usually sleep in ...
3.893    Will Smith    willie     smithee    I usually sleep in ...

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!



Sunday, November 24, 2024

State Capital, TV character, TV game show host

Happy Puzzle Day, friends! Let's take a crack at today's Sunday Puzzle from NPR:

This week's challenge comes from listener Greg VanMechelen, of Berkeley, Calif. Name a state capital. Inside it in consecutive letters is the first name of a popular TV character of the past. Remove that name, and the remaining letters in order will spell the first name of a popular TV game show host of the past. What is the capital and what are the names?

I'll break this puzzle down as usual. Here's what we need to solve it:

  • C: a list of state capitals
    • easy peasy--we love starting with a closed class (i.e., there is a fixed list and we don't need to worry about whether our list contains all the right candidates)
  • get_substrings(): We need some kind of a (python) function to iterate through each state capital string and apply the puzzle transformation
    • In this case this means extracting a substring, squishing the remaining letters together into a string, and returning the substring and the remaining string.
    • I suggest we do this in a way that allows each substring to be between 2 and 6 letters long. This is a somewhat arbitrary decision, but my intuition tells me that we're unlikely to find a sequence of 7 letters embedded in a state capital that happen to spell out a person's name and still allow the rest of the letters to spell another name. We can of course modify this limit as needed.
  • LM: We need a (large) language model to evaluate candidates.
    • I'm using a pretrained huggingface transformers library implementation of GPT2.
    • I'll iterate through the candidates, inserting them into a sentence template and getting a perplexity score from the LLM.
    • For example, we could imagine: Madison --> di, mason
    • And we use them in the sentence template: "Di was a popular TV character and Mason was a popular TV game show host."
      • Note that capitalization matters here. Without capitalizing the names, the correct solution is ranked quite poorly, but with capitalization in emerges among the top 8 candidates.
    • We return a list of these sentences ranked by perplexity scores, and theoretically, the correct solution should be among those with the lowest perplexity.
As I mentioned above, my approach did lead me to the solution. You can try my script here. Here's a look at some of the output (not including the solution).
  • (__1__ is a TV character and __2__ is a TV game show host)
  • 3.54 | Jackson | Son | Jack | Son is a TV character and Jack is ...
  • 3.69 | Annapolis | Anna | Polis | Anna is a TV character and Polis is ...
  • 3.71 | Honolulu | Hono | Lulu | Hono is a TV character and Lulu is ...
  • 3.80 | Salem | Le | Sam | Le is a TV character and Sam is ...
  • 3.82 | Helena | He | Lena | He is a TV character and Lena is ...
  • 3.86 | Littlerock | Ock | Littler | Ock is a TV character and Littler is ...
  • 3.87 | Hartford | Hart | Ford | Hart is a TV character and Ford is ...
Good luck! I'll be back after NPR's Thursday deadline for submissions to share my solution.

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!



Sunday, November 03, 2024

Experiment place and person

It's Sunday Funday here on Natural Language Processing, where we use natural language processing and linguistics to solve the Sunday Puzzle from NPR! Let's look at this week's puzzle:

This week's challenge comes from listener Mark Maxwell-Smith. Name a place where experiments are done (two words). Drop the last letter of each word. The remaining letters, reading from left to right, will name someone famously associated with experiments. Who is it?

Classic Sunday Puzzle, right? Take thing from Class A, apply a string transformation, get thing from Class B.

Here's how we'll solve the puzzle:

  • P: (list) places where experiments are done (two words)
    • We need this list to iterate through in search of the solution
    • We can ask any available LLM to generate a list to get us started
      • I started with such a list, but broke it into a list of first words and second words, that my python script then recombines to get a longer list, so:
        • [science lab, physics room] --> 
        • [science lab, science room, physics lab, physics room]
    • This is an open class, but the clue is so restrictive I think we can cover all the possibilities this way
  • S: (list) someone famously associated with experiments
    • Again, we can ask an LLM to provide us a list, but we can't be certain we'll cover the solution because this is a fairly vague clue and definitely a very open class
    • Alternatively, we can simply iterate through P, apply the transformation, and plug the new string into an LLM to get a score
  • transform(str): (function) We need to define a function that can take in a string like "science laboratory" and return the transformed string: "Scienclaborator"
    • Note that we'll capitalize the string as we're expecting a name
  • LLM: We'll use the python transformers library and its pretrained GPT2 model
    • We'll start with a sentence frame like this:
      • '[BLANK] was a famous scientist known for experiments.'
    • We insert each transformed string into the sentence frame, pass it to the LLM, and return a perplexity score
    • We rank these candidates by perplexity
      • The solution should be among the candidates with the lowest perplexity score.

And that's how I solved it! If you have alternative approaches, drop a note in the comments. I'll be back right here after NPR's Thursday deadline to share my solution. Good luck!

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. Click here for the python script on GitHub. See you next week!


Monday, October 28, 2024

Place name and animal behavior

It's Monday Funday at Natural Language Puzzling, the blog where we use natural language processing and linguistics to solve the weekly Sunday Puzzle from NPR! Let's take a look at this week's puzzle:

Name a place somewhere on the globe -- in two words. Rearrange the letters of the first word to name some animals. The second word in the place name is something those animals sometimes do. What is it?

This one immediately strikes me as a relatively difficult one. Let's look at tools and data we need to solve this one and get a better understanding of what makes it difficult. We'll need:

  • P: a list of two word place names anywhere on the globe
    • this is a very unrestricted class
    • we can return to the OSM Names file that we've used in recent puzzles
      • it has 21 million entries, so we'll narrow that down, keeping only the list of names (and dumping the rest of the tsv file)
      • we'll also filter it down to only two word place names, and only places written with Roman letters
      • this still leaves about 8 place names; *we'll return to this later
  • anagram_checker(string): A function to take a string and return all the valid anagrams as a first step to rearranging the letters of the first word to "name some animals"
    • This will require an English lexicon for validating words; I'll be using a lexicon extracted from the Brown Corpus in the NLTK (much like I did in this recent puzzle)
  • LLM: We'll use an LLM to evaluate the likelihood that each valid anagram fits the "name some animals" clue.
    • I'll use the pretrained GPT2 model in the python transformers library
    • I'll plug each candidate into a sentence template like "The wildlife photographer published a set of photos of [BLANK]."
    • I'll keep the candidates that score below some (currently unestablished) threshold for perplexity.
  • For the most likely candidates, I'll use the anagram_checker on the second word of the place name to find valid English words
    • Then I'll use the LLM again to evaluate candidates in a sentence like this one: "[BLANK2] is something that [BLANK1] sometimes do."
    • Ideally, the correct solution should be among the candidates with the lowest perplexity scores.
Regarding the list of 8 million place names of two words: Naturally, many of these are places most of us have never heard of. I'm certain that the solution will be a place name that is familiar to most listeners. So we need a way to sort the 8 million candidate place names by how well known they are. Again, I returned to the LLM and plugged the candidates into this template: "[BLANK] is one of the most famous places in the world." And I'm storing the resulting list of ranked candidates. When you consider the approach I'm taking above, it's very brute force and computationally heavy, so we want to start with a short list of place names. I'll likely try running my script with the first 1000 place names, and if the solution isn't found there, take the second 1000 place names, and so on.

Good luck this week! I'll be working on my approach and I'll be back to share my solution after the Thursday submission deadline from NPR (i.e., no spoilers here).

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!



Sunday, October 20, 2024

US city and state with 13 unique letters

Hello and welcome back to Natural Language Puzzling, the only blog on the web where we use natural language processing and linguistics to solve the Sunday Puzzle from NPR! Let's take a look at this week's puzzle:

This week's challenge comes from listener David Dickerson, of Tucson, Arizona. The city UTICA, NEW YORK, when spelled out, contains 12 letters, all of them different. Think of a well-known U.S. city, that when its name is spelled out, contains 13 letters, all of them different. Your answer doesn't have to match mine.

Immediately we can tell that this is going to be an easy one with minimal NLP or linguistics, and not a lot of tools required. Here's what we need:

We need a python script that can do a bit of cleanup on place names-- lowercase everything (so "a" and "A" are counted as one letter type and not two), remove punctuation, etc. Then we simply check if the original length of the word and the count of unique letters in the word are the same.

My script is producing 38 solutions, although only one or two of them are "well-known". Good luck! I'll be back to share my solution after the Thursday NPR submission deadline.

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!



Monday, October 14, 2024

Athletes, beverage

Welcome back to Natural Language Puzzling, where we use natural language processing and linguistics to solve the Sunday Puzzle from NPR. Here's this week's puzzle:

This week's challenge comes from listener Mike Selinker, of Renton, Wash. Think of something to drink whose name is a compound word. Delete the first letter of the first part and you'll get some athletes. Delete the first letter of the second part and you'll get where these athletes compete. What words are these?

 This one is not terribly different from last week's puzzle. Here's how I plan to approach this problem:

  • B: We need a sizable list of beverages
    • I've pieced together a list of about 200 beverages including cocktails, coffee and tea drinks, etc.
    • We know the solution should be a single string and it should be a compound word (e.g., "houseboat", "sunshine", etc.)
      • After we drop a letter from each part of the compound word, we'll still have a word for athletes and a word for a place where they compete
      • These two strings probably need to be a minimum of 3 letters each
      • So we should start with a beverage word that is at least 8 letters long
  • compound word function: Ideally, we need a function or a tool that can tell us whether a string is a compound word, and if so, what its component words are
    • I haven't found a readily available solution for this, so I effectively had to create my own
  • splitting function: To create our own compound word checker, we'll need to split each beverage word in two, such that each of the two strings contains a minimum of 3 letters; we can do this in python
  • Lexicon: With each beverage split into a pair or multiple pairs, we need to check whether each half of the split is a real English word.
    • There are many ways to approach this and numerous wordlists we could choose from
    • I chose to use NLTK to extract the 6000 most frequent words from the Brown Corpus
    • We can find much larger lexicons, but they tend to be too permissive
  • LLM: Now that we know which pairs are valid words, we can plug these words into a sentence template, pass that sentence to an LLM to get a perplexity score, then rank the candidate words according their scores, and the correct solution should be among the sentences with the lowest perplexity.
    • My sentence template:
      • We went to the __1__ to see the __0__ compete in the tournament.
    • I'm using GPT2 as implemented in the transformers python library.

This approach worked for me. I'll be back to share my solution after NPR's Thursday submission deadline. In the meantime, feel free to examine my script on GitHub and drop your suggestions in the comments. Good luck, Puzzlers!

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!



Tuesday, October 08, 2024

Place name, palindrome, body part

Ahoy! Welcome back to Natural Language Puzzling, the weekly blog where we solve the NPR Sunday puzzle using natural language processing (NLP), language modeling and linguistics. Let's dive into this week's puzzle:

This week's challenge comes from listener Joe Krozel, of Creve Coeur, MO. Think of a place in America. Two words, 10 letters altogether. The first five letters read the same forward and backward. The last five letters spell something found in the body. What place is this?

Okay, so here's what I need to solve this problem:
  • P: a list of place names in America
    • Let's assume "America" means "USA" here
    • The longer the list, the better
    • We could do a web search for an existing list of place names
    • We could ask an LLM for a list
    • I'm going to use OpenStreetMap data (**details below)
  • is_anagram(string): We need a function to check whether a string is the same forward and backward; we'll use python to take the first five letters and run this function.
  • is_found_in_body(string): We also need a function to determine whether the last five letters are "something found in the body".
    • We could start out with a list of body parts, fluids, etc., but "something found in the body" is worded kind of vaguely and I'm not confident we can easily assemble an exhaustive list
    • I'm going to use an LLM instead; I'll take the last five letters from each of our candidate place names, insert the string into a sentence template, pass this sentence to the LLM, get a perplexity score, then sort these candidate solutions from highest to lowest perplexity; the correct solution should be among the lowest perplexities. I'll use the python transformers library's implementation of GPT2 for this.
There's a little more to it, but that's the basic idea. We'll also need some functions to do a bit of string cleaning and manipulation and to filter our list to candidates that have exactly two words and 10 letters. I've done all this in a python script, which you can view here on the GitHub repo.

**Regarding the OpenStreetMap data
OSMNames is a free, open source, downloadable database of over 21 million place names around the world. It downloads as a massive file with lots of columns that are totally extraneous for our purposes, so I use a few BASH commands to filter it down to what we need. To reproduce the file I'm using in my script, you can start with the downloaded file and run the following commands:

## unzip the file (if you're on a Windows/Linux machine, you can probably skip this and use "zcat" instead of "cat" on the next step)
gunzip planet-latest_geonames.tsv.gz

## get only place names for the USA
cat planet-latest_geonames.tsv | awk -F '\t' -v OFS='\t' 'NR == 1 || $16 == "us"' > us1.tsv

## keep only column 1 and column 14 (place name and state)
cut -f1,14 < us1.tsv > us2.tsv

## remove any duplicate entries
awk '!seen[$0]++' us2.tsv > usa-places.tsv

Now that we have an appropriately filtered list, our script can create a subset of candidate place names that are exactly two words and 10 letters. For my approach, I assumed this could be simply a two word place name that totals 10 letters, like San Antonio. But it could also be a combination of one-word place name and one-word state name, like Dayton Ohio. So my python script churns through the input file to find all such possibilities.

Next I iterate through this list, keeping only the candidates where the first five letters are palindromic.

Then I iterate through the remaining list, plugging each candidate into a couple of sentence templates for the LLM to evaluate. Here are the two sentence templates I've come up with:
  • In anatomy class, students learn about all the things inside the body, like bones and [BLANK].
  • Each anatomy lesson focuses on a new part of the body, like a bone or a [BLANK].
In the end, I get a long list of sentences ranked by perplexity, and indeed the sentence with the lowest perplexity score is the correct solution! Without giving you the solution, here's what some of that output looks like:

5.0561   Each anatomy ... like a bone or a ararc  mirimararc

5.0450   In anatomy ... like bones and aradr  caracaradr

5.0373   Each anatomy ... like a bone or a awaii  hanahawaii

5.0224   In anatomy ... like bones and acove  noconacove

4.9808   In anatomy ... like bones and nroad  aegeanroad

4.9779   Each anatomy ... like a bone or a uaway  hilihuaway

Good luck out there, puzzlers! I'll be back after the Thursday afternoon NPR deadline to share my solution. In the meantime, feel free to give my script a try.

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!



Wednesday, October 02, 2024

Breakfast cereal characters

Welcome back to Natural Language Puzzling, the only blog on the web that teaches you how to solve complex word puzzles using natural language processing (NLP), computational linguistics, corpus linguistics, language modeling and computer programming.

Let's take a shot at this week's Sunday Puzzle from NPR:

This week's challenge comes from listener Curtis Guy, of Buffalo, N.Y. Name a certain breakfast cereal character. Remove the third, fifth, and sixth letters and read the result backward. You'll get a word that describes this breakfast cereal character. What is it?

This seems like an easy one, right? You can probably come up with the solution in your head within a few minutes, like I did. But let's imagine we need to solve this programmatically. What do we need to do so?

  • C: a list of breakfast cereal characters
    • This is going to be a pretty short list, which makes this an easy puzzle
    • We can brainstorm one and/or do a little web searching
  • transform(c): a function to transform each breakfast cereal character (c) as prescribed in the puzzle (remove letters and reverse the string)
    • We'll use python to perform this
  • evaluate(w): a function to evaluate the likelihood of each transformed string as "a word that describes this breakfast cereal character"
    • lexicon: given the restrictive nature of this puzzle (i.e., there won't be a lot of candidates to consider), we can probably simply rely on a lexicon--any transformed string that isn't found in the English lexicon is rejected, so we only need to manually read through the remaining candidates to find the solution.
    • language model: As we've done in the past, we can load an LLM, slot the character and the transformed string into a sentence frame, and score the candidates according to the perplexity score provided by our model.
    • I'm going to use both a lexicon and an LLM. First, if the transformed string isn't in the lexicon, it's immediately rejected. Any transformed strings that are in the lexicon are then passed to the LLM as described above. I'll use the GPT2 model in the python transformers library.
If this approach works, we'll get a list of sentences ranked by perplexity, and we'll expect the correct solution to be among the sentences with the lowest perplexity. Good luck! I'll be back after the Thursday NPR deadline to share my solution. In the meantime, you can see my python script here. If you need help with your list of cereal characters, you can find mine hardcoded in that script.

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!



Monday, September 23, 2024

NEW TOWELS anagram, brand name

It's Monday Funday on Natural Language Puzzling, the puzzle blog where use natural language processing, corpus linguistics, computational linguistics, language modeling and other tools of the trade to take on the weekly Sunday Puzzle from Weekend Edition on NPR. 

This week's challenge:  Take the phrase NEW TOWELS. Rearrange its nine letters to get the brand name of a product that you might buy at a supermarket.

What approach would you take here? 

Here's how I will set this problem up:

  • C: a list of candidate anagrams (strings)
    • We shouldn't assume that the solution will be a single word. We'll need to allow for multiword solutions, so long as they include the same 9 tokens as NEW TOWELS.
    • We can use our own approach to generating anagrams, as we did in this puzzle from two weeks ago (script here);
    • I'm using a more efficient implementation that I found in this script on Github, credit to Raghav Gurung;
    • Ideally we'll set this up recursively, so that for each word we find that can be spelled with some subset of NEWTOWELS, we then look for additional spellable words within the remaining letters of NEWTOWELS.
    • In fact, I'll likely brute force it by simply rerunning the function 3 or 4 times on the remaining letters until none remain or no spellable words exist. If this works, we can approach this as an exercise in recursive programming and come back to reimplement--sounds like fun!
Then, we'll need one of these two:
  • B: a list of product brand names you might find at a supermarket
    • We might be able to find a decent list already on the web somewhere;
    • We could probably get something workable from an LLM if we run enough queries;
    • This is an open class, so we'll never have a complete list
OR:
  • LM: a language model (LM) that we can use to evaluate the probability of a sentence.
    • In fact, I will be using the GPT2 implementation in the Python Transformers library, as this is probably adequate and we can run it easily on a laptop.
      • I'll plug each candidate into a sentence template like this:
      • "I prefer to buy name brands like [BLANK] that can be found in supermarkets everywhere."
      • Then the sentence is passed to the LM, which returns a perplexity score
      • I'll sort the sentences by this score; the solution should be the one with the lowest or near lowest perplexity.
That's the approach I've taken in my script, and I'll be back after NPR's Thursday deadline for submissions to share my solution. Good luck!

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!



Monday, September 16, 2024

State capital, world capital, expensive entrée

Welcome to my Natural Language Puzzling blog, where I use natural language processing (NLP), large language models (LLMs), and other linguistics tricks and tools of the trade to solve the Weekly Puzzle from Weekend Edition Sunday on NPR. Here's this week's puzzle:

This week's challenge comes from listener Rawson Scheinberg, of Northville, Mich. Name a U.S. state capital. Then name a world capital. Say these names one after the over and phonetically you'll get an expensive dinner entrĂ©e. What is it?

As usual, let's begin with a breakdown of this problem.

We're concerned with 3 classes here: S: state capitals; this is a closed class of just 50 items, so we can easily obtain this list.

  • W: world capitals; also a closed list of ~200 items, easy to obtain.
  • X: expensive dinner entrees; this is an open list, so theoretically harder to obtain.
    • However, I suspect if we simply came up with the top 10 or 20 possibilities, we'd have the solution covered.
We also need some phonetic resources:
  • p: pronunciation dictionary function
    • Note: This assumes we have an adequate list for X (expensive dinner entrees)
    • We'll use p to obtain the phonetic spelling of each string in S, W, and X:
    • We've used CMUdict before by loading it directly as a text file and parsing it ourselves, but this time I'm experimenting with an existing python implementation. This includes a pretrained g2p (grapheme-to-phoneme) model; if a spelling isn't found in the dictionary, the model will predict the most likely way to pronounce that word.
  • m: a matching function
    • We need to be able to compare the entrĂ©e phonetic spelling with the combined state capital + world capital phonetic spelling
      • m('S T EY1 K', 'AO L B AH0 N IY0 B ER0 L IH1 N') returns 'False'
    • To avoid rejecting the correct solution, this will almost certainly need to allow some "fuzzy" matching
      • We'll definitely want to drop the stress markers (0, 1, 2).
      • We may even want to drop vowels, as vowel pronunciations can be highly context dependent, and when we combine the state capital and world capital strings, we're changing the context and the stress, so vowel changes are quite likely.
    • Ideally, we'd have a tool that can take two words, find their phonetic spellings and then return a phonetic similarity score using an edit distance or Levenshtein distance
Alternatively, I'd also like to try another approach, just for fun. What if we couldn't obtain a reasonable list for X, the expensive dinner entrees? How else could we approach the problem?

I think we'd have to iterate through the combinations of state and world capitals, generating pronunciations for each, most of which would not be real words. Then we'd need a p2g, or phoneme to grapheme model, to take each string of phonemes and map them to the most likely orthographic spelling of real English text. This resulting spelling would be treated as a candidate for the "expensive dinner entrĂ©e". From there, we'd query a language model to judge the likelihood of the candidate; in the past, we've used python implementations of BERT/SBERT and GPT2 for this evaluation. In this case, we might ask the model to score this sentence with each candidate inserted: "The [blank] is the most expensive entrĂ©e on the menu, but very highly reviewed by food critics."

Let's see how far we get with the first approach. I'll be back after the Thursday deadline for NPR submissions (so we don't spoil it for listeners) to share my solution. In the meantime, you can see the approach I'm tinkering with in this script. Good luck!

Levi King

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!



Monday, September 09, 2024

Watercraft and body of water

Hi Puzzlers! Welcome back to another exciting installment of Natural Language Puzzling, where we use natural language processing to solve the NPR Sunday Puzzle. Let's look at this week's puzzle:

This week's challenge comes from listener Michael Schwartz, of Florence, Oregon. Take the name of a watercraft that contains an odd number of letters. Remove the middle letter and rearrange the remaining ones to name a body of water. What words are these?

It's a puzzle format as old as time: Take item from Class A (boat), apply transformation (remove letter and rearrange), return item from Class B (body of water). A potentially tricky part here is that these are both open classes---we'll never have a complete list of types of watercraft or bodies of water. New watercraft get invented, for example.

So how do we go about this? Here's what I need, how I'll get it and how I'll use it:

  • C: a list of watercraft
    • It's not clear whether this is a list of common nouns like raft and submarine, and/or specific craft, like Titanic
      • After years of following the Sunday Puzzle, I suspect it's the former, so I'll start with that
      • Between web search, Wikipedia, and asking ChatGPT, I cobbled together a list of 100+ different types of watercraft. I filtered this down to only those that are one word, because these puzzles tend to prefer that kind of simplicity. This left me with about 70 words, which I saved in my Python script
  • W: a list of bodies of water
    • Again, it's not clear if we want common nouns like lake and reservoir or proper nouns like Mediterranean Sea
      • I am working under the assumption we want the simple common nouns, because the puzzles tend to work that way unless they specify otherwise. Also, the proper nouns will typically be more than one word, which would make this a bit messy and the puzzles tend to be cleaner than that.
      • However, I'm not super confident about assembling this list, so I think instead of starting with a list, I'll use a language model to evaluate the candidate anagrams to see if they make sense in a sentence as a body of water
  • L: lexicon, a list of valid English spellings (see "script" section below)
  • LLM: Large language model
    • We'll use a large language model to evaluate the probability (or perplexity) of sentences where the candidate words are evaluated in context as a body of water.
    • I'll be using the GPT2 implementation in the Python transformers library, which is more than powerful enough for our needs here. I'll start with a sentence frame (or template):
      • As the largest body of water in the area, the _____ is a popular summer destination.
      • So for the watercraft steamer, we drop the middle letter to get stemer, and we find the anagrams meters, metres
      • I get scores from the LLM for the resulting sentences:
      • 3.91 As the largest body of water in the area, the metres is a popular summer destination.
      • 3.87 As the largest body of water in the area, the meters is a popular summer destination.
      • This is a perplexity score, so a lower score indicates a more likely sentence than a higher score. Of course, with only these two examples, we can't conclude much. I took these from a larger list (which I can't post here because spoilers), and in that context these scores could probably be taken to indicate that the words are real words and are not incoherent in the context, but an LLM wouldn't find them particularly likely.
      • For comparison, I tried this with carburetor and lake:
      • 3.67 As the largest body of water in the area, the carburetor is a popular summer destination.
      • 2.94 As the largest body of water in the area, the lake is a popular summer destination.
  • script:
    • We need a script that can iterate through our list of watercraft and apply some functions:
      • Keep only those with an odd number of letters
      • Remove the middle letter
      • Find all the anagrams
        • We'll simply get all the recombined permutations of the letters, then check them against an English lexicon (L)
      • Slot these candidate strings into the sentence frame and pass each sentence to the LLM for a score
      • Keep all the scores below a reasonable perplexity threshold


From there, if I've done this correctly and set an appropriate perplexity threshold, I should get a manageable number of scored sentences back. This will be human in the loop (HITL) approach. That's me, I'm the human. I'll then review the results and ideally, the correct solution will have a relatively low perplexity score and it should be fairly easy to spot.

And in fact, that's exactly what I've done, and I got a solution that I'll be back to share with you after the Thursday deadline for submissions to NPR.

In the meantime, if you'd like to try my script or modify my approach, you can find it here. Make sure you have the torch and transformers libraries installed for Python, and you'll need to also download this lexicon file.

Friday, September 06, 2024

A TV Personality of the past and a creature of the past

Happy Friday, Puzzlers! I'm later in the week than usual, but I've got a puzzle breakdown and solution for you. Let's look at this week's NPR Sunday Puzzle:

Yes, it comes from listener Ethan Kane (ph) of Albuquerque, N.M. Name a famous TV personality of the past. Drop the second letter of this person's last name, and phonetically, the first and last names together will sound like a creature of the past. What celebrity is this?
So again, a famous TV personality of the past. Drop the second letter of this person's last name, and say it out loud. The first and last names together will sound like a creature of the past. What celebrity is it?

This is a familiar format-- take item from Class A, apply a transformation and return an item from Class B.

In this case, we have an extra layer of representation to deal with--the speech sounds of the two items in question. In other words, we're dealing with both orthography and phonology.

So here's what we need to solve this puzzle:

  • T: a list of candidate names for the TV personality of the past
  • C: a list of candidate "creatures of the past"
  • pronunciation dictionary: we'll have the spelling of the personalities and creatures, but we need this resource to get the phonetic spellings so we can compare them
  • script, which will:
    • load the data (list of TV personalities, list of creatures, pronunciation dictionary)
    • split personalities and creatures into separate strings that we can query in the pron dictionary;
      • "Bob Hope" --> ["bob", "hope"]
      • "Saber-toothed tiger" --> ["saber", "toothed", "tiger"]
      • "Dodo" --> ["dodo"]
    • query these separate strings for pronunciations:
      • "bob" --> "B AA1 B"
      • "hope" --> "HH OW1 P"
      • "saber" --> "S EY1 B ER0"
      • "toothed" --> "T UW1 TH T"
      • "tiger" --> "T AY1 G ER0"
    • rejoin the pronunciation strings and normalize (remove numeral accent markers and ensure correct spaces):
      • ["B AA1 B", "HH OW1 P"] --> "B AA B HH OW P"
      • ["S EY1 B ER0", "T UW1 TH T", "T AY1 G ER0"] --> "S EY B ER T UW TH T T AY G ER"
      • Note: we do this because stress can be quite variable anyway, and we probably want to relax the constraint here.
        • I'll start with this approach, and revisit if necessary. For example, I can see that perhaps we'd want to reduce the double \T T\ to a single \T\ in the phonetic spelling of sabertooth tiger, as a contextual phonotactic rule would normally apply here in running speech anyway. So this is something to revisit if I'm striking out.
    • Iterate through the person pronunciations, then through the creature pronunciations, looking for a match and printing out any results.
I've done a bit of a "draw the rest of the owl" trick here, of course, because much of the challenge is simply coming up with T and C, our lists of candidate TV personalities and creatures. I tried using BERT in masking mode to fill in blanks like "The paleontologists discovered a rare complete [MASK] skeleton last month" in hopes of getting a list of suitable creatures, but I found it really difficult to tune these prompts in a way that generated good answers and not a lot of bad answers. We had success with BERT masked mode before, like in this puzzle, but I pivoted to use an LLM this time. Actually, for the creatures list, I found a few lists online and cobbled together a list of only about 30 creatures. For the list of TV personalities, I ran a few ChatGPT queries and eventually had a list of about 250 names.

You can see my work on GitHub; you'll also want my list of TV personalities (the list of creatures is much shorter and hard-coded). As these puzzles are always one-off solutions, I rarely optimize the script for efficiency--it's simply a race to get a solution. This script could benefit from some TLC (feel free to push your changes), but I'm happy we got a solution. The deadline for submissions has passed, so click below if you just want to see my answer. See you next week!



Friday, August 30, 2024

Famous name with 8 letters but only 1 consonant

Hi Puzzlers, have you solved this week's Sunday Puzzle from NPR yet?

This week's challenge comes from listener Lillian Range, of New Orleans. The word NONUNION has four N's and no other consonant. What famous American of the past -- first and last names, 8 letters in all -- has four instances of the same consonant and no other consonant?

Clearly, this one is going to be light on the NLP side of things. All we really need here is a long enough list of famous Americans and a few python functions to iterate through them.

I've scraped together a list of 10,000 famous names, which has proven adequate. [A caveat--my list is noisy, has been cleaned with some automated functions which ASCII-fy everything, so diacritics are removed, spellings may get broken, etc. And on top of that, there are lots of non-American names and I even noticed some fictional characters. This is a bootstrapped list and I wouldn't necessarily rely on it for future puzzles.]

I also wrote a little python script to read in each name, keep only those that are 8 letters total, and finally print out any names that have only 1 consonant with exactly 4 instances (in data science terms-- 1 type, 4 tokens). I've got it printing out the correct name now. Feel free to download the script and list and try it yourself.

Or now that the Thursday deadline has passed, you can see my solution below:


See you at the next puzzle!

Sunday, August 18, 2024

An actor and a profession

It's Sunday, Puzzlers, and you know that that means---it's time to solve this week's Sunday Puzzle from NPR:

This week's challenge comes from listener Peter Collins, of Ann Arbor, Michigan. Think of a famous movie star -- first and last names, nine letters in all. The third, fourth, fifth, seventh, and eighth letters, in order, name a profession. The star's last name is something that this profession uses. Who is the movie star and what is the profession?


Classic, easy-peasy transformation puzzle, right? We take thing from Class A, apply the prescribed transformation, and return thing from Class B. Quite similar to last week's puzzle, really. So what do we need to solve this one?

  • Actors:
    • We need a list of actors; fortunately this is a common element in puzzles, so we can use this list of actors from previous puzzles.
  • Professions:
    • Ideally, we'd check to see if the specified transformation gives us a string that's found in a list of professions. I think this is going to be overkill, however. I suspect that if we simply check to see if the resulting string is a real word, that will reduce the number of candidates down to a number that we can quickly scan visually in order to find the solution. In fact, this may give us the solution only and no other candidates. So, we need:
  • transformation:
    • I'll be using python to implement the transformation where we extract the five letters from the specified positions
Does this approach make sense to you? Would you go a step further and use some semantic tools like LMs  or embeddings to compare the transformed strings against a list of professions? Do you think this approach is adequate?

I think my approach is working. If you'd like to explore it, I'm sharing my python script as usual on GitHub. I'll be back after NPR's Thursday deadline to share my solution. Good luck!
-- (Puzzle Dr.) Levi King

Update

The Thursday deadline has passed, so here's my solution: 


See you next week!

Tuesday, August 13, 2024

A food transformed

 Hi Puzzlers! Let's take a shot at this week's Sunday Puzzle from NPR.

This week's challenge comes from listener Greg VanMechelen, of Berkeley, Calif. Think of a popular food item in six letters. Change the last two letters to a K to make a common five-letter word in which none of the letters are pronounced the same as in the six-letter food. What food is this?

Ah, a classic transformation puzzle-- take a thing from Class A, apply the given transformation, return a thing from Class B. We've done this a million times! Let's break it down.

We need:

  • A list of food words:
    • Foods are a common puzzle theme, so we can simply reuse the food lexicon we created for previous puzzles, here.
    • We'll filter this list down to 6-letter words.
  • An English lexicon:
    • After we apply the transformation, we'll need to check if the resulting string is a real word.
    • Here's a lexicon of 10k words we've scraped and cobbled together for previous puzzles.
  • A script with the following functions:
    • load lexicons from text files (i.e., food lexicon & general English lexicon);
    • filter the lexicons to 6-letters and 5-letters respectively;
    • iterate through the food lexicon and:
      • transform spelling (replace last two letters of food word with "k");
      • check English lexicon for transformed string;
      • print original string and new string (if new string is in English lexicon);
But wait--what about the pronunciation bit? We certainly could do this easily enough using a pronunciation dictionary, as we did with the CMU pronunciation dictionary for this puzzle, for example. However, I think it would be overkill. I'm confident that the number of candidate pairs where we have a food word that is transformed into a real English word is going to be so small that we can easily pick the correct solution from the results. In fact, there may only be one candidate pair--we'll find out soon!

I've whipped up a script as described above; feel free to try to get it working. I'll be back after the Thursday NPR deadline to share my results. Good luck!

Update

The Thursday deadline has passed, so here's my solution: 


See you next week!

Monday, May 13, 2024

7 letters, 5 syllables, geographic name

It's Monday, so let's start the week off right--with the Sunday puzzle from NPR:

This week's challenge: Think of a well-known seven-letter geographical name in a single word that has just two consonants and yet is pronounced in five syllables.

Wow, that's two weeks in a row with a puzzle that breaks the usual transform thing from Class A to get another thing from Class B format! How should we approach this one? What do we need?

  • Lexicon
    • If we have a sufficiently large list of words, we can iterate through to find any items that are 7-letters and contain exactly 5 vowels.
    • "geographical name": I understand this to mean a place name like "Atlanta" or "Borneo".
      • Some of the lexicons we've used in the past might be light on proper names, so we may want to look for something called a gazeteer. These days, a gazeteer is usually just a list of place names, but the name comes from the index printed at the end of an atlas.
  • Pronunciation dictionary
    • Let's imagine our orthographic search returns 1,000 results of 7-letter, 5-vowel names. With such a large list of results, it would be helpful to run these results through a pronunciation dictionary and return only those with 5 syllables.
    • More likely, our search is going to return a much smaller number of results, in which case we can skim through them manually to find the real solution to the puzzle.
Well, that's going to be my approach, and I've got a script started on GitHub that might help you out. Leave a comment if you have a different idea. Good luck and I'll be back after the deadline to share my solution!

Tuesday, May 07, 2024

3 words that look alike but don't rhyme

Hello Puzzlers! Time for this week's Sunday Puzzle from NPR:

This week's challenge: This week's challenge comes from listener Jim Bricker, of Wayland, Mass. Think of three common six-letter words that have vowels in the second and fifth positions. The last five letters of the words are the same. Only the first letters differ. And none of the words rhyme with either of the others. What words are they?

It's nice to see a puzzle that doesn't follow the familiar format (take a thing from Class A and apply a defined transformation to get a thing from Class B).

How should we approach this puzzle and what resources do we need?

We want our script to do the following:
  1. Read in the lexicon.
  2. Keep only 6-letter words.
  3. Keep only those with (orthographic) vowels (a, e, i, o, u, y) in the 2nd and 5th position.
  4. Group remaining words into lists where each list contains words that share the same final 5 letters.
  5. Keep only the word groups that contain 3 words or more.
  6. Iterate through the groups and retrieve each word's pronunciation(s) from the pron dict.
  7. For each pronunciation in the group, compare the (phonetic) vowels. (If two pronunciations have distinct vowel sequences, they do not rhyme.) Print out any groups where at least 3 words have distinct vowels sequences. This should give us the solution.
I've done exactly this, and indeed it is returning the solution. It's also returning a few false solutions, like this one:

[('borrow', 'B AA1 R OW2')]
[('morrow', 'M AA1 R OW0'), ('sorrow', 'S AA1 R OW0')]
[('morrow', 'M AO1 R OW0')]

For my money, these pronunciations more or less do rhyme. That may say something about my own dialect, but I'm confident that another solution is better, and I'll share it after the deadline this Thursday. In the meantime, if you'd like to see my script, you can find it here. Good luck and happy puzzling!

--Levi King

Update

The Thursday deadline has passed, so here's my solution: 


See you next week!

Monday, April 22, 2024

Major American corporation of the past

Welcome back, Puzzlers! And welcome back to  host Will Shortz!

Let's take a look at this week's challenge from the NPR Sunday Puzzle:

This week's challenge comes from listener Jim Vespe, of Mamaroneck, N.Y. Think of a a major American corporation of the past (two words, 15 letters altogether). Change the last three letters in the second word and the resulting phrase will name something that will occur later this year. What is it?

Oof, this one feels like a total freebee, and not to brag, but I got this right away after considering just a few possibilities.

The major American corporation of the past doesn't narrow things down much, but knowing that it must be two words, 15 letters altogether would definitely help us filter down a list if we had one handy.

The something that will occur later this year feels like the best place to start, because I can't think of a lot of possibilities. I don't think it would be something that occurs every year.

That leaves special events for 2024. It's safe to assume that this would be something that your average NPR listener would be aware of---not some obscure event or gathering.

The next World Cup is in 2026, so that's not relevant. Likewise, we already had a major eclipse.

The Summer Olympics takes place this year in Paris. This is a major election year in the USA.  Can you think of variations on one of these that might give us a solution? How about other major, newsworthy events happening this year?

I ran this challenge by Microsoft Copilot (using GPT-3.5, I believe), and while it made some wrong assumptions, it got close enough that it would lead a reasonable person to the correct answer. I also passed it to Google Bard Gemini, but the results were nowhere near the correct answer.

I'm going to pass on writing a full script for solving this one since it's so easy, but let's walk through a hypothetical NLP approach here.

This is a pretty standard word puzzle format for the Sunday Puzzle.

There exists some string (2 words, 15 letters) among List A (American corporations of the past) such that when the last 3 letters are changed, the resulting string can be found among List B (something that will occur later this year). In other words, take a thing from List A, transform it, get a thing from List B.

In an NLP approach, the main challenge would be gathering List A and List B and ensuring that they are reasonably complete. These are both open classes, meaning there is no complete list, because we could always keep digging and find something else. This is in contrast to closed classes, which are things like U.S States or Academy Award Best Picture winners.

I would suggest that we search the web for a ready-made list of American Corporations, and/or ask an LLM to provide us a list. Alternatively, as we've done in the past, we can use a pretrained (L)LM like BERT or GPT-2 (both available in python libraries) to fill in a blank for us with the 100 or 200 most likely candidates: "In the past, major American corporations like ________ provided employees with a wide range of benefits." We know these items need to be 2 words, 15 letters total, so we remove any items that don't fit that pattern. A little cleaning and formatting and we have List A.

For List B, I would turn to the web again. Possibly we could do a web search and find a list of major events taking place this year. We could ask Copilot. We could also just brainstorm our own list (but in doing so the solution would probably be so obvious we don't need an NLP approach). Let's assume we manage to find or assemble a list of plausible candidates. Again, we'll want to filter our list to only include strings that give us 2 words, 15 letters total.

The next element is the transformation--change the last three letters in the second word. In a brute force approach, we could indeed iterate through all permutations of 3 letters, but wouldn't this be unnecessary at this stage? Instead, we should iterate through List A, then iterate through List B, looking for a string match between ItemA[:-3] and ItemB[:-3]. In other words, we're comparing the strings but ignoring the final 3 letters. If we find such a match, our python script would print it out, at which point we can see how the 3 letters were changed.

I suspect we'd only find one solution, but let's take this a step farther. Let's assume, for the sake of challenge, that this would result in thousands of potential solutions--too many to review manually. How could we sort through those potential solutions?

Again, I would suggest we use a language model here. This time, instead of asking the model to fill in a blank for us, we're going to have the script fill in a blank with each of our potential solutions and then rank them according to a probability (or perplexity) score. In fact, we used such an approach just last week. A good sentence template to use here might be: "Many people are excited because the ________ will take place later this year." So our script will use the remaining items from List B and tell us how well each one fits in the context of our template.

What do you think of this hypothetical approach? Did you also find this puzzle easier than usual? I'll be back after the Thursday submission deadline to share my solution.

--Levi King

Update

Here's my solution: 


See you next week!

Director, anagram, film award

Welcome back to Natural Language Puzzling, the blog where we use natural language processing and linguistics to solve the Sunday Puzzle from...