Natural Language Puzzling: October 2024

Monday, October 28, 2024

Place name and animal behavior

It's Monday Funday at Natural Language Puzzling, the blog where we use natural language processing and linguistics to solve the weekly Sunday Puzzle from NPR! Let's take a look at this week's puzzle:

Name a place somewhere on the globe -- in two words. Rearrange the letters of the first word to name some animals. The second word in the place name is something those animals sometimes do. What is it?

This one immediately strikes me as a relatively difficult one. Let's look at tools and data we need to solve this one and get a better understanding of what makes it difficult. We'll need:

P: a list of two word place names anywhere on the globe

this is a very unrestricted class
we can return to the OSM Names file that we've used in recent puzzles

it has 21 million entries, so we'll narrow that down, keeping only the list of names (and dumping the rest of the tsv file)
we'll also filter it down to only two word place names, and only places written with Roman letters
this still leaves about 8 place names; *we'll return to this later

anagram_checker(string): A function to take a string and return all the valid anagrams as a first step to rearranging the letters of the first word to "name some animals"

This will require an English lexicon for validating words; I'll be using a lexicon extracted from the Brown Corpus in the NLTK (much like I did in this recent puzzle)

LLM: We'll use an LLM to evaluate the likelihood that each valid anagram fits the "name some animals" clue.

I'll use the pretrained GPT2 model in the python transformers library
I'll plug each candidate into a sentence template like "The wildlife photographer published a set of photos of [BLANK]."
I'll keep the candidates that score below some (currently unestablished) threshold for perplexity.

For the most likely candidates, I'll use the anagram_checker on the second word of the place name to find valid English words

Then I'll use the LLM again to evaluate candidates in a sentence like this one: "[BLANK2] is something that [BLANK1] sometimes do."
Ideally, the correct solution should be among the candidates with the lowest perplexity scores.

Regarding the list of 8 million place names of two words: Naturally, many of these are places most of us have never heard of. I'm certain that the solution will be a place name that is familiar to most listeners. So we need a way to sort the 8 million candidate place names by how well known they are. Again, I returned to the LLM and plugged the candidates into this template: "[BLANK] is one of the most famous places in the world." And I'm storing the resulting list of ranked candidates. When you consider the approach I'm taking above, it's very brute force and computationally heavy, so we want to start with a short list of place names. I'll likely try running my script with the first 1000 place names, and if the solution isn't found there, take the second 1000 place names, and so on.

Good luck this week! I'll be working on my approach and I'll be back to share my solution after the Thursday submission deadline from NPR (i.e., no spoilers here).

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!

Sunday, October 20, 2024

US city and state with 13 unique letters

Hello and welcome back to Natural Language Puzzling, the only blog on the web where we use natural language processing and linguistics to solve the Sunday Puzzle from NPR! Let's take a look at this week's puzzle:

This week's challenge comes from listener David Dickerson, of Tucson, Arizona. The city UTICA, NEW YORK, when spelled out, contains 12 letters, all of them different. Think of a well-known U.S. city, that when its name is spelled out, contains 13 letters, all of them different. Your answer doesn't have to match mine.

Immediately we can tell that this is going to be an easy one with minimal NLP or linguistics, and not a lot of tools required. Here's what we need:

C: a long list of cities (and their states) in the US

Easy--we can reuse the list of places (plus their states) in the US that we prepared for the puzzle two weeks ago
Note that our list includes non-city places-- counties, streets, parks, etc., but that's okay.

f: a function to check whether the city+state contains 13 unique letters (i.e., 13 types)

this is an easy job for python

We need a python script that can do a bit of cleanup on place names-- lowercase everything (so "a" and "A" are counted as one letter type and not two), remove punctuation, etc. Then we simply check if the original length of the word and the count of unique letters in the word are the same.

My script is producing 38 solutions, although only one or two of them are "well-known". Good luck! I'll be back to share my solution after the Thursday NPR submission deadline.

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!

Monday, October 14, 2024

Athletes, beverage

Welcome back to Natural Language Puzzling, where we use natural language processing and linguistics to solve the Sunday Puzzle from NPR. Here's this week's puzzle:

This week's challenge comes from listener Mike Selinker, of Renton, Wash. Think of something to drink whose name is a compound word. Delete the first letter of the first part and you'll get some athletes. Delete the first letter of the second part and you'll get where these athletes compete. What words are these?

This one is not terribly different from last week's puzzle. Here's how I plan to approach this problem:

B: We need a sizable list of beverages

I've pieced together a list of about 200 beverages including cocktails, coffee and tea drinks, etc.
We know the solution should be a single string and it should be a compound word (e.g., "houseboat", "sunshine", etc.)

After we drop a letter from each part of the compound word, we'll still have a word for athletes and a word for a place where they compete
These two strings probably need to be a minimum of 3 letters each
So we should start with a beverage word that is at least 8 letters long

compound word function: Ideally, we need a function or a tool that can tell us whether a string is a compound word, and if so, what its component words are

I haven't found a readily available solution for this, so I effectively had to create my own

splitting function: To create our own compound word checker, we'll need to split each beverage word in two, such that each of the two strings contains a minimum of 3 letters; we can do this in python
Lexicon: With each beverage split into a pair or multiple pairs, we need to check whether each half of the split is a real English word.

There are many ways to approach this and numerous wordlists we could choose from
I chose to use NLTK to extract the 6000 most frequent words from the Brown Corpus
We can find much larger lexicons, but they tend to be too permissive

LLM: Now that we know which pairs are valid words, we can plug these words into a sentence template, pass that sentence to an LLM to get a perplexity score, then rank the candidate words according their scores, and the correct solution should be among the sentences with the lowest perplexity.

My sentence template:

We went to the __1__ to see the __0__ compete in the tournament.

I'm using GPT2 as implemented in the transformers python library.

This approach worked for me. I'll be back to share my solution after NPR's Thursday submission deadline. In the meantime, feel free to examine my script on GitHub and drop your suggestions in the comments. Good luck, Puzzlers!

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!

Tuesday, October 08, 2024

Place name, palindrome, body part

Ahoy! Welcome back to Natural Language Puzzling, the weekly blog where we solve the NPR Sunday puzzle using natural language processing (NLP), language modeling and linguistics. Let's dive into this week's puzzle:

This week's challenge comes from listener Joe Krozel, of Creve Coeur, MO. Think of a place in America. Two words, 10 letters altogether. The first five letters read the same forward and backward. The last five letters spell something found in the body. What place is this?

Okay, so here's what I need to solve this problem:

P: a list of place names in America

Let's assume "America" means "USA" here
The longer the list, the better
We could do a web search for an existing list of place names
We could ask an LLM for a list
I'm going to use OpenStreetMap data (**details below)

is_anagram(string): We need a function to check whether a string is the same forward and backward; we'll use python to take the first five letters and run this function.
is_found_in_body(string): We also need a function to determine whether the last five letters are "something found in the body".

We could start out with a list of body parts, fluids, etc., but "something found in the body" is worded kind of vaguely and I'm not confident we can easily assemble an exhaustive list
I'm going to use an LLM instead; I'll take the last five letters from each of our candidate place names, insert the string into a sentence template, pass this sentence to the LLM, get a perplexity score, then sort these candidate solutions from highest to lowest perplexity; the correct solution should be among the lowest perplexities. I'll use the python transformers library's implementation of GPT2 for this.

There's a little more to it, but that's the basic idea. We'll also need some functions to do a bit of string cleaning and manipulation and to filter our list to candidates that have exactly two words and 10 letters. I've done all this in a python script, which you can view here on the GitHub repo.

**Regarding the OpenStreetMap data

OSMNames is a free, open source, downloadable database of over 21 million place names around the world. It downloads as a massive file with lots of columns that are totally extraneous for our purposes, so I use a few BASH commands to filter it down to what we need. To reproduce the file I'm using in my script, you can start with the downloaded file and run the following commands:

## unzip the file (if you're on a Windows/Linux machine, you can probably skip this and use "zcat" instead of "cat" on the next step)

gunzip planet-latest_geonames.tsv.gz

## get only place names for the USA

cat planet-latest_geonames.tsv | awk -F '\t' -v OFS='\t' 'NR == 1 || $16 == "us"' > us1.tsv

## keep only column 1 and column 14 (place name and state)

cut -f1,14 < us1.tsv > us2.tsv

## remove any duplicate entries

awk '!seen[$0]++' us2.tsv > usa-places.tsv

Now that we have an appropriately filtered list, our script can create a subset of candidate place names that are exactly two words and 10 letters. For my approach, I assumed this could be simply a two word place name that totals 10 letters, like San Antonio. But it could also be a combination of one-word place name and one-word state name, like Dayton Ohio. So my python script churns through the input file to find all such possibilities.

Next I iterate through this list, keeping only the candidates where the first five letters are palindromic.

Then I iterate through the remaining list, plugging each candidate into a couple of sentence templates for the LLM to evaluate. Here are the two sentence templates I've come up with:

In anatomy class, students learn about all the things inside the body, like bones and [BLANK].
Each anatomy lesson focuses on a new part of the body, like a bone or a [BLANK].

In the end, I get a long list of sentences ranked by perplexity, and indeed the sentence with the lowest perplexity score is the correct solution! Without giving you the solution, here's what some of that output looks like:

5.0561 Each anatomy ... like a bone or a ararc. mirimararc
5.0450 In anatomy ... like bones and aradr. caracaradr
5.0373 Each anatomy ... like a bone or a awaii. hanahawaii
5.0224 In anatomy ... like bones and acove. noconacove
4.9808 In anatomy ... like bones and nroad. aegeanroad
4.9779 Each anatomy ... like a bone or a uaway. hilihuaway

Good luck out there, puzzlers! I'll be back after the Thursday afternoon NPR deadline to share my solution. In the meantime, feel free to give my script a try.

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!

Wednesday, October 02, 2024

Breakfast cereal characters

Welcome back to Natural Language Puzzling, the only blog on the web that teaches you how to solve complex word puzzles using natural language processing (NLP), computational linguistics, corpus linguistics, language modeling and computer programming.

Let's take a shot at this week's Sunday Puzzle from NPR:

This week's challenge comes from listener Curtis Guy, of Buffalo, N.Y. Name a certain breakfast cereal character. Remove the third, fifth, and sixth letters and read the result backward. You'll get a word that describes this breakfast cereal character. What is it?

This seems like an easy one, right? You can probably come up with the solution in your head within a few minutes, like I did. But let's imagine we need to solve this programmatically. What do we need to do so?

C: a list of breakfast cereal characters

This is going to be a pretty short list, which makes this an easy puzzle
We can brainstorm one and/or do a little web searching

transform(c): a function to transform each breakfast cereal character (c) as prescribed in the puzzle (remove letters and reverse the string)

We'll use python to perform this

evaluate(w): a function to evaluate the likelihood of each transformed string as "a word that describes this breakfast cereal character"

lexicon: given the restrictive nature of this puzzle (i.e., there won't be a lot of candidates to consider), we can probably simply rely on a lexicon--any transformed string that isn't found in the English lexicon is rejected, so we only need to manually read through the remaining candidates to find the solution.
language model: As we've done in the past, we can load an LLM, slot the character and the transformed string into a sentence frame, and score the candidates according to the perplexity score provided by our model.
I'm going to use both a lexicon and an LLM. First, if the transformed string isn't in the lexicon, it's immediately rejected. Any transformed strings that are in the lexicon are then passed to the LLM as described above. I'll use the GPT2 model in the python transformers library.

If this approach works, we'll get a list of sentences ranked by perplexity, and we'll expect the correct solution to be among the sentences with the lowest perplexity. Good luck! I'll be back after the Thursday NPR deadline to share my solution. In the meantime, you can see my python script here. If you need help with your list of cereal characters, you can find mine hardcoded in that script.

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!

Natural Language Puzzling

Monday, October 28, 2024

Place name and animal behavior

Sunday, October 20, 2024

US city and state with 13 unique letters

Monday, October 14, 2024

Athletes, beverage

Tuesday, October 08, 2024

Place name, palindrome, body part

Wednesday, October 02, 2024

Breakfast cereal characters

Director, anagram, film award

Report Abuse