It's Monday Funday at Natural Language Puzzling, the blog where we use natural language processing and linguistics to solve the weekly Sunday Puzzle from NPR! Let's take a look at this week's puzzle:
Name a place somewhere on the globe -- in two words. Rearrange the letters of the first word to name some animals. The second word in the place name is something those animals sometimes do. What is it?
This one immediately strikes me as a relatively difficult one. Let's look at tools and data we need to solve this one and get a better understanding of what makes it difficult. We'll need:
- P: a list of two word place names anywhere on the globe
- this is a very unrestricted class
- we can return to the OSM Names file that we've used in recent puzzles
- it has 21 million entries, so we'll narrow that down, keeping only the list of names (and dumping the rest of the tsv file)
- we'll also filter it down to only two word place names, and only places written with Roman letters
- this still leaves about 8 place names; *we'll return to this later
- anagram_checker(string): A function to take a string and return all the valid anagrams as a first step to rearranging the letters of the first word to "name some animals"
- This will require an English lexicon for validating words; I'll be using a lexicon extracted from the Brown Corpus in the NLTK (much like I did in this recent puzzle)
- LLM: We'll use an LLM to evaluate the likelihood that each valid anagram fits the "name some animals" clue.
- I'll use the pretrained GPT2 model in the python transformers library
- I'll plug each candidate into a sentence template like "The wildlife photographer published a set of photos of [BLANK]."
- I'll keep the candidates that score below some (currently unestablished) threshold for perplexity.
- For the most likely candidates, I'll use the anagram_checker on the second word of the place name to find valid English words
- Then I'll use the LLM again to evaluate candidates in a sentence like this one: "[BLANK2] is something that [BLANK1] sometimes do."
- Ideally, the correct solution should be among the candidates with the lowest perplexity scores.
Regarding the list of 8 million place names of two words: Naturally, many of these are places most of us have never heard of. I'm certain that the solution will be a place name that is familiar to most listeners. So we need a way to sort the 8 million candidate place names by how well known they are. Again, I returned to the LLM and plugged the candidates into this template: "[BLANK] is one of the most famous places in the world." And I'm storing the resulting list of ranked candidates. When you consider the approach I'm taking above, it's very brute force and computationally heavy, so we want to start with a short list of place names. I'll likely try running my script with the first 1000 place names, and if the solution isn't found there, take the second 1000 place names, and so on.
Good luck this week! I'll be working on my approach and I'll be back to share my solution after the Thursday submission deadline from NPR (i.e., no spoilers here).
Update & Solution
The deadline for submissions has passed, so click below if you want to see my answer. See you next week!