Tuesday, October 08, 2024

Place name, palindrome, body part

Ahoy! Welcome back to Natural Language Puzzling, the weekly blog where we solve the NPR Sunday puzzle using natural language processing (NLP), language modeling and linguistics. Let's dive into this week's puzzle:

This week's challenge comes from listener Joe Krozel, of Creve Coeur, MO. Think of a place in America. Two words, 10 letters altogether. The first five letters read the same forward and backward. The last five letters spell something found in the body. What place is this?

Okay, so here's what I need to solve this problem:
  • P: a list of place names in America
    • Let's assume "America" means "USA" here
    • The longer the list, the better
    • We could do a web search for an existing list of place names
    • We could ask an LLM for a list
    • I'm going to use OpenStreetMap data (**details below)
  • is_anagram(string): We need a function to check whether a string is the same forward and backward; we'll use python to take the first five letters and run this function.
  • is_found_in_body(string): We also need a function to determine whether the last five letters are "something found in the body".
    • We could start out with a list of body parts, fluids, etc., but "something found in the body" is worded kind of vaguely and I'm not confident we can easily assemble an exhaustive list
    • I'm going to use an LLM instead; I'll take the last five letters from each of our candidate place names, insert the string into a sentence template, pass this sentence to the LLM, get a perplexity score, then sort these candidate solutions from highest to lowest perplexity; the correct solution should be among the lowest perplexities. I'll use the python transformers library's implementation of GPT2 for this.
There's a little more to it, but that's the basic idea. We'll also need some functions to do a bit of string cleaning and manipulation and to filter our list to candidates that have exactly two words and 10 letters. I've done all this in a python script, which you can view here on the GitHub repo.

**Regarding the OpenStreetMap data
OSMNames is a free, open source, downloadable database of over 21 million place names around the world. It downloads as a massive file with lots of columns that are totally extraneous for our purposes, so I use a few BASH commands to filter it down to what we need. To reproduce the file I'm using in my script, you can start with the downloaded file and run the following commands:

## unzip the file (if you're on a Windows/Linux machine, you can probably skip this and use "zcat" instead of "cat" on the next step)
gunzip planet-latest_geonames.tsv.gz

## get only place names for the USA
cat planet-latest_geonames.tsv | awk -F '\t' -v OFS='\t' 'NR == 1 || $16 == "us"' > us1.tsv

## keep only column 1 and column 14 (place name and state)
cut -f1,14 < us1.tsv > us2.tsv

## remove any duplicate entries
awk '!seen[$0]++' us2.tsv > usa-places.tsv

Now that we have an appropriately filtered list, our script can create a subset of candidate place names that are exactly two words and 10 letters. For my approach, I assumed this could be simply a two word place name that totals 10 letters, like San Antonio. But it could also be a combination of one-word place name and one-word state name, like Dayton Ohio. So my python script churns through the input file to find all such possibilities.

Next I iterate through this list, keeping only the candidates where the first five letters are palindromic.

Then I iterate through the remaining list, plugging each candidate into a couple of sentence templates for the LLM to evaluate. Here are the two sentence templates I've come up with:
  • In anatomy class, students learn about all the things inside the body, like bones and [BLANK].
  • Each anatomy lesson focuses on a new part of the body, like a bone or a [BLANK].
In the end, I get a long list of sentences ranked by perplexity, and indeed the sentence with the lowest perplexity score is the correct solution! Without giving you the solution, here's what some of that output looks like:

5.0561   Each anatomy ... like a bone or a ararc  mirimararc

5.0450   In anatomy ... like bones and aradr  caracaradr

5.0373   Each anatomy ... like a bone or a awaii  hanahawaii

5.0224   In anatomy ... like bones and acove  noconacove

4.9808   In anatomy ... like bones and nroad  aegeanroad

4.9779   Each anatomy ... like a bone or a uaway  hilihuaway

Good luck out there, puzzlers! I'll be back after the Thursday afternoon NPR deadline to share my solution. In the meantime, feel free to give my script a try.

Update & Solution

The deadline for submissions has passed, so click below if you want to see my answer. See you next week!



No comments:

Post a Comment

Director, anagram, film award

Welcome back to Natural Language Puzzling, the blog where we use natural language processing and linguistics to solve the Sunday Puzzle from...