Natural Language Puzzling: Names from U.S. history (Preview)

Happy Monday, Puzzlers! Let's take a look at the latest Sunday Puzzle:

This week's challenge comes from listener Ed Pegg Jr., who runs mathpuzzle.com. Think of someone who has been in the news this year in a positive way. Say this person's first initial and last name out loud. It will sound like an important person in U.S. history. Who is it?

FYI, it's always a good idea to listen to the puzzle, too. In this case, Puzzlemaster Will Shortz dropped an important clue on the air that didn't make it into the write-up. While repeating the puzzle, he rephrased "someone who has been in the news this year" as "an official who has been in the news this year." Good to know.

Let's break this puzzle down a little.

We need:

O: a list of officials making positive news this year
H: a list of important persons in U.S. history

We could also use a dictionary of common U.S. name spellings and their pronunciations; i.e., orthographic spelling to phonemic spelling. That might be hard to come by, so a pronunciation model that can predict the phonemic spelling for previously unseen words (or in this case, names). This is commonly referred to as a grapheme-to-phoneme model, or g2p.

How do we come up with O, the list of officials making positive news this year? We're not likely to find such a list ready to use online. The thorough but hard way would be to use a web scraping tool like Beautiful Soup to hoover up loads of articles from a few non-paywalled news sites, making sure to limit it to articles from 2021. I'm just spitballing here, but I think next we would run a sentiment analysis tool. I'd probably use Stanza, the python version of the Stanford CoreNLP toolkit, but other good options are NLTK or TextBlob. These tools typically take some text and generate a score between -1.0 and 1.0 indicating where the sentiment lies on a continuum from very negative to very positive. We'd keep all the positive news articles. Then we'd run a named entity recognizer (NER) tool to extract a list of persons mentioned in the positive articles. In past puzzles, we've used the Stanza NER tool, which is easy to use out of the box. We could refine this a little, but that would probably do the trick to get us a decent list for O. We should go ahead and update these names to the form we need, with first initial and last name, e.g., Barack Obama --> B. Obama.

We could also try to scrape a list for H; if we had a lot of U.S. history text documents, we could just use NER to find names, and choose the top 100 (or 500, etc.) most frequent names. For this list, we can probably just find a list online. Here's a nice looking list of 100 from Smithsonian Magazine.

At this point, I would go ahead and apply whatever phonemic resource we're using to get a phonemic spelling (or spellings, if alternate pronunciations apply) for each name in H.

We also need a list of the pronunciations of the "names" of the 26 letters of the alphabet (actually, 25; we can skip the oddly named "w"). For example, in IPA:

Letter	IPA
a	eI
b	bi:
c	si:

From here, I would first make a quick pass through H and eliminate any name that starts with an IPA pronunciation that is not on our list of letter name pronunciations. For example, Babe Ruth would be eliminated here because the first phonemes in the pronunciation (beI) do not match anything in our list of letter names. Jane Addams, however, would not be eliminated, because the letter J and the name Jane both start with dzeI. (Of course this name would later fail, as there is no "J. Nadams" in our list O.)

From there, I'd probably just have the script print out all the remaining names. From just skimming the list manually, this would only leave about 15 of the 100 names we started with. That's surely a small enough number that we can read through them out loud to see if anything clicks.

But let's assume for some reason we really want to get the solution from our script. Instead of just printing out the remaining names in H, we would use our phonemic resources to generate pronunciations for all the names in O, allowing for multiple alternate pronunciations when needed. Then we simply iterate through H, looking for any matches in O.

I'm currently working on installing and familiarizing myself with some phonemic tools and models, and I promise I'll go all the way on such an approach in the near future when we get a puzzle with an interesting pronunciation aspect. I think it's overkill here, so I'm going to hold off this time.

In fact, this week's puzzle is fairly easy, so long as you consume a decent amount news and current events. I solved it by taking the first five or ten important persons from U.S. history that came to mind, then just mentally applying the same logic above with my Babe Ruth / Jane Addams examples above until I came to a match. I'm sure you can do the same!

Good luck, and I'll see you Friday with a solution!

-- Levi King

Natural Language Puzzling

Monday, February 08, 2021

Names from U.S. history (Preview)

No comments:

Post a Comment

Director, anagram, film award

Report Abuse