Welcome back, Puzzlers!
Did you solve the latest puzzle? I did, so consider that your spoiler warning -- the solution is shown later in this post.
Let's get into it. Here's the latest Sunday Puzzle:
This week's challenge comes from listener Michael Shteyman, of Freeland, Md. Name a person in 2011 world news in eight letters. Remove the third, fourth and fifth letters. The remaining letters, in order, will name a person in 2021 world news. What names are these?
This week's puzzle is another of the f_x(string1) = string2 variety. That is, we're looking for string1 and string2, where each is a word or phrase; we're given function x, and when we apply function x to string1 the result is string2.
Once again, we need two lists of words; more specifically in this case, we need two lists of names. In the preview, I suggested that we turn to Wikipedia, because it should be the fastest and easiest. Wikipedia has a page for each year, listing major events from that year. Here are the 2011 and 2021 entries. I suggested that we collect the text from these articles and pipe it through a named entity recognition (NER) tool to get our candidate lists.
About that--there are tools in NLP that perform this task called NER. Like the other language models we've discussed, they are typically trained on large amounts of text. This can be supervised or unsupervised; supervised means the training data is manually labeled in the way we want, and unsupervised means the training data is unlabeled and the system must learn to do this classification on its own. An NER tool learns the kinds of contexts in which names occur, and when it sees a new word, it decides based on context whether that word is a name or not and labels it as such.
As always, I've posted my solver script to the companion GitHub repository for this blog. You'll also need the accompanying 2011 and 2021 Wikipedia text files uploaded there. I used the Stanford CoreNLP NER tool; specifically, I used the version implemented in the Stanza package for Python, with the default pre-trained English model, as you can see in the solver script. This tool provides different named entity labels. I used the default settings, which produces the labels: DATE, LOCATION, MONEY, ORGANIZATION, PERCENT, PERSON, TIME.
My script reads in the text files taken from Wikipedia and pulls out only the PERSON entities. For the 2011 names, it looks for names with 8 letters, and for the 2021 names, it looks for 5 letters. This time I learned my lesson from last week--we shouldn't assume! In this case, I didn't assume that "name a person in 8 letters" means a single string of 8 characters where each is a letter. Instead, when checking this rule, the script drops any non-letter, then checks for 8 characters. This means we also need to split the name on whitespace, then try reasonable combinations of the resulting strings. Namely, we want to try using the first name only, for celebrities commonly referred to by first name only. We also want to try first plus last name, and finally the last two names, if there are more than two, because it could be a name like du Pont, de Soto, Von Trapp, etc.
I am assuming here that the names will more or less follow the most common patterns for names written with the English alphabet.
For the named entity "Eddie Van Halen," for example, we want to try the following combinations and see if we find eight letters:
- "Eddie"
- "Eddie Halen"
- "Van Halen"
No comments:
Post a Comment