Saturday, January 30, 2021

Sound it out (Solution)

Now that the submission deadline has passed, it's time to post the solution. Spoiler warning: solution below! First, let's take another look at this week's Sunday Puzzle:

This week's challenge is a spinoff of my on-air puzzle, and it's a little tricky. Think of a hyphenated word you might use to describe a young child that sounds like three letters spoken one after the other.

Okay, we'd better step back and look at the on-air challenge. Here is the description:

Every answer today is a word or name that sounds like it starts with two spoken letters of the alphabet.

Example: Wanting what other people have --> ENVIOUS (N-V-ous)

The full on-air challenge has more examples, and you can read or listen at the link above.

So we need a list of candidate words, then we can check that they fit the rules: hyphenated, describes a young child, sounds like three letters spoken one after the other.

This week, I didn't come up with a script that gives a solution outright, but I did write a script to help me generate a list of words and then narrow it down to a number that I could manually search through.

My script uses word2vec: I give it a seed word, and it gives me the top k most similar words---most similar based on the contexts in which the words appear. Given that we have some specific form requirements (3 syllables, hyphenated), we probably want to generate a really large list of candidates, then filter them down so we keep only those matching the form requirements. In other words, we are prioritizing recall over precision here. This means we need quite a few seed words.

I struggled to come up with seed words on my own, so I decided to use BERT in masking mode: I gave it a few "fill in the blank" phrases, and it returned predictions for the blanks:

  • 'that little kid is such a [MASK]'
  • 'play with the [MASK] toddler'
  • etc.
I didn't include this part in the script. Instead, I manually filtered through the BERT predictions and kept the words that I thought would make good seed words for the task, e.g.:

'active', 'adorable', 'adventurous', 'angel', 'angry', 'animated', 'annoying', 'anxious', 'athletic', 'beauty', 'blessing', etc.

The seed words I used are coded into the script, so you can see the full list there if you'd like. The list is 125 words long.

I didn't pursue the phonemic dictionary approach that I discussed in the preview post. Instead, I went for a more "quick and dirty" approach. I generated a list of candidates from the seed words. Then I filtered out any results without a hyphen. My first attempt resulted in over 10,000 hyphenated words. This was too many to manually read through to find a solution, especially because it really takes a moment to read each one and consider whether the pronunciation "sounds like three letters spoken one after another."

In skimming through this long alphabetical list of candidates, I realized that some spans of words could be pruned out smartly with just a little more work, simply based on the first couple of letters. For example, no letter has a "name" that is pronounced with a consonant cluster. I decided to go through the alphabet, letter by letter, and for each letter list all the letter sequences that might possibly start a spelling that sounds like that letter. Then I could filter out any candidates that don't start with one of these allowed sequences. For example:

  • A: "a", "ei", 
  • B: "be", "bi"
  • C: "ce", "ci", "se", "si"
  • D: "de", "di", "dy"
  • etc.
Again, it's better to be too lax than too strict here--we'd rather keep a lot of bad candidates than filter out the correct one! In some cases, particularly the vowels, we can really only rely one letter. A could start with "ay" or "ei", but it could also simply be "a" as in "able" or "ate".

I had the script run word2vec for my seed words with four different pre-trained models. This is probably overkill, but it's better to cast a wide net here. The rationale for using different models is that some are trained on newspaper test, others are trained on twitter, etc. There might be words that appear on Twitter but not in news articles, or vice versa. With each of these four models providing the top 300 most similar words to each of the 125 seed words, and after running the results through the filters described above, 477 words remained. Most of them make no sense at all, and very few of them come anywhere close to sounding like three letters. Here are 20 of the most reasonable from that list, and the solution is among them:

    1. beauty-queen
    2. big-boned
    3. bitter-sweet
    4. cared-for
    5. cat-eyed
    6. city-born
    7. curly-headed
    8. cutie-pie
    9. denim-clad
    10. diva-esque
    11. gimlet-eyed
    12. girl-child
    13. pencil-thin
    14. pet-loving
    15. pie-eyed
    16. pink-cheeked
    17. pint-sized
    18. pixie-like
    19. teddy-bear
    20. teeny-bopper

    Did you spot the solution in that list? If so, you probably let out a little groan, right? I'm glad I didn't go the full phonemic dictionary route, because I wouldn't have caught the solution that way. The instructions did mention that this one would be tricky, but using a Greek letter is a bit of a stretch!

    Thanks for reading, Puzzlers. I'll see you for the next one!

    --Levi King

    Tuesday, January 26, 2021

    Sound it out (Preview)

    New week, new Sunday Puzzle:

    This week's challenge is a spinoff of my on-air puzzle, and it's a little tricky. Think of a hyphenated word you might use to describe a young child that sounds like three letters spoken one after the other.

    Okay, we'd better step back and look at the on-air challenge. Here is the description:

    Every answer today is a word or name that sounds like it starts with two spoken letters of the alphabet.

    Example: Wanting what other people have --> ENVIOUS (N-V-ous)

    The full on-air challenge has more examples, and you can read or listen at the link above.

    The weekly challenge is slightly different, however. Let's break it down.

    hyphenated word you might use to describe a young child

    As usual, we need a list of candidate words. They must be hyphenated and appropriate for describing a young child. In the past we've used word2vec: we give it a "seed" word and it gives us the top k similar words. We can brainstorm some seed words and do this again... precocious, mischievous, baby-faced, etc.

    sounds like three letters spoken one after the other

    Whew, okay, this looks tricky. What we really need is a phonemic dictionary (often referred to as a phonetic dictionary, but that's technically a misnomer). We'd also have a list of the phonemic transcriptions of each letter of the alphabet. We would query this phonemic dictionary with each of our candidate words to get the phonemic transcription, then check that the phonemic transcription contains exactly three syllables and that those appear in our list of phonemic transcriptions for the letters of the alphabet. I'm not sure what's out there in terms of free, publicly available phonemic dictionaries, so I'll have to poke around a little.

    But what if we can't come up with a phonemic dictionary? It's not all bad. The way I read it, this rule does mean that the target word contains exactly three syllables. That's helpful. And how many possible combinations of three letters are there?

    Well, we have three positions, with 26 possibilities for each position, so that's 263 = 17,576.

    But we can do better than that. Can all 26 letters work here? There's at least one exception: w. Obviously, the "name" of this letter containing double is problematic here. So that leaves us with 253 = 15,625. Well, that's a little better, but still too many to manually work our way through. With a smaller number, perhaps we could just generate all those possibilities, then sit down and read through them out loud until we find a pronunciation that makes sense. And in fact, maybe we could reduce the total number by thinking more carefully about the 25 letter pronunciations; we may determine that certain sequences do not occur in English.

    More likely, without a phonemic dictionary, we'd approach from the other end---generate a list of candidate words, keep only those with hyphens, then read through them manually to find one that fits. Given that we expect three syllables, we could probably further filter our candidates first using some range for the number of letters. I would guess, for example, that the target word contains between 5 and 11 letters.

    Good luck, Puzzlers! I'll see you on Friday, and I hope to have a solution by then!

    --Levi King

    Friday, January 22, 2021

    Landmarks, elements, and states (Solution)

    Time for this week's solution. Spoilers ahead!

    Here's the Sunday Puzzle again:

    This challenge comes from listener Gerry Reynolds of Chicago. Name a national landmark (6,3). Add the name of a chemical element. Rearrange all the letters to name two states. What are they?

    As we discussed, (US) states is a closed set of 50, and elements are a closed set of 118. But we need a list of national landmarks, which is an open set. I suggested we start with Wikipedia for a list of landmarks. 

    Surprisingly, I didn't find what I wanted at Wikipedia. There are lists of landmarks by state, and national parks, etc., but nothing like a single big list of landmarks and tourist attractions.

    I did some web searching instead and found a nice clean list of 200 national landmarks.

    Given that we're dealing with three fairly small lists, I just hard coded the lists into my solver script (rather than storing them in files to be read by the script). As always, you can find the solver script on the companion GitHub repo for this blog. It's annotated with comments so it should be easy to follow. Basically, I simply followed the plan I laid out in the preview post:

    1. keep only landmarks that fit (6,3) (i.e., a 6-letter word followed by a 3-letter word)
    2. generate all landmark-element pairs
    3. generate all state-state pairs
    4. for each landmark-element pair, compare each state-state pair for a match:
      1. for comparison, for each pair: lowercase, remove non-letters, sort letters, join all letters to single sorted string; e.g., ["North Dakota", "Ohio"] --> "aadhhiknoooortt"
    And it worked. Let's get to the solution, below. Scroll slowly, I'm going to drop some clues along the way.
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    Clue 1. The partially-redacted string produced in step 4 above:

    ##eh###oor#v
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    Clue 2. There was only one landmark from my list of 200 that fit the (6,3) requirement. The first word (6) is spelled with the letters shown above in Clue 1.
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    Clue 3. The full string produced in step 4 above:

    adehimnoortv
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    Clue 4. Okay, you've probably solved it by now. If not, the states are:

    idaho, vermont
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    .
    The solution:

    hoover dam, tin / idaho, vermont


    See you next week!
    --Levi King

    Tuesday, January 19, 2021

    Landmarks, elements, and states (Preview)

    Welcome back, Puzzlers! Let's take a look at the current Sunday Puzzle from NPR and do some brainstorming.

    This challenge comes from listener Gerry Reynolds of Chicago. Name a national landmark (6,3). Add the name of a chemical element. Rearrange all the letters to name two states. What are they?

    Immediately we can see this is essentially the familiar puzzle format: f_x(string1) = string2. That is, we're looking for string1 and string2, where each is a word or phrase; we're given function x, and when we apply function x to string1 the result is string2.

    Two things are a little different this time, however. First, each string we're matching is actually a pair of strings---no big deal. Second, we're dealing with two closed classes and one open class. These are common concepts in language and linguistics. A closed class is a fixed list that doesn't change, like determiners (roughly articles) or prepositions. We could come up with a complete list of all English prepositions, and it really doesn't change much from year to year or even decade to decade. An open class is list that expands, like verbs and nouns. We could try but we'd never really come up with a complete list of these, and new words enter the language every year. Of course, we can list all 50 states easily, and we can list all 118 elements easily---not from memory, but that's okay!

    So we have states and landmarks covered, but we need a list of landmarks. As I often do, I suggest we start with Wikipedia and see if there are any pages from which we can easily extract a list.

    Assuming we have all three lists, how do we begin? Here's my plan:

    1. keep only landmarks that fit (6,3) (i.e., a 6-letter word followed by a 3-letter word)
    2. generate all landmark-element pairs
    3. generate all state-state pairs
    4. for each landmark-element pair, compare each state-state pair for a match:
      1. for comparison, for each pair: lowercase, remove non-letters, sort letters, join all letters to single sorted string; e.g., ["North Dakota", "Ohio"] --> "aadhhiknoooortt"
    That's pretty much it. Assuming we start with a robust list of landmarks that we are confident contains the target, this approach should work. I'll be back Friday to share it with you! Good luck!

    --Levi

    Friday, January 15, 2021

    Names in the News (Solution)

    Welcome back, Puzzlers!

    Did you solve the latest puzzle? I did, so consider that your spoiler warning -- the solution is shown later in this post.

    Let's get into it. Here's the latest Sunday Puzzle:

    This week's challenge comes from listener Michael Shteyman, of Freeland, Md. Name a person in 2011 world news in eight letters. Remove the third, fourth and fifth letters. The remaining letters, in order, will name a person in 2021 world news. What names are these?

    This week's puzzle is another of the f_x(string1) = string2 variety. That is, we're looking for string1 and string2, where each is a word or phrase; we're given function x, and when we apply function x to string1 the result is string2.

    Once again, we need two lists of words; more specifically in this case, we need two lists of names. In the preview, I suggested that we turn to Wikipedia, because it should be the fastest and easiest. Wikipedia has a page for each year, listing major events from that year. Here are the 2011 and 2021 entries. I suggested that we collect the text from these articles and pipe it through a named entity recognition (NER) tool to get our candidate lists.

    About that--there are tools in NLP that perform this task called NER. Like the other language models we've discussed, they are typically trained on large amounts of text. This can be supervised or unsupervised; supervised means the training data is manually labeled in the way we want, and unsupervised means the training data is unlabeled and the system must learn to do this classification on its own. An NER tool learns the kinds of contexts in which names occur, and when it sees a new word, it decides based on context whether that word is a name or not and labels it as such.

    As always, I've posted my solver script to the companion GitHub repository for this blog. You'll also need the accompanying 2011 and 2021 Wikipedia text files uploaded there. I used the Stanford CoreNLP NER tool; specifically, I used the version implemented in the Stanza package for Python, with the default pre-trained English model, as you can see in the solver script. This tool provides different named entity labels. I used the default settings, which produces the labels: DATE, LOCATION, MONEY, ORGANIZATION, PERCENT, PERSON, TIME.

    My script reads in the text files taken from Wikipedia and pulls out only the PERSON entities. For the 2011 names, it looks for names with 8 letters, and for the 2021 names, it looks for 5 letters. This time I learned my lesson from last week--we shouldn't assume! In this case, I didn't assume that "name a person in 8 letters" means a single string of 8 characters where each is a letter. Instead, when checking this rule, the script drops any non-letter, then checks for 8 characters. This means we also need to split the name on whitespace, then try reasonable combinations of the resulting strings. Namely, we want to try using the first name only, for celebrities commonly referred to by first name only. We also want to try first plus last name, and finally the last two names, if there are more than two, because it could be a name like du Pont, de Soto, Von Trapp, etc.

    I am assuming here that the names will more or less follow the most common patterns for names written with the English alphabet.

    For the named entity "Eddie Van Halen," for example, we want to try the following combinations and see if we find eight letters:

    1. "Eddie"
    2. "Eddie Halen"
    3. "Van Halen"
    If "Eddie Van Halen" were in our 2021 text, we'd add "Van Halen" to our list of 2021 candidate names, because it's eight letters.

    Now that the script has lists of candidate names of the appropriate length, it can apply the transformation rules and see there are any matches. It starts with 2011 candidate 1, removes letters 3, 4 and 5, then looks at 2021 candidate 1 to see if it matches. If not, it looks at 2021 candidate 2, and so on. If there is no match for the transformed 2011 name in the 2021 list, it moves on to the next 2011 name and repeats.

    And it turns out that only one pair of names fits the pattern:

    bin Laden + Biden

    Thanks for reading! I'll see you for the next puzzle!

    --Levi King

    Monday, January 11, 2021

    Names in the News (Preview)

    As we prepare to crack our third puzzle, I want to reiterate what this blog is about and welcome any newcomers.

    Every Sunday on National Public Radio (NPR), the program Weekend Edition Sunday airs a segment called Sunday Puzzle. It's been a recurring segment since the 1980s, I believe. Will Shortz, the editor of the New York Times crossword puzzle and "NPR Puzzlemaster" hosts the segment. Each week, there is a "listener challenge" puzzle, and listeners have until Thursday to submit the answer. A winner is drawn from the correct answer, and that person gets to call in and play an on-air challenge with Will Shortz (and Weekend Edition Sunday host Lulu Garcia-Navarro) during the subsequent Sunday Puzzle.

    The puzzle is typically some kind of word puzzle. In this blog, I focus on applying natural language processing (NLP) to find the solution, hence "Natural Language Puzzling." I've listened to the puzzle for years, often brainstorming what NLP methods or tools I would use. Now I've committed to working out these puzzles on this blog. I keep the writing fairly non-technical, so I'm hoping it will be accessible to a broad range of puzzlers. I also upload any scripts I put together to solve a puzzle so those who want a closer look can download them. I post a preview on Sunday or Monday, discussing the new puzzle and possible approaches. Then, I post my approach and the solution on Friday (after the Thursday deadline so as not to spoil it).

    This week's puzzle is another of the f_x(string1) = string2 variety. That is, we're looking for string1 and string2, where each is a word or phrase; we're given function x, and when we apply function x to string1 the result is string2. Our function x is, as always, some kind of transformation, like "remove two letters"  or "swap the first and last letters."

    Here's this week's puzzle:

    This week's challenge comes from listener Michael Shteyman, of Freeland, Md. Name a person in 2011 world news in eight letters. Remove the third, fourth and fifth letters. The remaining letters, in order, will name a person in 2021 world news. What names are these?

    I confess that I solved this one already, sans NLP. This was one of those rare times when the answer popped right out at me. Nonetheless, I want to see if I can put together a script using NLP approaches that will spit out the correct solution.

    So what do we need to solve this one? We know we need two "persons" from world news. The wording here seems deliberate; we don't know if these are full names, last names, maybe even nicknames. Crucially, we don't know if we should expect spaces within the strings.

    Like the previous two problems, we'll want to start with two lists of candidate words (names), then iterate through list 1 to see if the transformation function produces a word from list 2.

    Where can we get these candidate word lists?

    1. Search the web: Maybe we can find lists that others have posted for "names in the news" for  2011 and 2021
    2. News text: For example, if we had a corpus of Wall Street Journal articles from 2011, we could extract a list of names using a Named Entity Recognition (NER) tool, like the one included in the Stanford Core NLP package. It might be hard to come by such datasets for free, however.
    3. Twitter: I suspect we might be able to find datasets that contain a large random sample of Tweets, possibly divided into months or years in a way we can use. The NER approach would work there too. It might be less accurate, because news text follows strict style guidelines and is thus relatively more predictable than Twitter text, which follows no guidelines.
    4. Wikipedia: I saved my preferred approach for last. Preferred because it should be the fastest and easiest. Wikipedia has a page for each year, listing major events from that year. I think there are also more specific year-in-review type articles, too, that focus on entertainment news, politics news, sports news, etc. It would be fairly straightforward to collect the text from these articles and pipe it through an NER tool to get our candidate lists.
    With candidate lists in hand, we need to script the transformation function and iterate.

    That's my plan. I'll be back on Friday to walk you through it.

    --Levi King

    Friday, January 08, 2021

    Cooking and music words: Solution; and the importance of style

    The submission deadline has passed, so let's solve this puzzle! And spoiler warning -- the solution will be revealed in the second half of this post. Again, here's the Sunday Puzzle from 1/3/2021:

    This week's challenge comes from listener Robert Flood of Allen, Texas. Think of a seven-letter hyphenated word for a kind of cooking. Change the middle letter to get a new word describing a kind of music. What words are these?

    In the preview post, I suggested that we start with a list of cooking words and a list of music words. How do we get such word lists? Here are some ideas:
    1. Search the web for lists. We might get lucky and find that someone has already published lists like those we need.
    2. Scrape relevant sources on the web. We could come up with a handful of food and recipe sites like allrecipes.com and epicurious.com and music sites like pitchfork.com and last.fm. Then we could use a couple of python tools to "scrape" these sites and clean the text. But then what? We'd need to use something like term frequency-inverse document frequency (tf-idf) to find words that occur at a higher rate in the cooking and music documents than they occur in a mixed collection of documents--this is basically what a "word cloud" is. This approach would require significant effort.
    3. Wikipedia. Here's a page listing music styles, for example.
    4. Language models. We could use a pre-trained language model to predict words that are similar to known cooking and music words, or that are found in similar contexts.
    Approach 3 would be fun but also a lot of effort. Let's try that on some future puzzle. In my previous post, I had success with a language modeling approach, so I think we'll start with approach 4 above.

    Let's return to BERT in masking mode. BERT is a sentence embedding tool that trains on massive amounts of text to create a language model. Such models can be used for a variety of tasks, like scoring how similar two sentences are, or predicting the next word or sentence in a document. In our case, we'll use it to predict words that fill in a blank. This will give us lists of candidate words.

    Aaaand... that won't work. I started trying some mask queries on BERT and quickly realized that it does not produce any hyphenated words. I'm no expert in BERT, but I suspect that when it tokenizes the training data, it breaks up hyphenated words. In other words, if it isn't trained on hyphenated words, it won't include them in its predictions. Perhaps there's a way around this, but I didn't find a quick fix.

    Word2vec is an approach to language modeling that predates BERT. It also trains on massive amounts of text. It basically learns where each word in the vocabulary occurs on average in relation to each other word in the vocabulary, and represents this information as a vector (a list of numbers). Words that occur in similar contexts have similar vectors. Vectors can be compared using cosine to measure word similarity.

    We can use the word2vec implementation in Gensim, an open source Python package for textual similarity and related tasks. The tool can take a "seed word" and give us the top n similar words.

    I'm going to start with a list of seed words for each of the two lists we need. First, let's list some seed words for "a kind of cooking". I started with this Wikipedia page for "Cooking Methods". After dropping the multi-word terms (to keep this simple), and adding a base verb form for the gerund ("-ing") forms listed there, I have 58 cooking seed words, like these:
    • bake
    • baking
    • barbecue
    • barbecuing
    • blacken
    • blackening
    • etc.
    And some seed words for "describing a kind of music". I couldn't find a comparable list, so I poked around Wikipedia and the web and cobbled together my own list of 77 terms like these:
    • a-capella
    • acapella
    • acoustic
    • adagio
    • allegro
    • anthem
    • anthemic
    • etc.
    From there, I put together a script to create a list of cooking words by running each of my seed words through word2vec, generating a list of the 100 most similar words for each seed word. There is overlap in the results, so duplicates are removed.

    My script then does the same for my music seed words.

    At this point, we have a few thousand words in each list, and we expect that the target cooking word and the target music word are among them.

    Now the script simply needs to iterate through them, applying the rules of the puzzle to find a match. So first it prunes our list of cooking words, keeping only words that would be 7 characters if we removed hyphens. Similarly for music words, only 7-letter words are kept.

    Finally, the script iterates through the remaining cooking words, one by one. It deletes any hyphens in the cooking word, leaving only 7 letters. Then it starts iterating through the music words, one by one. If the first 3 letters don't match with the cooking word, it moves on to the next music word. If the first 3 letters do match, it then checks to see if the final 3 letters match, and finally checks that the 4th (middle) letter does not match. If it checks a cooking word against all the music words and doesn't find a match, it moves on to the next cooking word and iterates through the music words again.

    Does it work?

    Well, yeah... But there were some interesting failures first.

    At first, I was only getting a handful of word pairs that fit the form requirements but weren't actually cooking or music words. Sometimes a seed word can have multiple word senses, so our results might include some unwanted words. "Smoking," for example, is a cooking seed word, but it has multiple non-cooking meanings too. One example pair was "sub-sect" and "subject"---the pattern fits but the meanings do not.

    Ultimately, I ended up manually looking through a list of the generated music words, and the target word caught my attention. I quickly realized why my script wasn't working---the cooking word has 2 hyphens, and I had mistakenly assumed it had only 1 hyphen. You know what happens when you assume, right?

    The solution:
    bar-b-que and baroque

    Once I knew what the script should be finding, I went back to fix it. No problem.

    But... it still didn't include bar-b-que in the results, so the match was not found. Argh. What gives?

    I think this is where things get really interesting! The Gensim word2vec implementation currently has 11 pre-trained English word vector models. Each model is trained on different datasets. (Well, not exactly---some are just larger "more detailed" models derived from the same data as smaller models). The training sets fall into roughly 3 categories:
    • Wikipedia text
    • Newspaper text
    • Twitter text
    I started with a model trained on Wikipedia and newspaper text (glove-wiki-gigaword-100). Do you see why this might be a problem?

    I've had enough journalism-ish experience to know that news publications use strict rules like the AP Stylebook. This dictates things like grammar and punctuation (Oxford commas? Yes!). It also lists how to spell some particular words, especially those with multiple possible spellings and/or foreign words and names. Hanukkah not Chanuka; Qur'an not Koran.

    Heck, I even found this old tweet from the AP:
    Also, presumably, not bar-b-que! The point here is that bar-b-que would likely never appear in the newspaper text, and thus it isn't even in the model's vocabulary. (I don't know as much about Wikipedia, but I believe it observes similar strict guidelines.)

    I switched to a model trained on Twitter, which of course has no style guide, and the correct solution popped right out!

    Lessons from this puzzle: Question your assumptions and consider your training data!

    For anyone interested, the solver script is on GitHub. I also uploaded a related script that I used to run each model interactively, giving it a seed word to see if it produces a given target word (bar-b-que).

    Thanks for reading, and see you next week!

    --Levi King

    Monday, January 04, 2021

    Cooking and music words (preview)

    Welcome back to Natural Language Puzzling. Let's take a look at the Sunday Puzzle from yesterday, 1/3/2021

    This week's challenge comes from listener Robert Flood of Allen, Texas. Think of a seven-letter hyphenated word for a kind of cooking. Change the middle letter to get a new word describing a kind of music. What words are these?

    I will post my solution on Friday, after the Thursday deadline for submissions. I don't want Will Shortz to come break my kneecaps, after all. For now, let's do a bit of brainstorming.

    A "word for a kind of cooking". It's not super clear but I'm assuming this is either a method, like grill or broil, or possibly a cuisine, like French or Turkish. It would be great if we had a long list of words like these. From there, we could eliminate any that aren't a seven-letter hyphenated word.

    A "word describing a kind of music". This is probably a genre, like punk or bluegrass, or maybe it's an adjective used to describe music, like up-beat or pizzicato. Again, if we had such a list we could filter to keep only seven-letter hyphenated words.

    If we had such lists, we could simply compare them pairwise until we find a pair where only the middle letter differs.

    So how do we get these lists of cooking words and music words?

    See you Friday with the solution (I hope)!

    --Levi

    Saturday, January 02, 2021

    First challenge: 12/20/2020: Christmas Wishes

    Let's kick off this project by looking at the most recent challenge, from Sunday, December 20, 2020. The deadline for submitting solutions has already passed, but let's see if we can find one anyway.

    This week's challenge: This week's challenge comes from listener Dan Pitt, of Palo Alto, Calif. Take the name BUENOS AIRES. Remove one letter. The remaining letters can be rearranged to name two things that many people wish for around this time of year. What are they?

    Okay, so we're looking for two words, and these are things that people wish for around the Christmas (or Hanukkah?) season. I suppose these could be actual gifts like "toys" or "chocolates" or "socks", or they could be more abstract like "peace" for the new year, or "wealth", or "snow" for a white Christmas.

    BUENOS AIRES has 11 letters, and we have to drop 1. So we need two words containing 10 letters total. I think it's safe to assume that neither word will be one letter or two letters, so our two words will be one of these combinations of letters:

    • 3+7
    • 4+6
    • 5+5
    Here's a rough sketch of the approach I'm thinking of:
    1. Start with a list of candidate words, W;
    2. Remove any word in W that cannot be spelled with BUENOSAIRES;
    3. For each word wd_a in W, find any other word wd_b in W for which the length of wd_a + length of wd_b = 10; store pairs as list P
    4. Remove any pair in P for which wd_a and wd_b cannot be spelled with BUENOSAIRES;
    5. Print remaining pairs and find solution among them.

    Sounds great, but... how do we get a list of candidate words in step 1 above? Similar challenges might give us more to go on here, such as "women's names" or "hobbies," for example. In this case, we only know that the two words are things wished for around Christmas time. We may be able to use a sentence embedding model like BERT to generate a list of possibilities.

    BERT can be used in a "masking" mode. In this mode, the tool infers the most likely words to fill in a blank. The example given on the HuggingFace page is:

    "Paris is the [MASK] of France."

    The top prediction is "capital", but other predictions include "center", "heart", "city", etc.

    Perhaps now you can see where I'm going with this. We can construct a few queries to collect our candidate words:
    • "I'm wishing for [MASK] this Christmas"
    • "Classic Christmas gifts like [MASK]"
    • "I was so happy when I got [MASK] for Christmas"
    • etc.
    Then we can continue with our approach, filtering for length and spelling.

    I've implemented all of this in a Python script. You can find it at my companion GitHub Repo for this blog, linked here.


    The functions I've implemented there are annotated to describe what each does, so it should be fairly easy to follow. I admit it's not an optimal implementation-- I could probably simplify it or remove some redundant bits, but it's working and produces a solution in just a few seconds, so I'm leaving well enough alone.

    The deadline has passed and the solution will air tomorrow, but if you want to work this out for yourself, here's your warning that I'll be spoiling the solution later on this page!

    That is to say--my approach worked.

    Here is the full set of queries I ran through the BERT masking model:

    • All I want for Christmas is [MASK]
    • All I want for Christmas is a [MASK]
    • [MASK] are selling out this Christmas
    • sold out of [MASK] this Christmas
    • I'm wishing for [MASK] this Christmas
    • I'm wishing for a [MASK] this Christmas
    • I got [MASK] for Christmas
    • I got a [MASK] for Christmas
    • Classic Christmas gifts like [MASK] and 
    • [MASK] is the hottest Christmas gift
    • I wish for [MASK] this Christmas
    • asked Santa Claus for [MASK]
    • asked Santa for [MASK]
    • asked Santa Claus for a [MASK]
    • asked Santa for a [MASK]
    • best Christmas gift was [MASK]
    • Christmas gift was a [MASK]
    • gave me a [MASK] for Christmas
    • gave me [MASK] for Christmas
    • the gift of [MASK] this Christmas
    I didn't investigate exactly which of these queries turned up the words in the solution, but doing so would be fairly simple by modifying the code a little.

    In this task, we're mostly concerned with recall rather than precision; we don't mind turning up some bogus solutions so long as the correct solution is among them.

    I found that by setting k, the number of predictions per query, to 100, I get exactly one solution--the correct one.

    If I set k to 1500, I get the 17 pairs below. Most of the wrong answers here don't make sense, but I think beans and euros would be a pretty good Christmas gift!

    Spoiler alert: The correct response is in this list.
    • ['air', 'bonuses']
    • ['aires', 'bones']
    • ['aires', 'bonus']
    • ['are', 'bonuses']
    • ['ari', 'bonuses']
    • ['beans', 'euros']
    • ['beans', 'rosie']
    • ['beau', 'sirens']
    • ['bee', 'russian']
    • ['been', 'russia']
    • ['ben', 'serious']
    • ['bone', 'raises']
    • ['bone', 'russia']
    • ['bones', 'raise']
    • ['bonus', 'raise']
    • ['bruises', 'neo']
    • ['bruises', 'one']
    Thanks for reading! See you for the next puzzle.
    --Levi K

    Welcome to Natural Language Puzzling

    Happy New Year and welcome to my new blog, Natural Language Puzzling!

    This is a project I've been mulling over for several months and what better time to begin than at the start of a new year. I've been an academic and professional in the field of natural language processing (NLP, aka computational linguistics or "data science for language") for about a decade now, and it's time to put those skills toward something fun. On this blog, I'll show how we can use the tools and methods of NLP to solve word puzzles, riddles and the like. I plan to primarily focus on the weekly Sunday Puzzle from Weekend Edition on NPR

    Each Sunday, Will Shortz, editor of the New York Times crossword puzzle hosts the Sunday Puzzle, a roughly five minute segment featuring two puzzles. The first puzzle is the on-air challenge, in which a caller is presented with a series of rapid-fire puzzle questions, usually following some pattern. Then the segment closes with the weekly challenge, where Shortz reads some puzzle and listeners have until Thursday that week to submit the solution. Among the correct responses, a random winner is drawn, and that person is the caller who participates in the next on-air challenge.

    This blog will focus on these weekly challenges. I won't be spoiling the answers, however; I plan to post solutions after the Thursday deadline only--assuming I'm able to find the solution, that is! I may also discuss how NLP could augment the on-air puzzles, and sometimes I'll look at older puzzles or puzzles from other sources entirely. I also expect to recruit occasional "guest bloggers" to help when I get stuck or feel that a particular problem plays well to a colleague's strengths or expertise.

    In terms of the kinds of NLP tools I'll be discussing here, you can expect to see some Python scripting, NLP and data science packages like NLTK, spaCy, as well as tools like the Stanford CoreNLP and Parser kit and BERT models. We'll no doubt use methods like part-of-speech tagging, dependency parsing, word embeddings, n-gram language modeling, sentence transformers, tf-idf, mutual information, similarity metrics, machine translation, web scraping, code and cypher cracking and so on. My goal is to provide practical examples of how such tools can be used to solve various kinds of language problems. Most of this will involve solving puzzles, but the methods and tools discussed here are applicable to many practical problems as well.

    I'd love to see this become a collaborative space where others can share their thoughts on how to solve problems with NLP as well. There's rarely a single "right" approach to this.

    Director, anagram, film award

    Welcome back to Natural Language Puzzling, the blog where we use natural language processing and linguistics to solve the Sunday Puzzle from...