Monday, March 29, 2021

Two Bird Words, One Function (Preview)

Alrighty, we're getting back on track. I didn't post a solution to the last puzzle because I didn't come up with one. IMO, it was a pretty lame one anyway, but let's try for a win this week!

Let's take a crack at the current puzzle:

This week's challenge comes from listener Greg VanMechelen, of Berkeley, Calif. Name something birds do. Put the last sound of this word at the start and the first sound at the end, and phonetically you'll name something else birds do. What are these things?

Okay, time to break it down.

  • Here we have the familiar puzzle format: transform(string_a) = string_b
    • In other words, we're looking for string_a and string_b; we have a transformation function (swap initial and final sounds), and when we apply the function to string_a, we should get string_b;
  • "something birds do" / "this word" / "something else birds do"
    • This indicates string_a is a single word. Most likely, string_b would also be one word, but there's nothing in the clue to confirm that, so we should keep that in mind.
    • Both strings are probably verbs (or maybe a verb phrase for string_b);
      • we'll probably need to try a few verb forms (infinitive, progressive, simple present 3rd person singular and plural...);
  • "the last sound" / "the first sound"
    • "sound" is not exactly a technical term; we don't know if this is a single phoneme, a consonant cluster, or maybe a whole syllable, so we'll probably have to consider all of these;
    • in linguistic terms, a syllable is an onset + nucleus + coda;
      • the nucleus is essentially the vowel sound and is required for any syllable;
        • words like "Oh" or "eye" or "a" are single syllable words with a nucleus only;
      • the onset is optional; this can be a single consonant phoneme or a consonant cluster;
        • "b" in "bay", "tr" in "tray", "spr" in "spring";
      • the coda is optional; again, a consonant or consonant cluster;
        • "k" in "book", "ng" in "spring", "lt" in "bolt";
  • B: a list of bird activity words:
    • As usual, we need a list of words to start with. This time just one list, since string_a and string_b are both coming from the same list.
    • I will probably use some combination of BERT in masking mode or Word2Vec to obtain this list of words.
      • SBERT can suggest words to fill in a blank, e.g.:
        • "This time of year, most birds need to [MASK] before the weather changes."
      • Word2Vec can take a seed word and give us words that appear in similar contexts, e.g.:
        • fly --> soar, glide, flap, climb, dive, hover, wing, etc.
One approach would be to simply generate a list of a few dozen words and then manually look through them until I find something that makes sense. I'll probably start from there.

A more powerful approach, would be to rely on a phonemic dictionary, where I can find the phonemic transcription (or transcriptions because many words have multiple) for a given spelling. Then I'd simply get my list of words, store all their pronunciations, iterate through these applying a function that slices off some number of the initial phonemes and final phonemes and swaps them, looking for a pronunciation match in the same list (B). This is sort of overkill, but I've been wanting to attack a pronunciation-based puzzle like in this way, so if the fast and dirty approach fails, I'll try to make time for this approach this week. Regarding our phoneme swapping function, we'd want to try various recombinations for each pair of candidates we compare. One thing that could be useful here would be to apply some phonotactic rules to identify the onset, nucleus and coda. Just something to think about.

Okay, good luck with this one! I'll be back after the submission deadline to share what I've come up with.

--Levi

Wednesday, March 24, 2021

14 capital letters (Preview)

Hi Puzzlers! I got behind on my weekly posts here but I'm back now!

Let's take a look at this week's puzzle:

This week's challenge comes from Ed Pegg Jr. of Champaign, Ill. Take the phrase ZANY BOX KEPT HIM. Write it in capital letters. Something is special about the 14 letters in this sentence that sets them apart from all the other 12 letters of the alphabet. What is it?

Okay, this is different! I don't see much of a role for natural language processing here, but we'll try for a solution anyway.

The target letters (14):

  •  ZANY BOX KEPT HIM
  • A B E H I K M N O P T X Y Z (alphabetized)
The remaining letters (12):
  • C D F G J L Q R S U V W
The fact that capital letters are used might suggest the solution has something to do with the actual shape of the capital letters--something that likely wouldn't hold true using the lowercase letters. Nothing pops out at me. We have various forms of symmetry and asymmetry in both groups. Open and closed forms in both groups. Varying numbers of pen strokes in both groups. Speaking of which...

The puzzle specifies that we should write the capital letters--I'm guessing it could be relevant that we actually write this by hand.

I just took a couple minutes to write it out by hand, multiple times. Also the remaining letters. Nothing is really occurring to me.

I also took a photo of my handwritten ZANY BOX KEPT HIM and tried rotating it, mirroring it, flipping it upside down, etc. Again, I don't see any pattern.

One big question here is this: Is the solution something inherent about the characters and their orthography, or are they related by some kind of real world knowledge? For the latter, for example, the 14 letters could represent all the digits used in Roman numerals (they don't). Or something along those lines, which would mean the 14 letters represent a real world closed class.

If we are interested in a closed class of some real world thing, and if the orthography is critical, then this real world class would have to be case sensitive, if you will. For example, US state postal abbreviations are always written in capital letters. Nothing comes to mind here. Letters from both groups appear in state abbreviations. 
Capitalization is crucial in the periodic table of elements, too, where both capital and lowercase letters appear. Maybe one of the two groups contains all the single letter abbreviations for elements. And... nope.

Okay, I'll have to mull this one over for a while.

Sunday, March 07, 2021

Two Companies (Follow-up)

Now that the solution is out, let's take another look at this Sunday Puzzle:

This week's challenge from from Joseph Young of St. Cloud, Minn. I'm looking for the names of two companies. One of them has a two-part name (5,5). The other has a three-part name (5,7,5). The last five-letter part of the two names is the same. And the first five-letter part of the first company's name is something the second company wants. What is it?

I'm disappointed I didn't solve this one, but perhaps it's cold comfort that the relevant companies simply weren't on my list. It's hard to believe that they weren't included in any of the lists of biggest companies that I cobbled together for this puzzle. I had over 28,000 companies in my final list. I've added the correct companies now, and my script does find them among the possible solutions.

So what were the companies?


OK, on to the next one!

Saturday, March 06, 2021

Two Companies (Solution - UNSOLVED!)

 Welp. I'm finally stumped! Did you solve it?

Let's take another look at the Sunday Puzzle:

This week's challenge from from Joseph Young of St. Cloud, Minn. I'm looking for the names of two companies. One of them has a two-part name (5,5). The other has a three-part name (5,7,5). The last five-letter part of the two names is the same. And the first five-letter part of the first company's name is something the second company wants. What is it?

In the preview post, I shared this breakdown:



    • Lists:
      • C: list of companies;
        • We can try scraping text from the web then running a named entity recognizer (NER) tool to extract all the corporation names;
        • More likely, we can find a list of company names already on the web;
        • Stock market listings might be helpful;

    • Functions:
      • filter_names(C,[word lengths]): This function should take our list C and a list of integers representing the lengths of each word in the company name; it should return only company names from C that matches the word lengths.
        • For example, to find companies matching the second company's word lengths, we'd use the function like this: filter_names(C,[5,7,5])
      • proposition_probability(some_sentence): This function should generate a score between 0 and 1 to represent the likelihood of a sentence. 
        • We'll use this to solve the last part of the puzzle. We know that Company A is two words, (5,5). Let's call these caxcay. Let's call the (5,7,5) company Company B, or cb. So if we have a rich language model and/or a knowledge graph, we can take our lists of companies with matching word lengths and the matching final 5-letter word, and iterate through to find the most likely propositions.
        • For example, we'll take each cb candidate, then iterate through each ca candidate; we'll use a sentence template to construct propositions that we can evaluate with a model. An example might be: "[cb] is looking to acquire more [cax]". (Most likely, we'd want to use a few variations on this and take the average.) So we'll use this function to evaluate all our candidates; we can rank them by probability, then manually skim through the most probable propositions to find the solution.
        • What resource can we use here? I've been using BERT for various language modeling tasks lately, so I plan to do a little reading and see if it seems suitable. I'm not sure if it really captures the kind of knowledge we need involving specific companies.



      So that's pretty much what I tried so far. I cobbled together a huge list of companies -- 28,605 to be exact. From these, I do a lot of pre-processing to expand the list with alternate versions of each name. For example, I look for a lot of generic words like "corp" and "inc" and drop those. Then I just find all the Company A (5,5) candidates, and all the Company B (5,7,5) candidates. Then we find all pairs where Company A word2 matches Company B word2. Then I use some templates to create sentences and evaluate these with BERT, which returns a list sorted by the score.

      I didn't really find anything that makes sense. Also, most of the companies that fit 5,5 or 5,7,5 are companies I've never heard of, and I suspect the companies in the solution will sound familiar. A couple of famous names pop up as Company A candidates:

      • Shake Shack
      • Exxon Mobil
      • Jamba Juice
      But Exxon and Jamba don't really make sense as something Company B wants, and I can't find a 5,7,shack candidate to match Shake Shack.

      When I have the solution, I'll revise my script to get it working; I'm mostly curious to learn whether or not I have the company names in my list, and if so, what is wrong with my string handling that results in not finding the solution?

      Tuesday, March 02, 2021

      Two Companies (Preview)

       I hope all my Puzzlers are having a great week! Let's take a look at the latest Sunday Puzzle from NPR:

      This week's challenge from from Joseph Young of St. Cloud, Minn. I'm looking for the names of two companies. One of them has a two-part name (5,5). The other has a three-part name (5,7,5). The last five-letter part of the two names is the same. And the first five-letter part of the first company's name is something the second company wants. What is it?

       OK. This looks tricky, but let's break it down.

      What resources do we need?

      • Lists:
        • C: list of companies;
          • We can try scraping text from the web then running a named entity recognizer (NER) tool to extract all the corporation names;
          • More likely, we can find a list of company names already on the web;
          • Stock market listings might be helpful;

      • Functions:
        • filter_names(C,[word lengths]): This function should take our list C and a list of integers representing the lengths of each word in the company name; it should return only company names from C that matches the word lengths.
          • For example, to find companies matching the second company's word lengths, we'd use the function like this: filter_names(C,[5,7,5])
        • proposition_probability(some_sentence): This function should generate a score between 0 and 1 to represent the likelihood of a sentence. 
          • We'll use this to solve the last part of the puzzle. We know that Company A is two words, (5,5). Let's call these cax, cay. Let's call the (5,7,5) company Company B, or cb. So if we have a rich language model and/or a knowledge graph, we can take our lists of companies with matching word lengths and the matching final 5-letter word, and iterate through to find the most likely propositions.
          • For example, we'll take each cb candidate, then iterate through each ca candidate; we'll use a sentence template to construct propositions that we can evaluate with a model. An example might be: "[cb] is looking to acquire more [cax]". (Most likely, we'd want to use a few variations on this and take the average.) So we'll use this function to evaluate all our candidates; we can rank them by probability, then manually skim through the most probable propositions to find the solution.
          • What resource can we use here? I've been using BERT for various language modeling tasks lately, so I plan to do a little reading and see if it seems suitable. I'm not sure if it really captures the kind of knowledge we need involving specific companies.

      That's my plan. Do you have ideas or suggestions?

      Thanks for stopping by. I'll see you later this week with a solution (I hope)!

      --Levi

      Director, anagram, film award

      Welcome back to Natural Language Puzzling, the blog where we use natural language processing and linguistics to solve the Sunday Puzzle from...