Saturday, March 06, 2021

Two Companies (Solution - UNSOLVED!)

 Welp. I'm finally stumped! Did you solve it?

Let's take another look at the Sunday Puzzle:

This week's challenge from from Joseph Young of St. Cloud, Minn. I'm looking for the names of two companies. One of them has a two-part name (5,5). The other has a three-part name (5,7,5). The last five-letter part of the two names is the same. And the first five-letter part of the first company's name is something the second company wants. What is it?

In the preview post, I shared this breakdown:



    • Lists:
      • C: list of companies;
        • We can try scraping text from the web then running a named entity recognizer (NER) tool to extract all the corporation names;
        • More likely, we can find a list of company names already on the web;
        • Stock market listings might be helpful;

    • Functions:
      • filter_names(C,[word lengths]): This function should take our list C and a list of integers representing the lengths of each word in the company name; it should return only company names from C that matches the word lengths.
        • For example, to find companies matching the second company's word lengths, we'd use the function like this: filter_names(C,[5,7,5])
      • proposition_probability(some_sentence): This function should generate a score between 0 and 1 to represent the likelihood of a sentence. 
        • We'll use this to solve the last part of the puzzle. We know that Company A is two words, (5,5). Let's call these caxcay. Let's call the (5,7,5) company Company B, or cb. So if we have a rich language model and/or a knowledge graph, we can take our lists of companies with matching word lengths and the matching final 5-letter word, and iterate through to find the most likely propositions.
        • For example, we'll take each cb candidate, then iterate through each ca candidate; we'll use a sentence template to construct propositions that we can evaluate with a model. An example might be: "[cb] is looking to acquire more [cax]". (Most likely, we'd want to use a few variations on this and take the average.) So we'll use this function to evaluate all our candidates; we can rank them by probability, then manually skim through the most probable propositions to find the solution.
        • What resource can we use here? I've been using BERT for various language modeling tasks lately, so I plan to do a little reading and see if it seems suitable. I'm not sure if it really captures the kind of knowledge we need involving specific companies.



      So that's pretty much what I tried so far. I cobbled together a huge list of companies -- 28,605 to be exact. From these, I do a lot of pre-processing to expand the list with alternate versions of each name. For example, I look for a lot of generic words like "corp" and "inc" and drop those. Then I just find all the Company A (5,5) candidates, and all the Company B (5,7,5) candidates. Then we find all pairs where Company A word2 matches Company B word2. Then I use some templates to create sentences and evaluate these with BERT, which returns a list sorted by the score.

      I didn't really find anything that makes sense. Also, most of the companies that fit 5,5 or 5,7,5 are companies I've never heard of, and I suspect the companies in the solution will sound familiar. A couple of famous names pop up as Company A candidates:

      • Shake Shack
      • Exxon Mobil
      • Jamba Juice
      But Exxon and Jamba don't really make sense as something Company B wants, and I can't find a 5,7,shack candidate to match Shake Shack.

      When I have the solution, I'll revise my script to get it working; I'm mostly curious to learn whether or not I have the company names in my list, and if so, what is wrong with my string handling that results in not finding the solution?

      Director, anagram, film award

      Welcome back to Natural Language Puzzling, the blog where we use natural language processing and linguistics to solve the Sunday Puzzle from...