Welp. I'm finally stumped! Did you solve it?
Let's take another look at the Sunday Puzzle:
This week's challenge from from Joseph Young of St. Cloud, Minn. I'm looking for the names of two companies. One of them has a two-part name (5,5). The other has a three-part name (5,7,5). The last five-letter part of the two names is the same. And the first five-letter part of the first company's name is something the second company wants. What is it?
In the preview post, I shared this breakdown:
- Lists:
- C: list of companies;
- We can try scraping text from the web then running a named entity recognizer (NER) tool to extract all the corporation names;
- More likely, we can find a list of company names already on the web;
- Stock market listings might be helpful;
- Functions:
- filter_names(C,[word lengths]): This function should take our list C and a list of integers representing the lengths of each word in the company name; it should return only company names from C that matches the word lengths.
- For example, to find companies matching the second company's word lengths, we'd use the function like this: filter_names(C,[5,7,5])
- proposition_probability(some_sentence): This function should generate a score between 0 and 1 to represent the likelihood of a sentence.
- We'll use this to solve the last part of the puzzle. We know that Company A is two words, (5,5). Let's call these cax, cay. Let's call the (5,7,5) company Company B, or cb. So if we have a rich language model and/or a knowledge graph, we can take our lists of companies with matching word lengths and the matching final 5-letter word, and iterate through to find the most likely propositions.
- For example, we'll take each cb candidate, then iterate through each ca candidate; we'll use a sentence template to construct propositions that we can evaluate with a model. An example might be: "[cb] is looking to acquire more [cax]". (Most likely, we'd want to use a few variations on this and take the average.) So we'll use this function to evaluate all our candidates; we can rank them by probability, then manually skim through the most probable propositions to find the solution.
- What resource can we use here? I've been using BERT for various language modeling tasks lately, so I plan to do a little reading and see if it seems suitable. I'm not sure if it really captures the kind of knowledge we need involving specific companies.
So that's pretty much what I tried so far. I cobbled together a huge list of companies -- 28,605 to be exact. From these, I do a lot of pre-processing to expand the list with alternate versions of each name. For example, I look for a lot of generic words like "corp" and "inc" and drop those. Then I just find all the Company A (5,5) candidates, and all the Company B (5,7,5) candidates. Then we find all pairs where Company A word2 matches Company B word2. Then I use some templates to create sentences and evaluate these with BERT, which returns a list sorted by the score.
I didn't really find anything that makes sense. Also, most of the companies that fit 5,5 or 5,7,5 are companies I've never heard of, and I suspect the companies in the solution will sound familiar. A couple of famous names pop up as Company A candidates:
- Shake Shack
- Exxon Mobil
- Jamba Juice
No comments:
Post a Comment