||[Oct. 1st, 2002|06:22 pm]
patrick's post about pangrams the other day, I started wondering how possible it'd be to automatically generate them.After reading |
A pangram is a sentence containing every letter once. Ideally, only once. The smaller the sentence, the better.
To auto-generate a pangram there are two major components involved:
1) generate all possible sets of words with no duplicate letters.
2) arrange a given set such that it's grammatical. this step could be done by hand, if the number of sets are small enough. if the number of sets is too large, data from the various lexicon databases could be used to automatically arrange a lot of the sets, or more likely: automatically discard sets that are known not to work.
I started playing with the first part of the problem.
45,392 -- words in English. (well, in /usr/share/dict/words, at least)
13,165 -- words without duplicate letters (I care only about perfect pangrams)
It's convenient to represent a word's set of letters in a 32 bit integer, one bit per letter.
11,179 -- sets of words (each set being a distinct letter usage field)
Now, I'm storing 11,179 files in a tree of directories (stupid filesystem), each file containing that set's words.
At this point I saw a few options:
1) Keep merging sets until the set representing all letters is found.
For example, there are 1,582,990 double-word sets of single-word sets. The basic problem with this method seems to be that lots of duplicate sets are found. That 1.5M-sized set, for example, might include masks already found by separate words in the single-word set.
I decided this sucked.
2) Recursively find all sets matching any part of given mask (starting with the full mask) but not any part of the inverse of the mask, creating a new union-set everytime you fail to make it all the way. A union-set would be a set of sets needed to fulfill the requested mask.
Memoize everything to disk. This way I can start/stop the program and not lose its position ... because whatever strategy I decide on, it's gonna be slow.
At the end, find all combinations of child sets of tree of union-sets.
3) Go through the alphabet, requesting sets of words containing the next needed letter, but not those already had. Here, the 11,179 sets can be split into 26 index files.
Search depth-first. On interrupt, serialize stack to disk. On startup, skip past permutations already considered.
I'm thinking #2 and #3 both have promise, but #3 should be faster to start with because it shouldn't need to do much disk I/O. The 26 indexes would all fit in memory, then it's just CPU limited.
Then again, #2 might catch up and win by the end, after its cache fills up. I'd need to put ReiserFS on a spare machine for all the small files, though.