[x]Blackmoor Vituperative

Sunday, 2009-03-08

Doublet solver for Perl

Filed under: Programming — bblackmoor @ 23:27

I have translated my doublet solver for Python to Perl. Having already done the difficult part — figuring out what the script needs to do — putting it into a specific language is, aside from a “gotcha” now and then, mostly just grunt work. But I was curious how it would compare with the Python version. I expected it to take an hour or so, but it took me most of the evening, partly because I got distracted by Celebrity Apprentice, and partly because I had to re-think how to handle arrays of arrays (more on that later).

So here is a link to the script, for anyone who might find it educational (rename it to doublet.pl in order to run it), and the official Scrabble word list (you will need to unzip it, of course).

For instructions on how to use the script, run doublet.pl -h

For the rules of the puzzle, what the output looks like, and so on, read my blog entry for the doublet solver for Python.

Some observations:

I have been writing Perl for years: well over a decade. At the time I discovered it, it could do things which were not feasible any other way. When you need to drive a screw and all you have is a hammer, you use a hammer and you are grateful to have it. I was grateful to have Perl.

Times have changed. There are a number of other widely supported languages which do what Perl does with more elegance and less perversity: PHP and Python, to name a couple you might have heard of. None of these languages are perfect. I suspect that no language will ever be perfect, because human beings are not perfect. Nonetheless, if given the choice, I would not choose to use Perl now that there are, in my opinion, better alternatives available.

As an example of things that Perl does poorly when compared to other languages, compare how PHP handles arrays to how Perl handles arrays. I mentioned in my doublet solver for Python that I was a bit disappointed in how Python handles arrays (what Python calls “lists” and “dictionaries”), but I think Python is a step above Perl in this area, while PHP is superior to them both.

Because of the way Perl handles (or rather, does not handle) multidimensional arrays, I had to put the code through some contortions to translate the doublet solver: instead of an array of arrays, I had to choose between using an array of references to other arrays (which is cumbersome), or an array of strings that pretend to be arrays (which is inelegant). I chose the latter. It works, yes, but I can’t say that I am pleased with it. I suppose I could have used “hashes” (what Perl calls associative arrays), but that would have been even uglier than the string solution, in my opinion.

I should point out that in both this script and the Python doublet solver, I used very few comments. Partly that is because the scripts are so short, and partly because I think what they do is self-evident. In general, I think comments are useful for a relatively small number of cases:

  1. Documenting the API for reusing a function or object.
  2. Explaining decisions which may be counterintuitive or inobvious.
  3. Pointing out kludges which, for whatever reason, seemed appropriate at the time.
  4. Notes for later improvements.
  5. Explaining algorithms which may not be obvious.
  6. Explaining function calls or API calls where the names used for variables, etc., are cryptic or misleading.
  7. Explaining the purpose of code so obfuscated that it looks like line noise.

That last item shows up in Perl a lot. But in this case, I think I made the code as transparent as it can be.

Saturday, 2009-02-28

Doublet solver for Python

Filed under: Programming — bblackmoor @ 17:50

I was recently shown an interesting programming puzzle. After reading it, I recognized it as a variation on the classic “doublet” game (I read a lot of Lewis Carroll as a kid, as well as being a huge fan of Alice in her various guises).

The Rules

  1. Call two words “adjacent” if you can change one word into the other by adding, deleting, or changing a single letter.
  2. Using the computer language of your choice, write a program which takes two words as inputs and creates an ordered list of unique words where each word in the sequence is adjacent to the previous and next words in the sequence.
  3. If you know Perl, a Perl solution is preferred.

The puzzle suggests using Perl to solve the puzzle, but I do not care for Perl (despite having worked with it longer than any other language I know — or perhaps because of that). I have been learning more about Python (and liking it more the more I use it, unlike Perl), so I decided to give a Python based solution a go.

I started with a basic plan for the script:

  • Input the two starting words
  • Load the dictionary
  • Filter the dictionary to words of the appropriate length
  • Find every variation of every word in the filtered dictionary.
  • Find variations of the starting word
  • If none of them are the ending word, find variations of those variations
  • Repeat until the ending word is found

(This was not the best way to start. More on that later.)

A couple of things immediately occurred to me. First, that removing letters would be easier than adding them, if the two end words are of different length. Second, I decided that I wanted to be able to pass arguments and options to the script on the command line: the two words, obviously, but I also wanted to be able to specify a dictionary file of valid words, as well as being able to pass an option that would show me every word tried while the script was running.

I am relatively new to Python, so I did not know how to handle command-line parameters gracefully. My first thought was to test each argument for a leading dash “-” character, and then split them up into arrays. Fortunately, I stumbled across a site with a sample script demonstrating the Python “getopt” module, which does just what I needed. While reading up on that, I then discovered another Python goodie called “doc strings“. I was really rolling now!

My next task was to load the dictionary, and filter out words of an inappropriate length. My initial idea was to build a cross-reference table where the first column would be all of the words of an appropriate length, and the other columns would be every possible single-step variation on those words. For example, for “dog” there would be “cog”, “log”, “dogs”, and so on.

But then I hit a hurdle. When reading in the dictionary file, I only wanted words which are no longer than the longer of the two starting words, and no shorter than the shorter of the two starting words. So naturally, after reading each line, I would test for that, but it seemed that Python was not automatically stripping off the newline characters. But then, why would it, since it does not know which style of newline I happen to be using (I am working on this project on a Windows laptop, but I usually use Unix line endings because I do most of my real work on Linux machines). How does Python treat newline characters when doing readline? And how do I strip them off?

There are two steps to that answer. First, open the file with the U option, for “universal newline support” (e.g., dictionary_file = open(dictionary,"rU")). Second, use strip() to remove whitespace (e.g., line = line.strip()).

Now for my next problem: as far as I know, Python’s arrays (aka, “lists”) are only able to use integers as indexes. For my initial dictionary idea to work, I needed some way of using strings as indexes (aka, “associative arrays”), the way PHP can. As it turns out, the Python feature that offers this functionality is actually called a dictionary. Go figure. Python’s dictionaries aren’t treated like normal (normal for PHP, that is) associative arrays: it does not have the same syntax or methods, and personally, this is an area where I think PHP is superior to Python. Of course, after learning all of this, I went back and thought of a simpler way of building the dictionary that did not use “dictionaries” (aka associative arrays), and simply used lists (aka arrays).

(Why do people insist on renaming everything? Why not call something the name by which it is nigh-universally known? “Arrays” and “associative arrays”. Is that so difficult?)

Before too long, I ran into another problem: how backslashes are treated in strings. I wanted to insert “\b” (for a word boundary) into a string, but Python interprets this to mean something special. So I needed to add an “r” before the opening quotation mark to indicate that this is a “raw string”. This is the first time I have encountered something in Python which reminds me of Perl (which is littered with bizarre, nonintuitive things like this).

These hurdles overcome, I tested my script using a small set of sample data (a dictionary of about twenty words), and it worked great. The problem came with larger dictionaries. When using a dictionary with thousands of words, creating the initial cross-reference index took forever. I also soon learned that my first idea, which was to try every possible variation of a word before giving up and going to the next word resulted in ridiculously long word paths. It soon dawned on me that it would probably be faster to search one letter-layer at a time (what people in computer science classes call a “breadth-first search“) than going down the full path of a particular word path (aka, a “depth-first search“). Amusingly enough, the code to do so was actually simpler than my initial “build a massive cross-reference table” approach.

(By the way, while refreshing my memory on breadth-first searches, I also ran across an algorithm for finding shortest paths called Dijkstra’s algorithm. That looks really interesting, and I want to play with that some day, but I figured I would have enough on my plate just implementing the search in Python, so Dijkstra’s algorithm will have to wait for another day.)

The final hitch I ran into was getting the recursive function to return its value. I must have banged my head against that for three hours before realizing I had made a stupid mistake. I was only returning the value when the function was successful, not when it was calling itself recursively. Since it’s recursive, it needed a return statement for every step of the recursion. A silly mistake on my part, but hey, it happens.

So here is a link to the script, for anyone who might find it educational (rename it to doublet.py in order to run it), and the official Scrabble word list (you will need to unzip it, of course).

For instructions on how to use the script, run doublet.pl -h.

Here is some typical output:

doublet.py -d scrabble_dictionary.txt beginning end
Fail: could not find a path between words
2 potential paths were examined, taking 1.5630 seconds

doublet.py -d scrabble_dictionary.txt begin end
707 potential paths were examined, taking 5.8280 seconds
begin >> began >> bean >> bead >> bend >> end

And here are my answers to the “bonus questions” asked by the original puzzle:

  • What is the shortest path between “crawl” and “run”?
    In theory, the shortest path would be five (including the two words at the ends). As it happens, that is also the shortest path when using the dictionary I used.
  • What is the shortest path between “mouse” and “elephant”?
    In theory, the shortest path would be eight (including the two words at the ends). However, I could not find a path using the dictionary I used, possibly due to one of the assumptions I used (see below).
  • Does your program necessarily return the shortest path?
    Yes.
  • What assumptions did you make in your program?
    Other than the parameters provided by the puzzle, I set a limitation that the words which would be examined when trying to make a path would be no shorter than one character less than the shorter of the two words, and no longer than one character more than the longer of the two words. I realize that this will, in some cases, result in not finding a path where one may actually exist.
  • How did you test your program?
    Using the words given in the puzzle, as well as a few words that I did not think had a path between them. I also tested the various failure conditions.
  • What is the Big-O complexity of your program?
    I think it’s O(n^2), but it’s been a number of years since I studied big-O notation, so I could be wrong.
  • Suppose each letter had a point value (for example, as in Scrabble). Discuss (but don’t code) how your algorithm would change if you wanted to find the the path with the lowest possible point total.
    I would have had to continue the script after it finds a path, rather than immediately ending. Then I would have had to contemplate its point value. Finally, when all of the paths are found, or some arbitrary limit is reached, the script would return the found path with the lowest point value.
  • Sometimes the shortest path isn’t unique. Discuss (but don’t code) how your algorithm would change if you needed to return all of the shortest paths between two words.
    I would have had to continue the script after it finds a path, rather than immediately ending. Finally, when all of the paths of that length are found, the script would return the found paths.
  • Discuss (don’t code) how you might change your program if your concern was minimizing memory usage.
    I would probably replace the search path array with files for each path.
  • Discuss (don’t code) how you might change your program if your concern was minimizing CPU time.
    I could be wrong, but I think it already minimizes CPU time. Slicing up the dictionary file by word length would probably help, though.
  • Discuss (don’t code) how you might change your program if your concern was minimizing programmer time.
    If conserving my time was important, I would not have added the various optional parameters or tested for so many error conditions.
  • Discuss (but don’t code) how your algorithm would change if you wanted to find the longest path between two words.
    I would have had to continue the script after it finds a path, rather than immediately ending. Then, when all of the paths are found, or some arbitrary limit is reached, the script would return the last path found.

Final thoughts:

Just thinking about this took about three days. I wasn’t even sure how to approach the problem at first. Once I had a plan, though, things progressed pretty quickly. I would say that writing the actual code took about four days (an hour here, an hour there). Much of this time was simply spent learning how Python does things, but a good chunk was spent redesigning the script after I realized the problems with my first design.

The script works, and I do not think I will refine it any further. However, I am not entirely satisfied with it, and I can see several things I would like to do differently next time. I did not even attempt to make this object oriented, and that is something that I want to tackle in Python soon. Also, I am not entirely pleased with how I handled the global variables in this script: there are far too many of them for my taste. I am also slightly disillusioned about Python itself. While I think it is a significant improvement over Perl, it still has some peculiar deficiencies. However, I do consider it a significant improvement over Perl, and considering how much time I spent writing and modifying scripts, I think learning Python will be time well spent.

Monday, 2009-01-12

NSA initiative pinpoints 25 top coding errors

Filed under: Programming,Security — bblackmoor @ 18:40

It looks like the NSA is actually doing something useful for a change, rather than just spying on American citizens.

Friday, 2008-03-28

JMRI Defense: Keeping an Open-Source Project Alive

Filed under: Intellectual Property,Programming — bblackmoor @ 15:21

The JMRI Defense fund is a worthwhile cause. Think about sending a few dollars their way.

Friday, 2007-11-23

Categorical Links Page plugin for WordPress

Filed under: Programming — bblackmoor @ 22:32

I updated WordPress to 2.3.1 today. In the process, a plugin I use stopped working. So I fixed it. If you would like to us it, you may download it from here:

Categorical Links Page plugin

You will need 7-zip to decompress the file.

Friday, 2007-04-27

Adobe decides to open Flex

Filed under: Programming,The Internet — bblackmoor @ 14:00

Adobe Systems has announced its plans to open-source its Flex Web development framework.

The San Jose, Calif., company is releasing its Adobe Flex source code to the open-source community to enable developers throughout the world to tap the capabilities of Flex and participate in the ongoing development of the technology.

Flex is a framework for building cross-operating system RIAs (rich Internet applications) for the Web and enabling new Adobe Apollo applications for the desktop, the company said.

“We’ll be open-sourcing Flex with the next release of the technology, which is code-named Moxie,” said Jeff Whatcott, vice president of product marketing in Adobe’s Enterprise and Developer Business Unit.

Whatcott said Adobe will introduce the first public pre-release version of “Moxie” in June, “and we’ll be providing public daily builds of the technology starting at that time. We’ll also be launching a public bug database, so it’ll look, act and feel like an open-source project” even then.

However, the technology will not be open-sourced until “Moxie” is released in the second half of 2007—most likely in the fall, Whatcott said.

Upon release, the open-source Flex software development kit (SDK) and documentation will be available under the MPL (Mozilla Public License), Whatcott said.

Using the MPL for open-sourcing Flex will allow full and free access to source code, and developers will be able to freely download, extend and contribute to the source code for the Flex compiler, components and application framework.

Adobe will also continue to make the Flex SDK and other Flex products available under their existing commercial licenses, allowing both new and existing partners and customers to choose the license terms that best suit their requirements.

Whatcott said the MPL “strikes a good balance” for developers, particularly those who want to take a staged approach to working with open-source technology.

“This is the culmination of a long path toward opening up Flex,” Whatcott said.

(from eWeek, Adobe Open-Source Move Sets Showdown with Microsoft)

I have it on good authority that Flex is going to be the Next Big Thing. If you like to stay abreast of web technology, this is the time to start gearing up with Flex.

Silverlight isn’t even an also-ran.

Monday, 2007-03-19

Adobe releases alpha of Apollo

Filed under: Programming,The Internet — bblackmoor @ 11:19

Adobe Systems has announced the first public alpha release of Apollo, its cross-operating system run-time for Web developers.

The technology is available on the Adobe Labs site.

So… is this good, or evil? I hate to say it, but it sounds to me that Adobe wants to do what Macromedia wanted to do and nearly succeeded in doing: subverting the Internet and turning it into their proprietary product. But I will have to do more research before I make up my mind.

Thursday, 2006-06-29

Spring vs. EJB

Filed under: Programming — bblackmoor @ 16:28

In quite a few design brainstorming sessions, the debate between Spring and EJB results in a deadlock. There are developers who are damn passionate about Spring and hate EJBs. Let’s have a look at the main important differences between the two….

(from TechRepublic, Spring vs. EJB)

Thursday, 2006-04-13

Prototype: Easing AJAX’s Pain

Filed under: Programming,The Internet — bblackmoor @ 20:15

Are you looking to simplify AJAX code? Bruce Perry has some help in “Prototype: Easing AJAX’s Pain,” which introduces the “Prototype” library. Prototype introduces a number of development-easing shortcuts for JavaScript authors. Further, “Prototype also wraps the functionality of XMLHttpRequest with its own Ajax.Request and related objects, so that you don’t have to bother with writing code for instantiating this object for various browsers.”

Tuesday, 2006-04-11

Intro to Maven 2.0

Filed under: Programming — bblackmoor @ 15:39

Chris Hardin has written a really good (if brief) introduction to Maven 2.0. Check it out.

« Previous PageNext Page »