brad's life - Naming twins in Python & Perl [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Naming twins in Python & Perl [Jan. 6th, 2008|02:16 pm]
Previous Entry Add to Memories Share Next Entry
[Tags|, , , ]

Last night at Beau's party, one of Beau's guests mentioned he's expecting twins shortly, which is why is wife wasn't at the party.

I drunkenly suggested he name his kids two names that were anagrams of each other. I then wandered off downstairs to find such suitable names.

Because I'm supposed to be be working in Python these days, not Perl, I gathered what little Python knowledge I had and wrote out:
#!/usr/bin/python

by_anagram = {}

names_file = open("dist.male.first")
for line in names_file:
    # lines in file look like:
    # WILLIAM        2.451 14.812      5
    # we want just the first field.
    name = (line.split())[0]
    letters = [letter for letter in name]
    letters.sort()
    sorted_name = "".join(letters)
    if not sorted_name in by_anagram:
        by_anagram[sorted_name] = []
    by_anagram[sorted_name].append(name)

for sorted_name in by_anagram:
    if len(by_anagram[sorted_name]) < 2:
        continue

    print by_anagram[sorted_name]
Not so happy with it, but it worked, and I printed out the results and brought them up to the guy:
['TROY', 'TORY']
['CLAY', 'LACY']
['JEFFREY', 'JEFFERY']
['ANGEL', 'GALEN']
['FOREST', 'FOSTER']
['ANDRE', 'DAREN', 'ARDEN']
['LEROY', 'ELROY']
['LEONARD', 'RENALDO', 'LEANDRO']
['ARIEL', 'ARLIE']
['BRENDAN', 'BRANDEN']
['WARREN', 'WARNER']
['DEAN', 'DANE']
['CHRISTOPER', 'CRISTOPHER']
['COLE', 'CLEO']
['CARLO', 'CAROL']
['ELMER', 'MERLE']
['REUBEN', 'RUEBEN']
['JOSEPH', 'JOESPH', 'JOSPEH']
['COREY', 'ROYCE']
['JASON', 'JONAS']
['RAMON', 'ROMAN']
['JAMIE', 'JAIME']
['CARMELO', 'MARCELO']
['BYRON', 'BRYON']
['LEON', 'NOEL', 'OLEN']
['NEAL', 'LANE']
['MICHAEL', 'MICHEAL', 'MICHALE']
['KEITH', 'KIETH']
['BERT', 'BRET']
['BRIAN', 'BRAIN']
['OLIN', 'LINO']
['DION', 'DINO']
['DANA', 'ADAN']
['RONALD', 'ROLAND', 'ARNOLD']
['ISRAEL', 'ISREAL']
['DARNELL', 'RANDELL']
['ANTOINE', 'ANTIONE']
['ORLANDO', 'ROLANDO', 'ARNOLDO']
Just now, I was wondering the equivalent in Perl, and wrote:
#!/usr/bin/perl

use strict;
open (my $fh, "dist.male.first") or die;
my %by_anagram;
while (<$fh>) {
    chomp;
    s/\s.*//;
    my $name = $_;
    my $sorted_name = join('', sort split //, $name);
    push @{$by_anagram{$sorted_name}}, $name;
}

foreach my $sn (grep { @{$by_anagram{$_}} > 1 } keys %by_anagram) {
    print "@{$by_anagram{$sn}}\n";
}
In particular, I like about Python that errors-are-exceptions is the norm. But I like that regexps are built into Perl. (also hate Python's general hating on functional programming, unrelated to this post) I'm sure my Python could be way shorter, too. Anybody want to post either their short Python version, or their more-idiomatic Python version?

Also interesting: (fastest of a 3 consecutive runs each)
sammy$ time ./ananames.pl > /dev/null

real	0m0.026s
user	0m0.024s
sys	0m0.004s

sammy$ time ./ananames.py > /dev/null

real	0m0.043s
user	0m0.036s
sys	0m0.008s
LinkReply

Comments:
Page 1 of 2
<<[1] [2] >>
[User Picture]From: avva
2008-01-06 10:32 pm (UTC)

(Link)

Rueben and Reuben is the winning pair.
[User Picture]From: herbie
2008-01-06 10:47 pm (UTC)

(Link)

Think, though, of the pressures put on a Brian/Brain pair.
[User Picture]From: ciphergoth
2008-01-06 10:35 pm (UTC)

(Link)

#!/usr/bin/python

by_anagram = {}

for line in open("dist.male.first"):
    name = (line.split())[0]
    sorted_name = "".join(sorted([letter for letter in name]))
    by_anagram.setdefault(sorted_name, []).append(name)
    
for names in by_anagram.itervalues():
    if len(names) > 1:
       print names
From: evan
2008-01-06 11:03 pm (UTC)

(Link)

sorted(name) also works.
(no subject) - (Anonymous) Expand
[User Picture]From: jarodrussell
2008-01-06 10:38 pm (UTC)

(Link)

That's an elegantly simple way of sorting and comparing the letters in a name. Beautiful code.
[User Picture]From: hellarad
2008-01-06 10:58 pm (UTC)

(Link)

1] why didn't you do this for meeee?
2] same names spelled differently so do not count!!!
[User Picture]From: dossy
2008-01-06 11:23 pm (UTC)

(Link)

I'm assuming you're using this dist.male.first?

As long as you don't mind a little golfing, here's the Tcl version:

set f [open dist.male.first r]
while {[gets $f line] >= 0} {
    set name [lindex [split $line " "] 0]
    set sorted [lsort [split $name ""]]
    lappend by_anagram($sorted) $name
}

foreach sorted [array names by_anagram] {
    if {[llength $by_anagram($sorted)] > 1} {
        puts $by_anagram($sorted)
    }
}
close $f


I'd like to know how fast or slow it is on your machine, to get an apples-to-apples benchmark comparison (assuming you have Tcl/tclsh installed). Include the version of tclsh you used, if you do.
[User Picture]From: pne
2008-01-07 07:17 am (UTC)

golf

(Link)

Couldn't you turn set name [lindex [split $line " "] 0] into set name [lindex $line 0]?

I thought that strings and lists in Tcl were interconvertible, and lindex'ing a string essentially split it on whitespace.
(Deleted comment)
From: http://intertwingly.net/blog/
2008-01-07 12:47 am (UTC)

Ruby 1.9

(Link)

File.open('dist.male.first').
  map {|line| line.split.first}.
  group_by {|name| name.chars.sort}.
  each {|key,names| p names if names.length>1}
From: elvenforever
2008-01-07 12:18 am (UTC)

(Link)

Heh. This year was my 10th wedding anniversary. I make jewelry. So, to celebrate, I calculated our wedding date in binary and made it into a necklace. Geek love. :)
[User Picture]From: jope
2008-01-07 12:27 am (UTC)

(Link)

Where were you when the best Hasbro could come up with was Tomax and Xamot?
[User Picture]From: ciphergoth
2008-01-07 01:56 am (UTC)

(Link)

Going a bit more mad for brevity with list comprehensions:
#!/usr/bin/python

by_anagram = {}

for name in [l.split()[0] for l in open("dist.male.first")]:
        by_anagram.setdefault("".join(sorted(name)), []).append(name)

print "\n".join([" ".join(n) for n in by_anagram.itervalues() if len(n) > 1])
[User Picture]From: krow
2008-01-07 02:46 am (UTC)

(Link)

Ask Guido someday about how Python's Threads are implemented...

In some fairness though, Python's internals are much easier to follow then the MACRO hell of Perl's.
From: ex_n1ckname629
2008-01-07 03:02 am (UTC)

(Link)

oki doki

$ time ./ana.py > /dev/null

real 0m0.028s
user 0m0.016s
sys 0m0.008s
$ time ./ana.pl > /dev/null

real 0m0.019s
user 0m0.020s
sys 0m0.000s
$ time ./ananames.bin > /dev/null

real 0m0.009s
user 0m0.008s
sys 0m0.004s
[User Picture]From: mechanyx
2008-01-07 03:31 am (UTC)

(Link)

Can I ask why you are "supposed" to be writing in Python if your Perl is so strong?
[User Picture]From: brad
2008-01-07 04:16 am (UTC)

(Link)

Google has a shitload of code and infrastructure. Google also has a shitload of engineers that altogether know a shitload of languages. If every engineer were allowed to write in his/her favorite pet language of the week, the necessary explosion of substandard bindings for each library * each language would be unmaintainable. Given that Perl/Python/Ruby are all effectively the same, it makes sense to standardize on one. Python has the right mix of learnability, readability, industry/community support, etc. I don't object to having to write in Python... it makes a ton of sense.
From: evan
2008-01-07 04:05 am (UTC)

i can has golf

(Link)

import Control.Arrow
import List
import qualified Data.Map as M

main = do
  dat <- readFile "dist.male.first"
  let pairs = map ((sort &&& (:[])) . head . words) $ lines dat
  mapM_ print $ filter ((>1) . length) $ M.elems $ M.fromListWith (++) pairs
Hooray for large standard libraries.
[User Picture]From: mobley
2008-01-07 04:23 am (UTC)

(Link)

if i was him, i would name my children
orlando and rolando
and be done with it.

also, if i had a twin named branden
while growing up
one of us would've been
driven to kill by now.
[User Picture]From: nothings
2008-01-07 04:29 am (UTC)

(Link)

For what little it's worth:

#define STB_DEFINE
#include "stb.h"     // http://nothings.org/stb.h

stb_sdict *names;

int main(int argc, char **argv)
{
   int i,j;
   char **lines, *y;
   names = stb_sdict_new(1);
   lines = stb_stringfile("/sean/writing/tools/male.txt", NULL);
   for (; *lines; ++lines) {
      char **z = stb_tokens(*lines, " ", NULL), *p = strdup(z[0]), **s;
      qsort(z[0], strlen(z[0]), 1, stb_charcmp);
      s = stb_sdict_get(names, z[0]);
      stb_arr_push(s, p);
      stb_sdict_set(names, z[0], s);
   }
   stb_sdict_for(names, i, y, lines) {
      if (stb_arr_len(lines) > 1) {
         printf("[ ");
         for (j=0; j < stb_arr_len(lines); ++j) {
            printf("'%s' ", lines[j]);
         }
         printf("]\n");
      }
   }
   return 0;
}


Of course that leaks all memory, and I don't have perl or python installed to do a performance comparison anyway.
From: edge_walker
2008-01-07 05:00 am (UTC)

(Link)

Hey, no fair. No one’s showing the Perl version any love even though it has a lot of stuff you can get rid of:

#!/usr/bin/perl
use strict;

sub sort_chars { join '', sort split //, shift }

my %by_anagram;

@ARGV = "dist.male.first"; # better yet: pass it on the command line
while (<>) {
    s/\s.*//s;
    push @{ $by_anagram{ sort_chars($_) } }, $_;
}

for ( values %by_anagram ) {
    print "@$_\n" if @$_ > 1;
}

(Untested.)

Although I’d write it slightly longer in order to de-uglify the push line, by factoring out the addressing into an extra temporary:

    my $bucket = \$by_anagram{ sort_chars($_) }; # oh hai, iz mah...
    push @$bucket, $_;

Much nicer.

From: (Anonymous)
2008-01-08 05:13 am (UTC)

A little less opaque with most of his code concepts intact ...

(Link)

#!/usr/bin/perl open (my $fh, "dist.male.first") or die; my (%by_anagram,$name,$first,$rest,$sorted_name); while (($name=<$fh>)=~ s/\s.*\n//m) { push @{$by_anagram{join('', sort split //, $name) }},$name; } foreach my $sn ( keys %by_anagram) { printf "%s\n",join(", ",@{$by_anagram{$sn}}) if (@{$by_anagram{$sn}} > 1); } Removed as much of the extraneous cde as possible, though I didn't alter his character sort (join, sort, split). landman@lightning:~$ time ./ana2.pl > m real 0m0.036s user 0m0.028s sys 0m0.004s
From: (Anonymous)
2008-01-07 06:06 am (UTC)

(Link)

In situations like these, I'd prefer line.strip() to (line.split())[0]. The latter doesn't need the extra parentheses, anyway (and I don't know the layout of the file, if there is more than one thing on every line my replacement doesn't work).
[User Picture]From: ciphergoth
2008-01-07 11:34 pm (UTC)

(Link)

It doesn't - see the comment in the Perl code, there are other fields in the file. I think this is the cleanest way.
[User Picture]From: mart
2008-01-07 10:19 am (UTC)

(Link)

I thought it might be fun to do it in C#, just for kicks:

using System;
using System.IO;
using System.Linq;
using System.Collections.Generic;

public class AnagramNames {
	public static void Main(string[] args) {
		var names = from name in File.ReadAllLines("dist.male.first") select name.Split(' ')[0];
		var pairs = from pair in (
			from name in names group name by SortCharsInString(name)
		) where pair.Count() > 1 select pair;

		foreach (var pair in pairs) {
			Console.WriteLine(String.Join(",", pair.ToArray()));
		}
	}
	public static string SortCharsInString(string s) {
		char[] arr = s.ToArray();
		Array.Sort(arr);
		return new string(arr);
	}
}

It could be even shorter and more memory-efficient if I could find a way in the standard library to:

  • Get an IEnumerable<string> for "lines in a file" without reading the whole file into an array (it's easy to write such a thing, so I'm sure it's in the standard library somewhere...)
  • Sort the characters in a string in a functional way, rather than in-place

No timings, since I had to write this in Windows; Mono doesn't have stable LINQ support yet.

[User Picture]From: mtbg
2008-01-07 05:42 pm (UTC)

(Link)

(it's easy to write such a thing, so I'm sure it's in the standard library somewhere...)

So, by deduction, it must be really really hard to write a Pair class in Java...

I have very little to say about the Python version that ciphergoth hasn't said. I probably would've implemented it as the list comprehension version. One comment about your version: letters = [letter for letter in name] is better spelled letters = list(name).
From: free_meal
2008-01-07 02:12 pm (UTC)

Hello.

(Link)

I'll be a bit offtop. Sorry. I'm writing a work for russian university, work is called "livejournal.com as unofficial massmedia", and i'd like to know your opinion about this. What do you think of the fact that livejournal.com is becoming or became already fourth authority? And can you predict lj's future? What do you prefer: newspapers and tv news or it? Thank you. Danil.
From: (Anonymous)
2008-01-07 04:00 pm (UTC)

(Link)

;; Emacs Lisp version :D
(require 'cl)
(with-current-buffer (find-file-noselect "input")
 (let ((hash (make-hash-table :test #'equal)))
    (while (and (not (eobp))
                (re-search-forward "^\\w*" nil t))
      (setf str (match-string 0))
      (sort* str #'<)
      (push (match-string 0) (gethash str hash)))
    (maphash (lambda (k v)
               (when (cdr v) (print v))) hash)))
From: (Anonymous)
2008-01-07 05:19 pm (UTC)

A php version

(Link)

I know there will be haters but here is a php version.

#!/usr/local/php5/bin/php
$line) {
$parts = explode(" ", $line);
$letters = str_split($parts[0]);
sort($letters);
$letters = implode('', $letters);
if (!array_key_exists($letters, $anagrams)) {
$anagrams[$letters] = array();
}
$anagrams[$letters][] = $parts[0];
}

foreach ($anagrams as $key => $value) {
if (count($value) < 2) {
continue;
}
echo implode(", ", $value)."\n";
}
From: (Anonymous)
2008-01-10 04:12 pm (UTC)

Re: A php version

(Link)

Well, it seems you be a hater of indenting.
Please stop the hate!
[User Picture]From: pawnhearts
2008-01-07 05:41 pm (UTC)

(Link)

by_anagram = {}
for line in file("dist.male.first"):
  by_anagram[sorted(line)]=by_anagram.get(sorted(line),[])+[line,]
for line in by_anagram.values():
  print line

[User Picture]From: ciphergoth
2008-01-07 11:46 pm (UTC)

(Link)

Test this - it will throw a TypeException; in addition you need to print only the entries with more than one name in the list. "setdefault" is easier. Also, compare my "list comprehensions" version above.
[User Picture]From: crw
2008-01-07 07:21 pm (UTC)

(Link)

Alice / Celia, sadly missing (from Jeff Noon's "Automated Alice")
From: blandest.myopenid.com
2008-01-07 08:51 pm (UTC)

(Link)

Nice post, it's fun to see different approaches. Here's mine in Common Lisp:

(let ((hash (make-hash-table :test #'equal)))
    (with-open-file (stream "names")
                    (loop for line = (read-line stream nil nil)
                          for key = (sort (copy-seq line) #'char<)
                          until (null line)
                          do (push line (gethash key hash nil))))
    (loop for v being the hash-value of hash
          do (when (cdr v) (format t "~{~A~^, ~}~%" v))))

From: blog.moertel.com
2008-01-08 12:56 am (UTC)

Generalize the solution

(Link)

This problem is one of many that involve clustering values based on their signatures. If you factor out the signature-computing method as a parameter, what's left is a handy, reusable function that solves a large family of related problems:

ClusterBy: a handy little function for the toolbox (http://blog.moertel.com/articles/2007/09/01/clusterby-a-handy-little-function-for-the-toolbox).

Cheers,
Tom
Page 1 of 2
<<[1] [2] >>