?

Log in

No account? Create an account
brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Naming twins in Python & Perl [Jan. 6th, 2008|02:16 pm]
Brad Fitzpatrick
[Tags|, , , ]

Last night at Beau's party, one of Beau's guests mentioned he's expecting twins shortly, which is why is wife wasn't at the party.

I drunkenly suggested he name his kids two names that were anagrams of each other. I then wandered off downstairs to find such suitable names.

Because I'm supposed to be be working in Python these days, not Perl, I gathered what little Python knowledge I had and wrote out:
#!/usr/bin/python

by_anagram = {}

names_file = open("dist.male.first")
for line in names_file:
    # lines in file look like:
    # WILLIAM        2.451 14.812      5
    # we want just the first field.
    name = (line.split())[0]
    letters = [letter for letter in name]
    letters.sort()
    sorted_name = "".join(letters)
    if not sorted_name in by_anagram:
        by_anagram[sorted_name] = []
    by_anagram[sorted_name].append(name)

for sorted_name in by_anagram:
    if len(by_anagram[sorted_name]) < 2:
        continue

    print by_anagram[sorted_name]
Not so happy with it, but it worked, and I printed out the results and brought them up to the guy:
['TROY', 'TORY']
['CLAY', 'LACY']
['JEFFREY', 'JEFFERY']
['ANGEL', 'GALEN']
['FOREST', 'FOSTER']
['ANDRE', 'DAREN', 'ARDEN']
['LEROY', 'ELROY']
['LEONARD', 'RENALDO', 'LEANDRO']
['ARIEL', 'ARLIE']
['BRENDAN', 'BRANDEN']
['WARREN', 'WARNER']
['DEAN', 'DANE']
['CHRISTOPER', 'CRISTOPHER']
['COLE', 'CLEO']
['CARLO', 'CAROL']
['ELMER', 'MERLE']
['REUBEN', 'RUEBEN']
['JOSEPH', 'JOESPH', 'JOSPEH']
['COREY', 'ROYCE']
['JASON', 'JONAS']
['RAMON', 'ROMAN']
['JAMIE', 'JAIME']
['CARMELO', 'MARCELO']
['BYRON', 'BRYON']
['LEON', 'NOEL', 'OLEN']
['NEAL', 'LANE']
['MICHAEL', 'MICHEAL', 'MICHALE']
['KEITH', 'KIETH']
['BERT', 'BRET']
['BRIAN', 'BRAIN']
['OLIN', 'LINO']
['DION', 'DINO']
['DANA', 'ADAN']
['RONALD', 'ROLAND', 'ARNOLD']
['ISRAEL', 'ISREAL']
['DARNELL', 'RANDELL']
['ANTOINE', 'ANTIONE']
['ORLANDO', 'ROLANDO', 'ARNOLDO']
Just now, I was wondering the equivalent in Perl, and wrote:
#!/usr/bin/perl

use strict;
open (my $fh, "dist.male.first") or die;
my %by_anagram;
while (<$fh>) {
    chomp;
    s/\s.*//;
    my $name = $_;
    my $sorted_name = join('', sort split //, $name);
    push @{$by_anagram{$sorted_name}}, $name;
}

foreach my $sn (grep { @{$by_anagram{$_}} > 1 } keys %by_anagram) {
    print "@{$by_anagram{$sn}}\n";
}
In particular, I like about Python that errors-are-exceptions is the norm. But I like that regexps are built into Perl. (also hate Python's general hating on functional programming, unrelated to this post) I'm sure my Python could be way shorter, too. Anybody want to post either their short Python version, or their more-idiomatic Python version?

Also interesting: (fastest of a 3 consecutive runs each)
sammy$ time ./ananames.pl > /dev/null

real	0m0.026s
user	0m0.024s
sys	0m0.004s

sammy$ time ./ananames.py > /dev/null

real	0m0.043s
user	0m0.036s
sys	0m0.008s
LinkReply

Comments:
Page 2 of 3
<<[1] [2] [3] >>
From: edge_walker
2008-01-07 05:00 am (UTC)

Hey, no fair. No one’s showing the Perl version any love even though it has a lot of stuff you can get rid of:

#!/usr/bin/perl
use strict;

sub sort_chars { join '', sort split //, shift }

my %by_anagram;

@ARGV = "dist.male.first"; # better yet: pass it on the command line
while (<>) {
    s/\s.*//s;
    push @{ $by_anagram{ sort_chars($_) } }, $_;
}

for ( values %by_anagram ) {
    print "@$_\n" if @$_ > 1;
}

(Untested.)

Although I’d write it slightly longer in order to de-uglify the push line, by factoring out the addressing into an extra temporary:

    my $bucket = \$by_anagram{ sort_chars($_) }; # oh hai, iz mah...
    push @$bucket, $_;

Much nicer.

(Reply) (Thread)
From: (Anonymous)
2008-01-08 05:13 am (UTC)

A little less opaque with most of his code concepts intact ...

#!/usr/bin/perl open (my $fh, "dist.male.first") or die; my (%by_anagram,$name,$first,$rest,$sorted_name); while (($name=<$fh>)=~ s/\s.*\n//m) { push @{$by_anagram{join('', sort split //, $name) }},$name; } foreach my $sn ( keys %by_anagram) { printf "%s\n",join(", ",@{$by_anagram{$sn}}) if (@{$by_anagram{$sn}} > 1); } Removed as much of the extraneous cde as possible, though I didn't alter his character sort (join, sort, split). landman@lightning:~$ time ./ana2.pl > m real 0m0.036s user 0m0.028s sys 0m0.004s
(Reply) (Parent) (Thread)
From: (Anonymous)
2008-01-07 06:06 am (UTC)
In situations like these, I'd prefer line.strip() to (line.split())[0]. The latter doesn't need the extra parentheses, anyway (and I don't know the layout of the file, if there is more than one thing on every line my replacement doesn't work).
(Reply) (Thread)
[User Picture]From: ciphergoth
2008-01-07 11:34 pm (UTC)
It doesn't - see the comment in the Perl code, there are other fields in the file. I think this is the cleanest way.
(Reply) (Parent) (Thread)
[User Picture]From: mart
2008-01-07 10:19 am (UTC)

I thought it might be fun to do it in C#, just for kicks:

using System;
using System.IO;
using System.Linq;
using System.Collections.Generic;

public class AnagramNames {
	public static void Main(string[] args) {
		var names = from name in File.ReadAllLines("dist.male.first") select name.Split(' ')[0];
		var pairs = from pair in (
			from name in names group name by SortCharsInString(name)
		) where pair.Count() > 1 select pair;

		foreach (var pair in pairs) {
			Console.WriteLine(String.Join(",", pair.ToArray()));
		}
	}
	public static string SortCharsInString(string s) {
		char[] arr = s.ToArray();
		Array.Sort(arr);
		return new string(arr);
	}
}

It could be even shorter and more memory-efficient if I could find a way in the standard library to:

  • Get an IEnumerable<string> for "lines in a file" without reading the whole file into an array (it's easy to write such a thing, so I'm sure it's in the standard library somewhere...)
  • Sort the characters in a string in a functional way, rather than in-place

No timings, since I had to write this in Windows; Mono doesn't have stable LINQ support yet.

(Reply) (Thread)
[User Picture]From: mtbg
2008-01-07 05:42 pm (UTC)
(it's easy to write such a thing, so I'm sure it's in the standard library somewhere...)

So, by deduction, it must be really really hard to write a Pair class in Java...

I have very little to say about the Python version that ciphergoth hasn't said. I probably would've implemented it as the list comprehension version. One comment about your version: letters = [letter for letter in name] is better spelled letters = list(name).
(Reply) (Parent) (Thread)
(Deleted comment)
From: free_meal
2008-01-07 02:12 pm (UTC)

Hello.

I'll be a bit offtop. Sorry. I'm writing a work for russian university, work is called "livejournal.com as unofficial massmedia", and i'd like to know your opinion about this. What do you think of the fact that livejournal.com is becoming or became already fourth authority? And can you predict lj's future? What do you prefer: newspapers and tv news or it? Thank you. Danil.
(Reply) (Thread)
From: (Anonymous)
2008-01-07 04:00 pm (UTC)
;; Emacs Lisp version :D
(require 'cl)
(with-current-buffer (find-file-noselect "input")
 (let ((hash (make-hash-table :test #'equal)))
    (while (and (not (eobp))
                (re-search-forward "^\\w*" nil t))
      (setf str (match-string 0))
      (sort* str #'<)
      (push (match-string 0) (gethash str hash)))
    (maphash (lambda (k v)
               (when (cdr v) (print v))) hash)))
(Reply) (Thread)
From: (Anonymous)
2008-01-07 05:19 pm (UTC)

A php version

I know there will be haters but here is a php version.

#!/usr/local/php5/bin/php
$line) {
$parts = explode(" ", $line);
$letters = str_split($parts[0]);
sort($letters);
$letters = implode('', $letters);
if (!array_key_exists($letters, $anagrams)) {
$anagrams[$letters] = array();
}
$anagrams[$letters][] = $parts[0];
}

foreach ($anagrams as $key => $value) {
if (count($value) < 2) {
continue;
}
echo implode(", ", $value)."\n";
}
(Reply) (Thread)
From: (Anonymous)
2008-01-10 04:12 pm (UTC)

Re: A php version

Well, it seems you be a hater of indenting.
Please stop the hate!
(Reply) (Parent) (Thread)
(Deleted comment)
[User Picture]From: ciphergoth
2008-01-07 11:46 pm (UTC)
Test this - it will throw a TypeException; in addition you need to print only the entries with more than one name in the list. "setdefault" is easier. Also, compare my "list comprehensions" version above.
(Reply) (Parent) (Thread)
[User Picture]From: crw
2008-01-07 07:21 pm (UTC)
Alice / Celia, sadly missing (from Jeff Noon's "Automated Alice")
(Reply) (Thread)
From: ext_76681
2008-01-07 08:51 pm (UTC)
Nice post, it's fun to see different approaches. Here's mine in Common Lisp:

(let ((hash (make-hash-table :test #'equal)))
    (with-open-file (stream "names")
                    (loop for line = (read-line stream nil nil)
                          for key = (sort (copy-seq line) #'char<)
                          until (null line)
                          do (push line (gethash key hash nil))))
    (loop for v being the hash-value of hash
          do (when (cdr v) (format t "~{~A~^, ~}~%" v))))

(Reply) (Thread)
From: ext_76693
2008-01-08 12:56 am (UTC)

Generalize the solution

This problem is one of many that involve clustering values based on their signatures. If you factor out the signature-computing method as a parameter, what's left is a handy, reusable function that solves a large family of related problems:

ClusterBy: a handy little function for the toolbox (http://blog.moertel.com/articles/2007/09/01/clusterby-a-handy-little-function-for-the-toolbox).

Cheers,
Tom
(Reply) (Thread)
[User Picture]From: stillsheryl
2008-01-08 05:54 am (UTC)
hey, it was great to meet you and I had a great time at the party. I hope I didn't freak you out or anything. My brother was telling me I hugged you and beau goodbye like 8 times. I guess a little absinthe will do that, hey?
(Reply) (Thread)
[User Picture]From: brad
2008-01-08 07:47 am (UTC)
Heh, it's okay... I was amused. After the 5th hug or so I figured they'd keep coming for awhile. :-)
(Reply) (Parent) (Thread) (Expand)
From: (Anonymous)
2008-01-08 07:23 pm (UTC)

Finding Anagrams with XSLT

See how this is done in pure XSLT here:

http://dnovatchev.spaces.live.com/blog/cns!44B0A32C2CCF7488!357.entry


Cheers,
Dimitre Novatchev
(Reply) (Thread)
From: (Anonymous)
2008-01-09 10:36 pm (UTC)

naive scala implementation

still figuring scala out so this is probably dumb but still within a reasonable #LOC...
      import scala.io.Source
      import scala.collection.mutable.HashMap
      var nameset = new HashMap[String,String]
      for ( line <- Source.fromURL("http://www.census.gov/genealogy/names/dist.male.first").getLines ) {
          val name = line.split(" ")(0)
          val sname = name.split("").toList.filter(x => x.length > 0).sort(_ < _).reduceLeft[String]((x,y) => x + y)
          if ( nameset.contains(sname) ) {
              println(name + " - " + nameset(sname))
          } else {
              nameset(sname) = name
          }
      }
(Reply) (Thread)
From: (Anonymous)
2008-01-10 05:16 am (UTC)

Now in the Io programming language

Just for fun, same thing in the Io programming language (http://www.iolanguage.com):

#! /usr/bin/env io

names := File open("dist.male.first") readLines map(line, line split at(0))

by_anagram := Map clone
sorted := Object clone

names foreach(name,
sorted = name clone;
sorted sort;
(by_anagram hasKey(sorted)) ifFalse(by_anagram atPut(sorted, List clone));
by_anagram at(sorted) append(name))

by_anagram values select(li,li size > 1) map(asString) join("\n") println
(Reply) (Thread)
From: (Anonymous)
2008-01-10 05:46 am (UTC)

shorter python code

#!/usr/bin/env python My shorter python version (also at http://pastebin.ca/849068): by_anagram = {} names_file = open('dist.male.first') for line in names_file: name, _ = line.split(1) sorted_name = ''.join(sorted(letter for letter in name)) if sorted_name not in by_anagram: by_anagram[sorted_name] = [] by_anagram[sorted_name].append(name) print '\n'.join(x for x in by_anagram if len(x) < 2)
(Reply) (Thread)
Page 2 of 3
<<[1] [2] [3] >>