Log in

CS, trees, graphs, ontologies, ireland - brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

CS, trees, graphs, ontologies, ireland [Jan. 9th, 2007|11:52 pm]
Brad Fitzpatrick
[Tags|, ]

The CS geek in me wants to see the world as a tree. Everything has a single parent node. The relational database geek in me says "fuck all that, m:n yo!"

Yet I try to make the tree work.

zum Beispiel,

Say I only got a hypothetical byte or two to encode the major population centers of the (LiveJournal) world. If I make a tree of regions and number the regions as I walk the tree depth-first, you get something like:
  1  AU
  2  . New South Wales
  3  . . Sydney
  4  . Victoria
  5  . . Melbourne
  6  DE
  7  PH
  8  RU
  9  . Moscow
 10  . Saint Petersburg
 11  UA
 12  . Kiev
 13  UK
 14  . England
 15  . . London
 16  . Northern Ireland
 17  . Scotland
 18  . Wales
 19  US
 20  . CA
 21  . . Bay Area
 22  . . . North Bay
 23  . . . . Marin
 24  . . . . . Sausalito
 25  . . . . . Tiburon
 26  . . . . Napa
 27  . . . . Solano
 28  . . . . Sonoma
 29  . . . San Francisco
 30  . OR
 31  . . Portland Metro Area
 32  . . . Beaverton
 33  . . . Portland
 34  . WA
 35  . . Seattle

Then, if you want to search people in the bay area, you find users with location IDs from 21-29. Or all of California is 20-29. etc. This is all fine and dandy if your world fits into a pretty little tree, but then you got stuff like Ireland...

Northern Island is part of the UK.
The rest of Ireland is, politically, part of the EU.

But shouldn't there be a common ancestor node for the island of Ireland? The tree breaks down!

But I shouldn't have even mentioned search because that clouds the intent of this post. This isn't about solving any technical problem. At a high level, what I want to provide people with is a way to search by broader or narrow "region" where that region is interesting. Not "50 miles from here". That's easy. But that city over there across the river is basically inaccessible.... you're more likely to drive 3 times as far in that other direction.

Side story: my car recently recommended I drive like 3 times as far because its stupid navigation can only find the nearest junkfood restaurants as-the-crow-flies and the location 9 miles north (along my direction of travel) was farther than the 8 miles as-the-crow-flies which actually was 5.5 miles north, 5.5 miles east (out of the way), eat, 5.5 west (back to free way). So 11 miles wasted. Yay quadtree!

(don't worry, the humans won and belayed the computer's order)

Anyway. I want a tree of the world locations, in useful break-downs that are useful to real humans. Community project or does this data freely exist?

Note that this project is unrelated to my previous post which is all about equivalences of user-inputted cities. This post is about letting homeboy admit he's from the surburbs without being excluded from the big-city "metro area" searches (which will be checked by default, to let homeboy give accurate data....)

Enlighten me. Tell me that my fantasy tree exists. I'll also, less happily, accept a free graph.

(Deleted comment)
(Deleted comment)
[User Picture]From: toast0
2007-01-10 09:06 am (UTC)
You can't make the tree work if you want metro areas. See the Quad Cities for example. Or the Four Corners.

At Yahoo! Travel we organize the cities of the world by state/whatever, country, continentish. This works out to a nice tree. But we also have some number of 'Areas' which are collections of cities (data for US only I think). For more detail and flexibility, you could put neighborhoods in below cities (I hope... but maybe some neighborhoods cross city lines), and allow collections of more than just cities, but I think if you have collections of collections it gets messy fast.

(Reply) (Thread)
[User Picture]From: vadda
2007-01-10 09:13 am (UTC)
Sounds like a humongous project. Customized "local areas"? How do you define what's "local" to a specific person and what's not? To me - San Francisco is local and Palo Alto is not, to my neighbor across the street it's the opposite, cause he happens to work in Palo Alto, therefore his "local area" is different. In modern world it boils down to individuals or fairly small groups of individuals confined within specific areas. Sure, in medieval France it won't be an issue since majority of the people did not travel much. So, you either stick to formal boundaries (state, county or "oblast'") or drill down to individuals.

Just to make a counterpoint - isince it is now possible to track all wireless devices/users by using cell towers (down to few meters I believe), by gathering statistics and making combined charts it would be possible to see what are the most common routes/areas for people living in specific places. Would be interesting to see what the real "localities" are ...
(Reply) (Thread)
[User Picture]From: nyssabee
2007-01-10 09:27 am (UTC)
You forgot Brisbane in Australia :)
(Reply) (Thread)
[User Picture]From: edm
2007-01-10 09:49 am (UTC)


My CS/math geek gut feel from your description is that you want a DAG (Directed Acyclic Graph), which is essentially a tree where some of the branches have "grown together" again so that a given leaf may be reached through multiple paths through the branches. (But because it's (a) directed and (b) acyclic you don't get loops in walking the graph.) Your use case ("who is in this enclosing group") might make the actual search implementation a little tricky, at least in SQL, but in terms of graph traversal it is pretty trivial (start at desired search item, walk along directions in graph to find all the things which belong). In may be easier to pull in the entire DAG and search it in memory (with perl) than try to do it in SQL directly (and if you build the DAG at startup, as a const structure, you should be able to share it amongst all your processes.)

FWIW, I think at the point you're saying, eg, "Ireland" can belong to two different things (which in turn perhaps have a common parent -- the island they're on), you've got to abandon the idea that what you have is a tree.

Incidentally the nice thing about having such a structure is that you could allow a user to walk down the DAG to find the place they identify with, and thus (given a large enough data set) pretty much solve the data entry issue you have (which gives you half a dozen different "St Petersburg" locations). But I think you might have to build this data set yourself, perhaps in a similar way to the schools list ("pick your location -- not on the list? Submit a new one here").


PS: Possibly of use: Boost::Graph, which has a Boost::Graph::Directed module, and Graph which has a Graph::Directed. The first one appears to be a wrapper around a C++ Boost library.
(Reply) (Thread)
[User Picture]From: pne
2007-01-10 08:27 pm (UTC)


Your use case ("who is in this enclosing group") might make the actual search implementation a little tricky, at least in SQL, but in terms of graph traversal it is pretty trivial (start at desired search item, walk along directions in graph to find all the things which belong). In may be easier to pull in the entire DAG and search it in memory (with perl) than try to do it in SQL directly (and if you build the DAG at startup, as a const structure, you should be able to share it amongst all your processes.)

I didn't know it was called a DAG, but that's a data structure we use at work, and we search it with SQL.

We have zones which can be part of other zones, though zones can have more than one parent zone, and sometimes those lines later re-join. (With your example, we might have both Northern Ireland -> UK -> EU -> World and Northern Ireland -> Ireland (the island) -> Europe -> World, or something like that.)

We do quite a few queries both up ("What zones enclose this zone?") and down ("Which zones are part of those zone?") the "tree" (well, DAG, I suppose).

We do use non-ANSI SQL syntax, though (specifically, Oracle's "CONNECT BY" feature, which works really well with this sort of hierarchical data, at least for us).
(Reply) (Parent) (Thread)
From: jamesd
2007-01-10 10:02 am (UTC)
Probably not currently doable with public data, but that just points to the need for a new schools project, possibly with involvment from the fair community at wikipedia that is interested in GIS.

I did once do this for ports, working out routings from port to port. The bad news is that the end paths in the last 10-300 miles were often complex enough to require human tuning of routes and your world here is all about the end of the graph, so you need human help. The generic rules were overridable with human entered port to port or waypoint to waypoint distances and I don't see a practical alternative for you when it comes to neeing sensible obstacle-avoiding distances.

The good news is that these local routes are just another lookup table, so it's not unduly tough when you can throw a lot of people at the problem.
(Reply) (Thread)
From: jmason
2007-01-10 11:02 am (UTC)

re Ireland

not to provide a useful answer -- but:

actually, the Irish situation is a little more complex; if you want to consider it geographically, Eire (aka the Republic of Ireland) and Northern Ireland *are* both part of the island of Ireland, which is part of Europe.

However, politically, it's slightly different: Eire is part of the EU, as is the UK. Northern Ireland is part of the UK, not of Eire.

meh, politics. Of course, if you're talking about some kind of system for people to identify their location, some inhabitants might take offense to suggestions that don't match their world-view, so you've got to take it into account ;)
(Reply) (Thread)
[User Picture]From: nothings
2007-01-10 11:05 am (UTC)
Not a tree of the whole thing, but a diagram for the whole british isles mess. (I think wikipedia has a similar one, and pendanticizes that it's an 'Euler diagram' not a 'Venn diagram', but whatever.)


Basically it boils down to whether you want geography or political. There's no tree that's geopolitical, because of the UK vs. island-of-Ireland conflict you mention.

Also, I'm puzzled that you'd alphabetize the tree everywhere. Surely for nice consistency you should sort each level so the most populous is at the top. Not that you need to for any reason, it's just weird otherwise.
(Reply) (Thread)
[User Picture]From: brad
2007-01-10 04:17 pm (UTC)
Hah, that's a great diagram... thanks!

(The alphabetization was just for show.)

(Reply) (Parent) (Thread)
[User Picture]From: henrylyne
2007-01-10 06:46 pm (UTC)
My father calls the British Isles the Wise Islands, for Wales, Ireland, Scotland and England. Not sure where he got that from.
(Reply) (Parent) (Thread)
[User Picture]From: scosol
2007-01-11 08:00 pm (UTC)
> Basically it boils down to whether you want geography or political.

yeah- perhaps more important is language even-

a "tree" doesn't work- you need to have multiple parents
(Reply) (Parent) (Thread)
[User Picture]From: mart
2007-01-10 11:16 am (UTC)

For a project at work I needed a similar data structure. Since it's UK-only right now, I made some scripts to screen-scrape the data on UK ceremonial counties and their constituent towns/areas from Wikipedia. If you do find some decent data, I'd love to see it too!

(Reply) (Thread)
[User Picture]From: crschmidt
2007-01-10 12:48 pm (UTC)
Vmap0 has geographic data down to country level for the whole world, and down to the city level for important cities, though whether it's going to be useful to solve your problem is another question. (Solving the 'what regions, in descending order of size/importance, cover this geographic point' problem is one that I've been told by the very smart people I work with is very difficult, which is why we don't solve it.)

For the US, there's a lot of city data in census stuff, and it's all public domain.

So, the answer is probably "You can seed a decent database (if you can come up with a tree algorithm for data) but you'll need more input from users after that."

If you do come up with a user-input database, getting a dump of it for a project like OpenStreetMap would be cool, since then any work you do could be leveraged by others in the future, rather than building their own.

Unfortunately for use case, OpenStreetMap probably has almost none of the data you need, so it's not useful in the other direction.
(Reply) (Thread)
[User Picture]From: figg
2007-01-10 01:27 pm (UTC)
This is all fine and dandy if your world fits into a pretty little tree, but then you got stuff like Ireland...

And the Isle of man, not in the uk, not in the eu, but british citizens. A similar thing happens for the Channel Islands. (Crown Dependancies are weird).

There are also British overseas territories like gibraltar, bermuda, and the falkland islands.

If you're including the EU as a parent node, remember that many countries are in Europe, but not in the EU. Like norway (Famous for is absence on the euro coin), and Switzerland.

What about Campione d'Italia? Part of Italy, in Switzerland. Or Busingen, german, also in switzerland.

Or the Vatican/San Marino - in italy, but not part of italy or the eu.

Is the Kaliningrad Oblast part of Europe, even though it's in russia?

Baarle-Hertog/Baarle-Nassau is the same town, just in different countires, and partially inside one another. (Half of this shop is in belgium, half of it is in the netherlands).

If you're using the EU as a parent node, what about the commonwealth of nations? The Arab League? CIS?

Is Taiwan/Taipei part of PRC or not? Is the falklands part of argentinia or the uk? Is western sahara part of morroco? Is palestine part of israel? Is kashmir part of india or pakistan?

Let us not forget some of the geographical oddities in the us, like the district of columbia (a federal district, not a state or in none), or point roberts (only accessible by canada, but in Washington).

Tell me that my fantasy tree exists

Do you want a pony for christmas too? :)
(Reply) (Thread)
[User Picture]From: brad
2007-01-10 04:19 pm (UTC)
Wonderful links, thanks!

I imagine ponies are too much work and I have enough. :-)
(Reply) (Parent) (Thread)
[User Picture]From: dossy
2007-01-10 01:42 pm (UTC)
Careful, Brad, you're on the precipice of the very slippery slope which is the Travelling Salesman Problem.

Of course, I'd personally love to see you solve it, but that's a bit of a daunting challenge. :-)
(Reply) (Thread)
[User Picture]From: brad
2007-01-10 04:20 pm (UTC)
Don't worry -- am aware of TSP and not trying to solve it.
(Reply) (Parent) (Thread)
From: asija
2007-01-10 03:01 pm (UTC)
To solve a similar problem I used this directory of world locations:


It doesn't completely answer your question (I'm not sure about "useful to real humans" aspect), but for some countries (Italy and Russia, for example) the tree is very good.
(Reply) (Thread)
[User Picture]From: brad
2007-01-10 04:24 pm (UTC)
Thanks! Great resource!
(Reply) (Parent) (Thread)
[User Picture]From: jedermann
2007-01-10 03:36 pm (UTC)
From a conceptual point of view, what you need is a weighted graph.

The world can be tesselated, relatively crudely into a set of politically and culturally homogeneous regions (tiles) of area not greater than, say, A (defined by, say, how much one can cover by car within 2 hours). We can assume that all points inside each tile are equally accessible from any other point inside the same tile. For example, there is obviously no need (but it still may be sensible) to isolate individual states in the US because there is no real border of any kind between them. Countries in Europe are most likely valid tiles because there are the language barriers.

After that, each tile is assumed to be a node in the graph. Adjacent tiles are connected by edges. The weight of each edge represents the difficulty to cross the tile border (visa, language, etc.). After that you basically do the < 50 miles type search on this weighted graph.

Obviously, pretty impossible to implement without a lot of research and work, but I think that's ultimately what one needs.
(Reply) (Thread)
[User Picture]From: brad
2007-01-10 04:23 pm (UTC)
I totally agree. That'd be a wonderful project.

[Un]fortunately I need the 90%-good-enough in 5% of the time project. :-)
(Reply) (Parent) (Thread)
[User Picture]From: nick
2007-01-10 04:46 pm (UTC)
(Reply) (Thread)
[User Picture]From: brad
2007-01-10 04:49 pm (UTC)
And can you believe it only took like 30 seconds?
(Reply) (Parent) (Thread)
[User Picture]From: amdiranifani
2007-01-10 06:22 pm (UTC)
Leaving aside the question of political entities, for dealing with geographic data, the data structure you want is an R-tree or its variants (R*, R+).
(Reply) (Thread)
From: chrislightfoot
2007-01-10 10:57 pm (UTC)
you might want to have a look at the stuff I did in this area for mySociety:
(Reply) (Thread)
[User Picture]From: taral
2007-01-10 11:33 pm (UTC)
It's not hard to conceive of an algorithm to fix the stupid POI handling in most nav systems.
(Reply) (Thread)
From: legolas
2007-01-10 11:51 pm (UTC)
This makes me think about how I handle email (store in folders) vs how google handles email (search, tagging(?)) and how lj and others handle posts (tagging). Feels like you're doing the tree, and you really want something else. (But then I can't imagine working with my email the google way yet, so perhaps I'm not the right guy to tell you all this ;-)
(Reply) (Thread)
[User Picture]From: algeh
2007-01-11 02:55 am (UTC)
Another thing to consider is that some suburbs are kind of between communities. For example, think of where you used to live in Aloha/Hillsboro. From there, Beaverton, Portland, and Hillsboro all kind of seemed "local", but it wasn't really contained in any of them (I think it was technically in Hillsboro) - for any reasonable definition of "nearby" one would be equally interested in Beaverton stuff or Hillsboro stuff, yet that would not be true of someone who lived in a more central part of Hillsboro or Beaverton. I'm not sure this is the best example, but I think there are more overlapping rings than hierarchical trees in what is "local" to a person/community.
(Reply) (Thread)
[User Picture]From: ossie
2007-01-11 04:14 am (UTC)

why not try timezones

that should give you around 20 different parent parts of the tree

then you can have subsections for states in the US
countries in Europe
to give you
1 a-z
2 a-z
then you can further branch out from that to cities within the countries

or you can use a globe and actually figure out location from ISPs and have 18 by 24 sections from lattitude and longtitude so you get 432 sections, although there will be a lot of ocean with noone in them

(Reply) (Thread)