You can't make the tree work if you want metro areas. See the Quad Cities
for example. Or the Four Corners
At Yahoo! Travel we organize the cities of the world by state/whatever, country, continentish. This works out to a nice tree. But we also have some number of 'Areas' which are collections of cities (data for US only I think). For more detail and flexibility, you could put neighborhoods in below cities (I hope... but maybe some neighborhoods cross city lines), and allow collections of more than just cities, but I think if you have collections of collections it gets messy fast.
Sounds like a humongous project. Customized "local areas"? How do you define what's "local" to a specific person and what's not? To me - San Francisco is local and Palo Alto is not, to my neighbor across the street it's the opposite, cause he happens to work in Palo Alto, therefore his "local area" is different. In modern world it boils down to individuals or fairly small groups of individuals confined within specific areas. Sure, in medieval France it won't be an issue since majority of the people did not travel much. So, you either stick to formal boundaries (state, county or "oblast'") or drill down to individuals.
Just to make a counterpoint - isince it is now possible to track all wireless devices/users by using cell towers (down to few meters I believe), by gathering statistics and making combined charts it would be possible to see what are the most common routes/areas for people living in specific places. Would be interesting to see what the real "localities" are ...
You forgot Brisbane in Australia :)
2007-01-10 09:49 am (UTC)
My CS/math geek gut feel from your description is that you want a DAG (Directed Acyclic Graph
), which is essentially a tree where some of the branches have "grown together" again so that a given leaf may be reached through multiple paths through the branches. (But because it's (a) directed and (b) acyclic you don't get loops in walking the graph.) Your use case ("who is in this enclosing group") might make the actual search implementation a little tricky, at least in SQL, but in terms of graph traversal it is pretty trivial (start at desired search item, walk along directions in graph to find all the things which belong). In may be easier to pull in the entire DAG and search it in memory (with perl) than try to do it in SQL directly (and if you build the DAG at startup, as a const structure, you should be able to share it amongst all your processes.)
FWIW, I think at the point you're saying, eg, "Ireland" can belong to two different things (which in turn perhaps have a common parent -- the island they're on), you've got to abandon the idea that what you have is a tree.
Incidentally the nice thing about having such a structure is that you could allow a user to walk down the DAG to find the place they identify with, and thus (given a large enough data set) pretty much solve the data entry issue you have (which gives you half a dozen different "St Petersburg" locations). But I think you might have to build this data set yourself, perhaps in a similar way to the schools list ("pick your location -- not on the list? Submit a new one here").
PS: Possibly of use: Boost::Graph
, which has a Boost::Graph::Directed module, and Graph
which has a Graph::Directed. The first one appears to be a wrapper around a C++ Boost library.
2007-01-10 08:27 pm (UTC)
Your use case ("who is in this enclosing group") might make the actual search implementation a little tricky, at least in SQL, but in terms of graph traversal it is pretty trivial (start at desired search item, walk along directions in graph to find all the things which belong). In may be easier to pull in the entire DAG and search it in memory (with perl) than try to do it in SQL directly (and if you build the DAG at startup, as a const structure, you should be able to share it amongst all your processes.)
I didn't know it was called a DAG, but that's a data structure we use at work, and we search it with SQL.
We have zones which can be part of other zones, though zones can have more than one parent zone, and sometimes those lines later re-join. (With your example, we might have both Northern Ireland -> UK -> EU -> World and Northern Ireland -> Ireland (the island) -> Europe -> World, or something like that.)
We do quite a few queries both up ("What zones enclose this zone?") and down ("Which zones are part of those zone?") the "tree" (well, DAG, I suppose).
We do use non-ANSI SQL syntax, though (specifically, Oracle's "CONNECT BY" feature, which works really well with this sort of hierarchical data, at least for us).
Probably not currently doable with public data, but that just points to the need for a new schools project, possibly with involvment from the fair community at wikipedia that is interested in GIS.
I did once do this for ports, working out routings from port to port. The bad news is that the end paths in the last 10-300 miles were often complex enough to require human tuning of routes and your world here is all about the end of the graph, so you need human help. The generic rules were overridable with human entered port to port or waypoint to waypoint distances and I don't see a practical alternative for you when it comes to neeing sensible obstacle-avoiding distances.
The good news is that these local routes are just another lookup table, so it's not unduly tough when you can throw a lot of people at the problem.
2007-01-10 11:02 am (UTC)
not to provide a useful answer -- but:
actually, the Irish situation is a little more complex; if you want to consider it geographically, Eire (aka the Republic of Ireland) and Northern Ireland *are* both part of the island of Ireland, which is part of Europe.
However, politically, it's slightly different: Eire is part of the EU, as is the UK. Northern Ireland is part of the UK, not of Eire.
meh, politics. Of course, if you're talking about some kind of system for people to identify their location, some inhabitants might take offense to suggestions that don't match their world-view, so you've got to take it into account ;)
2007-01-10 04:17 pm (UTC)
Hah, that's a great diagram... thanks!
(The alphabetization was just for show.)
My father calls the British Isles the Wise Islands, for Wales, Ireland, Scotland and England. Not sure where he got that from.
> Basically it boils down to whether you want geography or political.
yeah- perhaps more important is language even-
a "tree" doesn't work- you need to have multiple parents
2007-01-10 11:16 am (UTC)
For a project at work I needed a similar data structure. Since it's UK-only right now, I made some scripts to screen-scrape the data on UK ceremonial counties and their constituent towns/areas from Wikipedia. If you do find some decent data, I'd love to see it too!
Vmap0 has geographic data down to country level for the whole world, and down to the city level for important cities, though whether it's going to be useful to solve your problem is another question. (Solving the 'what regions, in descending order of size/importance, cover this geographic point' problem is one that I've been told by the very smart people I work with is very difficult, which is why we don't solve it.)
For the US, there's a lot of city data in census stuff, and it's all public domain.
So, the answer is probably "You can seed a decent database (if you can come up with a tree algorithm for data) but you'll need more input from users after that."
If you do come up with a user-input database, getting a dump of it for a project like OpenStreetMap
would be cool, since then any work you do could be leveraged by others in the future, rather than building their own.
Unfortunately for use case, OpenStreetMap probably has almost none of the data you need, so it's not useful in the other direction.
2007-01-10 01:27 pm (UTC)
This is all fine and dandy if your world fits into a pretty little tree, but then you got stuff like Ireland...
And the Isle of man, not in the uk, not in the eu, but british citizens. A similar thing happens for the Channel Islands. (Crown Dependancies
There are also British overseas territories
like gibraltar, bermuda, and the falkland islands.
If you're including the EU as a parent node, remember that many countries are in Europe, but not in the EU. Like norway (Famous for is absence on the euro coin
), and Switzerland.
What about Campione d'Italia
? Part of Italy, in Switzerland. Or Busingen
, german, also in switzerland.
Or the Vatican/San Marino - in italy, but not part of italy or the eu.
Is the Kaliningrad Oblast
part of Europe, even though it's in russia?Baarle-Hertog
is the same town, just in different countires, and partially inside one another. (Half of this shop is in belgium, half of it is in the netherlands
If you're using the EU as a parent node, what about the commonwealth of nations? The Arab League? CIS?
Is Taiwan/Taipei part of PRC or not? Is the falklands part of argentinia or the uk? Is western sahara part of morroco? Is palestine part of israel? Is kashmir part of india or pakistan?
Let us not forget some of the geographical oddities in the us, like the district of columbia (a federal district, not a state or in none), or point roberts
(only accessible by canada, but in Washington).Tell me that my fantasy tree exists
Do you want a pony for christmas too? :)
2007-01-10 04:19 pm (UTC)
Wonderful links, thanks!
I imagine ponies are too much work and I have enough. :-)
Careful, Brad, you're on the precipice of the very slippery slope which is the Travelling Salesman Problem
Of course, I'd personally love
to see you solve it, but that's a bit of a daunting challenge. :-)
2007-01-10 04:20 pm (UTC)
Don't worry -- am aware of TSP and not trying to solve it.
To solve a similar problem I used this directory of world locations:
It doesn't completely answer your question (I'm not sure about "useful to real humans" aspect), but for some countries (Italy and Russia, for example) the tree is very good.
2007-01-10 04:24 pm (UTC)
Thanks! Great resource!
From a conceptual point of view, what you need is a weighted graph.
The world can be tesselated
, relatively crudely into a set of politically and culturally homogeneous regions (tiles) of area not greater than, say, A (defined by, say, how much one can cover by car within 2 hours). We can assume that all points inside each tile are equally accessible from any other point inside the same tile. For example, there is obviously no need (but it still may be sensible) to isolate individual states in the US because there is no real border of any kind between them. Countries in Europe are most likely valid tiles because there are the language barriers.
After that, each tile is assumed to be a node in the graph. Adjacent tiles are connected by edges. The weight of each edge represents the difficulty to cross the tile border (visa, language, etc.). After that you basically do the < 50 miles type search on this weighted graph.
Obviously, pretty impossible to implement without a lot of research and work, but I think that's ultimately what one needs.
2007-01-10 04:23 pm (UTC)
I totally agree. That'd be a wonderful project.
[Un]fortunately I need the 90%-good-enough in 5% of the time project. :-)
2007-01-10 04:46 pm (UTC)
DOOD UR MAPS R DA SHIT!!! U SHUD B A SURVEYOR 4 REALZ!!!
2007-01-10 04:49 pm (UTC)
And can you believe it only took like 30 seconds?
Leaving aside the question of political entities, for dealing with geographic data, the data structure you want is an R-tree
or its variants (R*, R+).
you might want to have a look at the stuff I did in this area for mySociety:
It's not hard to conceive of an algorithm to fix the stupid POI handling in most nav systems.
This makes me think about how I handle email (store in folders) vs how google handles email (search, tagging(?)) and how lj and others handle posts (tagging). Feels like you're doing the tree, and you really want something else. (But then I can't imagine working with my email the google way yet, so perhaps I'm not the right guy to tell you all this ;-)
Another thing to consider is that some suburbs are kind of between communities. For example, think of where you used to live in Aloha/Hillsboro. From there, Beaverton, Portland, and Hillsboro all kind of seemed "local", but it wasn't really contained in any of them (I think it was technically in Hillsboro) - for any reasonable definition of "nearby" one would be equally interested in Beaverton stuff or Hillsboro stuff, yet that would not be true of someone who lived in a more central part of Hillsboro or Beaverton. I'm not sure this is the best example, but I think there are more overlapping rings than hierarchical trees in what is "local" to a person/community.
2007-01-11 04:14 am (UTC)
why not try timezones
that should give you around 20 different parent parts of the tree
then you can have subsections for states in the US
countries in Europe
to give you
then you can further branch out from that to cities within the countries
or you can use a globe and actually figure out location from ISPs and have 18 by 24 sections from lattitude and longtitude so you get 432 sections, although there will be a lot of ocean with noone in them