?

Log in

No account? Create an account
$good->meets($evil) - brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

$good->meets($evil) [Mar. 25th, 2006|04:36 pm]
Brad Fitzpatrick
[Tags|, , ]

Dear Graph Theorists and Wanna-Be Graph Theorists (like me),

Please help.

My fun Saturday morning project was writing a test harness so people like you could show me up, and show each other up, by writing an awesome graph analysis algorithm.

What does it have to do?

Separate out the "good nodes" from the "bad nodes".

Restrictions:
  • You implement one function, "calc", which takes a node, and returns a trust values 0.0-1.0 (inclusive), or undef if you don't have a solid answer.

  • Your calc function will run hundreds or thousands of times for each node, and in no particular order. The trust value of each node is constantly changing.

  • You can't keep state between invocations. If you try, it's cheating, and won't work in reality anyway when dozens/hundreds of processes are running independently.

  • All you can get from a provided node object is its current trust level, all nodes connected by inbound edges, and their trust levels. You can keep going deeper and deeper if you want, but:

  • I won't let you go more than two levels deep. You can't recurse the graph forever searching for stuff. That's why I also don't let you walk the graph forward.

The test harness starts out like this:
my $good = TrustGraph::Graph->load(file => "$Bin/data/good-network.dat.gz");

Wherein we load the first 100,000 LJ users' relationships amongst each other.

Then this:
my $evil = TrustGraph::Graph->generate(nodes => 10_000,
                           edges => 30_000);

That makes a 10,000 node graph of "evil" entities, with 30,000 edges amonst themselves.

Then we merge them:
my $all = $good->meets($evil);
But they're not linked yet. They just exist in the same graph, with two major islands.

We mark some as root nodes (in the good graph), to see it all. They have a fixed trust of 1.0:
# mark certain prominent members as trusted root nodes
foreach my $goodid (qw(2 10 14 15 1571)) {
    $good->node($goodid)->mark_as_root;
}
Now, we take a bunch of confused losers who accidentally befriend evil people:
for (1..$confused) {
    $good->random_node->links($evil->random_node);
}
Then we run your calc function on the collective graph:
$all->reset_trust;
$all->mass_compute(
                   iterations => 750_000,
                   func => $calc_func,
                   every => [10_000, $dump_stats],
                   );

As it goes it dumps stats, so you can abort it early if it's looking like it's sucking:

$ ./simulate --module=Brad --confused=2000

PROGRESS: 150000 / 750000 (20.00%)
$VAR1 = {
          'good' => {
                      'coverage' => '0.564841498559078',
                      'avg' => '0.813768195517867',
                    },
          'evil' => {
                      'avg' => '0.345016327849719',
                    }
        };
GOOD:
0.00 - 0.05: 
0.05 - 0.10: 
0.10 - 0.15: 
0.15 - 0.20: 
0.20 - 0.25: 
0.25 - 0.30: 
0.30 - 0.35: #
0.35 - 0.40: #
0.40 - 0.45: #
0.45 - 0.50: ##
0.50 - 0.55: 
0.55 - 0.60: ###
0.60 - 0.65: ###
0.65 - 0.70: 
0.70 - 0.75: #############
0.75 - 0.80: 
0.80 - 0.85: ##########
0.85 - 0.90: 
0.90 - 0.95: ########
0.95 - 1.00: ########################################

BAD:
0.00 - 0.05: ###################
0.05 - 0.10: ##################################
0.10 - 0.15: ###########################
0.15 - 0.20: ########################################
0.20 - 0.25: #################################
0.25 - 0.30: ########################
0.30 - 0.35: ###############################
0.35 - 0.40: #################################
0.40 - 0.45: #######################
0.45 - 0.50: ################
0.50 - 0.55: ############
0.55 - 0.60: ######################
0.60 - 0.65: #################
0.65 - 0.70: 
0.70 - 0.75: #####################
0.75 - 0.80: 
0.80 - 0.85: ###########
0.85 - 0.90: 
0.90 - 0.95: ####
0.95 - 1.00: ##

TODO:
  • Prettier output, use GD for graphs
  • more attack models (notable all evil nodes linking to good nodes) Update: done
  • better harness for running/evaluating lots of algorithms
  • .....
Want to write a calc function?

$ svn co http://code.sixapart.com/svn/TrustGraph/trunk TrustGraph

If you write a good one, tease others in the comments here with the output.
LinkReply

Comments:
[User Picture]From: nothings
2006-03-26 01:52 am (UTC)
So the actual algorithm is just to run calc on all the nodes, iterate and hope it converges? (Is it possible that's not the best possible algorithm? I guess it may be the only one that easily scales to multiprocessing. But there are even tricks across iterations, such as over-relaxation, that would still allow the same scaling within an iteration. Not that over-relaxation is actually a good idea here, but there might be other such ideas.) Similarly, are you sure you're not just excluding some good algorithm that works forward? For example, some radiosity algorithms work backwards ("gathering"), and some work forwards ("shooting") and there was a lot of experimentation about which was more efficient; light is linear/additive, so that's a particularly special case, and yet I could imagine that the idea could be useful generally.
(Reply) (Thread)
[User Picture]From: brad
2006-03-26 02:12 am (UTC)
So the actual algorithm is just to run calc on all the nodes, iterate and hope it converges?

Yes. But the idea was to build this test suite to reduce the "hope" part. I may not be a great (or even good) graph theorist, but I can test and measure until I get what I want. This lets me (or other) play with ideas and see results on a small but realisticly-shaped dataset first.

Is it possible that's not the best possible algorithm?

Almost certainly! For some attribute. But like you said, for other attributes (scalability to lots of machines without sharing state, and scalability on the size of the graph), it might be exactly what we need. We'll see.

As for going forwards, there's no reason we couldn't allow it. I'm not against it, I just didn't think it'd be too useful. After all, the only outbound links I'd care about are those which are reciprocal, which is easy to find with only inbound link searching twice deep. (inefficient, yes, but possible, and the second order network isn't generally that big....) I suppose I should just provide it easily and let people shoot themselves in the foot if they try to use it wrong. (by, say, giving trust to people for linking to really trustworthy people....)

In any case, you know more about all this than I do, so feel free to school me with links/terms and I can google away.
(Reply) (Parent) (Thread)
[User Picture]From: nothings
2006-03-26 02:38 am (UTC)
Unfortunately I don't really know anything about the problem space; most of my theoretical knowledge predates Page Rank, which was the big push in this area, and most of my theoretical knowledge is of this totally linear domain (lighting) where there are alot of really nice mathematical properties that don't apply for non-linearity. It looks like the most relevant topic is 'trust metric' or 'reputation metric', e.g. look at that on wikipedia (lots of links) and google.

Based on my general knowledge, the most likely scenario where you could push the algorithm would be to allow more iterations for "important" areas, which would be e.g. pushing-forwards from 'definitely good' areas and pushing-backwards from 'definitely evil' areas; those are the ones where you get the most bang for your buck in terms of spreading stuff out. I'm not sure exactly how you'd go about designing the implementation so it scales though, and the problem is that 'pushing good outwards' is difficult to define meaningful in a non-linear space.

The whole scaling thing is pretty curious; if you're multi-threaded/processor for the sake of efficiency, but you force an otherwise inefficient implementation, are you really gaining? I guess the answer depends on the ratio of threads to inefficiencies.
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-03-26 02:45 am (UTC)
I thought I remembered this not scaling well, but maybe it does:
http://www.advogato.org/trust-metric.html
(Reply) (Parent) (Thread)
[User Picture]From: nothings
2006-03-26 02:59 am (UTC)
I can't find much about distributed implementations of maximum-flow (which seems to be at the heart of that algorithm) or anything like that, yeah, so it's probably difficult to scale. (I see mention of 'distributed Fulkerson-Ford', but not enough description to get details.) You might try running it on the reduced data set to compare the results though.
(Reply) (Parent) (Thread)
From: evan
2006-03-26 03:13 am (UTC)
http://www.aaronsw.com/2002/raphchat

"A chat from opn:p2p-hackers on the attack-resistance of Google PageRank, anonymous remailer networks, and lots of stuff in between. [2002-03-02]"

(Raph from Advogato talking about PageRank as an algorithm for trust networks.)
(Reply) (Parent) (Thread)
From: billemon
2006-03-26 01:30 pm (UTC)
The example you mentioned (radiosity) seems to me like a great analogy for this, if you consider the input "trust" points (the roots of your graph) as lights and treat "trust" in this case as something like how much light is shed on each of the other nodes. Especially if you call the root nodes "luminaries" ;)

One thing that bothers me is that there's a potential for a loop of links to artificially inflate the "degree of trust" perhaps, but that follows naturally from the idea of having (say) a group of reflective objects near each other, you would naturally get a stronger reflection.

Reciprocal links, though, do imply an increased degree of trust (such that you ignore serial adders as long as few people link back to them). So this approach does give a reasonable model of how trust works, imo.
(Reply) (Parent) (Thread)
From: evan
2006-03-26 02:53 am (UTC)
I think you also want to specify that evil nodes can link to good ones. Otherwise, you just reverse all the links and find maximal set of nodes reachable from your roots by following links.
(Reply) (Thread)
[User Picture]From: brad
2006-03-26 02:56 am (UTC)
That's one of my todo items. ("more attack models (notable all evil nodes linking to good nodes)")
(Reply) (Parent) (Thread)
[User Picture]From: moconnor
2006-03-26 07:17 am (UTC)
Yeah, if the evil nodes don't do any linking
you can get pretty close to perfect staying in the
constraints you gave:

package TrustGraph::Algorithm::Matthew;

sub calc {
    my $node = shift;
    my $back_links = $node->linked_from;

    for my $n (@$back_links) {
        my $links = $n->linked_from;
        for my $o (@$links) {
            return 1 if $o == $node and defined $n->trust;
        }
    }

    return undef;
}
1;
(Reply) (Parent) (Thread)
[User Picture]From: no_gritzko_here
2006-03-26 07:24 am (UTC)

Islands aren't necessarily evil

Islands aren't necessarily evil. It might be some RPG clique or Russian part of LJ.
You have no negative opinions in the model (no distrust) and no personalization (how much Alice trusts Bob).
So, popularity metric (like PageRank) is the only thing you can do here.
(Reply) (Thread)
[User Picture]From: brad
2006-03-26 05:19 pm (UTC)

Re: Islands aren't necessarily evil

In reality (people have done analysis on the entire graph), there are no islands bigger than a couple people. Even RPG cliques are read from the outside, and Russians also often speak English and get linked into the graph.

That said, the idea isn't to automatically bitch-slap any account for being marked "bad" (aka not good), but rather just have it available as one of many existing variables for detection of suspicious accounts.
(Reply) (Parent) (Thread)
[User Picture]From: xlerb
2006-03-27 10:27 am (UTC)

Re: Islands aren't necessarily evil

One problem I ran into, I think, wasn't islands so much as peninsulae — nodes with only a single friend-of, and other nodes reachable from the roots only via the single-in nodes. (But I tried to analyze this at one point, and I don't think it accounts for all of the ~40% my latest leaves behind.)
(Reply) (Parent) (Thread)
[User Picture]From: ciphergoth
2006-03-26 08:12 am (UTC)
Parallel eigenvector finding is a very well studied problem!
(Reply) (Thread)
[User Picture]From: xlerb
2006-03-26 11:05 am (UTC)

Still working on not sacrificing quite so many of the alleged good nodes…

PROGRESS: 750000 / 7500000 (10.00%)
$VAR1 = {
          'good' => {
                      'coverage' => '0.959052002236655',
                      'avg' => '0.6531029173693',
                      'percentiles' => [
                                         '0.00874063037052613',
                                         '0.217309368389952',
                                         '0.86611734270536',
                                         '0.999998028111126',
                                         '1'
                                       ]
                    },
          'evil' => {
                      'avg' => '0.126750513741312',
                      'percentiles' => [
                                         0,
                                         '0.0203900550090053',
                                         '0.0814786963984952',
                                         '0.181406981788721',
                                         '0.29723033559995'
                                       ]
                    }
        };
GOOD:
0.00 - 0.05: ############
0.05 - 0.10: ###
0.10 - 0.15: ##
0.15 - 0.20: ##
0.20 - 0.25: ###
0.25 - 0.30: ##
0.30 - 0.35: #
0.35 - 0.40: #
0.40 - 0.45: 
0.45 - 0.50: 
0.50 - 0.55: 
0.55 - 0.60: 
0.60 - 0.65: 
0.65 - 0.70: 
0.70 - 0.75: #
0.75 - 0.80: #####
0.80 - 0.85: ##
0.85 - 0.90: ##
0.90 - 0.95: #
0.95 - 1.00: ########################################

BAD:
0.00 - 0.05: ########################################
0.05 - 0.10: ###################
0.10 - 0.15: #############
0.15 - 0.20: ##########
0.20 - 0.25: #######
0.25 - 0.30: #####
0.30 - 0.35: ###
0.35 - 0.40: #
0.40 - 0.45: 
0.45 - 0.50: 
0.50 - 0.55: 
0.55 - 0.60: 
0.60 - 0.65: 
0.65 - 0.70: 
0.70 - 0.75: 
0.75 - 0.80: 
0.80 - 0.85: 
0.85 - 0.90: 
0.90 - 0.95: 
0.95 - 1.00: 
(Reply) (Thread)
[User Picture]From: xlerb
2006-03-26 11:17 am (UTC)

And now, in the spirit of open source…

The system I first tried this on (MacOS X), the zcat wanted to append a .Z to the filename; so I replaced that with gunzip, and later on filled in a percentile method which at least seems to work. The diff is, uh, attached:
begin 644 Graph_pm.diff.gz
M'XL(`/=V)D0"`Z526VO;,!1^GG_%MU0%NXZ;&`IC\5(,HQU]V,-@C$$<,L>6
M8S%?A"2W2U+OMT^2G6T=Y&EZ..)<OG/YSGEH<OIC@8IM9Y]%)]4'D?)R9N4U
MKYWE_S\G"()S^5^Y@CXRR=H&X8WG^+Y_/O*I%=]9LT/6\KWGQ#&",)R^@6]E
M'#LPK]Z#2%H56&);42EQ[*<@695*&3D88G)&,9&=Y#13*%A%F[2F$[`"Q&A8
M_L1LE<CD6^(FWO-Z%CF!@8U.3`Y9JD;M^1]8<KT[$`WP7P!V77-@'$&&Y),U
M)G=GD;;#EM,&KIFE**=#D(=6#)W?IUK-H=HAS'H7(*\G(_BI-#G==QI[Z^$X
M&,W+RK;FT<#<_*VE3G_AC27/@>RVX%1DM%$FP?$WH:YE=(J8MZQ1TM,CQ9N1
M%$%5)QJLUGKF86H-B*7J"K,"V0J%(TB*K.8@6_38"<JU*:<%:_04/>K4Z&03
MW"JS=6UY3*N.2ES:LB.7@SQ5&T"#[;1UINN1S2R<SW$%<F%;B%Z&F)8("_00
M+F'>7TXW#$CA71$+6A&V]DGQ1_/#]1C;GSB`MJ`WK.F\EU],P^_3K*01<('[
IAZ\?[Q:@-5=[Z,O6:Q*ITC<NS6ZR$G6;LV*OJ="G[?P"7A)6&($#````
`
end
(Reply) (Thread)
[User Picture]From: brad
2006-03-26 05:11 pm (UTC)

Re: And now, in the spirit of open source…

Hahahaha. File attachments in LiveJournal via uuencode. I fucking love it. :-)
(Reply) (Parent) (Thread)
[User Picture]From: photomatt
2006-03-26 06:04 pm (UTC)

Maybe Not

This is pretty interesting, and seems like it would work particularly well in a community like LJ. Is this at all related to the email I sent you a few days ago?
(Reply) (Thread)
[User Picture]From: brad
2006-03-26 06:20 pm (UTC)

Re: Maybe Not

Between you, Google, yandex, ljseek, and technorati: yeah. We'd never had a problem with linkfarmers until recently and I'd really rather they go away. (We'd always cared about spamming, not linkfarming....) So first step is detecting them, which is looking easy.

Sorry I didn't reply to your latest email yet.
(Reply) (Parent) (Thread)
From: ezrac
2006-03-26 09:39 pm (UTC)

Should be pretty quick

Brad: What's the diameter of the graph? (The big component, I mean.)

If trust is just a function of the trust values of the neighbors, I think your approach should label all the nodes in d iterations, with d the diameter. I'm assuming that you start out with some labeled 0/1 and some UNlabeled, so that you can tell when you're done, and you don't bother re-processing labelled nodes. You say the trust value of each node is constantly changing, but I think it's better to let trust flow deterministically from root/evil nodes to the others.

I'm always reading that the diameter of a "natural" graph is likely to be small (the 6-degrees phenomenon) so I wager that LiveJournal is small as well, so this approach should work pretty well.
(Reply) (Thread)
From: orbadelic
2006-03-27 07:10 pm (UTC)

any good reading on the topic?

I know I am somewhat naive when it comes to graph theory, any good reading on exactly what it is you are trying to find/do?
(Reply) (Thread)