?

Log in

No account? Create an account
Popular Blogs - brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Popular Blogs [Jan. 31st, 2007|09:10 am]
Brad Fitzpatrick
[Tags|, ]

You can always tell when you've found a popular blog ....

LinkReply

Comments:
[User Picture]From: topbit
2007-01-31 05:23 pm (UTC)
clearly they need a little memcached loving - or at least filesystem caching.
(Reply) (Thread)
[User Picture]From: crschmidt
2007-01-31 05:32 pm (UTC)
A popular blog on a badly configured server. Not to say that my own wouldn't necessarily fall into that category, but it's not impossible to configure your server so that you're not breaking your site like that.
(Reply) (Thread)
[User Picture]From: brad
2007-01-31 05:34 pm (UTC)
It seems to be impossible for anybody running WordPress: even WordPress.com can't survive being linked to from any halfway popular site!

I can think of a half dozen ways to prevent the above from ever showing to users (even if it happens internally). If you only have m resources and n requests, get in line! You don't fucking just explode.

Grrrr.
(Reply) (Parent) (Thread)
[User Picture]From: crschmidt
2007-01-31 05:39 pm (UTC)
I'm surprised that you'd see that on something hosted at wordpress.com, and amused, since their spam catching service, Akismet, says: "One of the reasons we're doing Akismet is we've built up a highly fault-tolerant infrastructure that can handle huge amounts of traffic and processing."

"We can handle spam, but we can't handle your blog. Sorry!"

For the most part, you simply see this error because the default Apache has MaxClients set to something like 100 or 150, and the default MySQL install has maxconnections set to something lower than that. Connect 150 web server threads to 100 MySQL connections and watch the rest fall over! Boom.

And of course, it works 99% of the time, because most people don't have many users, and when they do get that many, they can't get in to see the error anyway.
(Reply) (Parent) (Thread)
[User Picture]From: brad
2007-01-31 05:53 pm (UTC)
Stop making excuses.

Sorry, I'm hating on everything today.
(Reply) (Parent) (Thread)
[User Picture]From: bsdguru
2007-01-31 05:57 pm (UTC)
They use lighttpd for WordPress.com and many instances of MySQL with each blog having something like 14 or tables. It's insantity now!
(Reply) (Parent) (Thread)
[User Picture]From: photomatt
2007-02-01 09:21 am (UTC)
We've never used Lighttpd, maybe you're thinking of Litespeed?
(Reply) (Parent) (Thread)
[User Picture]From: bsdguru
2007-01-31 05:59 pm (UTC)
It's all about caching data, which they don't seem to do. And don't get me started on all the SQL queries they do to generate just one page.
(Reply) (Parent) (Thread)
[User Picture]From: nothings
2007-01-31 07:45 pm (UTC)
(brad, this Q is for you too):

As an old-school C programmer who always figures out some solution involving something you can load into memory and has never touched a database and has an unreasonable distrust for them, I have to ask: why don't databases perform magically better than they do? Isn't that about half of the point? (The other half is so you don't have to roll something from scratch that supports all the queries they do.) I mean, despite avoiding them forever, I've started poking at the concepts, and am slowly swinging over to the belief that they're a good idea, and that they ought to be competitive or better on the performance-front, because the database engine can aggressively optimize every possible query, whereas something you hand-code you have to put in new effort for every new query.

Or let me drill down to a simpler question relating to your specific reply above: why, under the hood, deep down, is something like memcached necessary? Why isn't the SQL server's cache as effective?

Oh, I understand that memcached allows you to distribute it, so you can throw 8 other machines with 4GB each of cache at the problem and significantly increase the cache size. But the impression I've gotten about the effectiveness of memcached is that it's beyond that. In normal caches, doubling its size doesn't halve performance. Indeed I think I read somewhere about somebody running a memcached on the same machine as the SQL server anyway.

Now, one thing is you can say that memcached can be smarter because it requires app side code that figures out what to cache, so memcached has access to info that SQL doesn't have. But this is of course the naive argument I make against databases in the first place. There's no reason the database can't sit there and figure out which things are queried more often and optimize for them. There's no reason the database can't pursue a much more sophisticated cacheing strategy than LRU. There's no reason you can't hint to the server about what to cache, in the same way indices are hints about improving performance for queries you'll make.

Is it just something simple like, say, databases use page-oriented cacheing, and memcached caches arbitrary sized-blocks, and if what you're cacheing is fairly small, memcached thus can store way more useful info in the same space? Because if it's something like that, somebody needs to kick the database people's asses and get them to address this problem so responses like "It's all about caching data, which they don't seem to do" will become irrelevant, like they clearly should be.

Or is my impression of memcached's success wrong, and it's only a win because you're scaling your cache size by a factor of 10 by using a bunch more processes/machines?
(Reply) (Parent) (Thread)
From: jamesd
2007-01-31 08:31 pm (UTC)
Say you have this nice setup with 500 web servers each of which can display pages from any blog. Request for a blog view comes in, web server starts building a page and gets nicely cached data from database server. Now the next 499 requests come in and are load balanced to the other 499 web servers. Database server is now doing as much work as 500 web servers even if all it's doing is serving a cached result to the web servers that they receive. Oops, database server has a problem.

Now insert application that tries to do writing in transactions with a commit to disk after each write to meet the durability guarantee of an ACID database server, on a drive that can do this 25 times per second. Those 500 requests are now waiting for 25 commits per second. Ouch. Someone forgot to design for the predictable load and cache those requests and insert them in a batch.

Caching approaches such as reverse proxies (Squid) or application-specific caching in memcached (like LiveJournal caching all writes to memcached and reading from a set of n memcached servers instead of fewer database servers) help to deal with this concentration of load at the database server.

Places like Wikipedia and Livejournal have been hugely optimised to process page requests efficiently and avoid doing stupid things.
(Reply) (Parent) (Thread)
[User Picture]From: nothings
2007-01-31 09:05 pm (UTC)
Database server is now doing as much work as 500 web servers even if all it's doing is serving a cached result to the web servers that they receive.


Well, only if the web server actually does nothing. Cacheing a piece of data should not require very much of _any_ resource, so I'm just not sure why it's a performance problem. If this logic made sense, you'd need 500 memcached servers to cope with your 500 web servers and one database server, and that just doesn't sound at all plausible to me.

Now insert application that tries to do writing in transactions with a commit to disk after each write to meet the durability guarantee of an ACID database server, on a drive that can do this 25 times per second. Those 500 requests are now waiting for 25 commits per second. Ouch. Someone forgot to design for the predictable load and cache those requests and insert them in a batch.

That misses my point: if the semantics of the SQL queries force them to get blocked by the atomic commits, but then you willyfully break those semantics left and right in your external-to-the-db cacheing scheme, that seems to me that there's something fundamentally wrong with the db scheme--that there's an all-or-nothing-ness to the atomicity.

I'm trying to drill down into what underlying physical, mechanical, or resource-consumptive detail explains this. You're being a little too abstract, operating at too high a level. I see a situation like what you're describing and say "something seems wrong here", and I'd like to drill down. As far as I can see, your answer just describes the situation I was already saying and offers that description as the explanation. Maybe when you wrote this description you had more concrete things in mind and just didn't state them clearly, though.

The point isn't that people need to avoid doing stupid things; the point is why you can do stupid things in the first place. (Well, you can always do stupid things, my point is why is it the case that this particular thing you're calling stupid is stupid.)
(Reply) (Parent) (Thread)
From: jamesd
2007-02-01 01:53 am (UTC)
Tell me what you can see happening when a database server is told twelve consecutive times for one web page to "get me the row from this table with primary key 'nothings'".

Tell me what you can see happening when a web server asks a memcached server to "store this web page that contains the completely built page that is the result of those twelve database queries and some pretty page formatting by the web server".

What do you see being required when the memcached server is told to return the single cached item that is that page? How does the elapsed time taken for 12 ask -answer queries in the database situation compare tot he single ask-answer memcached case? What implications does that have for the total number of connections that the different servers have to maintain at one time?

What implications does it have for the number of individual RAM allocations being done on the memcached server and the database server? (yes, these really matter - see 21727 a performance bug caused by freeing and reallocating instead of reusing RAM)

You're right that I have concrete thoughts and experience of this- both at wiki and at work (I tend to do lots of database-related high performance and architecture work).
(Reply) (Parent) (Thread)
[User Picture]From: nothings
2007-02-01 02:17 am (UTC)
Tell me what you can see happening when a database server is told twelve consecutive times for one web page to "get me the row from this table with primary key 'nothings'".


That's my point: why isn't it nearly free the 11 times after the first one? It might add round-trip networking latency, and lead to a slower response time, but but what resource (besides bandwidth) should actually get used up by it? And if bandwidth isn't a limiting factor, why should that affect the overall system throughput?

Obviously if you cache whole web pages, that's fairly different in terms of the performance gains, but my impression of how LJ uses memcached was that it was at a much lower level. Any highly-dynamic web page (like LJ, or wikipedia for non-anon users) obviously has issues here.

Again, a lot of this comes back to the thing I thought I read somewhere where somebody got a win just running one memcached on the same machine as the database server. If that's totally a myth, if the win of memcached is octupling the available RAM to use as cache, then ok, I have no real issue here--that's just as fast as you can make things go with that much memory. But if that sort of story is true, it really feels like something is wrong here.

Let me present my concern about that in a different way. Sometimes I'll hear about some friend who's trying to figure out how to do some very complicated query right. Presumably it involves lots of hideous internal grinding matching things between tables or whatever. And when they've exhausted the obvious solutions with rewriting the query differently or adding more indices, they don't eventually bail and say "I know, I'll go write a perl program that can handle this query more efficiently, and then later go rewrite it in C". They do go implement a new data structure that allows querying this information more efficiently--but they do it by defining new fields in a table, or a totally new table. They apply the tools of the database to solving the problem.

It feels to me like cacheing ought to just work, that the case I quoted above ought to be fast, and shouldn't require somebody hack through their code to cache things locally themselves, or implement something like memcached to improve the cacheing they get across multiple related page views.

Again, I've never actually used databases myself, much less databases in production environments, so I have no idea how things shake out in practice. I'd naturally assume that database servers are primarily limited by their disk performance. More memory will allow a larger disk cache and reduce disk reads (but not writes). CPU power I would expect to be largely irrelevant. But the case of the 11 extra reads of the same data shouldn't use anything but a little memory and some amount of CPU, if everything is still cached. Or is this wrong, and the CPU performance itself is a limiting factor?
(Reply) (Parent) (Thread)
[User Picture]From: edm
2007-02-01 10:26 am (UTC)

Database working set

If your database is small enough (or at least your active working set is small enough) and your memory is large enough then, yes, everything that matters ends up being cached in RAM (the filesystem cache if nothing else). At which point the only thing which determines speed of reading from the database is the efficiency of parsing and executing the queries. So requesting the same thing redundantly 11 extra times means that you're taking 12 times as long as you should be taking -- with no disk I/O we're basically talking a linear factor of slowdown.

Caching with, eg, memcached might help a little, but again it's all in RAM so you're basically down to comparing the complexity of a memcached lookup (distributed hash?) with parsing and executing a query. Given the heavyweight nature of SQL I could see a distributed hash style lookup being noticably quicker, but still not the win of simply not doing the queries at all 'cause you already have the answer.

Databases are generally only limited by disk I/O on (a) writes (which need to be committed to stable storage for ACID), and (b) when they're sufficiently large (and/or your RAM is sufficiently small) that you're forever paging things in off disk.

Ewen

PS: It's got to be at least months since I last saw database handle exhaustion errors out of LJ. (This was for a while after the last power incident, which I guess took out sufficient hardware to be a problem under load.)
(Reply) (Parent) (Thread)
[User Picture]From: nothings
2007-01-31 09:11 pm (UTC)
Ok, having just checked your blog, I can see that you must certainly have concrete thoughts about this and you were just being vague.

Keep in mind that you're explaining this to a unique kind of idiot: I understand the principles of databases and of networks and of scaling, but I've never ever in my life worked with any of them.
(Reply) (Parent) (Thread)
[User Picture]From: askbjoernhansen
2007-02-04 10:47 am (UTC)

uh

nothings, your questions don't make much sense. The reason to do caching in the application is that your application can be smarter about it than the DB can. Caching is, generally speaking, a tradeoff between less load on the systems and getting fresh data. The DB can't make that trade off - your application can. The typical caching mechanisms are also much much much simpler (less resource usage) and easier to scale horizontally than anything you'd do on the DB layer.

You should look up the slides from Brad's LJ scaling talk or maybe the slides from my similar-ish talk:
http://develooper.com/talks/
http://develooper.com/talks/Real-World-Scalability-Web-Builder-2006.pdf


- ask
(Reply) (Parent) (Thread)
[User Picture]From: bsdguru
2007-01-31 05:58 pm (UTC)
They did give it some loving. Apparently they purchased some more boxed from Layered.
(Reply) (Parent) (Thread)
[User Picture]From: photomatt
2007-02-01 09:18 am (UTC)

Really?

"even WordPress.com can't survive being linked to from any halfway popular site!"

We're on digg/slashdot/fark/etc several times a week, it's doesn't even make a blip on our stats. (Which are public like yours.) The above error message literally doesn't exist on WordPress.com, it's just the default DB error in stock downloadable WP. It's usually triggered when people enter incorrect information in their config file.

Popular sites are never a problem because they're so easy to cache. What causes most of our load are the tons of blogs that get 1-10 hits per day. Have you seen something weird on the site I should be looking into?
(Reply) (Parent) (Thread)
[User Picture]From: brad
2007-02-01 10:44 am (UTC)

Re: Really?

Maybe you've got better lately, but historically (over the past year?), it's wordpress.com that I associate with the aforelinked image.

Can't you make WordPress by default stall in mod_php waiting to get a DB connection before it proceeds? Hanging is better than a 500.
(Reply) (Parent) (Thread)
[User Picture]From: photomatt
2007-02-01 11:15 am (UTC)

Re: Really?

I just checked out the external monitoring from grabperf.org, it doesn't look like anything crazy has happened on wp.com as far back as its graphs go. What site did you take the screenshot from?

In core WP if you see that message PHP and the web server are working just fine -- it's not a 500 error -- but there is no connection to the DB so WP dies with that error because there's nothing else left for it to do. It lists those bullet points because usually it's from people who manually edit their wp-config.php file and have a typo in the db connection details. If it's on a overloaded site (probably on a $ 7/mo shared host) it just means MySQL crapped out before the web server, which isn't uncommon because most hosts configure MySQL very badly.

On WP.com that rarely happens. If it can't connect to a DB instance it just tries other slaves, failing over to a remote DC if all the local instances are down.
(Reply) (Parent) (Thread)
[User Picture]From: foobarbazbax
2007-01-31 06:12 pm (UTC)
I think a lot of people fail to properly load test their application. First I ask, "how much traffic do I need to support?" Then I ask, "how much traffic can I currently support?" Then I optimize appropriately.

I really enjoyed this presentation by Goranka Bjedov from Google. She does an excellent job explaining how to do proper load testing. She basically just pimps JMeter, which I've used a couple of times and have been fairly impressed with.
(Reply) (Thread)
[User Picture]From: foobarbazbax
2007-01-31 06:16 pm (UTC)
Plus, when you fail, fail gracefully.
(Reply) (Parent) (Thread)
[User Picture]From: gargoylemusic
2007-01-31 06:34 pm (UTC)
So... I guess I'll ask the obvious [n00bish] question. How does one (or what software/apache modules/thinkgeek gimmicks/.vimrc settings/gaim plugins/etc would one use) to properly handle this kind of thing?
(Reply) (Thread)
[User Picture]From: photomatt
2007-02-01 09:39 am (UTC)
There's a drop-in plugin for WordPress called wp-cache that uses file caching and eliminates all DB hits. It can easily push 15-25 reqs/sec on cheap hardware, essentially static PHP performance.
(Reply) (Parent) (Thread)
[User Picture]From: anildash
2007-01-31 07:31 pm (UTC)
I'm collecting "blog failure" images. I have some good ones, i should put 'em all online.
(Reply) (Thread)
From: evan
2007-01-31 08:45 pm (UTC)
When your image has been shrunk this small, you sorta look like a mannequin.
(Reply) (Parent) (Thread)
[User Picture]From: anildash
2007-02-02 03:34 am (UTC)
I'm a mannequin doing the robot. Dancin', dancin', dancin'! I'm a dancin' machine.

Dancing to the sound of blogs failing. I have one of my site showing an offensive error message I wrote myself. hooray!
(Reply) (Parent) (Thread)
[User Picture]From: ghewgill
2007-02-01 05:47 am (UTC)
Does that mean that livejournal doesn't have any popular blogs? :)
(Reply) (Thread)