?

Log in

No account? Create an account
linux-kernel; inlines - brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

linux-kernel; inlines [Jan. 2nd, 2006|09:37 pm]
Brad Fitzpatrick
[Tags|, ]

I'm addicted to reading the linux-kernel list. There's a big thread going on about changing the meaning of "inline" in the kernel tree to mean "if gcc4 wants to" instead of the historical "always-inline!" that was required due to gcc3 quirks. Then introducing a new "__always_inline" to actually mean __attribute__((always_inline)), for the few places in the kernel that require inline.

I guess the whole argument is that inline has turned into a "ricing option" that programmers throw about for tons of bogus reasons, not understanding gcc, not understanding other architectures, etc. Hence the patches to remove them all and just let the compiler do it, because it can't get any worse.

I liked this post from Ingo:
....
furthermore, there's also a new CPU-architecture argument: the cost of
icache misses has gone up disproportionally over the past couple of
years, because on the first hand lots of instruction-scheduling
'metadata' got embedded into the L1 cache (like what used to be the BTB
cache), and secondly because the (physical) latency gap between L1 cache
and L2 cache has increased. Thirdly, CPUs are much better at untangling
data dependencies, hence more compact but also more complex code can
still perform well. So the L1 icache is more important than it used to
be, and small code size is more important than raw cycle count - _and_
small code has less of a speed hit than it used to have. x86 CPUs have
become simple JIT compilers, and code size reductions tend to become the
best way to inform the CPU of what operations we want to compute.
...
LinkReply

Comments:
[User Picture]From: the_p0pe
2006-01-03 05:42 am (UTC)
That's some good linux drama
(Reply) (Thread)
[User Picture]From: way2tired
2006-01-03 05:51 am (UTC)
I'm so amused that I read "Ricing option" more than once. What a term.
(Reply) (Thread)
[User Picture]From: pyesetz
2006-01-03 05:55 am (UTC)

Ricing option?

The linked page mentions "ricing option" but does not define it.  Google has only five hits for "ricing option", none of which define it.
(Reply) (Thread)
[User Picture]From: brad
2006-01-03 06:24 am (UTC)

Re: Ricing option?

At least one international user on the list was confused as well. Alan Cox replied "think go-fast stripes".

I say: http://funroll-loops.org/
(Reply) (Parent) (Thread)
[User Picture]From: mulix
2006-01-03 06:34 am (UTC)

Re: Ricing option?

some of my best friends are gentoo developers!
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-01-03 06:41 am (UTC)

Re: Ricing option?

I have no problems with $x developers, it's generally $x users I can't stand.
(Reply) (Parent) (Thread)
[User Picture]From: mulix
2006-01-03 07:02 am (UTC)

Re: Ricing option?

... for every value of x.
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-01-03 07:12 am (UTC)

Re: Ricing option?

Well, not quite. After posting that, I realized there exist some $x for which I hate the developer(s) but don't mind the users.
(Reply) (Parent) (Thread)
[User Picture]From: mulix
2006-01-03 09:31 am (UTC)

Re: Ricing option?

rant away, cap'n!
(Reply) (Parent) (Thread)
[User Picture]From: jwz
2006-01-05 05:07 am (UTC)

Re: Ricing option?

My psychic powers tell me you're thinking of djb!
(Reply) (Parent) (Thread)
[User Picture]From: nothings
2006-01-03 06:50 am (UTC)
I'm sure it's already addressed in the thread which I didn't read, but my 2 cents is, if you inline something with a conditional branch, you replicate the conditional branch to two (or more) distinct addresses, which can help the branch predictor, which will predict those branches independently.

I'm a little surprised that icache matters that much since there's prefretch, but I guess these people have actually measured it and that's the reality.
(Reply) (Thread)
[User Picture]From: brad
2006-01-03 06:53 am (UTC)
There are certainly places it helps, but (I guess?) gcc4 can finally do it well enough to just let it do it in almost all cases.

The bigger problem is people have been throwing "inline" on every static function with one caller "just because". The problem, as many are quick to point out, is that eventually you have two callers and you forget to remove the inline, then your icache goes to shit and you have 20,000 inlines in the kernel, expanded 100,000 times. (the current situation) Also, those cases of "but there's just one caller!" actually make gcc generate worse code because gcc handles large functions with lots of local variables poorly, and a call would've just been quicker than all the spilling.
(Reply) (Parent) (Thread)
[User Picture]From: nothings
2006-01-03 07:04 am (UTC)
Yeah, makes sense that the compiler can do a reasonable job, and that it has encrufted. That's a longstanding complaint about "optimization annotations" I remember reading on comp.compilers 12+ years ago, that things like the C "register" keyword or "inline" get added to code at some point as an optimization, but then there's no development mechanism to verify or back it out as the code ages.
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-01-03 07:13 am (UTC)
Yup. "inline is the new register" is heard a lot lately.
(Reply) (Parent) (Thread)
From: jeffr
2006-01-04 01:16 am (UTC)
With regard to the branch predictor, that's valid if the branch result tends to be the same regardless of the caller.

All of this discussion misses one great use of inlines, and that is code reduction due to constant propagation. Depending on the function, you can often write cleaner looking code that gets eliminated in the dead code stage for most callers. In this case it's much cheaper than a function.

With regards to cache prefetch; you still have to wait for the memory to return the results. If your icache line holds 16 instructions or so (variable with x86 variable width instructions) those 16 instructions are not likely to take longer than fetching the next icache line. So your best bet is to keep code contiguous so that the prefetch can fetch far ahead of your stream. inlineing keeps your code contiguous, but takes up more icache space so if you're likely to keep multiple copies in cache at once it's more harmful to inline.

So inline is good for your prefetcher, bad for icache utilization, sometimes bad and sometimes good for btb and branch predictor entries. It really comes down to all sorts of details that the programmer is unlikely to do correctly to start with and are unlikely to stay correct for the long term.

Now, don't get me started on the likely() unlikely() crap that is popping up all over the linux kernel. In 10 years perhaps they'll start making the same arguments about this and realize their mistakes. Or explicit cache prefetching in general purpose code. bleh.
(Reply) (Parent) (Thread)
[User Picture]From: brad
2006-01-04 03:32 am (UTC)
Actually, I would love to get you started on [un]likely() ... I've always thought that those as least made a hell of a lot more sense than inline or register, at least if you only use them for really hot paths where the certainty of your likely/unlikely is high-ish (90%?). Is your distaste for them that they're used too often, or that the code generation isn't that much better to warrant the source code pollution?
(Reply) (Parent) (Thread)
From: jeffr
2006-01-05 03:16 am (UTC)
There are a few problems.

1) asthetics
2) asthetics
3) The branch predictor is commonly 98% or more accurate for a variety of workloads. It is the least of your optimization worries.
4) It is premature optimization and an optimization that is unlikely to produce a measurable gain.
5) You can only order branches such that the static branch predictor gets them right. Branch prediction, however, is not that simple. There are global branch history bits that may be used to influence the decision as well as history for this location. Simply rearranging the targets doesn't guarantee that the path you want is taken.
6) Code changes. Probably the weakest argument but it still fits.

The best solution for this and the question of inlines is to do profiled execution runs and then relink/recompile. I believe gcc has support for this now and microsoft and others have had it for some time. Detailed counters keep track of branch history, code locality issues, etc. This can all be used to make more informed optimization decisions than the programmer can especially considering the lifetime of code in some open source projects.
(Reply) (Parent) (Thread)
[User Picture]From: ghewgill
2006-01-03 02:45 pm (UTC)
I'm usually of the opinion that the less you try to tell your compiler/computer how to do things, the better off you are. The last company I worked at offered distributed computing infrastructure for customers, where the customers would run it over anywhere from tens to thousands of desktop machines. It always happened that some customer would want some option or other to always schedule this job on machines that have more than X cpu MHz, or Y amount of memory, or isn't being used for Z hours, or whatever. We used to call these "go-slow" options because it almost invariably turned out that if you just let the scheduler do its thing and don't try to impose artificial constraints on the job execution to try to make it go faster, then your job would in face complete sooner. Nevertheless, it turned out that the scheduler was full of exceptions that just handled these artificial constraints.

I see "inline" and the ilk (such as "register") very similarly. I'm not going to try to tell the compiler how to do its job, even though in version i.j.k of the compiler it happens to be faster if I flip some particular switch. The compiler has a much better view of the global optimization landscape than I do.

There's also of course the 90/10 rule where it doesn't matter how inefficiently 90% of the code is executed because it doesn't run very often.
(Reply) (Thread)
[User Picture]From: wolf550e
2006-01-03 07:00 pm (UTC)
I agree, compilers are almost always better at optimization than programmers, especially with code that gets compiled on so many different architectures, but I wanted to point out that the argument that modern processors are so smart is not going to be correct for too long - newer architectures sacrifice a lot of out-of-order optimization logic (like scheduling and branch prediction) to cram more (simpler) execution cores into a target silicon size.
(Reply) (Thread)