Since I'm apparently an endless font of information regarding all things technology, I've decided to host a little class on computers here, aimed at those who know computers pretty well, but still want to know a bit more advanced stuff. I'll be using some C code and assembly in this thread (GAS syntax, not Intel). I'll also be assuming that you know about stuff like CAS latencies, caches, and memory controllers. Don't worry if you can't read assembly - I've heavily commented it, so it should be pretty much understandable.
By the way, this is going to be a long read. Seriously, it's ~2400 words. I said "Ludicrously Detailed", and I meant it.
We'll start with the core of the machine: the Central Processing Unit. It's job, as you know, is to do math and stuff.
We'll do some simple math on it. A common task is to find the distance between two points on a Cartesian grid. Such calculations are particularly common in games and physics simulations. We'll solve for the distance between (1, 1) and (4, 5). [sp]It's 5.[/sp]
In C, this code would be simple:
[code]float x1 = 4;
float x2 = 1;
float y1 = 5;
float y2 = 1;
float distance;
if (x2 > x1){
distance = (y2 > y1) ? sqrt((x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1)) : sqrt((x2 - x1) * (x2 - x1) + (y1 - y2) * (y1 - y2));
} else {
distance = (y2 > y1) ? sqrt((x1 - x2) * (x1 - x2) + (y2 - y1) * (y2 - y1)) : sqrt((x1 - x2) * (x1 - x2) + (y1 - y2) * (y1 - y2));
}[/code]
However, those letters and numbers mean nothing to a CPU. When you compile that code, it's converted into something called [i]assembly language[/i], which is pretty much a 1-to-1 representation of the actual machine code the CPU uses.
Insert obligatory "I wish - it's machine code" reference here.
This code could be translated into assembly like so (some compilers might produce simpler, unoptimized code; some might optimize it even more. I'm trying to show as many aspects of the x86-64 as possible, so I do a few weird things):
[code]_main:
MOVL $4, %rax; load x1 to AX
MOVL $1, %rbx; load x2 to BX
MOVL $5, %rcx; load y1 to CX
MOVL $1, %rdx; load y2 to DX
CVTSI2SS %rax, %xmm0; convert AX to floating-point, store on XMM0
CVTSI2SS %rbx, %xmm1; convert BX to floating-point, store on XMM1
CVTSI2SS %rcx, %xmm2; convert CX to floating-point, store on XMM2
CVTSI2SS %rdx, %xmm3; convert DX to floating-point, store on XMM3
; if (x2 > x1){
CMP %rax, %rbx; comparison
JG xflip; if x1 > x2, goto xflip
FLD %xmm0; push XMM0 (x1) onto FPU stack
FLD %xmm1; push XMM1 (x2) onto FPU stack
JMP do_x_sqrd; goto do_x_sqrd
xflip: ; x1 is > x2
FLD %xmm1; push XMM0 (x1) onto FPU stack
FLD %xmm0; push XMM1 (x2) onto FPU stack
do_x_sqrd: ; do the (x2 - x1)^2 thing
FSUB %st0, %st1; subtract and push result onto stack
FLD %st0; push onto stack again
FMUL %st0, %st1; multiply and push result onto stack
; clear unneeded registers, to prevent overflow
FFREE %st1
FFREE %st2
FFREE %st3
FFREE %st4
; (y2 > y1) ?
CMP %rcx, %rdx; comparison
JG yflip; if y1 > y2, goto yflip
FLD %xmm2; push XMM2 (y1) onto FPU stack
FLD %xmm3; push XMM3 (y2) onto FPU stack
JMP do_y_sqrd;
yflip:
FLD %xmm3; push XMM3 (y2) onto FPU stack
FLD %xmm2; push XMM2 (y1) onto FPU stack
do_y_sqrd:
FSUB %st0, %st1; subtract and push result onto stack
FLD %st0; push onto stack again
FMUL %st0, %st1; multiply and push result onto stack
FADD %st0, %st6; add dx^2 and dy^2, and push it onto the stack
FSQRT %st0; calculate the square root
; and we're done[/code]
Obviously, that's a lot more code than the plain C was. However, it also tells you more about what the processor is actually doing.
You see, not all computer languages work at the same level. C is considered a "low-level" language - you still have to think about things like garbage collection, converting floats to ints, and so on. Compare that to Python, where resources are dynamically deallocated, and where you only have to worry about "does this int have enough bits to hold a value this large?".
Assembly is even lower than C. It's effectively as close to bare metal as any modern programmer gets.
We're going to go deeper.
Now, I can't actually get into the micro-ops that modern (since the late 90s) x86 processors use, because there's almost no public documentation on them I can find.
However, I can get into more detailed elements. I'm going to use a fictitious processor here, just to make the math simpler.
The CPU will have all the modern technologies: 64-bit addressing, Hyper-Threading, and so on. It could be multi-core, but that's irrelevant to the code we're running. It'll have all the caches modern CPUs have - a split L1 cache, a unified L2 cache, an L3 cache shared with other cores, and a two-level split translation lookaside buffer. Base clock will be 100mHz, with a multiplier of 20x, giving a CPU clock speed of 2gHz. This means the CPU can execute one instruction every 0.5 nanoseconds (ns). RAM will be DDR3-1600, 10-10-10-30 timings, meaning a memory access takes 5 ns, and a data transfer takes 1.25 ns.
The time is now 0 ns. The first 64 bytes of code (16 lines, since instructions are 32 bits even in x86-64) are already loaded into the L1 Instruction cache, but everything else is still in RAM. All clocks are just beginning a cycle.
First line of assembly: MOVL $4, %rax; load x1 to AX
Well, the constant "4" is stored at a spot in memory somewhere. The CPU checks the TLB. The location of it isn't cached, so it calculates it. Half a nanosecond later, it's determined that it's not in any cache, only in RAM. It orders the memory controller to go get that data. Unfortunately, it just missed the memory clock, and has to wait .75 ns for it to cycle completely.
The time is 1.25 ns. The memory controller issues a command to the RAM - ACTIVATE, telling the RAM to open the appropriate bank and row. However, there needs to be a delay between issuing the "Row access" command and "Column access" command, due to the way DRAM works (this is the tRCD, the second number in the timings). For us, that's 10 FSB cycles, or 12.5 ns.
The time is 13.75 ns. The memory controller issues a command to the RAM - READ, telling the RAM to select the appropriate column and send it to the CPU.
Once again, we're waiting on the RAM for another 12.5 ns, due to the CAS latency. For those who are wondering what the CPU is doing, it's most likely taking this time to process another thread. That's one way that Hyper-Threading gives a performance boost - it can process things while it's waiting on the RAM for something.
The time is now 26 ns. Data is flowing across the FSB directly into the L1 Data cache, 64 bits every 1.25 ns. The RAM has also prefetched the next 8 columns, so we can access those quickly. Fortunately, any assembler worth a damn will have put all the data in one spot.
The time is now 28 ns. "4" has been loaded into register A.
Next instruction is MOVL $1, %rbx. The memory controller requests it from RAM, which simply grabs the data from the prefetch buffer and sends it over. Takes 2.5 ns to access, and another .5 ns to load it into the register.
Same thing happens for the next two lines of code (MOVL $5, %rcx and MOVL $1, %rdx). And the RAM still has four quadwords of data prefetched.
The time is now 37 ns.
The next instructions are a little weird. "CVTSI2SS %rax, %xmm0" is an SSE instruction, and as such it uses one of the SSE registers instead of the normal registers. All it's doing is converting whatever's in register A into 32-bit floating-point and loading it into the zeroth SSE register.
This is a relatively complex operation. Now, it really depends on the exact processor, but it most likely would take more than one clock cycle to do. I'll call it a two-clock operation. This is then repeated for the other 4 stored values, just to get everything ready for actual math.
The time is now 41 ns. The current instruction is "CMP %rax, %rbx"
What that does is compare the values of registers A and B (variables x1 and x2, if you're losing track). By itself, it's useless, but the next operation actually does something with it.
The time is now 41.5 ns. The current instruction is "JG xflip". If the previous comparison was that A > B, the processor will "jump" (a goto in a high-level language) to the address of the label "xflip".
However, while the processor is pondering that difficult problem for half a nanosecond, the CPU has a quandry. It needs to know what the next operation is, so it can start sending it to the CPU, but it could be one of two operations. It doesn't want to waste time - enough time was already wasted with that RAM access. There's actually a complex system called a "branch predictor" here, which is quite hard to understand, and relies heavily on what happened before. I'll be a pessimist and say that it guesses wrong.
The time is now 42 ns. The CPU has determined that it needs to jump to line 18 ("FLD %xmm1"), but the currently loaded instruction is line 13 ("FLD %xmm0").
Not a problem. The CPU finishes line 18, but then just throws the results away and calculates line 13. Fortunately, line 18 was in cache - otherwise, it would have kept calculating from line 13 while it was fetching it from RAM.
"FLD" is also a weird instruction - it's an FPU command, not a CPU command. You see, back in the 80s, a floating-point unit was generally unneeded, and a rather large expense. So, it was made as a second chip, which plugged into another socket. Around the time of the Pentium, they decided to just include it in the CPU, but the system had been designed to be seperate, and still sort of acts like it. The FPU has its own registers (organized as a stack, rather than completely freely-addressable), and its own opcodes. It even has a seperate execution unit.
As an addendum, this is another way Hyper-Threading gives a boost. If one thread is using the FPU, another can be using the main integer processing unit, or using the SSE SIMD processing unit. It's generally rare that this gives a major boost, though, except for certain scientific applications. But hey - now you know.
"FLD" simply pushes a value onto the stack. So, we have "1.0" on top of the stack, with nothing beneath it.
The time is now 43 ns. The current operation is "FLD %xmm0". So, "4.0" gets pushed onto the stack, with "1.0" beneath it. Simple enough.
The time is now 43.5 ns. It's time to do some fucking MATH. "FSUB %st0, %st1" is our current instruction. Easy enough: what's 4.0 - 1.0? Whatever it is, push it onto the stack! The stack is now 3.0, 4.0, 1.0, if you lost track.
Now, we need to square that. Simple enough - push the top value onto the stack again ("FLD %st0"), giving us 3.0, 3.0, 4.0, 1.0.
And, time to multiply ("FMUL %st0, %st1"). 3.0 * 3.0 = 9.0, or so I've been taught. Let's push that onto the stack, giving us 9.0, 3.0, 3.0, 4.0, 1.0. This is getting rather close to filling the stack, so we're going to free up some of it.
The time is now 45 ns. The next 4 operations are very similar: FFREE %st1, FFREE %st2, FFREE %st3, and FFREE %st4. All this does is clear the stack except for %st0, which is the 9.0 we need. Fortunately, the integer unit can be working on things now as well: the instructions after that are "CMP %rcx, %rdx" and "JG yflip". The comparison goes simply, and the branch predictor even comes up with the right guess, sending the CPU the instructions "FLD %xmm3" and "FLD %xmm2". Unfortunately, those are FPU instructions, and the FPU is busy for the next nanosecond. It processes them as soon as it finishes those FFREEs.
The time is now 47 ns. The FPU stack is 5.0, 1.0, 9.0. There's just a few instructions left to do.
First, another subtraction ("FSUB %st0, %st1"). 5.0 - 1.0 = 4.0, making the stack 4.0, 5.0, 1.0, 9.0.
The next instruction ("FLD %st0") pushes another 5.0 onto the stack (4.0, 4.0, 5.0, 1.0, 9.0).
The time is now 48 ns. The current instruction is a multiply ("FMUL %st0, %st1"), leaving the stack nearly full at 16.0, 4.0, 4.0, 5.0, 1.0, 9.0.
Next comes an addition ("FADD %st0, %st6"), which pushes a 25.0 onto the stack.
The time is now 49 ns. The last instruction is a complex one: "FSQRT %st0". Theoretically, this would take 32 cycles, or 16 ns, to calculate using an iterative method. However, processors include "lookup tables" that allow the processor to skip over some of that calculation.
As a quick aside, there was a bug in one of these lookup tables in early Pentium processors. It wasn't in the FSQRT table, though, it was in the FDIV table. It used an incorrect starting value to speed up the calculation, causing calculations that used that starting value to give bad results. Of course, these bad results were still accurate to 4 significant digits.
Anyways, about eight cycles later, we get the result of our FSQRT, and the program is done. The result, 5.0, is left at the top of the FPU stack. All in all, it took 53 nanoseconds to do all that, about half of which was spent waiting on the RAM.
And now you know.
Wait, you're still reading? After all that massive text, you want more info? I'm done for today, but I might do another "ludicrous detail" thread sometime. Tell me what you want to know, and I'll see what I can explain.
Interesting but a little :psyduck:
Awesome. No I did not read it all.
[QUOTE=kukiric;27200543]Interesting but a little :psyduck:[/QUOTE]
If your head isn't hurting a bit after reading this, I did not go into enough detail.
We need to go deeper.
My head didn't hurt
I already knew this, or at least most of it to matter, made an effort to learn x86 assembly by trying to make a simple OS (Key words, effort, trying) - It was a learning experience. :ninja:
Good read. Nice detail on the RAM timing.
What next? A complete dissection of how the internet works right from the physical layer to the application layer?
Nice read. Thanks :smile:
[QUOTE=Fatal-Error;27200751]We need to go deeper.[/QUOTE]
Ask him to translate this into microcode.
I think there's a typo in the OP, you wrote x96 instead of x86.
Good read though, I'd love to see more threads like this.
Make a version of this that's readable by computer illiterate people.
I love stuff like this.
[QUOTE=VistaPOWA;27200939]I think there's a typo in the OP, you wrote x96 instead of x86.
Good read though, I'd love to see more threads like this.[/QUOTE]
Good spot. Thanks.
[editline]4th January 2011[/editline]
[QUOTE=Cyril;27200905]Ask him to translate this into microcode.[/QUOTE]
I actually considered doing that, but I figured it was too pointlessly difficult.
That was fascinating, thank you. I'm studying for a degree in physics, specialising in solid state physics, and that appeals to my interested tremendously. However, I found the title a little misleading. What you actually told us was: "How a computer works". The CPU works by processing binary bits, in a rather complicated manner best explained with logic arrow diagrams. I'd reccomend a book like:
[url]http://www.amazon.co.uk/Electronic-Computers-S-H-Hollingdale/dp/B0007EI67S/ref=sr_1_1?ie=UTF8&qid=1294180467&sr=8-1[/url]
Which while a piece of history which belongs in a museum explains the basics incredibly well, along with some more advanced concepts. But I'm sure you know this stuff, but I would have liked to see an example of that, could be interesting.
[QUOTE=Jimjim32;27201445]That was fascinating, thank you. I'm studying for a degree in physics, specialising in solid state physics, and that appeals to my interested tremendously. However, I found the title a little misleading. What you actually told us was: "How a computer works". The CPU works by processing binary bits, in a rather complicated manner best explained with logic arrow diagrams. I'd reccomend a book like:
[url]http://www.amazon.co.uk/Electronic-Computers-S-H-Hollingdale/dp/B0007EI67S/ref=sr_1_1?ie=UTF8&qid=1294180467&sr=8-1[/url]
Which while a piece of history which belongs in a museum explains the basics incredibly well, along with some more advanced concepts. But I'm sure you know this stuff, but I would have liked to see an example of that, could be interesting.[/QUOTE]
I was focusing on the CPU, but I guess I did go into a few other things. Mainly, I tried to go into "what's a CPU made of" by way of actual use. Hence why I used the old FPU instructions to solve it instead of solving the problem in about 3 SSE instructions.
[editline]4th January 2011[/editline]
[QUOTE=Tezzanator92;27200765]What next? A complete dissection of how the internet works right from the physical layer to the application layer?[/QUOTE]
Maybe. Might be interesting.
I read it all.
My head hurts.
Haha, that question's an entire career for some people.
Please never stop posting here.
[editline]4th January 2011[/editline]
And I'm sorry your thread in the literacy group didn't get any attention. If it's worth anything, I read this entire thread and the aforementioned.
And this is why I hate assembly :argh:
and thats how the cookie crumbles
ewww x86 instructions... if you wanted to explain it more simply, I would have used MIPS as the example...
Informative thread. I think it should be stickied.
[QUOTE=doonbugie2;27203020]Informative thread. I think it should be stickied.[/QUOTE]
It's too impractical
Imma read this when I have time, looks interesting
[QUOTE=B!N4RY;27203372]It's too impractical[/QUOTE]
depends.
[QUOTE=Tezzanator92;27200765]My head didn't hurt
I already knew this, or at least most of it to matter, made an effort to learn x86 assembly by trying to make a simple OS (Key words, effort, trying) - It was a learning experience. :ninja:
Good read. Nice detail on the RAM timing.
What next? A complete dissection of how the internet works right from the physical layer to the application layer?[/QUOTE]
i would love this
you mostly just explained assembly
[QUOTE=ButtsexV2;27204059]you mostly just explained assembly[/QUOTE]
But you do have to learn more about how a CPU works in order to understand assembly.
And I did explain a few things like hyper-threading and lookup tables that aren't involved in assembly.
So it wasn't exactly "how a CPU works", but it was still informative and ludicrously detailed.
:iia:
Now you should make Ludicrously Detailed thread about porn hiding.
Wow. Very well written, I managed to follow it nicely.
I actually want to try out a bit of assembly now.
[QUOTE=gman003-main;27200436]Wait, you're still reading?[/QUOTE]
No, I gave up about halfway through and skipped to the end :v:
Sorry, you need to Log In to post a reply to this thread.