Sunday, December 23, 2007

x86 assembly generated by various GCC releases for the main() function

In this article, I will look at the assembly code generated from it by various releases of GCC on FreeBSD and I will try to explain what it does and why (though a couple of things are still mysterious). Moreover I will illustrate each assembly with a snapshot of the call stack right before calling strcpy. This shouldn't change much on other operating systems (such as Linux).

All tests are compiled with the following GCC command-line:
gcc -S -O test.c


Here is the very simple (vulnerable) C program I will use:

int
main(int ac, char *av[])
{
char buf[16];

if (ac < 2)
return 0;
strcpy(buf, av[1]);
return 1;
}

Expectation


First let's see what I would expect. This first assembly code comes along with verbose comments. Later ones won't be

main:
enter /* Create a new stack frame */
sub $16, %esp /* Allocate buf */
cmp $1, 8(%ebp) /* Compare ac to 1... */
jle .byebye0 /* And set the return value to */
/* 0 if lower or equal */
mov 12(%ebp), %eax /* Load av */
push 4(%eax) /* Push av[1] */
push -16(%ebp) /* Push &buf */
call strcpy
mov $1, %eax /* Set the return value to 1 */
jmp .byebye
byebye0:
mov $0, %eax
byebye:
leave /* Restore the old stack frame */
ret /* Return to the caller */


Now let's look at the stack frame:

| av |
| ac |
| ret |
ebp-> | sebp | saved ebp from the previous stack frame
|/ / / / | ^
| / / / /| |
|/ / / / | | buf, 16 bytes wide
| / / / /| v
| av[1] |
esp-> | &buf |


GCC 2.8.1


With GCC 2.8.1, the assembly is pretty close to what I expected:

main:
pushl %ebp /* Don't know why enter isn't */
movl %esp,%ebp /* but this is semantically */
/* equivalent */
subl $16,%esp
cmpl $1,8(%ebp)
jle .L2
movl 12(%ebp),%eax
pushl 4(%eax)
leal -16(%ebp),%eax
pushl %eax
call strcpy
movl $1,%eax
jmp .L3
.align 4
.L2:
xorl %eax,%eax
.L3:
leave
ret


The stack right before calling strcpy() is identical to the expectation.

GCC 2.95.3


Here, the generated assembly is very close to above, but the size allocated on the stack is a little larger (8 bytes). Moreover, 8 additional bytes are allocated on the stack before pushing the arguments.

main:
pushl %ebp
movl %esp,%ebp
subl $24,%esp /* 24 bytes are allocated */
/* instead of 16 */
cmpl $1,8(%ebp)
jle .L3
addl $-8,%esp
movl 12(%ebp),%eax
pushl 4(%eax)
leal -16(%ebp),%eax /* surprisingly, a 16 bytes */
/* buffer is still provided */
/* to strcpy() */
pushl %eax
call strcpy
movl $1,%eax
jmp .L4
.p2align 4,,7
.L3:
xorl %eax,%eax
.L4:
leave
ret


And the corresponding stack frame:

| av |
| ac |
| ret |
ebp-> | sebp |
|/ / / / | ^ ^
| / / / /| | |
|/ / / / | | | buf, 16 bytes wide
| / / / /| | v
|////////| | ^
|////////| v v 8 unused bytes
|\\\\\\\\| ^
|\\\\\\\\| v 8 unused bytes
| av[1] |
esp-> | &buf |


In order to understand what kind of optimization GCC tried to perform, I compile the test program with multiple buffer sizes:

sizeof(buf) stack alloc
0 8 (= 16 * 1 - 8)
1-16 24 (= 16 * 2 - 8)
17-32 40 (= 16 * 3 - 8)
33-48 56 (= 16 * 4 - 8)
49-64 72 (= 16 * 5 - 8)


Given that GCC 2.95.3 allocates an additional 8 bytes buffer afterward, it is undoubtly doing 16-bytes alignment here. The intent is good but in my opinion it lacks an initial alignment of the stack frame pointer that later GCC releases do, as we are going to see.

GCC 3.4.6


This release is the greediest among the ones we are looking at with regard to the stack management.

main:
pushl %ebp
movl %esp, %ebp
subl $24, %esp /* Alloc a 24 bytes buffer */
andl $-16, %esp /* Stack alignment on 16 */
/* bytes boundary */
subl $16, %esp /* Allocates another 16 */
/* bytes buffer */
movl $0, %eax
cmpl $1, 8(%ebp)
jle .L1
subl $8, %esp /* And yet another 8 bytes one */
movl 12(%ebp), %eax
pushl 4(%eax)
leal -24(%ebp), %eax /* The 24 bytes buffer is */
/* used for strcpy() */
pushl %eax
call strcpy
movl $1, %eax
.L1
leave
ret


Looking at the stack, we can see it is greedily stuffed:

| av |
| ac |
| ret |
ebp-> | sebp |
|/ / / / | ^
| / / / /| |
|/ / / / | |
| / / / /| | buf, 24 bytes wide
|/ / / / | |
| / / / /| v
|\\\\\\\\| ^
|\\\\\\\\| | stack alignment on 16 bytes boundary
|\\\\\\\\| v (runtime-dependent size)
|/ / / / | ^
| / / / /| |
|/ / / / | | 16 unused bytes
| / / / /| v
|\ \ \ \ | ^
| \ \ \ \| v 8 unused bytes
| av[1] |
esp-> | &buf |


I performed the same test as above by varying buf size and I get the very same result as GCC 2.95.3. It is however difficult to argue this is to maintain alignment since strcpy() arguments are pushed far below and the stack is aligned explicitely right after anyway. This question stays opened.

Interestingly, while GCC 2.95.3 allocates 24 bytes but only provides an address with 16 available bytes, GCC 3.4.6 allocates 24 bytes too but provides the whole buffer to strcpy(). My feeling is that instead of wasting 8 bytes, GCC developpers prefered to include them in the buffer in order to mitigate off-by-one/two exploitations.

The 8 bytes buffer is certainly here to create a 16 byte block with the two arguments of strcpy(). Nonetheless, I don't understand the purpose of the 16 bytes buffer at all.

GCC 4.2.1


This release generates assembly that seems quite odd at first glance but after thinking a bit about it, everything is (quite) understandable, contrary to GCC 3.4.6 :-).

main:
leal 4(%esp), %ecx /* Load &ac in %ecx */
andl $-16, %esp /* Stack alignment on 16 */
/* bytes boundary */
pushl -4(%ecx) /* Push the ret again */
pushl %ebp /* Creates a new stack frame */
movl %esp, %ebp
pushl %ecx /* Push &ac */
subl $36, %esp /* Create a 36 bytes buffer */
movl 4(%ecx), %edx /* Load av in %edx */
movl $0, %eax /* Return value will be 0 if */
/* the comparison fails */
cmpl $1, (%ecx)
jle .L4
movl 4(%edx), %eax /* Load av[1] in %eax */
movl %eax, 4(%esp) /* Fake push */
leal -20(%ebp), %eax /* Load the 16 bytes buffer */
/* (jump over &ac, thus 20) */
movl %eax, (%esp) /* Fake push */
call strcpy
movl $1, %eax
.L4:
addl $36, %esp /* Cleanup the stack */
popl %ecx /* Pop &ac in %ecx */
popl %ebp /* Restore %ebp */
leal -4(%ecx), %esp /* Restore %esp thanks to &ac */
ret


Evil, isn't it? :-) Let's look at the stack snapshot right before strcpy is called(). You will see that actually the stack is handled very neatly.

| av |
| ac |
| ret |
|\\\\\\\\| ^
|\\\\\\\\| | stack alignment on 16 bytes boundary
|\\\\\\\\| v (runtime-dependent size)
| ret |
ebp-> | sebp |
| &ac |
|/ / / / | ^ ^
| / / / /| | |
|/ / / / | | | buf, 16 bytes wide
| / / / /| | v
|////////| | ^
|////////| | | 12 unused bytes
|////////| | v
| av[1] | |
esp-> | &buf | v


As you may have already notice, stack alignment is performed before creating a new stack frame. Therefore, contrary to the previous assembly we have seen, it is not possible to get ac or av with simple indexing from %ebp since the gap created by the alignment cannot be known at compile time. So the very first instruction stores the address of ac in %ecx; it is then possible to access main()'s arguments: (%ecx) for ac and 4(%ecx) for av. Note that -4(%ecx) is the address where ret is stored.

Still before creating the new stack frame, ret is pushed once again. I'm not sure about the purpose of this, but my feeling is that it is meant to create an authentic-looking stack frame, in case the program performs nasty things with it. Only then the new stack frame is created. &ac is pushed afterward so we won't lose its track if %ecx is used for something else.

A 36 bytes buffer is allocated on the stack which will hold buf in its end (top) and strcpy() arguments at its beginning (bottom). Indeed you can see that before calling strcpy(), instead of using "push" to set arguments on the stack, it simply stores values using indexation from %esp. Ok that's fine but we only need 16 + 8 = 24 bytes to hold all this, not 36, so there are 12 superfluous bytes. Yes, but GCC wants to keep its 16 bytes alignment! Look: After the explicit stack alignment, 12 bytes have been pushed (3 words); 16 bytes are needed for buf and 8 bytes (2 words) are needed for the arguments of strcpy(). 12 + 16 + 8 = 36; the next multiple of 16 is 48, so it allocates 48 - 36 = 12 "superfluous" bytes. Neat, eh?

Upon return, &ac is popped up in %ecx, %ebp is popped as well and %esp is restored to its initial value.

One thing I'm not sure to understand is why GCC stores &ac in %ecx. Why not simply perform "mov %ebp, %ecx" at the beginning. One could argue this is because ac is used later in a comparison but I tried to remove the test from the C source and this does change anything. So I think this is merely a convention.

Without wanting to be too picky, I also wonder why av is loaded into %edx before the test, while it is only useful after it. I don't have any idea about this :-).