http://blog.oxff.net/#6kymkyx4h6gcivm4iknq

2011-05-02 13:47

64bit Linux, MAP_32BIT and fs Segment

I've recently tried to port my 32bit malware emulator for Windows malware to 64bit Linux (it was previously compiled as 32bit process even for 64bit Linux). In theory, it is very simple: we just need to reserve the 2 GiB userspace memory at a 32bit address, so we can reference it in the LDT and add an additional LDT descriptor for our translated 32bit code segment (as the windows malware is 32bit code and I don't want to rewrite to 64bit). In theory.

In practice, Linux 64bit processes are a horrible bitchy thing, if you get to the low level. First of all, reserving 2 GiB of memory (of course in the 32bit address space) is no problem at all in a 32bit process:

mmap(0, 1024UL * 1024UL * 1024UL * 2UL, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

Of course Linux wants to optimize stuff and therefore we get a high address >= 0x100000000. But mmap to the rescue, 64bit Linux added a new flag: MAP_32BIT. In theory, this would allocate our memory in the 32bit area of the address space, except for the little twitch that it just doesn't work in practice. mmap just returns -1 and errno tells us that unfortunately, the memory could not be allocated. If there is already enough space to reserve continouus 2 GiB in a 32bit process, why isn't there in a 64bit process as well? /proc/pid/maps indicates large enough holes, so there probably is just a bug in the kernel. Talking to the linux-mm people did not help at all, they recommended it to debug it myself (and use printk and not gdb with a VM) -- but I'd have to work around this anyway, since I don't want to distribute my code with a ''latest unstable kernel plz'' requirement.

Luckily, we can work around this issue: for referencing the memory in the LDT, we don't need the whole area to be in the 32bit address space but only the starting address. The following code hence reserves 2 GiB of memory, referencable from LDT:

mmap(0xfffff000, 1024UL * 1024UL * 1024UL * 2UL, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

So everything left now is adding a LDT descriptor for this nice memory area, right? Since I know, that I'll have to deal with the fs segment as well and for 64bit code (which I'm using in the following testing prototype), segment descriptor bases are only considered for the fs and gs segments (cf. Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A, Chapter 3.2.4), I gave fs a shot. And of course it did not work.

Let us first consider the following prototype, that (ab)uses gs:

#include <sys/mman.h>
#include <sys/types.h>
#include <sys/syscall.h>
#include <asm/ldt.h>
#include <unistd.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>

#define MMAP_FLAGS (MAP_PRIVATE | MAP_ANONYMOUS)
#define SIZE (1024UL * 1024UL * 1024UL * 2UL)

int populateDescriptor(uint16_t selector, uint32_t base, uint32_t size, int type)
{
        struct user_desc descriptor = { 0 };

descriptor.entry_number = selector >> 3;
        descriptor.base_addr = base;
        descriptor.limit = (size - 1) / sysconf(_SC_PAGESIZE) + 1;
        descriptor.seg_32bit = 1;
        descriptor.contents = (int) type;
        descriptor.read_exec_only = 0;
        descriptor.limit_in_pages = 1;
        descriptor.seg_not_present = 0;
        descriptor.useable = 1;

if(syscall(SYS_modify_ldt, 1, &descriptor, sizeof(descriptor)) < 0)
        {
                perror(__PRETTY_FUNCTION__);
                return 0;
        }

return 1;
}

uint16_t allocateDescriptor()
{
        static uint16_t index = 0;

return ((++index << 3) | 7);
}

int main(int argc, char * argv[])
{
        void * result = mmap((void *) 0xfffff000, 0x1000, PROT_NONE, MMAP_FLAGS, -1, 0);

if(result != MAP_FAILED)
                printf("Successfully reserved %lx bytes of memory, starting at %p.\n", SIZE, result);
        else
        {
                perror("Failed to reserve memory");
                return -1;
        }

if(mprotect(result, sysconf(_SC_PAGESIZE), PROT_READ | PROT_WRITE) < 0)
        {
                perror("Failed to protect first page");
                return -1;
        }

uint16_t selector = allocateDescriptor(), oldSelector;
        uint32_t value;

if(!populateDescriptor(selector, (uint32_t) (size_t) result, SIZE, MODIFY_LDT_CONTENTS_DATA))
                return -1;

memset(result, 'A', 1024);

asm("mov %%gs, %0" : "=r" (oldSelector) );
        asm("mov %0, %%gs" : : "r" (selector) );
        asm("movl %%gs:0, %0" : "=r" (value) );
        asm("mov %0, %%gs" : : "r" (oldSelector) );

printf("! %08x\n", value);

return 0;                       
}                                       

The output looks quite like what I would have expected:

Successfully reserved 80000000 bytes of memory, starting at 0xfffff000.
! 41414141

Now what if we change gs to fs?

        asm("mov %%fs, %0" : "=r" (oldSelector) );
        asm("mov %0, %%fs" : : "r" (selector) );
        asm("movl %%fs:0, %0" : "=r" (value) );
        asm("mov %0, %%fs" : : "r" (oldSelector) );

We get a nice little segmentation fault:

(gdb) run
Starting program: xxx/mmap-32bit-test 
Successfully reserved 80000000 bytes of memory, starting at 0xfffff000.

Program received signal SIGSEGV, Segmentation fault.
0x00007fab7d6089d7 in vfprintf () from /lib/libc.so.6
(gdb) disas $rip $rip+1
Dump of assembler code from 0x7fab7d6089d7 to 0x7fab7d6089d8:
0x00007fab7d6089d7 <vfprintf+55>: mov    %fs:(%rdx),%eax
End of assembler dump.

But wait, we restored the fs selector to its old value, before we executed the printf, because we of course now that Linux might abuse fs or gs for it's vdso fast system call stuff. Still, why are we getting a segmentation fault? Let's verify what we're doing (omitted useless junk):

(gdb) break main
Breakpoint 1 at 0x4007f9
(gdb) run
Breakpoint 1, 0x00000000004007f9 in main ()
(gdb) p/x $fs
$1 = 0x0
(gdb) c
Continuing.
Program received signal SIGSEGV, Segmentation fault.
0x00007fd0228b49d7 in vfprintf () from /lib/libc.so.6
(gdb) p/x $fs
$2 = 0x0

What the hell, fs was unused all the time and we also restored it to its initial null value. Still, the code is expecting something special in it?! Yet another bullshit, I will have to work around (by dynamically rewriting fs in the emulated code to gs, which is unused in Windows anyway).

Linux kernel developers do not only fix security bugs silently, they also suck leagues above my incompetence to debug their code.

Update: A very simple testcase for the fs bug is:

#include <stdint.h>

void main()
{
        uint16_t fs;

asm("mov %%fs, %0" : "=r" (fs) );
        asm("mov %0, %%fs" : : "r" (fs) );
}

Update 2: Finally I can justify my open IRC shells to my boss! Thanks to erg0t from #social, I was able to fix the strange fs bug (in my code!). He pointed out that arch_prctl can be used to set the fs and gs bases as well. However the way they do it is by modifying MSR C000_0100h, also known as IA32_FS_BASE. Writing to the fs segment selector register flushes this value, though. So even restoring the original value destroys glibc's use of the fs register (eventhoug it looked unused by containing 0). Here is a working fix:

        if(!populateDescriptor(selector, (uint32_t) (size_t) result, SIZE, MODIFY_LDT_CONTENTS_DATA))
                return -1;

memset(result, 'A', 1024);

arch_prctl(ARCH_GET_FS, &originalFsBase);
        printf("Original FS base: %lx\n", originalFsBase);

asm("mov %0, %%fs" : : "r" (selector) );
        asm("movl %%fs:0, %0" : "=r" (value) );
        asm("mov %0, %%fs" : : "r" (0) );
        arch_prctl(ARCH_SET_FS, originalFsBase);

printf("! %08x\n", value);

return 0;
}

Which gives us the expected results:

Successfully reserved 80000000 bytes of memory, starting at 0xfffff000.
Original FS base: 7fe4ab5436f0
! 41414141

Thanks again, erg0t from #social! ;)