While hooking code in userland seems to be fairly common for various purposes (such as sandboxing malware by API hooking), hooking system calls is usually not done in userland. As you can get the same information from employing such hooks in kernelland (just after the transition), people usually choose to deploy their hooks there, since they benefit from added security and stability if implemented properly. That being said, there is one application of system call hooking that rightfully belongs into userland: Hooking of 32bit system calls on a native 64bit environment.
WOW64 is the emulation / abstraction layer introduced in 64bit Windows to support 32bit applications. There are many details about it that I don't want to cover. However for various reasons (I'll leave it to your creativity to find your own; I found a good one playing together with Tillmann Werner), one might be interested in hooking the 32bit system calls that are issued by a 32bit application running in such an environment.
On 32bit Windows XP, there used to be a function pointer within the
KUSER_SHARED_DATA page at offset
0x300 that pointed to the symbol
ntdll!KiFastSystemCall for any modern machine and was used in any system call wrapper in
ntdll to issue a system call:
0:001> u poi(0x7ffe0000+0x300) ntdll!KiFastSystemCall: 7c90e510 8bd4 mov edx,esp 7c90e512 0f34 sysenter ntdll!KiFastSystemCallRet: 7c90e514 c3 ret 7c90e515 8da42400000000 lea esp,[esp] 7c90e51c 8d642400 lea esp,[esp] ntdll!KiIntSystemCall: 7c90e520 8d542408 lea edx,[esp+8] 7c90e524 cd2e int 2Eh 7c90e526 c3 ret
Hooking this would not make much sense, since one could gather the same data just right after the
sysenter within kernelland.
Now fast forward to Windows 7, 64bit with a 32bit process running on WOW64. For the following, I will use the 64bit WinDbg version.
On this newer environment, the code executed by a system call wrapper, such as
ntdll!ZwCreateFile in this example, does not take any indirection through
KUSER_SHARED_DATA. Instead, it calls a function pointer within the TEB:
0:000:x86> u ntdll32!ZwCreateFile ntdll32!ZwCreateFile: 77a80054 b852000000 mov eax,52h 77a80059 33c9 xor ecx,ecx 77a8005b 8d542404 lea edx,[esp+4] 77a8005f 64ff15c0000000 call dword ptr fs:[0C0h] 77a80066 83c404 add esp,4 77a80069 c22c00 ret 2Ch
This new field is called
WOW32Reserved and points into
+0x0c0 WOW32Reserved : 0x743b2320 0:000:x86> u 743b2320 L1 wow64cpu!X86SwitchTo64BitMode: 743b2320 ea1e273b743300 jmp 0033:743B271E
This is in turn a far jmp into the 64bit code segment. The absolute address points into the 64bit part of
wow64cpu and sets up the 64bit stack first:
0:000> u 743B271E wow64cpu!CpupReturnFromSimulatedCode: 00000000`743b271e 67448b0424 mov r8d,dword ptr [esp] 00000000`743b2723 458985bc000000 mov dword ptr [r13+0BCh],r8d 00000000`743b272a 4189a5c8000000 mov dword ptr [r13+0C8h],esp 00000000`743b2731 498ba42480140000 mov rsp,qword ptr [r12+1480h]
Following this, the code will convert the system call specific parameters and convert them to their 64bit equivalents. The code than transitions to the original kernel code.
So the only way to grab the unmodified 32bit system calls (and parameters), before any conversion is being done, is to hook this code. My first idea was to hijack the writable function pointer inside the TEB, but that involves the inconvenience that I need to track threads and modify it for every new thread. Since this function pointer always points to the same location, I decided to go for an inline function hook. In this case, the hook is very simple, since I know that there will be one long enough instruction with fixed length operands. However, we have to take into account SMP systems that might be decoding this instruction while we're writing there, so it is desirable to use a locked write. Unfortunately, there is not enough room around the instruction to write the hook there and overwrite the original instruction with a near jmp (two bytes, can be written atomically with
mov if the address is word-aligned or
xchg in the general case).
Hence we need to write our five bytes with one single locked write. There is (at least?) one instruction on x86 in 32bit mode which can do that:
cmpxchg8b. Reading the processor manual, it gets obvious that we can abuse this to do an unconditional write if we just execute two subsequent
cmpxchg8b in a row (assuming that no one else is writing there concurrently):
[Update: As @ange4771 correctly pointed out,
cmpxchg8b requires a
lock prefix to be atomic]
asm("lock cmpxchg8b (%6)\n\tlock cmpxchg8b (%6)" : "=a" (* (DWORD *) origTrampoline), "=d" (* (DWORD *) &origTrampoline) : "a" (* (DWORD *) trampoline), "d" (* (DWORD *) &trampoline), "b" (* (DWORD *) trampoline), "c" (* (DWORD *) &trampoline), "D" (fnX86SwitchTo64BitMode));
One can read out the original jump destination in between those two instructions from
edx:eax to hotpatch your hook before it is eventually inserted. This is especially useful when a debugger is attached, as single-stepping results in the syscall trampoline being silently executed (this is great for debugger detection). The hook can then just end in the same
jmp far 0x33:?? that was present at
X86SwitchTo64BitMode, one just needs to preserve