Three Ways to Inject Your Code into Another Process

Article

321

Download entire package - 157.31 KB
Download WinSpy - 20 KB (demo application)

Introduction

WinSpy

Several password spy tutorials have been posted to The Code Project, but all of them rely on Windows hooks. Is there any other way to make such a utility? Yes, there is. But first, let me review the problem briefly, just to make sure we're all on the same page.

To "read" the contents of any control - either belonging to your application or not - you generally send the WM_GETTEXT message to it. This also applies to edit controls, except in one special case. If the edit control belongs to another process and the ES_PASSWORD style is set, this approach fails. Only the process that "owns" the password control can get its contents via WM_GETTEXT. So, our problem reduces to the following: How to get

::SendMessage( hPwdEdit, WM_GETTEXT, nMaxChars, psBuffer );

executed in the address space of another process.

In general, there are three possibilities to solve this problem:

Put your code into a DLL; then, map the DLL to the remote process via windows hooks.
Put your code into a DLL and map the DLL to the remote process using the CreateRemoteThread & LoadLibrary technique.
Instead of writing a separate DLL, copy your code to the remote process directly - via WriteProcessMemory - and start its execution with CreateRemoteThread. A detailed description of this technique can be found here.

I. Windows Hooks

Demo applications: HookSpy and HookInjEx

The primary role of windows hooks is to monitor the message traffic of some thread. In general there are:

Local hooks, where you monitor the message traffic of any thread belonging to your process.
Remote hooks, which can be:
1. thread-specific, to monitor the message traffic of a thread belonging to another process;
2. system-wide, to monitor the message traffic for all threads currently running on the system.

If the hooked thread belongs to another process (cases 2a & 2b), your hook procedure must reside in a dynamic-link library (DLL). The system then maps the DLL containing the hook procedure into the address space of the hooked thread. Windows will map the entire DLL, not just the hook procedure. That is why Windows hooks can be used to inject code into another process's address space.

While I won't discuss hooks in this article further (take a look at the SetWindowHookEx API in MSDN for more details), let me give you two more hints that you won't find in the documentation, but might still be useful:

After a successful call to SetWindowsHookEx, the system maps the DLL into the address space of the hooked thread automatically, but not necessary immediately. Because windows hooks are all about messages, the DLL isn't really mapped until an adequate event happens. For example:
If you install a hook that monitors all nonqueued messages of some thread (WH_CALLWNDPROC), the DLL won't be mapped into the remote process until a message is actually sent to (some window of) the hooked thread. In other words, if UnhookWindowsHookEx is called before a message was sent to the hooked thread, the DLL will never be mapped into the remote process (although the call to SetWindowsHookEx itself succeeded). To force an immediate mapping, send an appropriate event to the concerned thread right after the call to SetWindowsHookEx.

The same is true for unmapping the DLL after calling UnhookWindowsHookEx. The DLL isn't really unmapped until an adequate event happens.
When you install hooks, they can affect the overall system performance (especially system-wide hooks). However, you can easily overcome this shortcoming if you use thread-specific hooks solely as a DLL mapping mechanism, and not to trap messages. Consider the following code snippet:
```
BOOL APIENTRY DllMain( HANDLE hModule,
                       DWORD  ul_reason_for_call,
                       LPVOID lpReserved )
{
    if( ul_reason_for_call == DLL_PROCESS_ATTACH )
    {
        // Increase reference count via LoadLibrary
        char lib_name[MAX_PATH]; 
        ::GetModuleFileName( hModule, lib_name, MAX_PATH );
        ::LoadLibrary( lib_name );

        // Safely remove hook
        ::UnhookWindowsHookEx( g_hHook );
    }    
    return TRUE;
}
```
So, what happens? First we map the DLL to the remote process via Windows hooks. Then, right after the DLL has actually been mapped, we unhook it. Normally, the DLL would be unmapped now, too, as soon as the first message to the hooked thread would arrive. The dodgy thing is we prevent this unmapping by increasing the DLLs reference count via LoadLibrary.

The question that remains is: How to unload the DLL now, once we are finished? UnhookWindowsHookEx won't do it because we unhooked the thread already. You could do it this way:
- Install another hook, just before you want to unmap the DLL;
- Send a "special" message to the remote thread;
- Catch this message in your hook procedure; in response, call FreeLibrary & UnhookWindowsHookEx.
Now, hooks are used only while mapping/unmapping the DLL to/from the remote process; there is no influence on the performance of the "hooked" thread in the meantime. Put anohter way: We get a DLL mapping mechanism that doesn't interfere the target process more than the LoadLibrary technique discussed below does (see Section II.). However, opposed to the LoadLibrary technique, this solution works on both: WinNT and Win9x.

But, when should one use this trick? Always when the DLL has to be present in the remote process for a longer period of time (i.e. if you subclass a control belonging to another process) and you want to interfere the target process as little as possible. I didn't use it in HookSpy because the DLL there is injected just for a moment - just long enough to get the password. I rather provided another example - HookInjEx - to demonstrate it. HookInjEx maps/unmaps a DLL into "explorer.exe", where it subclasses the Start button. More precisely: It swaps the left and right mouse clicks for the Start button.

You will find HookSpy and HookInjEx as well as their sources in the download package at the beginning of the article.

II. The CreateRemoteThread & LoadLibrary Technique

Demo application: LibSpy

In general, any process can load a DLL dynamically by using the LoadLibrary API. But, how do we force an external process to call this function? The answer is CreateRemoteThread.

Let's take a look at the declaration of the LoadLibrary and FreeLibrary APIs first:

HINSTANCE LoadLibrary(
  LPCTSTR lpLibFileName   // address of filename of library module
);

BOOL FreeLibrary(
  HMODULE hLibModule      // handle to loaded library module
);

Now, compare them with the declaration of ThreadProc - the thread routine - passed to CreateRemoteThread:

DWORD WINAPI ThreadProc(
  LPVOID lpParameter   // thread data
);

As you can see, all functions use the same calling convention and all accept a 32-bit parameter. Also, the size of the returned value is the same. In other words: We may pass a pointer to LoadLibrary/FreeLibrary as the thread routine to CreateRemoteThread.

However, there are two problems (see the description for CreateRemoteThread below):

The lpStartAddress parameter in CreateRemoteThread must represent the starting address of the thread routine in the remote process.
If lpParameter - the parameter passed to ThreadFunc - is interpreted as an ordinary 32-bit value (FreeLibrary interprets it as an HMODULE), everything is fine. However, if lpParameter is interpreted as a pointer (LoadLibraryA interprets it as a pointer to a char string), it must point to some data in the remote process.

The first problem is actually solved by itself. Both LoadLibrary and FreeLibray are functions residing in kernel32.dll. Because kernel32 is guaranteed to be present and at the same load address in every "normal" process (see Appendix A), the address of LoadLibrary/FreeLibray is the same in every process too. This ensures that a valid pointer is passed to the remote process.

The second problem is also easy to solve: Simply copy the DLL module name (needed by LoadLibrary) to the remote process via WriteProcessMemory.

So, to use the CreateRemoteThread & LoadLibrary technique, follow these steps:

Retrieve a HANDLE to the remote process (OpenProcess).
Allocate memory for the DLL name in the remote process (VirtualAllocEx).
Write the DLL name, including full path, to the allocated memory (WriteProcessMemory).
Map your DLL to the remote process via CreateRemoteThread & LoadLibrary.
Wait until the remote thread terminates (WaitForSingleObject); this is until the call to LoadLibrary returns. Put another way, the thread will terminate as soon as our DllMain (called with reason DLL_PROCESS_ATTACH) returns.
Retrieve the exit code of the remote thread (GetExitCodeThread). Note that this is the value returned by LoadLibrary, thus the base address (HMODULE) of our mapped DLL.
Free the memory allocated in Step #2 (VirtualFreeEx).
Unload the DLL from the remote process via CreateRemoteThread & FreeLibrary. Pass the HMODULE handle retreived in Step #6 to FreeLibrary (via lpParameter in CreateRemoteThread).
Note: If your injected DLL spawns any new threads, be sure they are all terminated before unloading it.
Wait until the thread terminates (WaitForSingleObject).

Also, don't forget to close all the handles once you are finished: To both threads, created in Steps #4 and #8; and the handle to the remote process, retrieved in Step #1.

Let's examine some parts of LibSpy's sources now, to see how the above steps are implemented in reality. For the sake of simplicity, error handling and unicode support are removed.

HANDLE hThread;
char    szLibPath[_MAX_PATH];  // The name of our "LibSpy.dll" module
                               // (including full path!);
void*   pLibRemote;   // The address (in the remote process) where 
                      // szLibPath will be copied to;
DWORD   hLibModule;   // Base address of loaded module (==HMODULE);
HMODULE hKernel32 = ::GetModuleHandle("Kernel32");

// initialize szLibPath
//...

// 1. Allocate memory in the remote process for szLibPath
// 2. Write szLibPath to the allocated memory
pLibRemote = ::VirtualAllocEx( hProcess, NULL, sizeof(szLibPath),
                               MEM_COMMIT, PAGE_READWRITE );
::WriteProcessMemory( hProcess, pLibRemote, (void*)szLibPath,
                      sizeof(szLibPath), NULL );


// Load "LibSpy.dll" into the remote process
// (via CreateRemoteThread & LoadLibrary)
hThread = ::CreateRemoteThread( hProcess, NULL, 0,
            (LPTHREAD_START_ROUTINE) ::GetProcAddress( hKernel32,
                                       "LoadLibraryA" ),
             pLibRemote, 0, NULL );
::WaitForSingleObject( hThread, INFINITE );

// Get handle of the loaded module
::GetExitCodeThread( hThread, &hLibModule );

// Clean up
::CloseHandle( hThread );
::VirtualFreeEx( hProcess, pLibRemote, sizeof(szLibPath), MEM_RELEASE );

Assume our SendMessage - the code that we actually wanted to inject - was placed in DllMain (DLL_PROCESS_ATTACH), so it has already been executed by now. Then, it is time to unload the DLL from the target process:

// Unload "LibSpy.dll" from the target process
// (via CreateRemoteThread & FreeLibrary)
hThread = ::CreateRemoteThread( hProcess, NULL, 0,
            (LPTHREAD_START_ROUTINE) ::GetProcAddress( hKernel32,
                                       "FreeLibrary" ),
            (void*)hLibModule, 0, NULL );
::WaitForSingleObject( hThread, INFINITE );

// Clean up
::CloseHandle( hThread );

Interprocess Communications

Until now, we only talked about how to inject the DLL to the remote process. However, in most situations the injected DLL will need to communicate with your original application in some way (recall that the DLL is mapped into some remote process now, not to our local application!). Take our Password Spy: The DLL has to know the handle to the control that actually contains the password. Obviously, this value can't be hardcoded into it at compile time. Similarly, once the DLL gets the password, it has to send it back to our application so we can display it appropriately.

Fortunately, there are many ways to deal with this situation: File Mapping, WM_COPYDATA, the Clipboard, and the sometimes very handy #pragma data_seg, to name just a few. I won't describe these techniques here because they are all well documented either in MSDN (see Interprocess Communications) or in other tutorials. Anyway, I used solely the #pragma data_seg in the LibSpy example.

You will find LibSpy and its sources in the download package at the beginning of the article.

III. The CreateRemoteThread & WriteProcessMemory Technique

Demo application: WinSpy

Another way to copy some code to another process's address space and then execute it in the context of this process involves the use of remote threads and the WriteProcessMemory API. Instead of writing a separate DLL, you copy the code to the remote process directly now - via WriteProcessMemory - and start its execution with CreateRemoteThread.

Let's take a look at the declaration of CreateRemoteThread first:

HANDLE CreateRemoteThread(
  HANDLE hProcess,        // handle to process to create thread in
  LPSECURITY_ATTRIBUTES lpThreadAttributes,  // pointer to security
                                             // attributes
  DWORD dwStackSize,      // initial thread stack size, in bytes
  LPTHREAD_START_ROUTINE lpStartAddress,     // pointer to thread
                                             // function
  LPVOID lpParameter,     // argument for new thread
  DWORD dwCreationFlags,  // creation flags
  LPDWORD lpThreadId      // pointer to returned thread identifier
);

If you compare it to the declaration of CreateThread (MSDN), you will notice the following differences:

The hProcess parameter is additional in CreateRemoteThread. It is the handle to the process in which the thread is to be created.
The lpStartAddress parameter in CreateRemoteThread represents the starting address of the thread in the remote processes address space. The function must exist in the remote process, so we can't simply pass a pointer to the local ThreadFunc. We have to copy the code to the remote process first.
Similarly, the data pointed to by lpParameter must exist in the remote process, so we have to copy it there, too.

Now, we can summarize this technique in the following steps:

Retrieve a HANDLE to the remote process (OpenProces).
Allocate memory in the remote process's address space for injected data (VirtualAllocEx).
Write a copy of the initialised INJDATA structure to the allocated memory (WriteProcessMemory).
Allocate memory in the remote process's address space for injected code.
Write a copy of ThreadFunc to the allocated memory.
Start the remote copy of ThreadFunc via CreateRemoteThread.
Wait until the remote thread terminates (WaitForSingleObject).
Retrieve the result from the remote process (ReadProcessMemory or GetExitCodeThread).
Free the memory allocated in Steps #2 and #4 (VirtualFreeEx).
Close the handles retrieved in Steps #6 and #1 (CloseHandle).

Additional caveats that ThreadFunc has to obey:

ThreadFunc should not call any functions besides those in kernel32.dll and user32.dll; only kernel32 and user32 are, if present (note that user32 isn't mapped into every Win32 process!), guaranteed to be at the same load address in both the local and the target process (see Appendix A). If you need functions from other libraries, pass the addresses of LoadLibrary and GetProcAddress to the injected code, and let it go and get the rest itself. You could also use GetModuleHandle instead of LoadLibrary, if for one or another reason the debatable DLL is already mapped into the target process.
Similarly, if you want to call your own subroutines from within ThreadFunc, copy each routine to the remote process individually and supply their addresses to ThreadFunc via INJDATA.
Don't use any static strings. Rather pass all strings to ThreadFunc via INJDATA.
Why? The compiler puts all static strings into the ".data" section of an executable and only references (=pointers) remain in the code. Then, the copy of ThreadFunc in the remote process would point to something that doesn't exist (at least not in its address space).
Remove the /GZ compiler switch; it is set by default in debug builds (see Appendix B).
Either declare ThreadFunc and AfterThreadFunc as static or disable incremental linking (see Appendix C).
There must be less than a page-worth (4 Kb) of local variables in ThreadFunc (see Appendix D). Note that in debug builds some 10 bytes of the available 4 Kb are used for internal variables.

If you have a switch block with more than three case statements, either split it up like this:

switch( expression ) {
    case constant1: statement1; goto END;
    case constant2: statement2; goto END;
    case constant3: statement2; goto END;
}
switch( expression ) {
    case constant4: statement4; goto END;
    case constant5: statement5; goto END;
    case constant6: statement6; goto END;
}
END:

or modify it into an if-else if sequence (see Appendix E).

You will almost certainly crash the target process if you don't play by those rules. Just remember: Don't assume anything in the target process is at the same address as it is in your process (see Appendix F).

GetWindowTextRemote(A/W)

All the functionality you need to get the password from a "remote" edit control is encapsulated in GetWindowTextRemot(A/W):

int GetWindowTextRemoteA( HANDLE hProcess, HWND hWnd, LPSTR  lpString );
int GetWindowTextRemoteW( HANDLE hProcess, HWND hWnd, LPWSTR lpString );

Parameters

hProcess

: Handle to the process the edit control belongs to.

hWnd

: Handle to the edit control containing the password.

lpString

: Pointer to the buffer that is to receive the text.

Return Value

The return value is the number of characters copied.

Let's examine some parts of its sources now - especially the injected data and code - to see how GetWindowTextRemote works. Again, unicode support is removed for the sake of simplicity.

INJDATA

typedef LRESULT     (WINAPI *SENDMESSAGE)(HWND,UINT,WPARAM,LPARAM);

typedef struct {    
    HWND hwnd;                    // handle to edit control
    SENDMESSAGE  fnSendMessage;   // pointer to user32!SendMessageA

    char psText[128];    // buffer that is to receive the password
} INJDATA;

INJDATA is the data structure being injected into the remote process. However, before doing so the structure's pointer to SendMessageA is initialised in our application. The dodgy thing here is that user32.dll is (if present!) always mapped to the same address in every process; thus, the address of SendMessageA is always the same, too. This ensures that a valid pointer is passed to the remote process.

ThreadFunc

static DWORD WINAPI ThreadFunc (INJDATA *pData) 
{
    pData->fnSendMessage( pData->hwnd, WM_GETTEXT,    // Get password
                          sizeof(pData->psText),
                          (LPARAM)pData->psText );  
    return 0;
}

// This function marks the memory address after ThreadFunc.
// int cbCodeSize = (PBYTE) AfterThreadFunc - (PBYTE) ThreadFunc.
static void AfterThreadFunc (void)
{
}

ThradFunc is the code executed by the remote thread. Point of interest:

Note how AfterThreadFunc is used to calculate the code size of ThreadFunc. In general this isn't the best idea, because the linker is free to change order of your functions (i.e. it could place ThreadFunc behind AfterThreadFunc). However, you can be pretty sure that in small projects, like our WinSpy is, the order of your functions will be preserved. If necessary, you could also use the /ORDER linker option to help you out; or yet better: Determine the size of ThreadFunc with a dissasembler.

How to Subclass a Remote Control With this Technique

Demo application: InjectEx

Let's explain something more complicated now: How to subclass a control belonging to another process with this technique?

First of all, note that you have to copy two functions to the remote process to accomplish this task:

ThreadFunc, which actually subclasses the control in the remote process via SetWindowLong, and
NewProc, the new window procedure of the subclassed control.

However, the main problem is how to pass data to the remote NewProc. Because NewProc is a callback function and thus has to conform to specific guidelines, we can't simply pass a pointer to INJDATA to it as an argument. Fortunately, there are other ways to solve this problem (I found two), but all rely on the assembly language. So, when I tried to preserve the assembly for the appendixes until now, it won't go without it this time.

Solution 1

Observe the following picture:

The virtual address space

Note that INJDATA is placed immediately before NewProc in the remote process? This way NewProc knows the memory location of INJDATA in the remote processes address space at compile time. More precisely: It knows the address of INJDATA relative to its own location, but that's actually all we need. Now NewProc might look like this:

static LRESULT CALLBACK NewProc(
  HWND hwnd,       // handle to window
  UINT uMsg,       // message identifier
  WPARAM wParam,   // first message parameter
  LPARAM lParam )  // second message parameter
{
    INJDATA* pData = (INJDATA*) NewProc;  // pData points to
                                          // NewProc;
    pData--;              // now pData points to INJDATA;
                          // recall that INJDATA in the remote 
                          // process is immediately before NewProc;

    //-----------------------------
    // subclassing code goes here
    // ........
    //-----------------------------

    // call original window procedure;
    // fnOldProc (returned by SetWindowLong) was initialised
    // by (the remote) ThreadFunc and stored in (the remote) INJDATA;
    return pData->fnCallWindowProc( pData->fnOldProc, 
                                    hwnd,uMsg,wParam,lParam );
}

However, there is still a problem. Observe the first line:

INJDATA* pData = (INJDATA*) NewProc;

This way, a hardcoded value (the memory location of the original NewProc in our process) will be arranged to pData. That is not quite what we want: The memory location of the "current" copy of NewProc in the remote process, regardless of to what location it is (NewProc) actually moved. In other words, we would need some kind of a "this pointer."

While there is no way to solve this in C/C++, it can be done with inline assembly. Consider the modified NewProc:

static LRESULT CALLBACK NewProc(
  HWND hwnd,      // handle to window
  UINT uMsg,      // message identifier
  WPARAM wParam,  // first message parameter
  LPARAM lParam ) // second message parameter
{
    // calculate location of the INJDATA struct;
    // remember that INJDATA in the remote process
    // is placed right before NewProc;
    INJDATA* pData;
    _asm {
        call    dummy
dummy:
        pop     ecx         // <- ECX contains the current EIP
        sub     ecx, 9      // <- ECX contains the address of NewProc
        mov     pData, ecx
    }
    pData--;

    //-----------------------------
    // subclassing code goes here
    // ........
    //-----------------------------

    // call original window procedure
    return pData->fnCallWindowProc( pData->fnOldProc, 
                                    hwnd,uMsg,wParam,lParam );
}

So, what's going on? Virtually every processor has a special register that points to the memory location of the next instruction to be executed. That's the so-called instruction pointer, denoted EIP on 32-bit Intel and AMD processors. Because EIP is a special-purpose register, you can't access it programmatically as you can general purpose registers (EAX, EBX, etc). Put another way: There is no OpCode, with which you could address EIP and read or change its contents explicitly. However, EIP can still be changed (and is changed all the time) implicitly, by instructions such as JMP, CALL and RET. Let's, for example, explain how the subroutine CALL/RET mechanism works on 32-bit Intel and AMD processors:

When you call a subroutine (via CALL), the address of the subroutine is loaded into EIP. But, even before EIP is modified, its old value is automatically pushed onto the stack (for use later as a return instruction-pointer). At the end of a subroutine, the RET instruction automatically pops the top of the stack into EIP.

Now you know how EIP is modified via CALL and RET, but how to get its current value?
Well, remember that CALL pushes EIP onto the stack? So, in order to get its current value call a "dummy function" and pop the stack right thereafter. Let's explain the whole trick at our compiled NewProc:

 Address   OpCode/Params   Decoded instruction
--------------------------------------------------
:00401000  55              push ebp            ; entry point of
                                               ; NewProc
:00401001  8BEC            mov ebp, esp
:00401003  51              push ecx
:00401004  E800000000      call 00401009       ; *a*    call dummy
:00401009  59              pop ecx             ; *b*
:0040100A  83E909          sub ecx, 00000009   ; *c*
:0040100D  894DFC          mov [ebp-04], ecx   ; mov pData, ECX
:00401010  8B45FC          mov eax, [ebp-04]
:00401013  83E814          sub eax, 00000014   ; pData--;
.....
.....
:0040102D  8BE5            mov esp, ebp
:0040102F  5D              pop ebp
:00401030  C21000          ret 0010

A dummy function call; it just jumps to the next instruction and pushes EIP onto the stack.
Pop the stack into ECX. ECX then holds EIP; this is exactly the address of the "pop ECX" instruction as well.
Note that the "distance" between the entry point of NewProc and the "pop ECX" instruction is 9 bytes; thus, to calculate the address of NewProc, subtract 9 from ECX.

This way, NewProc can always calculate its own address, regardless of to what location it is actually moved! However, be aware that the distance between the entry point of NewProc and the "pop ECX" instruction might change as you change your compiler/linker options, and is thus different in release and debug builds, too. But, the point is that you still know the exact value at compile time:

First, compile your function.
Determine the correct distance with a disassembler.
Finally, recompile with the correct distance.

That's the solution used in InjectEx. InjectEx, similarly as HookInjEx, swaps the left and right mouse clicks for the Start button.

Solution 2

Placing INJDATA right before NewProc in the remote processes address space isn't the only way to solve our problem. Consider the following variant of NewProc:

static LRESULT CALLBACK NewProc(
  HWND hwnd,      // handle to window
  UINT uMsg,      // message identifier
  WPARAM wParam,  // first message parameter
  LPARAM lParam ) // second message parameter
{
    INJDATA* pData = 0xA0B0C0D0;    // a dummy value

    //-----------------------------
    // subclassing code goes here
    // ........
    //-----------------------------

    // call original window procedure
    return pData->fnCallWindowProc( pData->fnOldProc, 
                                    hwnd,uMsg,wParam,lParam );
}

Here, 0xA0B0C0D0 is just a placeholder for the real (absolute!) address of INJDATA in the remote processes address space. Recall that you can't know this address at compile time. However, you do know the location of INJDATA in the remote process right after the call to VirtualAllocEx (for INJDATA) is made.

Our NewProc could compile into something like this:

 Address   OpCode/Params     Decoded instruction
--------------------------------------------------
:00401000  55                push ebp
:00401001  8BEC              mov ebp, esp
:00401003  C745FCD0C0B0A0    mov [ebp-04], A0B0C0D0
:0040100A  ...
....
....
:0040102D  8BE5              mov esp, ebp
:0040102F  5D                pop ebp
:00401030  C21000            ret 0010

Thus, its compiled code (in hexadecimal) would be: 558BECC745FCD0C0B0A0......8BE55DC21000.

Now, you would proceed as follows:

Copy INJDATA, ThreadFunc and NewProc to the target process.
Change the code of NewProc, so that pData holds the real address of INJDATA.
For example, let's say the address of INJDATA (the value returned by VirtualAllocEx) in the target process is 0x008a0000. Then you modify the code of NewProc as follows:

558BECC745FCD0C0B0A0......8BE55DC21000 <- original NewProc ¹

558BECC745FC00008A00......8BE55DC21000 <- modified NewProc with real address of INJDATA

Put another way: You replace the dummy value A0B0C0D0 with the real address of INJDATA. ²
Start execution of the remote ThreadFunc, which in turn subclasses the control in the remote process.

¹ One might wonder why the addresses A0B0C0D0 and 008a0000 in the compiled code appear in reverse order. It's because Intel and AMD processors use the little-endian notation for to represent their (multi-byte) data. In other words: The low-order byte of a number is stored in memory at the lowest address, and the high-order byte at the highest address.
Imagine the word UNIX stored in four bytes. In big-endian systems, it would be stored as UNIX. In little-endian systems, it would be stored as XINU.

² Some (bad) cracks modify the code of an executable in a similar way. However, once loaded into memory, a program can't change its own code (the code resides in the ".text" section of an executable, which is write protected). Still we could modify our remote NewProc, because it was previously copied to a peace of memory with PAGE_EXECUTE_READWRITE permission.

When to use the CreateRemoteThread & WriteProcessMemory technique

The CreateRemoteThread & WriteProcessMemory technique of code injection is, when compared to the other methods, more flexible in that you don't need an additional DLL. Unfortunately, it is also more complicated and riskier than the other methods. You can (and most probably will) easily crash the remote process, as soon as something is wrong with your ThreadFunc (see Appendix F). Because debugging a remote ThreadFunc can also be a nightmare, you should use this technique only when injecting at most a few instructions. To inject a larger peace of code, use one of the methods discussed in Sections II and I.

Again, WinSpy and InjectEx, as well as their sources, can be found in the download package at the beginning of the article.

Some Final Words

At the end, let's summarize some facts we didn't mention so far:

	OS	Processes
I. Hooks	Win9x and WinNT	only processes that link with USER32.DLL¹
II. CreateRemoteThread & LoadLibrary	WinNT only²	all processes³, including system services⁴
III. CreateRemoteThread & WriteProcessMemory	WinNT only	all processes, including system services

Obviously you can't hook a thread that has no message queue. Also, SetWindowsHookEx wont work with system services, even if they link against USER32.DLL.
There is no CreateRemoteThread nor VirtualAllocEx on Win9x. (Actually, they can be emulated on Win9x, too; but that's a story for yet another day.)
All processes = All Win32 processes + csrss.exe
Native applications (smss.exe, os2ss.exe, autochk.exe, etc) don't use Win32 APIs, and thus don't link against kernel32.dll either. The only exception is csrss.exe, the Win32 subsystem itself. It's a native application but some of its libraries (~winsrv.dll) require Win32 DLLs, including kernel32.dll.
If you want to inject code into system services (lsass.exe, services.exe, winlogon.exe, and so on) or into csrss.exe, set the privileges of your process to "SeDebugPrivilege" (AdjustTokenPrivileges) before opening a handle to the remote process (OpenProcess).

That's almost it. There is just one more thing that you should bear in mind: Your injected code can, especially if something is wrong with it, easily pull the target process down to oblivion with it. Just remember: Power comes with responsibility!

Because many examples in this article were about passwords, you might find it interesting to read the article Super Password Spy++, written by Zhefu Zhang, too. There he explains how to get the passwords out of an Internet Explorer password field. More. He even shows you how to protect your password controls against such attacks.

Last note: The only reward someone gets for writing and publishing an article is the feedback he gets, so, if you found it useful simply drop in a comment or vote for it (). But even more importantly: Let me know if something is wrong or buggy, if you think something could be done better, or that something is still left unclear.

Acknowledgments

First, thanks to my readers at CodeGuru, where this *text* was initially published. It is mainly because of your questions, that the article grew from its initial 1200 words to what it is today: An 6000 word "animal." However, if there is someone that especially deserves to be singled out, then it is Rado Picha. Parts of the article greatly benefited from his suggestions and explanations to me. Last, but not least, thanks to Susan Moore for helping me through that minefield called the English language, and making my article more readable.

A) Why are kernel32.dll and user32.dll always mapped to the same address?

My presumption: Because Microsoft programmers thought that it could be a useful speed optimization. Let's explain why.

In general, an executable is composed of several sections, including a ".reloc" section.

When the linker creates an EXE or DLL file, it makes an assumption about where the file will be mapped into memory. That's the so-called assumed/preferred load/base address. All the absolute addresses in the image are based on this linker assumed load address. If for whatever reason the image isn't loaded at this address, the PE - portable executable - loader has to fix all the absolute addresses in the image. That is where the ".reloc" section comes in: It contains a list of all the places in the image, where the difference between the linker assumed load address and the actual load address needs to be factored in (anyway, note that most of the instructions produced by the compiler use some kind of relative addressing; as a result, there are not as many relocations as you might think). If, on the other side, the loader is able to load the image at the linkers preferred base address, the ".reloc" section is completely ignored.

But, how do kernel32.dll, user32.dll and their load addresses fit into the story? Because every Win32 application needs kernel32.dll, and most of them need user32.dll, too, you can improve the load time of all executables by always mapping them (kernel32 and user32) to their preferred bases. Then the loader must never fix any (absolute) addresses in kernel32.dll and user32.dll.

Let's close out this discussion with the following example:

Set the image base of some App.exe to KERNEL32's (/base:"0x77e80000") or to USER32's (/base:"0x77e10000") preferred base. If App.exe doesn't import from USER32, just LoadLibrary it. Then compile App.exe and try to run it. An error box pops up ("Illegal System DLL Relocation") and App.exe fails to load.

Why? When creating a process, the loader on Win 2000, Win XP and Win 2003 checks if kernel32.dll and user32.dll (their names are hardcoded into the loader) are mapped at their preferred bases; if not, a hard error is raised. In WinNT 4 ole32.dll was also checked. In WinNT 3.51 and lower such checks were not present, so kernel32.dll and user32.dll could be anywhere. Anyway, the only module that is always at its base is ntdll.dll. The loader doesn't check it, but if ntdll.dll is not at its base, the process just can't be created.

To summarize, on WinNT 4 and higher:

DLLs, that are always mapped to their bases: kernel32.dll, user32.dll and ntdll.dll.
DLLs that are present in every Win32 application (+ csrss.exe): kernel32.dll and ntdll.dll.
The only DLL that is present in every process, even in native applications: ntdll.dll.

B) The /GZ compiler switch

In Debug builds, the /GZ compiler feature is turned on by default. You can use it to catch some errors (see the documentation for details). But what does it mean to our executable?

When /GZ is turned on, the compiler will add some additional code to every function residing in the executable, including a function call (added at the very end of every function) that verifies the ESP stack pointer hasn't changed through our function. But wait, a function call is added to ThreadFunc? That's the road to disaster. Now the remote copy of ThreadFunc will call a function that doesn't exist in the remote process (at least not at the same address).

C) Static functions Vs. Incremental linking

Incremental linking is used to shorten the linking time when building your applications. The difference between normally and incrementally linked executables is that in incrementally linked ones each function call goes through an extra JMP instruction emitted by the linker (an exception to this rule are functions declared as static!). These JMPs allow the linker to move the functions around in memory without updating all the CALL instructions that reference the function. But it's exactly this JMP that causes problems too: now ThreadFunc and AfterThreadFunc will point to the JMP instructions instead to the real code. So, when calculating the size of ThreadFuncthis way:

const int cbCodeSize = ((LPBYTE) AfterThreadFunc - (LPBYTE) ThreadFunc);

you will actually calculate the "distance" between the JMPs that point to ThreadFunc and AfterThreadFunc respectively (usually they will appear one right after the other; but don't count on this). Now suppose our ThreadFunc is at address 004014C0 and the accompanying JMP instruction at 00401020.

:00401020   jmp  004014C0
 ...
:004014C0   push EBP          ; real address of ThreadFunc
:004014C1   mov  EBP, ESP
 ...

Then

WriteProcessMemory( .., &ThreadFunc, cbCodeSize, ..);

will copy the "JMP 004014C0" instruction (and all instructions in the range of cbCodeSize that follow it) to the remote process - not the real ThreadFunc. The first thing the remote thread will execute will be a "JMP 004014C0". Well, it will also be among its last instructions - not only to the remote thread, but to the whole process.

However, there is an exception to this JMP instruction "rule." If a function is declared as static, it will be called directly, even if linked incrementally. That's why Rule #4 says to declare ThreadFunc and AfterThreadFunc as static or disable incremental linking. (Some other aspects of incremental linking can be found in the article "Remove Fatty Deposits from Your Applications Using Our 32-bit Liposuction Tools" by Matt Pietrek)

D) Why can my ThreadFunc have only 4k of local variables?

Local variables are always stored on the stack. If a function has, say, 256 bytes of local variables, the stack pointer is decreased by 256 when entering the function (more precisely, in the functions prologue). The following function:

void Dummy(void) {
    BYTE var[256];
    var[0] = 0;
    var[1] = 1;
    var[255] = 255;
}

could, for instance, compile into something like this:

:00401000   push ebp
:00401001   mov  ebp, esp
:00401003   sub  esp, 00000100           ; change ESP as storage for
                                         ; local variables is needed
:00401006   mov  byte ptr [esp], 00      ; var[0] = 0;
:0040100A   mov  byte ptr [esp+01], 01   ; var[1] = 1;
:0040100F   mov  byte ptr [esp+FF], FF   ; var[255] = 255;
:00401017   mov  esp, ebp                ; restore stack pointer
:00401019   pop  ebp
:0040101A   ret

Note how the stack pointer (ESP) was changed in the above example? But what is different if a function needs more than 4 Kb for its local variables? Well, then the stack pointer isn't changed directly. Rather, another function (a stack probe) is called, which in turn changes it appropriately. But it's exactly this additional function call that makes our ThreadFunc "corrupt," because its remote copy would call something that's not there.

Let's see what the documentation says about stack probes and the /Gs compiler option:

"The /Gssizeoption is an advanced feature with which you can control stack probes. A stack probe is a sequence of code that the compiler inserts into every function call. When activated, a stack probe reaches benignly into memory by the amount of space required to store the associated function's local variables.
If a function requires more than size stack space for local variables, its stack probe is activated. The default value of size is the size of one page (4 Kb for 80x86 processors). This value allows a carefully tuned interaction between an application for Win32 and the Windows NT virtual-memory manager to increase the amount of memory committed to the program stack at run time."

I'm sure one or another wondered about the above statement: "...a stack probe reaches benignly into memory...". Those compiler options (their descriptions!) are sometimes really irritating, at least until you look under the hood and see what's going on. If, for instance, a function needs 12 Kb storage for its local variables, the memory on the stack would be "allocated" (more precisely: committed) this way:

sub    esp, 0x1000    ; "allocate" first 4 Kb
test  [esp], eax      ; touches memory in order to commit a
                      ; new page (if not already committed)
sub    esp, 0x1000    ; "allocate" second 4 Kb
test  [esp], eax      ; ...
sub    esp, 0x1000
test  [esp], eax

Note how the stack pointer is changed in 4 Kb steps now and, more importantly, how the bottom of the stack is "touched" (via test) after each step. This ensures the page containing the bottom of the stack is being committed, before "allocating" (committing) another page.

After reading ..

"Each new thread receives its own stack space, consisting of both committed and reserved memory. By default, each thread uses 1 Mb of reserved memory, and one page of committed memory. The system will commit one page block from the reserved stack memory as needed." (see MSDN CreateThread > dwStackSize > "Thread Stack Size").

.. it should also be clear why the documentation about /Gs says that you get with stack probes a carefully tuned interaction between your application and the Windows NT virtual-memory manager.

Now back to our ThreadFunc and 4 Kb limit:
Although you could prevent calls to the stack probe routine with /Gs, the documentation warns you about doing so. Further, the documentation says you can turn stack probes on or off by using the #pragma check_stack directive. However, it seems this pragma doesn't affect stack probes at all (either the documentation is buggy, or I am missing some other facts?). Anyway, recall that the CreateRemoteThread & WriteProcessMemory technique should be used only when injecting small peaces of code, so your local variables should rarely *consume* more than a few bytes and thus not get even close to the 4 Kb limit.

E) Why should I split up my switch block with more than three case statements?

Again, it is easiest to explain it with an example. Consider the following function:

int Dummy( int arg1 ) 
{
    int ret =0;

    switch( arg1 ) {
    case 1: ret = 1; break;
    case 2: ret = 2; break;
    case 3: ret = 3; break;
    case 4: ret = 0xA0B0; break;
    }
    return ret;
}

It would compile into something like this:

 Address   OpCode/Params    Decoded instruction
--------------------------------------------------
                                             ; arg1 -> ECX
:00401000  8B4C2404         mov ecx, dword ptr [esp+04]
:00401004  33C0             xor eax, eax     ; EAX = 0
:00401006  49               dec ecx          ; ECX --
:00401007  83F903           cmp ecx, 00000003
:0040100A  771E             ja 0040102A

; JMP to one of the addresses in table ***
; note that ECX contains the offset
:0040100C  FF248D2C104000   jmp dword ptr [4*ecx+0040102C]

:00401013  B801000000       mov eax, 00000001   ; case 1: eax = 1;
:00401018  C3               ret
:00401019  B802000000       mov eax, 00000002   ; case 2: eax = 2;
:0040101E  C3               ret
:0040101F  B803000000       mov eax, 00000003   ; case 3: eax = 3;
:00401024  C3               ret
:00401025  B8B0A00000       mov eax, 0000A0B0   ; case 4: eax = 0xA0B0;
:0040102A  C3               ret
:0040102B  90               nop

; Address table ***
:0040102C  13104000         DWORD 00401013   ; jump to case 1
:00401030  19104000         DWORD 00401019   ; jump to case 2
:00401034  1F104000         DWORD 0040101F   ; jump to case 3
:00401038  25104000         DWORD 00401025   ; jump to case 4

Note how the switch-case was implemented?
Rather than examining every single case statement separately, an address table is created. Then, we jump to the right case by simply calculating the offset into the address table. If you think for a moment, this really is an improvement. Imagine you had a switch with 50 case statements. Without the above trick, you had to execute 50 CMP and JMP instructions to get to the last case. With the address table, on the contrary, you can jump to any case by a single table look-up. In terms of computer algorithms and time complexity: We replace an O(2n) algorithm by an O(5) one, where:

O denotes the worst-case time complexity.
We assume five instructions are neccessary to calculate the offset, do the table look-up, and finally jump to the appropriate address.

Now, one might think the above was possible only because the case constants were carefully chosen to be consecutive (1,2,3,4). Fortunately, it turns out the same solution can be applied to most real-world examples, only the offset calculation becomes somewhat more complicated. But there are two exceptions, though:

if there are three or less case statements or
if the case constants are completely unrelated to each other (i.e. "case 1", "case 13", "case 50", and "case 1000")

then the resulting code does it the long way by examining every single case constant separately, with the CMP and JMP instructions. In other words, then the resulting code is essentially the same as if you had an ordinary if-else if sequence.

Point of interest: If you ever wondered for what reason only a constant-expression can accompany a case statement, then you know why by now. In order to create the address table, this value obviously has to be known at compile time.

Now back to the problem!
Notice the JMP instruction at address 0040100C? Let's see what Intel's documentation says about the hex opcode FF:

Opcode    Instruction    Description
FF /4     JMP r/m32      Jump near, absolute indirect,
                         address given in r/m32

Oops, the debatable JMP uses some kind of absolute addressing? In other words, one of its operands (0040102C in our case) represents an absolute address. Need I say more? Now, the remote ThreadFunc would blindly think the address table for its switch is at 0040102C, JMP to a wrong place, and thus effectively crash the remote process.

F) Why does the remote process crash, anyway?

When your remote process crashes, it will always be for one of the following reasons:

You referenced a string inside of ThreadFunc that doesn't exist.
One or more instructions in ThreadFunc use absolute addressing (see Appendix E for an example).
ThreadFunc calls a function that doesn't exist (the call could be added by the compiler/linker). When you will look at ThreadFuncin dissasembler in this case you will see something like this:
```
:004014C0    push EBP         ; entry point of ThreadFunc
:004014C1    mov EBP, ESP
 ...
:004014C5    call 0041550     ; this will crash the
                              ; remote process
 ...
:00401502    ret
```
If the debatable CALL was added by the compiler (because some "forbidden" switch, such as /GZ, was turned on), it will be located either somewhere at the beginning or near the end of ThreadFunc.

In any case, you can't be careful enough with the CreateRemoteThread & WriteProcessMemory technique. Especially watch for your compiler/linker options. They could easily add something to your ThreadFunc.

Appendices

References:

Load Your 32-bit DLL into Another Process's Address Space Using INJLIB by Jeffrey Richter. MSJ May, 1994
HOWTO: Subclass a Window in Windows 95; Microsoft Knowledge Base Article - 125680
Tutorial 24: Windows Hooks by Iczelion
CreateRemoteThread by Felix Kasza
API hooking revealed by Ivo Ivanov
Peering Inside the PE: A Tour of the Win32 Portable Executable File Format by Matt Pietrek, March 1994
Intel Architecture Software Developer's Manual, Volume 2: Instruction Set Reference

Article History

July 25, 2003: Article published
August 19, 2003: Applied only some minor formatting changes

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Robert Kuster

Software Developer

Germany

Member

Sometimes I dream that I could fly a jet fighter or that I could play guitar like the Gipsy Kings. While I then always realize that bombing people probably isn't so cool, I'm still not so sure about the Gipsy Kings and the guitar. Still, I think writing code is the next best thing that I could do so I will try to make the best out of it.

You can visit me at www.windbg.info or at www.rkuster.com. Additionally You are also welcome to check out my:

- WinDbg. From A to Z! - PDF Booklet (111 slides)
- Common WinDbg Commands (Thematically Grouped)



代码注入的三种方法

作者：Robert Kuster
编译：VCKBASE

原文出处：Three Ways to Inject Your Code into Another Process

下载源代码
　

简介

　　本文将讨论如何把代码注入不同的进程地址空间，然后在该进程的上下文中执行注入的代码。我们在网上可以查到一些窗口/密码侦测的应用例子，网上的这些程序大多都依赖 Windows 钩子技术来实现。本文将讨论除了使用 Windows 钩子技术以外的其它技术来实现这个功能。如图一所示：

图一　WinSpy 密码侦测程序

为了找到解决问题的方法。首先让我们简单回顾一下问题背景。
　　要“读取”某个控件的内容——无论这个控件是否属于当前的应用程序——通常都是发送 WM_GETTEXT 消息来实现。这个技术也同样应用到编辑控件，但是如果该编辑控件属于另外一个进程并设置了 ES_PASSWORD 式样，那么上面讲的方法就行不通了。用 WM_GETTEXT 来获取控件的内容只适用于进程“拥有”密码控件的情况。所以我们的问题变成了如何在另外一个进程的地址空间执行：

::SendMessage( hPwdEdit, WM_GETTEXT, nMaxChars, psBuffer );

通常有三种可能性来解决这个问题。

将你的代码放入某个 DLL，然后通过 Windows 钩子映射该DLL到远程进程；
将你的代码放入某个 DLL，然后通过 CreateRemoteThread 和 LoadLibrary 技术映射该DLL到远程进程；
如果不写单独的 DLL，可以直接将你的代码拷贝到远程进程——通过 WriteProcessMemory——并用 CreateRemoteThread 启动它的执行。本文将在第三部分详细描述该技术实现细节；

第一部分： Windows 钩子

范例程序——参见HookSpy 和HookInjEx

　　Windows 钩子主要作用是监控某些线程的消息流。通常我们将钩子分为本地钩子和远程钩子以及系统级钩子，本地钩子一般监控属于本进程的线程的消息流，远程钩子是线程专用的，用于监控属于另外进程的线程消息流。系统级钩子监控运行在当前系统中的所有线程的消息流。
　　如果钩子作用的线程属于另外的进程，那么你的钩子过程必须驻留在某个动态链接库（DLL）中。然后系统映射包含钩子过程的DLL到钩子作用的线程的地址空间。Windows将映射整个 DLL，而不仅仅是钩子过程。这就是为什么 Windows 钩子能被用于将代码注入到别的进程地址空间的原因。
　　本文我不打算涉及钩子的具体细节（关于钩子的细节请参见 MSDN 库中的 SetWindowHookEx API），但我在此要给出两个很有用心得，在相关文档中你是找不到这些内容的：

在成功调用 SetWindowsHookEx 后，系统自动映射 DLL 到钩子作用的线程地址空间，但不必立即发生映射，因为 Windows 钩子都是消息，DLL 在消息事件发生前并没有产生实际的映射。例如：
　　如果你安装一个钩子监控某些线程（WH_CALLWNDPROC）的非队列消息，在消息被实际发送到（某些窗口的）钩子作用的线程之前，该DLL 是不会被映射到远程进程的。换句话说，如果 UnhookWindowsHookEx 在某个消息被发送到钩子作用的线程之前被调用，DLL 根本不会被映射到远程进程（即使 SetWindowsHookEx 本身调用成功）。为了强制进行映射，在调用 SetWindowsHookEx 之后马上发送一个事件到相关的线程。
　　在UnhookWindowsHookEx了之后，对于没有映射的DLL处理方法也一样。只有在足够的事件发生后，DLL才会有真正的映射。
当你安装钩子后，它们可能影响整个系统得性能（尤其是系统级钩子），但是你可以很容易解决这个问题，如果你使用线程专用钩子的DLL映射机制，并不截获消息。考虑使用如下代码：
```
BOOL APIENTRY DllMain( HANDLE hModule,
                       DWORD  ul_reason_for_call,
                       LPVOID lpReserved )
{
    if( ul_reason_for_call == DLL_PROCESS_ATTACH )
    {
        // Increase reference count via LoadLibrary
        char lib_name[MAX_PATH]; 
        ::GetModuleFileName( hModule, lib_name, MAX_PATH );
        ::LoadLibrary( lib_name );

        // Safely remove hook
        ::UnhookWindowsHookEx( g_hHook );
    }    
    return TRUE;
}			
```
　　那么会发生什么呢？首先我们通过Windows 钩子将DLL映射到远程进程。然后，在DLL被实际映射之后，我们解开钩子。通常当第一个消息到达钩子作用线程时，DLL此时也不会被映射。这里的处理技巧是调用LoadLibrary通过增加 DLLs的引用计数来防止映射不成功。
　　现在剩下的问题是如何卸载DLL，UnhookWindowsHookEx 是不会做这个事情的，因为钩子已经不作用于线程了。你可以像下面这样做：
- 就在你想要解除DLL映射前，安装另一个钩子；
- 发送一个“特殊”消息到远程线程；
- 在钩子过程中截获这个消息，响应该消息时调用 FreeLibrary 和 UnhookWindowsHookEx；
　　目前只使用了钩子来从处理远程进程中DLL的映射和解除映射。在此“作用于线程的”钩子对性能没有影响。
下面我们将讨论另外一种方法，这个方法与 LoadLibrary 技术的不同之处是DLL的映射机制不会干预目标进程。相对LoadLibrary 技术，这部分描述的方法适用于 WinNT和Win9x。
　　但是，什么时候使用这个技巧呢？答案是当DLL必须在远程进程中驻留较长时间（即如果你子类化某个属于另外一个进程的控件时）以及你想尽可能少的干涉目标进程时。我在 HookSpy 中没有使用它，因为注入DLL 的时间并不长——注入时间只要足够得到密码即可。我提供了另外一个例子程序——HookInjEx——来示范。HookInjEx 将DLL映射到资源管理器“explorer.exe”，并从中/解除影射，它子类化“开始”按钮，并交换鼠标左右键单击“开始”按钮的功能。

HookSpy 和 HookInjEx 的源代码都可以从本文的下载源代码中获得。

第二部分：CreateRemoteThread 和 LoadLibrary 技术

范例程序——LibSpy

通常，任何进程都可以通过 LoadLibrary API 动态加载DLL。但是，如何强制一个外部进程调用这个函数呢？答案是：CreateRemoteThread。
首先，让我们看一下 LoadLibrary 和FreeLibrary API 的声明：

HINSTANCE LoadLibrary(
LPCTSTR lpLibFileName // 库模块文件名的地址
);

BOOL FreeLibrary(
HMODULE hLibModule // 要加载的库模块的句柄
);

现在将它们与传递到 CreateRemoteThread 的线程例程——ThreadProc 的声明进行比较。

DWORD WINAPI ThreadProc(
LPVOID lpParameter // 线程数据
);

你可以看到，所有函数都使用相同的调用规范并都接受 32位参数，返回值的大小都相同。也就是说，我们可以传递一个指针到LoadLibrary/FreeLibrary 作为到 CreateRemoteThread 的线程例程。但这里有两个问题，请看下面对CreateRemoteThread 的描述：

CreateRemoteThread 的 lpStartAddress 参数必须表示远程进程中线程例程的开始地址。
如果传递到 ThreadFunc 的参数lpParameter——被解释为常规的 32位值（FreeLibrary将它解释为一个 HMODULE），一切OK。但是，如果 lpParameter 被解释为一个指针（LoadLibraryA将它解释为一个串指针）。它必须指向远程进程的某些数据。

　　第一个问题实际上是由它自己解决的。LoadLibrary 和 FreeLibray 两个函数都在 kernel32.dll 中。因为必须保证kernel32存在并且在每个“常规”进程中的加载地址要相同，LoadLibrary/FreeLibray 的地址在每个进程中的地址要相同，这就保证了有效的指针被传递到远程进程。
　　第二个问题也很容易解决。只要通过 WriteProcessMemory 将 DLL 模块名（LoadLibrary需要的DLL模块名）拷贝到远程进程即可。

所以，为了使用CreateRemoteThread 和 LoadLibrary 技术，需要按照下列步骤来做：

获取远程进程（OpenProcess）的 HANDLE；
为远程进程中的 DLL名分配内存（VirtualAllocEx）；
将 DLL 名，包含全路径名，写入分配的内存（WriteProcessMemory）；
用 CreateRemoteThread 和 LoadLibrary. 将你的DLL映射到远程进程；
等待直到线程终止（WaitForSingleObject），也就是说直到 LoadLibrary 调用返回。另一种方法是，一旦 DllMain（用DLL_PROCESS_ATTACH调用）返回，线程就会终止；
获取远程线程的退出代码（GetExitCodeThread）。注意这是一个 LoadLibrary 返回的值，因此是所映射 DLL 的基地址（HMODULE）。
在第二步中释放分配的地址（VirtualFreeEx）；
用 CreateRemoteThread 和 FreeLibrary从远程进程中卸载 DLL。传递在第六步获取的 HMODULE 句柄到 FreeLibrary（通过 CreateRemoteThread 的lpParameter参数）；
注意：如果你注入的 DLL 产生任何新的线程，一定要在卸载DLL 之前将它们都终止掉；
等待直到线程终止（WaitForSingleObject）；

　　此外，处理完成后不要忘了关闭所有句柄，包括在第四步和第八步创建的两个线程以及在第一步获取的远程线程句柄。现在让我们看一下 LibSpy 的部分代码，为了简单起见，上述步骤的实现细节中的错误处理以及 UNICODE 支持部分被略掉。

HANDLE hThread;
char    szLibPath[_MAX_PATH];  // “LibSpy.dll”模块的名称 (包括全路径);
void*   pLibRemote;   // 远程进程中的地址，szLibPath 将被拷贝到此处;
DWORD   hLibModule;   // 要加载的模块的基地址（HMODULE）
HMODULE hKernel32 = ::GetModuleHandle("Kernel32");

// 初始化szLibPath
//...
// 1. 在远程进程中为szLibPath 分配内存
// 2. 将szLibPath 写入分配的内存
pLibRemote = ::VirtualAllocEx( hProcess, NULL, sizeof(szLibPath),
                               MEM_COMMIT, PAGE_READWRITE );
::WriteProcessMemory( hProcess, pLibRemote, (void*)szLibPath,
                      sizeof(szLibPath), NULL );

// 将"LibSpy.dll" 加载到远程进程（使用CreateRemoteThread 和 LoadLibrary）
hThread = ::CreateRemoteThread( hProcess, NULL, 0,
            (LPTHREAD_START_ROUTINE) ::GetProcAddress( hKernel32,
                                       "LoadLibraryA" ),
             pLibRemote, 0, NULL );
::WaitForSingleObject( hThread, INFINITE );

// 获取所加载的模块的句柄
::GetExitCodeThread( hThread, &hLibModule );

// 清除
::CloseHandle( hThread );
::VirtualFreeEx( hProcess, pLibRemote, sizeof(szLibPath), MEM_RELEASE );

　假设我们实际想要注入的代码——SendMessage ——被放在DllMain (DLL_PROCESS_ATTACH)中，现在它已经被执行。那么现在应该从目标进程中将DLL 卸载：

// 从目标进程中卸载"LibSpy.dll"  (使用 CreateRemoteThread 和 FreeLibrary)
hThread = ::CreateRemoteThread( hProcess, NULL, 0,
            (LPTHREAD_START_ROUTINE) ::GetProcAddress( hKernel32,
                                       "FreeLibrary" ),
            (void*)hLibModule, 0, NULL );
::WaitForSingleObject( hThread, INFINITE );

// 清除
::CloseHandle( hThread );

进程间通信

　　到目前为止，我们只讨论了关于如何将DLL 注入到远程进程的内容，但是，在大多数情况下，注入的 DLL 都需要与原应用程序进行某种方式的通信（回想一下，我们的DLL是被映射到某个远程进程的地址空间里了，不是在本地应用程序的地址空间中）。比如秘密侦测程序，DLL必须要知道实际包含密码的控件句柄，显然，编译时无法将这个值进行硬编码。同样，一旦DLL获得了秘密，它必须将它发送回原应用程序，以便能正确显示出来。
　　幸运的是，有许多方法处理这个问题，文件映射，WM_COPYDATA，剪贴板以及很简单的 #pragma data_seg 共享数据段等，本文我不打算使用这些技术，因为MSDN（“进程间通信”部分）以及其它渠道可以找到很多文档参考。不过我在 LibSpy例子中还是使用了 #pragma data_seg。细节请参考 LibSpy 源代码。

第三部分：CreateRemoteThread 和 WriteProcessMemory 技术

范例程序——WinSpy

　　另外一个将代码拷贝到另一个进程地址空间并在该进程上下文中执行的方法是使用远程线程和 WriteProcessMemory API。这种方法不用编写单独的DLL，而是用 WriteProcessMemory 直接将代码拷贝到远程进程——然后用 CreateRemoteThread 启动它执行。先来看看 CreateRemoteThread 的声明：

HANDLE CreateRemoteThread(
  HANDLE hProcess,        // 传入创建新线程的进程句柄
  LPSECURITY_ATTRIBUTES lpThreadAttributes,  // 安全属性指针
  DWORD dwStackSize,      // 字节为单位的初始线程堆栈
  LPTHREAD_START_ROUTINE lpStartAddress,     // 指向线程函数的指针
  LPVOID lpParameter,     // 新线程使用的参数
  DWORD dwCreationFlags,  // 创建标志
  LPDWORD lpThreadId      // 指向返回的线程ID
);

如果你比较它与 CreateThread（MSDN）的声明，你会注意到如下的差别：

在 CreateRemoteThread中，hProcess是额外的一个参数，一个进程句柄，新线程就是在这个进程中创建的；
在 CreateRemoteThread中，lpStartAddress 表示的是在远程进程地址空间中的线程起始地址。线程函数必须要存在于远程进程中，所以我们不能简单地传递一个指针到本地的 ThreadFunc。必须得先拷贝代码到远程进程；
同样，lpParameter 指向的数据也必须要存在于远程进程，所以也得将它拷贝到那。

综上所述，我们得按照如下的步骤来做：

获取一个远程进程的HANDLE (OpenProces) ；
在远程进程地址空间中为注入的数据分配内存（VirtualAllocEx）；
将初始的 INDATA 数据结构的一个拷贝写入分配的内存中（WriteProcessMemory）；
在远程进程地址空间中为注入的代码分配内存；
将 ThreadFunc 的一个拷贝写入分配的内存；
用 CreateRemoteThread启动远程的 ThreadFunc 拷贝；
等待远程线程终止（WaitForSingleObject）；
获取远程来自远程进程的结果（ReadProcessMemory 或 GetExitCodeThread）；
释放在第二步和第四步中分配的内存（VirtualFreeEx）；
关闭在第六步和第一步获取的句柄（CloseHandle）；

ThreadFunc 必须要遵循的原则：

除了kernel32.dll 和user32.dll 中的函数之外，ThreadFunc 不要调用任何其它函数，只有 kernel32.dll 和user32.dll被保证在本地和目标进程中的加载地址相同（注意，user32.dll并不是被映射到每个 Win32 的进程）。如果你需要来自其它库中的函数，将LoadLibrary 和 GetProcAddress 的地址传给注入的代码，然后放手让它自己去做。如果映射到目标进程中的DLL有冲突，你也可以用 GetModuleHandle 来代替 LoadLibrary。
　　同样，如果你想在 ThreadFunc 中调用自己的子例程，要单独把每个例程的代码拷贝到远程进程并用 INJDATA为 ThreadFunc 提供代码的地址。
不要使用静态字符串，而要用 INJDATA 来传递所有字符串。之所以要这样，是因为编译器将静态字符串放在可执行程序的“数据段”中，可是引用（指针）是保留在代码中的。那么，远程进程中ThreadFunc 的拷贝指向的内容在远程进程的地址空间中是不存在的。
去掉 /GZ 编译器开关，它在调试版本中是默认设置的。
将 ThreadFunc 和 AfterThreadFunc 声明为静态类型，或者不启用增量链接。
ThreadFunc 中的局部变量一定不能超过一页（也就是 4KB）。
注意在调试版本中4KB的空间有大约10个字节是用于内部变量的。

如果你有一个开关语句块大于3个case 语句，将它们像下面这样拆分开：

switch( expression ) {
    case constant1: statement1; goto END;
    case constant2: statement2; goto END;
    case constant3: statement2; goto END;
}
switch( expression ) {
    case constant4: statement4; goto END;
    case constant5: statement5; goto END;
    case constant6: statement6; goto END;
}
END:

或者将它们修改成一个 if-else if 结构语句（参见附录E）。

　　如果你没有按照这些规则来做，目标进程很可能会崩溃。所以务必牢记。在目标进程中不要假设任何事情都会像在本地进程中那样（参见附录F）。

GetWindowTextRemote(A/W)

要想从“远程”编辑框获得密码，你需要做的就是将所有功能都封装在GetWindowTextRemot(A/W):中。

int GetWindowTextRemoteA( HANDLE hProcess, HWND hWnd, LPSTR lpString );
int GetWindowTextRemoteW( HANDLE hProcess, HWND hWnd, LPWSTR lpString );

参数说明：
hProcess：编辑框控件所属的进程句柄； 
hWnd：包含密码的编辑框控件句柄； 
lpString：接收文本的缓冲指针； 
返回值：返回值是拷贝的字符数；

　　下面让我们看看它的部分代码——尤其是注入数据的代码——以便明白 GetWindowTextRemote 的工作原理。此处为简单起见，略掉了 UNICODE 支持部分。

INJDATA
typedef LRESULT (WINAPI *SENDMESSAGE)(HWND,UINT,WPARAM,LPARAM);

typedef struct { 
HWND hwnd; // 编辑框句柄
SENDMESSAGE fnSendMessage; // 指向user32.dll 中 SendMessageA 的指针

char psText[128]; // 接收密码的缓冲
} INJDATA;

　　INJDATA 是一个被注入到远程进程的数据结构。但在注入之前，结构中指向 SendMessageA 的指针是在本地应用程序中初始化的。因为对于每个使用user32.dll的进程来说，user32.dll总是被映射到相同的地址，因此，SendMessageA 的地址也肯定是相同的。这就保证了被传递到远程进程的是一个有效的指针。

ThreadFunc函数

static DWORD WINAPI ThreadFunc (INJDATA *pData) 
{
	pData->fnSendMessage( pData->hwnd, WM_GETTEXT, // Get password
				sizeof(pData->psText),
				(LPARAM)pData->psText ); 
	return 0;
}

// 该函数在ThreadFunc之后标记内存地址
// int cbCodeSize = (PBYTE) AfterThreadFunc - (PBYTE) ThreadFunc.
static void AfterThreadFunc (void)
{
}

ThradFunc 是被远程线程执行的代码。

注释：注意AfterThreadFunc 是如何计算 ThreadFunc 大小的。通常这样做并不是一个好办法，因为链接器可以随意更改函数的顺序（也就是说ThreadFunc可能被放在 AfterThreadFunc之后）。这一点你可以在小项目中很好地保证函数的顺序是预先设想好的，比如 WinSpy 程序。在必要的情况下，你还可以使用 /ORDER 链接器选项来解决函数链接顺序问题。或者用反汇编确定 ThreadFunc 函数的大小。

如何使用该技术子类化远程控件

范例程序——InjectEx

下面我们将讨论一些更复杂的内容，如何子类化属于另一个进程的控件。

首先，你得拷贝两个函数到远程进程来完成此任务

ThreadFunc实际上是通过 SetWindowLong子类化远程进程中的控件；
NewProc是子类化控件的新窗口过程；

　　这里主要的问题是如何将数据传到远程窗口过程 NewProc，因为 NewProc 是一个回调函数，它必须遵循特定的规范和原则，我们不能简单地在参数中传递 INJDATA指针。幸运的是我找到了有两个方法来解决这个问题，只不过要借助汇编语言，所以不要忽略了汇编，关键时候它是很有用的！

方法一：

如下图所示：

　　在远程进程中，INJDATA 被放在NewProc 之前，这样 NewProc 在编译时便知道 INJDATA 在远程进程地址空间中的内存位置。更确切地说，它知道相对于其自身位置的 INJDATA 的地址，我们需要所有这些信息。下面是 NewProc 的代码：

static LRESULT CALLBACK NewProc(
  HWND hwnd,       // 窗口句柄
  UINT uMsg,       // 消息标示符
  WPARAM wParam,   // 第一个消息参数
  LPARAM lParam )  // 第二个消息参数
{
    INJDATA* pData = (INJDATA*) NewProc;  // pData 指向 NewProc
    pData--;              // 现在pData 指向INJDATA;
                          // 回想一下INJDATA 被置于远程进程NewProc之前;

    //-----------------------------
    // 此处是子类化代码
    // ........
    //-----------------------------

    // 调用原窗口过程;
    // fnOldProc (由SetWindowLong 返回) 被（远程）ThreadFunc初始化
    // 并被保存在（远程）INJDATA;中
    return pData->fnCallWindowProc( pData->fnOldProc, 
                                    hwnd,uMsg,wParam,lParam );
}

但这里还有一个问题，见第一行代码：

INJDATA* pData = (INJDATA*) NewProc;

　　这种方式 pData得到的是硬编码值（在我们的进程中是原 NewProc 的内存地址）。这不是我们十分想要的。在远程进程中，NewProc “当前”拷贝的内存地址与它被移到的实际位置是无关的，换句话说，我们会需要某种类型的“this 指针”。
虽然用 C/C++ 无法解决这个问题，但借助内联汇编可以解决，下面是对 NewProc的修改：

static LRESULT CALLBACK NewProc(
  HWND hwnd,       // 窗口句柄
  UINT uMsg,       // 消息标示符
  WPARAM wParam,   // 第一个消息参数
  LPARAM lParam )  // 第二个消息参数
{
    // 计算INJDATA 结构的位置
    // 在远程进程中记住这个INJDATA 
    // 被放在NewProc之前
    INJDATA* pData;
    _asm {
        call    dummy
dummy:
        pop     ecx         // <- ECX 包含当前的EIP
        sub     ecx, 9      // <- ECX 包含NewProc的地址
        mov     pData, ecx
    }
    pData--;


    //-----------------------------
    // 此处是子类化代码
    // ........
    //-----------------------------

    // 调用原来的窗口过程
    return pData->fnCallWindowProc( pData->fnOldProc, 
                                    hwnd,uMsg,wParam,lParam );
}

　　那么，接下来该怎么办呢？事实上，每个进程都有一个特殊的寄存器，它指向下一条要执行的指令的内存位置。即所谓的指令指针，在32位 Intel 和 AMD 处理器上被表示为 EIP。因为 EIP是一个专用寄存器，你无法象操作一般常规存储器（如：EAX，EBX等）那样通过编程存取它。也就是说没有操作代码来寻址 EIP，以便直接读取或修改其内容。但是，EIP 仍然还是可以通过间接方法修改的（并且随时可以修改），通过JMP，CALL和RET这些指令实现。下面我们就通过例子来解释通过 CALL/RET 子例程调用机制在32位 Intel 和 AMD 处理器上是如何工作的。
　　当你调用（通过 CALL）某个子例程时，子例程的地址被加载到 EIP，但即便是在 EIP杯修改之前，其旧的那个值被自动PUSH到堆栈（被用于后面作为指令指针返回）。在子例程执行完时，RET 指令自动将堆栈顶POP到 EIP。
　　现在你知道了如何通过 CALL 和 RET 实现 EIP 的修改，但如何获取其当前的值呢？下面就来解决这个问题，前面讲过，CALL PUSH EIP 到堆栈，所以，为了获取其当前值，调用“哑函数”，然后再POP堆栈顶。让我们用编译后的 NewProc 来解释这个窍门。

Address   OpCode/Params   Decoded instruction
--------------------------------------------------
:00401000  55              push ebp            ; entry point of
                                               ; NewProc
:00401001  8BEC            mov ebp, esp
:00401003  51              push ecx
:00401004  E800000000      call 00401009       ; *a*    call dummy
:00401009  59              pop ecx             ; *b*
:0040100A  83E909          sub ecx, 00000009   ; *c*
:0040100D  894DFC          mov [ebp-04], ecx   ; mov pData, ECX
:00401010  8B45FC          mov eax, [ebp-04]
:00401013  83E814          sub eax, 00000014   ; pData--;
.....
.....
:0040102D  8BE5            mov esp, ebp
:0040102F  5D              pop ebp
:00401030  C21000          ret 0010

哑函数调用；就是JUMP到下一个指令并PUSH EIP到堆栈；
然后将堆栈顶POP到 ECX，ECX再保存EIP；这也是 POP EIP指令的真正地址；
注意 NewProc 的入口点和 “POP ECX”之间的“距离”是9 个字节；因此为了计算 NewProc的地址，要从 ECX 减9。

　　这样一来，不管 NewProc 被移到什么地方，它总能计算出其自己的地址。但是，NewProc 的入口点和 “POP ECX”之间的距离可能会随着你对编译/链接选项的改变而变化，由此造成 RELEASE和DEBUG版本之间也会有差别。但关键是你仍然确切地知道编译时的值。

首先，编译函数
用反汇编确定正确的距离
最后，用正确的距离值重新编译

此即为 InjecEx 中使用的解决方案，类似于 HookInjEx，交换鼠标点击“开始”左右键时的功能。

方法二：

对于我们的问题，在远程进程地址空间中将 INJDATA 放在 NewProc 前面不是唯一的解决办法。看下面 NewProc的变异版本：

static LRESULT CALLBACK NewProc(
  HWND hwnd,      // 窗口句柄
  UINT uMsg,      // 消息标示符
  WPARAM wParam,  // 第一个消息参数
  LPARAM lParam ) // 第二个消息参数
{
    INJDATA* pData = 0xA0B0C0D0;    // 虚构值

    //-----------------------------
    // 子类化代码
    // ........
    //-----------------------------

    // 调用原来的窗口过程
    return pData->fnCallWindowProc( pData->fnOldProc, 
                                    hwnd,uMsg,wParam,lParam );
}

　　此处 0xA0B0C0D0 只是远程进程地址空间中真实（绝对）INJDATA地址的占位符。前面讲过，你无法在编译时知道该地址。但你可以在调用 VirtualAllocEx （为INJDATA）之后得到 INJDATA 在远程进程中的位置。编译我们的 NewProc 后，可以得到如下结果：

 Address   OpCode/Params     Decoded instruction
--------------------------------------------------
:00401000  55                push ebp
:00401001  8BEC              mov ebp, esp
:00401003  C745FCD0C0B0A0    mov [ebp-04], A0B0C0D0
:0040100A  ...
....
:0040102D  8BE5              mov esp, ebp
:0040102F  5D                pop ebp
:00401030  C21000            ret 0010

因此，其编译的代码（十六进制）将是：

558BECC745FCD0C0B0A0......8BE55DC21000.

现在你可以象下面这样继续：

将INJDATA，ThreadFunc和NewProc 拷贝到目标进程；
修改 NewProc 的代码，以便 pData 中保存的是 INJDATA 的真实地址。
例如，假设 INJDATA 的地址（VirtualAllocEx返回的值）在目标进程中是 0x008a0000。然后象下面这样修改NewProc的代码：
```
	558BECC745FCD0C0B0A0......8BE55DC21000 <- 原来的NewProc （注1） 
	558BECC745FC00008A00......8BE55DC21000 <- 修改后的NewProc，使用的是INJDATA的实际地址。
```
也就是说，你用真正的 INJDATA（注2）地址替代了虚拟值 A0B0C0D0（注2）。
开始执行远程的 ThreadFunc，它负责子类化远程进程中的控件。

注1、有人可能会问，为什么地址 A0B0C0D0 和 008a0000 在编译时顺序是相反的。因为 Intel 和 AMD 处理器使用 little-endian 符号来表示（多字节）数据。换句话说，某个数字的低位字节被存储在内存的最小地址处，而高位字节被存储在最高位地址。
假设“UNIX”这个词存储用4个字节，在 big-endian 系统中，它被存为“UNIX”，在 little-endian 系统中，它将被存为“XINU”。
注2、某些破解（很糟）以类似的方式修改可执行代码，但是一旦加载到内存，一个程序是无法修改自己的代码的（代码驻留在可执行程序的“.text” 区域，这个区域是写保护的）。但仍可以修改远程的 NewProc，因为它是先前以 PAGE_EXECUTE_READWRITE 许可方式被拷贝到某个内存块中的。

何时使用 CreateRemoteThread 和 WriteProcessMemory 技术

　　与其它方法比较，使用 CreateRemoteThread 和 WriteProcessMemory 技术进行代码注入更灵活，这种方法不需要额外的 dll，不幸的是，该方法更复杂并且风险更大，只要ThreadFunc出现哪怕一丁点错误，很容易就让（并且最大可能地会）使远程进程崩溃（参见附录 F），因为调试远程 ThreadFunc 将是一个可怕的梦魇，只有在注入的指令数很少时，你才应该考虑使用这种技术进行注入，对于大块的代码注入，最好用 I.和II 部分讨论的方法。

WinSpy 以及 InjectEx 请从这里下载源代码。

结束语

到目前为止，有几个问题是我们未提及的，现总结如下：

解决方案	OS	进程
I、Hooks	Win9x 和 WinNT	仅仅与 USER32.DLL （注3）链接的进程
II、CreateRemoteThread & LoadLibrary	仅 WinNT（注4）	所有进程（注5）, 包括系统服务（注6）
III、CreateRemoteThread & WriteProcessMemory	仅 WinNT	所有进程, 包括系统服务

注3：显然，你无法hook一个没有消息队列的线程，此外，SetWindowsHookEx不能与系统服务一起工作，即使它们与 USER32.DLL 进行链接；
注4：Win9x 中没有 CreateRemoteThread，也没有 VirtualAllocEx （实际上，在Win9x 中可以仿真，但不是本文讨论的问题了）；
注5：所有进程 = 所有 Win32 进程 + csrss.exe
本地应用（smss.exe, os2ss.exe, autochk.exe 等）不使用 Win32 API，所以也不会与 kernel32.dll 链接。唯一一个例外是 csrss.exe，Win32 子系统本身，它是本地应用程序，但其某些库（~winsrv.dll）需要 Win32 DLLs，包括 kernel32.dll；
注6：如果你想要将代码注入到系统服务中（lsass.exe, services.exe, winlogon.exe 等）或csrss.exe，在打开远程句柄（OpenProcess）之前，将你的进程优先级置为 “SeDebugPrivilege”（AdjustTokenPrivileges）。

　　最后，有几件事情一定要了然于心：你的注入代码很容易摧毁目标进程，尤其是注入代码本身出错的时候，所以要记住：权力带来责任！
　　因为本文中的许多例子是关于密码的，你也许还读过 Zhefu Zhang 写的另外一篇文章“Super Password Spy++” ，在该文中，他解释了如何获取IE 密码框中的内容，此外，他还示范了如何保护你的密码控件免受类似的攻击。

附录A：

为什么 kernel32.dll 和user32.dll 总是被映射到相同的地址。

　　我的假定：因为Microsoft 的程序员认为这样做有助于速度优化，为什么呢？我的解释是——通常一个可执行程序是由几个部分组成，其中包括“.reloc” 。当链接器创建 EXE 或者 DLL文件时，它对文件被映射到哪个内存地址做了一个假设。这就是所谓的首选加载/基地址。在映像文件中所有绝对地址都是基于链接器首选的加载地址，如果由于某种原因，映像文件没有被加载到该地址，那么这时“.reloc”就起作用了，它包含映像文件中的所有地址的清单，这个清单中的地址反映了链接器首选加载地址和实际加载地址的差别（无论如何，要注意编译器产生的大多数指令使用某种相对地址寻址，因此，并没有你想象的那么多地址可供重新分配），另一方面，如果加载器能够按照链接器首选地址加载映像文件，那么“.reloc”就被完全忽略掉了。
　　但kernel32.dll 和user32.dll 及其加载地址为何要以这种方式加载呢？因为每一个 Win32 程序都需要kernel32.dll，并且大多数Win32 程序也需要 user32.dll，那么总是将它们（kernel32.dll 和user32.dll）映射到首选地址可以改进所有可执行程序的加载时间。这样一来，加载器绝不能修改kernel32.dll and user32.dll.中的任何（绝对）地址。我们用下面的例子来说明：
　　将某个应用程序 App.exe 的映像基地址设置成 KERNEL32的地址（/base:"0x77e80000"）或 USER32的首选基地址（/base:"0x77e10000"），如果 App.exe 不是从 USER32 导入方式来使用 USER32，而是通过LoadLibrary 加载，那么编译并运行App.exe 后，会报出错误信息（"Illegal System DLL Relocation"——非法系统DLL地址重分配），App.exe 加载失败。
为什么会这样呢？当创建进程时，Win 2000、Win XP 和Win 2003系统的加载器要检查 kernel32.dll 和user32.dll 是否被映射到首选基地址（实际上，它们的名字都被硬编码进了加载器），如果没有被加载到首选基地址，将发出错误。在 WinNT4中，也会检查ole32.dll，在WinNT 3.51 和较低版本的Windows中，由于不会做这样的检查，所以kernel32.dll 和user32.dll可以被加载任何地方。只有ntdll.dll总是被加载到其基地址，加载器不进行检查，一旦ntdll.dll没有在其基地址，进程就无法创建。

总之，对于 WinNT 4 和较高的版本中

一定要被加载到基地址的DLLs 有：kernel32.dll、user32.dll 和ntdll.dll；
每个Win32 程序都要使用的 DLLs+ csrss.exe：kernel32.dll 和ntdll.dll；
每个进程都要使用的DLL只有一个，即使是本地应用：ntdll.dll；

附录B：

/GZ 编译器开关

　　在生成 Debug 版本时，/GZ 编译器特性是默认打开的。你可以用它来捕获某些错误（具体细节请参考相关文档）。但对我们的可执行程序意味着什么呢？
　　当打开 /GZ 开关，编译器会添加一些额外的代码到可执行程序中每个函数所在的地方，包括一个函数调用（被加到每个函数的最后）——检查已经被我们的函数修改的 ESP堆栈指针。什么！难道有一个函数调用被添加到 ThreadFunc 吗？那将导致灾难。ThreadFunc 的远程拷贝将调用一个在远程进程中不存在的函数（至少是在相同的地址空间中不存在）

附录C：

静态函数和增量链接

　　增量链接主要作用是在生成应用程序时缩短链接时间。常规链接和增量链接的可执行程序之间的差别是——增量链接时，每个函数调用经由一个额外的JMP指令，该指令由链接器发出（该规则的一个例外是函数声明为静态）。这些 JMP 指令允许链接器在内存中移动函数，这种移动无需修改引用函数的 CALL指令。但这些JMP指令也确实导致了一些问题：如 ThreadFunc 和 AfterThreadFunc 将指向JMP指令而不是实际的代码。所以当计算ThreadFunc 的大小时：

const int cbCodeSize = ((LPBYTE) AfterThreadFunc - (LPBYTE) ThreadFunc)

　　你实际上计算的是指向 ThreadFunc 的JMPs 和AfterThreadFunc之间的“距离” （通常它们会紧挨着，不用考虑距离问题）。现在假设 ThreadFunc 的地址位于004014C0 而伴随的 JMP指令位于 00401020。

:00401020   jmp  004014C0
 ...
:004014C0   push EBP          ; ThreadFunc 的实际地址
:004014C1   mov  EBP, ESP
 ...

那么

WriteProcessMemory( .., &ThreadFunc, cbCodeSize, ..);

　　将拷贝“JMP 004014C0”指令（以及随后cbCodeSize范围内的所有指令）到远程进程——不是实际的 ThreadFunc。远程进程要执行的第一件事情将是“JMP 004014C0” 。它将会在其最后几条指令当中——远程进程和所有进程均如此。但 JMP指令的这个“规则”也有例外。如果某个函数被声明为静态的，它将会被直接调用，即使增量链接也是如此。这就是为什么规则#4要将 ThreadFunc 和 AfterThreadFunc 声明为静态或禁用增量链接的缘故。（有关增量链接的其它信息参见 Matt Pietrek的文章“Remove Fatty Deposits from Your Applications Using Our 32-bit Liposuction Tools” ）

附录D：

为什么 ThreadFunc的局部变量只有 4k？

　　局部变量总是存储在堆栈中，如果某个函数有256个字节的局部变量，当进入该函数时，堆栈指针就减少256个字节（更精确地说，在函数开始处）。例如，下面这个函数：

void Dummy(void) {
    BYTE var[256];
    var[0] = 0;
    var[1] = 1;
    var[255] = 255;
}

编译后的汇编如下：

:00401000   push ebp
:00401001   mov  ebp, esp
:00401003   sub  esp, 00000100           ; change ESP as storage for
                                         ; local variables is needed
:00401006   mov  byte ptr [esp], 00      ; var[0] = 0;
:0040100A   mov  byte ptr [esp+01], 01   ; var[1] = 1;
:0040100F   mov  byte ptr [esp+FF], FF   ; var[255] = 255;
:00401017   mov  esp, ebp                ; restore stack pointer
:00401019   pop  ebp
:0040101A   ret

　　注意上述例子中，堆栈指针是如何被修改的？而如果某个函数需要4KB以上局部变量内存空间又会怎么样呢？其实，堆栈指针并不是被直接修改，而是通过另一个函数调用来修改的。就是这个额外的函数调用使得我们的 ThreadFunc “被破坏”了，因为其远程拷贝会调用一个不存在的东西。
　　我们看看文档中对堆栈探测和 /Gs编译器选项是怎么说的：
——“/GS是一个控制堆栈探测的高级特性，堆栈探测是一系列编译器插入到每个函数调用的代码。当函数被激活时，堆栈探测需要的内存空间来存储相关函数的局部变量。
　　如果函数需要的空间大于为局部变量分配的堆栈空间，其堆栈探测被激活。默认的大小是一个页面（在80x86处理器上4kb）。这个值允许在Win32 应用程序和Windows NT虚拟内存管理器之间进行谨慎调整以便增加运行时承诺给程序堆栈的内存。”
我确信有人会问：文档中的“……堆栈探测到一块需要的内存空间来存储相关函数的局部变量……”那些编译器选项（它们的描述）在你完全弄明白之前有时真的让人气愤。例如，如果某个函数需要12KB的局部变量存储空间，堆栈内存将进行如下方式的分配（更精确地说是“承诺” ）。

sub    esp, 0x1000    ; "分配" 第一次 4 Kb
test  [esp], eax      ; 承诺一个新页内存（如果还没有承诺）
sub    esp, 0x1000    ; "分配" 第二次4 Kb
test  [esp], eax      ; ...
sub    esp, 0x1000
test  [esp], eax

　　注意4KB堆栈指针是如何被修改的，更重要的是，每一步之后堆栈底是如何被“触及”（要经过检查）。这样保证在“分配”（承诺）另一页面之前，当前页面承诺的范围也包含堆栈底。

注意事项
　　“每一个线程到达其自己的堆栈空间，默认情况下，此空间由承诺的以及预留的内存组成，每个线程使用 1 MB预留的内存，以及一页承诺的内存，系统将根据需要从预留的堆栈内存中承诺一页内存区域” （参见 MSDN CreateThread > dwStackSize > Thread Stack Size）
　　还应该清楚为什么有关　/GS 的文档说在堆栈探针在 Win32 应用程序和Windows NT虚拟内存管理器之间进行谨慎调整。

现在回到我们的ThreadFunc以及 4KB 限制
　　虽然你可以用 /Gs 防止调用堆栈探测例程，但在文档对于这样的做法给出了警告，此外，文件描述可以用 #pragma check_stack 指令关闭或打开堆栈探测。但是这个指令好像一点作用都没有（要么这个文档是垃圾，要么我疏忽了其它一些信息？）。总之，CreateRemoteThread 和 WriteProcessMemory 技术只能用于注入小块代码，所以你的局部变量应该尽量少耗费一些内存字节，最好不要超过 4KB限制。

附录E：

为什么要将开关语句拆分成三个以上?

用下面这个例子很容易解释这个问题，假设有如下这么一个函数：

int Dummy( int arg1 ) 
{
    int ret =0;

    switch( arg1 ) {
    case 1: ret = 1; break;
    case 2: ret = 2; break;
    case 3: ret = 3; break;
    case 4: ret = 0xA0B0; break;
    }
    return ret;
}

编译后变成下面这个样子：

地址      操作码/参数       解释后的指令
--------------------------------------------------
                                             ; arg1 -> ECX
:00401000  8B4C2404         mov ecx, dword ptr [esp+04]
:00401004  33C0             xor eax, eax     ; EAX = 0
:00401006  49               dec ecx          ; ECX --
:00401007  83F903           cmp ecx, 00000003
:0040100A  771E             ja 0040102A

; JMP 到表***中的地址之一
; 注意 ECX 包含的偏移
:0040100C  FF248D2C104000   jmp dword ptr [4*ecx+0040102C]

:00401013  B801000000       mov eax, 00000001   ; case 1: eax = 1;
:00401018  C3               ret
:00401019  B802000000       mov eax, 00000002   ; case 2: eax = 2;
:0040101E  C3               ret
:0040101F  B803000000       mov eax, 00000003   ; case 3: eax = 3;
:00401024  C3               ret
:00401025  B8B0A00000       mov eax, 0000A0B0   ; case 4: eax = 0xA0B0;
:0040102A  C3               ret
:0040102B  90               nop

; 地址表***
:0040102C  13104000         DWORD 00401013   ; jump to case 1
:00401030  19104000         DWORD 00401019   ; jump to case 2
:00401034  1F104000         DWORD 0040101F   ; jump to case 3
:00401038  25104000         DWORD 00401025   ; jump to case 4

注意如何实现这个开关语句？

　　与其单独检查每个CASE语句，不如创建一个地址表，然后通过简单地计算地址表的偏移量而跳转到正确的CASE语句。这实际上是一种改进。假设你有50个CASE语句。如果不使用上述的技巧，你得执行50次 CMP和JMP指令来达到最后一个CASE。相反，有了地址表后，你可以通过表查询跳转到任何CASE语句，从计算机算法角度和时间复杂度看，我们用O(5)代替了O(2n)算法。其中：

O表示最坏的时间复杂度；
我们假设需要5条指令来进行表查询计算偏移量，最终跳到相应的地址；

　　现在，你也许认为出现上述情况只是因为CASE常量被有意选择为连续的（1，2，3，4）。幸运的是，它的这个方案可以应用于大多数现实例子中，只有偏移量的计算稍微有些复杂。但有两个例外：

如果CASE语句少于等于三个；
如果CASE 常量完全互不相关（如：“"case 1” ，“case 13” ，“case 50” ，和“case 1000” ）；

　　显然，单独判断每个的CASE常量的话，结果代码繁琐耗时，但使用CMP和JMP指令则使得结果代码的执行就像普通的if-else 语句。
有趣的地方：如果你不明白CASE语句使用常量表达式的理由，那么现在应该弄明白了吧。为了创建地址表，显然在编译时就应该知道相关地址。

现在回到问题！
注意到地址 0040100C 处的JMP指令了吗？我们来看看Intel关于十六进制操作码 FF 的文档是怎么说的：

操作码　指令 　　　　描述
FF /4 　JMP r/m32 　Jump near, absolute indirect,
　　　　　　　　　　　address given in r/m32

　　原来JMP 使用了一种绝对寻址方式，也就是说，它的操作数（CASE语句中的 0040102C）表示一个绝对地址。还用我说什么吗？远程 ThreadFunc 会盲目地认为地址表中开关地址是 0040102C，JMP到一个错误的地方，造成远程进程崩溃。

附录F：

为什么远程进程会崩溃呢？

当远程进程崩溃时，它总是会因为下面这些原因：

在ThreadFunc 中引用了一个不存在的串；
在在ThreadFunc 中中一个或多个指令使用绝对寻址（参见附录E）；
ThreadFunc 调用某个不存在的函数（该调用可能是编译器或链接器添加的）。你在反汇编器中可以看到这样的情形：
```
:004014C0    push EBP         ; ThreadFunc 的入口点
:004014C1    mov EBP, ESP
 ...
:004014C5    call 0041550     ;  这里将使远程进程崩溃
 ...
:00401502    ret
```
如果 CALL 是由编译器添加的指令（因为某些“禁忌” 开关如/GZ是打开的），它将被定位在 ThreadFunc 的开始的某个地方或者结尾处。

　　不管哪种情况，你都要小心翼翼地使用 CreateRemoteThread 和 WriteProcessMemory 技术。尤其要注意你的编译器/链接器选项，一不小心它们就会在 ThreadFunc 添加内容。

参考资料：

posted on 2012-06-06 16:51 Richard.FreeBSD 阅读(436) 评论(0) 收藏举报

刷新页面返回顶部

Three Ways to Inject Your Code into Another Process

Three Ways to Inject Your Code into Another Process

Contents

Introduction

I. Windows Hooks

Demo applications: HookSpy and HookInjEx

II. The CreateRemoteThread & LoadLibrary Technique

Demo application: LibSpy

Interprocess Communications

III. The CreateRemoteThread & WriteProcessMemory Technique

Demo application: WinSpy

GetWindowTextRemote(A/W)

Parameters

Return Value

INJDATA

ThreadFunc

How to Subclass a Remote Control With this Technique

Demo application: InjectEx

Solution 1

Solution 2

When to use the CreateRemoteThread & WriteProcessMemory technique

Some Final Words

Acknowledgments

Appendices

References:

Article History

License

About the Author

导航

公告

`558BECC745FCD0C0B0A0......8BE55DC21000`	`<-` original `NewProc` ¹
`558BECC745FC00008A00......8BE55DC21000`	`<-` modified `NewProc` with real address of `INJDATA`