代码改变世界

An Introduction to ARM Assembly Language

2010-06-01 15:26  王克伟  阅读(...)  评论(... 编辑 收藏

Author: Jason Fuller

Who is this document for?

This document is intended for anyone who occasionally needs to debug compiled ARM code at the assembly language level. 

 

Why would I want to do that?

Because retail builds have compiler optimizations turned on, and compiler optimizations confuse the source-level debugger.  For example, the values it displays for your local variables are often wrong because the real values are kept in registers, not on the stack where the debugger knows how to find them.

 

Registers

The ARM CPU has 15 registers:

r0 through r3 are used as general purpose registers, but they are also used to pass the first four parameters into a function.  Their values are not guaranteed to be preserved across function calls. 

r0 is also used to return the return value from a function.

r4 through r11 and r13 are general purpose registers.  It is the responsibility of a called function to preserve these values, i.e., to ensure that the values of these registers are the same when the function exits as when it was entered.

sp is the Stack Pointer.

lr is the Link Register, which stores the return address when a function is called.  But note that lr can also be used as a general-purpose register when it is not being used as the link register, so be careful.

pc is the Program Counter, i.e., the instruction pointer.

 

Condition Flags

There are four condition flags that are set as the result of executing instructions:

  N  Negative   Set if result is negative

  Z  Zero             Set if result is zero

  C  Carry          Set if a carry occurs, or a bit is shifted off the end by a shift instruction

  V  Overflow  Set if overflow occurs, i.e., a signed result is bigger than 31 bits

 

Instructions

A full list of ARM instructions can be found here, but it’s not particularly readable or educational, so I’ll go over the most common instructions here. 

Throughout this document, I’ll use C code in the right column to explain what the assembly instruction in the left column does.  I use C syntax simply because everyone is familiar with it.

Instruction                               C language equivalent

-------------                                ---------------------------

mov   rx, ry                              rx = ry  // Move register ry into rx

mov   rx, ry lsl #5                 rx = ry << 5 // Logical Shift Left

mov   rx, ry lsr #6                 rx = ry >> 6 // Logical Shift Right

mov   rx, #0x12                     rx = 0x12  // Move x12 into rx

mov   rx, #0x21, 28             rx = 0x21 rotated right 28 bits (see below)

str   rx, [ry, #0x12]              DWORD *ry; ry[0x12/4] = rx  // Store rx into memory at ry + x12

str   rx, [ry, rz]                        DWORD *ry; ry[rz/4] = rx  // Store rx into memory at ry + rz

ldr   rx, [ry], –rz                     ry -= rz;  rx = *ry

cmp   rx, ry                               Compare rx to ry, and set the Condition flags accordingly, for example, Z = (rx == ry)

cmp  rx, #0x12                       Compare rx to 0x12. 

add   rx, ry, #0x12                 rx = ry + 0x12

sub   rx, ry, #0x12                 rx = ry – 0x12

mul   rx, ry, #0x12                rx = ry * 0x12

orr   rx, ry, #2                          rx = ry | 2

and   rx, ry, #2                         rx = ry & 2

bic   rx, ry, #5                          rx = ry & ~5 // Bit Clear

bx    rx                                         rx(); // Jump to the address in rx

A little explanation is in order about instructions that use “shifter operands” such as mov rx, 0x21, 28.  It may seem like an odd instruction but is actually quite common, because it allows the compiler to stuff 32 bit constant values into just 12 bits of an instruction : 8 bits for the constant (0x21) and 4 bits for the shift amount (28) which can be any even number between 0 and 31.  Without this trick of stuffing the constant into the 32-bit instruction, the compiler would have to load a constant from memory, which is much more expensive.

By the way, the example we’ve been using:

mov rx, 0x21, 28       

rx = 0x21 rotated right 28 bits

is essentially the same as :

rx = 0x21 << (32 – 28)

or

rx = 0x21 << 4

 

PC – relative addressing

Even though the use of shifter operands can sometimes allow the compiler to fit a constant into an instruction, sometimes the constant simply won’t fit and must be loaded from memory.  Where does the compiler put these constants?  Right in the instruction stream, between where one function ends and the next one starts.  This allows the compiler to load a constant using “pc-relative addressing”, that is, using the program counter as if it were a pointer to data.  For example:

ldr r1, [pc, #0x1C]                     DWORD *pc;  r1 = pc[0x1c / 4];

The one catch is that the value of pc used in the pointer arithmetic is not the address of the instruction itself.  It is the address of the instruction plus 8. (This is just an artifact of the way the chip works.  By the time the instruction actually executes, the pc has already been incremented.)  So, for example:

Address       Instruction        Disassembly

-------              -----------            -----------

01F05640    e59f101c            ldr                  r1, [pc, #0x1C]

01F05664    12345678            ??? // The data is at address 1F05640 + 8 + 1C

By the way, this is why when you’re looking at a disassembly window in Platform Builder, some instructions will look weird, or show as ???.  It’s because they’re not really instructions, they’re data.

 

Suffixes

B and H

By default, instructions operate on 32-bit words.  (Note that this definition of a word is different from the Win32 concept of a WORD, which is 16 bits.)  However, an instruction that has the H suffix operates on halfwords (16 bits), and an instruction with the B suffix operates on bytes.  For example:

strb       r1, [r8, #0x28]                BYTE r1 = ((BYTE*)r8)[0x28]

S

Some instructions take an optional S suffix, which means “update the condition flags based on the result of this instruction”.

Conditional suffixes

All ARM instructions are conditional, i.e., they can all be modified by a suffix indicating under what conditions the instructions should be executed.  Here are the most common condition suffixes:

Suffix                                                      Condition under which instruction is to execute

-------                                                       --------------------------------------------------------

eq        equal                                                             Z == 1

ne        not equal                                                     Z == 0

hi         unsigned higher                                       C == 1 && Z == 0

ls          unsigned lower or same                       C==0 || Z==1

ge        signed greater or equal                         N==V

lt          signed less than                                        N != V

gt         signed greater than                                Z==0 && N==V

le         signed less than or equal                      Z==1 && N != V

Don’t worry too much about the third column.  The suffixes work the way you would expect them to.  For example:

cmp             r1, #0x5

addeq         r2, r1, r3                 if (r1 == 5) r2 = r1 + r3;

movne       r3, #0x18               if (r1 != 5) r3 = 0x18;

movgt        r3, #0x18               if (r1 > 5)  r3 = 0x18;

bleq             Function                if (r1 == 5) Function();

If the condition is not true, then the instruction does nothing.

Conditional instructions allow the compiler to compile simple “if then else” statements without using any “jump” instructions.  For example:

if (r1 == 5)

            r2 = 6;

else

            r3 = 7;

would become :

cmp r1, 5

moveq r2, #6

movne r3, #7

whereas the x86 compiler (generally) would have to generate two “jump” instructions: one to jump over the “then” clause if the condition was not met,  and one to jump over the “else” clause if it was.

 

Data alignment

The ARM CPU can only access DWORDs in memory that are aligned on addresses that are divisible by 4.  Likewise, it can only access 16-bit values on addresses divisible by 2.  An unaligned access will result in a Datatype Misalignment exception.

 

Function Calls

One of the most important mechanisms to understand is how a function call happens.

Step 1 - Parameters:

The caller sets up the parameters.   r0 through r3 are used to transfer the first four parameters of a function.  If there are more than four, the rest are pushed on the stack.  (The rightmost parameter is pushed first).

Step 2 - Call:

The function is called by executing the Branch and Link instruction:

bl    MyFunction                    lr = pc + 4;  MyFunction();

Note how this instruction sets lr to be the address to return to when the function is done.

For C++ method calls, you’ll see the bx instruction instead of bl.  Bx jumps to the address in a register.  For example, if r0 is your C++ “this” pointer:

ldr   r2, [r0]          r2 = &vtable

ldr   r3, [r2, #0xC]    r3 = vtable[3] // == fourth method

mov   lr, pc            Manually set up lr since we’re not using bl

bx    r3                Jump to r3, i.e., call the fourth method

Step 3 – Preserve registers:

The first instruction in a function usually looks something like this:

stmdb       sp!, {r4 - r6, lr}      push(lr); push(r6);

                                                         push(r5); push(r4);

This is the “Store Multiple Decrement Before” instruction, which is my favorite CPU instruction of all time.  It pushes an entire specified set of registers onto the stack in one instruction.  This serves two purposes.  First, it helps the called function fulfill its obligation to preserve registers r4 through r11, and r13.  Whatever subset of these registers the function is going to later use (i.e. trash), it can push onto the stack.  Second, stmdb safely stores away the return address, lr.

Note that the order in which it pushes the registers may be the opposite of what you expect.  The nice thing about this, though, is that if you are looking at a memory dump of the stack, the registers will be in the same order in memory as they are listed in the code.

Step 4 - Locals:

Next, the callee decrements the stack pointer in order to reserve space on the stack for its local variables:

sub         sp, sp, #0xC            sp -= 12;

Note that the size it reserves may not be what you expect from looking at the local variables in the C code.  This is because the optimizing compiler may not need to store a local on the stack at all; it may be able to get away with using registers.

Step 5 – Body

Next, the body of the function is executed.  Somewhere along the way, (the optimizing compiler is free to decide where) the compiler sets r0 to the return value of the function.

Step 6 – Return

If space was allocated on the stack for locals, it is released:

add         sp, sp, #0xC

Then the registers that we saved away at the beginning of the function are restored using the “Load Multiple Increment After” instruction:

ldmia       sp!, {r4 – r6, lr}      pop(r4); pop(r5); pop(r6); pop(lr);

And finally we jump to the return address:

bx          lr               

Note, if you are debugging a pre-WM5 app, a different code sequence will be used:

ldmia       sp!, {r4 – r6, pc}      pop(r4); pop(r5); pop(r6); pop(pc);

Note that this is the same list of registers as used in the stmdb instruction at the beginning of the function, except that now lr has been replaced by pc.  So the value of lr when the function was entered, i.e., the return address, is now loaded into pc, the program counter.  This has the effect of jumping to the return address.  In other words, returning from the function.

 

The Frame Pointer

The Frame Pointer is not a register; it’s just a concept.   The Frame Pointer is the value of the stack pointer while the body of a function is executing.  In other words, it’s a pointer to a stack frame.  What’s in a stack frame?  Well, from steps 3 and 4 above, a stack frame contains local variables, followed by the registers that the function needed to preserve.

Every function call in a call stack has a frame pointer, and Platform Builder will show you the frame pointers if you right-click in the Call Stack window and check “Frame Pointer”.

 

Finding local variables

Now that you understand the basics of ARM assembly, you can use them to do some cool debugging tricks.  For example, the debugger often can’t give you the value of local variables in a retail build, because the debugger assumes locals are on the stack, and the optimizing compiler often keeps them in registers instead.  But, by looking at the disassembly window in Platform Builder, and figuring out what the disassembly is doing and how it relates to the C source code, you can find where the compiler is storing your local variables.

Understanding the optimized code the compiler generates can be tricky, and there’s no cookbook to finding your locals, but here’s one tip:

Before a function is called, the parameters have to be loaded into r0, r1, r2, and r3.  So it’s fairly easy to look at what the compiler is loading into these registers and to match them up to the C code.  For example:

RECT rc;

GetWindowRect(hWnd, &rc);

becomes:

add         r1, sp, #0x20     // This tells us that rc is

// located at sp + 0x20

mov         r0, r4            // This tells us that hWnd was in r4

// before this snippet of code

bl          GetWindowRect

 

Finding local variables in other stack frames

Now let’s tackle a harder problem.   Suppose you want to know the value of a local that lives in a function that is not at the top of the call stack?  For example, suppose your call stack looks like this:

            Generic.exe!MyRegisterClass

            Generic.exe!InitInstance        

            Generic.exe!WinMain

            Generic.exe!WinMainCRTStartup    

And you want to know what the value of hInstance was in WinMain.  The first step is to do what we did before:  read the disassembly and figure out where the value was before InitInstance was called:

int WINAPI WinMain(HINSTANCE hInstance,

                   HINSTANCE hPrevInstance,

                   LPTSTR    lpCmdLine,

                   int       nCmdShow)

{

00011960  stmdb       sp!, {r4, lr}

00011964  sub         sp, sp, #0x1C

00011968  mov         r4, r0

      MSG msg;

      // Perform application initialization:

      if (!InitInstance(hInstance, nCmdShow))

0001196C  mov         r1, r3

00011970  bl          |InitInstance ( 117f8h )|

Since hInstance is the first parameter to WinMain, hInstance must have been in r0 when WinMain was entered.  The mov r4,r0 instruction tells you that hInstance was in r4 when InitInstance was called.  Of course, it’s not still in r4 by the time we got to MyRegisterClass.  So how do you figure out what r4 used to be? 

Remember that it is the responsibility of a called function to preserve the values of r4 through r11 and r13.  So, one solution is to just step in the debugger back out to WinMain and then look at r4.  But there are a number of reasons why this might not be possible: you may be looking at a post-mortem Watson dump; the debugger might be misbehaving; you might need to do more investigation before you step back out and lose your current state, etc.  So then what do you do?

Since it is the responsibility of a called function to preserve r4 – r11, that means that one of the functions in the call stack (above the function you care about) must have preserved your r4. So, starting at WinMain (since that’s where your local lives) walk up the stack one frame at a time looking for a function that preserved r4 on the stack.

So we start at InitInstance.  Find the beginning of the function:

BOOL InitInstance(HINSTANCE hInstance, int nCmdShow)

{

000117F8  stmdb       sp!, {r4 - r7, lr}

000117FC  sub         sp, sp, #0x75, 30

And there it is.  InitInstance preserves r4.  But where exactly did it put r4?  Remember what the stack looks like:

MyRegisterClass locals                             ←MyRegisterClass frame pointer

MyRegisterClass preserved registers

InitInstance locals                                      ←InitInstance frame pointer

InitInstance preserved registers

WinMain locals                                            ←WinMain frame pointer

WinMain preserved registers

So first you need to find the frame pointer for WinMain, which Platform Builder’s call stack window will give you.  (If for some reason you don’t have a debugger, you can even figure out the frame pointer yourself by starting with the current value of the stack pointer and looking at the prolog of each function in the stack to see how big each frame is.)

So now we know WinMain’s frame pointer, which points to its locals.  The “sub    sp,sp,#0x75,30” instruction tells us that WinMain has 0x1D4 bytes of locals (0x1D4 == 0x75 << (32-30)).  So WinMain’s preserved registers start at the frame pointer + 0x1d4.  And since r4 is the first of the preserved registers, r4 lives at the frame pointer + 0x1d4.  And there you have it, you found the value of WinMain’s local hInstance variable.

By the way, if you reach the very top of the call stack, and none of the functions preserved r4, that means that none of them trash r4, and so the current value of r4 is what you are looking for.