64-位汇编的艺术卷一-全-

64 位汇编的艺术卷一（全）

原文：zh.annas-archive.org/md5/9c64d259f1c4545987e6e1ced8b28e1f

译者：飞龙

协议：CC BY-NC-SA 4.0

前言

本书是 30 年工作的结晶。这本书的最早版本是我为我的 Cal Poly Pomona 和 UC Riverside 的学生复印的笔记，标题为“如何使用 8088 汇编语言编程 IBM PC”。我得到了许多学生的反馈，以及我一个好朋友 Mary Philips 的建议，这些帮助稍微润色了一下内容。Bill Pollock 将那个早期版本从互联网的遗忘角落拯救了出来，在 Karol Jurado 的帮助下，《汇编语言的艺术》的第一版在 2003 年得以问世。

数千名读者（以及他们的建议），以及 Bill Pollock、Alison Peterson、Ansel Staton、Riley Hoffman、Megan Dunchak、Linda Recktenwald、Susan Glinert Stevens 和 Nancy Bell（来自 No Starch Press）的贡献，以及 Nathan Baker 的技术审查，促成了这本书的第二版在 2010 年问世。

十年后，《汇编语言的艺术》（或者我称之为AoA）因其依赖于已经 35 年的 32 位 Intel x86 设计而逐渐失去人气。今天，如果有人想学习 80x86 汇编语言，他们会想要在更新的 x86-64 CPU 上学习 64 位汇编。因此，在 2020 年初，我开始了将旧版 32 位AoA（基于使用高级汇编器，或 HLA）转向 64 位的过程，采用了 Microsoft Macro Assembler（MASM）。

当我第一次开始这个项目时，我以为只需要将几个 HLA 程序翻译成 MASM，稍微修改一些文本，就能轻松完成64 位汇编的艺术的翻译工作。我错了。由于 No Starch Press 希望在可读性和理解上做出突破，并且 Tony Tribelli 在对本书每一行文本和代码进行技术审查时做出了令人难以置信的工作，这个项目变得像从头开始写一本新书一样繁重。没关系，我认为你会真正感激这本书中所付出的努力。

关于本书中的源代码说明

本书中展示了大量的 x86-64 汇编语言（以及 C/C++）源代码。通常，源代码有三种形式：代码片段、单一的汇编语言过程或函数，以及完整的程序。

代码片段是程序的片段；它们不是独立的，不能使用 MASM（或在 C/C++源代码的情况下使用 C++编译器）进行编译（汇编）。代码片段的目的是阐明某个要点或提供编程技巧的小示例。以下是你将在本书中找到的一个典型代码片段示例：

someConst = 5
   .
   .
   .
mov eax, someConst

垂直省略号（. . .）表示可以在其位置出现的任意代码（并非所有的代码片段都使用省略号，但指出这一点是有意义的）。

汇编语言过程也不是独立的代码。尽管你可以组装本书中出现的许多汇编语言过程（只需将代码从书中复制到编辑器中，然后运行 MASM 来处理生成的文本文件），但它们不会自行执行。代码片段和汇编语言过程有一个主要的不同点：过程作为本书的可下载源文件的一部分出现（在 artofasm.randallhyde.com/）。

完整程序，你可以编译并执行，在本书中被标记为列表。它们有一个列表编号/标识符，形式为“Listing C-N”，其中C是章节号，N是一个按顺序递增的列表编号，每个章节从 1 开始。以下是本书中出现的一个程序列表示例：

; Listing 1-3

; A simple MASM module that contains
; an empty function to be called by
; the C++ code in Listing 1-2.

        .CODE

; The "option casemap:none" statement
; tells MASM to make all identifiers
; case-sensitive (rather than mapping
; them to uppercase). This is necessary
; because C++ identifiers are case-
; sensitive.

        option  casemap:none

; Here is the "asmFunc" function.

        public  asmFunc
asmFunc PROC

; Empty function just returns to C++ code.

        ret     ; Returns to caller

asmFunc ENDP
        END

Listing 1：一个由 Listing 1-2 中的 C++ 程序调用的 MASM 程序

像过程一样，所有列表都可以在我的网站上以电子形式获取：artofasm.randallhyde.com/。这个链接将引导你到包含本书所有源文件和其他支持信息的页面（如勘误表、电子章节以及其他有用信息）。有几个章节将列表编号附加到过程和宏，这些并非完整的程序，仅为提高可读性。有一些列表演示了 MASM 语法错误或无法运行。源代码仍然会以该列表名的形式出现在电子版分发中。

通常，本书在可执行的列表之后会给出构建命令和示例输出。以下是一个典型的示例（用户输入以粗体显示）：

C:\>**build listing4-7**

C:\>**echo off**
 Assembling: listing4-7.asm
c.cpp

C:\>**listing4-7**
Calling Listing 4-7:
aString: maxLen:20, len:20, string data:'Initial String Data'
Listing 4-7 terminated

本书中的大多数程序从 Windows 命令行 运行（即在 cmd.exe 应用程序中）。默认情况下，本书假设你是从 C: 驱动器的根目录运行程序。因此，每个构建命令和示例输出通常都会有 C:\> 作为命令行中你输入的命令的前缀。然而，你也可以从任何驱动器或目录运行程序。

如果你对 Windows 命令行完全陌生，请花些时间了解 Windows 命令行解释器（CLI）。你可以通过在 Windows 的 运行 命令中执行 cmd.exe 程序来启动 CLI。由于在阅读本书时你将频繁使用 CLI，我建议在桌面上创建一个 cmd.exe 的快捷方式。在附录 C 中，我描述了如何创建该快捷方式，以便自动设置你需要的环境变量，轻松运行 MASM（以及 Microsoft Visual C++ 编译器）。附录 D 为那些不熟悉 CLI 的人提供了一个 Windows CLI 的快速入门。

第一部分

机器组织

第一章：汇编语言的 Hello, World

本章是一个“快速入门”章节，旨在让你尽可能快速地开始编写基础的汇编语言程序。在本章结束时，你应该理解 Microsoft Macro Assembler (MASM) 程序的基本语法，以及学习后续章节中新汇编语言特性的前提条件。

本章内容包括：

MASM 程序的基本语法
英特尔中央处理器（CPU）架构
为变量分配内存
使用机器指令控制 CPU
将 MASM 程序与 C/C++ 代码链接，以便你能够调用 C 标准库中的例程
编写一些简单的汇编语言程序

1.1 你需要准备的

要学习使用 MASM 编写汇编语言程序，你需要一些先决条件：64 位版本的 MASM，文本编辑器（用于创建和修改 MASM 源文件）、链接器、各种库文件，以及 C++ 编译器。

如今的软件工程师只有在 C++、C#、Java、Swift 或 Python 代码运行过慢时，才会转向汇编语言，他们需要提升代码中某些模块（或函数）的性能。由于你在实际应用中通常会将汇编语言与 C++ 或其他高级语言（HLL）代码接口，因此本书也会这样操作。

另一个使用 C++ 的理由是 C 标准库。虽然不同的人为 MASM 创建了几个有用的库（例如 www.masm32.com/ 提供了一个很好的例子），但没有公认的标准库集。为了使 C 标准库能够立即在 MASM 程序中使用，本书提供了带有简短 C/C++ 主函数的示例，该主函数调用一个用 MASM 编写的汇编语言外部函数。将 C++ 主程序与 MASM 源文件一起编译，将生成一个可执行文件，你可以运行并进行测试。

学习汇编语言需要了解 C++ 吗？其实不需要。本书会为你提供运行示例程序所需的 C++ 知识。不过，汇编语言并不是你的第一门语言的最佳选择，因此本书假设你已经有一些 C/C++、Pascal（或 Delphi）、Java、Swift、Rust、BASIC、Python 或任何其他命令式或面向对象编程语言的经验。

1.2 在你的计算机上设置 MASM

MASM 是微软的产品，属于 Visual Studio 开发工具套件的一部分。由于它是微软的工具集，你需要运行某个版本的 Windows（截至写作时，Windows 10 是最新版本；但任何更新版本的 Windows 也可能可以运行）。附录 C 提供了如何安装 Visual Studio Community（“免费版”，包括 MASM 和 Visual C++ 编译器，以及你将需要的其他工具）的完整描述。请参阅该附录以获取更多详情。

1.3 在你的机器上设置文本编辑器

Visual Studio 包括一个文本编辑器，你可以用来创建和编辑 MASM 和 C++程序。因为你必须安装 Visual Studio 软件包来获得 MASM，所以你自动得到了一个生产级程序员文本编辑器，可以用来编辑你的汇编语言源文件。

然而，你可以使用任何可以处理纯 ASCII 文件（UTF-8 也可以）的编辑器来创建 MASM 和 C++源文件，比如 Notepad++或者来自www.masm32.com/的文本编辑器。文字处理程序，如 Microsoft Word，不适合编辑程序源文件。

1.4 MASM 程序的结构

一个典型的（独立的）MASM 程序如下所示：清单 1-1。

; Comments consist of all text from a semicolon character
; to the end of the line.

; The ".code" directive tells MASM that the statements following
; this directive go in the section of memory reserved for machine
; instructions (code).

        .code

; Here is the "main" function. (This example assumes that the
; assembly language program is a stand-alone program with its
; own main function.)

main    PROC

`Machine instructions go here`

        ret    ; Returns to caller

main    ENDP

; The END directive marks the end of the source file.

        END

清单 1-1：简单的 Shell 程序

一个典型的 MASM 程序包含一个或多个区段，表示内存中出现的数据类型。这些区段以 MASM 语句开头，例如.code或.data。变量和其他内存值出现在数据区段中。机器指令出现在代码区段内的过程里，等等。汇编语言源文件中出现的各个区段是可选的，因此某个源文件中不一定会包含每一种类型的区段。例如，清单 1-1 只包含一个单独的代码区段。

.code语句是一个汇编器指令的例子——这是一条告诉 MASM 程序某些信息的语句，但并不是实际的 x86-64 机器指令。特别地，.code指令告诉 MASM 将其后的语句分组到一个为机器指令保留的特殊内存区段中。

1.5 运行你的第一个 MASM 程序

传统的第一个程序，受到 Brian Kernighan 和 Dennis Ritchie 的《C 程序设计语言》（普伦蒂斯·霍尔出版社，1978 年）的推广，是“Hello, world！”程序。这个程序的唯一目的是提供一个简单的示例，供学习新编程语言的人用来弄清楚如何使用编译和运行该语言程序所需的工具。

不幸的是，编写像“Hello, world！”这样简单的程序在汇编语言中是一项大工程。你必须学习几条机器指令和汇编器指令，更不用说 Windows 系统调用了，才能打印字符串“Hello, world！”在这个阶段，对初学汇编语言的程序员来说，这实在是要求太高了（对于那些想快速前进的人，可以查看附录 C 中的示例程序）。

然而，清单 1-1 中的程序外壳实际上是一个完整的汇编语言程序。你可以编译（汇编）并运行它。它不会产生任何输出，启动后会立即返回到 Windows。但是，它确实能运行，并且将作为展示如何汇编、链接和运行汇编语言源文件的机制。

MASM 是一个传统的 命令行汇编器，这意味着您需要从 Windows 命令行提示符（通过运行 cmd.exe 程序获得）中运行它。为此，请在命令行提示符或 shell 窗口中输入类似以下的内容：

C:\>**ml64 programShell.asm /link /subsystem:console /entry:main**

这个命令告诉 MASM 汇编 programShell.asm 程序（我将清单 1-1 保存到其中）为一个可执行文件，将结果链接为一个控制台应用程序（可以从命令行运行的程序），并在汇编语言源文件中的 main 标签处开始执行。假设没有发生错误，您可以通过在命令提示符窗口中键入以下命令来运行生成的程序：

C:\>**programShell**

Windows 应立即响应一个新的命令行提示符（因为 programShell 应用程序在开始运行后会将控制权交还给 Windows）。

1.6 运行您的第一个 MASM/C++ 混合程序

这本书通常将一个汇编语言模块（包含一个或多个用汇编语言编写的函数）与一个调用这些函数的 C/C++ 主程序结合在一起。由于编译和执行过程与独立的 MASM 程序略有不同，本节将演示如何创建、编译并运行一个混合汇编/C++ 程序。清单 1-2 提供了调用汇编语言模块的主要 C++ 程序。

// Listing 1-2

// A simple C++ program that calls an assembly language function.
// Need to include stdio.h so this program can call "printf()".

#include <stdio.h>

// extern "C" namespace prevents "name mangling" by the C++
// compiler.

extern "C"
{
    // Here's the external function, written in assembly
    // language, that this program will call:

    void asmFunc(void);
};

int main(void)
{
    printf("Calling asmMain:\n");
    asmFunc();
    printf("Returned from asmMain\n");
}

清单 1-2：一个示例 C/C++ 程序，listing1-2.cpp，调用一个汇编语言函数

清单 1-3 是对独立 MASM 程序的轻微修改，包含 C++ 程序调用的 asmFunc() 函数。

; Listing 1-3

; A simple MASM module that contains an empty function to be 
; called by the C++ code in Listing 1-2.

        .CODE

; (See text concerning option directive.)

        option  casemap:none

; Here is the "asmFunc" function.

        public  asmFunc
asmFunc PROC

; Empty function just returns to C++ code.

        ret    ; Returns to caller

asmFunc ENDP
        END

清单 1-3：一个 MASM 程序，listing1-3.asm，是 C++ 程序清单 1-2 中调用的程序

清单 1-3 与原始 programShell.asm 源文件相比有三个变化。首先，新增了两个语句：option 语句和 public 语句。

option 语句告诉 MASM 使所有符号区分大小写。这是必要的，因为 MASM 默认情况下是不区分大小写的，并将所有标识符映射为大写字母（因此 asmFunc() 会变成 ASMFUNC()）。C++ 是一种区分大小写的语言，将 asmFunc() 和 ASMFUNC() 视为两个不同的标识符。因此，告诉 MASM 尊重标识符的大小写非常重要，以避免与 C++ 程序混淆。

*public 语句声明 asmFunc() 标识符将在 MASM 源/目标文件外部可见。如果没有这个语句，asmFunc() 只会在 MASM 模块内部可访问，C++ 编译时会抱怨 asmFunc() 是一个未定义的标识符。

清单 1-3 与清单 1-1 之间的第三个区别是函数的名称从 main() 改为了 asmFunc()。如果汇编代码使用 main() 这个名称，C++ 编译器和链接器会感到困惑，因为 main() 也是 C++ 的主函数名称。

要编译和运行这些源文件，您可以使用以下命令：

C:\>**ml64 /c listing1-3.asm**
Microsoft (R) Macro Assembler (x64) Version 14.15.26730.0
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: listing1-3.asm

C:\>**cl listing1-2.cpp listing1-3.obj**
Microsoft (R) C/C++ Optimizing Compiler Version 19.15.26730 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

listing1-2.cpp
Microsoft (R) Incremental Linker Version 14.15.26730.0
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:listing1-2.exe
listing1-2.obj
listing1-3.obj

C:\>**listing1-2**
Calling asmFunc:
Returned from asmFunc

ml64 命令使用 `/c` 选项，表示*仅编译*，并不会尝试运行链接器（因为 *listing1-3.asm* 不是独立程序，所以链接器运行会失败）。MASM 的输出是一个目标代码文件（*listing1-3.obj*），它作为下一个命令中 Microsoft Visual C++（MSVC）编译器的输入。

The `cl` command runs the MSVC compiler on the *listing1-2.cpp* file and links in the assembled code (*listing1-3.obj*). The output from the MSVC compiler is the *listing1-2.exe* executable file. Executing that program from the command line produces the output we expect. ## 1.7 An Introduction to the Intel x86-64 CPU Family Thus far, you’ve seen a single MASM program that will actually compile and run. However, the program does nothing more than return control to Windows. Before you can progress any further and learn some real assembly language, a detour is necessary: unless you understand the basic structure of the Intel x86-64 CPU family, the machine instructions will make little sense. The Intel CPU family is generally classified as a *von Neumann architecture machine*. Von Neumann computer systems contain three main building blocks: the *central processing unit* *(CPU)*, *memory*, and *input/output (I/0) devices*. These three components are interconnected via the *system bus* (consisting of the address, data, and control buses).The block diagram in Figure 1-1 shows these relationships. The CPU communicates with memory and I/O devices by placing a numeric value on the address bus to select one of the memory locations or I/O device port locations, each of which has a unique numeric *address*. Then the CPU, memory, and I/O devices pass data among themselves by placing the data on the data bus. The control bus contains signals that determine the direction of the data transfer (to/from memory and to/from an I/O device). ![f01001](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f01001.png) Figure 1-1: Von Neumann computer system block diagram Within the CPU, special locations known as *registers* are used to manipulate data. The x86-64 CPU registers can be broken into four categories: general-purpose registers, special-purpose application-accessible registers, segment registers, and special-purpose kernel-mode registers. Because the segment registers aren’t used much in modern 64-bit operating systems (such as Windows), there is little need to discuss them in this book. The special-purpose kernel-mode registers are intended for writing operating systems, debuggers, and other system-level tools. Such software construction is well beyond the scope of this text. The x86-64 (Intel family) CPUs provide several *general-purpose registers* for application use. These include the following: * Sixteen 64-bit registers that have the following names: RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP, R8, R9, R10, R11, R12, R13, R14, and R15 * Sixteen 32-bit registers: EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP, R8D, R9D, R10D, R11D, R12D, R13D, R14D, and R15D * Sixteen 16-bit registers: AX, BX, CX, DX, SI, DI, BP, SP, R8W, R9W, R10W, R11W, R12W, R13W, R14W, and R15W * Twenty 8-bit registers: AL, AH, BL, BH, CL, CH, DL, DH, DIL, SIL, BPL, SPL, R8B, R9B, R10B, R11B, R12B, R13B, R14B, and R15B Unfortunately, these are not 68 independent registers; instead, the x86-64 overlays the 64-bit registers over the 32-bit registers, the 32-bit registers over the 16-bit registers, and the 16-bit registers over the 8-bit registers. Table 1-1 shows these relationships. Because the general-purpose registers are not independent, modifying one register may modify as many as three other registers. For example, modifying the EAX register may very well modify the AL, AH, AX, and RAX registers. This fact cannot be overemphasized. A common mistake in programs written by beginning assembly language programmers is register value corruption due to the programmer not completely understanding the ramifications of the relationships shown in Table 1-1. Table 1-1: General-Purpose Registers on the x86-64 | **Bits 0–63** | **Bits 0–31** | **Bits 0–15** | **Bits 8–15** | **Bits 0–7** | | --- | --- | --- | --- | --- | | RAX | EAX | AX | AH | AL | | RBX | EBX | BX | BH | BL | | RCX | ECX | CX | CH | CL | | RDX | EDX | DX | DH | DL | | RSI | ESI | SI | | SIL | | RDI | EDI | DI | | DIL | | RBP | EBP | BP | | BPL | | RSP | ESP | SP | | SPL | | R8 | R8D | R8W | | R8B | | R9 | R9D | R9W | | R9B | | R10 | R10D | R10W | | R10B | | R11 | R11D | R11W | | R11B | | R12 | R12D | R12W | | R12B | | R13 | R13D | R13W | | R13B | | R14 | R14D | R14W | | R14B | | R15 | R15D | R15W | | R15B | In addition to the general-purpose registers, the x86-64 provides special-purpose registers, including eight *floating-point registers* implemented in the x87 *floating-point unit (FPU)*. Intel named these registers ST(0) to ST(7). Unlike with the general-purpose registers, an application program cannot directly access these. Instead, a program treats the floating-point register file as an eight-entry-deep stack and accesses only the top one or two entries (see “Floating-Point Arithmetic” in Chapter 6 for more details). Each floating-point register is 80 bits wide, holding an extended-precision real value (hereafter just *extended precision*). Although Intel added other floating-point registers to the x86-64 CPUs over the years, the FPU registers still find common use in code because they support this 80-bit floating-point format. In the 1990s, Intel introduced the MMX register set and instructions to support *single instruction, multiple data* *(SIMD)* operations. The *MMX register set* is a group of eight 64-bit registers that overlay the ST(0) to ST(7) registers on the FPU. Intel chose to overlay the FPU registers because this made the MMX registers immediately compatible with multitasking operating systems (such as Windows) without any code changes to those OSs. Unfortunately, this choice meant that an application could not simultaneously use the FPU and MMX instructions. Intel corrected this issue in later revisions of the x86-64 by adding the *XMM register set*. For that reason, you rarely see modern applications using the MMX registers and instruction set. They are available if you really want to use them, but it is almost always better to use the XMM registers (and instruction set) and leave the registers in FPU mode. To overcome the limitations of the MMX/FPU register conflicts, AMD/Intel added sixteen 128-bit XMM registers (XMM0 to XMM15) and the SSE/SSE2 instruction set. Each register can be configured as four 32-bit floating-point registers; two 64-bit double-precision floating-point registers; or sixteen 8-bit, eight 16-bit, four 32-bit, two 64-bit, or one 128-bit integer registers. In later variants of the x86-64 CPU family, AMD/Intel doubled the size of the registers to 256 bits each (renaming them YMM0 to YMM15) to support eight 32-bit floating-point values or four 64-bit double-precision floating-point values (integer operations were still limited to 128 bits). The **RFLAGS* (or just *FLAGS*) register is a 64-bit register that encapsulates several single-bit Boolean (true/false) values.^(1) Most of the bits in the RFLAGS register are either reserved for kernel mode (operating system) functions or are of little interest to the application programmer. Eight of these bits (or *flags*) are of interest to application programmers writing assembly language programs: the overflow, direction, interrupt disable,^(2) sign, zero, auxiliary carry, parity, and carry flags. Figure 1-2 shows the layout of the flags within the lower 16 bits of the RFLAGS register.* *![f01002](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f01002.png) Figure 1-2: Layout of the FLAGS register (lower 16 bits of RFLAGS) Four flags in particular are extremely valuable: the overflow, carry, sign, and zero flags, collectively called the *condition codes.*^(3) The state of these flags lets you test the result of previous computations. For example, after comparing two values, the condition code flags will tell you whether one value is less than, equal to, or greater than a second value. One important fact that comes as a surprise to those just learning assembly language is that almost all calculations on the x86-64 CPU involve a register. For example, to add two variables together and store the sum into a third variable, you must load one of the variables into a register, add the second operand to the value in the register, and then store the register away in the destination variable. Registers are a middleman in nearly every calculation. You should also be aware that, although the registers are called **general-purpose*, you cannot use any register for any purpose. All the x86-64 registers have their own special purposes that limit their use in certain contexts. The RSP register, for example, has a very special purpose that effectively prevents you from using it for anything else (it’s the *stack pointer*). Likewise, the RBP register has a special purpose that limits its usefulness as a general-purpose register. For the time being, avoid the use of the RSP and RBP registers for generic calculations; also, keep in mind that the remaining registers are not completely interchangeable in your programs.* *## 1.8 The Memory Subsystem The *memory subsystem* holds data such as program variables, constants, machine instructions, and other information. Memory is organized into cells, each of which holds a small piece of information. The system can combine the information from these small cells (or *memory locations*) to form larger pieces of information. The x86-64 supports *byte-addressable memory*, which means the basic memory unit is a byte, sufficient to hold a single character or a (very) small integer value (we’ll talk more about that in Chapter 2). Think of memory as a linear array of bytes. The address of the first byte is 0, and the address of the last byte is 2³² – 1\. For an x86 processor with 4GB memory installed,^(4) the following pseudo-Pascal array declaration is a good approximation of memory: ``` Memory: array [0..4294967295] of byte; ``` C/C++ and Java users might prefer the following syntax: ``` byte Memory[4294967296]; ``` For example, to execute the equivalent of the Pascal statement `Memory [125] := 0;`, the CPU places the value `0` on the data bus, places the address `125` on the address bus, and asserts the write line (this generally involves setting that line to `0`), as shown in Figure 1-3. ![f01003](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f01003.png) Figure 1-3: Memory write operation To execute the equivalent of `CPU := Memory [125];`, the CPU places the address `125` on the address bus, asserts the read line (because the CPU is reading data from memory), and then reads the resulting data from the data bus (see Figure 1-4). ![f01004](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f01004.png) Figure 1-4: Memory read operation To store larger values, the x86 uses a sequence of consecutive memory locations. Figure 1-5 shows how the x86 stores bytes, *words* (2 bytes), and *double words* (4 bytes) in memory. The memory address of each object is the address of the first byte of each object (that is, the lowest address). ![f01005](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f01005.png) Figure 1-5: Byte, word, and double-word storage in memory ## 1.9 Declaring Memory Variables in MASM Although it is possible to reference memory by using numeric addresses in assembly language, doing so is painful and error-prone. Rather than having your program state, “Give me the 32-bit value held in memory location 192 and the 16-bit value held in memory location 188,” it’s much nicer to state, “Give me the contents of `elementCount` and `portNumber`.” Using variable names, rather than memory addresses, makes your program much easier to write, read, and maintain. To create (writable) data variables, you have to put them in a data section of the MASM source file, defined using the `.data` directive. This directive tells MASM that all following statements (up to the next `.code` or other section-defining directive) will define data declarations to be grouped into a read/write section of memory. Within a `.data` section, MASM allows you to declare variable objects by using a set of data declaration directives. The basic form of a data declaration directive is ``` `label` `directive` ? ``` where `label` is a legal MASM identifier and `directive` is one of the directives appearing in Table 1-2. Table 1-2: MASM Data Declaration Directives | **Directive** | **Meaning** | | --- | --- | | `` `byte` (or `db`) `` | Byte (unsigned 8-bit) value | | `` `sbyte` `` | Signed 8-bit integer value | | `` `word` (or `dw`) `` | Unsigned 16-bit (word) value | | `` `sword` `` | Signed 16-bit integer value | | `` `dword` (or `dd`) `` | Unsigned 32-bit (double-word) value | | `` `sdword` `` | Signed 32-bit integer value | | `` `qword` (or `dq`) `` | Unsigned 64-bit (quad-word) value | | `` `sqword` `` | Signed 64-bit integer value | | `` `tbyte` (or `dt`) `` | Unsigned 80-bit (10-byte) value | | `` `oword` `` | 128-bit (octal-word) value | | `` `real4` `` | Single-precision (32-bit) floating-point value | | `` `real8` `` | Double-precision (64-bit) floating-point value | | `` `real10` `` | Extended-precision (80-bit) floating-point value | The question mark (`?`) operand tells MASM that the object will not have an explicit value when the program loads into memory (the default initialization is zero). If you would like to initialize the variable with an explicit value, replace the `?` with the initial value; for example: ``` hasInitialValue sdword -1 ``` Some of the data declaration directives in Table 1-2 have a signed version (the directives with the `s` prefix). For the most part, MASM ignores this prefix. It is the machine instructions you write that differentiate between signed and unsigned operations; MASM itself usually doesn’t care whether a variable holds a signed or an unsigned value. Indeed, MASM allows both of the following: ``` .data u8 byte -1 ; Negative initializer is okay i8 sbyte 250 ; even though +128 is maximum signed byte ``` All MASM cares about is whether the initial value will fit into a byte. The `-1`, even though it is not an unsigned value, will fit into a byte in memory. Even though `250` is too large to fit into a signed 8-bit integer (see “Signed and Unsigned Numbers” in Chapter 2), MASM will happily accept this because `250` will fit into a byte variable (as an unsigned number). It is possible to reserve storage for multiple data values in a single data declaration directive. The string multi-valued data type is critical to this chapter (later chapters discuss other types, such as arrays in Chapter 4). You can create a null-terminated string of characters in memory by using the `byte` directive as follows: ``` ; Zero-terminated C/C++ string. strVarName byte 'String of characters', 0 ``` Notice the `, 0` that appears after the string of characters. In any data declaration (not just byte declarations), you can place multiple data values in the operand field, separated by commas, and MASM will emit an object of the specified size and value for each operand. For string values (surrounded by apostrophes in this example), MASM emits a byte for each character in the string (plus a zero byte for the `, 0` operand at the end of the string). MASM allows you to define strings by using either apostrophes or quotes; you must terminate the string of characters with the same delimiter that begins the string (quote or apostrophe). ### 1.9.1 Associating Memory Addresses with Variables One of the nice things about using an assembler/compiler like MASM is that you don’t have to worry about numeric memory addresses. All you need to do is declare a variable in MASM, and MASM associates that variable with a unique set of memory addresses. For example, say you have the following declaration section: ``` .data i8 sbyte ? i16 sword ? i32 sdword ? i64 sqword ? ``` MASM will find an unused 8-bit byte in memory and associate it with the `i8` variable; it will find a pair of consecutive unused bytes and associate them with `i16`; it will find four consecutive locations and associate them with `i32`; finally, MASM will find 8 consecutive unused bytes and associate them with `i64`. You’ll always refer to these variables by their name. You generally don’t have to concern yourself with their numeric address. Still, you should be aware that MASM is doing this for you. When MASM is processing declarations in a `.data` section, it assigns consecutive memory locations to each variable.^(5) Assuming `i8` (in the previous declarations) as a memory address of 101, MASM will assign the addresses appearing in Table 1-3 to `i8`, `i16`, `i32`, and `i64`. Table 1-3: Variable Address Assignment | **Variable** | **Memory address** | | --- | --- | | `i8` | 101 | | `i16` | 102 (address of `i8` plus 1) | | `i32` | 104 (address of `i16` plus 2) | | `i64` | 108 (address of `i32` plus 4) | Whenever you have multiple operands in a data declaration statement, MASM will emit the values to sequential memory locations in the order they appear in the operand field. The label associated with the data declaration (if one is present) is associated with the address of the first (leftmost) operand’s value. See Chapter 4 for more details. ### 1.9.2 Associating Data Types with Variables During assembly, MASM associates a data type with every label you define, including variables. This is rather advanced for an assembly language (most assemblers simply associate a value or an address with an identifier). For the most part, MASM uses the variable’s size (in bytes) as its type (see Table 1-4). Table 1-4: MASM Data Types | **Type** | **Size** | **Description** | | --- | --- | --- | | `byte` (`db`) | 1 | 1-byte memory operand, unsigned (generic integer) | | `sbyte` | 1 | 1-byte memory operand, signed integer | | `word` (`dw`) | 2 | 2-byte memory operand, unsigned (generic integer) | | `sword` | 2 | 2-byte memory operand, signed integer | | `dword` (`dd`) | 4 | 4-byte memory operand, unsigned (generic integer) | | `sdword` | 4 | 4-byte memory operand, signed integer | | `qword` (`dq`) | 8 | 8-byte memory operand, unsigned (generic integer) | | `sqword` | 8 | 8-byte memory operand, signed integer | | `tbyte` (`dt`) | 10 | 10-byte memory operand, unsigned (generic integer or BCD) | | `oword` | 16 | 16-byte memory operand, unsigned (generic integer) | | `real4` | 4 | 4-byte single-precision floating-point memory operand | | `real8` | 8 | 8-byte double-precision floating-point memory operand | | `real10` | 10 | 10-byte extended-precision floating-point memory operand | | `proc` | N/A | Procedure label (associated with `PROC` directive) | | `label`: | N/A | Statement label (any identifier immediately followed by a `:`) | | `constant` | Varies | Constant declaration (equate) using `=` or `EQU` directive | | `text` | N/A | Textual substitution using macro or `TEXTEQU` directive | Later sections and chapters fully describe the `proc`, `label`, `constant`, and `text` types. ## 1.10 Declaring (Named) Constants in MASM MASM allows you to declare manifest constants by using the `=` directive. A *manifest constant* is a symbolic name (identifier) that MASM associates with a value. Everywhere the symbol appears in the program, MASM will directly substitute the value of that symbol for the symbol. A manifest constant declaration takes the following form: ``` `label` = `expression` ``` Here, `label` is a legal MASM identifier, and `expression` is a constant arithmetic expression (typically, a single literal constant value). The following example defines the symbol `dataSize` to be equal to `256`: ``` dataSize = 256 ``` Most of the time, MASM’s `equ` directive is a synonym for the `=` directive. For the purposes of this chapter, the following statement is largely equivalent to the previous declaration: ``` dataSize equ 256 ``` Constant declarations (*equates* in MASM terminology) may appear anywhere in your MASM source file, prior to their first use. They may appear in a .`data` section, a `.code` section, or even outside any sections. ## 1.11 Some Basic Machine Instructions The x86-64 CPU family provides from just over a couple hundred to many thousands of machine instructions, depending on how you define a machine instruction. But most assembly language programs use around 30 to 50 machine instructions,^(6) and you can write several meaningful programs with only a few. This section provides a small handful of machine instructions so you can start writing simple MASM assembly language programs right away. ### 1.11.1 The mov Instruction Without question, the `mov` instruction is the most oft-used assembly language statement. In a typical program, anywhere from 25 percent to 40 percent of the instructions are `mov` instructions. As its name suggests, this instruction moves data from one location to another.^(7) Here’s the generic MASM syntax for this instruction: ``` mov `destination_operand`, `source_operand` ``` The `source_operand` may be a (general-purpose) register, a memory variable, or a constant. The `destination_operand` may be a register or a memory variable. The x86-64 instruction set does not allow both operands to be memory variables. In a high-level language like Pascal or C/C++, the `mov` instruction is roughly equivalent to the following assignment statement: ``` `destination_operand` = `source_operand` ; ``` The `mov` instruction’s operands must both be the same size. That is, you can move data between a pair of byte (8-bit) objects, word (16-bit) objects, double-word (32-bit), or quad-word (64-bit) objects; you may not, however, mix the sizes of the operands. Table 1-5 lists all the legal combinations for the `mov` instruction. You should study this table carefully because most of the general-purpose x86-64 instructions use this syntax. Table 1-5: Legal x86-64 `mov` Instruction Operands | **Source*** | **Destination** | | --- | --- | | reg[8] | reg[8] | | reg[8] | mem[8] | | mem[8] | reg[8] | | constant** | reg[8] | | constant | mem[8] | | reg[16] | reg[16] | | reg[16] | mem[16] | | mem[16] | reg[16] | | constant | reg[16] | | constant | mem[16] | | reg[32] | reg[32] | | reg[32] | mem[32] | | mem[32] | reg[32] | | constant | reg[32] | | constant | mem[32] | | reg[64] | reg[64] | | reg[64] | mem[64] | | mem[64] | reg[64] | | constant | reg[64] | | constant[32] | mem[64] | | ^(*) reg[*n*] means an *n*-bit register, and mem[*n*] means an *n*-bit memory location.^(**) The constant must be small enough to fit in the specified destination operand. | This table includes one important thing to note: the x86-64 allows you to move only a 32-bit constant value into a 64-bit memory location (it will sign-extend this value to 64 bits; see “Sign Extension and Zero Extension” in Chapter 2 for more information about sign extension). Moving a 64-bit constant into a 64-bit register is the only x86-64 instruction that allows a 64-bit constant operand. This inconsistency in the x86-64 instruction set is annoying. Welcome to the x86-64. ### 1.11.2 Type Checking on Instruction Operands MASM enforces some type checking on instruction operands. In particular, the size of an instruction’s operands must agree. For example, MASM will generate an error for the following: ``` i8 byte ? . . . mov ax, i8 ``` The problem is that you are attempting to load an 8-bit variable (`i8`) into a 16-bit register (AX). As their sizes are not compatible, MASM assumes that this is a logic error in the program and reports an error.^(8) For the most part, MASM ignores the difference between signed and unsigned variables. MASM is perfectly happy with both of these `mov` instructions: ``` i8 sbyte ? u8 byte ? . . . mov al, i8 mov bl, u8 ``` All MASM cares about is that you’re moving a byte variable into a byte-sized register. Differentiating signed and unsigned values in those registers is up to the application program. MASM even allows something like this: ``` r4v real4 ? r8v real8 ? . . . mov eax, r4v mov rbx, r8v ``` Again, all MASM really cares about is the size of the memory operands, not that you wouldn’t normally load a floating-point variable into a general-purpose register (which typically holds integer values). In Table 1-4, you’ll notice that there are `proc`, `label`, and `constant` types. MASM will report an error if you attempt to use a `proc` or `label` reserved word in a `mov` instruction. The procedure and label types are associated with addresses of machine instructions, not variables, and it doesn’t make sense to “load a procedure” into a register. However, you may specify a `constant` symbol as a source operand to an instruction; for example: ``` someConst = 5 . . . mov eax, someConst ``` As there is no size associated with constants, the only type checking MASM will do on a constant operand is to verify that the constant will fit in the destination operand. For example, MASM will reject the following: ``` wordConst = 1000 . . . mov al, wordConst ``` ### 1.11.3 The add and sub Instructions The x86-64 `add` and `sub` instructions add or subtract two operands, respectively. Their syntax is nearly identical to the `mov` instruction: ``` add `destination_operand`*,* `source_operand` sub `destination_operand`*,* `source_operand` ``` However, constant operands are limited to a maximum of 32 bits. If your destination operand is 64 bits, the CPU allows only a 32-bit immediate source operand (it will sign-extend that operand to 64 bits; see “Sign Extension and Zero Extension” in Chapter 2 for more details on sign extension). The `add` instruction does the following: ``` `destination_operand` = `destination_operand` + `source_operand` ``` The `sub` instruction does the calculation: ``` `destination_operand` = `destination_operand` - `source_operand` ``` With these three instructions, plus some MASM control structures, you can actually write sophisticated programs. ### 1.11.4 The lea Instruction Sometimes you need to load the address of a variable into a register rather than the value of that variable. You can use the `lea` (*load effective address*) instruction for this purpose. The `lea` instruction takes the following form: ``` lea `reg64`, `memory_var` ``` Here, `reg64` is any general-purpose 64-bit register, and `memory_var` is a variable name. Note that `memory_var`’s type is irrelevant; it doesn’t have to be a `qword` variable (as is the case with `mov`, `add`, and `sub` instructions). Every variable has a memory address associated with it, and that address is always 64 bits. The following example loads the RCX register with the address of the first character in the `strVar` string: ``` strVar byte "Some String", 0 . . . lea rcx, strVar ``` The `lea` instruction is roughly equivalent to the C/C++ unary `&` (*address-of*) operator. The preceding assembly example is conceptually equivalent to the following C/C++ code: ``` char strVar[] = "Some String"; char *RCX; . . . RCX = &strVar[0]; ``` ### 1.11.5 The call and ret Instructions and MASM Procedures To make function calls (as well as write your own simple functions), you need the `call` and `ret` instructions. The `ret` instruction serves the same purpose in an assembly language program as the `return` statement in C/C++: it returns control from an assembly language procedure (assembly language functions are called *procedures*). For the time being, this book will use the variant of the `ret` instruction that does not have an operand: ``` ret ``` (The `ret` instruction does allow a single operand, but unlike in C/C++, the operand does not specify a function return value. You’ll see the purpose of the `ret` instruction operand in Chapter 5.) As you might guess, you call a MASM procedure by using the `call` instruction. This instruction can take a couple of forms. The most common is ``` call `proc_name` ``` where `proc_name` is the name of the procedure you want to call. As you’ve seen in a couple code examples already, a MASM procedure consists of the line ``` `proc_name` proc ``` followed by the body of the procedure (typically ending with a `ret` instruction). At the end of the procedure (typically immediately after the `ret` instruction), you end the procedure with the following statement: ``` `proc_name` endp ``` The label on the `endp` directive must be identical to the one you supply for the `proc` statement. In the stand-alone assembly language program in Listing 1-4, the main program calls `myProc`, which will immediately return to the main program, which then immediately returns to Windows. ``` ; Listing 1-4 ; A simple demonstration of a user-defined procedure. .code ; A sample user-defined procedure that this program can call. myProc proc ret ; Immediately return to the caller myProc endp ; Here is the "main" procedure. main PROC ; Call the user-defined procedure. call myProc ret ; Returns to caller main endp end ``` Listing 1-4: A sample user-defined procedure in an assembly language program You can compile this program and try running it by using the following commands: ``` C:\>**ml64 listing1-4.asm /link /subsystem:console /entry:main** Microsoft (R) Macro Assembler (x64) Version 14.15.26730.0 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: listing1-4.asm Microsoft (R) Incremental Linker Version 14.15.26730.0 Copyright (C) Microsoft Corporation. All rights reserved. /OUT:listing1-4.exe listing1-4.obj /subsystem:console /entry:main C:\>**listing1-4** ``` ## 1.12 Calling C/C++ Procedures While writing your own procedures and calling them are quite useful, the reason for introducing procedures at this point is not to allow you to write your own procedures, but rather to give you the ability to call procedures (functions) written in C/C++. Writing your own procedures to convert and output data to the console is a rather complex task (probably well beyond your capabilities at this point). Instead, you can call the C/C++ `printf()` function to produce program output and verify that your programs are actually doing something when you run them. Unfortunately, if you call `printf()` in your assembly language code without providing a `printf()` procedure, MASM will complain that you’ve used an undefined symbol. To call a procedure outside your source file, you need to use the MASM `externdef` directive.^(9) This directive has the following syntax: ``` externdef `symbol`:`type` ``` Here, `symbol` is the external symbol you want to define, and `type` is the type of that symbol (which will be `proc` for external procedure definitions). To define the `printf()` symbol in your assembly language file, use this statement: ``` externdef printf:proc ``` When defining external procedure symbols, you should put the `externdef` directive in your `.code` section. The `externdef` directive doesn’t let you specify parameters to pass to the `printf()` procedure, nor does the `call` instruction provide a mechanism for specifying parameters. Instead, you can pass up to four parameters to the `printf()` function in the x86-64 registers RCX, RDX, R8, and R9\. The `printf()` function requires that the first parameter be the address of a format string. Therefore, you should load RCX with the address of a zero-terminated string prior to calling `printf()`. If the format string contains any format specifiers (for example, `%d`), you must pass appropriate parameter values in RDX, R8, and R9\. Chapter 5 goes into great detail concerning procedure parameters, including how to pass floating-point values and more than four parameters. ## 1.13 Hello, World! At this point (many pages into this chapter), you finally have enough information to write this chapter’s namesake application: the “Hello, world!” program, shown in Listing 1-5. ``` ; Listing 1-5 ; A "Hello, world!" program using the C/C++ printf() function to ; provide the output. option casemap:none .data ; Note: "10" value is a line feed character, also known as the ; "C" newline character. fmtStr byte 'Hello, world!', 10, 0 .code ; External declaration so MASM knows about the C/C++ printf() ; function. externdef printf:proc ; Here is the "asmFunc" function. public asmFunc asmFunc proc ; "Magic" instruction offered without explanation at this point: sub rsp, 56 ; Here's where we'll call the C printf() function to print ; "Hello, world!" Pass the address of the format string ; to printf() in the RCX register. Use the LEA instruction ; to get the address of fmtStr. lea rcx, fmtStr call printf ; Another "magic" instruction that undoes the effect of the ; previous one before this procedure returns to its caller. add rsp, 56 ret ; Returns to caller asmFunc endp end ``` Listing 1-5: Assembly language code for the “Hello, world!” program The assembly language code contains two “magic” statements that this chapter includes without further explanation. Just accept the fact that subtracting from the RSP register at the beginning of the function and then adding this value back to RSP at the end of the function are needed to make the calls to C/C++ functions work properly. Chapter 5 more fully explains the purpose of these statements. The C++ function in Listing 1-6 calls the assembly code and makes the `printf()` function available for use. ``` // Listing 1-6 // C++ driver program to demonstrate calling printf() from assembly // language. // Need to include stdio.h so this program can call "printf()". #include <stdio.h> // extern "C" namespace prevents "name mangling" by the C++ // compiler. extern "C" { // Here's the external function, written in assembly // language, that this program will call: void asmFunc(void); }; int main(void) { // Need at least one call to printf() in the C program to allow // calling it from assembly. printf("Calling asmFunc:\n"); asmFunc(); printf("Returned from asmFunc\n"); } ``` Listing 1-6: C++ code for the “Hello, world!” program Here’s the sequence of steps needed to compile and run this code on my machine: ``` C:\>**ml64 /c listing1-5.asm** Microsoft (R) Macro Assembler (x64) Version 14.15.26730.0 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: listing1-5.asm C:\>**cl listing1-6.cpp listing1-5.obj** Microsoft (R) C/C++ Optimizing Compiler Version 19.15.26730 for x64 Copyright (C) Microsoft Corporation. All rights reserved. listing1-6.cpp Microsoft (R) Incremental Linker Version 14.15.26730.0 Copyright (C) Microsoft Corporation. All rights reserved. /out:listing1-6.exe listing1-6.obj listing1-5.obj C:\>**listing1-6** Calling asmFunc: Hello, World! Returned from asmFunc ``` You can finally print “Hello, world!” on the console! ## 1.14 Returning Function Results in Assembly Language In a previous section, you saw how to pass up to four parameters to a procedure written in assembly language. This section describes the opposite process: returning a value to code that has called one of your procedures. In pure assembly language (where one assembly language procedure calls another), passing parameters and returning function results are strictly a convention that the caller and callee procedures share with one another. Either the callee (the procedure being called) or the caller (the procedure doing the calling) may choose where function results appear. From the callee viewpoint, the procedure returning the value determines where the caller can find the function result, and whoever calls that function must respect that choice. If a procedure returns a function result in the XMM0 register (a common place to return floating-point results), whoever calls that procedure must expect to find the result in XMM0\. A different procedure could return its function result in the RBX register. From the caller’s viewpoint, the choice is reversed. Existing code expects a function to return its result in a particular location, and the function being called must respect that wish. Unfortunately, without appropriate coordination, one section of code might demand that functions it calls return their function results in one location, while a set of existing library functions might insist on returning their function results in another location. Clearly, such functions would not be compatible with the calling code. While there are ways to handle this situation (typically by writing facade code that sits between the caller and callee and moves the return results around), the best solution is to ensure that everybody agrees on things like where function return results will be found prior to writing any code. This agreement is known as an *application binary interface (***ABI)*. An ABI is a contract, of sorts, between different sections of code that describe *calling conventions* (where things are passed, where they are returned, and so on), data types, memory usage and alignment, and other attributes. CPU manufacturers, compiler writers, and operating system vendors all provide their own ABIs. For obvious reasons, this book uses the Microsoft Windows ABI.* *Once again, it’s important to understand that when you’re writing your own assembly language code, the way you pass data between your procedures is totally up to you. One of the benefits of using assembly language is that you can decide the interface on a procedure-by-procedure basis. The only time you have to worry about adhering to an ABI is when you call code that is outside your control (or if that external code makes calls to your code). This book covers writing assembly language under Microsoft Windows (specifically, assembly code that interfaces with MSVC); therefore, when dealing with external code (Windows and C++ code), you have to use the Windows/MSVC ABI. The Microsoft ABI specifies that the first four parameters to `printf()` (or any C++ function, for that matter) must be passed in RCX, RDX, R8, and R9. The Windows ABI also states that functions (procedures) return integer and pointer values (that fit into 64 bits) in the RAX register. So if some C++ code expects your assembly procedure to return an integer result, you would load the integer result into RAX immediately before returning from your procedure. To demonstrate returning a function result, we’ll use the C++ program in Listing 1-7 (*c.cpp*, a generic C++ program that this book uses for most of the C++/assembly examples hereafter). This C++ program includes two extra function declarations: `getTitle()` (supplied by the assembly language code), which returns a pointer to a string containing the title of the program (the C++ code prints this title), and `readLine()` (supplied by the C++ program), which the assembly language code can call to read a line of text from the user (and put into a string buffer in the assembly language code). ``` // Listing 1-7 // c.cpp // Generic C++ driver program to demonstrate returning function // results from assembly language to C++. Also includes a // "readLine" function that reads a string from the user and // passes it on to the assembly language code. // Need to include stdio.h so this program can call "printf()" // and string.h so this program can call strlen. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <string.h> // extern "C" namespace prevents "name mangling" by the C++ // compiler. extern "C" { // asmMain is the assembly language code's "main program": void asmMain(void); // getTitle returns a pointer to a string of characters // from the assembly code that specifies the title of that // program (that makes this program generic and usable // with a large number of sample programs in "The Art of // 64-Bit Assembly"). char *getTitle(void); // C++ function that the assembly // language program can call: int readLine(char *dest, int maxLen); }; // readLine reads a line of text from the user (from the // console device) and stores that string into the destination // buffer the first argument specifies. Strings are limited in // length to the value specified by the second argument // (minus 1). // This function returns the number of characters actually // read, or -1 if there was an error. // Note that if the user enters too many characters (maxlen or // more), then this function returns only the first maxlen-1 // characters. This is not considered an error. int readLine(char *dest, int maxLen) { // Note: fgets returns NULL if there was an error, else // it returns a pointer to the string data read (which // will be the value of the dest pointer). char *result = fgets(dest, maxLen, stdin); if(result != NULL) { // Wipe out the newline character at the // end of the string: int len = strlen(result); if(len > 0) { dest[len - 1] = 0; } return len; } return -1; // If there was an error } int main(void) { // Get the assembly language program's title: try { char *title = getTitle(); printf("Calling %s:\n", title); asmMain(); printf("%s terminated\n", title); } catch(...) { printf ( "Exception occurred during program execution\n" "Abnormal program termination.\n" ); } } ``` Listing 1-7: Generic C++ code for calling assembly language programs The `` `try..catch` block catches any exceptions the assembly code generates, so you get some sort of indication if the program aborts abnormally. `` ````` Listing 1-8 provides assembly code that demonstrates several new concepts, foremost returning a function result (to the C++ program). The assembly language function `getTitle()` returns a pointer to a string that the calling C++ code will print as the title of the program. In the `.data` section, you’ll see a string variable `titleStr` that is initialized with the name of this assembly code (`Listing 1-8`). The `getTitle()` function loads the address of that string into RAX and returns this string pointer to the C++ code (Listing 1-7) that prints the title before and after running the assembly code. This program also demonstrates reading a line of text from the user. The assembly code calls the `readLine()` function appearing in the C++ code. The `readLine()` function expects two parameters: the address of a character buffer (C string) and a maximum buffer length. The code in Listing 1-8 passes the address of the character buffer to the `readLine()` function in RCX and the maximum buffer size in RDX. The maximum buffer length must include room for two extra characters: a newline character (line feed) and a zero-terminating byte. Finally, Listing 1-8 demonstrates declaring a character buffer (that is, an array of characters). In the `.data` section, you will find the following declaration: ``` input byte maxLen dup (?) ``` The `maxLen` `` `dup (?)` operand tells MASM to duplicate the `(?)` (that is, an uninitialized byte) `maxLen` times. `maxLen` is a constant set to `256` by an equate directive (`=`) at the beginning of the source file. (For more details, see “Declaring Arrays in Your MASM Programs” in Chapter 4.) `` ```` ``` ; Listing 1-8 ; An assembly language program that demonstrates returning ; a function result to a C++ program. option casemap:none nl = 10 ; ASCII code for newline maxLen = 256 ; Maximum string size + 1 .data titleStr byte 'Listing 1-8', 0 prompt byte 'Enter a string: ', 0 fmtStr byte "User entered: '%s'", nl, 0 ; "input" is a buffer having "maxLen" bytes. This program ; will read a user string into this buffer. ; The "maxLen dup (?)" operand tells MASM to make "maxLen" ; duplicate copies of a byte, each of which is uninitialized. input byte maxLen dup (?) .code externdef printf:proc externdef readLine:proc ; The C++ function calling this assembly language module ; expects a function named "getTitle" that returns a pointer ; to a string as the function result. This is that function: public getTitle getTitle proc ; Load address of "titleStr" into the RAX register (RAX holds ; the function return result) and return back to the caller: lea rax, titleStr ret getTitle endp ; Here is the "asmMain" function. public asmMain asmMain proc sub rsp, 56 ; Call the readLine function (written in C++) to read a line ; of text from the console. ; int readLine(char *dest, int maxLen) ; Pass a pointer to the destination buffer in the RCX register. ; Pass the maximum buffer size (max chars + 1) in EDX. ; This function ignores the readLine return result. ; Prompt the user to enter a string: lea rcx, prompt call printf ; Ensure the input string is zero-terminated (in the event ; there is an error): mov input, 0 ; Read a line of text from the user: lea rcx, input mov rdx, maxLen call readLine ; Print the string input by the user by calling printf(): lea rcx, fmtStr lea rdx, input call printf add rsp, 56 ret ; Returns to caller asmMain endp end ``` Listing 1-8: Assembly language program that returns a function result To compile and run the programs in Listings 1-7 and 1-8, use statements such as the following: ``` C:\>**ml64 /c listing1-8.asm** Microsoft (R) Macro Assembler (x64) Version 14.15.26730.0 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: listing1-8.asm C:\>**cl /EHa /Felisting1-8.exe c.cpp listing1-8.obj** Microsoft (R) C/C++ Optimizing Compiler Version 19.15.26730 for x64 Copyright (C) Microsoft Corporation. All rights reserved. c.cpp Microsoft (R) Incremental Linker Version 14.15.26730.0 Copyright (C) Microsoft Corporation. All rights reserved. /out:listing1-8.exe c.obj listing1-8.obj C:\> **listing1-8** Calling Listing 1-8: Enter a string: This is a test User entered: 'This is a test' Listing 1-8 terminated ``` The `/Felisting1-8.exe` command line option tells MSVC to name the executable file *listing1-8.exe*. Without the `/Fe` option, MSVC would name the resulting executable file *c.exe* (after *c.cpp*, the generic example C++ file from Listing 1-7). ## 1.15 Automating the Build Process At this point, you’re probably thinking it’s a bit tiresome to type all these (long) command lines every time you want to compile and run your programs. This is especially true if you start adding more command line options to the `ml64` and `cl` commands. Consider the following two commands: ``` ml64 /nologo /c /Zi /Cp listing1-8.asm cl /nologo /O2 /Zi /utf-8 /EHa /Felisting1-8.exe c.cpp listing1-8.obj listing1-8 ``` The `/Zi` option tells MASM and MSVC to compile extra debug information into the code. The `/nologo` option tells MASM and MSVC to skip printing copyright and version information during compilation. The MASM `/Cp` option tells MASM to make compilations case-insensitive (so you don’t need the `options casemap:none` directive in your assembly source file). The `/O2` option tells MSVC to optimize the machine code the compiler produces. The `/utf-8` option tells MSVC to use UTF-8 Unicode encoding (which is ASCII-compatible) rather than UTF-16 encoding (or other character encoding). The `/EHa` option tells MSVC to handle processor-generated exceptions (such as memory access faults—a common exception in assembly language programs). As noted earlier, the `/Fe` option specifies the executable output filename. Typing all these command line options every time you want to build a sample program is going to be a lot of work. The easy solution is to create a batch file that automates this process. You could, for example, type the three previous command lines into a text file, name it*l8.bat*, and then simply type `l8` at the command line to automatically execute those three commands. That saves a lot of typing and is much quicker (and less error-prone) than typing these three commands every time you want to compile and run the program. The only drawback to putting those three commands into a batch file is that the batch file is specific to the *listing1-8.asm* source file, and you would have to create a new batch file to compile other programs. Fortunately, it is easy to create a batch file that will work with any single assembly source file that compiles and links with the generic *c.cpp* program. Consider the following *build.bat* batch file: ``` echo off ml64 /nologo /c /Zi /Cp %1.asm cl /nologo /O2 /Zi /utf-8 /EHa /Fe%1.exe c.cpp %1.obj ``` The `%1` item in these commands tells the Windows command line processor to substitute a command line parameter (specifically, command line parameter number 1) in place of the `%1`. If you type the following from the command line ``` build listing1-8 ``` then Windows executes the following three commands: ``` echo off ml64 /nologo /c /Zi /Cp listing1-8.asm cl /nologo /O2 /Zi /utf-8 /EHa /Felisting1-8.exe c.cpp listing1-8.obj ``` With this *build.bat* file, you can compile several projects simply by specifying the assembly language source file name (without the *.asm* suffix) on the build command line. The *build.bat* file does not run the program after compiling and linking it. You could add this capability to the batch file by appending a single line containing `%1` to the end of the file. However, that would always attempt to run the program, even if the compilation failed because of errors in the C++ or assembly language source files. For that reason, it’s probably better to run the program manually after building it with the batch file, as follows: ``` C:\>**build listing1-8** C:\>**listing1-8** ``` A little extra typing, to be sure, but safer in the long run. Microsoft provides another useful tool for controlling compilations from the command line: *makefiles*. They are a better solution than batch files because makefiles allow you to conditionally control steps in the process (such as running the executable) based on the success of earlier steps. However, using Microsoft’s make program (*nmake.exe*) is beyond the scope of this chapter. It’s a good tool to learn (and Chapter 15 will teach you the basics). However, batch files are sufficient for the simple projects appearing throughout most of this book and require little extra knowledge or training to use. If you are interested in learning more about makefiles, see Chapter 15 or “For More Information” on page 39. ## 1.16 Microsoft ABI Notes As noted earlier (see “Returning Function Results in Assembly Language” on page 27), the Microsoft ABI is a contract between modules in a program to ensure compatibility (between modules, especially modules written in different programming languages).^(10) In this book, the C++ programs will be calling assembly language code, and the assembly modules will be calling C++ code, so it’s important that the assembly language code adhere to the Microsoft ABI. Even if you were to write stand-alone assembly language code, it would still be calling C++ code, as it would (undoubtedly) need to make Windows **application programming interface* *(API)* calls. The Windows API functions are all written in C++, so calls to Windows must respect the Windows ABI.* *Because following the Microsoft ABI is so important, each chapter in this book (if appropriate) includes a section at the end discussing those components of the Microsoft ABI that the chapter introduces or heavily uses. This section covers several concepts from the Microsoft ABI: variable size, register usage, and stack alignment. ### 1.16.1 Variable Size Although dealing with different data types in assembly language is completely up to the assembly language programmer (and the choice of machine instructions to use on that data), it’s crucial to maintain the size of the data (in bytes) between the C++ and assembly language programs. Table 1-6 lists several common C++ data types and the corresponding assembly language types (that maintain the size information). Table 1-6: C++ and Assembly Language Types | **C++ type** | **Size (in bytes)** | **Assembly language type** | | --- | --- | --- | | `char` | 1 | `sbyte` | | `signed char` | 1 | `sbyte` | | `unsigned char` | 1 | `byte` | | `short int` | 2 | `sword` | | `short unsigned` | 2 | `word` | | `int` | 4 | `sdword` | | `unsigned (unsigned int)` | 4 | `dword` | | `long` | 4 | `sdword` | | `long int` | 4 | `sdword` | | `long unsigned` | 4 | `dword` | | `long int` | 8 | `sqword` | | `long unsigned` | 8 | `qword` | | `__int64` | 8 | `sqword` | | `unsigned __int64` | 8 | `qword` | | `Float` | 4 | `real4` | | `double` | 8 | `real8` | | `pointer` (for example, `void *`) | 8 | `qword` | Although MASM provides signed type declarations (`sbyte`, `sword`, `sdword`, and `sqword`), assembly language instructions do not differentiate between the unsigned and signed variants. You could process a signed integer (`sdword`) by using unsigned instruction sequences, and you could process an unsigned integer (`dword`) by using signed instruction sequences. In an assembly language source file, these different directives mainly serve as a documentation aid to help describe the programmer’s intentions.^(11) Listing 1-9 is a simple program that verifies the sizes of each of these C++ data types. ``` // Listing 1-9 // A simple C++ program that demonstrates Microsoft C++ data // type sizes: #include <stdio.h> int main(void) { char v1; unsigned char v2; short v3; short int v4; short unsigned v5; int v6; unsigned v7; long v8; long int v9; long unsigned v10; long long int v11; long long unsigned v12; __int64 v13; unsigned __int64 v14; float v15; double v16; void * v17; printf ( "Size of char: %2zd\n" "Size of unsigned char: %2zd\n" "Size of short: %2zd\n" "Size of short int: %2zd\n" "Size of short unsigned: %2zd\n" "Size of int: %2zd\n" "Size of unsigned: %2zd\n" "Size of long: %2zd\n" "Size of long int: %2zd\n" "Size of long unsigned: %2zd\n" "Size of long long int: %2zd\n" "Size of long long unsigned: %2zd\n" "Size of __int64: %2zd\n" "Size of unsigned __int64: %2zd\n" "Size of float: %2zd\n" "Size of double: %2zd\n" "Size of pointer: %2zd\n", sizeof v1, sizeof v2, sizeof v3, sizeof v4, sizeof v5, sizeof v6, sizeof v7, sizeof v8, sizeof v9, sizeof v10, sizeof v11, sizeof v12, sizeof v13, sizeof v14, sizeof v15, sizeof v16, sizeof v17 ); } ``` Listing 1-9: Output sizes of common C++ data types Here’s the build command and output from Listing 1-9: ``` C:\>**cl listing1-9.cpp** Microsoft (R) C/C++ Optimizing Compiler Version 19.15.26730 for x64 Copyright (C) Microsoft Corporation. All rights reserved. listing1-9.cpp Microsoft (R) Incremental Linker Version 14.15.26730.0 Copyright (C) Microsoft Corporation. All rights reserved. /out:listing1-9.exe listing1-9.obj C:\>**listing1-9** Size of char: 1 Size of unsigned char: 1 Size of short: 2 Size of short int: 2 Size of short unsigned: 2 Size of int: 4 Size of unsigned: 4 Size of long: 4 Size of long int: 4 Size of long unsigned: 4 Size of long long int: 8 Size of long long unsigned: 8 Size of __int64: 8 Size of unsigned __int64: 8 Size of float: 4 Size of double: 8 Size of pointer: 8 ``` ### 1.16.2 Register Usage **Register usage* in an assembly language procedure (including the main assembly language function) is also subject to certain Microsoft ABI rules. Within a procedure, the Microsoft ABI has this to say about register usage):^(12)* ** Code that calls a function can pass the first four (integer) arguments to the function (procedure) in the RCX, RDX, R8, and R9 registers, respectively. Programs pass the first four floating-point arguments in XMM0, XMM1, XMM2, and XMM3. * Registers RAX, RCX, RDX, R8, R9, R10, and R11 are *volatile*, which means that the function/procedure does not need to save the registers’ values across a function/procedure call. * XMM0/YMM0 through XMM5/YMM5 are also volatile. The function/procedure does not need to preserve these registers across a call. * RBX, RBP, RDI, RSI, RSP, R12, R13, R14, and R15 are nonvolatile registers. A procedure/function must preserve these registers’ values across a call. If a procedure modifies one of these registers, it must save the register’s value before the first such modification and restore the register’s value from the saved location prior to returning from the function/procedure. * XMM6 through XMM15 are nonvolatile. A function must preserve these registers across a function/procedure call (that is, when a procedure returns, these registers must contain the same values they had upon entry to that procedure). * Programs that use the x86-64’s floating-point coprocessor instructions must preserve the value of the floating-point control word across procedure calls. Such procedures should also leave the floating-point stack cleared. * Any procedure/function that uses the x86-64’s direction flag must leave that flag cleared upon return from the procedure/function. Microsoft C++ expects function return values to appear in one of two places. Integer (and other non-scalar) results come back in the RAX register (up to 64 bits). If the return type is smaller than 64 bits, the upper bits of the RAX register are undefined—for example, if a function returns a short int (16-bit) result, bits 16 to 63 in RAX may contain garbage. Microsoft’s ABI specifies that floating-point (and vector) function return results shall come back in the XMM0 register. ### 1.16.3 Stack Alignment Some “magic” instructions appear in various source listings throughout this chapter (they basically add or subtract values from the RSP register). These instructions have to do with stack alignment (as required by the Microsoft ABI). This chapter (and several that follow) supply these instructions in the code without further explanation. For more details on the purpose of these instructions, see Chapter 5. ## 1.17 For More Information This chapter has covered a lot of ground! While you still have a lot to learn about assembly language programming, this chapter, combined with your knowledge of HLLs (especially C/C++), provides just enough information to let you start writing real assembly language programs. Although this chapter covered many topics, the three primary ones of interest are the x86-64 CPU architecture, the syntax for simple MASM programs, and interfacing with the C Standard Library. The following resources provide more information about makefiles: * Wikipedia: [`en.wikipedia.org/wiki/Make_(software)`](https://en.wikipedia.org/wiki/Make_(software)) * *Managing Projects with GNU Make* by Robert Mecklenburg (O’Reilly Media, 2004) * *The GNU Make Book,* First Edition, by John Graham-Cumming (No Starch Press, 2015) * *Managing Projects with make*, by Andrew Oram and Steve Talbott (O’Reilly & Associates, 1993) For more information about MVSC: * *Microsoft Visual Studio websites: [`visualstudio.microsoft.com/`](https://visualstudio.microsoft.com/ ) and [`visualstudio.microsoft.com/vs/`](https://visualstudio.microsoft.com/vs/)* ** *Microsoft free developer offers: [`visualstudio.microsoft.com/free-developer-offers/`](https://visualstudio.microsoft.com/free-developer-offers/)** **For more information about MASM: * *Microsoft, C++, C, and Assembler documentation: [`docs.microsoft.com/en-us/cpp/assembler/masm/masm-for-x64-ml64-exe?view=msvc-160/`](https://docs.microsoft.com/en-us/cpp/assembler/masm/masm-for-x64-ml64-exe?view=msvc-160/)* ** *Waite Group MASM Bible (covers MASM 6, which is 32-bit only, but still contains lots of useful information about MASM): [`www.amazon.com/Waite-Groups-Microsoft-Macro-Assembler/dp/0672301555/`](https://www.amazon.com/Waite-Groups-Microsoft-Macro-Assembler/dp/0672301555/)** **For more information about the ABI: * The best documentation comes from Agner Fog’s website: [`www.agner.org/optimize/`](https://www.agner.org/optimize/). * Microsoft’s website also has information on Microsoft ABI calling conventions (see [`docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-160`](https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-160) or search for *Microsoft calling conventions*). ## 1.18 Test Yourself 1. What is the name of the Windows command line interpreter program? 2. What is the name of the MASM executable program file? 3. What are the names of the three main system buses? 4. Which register(s) overlap the RAX register? 5. Which register(s) overlap the RBX register? 6. Which register(s) overlap the RSI register? 7. Which register(s) overlap the R8 register? 8. Which register holds the condition code bits? 9. How many bytes are consumed by the following data types? 1. `word` 2. `dword` 3. `oword` 4. `qword` with a `4 dup (?)` operand 5. `real8` 10. If an 8-bit (byte) memory variable is the destination operand of a `mov` instruction, what source operands are legal? 11. If a `mov` instruction’s destination operand is the EAX register, what is the largest constant (in bits) you can load into that register? 12. For the `add` instruction, fill in the largest constant size (in bits) for all the destination operands specified in the following table: | **Destination** | **Constant size** | | RAX | | | EAX | | | AX | | | AL | | | AH | | | mem[32] | | | mem[64] | | 13. What is the destination (register) operand size for the `lea` instruction? 14. What is the source (memory) operand size of the `lea` instruction? 15. What is the name of the assembly language instruction you use to call a procedure or function? 16. What is the name of the assembly language instruction you use to return from a procedure or function? 17. What does *ABI* stand for? 18. In the Windows ABI, where do you return the following function return results? 1. 8-bit byte values 2. 16-bit word values 3. 32-bit integer values 4. 64-bit integer values 5. Floating-point values 6. 64-bit pointer values 19. Where do you pass the first parameter to a Microsoft ABI–compatible function? 20. Where do you pass the second parameter to a Microsoft ABI–compatible function? 21. Where do you pass the third parameter to a Microsoft ABI–compatible function? 22. Where do you pass the fourth parameter to a Microsoft ABI–compatible function? 23. What assembly language data type corresponds to a C/C++ `long int`? 24. What assembly language data type corresponds to a C/C++ `long long unsigned`?****** ```` `````****

第二章：计算机数据表示与操作

许多初学者在学习汇编语言时遇到的一个主要难点是二进制和十六进制数字系统的常见使用。尽管十六进制数字有些陌生，但它们的优点远远超过缺点。理解二进制和十六进制数字系统非常重要，因为它们的使用简化了其他主题的讨论，包括位操作、符号数表示、字符编码和打包数据。

本章讨论了几个重要的概念，包括以下内容：

二进制和十六进制数字系统
二进制数据组织（位、半字节、字节、字和双字）
有符号和无符号数字系统
二进制值上的算术、逻辑、移位和旋转操作
位域和打包数据
浮点数和二进制十进制格式
字符数据

这是基础内容，本文的其余部分依赖于你对这些概念的理解。如果你已经在其他课程或学习中接触过这些术语，你至少应该浏览一下这些材料，然后再继续下一章。如果你对这些内容不熟悉，或者只模糊了解，你应该在继续之前认真学习这些内容。本章的所有内容都很重要！不要跳过任何内容。

2.1 数字系统

现代大多数计算机系统不使用十进制（基数 10）系统来表示数字值。相反，它们通常使用二进制或二的补码数字系统。

2.1.1 十进制系统回顾

你已经使用十进制数字系统很久了，可能已经习以为常。当你看到一个数字如 123 时，你并不会思考数字 123 的值；相反，你会脑海中生成这个值代表多少个项目的图像。然而，实际上，数字 123 代表的是：

(1 × 10²) + (2 × 10¹) + (3 × 10⁰)
或
100 + 20 + 3

在十进制位置数字系统中，小数点左侧的每个数字表示一个值，这个值是 0 到 9 之间的某个数字，乘以 10 的逐渐增大的幂。小数点右侧的每个数字表示一个值，这个值是 0 到 9 之间的某个数字，乘以 10 的逐渐减小的负幂。例如，数字 123.456 表示的是：

(1 × 10²) + (2 × 10¹) + (3 × 10⁰) + (4 × 10^(-1)) + (5 × 10^(-2)) + (6 × 10^(-3))
或
100 + 20 + 3 + 0.4 + 0.05 + 0.006

2.1.2 二进制数字系统

现代大多数计算机系统使用二进制逻辑操作。计算机通过两种电压水平（通常是 0V 和+2.4 至 5V）来表示值。这两种电压水平可以精确表示两个独特的值。它们可以是任何两个不同的值，但通常表示的是 0 和 1，二进制数字系统中的两个数字。

二进制计数系统的工作方式与十进制计数系统相似，区别在于二进制只允许使用数字 0 和 1（而不是 0 到 9），并且使用 2 的幂而非 10 的幂。因此，将二进制数转换为十进制非常简单。对于二进制字符串中的每个 1，添加 2^(n)，其中n是二进制数字的零基位置。例如，二进制值 11001010[2]表示如下：

(1 × 2⁷) + (1 × 2⁶) + (0 × 2⁵) + (0 × 2⁴) + (1 × 2³) + (0 × 2²) + (1 × 2¹) + (0 × 2⁰)
=
128[10] + 64[10] + 8[10] + 2[10]
=
202[10]

将十进制转换为二进制稍微复杂一些。你必须找到那些 2 的幂，当它们相加时，得到十进制结果。

将十进制转换为二进制的一种简单方法是偶数/奇数—除以二算法。该算法使用以下步骤：

如果数字是偶数，则输出 0。如果数字是奇数，则输出 1。
将数字除以 2，舍去任何小数部分或余数。
如果商为 0，则算法完成。
如果商不为 0 且为奇数，则在当前字符串前插入 1；如果数字为偶数，则在二进制字符串前加 0。
回到步骤 2 并重复。

二进制数字在高级语言中虽然不太重要，但在汇编语言程序中随处可见。因此，你应该熟悉它们。

2.1.3 二进制约定

从最纯粹的角度来看，每个二进制数都包含无限个数字（或位，即二进制数字的缩写）。例如，我们可以用以下任意一种方式表示数字 5：

101 00000101 0000000000101 . . . 000000000000101

任何数量的前导零数字都可以出现在二进制数字前面而不会改变其值。因为 x86-64 通常以 8 位为一组，所以我们将所有二进制数字扩展为 4 位或 8 位的倍数。按照这个约定，我们将数字 5 表示为 0101[2]或 00000101[2]。

为了使较大的数字更易于阅读，我们将每组 4 个二进制位用下划线分隔。例如，我们将二进制值 1010111110110010 写成 1010_1111_1011_0010。

我们将按如下方式为每个比特编号：

二进制数字中的最右侧比特是比特位置 0。
每个向左的比特都被赋予下一个连续的比特编号。

一个 8 位二进制值使用第 0 到第 7 位：

X[7] X[6] X[5] X[4] X[3] X[2] X[1] X[0]

一个 16 位二进制值使用第 0 到第 15 位：

X[15] X[14] X[13] X[12] X[11] X[10] X[9] X[8] X[7] X[6] X[5] X[4] X[3] X[2] X[1] X[0]

一个 32 位二进制值使用第 0 到第 31 位，依此类推。

位 0 是低位（LO）位；有些人称其为最低有效位。最左边的位称为高位（HO）位，或最高有效位。我们将根据各自的比特编号来称呼中间的比特。

在 MASM 中，可以将二进制值指定为以b字符结尾的 0 或 1 数字字符串。记住，MASM 不允许在二进制数字中使用下划线。

2.2 十六进制计数系统

不幸的是，二进制数字冗长。表示值 202[10]需要八位二进制数字，但只需要三位十进制数字。当处理大数值时，二进制数字很快变得难以处理。不幸的是，计算机是以二进制为“思维方式”的，因此在大多数情况下，使用二进制计数系统非常方便。虽然我们可以在十进制和二进制之间进行转换，但这种转换并非小事。

十六进制（基数 16）计数系统解决了二进制系统中许多固有的问题：十六进制数字简洁，且将其转换为二进制以及反向转换都很简单。因此，大多数工程师使用十六进制计数系统。

因为十六进制数字的基数（进制）是 16，每个十六进制数字在十六进制小数点左边代表一个乘以 16 的逐次幂的值。例如，数字 1234[16]等于：

(1 × 16³) + (2 × 16²) + (3 × 16¹) + (4 × 16⁰)
或者
4096 + 512 + 48 + 4 = 4660[10]

每个十六进制数字可以表示 0 到 15[10]之间的 16 个值。由于只有 10 个十进制数字，我们需要 6 个附加数字来表示 10[10]到 15[10]之间的值。我们没有为这些数字创建新的符号，而是使用字母 A 到 F。以下是所有有效的十六进制数字的示例：

1234[16] DEAD[16] BEEF[16] 0AFB[16] F001[16] D8B4[16]

因为我们通常需要将十六进制数字输入到计算机系统中，而在大多数计算机系统中你无法输入下标来表示相关值的基数，所以我们需要一种不同的机制来表示十六进制数字。我们将采用以下 MASM 约定：

所有十六进制值以数字字符开头，并且以h后缀结尾；例如，123A4h 和 0DEADh。
所有二进制值以b字符结尾；例如，10010b。
十进制数字没有后缀字符。
如果基数可以从上下文中推断出来，本书可能会省略后缀的h或b字符。

以下是使用 MASM 符号表示的有效十六进制数字的几个示例：

1234h 0DEADh 0BEEFh 0AFBh 0F001h 0D8B4h

如你所见，十六进制数字简洁且易于阅读。此外，你可以轻松地在十六进制和二进制之间进行转换。表 2-1 提供了所有你需要的转换信息，可以将任何十六进制数字转换为二进制数字，反之亦然。

表 2-1：二进制/十六进制转换

二进制	十六进制
0000	0
0001	1
0010	2
0011	3
0100	4
0101	5
0110	6
0111	7
1000	8
1001	9
1010	A
1011	B
1100	C
1101	D
1110	E
1111	F

要将十六进制数字转换为二进制数字，可以将每个十六进制数字对应的 4 位二进制替换。例如，要将 0ABCDh 转换为二进制值，根据表 2-1 将每个十六进制数字转换，如下所示：

A	B	C	D	十六进制
1010	1011	1100	1101	二进制

将二进制数字转换为十六进制格式几乎同样简单：

用 0 填充二进制数，确保该数字包含 4 位的倍数。例如，给定二进制数 1011001010，在数字的左侧加上 2 位，使其包含 12 位：001011001010。
将二进制值分成 4 位一组；例如，0010_1100_1010。
查找表 2-1 中的这些二进制值，并替换为相应的十六进制数字：2CAh。

与十进制和二进制之间，或十进制和十六进制之间的转换困难相比，这是多么简单！

由于在十六进制和二进制之间的转换是你将反复进行的操作，因此你应该花几分钟时间记住转换表。即使你有一个可以为你完成转换的计算器，你会发现手动转换要快得多，也更方便。

2.3 关于数字与表示的说明

许多人混淆了数字和它们的表示。一些学习汇编语言的学生常常问这样一个问题：“我在 EAX 寄存器中有一个二进制数。如何将它转换为 EAX 寄存器中的十六进制数？”答案是：“你不需要转换。”

尽管可以强烈论证内存或寄存器中的数字是以二进制表示的，但最好将内存或寄存器中的值视为抽象的数值量。像 128、80h 或 10000000b 这样的符号串并不是不同的数字；它们只是同一个抽象量的不同表示方式，我们称之为一百二十八。在计算机内部，数字就是数字，不管其表示方式如何；只有在输入或输出值时，表示方式才变得重要。

可读的人类格式的数值总是字符的串。要以人类可读的格式打印值 128，必须将数值 128 转换为字符序列“1”后跟“2”再跟“8”。这将提供数值量的十进制表示。如果你愿意，也可以将数值 128 转换为字符序列 80h。这是相同的数字，但我们已将其转换为不同的字符序列，因为（假设）我们想用十六进制表示而不是十进制。同样，如果我们想看到这个数字的二进制表示，我们必须将这个数值转换为一个包含 1 和七个 0 字符的字符串。

纯汇编语言没有通用的打印或写入函数，无法调用它们将数字量以字符串形式显示在控制台上。你可以编写自己的过程来处理这个过程（本书稍后会讨论一些这样的过程）。目前，本书中的 MASM 代码依赖于 C 标准库的 printf() 函数来显示数值。考虑一下清单 2-1 中的程序，它将各种值转换为其十六进制等价物。

; Listing 2-1

; Displays some numeric values on the console.

        option  casemap:none

nl      =       10  ; ASCII code for newline

         .data
i        qword  1
j        qword  123
k        qword  456789

titleStr byte   'Listing 2-1', 0

fmtStrI  byte   "i=%d, converted to hex=%x", nl, 0
fmtStrJ  byte   "j=%d, converted to hex=%x", nl, 0
fmtStrK  byte   "k=%d, converted to hex=%x", nl, 0

        .code
        externdef   printf:proc

; Return program title to C++ program:

         public getTitle
getTitle proc

; Load address of "titleStr" into the RAX register (RAX holds
; the function return result) and return back to the caller:

         lea rax, titleStr
         ret
getTitle endp

; Here is the "asmMain" function.

        public  asmMain
asmMain proc

; "Magic" instruction offered without explanation at this point:

        sub     rsp, 56

; Call printf three times to print the three values i, j, and k:

; printf("i=%d, converted to hex=%x\n", i, i);

 lea     rcx, fmtStrI
        mov     rdx, i
        mov     r8, rdx
        call    printf

; printf("j=%d, converted to hex=%x\n", j, j);

        lea     rcx, fmtStrJ
        mov     rdx, j
        mov     r8, rdx
        call    printf

; printf("k=%d, converted to hex=%x\n", k, k);

        lea     rcx, fmtStrK
        mov     rdx, k
        mov     r8, rdx
        call    printf

; Another "magic" instruction that undoes the effect of the previous
; one before this procedure returns to its caller.

        add     rsp, 56

        ret     ; Returns to caller

asmMain endp
        end

清单 2-1：十进制到十六进制转换程序

清单 2-1 使用了第一章中的通用 c.cpp 程序（以及通用的 build.bat 批处理文件）。你可以通过以下命令在命令行中编译并运行该程序：

C:\>**build  listing2-1**

C:\>**echo off**
 Assembling: listing2-1.asm
c.cpp

C:\> **listing2-1**
Calling Listing 2-1:
i=1, converted to hex=1
j=123, converted to hex=7b
k=456789, converted to hex=6f855
Listing 2-1 terminated

2.4 数据组织

在纯数学中，一个值的表示可能需要任意数量的位。然而，计算机通常使用特定数量的位。常见的集合包括单个位、4 位的组合（称为半字）、8 位（字节）、16 位（字）、32 位（双字，或 双字节）、64 位（四字，或 四字节）、128 位（八字，或 八字节）等等。

2.4.1 位

在二进制计算机中，数据的最小单位是单个位（bit）。通过一个位，你可以表示任何两个不同的项。例如，0 或 1，真或假，对或错。然而，你并不局限于表示二进制数据类型；你可以使用一个位来表示数字 723 和 1245，或者可能是红色和蓝色，甚至是红色和数字 3256。你可以用一个位表示任何两个不同的值，但仅能表示两个值。

不同的位可以表示不同的内容。例如，您可以使用 1 位来表示 0 和 1 的值，而另一位则可以表示 true 和 false 的值。你如何通过查看这些位来区分它们呢？答案是，你无法做到。这说明了计算机数据结构的整体概念：数据是你定义的样子。如果你使用一位来表示布尔（真/假）值，那么这位（按照你的定义）表示 true 或 false。然而，你必须保持一致。如果你在程序的某一部分使用位来表示真或假，那么在之后的代码中不应再使用这个值来表示红色或蓝色。

2.4.2 半字

一个半字（nibble）是由 4 位组成的集合。通过一个半字，我们可以表示最多 16 个不同的值，因为 4 位的字符串有 16 种独特的组合：

半字节是一种有趣的数据结构，因为它需要 4 位来表示二进制编码十进制（BCD）数值^1 和十六进制数。在十六进制数的情况下，0、1、2、3、4、5、6、7、8、9、A、B、C、D、E 和 F 的值是用 4 位表示的。BCD 使用 10 个不同的数字（0、1、2、3、4、5、6、7、8 和 9），同样需要 4 位（因为我们只能用 3 位表示 8 个不同的值，而用 4 位表示的 6 个额外的值在 BCD 表示中从不使用）。实际上，任何 16 个不同的值都可以用半字节表示，虽然十六进制和 BCD 数字是我们通常用单个半字节表示的主要项。

2.4.3 字节

毫无疑问，x86-64 微处理器使用的最重要的数据结构就是字节，它由 8 位组成。在 x86-64 系统中，主存和 I/O 地址都是字节地址。这意味着 x86-64 程序可以单独访问的最小项是一个 8 位的值。要访问更小的数据项，必须读取包含该数据的字节并去除不需要的位。字节中的位通常按 0 到 7 的顺序编号，如图 2-1 所示。

图 2-1：位编号

位 0 是 LO 位，即最低有效位，位 7 是 HO 位，即字节的最高有效位。我们将按位号引用其他所有位。

一个字节正好包含两个半字节（见图 2-2）。

图 2-2：字节中的两个半字节

位 0 到 3 组成低位半字节，而位 4 到 7 构成高位半字节。由于一个字节恰好包含两个半字节，因此字节值需要两个十六进制数字表示。

因为一个字节包含 8 位，它可以表示 2⁸（256）个不同的值。通常，我们会用一个字节表示数值范围从 0 到 255、带符号数值范围从-128 到+127（见第 62 页的“带符号与无符号数”）、ASCII IBM 字符编码以及其他需要不超过 256 个不同值的特殊数据类型。许多数据类型的项数少于 256，所以 8 位通常足够。

由于 x86-64 是字节寻址的机器，操作整个字节比操作单独的位或半字节更高效。因此，使用一个完整的字节来表示不超过 256 项的数据类型是更高效的，即使使用少于 8 位的位数就足够。

字节的最重要用途之一可能就是存储字符值。在键盘上输入、在屏幕上显示以及在打印机上打印的字符都有对应的数字值。为了与外界进行通信，PC 通常使用变种的ASCII 字符集或Unicode 字符集。ASCII 字符集有 128 个已定义的编码。

字节也是你在 MASM 程序中可以创建的最小变量。要创建一个任意字节变量，你应该使用 byte 数据类型，如下所示：

 .data
byteVar  byte ?

byte 数据类型是一种部分未定义的数据类型。与 byte 对象相关联的唯一类型信息是其大小（1 字节）。^(2) 你可以将任何 8 位值（小的有符号整数、小的无符号整数、字符等）存储到一个字节变量中。跟踪你存入字节变量中的对象类型完全由你决定。

2.4.4 字

一个字是 16 位的一组数据。我们将字中的比特位从 0 到 15 编号，如图 2-3 所示。像字节一样，比特 0 是最低有效位。对于字，位 15 是最高有效位。引用字中的其他比特位时，我们将使用它们的比特位位置编号。

图 2-3：一个字中的比特位

一个字正好包含 2 个字节（因此包含四个半字）。比特 0 到 7 形成低位字节，比特 8 到 15 形成高位字节（参见图 2-4 和 2-5）。

图 2-4：一个字中的 2 个字节

图 2-5：字中的半字

使用 16 位，你可以表示 2¹⁶（65,536）个值。这些值可以是 0 到 65,535 的范围，或者通常情况下，是 -32,768 到 +32,767 的有符号值，或者其他值数不超过 65,536 的数据类型。

字的三大主要用途是短的有符号整数值、短的无符号整数值和 Unicode 字符。无符号数值由对应字中比特位的二进制值表示。有符号数值使用二进制补码形式表示（参见第 67 页的“符号扩展和零扩展”）。作为 Unicode 字符，字可以表示最多 65,536 个字符，从而允许在计算机程序中使用非罗马字符集。Unicode 是一种国际标准，类似于 ASCII，它允许计算机处理非罗马字符，如汉字、希腊字母和俄语字符。

与字节一样，你也可以在 MASM 程序中创建字变量。要创建一个任意字变量，使用 word 数据类型，如下所示：

 .data
w        word  ?

2.4.5 双字

双字正如其名所示：一对字。因此，双字长度为 32 位，如图 2-6 所示。

图 2-6：双字中的比特位编号

自然地，这个双字可以分为一个高位字和一个低位字，4 字节，或八个不同的半字（见图 2-7）。

双字（dword）可以表示各种内容。一个常见的双字用途是表示 32 位整数值（允许无符号数字范围为 0 到 4,294,967,295 或有符号数字范围为 -2,147,483,648 到 2,147,483,647）。32 位浮点值也可以存储在一个双字中。

图 2-7：双字中的字节、字和字节组

你可以通过使用 dword 数据类型来创建一个任意的双字变量，如以下示例所示：

 .data
d     dword  ?

2.4.6 四字和八字

四字（64 位）值也很重要，因为 64 位整数、指针和某些浮点数据类型需要 64 位。同样，现代 x86-64 处理器的 SSE/MMX 指令集可以操作 64 位值。类似地，八字（128 位）值也很重要，因为 AVX/SSE 指令集可以操作 128 位值。MASM 允许通过使用 qword 和 oword 类型来声明 64 位和 128 位值，如下所示：

 .data
o     oword ?
q     qword ?

你不能直接使用标准指令（如 mov、add 和 sub）来操作 128 位整数对象，因为标准的 x86-64 整数寄存器每次只能处理 64 位。在第八章中，你将看到如何操作这些扩展精度值；第十一章描述了如何通过使用 SIMD 指令直接操作 oword 值。

2.5 位上的逻辑操作

我们将进行四种主要的逻辑操作（布尔函数），使用十六进制和二进制数字：与、或、异或（XOR）和非（NOT）。

2.5.1 与操作

逻辑与操作是一个二元操作（意味着它接受两个操作数）。^3 这些操作数是单独的二进制位。与操作如下所示：

0 and 0 = 0
0 and 1 = 0
1 and 0 = 0
1 and 1 = 1

表示逻辑与操作的一种简洁方式是使用真值表。真值表的形式如表 2-2 所示。

表 2-2：与真值表

与	0	1
0	0	0
1	0	1

这就像你在学校遇到的乘法表一样。左列的值对应于与操作的左操作数，顶行的值对应于与操作的右操作数。位于行和列交点处的值（对于特定的输入值对）就是这两个值进行逻辑与操作后的结果。

在英语中，逻辑与操作是：“如果第一个操作数为 1，第二个操作数为 1，则结果为 1；否则，结果为 0。”我们也可以这样表述：“如果任一或两个操作数为 0，结果为 0。”

你可以使用逻辑与操作强制结果为 0：如果其中一个操作数为 0，则无论另一个操作数是什么，结果始终为 0。例如，在表 2-2 中，标记为 0 输入的行仅包含 0，标记为 0 的列也仅包含 0。相反，如果其中一个操作数为 1，结果则完全等于第二个操作数的值。这些与操作的结果非常重要，特别是当我们想要强制位为 0 时。我们将在下一节中探讨这些逻辑与操作的使用。

2.5.2 或操作

逻辑 OR 运算 也是一个二目运算。它的定义如下：

0 or 0 = 0
0 or 1 = 1
1 or 0 = 1
1 or 1 = 1

表 2-3 显示了 OR 运算的真值表。

表 2-3：OR 真值表

OR	0	1
0	0	1
1	1	1

通俗地说，逻辑 OR 运算可以表示为：“如果第一个操作数或第二个操作数（或两者）为 1，结果为 1；否则结果为 0。”这也被称为 包含或 运算。

如果逻辑 OR 运算的一个操作数是 1，结果始终为 1，无论第二个操作数的值如何。如果一个操作数为 0，结果始终是第二个操作数的值。与逻辑 AND 运算一样，这是逻辑 OR 运算的一个重要副作用，将会非常有用。

注意，这种形式的包含逻辑 OR 运算与标准的英文含义有所不同。考虑以下句子：“我去商店或者去公园。”这样的句子意味着说话者要么去商店，要么去公园，但不会同时去两个地方。因此，逻辑 OR 的英文版本与包含或运算略有不同；实际上，这就是异或运算的定义。

2.5.3 XOR 运算

逻辑 XOR（异或运算） 也是一个二目运算。它的定义如下：

0 xor 0 = 0
0 xor 1 = 1
1 xor 0 = 1
1 xor 1 = 0

表 2-4 显示了 XOR 运算的真值表。

表 2-4：XOR 真值表

XOR	0	1
0	0	1
1	1	0

在英语中，逻辑 XOR 运算是：“如果第一个操作数或第二个操作数，但不是两者，等于 1，结果为 1；否则结果为 0。”异或运算比逻辑 OR 运算更接近英语中 or 这个词的含义。

如果逻辑异或运算的一个操作数是 1，结果总是另一个操作数的反转；也就是说，如果一个操作数是 1，另一个操作数是 1 时结果是 0，另一个操作数是 0 时结果是 1。如果第一个操作数是 0，结果则完全是第二个操作数的值。这个特性让你可以选择性地反转位字符串中的位。

2.5.4 NOT 运算

逻辑 NOT 运算是一个 单目运算（意味着它只接受一个操作数）：

not 0 = 1
not 1 = 0

NOT 运算的真值表见表 2-5。

表 2-5：NOT 真值表

非	0	1
	1	0

2.6 二进制数字和位字符串的逻辑运算

前一节定义了单比特操作数的逻辑函数。由于 x86-64 使用的是 8、16、32、64 或更多比特的分组，^(4) 我们需要扩展这些函数的定义，以处理超过 2 比特的情况。

x86-64 上的逻辑函数是以逐位（或按位）的方式操作的。给定两个值，这些函数首先对每个值的第 0 位进行操作，生成结果的第 0 位；然后对输入值的第 1 位进行操作，生成结果的第 1 位，依此类推。例如，如果你想计算以下两个 8 位数的逻辑与，你需要在每一列上独立执行逻辑与操作：

1011_0101b
1110_1110b
----------
1010_0100b

你也可以将这种逐位计算应用于其他逻辑函数。

要对两个十六进制数执行逻辑操作，应该先将它们转换为二进制。

通过使用逻辑与/或操作强制位为 0 或 1，以及通过使用逻辑异或操作反转位的能力，在处理位字符串时非常重要（例如二进制数）。这些操作使你能够选择性地操作位字符串中的某些位，而不影响其他位。

例如，如果你有一个 8 位二进制值X，并且你想保证第 4 到第 7 位是 0，你可以将值X与二进制值 0000_1111b 进行逻辑与操作。这个按位逻辑与操作会将高 4 位强制为 0，并保持低 4 位的X不变。同样，你可以通过分别将X与 0000_0001b 进行逻辑或操作和将X与 0000_0100b 进行逻辑异或操作，强制X的低位为 1，并反转X的第 2 位。

使用逻辑与、或和异或操作按这种方式操作位字符串，称为掩码位字符串。我们使用掩码这个术语，因为我们可以使用某些值（逻辑与的 1，逻辑或/异或的 0）来掩盖或暴露操作中的某些位，强制某些位为 0、1 或其反值。

x86-64 CPUs 支持四条指令，将这些按位逻辑操作应用于其操作数。这些指令是and、or、xor和not。and、or和xor指令使用与add和sub指令相同的语法：

and  `dest`, `source`
or   `dest`, `source`
xor  `dest`, `source`

这些操作数与 add 操作数具有相同的限制。具体来说，source 操作数必须是常量、内存或寄存器操作数，而 dest 操作数必须是内存或寄存器操作数。此外，操作数必须具有相同的大小，且不能同时为内存操作数。如果目标操作数是 64 位且源操作数是常量，则该常量限制为 32 位（或更少），CPU 将对该值进行符号扩展至 64 位（参见第 67 页的“符号扩展与零扩展”）。

这些指令通过以下公式计算明显的按位逻辑操作：

dest = `dest` `operator` `source`

x86-64 逻辑not指令，由于只有一个操作数，因此使用略有不同的语法。此指令的形式如下：

not  `dest`

此指令计算出以下结果：

dest = not(`dest`)

dest 操作数必须是寄存器或内存操作数。此指令将指定目标操作数中的所有位进行反转。

列表 2-2 中的程序从用户输入两个十六进制值，并计算它们的逻辑 and、or、xor 和 not。

; Listing 2-2

; Demonstrate AND, OR, XOR, and NOT logical instructions.

            option  casemap:none

nl          =       10  ; ASCII code for newline

             .data
leftOp       dword   0f0f0f0fh
rightOp1     dword   0f0f0f0f0h
rightOp2     dword   12345678h

titleStr     byte   'Listing 2-2', 0

fmtStr1      byte   "%lx AND %lx = %lx", nl, 0
fmtStr2      byte   "%lx OR  %lx = %lx", nl, 0
fmtStr3      byte   "%lx XOR %lx = %lx", nl, 0
fmtStr4      byte   "NOT %lx = %lx", nl, 0

            .code
            externdef   printf:proc

; Return program title to C++ program:

            public getTitle
getTitle    proc

;  Load address of "titleStr" into the RAX register (RAX holds the
;  function return result) and return back to the caller:

            lea rax, titleStr
            ret
getTitle    endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc

; "Magic" instruction offered without explanation at this point:

            sub     rsp, 56

; Demonstrate the AND instruction:

            lea     rcx, fmtStr1
            mov     edx, leftOp
            mov     r8d, rightOp1
            mov     r9d, edx  ; Compute leftOp
            and     r9d, r8d  ; AND rightOp1
            call    printf

            lea     rcx, fmtStr1
            mov     edx, leftOp
            mov     r8d, rightOp2
            mov     r9d, r8d
            and     r9d, edx
            call    printf

; Demonstrate the OR instruction:

            lea     rcx, fmtStr2
            mov     edx, leftOp
            mov     r8d, rightOp1
            mov     r9d, edx  ; Compute leftOp
            or      r9d, r8d  ; OR rightOp1
            call    printf

            lea     rcx, fmtStr2
            mov     edx, leftOp
            mov     r8d, rightOp2
            mov     r9d, r8d
            or      r9d, edx
            call    printf

; Demonstrate the XOR instruction:

            lea     rcx, fmtStr3
            mov     edx, leftOp
            mov     r8d, rightOp1
            mov     r9d, edx  ; Compute leftOp
            xor     r9d, r8d  ; XOR rightOp1
            call    printf

            lea     rcx, fmtStr3
            mov     edx, leftOp
            mov     r8d, rightOp2
            mov     r9d, r8d
 xor     r9d, edx
            call    printf

; Demonstrate the NOT instruction:

            lea     rcx, fmtStr4
            mov     edx, leftOp
            mov     r8d, edx  ; Compute not leftOp
            not     r8d
            call    printf

            lea     rcx, fmtStr4
            mov     edx, rightOp1
            mov     r8d, edx  ; Compute not rightOp1
            not     r8d
            call    printf

            lea     rcx, fmtStr4
            mov     edx, rightOp2
            mov     r8d, edx  ; Compute not rightOp2
            not     r8d
            call    printf

; Another "magic" instruction that undoes the effect of the previous
; one before this procedure returns to its caller.

            add     rsp, 56

            ret     ; Returns to caller

asmMain     endp
            end

列表 2-2：and、or、xor 和 not 示例

这是构建并运行此代码后的结果：

C:\MASM64>**build  listing2-2**

C:\MASM64>**ml64 /nologo /c /Zi /Cp  listing2-2.asm**
 Assembling: listing2-2.asm

C:\MASM64>**cl /nologo /O2 /Zi /utf-8 /Fe listing2-2.exe c.cpp  listing2-2.obj**
c.cpp

C:\MASM64> **listing2-2**
Calling Listing 2-2:
f0f0f0f AND f0f0f0f0 = 0
f0f0f0f AND 12345678 = 2040608
f0f0f0f OR  f0f0f0f0 = ffffffff
f0f0f0f OR  12345678 = 1f3f5f7f
f0f0f0f XOR f0f0f0f0 = ffffffff
f0f0f0f XOR 12345678 = 1d3b5977
NOT f0f0f0f = f0f0f0f0
NOT f0f0f0f0 = f0f0f0f
NOT 12345678 = edcba987
Listing 2-2 terminated

顺便说一句，你会经常看到以下“神奇”的指令：

xor `reg`, `reg`

将一个寄存器与其自身做异或操作会将该寄存器设置为 0。除了 8 位寄存器外，xor 指令通常比将立即数移入寄存器更高效。请考虑以下情况：

xor eax, eax  ; Just 2 bytes long in machine code
mov eax, 0    ; Depending on register, often 6 bytes long

处理 64 位寄存器时，节省的空间更大（因为立即数 0 本身就是 8 字节长）。

2.7 有符号和无符号数

到目前为止，我们将二进制数视为无符号值。二进制数 . . . 00000 代表 0，. . . 00001 代表 1，. . . 00010 代表 2，依此类推，直到无穷大。对于 n 位，我们可以表示 2^(n) 个无符号数。那么负数呢？如果我们将可能的组合的一半分配给负值，另一半分配给正值和 0，那么使用 n 位，我们可以表示的有符号值范围是 –2^(*n*)(-1) 到 +2^(*n*)(-1) –1。因此，我们可以使用单个 8 位字节表示负值 –128 到 –1 和非负值 0 到 127。使用 16 位字，可以表示范围从 –32,768 到 +32,767 的值。使用 32 位双字，可以表示范围从 –2,147,483,648 到 +2,147,483,647 的值。

在数学（和计算机科学）中，补码方法将负数和非负数（包括正数和零）编码为两个相等的集合，使得它们可以使用相同的算法（或硬件）进行加法运算，并无论符号如何都能得到正确的结果。

x86-64 微处理器使用 二进制补码 表示有符号数。在这种系统中，数字的高位（HO 位）是 符号位（将整数分为两个相等的集合）。如果符号位为 0，则该数为正数（或零）；如果符号位为 1，则该数为负数（采用补码形式，我稍后会描述）。以下是一些示例。

对于 16 位数：

8000h 是负数，因为高位（HO 位）为 1。
100h 是正数，因为高位（HO 位）为 0。
7FFFh 是正数。
0FFFFh 是负数。
0FFFh 是正数。

如果高位（HO 位）为 0，则该数为正数（或 0），并使用标准二进制格式。如果高位（HO 位）为 1，则该数为负数，并使用二进制补码形式（这就是支持负数和非负数加法的“神奇形式”，无需特殊硬件）。

要将正数转换为其负数的补码形式，使用以下算法：

反转数字中的所有位；也就是说，应用逻辑非（NOT）功能。
将反转结果加 1，并忽略高位（HO 位）中的进位。

这产生了满足补码形式数学定义的位模式。特别地，使用这种形式加法负数和非负数时会得到预期的结果。

例如，计算 -5 的 8 位等效值：

0000_0101b 5（以二进制表示）。
1111_1010b 反转所有位。
1111_1011b 加 1 得到结果。

如果我们取 -5 并对其执行二的补码操作，我们会得到原始值 0000_0101b，再次返回：

1111_1011b 对 -5 进行二的补码操作。
0000_0100b 反转所有位。
0000_0101b 加 1 得到结果 (+5)。

请注意，如果我们将 +5 和 –5 相加（忽略 HO 位的进位），我们得到预期的结果 0：

 1111_1011b         Two's complement for -5
    + 0000_0101b         Invert all the bits and add 1
      ----------
  (1) 0000_0000b         Sum is zero, if we ignore carry

以下示例提供了一些正负 16 位有符号值：

7FFFh: +32767，最大的 16 位正数
8000h: –32768，最小的 16 位负数
4000h: +16384

要将上述数字转换为其负数对（即取反），请执行以下操作：

7FFFh:      0111_1111_1111_1111b   +32,767
            1000_0000_0000_0000b   Invert all the bits (8000h)
            1000_0000_0000_0001b   Add 1 (8001h or -32,767)

4000h:      0100_0000_0000_0000b   16,384
            1011_1111_1111_1111b   Invert all the bits (0BFFFh)
            1100_0000_0000_0000b   Add 1 (0C000h or -16,384)

8000h:      1000_0000_0000_0000b   -32,768
            0111_1111_1111_1111b   Invert all the bits (7FFFh)
            1000_0000_0000_0000b   Add one (8000h or -32,768)

8000h 反转变为 7FFFh。加 1 后，我们得到 8000h！等一下，发生了什么？–（–32,768）等于 –32,768？当然不是。但值 +32,768 不能用 16 位有符号数表示，所以我们不能取反最小的负值。

通常，您不需要手动执行二的补码操作。x86-64 微处理器提供了一条指令，neg（取反），它为您执行此操作：

neg `dest`

该指令计算 dest = -``dest``; 操作数必须是一个内存位置或寄存器。neg操作适用于字节、字、双字和四字大小的对象。因为这是一个有符号整数操作，所以只有对有符号整数值进行操作才有意义。位于列表 2-3 中的程序演示了对有符号 8 位整数值执行二的补码操作和neg` 指令。

; Listing 2-3

; Demonstrate two's complement operation and input of numeric values.

        option  casemap:none

nl       =      10  ; ASCII code for newline
maxLen   =      256

         .data
titleStr byte   'Listing 2-3', 0

prompt1  byte   "Enter an integer between 0 and 127:", 0
fmtStr1  byte   "Value in hexadecimal: %x", nl, 0
fmtStr2  byte   "Invert all the bits (hexadecimal): %x", nl, 0
fmtStr3  byte   "Add 1 (hexadecimal): %x", nl, 0
fmtStr4  byte   "Output as signed integer: %d", nl, 0
fmtStr5  byte   "Using neg instruction: %d", nl, 0

intValue sqword ?
input    byte   maxLen dup (?)

            .code
            externdef printf:proc
            externdef atoi:proc
            externdef readLine:proc

; Return program title to C++ program:

            public getTitle
getTitle    proc
            lea rax, titleStr
 ret
getTitle    endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc

; "Magic" instruction offered without explanation at this point:

            sub     rsp, 56

; Read an unsigned integer from the user: This code will blindly
; assume that the user's input was correct. The atoi function returns
; zero if there was some sort of error on the user input. Later
; chapters in Ao64A will describe how to check for errors from the
; user.

            lea     rcx, prompt1
            call    printf

            lea     rcx, input
            mov     rdx, maxLen
            call    readLine

; Call C stdlib atoi function.

; i = atoi(str)

            lea     rcx, input
            call    atoi
            and     rax, 0ffh      ; Only keep LO 8 bits
            mov     intValue, rax

; Print the input value (in decimal) as a hexadecimal number:

            lea     rcx, fmtStr1
            mov     rdx, rax
            call    printf

; Perform the two's complement operation on the input number.
; Begin by inverting all the bits (just work with a byte here).

            mov     rdx, intValue
            not     dl             ; Only work with 8-bit values!
            lea     rcx, fmtStr2
            call    printf

; Invert all the bits and add 1 (still working with just a byte).

            mov     rdx, intValue
            not     rdx
            add     rdx, 1
            and     rdx, 0ffh      ; Only keep LO eight bits
 lea     rcx, fmtStr3
            call    printf

; Negate the value and print as a signed integer (work with a full
; integer here, because C++ %d format specifier expects a 32-bit
; integer). HO 32 bits of RDX get ignored by C++.

            mov     rdx, intValue
            not     rdx
            add     rdx, 1
            lea     rcx, fmtStr4
            call    printf

; Negate the value using the neg instruction.

            mov     rdx, intValue
            neg     rdx
            lea     rcx, fmtStr5
            call    printf

; Another "magic" instruction that undoes the effect of the previous
; one before this procedure returns to its caller.

            add     rsp, 56
            ret     ; Returns to caller
asmMain     endp
            end

列表 2-3: 二的补码示例

以下命令构建并运行列表 2-3 中的程序：

C:\>**build  listing2-3**

C:\>**echo off**
 Assembling: listing2-3.asm
c.cpp

C:\> **listing2-3**
Calling Listing 2-3:
Enter an integer between 0 and 127:123
Value in hexadecimal: 7b
Invert all the bits (hexadecimal): 84
Add 1 (hexadecimal): 85
Output as signed integer: -123
Using neg instruction: -123
Listing 2-3 terminated

除了二的补码操作（通过反转/加 1 和使用 neg 指令），此程序演示了一个新特性：用户数字输入。数字输入 是通过从用户读取输入字符串（使用 c.cpp 源文件中的 readLine() 函数）然后调用 C 标准库中的 atoi() 函数实现的。此函数需要一个单一的参数（通过 RCX 传递），该参数指向一个包含整数值的字符串。它将该字符串转换为相应的整数，并返回整数值给 RAX。^(5)

2.8 符号扩展与零扩展

将 8 位二的补码值转换为 16 位，反之将 16 位值转换为 8 位，可以通过 符号扩展 和收缩操作来实现。

要将有符号值从一定数量的位扩展到更多的位，将符号位复制到新格式中所有额外的位。例如，要将一个 8 位数扩展为 16 位数，复制 8 位数的第 7 位到 16 位数的第 8 到第 15 位。要将 16 位数扩展为双字，将第 15 位复制到双字的第 16 到第 31 位。

在操作不同长度的有符号值时，必须使用符号扩展。例如，要将一个字节数值添加到一个字（word）数值中，必须将字节数值符号扩展为字，然后再将两个数值相加。其他操作（特别是乘法和除法）可能需要扩展到 32 位；见表 2-6。

表 2-6：符号扩展

8 位	16 位	32 位
80h	0FF80h	0FFFFFF80h
28h	0028h	00000028h
9Ah	0FF9Ah	0FFFFFF9Ah
7Fh	007Fh	0000007Fh
	1020h	00001020h
	8086h	0FFFF8086h

要将无符号值扩展为更大的数值，必须对该值进行零扩展，如表 2-7 所示。零扩展很简单——只需将 0 存入较大操作数的高字节（HO 字节）即可。例如，要将 8 位值 82h 零扩展到 16 位，你只需在高字节前添加一个 0，得到 0082h。

表 2-7：零扩展

8 位	16 位	32 位
80h	0080h	00000080h
28h	0028h	00000028h
9Ah	009Ah	0000009Ah
7Fh	007Fh	0000007Fh
	1020h	00001020h
	8086h	00008086h

2.9 标志收缩与饱和

符号收缩，即将某个具有特定位数的值转换为具有较少位数的相同值，稍微复杂一些。给定一个n位数值，如果m < n，并不是总能将其转换为m位数值。例如，考虑值–448。作为 16 位有符号数，它的十六进制表示为 0FE40h。该数的绝对值太大，无法适应 8 位数值，因此不能进行 8 位符号收缩（这样做会导致溢出）。

为了正确地进行标志收缩，必须丢弃的高字节（HO 字节）必须全部为 0 或 0FFh，并且结果值的 HO 位必须与从数字中移除的每个位匹配。以下是一些示例（16 位到 8 位）：

0FF80h 可以进行符号收缩为 80h。
0040h 可以进行符号收缩为 40h。
0FE40h 不能进行 8 位标志收缩。
0100h 不能进行 8 位标志收缩。

如果你必须将较大的对象转换为较小的对象，并且愿意接受精度损失，可以使用饱和。通过饱和转换值时，如果较大值在较小对象的范围内，则将较大值复制到较小值中。如果较大值超出了较小对象的范围，则通过将值裁剪到较小对象范围内的最大（或最小）值来裁剪该值。

例如，当将一个 16 位有符号整数转换为 8 位有符号整数时，如果 16 位值的范围在–128 到+127 之间，你只需将 16 位对象的低字节（LO 字节）复制到 8 位对象。如果 16 位有符号值大于+127，则将值裁剪为+127，并将+127 存储到 8 位对象中。同样，如果值小于–128，则将最终的 8 位对象裁剪为–128。

尽管将值限制在较小对象的边界会导致精度损失，但有时这是可以接受的，因为替代方法是引发异常或以其他方式拒绝计算。对于许多应用程序，如音频或视频处理，裁剪后的结果仍然是可识别的，因此这是一个合理的转换。

2.10 简短插曲：控制转移指令简介

到目前为止，汇编语言示例一直在没有使用条件执行（即，在执行代码时做出决策的能力）的情况下勉强进行。事实上，除了call和ret指令外，你还没有看到任何影响汇编代码顺序执行的方法。

然而，本书正迅速接近一个阶段，在这个阶段，意义深远的示例需要能够有条件地执行不同的代码部分。本节简要介绍了条件执行的主题以及将控制转移到程序其他部分的方法。

2.10.1 `jmp`指令

或许最好的开始方式是讨论 x86-64 无条件控制转移指令——jmp指令。jmp指令有几种形式，但最常见的形式是

jmp `statement_label`

其中，statement_label是附加在.code部分机器指令上的标识符。jmp指令会立即将控制转移到由标签前缀的语句。这在语义上等同于高级语言中的goto语句。

这是一个在mov指令前面加上语句标签的示例：

stmtLbl: mov eax, 55

与所有 MASM 符号一样，语句标签有两个主要属性：一个地址（即紧跟标签后的机器指令的内存地址）和一个类型。类型是label，与proc指令的标识符类型相同。

语句标签不必与机器指令位于同一物理源行。考虑以下示例：

anotherLabel:
   mov eax, 55

这个示例在语义上等同于前一个示例。绑定到anotherLabel的值（地址）是紧跟标签后的机器指令的地址。在这种情况下，即使该mov指令出现在下一行，它仍然是mov指令（因为它仍然跟在标签后面，中间没有其他 MASM 语句生成代码）。

从技术上讲，你也可以跳转到proc标签，而不是语句标签。然而，jmp指令不会设置返回地址，因此如果过程执行ret指令，返回位置可能未定义。（第五章将更详细地探讨返回地址。）

2.10.2 条件跳转指令

尽管 jmp 指令的常见形式在汇编语言程序中不可或缺，但它并不提供有条件地执行不同代码段的能力——因此称其为 无条件跳转。^(6) 幸运的是，x86-64 CPU 提供了多种 条件跳转指令，顾名思义，这些指令允许条件性地执行代码。

这些指令测试 FLAGS 寄存器中的条件代码位（请参阅第一章中的《Intel x86-64 CPU 系列介绍》），以确定是否应进行分支。FLAGS 寄存器中有四个条件代码位，这些条件跳转指令会测试它们：进位、符号、溢出和零标志。^(7)

x86-64 CPU 提供了八条指令，用于测试这四个标志中的每一个（请参阅表 2-8）。条件跳转指令的基本操作是，它们测试一个标志，查看它是已设置（1）还是清除（0），如果测试成功，则跳转到目标标签。如果测试失败，程序将继续执行条件跳转指令后的下一条指令。

表 2-8：测试条件代码标志的条件跳转指令

指令	说明
`jc` `label`	如果进位已设置，则跳转。如果进位标志已设置（`1`），则跳转到标签；如果进位标志清除（`0`），则继续执行。
`jnc` `label`	如果没有进位，则跳转。如果进位标志清除（`0`），则跳转到标签；如果进位标志已设置（`1`），则继续执行。
`jo` `label`	如果溢出，则跳转。如果溢出标志已设置（`1`），则跳转到标签；如果溢出标志清除（`0`），则继续执行。
`jno` `label`	如果没有溢出，则跳转。如果溢出标志清除（`0`），则跳转到标签；如果溢出标志已设置（`1`），则继续执行。
`js` `label`	如果符号（负数），则跳转。如果符号标志已设置（`1`），则跳转到标签；如果符号标志清除（`0`），则继续执行。
`jns` `label`	如果没有符号，则跳转。如果符号标志为清除（`0`），则跳转到标签；如果符号标志已设置（`1`），则继续执行。
`jz` `label`	如果为零，则跳转。如果零标志已设置（`1`），则跳转到标签；如果零标志清除（`0`），则继续执行。
`jnz` `label`	如果不为零，则跳转。如果零标志清除（`0`），则跳转到标签；如果零标志已设置（`1`），则继续执行。

要使用条件跳转指令，必须首先执行一个会影响一个（或多个）条件代码标志的指令。例如，无符号算术溢出将设置进位标志（同样，如果没有发生溢出，进位标志将被清除）。因此，可以在 add 指令之后使用 jc 和 jnc 指令，以查看计算过程中是否发生了（无符号）溢出。例如：

 mov eax, int32Var
    add eax, anotherVar
    jc  overflowOccurred

; Continue down here if the addition did not
; produce an overflow.

    .
    .
    .

overflowOccurred:

; Execute this code if the sum of int32Var and anotherVar
; does not fit into 32 bits.

不是所有的指令都会影响标志。在我们到目前为止看到的所有指令（mov、add、sub、and、or、not、xor 和 lea）中，只有 add、sub、and、or、xor 和 not 指令会影响标志。add 和 sub 指令会按照表 2-9 所示的方式影响标志。

表 2-9：执行 add 或 sub 后的标志设置

标志	说明
进位	如果发生无符号溢出，则设置该标志（例如，将字节值 0FFh 和 01h 相加）。如果没有溢出，则清除该标志。请注意，从 0 中减去 1 也会清除进位标志（即，0 – 1 等同于 0 + （–1），而 –1 在二进制补码形式中表示为 0FFh）。
溢出	如果发生带符号溢出，则设置该标志（例如，将字节值 07Fh 和 01h 相加）。带符号溢出发生在次高位（HO 位）溢出到最高位（HO 位）（例如，当处理字节大小的计算时，7Fh 变为 80h，或 0FFh 变为 0）。
符号	如果结果的最高位（HO 位）被设置，则符号标志被设置。否则，符号标志清除（即，符号标志反映结果的 HO 位状态）。
零	如果计算结果为 0，则设置零标志；否则，清除该标志。

逻辑指令（and、or、xor 和 not）总是清除进位标志和溢出标志。它们将结果的最高位（HO 位）复制到符号标志，并在结果为零或非零时设置或清除零标志。

除了条件跳转指令，x86-64 CPU 还提供了一组条件移动指令。第七章介绍了这些指令。

2.10.3 `cmp` 指令与相应的条件跳转

cmp（比较）指令可能是执行条件跳转前最有用的指令。比较指令与 sub 指令的语法相同，实际上，它也会从第一个操作数中减去第二个操作数，并根据减法结果设置条件码标志。^(8) 但 cmp 指令不会将差值存回第一个（目标）操作数。cmp 指令的全部目的就是基于减法结果设置条件码标志。

尽管你可以在 cmp 指令之后立即使用 jc/jnc、jo/jno、js/jns 和 jz/jnz 指令（测试 cmp 如何设置各个标志），但在 cmp 指令的上下文中，标志名称的含义并不重要。从逻辑上讲，当你看到以下指令时（注意，cmp 指令的操作数语法与 add、sub 和 mov 指令相同），

cmp `left_operand`, `right_operand`

你可以将此指令理解为“将 left_operand 与 right_operand 进行比较。”你通常在比较后会问以下问题：

left_operand 是否等于 right_operand？
left_operand 是否不等于 right_operand？
left_operand 是否小于 right_operand？
left_operand是否小于或等于right_operand？
left_operand是否大于right_operand？
left_operand是否大于或等于right_operand？

到目前为止介绍的条件跳转指令没有（直观地）回答这些问题。

x86-64 CPU 提供了一组附加的条件跳转指令，如表 2-10 所示，允许你测试比较条件。

表 2-10：cmp指令后的条件跳转指令

指令	测试的标志	解释
`je` `标签`	`ZF == 1`	如果相等则跳转。如果`left_operand`等于`right_operand`，则将控制转移到目标标签。这个指令与`jz`同义，因为如果两个操作数相等（它们的减法结果为 0），则会设置零标志。
`jne` `标签`	`ZF == 0`	如果不相等则跳转。如果`left_operand`不等于`right_operand`，则将控制转移到目标标签。这个指令与`jnz`同义，因为如果两个操作数不相等（它们的减法结果非零），则零标志会清除。
`ja` `标签`	`CF == 0` 且 `ZF == 0`	如果条件成立则跳转。如果无符号的`left_operand`大于无符号的`right_operand`，则将控制转移到目标标签。
`jae` `标签`	`CF == 0`	如果大于或等于则跳转。如果无符号的`left_operand`大于或等于无符号的`right_operand`，则将控制转移到目标标签。这个指令与`jnc`同义，因为如果`left_operand`大于或等于`right_operand`，则不会发生无符号溢出（实际上是下溢）。
`jb` `标签`	`CF == 1`	如果小于则跳转。如果无符号的`left_operand`小于无符号的`right_operand`，则将控制转移到目标标签。这个指令与`jc`同义，因为如果`left_operand`小于`right_operand`，则会发生无符号溢出（实际上是下溢）。
`jbe` `标签`	`CF == 1` 或 `ZF == 1`	如果小于或等于则跳转。如果无符号的`left_operand`小于或等于无符号的`right_operand`，则将控制转移到目标标签。
`jg` `标签`	`SF == OF` 且 `ZF == 0`	如果大于则跳转。如果有符号的`left_operand`大于有符号的`right_operand`，则将控制转移到目标标签。
`jge` `标签`	`SF == OF`	如果大于或等于则跳转。如果有符号的`left_operand`大于或等于有符号的`right_operand`，则将控制转移到目标标签。
`jl` `标签`	`SF ≠ OF`	如果小于则跳转。如果有符号的`left_operand`小于有符号的`right_operand`，则将控制转移到目标标签。

| jle 标签 | ZF == 1 或

SF ≠ OF | 如果小于或等于则跳转。如果有符号的left_operand小于或等于有符号的right_operand，则将控制转移到目标标签。 |

在表 2-10 中，最重要的事情之一是，独立的条件跳转指令用于有符号和无符号比较。考虑两个字节值 0FFh 和 01h。从无符号的角度看，0FFh 大于 01h。然而，当我们将其视为有符号数（使用二进制补码编号系统）时，0FFh 实际上是 -1，这显然小于 1。它们有相同的位表示，但在有符号或无符号数字的比较下，会得到完全不同的结果。

2.10.4 条件跳转同义词

一些指令是其他指令的同义词。例如，jb 和 jc 是相同的指令（即它们有相同的数字机器码编码）。这是为了方便和可读性。举个例子，在 cmp 指令之后，jb 比 jc 更有意义。MASM 定义了几个条件跳转指令的同义词，这使得编码变得稍微容易些。表 2-11 列出了许多这样的同义词。

表 2-11：条件跳转同义词

指令	等价指令	描述
`ja`	`jnbe`	如果大于，则跳转；如果不小于或等于，则跳转。
`jae`	`jnb`, `jnc`	如果大于或等于，则跳转；如果不小于，则跳转；如果无进位，则跳转。
`jb`	`jc`, `jnae`	如果小于，则跳转；如果有进位，则跳转；如果不大于或等于，则跳转。
`jbe`	`jna`	如果小于或等于，则跳转；如果不大于，则跳转。
`jc`	`jb`, `jnae`	如果有进位，则跳转；如果小于，则跳转；如果不大于或等于，则跳转。
`je`	`jz`	如果相等，则跳转；如果为零，则跳转。
`jg`	`jnle`	如果大于，则跳转；如果不小于或等于，则跳转。
`jge`	`jnl`	如果大于或等于，则跳转；如果不小于，则跳转。
`jl`	`jnge`	如果小于，则跳转；如果不大于或等于，则跳转。
`jle`	`jng`	如果小于或等于，则跳转；如果不大于，则跳转。
`jna`	`jbe`	如果不大于，则跳转；如果小于或等于，则跳转。
`jnae`	`jb`, `jc`	如果不大于或等于，则跳转；如果小于，则跳转；如果有进位，则跳转。
`jnb`	`jae`, `jnc`	如果不小于，则跳转；如果大于或等于，则跳转；如果无进位，则跳转。
`jnbe`	`ja`	如果不小于或等于，则跳转；如果大于，则跳转。
`jnc`	`jnb`, `jae`	如果无进位，则跳转；如果不小于，则跳转；如果大于或等于，则跳转。
`jne`	`jnz`	如果不相等，则跳转；如果不为零，则跳转。
`jng`	`jle`	如果不大于，则跳转；如果小于或等于，则跳转。
`jnge`	`jl`	如果不大于或等于，则跳转；如果小于，则跳转。
`jnl`	`jge`	如果不小于，则跳转；如果大于或等于，则跳转。
`jnle`	`jg`	如果不小于或等于，则跳转；如果大于，则跳转。
`jnz`	`jne`	如果不为零，则跳转；如果不相等，则跳转。
`jz`	`je`	如果为零，则跳转；如果相等，则跳转。

有一点非常重要：cmp指令只会为整数比较设置标志（这也涵盖了字符和其他可以用整数表示的类型）。具体来说，它不会比较浮点值并根据浮点比较设置标志。有关浮点运算（和比较）的更多信息，请参阅第六章中的“浮点运算”。

2.11 移位和旋转

另一类适用于位串的逻辑操作是移位和旋转操作。这两类操作可以进一步细分为左移、左旋转、右移和右旋转。

左移操作将位串中的每个位向左移动一个位置，如图 2-8 所示。

图 2-8：左移操作

位 0 移入位位置 1，位位置 1 中的先前值移入位位置 2，依此类推。我们将一个 0 移入位 0，先前的高位值将成为此操作的进位。

x86-64 提供了一个左移指令shl，用于执行这一有用的操作。shl指令的语法如下所示：

shl `dest`, `count`

count操作数可以是 CL 寄存器或范围为 0 到 n 的常数，其中 n 是目标操作数中位数减 1（例如，对于 8 位操作数，n = 7；对于 16 位操作数，n = 15；对于 32 位操作数，n = 31；对于 64 位操作数，n = 63）。dest操作数是一个典型的目标操作数，它可以是内存位置或寄存器。

当count操作数是常数 1 时，shl指令执行如图 2-9 所示的操作。

图 2-9：shl 1 操作

在图 2-9 中，C 表示进位标志——即，从操作数中移出的高位（HO 位）进入进位标志。因此，在执行shl dest, 1指令后，您可以通过立即测试进位标志来检测溢出（例如，使用jc和jnc）。

shl指令根据结果设置零标志（如果结果为零，则z=1，否则z=0）。如果结果的高位（HO 位）为 1，shl指令将设置符号标志。如果移位计数为 1，则如果高位（HO 位）发生变化（即，当高位原本是 1 时将 0 移入，或者原本是 0 时将 1 移入），则shl指令设置溢出标志；对于所有其他移位计数，溢出标志未定义。

将一个值向左移动一个数字位置，就等同于将它乘以它的基数（进制）。例如，将一个十进制数向左移动一个位置（在数字右侧添加一个 0）实际上是将其乘以 10（基数）：

1234 shl 1 = 12340

（shl 1表示将一位数字向左移动一个位置。）

因为二进制数的基数是 2，左移会使其乘以 2。如果你将一个值左移 n 次，它将被乘以 2^(n)。

右移操作的工作方式相同，只是我们将数据移动的方向相反。对于一个字节值，第 7 位移到第 6 位，第 6 位移到第 5 位，第 5 位移到第 4 位，以此类推。在右移期间，我们将 0 移入第 7 位，而第 0 位将是操作的进位输出（见图 2-10）。

图 2-10：右移操作

正如你可能预期的那样，x86-64 提供了一个 shr 指令，它将在目标操作数中右移位。其语法类似于 shl 指令：

shr `dest`, `count`

该指令将 0 移入目标操作数的高位；其他位向右移动一个位置（从较高位编号移至较低位编号）。最后，第 0 位被移入进位标志。如果你指定移位次数为 1，shr 指令将执行图 2-11 中所示的操作。

图 2-11：通过 1 次操作进行 shr 移位

shr 指令根据结果设置零标志（如果结果为零，则 ZF=1，否则 ZF=0）。shr 指令清除符号标志（因为结果的高位总是 0）。如果移位计数为 1，当高位发生变化时，shl 会设置溢出标志（即当将 0 移入高位时，高位原本是 1，或者将 1 移入时，高位原本是 0）；对于其他移位计数，溢出标志是未定义的。

因为左移相当于乘以 2，因此右移大致相当于除以 2（或者通常情况下，除以该数的基数）。如果你进行 n 次右移，结果将是该数除以 2^(n)。

然而，右移仅相当于一个无符号的除以 2 操作。例如，如果将 254（0FEh）的无符号表示右移一位，得到的是 127（7Fh），正是你所期望的结果。然而，如果将 -2（0FEh）的二进制补码表示右移一位，得到的是 127（7Fh），这不正确。这个问题发生是因为我们将 0 移入了第 7 位。如果第 7 位原本是 1，我们就把它从负数变成了正数。在进行除以 2 操作时，这样做是不可取的。

要将右移用作除法运算符，我们必须定义第三种移位操作：算术右移。^(9) 这种操作与正常的右移操作（逻辑右移）相似，不同之处在于，算术右移操作将高位（HO）位的值复制回自身，而不是将 0 移入高位；也就是说，在移位操作期间，它不会修改高位，如图 2-12 所示。

图 2-12：算术右移操作

算术右移通常会产生你预期的结果。例如，如果对 -2（0FEh）执行算术右移操作，结果将是 -1（0FFh）。然而，该操作总是将数字舍入为最接近的、小于或等于实际结果 的整数。例如，如果对 -1（0FFh）应用算术右移操作，结果是 -1，而不是 0。因为 -1 小于 0，算术右移操作会向 -1 舍入。这不是算术右移操作中的错误；它只是使用了不同（但有效）的整数除法定义。

x86-64 提供了一个算术右移指令，sar（算术右移）。此指令的语法几乎与 shl 和 shr 相同：

sar `dest`, `count`

对计数和目标操作数的常规限制适用。如果计数为 1，则此指令的操作如图 2-13 所示。

图 2-13：sar dest``, 1 操作

sar 指令根据结果设置零标志（z=1 如果结果为零，否则 z=0）。sar 指令将符号标志设置为结果的高位位。sar 指令执行后，溢出标志应始终被清除，因为该操作无法发生符号溢出。

左旋转 和 右旋转 操作的行为类似于左移和右移操作，只不过从一端移出的位会被旋转回另一端。图 2-14 说明了这些操作。

图 2-14：左旋转和右旋转操作

x86-64 提供了 rol（左旋转）和 ror（右旋转）指令，这些指令对其操作数执行基本操作。这两条指令的语法与移位指令相似：

rol `dest`, `count`
ror `dest`, `count`

如果移位计数为 1，这两条指令会将移出的位复制到进位标志中，如图 2-15 和 2-16 所示。

图 2-15：rol dest``, 1 操作

图 2-16：ror dest``, 1 操作

与移位指令不同，旋转指令不会影响符号标志或零标志的设置。OF 标志仅定义在 1 位旋转中；在所有其他情况下，它是未定义的（仅限 RCL 和 RCR 指令：零位旋转什么也不做——即不会影响任何标志）。对于左旋转，OF 标志设置为原始高位 2 位的异或值。对于右旋转，OF 标志设置为旋转后的高位 2 位的异或值。

在旋转操作中，通过进位移动输出位并将前一个进位值移回到移位操作的输入位通常更加方便。x86-64 的rcl（通过进位左旋）和rcr（通过进位右旋）指令为你实现了这一操作。这些指令使用以下语法：

``` rcl `dest`, `count` rcr `dest`, `count` ``` The `count` operand is either a constant or the CL register, and the `dest` operand is a memory location or register. The `count` operand must be a value that is less than the number of bits in the `dest` operand. For a count value of 1, these two instructions do the rotation shown in Figure 2-17. ![f02017a](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f02017a.png)![f02017b](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f02017b.png) Figure 2-17: `rcl` `dest``, 1` and `rcr` `dest``, 1` operations Unlike the shift instructions, the rotate-through-carry instructions do not affect the settings of the sign or zero flags. The OF flag is defined only for the 1-bit rotates. For left rotates, the OF flag is set if the original HO 2 bits change. For right rotates, the OF flag is set to the exclusive OR of the resultant HO 2 bits. ## 2.12 Bit Fields and Packed Data Although the x86-64 operates most efficiently on `byte`, `word`, `dword`, and `qword` data types, occasionally you’ll need to work with a data type that uses a number of bits other than 8, 16, 32, or 64\. You can also zero-extend a nonstandard data size to the next larger power of 2 (such as extending a 22-bit value to a 32-bit value). This turns out to be fast, but if you have a large array of such values, slightly more than 31 percent of the memory is going to waste (10 bits in every 32-bit value). However, suppose you were to repurpose those 10 bits for something else? By *packing* the separate 22-bit and 10-bit values into a single 32-bit value, you don’t waste any space. For example, consider a date of the form 04/02/01\. Representing this date requires three numeric values: month, day, and year values. Months, of course, take on the values 1 to 12\. At least 4 bits (a maximum of 16 different values) are needed to represent the month. Days range from 1 to 31\. So it will take 5 bits (a maximum of 32 different values) to represent the day entry. The year value, assuming that we’re working with values in the range 0 to 99, requires 7 bits (which can be used to represent up to 128 different values). So, 4 + 5 + 7 = 16 bits, or 2 bytes. In other words, we can pack our date data into 2 bytes rather than the 3 that would be required if we used a separate byte for each of the month, day, and year values. This saves 1 byte of memory for each date stored, which could be a substantial savings if you need to store many dates. The bits could be arranged as shown in Figure 2-18. ![f02018](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f02018.png) Figure 2-18: Short packed date format (2 bytes) *MMMM* represents the 4 bits making up the month value, *DDDDD* represents the 5 bits making up the day, and *YYYYYYY* is the 7 bits composing the year. Each collection of bits representing a data item is a *bit field*. For example, April 2, 2001, would be represented as 4101h: ``` 0100 00010 0000001 = 0100_0001_0000_0001b or 4101h 4 2 01 ``` Although packed values are *space-efficient* (that is, they make efficient use of memory), they are computationally *inefficient* (slow!). The reason? It takes extra instructions to unpack the data packed into the various bit fields. These extra instructions take additional time to execute (and additional bytes to hold the instructions); hence, you must carefully consider whether packed data fields will save you anything. The sample program in Listing 2-4 demonstrates the effort that must go into packing and unpacking this 16-bit date format. ``` ; Listing 2-4 ; Demonstrate packed data types. option casemap:none NULL = 0 nl = 10 ; ASCII code for newline maxLen = 256 ; New data declaration section. ; .const holds data values for read-only constants. .const ttlStr byte 'Listing 2-4', 0 moPrompt byte 'Enter current month: ', 0 dayPrompt byte 'Enter current day: ', 0 yearPrompt byte 'Enter current year ' byte '(last 2 digits only): ', 0 packed byte 'Packed date is %04x', nl, 0 theDate byte 'The date is %02d/%02d/%02d' byte nl, 0 badDayStr byte 'Bad day value was entered ' byte '(expected 1-31)', nl, 0 badMonthStr byte 'Bad month value was entered ' byte '(expected 1-12)', nl, 0 badYearStr byte 'Bad year value was entered ' byte '(expected 00-99)', nl, 0 .data month byte ? day byte ? year byte ? date word ? input byte maxLen dup (?) .code externdef printf:proc externdef readLine:proc externdef atoi:proc ; Return program title to C++ program: public getTitle getTitle proc lea rax, ttlStr ret getTitle endp ; Here's a user-written function that reads a numeric value from the ; user: ; int readNum(char *prompt); ; A pointer to a string containing a prompt message is passed in the ; RCX register. ; This procedure prints the prompt, reads an input string from the ; user, then converts the input string to an integer and returns the ; integer value in RAX. readNum proc ; Must set up stack properly (using this "magic" instruction) before ; we can call any C/C++ functions: sub rsp, 56 ; Print the prompt message. Note that the prompt message was passed to ; this procedure in RCX, we're just passing it on to printf: call printf ; Set up arguments for readLine and read a line of text from the user. ; Note that readLine returns NULL (0) in RAX if there was an error. lea rcx, input mov rdx, maxLen call readLine ; Test for a bad input string: cmp rax, NULL je badInput ; Okay, good input at this point, try converting the string to an ; integer by calling atoi. The atoi function returns zero if there was ; an error, but zero is a perfectly fine return result, so we ignore ; errors. lea rcx, input ; Ptr to string call atoi ; Convert to integer badInput: add rsp, 56 ; Undo stack setup ret readNum endp ; Here is the "asmMain" function. public asmMain asmMain proc sub rsp, 56 ; Read the date from the user. Begin by reading the month: lea rcx, moPrompt call readNum ; Verify the month is in the range 1..12: cmp rax, 1 jl badMonth cmp rax, 12 jg badMonth ; Good month, save it for now: mov month, al ; 1..12 fits in a byte ; Read the day: lea rcx, dayPrompt call readNum ; We'll be lazy here and verify only that the day is in the range ; 1..31. cmp rax, 1 jl badDay cmp rax, 31 jg badDay ; Good day, save it for now: mov day, al ; 1..31 fits in a byte ; Read the year: lea rcx, yearPrompt call readNum ; Verify that the year is in the range 0..99. cmp rax, 0 jl badYear cmp rax, 99 jg badYear ; Good year, save it for now: mov year, al ; 0..99 fits in a byte ; Pack the data into the following bits: ; 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 ; m m m m d d d d d y y y y y y y movzx ax, month shl ax, 5 or al, day shl ax, 7 or al, year mov date, ax ; Print the packed date: lea rcx, packed movzx rdx, date call printf ; Unpack the date and print it: movzx rdx, date mov r9, rdx and r9, 7fh ; Keep LO 7 bits (year) shr rdx, 7 ; Get day in position mov r8, rdx and r8, 1fh ; Keep LO 5 bits shr rdx, 5 ; Get month in position lea rcx, theDate call printf jmp allDone ; Come down here if a bad day was entered: badDay: lea rcx, badDayStr call printf jmp allDone ; Come down here if a bad month was entered: badMonth: lea rcx, badMonthStr call printf jmp allDone ; Come down here if a bad year was entered: badYear: lea rcx, badYearStr call printf allDone: add rsp, 56 ret ; Returns to caller asmMain endp end ``` Listing 2-4: Packing and unpacking date data Here’s the result of building and running this program: ``` C:\>**build listing2-4** C:\>**echo off** Assembling: listing2-4.asm c.cpp C:\> **listing2-4** Calling Listing 2-4: Enter current month: 2 Enter current day: 4 Enter current year (last 2 digits only): 68 Packed date is 2244 The date is 02/04/68 Listing 2-4 terminated ``` Of course, having gone through the problems with Y2K (Year 2000),^(10) you know that using a date format that limits you to 100 years (or even 127 years) would be quite foolish. To future-proof the packed date format, we can extend it to 4 bytes packed into a double-word variable, as shown in Figure 2-19. (As you will see in Chapter 4, you should always try to create data objects whose length is an even power of 2—1 byte, 2 bytes, 4 bytes, 8 bytes, and so on—or you will pay a performance penalty.) ![f02019](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f02019.png) Figure 2-19: Long packed date format (4 bytes) The Month and Day fields now consist of 8 bits each, so they can be extracted as a byte object from the double word. This leaves 16 bits for the year, with a range of 65,536 years. By rearranging the bits so the Year field is in the HO bit positions, the Month field is in the middle bit positions, and the Day field is in the LO bit positions, the long date format allows you to easily compare two dates to see if one date is less than, equal to, or greater than another date. Consider the following code: ``` mov eax, Date1 ; Assume Date1 and Date2 are dword variables cmp eax, Date2 ; using the Long Packed Date format jna d1LEd2 `Do something if Date1 > Date2` d1LEd2: ``` Had you kept the different date fields in separate variables, or organized the fields differently, you would not have been able to compare `Date1` and `Date2` as easily as for the short packed data format. Therefore, this example demonstrates another reason for packing data even if you don’t realize any space savings—it can make certain computations more convenient or even more efficient (contrary to what normally happens when you pack data). Examples of practical packed data types abound. You could pack eight Boolean values into a single byte, you could pack two BCD digits into a byte, and so on. A classic example of packed data is the RFLAGS register. This register packs nine important Boolean objects (along with seven important system flags) into a single 16-bit register. You will commonly need to access many of these flags. You can test many of the condition code flags by using the conditional jump instructions and manipulate the individual bits in the FLAGS register with the instructions in Table 2-12 that directly affect certain flags. Table 2-12: Instructions That Affect Certain Flags | **Instruction** | **Explanation** | | --- | --- | | `cld` | Clears (sets to `0`) the direction flag. | | `std` | Sets (to `1`) the direction flag. | | `cli` | Clears the interrupt disable flag. | | `sti` | Sets the interrupt disable flag. | | `clc` | Clears the carry flag. | | `stc` | Sets the carry flag. | | `cmc` | Complements (inverts) the carry flag. | | `sahf` | Stores the AH register into the LO 8 bits of the FLAGS register. (Warning: certain early x86-64 CPUs do not support this instruction.) | | `lahf` | Loads AH from the LO 8 bits of the FLAGS register. (Warning: certain early x86-64 CPUs do not support this instruction.) | The `lahf` and `sahf` instructions provide a convenient way to access the LO 8 bits of the FLAGS register as an 8-bit byte (rather than as eight separate 1-bit values). See Figure 2-20 for a layout of the FLAGS register. ![f02020](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f02020.png) Figure 2-20: FLAGS register as packed Boolean data The `lahf` (*load AH with the LO eight bits of the FLAGS register*) and the `sahf` (*store AH into the LO byte of the RFLAGS register*) use the following syntax: ``` lahf sahf ``` ## 2.13 IEEE Floating-Point Formats When Intel planned to introduce a floating-point unit (the 8087 FPU) for its new 8086 microprocessor, it hired the best numerical analyst it could find to design a floating-point format. That person then hired two other experts in the field, and the three of them (William Kahan, Jerome Coonen, and Harold Stone) designed Intel’s floating-point format. They did such a good job designing the KCS Floating-Point Standard that the Institute of Electrical and Electronics Engineers (IEEE) adopted this format for its floating-point format.^(11) To handle a wide range of performance and accuracy requirements, Intel actually introduced *three* floating-point formats: single-precision, double-precision, and extended-precision. The single- and double-precision formats corresponded to C’s float and double types or FORTRAN’s real and double-precision types. The extended-precision format contains 16 extra bits that long chains of computations could use as guard bits before rounding down to a double-precision value when storing the result. ### 2.13.1 Single-Precision Format The *single-precision format* uses aone’s complement 24-bit mantissa, an 8-bit excess-127 exponent, and a single sign bit. The *mantissa* usually represents a value from 1.0 to just under 2.0\. The HO bit of the mantissa is always assumed to be 1 and represents a value just to the left of the *binary point*.^(12) The remaining 23 mantissa bits appear to the right of the binary point. Therefore, the mantissa represents the value: ``` 1.mmmmmmm mmmmmmmm ``` The `mmmm` characters represent the 23 bits of the mantissa. Note that because the HO bit of the mantissa is always 1, the single-precision format doesn’t actually store this bit within the 32 bits of the floating-point number. This is known as an *implied bit*. Because we are working with binary numbers, each position to the right of the binary point represents a value (`0` or `1`) times a successive negative power of 2\. The implied 1 bit is always multiplied by 2⁰, which is 1\. This is why the mantissa is always greater than or equal to 1\. Even if the other mantissa bits are all 0, the implied 1 bit always gives us the value 1.^(13) Of course, even if we had an almost infinite number of 1 bits after the binary point, they still would not add up to 2\. This is why the mantissa can represent values in the range 1 to just under 2. Although there is an infinite number of values between 1 and 2, we can represent only 8 million of them because we use a 23-bit mantissa (with the implied 24th bit always 1). This is the reason for inaccuracy in floating-point arithmetic—we are limited to a fixed number of bits in computations involving single-precision floating-point values. The mantissa uses a *one’s* *complement* format rather than two’s complement to represent signed values. The 24-bit value of the mantissa is simply an unsigned binary number, and the sign bit determines whether that value is positive or negative. One’s complement numbers have the unusual property that there are two representations for 0 (with the sign bit set or clear). Generally, this is important only to the person designing the floating-point software or hardware system. We will assume that the value 0 always has the sign bit clear. To represent values outside the range 1.0 to just under 2.0, the exponent portion of the floating-point format comes into play. The floating-point format raises 2 to the power specified by the exponent and then multiplies the mantissa by this value. The exponent is 8 bits and is stored in an *excess-127* format. In excess-127 format, the exponent 0 is represented by the value 127 (7Fh), negative exponents are values in the range 0 to 126, and positive exponents are values in the range 128 to 255\. To convert an exponent to excess-127 format, add 127 to the exponent value. The use of excess-127 format makes it easier to compare floating-point values. The single-precision floating-point format takes the form shown in Figure 2-21. ![f02021](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f02021.png) Figure 2-21: Single-precision (32-bit) floating-point format With a 24-bit mantissa, you will get approximately six and a half (decimal) digits of precision (half a digit of precision means that the first six digits can all be in the range 0 to 9, but the seventh digit can be only in the range 0 to *x*, where *x* < 9 and is generally close to 5). With an 8-bit excess-127 exponent, the dynamic range^(14) of single-precision floating-point numbers is approximately 2^(±127), or about 10^(±38). Although single-precision floating-point numbers are perfectly suitable for many applications, the precision and dynamic range are somewhat limited and unsuitable for many financial, scientific, and other applications. Furthermore, during long chains of computations, the limited accuracy of the single-precision format may introduce serious error. ### 2.13.2 Double-Precision Format The *double-precision format* helps overcome the problems of single-precision floating-point. Using twice the space, the double-precision format has an 11-bit excess-1023 exponent and a 53-bit mantissa (with an implied HO bit of 1) plus a sign bit. This provides a dynamic range of about 10^(±308) and 14.5 digits of precision, sufficient for most applications. Double-precision floating-point values take the form shown in Figure 2-22. ![f02022](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f02022.png) Figure 2-22: 64-bit double-precision floating-point format ### 2.13.3 Extended-Precision Format To ensure accuracy during long chains of computations involving double-precision floating-point numbers, Intel designed the *extended-precision format*. It uses 80 bits. Twelve of the additional 16 bits are appended to the mantissa, and 4 of the additional bits are appended to the end of the exponent. Unlike the single- and double-precision values, the extended-precision format’s mantissa does not have an implied HO bit. Therefore, the extended-precision format provides a 64-bit mantissa, a 15-bit excess-16383 exponent, and a 1-bit sign. Figure 2-23 shows the format for the extended-precision floating-point value. ![f02023](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f02023.png) Figure 2-23: 80-bit extended-precision floating-point format On the x86-64 FPU, all computations are done using the extended-precision format. Whenever you load a single- or double-precision value, the FPU automatically converts it to an extended-precision value. Likewise, when you store a single- or double-precision value to memory, the FPU automatically rounds the value down to the appropriate size before storing it. By always working with the extended-precision format, Intel guarantees that a large number of guard bits are present to ensure the accuracy of your computations. ### 2.13.4 Normalized Floating-Point Values To maintain maximum precision during computation, most computations use normalized values. A *normalized floating-point value* is one whose HO mantissa bit contains 1\. Almost any non-normalized value can be normalized: shift the mantissa bits to the left and decrement the exponent until a 1 appears in the HO bit of the mantissa. Remember, the exponent is a binary exponent. Each time you increment the exponent, you multiply the floating-point value by 2\. Likewise, whenever you decrement the exponent, you divide the floating-point value by 2\. By the same token, shifting the mantissa to the left one bit position multiplies the floating-point value by 2; likewise, shifting the mantissa to the right divides the floating-point value by 2\. Therefore, shifting the mantissa to the left one position *and* decrementing the exponent does not change the value of the floating-point number at all. Keeping floating-point numbers normalized is beneficial because it maintains the maximum number of bits of precision for a computation. If the HO *n* bits of the mantissa are all 0, the mantissa has that many fewer bits of precision available for computation. Therefore, a floating-point computation will be more accurate if it involves only normalized values. In two important cases, a floating-point number cannot be normalized. Zero is one of these special cases. Obviously, it cannot be normalized because the floating-point representation for 0 has no 1 bits in the mantissa. This, however, is not a problem because we can exactly represent the value 0 with only a single bit. In the second case, we have some HO bits in the mantissa that are 0, but the biased exponent is also 0 (and we cannot decrement it to normalize the mantissa). Rather than disallow certain small values, whose HO mantissa bits and biased exponent are 0 (the most negative exponent possible), the IEEE standard allows special *denormalized*values to represent these smaller values.^(15) Although the use of denormalized values allows IEEE floating-point computations to produce better results than if underflow occurred, keep in mind that denormalized values offer fewer bits of precision. ### 2.13.5 Non-Numeric Values The IEEE floating-point standard recognizes three special non-numeric values: –infinity, +infinity, and a special not-a-number (NaN). For each of these special numbers, the exponent field is filled with all 1 bits. If the exponent is all 1 bits and the mantissa is all 0 bits, then the value is infinity. The sign bit will be `0` for +infinity, and `1` for –infinity. If the exponent is all 1 bits and the mantissa is not all 0 bits, then the value is an invalid number (known as a *not-a-number* in IEEE 754 terminology). NaNs represent illegal operations, such as trying to take the square root of a negative number. Unordered comparisons occur whenever either operand (or both) is a NaN. As NaNs have an indeterminate value, they cannot be compared (that is, they are incomparable). Any attempt to perform an unordered comparison typically results in an exception or some sort of error. Ordered comparisons, on the other hand, involve two operands, neither of which are NaNs. ### 2.13.6 MASM Support for Floating-Point Values MASM provides several data types to support the use of floating-point data in your assembly language programs. MASM floating-point constants allow the following syntax: * An optional `+` or `-` symbol, denoting the sign of the mantissa (if this is not present, MASM assumes that the mantissa is positive) * Followed by one or more decimal digits * Followed by a decimal point and zero or more decimal digits * Optionally followed by an `e` or `E`, optionally followed by a sign (`+` or `-`) and one or more decimal digits The decimal point or the `e`/`E` must be present in order to differentiate this value from an integer or unsigned literal constant. Here are some examples of legal literal floating-point constants: ``` 1.234 3.75e2 -1.0 1.1e-1 1.e+4 0.1 -123.456e+789 +25.0e0 1.e3 ``` A floating-point literal constant must begin with a decimal digit, so you must use, for example, 0.1 to represent .1 in your programs. To declare a floating-point variable, you use the `real4`, `real8`, or `real10` data types. The number at the end of these data type declarations specifies the number of bytes used for each type’s binary representation. Therefore, you use `real4` to declare single-precision real values, `real8` to declare double-precision floating-point values, and `real10` to declare extended-precision floating-point values. Aside from using these types to declare floating-point variables rather than integers, their use is nearly identical to that of `byte`, `word`, `dword`*,* and so on. The following examples demonstrate these declarations and their syntax: ``` .data fltVar1 real4 ? fltVar1a real4 2.7 pi real4 3.14159 DblVar real8 ? DblVar2 real8 1.23456789e+10 XPVar real10 ? XPVar2 real10 -1.0e-104 ``` As usual, this book uses the C/C++ `printf()` function to print floating-point values to the console output. Certainly, an assembly language routine could be written to do this same thing, but the C Standard Library provides a convenient way to avoid writing that (complex) code, at least for the time being. ## 2.14 Binary-Coded Decimal Representation Although the integer and floating-point formats cover most of the numeric needs of an average program, in some special cases other numeric representations are convenient. In this section, we’ll discuss the *binary-coded decimal (BCD)* format because the x86-64 CPU provides a small amount of hardware support for this data representation. BCD values are a sequence of nibbles, with each nibble representing a value in the range 0 to 9\. With a single byte, we can represent values containing two decimal digits, or values in the range 0 to 99 (see Figure 2-24). ![f02024](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f02024.png) Figure 2-24: BCD data representation in memory As you can see, BCD storage isn’t particularly memory efficient. For example, an 8-bit BCD variable can represent values in the range 0 to 99, while that same 8 bits, when holding a binary value, can represent values in the range 0 to 255\. Likewise, a 16-bit binary value can represent values in the range 0 to 65,535, while a 16-bit BCD value can represent only about one-sixth of those values (0 to 9999). However, it’s easy to convert BCD values between the internal numeric representation and their string representation, and to encode multi-digit decimal values in hardware (for example, using a thumb wheel or dial) using BCD. For these two reasons, you’re likely to see people using BCD in embedded systems (such as toaster ovens, alarm clocks, and nuclear reactors) but rarely in general-purpose computer software. The Intel x86-64 floating-point unit supports a pair of instructions for loading and storing BCD values. Internally, however, the FPU converts these BCD values to binary and performs all calculations in binary. It uses BCD only as an external data format (external to the FPU, that is). This generally produces more-accurate results and requires far less silicon than having a separate coprocessor that supports decimal arithmetic. ## 2.15 Characters Perhaps the most important data type on a personal computer is the `character` data type. The term *character* refers to a human or machine-readable symbol that is typically a non-numeric entity, specifically any symbol that you can normally type on a keyboard (including some symbols that may require multiple keypresses to produce) or display on a video display. Letters (*alphabetic characters*), punctuation symbols, numeric digits, spaces, tabs, carriage returns (enter), other control characters, and other special symbols are all characters. Most computer systems use a 1- or 2-byte sequence to encode the various characters in binary form. Windows, macOS, FreeBSD, and Linux use either the ASCII or Unicode encodings for characters. This section discusses the ASCII and Unicode character sets and the character declaration facilities that MASM provides. ### 2.15.1 The ASCII Character Encoding The *American Standard Code for Information Interchange (ASCII) character set* maps 128 textual characters to the unsigned integer values 0 to 127 (0 to 7Fh). Although the exact mapping of characters to numeric values is arbitrary and unimportant, using a standardized code for this mapping is important because when you communicate with other programs and peripheral devices, you all need to speak the same “language.” ASCII is a standardized code that nearly everyone has agreed on: if you use the ASCII code 65 to represent the character `A`, then you know that a peripheral device (such as a printer) will correctly interpret this value as the character `A` whenever you transmit data to that device. Despite some major shortcomings, ASCII data has become thestandard for data interchange across computer systems and programs.^(16) Most programs can accept ASCII data; likewise, most programs can produce ASCII data. Because you will be dealing with ASCII characters in assembly language, it would be wise to study the layout of the character set and memorize a few key ASCII codes (for example, for `0`, `A`, `a`, and so on). See Appendix A for a list of all the ASCII character codes. The ASCII character set is divided into four groups of 32 characters. The first 32 characters, ASCII codes 0 to 1Fh (31), form a special set of nonprinting characters, the *control characters*. We call them control characters because they perform various printer/display control operations rather than display symbols. Examples include *carriage return*, which positions the cursor to the left side of the current line of characters;^(17) line feed, which moves the cursor down one line on the output device; and backspace, which moves the cursor back one position to the left. Unfortunately, different control characters perform different operations on different output devices. Little standardization exists among output devices. To find out exactly how a control character affects a particular device, you will need to consult its manual. The second group of 32 ASCII character codes contains various punctuation symbols, special characters, and the numeric digits. The most notable characters in this group include the space character (ASCII code 20h) and the numeric digits (ASCII codes 30h to 39h). The third group of 32 ASCII characters contains the uppercase alphabetic characters. The ASCII codes for the characters `A` to `Z` lie in the range 41h to 5Ah (65 to 90). Because there are only 26 alphabetic characters, the remaining 6 codes hold various special symbols. The fourth, and final, group of 32 ASCII character codes represents the lowercase alphabetic symbols, 5 additional special symbols, and another control character (delete). The lowercase character symbols use the ASCII codes 61h to 7Ah. If you convert the codes for the upper- and lowercase characters to binary, you will notice that the uppercase symbols differ from their lowercase equivalents in exactly one bit position. For example, consider the character codes for `E` and `e` appearing in Figure 2-25. ![f02025](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f02025.png) Figure 2-25: ASCII codes for *E* and *e* The only place these two codes differ is in bit 5\. Uppercase characters always contain a 0 in bit 5; lowercase alphabetic characters always contain a 1 in bit 5\. You can use this fact to quickly convert between upper- and lowercase. If you have an uppercase character, you can force it to lowercase by setting bit 5 to 1\. If you have a lowercase character, you can force it to uppercase by setting bit 5 to 0\. You can toggle an alphabetic character between upper- and lowercase by simply inverting bit 5. Indeed, bits 5 and 6 determine which of the four groups in the ASCII character set you’re in, as Table 2-13 shows. Table 2-13: ASCII Groups | **Bit 6** | **Bit 5** | **Group** | | --- | --- | --- | | 0 | 0 | Control characters | | 0 | 1 | Digits and punctuation | | 1 | 0 | Uppercase and special | | 1 | 1 | Lowercase and special | So you could, for instance, convert any upper- or lowercase (or corresponding special) character to its equivalent control character by setting bits 5 and 6 to 0\. Consider, for a moment, the ASCII codes of the numeric digit characters appearing in Table 2-14. Table 2-14: ASCII Codes for Numeric Digits | **Character** | **Decimal** | **Hexadecimal** | | --- | --- | --- | | 0 | 48 | 30h | | 1 | 49 | 31h | | 2 | 50 | 32h | | 3 | 51 | 33h | | 4 | 52 | 34h | | 5 | 53 | 35h | | 6 | 54 | 36h | | 7 | 55 | 37h | | 8 | 56 | 38h | | 9 | 57 | 39h | The LO nibble of the ASCII code is the binary equivalent of the represented number. By stripping away (that is, setting to `0`) the HO nibble of a numeric character, you can convert that character code to the corresponding binary representation. Conversely, you can convert a binary value in the range 0 to 9 to its ASCII character representation by simply setting the HO nibble to `3`. You can use the logical AND operation to force the HO bits to 0; likewise, you can use the logical OR operation to force the HO bits to 0011b (3). Unfortunately, you *cannot* convert a string of numeric characters to their equivalent binary representation by simply stripping the HO nibble from each digit in the string. Converting 123 (31h 32h 33h) in this fashion yields 3 bytes, 010203h, but the correct value for 123 is 7Bh. The conversion described in the preceding paragraph works only for single digits. ### 2.15.2 MASM Support for ASCII Characters MASM provides support for character variables and literals in your assembly language programs. Character literal constants in MASM take one of two forms: a single character surrounded by apostrophes or a single character surrounded by quotes, as follows: ``` 'A' "A" ``` Both forms represent the same character (`A`). If you wish to represent an apostrophe or a quote within a string, use the other character as the string delimiter. For example: ``` 'A "quotation" appears within this string' "Can't have quotes in this string" ``` Unlike the C/C++ language, MASM doesn’t use different delimiters for single-character objects versus string objects, or differentiate between a character constant and a string constant with a single character. A character literal constant has a single character between the quotes (or apostrophes); a string literal has multiple characters between the delimiters. To declare a character variable in a MASM program, you use the `byte` data type. For example, the following declaration demonstrates how to declare a variable named `UserInput`: ``` .data UserInput byte ? ``` This declaration reserves 1 byte of storage that you could use to store any character value (including 8-bit extended ASCII/ANSI characters). You can also initialize character variables as follows: ``` .data TheCharA byte 'A' ExtendedChar byte 128 ; Character code greater than 7Fh ``` Because character variables are 8-bit objects, you can manipulate them using 8-bit registers. You can move character variables into 8-bit registers, and you can store the value of an 8-bit register into a character variable. ## 2.16 The Unicode Character Set The problem with ASCII is that it supports only 128 character codes. Even if you extend the definition to 8 bits (as IBM did on the original PC), you’re limited to 256 characters. This is way too small for modern multinational/multilingual applications. Back in the 1990s, several companies developed an extension to ASCII, known as *Unicode*, using a 2-byte character size. Therefore, (the original) Unicode supported up to 65,536 character codes. Alas, as well-thought-out as the original Unicode standard could be, systems engineers discovered that even 65,536 symbols were insufficient. Today, Unicode defines 1,112,064 possible characters, encoded using a variable-length character format. ### 2.16.1 Unicode Code Points A Unicode *code point* is an integer value that Unicode associates with a particular character symbol. The convention for Unicode code points is to specify the value in hexadecimal with a preceding U+ prefix; for example, U+0041 is the Unicode code point for the `A` character (41h is also the ASCII code for `A`; Unicode code points in the range U+0000 to U+007F correspond to the ASCII character set). ### 2.16.2 Unicode Code Planes The Unicode standard defines code points in the range U+000000 to U+10FFFF (10FFFFh is 1,114,111, which is where most of the 1,112,064 characters in the Unicode character set come from; the remaining 2047 code points are reserved for use as *surrogates*, which are Unicode extensions).^(18) The Unicode standard breaks this range up into 17 *multilingual planes*, each supporting up to 65,536 code points. The HO two hexadecimal digits of the six-digit code point value specify the multilingual plane, and the remaining four digits specify the character within the plane. The first multilingual plane, U+000000 to U+00FFFF, roughly corresponds to the original 16-bit Unicode definition; the Unicode standard calls this the *Basic Multilingual Plane (BMP)*. Planes 1 (U+010000 to U+01FFFF), 2 (U+020000 to U+02FFFF), and 14 (U+0E0000 to U+0EFFFF) are supplementary (extension) planes. Unicode reserves planes 3 to 13 for future expansion, and planes 15 and 16 for user-defined character sets. Obviously, representing Unicode code points outside the BMP requires more than 2 bytes. To reduce memory usage, Unicode (specifically the UTF-16 encoding; see the next section) uses 2 bytes for the Unicode code points in the BMP, and uses 4 bytes to represent code points outside the BMP. Within the BMP, Unicode reserves the surrogate code points (U+D800–U+DFFF) to specify the 16 planes after the BMP. Figure 2-26 shows the encoding. ![f02026](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f02026.png) Figure 2-26: Surrogate code point encoding for Unicode planes 1 to 16 Note that the two words (unit 1 and unit 2) always appear together. The unit 1 value (with HO bits 110110b) specifies the upper 10 bits (b[10] to b[19]) of the Unicode scalar, and the unit 2 value (with HO bits 110111b) specifies the lower 10 bits (b[0] to b[9]) of the Unicode scalar. Therefore, bits b[16] to b[19] (plus one) specify Unicode plane 1 to 16\. Bits b[0] to b[15] specify the Unicode scalar value within the plane. ### 2.16.3 Unicode Encodings As of Unicode v2.0, the standard supports a 21-bit character space capable of handling over a million characters (though most of the code points remain reserved for future use). Rather than use a 3-byte (or worse, 4-byte) encoding to allow the larger character set, Unicode, Inc., allowed different encodings, each with its own advantages and disadvantages. *UTF-32* uses 32-bit integers to hold Unicode scalars.^(19) The advantage to this scheme is that a 32-bit integer can represent every Unicode scalar value (which requires only 21 bits). Programs that require random access to characters in strings (without having to search for surrogate pairs) and other constant-time operations are (mostly) possible when using UTF-32\. The obvious drawback to UTF-32 is that each Unicode scalar value requires 4 bytes of storage (twice that of the original Unicode definition and four times that of ASCII characters). The second encoding format the Unicode supports is *UTF-16*. As the name suggests, UTF-16 uses 16-bit (unsigned) integers to represent Unicode values. To handle scalar values greater than 0FFFFh, UTF-16 uses the surrogate pair scheme to represent values in the range 010000h to 10FFFFh (see the discussion of code planes and surrogate code points in the previous section). Because the vast majority of useful characters fit into 16 bits, most UTF-16 characters require only 2 bytes. For those rare cases where surrogates are necessary, UTF-16 requires two words (32 bits) to represent the character. The last encoding, and unquestionably the most popular, is *UTF-8*. The UTF-8 encoding is upward compatible from the ASCII character set. In particular, all ASCII characters have a single-byte representation (their original ASCII code, where the HO bit of the byte containing the character contains a 0 bit). If the UTF-8 HO bit is 1, UTF-8 requires additional bytes (1 to 3 additional bytes) to represent the Unicode code point. Table 2-15 provides the UTF-8 encoding schema. Table 2-15: UTF-8 Encoding | **Bytes** | **Bits for code point** | **First code point** | **Last code point** | **Byte 1** | **Byte 2** | **Byte 3** | **Byte 4** | | --- | --- | --- | --- | --- | --- | --- | --- | | 1 | 7 | U+00 | U+7F | 0*xxxxxxx* | | | | | 2 | 11 | U+80 | U+7FF | 110*xxxxx* | 10*xxxxxx* | | | | 3 | 16 | U+800 | U+FFFF | 1110*xxxx* | 10*xxxxxx* | 10*xxxxxx* | | | 4 | 21 | U+10000 | U+10FFFF | 11110*xxx* | 10*xxxxxx* | 10*xxxxxx* | 10*xxxxxx* | The `xxx...` bits are the Unicode code point bits. For multi-byte sequences, byte 1 contains the HO bits, byte 2 contains the next HO bits, and so on. For example, the 2-byte sequence 11011111b, 10000001b corresponds to the Unicode scalar 0000_0111_1100_0001b (U+07C1). ## 2.17 MASM Support for Unicode Unfortunately, MASM provides almost zero support for Unicode text in a source file. Fortunately, MASM’s macro facilities provide a way for you to create your own Unicode support for strings in MASM. See Chapter 13 for more details on MASM macros. I will also return to this subject in *The Art of 64-Bit Assembly*, Volume 2, where I will spend considerable time describing how to force MASM to accept and process Unicode strings in source and resource files. ## 2.18 For More Information For general information about data representation and Boolean functions, consider reading my book *Write Great Code*, Volume 1, Second Edition (No Starch Press, 2020), or a textbook on data structures and algorithms (available at any bookstore). ASCII, EBCDIC, and Unicode are all international standards. You can find out more about the Extended Binary Coded Decimal Interchange Code (EBCDIC) character set families on IBM’s website ([`www.ibm.com/`](http://www.ibm.com/)). ASCII and Unicode are both International Organization for Standardization (ISO) standards, and ISO provides reports for both character sets. Generally, those reports cost money, but you can also find out lots of information about the ASCII and Unicode character sets by searching for them by name on the internet. You can also read about Unicode at [`www.unicode.org/`](http://www.unicode.org/). *Write Great Code* also contains additional information on the history, use, and encoding of the Unicode character set. ## 2.19 Test Yourself 1. What does the decimal value 9384.576 represent (in terms of powers of 10)? 2. Convert the following binary values to decimal: 1. 1010 2. 1100 3. 0111 4. 1001 5. 0011 6. 1111 3. Convert the following binary values to hexadecimal: 1. 1010 2. 1110 3. 1011 4. 1101 5. 0010 6. 1100 7. 1100_1111 8. 1001_1000_1101_0001 4. Convert the following hexadecimal values to binary: 1. 12AF 2. 9BE7 3. 4A 4. 137F 5. F00D 6. BEAD 7. 4938 5. Convert the following hexadecimal values to decimal: 1. A 2. B 3. F 4. D 5. E 6. C 6. How many bits are there in a 1. Word 2. Qword 3. Oword 4. Dword 5. BCD digit 6. Byte 7. Nibble 7. How many bytes are there in a 1. Word 2. Dword 3. Qword 4. Oword 8. How different values can you represent with a 1. Nibble 2. Byte 3. Word 4. Bit 9. How many bits does it take to represent a hexadecimal digit? 10. How are the bits in a byte numbered? 11. Which bit number is the LO bit of a word? 12. Which bit number is the HO bit of a dword? 13. Compute the logical AND of the following binary values: 1. 0 and 0 2. 0 and 1 3. 1 and 0 4. 1 and 1 14. Compute the logical OR of the following binary values: 1. 0 and 0 2. 0 and 1 3. 1 and 0 4. 1 and 1 15. Compute the logical XOR of the following binary values: 1. 0 and 0 2. 0 and 1 3. 1 and 0 4. 1 and 1 16. The logical NOT operation is the same as XORing with what value? 17. Which logical operation would you use to force bits to 0 in a bit string? 18. Which logical operation would you use to force bits to 1 in a bit string? 19. Which logical operation would you use to invert all the bits in a bit string? 20. Which logical operation would you use to invert selected bits in a bit string? 21. Which machine instruction will invert all the bits in a register? 22. What is the two’s complement of the 8-bit value 5 (00000101b)? 23. What is the two’s complement of the signed 8-bit value –2 (11111110)? 24. Which of the following signed 8-bit values are negative? 1. 1111_1111b 2. 0111_0001b 3. 1000_0000b 4. 0000_0000b 5. 1000_0001b 6. 0000_0001b 25. Which machine instruction takes the two’s complement of a value in a register or memory location? 26. Which of the following 16-bit values can be correctly sign-contracted to 8 bits? 1. 1111_1111_1111_1111 2. 1000_0000_0000_0000 3. 000_0000_0000_0001 4. 1111_1111_1111_0000 5. 1111_1111_0000_0000 6. 0000_1111_0000_1111 7. 0000_0000_1111_1111 8. 0000_0001_0000_0000 27. What machine instruction provides the equivalent of an HLL `goto` statement? 28. What is the syntax for a MASM statement label? 29. What flags are the condition codes? 30. *JE* is a synonym for what instruction that tests a condition code? 31. *JB* is a synonym for what instruction that tests a condition code? 32. Which conditional jump instructions transfer control based on an unsigned comparison? 33. Which conditional jump instructions transfer control based on a signed comparison? 34. How does the SHL instruction affect the zero flag? 35. How does the SHL instruction affect the carry flag? 36. How does the SHL instruction affect the overflow flag? 37. How does the SHL instruction affect the sign flag? 38. How does the SHR instruction affect the zero flag? 39. How does the SHR instruction affect the carry flag? 40. How does the SHR instruction affect the overflow flag? 41. How does the SHR instruction affect the sign flag? 42. How does the SAR instruction affect the zero flag? 43. How does the SAR instruction affect the carry flag? 44. How does the SAR instruction affect the overflow flag? 45. How does the SAR instruction affect the sign flag? 46. How does the RCL instruction affect the carry flag? 47. How does the RCL instruction affect the zero flag? 48. How does the RCR instruction affect the carry flag? 49. How does the RCR instruction affect the sign flag? 50. A shift left is equivalent to what arithmetic operation? 51. A shift right is equivalent to what arithmetic operation? 52. When performing a chain of floating-point addition, subtraction, multiplication, and division operations, which operations should you try to do first? 53. How should you compare floating-point values for equality? 54. What is a normalized floating-point value? 55. How many bits does a (standard) ASCII character require? 56. What is the hexadecimal representation of the ASCII characters 0 through 9? 57. What delimiter character(s) does MASM use to define character constants? 58. What are the three common encodings for Unicode characters? 59. What is a Unicode code point? 60. What is a Unicode code plane?

第三章：内存访问和组织

第一章和第二章向你展示了如何在汇编语言程序中声明和访问简单的变量。本章将全面解释 x86-64 内存访问。在本章中，你将学习如何高效组织变量声明，以加速对数据的访问。你还将了解 x86-64 堆栈以及如何在堆栈上操作数据。

本章讨论了几个重要概念，包括以下内容：

内存组织
程序的内存分配
x86-64 内存寻址模式
间接寻址和缩放索引寻址模式
数据类型强制转换
x86-64 堆栈

本章将教你如何高效利用计算机的内存资源。

3.1 运行时内存组织

正在运行的程序根据数据类型以多种方式使用内存。以下是你在汇编语言程序中可能遇到的一些常见数据分类：

代码

编码机器指令的内存值。

未初始化的静态数据

程序为未初始化的变量分配的一块内存区域，这些变量在程序运行的整个过程中都存在；Windows 在将程序加载到内存时会将该存储区域初始化为 0。

初始化静态数据

一块内存区域，在程序运行的整个过程中始终存在。然而，Windows 从程序的可执行文件中加载该区域中所有变量的值，因此当程序首次开始执行时，它们具有初始值。

只读数据

类似于初始化静态数据，Windows 从可执行文件中加载该内存区域的初始数据。然而，这一内存区域被标记为只读，以防止数据被不小心修改。程序通常将常量和其他不变的数据存储在该内存区域（顺便提一下，操作系统也将代码区标记为只读）。

堆

内存的这一特殊区域被指定为存储动态分配的存储空间。像 C 语言中的malloc()和free()这样的函数负责在堆区分配和释放存储空间。第四章中的“指针变量和动态内存分配”将更详细地讨论动态存储分配。

堆栈

在内存的这一特殊区域，程序维护着过程和函数的局部变量、程序状态信息以及其他临时数据。有关堆栈区的更多信息，请参见第 134 页的“堆栈段及 push 和 pop 指令”。

这些是常见程序（无论是汇编语言程序还是其他类型的程序）中典型的内存区域。较小的程序可能不会使用所有这些区域（代码区、栈区和数据区是一个很好的最小配置）。复杂的程序可能会根据需要在内存中创建额外的区域。一些程序可能会将多个区域合并。例如，许多程序将代码区和只读数据区合并成同一区域（因为两个区域中的数据都被标记为只读）。一些程序将未初始化数据区和已初始化数据区合并在一起（将未初始化变量初始化为 0）。合并区域通常由链接器程序处理。有关合并区域的更多细节，请参阅 Microsoft 链接器文档。^(1)

Windows 通常将不同类型的数据存储在内存的不同区域（或段）中。虽然通过运行链接器并指定不同的参数可以重新配置内存，但默认情况下，Windows 会按照与图 3-1 相似的组织方式将 MASM 程序加载到内存中。^(2)

图 3-1：MASM 典型的运行时内存组织

Windows 保留最低的内存地址。通常，你的应用程序不能访问这些低地址中的数据（或执行指令）。操作系统保留这块空间的一个原因是帮助捕捉 NULL 指针引用：如果你尝试访问内存位置 0（NULL），操作系统将产生 一般保护异常（也称为 段错误），这意味着你访问了一个不包含有效数据的内存位置。

内存映射中的其余六个区域存储与程序相关的不同类型的数据。这些内存区域包括栈区、堆区、.code 区、.data（静态）区、.const 区和 .data?（存储）区。每个区域都对应 MASM 程序中可以创建的数据类型。接下来将详细描述 .code、.data、.const 和 .data? 区域。^(3)

3.1.1 `.code` 区

.code 区包含 MASM 程序中的机器指令。MASM 将你编写的每个机器指令翻译成一个或多个字节的值。在程序执行过程中，CPU 将这些字节值解释为机器指令。

默认情况下，当 MASM 链接程序时，它会告诉系统你的程序可以执行指令并从代码段读取数据，但不能写入数据到代码段。如果你尝试将任何数据存储到代码段，操作系统将产生一般保护异常。

3.1.2 `.data` 区

.data部分通常是放置变量的地方。除了声明静态变量外，你还可以将数据列表嵌入到.data声明部分。你在.data部分嵌入数据的方式与在.code部分嵌入数据的方式相同：你使用byte、word、dword、qword等指令。考虑以下示例：

 .data
b   byte    0
    byte    1,2,3

u   dword   1
    dword   5,2,10;

c   byte   ?
    byte   'a', 'b', 'c', 'd', 'e', 'f';

bn  byte   ?
    byte   true  ; Assumes true is defined as "1"

MASM 使用这些指令将数据放入.data内存段时，会在前面声明的变量后写入数据。例如，字节值1、2和3会在b的0字节后写入.data部分。由于这些值没有与标签关联，你无法在程序中直接访问它们。你可以使用索引寻址模式来访问这些额外的值。

在前面的示例中，请注意c和bn变量没有（显式的）初始值。然而，如果你没有提供初始值，MASM 将把.data部分的变量初始化为 0，因此 MASM 将 NULL 字符（ASCII 码为 0）分配给c作为其初始值。同样，MASM 将假定 false 为0，并将 false 分配给bn的初始值。.data部分中的变量声明总是会消耗内存，即使你没有为它们分配初始值。

3.1.3 `.const`部分

.const数据部分包含常量、表格以及在程序执行过程中不能更改的其他数据。你可以通过在.const声明部分中声明它们来创建只读对象。.const部分类似于.data部分，但有三个不同之处：

.const部分以保留字.const开头，而不是.data。
.const部分的所有声明都有一个初始化器。
系统不允许你在程序运行时向.const对象中的变量写入数据。

这是一个示例：

 .const
pi      real4     3.14159
e       real4     2.71
MaxU16  word      65535
MaxI16  sword     32767

所有.const对象声明必须有一个初始化器，因为你不能在程序控制下初始化值。对于许多用途，你可以将.const对象视为字面常量。然而，由于它们实际上是内存对象，它们表现得像（只读）.data对象。你不能在字面常量允许的地方使用.const对象；例如，你不能在寻址模式中使用它们作为位移（见第 122 页的《x86-64 寻址模式》），也不能在常量表达式中使用它们。实际上，你可以在读取.data变量合法的任何地方使用它们。

和.data部分一样，你可以通过使用byte、word、dword等数据声明在.const部分嵌入数据值，尽管所有声明都必须初始化。例如：

 .const
roArray byte     0
        byte     1, 2, 3, 4, 5
qwVal   qword    1
        qword    0

注意，你也可以在.code部分声明常量值。在该部分声明的数据值也是只读对象，因为 Windows 会对.code部分进行写保护。如果你确实在.code部分放置常量声明，应该小心将它们放置在程序不会尝试执行为代码的位置（例如在jmp或ret指令之后）。除非你在手动编码 x86 机器指令时使用数据声明（这通常很少见，且只有专家程序员会这样做），否则你不希望程序尝试将数据作为机器指令执行；结果通常是未定义的。^(4)

3.1.4 `.data?`部分

.const部分要求你初始化所有声明的对象。.data部分让你可选择性地初始化对象（或者让它们保持未初始化状态，在这种情况下，它们的默认初始值为0）。.data?部分让你声明那些在程序开始运行时总是未初始化的变量。.data?部分以.data?保留字开头，并包含没有初始化器的变量声明。以下是一个示例：

 .data?
UninitUns32 dword  ?
i           sdword ?
character   byte   ?
b           byte   ?

Windows 会在加载程序到内存时将所有.data?对象初始化为 0。然而，依赖这种隐式初始化可能并不是一个好主意。如果需要一个初始化为 0 的对象，请在.data部分声明它并明确将其设置为 0。

你在.data?部分声明的变量可能会在程序的可执行文件中占用更少的磁盘空间。这是因为 MASM 会将.const和.data对象的初始值写入可执行文件，但对于在.data?部分声明的未初始化变量，它可能使用一种紧凑的表示方式；不过请注意，这种行为取决于操作系统版本和对象模块格式。

3.1.5 程序中声明部分的组织

.data、.const、.data?和.code部分可以在程序中出现零次或多次。声明部分可以按任何顺序出现，以下示例说明了这一点：

 .data
i_static   sdword    0

           .data?
i_uninit   sdword    ?

           .const
i_readonly dword     5

 .data
j          dword     ?

           .const
i2         dword     9

           .data?
c          byte      ?

           .data?
d          dword     ?

           .code

      `Code goes here`

            end

各部分可以按任意顺序出现，且某个声明部分在程序中可能出现多次。如前所述，当多个相同类型的声明部分（例如前面示例中的三个.data?部分）出现在程序的声明部分时，MASM 会将它们组合成一个单独的组（顺序可以任意）。

3.1.6 内存访问和 4K 内存管理单元页面

x86-64 的内存 管理单元（MMU）将内存划分为称为页面的块。^(5) 操作系统负责管理内存中的页面，因此应用程序通常不需要担心页面的组织。然而，在处理内存中的页面时，你应该注意几个问题：具体来说，CPU 是否允许访问某个给定的内存位置，以及该位置是可读/可写还是只读（写保护）。

每个程序节在内存中以连续的 MMU 页面出现。也就是说，.const节从 MMU 页面中的偏移量 0 开始，并且顺序地消耗内存中的页面，直到该节中的所有数据。内存中的下一个节（可能是.data）从紧接着上一个节的最后一页之后的下一个 MMU 页面中的偏移量 0 开始。如果上一个节（例如.const）没有消耗 4096 字节的整数倍，那么在该节数据的末尾和其最后一页的末尾之间会有填充空间（以确保下一个节从 MMU 页边界开始）。

每个新的节会在自己的 MMU 页面中开始，因为 MMU 通过使用页面的粒度来控制对内存的访问。例如，MMU 控制内存中的页面是可读/可写的还是只读的。对于.const 节，你希望内存是只读的。对于`.data`节，你希望允许读写。因为 MMU 只能按页级别强制执行这些属性，所以你不能将`.data`节的信息和`.const`节放在同一个 MMU 页面中。

Normally, all of this is completely transparent to your code. Data you declare in a `.data` (or `.data?`) section is readable and writable, and data in a `.const` section (and `.code` section) is read-only (`.code` sections are also *executable*). Beyond placing data in a particular section, you don’t have to worry too much about the page attributes. You do have to worry about MMU page organization in memory in one situation. Sometimes it is convenient to access (read) data beyond the end of a data structure in memory (for legitimate reasons—see Chapter 11 on SIMD instructions and Chapter 14 on string instructions). However, if that data structure is aligned with the end of an MMU page, accessing the next page in memory could be problematic. Some pages in memory are *inaccessible*; the MMU does not allow reading, writing, or execution to occur on that page. Attempting to do so will generate an x86-64 *general protection (segmentation) fault* and abort the normal execution of your program.^(6) If you have a data access that crosses a page boundary, and the next page in memory is inaccessible, this will crash your program. For example, consider a word access to a byte object at the very end of an MMU page, as shown in Figure 3-2. ![f03002](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03002.png) Figure 3-2: Word access at the end of an MMU page As a general rule, you should never read data beyond the end of a data structure.^(7) If for some reason you need to do so, you should ensure that it is legal to access the next page in memory (alas, there is no instruction on modern x86-64 CPUs to allow this; the only way to be sure that access is legal is to make sure there is valid data after the data structure you are accessing). ## 3.2 How MASM Allocates Memory for Variables MASM associates a current *location counter* with each of the four declaration sections (`.code`, `.data`, `.const`, and `.data?`). These location counters initially contain `0`, and whenever you declare a variable in one of these sections (or write code in a code section), MASM associates the current value of that section’s location counter with the variable; MASM also bumps up the value of that location counter by the size of the object you’re declaring. As an example, assume that the following is the only `.data` declaration section in a program: ``` .data b byte ? ; Location counter = 0, size = 1 w word ? ; Location counter = 1, size = 2 d dword ? ; Location counter = 3, size = 4 q qword ? ; Location counter = 7, size = 8 o oword ? ; Location counter = 15, size = 16 ; Location counter is now 31 ``` As you can see, the variable declarations appearing in a (single) `.data` section have contiguous offsets (location counter values) into the `.data` section. Given the preceding declaration, `w` will immediately follow `b` in memory, `d` will immediately follow `w` in memory, `q` will immediately follow `d`, and so on. These offsets aren’t the actual runtime address of the variables. At runtime, the system loads each section to a (base) address in memory. The linker and Windows add the base address of the memory section to each of these location counter values (which we call *displacements*, or *offsets*) to produce the actual memory address of the variables. Keep in mind that you may link other modules with your program (for example, from the C Standard Library) or even additional `.data` sections in the same source file, and the linker has to merge the `.data` sections together. Each section has its own location counter that also starts from zero when allocating storage for the variables in the section. Hence, the offset of an individual variable may have little bearing on its final memory address. Remember that MASM allocates memory objects you declare in `.const`, `.data`, and `.data?` sections in completely different regions of memory. Therefore, you cannot assume that the following three memory objects appear in adjacent memory locations (indeed, they probably will not): ``` .data b byte ? .const w word 1234h .data? d dword ? ``` In fact, MASM will not even guarantee that variables you declare in separate `.data` (or whatever) sections are adjacent in memory, even if there is nothing between the declarations in your code. For example, you cannot assume that `b`, `w`, and `d` are in adjacent memory locations in the following declarations, nor can you assume that they *won’t* be adjacent in memory: ``` .data b byte ? .data w word 1234h .data d dword ? ``` If your code requires these variables to consume adjacent memory locations, you must declare them in the same `.data` section. ## 3.3 The Label Declaration The `label` declaration lets you declare variables in a section (`.code`, `.data`, `.const`, and `.data?`) without allocating memory for the variable. The `label` directive tells MASM to assign the current address in a declaration section to a variable but not to allocate any storage for the object. That variable shares the same memory address as the next object appearing in the variable declaration section. Here is the syntax for the `label` declaration: ``` `variable_name` label `type` ``` The following code sequence provides an example of using the `label` declaration in the `.const` section: ``` .const abcd label dword byte 'a', 'b', 'c', 'd' ``` In this example, `abcd` is a double word whose LO byte contains 97 (the ASCII code for `a`), byte 1 contains 98 (`b`), byte 2 contains 99 (`c`), and the HO byte contains 100 (`d`). MASM does not reserve storage for the `abcd` variable, so MASM associates the following 4 bytes in memory (allocated by the `byte` directive) with `abcd`. ## 3.4 Little-Endian and Big-Endian Data Organization Back in “The Memory Subsystem” in Chapter 1, this book pointed out that the x86-64 stores multi-byte data types in memory with the LO byte at the lowest address in memory and the HO byte at the highest address in memory (see Figure 1-5 in Chapter 1). This type of data organization in memory is known as *little endian*. Little-endian data organization (in which the LO byte comes first and the HO byte comes last) is a common memory organization shared by many modern CPUs. It is not, however, the only possible data organization. The *big-endian* data organization reverses the order of the bytes in memory. The HO byte of the data structure appears first (in the lowest memory address), and the LO byte appears in the highest memory address. Tables 3-1, 3-2, and 3-3 describe the memory organization for words, double words, and quad words, respectively. Table 3-1: Word Object Little- and Big-Endian Data Organizations | **Data byte** | **Memory organization for little endian** | **Memory organization for big endian** | | --- | --- | --- | | 0 (LO byte) | base + 0 | base + 1 | | 1 (HO byte) | base + 1 | base + 0 | Table 3-2: Double-Word Object Little- and Big-Endian Data Organizations | **Data byte** | **Memory organization for little endian** | **Memory organization for big endian** | | --- | --- | --- | | 0 (LO byte) | base + 0 | base + 3 | | 1 | base + 1 | base + 2 | | 2 | base + 2 | base + 1 | | 3 (HO byte) | base + 3 | base + 0 | Table 3-3: Quad-Word Object Little- and Big-Endian Data Organizations | **Data byte** | **Memory organization for little endian** | **Memory organization for big endian** | | --- | --- | --- | | 0 (LO byte) | base + 0 | base + 7 | | 1 | base + 1 | base + 6 | | 2 | base + 2 | base + 5 | | 3 | base + 3 | base + 4 | | 4 | base + 4 | base + 3 | | 5 | base + 5 | base + 2 | | 6 | base + 6 | base + 1 | | 7 (HO byte) | base + 7 | base + 0 | Normally, you wouldn’t be too concerned with big-endian memory organization on an x86-64 CPU. However, on occasion you may need to deal with data produced by a different CPU (or by a protocol, such as TCP/IP, that uses big-endian organization as its canonical integer format). If you were to load a big-endian value in memory into a CPU register, your calculations would be incorrect. If you have a 16-bit big-endian value in memory and you load it into a 16-bit register, it will have its bytes swapped. For 16-bit values, you can correct this issue by using the `xchg` instruction. It has the syntax ``` xchg `reg`, `reg` xchg `reg`, `mem` ``` where `reg` is any 8-, 16-, 32-, or 64-bit general-purpose register, and `mem` is any appropriate memory location. The `reg` operands in the first instruction, or the `reg` and `mem` operands in the second instruction, must both be the same size. Though you can use the `xchg` instruction to exchange the values between any two arbitrary (like-sized) registers, or a register and a memory location, it is also useful for converting between (16-bit) little- and big-endian formats. For example, if AX contains a big-endian value that you would like to convert to little-endian form prior to some calculations, you can use the following instruction to swap the bytes in the AX register to convert the value to little-endian form: ``` xchg al, ah ``` You can use the `xchg` instruction to convert between little- and big-endian form for any of the 16-bit registers AX, BX, CX, and DX by using the low/high register designations (AL/AH, BL/BH, CL/CH, and DL/DH). Unfortunately, the `xchg` trick doesn’t work for registers other than AX, BX, CX, and DX. To handle larger values, Intel introduced the `bswap` (*byte swap*) instruction. As its name suggests, this instruction swaps the bytes in a 32- or 64-bit register. It swaps the HO and LO bytes, and the (HO – 1) and (LO + 1) bytes (plus all the other bytes, in opposing pairs, for 64-bit registers). The `bswap` instruction works for all general-purpose 32-bit and 64-bit registers. ## 3.5 Memory Access As you saw in “The Memory Subsystem” in Chapter 1, the x86-64 CPU fetches data from memory on the data bus. In an idealized CPU, the data bus is the size of the standard integer registers on the CPU; therefore, you would expect the x86-64 CPUs to have a 64-bit data bus. In practice, modern CPUs often make the physical data bus connection to main memory much larger in order to improve system performance. The bus brings in large chunks of data from memory in a single operation and places that data in the CPU’s *cache*, which acts as a buffer between the CPU and physical memory. From the CPU’s point of view, the cache *is* memory. Therefore, when the remainder of this section discusses memory, it’s generally talking about data sitting in the cache. As the system transparently maps memory accesses into the cache, we can discuss memory as though the cache were not present and discuss the advantages of the cache as necessary. On early x86 processors, memory was arranged as an array of bytes (8-bit machines such as the 8088), words (16-bit machines such as the 8086 and 80286), or double words (on 32-bit machines such as the 80386). On a 16-bit machine, the LO bit of the address did not physically appear on the address bus. So the addresses 126 and 127 put the same bit pattern on the address bus (126, with an implicit `0` in bit position 0), as shown in Figure 3-3.^(8) ![f03003](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03003.png) Figure 3-3: Address and data bus for 16-bit processors When reading a byte, the CPU uses the LO bit of the address to select the LO byte or HO byte on the data bus. Figure 3-4 shows the process when accessing a byte at an even address (126 in this figure). Figure 3-5 shows the same operation when reading a byte from an odd address (127 in this figure). Note that in both Figures 3-4 and 3-5, the address appearing on the address bus is 126. ![f03004](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03004.png) Figure 3-4: Reading a byte from an even address on a 16-bit CPU ![f03005](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03005.png) Figure 3-5: Reading a byte from an odd address on a 16-bit CPU So, what happens when this 16-bit CPU wants to access 16 bits of data at an odd address? For example, suppose in these figures the CPU reads the word at address 125\. When the CPU puts address 125 on the address bus, the LO bit doesn’t physically appear. Therefore, the actual address on the bus is 124\. If the CPU were to read the LO 8 bits off the data bus at this point, it would get the data at address 124, not address 125. Fortunately, the CPU is smart enough to figure out what is going on here, and extracts the data from the HO 8 bits on the address bus and uses this as the LO 8 bits of the data operand. However, the HO 8 bits that the CPU needs are not found on the data bus. The CPU has to initiate a second read operation, placing address 126 on the address bus, to get the HO 8 bits (which will be sitting in the LO 8 bits of the data bus, but the CPU can figure that out). The bottom line is that it takes two memory cycles for this read operation to complete. Therefore, the instruction reading the data from memory will take longer to execute than had the data been read from an address that was an integral multiple of two. The same problem exists on 32-bit processors, except the 32-bit data bus allows the CPU to read 4 bytes at a time. Reading a 32-bit value at an address that is not an integral multiple of four incurs the same performance penalty. Note, however, that accessing a 16-bit operand at an odd address doesn’t always guarantee an extra memory cycle—only addresses whose remainder when divided by four is 3 incur the penalty. In particular, if you access a 16-bit value (on a 32-bit bus) at an address where the LO 2 bits contain 01b, the CPU can read the word in a single memory cycle, as shown in Figure 3-6. Modern x86-64 CPUs, with cache systems, have largely eliminated this problem. As long as the data (1, 2, 4, 8, or 10 bytes in size) is fully within a cache line, there is no memory cycle penalty for an unaligned access. If the access does cross a cache line boundary, the CPU will run a bit slower while it executes two memory operations to get (or store) the data. ![f03006](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03006.png) Figure 3-6: Accessing a word on a 32-bit data bus ## 3.6 MASM Support for Data Alignment To write fast programs, you need to ensure that you properly align data objects in memory. Proper *alignment* means that the starting address for an object is a multiple of a certain size, usually the size of an object if the object’s size is a power of 2 for values up to 32 bytes in length. For objects greater than 32 bytes, aligning the object on an 8-, 16-, or 32-byte address boundary is probably sufficient. For objects fewer than 16 bytes, aligning the object at an address that is the next power of 2 greater than the object’s size is usually fine. Accessing data that is not aligned at an appropriate address may require extra time (as noted in the previous section); so, if you want to ensure that your program runs as rapidly as possible, you should try to align data objects according to their size. Data becomes misaligned whenever you allocate storage for different-sized objects in adjacent memory locations. For example, if you declare a byte variable, it will consume 1 byte of storage, and the next variable you declare in that declaration section will have the address of that byte object plus 1\. If the byte variable’s address happens to be an even address, the variable following that byte will start at an odd address. If that following variable is a word or double-word object, its starting address will not be optimal. In this section, we’ll explore ways to ensure that a variable is aligned at an appropriate starting address based on that object’s size. Consider the following MASM variable declarations: ``` .data dw dword ? b byte ? w word ? dw2 dword ? w2 word ? b2 byte ? dw3 dword ? ``` The first `.data` declaration in a program (running under Windows) places its variables at an address that is an even multiple of 4096 bytes. Whatever variable first appears in that `.data` declaration is guaranteed to be aligned on a reasonable address. Each successive variable is allocated at an address that is the sum of the sizes of all the preceding variables plus the starting address of that `.data` section. Therefore, assuming MASM allocates the variables in the previous example at a starting address of `4096`, MASM will allocate them at the following addresses: ``` ; Start Adrs Length dw dword ? ; 4096 4 b byte ? ; 4100 1 w word ? ; 4101 2 dw2 dword ? ; 4103 4 w2 word ? ; 4107 2 b2 byte ? ; 4109 1 dw3 dword ? ; 4110 4 ``` With the exception of the first variable (which is aligned on a 4KB boundary) and the byte variables (whose alignment doesn’t matter), all of these variables are misaligned. The `w`, `w2`, and `dw2` variables start at odd addresses, and the `dw3` variable is aligned on an even address that is not a multiple of four. An easy way to guarantee that your variables are aligned properly is to put all the double-word variables first, the word variables second, and the byte variables last in the declaration, as shown here: ``` .data dw dword ? dw2 dword ? dw3 dword ? w word ? w2 word ? b byte ? b2 byte ? ``` This organization produces the following addresses in memory: ``` ; Start Adrs Length dw dword ? ; 4096 4 dw2 dword ? ; 4100 4 dw3 dword ? ; 4104 4 w word ? ; 4108 2 w2 word ? ; 4110 2 b byte ? ; 4112 1 b2 byte ? ; 4113 1 ``` As you can see, these variables are all aligned at reasonable addresses. Unfortunately, it is rarely possible for you to arrange your variables in this manner. While many technical reasons make this alignment impossible, a good practical reason for not doing this is that it doesn’t let you organize your variable declarations by logical function (that is, you probably want to keep related variables next to one another regardless of their size). To resolve this problem, MASM provides the `align` directive, which uses the following syntax: ``` align `integer_constant` ``` The integer constant must be one of the following small unsigned integer values: 1, 2, 4, 8, or 16\. If MASM encounters the `align` directive in a `.data` section, it will align the very next variable on an address that is an even multiple of the specified alignment constant. The previous example could be rewritten, using the `align` directive, as follows: ``` .data align 4 dw dword ? b byte ? align 2 w word ? align 4 dw2 dword ? w2 word ? b2 byte ? align 4 dw3 dword ? ``` If MASM determines that the current address (location counter value) of an `align` directive is not an integral multiple of the specified value, MASM will quietly emit extra bytes of padding after the previous variable declaration until the current address in the `.data` section is a multiple of the specified value. This makes your program slightly larger (by a few bytes) in exchange for faster access to your data. Given that your program will grow by only a few bytes when you use this feature, this is probably a good trade-off. As a general rule, if you want the fastest possible access, you should choose an alignment value that is equal to the size of the object you want to align. That is, you should align words to even boundaries by using an `align 2` statement, double words to 4-byte boundaries by using `align 4`, quad words to 8-byte boundaries by using `align 8`, and so on. If the object’s size is not a power of 2, align it to the next higher power of 2 (up to a maximum of 16 bytes). Note, however, that you need only align `real80` (and `tbyte`) objects on an 8-byte boundary. Note that data alignment isn’t always necessary. The cache architecture of modern x86-64 CPUs actually handles most misaligned data. Therefore, you should use the alignment directives only with variables for which speedy access is absolutely critical. This is a reasonable space/speed trade-off. ## 3.7 The x86-64 Addressing Modes Until now, you’ve seen only a single way to access a variable: the *PC-relative* addressing mode. In this section, you’ll see additional ways your programs can access memory by using x86-64 memory addressing modes. An *addressing mode* is a mechanism the CPU uses to determine the address of a memory location an instruction will access. The x86-64 memory addressing modes provide flexible access to memory, allowing you to easily access variables, arrays, records, pointers, and other complex data types. Mastery of the x86-64 addressing modes is the first step toward mastering x86-64 assembly language. The x86-64 provides several addressing modes: * Register addressing modes * PC-relative memory addressing modes * Register-indirect addressing modes: `[``reg`64`]` * Indirect-plus-offset addressing modes: `[``reg`64 `+` `expression``]` * Scaled-indexed addressing modes: `[``reg`64 `+` `reg`64 `*` `scale``]` and `[``reg`64 `+` `expression` `+` `reg`64 `*` `scale``]` The following sections describe each of these modes. ### 3.7.1 x86-64 Register Addressing Modes The *register addressing modes* provide access to the x86-64’s general-purpose register set. By specifying the name of the register as an operand to the instruction, you can access the contents of that register. This section uses the x86-64 `mov` (*move*) instruction to demonstrate the register addressing mode. The generic syntax for the `mov` instruction is shown here: ``` mov `destination`, `source` ``` The `mov` instruction copies the data from the `source` operand to the `destination` operand. The 8-, 16-, 32-, and 64-bit registers are all valid operands for this instruction. The only restriction is that both operands must be the same size. The following `mov` instructions demonstrate the use of various registers: ``` mov ax, bx ; Copies the value from BX into AX mov dl, al ; Copies the value from AL into DL mov esi, edx ; Copies the value from EDX into ESI mov rsp, rbp ; Copies the value from RBP into RSP mov ch, cl ; Copies the value from CL into DH mov ax, ax ; Yes, this is legal! (Though not very useful) ``` The registers are the best place to keep variables. Instructions using the registers are shorter and faster than those that access memory. Because most computations require at least one register operand, the register addressing mode is popular in x86-64 assembly code. ### 3.7.2 x86-64 64-Bit Memory Addressing Modes The addressing modes provided by the x86-64 family include PC-relative, register-indirect, indirect-plus-offset, and scaled-indexed. Variations on these four forms provide all the addressing modes on the x86-64. #### 3.7.2.1 The PC-Relative Addressing Mode The most common addressing mode, and the one that’s easiest to understand, is the *PC-relative* (or *RIP-relative*) addressing mode. This mode consists of a 32-bit constant that the CPU adds with the current value of the RIP (instruction pointer) register to specify the address of the target location. The syntax for the PC-relative addressing mode is to use the name of a symbol you declare in one of the many MASM sections (`.data`, `.data?`, `.const`, `.code`, etc.), as this book has been doing all along: ``` mov al, symbol ; PC-relative addressing mode automatically provides [RIP] ``` Assuming that variable `j` is an `int8` variable appearing at offset 8088h from RIP, the instruction `mov al, j` loads the AL register with a copy of the byte at memory location RIP + 8088h. Likewise, if `int8` variable `K` is at address RIP + 1234h in memory, then the instruction `mov K, dl` stores the value in the DL register to memory location RIP + 1234h (see Figure 3-7). ![f03007](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03007.png) Figure 3-7: PC-relative addressing mode MASM does not directly encode the address of `j` or `K` into the instruction’s *operation code* (or *opcode*, the numeric machine encoding of the instruction). Instead, it encodes a signed displacement from the end of the current instruction’s address to the variable’s address in memory. For example, if the next instruction’s opcode is sitting in memory at location 8000h (the end of the current instruction), then MASM will encode the value 88h as a 32-bit signed constant for `j` in the instruction opcode. You can also access words and double words on the x86-64 processors by specifying the address of their first byte (see Figure 3-8). ![f03008](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03008.png) Figure 3-8: Accessing a word or dword by using the PC-relative addressing mode #### 3.7.2.2 The Register-Indirect Addressing Modes The x86-64 CPUs let you access memory indirectly through a register by using the *register-indirect* addressing modes. The term *indirect* means that the operand is not the actual address, but the operand’s value specifies the memory address to use. In the case of the register-indirect addressing modes, the value held in the register is the address of the memory location to access. For example, the instruction `mov [rbx], eax` tells the CPU to store EAX’s value at the location whose address is currently in RBX (the square brackets around RBX tell MASM to use the register-indirect addressing mode). The x86-64 has 16 forms of this addressing mode. The following instructions provide examples of these 16 forms: ``` mov [`reg`[64]], al ``` where `reg`64 is one of the 64-bit general-purpose registers: RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP, R8, R9, R10, R11, R12, R13, R14, or R15\. This addressing mode references the memory location at the offset found in the register enclosed by brackets. The register-indirect addressing modes require a 64-bit register. You cannot specify a 32-, 16-, or 8-bit register in the square brackets when using an indirect addressing mode. Technically, you could load a 64-bit register with an arbitrary numeric value and access that location indirectly using the register-indirect addressing mode: ``` mov rbx, 12345678 mov [rbx], al ; Attempts to access location 12345678 ``` Unfortunately (or fortunately, depending on how you look at it), this will probably cause the operating system to generate a protection fault because it’s not always legal to access arbitrary memory locations. As it turns out, there are better ways to load the address of an object into a register, and you’ll see those shortly. You can use the register-indirect addressing modes to access data referenced by a pointer, you can use them to step through array data, and, in general, you can use them whenever you need to modify the address of a variable while your program is running. The register-indirect addressing mode provides an example of an *anonymous* variable; when using a register-indirect addressing mode, you refer to the value of a variable by its numeric memory address (the value you load into a register) rather than by the name of the variable. MASM provides a simple instruction that you can use to take the address of a variable and put it into a 64-bit register, the `lea` (*load effective address*) instruction: ``` lea rbx, j ``` After executing this `lea` instruction, you can use the `[rbx]` register-indirect addressing mode to indirectly access the value of `j`. #### 3.7.2.3 Indirect-Plus-Offset Addressing Mode The indirect-plus-offset addressing modes compute an *effective address* by adding a 32-bit signed constant to the value of a 64-bit register.^(9) The instruction then uses the data at this effective address in memory. The indirect-plus-offset addressing modes use the following syntax: ``` mov [`reg`[64] + `constant`], `source` mov [`reg`[64] - `constant`], `source` ``` where `reg`64 is a 64-bit general-purpose register, `constant` is a 4-byte constant (±2 billion), and `source` is a register or constant value. If `constant` is 1100h and RBX contains 12345678h, then ``` mov [rbx + 1100h], al ``` stores AL into the byte at address 12346778h in memory (see Figure 3-9). ![f03009](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03009.png) Figure 3-9: Indirect-plus-offset addressing mode The indirect-plus-offset addressing modes are really handy for accessing fields of classes and records/structures. You will see how to use these addressing modes for that purpose in Chapter 4. #### 3.7.2.4 Scaled-Indexed Addressing Modes The *scaled-indexed addressing modes* are similar to the indexed addressing modes, except the scaled-indexed addressing modes allow you to combine two registers plus a displacement, and multiply the index register by a (scaling) factor of 1, 2, 4, or 8 to compute the effective address by adding in the value of the second register multiplied by the scaling factor. (Figure 3-10 shows an example involving RBX as the base register and RSI as the index register.) The syntax for the scaled-indexed addressing modes is shown here: ``` [`base_reg`[64] + `index_reg`[64]*`scale`] [`base_reg`[64] + `index_reg`[64]*`scale` + `displacement`] [`base_reg`[64] + `index_reg`[64]*`scale` - `displacement`] ``` `base_reg`64 represents any general-purpose 64-bit register, `index_reg`64 represents any general-purpose 64-bit register except RSP, and `scale` must be one of the constants 1, 2, 4, or 8. ![f03010](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03010.png) Figure 3-10: Scaled-indexed addressing mode In Figure 3-10, suppose that RBX contains 1000FF00h, RSI contains 20h, and `const` is 2000h; then the instruction ``` mov al, [rbx + rsi*4 + 2000h] ``` will move the byte at address 10011F80h—1000FF00h + (20h × 4) + 2000—into the AL register. The scaled-indexed addressing modes are useful for accessing array elements that are 2, 4, or 8 bytes each. These addressing modes are also useful for accessing elements of an array when you have a pointer to the beginning of the array. ### 3.7.3 Large Address Unaware Applications One advantage of 64-bit addresses is that they can access a frightfully large amount of memory (something like 8TB under Windows). By default, the Microsoft linker (when it links together the C++ and assembly language code) sets a flag named `LARGEADDRESSAWARE` to true (`yes`). This makes it possible for your programs to access a huge amount of memory. However, there is a price to be paid for operating in `LARGEADDRESSAWARE` mode: the `const` component of the [`reg`64 + `const`] addressing mode is limited to 32 bits and cannot span the entire address space. Because of instruction-encoding limitations, the `const` value is limited to a signed value in the range ±2GB. This is probably far more than enough when the register contains a 64-bit base address and you want to access a memory location at a fixed offset (less than ±2GB) around that base address. A typical way you would use this addressing mode is as follows: ``` lea rcx, someStructure mov al, [rcx+fieldOffset] ``` Prior to the introduction of 64-bit addresses, the `const` offset appearing in the (32-bit) indirect-plus-offset addressing mode could span the entire (32-bit) address space. So if you had an array declaration such as ``` .data buf byte 256 dup (?) ``` you could access elements of this array by using the following addressing mode form: ``` mov al, buf[ebx] ; EBX was used on 32-bit processors ``` If you were to attempt to assemble the instruction `mov al, buf[rbx]` in a 64-bit program (or any other addressing mode involving `buf` other than PC-relative), MASM would assemble the code properly, but the linker would report an error: ``` error LNK2017: 'ADDR32' relocation to 'buf' invalid without /LARGEADDRESSAWARE:NO ``` The linker is complaining that in an address space exceeding 32 bits, it is impossible to encode the offset to the `buf` buffer because the machine instruction opcodes provide only a 32-bit offset to hold the address of `buf`. However, if we were to artificially limit the amount of memory that our application uses to 2GB, then MASM can encode the 32-bit offset to `buf` into the machine instruction. As long as we kept our promise and never used any more memory than 2GB, several new variations on the indirect-plus-offset and scaled-indexed addressing modes become possible. To turn off the large address–aware flag, you need to add an extra command line option to the `ml64` command. This is easily done in the *build.bat* file; let’s create a new *build.bat* file and call it *sbuild.bat.* This file will have the following lines: ``` echo off ml64 /nologo /c /Zi /Cp %1.asm cl /nologo /O2 /Zi /utf-8 /EHa /Fe%1.exe c.cpp %1.obj /link /largeaddressaware:no ``` This set of commands (*sbuild.bat* for *small build*) tells MASM to pass a command to the linker that turns off the large address–aware file. MASM, MSVC, and the Microsoft linker will construct an executable file that requires only 32-bit addresses (ignoring the 32 HO bits in the 64-bit registers appearing in addressing modes). Once you’ve disabled `LARGEADDRESSAWARE`, several new variants of the indirect-plus-offset and scaled-indexed addressing modes become available to your programs: ``` `variable`[`reg`[64]] `variable`[`reg`[64] + `const`] `variable`[`reg`[64] - `const`] `variable`[`reg`[64] * `scale`] `variable`[`reg`[64] * `scale` + `const`] `variable`[`reg`[64] * `scale` - `const`] `variable`[`reg`[64] + `reg_not_RSP`[64] * `scale`] `variable`[`reg`[64] + `reg_not_RSP`[64] * `scale` + `const`] `variable`[`reg`[64] + `reg_not_RSP`[64] * `scale` - `const`] ``` where `variable` is the name of an object you’ve declared in your source file by using directives like `byte`, `word`, `dword`, and so on; `const` is a (maximum 32-bit) constant expression; and `scale` is 1, 2, 4, or 8\. These addressing mode forms use the address of `variable` as the base address and add in the current value of the 64-bit registers (see Figures 3-11 through 3-16 for examples). ![f03011](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03011.png) Figure 3-11: Base address form of indirect-plus-offset addressing mode Although the small address forms (`LARGEADDRESSAWARE:NO`) are convenient and efficient, they can fail spectacularly if your program ever uses more than 2GB of memory. Should your programs ever grow beyond that point, you will have to completely rewrite every instruction that uses one of these addresses (that uses a global data object as the base address rather than loading the base address into a register). This can be very painful and error prone. Think twice before ever using the `LARGEADDRESSAWARE:NO` option. ![f03012](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03012.png) Figure 3-12: Small address plus constant form of indirect-plus-offset addressing mode ![f03013](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03013.png) Figure 3-13: Small address form of base-plus-scaled-indexed addressing mode ![f03014](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03014.png) Figure 3-14: Small address form of base-plus-scaled-indexed-plus-constant addressing mode ![f03015](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03015.png) Figure 3-15: Small address form of scaled-indexed addressing mode ![f03016](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03016.png) Figure 3-16: Small address form of scaled-indexed-plus-constant addressing mode ## 3.8 Address Expressions Often, when accessing variables and other objects in memory, we need to access memory locations immediately before or after a variable rather than the memory at the address specified by the variable. For example, when accessing an element of an array or a field of a structure/record, the exact element or field is probably not at the address of the variable itself. Address expressions provide a mechanism to attach an arithmetic expression to an address to access memory around a variable’s address. This book considers an *address expression* to be any legal x86-64 addressing mode that includes a displacement (that is, variable name) or an offset. For example, the following are legal address expressions: ``` [`reg`[64] + `offset`] [`reg`[64] + `reg_not_RSP`[64] * `scale` + `offset`] ``` Consider the following legal MASM syntax for a memory address, which isn’t actually a new addressing mode but simply an extension of the PC-relative addressing mode: ``` `variable_name`[`offset`] ``` This extended form computes its effective address by adding the constant offset within the brackets to the variable’s address. For example, the instruction `mov al, Address[3]` loads the AL register with the byte in memory that is 3 bytes beyond the `Address` object (see Figure 3-17). The `offset` value in these examples must be a constant. If `index` is an `int32` variable, then `variable``[``index``]` is not a legal address expression. If you wish to specify an index that varies at runtime, you must use one of the indirect or scaled-indexed addressing modes. Another important thing to remember is that the offset in `Address``[``offset``]` is a byte address. Although this syntax is reminiscent of array indexing in a high-level language like C/C++ or Java, this does not properly index into an array of objects unless `Address` is an array of bytes. ![f03017](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03017.png) Figure 3-17: Using an address expression to access data beyond a variable Until this point, the offset in all the addressing mode examples has always been a single numeric constant. However, MASM also allows a *constant expression* anywhere an offset is legal. A constant expression consists of one or more constant terms manipulated by operators such as addition, subtraction, multiplication, division, modulo, and a wide variety of others. Most address expressions, however, will involve only addition, subtraction, multiplication, and sometimes division. Consider the following example: ``` mov al, X[2*4 + 1] ``` This instruction will move the byte at address `X + 9` into the AL register. The value of an address expression is always computed at compile time, never while the program is running. When MASM encounters the preceding instruction, it calculates 2 × 4 + 1 on the spot and adds this result to the base address of `X` in memory. MASM encodes this single sum (base address of `X` plus 9) as part of the instruction; MASM does not emit extra instructions to compute this sum for you at runtime (which is good, because doing so would be less efficient). Because MASM computes the value of address expressions at compile time, all components of the expression must be constants because MASM cannot know the runtime value of a variable while it is compiling the program. Address expressions are useful for accessing the data in memory beyond a variable, particularly when you’ve used the `byte`, `word`, `dword`, and so on, statements in a `.data` or `.const` section to tack on additional bytes after a data declaration. For example, consider the program in Listing 3-1 that uses address expressions to access the four consecutive bytes associated with variable `i`. ``` ; Listing 3-1 ; Demonstrate address expressions. option casemap:none nl = 10 ; ASCII code for newline .const ttlStr byte 'Listing 3-1', 0 fmtStr1 byte 'i[0]=%d ', 0 fmtStr2 byte 'i[1]=%d ', 0 fmtStr3 byte 'i[2]=%d ', 0 fmtStr4 byte 'i[3]=%d',nl, 0 .data i byte 0, 1, 2, 3 .code externdef printf:proc ; Return program title to C++ program: public getTitle getTitle proc lea rax, ttlStr ret getTitle endp ; Here is the "asmMain" function. public asmMain asmMain proc push rbx ; "Magic" instruction offered without ; explanation at this point: sub rsp, 48 lea rcx, fmtStr1 movzx rdx, i[0] call printf lea rcx, fmtStr2 movzx rdx, i[1] call printf lea rcx, fmtStr3 movzx rdx, i[2] call printf lea rcx, fmtStr4 movzx rdx, i[3] call printf add rsp, 48 pop rbx ret ; Returns to caller asmMain endp end ``` Listing 3-1: Demonstration of address expressions Here’s the output from the program: ``` C:\>**build listing3-1** C:\>**echo off** Assembling: listing3-1.asm c.cpp C:\>**listing3-1** Calling Listing 3-1: i[0]=0 i[1]=1 i[2]=2 i[3]=3 Listing 3-1 terminated ``` The program in Listing 3-1 displays the four values `0`, `1`, `2`, and `3` as though they were array elements. This is because the value at the address of `i` is `0`. The address expression `i[1]` tells MASM to fetch the byte appearing at `i`’s address plus 1\. This is the value `1`, because the `byte` statement in this program emits the value `1` to the `.data` segment immediately after the value `0`. Likewise for `i[2]` and `i[3]`, this program displays the values `2` and `3`. Note that MASM also provides a special operator, `this`, that returns the current location counter (current position) within a section. You can use the `this` operator to represent the address of the current instruction in an address expression. See “Constant Expressions” in Chapter 4 for more details. ## 3.9 The Stack Segment and the push and pop Instructions The x86-64 maintains the stack in the `stack` segment of memory. The *stack* is a dynamic data structure that grows and shrinks according to certain needs of the program. The stack also stores important information about the program, including local variables, subroutine information, and temporary data. The x86-64 controls its stack via the RSP (stack pointer) register. When your program begins execution, the operating system initializes RSP with the address of the last memory location in the `stack` memory segment. Data is written to the `stack` segment by “pushing” data onto the stack and “popping” data off the stack. ### 3.9.1 The Basic push Instruction Here’s the syntax for the x86-64 `push` instruction: ``` push `reg`[16] push `reg`[64] push `memory`[16] push `memory`[64] pushw `constant`[16] push `constant`[32] ; Sign extends `constant`[32] to 64 bits ``` These six forms allow you to push 16-bit or 64-bit registers, 16-bit or 64-bit memory locations, and 16-bit or 64-bit constants, but not 32-bit registers, memory locations, or constants. The `push` instruction does the following: ``` RSP := RSP - `size_of_register_or_memory_operand` (2 or 8) [RSP] := `operand's_value` ``` For example, assuming that RSP contains 00FF_FFFCh, the instruction `push rax` will set RSP to 00FF_FFE4h and store the current value of RAX into memory location 00FF_FFE04, as Figures 3-18 and 3-19 show. ![f03018](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03018.png) Figure 3-18: Stack segment before the `push rax` operation ![f03019](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03019.png) Figure 3-19: Stack segment after the `push rax` operation Although the x86-64 supports 16-bit push operations, their primary use is in 16-bit environments such as Microsoft Disk Operating System (MS-DOS). For maximum performance, the stack pointer’s value should always be a multiple of eight; indeed, your program may malfunction under a 64-bit OS if RSP contains a value that is not a multiple of eight. The only practical reason for pushing fewer than 8 bytes at a time on the stack is to build up a quad word via four successive word pushes. ### 3.9.2 The Basic pop Instruction To retrieve data you’ve pushed onto the stack, you use the `pop` instruction. The basic `pop` instruction allows the following forms: ``` pop `reg`[16] pop `reg`[64] pop `memory`[16] pop `memory`[64] ``` Like the `push` instruction, the `pop` instruction supports only 16-bit and 64-bit operands; you cannot pop an 8-bit or 32-bit value from the stack. As with the `push` instruction, you should avoid popping 16-bit values (unless you do four 16-bit pops in a row) because 16-bit pops may leave the RSP register containing a value that is not a multiple of eight. One major difference between `push` and `pop` is that you cannot pop a constant value (which makes sense, because the operand for `push` is a source operand, while the operand for `pop` is a destination operand). Formally, here’s what the `pop` instruction does: ``` `operand` := [RSP] RSP := RSP + `size_of_operand` (2 or 8) ``` As you can see, the `pop` operation is the converse of the `push` operation. Note that the `pop` instruction copies the data from memory location `[RSP]` before adjusting the value in RSP. See Figures 3-20 and 3-21 for details on this operation. ![f03020](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03020.png) Figure 3-20: Memory before a `pop rax` operation ![f03021](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03021.png) Figure 3-21: Memory after the `pop rax` operation The value popped from the stack is still present in memory. Popping a value does not erase the value in memory; it just adjusts the stack pointer so that it points at the next value above the popped value. However, you should never attempt to access a value you’ve popped off the stack. The next time something is pushed onto the stack, the popped value will be obliterated. Because your code isn’t the only thing that uses the stack (for example, the operating system uses the stack, as do subroutines), you cannot rely on data remaining in stack memory once you’ve popped it off the stack. ### 3.9.3 Preserving Registers with the push and pop Instructions Perhaps the most common use of the `push` and `pop` instructions is to save register values during intermediate calculations. Because registers are the best place to hold temporary values, and registers are also needed for the various addressing modes, it is easy to run out of registers when writing code that performs complex calculations. The `push` and `pop` instructions can come to your rescue when this happens. Consider the following program outline: ``` `Some instructions that use the RAX register` `Some instructions that need to use RAX, for a` `different purpose than the above instructions` `Some instructions that need the original value in RAX` ``` The `push` and `pop` instructions are perfect for this situation. By inserting a `push` instruction before the middle sequence and a `pop` instruction after the middle sequence, you can preserve the value in RAX across those calculations: ``` `Some instructions that use the RAX register` push rax `Some instructions that need to use RAX, for a` `different purpose than the above instructions` pop rax `Some instructions that need the original value in RAX` ``` This `push` instruction copies the data computed in the first sequence of instructions onto the stack. Now the middle sequence of instructions can use RAX for any purpose it chooses. After the middle sequence of instructions finishes, the `pop` instruction restores the value in RAX so the last sequence of instructions can use the original value in RAX. ## 3.10 The Stack Is a LIFO Data Structure You can push more than one value onto the stack without first popping previous values off the stack. However, the stack is a *last-in, first-out (**LIFO)* data structure, so you must be careful how you push and pop multiple values. For example, suppose you want to preserve RAX and RBX across a block of instructions; the following code demonstrates the obvious way to handle this: ``` push rax push rbx `Code that uses RAX and RBX goes here` pop rax pop rbx ``` Unfortunately, this code will not work properly! Figures 3-22 through 3-25 show the problem. Because this code pushes RAX first and RBX second, the stack pointer is left pointing at RBX’s value on the stack. When the `pop rax` instruction comes along, it removes the value that was originally in RBX from the stack and places it in RAX! Likewise, the `pop rbx` instruction pops the value that was originally in RAX into the RBX register. The result is that this code manages to swap the values in the registers by popping them in the same order that it pushes them. ![f03022](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03022.png) Figure 3-22: Stack after pushing RAX To rectify this problem, you must note that the stack is a LIFO data structure, so the first thing you must pop is the last thing you push onto the stack. Therefore, you must always observe the following maxim: *always pop values in the reverse order that you push them.* The correction to the previous code is shown here: ``` push rax push rbx `Code that uses RAX and RBX goes here` pop rbx pop rax ``` ![f03023](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03023.png) Figure 3-23: Stack after pushing RBX ![f03024](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03024.png) Figure 3-24: Stack after popping RAX Another important maxim to remember is this: *always pop exactly the same number of bytes that you push.* This generally means that the number of pushes and pops must exactly agree. If you have too few pops, you will leave data on the stack, which may confuse the running program. If you have too many pops, you will accidentally remove previously pushed data, often with disastrous results. A corollary to the preceding maxim is *be careful when pushing and popping data within a loop.* Often it is quite easy to put the pushes in a loop and leave the pops outside the loop (or vice versa), creating an inconsistent stack. Remember, it is the execution of the `push` and `pop` instructions that matters, not the number of `push` and `pop` instructions that appear in your program. At runtime, the number (and order) of the `push` instructions the program executes must match the number (and reverse order) of the `pop` instructions. ![f03025](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03025.png) Figure 3-25: Stack after popping RBX One final thing to note: *the Microsoft ABI requires the stack to be aligned on a 16-byte boundary.* If you push and pop items on the stack, make sure that the stack is aligned on a 16-byte boundary before calling any functions or procedures that adhere to the Microsoft ABI (and require the stack to be aligned on a 16-byte boundary). ## 3.11 Other push and pop Instructions The x86-64 provides four additional `push` and `pop` instructions in addition to the basic ones: 1. `pushf` `popf` 2. `pushfq` `popfq` The `pushf`, `pushfq`, `popf`, and `popfq` instructions push and pop the RFLAGS register. These instructions allow you to preserve condition code and other flag settings across the execution of a sequence of instructions. Unfortunately, unless you go to a lot of trouble, it is difficult to preserve individual flags. When using the `pushf(q)` and `popf(q)` instructions, it’s an all-or-nothing proposition: you preserve all the flags when you push them; you restore all the flags when you pop them. You should really use the `pushfq` and `popfq` instructions to push the full 64-bit version of the RFLAGS register (rather than pushing only the 16-bit FLAGs portion). Although the extra 48 bits you push and pop are essentially ignored when writing applications, you still want to keep the stack aligned by pushing and popping only quad words. ## 3.12 Removing Data from the Stack Without Popping It Quite often you may discover that you’ve pushed data onto the stack that you no longer need. Although you could pop the data into an unused register or memory location, there is an easier way to remove unwanted data from the stack—simply adjust the value in the RSP register to skip over the unwanted data on the stack. Consider the following dilemma (in pseudocode, not actual assembly language): ``` push rax push rbx `Some code that winds up computing some values we want to keep` `in RAX and RBX` if(`Calculation_was_performed`) then ; Whoops, we don't want to pop RAX and RBX! ; What to do here? else ; No calculation, so restore RAX, RBX. pop rbx pop rax endif; ``` Within the `then` section of the `if` statement, this code wants to remove the old values of RAX and RBX without otherwise affecting any registers or memory locations. How can we do this? Because the RSP register contains the memory address of the item on the top of the stack, we can remove the item from the top of the stack by adding the size of that item to the RSP register. In the preceding example, we wanted to remove two quad-word items from the top of the stack. We can easily accomplish this by adding 16 to the stack pointer (see Figures 3-26 and 3-27 for the details): ``` push rax push rbx `Some code that winds up computing some values we want to keep` `in RAX and RBX` if(`Calculation_was_performed`) then ; Remove unneeded RAX/RBX values ; from the stack. add rsp, 16 else ; No calculation, so restore RAX, RBX. pop rbx pop rax endif; ``` ![f03026](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03026.png) Figure 3-26: Removing data from the stack, before `add rsp, 16` ![f03027](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03027.png) Figure 3-27: Removing data from the stack, after `add rsp, 16` Effectively, this code pops the data off the stack without moving it anywhere. Also note that this code is faster than two dummy `pop` instructions because it can remove any number of bytes from the stack with a single `add` instruction. ## 3.13 Accessing Data You’ve Pushed onto the Stack Without Popping It Once in a while, you will push data onto the stack and will want to get a copy of that data’s value, or perhaps you will want to change that data’s value without actually popping the data off the stack (that is, you wish to pop the data off the stack at a later time). The x86-64 `[``reg`64 `±` `offset``]` addressing mode provides the mechanism for this. Consider the stack after the execution of the following two instructions (see Figure 3-28): ``` push rax push rbx ``` ![f03028](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f03028.png) Figure 3-28: Stack after pushing RAX and RBX If you wanted to access the original RBX value without removing it from the stack, you could cheat and pop the value and then immediately push it again. Suppose, however, that you wish to access RAX’s old value or another value even further up the stack. Popping all the intermediate values and then pushing them back onto the stack is problematic at best, impossible at worst. However, as you will notice from Figure 3-28, each value pushed on the stack is at a certain offset from the RSP register in memory. Therefore, we can use the `[rsp ±` `offset``]` addressing mode to gain direct access to the value we are interested in. In the preceding example, you can reload RAX with its original value by using this single instruction: ``` mov rax, [rsp + 8] ``` This code copies the 8 bytes starting at memory address `rsp + 8` into the RAX register. This value just happens to be the previous value of RAX that was pushed onto the stack. You can use this same technique to access other data values you’ve pushed onto the stack. The previous section pointed out how to remove data from the stack by adding a constant to the RSP register. That pseudocode example could probably be written more safely as this: ``` push rax push rbx `Some code that winds up computing some values we want to keep` `in RAX and RBX` if(`Calculation_was_performed`) then `Overwrite saved values on stack with` `new RAX/RBX values (so the pops that` `follow won't change the values in RAX/RBX)` mov [rsp + 8], rax mov [rsp], rbx endif pop rbx pop rax ``` In this code sequence, the calculated result was stored over the top of the values saved on the stack. Later, when the program pops the values, it loads these calculated values into RAX and RBX. ## 3.14 Microsoft ABI Notes About the only feature this chapter introduces that affects the Microsoft ABI is data alignment. As a general rule, the Microsoft ABI requires all data to be aligned on a natural boundary for that data object. A *natural boundary* is an address that is a multiple of the object’s size (up to 16 bytes). Therefore, if you intend to pass a word/sword, dword/sdword, or qword/sqword value to a C++ procedure, you should attempt to align that object on a 2-, 4-, or 8-byte boundary, respectively. When calling code written in a Microsoft ABI–aware language, you must ensure that the stack is aligned on a 16-byte boundary before issuing a `call` instruction. This can severely limit the usefulness of the `push` and `pop` instructions. If you use the `push` instructions to save a register’s value prior to a call, you must make sure you push two (64-bit) values, or otherwise make sure the RSP address is a multiple of 16 bytes, prior to making the call. Chapter 5 explores this issue in greater detail. ## 3.15 For More Information An older, 16-bit version of my book *The Art of Assembly Language Programming* can be found at [`artofasm.randallhyde.com/`](https://artofasm.randallhyde.com/). In that text, you will find information about the 8086 16-bit addressing modes and segmentation. The published edition of that book (No Starch Press, 2010) covers the 32-bit addressing modes. Of course, the Intel x86 documentation (found at [`www.intel.com/`](http://www.intel.com/)) provides complete information on x86-64 address modes and machine instruction encoding. ## 3.16 Test Yourself 1. The PC-relative addressing mode indexes off which 64-bit register? 2. What does *opcode* stand for? 3. What type of data is the PC-relative addressing mode typically used for? 4. What is the address range of the PC-relative addressing mode? 5. In a register-indirect addressing mode, what does the register contain? 6. Which of the following registers is valid for use with the register-indirect addressing mode? 1. AL 2. AX 3. EAX 4. RAX 7. What instruction would you normally use to load the address of a memory object into a register? 8. What is an effective address? 9. What scaling values are legal with the scaled-indexed addressing mode? 10. What is the memory limitation on a `LARGEADDRESSAWARE:NO` application? 11. What is the advantage of using the `LARGEADDRESSAWARE:NO` option when compiling a program? 12. What is the difference between the `.data` section and the `.data?` section? 13. Which (standard MASM) memory sections are read-only? 14. Which (standard MASM) memory sections are readable and writable? 15. What is the location counter? 16. Explain how to use the `label` directive to coerce data to a different type. 17. Explain what happens if two (or more) `.data` sections appear in a MASM source file. 18. How would you align a variable in the `.data` section to an 8-byte boundary? 19. What does *MMU* stand for? 20. If `b` is a byte variable in read/write memory, explain how a `mov ax, b` instruction could cause a general protection fault. 21. What is an address expression? 22. What is the purpose of the MASM PTR operator? 23. What is the difference between a big-endian value and a little-endian value? 24. If AX contains a big-endian value, what instruction could you use to convert it to a little-endian value? 25. If EAX contains a little-endian value, what instruction could you use to convert it to a big-endian value? 26. If RAX contains a big-endian value, what instruction could you use to convert it to a little-endian value? 27. Explain, step by step, what the `push rax` instruction does. 28. Explain, step by step, what the `pop rax` instruction does. 29. When using the `push` and `pop` instructions to preserve registers, you must always pop the registers in the order that you pushed them. 30. What does *LIFO* stand for? 31. How do you access data on the stack without using the `push` and `pop` instructions? 32. How can pushing RAX onto the stack before calling a Windows ABI–compatible function create problems?

第四章：常量、变量和数据类型

第二章讨论了内存中数据的基本格式。第三章介绍了计算机系统如何在物理上组织这些数据。本章通过将 数据表示 概念与其实际的物理表示相连接，完成了这一讨论。如标题所示，本章主要涉及三个主题：常量、变量和数据结构。我并不假设你有数据结构的正式课程经验，尽管这样的经验会很有帮助。

本章讨论了如何声明和使用常量、标量变量、整数、数据类型、指针、数组、记录/结构体以及联合体。在进入下一章之前，你必须掌握这些内容。特别是，声明和访问数组似乎是初学汇编语言的程序员常遇到的各种问题。然而，本书其余部分依赖于你对这些数据结构及其内存表示的理解。不要试图跳过这部分内容，期望以后需要时再去学。你将马上用到这些知识，而试图在学习后续内容时再掌握这部分知识只会让你更加困惑。

4.1 `imul` 指令

本章介绍了数组和其他概念，这些概念将要求你扩展对 x86-64 指令集的理解。特别是，你需要学习如何将两个值相乘；因此，本节将讲解 imul（整数乘法）指令。

imul 指令有几种形式。本节不会覆盖所有形式，只讨论那些对于数组计算有用的形式（其余 imul 指令请参见第六章的“算术表达式”）。目前关注的 imul 变体如下：

; The following computes `destreg` = `destreg` * `constant`:

imul `destreg`[16], `constant`
imul `destreg`[32], `constant`
imul `destreg`[64], `constant`[32]

; The following computes `dest` = `src` * `constant`:

imul `destreg`[16], `srcreg`[16], `constant`
imul `destreg`[16], `srcmem`[16], `constant`

imul destreg[32], srcreg[32], `constant`
imul destreg[32], srcmem[32], `constant`

imul `destreg`[64], `srcreg`[64], `constant`[32]
imul `destreg`[64], `srcmem`[64], `constant`[32]

; The following computes `dest` = `destreg` * `src`:

imul `destreg`[16], `srcreg`[16]
imul `destreg`[16], `srcmem`[16]
imul `destreg`[32], `srcreg`[32]
imul `destreg`[32], `srcmem`[32]
imul `destreg`[64], `srcreg`[64]
imul `destreg`[64], `srcmem`[64]

请注意，imul 指令的语法与 add 和 sub 指令不同。特别是，目标操作数必须是寄存器（add 和 sub 都允许将内存操作数作为目标）。另外，imul 在最后一个操作数为常量时支持三个操作数。另一个重要的区别是，imul 指令只支持 16 位、32 位和 64 位操作数；它不支持 8 位操作数。最后，与大多数支持立即寻址模式的指令一样，CPU 限制常量大小为 32 位。对于 64 位操作数，x86-64 会将 32 位立即数扩展为 64 位。

imul 计算指定操作数的乘积，并将结果存储到目标寄存器中。如果发生溢出（因为 imul 只进行带符号整数值的乘法，所以溢出总是带符号溢出），该指令将同时设置进位标志和溢出标志。imul 不会改变其他条件码标志（例如，执行 imul 后，你不能有意义地检查符号标志或零标志）。

4.2 `inc` 和 `dec` 指令

正如到目前为止的几个例子所示，对寄存器或内存位置加 1 或减 1 是非常常见的操作。事实上，这些操作如此常见，以至于英特尔的工程师们专门设计了一对指令来执行这些特定的操作：inc（增量）和dec（减量）。

inc和dec指令使用以下语法：

inc `mem`/`reg`
dec `mem`/`reg`

单一操作数可以是任何合法的 8 位、16 位、32 位或 64 位寄存器或内存操作数。inc指令会对指定的操作数加 1，dec指令会对指定的操作数减 1。

这两条指令比对应的add或sub指令稍微短一些（它们的编码使用了更少的字节）。这两条指令与对应的add或sub指令之间还有一个细微的区别：它们不会影响进位标志。

4.3 MASM 常量声明

MASM 提供了三种指令，让你在汇编语言程序中定义常量。^(1) 总的来说，这三种指令被称为equates。你已经看过最常见的形式：

`symbol` = `constant_expression`

例如：

MaxIndex = 15

一旦你以这种方式声明了一个符号常量，你可以在任何对应的文字常量合法的地方使用该符号标识符。这些常量被称为manifest constants——符号表示，它们允许你在程序的任何地方将文字值替换为符号。

将其与.const变量进行对比；.const变量当然是一个常量，因为你无法在运行时更改其值。然而，.const变量与一个内存位置相关联；操作系统，而非 MASM 编译器，强制执行只读属性。尽管在程序运行时确实会崩溃，但像mov ReadOnlyVar, eax这样的指令是完全合法的。另一方面，写mov MaxIndex, eax（使用前面的声明）就像写mov 15, eax一样不合法。事实上，这两个语句是等价的，因为编译器在遇到这个常量时，会将MaxIndex替换为15。

常量声明非常适合定义在程序修改期间可能会变化的“魔法”数字。书中的大部分示例都使用了像nl（换行符）、maxLen和NULL这样的常量。

除了=指令外，MASM 还提供了equ指令：

`symbol` equ `constant_expression`

除了一些例外，这两个equate指令做的事情是一样的：它们定义了一个常量，MASM 将在源文件中每次遇到symbol时用constant_expression的值替代它。

这两者之间的第一个区别是 MASM 允许你重新定义使用=指令的符号。考虑以下代码片段：

maxSize  = 100

`Code that uses maxSize, expecting it to be 100`

maxSize  = 256

`Code that uses maxSize, expecting it to be 256`

你可能会质疑常量这个术语，因为在这个例子中，maxSize的值在源文件的多个点上发生了变化。然而，请注意，尽管maxSize的值在汇编过程中会变化，但在运行时，特定的字面常量（在这个例子中为 100 或 256）是永远不会改变的。

你无法重新定义通过equ指令声明的常量的值（无论是在运行时还是汇编时）。任何重新定义equ符号的尝试都会导致 MASM 的符号重定义错误。因此，如果你想防止在源文件中意外重新定义常量符号，应该使用equ指令，而不是=指令。

=和equ指令之间的另一个区别是，通过=定义的常量必须能表示为 64 位（或更小）整数。短字符字符串作为=操作数是合法的，但前提是它们的长度不超过八个字符（即适合 64 位值）。使用equ的等式则没有这种限制。

最终，=和equ之间的区别在于，=指令计算一个数值表达式的值并将该值保存下来，替代程序中出现该符号的位置。如果equ指令的操作数可以被简化为一个数值，它将以相同的方式工作。然而，如果equ操作数无法转换为数值，equ指令将把它的操作数保存为文本数据，并在符号位置替换为该文本数据。

由于数值/文本处理，equ偶尔会对其操作数感到困惑。考虑以下例子：

SomeStr  equ   "abcdefgh"
          .
          .
          .
memStr   byte  SomeStr

MASM 将报告错误（initializer magnitude too large for specified size或类似的错误），因为由八个字符abcdefgh构成的 64 位值将无法适配一个字节变量。然而，如果我们给字符串添加一个字符，MASM 将很乐意接受：

SomeStr  equ   "abcdefghi"
          .
          .
          .
memStr   byte  SomeStr

这两个例子之间的区别在于，在第一个例子中，MASM 决定它可以将字符串表示为 64 位整数，因此常量是一个四字常量，而不是字符字符串。在第二个例子中，MASM 无法将字符字符串表示为整数，因此它将操作数视为文本操作数，而不是数值操作数。当 MASM 在第二个例子中对memStr进行文本替换为abcdefghi时，MASM 能够正确地汇编代码，因为字符串是byte指令的完全合法操作数。

假设你真的希望 MASM 将八个或更少字符的字符串当作字符串而不是整数值来处理，那么有两种解决方案。第一种是将操作数用文本定界符括起来。MASM 在equ操作数字段中使用符号<和>作为文本定界符。因此，你可以使用以下代码来解决这个问题：

SomeStr  equ   <"abcdefgh">
          .
          .
          .
memStr   byte  SomeStr

由于 equ 指令的操作数有时可能会有些模糊，微软引入了第三种等式指令 textequ，用于在你想创建文本等式时使用。以下是使用文本等式的当前示例：

SomeStr  textequ   <"abcdefgh">
          .
          .
          .
memStr   byte      SomeStr

请注意，textequ 操作数必须始终使用文本定界符（< 和 >）在操作数字段中。

每当 MASM 在源文件中遇到用文本指令定义的符号时，它会立即将与该指令关联的文本替换为标识符。这与 C/C++ 中的 #define 宏有些相似（不过你无法指定任何参数）。考虑以下示例：

maxCnt  =       10
max     textequ <maxCnt>
max     =       max+1

MASM 会在整个程序中将 maxCnt 替换为 max（在 textequ 声明 max 后）。在此示例的第三行，这个替换结果是：

maxCnt  =       maxCnt+1

在程序的后续部分，MASM 会在每次遇到符号 maxCnt 时将其替换为 11。以后每次 MASM 遇到 max，它都会替换为 maxCnt，然后再将 maxCnt 替换为 11。

你甚至可以使用 MASM 文本等式做类似下面的操作：

mv    textequ  <mov>
        .
        .
        .
       mv      rax,0

MASM 会将 mv 替换为 mov，并将此序列中的最后一条语句编译成 mov 指令。大多数人会认为这是对汇编语言编程风格的巨大违反，但它是完全合法的。

4.3.1 常量表达式

到目前为止，本章给人的印象是，符号常量定义由标识符、可选的类型和字面常量组成。实际上，MASM 常量声明可能比这复杂得多，因为 MASM 允许将常量表达式（而不仅仅是字面常量）赋值给符号常量。通用常量声明有以下两种形式：

`identifier` =   `constant_expression`
`identifier` equ `constant_expression`

常量（整数）表达式采用你在 C/C++ 和 Python 等高级语言中习惯的形式。它们可以包含字面常量值、先前声明的符号常量以及各种算术运算符。

常量表达式运算符遵循标准的优先级规则（类似于 C/C++ 中的规则）；如果需要，可以使用括号来覆盖优先级。通常，如果优先级不明显，请使用括号明确指定评估顺序。表 4-1 列出了 MASM 允许在常量（和地址）表达式中使用的算术运算符。

表 4-1：常量表达式中允许的运算

算术运算符
`-`（一元取反）
`*`
`/`
`mod`
`/`
`+`
`-`
`[]`
比较运算符
`EQ`
`NE`
`LT`
`LE`
`GT`
`GE`
逻辑运算符
`AND`
`OR`
`NOT`
一元运算符
`HIGH`
`HIGHWORD`
`HIGH32`
`LENGTHOF`
`LOW`
`LOWWORD`
`LOW32`
`OFFSET`

| OPATTR | 返回操作符后面表达式的属性。属性以位图的形式返回，含义如下：位 0：表达式中有代码标签。

位 1：该表达式是可重定位的。

位 2：该表达式是常量表达式。

位 3：该表达式使用直接寻址。

位 4：该表达式是一个寄存器。

位 5：该表达式不包含未定义符号。

位 6：该表达式是一个栈段内存表达式。

位 7：该表达式引用了一个外部标签。

位 8-11：语言类型（可能 64 位代码的值为 0）。 |

`SIZE`	返回符号声明中第一个初始化器的字节大小。
`SIZEOF`	返回为给定符号分配的字节大小。
`THIS`	返回当前程序计数器在某节内的地址表达式。必须在`this`后加上类型；例如，`this byte`。
`	`SIZE`
---	---
`SIZEOF`	返回为给定符号分配的字节大小。
`THIS`	返回当前程序计数器在某节内的地址表达式。必须在`this`后加上类型；例如，`this byte`。
`this`的同义词。

4.3.2 `this`和`$`运算符

在表 4-1 中，最后两个运算符值得特别提及。this和$操作数（它们大致是同义词）返回当前所在节的偏移量。当前偏移量被称为位置计数器（参见第三章《MASM 如何分配变量内存》）。考虑以下内容：

someLabel equ $

这将设置标签的偏移量为程序中的当前位置。符号的类型将是 语句标签（例如，proc）。通常，人们使用 $ 操作符来处理分支标签（以及高级特性）。例如，以下代码会创建一个无限循环（实际上会锁死 CPU）：

jmp $     ; "$" is equivalent to the address of the jmp instr

你也可以使用类似这样的指令，在源文件中跳过固定数量的字节（向前或向后）：

jmp $+5   ; Skip to a position 5 bytes beyond the jmp

大多数情况下，像这样创建操作数是疯狂的，因为它依赖于知道每条机器指令在汇编时编译成的机器码字节数。显然，这是一项高级操作，不建议初学汇编语言的程序员使用（即便是对于大多数高级汇编语言程序员来说，也很难推荐这种做法）。

$ 操作符的一个实际应用（也许是最常见的用法）是计算源文件中数据声明块的大小：

someData     byte 1, 2, 3, 4, 5
sizeSomeData =    $-someData

地址表达式 $-someData 计算当前偏移量减去当前节中 someData 的偏移量。在这个例子中，它会得到 5，即 someData 操作数字段的字节数。在这个简单的例子中，使用 sizeof someData 表达式可能更为合适。它同样返回 someData 声明所需的字节数。然而，考虑以下语句：

someData     byte 1, 2, 3, 4, 5
             byte 6, 7, 8, 9, 0
sizeSomeData =    $-someData

在这种情况下，sizeof someData 仍然返回 5（因为它只返回附加到 someData 的操作数的长度），而 sizeSomeData 被设置为 10。

如果标识符出现在常量表达式中，那么该标识符必须是你之前在程序中通过 equate 指令定义的常量标识符。你不能在常量表达式中使用变量标识符；因为当 MASM 计算常量表达式时，变量的值在汇编时并未定义。此外，不要混淆编译时操作和运行时操作：

; Constant expression, computed while MASM
; is assembling your program:

x     = 5
y     = 6
Sum   = x + y

; Runtime calculation, computed while your program
; is running, long after MASM has assembled it:

     mov al, x
     add al, y

this 操作符与 $ 操作符有一个重要区别：$ 默认类型是语句标签，而 this 操作符允许你指定类型。this 操作符的语法如下：

this `type`

其中 type 是常见的数据类型之一（如 byte、sbyte、word、sword 等）。因此，this proc 就是直接等同于 $ 的操作符。注意，以下两个 MASM 语句是等价的：

someLabel label byte
someLabel equ   this byte

4.3.3 常量表达式计算

MASM 会在汇编时立即解释常量表达式的值。它不会生成任何机器指令来计算前面例子中的常量表达式 x + y。相反，它直接计算这两个常量值的和。从那时起，MASM 会将值 11 关联到常量 Sum，就好像程序包含了语句 Sum = 11 而不是 Sum = x + y。另一方面，MASM 不会在 mov 和 add 指令的前面部分预先计算 11 的值；它会忠实地生成这两条指令的目标代码，而 x86-64 会在程序运行时（汇编完成后的一段时间）计算它们的和。

通常，常量表达式在汇编语言程序中并不复杂。通常，你是在加法、减法或乘法两个整数值。例如，以下等式集合定义了一组具有连续值的常量：

TapeDAT        =  0
Tape8mm        =  TapeDAT + 1
TapeQIC80      =  Tape8mm + 1
TapeTravan     =  TapeQIC80 + 1
TapeDLT        =  TapeTravan + 1

这些常量的值如下：TapeDAT = 0，Tape8mm = 1，TapeQIC80 = 2，TapeTravan = 3，TapeDLT = 4。顺便说一下，这个例子演示了如何在 MASM 中创建一个枚举数据常量列表。

4.4 MASM typedef 声明

假设你不喜欢 MASM 用于声明 byte、word、dword、real4 和其他变量的名称。假设你更喜欢 Pascal 的命名规则，或者 C 的命名规则。你希望使用像 integer、float、double 之类的术语。如果 MASM 是 Pascal，你可以在程序的 type 部分重新定义这些名称。对于 C，你可以使用 typedef 语句来完成这项任务。好吧，MASM 像 C/C++ 一样，也有自己的类型声明语句，同样可以让你创建这些名称的别名。MASM 的 typedef 语句格式如下：

`new_type_name`  typedef  `existing_type_name`

以下示例演示了如何在 MASM 程序中设置与 C/C++ 或 Pascal 兼容的名称：

integer   typedef  sdword
float     typedef  real4
double    typedef  real8
colors    typedef  byte

现在你可以使用更有意义的声明来声明你的变量，比如这些：

 .data
i          integer ?
x          float   1.0
HouseColor colors  ?

如果你使用 Ada、C/C++ 或 FORTRAN（或任何其他语言），你可以选择你更熟悉的类型名称。当然，这不会改变 x86-64 或 MASM 如何处理这些变量，但它确实让你可以创建更易读、易懂的程序，因为类型名称更能反映实际的底层类型。给 C/C++ 程序员的一个警告：不要过于兴奋去定义一个 int 数据类型。不幸的是，int 是一个 x86-64 的机器指令（中断），因此这是 MASM 中的保留字。

4.5 类型强制转换

尽管 MASM 在类型检查方面相对宽松，但 MASM 确保你为指令指定了适当的操作数大小。例如，考虑以下（错误的）程序，在列表 4-1 中。

; Listing 4-1

; Type checking errors.

        option  casemap:none

nl      =       10  ; ASCII code for newline

        .data
i8      sbyte   ?
i16     sword   ?
i32     sdword  ?
i64     sqword  ?

        .code

; Here is the "asmMain" function.

        public  asmMain
asmMain proc

 mov     eax, i8
        mov     al, i16
        mov     rax, i32
        mov     ax, i64

        ret     ; Returns to caller
asmMain endp
        end

列表 4-1：MASM 类型检查

MASM 会对这四条 mov 指令生成错误，因为操作数大小不兼容。mov 指令要求两个操作数的大小相同。第一条指令尝试将字节移动到 EAX，第二条指令尝试将字移动到 AL，第三条指令尝试将双字移动到 RAX。第四条指令尝试将四字移动到 AX。当你尝试汇编这个文件时，编译器的输出如下：

C:\>**ml64 /c listing4-1.asm**
Microsoft (R) Macro Assembler (x64) Version 14.15.26730.0
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: listing4-1.asm
listing4-1.asm(24) : error A2022:instruction operands must be the same size
listing4-1.asm(25) : error A2022:instruction operands must be the same size
listing4-1.asm(26) : error A2022:instruction operands must be the same size
listing4-1.asm(27) : error A2022:instruction operands must be the same size

虽然这是 MASM 中的一个好特性，^(2) 但有时它会造成一些困扰。考虑以下代码片段：

 .data
byte_values  label byte
             byte  0, 1

             .
             .
             .

             mov ax, byte_values

在这个示例中，假设程序员确实想要将从 byte_values 地址开始的字加载到 AX 寄存器中，因为他们想通过一条指令将 AL 置为 0，AH 置为 1（0 存储在 LO 内存字节中，1 存储在 HO 内存字节中）。MASM 会拒绝此操作，报类型不匹配错误（因为 byte_values 是字节对象，而 AX 是字对象）。

程序员可以将其分解为两条指令，一条将地址 byte_values 处的字节加载到 AL 中，另一条将地址 byte_values[1] 处的字节加载到 AH 中。不幸的是，这样的分解会使程序稍微低效一些（这可能正是最初使用单条 mov 指令的原因）。为了告诉 MASM 我们知道自己在做什么，并希望将 byte_values 变量视为 word 对象，我们可以使用类型强制转换。

类型强制转换 是告诉 MASM，你希望将一个对象当作一个显式类型来处理，而不管它的实际类型是什么。^(3) 要强制转换变量的类型，你可以使用以下语法：

`new_type_name` ptr `address_expression`

new_type_name 项是你希望与 address_expression 指定的内存位置关联的新类型。你可以在任何合法的内存地址处使用此强制转换操作符。为了纠正之前的示例，以避免 MASM 报告类型不匹配错误，你可以使用以下语句：

mov ax, word ptr byte_values

这条指令告诉 MASM 将 AX 寄存器加载为从内存地址 byte_values 开始的字。假设 byte_values 仍然包含其初始值，这条指令将把 0 加载到 AL 中，把 1 加载到 AH 中。

表 4-2 列出了所有 MASM 类型强制转换操作符。

表 4-2：MASM 类型强制转换操作符

指令	含义
`byte ptr`	字节（无符号 8 位）值
`sbyte ptr`	有符号 8 位整数值
`word ptr`	无符号 16 位（字）值
`sword ptr`	有符号 16 位整数值
`dword ptr`	无符号 32 位（双字）值
`sdword ptr`	有符号 32 位整数值
`qword ptr`	无符号 64 位（四字）值
`sqword ptr`	有符号 64 位整数值
`tbyte ptr`	无符号 80 位（10 字节）值
`oword ptr`	128 位（八字）值
`xmmword ptr`	128 位（八字）值——与 `oword ptr` 相同
`ymmword ptr`	256 位值（用于 AVX YMM 寄存器）
`zmmword ptr`	512 位值（用于 AVX-512 ZMM 寄存器）
`real4 ptr`	单精度（32 位）浮动点数值
`real8 ptr`	双精度（64 位）浮动点数值
`real10 ptr`	扩展精度（80 位）浮动点数值

当你将一个匿名变量指定为直接修改内存的指令操作数时（例如，neg、shl、not等），就需要使用类型强制。考虑以下语句：

not [rbx]

MASM 会在这个指令上生成错误，因为它无法确定内存操作数的大小。该指令没有提供足够的信息来判断程序应该对 RBX 指向的字节、RBX 指向的字（word）、RBX 指向的双字（double word）还是 RBX 指向的四字（quad word）进行位反转。你必须使用类型强制运算符来明确指定这些类型指令中的匿名引用的大小：

not byte ptr [rbx]
not dword ptr [rbx]

考虑以下语句（其中byteVar是一个 8 位变量）：

mov dword ptr byteVar, eax

如果没有类型强制运算符，MASM 会抱怨这个指令，因为它试图将一个 32 位寄存器存储到一个 8 位内存位置。初学者可能希望他们的程序能够汇编成功，于是可能会走捷径，使用类型强制运算符，如这个指令所示；这当然能让汇编器安静下来——它不再抱怨类型不匹配——因此初学者会很高兴。

然而，程序仍然是错误的；唯一的区别是 MASM 不再警告你关于错误的消息。类型强制运算符并没有解决尝试将一个 32 位的值存储到一个 8 位内存位置的问题——它只是允许指令将一个 32 位的值从 8 位变量所指定的地址开始存储。程序仍然存储 4 个字节，覆盖了内存中byteVar后面的 3 个字节。

这通常会产生意外的结果，包括程序中变量的虚假修改。^(4) 另一个较少见的情况是，当byteVar后面的 3 个字节没有分配到实际内存中，或这些字节恰好位于只读内存区时，程序可能会因一般保护错误而中止。关于类型强制运算符，重要的是要记住这一点：如果你无法准确描述这个运算符的作用，就不要使用它。

同时要记住，类型强制运算符并不会对内存中的数据进行任何转换。它仅仅告诉汇编器将内存中的位视为另一种类型。它不会自动将 8 位值扩展为 32 位，也不会将整数转换为浮动点数值。它只是告诉编译器将内存操作数的位模式视为不同类型。

4.6 指针数据类型

你可能已经在 Pascal、C 或 Ada 编程语言中亲自体验过指针，现在你可能开始担心了。当你第一次在高级语言中遇到指针时，几乎每个人都会有不好的经历。放心吧！实际上，在汇编语言中，指针比在高级语言中更容易处理。

此外，你在使用指针时遇到的大多数问题可能与指针本身无关，而是与试图用指针实现的链表和树等数据结构有关。另一方面，指针在汇编语言中的用途远不止与链表、树等可怕的数据结构相关。实际上，像数组和记录这样的简单数据结构，往往也涉及到指针的使用。所以，如果你对指针有深深的恐惧感，那就忘掉你对它们的所有认知吧。你将会学到指针其实是多么强大。

最好的起点可能是指针的定义。指针是一个内存位置，其值是另一个内存位置的地址。不幸的是，像 C/C++这样的高级语言往往把指针的简单性隐藏在一层抽象的墙后面。这个额外的复杂性（顺便说一下，这是有充分理由的）往往让程序员感到害怕，因为他们不理解发生了什么。

为了揭示发生了什么，考虑以下 Pascal 中的数组声明：

M: array [0..1023] of integer;

即使你不懂 Pascal，这里的概念也很容易理解。M是一个包含 1024 个整数的数组，索引范围从M[0]到M[1023]。这些数组元素中的每一个都可以保存一个独立的整数值，互不干扰。换句话说，这个数组提供了 1024 个不同的整数变量，每个变量通过数字（数组索引）而非名称来引用。

如果你遇到包含M[0] := 100;的程序，你可能根本不需要思考这条语句到底在做什么。它正在把值100存入数组M的第一个元素。现在考虑以下两条语句：

i := 0;      (Assume "i" is an integer variable)
M [i] := 100;

你应该毫不犹豫地同意，这两条语句执行的操作与M[0] := 100;是相同的。事实上，你可能会同意，你可以使用任何在 0 到 1023 范围内的整数表达式作为该数组的索引。以下语句依然执行与我们对索引 0 的单一赋值相同的操作：

i := 5;      (Assume all variables are integers)
j := 10;
k := 50;
m [i*j-k] := 100;

“好吧，那有什么意义？”你可能在想。“任何能产生 0 到 1023 范围内整数的东西都是合法的，那又怎样？”好吧，考虑以下内容：

M [1] := 0;
M [M [1]] := 100;

哇！现在需要一点时间来消化。不过，如果你慢慢理解，这就能明白了，你会发现这两条指令执行的是你一直在做的相同操作。第一条语句将0存储到数组元素M[1]中。第二条语句获取M[1]的值，这个值是一个整数，你可以将它用作数组M的索引，并使用该值（0）来控制它存储值100的位置。

如果你愿意接受这一点作为合理的——或许有点奇怪，但仍然可以使用——那么你将不会对指针产生任何问题。因为M[1]是一个指针！嗯，严格来说不是，但如果你把M改为内存并将这个数组视为整个内存，那么这就是指针的精确定义：一个其值是另一个内存位置的地址（或者如果你愿意说是索引）的内存位置。在汇编语言程序中，指针的声明和使用非常简单。你甚至不需要担心数组索引之类的东西。

4.6.1 在汇编语言中使用指针

一个 MASM 指针是一个 64 位值，可以包含另一个变量的地址。如果你有一个 dword 变量p，它包含 1000_0000h，那么p“指向”内存位置 1000_0000h。要访问p指向的 dword，你可以使用如下代码：

mov  rbx, p       ; Load RBX with the value of pointer p
mov  rax, [rbx]   ; Fetch the data that p points at

通过将p的值加载到 RBX 中，这段代码将值 1000_0000h 加载到 RBX 中（假设p包含 1000_0000h）。第二条指令将 RAX 寄存器加载为从 RBX 中偏移量所在位置开始的 qword。因为 RBX 现在包含 1000_0000h，所以这将从地址 1000_0000h 到 1000_0007h 加载 RAX。

为什么不直接使用类似mov rax, mem的指令从地址 1000_0000h 加载 RAX 呢（假设mem在地址 1000_0000h）？嗯，有几个原因。但主要原因是，这个mov指令总是从mem所在的位置加载 RAX。你不能改变它加载 RAX 的地址。然而，前面的指令总是从p指向的位置加载 RAX。这在程序控制下很容易改变。事实上，两个指令mov rax, offset mem2和mov p, rax会导致这两个指令在下一次执行时从mem2加载 RAX。考虑以下代码片段：

 mov rax, offset i
    mov p, rax
      .
      .
      .      ; Code that sets or clears the carry flag.

    jc skipSetp

       mov rax, offset j
       mov p, rax
        .
        .
        .

skipSetp:
    mov rbx, p           ; Assume both code paths wind up
    mov rax, [rbx]       ; down here

这个简单的示例展示了程序中的两条执行路径。第一条路径将变量p加载为变量i的地址。第二条路径将p加载为变量j的地址。两条执行路径最终会汇聚到最后两个mov指令，这些指令根据所采取的执行路径将 RAX 加载为i或j。在许多方面，这类似于高级语言（如 Swift）中的过程参数。执行相同的指令会根据p中存储的地址（i或j）访问不同的变量。

4.6.2 在 MASM 中声明指针

由于指针长度为 64 位，你可以使用 qword 类型为指针分配存储空间。然而，与其使用 qword 声明，一个更好的方法是使用 typedef 创建一个指针类型：

 .data
pointer   typedef qword
b         byte    ?
d         dword   ?
pByteVar  pointer b
pDWordVar pointer d

这个例子演示了在 MASM 中初始化以及声明指针变量是可能的。请注意，你可以在 qword/pointer 指令的操作数字段中指定静态变量（.data、.const 和 .data? 对象）的地址，因此你只能用静态对象的地址来初始化指针变量。

4.6.3 指针常量与指针常量表达式

MASM 允许在指针常量合法的任何地方使用非常简单的常量表达式。指针常量表达式有以下三种形式之一：^(5)

offset StaticVarName [PureConstantExpression]
offset StaticVarName + PureConstantExpression
offset StaticVarName - PureConstantExpression

PureConstantExpression 术语指的是不涉及任何指针常量的数字常量表达式。这种类型的表达式产生一个内存地址，该地址是 StaticVarName 变量在内存中指定字节数之前或之后（分别是 - 或 +）的地址。请注意，这里显示的前两种形式在语义上是等效的；两者都返回一个指针常量，其地址是静态变量和常量表达式之和。

由于你可以创建指针常量表达式，发现 MASM 允许通过使用等式定义显式指针常量也就不足为奇了。列表 4-2 中的程序演示了你如何做到这一点。

; Listing 4-2

; Pointer constant demonstration.

        option  casemap:none

nl      =       10

        .const
ttlStr  byte    "Listing 4-2", 0
fmtStr  byte    "pb's value is %ph", nl
        byte    "*pb's value is %d", nl, 0

        .data
b       byte    0
        byte    1, 2, 3, 4, 5, 6, 7

pb      textequ <offset b[2]>

        .code
        externdef printf:proc

; Return program title to C++ program:

         public getTitle
getTitle proc
         lea rax, ttlStr
         ret
getTitle endp

; Here is the "asmMain" function.

        public  asmMain
asmMain proc

; "Magic" instruction offered without
; explanation at this point:

        sub     rsp, 48

        lea     rcx, fmtStr
        mov     rdx, pb
        movzx   r8, byte ptr [rdx]
        call    printf

        add     rsp, 48
        ret     ; Returns to caller

asmMain endp
        end

列表 4-2：MASM 程序中的指针常量表达式

以下是此代码的汇编和执行：

C:\>**build listing4-2**

C:\>**echo off**
 Assembling: listing4-2.asm
c.cpp

C:\>**listing4-2**
Calling Listing 4-2:
pb's value is 00007FF6AC381002h
*pb's value is 2
Listing 4-2 terminated

请注意，打印的地址在不同的机器和不同版本的 Windows 上可能会有所不同。

4.6.4 指针变量与动态内存分配

指针变量是存储 C 标准库 malloc() 函数返回结果的完美地方。该函数返回其在 RAX 寄存器中分配的存储地址；因此，你可以在调用 malloc() 之后，直接用一个 mov 指令将地址存入指针变量。列表 4-3 演示了调用 C 标准库的 malloc() 和 free() 函数。

; Listing 4-3

; Demonstration of calls
; to C standard library malloc
; and free functions.

        option  casemap:none

nl      =       10

        .const
ttlStr  byte    "Listing 4-3", 0
fmtStr  byte    "Addresses returned by malloc: %ph, %ph", nl, 0

        .data
ptrVar  qword   ?
ptrVar2 qword   ?

        .code
        externdef printf:proc
        externdef malloc:proc
        externdef free:proc

; Return program title to C++ program:

         public getTitle
getTitle proc
         lea rax, ttlStr
         ret
getTitle endp

; Here is the "asmMain" function.

        public  asmMain
asmMain proc

; "Magic" instruction offered without
; explanation at this point:

        sub     rsp, 48

; C standard library malloc function.

; ptr = malloc(byteCnt);

 mov     rcx, 256        ; Allocate 256 bytes
        call    malloc
        mov     ptrVar, rax     ; Save pointer to buffer

        mov     rcx, 1024       ; Allocate 1024 bytes
        call    malloc
        mov     ptrVar2, rax    ; Save pointer to buffer

        lea     rcx, fmtStr
        mov     rdx, ptrVar
        mov     r8, rax         ; Print addresses
        call    printf

; Free the storage by calling
; C standard library free function.

; free(ptrToFree);

        mov     rcx, ptrVar
        call    free

        mov     rcx, ptrVar2
        call    free

        add     rsp, 48
        ret     ; Returns to caller

asmMain endp
        end

列表 4-3：演示 malloc() 和 free() 调用

这是我在构建并运行此程序时得到的输出。请注意，malloc() 返回的地址可能会因系统、操作系统版本等不同而有所不同。因此，你很可能得到与我在系统上获得的数字不同的结果。

C:\>**build listing4-3**

C:\>**echo off**
 Assembling: listing4-3.asm
c.cpp

C:\>**listing4-3**
Calling Listing 4-3:
Addresses returned by malloc: 0000013B2BC43AD0h, 0000013B2BC43BE0h
Listing 4-3 terminated

4.6.5 常见的指针问题

程序员在使用指针时会遇到五个常见问题。这些错误中的一些会导致程序立即停止并显示诊断信息；其他问题则较为微妙，可能会导致程序产生不正确的结果而不报告错误，或者仅仅影响程序的性能而没有显示错误。这五个问题如下：

使用未初始化的指针
使用包含非法值的指针（例如 NULL）
在已经释放的存储空间上继续使用 malloc() 分配的存储
程序完成后未能free()存储空间
使用错误的数据类型访问间接数据

第一个问题是使用指针变量之前没有为指针分配有效的内存地址。初学者通常没有意识到，声明一个指针变量仅仅是为指针本身保留存储空间；它并没有为指针引用的数据保留存储空间。列表 4-4 中的简短程序演示了这个问题（不要尝试编译和运行这个程序，它会崩溃）。

; Listing 4-4

; Uninitialized pointer demonstration.
; Note that this program will not
; run properly.

        option  casemap:none

nl      =       10

        .const
ttlStr  byte    "Listing 4-4", 0
fmtStr  byte    "Pointer value= %p", nl, 0

        .data
ptrVar  qword   ?

        .code
        externdef printf:proc

; Return program title to C++ program:

         public getTitle
getTitle proc
         lea rax, ttlStr
         ret
getTitle endp

; Here is the "asmMain" function.

        public  asmMain
asmMain proc

; "Magic" instruction offered without
; explanation at this point:

        sub     rsp, 48

 lea     rcx, fmtStr
        mov     rdx, ptrVar
        mov     rdx, [rdx]      ; Will crash system
        call    printf

        add     rsp, 48
        ret     ; Returns to caller

asmMain endp
        end

列表 4-4：未初始化指针演示

尽管你在 .data 区段中声明的变量在技术上是初始化过的，但静态初始化仍然没有在该程序中为指针初始化有效地址（它将指针初始化为 0，即 NULL）。

当然，在 x86-64 上并没有真正意义上的未初始化变量。你真正拥有的是那些你明确赋予初始值的变量，以及那些恰好继承了分配给变量的存储空间时内存中所包含的任何位模式的变量。很多时候，这些在内存中闲置的垃圾位模式并不对应一个有效的内存地址。试图解除引用这样的指针（也就是访问它指向的内存中的数据）通常会引发内存访问违例异常。

然而，有时，这些内存中的随机位恰巧对应一个你可以访问的有效内存位置。在这种情况下，CPU 会访问指定的内存位置，而不会中止程序。虽然对一个初学者来说，这种情况可能看起来比停止程序更可取，但实际上这更糟糕，因为你的有缺陷的程序会继续运行而没有提示你出现问题。如果通过未初始化的指针存储数据，你很可能会覆盖内存中其他重要变量的值。这个缺陷可能会在程序中产生一些非常难以定位的问题。

程序员使用指针时的第二个问题是将无效的地址值存储到指针中。第一个问题实际上是第二个问题的一个特殊情况（内存中的垃圾位提供了无效地址，而不是你通过计算错误产生的）。后果是相同的；如果你尝试解除引用包含无效地址的指针，要么会得到内存访问违例异常，要么会访问一个意外的内存位置。

列出的第三个问题也被称为悬空指针问题。要理解这个问题，请考虑以下代码片段：

mov  rcx, 256
call malloc       ; Allocate some storage
mov  ptrVar, rax  ; Save address away in ptrVar
 .
 .    ; Code that uses the pointer variable ptrVar.
 .
mov   rcx, ptrVar
call  free        ; Free storage associated with ptrVar
  .
  .   ; Code that does not change the value in ptrVar.
  .
mov rbx, ptrVar
mov [rbx], al

在这个例子中，程序分配了 256 字节的存储空间，并将该存储空间的地址保存在ptrVar变量中。然后，代码使用这块 256 字节的存储空间一段时间，并释放了该存储空间，将其归还给系统供其他用途。请注意，调用free()不会以任何方式改变ptrVar的值；ptrVar仍然指向之前由malloc()分配的内存块。事实上，free()并不会改变该内存块中的任何数据，因此在从free()返回后，ptrVar仍然指向代码存储到该块中的数据。

然而，请注意，调用free()告诉系统程序不再需要这块 256 字节的内存块，系统可以将该内存区域用于其他用途。free()函数不能强制确保你永远不会再次访问这些数据；你只是承诺你不会。当然，前面的代码片段违反了这个承诺；正如你在最后两条指令中看到的，程序获取了ptrVar中的值，并访问了它在内存中指向的数据。

悬空指针最大的问题在于，你有很大一部分时间可以在没有问题的情况下使用它们。只要系统没有重新使用你已经释放的存储空间，使用悬空指针不会对程序产生不良影响。然而，随着每次调用malloc()，系统可能决定重新使用先前调用free()释放的内存。当发生这种情况时，任何尝试取消引用悬空指针的操作可能会产生意想不到的后果。问题可能从读取已经被覆盖的数据（通过数据存储的新合法使用）开始，到覆盖新数据，再到（最糟糕的情况）覆盖系统堆管理指针（这样做可能会导致程序崩溃）。解决方案很明确：一旦你释放了与指针相关联的存储空间，就永远不要再使用该指针的值。

所有问题中，第四个问题（未释放分配的存储）可能对程序的正常运行影响最小。以下代码片段演示了这个问题：

mov  rcx, 256
call malloc
mov  ptrVar, rax
 .              ; Code that uses ptrVar.
 .              ; This code does not free up the storage
 .              ; associated with ptrVar.
mov  rcx, 512
call malloc
mov  ptrVar, rax

; At this point, there is no way to reference the original
; block of 256 bytes pointed at by ptrVar.

在这个例子中，程序分配了 256 字节的存储空间，并通过使用ptrVar变量来引用这块存储。稍后，程序分配了另一块字节，并将ptrVar中的值覆盖为该新块的地址。请注意，ptrVar中之前的值丢失了。由于程序不再拥有这个地址值，因此无法调用free()来将存储空间返回给系统供以后使用。

结果是，这段内存不再对你的程序可用。虽然让 256 字节的内存对程序不可用似乎不是什么大问题，但如果这段代码位于一个不断重复的循环中，就不一样了。每次执行循环时，程序会丧失另外 256 字节的内存。经过足够次数的循环迭代后，程序将耗尽堆上的可用内存。这个问题通常被称为内存泄漏，因为它的效果就像是内存的位在程序执行过程中“泄漏”出你的计算机（导致可用存储越来越少）。

内存泄漏远比悬空指针危害小。实际上，内存泄漏只会带来两个问题：堆空间耗尽的风险（最终可能导致程序中止，尽管这种情况很少发生）以及由于虚拟内存页面交换引起的性能问题。尽管如此，你应该养成在使用完所有存储后立即释放它的习惯。当程序退出时，操作系统会回收所有存储，包括因内存泄漏而丢失的数据。因此，通过内存泄漏丢失的内存只会对程序造成影响，而不会影响整个系统。

指针的最后一个问题是缺乏类型安全的访问。这种情况可能发生，因为 MASM 无法也不会强制执行指针类型检查。例如，考虑 Listing 4-5 中的程序。

; Listing 4-5

; Demonstration of lack of type
; checking in assembly language
; pointer access.

          option  casemap:none

nl        =     10
maxLen    =     256

          .const
ttlStr    byte    "Listing 4-5", 0
prompt    byte    "Input a string: ", 0
fmtStr    byte    "%d: Hex value of char read: %x", nl, 0

          .data
bufPtr    qword   ?
bytesRead qword   ?

        .code
        externdef readLine:proc
        externdef printf:proc
 externdef malloc:proc
        externdef free:proc

; Return program title to C++ program:

         public getTitle
getTitle proc
         lea rax, ttlStr
         ret
getTitle endp

; Here is the "asmMain" function.

        public  asmMain
asmMain proc
        push    rbx             ; Preserve RBX

; "Magic" instruction offered without
; explanation at this point:

        sub     rsp, 40

; C standard library malloc function.
; Allocate sufficient characters
; to hold a line of text input
; by the user:

        mov     rcx, maxLen     ; Allocate 256 bytes
        call    malloc
        mov     bufPtr, rax     ; Save pointer to buffer

; Read a line of text from the user and place in
; the newly allocated buffer:

        lea     rcx, prompt     ; Prompt user to input
        call    printf          ; a line of text

        mov     rcx, bufPtr     ; Pointer to input buffer
        mov     rdx, maxLen     ; Maximum input buffer length
        call    readLine        ; Read text from user
        cmp     rax, -1         ; Skip output if error
        je      allDone
        mov     bytesRead, rax  ; Save number of chars read

; Display the data input by the user:

        xor     rbx, rbx        ; Set index to zero
dispLp: mov     r9, bufPtr      ; Pointer to buffer
        mov     rdx, rbx        ; Display index into buffer
        mov     r8d, [r9+rbx*1] ; Read dword rather than byte!
        lea     rcx, fmtStr
        call    printf

 inc     rbx             ; Repeat for each char in buffer
        cmp     rbx, bytesRead
        jb      dispLp

; Free the storage by calling
; C standard library free function.

; free(bufPtr);

allDone:
        mov     rcx, bufPtr
        call    free

        add     rsp, 40
        pop     rbx     ; Restore RBX
        ret             ; Returns to caller
asmMain endp
        end

Listing 4-5：类型不安全的指针访问示例

下面是构建和运行此示例程序的命令：

C:\>**build listing4-5**

C:\>**echo off**
 Assembling: listing4-5.asm
c.cpp

C:\>**listing4-5**
Calling Listing 4-5:
Input a string: Hello, World!
0: Hex value of char read: 6c6c6548
1: Hex value of char read: 6f6c6c65
2: Hex value of char read: 2c6f6c6c
3: Hex value of char read: 202c6f6c
4: Hex value of char read: 57202c6f
5: Hex value of char read: 6f57202c
6: Hex value of char read: 726f5720
7: Hex value of char read: 6c726f57
8: Hex value of char read: 646c726f
9: Hex value of char read: 21646c72
10: Hex value of char read: 21646c
11: Hex value of char read: 2164
12: Hex value of char read: 21
13: Hex value of char read: 5c000000
Listing 4-5 terminated

Listing 4-5 中的程序将用户输入的数据读取为字符值，然后以双字节的十六进制值显示数据。汇编语言的一个强大功能是，它允许你随意忽略数据类型，并且无需任何努力自动强制数据类型。然而，这种功能也有双刃剑的一面。如果你犯错，使用错误的数据类型访问间接数据，MASM 和 x86-64 可能不会捕捉到这个错误，导致程序产生不准确的结果。因此，在程序中使用指针和间接访问时，需要确保数据在使用时与数据类型保持一致。

这个演示程序有一个基本的缺陷，可能会为你带来问题：当读取输入缓冲区的最后两个字符时，程序会访问超出用户输入字符的数据。如果用户输入了 255 个字符（加上 readLine() 附加的零终止字节），程序将访问超出 malloc() 分配的缓冲区末尾的数据。从理论上讲，这可能导致程序崩溃。这又是一个因为使用错误的类型通过指针访问数据时可能出现的问题。

4.7 组合数据类型

复合数据类型，也称为聚合数据类型，是由其他（通常是标量）数据类型构建起来的数据类型。接下来的几节将涵盖几个较重要的复合数据类型 - 字符串，数组，多维数组，记录/结构体和联合体。字符串是复合数据类型的一个很好的例子；它是从一系列单个字符和其他数据构建起来的数据结构。

4.8 字符串

在整数值之后，字符串 可能是现代程序中使用最普遍的数据类型之一。x86-64 确实支持少量字符串指令，但这些指令实际上是用于块内存操作，而不是特定实现的字符字符串。因此，本节将提供几个字符串的定义，并讨论如何处理它们。

一般来说，字符串是一系列 ASCII 字符，具有两个主要特点：长度和 字符数据。不同的语言使用不同的数据结构来表示字符串。汇编语言（至少是没有任何库例程的情况下）并不在乎如何实现字符串。您只需要创建一系列机器指令来处理以任何格式获取的字符串数据。

4.8.1 零终止字符串

毫无疑问，零终止字符串 是当今使用最广泛的字符串表示，因为这是 C、C++和其他语言的本地字符串格式。零终止字符串由以 0 字节结尾的零个或多个 ASCII 字符序列组成。例如，在 C/C++中，字符串"abc"需要 4 个字节：三个字符a、b和c，后跟一个0。正如您很快就会看到的，MASM 字符字符串与零终止字符串向上兼容，但同时，您应注意，在 MASM 中创建零终止字符串非常简单。在.data段中执行此操作的最简单方法是使用以下类似的代码：

 .data
zeroString byte   "This is the zero-terminated string", 0

每当字符字符串出现在byte指令中，就像在这里一样，MASM 将每个字符连续地发射到后续的内存位置。字符串末尾的零值终止了该字符串。

零终止字符串具有两个主要特点：它们易于实现，字符串可以是任意长度。另一方面，零终止字符串有一些缺点。首先，虽然通常不重要，但零终止字符串不能包含 NUL 字符（其 ASCII 代码为 0）。一般情况下，这不是问题，但偶尔会造成混乱。零终止字符串的第二个问题是，对它们的许多操作有些效率低下。例如，要计算零终止字符串的长度，必须扫描整个字符串以查找该 0 字节（计算字符直到 0 为止）。以下程序片段演示了如何计算前述字符串的长度：

 lea rbx, zeroString
          xor rax, rax    ; Set RAX to zero
whileLp:  cmp byte ptr [rbx+rax*1], 0
          je  endwhile

          inc rax
          jmp whileLp

endwhile:

; String length is now in RAX.

从这段代码可以看出，计算字符串长度所需的时间与字符串的长度成正比；随着字符串变长，计算其长度所需的时间也会增加。

4.8.2 长度前缀字符串

长度前缀字符串 格式克服了零终止字符串的一些问题。长度前缀字符串在像 Pascal 这样的语言中很常见；它们通常由一个长度字节和零个或多个字符值组成。第一个字节指定字符串的长度，随后的字节（最多指定长度个字节）是字符数据。在长度前缀方案中，字符串 "abc" 将由 4 个字节组成：03（字符串长度）后跟 a、b 和 c。你可以通过以下代码在 MASM 中创建长度前缀字符串：

 .data
lengthPrefixedString label byte;
        byte 3, "abc"

提前计算字符数并将其插入到字节语句中，如此处所示，可能会看起来像是一个大麻烦。幸运的是，有一些方法可以让 MASM 自动为你计算字符串长度。

长度前缀字符串解决了与零终止字符串相关的两个主要问题。长度前缀字符串可以包含 NUL 字符，并且对于零终止字符串中的一些相对低效的操作（例如，字符串长度计算），使用长度前缀字符串会更加高效。然而，长度前缀字符串也有其缺点。主要缺点是它们的长度最大限制为 255 个字符（假设使用的是 1 字节的长度前缀）。

当然，如果你遇到 255 个字符的字符串长度限制问题，可以通过根据需要使用任意数量的字节来创建长度前缀字符串。例如，高级汇编器 (HLA) 使用 4 字节长度变体的长度前缀字符串，允许字符串长度达到 4GB。^(6) 关键点是，在汇编语言中，你可以根据自己的需要定义字符串格式。

如果你想在汇编语言程序中创建长度前缀字符串，你不希望手动计算字符串中的字符数并在代码中输出该长度。让汇编器帮你做这种繁琐的工作要好得多。这可以通过使用位置计数器操作符 ($) 来轻松实现，如下所示：

 .data
lengthPrefixedString label byte;
     byte lpsLen, "abc"
lpsLen = $-lengthPrefixedString-1

lpsLen 操作数在地址表达式中减去 1，因为 $-lengthPrefixedString 也包括了长度前缀字节，而这个字节并不算作字符串长度的一部分。

4.8.3 字符串描述符

另一种常见的字符串格式是字符串描述符。字符串描述符通常是一个小的数据结构（记录或结构，详见第 197 页的“记录/结构”），它包含几个描述字符串的数据字段。至少，字符串描述符可能会有指向实际字符串数据的指针和指定字符串中字符数（即字符串长度）的字段。其他可能的字段可能包括当前字符串占用的字节数^7、字符串可能占用的最大字节数、字符串编码（例如，ASCII、Latin-1、UTF-8 或 UTF-16），以及字符串数据结构设计者可能想到的任何其他信息。

到目前为止，最常见的描述符格式包括指向字符串数据的指针和一个大小字段，用于指定当前占用的字节数。请注意，这种特定的字符串描述符与长度前缀字符串不同。在长度前缀字符串中，长度位于字符数据之前。而在描述符中，长度和指针是捆绑在一起的，这一对通常与字符数据本身分开。

4.8.4 字符串指针

大多数情况下，汇编语言程序不会直接处理出现在 .data（或 .const 或 .data?）段中的字符串。相反，程序将处理指向字符串的指针（包括程序通过调用 malloc() 等函数动态分配的字符串）。清单 4-5 提供了一个简单的（如果不是坏的）示例。在这种应用程序中，您的汇编代码通常会将指向字符串的指针加载到基寄存器中，然后使用第二个（索引）寄存器来访问字符串中的各个字符。

4.8.5 字符串函数

不幸的是，极少数汇编程序提供您可以从汇编语言程序中调用的一组字符串函数^8。作为汇编语言程序员，您需要自己编写这些函数。幸运的是，如果您觉得任务有些艰巨，仍然有一些解决方案可供选择。

您可以调用的第一组字符串函数（无需自己编写）是 C 标准库字符串函数（来自 C 中的string.h头文件）。当然，您在调用 C 标准库函数时必须使用 C 字符串（以零终止的字符串），但通常这不是一个大问题。清单 4-6 提供了调用各种 C 字符串函数的示例。

; Listing 4-6

; Calling C Standard Library string functions.

          option  casemap:none

nl        =       10
maxLen    =     256

          .const
ttlStr    byte  "Listing 4-6", 0
prompt    byte  "Input a string: ", 0
fmtStr1   byte  "After strncpy, resultStr='%s'", nl, 0
fmtStr2   byte  "After strncat, resultStr='%s'", nl, 0
fmtStr3   byte  "After strcmp (3), eax=%d", nl, 0
fmtStr4   byte  "After strcmp (4), eax=%d", nl, 0
fmtStr5   byte  "After strcmp (5), eax=%d", nl, 0
fmtStr6   byte  "After strchr, rax='%s'", nl, 0
fmtStr7   byte  "After strstr, rax='%s'", nl, 0
fmtStr8   byte  "resultStr length is %d", nl, 0

str1      byte  "Hello, ", 0
str2      byte  "World!", 0
str3      byte  "Hello, World!", 0
str4      byte  "hello, world!", 0
str5      byte  "HELLO, WORLD!", 0

          .data
strLength dword ?
resultStr byte  maxLen dup (?)

        .code
        externdef readLine:proc
        externdef printf:proc
        externdef malloc:proc
        externdef free:proc

; Some C standard library string functions:

; size_t strlen(char *str)

        externdef strlen:proc

; char *strncat(char *dest, const char *src, size_t n)

        externdef strncat:proc

; char *strchr(const char *str, int c)

        externdef strchr:proc

; int strcmp(const char *str1, const char *str2)

        externdef strcmp:proc

; char *strncpy(char *dest, const char *src, size_t n)

        externdef strncpy:proc

; char *strstr(const char *inStr, const char *search4)

        externdef strstr:proc

; Return program title to C++ program:

         public getTitle
getTitle proc
         lea rax, ttlStr
 ret
getTitle endp

; Here is the "asmMain" function.

        public  asmMain
asmMain proc

; "Magic" instruction offered without
; explanation at this point:

        sub     rsp, 48

; Demonstrate the strncpy function to copy a
; string from one location to another:

        lea     rcx, resultStr  ; Destination string
        lea     rdx, str1       ; Source string
        mov     r8, maxLen      ; Max number of chars to copy
        call    strncpy

        lea     rcx, fmtStr1
        lea     rdx, resultStr
        call    printf

; Demonstrate the strncat function to concatenate str2 to
; the end of resultStr:

        lea     rcx, resultStr
        lea     rdx, str2
        mov     r8, maxLen
        call    strncat

        lea     rcx, fmtStr2
        lea     rdx, resultStr
        call    printf

; Demonstrate the strcmp function to compare resultStr
; with str3, str4, and str5:

        lea     rcx, resultStr
        lea     rdx, str3
        call    strcmp

        lea     rcx, fmtStr3
        mov     rdx, rax
        call    printf

        lea     rcx, resultStr
        lea     rdx, str4
        call    strcmp

        lea     rcx, fmtStr4
        mov     rdx, rax
        call    printf

 lea     rcx, resultStr
        lea     rdx, str5
        call    strcmp

        lea     rcx, fmtStr5
        mov     rdx, rax
        call    printf

; Demonstrate the strchr function to search for
; "," in resultStr:

        lea     rcx, resultStr
        mov     rdx, ','
        call    strchr

        lea     rcx, fmtStr6
        mov     rdx, rax
        call    printf

; Demonstrate the strstr function to search for
; str2 in resultStr:

        lea     rcx, resultStr
        lea     rdx, str2
        call    strstr

        lea     rcx, fmtStr7
        mov     rdx, rax
        call    printf

; Demonstrate a call to the strlen function:

        lea     rcx, resultStr
        call    strlen

        lea     rcx, fmtStr8
        mov     rdx, rax
        call    printf

        add     rsp, 48
        ret     ; Returns to caller
asmMain endp
        end

清单 4-6：从 MASM 源代码调用 C 标准库字符串函数

下面是构建和运行清单 4-6 的命令：

C:\>**build listing4-6**

C:\>**echo off**
 Assembling: listing4-6.asm
c.cpp

C:\>**listing4-6**
Calling Listing 4-6:
After strncpy, resultStr='Hello, '
After strncat, resultStr='Hello, World!'
After strcmp (3), eax=0
After strcmp (4), eax=-1
After strcmp (5), eax=1
After strchr, rax=', World!'
After strstr, rax='World!'
resultStr length is 13
Listing 4-6 terminated

当然，你可以提出一个很好的论点：如果你的所有汇编代码仅仅是调用一堆 C 标准库函数，那么你本来就应该一开始就用 C 编写应用程序。编写汇编代码的最大好处只有在你“用汇编语言思考”时才会发生，而不是用 C 语言思考。特别是，如果你停止使用以零为终止符的字符串，改用另一种字符串格式（如长度前缀或基于描述符的字符串，它们包含长度组件），你可以显著提高字符串函数调用的性能。

除了 C 标准库外，你还可以在互联网上找到许多用汇编语言编写的 x86-64 字符串函数。一个很好的起点是 MASM 论坛，网址是masm32.com/board/（尽管名字如此，这个论坛支持 64 位以及 32 位的 MASM 编程）。第十四章将更详细地讨论用汇编语言编写的字符串函数。

4.9 数组

与字符串一样，数组可能是最常用的复合数据类型。然而，大多数初学程序员并不了解数组在内部是如何运作的，以及它们相关的效率权衡。令人惊讶的是，很多初学者（甚至是高级程序员！）在了解如何在机器级别处理数组之后，会从完全不同的角度看待数组。

从抽象的角度来看，数组是一种聚合数据类型，其成员（元素）都是相同类型的。通过整数索引选择数组的成员。^(9) 不同的索引选择数组的不同元素。本书假设整数索引是连续的（尽管这并不是必须的）。也就是说，如果数字 x 是数组的有效索引，且 y 也是有效索引，且 x < y，那么所有 i 满足 x < i < y 的值，都是有效的索引。

每当你对数组应用索引运算符时，结果是由该索引选择的特定数组元素。例如，A[i] 选择数组 A 中的第 i 个元素。并没有正式要求元素 i 必须在内存中紧挨着元素 i+1。只要 A[i] 始终引用相同的内存位置，且 A[i+1] 始终引用其对应位置（并且两者不同），那么数组的定义就得到了满足。

在本书中，我们假设数组元素占据内存中的连续位置。一个包含五个元素的数组将在内存中显示为如图 4-1 所示。

图 4-1：数组在内存中的布局

数组的基地址是数组第一个元素的地址，并且总是出现在内存的最低位置。第二个数组元素紧接着第一个元素存储在内存中，第三个元素紧跟第二个元素，以此类推。索引不要求从零开始。只要它们是连续的，索引可以从任何数字开始。然而，为了讨论的目的，本书将所有索引从零开始。

要访问数组的一个元素，你需要一个将数组索引转换为该元素地址的函数。对于一维数组，这个函数非常简单：

`element_address` = `base_address` + `((index` - `initial_index)` * `element_size)`

其中，initial_index 是数组中第一个索引的值（如果它是零，你可以忽略它），而 element_size 的值是单个数组元素的大小，以字节为单位。

4.9.1 在 MASM 程序中声明数组

在访问数组元素之前，你需要为该数组分配存储空间。幸运的是，数组声明是基于你已经见过的声明构建的。要为数组分配 n 个元素，你可以在变量声明部分使用如下声明：

`array_name`  `base_type` n dup (?)

array_name 是数组变量的名称，base_type 是该数组元素的类型。这个声明为数组分配了存储空间。要获取数组的基地址，只需使用 array_name。

n dup (?) 操作符告诉 MASM 将对象复制 n 次。现在让我们看一些具体的示例：

 .data

; Character array with elements 0 to 127.

CharArray  byte 128 dup (?)

; Array of bytes with elements 0 to 9.

ByteArray  byte  10 dup (?)

; Array of double words with elements 0 to 3.

DWArray    dword  4 dup (?)

这些示例都为未初始化的数组分配存储空间。你还可以通过在 .data 和 .const 部分使用如下声明来指定数组元素的初始化：

RealArray   real4  1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0
IntegerAry  sdword 1, 1, 1, 1, 1, 1, 1, 1

这两个定义都创建了包含八个元素的数组。第一个定义将每个 4 字节的实数值初始化为 1.0，而第二个声明将每个 32 位整数（sdword）元素初始化为 1。

如果所有数组元素都有相同的初始值，你可以通过使用以下声明来节省一些工作：

RealArray   real4  8 dup (1.0)
IntegerAry  sdword 8 dup (1)

这些操作数字段告诉 MASM 复制括号内的值八次。在过去的示例中，这个值通常是 ?（一个未初始化的值）。不过，你可以在括号内放入一个初始值，MASM 将复制这个值。实际上，你可以放入一个由逗号分隔的值列表，MASM 将复制括号内的所有内容：

RealArray   real4  4 dup (1.0, 2.0)
IntegerAry  sdword 4 dup (1, 2)

这两个示例也创建了包含八个元素的数组。它们的初始值分别是 1.0、2.0、1.0、2.0、1.0、2.0、1.0、2.0，以及 1、2、1、2、1、2、1、2。

4.9.2 访问一维数组的元素

要访问一个零基数组的元素，你可以使用以下公式：

`element_address` = `base_address` + `index` * `element_size`

如果你在 LARGEADDRESSAWARE:NO 模式下操作，对于 base_address 条目，你可以使用数组的名称（因为 MASM 会将数组第一个元素的地址与数组的名称关联起来）。如果你在大地址模式下操作，你需要将数组的基地址加载到 64 位（基）寄存器中；例如：

lea rbx, `base_address`

element_size条目表示每个数组元素的字节数。如果对象是字节数组，则element_size字段为 1（结果是非常简单的计算）。如果数组中的每个元素是一个字（或其他 2 字节类型），则element_size为 2，依此类推。要访问前一节中的IntegerAry数组的元素，你需要使用以下公式（大小为 4，因为每个元素是一个 sdword 对象）：

`element_address` = IntegerAry + (index * 4)

假设为LARGEADDRESSAWARE:NO，则与语句eax = IntegerAry[index]等效的 x86-64 代码如下：

mov rbx, index
mov eax, IntegerAry[rbx*4]

在大地址模式（LARGEADDRESSAWARE:YES）下，你需要将数组的地址加载到基址寄存器中；例如：

lea rdx, IntegerAry
mov rbx, index
mov eax, [rdx + rbx*4]

这两条指令并没有显式地将索引寄存器（RBX）乘以 4（IntegerAry中 32 位整数元素的大小）。相反，它们使用了缩放索引寻址模式来执行乘法。

另一个需要注意的地方是，这个指令序列并没有显式地计算基地址加上索引乘以 4 的和。相反，它依赖于缩放索引寻址模式隐式地计算这个和。指令mov eax, IntegerAry[rbx*4]从位置IntegerAry + rbx*4加载 EAX，即基地址加上index*4（因为 RBX 包含index*4）。类似地，mov eax, [rdx+rbx*4]在寻址模式中计算了这个相同的和。当然，你本可以使用

lea rax, IntegerAry
mov rbx, index
shl rbx, 2     ; Sneaky way to compute 4 * RBX
add rbx, rax   ; Compute base address plus index * 4
mov eax, [rbx]

用前面的指令序列替代，但为什么要用五条指令，而两三条就能完成相同的任务呢？这是一个很好的例子，说明了为什么你应该深入理解你的寻址模式。选择合适的寻址模式可以减小程序的体积，从而加速程序的执行。

然而，如果你需要乘以 1、2、4 或 8 以外的常数，则不能使用缩放索引寻址模式。类似地，如果你需要乘以一个不是 2 的幂的元素大小，你将无法使用shl指令将索引乘以元素大小；相反，你将不得不使用imul或其他指令序列来执行乘法。

x86-64 上的索引寻址模式非常适合访问一维数组的元素。实际上，它的语法甚至暗示着数组访问。需要记住的重要事项是，你必须记得将索引乘以元素的大小。如果忘记这么做，将会得到错误的结果。

本节中出现的示例假设index变量是一个 64 位值。实际上，数组的整数索引通常是 32 位整数或 32 位无符号整数。因此，你通常会使用以下指令将索引值加载到 RBX 中：

mov ebx, index  ; Zero-extends into RBX

由于将 32 位值加载到通用寄存器时，寄存器会自动扩展为 64 位，因此以前的指令序列（假定使用 64 位索引值）在使用 32 位整数作为数组索引时仍然能正常工作。

4.9.3 排序值数组

几乎所有关于数组的教科书在介绍数组时都会给出一个排序的例子。因为你可能已经在高级语言中见过如何进行排序，所以有必要快速看一下 MASM 中的排序。列表 4-7 使用了冒泡排序的变种，它对于短列表和几乎已排序的列表非常有效，但对其他情况却非常低效。^(10)

; Listing 4-7

; A simple bubble sort example.

; Note: This example must be assembled
; and linked with LARGEADDRESSAWARE:NO.

        option  casemap:none

nl      =       10
maxLen  =       256
true    =       1
false   =       0

bool    typedef ptr byte

        .const
ttlStr  byte    "Listing 4-7", 0
fmtStr  byte    "Sortme[%d] = %d", nl, 0

        .data

; sortMe - A 16-element array to sort:

sortMe  label   dword
        dword   1, 2, 16, 14
        dword   3, 9, 4,  10
        dword   5, 7, 15, 12
        dword   8, 6, 11, 13
sortSize = ($ - sortMe) / sizeof dword    ; Number of elements

; didSwap - A Boolean value that indicates
;          whether a swap occurred on the
;          last loop iteration.

didSwap bool    ?

        .code
        externdef printf:proc

; Return program title to C++ program:

         public getTitle
getTitle proc
         lea rax, ttlStr
         ret
getTitle endp

; Here's the bubblesort function.

;       sort(dword *array, qword count);

; Note: this is not an external (C)
; function, nor does it call any
; external functions. So it will
; dispense with some of the Windows
; calling sequence stuff.

; array - Address passed in RCX.
; count - Element count passed in RDX.

sort    proc
        push    rax             ; In pure assembly language
        push    rbx             ; it's always a good idea
        push    rcx             ; to preserve all registers
        push    rdx             ; you modify
        push    r8

        dec     rdx             ; numElements - 1

; Outer loop:

outer:  mov     didSwap, false

        xor     rbx, rbx        ; RBX = 0
inner:  cmp     rbx, rdx        ; while RBX < count - 1
        jnb     xInner

 mov     eax, [rcx + rbx*4]      ; EAX = sortMe[RBX]
        cmp     eax, [rcx + rbx*4 + 4]  ; If EAX > sortMe[RBX + 1]
        jna     dontSwap                ; then swap

        ; sortMe[RBX] > sortMe[RBX + 1], so swap elements:

        mov     r8d, [rcx + rbx*4 + 4]
        mov     [rcx + rbx*4 + 4], eax
        mov     [rcx + rbx*4], r8d
        mov     didSwap, true

dontSwap:
        inc     rbx                     ; Next loop iteration
        jmp     inner

; Exited from inner loop, test for repeat
; of outer loop:

xInner: cmp     didSwap, true
        je      outer

        pop     r8
        pop     rdx
        pop     rcx
        pop     rbx
        pop     rax
        ret
sort    endp

; Here is the "asmMain" function.

        public  asmMain
asmMain proc
        push    rbx

; "Magic" instruction offered without
; explanation at this point:

        sub     rsp, 40

; Sort the "sortMe" array:

        lea     rcx, sortMe
        mov     rdx, sortSize           ; 16 elements in array
        call    sort

; Display the sorted array:

        xor     rbx, rbx
dispLp: mov     r8d, sortMe[rbx*4]
        mov     rdx, rbx
        lea     rcx, fmtStr
        call    printf

 inc     rbx
        cmp     rbx, sortSize
        jb      dispLp

        add     rsp, 40
        pop     rbx
        ret     ; Returns to caller
asmMain endp
        end

列表 4-7：一个简单的冒泡排序示例

以下是汇编和运行此示例代码的命令：

C:\>**sbuild listing4-7**

C:\>**echo off**
 Assembling: listing4-7.asm
c.cpp

C:\>**listing4-7**
Calling Listing 4-7:
Sortme[0] = 1
Sortme[1] = 2
Sortme[2] = 3
Sortme[3] = 4
Sortme[4] = 5
Sortme[5] = 6
Sortme[6] = 7
Sortme[7] = 8
Sortme[8] = 9
Sortme[9] = 10
Sortme[10] = 11
Sortme[11] = 12
Sortme[12] = 13
Sortme[13] = 14
Sortme[14] = 15
Sortme[15] = 16
Listing 4-7 terminated

冒泡排序通过比较数组中相邻的元素来工作。cmp指令（在; if EAX > sortMe[RBX + 1]之前）比较 EAX（它包含sortMe[rbx*4]）和sortMe[rbx*4 + 4]。因为该数组中的每个元素是 4 字节（dword），所以索引[rbx*4 + 4]引用的是紧接着[rbx*4]之后的下一个元素。

正如冒泡排序的典型做法一样，如果最内层的循环完成时没有交换任何数据，那么这个算法就会终止。如果数据已经预先排序好，冒泡排序非常高效，只需对数据进行一轮遍历。不幸的是，如果数据没有排序（最糟糕的情况是数据按逆序排序），那么这个算法的效率非常低。然而，冒泡排序易于实现和理解（这也是入门教材继续使用它作为例子的原因）。

4.10 多维数组

x86-64 硬件可以轻松处理一维数组。不幸的是，当前没有能够轻松访问多维数组元素的魔法寻址模式。这需要一些工作和多条指令。

在讨论如何声明或访问多维数组之前，最好先弄清楚如何在内存中实现它们。第一个问题是如何将多维对象存储到一维内存空间中。

请暂时考虑一个形式为A:array[0..3,0..3] of char;的 Pascal 数组。这个数组包含 16 个字节，组织成四行四列的字符。你需要将这个数组中的每个 16 个字节与主存中的 16 个连续字节对应起来。图 4-2 展示了其中的一种做法。

图 4-2：将 4×4 数组映射到顺序内存位置

实际的映射并不重要，只要满足两个条件：（1）每个元素映射到唯一的内存位置（即，数组中的两个条目不会占用相同的内存位置），（2）映射是一致的（即，数组中的某个元素总是映射到相同的内存位置）。因此，你真正需要的是一个具有两个输入参数（行和列）的函数，它能够生成指向 16 个内存位置的线性数组的偏移量。

现在，任何满足这些约束条件的函数都能正常工作。事实上，只要映射一致，你可以随机选择一个映射。然而，真正想要的是一个在运行时计算高效并适用于任何大小数组（不仅仅是 4×4 或限制为二维）的映射。虽然有许多可能的函数符合这个要求，但有两个特别的函数被大多数程序员和高级语言使用：行主序排序和列主序排序。

4.10.1 行主序排序

行主序排序将连续的元素按行依次排列，然后再按列排列，分配到连续的内存位置。这种映射在图 4-3 中进行了展示。

图 4-3：行主序数组元素排序

行主序排序是大多数高级编程语言采用的方法。它易于在机器语言中实现和使用。你从第一行（第 0 行）开始，然后将第二行连接到它的末尾。接着，将第三行连接到列表的末尾，然后是第四行，依此类推（见图 4-4）。

图 4-4：4×4 数组的行主序排序的另一种视图

将索引值列表转换为偏移量的实际函数是对计算一维数组元素地址的公式的轻微修改。计算二维行主序数组偏移量的公式如下：

`element_address =`
     `base_address` + `(``col_index` * `row_size` + `row_index``)` * `element_size`

如常，base_address是数组第一个元素的地址（此例中为A[0][0]），element_size是数组中单个元素的字节大小。col_index是最左边的索引，row_index是数组的最右边索引。row_size是数组每一行中的元素数量（在此例中为 4，因为每行有四个元素）。假设element_size为 1，则该公式从基地址计算出以下偏移量：

Column          Row             Offset
Index           Index           into Array
0               0               0
0               1               1
0               2               2
0               3               3
1               0               4
1               1               5
1               2               6
1               3               7
2               0               8
2               1               9
2               2               10
2               3               11
3               0               12
3               1               13
3               2               14
3               3               15

对于三维数组，计算内存偏移量的公式如下：

`Address` = `Base +` `` `((depth_index` * `col_size` + `col_index)` * `row_size` + `row_index)` * `element_size` ``

col_size是列中的元素数量，row_size是行中的元素数量。在 C/C++ 中，如果你将数组声明为type A[i][j][k];，那么row_size等于k，col_size等于j。

对于一个四维数组，在 C/C++ 中声明为type A[i][j][k][m];，计算数组元素地址的公式如下所示：

`Address` = `Base` + 
     `(((left_index` * `depth_size` + `depth_index`*)* * `col_size` + `col_index)` *
     `row_size` + `row_index`*)* * `element_size`

depth_size等于j，col_size等于k，row_size等于m。left_index表示最左侧索引的值。

到现在你可能已经开始看出一种规律。有一个通用的公式可以计算具有任意维度的数组的内存偏移量；然而，你很少会使用超过四维的数组。

另一种理解行主序数组的方便方式是将其看作数组的数组。考虑以下单维的 Pascal 数组定义：

A: array [0..3] of sometype;

其中 sometype 是类型 sometype = array [0..3] of char;。

A 是一个一维数组。它的各个元素恰好是数组，但目前您可以忽略这一点。计算一维数组元素地址的公式如下：

`element_address` = `Base` +`index` * `element_size`

在这种情况下，element_size 恰好为 4，因为 A 的每个元素是一个由四个字符组成的数组。因此，这个公式计算的是这个 4×4 字符数组中每一行的基地址（见图 4-5）。

图 4-5：将 4×4 数组视为数组的数组

当然，一旦你计算出一行的基地址，你可以重新应用一维公式来获得某个特定元素的地址。虽然这不会影响计算，但处理几个一维计算可能比处理一个复杂的多维数组计算更容易。

考虑一个 Pascal 数组定义为 A:array [0..3, 0..3, 0..3, 0..3, 0..3] of char;。您可以将这个五维数组视为一个一维数组的数组。以下 Pascal 代码提供了这样的定义：

type
 OneD   = array[0..3] of char;
 TwoD   = array[0..3] of OneD;
 ThreeD = array[0..3] of TwoD;
 FourD  = array[0..3] of ThreeD;
var
 A: array[0..3] of FourD;

OneD 的大小是 4 字节。因为 TwoD 包含四个 OneD 数组，所以它的大小是 16 字节。同样，ThreeD 是四个 TwoD，因此它的大小是 64 字节。最后，FourD 是四个 ThreeD，所以它的大小是 256 字节。为了计算 A[b, c, d, e, f] 的地址，您可以使用以下步骤：

通过公式 Base + b * size 计算 A[b] 的地址。此时 size 为 256 字节。将此结果作为下一步计算的新基地址。
通过公式 Base + c * size 计算 A[b, c] 的地址，其中 Base 是上一步骤中获得的值，size 为 64。将此结果作为下一步计算的新基地址。
通过公式 Base + d * size 计算 A[b, c, d] 的基地址，其中 Base 来自之前的计算，size 为 16。将此结果作为下一步计算的新基地址。
使用公式 Base + e * size 计算 A[b, c, d, e] 的地址，其中 Base 来自之前的计算，size 为 4。将该值作为下一步计算的基地址。
最后，使用公式 Base + f * size 计算 A[b, c, d, e, f] 的地址，其中 Base 来自之前的计算，size 为 1（显然，您可以忽略这最后的乘法）。此时获得的结果就是所需元素的地址。

你在汇编语言中很少能找到高维数组的一个主要原因是，汇编语言强调使用这些数组时的低效性。在 Pascal 程序中输入类似A[b, c, d, e, f]的东西很容易，但你并不了解编译器在处理这些代码时的细节。汇编语言程序员可不会那么轻率——他们明白使用高维数组时会遇到的麻烦。实际上，优秀的汇编语言程序员尽量避免使用二维数组，当使用二维数组变得绝对必要时，他们常常采用一些技巧来访问数组中的数据。

4.10.2 列优先排序

列优先排序是另一种高级语言中常用的计算数组元素地址的方法。FORTRAN 和各种 BASIC 方言（例如较旧版本的 Microsoft BASIC）都使用这种方法。

在行优先排序中，最右侧的索引随着连续内存位置的移动而增大最快。在列优先排序中，最左侧的索引增速最快。从图示来看，列优先排序的数组如图 4-6 所示。

使用列优先排序时，计算数组元素地址的公式与行优先排序类似。你只需反转计算中的索引和大小。

图 4-6：列优先数组元素排序

对于二维列优先数组：

`element_address` = `base_address` + `(row_index` * `col_size` + `col_index) *`
     `element_size`

对于三维列优先数组：

`Address` = `Base` +
     `((row_index` * `col_size` + `col_index)` *
     `depth_size` + `depth_index)` * `element_size`

对于四维列优先数组：

`Address` =
     `Base` + `(((row_index` * `col_size` + `col_index)` * `depth_size` + `depth_index)`
 `` `left_size` + `left_index)` * `element_size` ``

4.10.3 为多维数组分配存储

如果你有一个m×n的数组，它将有m × n个元素，并需要m × n × element_size字节的存储空间。要为数组分配存储，你必须预留这些内存。像往常一样，有多种方法可以完成此任务。要在 MASM 中声明一个多维数组，你可以使用如下声明：

`array_name` `element_type` `size`[1]*`size`[2]*`size`[3]*...*`size`[*n*] dup (?)

其中size1 到sizen 是数组每个维度的大小。

例如，下面是一个 4×4 字符数组的声明：

GameGrid byte 4*4 dup (?)

这里是另一个示例，展示了如何声明一个三维字符串数组（假设数组保存的是 64 位指针指向字符串）：

NameItems qword 2 * 3 * 3 dup (?)

就像一维数组一样，你可以通过在声明后跟随数组常量的值来初始化数组的每个元素。数组常量忽略维度信息；重要的是数组常量中的元素数量与实际数组的元素数量一致。以下示例展示了带初始化器的GameGrid声明：

GameGrid byte 'a', 'b', 'c', 'd'
         byte 'e', 'f', 'g', 'h'
         byte 'i', 'j', 'k', 'l'
         byte 'm', 'n', 'o', 'p'

这个示例的布局旨在增强可读性（这总是一个好主意）。MASM 并不会将四行数据解释为数组中的数据行。人类会这样做，这也是为什么以这种方式编写数据很有益的原因。重要的是数组常量中有 16 个（4 × 4）字符。你可能会同意，这比下面的形式更容易阅读：

GameGrid byte  'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
'k', 'l', 'm', 'n', 'o', 'p'

当然，如果你有一个大型数组，或者一个包含非常大行的数组，或一个具有多个维度的数组，那么很难做到具有良好的可读性。那时，仔细解释每一部分的注释就非常有用了。

对于一维数组，你可以使用dup操作符初始化大数组的每个元素，使其具有相同的值。以下示例初始化了一个 256×64 的字节数组，使得每个字节都包含值0FFh：

StateValue byte 256*64 dup (0FFh)

使用常量表达式来计算数组元素的数量，而不是简单地使用常量 16,384（256 × 64），更清楚地表明这段代码正在初始化一个 256×64 元素的数组，而不是简单的字面常量 16,384。

另一个你可以用来提高程序可读性的 MASM 技巧是使用嵌套dup声明。以下是一个 MASM 嵌套dup声明的示例：

StateValue byte 256 dup (64 dup (0FFh))

MASM 会根据dup操作符前面的常量指定的次数，复制括号内的内容；这包括嵌套的dup声明。这个示例表示：“将括号内的内容复制 256 次。”在括号内，有一个dup操作符，表示：“将0FFh复制 64 次”，所以外层的dup操作符将 64 个0FFh值的复制再复制 256 次。

使用“dup of dup（... of dup）”语法声明多维数组可能是一个好的编程习惯。这样可以更清楚地表明你正在创建一个多维数组，而不是一个包含大量元素的一维数组。

4.10.4 在汇编语言中访问多维数组元素

好吧，你已经看到过计算多维数组元素地址的公式。现在是时候看看如何使用汇编语言访问这些数组的元素了。

mov、shl和imul指令能够轻松地处理计算多维数组偏移量的各种方程式。我们首先考虑一个二维数组的情况：

 .data
i        sdword  ?
j        sdword  ?
TwoD     sdword  4 dup (8 dup (?))

           .
           .
           .

; To perform the operation TwoD[i,j] := 5;
; you'd use code like the following.
; Note that the array index computation is (i*8 + j)*4.

          mov ebx, i   ; Remember, zero-extends into RBX
          shl rbx, 3   ; Multiply by 8
          add ebx, j   ; Also zero-extends result into RBX^(11)
          mov TwoD[rbx*4], 5

请注意，这段代码不需要在 x86-64 架构上使用双寄存器寻址模式（至少在使用LARGEADDRESSAWARE:NO选项时不需要）。虽然像TwoD[rbx][rsi]这样的寻址模式看起来应该是访问二维数组的自然方式，但这并不是该寻址模式的目的。

现在考虑第二个示例，它使用了三维数组（再次假设使用LARGEADDRESSAWARE:NO）：

 .data
i       dword  ?
j       dword  ?
k       dword  ?
ThreeD  sdword 3 dup (4 dup (5 dup (?)))
          .
          .
          .

; To perform the operation ThreeD[i,j,k] := ESI;
; you'd use the following code that computes
; ((i*4 + j)*5 + k)*4 as the address of ThreeD[i,j,k].

          mov  ebx, i   ; Zero-extends into RBX
          shl  ebx, 2   ; Four elements per column
          add  ebx, j
          imul ebx, 5   ; Five elements per row
          add  ebx, k
          mov  ThreeD[rbx*4], esi

这段代码使用imul指令将 RBX 寄存器中的值乘以 5，因为shl指令只能将寄存器的值乘以 2 的幂。虽然有方法可以将寄存器中的值乘以其他常数，但imul指令更为方便。^(12) 还要记住，32 位通用寄存器上的操作会自动将结果扩展到 64 位寄存器中。

4.11 记录/结构体

另一个主要的复合数据结构是 Pascal 的record或 C/C++/C#的structure。^(13) Pascal 的术语可能更好，因为它通常避免与更通用的术语数据结构产生混淆。然而，MASM 使用术语struct，因此本书使用该术语。

而数组是同质的，其元素都是相同类型的，结构体中的元素可以具有不同的类型。数组让你通过整数索引选择特定的元素。使用结构体时，你必须通过名称选择一个元素（称为字段）。

结构体的主要目的是让你将不同但逻辑上相关的数据封装到一个单一的包裹中。Pascal 的学生记录声明是一个典型的例子：

student = 
     record
          Name:     string[64];
          Major:    integer;
          SSN:      string[11];
          Midterm1: integer;
 Midterm2: integer;
          Final:    integer;
          Homework: integer;
          Projects: integer;
     end;

大多数 Pascal 编译器会将记录中的每个字段分配到连续的内存位置。这意味着 Pascal 会为名称保留前 65 个字节，^(14) 接下来的 2 个字节存放专业代码（假设是 16 位整数），接下来的 12 个字节存放社会安全号码，依此类推。

4.11.1 MASM 结构体声明

在 MASM 中，你可以通过使用struct/ends声明来创建记录类型。你可以如下编码前面的记录：

student  struct
sName    byte    65 dup (?)  ; "Name" is a MASM reserved word
Major    word    ?
SSN      byte    12 dup (?)
Midterm1 word    ?
Midterm2 word    ?
Final    word    ?
Homework word    ?
Projects word    ?
student  ends

如你所见，MASM 声明与 Pascal 声明类似。为了忠实于 Pascal 声明，本例中使用了字符数组而不是字符串来表示sName和SSN（美国社会安全号码）字段。此外，MASM 声明假设整数是无符号的 16 位值（这对于此类型的数据结构可能是适当的）。

结构体中的字段名必须唯一；同一个字段名不能在同一记录中出现两次或更多次。然而，所有字段名对于该记录是局部的。因此，你可以在程序的其他地方或不同的记录中重用这些字段名。

struct/ends声明可以出现在源文件的任何位置，只要你在使用之前定义它即可。struct声明实际上并不会为student变量分配存储空间。相反，你必须明确声明一个student类型的变量。以下示例展示了如何做到这一点：

 .data
John    student  {}

这个奇怪的操作数（{}）是 MASM 的特色，你必须记住它。

John变量声明分配了 89 个字节的存储空间，如图 4-7 所示。

图 4-7：学生数据结构在内存中的存储

如果标签 John 对应于此记录的基地址，则 sName 字段位于偏移量 John + 0，Major 字段位于偏移量 John + 65，SSN 字段位于偏移量 John + 67，依此类推。

4.11.2 访问记录/结构字段

要访问结构的元素，你需要知道从结构开始到目标字段的偏移量。例如，John 变量中的 Major 字段位于 John 的基地址偏移量 65 处。因此，你可以使用以下指令将 AX 中的值存储到此字段：

mov word ptr John[65], ax

不幸的是，记住 struct 中所有字段的偏移量违背了使用结构的初衷。毕竟，如果你必须处理这些数字偏移量，为什么不直接使用字节数组而不是 struct 呢？

幸运的是，MASM 让你使用大多数高级语言中常用的机制来引用记录中的字段名称：点操作符。要将 AX 存储到 Major 字段中，你可以使用 mov John.Major, ax，而不是之前的指令。这更具可读性，也更容易使用。

使用点操作符不会引入新的寻址模式。指令 mov John.Major, ax 仍然使用 PC 相对寻址模式。MASM 只是将 John 的基地址与 Major 字段的偏移量（65）相加，以获得实际位移值并将其编码到指令中。

在处理你在某个静态段（.data、.const 或 .data?）中声明的 struct 变量并通过 PC 相对寻址模式访问时，点操作符效果很好。然而，当你有一个指向记录对象的指针时，会发生什么呢？考虑以下代码片段：

mov  rcx, sizeof student  ; Size of student struct
call malloc               ; Returns pointer in RAX
mov [rax].Final, 100

不幸的是，Final 字段名是 student 结构的局部名。因此，MASM 会抱怨在此代码序列中 Final 名称未定义。为了解决这个问题，你可以在使用指针引用时将结构名添加到点操作符的名称列表中。以下是前述代码的正确形式：

mov  rcx, sizeof student  ; Size of student struct
call malloc
mov [rax].student.Final, 100

4.11.3 嵌套 MASM 结构

MASM 允许你定义结构的字段，而这些字段本身是结构类型。考虑以下两个 struct 声明：

grades    struct
Midterm1  word  ?
Midterm2  word  ?
Final     word  ?
Homework  word  ?
Projects  word  ?
grades    ends

student   struct
sName     byte  65 dup (?)  ; "Name" is a MASM reserved word
Major     word  ?
SSN       byte  12 dup (?)
sGrades   grades {}
student   ends

现在，sGrades 字段保存了原本在 grades 结构中的所有单独的成绩字段。请注意，这个特定示例与之前的示例（见图 4-7）具有相同的内存布局。grades 结构本身并没有添加任何新的数据，它只是将成绩字段组织到自己的子结构中。

要访问子字段，你可以使用与 C/C++（以及大多数支持记录/结构的其他高级语言）相同的语法。如果之前部分中出现的 John 变量声明是这种新的 struct 类型，你可以通过以下语句访问 Homework 字段：

mov ax, John.sGrades.Homework

4.11.4 初始化结构字段

下面是一个典型的结构声明：

 .data
structVar  structType  {}

会将structType中的所有字段保持未初始化状态（类似于其他变量声明中使用?操作数的效果）。MASM 允许你通过在结构体变量声明的操作数字段中提供一个由逗号分隔的项列表，来为结构体的所有字段提供初始值，如 Listing 4-8 所示。

; Listing 4-8

; Sample struct initialization example.

         option  casemap:none

nl       =       10

         .const
ttlStr   byte    "Listing 4-8", 0
fmtStr   byte    "aString: maxLen:%d, len:%d, string data:'%s'"
         byte    nl, 0

; Define a struct for a string descriptor:

strDesc  struct
maxLen   dword   ?
len      dword   ?
strPtr   qword   ?
strDesc  ends

         .data

; Here's the string data we will initialize the
; string descriptor with:

charData byte   "Initial String Data", 0
len      =      lengthof charData ; Includes zero byte

; Create a string descriptor initialized with
; the charData string value:

aString  strDesc {len, len, offset charData}

        .code
        externdef printf:proc

; Return program title to C++ program:

         public getTitle
getTitle proc
         lea rax, ttlStr
         ret
getTitle endp

; Here is the "asmMain" function.

        public  asmMain
asmMain proc

; "Magic" instruction offered without
; explanation at this point:

        sub     rsp, 48

; Display the fields of the string descriptor.

        lea     rcx, fmtStr
        mov     edx, aString.maxLen ; Zero-extends!
        mov     r8d, aString.len    ; Zero-extends!
        mov     r9,  aString.strPtr
        call    printf

        add     rsp, 48 ; Restore RSP
        ret             ; Returns to caller
asmMain endp
        end

Listing 4-8：初始化结构体的字段

以下是 Listing 4-8 的构建命令和输出：

C:\>**build listing4-8**

C:\>**echo off**
 Assembling: listing4-8.asm
c.cpp

C:\>**listing4-8**
Calling Listing 4-8:
aString: maxLen:20, len:20, string data:'Initial String Data'
Listing 4-8 terminated

如果结构体字段是数组对象，你需要使用特定的语法来初始化该数组数据。考虑以下结构体定义：

aryStruct struct
aryField1 byte    8 dup (?)
aryField2 word    4 dup (?)
aryStruct ends

初始化操作数必须是一个字符串或一个单一项。因此，以下内容是非法的：

a aryStruct {1,2,3,4,5,6,7,8,  1,2,3,4}

这（可能）是尝试用{1,2,3,4,5,6,7,8}初始化aryField1，并用{1,2,3,4}初始化aryField2。然而，MASM 不接受这种写法。MASM 只希望在操作数字段中有两个值（分别对应aryField1和aryField2）。解决方案是将两个数组常量分别放入各自的花括号中：

a aryStruct {{1,2,3,4,5,6,7,8}, {1,2,3,4}}

如果你为给定数组元素提供了太多初始化值，MASM 会报告错误。如果提供的初始化值太少，MASM 会默默地将其余数组项填充为 0 值：

a aryStruct {{1,2,3,4}, {1,2,3,4}}

该示例将a.aryField1初始化为{1,2,3,4,0,0,0,0}，并将a.aryField2初始化为{1,2,3,4}。

如果字段是字节数组，你可以用字符字符串（其字符数不超过数组大小）来替代字节值列表：

b aryStruct {"abcdefgh", {1,2,3,4}}

如果提供的字符数过少，MASM 会用 0 字节填充字节数组的其余部分；如果字符数过多，则会产生错误。

4.11.5 结构体数组

创建一个结构体数组是完全合理的操作。为此，你需要创建一个struct类型，然后使用标准的数组声明语法。以下示例演示了如何做到这一点：

recElement struct
   `Fields for this record` 
recElement ends
            .
            .
            .
           .data
recArray   recElement 4 dup ({})

要访问该数组的元素，你需要使用标准的数组索引技术。由于recArray是一个一维数组，你可以通过使用公式base_address + index * lengthof(recElement)来计算该数组元素的地址。例如，要访问recArray的一个元素，你可以使用如下代码：

; Access element i of recArray:
; RBX := i*lengthof(recElement)

   imul ebx, i, sizeOf recElement     ; Zero-extends EBX to RBX!
   mov  eax, recArray.someField[rbx]  ; LARGEADDRESSAWARE:NO!

索引规范跟随整个变量名称；请记住，这里是汇编语言，不是高级语言（在高级语言中，你可能会使用recArray[i].someField）。

自然，你也可以创建多维记录数组。你需要使用行主序或列主序函数来计算记录中元素的地址。唯一的变化（与数组讨论中的不同）是每个元素的大小是记录对象的大小：

 .data
rec2D   recElement 4 dup (6 dup ({}))
          .
          .
          .
; Access element [i,j] of rec2D and load someField into EAX:

     imul ebx, i, 6
     add  ebx, j
     imul ebx, sizeof recElement
     lea  rcx, rec2D  ; To avoid requiring LARGEADDRESS...
     mov  eax, [rcx].recElement.someField[rbx*1]

4.11.6 记录内字段对齐

为了在程序中实现最大性能，或者确保 MASM 的结构正确映射到高级语言中的记录或结构，你通常需要能够控制记录中字段的对齐方式。例如，你可能希望确保双字字段的偏移量是四的倍数。你可以使用align指令来做到这一点。下面创建了一个具有未对齐字段的结构：

Padded  struct
b       byte    ?
d       dword   ?
b2      byte    ?
b3      byte    ?
w       word    ?
Padded  ends

下面是 MASM 在内存中组织该结构字段的方式：^(15)

 Name                     Size Offset     Type

Padded . . . . . . . . . . . . . 00000009
  b  . . . . . . . . . . . . . .         00000000        byte
  d  . . . . . . . . . . . . . .         00000001        dword
  b2 . . . . . . . . . . . . . .         00000005        byte
  b3 . . . . . . . . . . . . . .         00000006        byte
  w  . . . . . . . . . . . . . .         00000007        word

正如你在这个示例中看到的，d和w字段都被对齐到奇数偏移量，这可能导致性能较慢。理想情况下，你希望d对齐到双字偏移（四的倍数），而w对齐到偶数偏移。

你可以通过向结构中添加align指令来解决这个问题，如下所示：

Padded  struct
b       byte    ?
        align   4
d       dword   ?
b2      byte    ?
b3      byte    ?
        align   2
w       word    ?
Padded  ends

现在，MASM 为这些字段使用以下偏移量：

Padded . . . . . . . . . . . . .         0000000C
  b  . . . . . . . . . . . . . .         00000000        byte
  d  . . . . . . . . . . . . . .         00000004        dword
  b2 . . . . . . . . . . . . . .         00000008        byte
  b3 . . . . . . . . . . . . . .         00000009        byte
  w  . . . . . . . . . . . . . .         0000000A        word

正如你所看到的，d现在对齐到了 4 字节偏移，而w对齐到了偶数偏移。

MASM 提供了一个额外的选项，允许你在struct声明中自动对齐对象。如果你为struct语句提供一个值（必须是 1、2、4、8 或 16），MASM 将自动将结构中的所有字段对齐到一个偏移量，该偏移量是字段大小的倍数或你作为操作数指定的值，以较小者为准。考虑以下示例：

Padded  struct  4
b       byte    ?
d       dword   ?
b2      byte    ?
b3      byte    ?
w       word    ?
Padded  ends

下面是 MASM 为此结构生成的对齐方式：

Padded . . . . . . . . . . . . .         0000000C
  b  . . . . . . . . . . . . . .         00000000        byte
  d  . . . . . . . . . . . . . .         00000004        dword
  b2 . . . . . . . . . . . . . .         00000008        byte
  b3 . . . . . . . . . . . . . .         00000009        byte
  w  . . . . . . . . . . . . . .         0000000A        word

请注意，MASM 正确地将d对齐到双字边界，将w对齐到字边界（在结构体内）。还请注意，w没有对齐到双字边界（即使结构体操作数是 4）。这是因为 MASM 使用操作数或字段大小中的较小者作为对齐值（而w的大小是 2）。

4.12 联合体

记录/结构定义根据字段的大小为记录中的每个字段分配不同的偏移量。这种行为与在.data?、.data或.const部分分配内存偏移量非常相似。MASM 提供了第二种结构声明类型，即union，它不会为每个对象分配不同的地址；相反，union声明中的每个字段都具有相同的偏移量：零。下面的示例演示了union声明的语法：

unionType union
 `Fields (syntactically identical to struct declarations)`
unionType ends

是的，似乎很奇怪，MASM 仍然使用ends来标记union的结束（而不是endu）。如果这真的让你困扰，只需像下面这样为endu创建一个textequ：

endu  textequ <ends>

现在，你可以尽情使用endu来标记union的结束。

你访问union的字段与访问结构体字段的方式完全相同：使用点符号和字段名称。以下是一个union类型声明及union类型变量的具体示例：

numeric  union
i        sdword  ?
u        dword   ?
q        qword   ?
numeric  ends
           .
           .
           .
         .data
number  numeric  {}
           .
           .
           .
     mov number.u, 55
           .
           .
           .
     mov number.i, -62
           .
           .
           .
     mov rbx, number.q

需要注意的是，联合体对象的所有字段在结构中具有相同的偏移量。在前面的示例中，number.u、number.i 和 number.q 字段都有相同的偏移量：零。因此，联合体的字段在内存中是重叠的；这类似于 x86-64 8 位、16 位、32 位和 64 位通用寄存器之间的重叠。通常，你一次只能访问一个联合体字段；你不能同时操作特定联合体变量的多个字段，因为写入一个字段会覆盖其他字段。在前面的示例中，任何对 number.u 的修改也会改变 number.i 和 number.q。

程序员通常出于两个原因使用联合体：节省内存或创建别名。节省内存是该数据结构功能的预期用途。为了了解其工作原理，让我们将前面示例中的 numeric union 与相应的结构类型进行比较：

numericRec  struct
i           sdword  ?
u           dword   ?
q           qword   ?
numericRec  ends

如果你声明一个变量，比如 n，类型为 numericRec，你可以像声明为 numeric 类型一样，通过 n.i、n.u 和 n.q 来访问各个字段。二者的区别在于，numericRec 类型的变量为每个字段分配了独立的存储空间，而 numeric（联合体）对象为所有字段分配相同的存储空间。因此，sizeof numericRec 的值为 16，因为该记录包含两个双字字段和一个四字字段（real64）。然而，sizeof numeric 的值为 8。原因是联合体的所有字段都占用相同的内存位置，联合体对象的大小是该对象最大字段的大小（见图 4-8）。

图 4-8：union 与 struct 变量的布局

除了节省内存，程序员通常还使用联合体在代码中创建别名。如你所记得，别名是同一内存对象的不同名称。别名通常是程序中的一种混淆源，因此你应该谨慎使用它们；然而，有时使用别名非常方便。例如，在程序的某个部分，你可能需要不断使用类型强制转换来以不同类型引用一个对象。尽管你可以使用 MASM 的 textequ 来简化这个过程，另一种方法是使用一个 union 变量，并为你想要使用的不同类型创建相应的字段。例如，考虑以下代码：

CharOrUns union
chr       byte      ?
u         dword     ?
CharOrUns ends

          .data
v         CharOrUns {}

使用这样的声明，你可以通过访问 v.u 来操作一个 uns32 对象。如果你在某个时刻需要将该 dword 变量的低字节视为字符，你可以通过访问 v.chr 变量来实现；例如：

mov v.u, eax
mov ch, v.chr

你可以像在 MASM 程序中使用结构体一样使用联合体。特别是，union 声明可以作为结构体中的字段，struct 声明可以作为联合体中的字段，array 声明可以出现在联合体中，你可以创建联合体的数组，等等。

4.12.1 匿名联合体

在 struct 声明中，你可以放置一个 union 声明，而无需为 union 对象指定字段名称。以下示例演示了语法：

HasAnonUnion struct
r            real8    ?

             union
u            dword    ?
i            sdword   ?
             ends

s            qword    ?
HasAnonUnion ends

             .data
v            HasAnonUnion {}

每当匿名联合体出现在记录中时，你可以像访问记录中的未封闭字段一样访问联合体的字段。例如，在前面的示例中，你可以通过语法 v.u 和 v.i 来访问 v 的 u 和 i 字段。u 和 i 字段在记录中的偏移量相同（8，因为它们紧跟在 real8 对象之后）。v 的字段相对于 v 基地址的偏移量如下：

v.r           0
v.u           8
v.i           8
v.s          12

sizeof(v) 的值为 20，因为 u 和 i 字段仅占用 4 个字节。

MASM 还允许在联合体中使用匿名结构。更多详细信息请参阅 MASM 文档，语法和用法与结构体中的匿名联合体相同。

4.12.2 变体类型

联合体在程序中的一个重要用途是创建变体类型。变体变量可以在程序运行时动态改变其类型。一个变体对象在程序的某一点可以是整数，在程序的另一个部分可以切换为字符串，之后再变成实数值。许多高级语言（VHLL）系统使用动态类型系统（即变体对象）来减少程序的整体复杂性；事实上，许多 VHLL 的支持者坚信，动态类型系统的使用是你能用如此少的代码行编写复杂程序的原因之一。

当然，如果你可以在 VHLL 中创建变体对象，那么在汇编语言中当然也可以做到这一点。在本节中，我们将探讨如何使用联合体结构来创建变体类型。

在程序执行的任何时刻，一个变体对象具有特定的类型，但在程序控制下，变量可以切换为不同的类型。因此，当程序处理一个变体对象时，必须使用 if 语句或 switch 语句（或类似的结构）根据对象的当前类型执行不同的指令。高级语言（VHLL）系统会透明地处理这一过程。

在汇编语言中，你必须自己提供代码来测试类型。为了实现这一点，变体类型需要额外的信息，除了对象的值之外。具体而言，变体对象需要一个字段来指定对象的当前类型。这个字段（通常称为 tag 字段）是一个枚举类型或整数，用来指定对象在任何给定时刻的类型。以下代码演示了如何创建一个变体类型：

VariantType struct
tag         dword    ?  ; 0-uns32, 1-int32, 2-real64

            union
u           dword    ?
i           sdword   ?
r           real8    ?
 ends
VariantType ends

            .data
v           VariantType {}

程序将测试 v.tag 字段以确定 v 对象的当前类型。根据这个测试，程序将操作 v.i、v.u 或 v.r 字段。

当然，在操作变体对象时，程序的代码必须不断测试标签字段，并为 dword、sdword 或 real8 值执行不同的指令序列。如果你经常使用变体字段，那么编写程序处理这些操作（例如 vadd、vsub、vmul 和 vdiv）是非常有意义的。

4.13 微软 ABI 注释

微软 ABI 期望数组字段按照它们的自然大小对齐：从结构开始到某一字段的偏移量必须是该字段大小的倍数。除此之外，整个结构必须对齐到一个内存地址，该地址是结构中最大对象大小的倍数（最多 16 字节）。最后，整个结构的大小必须是结构中最大元素大小的倍数（你必须在结构末尾添加填充字节以适当填充结构的大小）。

微软 ABI 期望数组从内存中一个是元素大小的倍数的地址开始。例如，如果你有一个 32 位对象的数组，该数组必须从 4 字节边界开始。

当然，如果你不将数组或结构数据传递到另一种语言（你只在汇编代码中处理结构或数组），你可以随意对数据进行对齐（或不对齐）。

4.14 获取更多信息

关于内存中数据结构表示的更多信息，考虑阅读我的书 编写高效代码，第一卷（No Starch Press，2004）。如果你想深入讨论数据类型，可以查阅一本关于数据结构和算法的教科书。当然，MASM 在线文档（在 www.microsoft.com/）是一个很好的信息来源。

4.15 测试自己

imul 指令的两个操作数形式是什么，它将一个寄存器与常量相乘？
imul 指令的三个操作数形式是什么，它将一个寄存器与常量相乘，并将结果保存在目标寄存器中？
imul 指令的语法是什么，它将一个寄存器与另一个寄存器相乘？
什么是清单常量？
你会使用哪个指令来创建清单常量？
文本等式和数字等式之间有什么区别？
解释如何使用等式定义长度大于八个字符的字面字符串。
什么是常量表达式？
你会使用哪个运算符来确定字节指令操作数字段中的数据元素个数？
什么是位置计数器？
哪个运算符返回当前的位置计数器？
如何计算 .data 部分中两个声明之间的字节数？
你如何使用 MASM 创建一组枚举数据常量？
如何使用 MASM 定义你自己的数据类型？
什么是指针（它是如何实现的）？
如何在汇编语言中解引用一个指针？
如何在汇编语言中声明指针变量？
你会使用什么操作符来获取静态数据对象的地址（例如，在 .data 区段中）？
使用指针时，程序中常见的五个问题是什么？
什么是悬挂指针？
什么是内存泄漏？
什么是复合数据类型？
什么是零终止字符串？
什么是长度前缀字符串？
什么是基于描述符的字符串？
什么是数组？
数组的基地址是什么？
提供一个使用 dup 操作符声明数组的例子。
描述如何创建一个在汇编时初始化元素的数组。
访问一个
1. 一维数组 dword A[10]？
2. 二维数组 word W[4, 8] 的公式是什么？
3. 三维数组 real8 R[2, 4, 6]？
什么是行优先顺序？
什么是列优先顺序？
提供一个使用嵌套 dup 操作符声明二维数组（word 数组 W[4, 8]）的例子。
什么是记录/结构体？
在 MASM 中，使用哪些指令声明记录数据结构？
你使用什么操作符来访问记录/结构体的字段？
什么是联合？
在 MASM 中，声明联合使用哪些指令？
联合和记录/结构体中字段的内存组织有什么区别？
结构体中的匿名联合是什么？

第二部分

汇编语言编程

第五章：过程

在过程式编程语言中，代码的基本单元是过程。过程是一组计算值或执行动作（如打印或读取字符值）的指令。本章讨论了 MASM 如何实现过程、参数和局部变量。通过本章内容的学习，您应该能够熟练编写自己的过程和函数，并完全理解参数传递和 Microsoft ABI 调用约定。

5.1 实现过程

大多数过程式编程语言通过使用调用/返回机制来实现过程。代码调用一个过程，过程执行其任务，然后返回给调用者。调用和返回指令提供了 x86-64 的 过程调用机制。调用代码通过 call 指令调用一个过程，过程通过 ret 指令返回给调用者。例如，以下 x86-64 指令调用 C 标准库的 printf() 函数：

call printf

可惜的是，C 标准库并没有提供你永远需要的所有例程。大多数时候，您需要自己编写过程。为此，您将使用 MASM 的过程声明功能。一个基本的 MASM 过程声明形式如下：

`proc_name` proc `options`
          `Procedure statements`
`proc_name` endp

过程声明出现在程序的 .code 部分。在前面的语法示例中，proc_name 代表您希望定义的过程名称。这可以是任何有效（且唯一）的 MASM 标识符。

下面是一个 MASM 过程声明的具体示例。该过程在进入过程中时，将 0 填充到 RCX 指向的 256 个双字中：

zeroBytes proc
          mov eax, 0
          mov edx, 256
repeatlp: mov [rcx+rdx*4-4], eax
          dec rdx
          jnz repeatlp
          ret
zeroBytes endp

正如你可能已经注意到的，这个简单的过程没有涉及添加和减去 RSP 寄存器值的“魔术”指令。当过程需要调用其他 C/C++ 代码（或其他使用 Microsoft ABI 兼容语言编写的代码）时，这些指令是 Microsoft ABI 的要求。因为这个小函数没有调用其他过程，所以它没有执行这些指令。还要注意，这段代码使用循环索引从 256 递减到 0，倒序填充这 256 个双字数组（从末尾到开头），而不是从头到尾填充。这是汇编语言中的一种常见技术。

您可以使用 x86-64 的 call 指令来调用此过程。当程序执行时，代码遇到 ret 指令时，过程会返回给调用者，并开始执行 call 指令后的第一条指令。示例 5-1 中的程序提供了调用 zeroBytes 例程的示例。

; Listing 5-1

; Simple procedure call example.

         option  casemap:none

nl       =       10

         .const
ttlStr   byte    "Listing 5-1", 0

        .data
dwArray dword   256 dup (1)

        .code

; Return program title to C++ program:

         public getTitle
getTitle proc
         lea rax, ttlStr
         ret
getTitle endp

; Here is the user-written procedure
; that zeroes out a buffer.

zeroBytes proc
          mov eax, 0
          mov edx, 256
repeatlp: mov [rcx+rdx*4-4], eax
          dec rdx
          jnz repeatlp
          ret
zeroBytes endp

; Here is the "asmMain" function.

        public  asmMain
asmMain proc

; "Magic" instruction offered without
; explanation at this point:

        sub     rsp, 48

 lea     rcx, dwArray
        call    zeroBytes 

        add     rsp, 48     ; Restore RSP
        ret                 ; Returns to caller
asmMain endp
        end

示例 5-1：一个简单过程的示例

5.1.1 调用和返回指令

x86-64 的call指令执行两项操作。首先，它将紧接着call指令之后的（64 位）指令地址压入栈中；然后它将控制转移到指定过程的地址。call指令压入栈中的值称为返回地址。

当过程想要返回调用者并继续执行紧跟在call指令后的第一条语句时，大多数过程通过执行ret（返回）指令返回给调用者。ret指令会从栈中弹出（64 位）返回地址，并间接将控制转移到该地址。

以下是最小化过程的示例：

minimal proc
        ret
minimal endp

如果你通过call指令调用这个过程，minimal将简单地从栈中弹出返回地址，并返回到调用者。如果你没有在过程里放入ret指令，程序将在遇到endp语句时不会返回到调用者。相反，程序会跳到在内存中紧接着过程后面的代码。

示例程序在 Listing 5-2 中演示了这个问题。主程序调用了noRet，该程序直接跳转到followingProc（打印消息followingProc was called）。

; Listing 5-2

; A procedure without a ret instruction.

               option  casemap:none

nl             =       10

              .const
ttlStr        byte    "Listing 5-2", 0
fpMsg         byte    "followingProc was called", nl, 0

              .code
              externdef printf:proc

; Return program title to C++ program:

              public getTitle
getTitle      proc
              lea rax, ttlStr
              ret
getTitle      endp

; noRet - Demonstrates what happens when a procedure
;         does not have a return instruction.

noRet         proc
noRet         endp

followingProc proc
              sub  rsp, 28h
              lea  rcx, fpMsg
              call printf
              add  rsp, 28h
              ret
followingProc endp

; Here is the "asmMain" function.

              public  asmMain
asmMain       proc
              push    rbx

              sub     rsp, 40   ; "Magic" instruction

              call    noRet

              add     rsp, 40   ; "Magic" instruction
              pop     rbx
              ret               ; Returns to caller
asmMain       endp
              end

Listing 5-2：缺少ret指令在过程中的影响

尽管在某些少见的情况下这种行为可能是期望的，但在大多数程序中通常表现为缺陷。因此，始终记得通过使用ret指令显式地从过程返回。

5.1.2 过程中的标签

过程可以包含语句标签，就像你程序中的主过程一样（毕竟，在 MASM 看来，书中大部分示例中的主过程asmMain只是另一个过程声明）。然而，请注意，在过程内定义的语句标签是局部的；这些符号在过程外部是不可见的。

在大多数情况下，过程中的作用域符号是很有用的（有关作用域的讨论，请参见第 234 页的“局部（自动）变量”）。你不必担心不同过程之间的命名空间污染（符号名称冲突）。然而，有时，MASM 的名称作用域可能会导致问题。你实际上可能想要引用过程外的语句标签。

在标签逐个处理的基础上，一种方法是使用全局语句标签声明。全局语句标签与过程中的普通语句标签类似，不同之处在于符号后面跟的是两个冒号而不是一个冒号，像这样：

globalSymbol:: mov eax, 0

全局语句标签在程序外部是可见的。你可以使用无条件或有条件跳转指令将控制转移到外部程序的全局符号；你甚至可以使用call指令调用该全局符号（在这种情况下，它变成了该程序的第二个入口点）。通常，程序拥有多个入口点被认为是糟糕的编程风格，使用多个入口点往往会导致编程错误。因此，你应该很少在汇编语言程序中使用全局符号。

如果由于某种原因，你不希望 MASM 将程序中的所有语句标签视为该程序的局部标签，可以通过以下语句打开或关闭作用域：

option scoped
option noscoped

option noscoped指令禁用程序中的作用域（适用于指令之后的所有程序）。option scoped指令重新启用作用域。因此，你可以为单个程序（或程序集合）关闭作用域，并立即将其重新启用。

5.2 保存机器的状态

看一下清单 5-3。该程序试图打印 20 行 40 个空格和一个星号。不幸的是，一个微妙的错误导致了无限循环。主程序使用jnz printLp指令创建一个循环，调用PrintSpaces 20 次。该函数使用 EBX 来计数它打印的 40 个空格，然后返回时 ECX 为 0。主程序接着打印一个星号和换行符，递减 ECX，然后重复，因为 ECX 不是 0（此时它总是包含 0FFFF_FFFFh）。

这里的问题是print40Spaces子程序没有保存 EBX 寄存器。保存寄存器意味着在进入子程序时保存寄存器的值，在离开时恢复它。如果print40Spaces子程序保存了 EBX 寄存器的内容，清单 5-3 将能够正常工作。

; Listing 5-3

; Preserving registers (failure) example.

               option  casemap:none

nl             =       10

              .const
ttlStr        byte    "Listing 5-3", 0
space         byte    " ", 0
asterisk      byte    '*, %d', nl, 0

              .code
              externdef printf:proc

; Return program title to C++ program:

              public getTitle
getTitle      proc
              lea rax, ttlStr
              ret
getTitle      endp

; print40Spaces - Prints out a sequence of 40 spaces
;                 to the console display.

print40Spaces proc
              sub  rsp, 48   ; "Magic" instruction
              mov  ebx, 40
printLoop:    lea  rcx, space
              call printf
              dec  ebx
              jnz  printLoop ; Until EBX == 0
              add  rsp, 48   ; "Magic" instruction
              ret
print40Spaces endp

; Here is the "asmMain" function.

              public  asmMain
asmMain       proc
              push    rbx

; "Magic" instruction offered without
; explanation at this point:

              sub     rsp, 40   ; "Magic" instruction

              mov     rbx, 20
astLp:        call    print40Spaces
              lea     rcx, asterisk
              mov     rdx, rbx
              call    printf
 dec     rbx
              jnz     astLp

              add     rsp, 40   ; "Magic" instruction
              pop     rbx
              ret     ; Returns to caller
asmMain       endp
              end

清单 5-3：包含意外无限循环的程序

你可以使用 x86-64 的push和pop指令来保存寄存器的值，暂时用于其他目的。考虑下面的PrintSpaces代码：

print40Spaces proc
              push rbx
              sub  rsp, 40   ; "Magic" instruction
              mov  ebx, 40
printLoop:    lea  rcx, space
              call printf
              dec  ebx
              jnz  printLoop ; Until EBX == 0
              add  rsp, 40   ; "Magic" instruction
              pop  rbx
              ret
print40Spaces endp

print40Spaces通过使用push和pop指令保存和恢复 RBX 寄存器。可以由调用者（包含调用指令的代码）或被调用者（子程序）负责保存寄存器。在前面的例子中，被调用者负责保存寄存器。

清单 5-4 展示了如果调用者保存寄存器（出于“保存机器状态，第 II 部分”第 280 页的原因，主程序将 RBX 的值保存在静态内存位置，而不是使用栈）的代码可能是什么样子。

; Listing 5-4

; Preserving registers (caller) example.

               option  casemap:none

nl             =       10

              .const
ttlStr        byte    "Listing 5-4", 0
space         byte    " ", 0
asterisk      byte    '*, %d', nl, 0

              .data
saveRBX       qword   ?

 .code
              externdef printf:proc

; Return program title to C++ program:

              public getTitle
getTitle      proc
              lea rax, ttlStr
              ret
getTitle      endp

; print40Spaces - Prints out a sequence of 40 spaces
;                 to the console display.

print40Spaces proc
              sub  rsp, 48   ; "Magic" instruction
              mov  ebx, 40
printLoop:    lea  rcx, space
              call printf
              dec  ebx
              jnz  printLoop ; Until EBX == 0
              add  rsp, 48   ; "Magic" instruction
              ret
print40Spaces endp

; Here is the "asmMain" function.

              public  asmMain
asmMain       proc
              push    rbx

; "Magic" instruction offered without
; explanation at this point:

              sub     rsp, 40

              mov     rbx, 20
astLp:        mov     saveRBX, rbx
              call    print40Spaces
              lea     rcx, asterisk
              mov     rdx, saveRBX
              call    printf
              mov     rbx, saveRBX
              dec     rbx
              jnz     astLp

              add     rsp, 40
              pop     rbx
              ret     ; Returns to caller
asmMain       endp
              end

清单 5-4：调用者保存寄存器的示例

被调用者保留寄存器有两个优点：空间和可维护性。如果被调用者（子程序）保留所有受影响的寄存器，则只有一份push和pop指令——即子程序中包含的那些。如果调用者保存寄存器中的值，则程序需要在每个调用周围设置一组保留指令。这不仅让程序变得更长，而且也更难维护。记住在每次过程调用时需要保存和恢复哪些寄存器并非易事。

另一方面，如果一个子程序保留了它所修改的所有寄存器，它可能不必要地保留某些寄存器。在之前的示例中，print40Spaces过程没有保存 RBX。尽管print40Spaces改变了 RBX，但这不会影响程序的运行。如果调用者保留了寄存器，它就不必保存自己不关心的寄存器。

保留寄存器的一个大问题是，随着时间的推移，程序可能会发生变化。你可能会修改调用代码或过程，以使用额外的寄存器。这样的变化，当然，可能会改变你必须保留的寄存器集合。更糟糕的是，如果修改发生在子程序本身，你将需要定位每一个调用该例程的地方，并验证该子程序不会更改调用代码所使用的任何寄存器。

汇编语言程序员在寄存器保留方面有一个常见约定：除非有充分理由（性能原因）做出不同的选择，否则大多数程序员会保留子程序修改的所有寄存器（且这些寄存器不会显式返回一个修改后的值）。这减少了程序中发生缺陷的可能性，因为子程序修改了调用者期望保留的寄存器。当然，你也可以遵循与微软 ABI 相关的规则，关于易失性和非易失性寄存器；然而，这样的调用约定给程序员（以及其他程序）带来了效率上的弊端。

保留寄存器并不是保留环境的全部。你还可以推入和弹出子程序可能更改的变量和其他值。由于 x86-64 允许你推入和弹出内存位置，你也可以轻松保留这些值。

5.3 过程与堆栈

由于过程使用堆栈来保存返回地址，在过程内推入和弹出数据时必须小心。考虑以下简单（但有缺陷的）过程：

MessedUp   proc

           push rax
           ret

MessedUp   endp

当程序遇到ret指令时，x86-64 堆栈呈现出图 5-1 所示的形式。

图 5-1：MessedUp过程中的ret指令之前的堆栈内容

ret 指令并不知道栈顶的值不是有效的地址。它只是弹出栈顶的任何值，并跳转到该位置。在这个例子中，栈顶包含了保存的 RAX 值。因为 RAX 推入栈中的值不太可能是正确的返回地址，所以这个程序可能会崩溃或表现出其他未定义的行为。因此，在过程内将数据推入栈时，必须确保在从过程返回之前正确弹出这些数据。

在执行 ret 指令之前，从栈中弹出额外数据也可能会对程序造成严重影响。请考虑以下有缺陷的过程：

MessedUp2  proc

           pop rax
           ret

MessedUp2  endp

当执行到该过程中的ret指令时，x86-64 栈的状态大致如下图 Figure 5-2 所示。

图 5-2：MessedUp2 中 ret 前的栈内容

再次强调，ret 指令会盲目地弹出栈顶的任何数据，并尝试返回到那个地址。与前面的例子不同，在前者中栈顶的内容不太可能是有效的返回地址（因为它包含了 RAX 的值），而在这个例子中，栈顶有可能包含有效的返回地址。然而，这个地址不会是 messedUp2 过程的正确返回地址；相反，它会是调用 messedUp2 过程的过程的返回地址。为了理解这段代码的效果，可以参考 Listing 5-5 中的程序。

; Listing 5-5

; Popping a return address by mistake.

               option  casemap:none

nl             =       10

              .const
ttlStr        byte    "Listing 5-5", 0
calling       byte    "Calling proc2", nl, 0
call1         byte    "Called proc1", nl, 0
rtn1          byte    "Returned from proc 1", nl, 0
rtn2          byte    "Returned from proc 2", nl, 0

              .code
              externdef printf:proc

; Return program title to C++ program:

              public getTitle
getTitle      proc
              lea rax, ttlStr
              ret
getTitle      endp

; proc1 - Gets called by proc2, but returns
;         back to the main program.

proc1         proc
              pop   rcx     ; Pops return address off stack
              ret
proc1         endp

proc2         proc
              call  proc1   ; Will never return

; This code never executes because the call to proc1
; pops the return address off the stack and returns
; directly to asmMain.

              sub   rsp, 40
              lea   rcx, rtn1
              call  printf
              add   rsp, 40
              ret
proc2         endp

; Here is the "asmMain" function.

              public asmMain
asmMain       proc

              sub   rsp, 40

              lea   rcx, calling
              call  printf

              call  proc2
              lea   rcx, rtn2
              call  printf

              add   rsp, 40
              ret           ; Returns to caller
asmMain       endp
              end

列表 5-5：从栈中弹出过多数据的影响

因为在进入 proc1 时栈顶有一个有效的返回地址，你可能会认为这个程序会正常运行（按预期）。然而，从 proc1 过程返回时，这段代码直接返回到 asmMain 程序，而不是返回到 proc2 过程中的正确返回地址。因此，所有在调用 proc1 之后的 proc2 过程中的代码都不会执行。

阅读源代码时，你可能会发现很难理解为什么那些语句没有执行，因为它们紧跟在对 proc1 过程的调用之后。除非你仔细观察，否则并不明显，程序正从栈中弹出一个额外的返回地址，因此并没有返回到 proc2，而是直接返回到调用 proc2 的地方。因此，在过程中操作栈时，你应该始终小心推入和弹出数据，并确保在你的过程中的每次推入和相应的弹出之间存在一对一的关系。^(1)

5.3.1 激活记录

每当你调用一个过程时，程序会为该过程调用关联某些信息，包括返回地址、参数和自动局部变量，这些信息是通过一种叫做激活记录的数据结构来管理的。^(2) 程序在调用（激活）过程时创建激活记录，结构中的数据按照记录的方式组织。

激活记录的构建始于调用过程的代码。调用者在栈上为参数数据（如果有的话）腾出空间并将数据复制到栈上。然后，call指令将返回地址推送到栈上。此时，激活记录的构建在过程内部继续。过程会推送寄存器和其他重要的状态信息，然后为局部变量在激活记录中腾出空间。过程可能还会更新 RBP 寄存器，使其指向激活记录的基地址。

要查看传统的激活记录是什么样的，请考虑以下 C++过程声明：

void ARDemo(unsigned i, int j, unsigned k)
{
     int a;
     float r;
     char c;
     bool b;
     short w
     .
     .
     .
}

每当程序调用ARDemo过程时，它会首先将参数数据推送到栈上。在原始的 C/C++调用约定中（忽略 Microsoft ABI），调用代码将参数按与其在参数列表中出现的顺序相反的顺序，从右到左推送到栈上。因此，调用代码首先将k参数的值推送到栈上，然后推送j参数的值，最后推送i参数的数据。在推送完参数后，程序调用ARDemo过程。进入ARDemo过程时，栈中包含这四个项目，排列方式如图 5-3 所示。通过反向推送参数，它们在栈中的顺序是正确的（第一个参数位于内存中的最低地址）。

图 5-3：进入ARDemo时的栈组织

ARDemo中的前几条指令会将当前的 RBP 值推送到栈上，然后将 RSP 的值复制到 RBP 寄存器中。^(3) 接下来，代码将栈指针向下移动，以在内存中为局部变量腾出空间。这会产生如图 5-4 所示的栈组织。

图 5-4：ARDemo的激活记录

5.3.1.1 访问激活记录中的对象

要访问激活记录中的对象，必须使用从 RBP 寄存器到目标对象的偏移量。你需要特别关注的两个项目是参数和局部变量。你可以通过 RBP 寄存器的正偏移量访问参数；通过 RBP 寄存器的负偏移量访问局部变量，如图 5-5 所示。

英特尔专门保留了 RBP（基指针）寄存器，用作指向激活记录基址的指针。这就是为什么你应该避免将 RBP 寄存器用于常规计算的原因。如果你随意更改 RBP 寄存器中的值，可能会导致无法访问当前过程的参数和局部变量。

局部变量按其本地大小对齐（字符按 1 字节地址对齐，短整型/字按 2 字节地址对齐，长整型/整数/无符号整数/双字按 4 字节地址对齐，依此类推）。在ARDemo示例中，所有局部变量恰好都分配在适当的地址上（假设编译器按声明顺序分配存储空间）。

图 5-5：ARDemo激活记录中对象的偏移量

5.3.1.2 使用 Microsoft ABI 参数约定

Microsoft ABI 对激活记录模型进行了若干修改，特别是：

调用者将前四个参数通过寄存器传递，而不是通过栈传递（尽管它仍然需要在栈上为这些参数保留存储空间）。
参数始终是 8 字节的值。
调用者必须在栈上保留（至少）32 字节的参数数据，即使参数少于五个（如果参数有五个或更多，则每个额外的参数还需要保留 8 字节）。
在call指令将返回地址压入栈之前，RSP 必须是 16 字节对齐的。

更多信息请参阅第一章中的“Microsoft ABI 说明”。你只需在调用 Windows 或其他 Microsoft ABI 兼容代码时遵循这些约定。对于你自己编写并调用的汇编语言过程，你可以使用任何你喜欢的约定。

5.3.2 汇编语言标准入口序列

过程的调用者负责在栈上分配参数存储空间，并将参数数据移动到适当的位置。在最简单的情况下，这只是通过使用 64 位push指令将数据压入栈中。call指令将返回地址压入栈。构建其余的激活记录是过程的责任。你可以通过以下汇编语言的标准入口序列代码来实现这一点：

push rbp          ; Save a copy of the old RBP value
mov rbp, rsp      ; Get ptr to activation record into RBP
sub rsp, `num_vars` ; Allocate local variable storage plus padding

如果过程没有任何局部变量，这里显示的第三条指令sub rsp, num_vars就不必要了。

num_vars 代表该过程所需的局部变量字节数，这是一个常数，应为 16 的倍数（以确保 RSP 寄存器在 16 字节边界上对齐）。^(4) 如果过程中的局部变量字节数不是 16 的倍数，你应该将该值四舍五入到下一个更高的 16 的倍数，然后再从 RSP 中减去这个常数。这样做会略微增加该过程为局部变量分配的存储量，但不会影响过程的其他操作。

如果一个符合 Microsoft ABI 的程序调用你的过程，栈将在执行call指令之前立即在 16 字节边界上对齐。由于返回地址向栈中添加了 8 个字节，进入你的过程时，栈将对齐到一个（RSP mod 16）== 8 的地址（对齐到 8 字节地址，但未对齐到 16 字节地址）。将 RBP 推入栈中（以便在将 RSP 复制到 RBP 之前保存旧值）会再向栈中添加 8 个字节，因此 RSP 现在会是 16 字节对齐的。因此，假设在调用之前栈已经是 16 字节对齐的，并且从 RSP 中减去的数字是 16 的倍数，分配本地变量存储后，栈将是 16 字节对齐的。

如果你无法确保在进入你的过程时 RSP 是 16 字节对齐的（RSP mod 16 == 8），你可以通过在过程开始时使用以下序列强制 16 字节对齐：

push rbp
mov rbp, rsp
sub rsp, `num_vars`  ; Make room for local variables
and rsp, -16       ; Force qword stack alignment

–16 等价于 0FFFF_FFFF_FFFF_FFF0h。and指令序列强制栈对齐到 16 字节边界（它将栈指针中的值减少到 16 的倍数）。

ARDemo 激活记录只有 12 字节的本地存储。因此，从 RSP 中减去 12 来分配本地变量将无法保证栈是 16 字节对齐的。然而，前面序列中的and指令确保无论进入过程时 RSP 的值如何，RSP 始终是 16 字节对齐的（这会在图 5-5 中显示的那样，添加填充字节）。如果 RSP 没有按字节对齐，执行该指令所需的几个字节和 CPU 周期会得到丰厚的回报。当然，如果你知道栈在调用之前已经正确对齐，你可以省略额外的and指令，直接从 RSP 中减去 16，而不是 12（换句话说，保留比ARDemo过程需要的多 4 个字节，以保持栈对齐）。

5.3.3 汇编语言标准退出序列

在一个过程返回到它的调用者之前，需要清理激活记录。因此，标准的 MASM 过程和过程调用假设清理激活记录是过程的责任，尽管可以在过程和过程的调用者之间共享清理任务。

如果一个过程没有任何参数，则退出序列非常简单。它只需要三条指令：

mov rsp, rbp   ; Deallocate locals and clean up stack
pop rbp        ; Restore pointer to caller's activation record
ret            ; Return to the caller

在 Microsoft ABI 中（与纯汇编过程不同），清理栈上推送的任何参数是调用者的责任。因此，如果你编写的函数是从 C/C++（或其他符合 Microsoft ABI 的代码）调用的，你的过程无需做任何关于栈上参数的事情。

如果你正在编写只会从汇编语言程序中调用的过程，可以让被调用方（即过程）在返回调用方时清理栈上的参数，使用以下标准退出序列：

mov rsp, rbp    ; Deallocate locals and clean up stack
pop rbp         ; Restore pointer to caller's activation record
ret `parm_bytes`  ; Return to the caller and pop the parameters

ret指令的parm_bytes操作数是一个常量，指定在返回指令弹出返回地址后，从栈上移除的参数数据的字节数。例如，前面章节中的ARDemo示例代码为参数保留了三个四字（因为我们希望保持栈的 qword 对齐）。因此，标准退出序列将采用以下形式：

mov rsp, rbp
pop rbp
ret 24

如果你没有为ret指令指定一个 16 位常量操作数，x86-64 将不会在返回时从栈上弹出参数。执行完call到过程后的第一条指令时，这些参数仍然会留在栈上。类似地，如果你指定的值过小，某些参数将会在从过程返回时留在栈上。如果你指定的ret操作数过大，ret指令实际上会将一些调用者的数据从栈上弹出，通常会导致灾难性后果。

顺便提一下，Intel 在指令集中添加了一条特殊指令来缩短标准退出序列：leave。这条指令将 RBP 复制到 RSP，然后弹出 RBP。以下代码与之前介绍的标准退出序列等效：

leave
ret `optional_const`

选择权在你。大多数编译器会生成leave指令（因为它更简短），因此使用它是标准选择。

5.4 局部（自动）变量

大多数高级语言中的过程和函数允许你声明局部变量。这些变量通常只能在过程内部访问；它们无法被调用该过程的代码访问。

局部变量在高级语言中具有两个特殊属性：作用域和生命周期。标识符的作用域决定了该标识符在编译期间源文件中的可见性（可访问性）。在大多数高级语言中，过程的局部变量的作用域是该过程的主体；标识符在该过程外部不可访问。

而作用域是符号的编译时属性，生命周期是运行时属性。变量的生命周期是从存储首次绑定到该变量时开始，到存储不再可用时结束。静态对象（即你在.data、.const、.data?和.code段中声明的对象）具有与应用程序总运行时间相等的生命周期。程序在第一次加载到内存时为这些变量分配存储空间，这些变量在程序终止之前保持该存储空间。

局部变量（或者更准确地说，自动变量）在进入一个过程时会分配存储空间，并在过程返回时将这些存储空间归还以供其他用途。自动一词指的是程序在调用和返回过程时自动分配和释放变量的存储空间。

一个过程可以通过引用名称（使用相对 PC 寻址模式）来访问任何全局.data、.data?或.const对象，就像主程序访问这些变量一样。访问全局对象方便且简单。当然，访问全局对象会使程序更难阅读、理解和维护，因此你应尽量避免在过程内使用全局变量。虽然在某些情况下，在过程内访问全局变量可能是解决特定问题的最佳方案，但在这个阶段，你可能不会编写这样的代码，因此在这么做之前应仔细考虑你的选择。^(5)

5.4.1 自动（局部）变量的低级实现

你的程序通过使用来自激活记录基地址（RBP）的负偏移量来访问过程中的局部变量。考虑下面这个 MASM 过程，见清单 5-6（诚然，这个过程并不做什么，除了展示局部变量的使用）。

; Listing 5-6

; Accessing local variables.

               option  casemap:none
               .code

; sdword a is at offset -4 from RBP.
; sdword b is at offset -8 from RBP.

; On entry, ECX and EDX contain values to store
; into the local variables a and b (respectively):

localVars     proc
              push rbp
              mov  rbp, rsp
              sub  rsp, 16       ; Make room for a and b

              mov  [rbp-4], ecx  ; a = ECX
              mov  [rbp-8], edx  ; b = EDX

    ; Additional code here that uses a and b:

              mov   rsp, rbp
              pop   rbp
              ret
localVars     endp

清单 5-6：访问局部变量的示例过程

标准入口序列即使局部变量a和b只需要 8 个字节，也会分配 16 个字节的存储空间。这是为了保持栈的 16 字节对齐。如果某个过程不需要这么做，减去 8 个字节也完全可以。

localVars的激活记录见图 5-6。

当然，必须通过偏移量引用局部变量，从 RBP 寄存器算起，实在是太糟糕了。这个代码不仅难以阅读（[RBP-4]是a变量还是b变量？），而且也很难维护。例如，如果你决定不再需要a变量，那么你必须去找到每个出现[RBP-8]（访问b变量）的地方，并把它改成[RBP-4]。

图 5-6：LocalVars过程的激活记录

一个稍微更好的解决方案是为你的局部变量名创建等式。考虑清单 5-6 中所示修改后的清单 5-7。

; Listing 5-7

; Accessing local variables #2.

            option  casemap:none
            .code

; localVars - Demonstrates local variable access.

; sdword a is at offset -4 from RBP.
; sdword b is at offset -8 from RBP.

; On entry, ECX and EDX contain values to store
; into the local variables a and b (respectively):

a           equ     <[rbp-4]>
b           equ     <[rbp-8]>
localVars   proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 16  ; Make room for a and b

            mov     a, ecx
            mov     b, edx

    ; Additional code here that uses a and b:

            mov     rsp, rbp
            pop     rbp
 ret
localVars   endp

清单 5-7：使用等式的局部变量

这比清单 5-6 中的旧程序更易读且更易维护。实际上，可以改进这个等式系统。例如，以下四个等式完全合法：

a  equ <[rbp-4]>
b  equ a-4
d  equ b-4
e  equ d-4

MASM 会将[RBP-4]与a关联，将[RBP-8]与b关联，将[RBP-12]与d关联，将[RBP-16]与e关联。然而，过度使用花哨的等式并没有好处；如果你真的想让声明尽可能易于维护，MASM 提供了类似高级语言的局部变量（和参数）声明。

5.4.2 MASM 局部指令

为本地变量创建等式是一项繁琐且容易出错的工作。在定义等式时，很容易指定错误的偏移量，添加或删除过程中的本地变量也非常麻烦。幸运的是，MASM 提供了一条指令，让你能够指定本地变量，并且 MASM 会自动为本地变量填充偏移量。这个指令是 local，其语法如下：

local  `list_of_declarations`

list_of_declarations 是一个本地变量声明的列表，声明之间用逗号分隔。一个本地变量声明有两种主要形式：

`identifier`:`type`
`identifier` [`elements`]:`type`

在这里，type 是常见的 MASM 数据类型之一（byte、word、dword 等），identifier 是你要声明的本地变量的名称。第二种形式声明本地数组，其中 elements 是数组元素的数量。elements 必须是 MASM 在汇编时能够解析的常量表达式。

local 指令，如果出现在一个过程（procedure）中，必须是紧跟在过程声明（proc 指令）之后的第一条语句。一个过程可以有多个 local 语句；如果有多个 local 指令，它们必须紧随 proc 声明之后。下面是一个包含本地变量声明的代码片段示例：

procWithLocals proc
               local  var1:byte, local2:word, dVar:dword
               local  qArray[4]:qword, rlocal:real4
 local  ptrVar:qword
               local  userTypeVar:userType
                 .
                 .   ; Other statements in the procedure.
                 .
procWithLocals endp

MASM 会自动为你通过 local 指令声明的每个变量关联适当的偏移量。MASM 通过将变量的大小从当前偏移量（从零开始）中减去，然后将其舍入到对象大小的倍数来为变量分配偏移量。例如，如果 userType 被 typedef 定义为 real8，MASM 会像下面的 MASM 输出一样为 procWithLocals 中的本地变量分配偏移量：

var1 . . . . . . . . . . . . .        byte     rbp - 00000001
local2 . . . . . . . . . . . .        word     rbp - 00000004
dVar . . . . . . . . . . . . .        dword    rbp - 00000008
qArray . . . . . . . . . . . .        qword    rbp - 00000028
rlocal . . . . . . . . . . . .        dword    rbp - 0000002C
ptrVar . . . . . . . . . . . .        qword    rbp - 00000034
userTypeVar  . . . . . . . . .        qword    rbp - 0000003C

除了为每个本地变量分配偏移量外，MASM 还将 [RBP-constant] 寻址模式与每个符号关联。因此，如果你在过程内使用像 mov ax, local2 这样的语句，MASM 将会用 [RBP-4] 替代符号 local2。

当然，在进入过程时，你仍然需要在堆栈上分配本地变量的存储空间；也就是说，你仍然需要提供标准的入口（和标准退出）序列的代码。这意味着你必须加总所有本地变量所需的存储空间，以便在将 RSP 的值移动到 RBP 后从 RSP 中减去这个值。同样，这是重复性工作，如果你误算了本地变量存储的字节数，可能会成为过程中的缺陷源，因此在手动计算存储需求时必须小心。

MASM 确实为这个问题提供了一种解决方案（某种程度上）：option 指令。你已经看到过 option casemap:none、option noscoped 和 option scoped 指令；option 指令实际上支持许多参数，用于控制 MASM 的行为。使用 local 指令时，有两个操作数控制过程代码生成：prologue 和 epilogue。这些操作数通常有以下两种形式：

option prologue:PrologueDef
option prologue:none
option epilogue:EpilogueDef
option epilogue:none

默认情况下，MASM 假设 prologue:none 和 epilogue:none。当你将 prologue 和 epilogue 的值设置为 none 时，MASM 不会生成任何额外的代码来支持过程中的本地变量存储分配和释放；你将负责为该过程提供标准的入口和退出序列。

如果你在源文件中插入 option prologue:PrologueDef（默认序言生成）和 option epilogue:EpilogueDef（默认尾声生成），所有后续过程将自动为你生成适当的标准入口和退出序列（前提是过程内有本地指令）。MASM 会在过程的最后一个本地指令之后（在第一个机器指令之前）悄悄生成标准入口序列（序言），包括通常的标准入口序列指令。

push  rbp
mov   rbp, rsp
sub   rsp, `local_size`

其中 local_size 是一个常量，指定本地变量的数量，外加一个（可能的）额外量，用于保持栈对齐到 16 字节边界。（MASM 通常假设栈在 push rbp 指令之前是对齐到 mod 16 == 8 的边界。）

为了使 MASM 自动生成的序言代码正常工作，过程必须有且只有一个入口点。如果你定义了一个全局语句标签作为第二个入口点，MASM 就不知道在那个位置生成序言代码。除非你明确地自己包含标准入口序列，否则从第二个入口点进入过程会导致问题。这个故事的寓意是：过程应该有且只有一个入口点。

生成尾声的标准退出序列会更具挑战性。虽然一个汇编语言过程通常只有一个入口点，但常常有多个退出点。毕竟，退出点是由程序员通过放置 ret 指令来控制的，而不是通过某个指令（如 endp）。MASM 通过自动将找到的任何 ret 指令转换为标准退出序列来处理多个退出点的问题。

leave
ret

当然，假设 option epilogue:EpilogueDef 处于激活状态。

你可以控制 MASM 是否生成序言（标准入口序列）和尾声（标准退出序列），它们相互独立。因此，如果你希望自己编写 leave 指令（同时让 MASM 生成标准入口序列），是完全可以的。

关于 prologue: 和 epilogue: 选项的最后一点。在指定 prologue:PrologueDef 和 epilogue:EpilogueDef 之外，你还可以在 prologue: 或 epilogue: 选项后提供一个 宏标识符。如果你提供了宏标识符，MASM 会为标准入口或退出序列展开该宏。有关宏的更多信息，请参见第十三章中的《宏和 MASM 编译时语言》。

本书其余部分中的大多数示例程序继续使用textequ声明局部变量，而不是使用local指令，以使得[RBP-constant]寻址模式和局部变量偏移更加显式。

5.4.3 自动分配

自动存储分配的一个大优点是它能高效地在多个过程之间共享固定的内存池。例如，假设你依次调用三个过程，如下所示：

call ProcA
call ProcB
call ProcC

第一个过程（代码中的ProcA）在栈上分配其局部变量。返回时，ProcA释放该栈存储。进入ProcB时，程序通过使用刚才由ProcA释放的相同内存位置来分配ProcB的局部变量存储。同样，当ProcB返回并且程序调用ProcC时，ProcC使用ProcB最近释放的相同栈空间来存储它的局部变量。这种内存重用有效地利用了系统资源，可能是使用自动变量的最大优点。

现在你已经了解了汇编语言如何为局部变量分配和释放存储，就容易理解为什么自动变量在两次调用同一过程时不会保持其值。一旦过程返回到其调用者，自动变量的存储就丢失了，因此，值也就丢失了。因此，你必须始终假设局部var对象在进入过程时是未初始化的。如果你需要在多次调用同一过程时保持变量的值，你应该使用静态变量声明类型。

5.5 参数

尽管许多过程是完全自包含的，但大多数过程需要输入数据并将数据返回给调用者。参数是你传递给和从过程返回的数据。在纯汇编语言中，传递参数可能是一件真正麻烦的事。

讨论参数时，首先要考虑的是如何将它们传递给过程。如果你熟悉 Pascal 或 C/C++，你可能见过两种传递参数的方法：按值传递和按引用传递。在汇编语言中，任何在高级语言（HLL）中能做的事情都能在汇编语言中做（显然，高级语言代码会被编译成机器码），但你必须提供指令序列来以适当的方式访问这些参数。

当处理参数时，另一个你会面临的问题是在哪里传递参数。有很多地方可以传递参数：寄存器、栈、代码流、全局变量，或者它们的组合。本章涵盖了几种可能性。

5.5.1 值传递

按值传递的参数就是传递一个值——调用者将一个值传递给过程。值传递参数是仅用于输入的参数。你可以将它们传递给过程，但过程不能通过这些参数返回值。考虑以下的 C/C++函数调用：

CallProc(I);

如果你按值传递 I，无论在 CallProc() 内部发生什么，CallProc() 都不会改变 I 的值。

因为你必须将数据的副本传递给过程，所以应该仅将此方法用于传递小对象，如字节、字、双字和四字。按值传递大型数组和记录效率低下（因为你必须创建并传递该对象的副本给过程）。^(6)

5.5.2 按引用传递

要按引用传递参数，你必须传递变量的地址，而不是它的值。换句话说，你必须传递数据的指针。过程必须取消引用该指针以访问数据。按引用传递参数在需要修改实际参数或在过程之间传递大型数据结构时非常有用。由于 x86-64 中的指针宽度为 64 位，按引用传递的参数将是一个四字（quad-word）值。

你可以通过两种常见方式计算内存中对象的地址：offset 操作符或 lea 指令。你可以使用 offset 操作符获取你在 .data、.data?、.const 或 .code 段中声明的任何静态变量的地址。清单 5-8 演示了如何获取静态变量（staticVar）的地址，并将该地址传递给过程（someFunc），地址保存在 RCX 寄存器中。

; Listing 5-8

; Demonstrate obtaining the address
; of a static variable using offset
; operator.

            option  casemap:none

            .data
staticVar   dword   ?

            .code
            externdef someFunc:proc

getAddress  proc

 mov     rcx, offset staticVar
            call    someFunc

            ret
getAddress  endp

            end

清单 5-8：使用 offset 操作符获取静态变量的地址

使用 offset 操作符会引发一些问题。首先，它只能计算静态变量的地址；你无法获取自动（局部）变量或参数的地址，也不能计算涉及复杂内存寻址模式的内存引用的地址（例如，[RBX+RDX*1-5]）。另一个问题是，像 mov rcx, offset staticVar 这样的指令会生成大量字节（因为 offset 操作符返回的是 64 位常量）。如果你查看 MASM 生成的汇编列表（使用 /Fl 命令行选项），你可以看到这条指令有多大：

00000000  48/ B9                mov    rcx, offset staticVar
           0000000000000000 R
0000000A  E8 00000000 E        call    someFunc

正如你在这里看到的，mov 指令的长度是 10（0Ah）字节。

你已经看过多次获取变量地址的第二种方法：lea 指令（例如，在调用 printf() 之前，将格式化字符串的地址加载到 RCX 中）。清单 5-9 展示了将清单 5-8 中的例子重写为使用 lea 指令的版本。

; Listing 5-9

; Demonstrate obtaining the address
; of a variable using the lea instruction.

            option  casemap:none

            .data
staticVar   dword   ?

            .code
            externdef someFunc:proc

getAddress  proc

            lea     rcx, staticVar
            call    someFunc

            ret
getAddress  endp
            end

清单 5-9：使用 lea 指令获取变量的地址

查看 MASM 为这段代码生成的列表，我们发现 lea 指令的长度只有 7 字节。

00000000  48/ 8D 0D       lea     rcx, staticVar
           00000000 R
00000007  E8 00000000 E   call    someFunc

所以，如果没有其他原因，使用 lea 指令而不是 offset 操作符会让你的程序更简短。

使用 lea 的另一个优点是它接受任何内存寻址模式，而不仅仅是静态变量的名称。例如，如果 staticVar 是一个 32 位整数数组，你可以通过使用类似这样的指令，将当前元素地址（由 RDX 寄存器索引）加载到 RCX 中：

lea rcx, staticVar[rdx*4]  ; Assumes LARGEADDRESSAWARE:NO

通过引用传递通常比按值传递效率低。你必须在每次访问时解引用所有按引用传递的参数；这比直接使用值要慢，因为通常需要至少两条指令。然而，在传递大型数据结构时，通过引用传递更快，因为你不需要在调用过程之前复制整个大型数据结构。当然，你可能需要通过指针访问该大型数据结构的元素（例如，数组），所以传递大型数组时，效率损失非常小。

5.5.3 低级参数实现

参数传递机制是调用者和被调用者（过程）之间的契约。双方必须就参数数据的出现位置和形式（例如，值或地址）达成一致。如果你的汇编语言过程只被你编写的其他汇编语言代码调用，那么你可以控制契约的双方，并决定如何以及在哪里传递参数。

然而，如果外部代码调用了你的过程，或者你的过程调用了外部代码，那么你的过程必须遵循外部代码使用的调用约定。在 64 位 Windows 系统上，该调用约定无疑是 Windows ABI。

在讨论 Windows 调用约定之前，我们将考虑你编写的代码的调用情况（因此，你完全控制调用约定）。接下来的章节将深入介绍在纯汇编语言代码中传递参数的各种方式（不涉及与 Microsoft ABI 相关的开销）。

5.5.3.1 在寄存器中传递参数

在讨论了如何将参数传递给过程之后，接下来要讨论的是在哪里传递参数。这取决于这些参数的大小和数量。如果你传递的是少量参数，寄存器是传递它们的理想位置。如果你只传递一个参数，你应该使用表 5-1 中列出的寄存器，匹配相应的数据类型。

表 5-1：按大小划分的参数位置

数据大小	在此寄存器中传递
字节	CL
字	CX
双字	ECX
四字	RCX

这不是一条严格的规则。然而，这些寄存器非常方便，因为它们与 Microsoft ABI 中的第一个参数寄存器相匹配（大多数人会在此寄存器中传递一个参数）。

如果你在 x86-64 寄存器中传递多个参数到一个过程，你应该按以下顺序使用寄存器：

First                                           Last
RCX, RDX, R8, R9, R10, R11, RAX, XMM0/YMM0-XMM5/YMM5

一般来说，你应该通过通用寄存器传递整数和其他非浮动值，通过 XMMx/YMMx 寄存器传递浮动值。这不是一个强制要求，但微软为传递参数和局部变量（volatile）保留这些寄存器，因此使用这些寄存器传递参数不会干扰微软 ABI 的非易失性寄存器。当然，如果你打算让符合微软 ABI 的代码调用你的过程，你必须严格遵守微软调用约定（参见第 261 页“调用约定和微软 ABI”）。

当然，如果你在编写纯汇编语言代码（没有调用任何你没写的代码），你可以根据需要使用大多数通用寄存器（RSP 是例外，应该避免使用 RBP，但其他寄存器都可以使用）。XMM/YMM 寄存器也是如此。

举个例子，考虑 strfill(s,c) 过程，它将字符 c（通过 AL 传值）复制到 s（通过 RDI 传引用）中的每个字符位置，直到遇到零终止字节（Listing 5-10）。

; Listing 5-10

; Demonstrate passing parameters in registers.

            option  casemap:none

            .data
staticVar   dword   ?

            .code
            externdef someFunc:proc

; strfill - Overwrites the data in a string with a character.

;     RDI -  Pointer to zero-terminated string
;            (for example, a C/C++ string).
;      AL -  Character to store into the string.

strfill     proc
            push    rdi     ; Preserve RDI because it changes

; While we haven't reached the end of the string:

whlNot0:    cmp     byte ptr [rdi], 0
            je      endOfStr

; Overwrite character in string with the character
; passed to this procedure in AL:

            mov     [rdi], al

; Move on to the next character in the string and
; repeat this process:

            inc     rdi
            jmp     whlNot0

endOfStr:   pop     rdi
            ret
strfill     endp
            end

Listing 5-10：通过寄存器传递参数给 strfill 过程

要调用 strfill 过程，你需要在调用之前将字符串数据的地址加载到 RDI 中，将字符值加载到 AL 中。以下代码片段演示了典型的 strfill 调用：

lea  rdi, stringData ; Load address of string into RDI
mov  al, ' '         ; Fill string with spaces
call strfill

这段代码通过引用传递字符串，通过值传递字符数据。

5.5.3.2 代码流中传递参数

另一个可以传递参数的地方是在 call 指令之后的代码流中。考虑以下 print 例程，它将一个字面常量字符串打印到标准输出设备：

call print
byte "This parameter is in the code stream.",0

通常，子程序会将控制权返回到紧接着 call 指令后的第一条指令。如果在这里发生这种情况，x86-64 会试图将 "This..." 的 ASCII 码解释为一条指令。这将产生不期望的结果。幸运的是，在从子程序返回之前，你可以跳过这个字符串。

那么，你如何访问这些参数呢？很简单。栈上的返回地址指向它们。考虑 Listing 5-11 中的 print 实现。

; Listing 5-11

; Demonstration passing parameters in the code stream.

        option  casemap:none

nl          =       10
stdout      =       -11

            .const
ttlStr      byte    "Listing 5-11", 0

            .data
soHandle    qword   ?
bWritten    dword   ?

            .code

            ; Magic equates for Windows API calls:

            extrn __imp_GetStdHandle:qword
            extrn __imp_WriteFile:qword

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; Here's the print procedure.
; It expects a zero-terminated string
; to follow the call to print.

print       proc
            push    rbp
            mov     rbp, rsp
            and     rsp, -16         ; Ensure stack is 16-byte-aligned
            sub     rsp, 48          ; Set up stack for MS ABI

; Get the pointer to the string immediately following the
; call instruction and scan for the zero-terminating byte.

            mov     rdx, [rbp+8]     ; Return address is here
            lea     r8, [rdx-1]      ; R8 = return address - 1
search4_0:  inc     r8               ; Move on to next char
            cmp     byte ptr [R8], 0 ; At end of string?
            jne     search4_0

; Fix return address and compute length of string:

            inc     r8               ; Point at new return address
            mov     [rbp+8], r8      ; Save return address
            sub     r8, rdx          ; Compute string length
            dec     r8               ; Don't include 0 byte

; Call WriteFile to print the string to the console:

; WriteFile(fd, bufAdrs, len, &bytesWritten);

; Note: pointer to the buffer (string) is already
; in RDX. The len is already in R8\. Just need to
; load the file descriptor (handle) into RCX:

            mov     rcx, soHandle    ; Zero-extends!
            lea     r9, bWritten     ; Address of "bWritten" in R9
            call    __imp_WriteFile

            leave
            ret
print       endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 40

; Call getStdHandle with "stdout" parameter
; in order to get the standard output handle
; we can use to call write. Must set up
; soHandle before first call to print procedure.

            mov     ecx, stdout      ; Zero-extends!
            call    __imp_GetStdHandle
            mov     soHandle, rax    ; Save handle

; Demonstrate passing parameters in code stream
; by calling the print procedure:

            call    print
            byte    "Hello, world!", nl, 0

; Clean up, as per Microsoft ABI:

            leave
            ret     ; Returns to caller

asmMain     endp
            end

Listing 5-11：打印过程实现（使用代码流参数）

关于 Listing 5-11 中的机器习惯用法，有一个快速说明。指令

lea  r8, [rdx-1]

这实际上并不是将一个地址加载到 R8 中。实际上，这是一个算术指令，它计算 R8 = RDX – 1（通过一个指令，而不是通常需要的两条指令）。这是汇编语言程序中lea指令的常见用法。因此，这是一个你应该熟悉的小编程技巧。

除了展示如何在代码流中传递参数外，print例程还展示了另一个概念：可变长度参数。call后面的字符串可以是任何实际长度。以零终止的字节标记参数列表的结束。

我们有两种简单的方法来处理可变长度参数：要么使用特殊的终止值（如 0），要么传递一个特殊的长度值，告诉子程序你正在传递多少个参数。两种方法各有优缺点。

使用特殊值终止参数列表要求选择一个在列表中永远不会出现的值。例如，print使用 0 作为终止值，因此它无法打印 NUL 字符（其 ASCII 码是 0）。有时候，这并不是一个限制。指定长度参数是另一种可以用来传递可变长度参数列表的机制。虽然这种方式不需要任何特殊的代码，也不限制可以传递给子程序的值的范围，但设置长度参数并维护结果代码可能会变得非常麻烦。^(8)

尽管在代码流中传递参数具有便利性，但这种方法也有缺点。首先，如果你没有提供子程序所需的准确数量的参数，子程序会感到困惑。以print为例，它会打印一串字符直到遇到零终止字节，然后将控制权交还给该字节后的第一条指令。如果你没有提供零终止字节，print例程会愉快地将随后的操作码字节当作 ASCII 字符打印，直到遇到一个零字节。因为零字节经常出现在指令的中间，print例程可能会将控制权交给另一条指令的中间部分，这很可能会导致机器崩溃。

插入额外的 0 是程序员在使用print例程时遇到的另一个问题，这种情况比你想象的更为常见。在这种情况下，print例程会在遇到第一个零字节时返回，并尝试将随后的 ASCII 字符作为机器码执行。尽管存在问题，但代码流仍然是一个高效的传递参数的地方，尤其是当这些参数的值不发生变化时。

5.5.3.3 在栈上传递参数

大多数高级语言使用栈来传递大量参数，因为这种方法相对高效。尽管在栈上传递参数的效率略低于在寄存器中传递参数，但寄存器集是有限的（特别是如果你只使用微软 ABI 为此目的保留的四个寄存器），你只能通过寄存器传递少量的值或引用参数。另一方面，栈允许你轻松传递大量的参数数据。这就是大多数程序将参数传递在栈上的原因（至少，当传递超过大约三个到六个参数时）。

要手动将参数传递到栈上，在调用子程序之前立即推送它们。子程序随后从栈内存中读取这些数据，并适当地处理它们。考虑以下高级语言函数调用：

CallProc(i,j,k);

在 32 位汇编语言的时代，你可以使用以下指令序列将这些参数传递给CallProc：

push  k  ; Assumes i, j, and k are all 32-bit
push  j  ; variables
push  i  
call  CallProc

不幸的是，随着 x86-64 64 位 CPU 的出现，32 位push指令从指令集中移除了（64 位push指令替代了它）。如果你想通过使用push指令将参数传递给过程，它们必须是 64 位操作数。^(9)

因为保持 RSP 在适当的边界（8 字节或 16 字节）对齐至关重要，微软的 ABI 要求每个参数在栈上占用 8 字节，因此不允许更大的参数出现在栈上。如果你控制着参数契约的双方（调用者和被调用者），你可以将更大的参数传递给你的过程。然而，确保所有参数的大小都是 8 字节的倍数是一个好主意。

一个简单的解决方案是使所有变量成为qword对象。然后，你可以在调用过程之前直接使用push指令将它们推送到栈上。然而，并不是所有对象都能完美地适应 64 位（例如字符）。即使是那些本可以是 64 位的对象（例如整数），通常也不需要使用那么多存储空间。

使用push指令处理较小对象的一种巧妙方法是使用类型强制转换。考虑以下CallProc的调用序列：

push  qword ptr k
push  qword ptr j
push  qword ptr i
call  CallProc

该序列将从与变量i、j和k关联的地址开始，推送 64 位的值，无论这些变量的大小如何。如果i、j和k是较小的对象（比如 32 位整数），这些push指令将把它们的值连同超出这些变量的数据一起推送到栈上。只要CallProc将这些参数值视为其实际大小（例如，32 位），并忽略为每个参数推送到栈上的高位数据，这通常是可行的。

将超出变量边界的额外数据推送到栈上可能会带来一个问题。如果变量正好位于内存页的末尾，并且下一页不可读，那么推送超出变量的数据可能会尝试从下一内存页推送数据，从而导致内存访问违规（这将崩溃你的程序）。因此，如果你使用这种技术，必须确保这些变量不出现在内存页的末尾（以免下一页不可访问）。最简单的做法是确保你在数据段中声明的最后一个变量不是你在栈上推送的变量。例如：

i    dword ?
j    dword ?
k    dword ?
pad  qword ?  ; Ensures that there are at least 64 bits
              ; beyond the k variable

尽管将额外数据推入一个变量是可行的，但这仍然是一种值得怀疑的编程实践。更好的技巧是完全放弃push指令，改用另一种方法将参数数据推送到栈上。

另一种将数据“推送”到栈上的方法是将 RSP 寄存器下移到适当的内存位置，然后通过使用mov（或类似的）指令将数据直接移入栈中。考虑以下CallProc的调用序列：

sub  rsp, 12
mov  eax, k
mov  [rsp+8], eax
mov  eax, j
mov  [rsp+4], eax
mov  eax, i
mov  [rsp], eax
call CallProc

尽管这比前面的示例需要两倍的指令（八条对比四条），但此序列是安全的（没有访问不可访问内存页的可能）。此外，它将准确地将参数所需的数据量推送到栈上（每个对象 32 位，总共 12 字节）。

这种方法的主要问题是，将一个不对齐到 8 字节边界的地址放入 RSP 寄存器是一个非常糟糕的主意。在最坏的情况下，栈如果没有按 8 字节对齐会崩溃程序；在最好的情况下，它会影响程序的性能。因此，即使你希望将参数作为 32 位整数传递，也应该在调用之前始终为栈上的参数分配 8 字节的倍数。前面的示例可以编码为如下：

sub  rsp, 16   ; Allocate a multiple of 8 bytes
mov  eax, k
mov  [rsp+8], eax
mov  eax, j
mov  [rsp+4], eax
mov  eax, i
mov  [rsp], eax
call CallProc

请注意，CallProc将简单地忽略以这种方式分配到栈上的额外 4 个字节（别忘了在返回时从栈上移除这些额外的存储）。

为了满足 Microsoft ABI 的要求（事实上，几乎所有 x86-64 CPU 的应用程序二进制接口都要求）每个参数精确消耗 8 字节（即使它们的原生数据大小更小），你可以使用以下代码（指令数量相同，只是多了一些栈空间）：

sub  rsp, 24   ; Allocate a multiple of 8 bytes
mov  eax, k
mov  [rsp+16], eax
mov  eax, j
mov  [rsp+8], eax
mov  eax, i
mov  [rsp], eax
call CallProc

mov指令将数据按 8 字节边界展开。栈上每个 64 位项的高位双字（HO dword）将包含垃圾数据（即栈内存中在此序列之前的数据）。这没关系；CallProc过程（假设）将忽略这些额外数据，仅对每个参数值的低 32 位（LO 32 bits）进行操作。

进入CallProc时，使用此序列，x86-64 的栈将如图 5-7 所示。

图 5-7：进入CallProc时的栈布局

如果你的过程包括标准的入口和退出序列，你可以通过从 RBP 寄存器索引直接访问激活记录中的参数值。考虑以下使用声明的CallProc的激活记录布局：

CallProc proc
         push  rbp      ; This is the standard entry sequence
         mov   rbp, rsp ; Get base address of activation record into RBP
          .
          .
          .
         leave
         ret   24

假设你已经将三个四字参数推送到栈上，它在执行CallProc中的mov rbp, rsp之后应该看起来像图 5-8。

现在你可以通过从 RBP 寄存器索引来访问参数：

mov eax, [rbp+32]    ; Accesses the k parameter
mov ebx, [rbp+24]    ; Accesses the j parameter
mov ecx, [rbp+16]    ; Accesses the i parameter

图 5-8：CallProc激活记录在标准入口序列执行后的状态

5.5.3.4 访问堆栈上的值参数

访问按值传递的参数与访问局部变量对象没有什么不同。实现这一点的一种方法是使用等式，正如前面为局部变量所演示的那样。示例 5-12 提供了一个示例程序，其中的过程访问了由主程序按值传递给它的参数。

; Listing 5-12

; Accessing a parameter on the stack.

        option  casemap:none

nl          =       10
stdout      =       -11

            .const
ttlStr      byte    "Listing 5-12", 0
fmtStr1     byte    "Value of parameter: %d", nl, 0

            .data
value1      dword   20
value2      dword   30

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

theParm     equ     <[rbp+16]>
ValueParm   proc
            push    rbp
            mov     rbp, rsp

            sub     rsp, 32         ; "Magic" instruction

            lea     rcx, fmtStr1
            mov     edx, theParm
            call    printf

            leave
            ret
ValueParm   endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 40

            mov     eax, value1
            mov     [rsp], eax      ; Store parameter on stack
            call    ValueParm

            mov     eax, value2
            mov     [rsp], eax
            call    ValueParm

; Clean up, as per Microsoft ABI:

            leave
            ret                     ; Returns to caller

asmMain     endp
            end

示例 5-12：值参数的演示

尽管你可以通过在代码中使用匿名地址[RBP+16]来访问theParm的值，但以这种方式使用等式会使你的代码更具可读性和可维护性。

5.5.4 使用`proc`指令声明参数

MASM 为使用proc指令声明过程的参数提供了另一种解决方案。你可以将参数列表作为操作数传递给proc指令，如下所示：

`proc_name`  proc  `parameter_list`

其中parameter_list是由逗号分隔的一个或多个参数声明的列表。每个参数声明的形式为

`parm_name`:`type`

其中parm_name是一个有效的 MASM 标识符，type是常见的 MASM 类型之一（proc，byte，word，dword等）。有一个例外，参数列表声明与局部指令的操作数相同：唯一的例外是 MASM 不允许将数组作为参数。（MASM 参数假设使用的是 Microsoft ABI，而 Microsoft ABI 只允许 64 位参数。）

作为proc操作数出现的参数声明假设执行了标准的入口序列，并且程序将从 RBP 寄存器访问参数，保存的 RBP 和返回地址值位于 RBP 寄存器的偏移量 0 和 8（因此第一个参数从偏移量 16 开始）。MASM 为每个参数分配 8 字节的偏移量（根据 Microsoft ABI）。举个例子，考虑以下参数声明：

procWithParms proc  k:byte, j:word, i:dword
                .
                .
                .
procWithParms endp

k的偏移量为[RBP+16]，j的偏移量为[RBP+24]，i的偏移量为[RBP+32]。再说一遍，这些偏移量始终是 8 字节，不论参数的数据类型是什么。

根据 Microsoft ABI，MASM 会在栈上为前四个参数分配存储空间，尽管你通常会将这些参数传递给 RCX、RDX、R8 和 R9 寄存器。这 32 字节的存储空间（从RBP+16开始）在 Microsoft ABI 术语中被称为影子存储。进入过程时，参数值不会出现在这个影子存储中（而是存储在寄存器中）。该过程可以将寄存器值保存在这个预分配的存储空间中，或者可以将影子存储用于任何它想要的目的（例如额外的局部变量存储）。然而，如果过程引用在proc操作数字段中声明的参数名，并期望访问参数数据，过程应该将这些寄存器中的值存储到该影子存储中（假设这些参数是通过 RCX、RDX、R8 和 R9 寄存器传递的）。当然，如果你在调用之前将这些参数压入栈中（在汇编语言中，忽略 Microsoft ABI 调用约定），那么数据已经就位，你就不需要担心影子存储的问题。

当调用一个在proc指令的操作数字段中声明了参数的过程时，别忘了 MASM 假定你按照参数列表中出现的反向顺序将参数压入栈中，以确保列表中的第一个参数位于栈上的最低内存地址。例如，如果你从前面的代码片段调用procWithParms过程，你通常会使用以下代码将参数压栈：

mov   eax, dwordValue
push  rax             ; Parms are always 64 bits
mov   ax, wordValue
push  rax
mov   al, byteValue
push  rax
call  procWithParms

另一种可能的解决方案（虽然稍微多占几个字节，但通常更快）是使用以下代码：

sub   rsp, 24         ; Reserve storage for parameters
mov   eax, dwordValue ; i
mov   [rsp+16], eax
mov   ax, wordValue
mov   [rsp+8], ax     ; j
mov   al, byteValue
mov   [rsp], al       ; k
call  procWithParms

别忘了，如果清理栈是被调用方的责任，那么你可能会在前两条指令后使用add rsp, 24指令来从栈中移除参数。当然，你也可以让过程本身通过将要加到 RSP 的数字指定为ret指令的操作数来清理栈，就像本章前面解释的那样。

5.5.5 访问栈上的引用参数

因为你将对象的地址作为引用参数传递，所以在过程内访问引用参数比访问值参数稍微复杂一些，因为你必须解引用指向引用参数的指针。

在示例 5-13 中，RefParm过程有一个按引用传递的单一参数。按引用传递的参数总是指向对象的（64 位）指针。为了访问与该参数相关的值，这段代码必须将该四字地址加载到一个 64 位寄存器中，并间接地访问数据。示例 5-13 中的mov rax, theParm指令将这个指针加载到 RAX 寄存器中，然后过程RefParm使用[RAX]寻址模式来访问theParm的实际值。

; Listing 5-13

; Accessing a reference parameter on the stack.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 5-13", 0
fmtStr1     byte    "Value of parameter: %d", nl, 0

            .data
value1      dword   20
value2      dword   30

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

theParm     equ     <[rbp+16]> 
RefParm     proc
            push    rbp
            mov     rbp, rsp

            sub     rsp, 32         ; "Magic" instruction

            lea     rcx, fmtStr1
            mov     rax, theParm    ; Dereference parameter
            mov     edx, [rax]
            call    printf

            leave
            ret
RefParm     endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 40

            lea     rax, value1
            mov     [rsp], rax      ; Store address on stack
            call    RefParm

 lea     rax, value2
            mov     [rsp], rax
            call    RefParm

; Clean up, as per Microsoft ABI:

            leave
            ret     ; Returns to caller

asmMain     endp
            end

示例 5-13：访问引用参数

以下是示例 5-13 的构建命令和程序输出：

C:\>**build listing5-13**

C:\>**echo off**
 Assembling: listing5-13.asm
c.cpp

C:\>**listing5-13**
Calling Listing 5-13:
Value of parameter: 20
Value of parameter: 30
Listing 5-13 terminated

如你所见，访问（小的）按引用传递的参数比访问值参数稍微低效一些，因为你需要额外的指令将地址加载到 64 位指针寄存器中（更不用说你还需要为此目的保留一个 64 位寄存器）。如果你频繁访问引用参数，这些额外的指令会开始累积，降低程序的效率。此外，很容易忘记取消引用一个引用参数并在计算中使用该值的地址。因此，除非你真的需要影响实际参数的值，否则应该使用按值传递来传递小的对象到过程。

传递大对象，如数组和记录，是使用引用参数变得高效的地方。当按值传递这些对象时，调用代码必须复制实际参数；如果它是一个大对象，复制过程可能效率低下。由于计算大对象的地址和计算小标量对象的地址一样高效，按引用传递大对象时不会损失效率。在过程内部，你仍然需要取消引用指针来访问对象，但由于间接访问的效率损失与复制大对象的成本相比是最小的。示例 5-14 中的程序演示了如何使用按引用传递来初始化一个记录数组。

; Listing 5-14

; Passing a large object by reference.

 option  casemap:none

nl          =       10
NumElements =       24

Pt          struct
x           byte    ?
y           byte    ?
Pt          ends

            .const
ttlStr      byte    "Listing 5-14", 0
fmtStr1     byte    "RefArrayParm[%d].x=%d ", 0
fmtStr2     byte    "RefArrayParm[%d].y=%d", nl, 0

            .data
index       dword   ?
Pts         Pt      NumElements dup ({})

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

ptArray     equ     <[rbp+16]> 
RefAryParm  proc
            push    rbp
            mov     rbp, rsp

            mov     rdx, ptArray
            xor     rcx, rcx        ; RCX = 0

; While ECX < NumElements, initialize each
; array element. x = ECX/8, y = ECX % 8.

ForEachEl:  cmp     ecx, NumElements
            jnl     LoopDone

            mov     al, cl
            shr     al, 3           ; AL = ECX / 8
            mov     [rdx][rcx*2].Pt.x, al

            mov     al, cl
            and     al, 111b        ; AL = ECX % 8
            mov     [rdx][rcx*2].Pt.y, al
            inc     ecx
            jmp     ForEachEl

LoopDone:   leave
 ret
RefAryParm  endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 40

; Initialize the array of points:

            lea     rax, Pts
            mov     [rsp], rax      ; Store address on stack
            call    RefAryParm

; Display the array:

            mov     index, 0
dispLp:     cmp     index, NumElements
            jnl     dispDone

            lea     rcx, fmtStr1
            mov     edx, index              ; Zero-extends!
            lea     r8, Pts                 ; Get array base
            movzx   r8, [r8][rdx*2].Pt.x    ; Get x field
            call    printf

            lea     rcx, fmtStr2
            mov     edx, index              ; Zero-extends!
            lea     r8, Pts                 ; Get array base
            movzx   r8, [r8][rdx*2].Pt.y    ; Get y field
            call    printf

            inc     index
            jmp     dispLp

; Clean up, as per Microsoft ABI:

dispDone:
            leave
            ret     ; Returns to caller

asmMain     endp
            end

示例 5-14：通过引用传递记录数组

以下是示例 5-14 的构建命令和输出：

C:\>**build listing5-14**

C:\>**echo off**
 Assembling: listing5-14.asm
c.cpp

C:\>**listing5-14**
Calling Listing 5-14:
RefArrayParm[0].x=0 RefArrayParm[0].y=0
RefArrayParm[1].x=0 RefArrayParm[1].y=1
RefArrayParm[2].x=0 RefArrayParm[2].y=2
RefArrayParm[3].x=0 RefArrayParm[3].y=3
RefArrayParm[4].x=0 RefArrayParm[4].y=4
RefArrayParm[5].x=0 RefArrayParm[5].y=5
RefArrayParm[6].x=0 RefArrayParm[6].y=6
RefArrayParm[7].x=0 RefArrayParm[7].y=7
RefArrayParm[8].x=1 RefArrayParm[8].y=0
RefArrayParm[9].x=1 RefArrayParm[9].y=1
RefArrayParm[10].x=1 RefArrayParm[10].y=2
RefArrayParm[11].x=1 RefArrayParm[11].y=3
RefArrayParm[12].x=1 RefArrayParm[12].y=4
RefArrayParm[13].x=1 RefArrayParm[13].y=5
RefArrayParm[14].x=1 RefArrayParm[14].y=6
RefArrayParm[15].x=1 RefArrayParm[15].y=7
RefArrayParm[16].x=2 RefArrayParm[16].y=0
RefArrayParm[17].x=2 RefArrayParm[17].y=1
RefArrayParm[18].x=2 RefArrayParm[18].y=2
RefArrayParm[19].x=2 RefArrayParm[19].y=3
RefArrayParm[20].x=2 RefArrayParm[20].y=4
RefArrayParm[21].x=2 RefArrayParm[21].y=5
RefArrayParm[22].x=2 RefArrayParm[22].y=6
RefArrayParm[23].x=2 RefArrayParm[23].y=7
Listing 5-14 terminated

从这个例子中可以看出，按引用传递大对象是非常高效的。

5.6 调用约定和 Microsoft ABI

在 32 位程序时代，不同的编译器和语言通常使用完全不同的参数传递约定。因此，用 Pascal 编写的程序无法调用 C/C++ 函数（至少，不能使用本地的 Pascal 参数传递约定）。类似地，C/C++ 程序也无法调用 FORTRAN、BASIC 或其他语言编写的函数，除非程序员特别处理。这就像是一个巴别塔的局面，因为各个语言之间不兼容。^(10)

为了解决这些问题，CPU 制造商（如英特尔）设计了一套协议，称为 应用程序二进制接口 (**ABI)，以提供过程调用的一致性。遵循 CPU 制造商 ABI 的语言可以调用用其他也遵循相同 ABI 的语言编写的函数和过程。这为编程语言的互操作性带来了一定的理性。

对于在 Windows 上运行的程序，微软从 Intel ABI 中取出了一部分，创建了微软调用约定（通常人们称之为微软 ABI）。下一节将详细介绍微软调用约定。然而，首先值得讨论的是在微软 ABI 之前存在的许多其他调用约定。^(11)

较早的正式调用约定之一是Pascal 调用约定。在这种约定中，调用者按实际参数列表中参数出现的顺序（从左到右）将参数压入堆栈。在 80x86/x86-64 CPU 上，由于堆栈在内存中向下增长，第一个参数最终会位于堆栈的最高地址，而最后一个参数则位于堆栈的最低地址。

尽管参数在堆栈上的顺序可能看起来是倒置的，但计算机并不关心这一点。毕竟，过程会通过使用数字偏移量来访问参数，而它并不关心偏移量的具体值。^(12) 另一方面，对于简单的编译器来说，按源文件中出现的顺序推送参数更容易生成代码，因此 Pascal 调用约定使编译器编写者的工作稍微轻松一些（尽管优化编译器通常还是会重新排列代码）。

Pascal 调用约定的另一个特点是被调用者（即过程本身）负责在子程序返回时从堆栈中移除参数数据。这将清理代码局部化到过程内部，从而避免在每次调用该过程时都重复清理参数。

Pascal 调用约定的一个大缺点是处理可变参数列表较为困难。如果一个过程调用有三个参数，而第二个调用有四个参数，那么第一个参数的偏移量将根据实际参数的数量而变化。此外，如果参数的数量变化，过程清理堆栈也会变得更加困难（尽管这并非不可能）。对于 Pascal 程序来说，这不是一个问题，因为标准 Pascal 不允许用户编写的过程和函数有可变的参数列表。然而，对于像 C/C++ 这样的语言来说，这是一个问题。

由于 C（以及其他基于 C 的编程语言）支持变化的参数列表（例如printf()函数），C 采用了不同的调用约定：C 调用约定，也叫做cdecl 调用约定。在 C 中，调用方按实际参数列表中出现的反向顺序将参数推入堆栈。所以，它首先推入最后一个参数，然后推入第一个参数。由于堆栈是一个 LIFO 数据结构，第一个参数最终位于堆栈的最低地址处（并且与返回地址有固定的偏移，通常就在它上方；无论堆栈中有多少实际参数，情况都是如此）。此外，由于 C 支持变化的参数列表，清理堆栈上的参数是由调用方在函数返回后进行的。

第三种在 32 位 Intel 机器上常用的调用约定，STDCALL，基本上是 Pascal 和 C/C++调用约定的结合。参数是从右到左传递的（如同 C/C++）。然而，被调用方负责在返回之前清理堆栈上的参数。

这三种调用约定的一个问题是，它们都只使用内存来将参数传递给程序。当然，最有效的参数传递方式是在机器寄存器中。这导致了第四种常见的调用约定，即FASTCALL 调用约定。在这种约定中，调用程序将参数通过寄存器传递给程序。然而，由于寄存器是大多数 CPU 上的有限资源，FASTCALL 调用约定通常只通过寄存器传递前三到六个参数。如果需要更多参数，FASTCALL 会将其余的参数通过堆栈传递（通常按照反向顺序，如同 C/C++和 STDCALL 调用约定）。

5.7 Microsoft ABI 和 Microsoft 调用约定

本章已经多次提到 Microsoft ABI。现在是时候正式描述 Microsoft 调用约定了。

5.7.1 数据类型与 Microsoft ABI

正如《Microsoft ABI 注释》在第 1、3 和 4 章中所指出的，本机数据类型的大小为 1、2、4 和 8 字节（请参见第一章中的表 1-6）。所有这些变量应按其本机大小对齐存储。

对于参数，所有过程/函数的参数必须占用精确的 64 位。如果数据对象小于 64 位，则参数值的高位部分（超出实际参数本机大小的位）是未定义的（并且不保证为零）。过程应该仅访问参数本机类型的实际数据位，忽略高位部分。

如果参数的本机类型大于 64 位，则 Microsoft ABI 要求调用方通过引用传递该参数，而不是通过值传递（即，调用方必须传递数据的地址）。

5.7.2 参数位置

微软的 ABI 使用一种变种的 FASTCALL 调用约定，要求调用者通过寄存器传递前四个参数。表 5-2 列出了这些参数的寄存器位置。

表 5-2：FASTCALL 参数位置

参数	如果是标量/引用	如果是浮点数
1	RCX	XMM0
2	RDX	XMM1
3	R8	XMM2
4	R9	XMM3
5 到 n	在栈上，从右到左	在栈上，从右到左

如果过程有浮点参数，调用约定会跳过使用通用寄存器来传递相同参数位置的参数。假设你有以下 C/C++ 函数：

void someFunc(int a, double b, char *c, double d)

然后，微软的调用约定会期望调用者将 a 传递到（RCX 的低 32 位），b 传递到 XMM1，c 的指针传递到 R8，d 传递到 XMM3，跳过 RDX、R9、XMM0 和 XMM2。这个规则有一个例外：对于变参（参数数量不固定）或未声明原型的函数，浮点值必须在相应的通用寄存器中复制（请参阅 docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-160#parameter-passing/）。

尽管微软的调用约定将前四个参数传递到寄存器中，但它仍然要求调用者为这些参数在栈上分配存储空间（影子存储）。^(13) 事实上，即使过程没有四个参数（或根本没有参数），微软调用约定也要求调用者在栈上为四个参数分配存储空间。调用者不需要将参数数据复制到栈存储区域——只需将参数数据保留在寄存器中即可。然而，这块栈空间必须存在。微软编译器假定栈空间已经存在，并会使用这块栈空间来保存寄存器值（例如，如果过程调用了另一个过程并需要保留寄存器值）。有时微软的编译器将此影子存储用作局部变量。

如果你调用了一个符合微软调用约定的外部函数（比如一个 C/C++ 库函数），并且没有为影子存储分配空间，应用程序几乎肯定会崩溃。

5.7.3 易失性与非易失性寄存器

正如第一章早期所提到的，微软的 ABI（应用二进制接口）将某些寄存器声明为易失性，其他的则为非易失性。易失性意味着一个过程可以修改寄存器的内容而不保留其值。非易失性则意味着如果一个过程修改了寄存器的值，必须保留该寄存器的原始值。表 5-3 列出了这些寄存器及其易失性。

表 5-3：寄存器易失性

寄存器	易失性/非易失性
RAX	易失性
RBX	非易失性
RCX	易失性
RDX	易失性
RDI	非易失性
RSI	非易失性
RBP	非易失性
RSP	非易失性
R8	易失性
R9	易失性
R10	易失性
R11	易失性
R12	非易失性
R13	非易失性
R14	非易失性
R15	非易失性
XMM0/YMM0	易失性
XMM1/YMM1	易失性
XMM2/YMM2	易失性
XMM3/YMM3	易失性
XMM4/YMM4	易失性
XMM5/YMM5	易失性
XMM6/YMM6	XMM6 非易失性，YMM6 的上半部分为易失性
XMM7/YMM7	XMM7 非易失性，YMM7 的上半部分为易失性
XMM8/YMM8	XMM8 非易失性，YMM8 的上半部分为易失性
XMM9/YMM9	XMM9 非易失性，YMM9 的上半部分为易失性
XMM10/YMM10	XMM10 非易失性，YMM10 的上半部分为易失性
XMM11/YMM11	XMM11 非易失性，YMM11 的上半部分为易失性
XMM12/YMM12	XMM12 非易失性，YMM12 的上半部分为易失性
XMM13/YMM13	XMM13 非易失性，YMM13 的上半部分为易失性
XMM14/YMM14	XMM14 非易失性，YMM14 的上半部分为易失性
XMM15/YMM15	XMM15 非易失性，YMM15 的上半部分为易失性
FPU	易失性，但 FPU 栈在返回时必须为空
Direction flag	必须在返回时清除

在一个过程内使用非易失性寄存器是完全合理的。但是，必须保存这些寄存器的值，以确保它们在函数返回时不会改变。如果你没有将影像存储用作其他用途，这是保存和恢复非易失性寄存器值的好地方；例如：

someProc  proc
          push  rbp
          mov   rbp, rsp
          mov   [rbp+16], rbx    ; Save RBX in parm 1's shadow
           .
           .  ; Procedure's code
           .
          mov    rbx, [rbp+16]   ; Restore RBX from shadow
          leave
          ret
someProc  endp

当然，如果你将影像存储用于其他目的，你总是可以将非易失性寄存器值保存在局部变量中，或者甚至可以将寄存器值压入和弹出栈：

someProc  proc        ; Save RBX via push
          push  rbx   ; Note that this affects parm offsets
          push  rbp
          mov   rbp, rsp
           .
           .  ; Procedure's code
           .
          leave
          pop   rbx   ; Restore RBX from stack
          ret
someProc  endp

someProc2 proc        ; Save RBX in a local
          push  rbp
          mov   rbp, rsp
          sub   rsp, 16       ; Keep stack aligned
          mov   [rbp-8], rbx  ; Save RBX
           .
           .  ; Procedure's code
           .
 mov   rbx, [rbp-8]  ; Restore RBX
          leave
          ret
someProc2 endp

5.7.4 栈对齐

正如我多次提到的，Microsoft ABI 要求每次调用过程时，栈必须对齐到 16 字节边界。当 Windows 将控制权转移到你的汇编代码（或当其他符合 Windows ABI 的代码调用你的汇编代码时），可以确保栈会对齐到一个 8 字节的边界，而该边界并不是 16 字节边界（因为返回地址在栈对齐到 16 字节后消耗了 8 字节）。如果在你的汇编代码中，你不关心 16 字节对齐，可以随意操作栈（不过，你应该至少保证栈对齐到 8 字节边界）。

另一方面，如果你计划调用使用 Microsoft 调用约定的代码，你需要确保在调用之前栈被正确对齐。可以通过两种方式来做到这一点：小心管理进入代码后的 RSP 寄存器修改（这样每次调用时你都知道栈是 16 字节对齐的），或者在调用前强制栈对齐到适当的边界。强制栈对齐到 16 字节可以使用如下指令轻松实现：

and rsp, -16

然而，你必须在设置参数之前执行此指令。如果你在调用指令之前立即执行此指令（但在将所有参数放入栈中之后），这可能会将 RSP 在内存中向下移动，从而使参数在进入过程时不在预期的偏移量处。

假设你不知道 RSP 的状态，并且需要调用一个期望五个参数（40 字节，不是 16 字节的倍数）的过程。这里是你通常会使用的调用顺序：

 sub rsp, 40  ; Make room for 4 shadow parms plus a 5th parm
  and rsp, -16 ; Guarantee RSP is now 16-byte-aligned

; Code to move four parameters into registers and the
; 5th parameter to location [RSP+32]:

  mov rcx, parm1
  mov rdx, parm2
  mov r8,  parm3
  mov r9,  parm4
  mov rax, parm5
  mov [rsp+32], rax
  call procWith5Parms

这段代码唯一的问题是很难在返回时清理栈（因为你无法准确知道and指令导致在栈上预留了多少字节）。不过，正如你将在下一部分看到的，通常你不会在单独的过程调用后清理栈，所以在这里不需要担心栈的清理。

5.7.5 参数设置和清理（或“这些魔法指令到底是什么？”）

微软 ABI 要求调用者设置参数，然后在函数返回时清理它们（从栈中移除）。理论上，这意味着调用一个符合微软 ABI 的函数应该看起来像以下内容：

; Make room for parameters. `parm_size` is a constant
; with the number of bytes of parameters required
; (including 32 bytes for the shadow parameters).

  sub rsp, `parm_size`

  `Code that copies parameters to the stack`

  call procedure

; Clean up the stack after the call:

  add rsp, `parm_size`

这个分配和清理序列有两个问题。首先，你必须为程序中的每个调用重复这个序列（sub rsp、parm_size 和 add rsp, parm_size），这可能会非常低效。其次，正如你在前一部分看到的那样，有时将栈对齐到 16 字节边界会强迫你向下调整栈一个未知的量，因此你不知道需要多少字节才能将 RSP 加回，以便清理栈。

如果你的程序中有多个调用分布在不同位置，你可以通过只执行一次该操作来优化参数在栈上的分配和回收过程。要理解这一点，可以考虑以下代码顺序：

; 1st procedure call:

  sub rsp, `parm_size`   ; Allocate storage for proc1 parms
  `Code that copies parameters to the registers and stack`
  call proc1
  add  rsp, `parm_size`  ; Clean up the stack

; 2nd procedure call:

  sub rsp, `parm_size2`  ; Allocate storage for proc2 parms
  `Code that copies parameters to the registers and stack`
  call proc2
  add rsp, `parm_size2`  ; Clean up the stack

如果你研究这段代码，你应该能够说服自己，第一次的add和第二次的sub有些多余。如果你修改第一次的sub指令，将栈大小减少到parm_size和parm_size2中的较大者，并用这个相同的值替换最后的add指令，你就可以消除两个调用之间出现的add和sub指令：

; 1st procedure call:

  sub rsp, `max_parm_size`   ; Allocate storage for all parms
  `Code that copies parameters to the registers and stack for proc1`
  call proc1

  `Code that copies parameters to the registers and stack for proc2`
  call proc2
  add rsp, `max_parm_size`   ; Clean up the stack

如果你确定在你的过程内所有调用所需的最大字节数，你可以消除整个过程中的所有单独的栈分配和清理（不要忘记，最小参数大小是 32 字节，即使过程没有任何参数，也因为影像存储需求）。

不过，更好的是，如果你的过程有局部变量，你可以将分配局部变量的sub指令与分配参数存储空间的指令结合起来。类似地，如果你使用标准的入口/退出序列，过程结束时的leave指令会自动释放所有参数（以及局部变量），当你退出过程时。

在本书中，你已经看到许多“神奇”的加法和减法指令，这些指令并没有太多的解释。现在你知道这些指令在做什么：它们在为局部变量和所有被调用过程的参数空间分配存储空间，同时保持栈的 16 字节对齐。

这里有一个使用标准入口/退出过程来设置局部变量和参数空间的过程的最后一个示例：

rbxSave  equ   [rbp-8]
someProc proc
         push  rbp
         mov   rbp, rsp
         sub   rsp, 48       ; Also leave stack 16-byte-aligned
         mov   rbxSave, rbx  ; Preserve RBX
          .
          .
          .
         lea   rcx, fmtStr
         mov   rdx, rbx      ; Print value in RBX (presumably)
         call  printf
          .
          .
          .
         mov   rbx, rbxSave  ; Restore RBX
         leave               ; Clean up stack
         ret
someProc endp

然而，如果你使用这个技巧为过程的参数分配存储空间，你将无法使用push指令将数据移入栈中。参数的存储空间已经在栈上分配；你必须使用mov指令（使用[RSP+``常量``]寻址模式）将数据复制到栈上（从第五个参数开始复制）。

5.8 函数与函数结果

函数是返回结果给调用者的过程。在汇编语言中，过程和函数之间几乎没有语法上的差异，这也是为什么 MASM 没有为函数提供特定声明的原因。然而，还是有一些语义上的差异；虽然你可以用相同的方式在 MASM 中声明它们，但使用方式不同。

过程是一系列完成任务的机器指令。执行过程的结果就是完成该活动。而函数则执行一系列机器指令，专门计算一个值并返回给调用者。当然，函数也可以执行某个活动，过程也可以计算值，但主要的区别在于函数的目的是返回一个计算结果；过程没有这个要求。

在汇编语言中，你不会使用特殊的语法来专门定义一个函数。对 MASM 而言，一切都是proc。一段代码通过程序员明确决定将函数结果（通常在寄存器中）通过过程的执行返回，从而变成一个函数。

x86-64 的寄存器是返回函数结果的最常见地方。C 标准库中的strlen()例程就是一个很好的例子，它将字符串的长度（你传递的地址作为参数）返回到 RAX 寄存器中。

按惯例，程序员通常在 AL、AX、EAX 和 RAX 寄存器中返回 8 位、16 位、32 位和 64 位（非浮点）结果。这是大多数高级语言返回这些类型结果的地方，也是 Microsoft ABI 规定返回函数结果的地方。例外是浮点值。Microsoft ABI 规定，应该在 XMM0 寄存器中返回浮点值。

当然，AL、AX、EAX 和 RAX 寄存器并没有什么特别神圣的地方。如果更方便，你可以将函数结果返回到任何寄存器中。当然，如果你调用的是一个符合 Microsoft ABI 的函数（如strlen()），你就只能期望函数的返回结果在 RAX 寄存器中（例如，strlen()在 RAX 中返回一个 64 位整数）。

如果你需要返回一个大于 64 位的函数结果，显然必须将其返回到 RAX 以外的地方（因为 RAX 只能存储 64 位值）。对于稍大于 64 位的值（例如 128 位，甚至可能多达 256 位），你可以将结果分割成几部分，并将这些部分返回到两个或更多的寄存器中。常见的做法是将 128 位的值返回到 RDX:RAX 寄存器对中。当然，XMM/YMM 寄存器也是返回大值的好地方。只需记住，这些方案不符合 Microsoft ABI，因此只有在调用你自己编写的代码时才实用。

如果你需要返回一个大型对象作为函数结果（比如一个包含 1000 个元素的数组），显然无法在寄存器中返回函数结果。你可以通过两种常见的方式来处理大型函数返回结果：要么将返回值作为引用参数传递，要么在堆上分配存储空间（例如，使用 C 标准库的malloc()函数）来存储该对象，并将指向它的指针返回到 64 位寄存器中。当然，如果你返回指向堆上分配的存储空间的指针，调用程序必须在使用完之后释放这段存储空间。

5.9 递归

递归发生在一个过程调用自身时。例如，下面是一个递归过程：

Recursive proc

          call Recursive
          ret

Recursive endp

当然，CPU 永远不会从这个过程返回。进入Recursive时，这个过程将立即再次调用自己，控制永远不会传递到过程的末尾。在这个特殊情况下，过度递归会导致无限循环。^(14)

像循环结构一样，递归需要一个终止条件来停止无限递归。Recursive可以通过如下方式重写，加入终止条件：

Recursive proc

          dec  eax
          jz   allDone
          call Recursive
allDone:
          ret

Recursive endp

对例程的这个修改使得Recursive根据 EAX 寄存器中显示的次数调用自身。在每次调用时，Recursive将 EAX 寄存器减 1，然后再次调用自己。最终，Recursive会将 EAX 减至 0，并从每个调用返回，直到返回到最初的调用者。

然而，到目前为止，并没有真正需要递归的情况。毕竟，你可以高效地按照以下方式编写该过程的代码：

Recursive proc
iterLp:
          dec  eax
          jnz  iterLp
          ret
Recursive endp

这两个例子都会根据 EAX 寄存器中传递的次数重复过程的主体。^(15) 事实证明，只有少数递归算法不能以迭代方式实现。然而，许多递归实现的算法比它们的迭代版本更高效，而且大多数情况下，算法的递归形式更容易理解。

快速排序算法可能是最著名的通常以递归形式出现的算法。该算法的 MASM 实现见清单 5-15。

; Listing 5-15

; Recursive quicksort.

        option  casemap:none

nl          =       10
numElements =       10

            .const
ttlStr      byte    "Listing 5-15", 0
fmtStr1     byte    "Data before sorting: ", nl, 0
fmtStr2     byte    "%d "   ; Use nl and 0 from fmtStr3
fmtStr3     byte    nl, 0
fmtStr4     byte    "Data after sorting: ", nl, 0

            .data
theArray    dword   1,10,2,9,3,8,4,7,5,6

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; quicksort - Sorts an array using the
;             quicksort algorithm.

; Here's the algorithm in C, so you can follow along:

  void quicksort(int a[], int low, int high)
  {
      int i,j,Middle;
      if(low < high)
      {
          Middle = a[(low+high)/2];
          i = low;
          j = high;
          do
          {
              while(a[i] <= Middle) i++;
              while(a[j] > Middle) j--;
              if(i <= j)
              {
                  swap(a[i],a[j]);
                  i++;
                  j--;
              }
          } while(i <= j);

          // Recursively sort the two subarrays.

          if(low < j) quicksort(a,low,j-1);
          if(i < high) quicksort(a,j+1,high);
      }
  }

; Args:
    ; RCX (_a):      Pointer to array to sort
    ; RDX (_lowBnd): Index to low bound of array to sort
    ; R8 (_highBnd): Index to high bound of array to sort

_a          equ     [rbp+16]        ; Ptr to array
_lowBnd     equ     [rbp+24]        ; Low bounds of array
_highBnd    equ     [rbp+32]        ; High bounds of array

; Local variables (register save area):

saveR9      equ     [rbp+40]        ; Shadow storage for R9
saveRDI     equ     [rbp-8]
saveRSI     equ     [rbp-16]
saveRBX     equ     [rbp-24]
saveRAX     equ     [rbp-32]

; Within the procedure body, these registers
; have the following meaning:

; RCX: Pointer to base address of array to sort.
; EDX: Lower bound of array (32-bit index).
; R8D: Higher bound of array (32-bit index).

; EDI: index (i) into array.
; ESI: index (j) into array.
; R9D: Middle element to compare against.

quicksort   proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 32

; This code doesn't mess with RCX. No
; need to save it. When it does mess
; with RDX and R8, it saves those registers
; at that point.

; Preserve other registers we use:

            mov     saveRAX, rax
            mov     saveRBX, rbx
            mov     saveRSI, rsi
            mov     saveRDI, rdi
            mov     saveR9, r9

            mov     edi, edx          ; i = low
            mov     esi, r8d          ; j = high

; Compute a pivotal element by selecting the
; physical middle element of the array.

            lea     rax, [rsi+rdi*1]  ; RAX = i+j
            shr     rax, 1            ; (i + j)/2
            mov     r9d, [rcx][rax*4] ; Middle = ary[(i + j)/2]

; Repeat until the EDI and ESI indexes cross one
; another (EDI works from the start toward the end
; of the array, ESI works from the end toward the
; start of the array).

rptUntil:

; Scan from the start of the array forward
; looking for the first element greater or equal
; to the middle element):

            dec     edi     ; To counteract inc, below
while1:     inc     edi     ; i = i + 1
            cmp     r9d, [rcx][rdi*4] ; While Middle > ary[i]
            jg      while1

; Scan from the end of the array backward, looking
; for the first element that is less than or equal
; to the middle element.

            inc     esi     ; To counteract dec, below
while2:     dec     esi     ; j = j - 1
 cmp     r9d, [rcx][rsi*4] ; While Middle < ary[j]
            jl      while2 

; If we've stopped before the two pointers have
; passed over one another, then we've got two
; elements that are out of order with respect
; to the middle element, so swap these two elements.

            cmp     edi, esi  ; If i <= j
            jnle    endif1

            mov     eax, [rcx][rdi*4] ; Swap ary[i] and ary[j]
            mov     r9d, [rcx][rsi*4]
            mov     [rcx][rsi*4], eax
            mov     [rcx][rdi*4], r9d

            inc     edi       ; i = i + 1
            dec     esi       ; j = j - 1

endif1:     cmp     edi, esi  ; Until i > j
            jng     rptUntil

; We have just placed all elements in the array in
; their correct positions with respect to the middle
; element of the array. So all elements at indexes
; greater than the middle element are also numerically
; greater than this element. Likewise, elements at
; indexes less than the middle (pivotal) element are
; now less than that element. Unfortunately, the
; two halves of the array on either side of the pivotal
; element are not yet sorted. Call quicksort recursively
; to sort these two halves if they have more than one
; element in them (if they have zero or one elements, then
; they are already sorted).

            cmp     edx, esi  ; If lowBnd < j
            jnl     endif2

            ; Note: a is still in RCX,
            ; low is still in RDX.
            ; Need to preserve R8 (high).
            ; Note: quicksort doesn't require stack alignment.

            push    r8
            mov     r8d, esi
            call    quicksort ; (a, low, j)
            pop     r8

endif2:     cmp     edi, r8d  ; If i < high
            jnl     endif3

            ; Note: a is still in RCX,
            ; High is still in R8D.
            ; Need to preserve RDX (low).
            ; Note: quicksort doesn't require stack alignment.

 push    rdx
            mov     edx, edi
            call    quicksort ; (a, i, high)
            pop     rdx

; Restore registers and leave:

endif3:
            mov     rax, saveRAX
            mov     rbx, saveRBX
            mov     rsi, saveRSI
            mov     rdi, saveRDI
            mov     r9, saveR9
            leave
            ret
quicksort   endp

; Little utility to print the array elements:

printArray  proc
            push    r15
            push    rbp
            mov     rbp, rsp
            sub     rsp, 40   ; Shadow parameters

            lea     r9, theArray
            mov     r15d, 0
whileLT10:  cmp     r15d, numElements
            jnl     endwhile1

            lea     rcx, fmtStr2
            lea     r9, theArray
            mov     edx, [r9][r15*4]
            call    printf

            inc     r15d
            jmp     whileLT10

endwhile1:  lea     rcx, fmtStr3
            call    printf
            leave
            pop     r15
            ret
printArray  endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 32   ; Shadow storage

; Display unsorted array:

            lea     rcx, fmtStr1
            call    printf
            call    printArray

; Sort the array:

            lea     rcx, theArray
            xor     rdx, rdx                ; low = 0
            mov     r8d, numElements-1      ; high = 9
            call    quicksort               ; (theArray, 0, 9)

; Display sorted results:

            lea     rcx, fmtStr4
            call    printf
            call    printArray

            leave
            ret     ; Returns to caller

asmMain     endp
            end

清单 5-15：递归快速排序程序

下面是快速排序程序的构建命令和示例输出：

C:\>**build listing5-15**

C:\>**echo off**
 Assembling: listing5-15.asm
c.cpp

C:\>**listing5-15**
Calling Listing 5-15:
Data before sorting:
1
10
2
9
3
8
4
7
5
6

Data after sorting:
1
2
3
4
5
6
7
8
9
10

Listing 5-15 terminated

请注意，这个快速排序过程使用寄存器处理所有局部变量。快速排序函数是一个叶函数；它不调用任何其他函数。因此，它不需要将堆栈对齐到 16 字节边界。此外，对于任何纯汇编过程（只会被其他汇编语言过程调用），像这种快速排序过程最好保存所有它修改过的寄存器的值（即使是易失性寄存器）。即使效率稍低，这也是一种良好的编程实践。

5.10 过程指针

x86-64 调用指令允许三种基本形式：基于程序计数器相对地址的调用（通过过程名）、通过 64 位通用寄存器的间接调用和通过四字长指针变量的间接调用。call指令支持以下（低级）语法：

call `proc_name`  ; Direct call to procedure `proc_name`
call reg64      ; Indirect call to procedure whose address
                ; appears in the reg[64]
call qwordVar   ; Indirect call to the procedure whose address
                ; appears in the qwordVar quad-word variable

我们在本书中一直使用第一种形式，因此这里不需要进一步讨论。第二种形式，即寄存器间接调用，通过指定的 64 位寄存器调用保存地址的过程。过程的地址是该过程内第一个要执行的指令的字节地址。在冯·诺依曼架构的机器（如 x86-64）上，系统将机器指令与其他数据一起存储在内存中。CPU 在执行指令之前从内存中获取指令操作码值。当你执行寄存器间接call指令时，x86-64 首先将返回地址压入堆栈，然后开始从寄存器值指定的地址获取下一个操作码字节（指令）。

上述call指令的第三种形式从内存中的一个四字长变量中获取过程的第一个指令地址。虽然该指令暗示调用使用了过程的直接寻址，但你应该意识到，任何合法的内存寻址模式在这里也是合法的。例如，call procPtrTable[rbx*8]是完全合法的；该语句从四字长数组（procPtrTable）中获取四字长并调用地址值为该四字长内容的过程。

MASM 将过程名称视为静态对象。因此，你可以通过使用offset操作符配合过程名称，或使用lea指令来计算过程的地址。例如，offset proc_name是proc_name过程的第一条指令的地址。所以，以下三种代码序列都会调用proc_name过程：

call `proc_name`
 .
 .
 .
mov  rax, offset `proc_name`
call rax
 .
 .
 .
lea   rax, `proc_name`
call  rax

由于过程的地址适合存储在 64 位对象中，你可以将这样的地址存储到四字变量中；实际上，你可以通过如下代码将过程的地址初始化到一个四字变量中：

p     proc
        .
        .
        .
p     endp
        .
        .
        .
       .data
ptrToP qword   offset p
        .
        .
        .
     call ptrToP ; Calls p if ptrToP has not changed

与所有指针对象一样，除非你已经用适当的地址初始化了该变量，否则不应尝试通过指针变量间接调用过程。你可以通过两种方式初始化过程指针变量：.data和.const对象允许使用初始化值，或者你可以计算一个例程的地址（作为 64 位值），并在运行时将该 64 位地址直接存储到过程指针中。以下代码片段演示了两种初始化过程指针的方式：

 .data
ProcPointer  qword  offset p   ; Initialize ProcPointer with 
                               ; the address of p
              .
              .
              .
             call ProcPointer  ; First invocation calls p

; Reload ProcPointer with the address of q.

             lea   rax, q
             mov  ProcPointer, rax
              .
              .
              .
             call  ProcPointer ; This invocation calls q

虽然本节中的所有示例都使用静态变量声明（.data、.const、.data?），但不要以为只能在静态变量声明部分声明简单的过程指针。你也可以将过程指针（实际上就是四字变量）声明为局部变量，作为参数传递，或者将它们声明为记录或联合体的字段。

5.11 过程参数

一个过程指针非常有用的地方是在参数列表中。通过传递过程的地址来选择多个过程中的一个进行调用是一种常见操作。当然，过程参数只是一个包含过程地址的四字变量，因此这与使用局部变量保存过程指针没有什么不同（当然，区别在于调用者通过过程地址间接初始化该参数）。

使用 MASM proc指令时，你可以通过使用proc类型说明符来指定过程指针类型；例如：

procWithProcParm proc  parm1:word, procParm:proc

你可以通过以下调用指令调用由此参数指向的过程：

call procParm

5.12 保存机器状态，第二部分

在第 220 页的“保存机器状态”中描述了使用push和pop指令来保存寄存器的状态，跨过程调用（被调用者寄存器的保护）。虽然这当然是保存寄存器跨过程调用的一种方法，但它并不是唯一的方式，也并不总是（甚至通常不是）保存和恢复寄存器的最佳方法。

push 和 pop 指令有几个主要优点：它们很简短（将一个 64 位寄存器推送或弹出只需要一个字节的指令操作码），并且它们可以与常量和内存操作数一起使用。然而，这些指令也有缺点：它们会修改栈指针，只能与 2 字节或 8 字节寄存器一起使用，它们只适用于通用整数寄存器（以及 FLAGS 寄存器），并且它们的速度可能比将寄存器数据移动到栈上的等效指令更慢。通常，更好的解决方案是保留局部变量空间中的存储，并简单地将寄存器移动到栈上的这些局部变量之间。

考虑以下使用 push 和 pop 指令来保存寄存器的过程声明：

preserveRegs proc
             push   rax
             push   rbx
             push   rcx
               .
               .
               .
             pop    rcx
             pop    rbx
             pop    rax
             ret
preserveRegs endp

你可以通过以下代码实现相同的功能：

preserveRegs proc
saveRAX      textequ <[rsp+16]>
saveRBX      textequ <[rsp+8]>
saveRCX      textequ <[rsp]>

             sub     rsp, 24      ; Make room for locals
             mov     saveRAX, rax
             mov     saveRBX, rbx
             mov     saveRCX, rcx
               .
               .
               .
             mov     rcx, saveRCX
             mov     rbx, saveRBX
             mov     rax, saveRAX
             add     rsp, 24      ; Deallocate locals
             ret
preserveRegs endp

这段代码的缺点在于，需要两个额外的指令来为存储局部变量（用于保存寄存器值）分配（和释放）栈空间。push 和 pop 指令自动分配这个存储空间，免去了你需要提供这些额外指令的麻烦。对于像这样的简单情况，push 和 pop 指令可能是更好的解决方案。

对于更复杂的过程，尤其是那些期望栈上传递参数或有局部变量的过程，过程已经在设置激活记录，从 RSP 中减去一个更大的数值不需要任何额外的指令：

 option  prologue:PrologueDef
             option  epilogue:EpilogueDef
preserveRegs proc    parm1:byte, parm2:dword
 local   localVar1:dword, localVar2:qword
             local   saveRAX:qword, saveRBX:qword
             local   saveRCX:qword

             mov     saveRAX, rax
             mov     saveRBX, rbx
             mov     saveRCX, rcx
               .
               .
               .
             mov     rcx, saveRCX
             mov     rbx, saveRBX
             mov     rax, saveRAX
             ret
preserveRegs endp

MASM 自动生成代码为 saveRAX、saveRBX 和 saveRCX（以及所有其他局部变量）分配栈上的存储，并在返回时清理局部存储。

当在栈上分配局部变量以及程序可能传递给它调用的函数的参数时，推送和弹出寄存器以保存它们会变得有问题。例如，考虑以下过程：

callsFuncs   proc
saveRAX      textequ <[rbp-8]>
saveRBX      textequ <[rbp-16]>
saveRCX      textequ <[rbp-24]>
             push    rbp
             mov     rbp, rsp
             sub     rsp, 48      ; Make room for locals and parms
             mov     saveRAX, rax ; Preserve registers in
             mov     saveRBX, rbx ; local variables
             mov     saveRCX, rcx

               .
               .
               .
             mov    [rsp], rax    ; Store parm1
             mov    [rsp+8], rbx  ; Store parm2
             mov    [rsp+16], rcx ; Store parm3
             call   theFunction
               .
               .
               .
             mov     rcx, saveRCX ; Restore registers
             mov     rbx, saveRBX
             mov     rax, saveRAX
             leave                ; Deallocate locals
             ret
callsFuncs   endp

如果这个函数在从 RSP 中减去 48 后将 RAX、RBX 和 RCX 推送到栈上，那么这些保存寄存器最终会出现在栈上，和函数传递给 theFunction 的 parm1、parm2 和 parm3 一起。正因如此，push 和 pop 指令在处理构建包含局部存储的激活记录的函数时表现不佳。

5.13 微软 ABI 说明

本章几乎已经完成了对微软调用约定的讨论。具体来说，一个符合微软 ABI 的函数必须遵循以下规则：

（标量）参数必须通过 RCX、RDX、R8 和 R9 传递，然后被推送到栈上。浮动点参数则用 XMM0、XMM1、XMM2 和 XMM3 来替代 RCX、RDX、R8 和 R9。
可变参数函数（如printf()）和未原型声明的函数，必须同时通过通用寄存器（整数寄存器）和 XMM 寄存器传递浮点值。（值得一提的是，printf()似乎仅通过整数寄存器传递浮点值就能正常工作，虽然这可能是使用本书编写时所用 MSVC 版本的一个偶然结果。）
所有参数的大小必须小于或等于 64 位；较大的参数必须通过引用传递。
在栈上，参数总是占用 64 位（8 字节），无论其实际大小如何；较小对象的高位（HO 位）是未定义的。
在call指令之前，栈必须按照 16 字节对齐。
寄存器 RAX、RCX、RDX、R8、R9、R10、R11 和 XMM0/YMM0 到 XMM5/YMM5 是易失的。调用者必须在调用之前保存这些寄存器的值，如果它需要在调用后继续使用这些值。还需注意，YMM0 到 YMM15 的高 128 位是易失的，调用者必须在调用之前保存这些寄存器的值，如果需要保存这些位。
寄存器 RBX、RSI、RDI、RBP、RSP、R12 到 R15，以及 XMM6 到 XMM15 是非易失的。如果被调用函数更改了这些寄存器的值，它必须保留这些寄存器的值。如前所述，虽然 YMM0L 到 YMM15L（低 128 位）是非易失的，但这些寄存器的高 128 位可以认为是易失的。然而，如果一个过程保存了 YMM0 到 YMM15 的低 128 位，它也可以保留所有位（这种在 Microsoft ABI 中的不一致性是为了支持在不支持 YMM 寄存器的 CPU 上运行的遗留代码）。
标量函数返回值（64 位或更小）会返回到 RAX 寄存器。如果数据类型小于 64 位，RAX 的高位（HO 位）是未定义的。
返回值大于 64 位的函数必须为返回值分配存储空间，并将该存储空间的地址作为第一个参数（RCX）传递给函数。返回时，函数必须在 RAX 寄存器中返回该指针。
函数返回浮点结果（双精度或单精度）时，结果会存放在 XMM0 寄存器中。

5.14 更多信息

本书的 32 位电子版（可在artofasm.randallhyde.com/找到）包含了一个关于高级和中级过程的完整“卷”。虽然这本书讲解的是 32 位汇编语言编程，但这些概念通过简单地使用 64 位地址而不是 32 位地址，同样适用于 64 位汇编。

本章所包含的信息涵盖了汇编程序员通常使用的 99%的材料，但还有一些关于过程和参数的附加信息，可能会引起您的兴趣。特别是，电子版介绍了更多的参数传递机制（按值/结果传递、按结果传递、按名称传递和按懒惰求值传递），并详细讨论了您可以传递参数的地方。电子版还涵盖了迭代器、懒惰计算和其他高级过程类型。最后，一本好的编译器构造教材会介绍有关过程运行时支持的更多细节。

关于 Microsoft ABI 的更多信息，请在 Microsoft 网站（或互联网）上搜索Microsoft calling conventions。

5.15 自测

逐步解释call指令是如何工作的。
逐步解释ret指令是如何工作的。
ret指令带有数字常量操作数时，做了什么？
返回地址会将什么值推送到栈上？
什么是命名空间污染？
如何在过程定义中声明一个单一的全局符号？
如何使过程中的所有符号都无作用域（即，过程中的所有符号都是全局的）？
解释如何使用push和pop指令来保存函数中的寄存器。
被调用方保存的主要缺点是什么？
被调用方保存的主要问题是什么？
如果您在函数开始时将一个值推送到栈上，但在函数中未弹出它，会发生什么？
如果您在函数中弹出多余的数据（即没有在函数中推送到栈上的数据），会发生什么？
什么是激活记录？
哪个寄存器通常指向一个激活记录，从而访问该记录中的数据？
使用 Microsoft ABI 时，在栈上为典型参数保留了多少字节？
过程的标准入口序列是什么（指令）？
过程的标准退出序列是什么（指令）？
如果当前 RSP 值未知，您可以使用什么指令强制栈指针对齐为 16 字节？
变量的作用域是什么？
变量的生命周期是什么？
什么是自动变量？
系统何时为自动变量分配存储空间？
解释声明局部/自动变量的两种方式。

给定以下过程源代码片段，提供每个局部变量的偏移量：

procWithLocals proc
               local  var1:word, local2:dword, dVar:byte
               local  qArray[2]:qword, rlocal[2]:real4
               local  ptrVar:qword
                 .
                 .   ; Other statements in the procedure.
                 .
          procWithLocals endp

在源文件中插入哪些语句，才能告诉 MASM 自动生成过程的标准入口和标准退出序列？
当 MASM 自动为一个过程生成标准入口序列时，它是如何确定将代码序列放在哪里的？
当 MASM 自动为一个过程生成标准退出序列时，它是如何确定将代码序列放在哪里的？
按值传递的参数向函数传递什么值？
按引用传递的参数会传递什么值给函数？
当将四个整数参数传递给函数时，Windows ABI 规定这些参数将如何传递？
当将浮动点值作为前四个参数之一传递时，Windows ABI 要求这些值将如何传递？
当将超过四个参数传递给函数时，Windows ABI 规定这些参数将如何传递？
在 Windows ABI 中，易失性寄存器和非易失性寄存器有什么区别？
在 Windows ABI 中，哪些寄存器是易失性的？
在 Windows ABI 中，哪些寄存器是非易失性的？
当在代码流中传递参数时，函数如何访问参数数据？
什么是影像参数？
如果一个函数有一个 32 位整数参数，它需要多少字节的影像存储？
如果一个函数有两个 64 位整数参数，它需要多少字节的影像存储？
如果一个函数有六个 64 位整数参数，它需要多少字节的影像存储？
在以下proc声明中，MASM 会将每个参数与哪些偏移量关联？
```
procWithParms proc  parm1:byte, parm2:word, parm3:dword, parm4:qword
```
假设在前一个问题中，parm4是一个按引用传递的字符参数。你如何将该字符加载到 AL 寄存器中（请提供代码序列）？

在以下proc代码片段中，MASM 会将每个局部变量与哪些偏移量关联？

procWithLocals proc
               local lclVar1:byte, lclVar2:word, lclVar3:dword, lclVar4:qword

将一个大型数组传递给过程的最佳方法是什么？
ABI代表什么？
返回函数结果最常见的位置在哪里？
什么是过程参数？
你如何调用作为参数传递给函数/过程的过程？
如果一个过程有局部变量，如何在该过程中保存寄存器是最好的方法？

第六章：算术运算

本章讨论汇编语言中的算术运算。通过本章的学习，你应该能够将像 Pascal 和 C/C++ 等高级语言中的算术表达式和赋值语句转换为 x86-64 汇编语言。

6.1 x86-64 整数算术指令

在学习如何在汇编语言中编码算术表达式之前，最好先讨论一下 x86-64 指令集中的其他算术指令。前面的章节已经涵盖了大部分的算术和逻辑指令，因此本节将讨论一些你仍然需要了解的剩余指令。

6.1.1 符号扩展和零扩展指令

一些算术操作在执行前需要符号扩展或零扩展的值。因此，我们首先来考虑符号扩展和零扩展指令。x86-64 提供了几条指令来将较小的数值符号扩展或零扩展为较大的数值。表 6-1 列出了可以将 AL、AX、EAX 和 RAX 寄存器进行符号扩展的指令。

表 6-1：扩展 AL、AX、EAX 和 RAX 的指令

指令	解释
`cbw`	通过符号扩展将 AL 中的字节转换为 AX 中的字
`cwd`	通过符号扩展将 AX 中的字转换为 DX:AX 中的双字
`cdq`	通过符号扩展将 EAX 中的双字转换为 EDX:EAX 中的四字
`cqo`	通过符号扩展将 RAX 中的四字转换为 RDX:RAX 中的八字
`cwde`	通过符号扩展将 AX 中的字转换为 EAX 中的双字
`cdqe`	通过符号扩展将 EAX 中的双字转换为 RAX 中的四字

请注意，cwd（将字转换为双字）指令并不会将 AX 中的字符号扩展为 EAX 中的双字。相反，它将符号扩展的高字存储到 DX 寄存器中（DX:AX 的表示法意味着你有一个双字值，其中 DX 包含值的高 16 位，AX 包含值的低 16 位）。如果你想将 AX 的符号扩展放入 EAX，应使用 cwde（将字转换为双字，扩展）指令。同样，cdq 指令将 EAX 符号扩展到 EDX:EAX。如果你希望将 EAX 符号扩展到 RAX，请使用 cdqe 指令。

对于一般的符号扩展操作，x86-64 提供了 mov 指令的扩展版本 movsx（带符号扩展的移动），它在复制数据的同时也对数据进行符号扩展。movsx 指令的语法类似于 mov：

movsxd `dest`, `source` ; If dest is 64 bits and source is 32 bits
movsx  `dest`, `source` ; For all other operand combinations

这些指令与 mov 指令之间语法的一个主要区别是，目标操作数通常必须比源操作数大。^(1) 例如，如果源操作数是一个字节，那么目标操作数必须是字（word）、双字（dword）或四字（qword）。目标操作数也必须是一个寄存器；然而，源操作数可以是一个内存位置。^(2) movsx 指令不允许常数操作数。

无论出于什么原因，MASM 在符号扩展 32 位操作数到 64 位寄存器时，需要使用不同的指令助记符（指令名称）（movsxd 而不是 movsx）。

要进行零扩展，可以使用 movzx 指令。它没有 movsx 的限制；只要目标操作数大于源操作数，指令就能正常工作。它支持 8 到 16、32 或 64 位，16 到 32 或 64 位之间的扩展。没有 32 到 64 位版本（事实证明这是不必要的）。

由于历史原因，x86-64 CPU 在执行 32 位操作时，总是将 32 位寄存器零扩展为 64 位寄存器。因此，要将 32 位寄存器零扩展到 64 位寄存器，你只需将（32 位）寄存器移动到其自身；例如：

mov eax, eax  ; Zero-extends EAX into RAX

零扩展某些 8 位寄存器（AL、BL、CL 和 DL）到它们对应的 16 位寄存器，可以通过将互补的 HO 寄存器（AH、BH、CH 或 DH）加载为 0 来轻松实现，而无需使用 movzx。要将 AX 零扩展到 DX:AX 或将 EAX 零扩展到 EDX:EAX，你只需将 DX 或 EDX 加载为 0。^(3)

由于指令编码的限制，x86-64 不允许将 AH、BH、CH 或 DH 寄存器零扩展或符号扩展到任何 64 位寄存器中。

6.1.2 mul 和 imul 指令

你已经看到过 x86-64 指令集中的一部分 imul 指令（参见第四章中的“imul 指令”）。本节将介绍扩展精度版本的 imul 和无符号 mul 指令。

乘法指令展示了 x86-64 指令集中的另一个不规则性。像 add、sub 等许多 x86-64 指令都支持两个操作数，就像 mov 指令一样。不幸的是，原始的 8086 操作码字节没有足够的位来支持所有指令，因此 x86-64 将 mul（无符号乘法）和 imul（符号整数乘法）指令当作单操作数指令，像 inc、dec 和 neg 指令一样。当然，乘法确实是一个双操作数功能。为了绕过这个事实，x86-64 总是假设累加器（AL、AX、EAX 或 RAX）是目标操作数。

另一个关于mul和imul指令的问题是，无法使用它们将累加器与常数相乘。英特尔很快意识到需要支持与常数相乘，并添加了更通用版本的imul指令以解决这个问题。然而，你必须意识到，基本的mul和imul指令并不支持像第四章中出现的imul那样的完整操作数范围。

乘法指令有两种形式：无符号乘法（mul）和有符号乘法（imul）。与加法和减法不同，你需要分别使用有符号和无符号操作的指令。

单操作数乘法指令有以下形式：

无符号乘法：

mul `reg`[8]   ; Returns AX
mul `reg`[16]  ; Returns DX:AX
mul `reg`[32]  ; Returns EDX:EAX
mul `reg`[64]  ; Returns RDX:RAX

mul `mem`[8]   ; Returns AX
mul `mem`[16]  ; Returns DX:AX
mul `mem`[32]  ; Returns EDX:EAX
mul `mem`[64]  ; Returns RDX:RAX

有符号（整数）乘法：

imul `reg`[8]  ; Returns AX
imul `reg`[16] ; Returns DX:AX
imul `reg`[32] ; Returns EDX:EAX
imul `reg`[64] ; Returns RDX:RAX

imul `mem`[8]  ; Returns AX
imul `mem`[16] ; Returns DX:AX
imul `mem`[32] ; Returns EDX:EAX
imul `mem`[64] ; Returns RDX:RAX

两个 n 位值相乘的结果可能需要多达 2 × n 位。因此，如果操作数是 8 位数量，结果可能需要 16 位。同样，16 位操作数会产生 32 位结果，32 位操作数会产生 64 位结果，而 64 位操作数需要最多 128 位来保存结果。表 6-2 列出了各种计算。

表 6-2: mul 和 imul 操作

指令	计算结果
`mul` `operand`[8]	AX = AL × operand[8]（无符号）
`imul` `operand`[8]	AX = AL × operand[8]（有符号）
`mul` `operand`[16]	DX:AX = AX × operand[16]（无符号）
`imul` `operand`[16]	DX:AX = AX × operand[16]（有符号）
`mul` `operand`[32]	EDX:EAX = EAX × operand[32]（无符号）
`imul` `operand`[32]	EDX:EAX = EAX × operand[32]（有符号）
`mul` `operand`[64]	RDX:RAX = RAX × operand[64]（无符号）
`imul` `operand`[64]	RDX:RAX = RAX × operand[64]（有符号）

如果 8×8 位、16×16 位、32×32 位或 64×64 位的乘积需要超过 8、16、32 或 64 位（分别），则mul和imul指令会设置进位和溢出标志。mul和imul会打乱符号标志和零标志。

当你学习第八章的扩展精度算术时，你会经常使用单操作数的mul和imul指令。然而，除非你在做多精度计算，否则你可能更倾向于使用更通用的多操作数版本imul指令，而不是扩展精度的mul或imul。然而，通用的imul（参见第四章）并不能完全替代这两条指令；除了操作数的数量外，它们之间还存在一些差异。以下规则专门适用于通用（多操作数）imul指令：

没有 8×8 位的多操作数imul指令。
通用的imul指令不会产生 2×n位的结果，而是将结果截断为n位。也就是说，16×16 位的乘法结果是 16 位。同样，32×32 位的乘法结果是 32 位。这些指令会在结果不能适配目标寄存器时设置进位和溢出标志。

6.1.3 `div`和`idiv`指令

x86-64 的除法指令执行 128/64 位除法、64/32 位除法、32/16 位除法或 16/8 位除法。这些指令具有以下几种形式：

div `reg`[8]
div `reg`[16]
div `reg`[32]
div `reg`[64]

div `mem`[8]
div `mem`[16]
div `mem`[32]
div `mem`[64]

idiv `reg`[8]
idiv `reg`[16]
idiv `reg`[32]
idiv `reg`[64]

idiv `mem`[8]
idiv `mem`[16]
idiv `mem`[32]
idiv `mem`[64]

div指令是无符号除法操作。如果操作数是 8 位操作数，div将 AX 寄存器除以操作数，将商存入 AL，将余数（模）存入 AH。如果操作数是 16 位数，div指令将 DX:AX 中的 32 位数除以操作数，将商存入 AX，余数存入 DX。对于 32 位操作数，div将 EDX:EAX 中的 64 位数除以操作数，将商存入 EAX，余数存入 EDX。最后，对于 64 位操作数，div将 RDX:RAX 中的 128 位数除以操作数，将商存入 RAX，余数存入 RDX。

没有div或idiv指令的变体可以让你将一个值除以常量。如果你想将一个值除以常量，你需要创建一个内存对象（最好在.const段中），并使用常量初始化它，然后将该内存值用作div/idiv的操作数。例如：

 .const
ten     dword   10
          .
          .
          .
         div    ten ; Divides EDX:EAX by 10

idiv指令计算带符号商和余数。idiv指令的语法与div相同（唯一的区别是使用idiv助记符），尽管在执行idiv之前，为idiv创建带符号操作数可能需要与div不同的指令序列。

在 x86-64 架构中，你不能简单地将一个无符号 8 位值除以另一个。如果除数是 8 位值，则被除数必须是 16 位值。如果需要将一个无符号 8 位值除以另一个，你必须通过将被除数加载到 AL 寄存器中，并将 0 移动到 AH 寄存器中来将被除数扩展为 16 位。如果在执行div之前未将 AL 扩展为 16 位，可能会导致 x86-64 产生错误结果！当你需要将两个 16 位无符号值相除时，你必须将 AX 寄存器（包含被除数）零扩展到 DX 寄存器。为此，只需将 0 加载到 DX 寄存器中。如果需要将一个 32 位值除以另一个，你必须在除法之前将 EAX 寄存器零扩展到 EDX（通过将 0 加载到 EDX）。最后，要将一个 64 位数除以另一个，你必须在除法之前将 RAX 零扩展到 RDX（例如，使用xor rdx, rdx指令）。

处理带符号整数值时，在执行idiv之前，你需要将 AL 扩展为 AX，将 AX 扩展为 DX，将 EAX 扩展为 EDX，或者将 RAX 扩展为 RDX。为此，可以使用cbw、cwd、cdq或cqo指令。^(4) 如果没有这样做，可能会产生错误的结果。

x86-64 的除法指令还存在一个问题：使用该指令时可能会出现致命错误。首先，当然，你可能会尝试将一个值除以 0。另一个问题是商可能太大，无法存入 RAX、EAX、AX 或 AL 寄存器。例如，16/8 位除法 8000h / 2 产生商 4000h，余数为 0。4000h 无法存入 8 位寄存器。如果发生这种情况，或者你尝试除以 0，x86-64 将生成除法异常或整数溢出异常。这通常意味着你的程序会崩溃。如果这种情况发生，可能是因为你在执行除法操作之前没有对分子进行符号扩展或零扩展。由于这个错误可能导致程序崩溃，因此在使用除法时，你应非常小心选择数值。

x86-64 在除法操作后会将进位标志、溢出标志、符号标志和零标志设置为未定义。因此，你不能通过检查标志位来测试除法操作后的问题。

6.1.4 再探 cmp 指令

正如在第二章《cmp 指令及相应的条件跳转》中提到的，cmp 指令根据减法操作（leftOperand - rightOperand）的结果更新 x86-64 的标志。x86-64 会以适当的方式设置标志，使我们可以将该指令解读为“将 leftOperand 与 rightOperand 进行比较”。你可以通过使用条件设置指令来测试比较结果，检查 FLAGS 寄存器中的相应标志（参见第 295 页的《setcc 指令》）或条件跳转指令（第二章或第七章）。

探索 cmp 指令时，可能首先要做的就是查看它如何影响标志。考虑以下 cmp 指令：

cmp ax, bx

此指令执行 AX – BX 计算，并根据计算结果设置标志。标志设置如下（另见表 6-3）：

只有当 AX = BX 时，零标志才会被设置。这是 AX – BX 产生 0 结果的唯一情况。因此，你可以使用零标志来测试相等或不等。

如果结果为负，则符号标志被设置为 1。乍一看，你可能认为如果 AX 小于 BX，符号标志就会被设置，但实际上并非总是如此。如果 AX = 7FFFh 且 BX = -1（0FFFFh），则从 BX 中减去 AX 结果为 8000h，这是负数（因此符号标志将被设置）。所以，对于有符号比较，符号标志并不能提供正确的状态。对于无符号操作数，考虑 AX = 0FFFFh 和 BX = 1。在这种情况下，AX 大于 BX，但它们的差值是 0FFFEh，仍然是负数。事实证明，符号标志和溢出标志结合起来，可以用来比较两个有符号值。

如果 AX 和 BX 的差值发生了溢出或下溢，溢出标志将在 cmp 操作后被设置。如前所述，符号标志和溢出标志在进行有符号比较时都会使用。

如果从 AX 中减去 BX 需要借位，进位标志将在 cmp 操作后被设置。只有当 AX 小于 BX 时，且 AX 和 BX 都是无符号值时，才会发生这种情况。

表 6-3：cmp 操作后的条件码设置

无符号操作数	有符号操作数
ZF: 相等/不等	ZF: 相等/不等
CF: 左 `<` 右（C = 1）左 `≥` 右（C = 0）	CF: 无意义
SF: 无意义	SF: 请参阅本节讨论
OF: 无意义	OF: 请参阅本节讨论

由于 cmp 指令以这种方式设置标志，您可以通过以下标志测试两个操作数的比较：

cmp `Left`, `Right`

对于有符号比较，SF（符号）和 OF（溢出）标志一起具有以下含义：

如果 [(SF = 0) 且 (OF = 1)] 或 [(SF = 1) 且 (OF = 0)]，则对于有符号比较，Left < Right。
如果 [(SF = 0) 且 (OF = 0)] 或 [(SF = 1) 且 (OF = 1)]，则对于有符号比较，Left ≥ Right。

注意，如果左操作数小于右操作数，则 (SF xor OF) 为 1。相反，如果左操作数大于或等于右操作数，则 (SF xor OF) 为 0。

要理解为什么这些标志以这种方式设置，请参考表 6-4 中的示例。

表 6-4：减法后的符号标志和溢出标志设置

Left	减去	Right	SF	OF
0FFFFh (–1)	–	0FFFEh (–2)	0	0
8000h (–32,768)	–	0001h	0	1
0FFFEh (–2)	–	0FFFFh (–1)	1	0
7FFFh (32767)	–	0FFFFh (–1)	1	1

请记住，cmp 操作实际上是减法运算；因此，表 6-4 中的第一个示例计算的是 (–1) – (–2)，即 (+1)。结果是正数，并且没有发生溢出，因此 S 和 O 标志都为 0。由于 (SF xor OF) 为 0，Left 大于或等于 Right。

在第二个示例中，cmp 指令计算的是 (–32,768) – (+1)，即 (–32,769)。由于 16 位有符号整数无法表示该值，该值会回绕到 7FFFh（+32,767），并设置溢出标志。结果是正数（至少在 16 位值中是正数），因此 CPU 会清除符号标志。这里 (SF xor OF) 为 1，因此 Left 小于 Right。

在第三个示例中，cmp 计算的是 (–2) – (–1)，得到 (–1)。没有发生溢出，因此 OF 为 0，结果是负数，所以 SF 为 1。由于 (SF xor OF) 为 1，Left 小于 Right。

在第四个（也是最后一个）示例中，cmp 计算 (+32,767) – (–1)。这将得到 (+32,768)，并设置溢出标志。此外，值会回绕到 8000h（–32,768），因此符号标志也会被设置。由于 (SF xor OF) 为 0，Left 大于或等于 Right。

6.1.5 setcc 指令

set``cc（条件设置）指令根据 FLAGS 寄存器中的值将一个字节的操作数（寄存器或内存）设置为 0 或 1。set``cc 指令的一般格式如下：

set`cc` `reg`[8]
set`cc` `mem`[8]

set``cc 代表在表 6-5、6-6 和 6-7 中出现的助记符。这些指令如果条件为假，则将对应的操作数设置为 0；如果条件为真，则将 8 位操作数设置为 1。

表 6-5：set``cc 测试标志的指令

指令	描述	条件	备注
`setc`	如果有进位则设置	进位 = 1	同 `setb`，`setnae`
`setnc`	如果没有进位则设置	进位 = 0	同 `setnb`，`setae`
`setz`	如果为零则设置	零 = 1	同 `sete`
`setnz`	如果不为零则设置	零 = 0	同 `setne`
`sets`	如果符号位为 1 则设置	符号 = 1
`setns`	如果没有符号位则设置	符号 = 0
`seto`	如果溢出则设置	溢出 = 1
`setno`	如果没有溢出则设置	溢出 = 0
`setp`	如果有奇偶标志则设置	奇偶 = 1	同 `setpe`
`setpe`	如果奇偶标志为偶则设置	奇偶 = 1	同 `setp`
`setnp`	如果没有奇偶标志则设置	奇偶 = 0	同 `setpo`
`setpo`	如果奇偶标志为奇则设置	奇偶 = 0	同 `setnp`

set``cc 指令在表 6-5 中仅用于测试标志，而没有其他操作含义。例如，你可以在移位、旋转、位测试或算术操作后使用 setc 来检查进位标志。

setp/setpe 和 setnp/setpo 指令检查奇偶标志。虽然这些指令在这里出现是为了完整性，但本书不会花太多时间讨论奇偶标志；在现代代码中，它通常仅用于检查浮点单元（FPU）是否为非数值（NaN）状态。

cmp 指令与 set``cc 指令协同工作。在 cmp 操作后，处理器标志提供有关操作数相对值的信息。它们可以帮助你查看一个操作数是否小于、等于或大于另一个操作数。

两组额外的 set``cc 指令在 cmp 操作后很有用。第一组处理无符号比较的结果（表 6-6）；第二组处理有符号比较的结果（表 6-7）。

表 6-6：set``cc 无符号比较指令

指令	描述	条件	备注
`seta`	如果大于（`>`）则设置	进位 `=` 0，零 `=` 0	同 `setnbe`
`setnbe`	如果不小于或等于（不 `≤`）则设置	进位 `=` 0，零 `=` 0	同 `seta`
`setae`	如果大于或等于（`≥`）则设置	进位 `=` 0	同 `setnc`，`setnb`
`setnb`	如果不小于（不 `<`）则设置	进位 `=` 0	同 `setnc`，`setae`
`setb`	如果小于（`<`）则设置	进位 `=` 1	同 `setc`，`setnae`
`setnae`	如果不大于或等于（不 `≥`）则设置	进位 `=` 1	同 `setc`，`setb`
`setbe`	如果小于或等于（`≤`）则设置	进位 `=` 1 或零 `=` 1	与 `setna` 相同
`setna`	如果不大于（不是 `>`）则设置	进位 `=` 1 或零 `=` 1	与 `setbe` 相同
`sete`	如果相等，则设置（`==`）	零 `=` 1	与 `setz` 相同
`setne`	如果不相等，则设置（`≠`）	零 `=` 0	与 `setnz` 相同

表 6-7: set``cc指令用于带符号比较

指令	描述	条件	注释
`setg`	如果大于（`>`）则设置	符号 `==` 溢出且零 `==` 0	与 `setnle` 相同
`setnle`	如果不小于或等于（不是 `≤`）则设置	符号 `==` 溢出或零 `==` 0	与 `setg` 相同
`setge`	如果大于或等于（`≥`）则设置	符号 `==` 溢出	与 `setnl` 相同
`setnl`	如果不小于（不是 `<`）则设置	符号 `==` 溢出	与 `setge` 相同
`setl`	如果小于（`<`）则设置	符号 `≠` 溢出	与 `setnge` 相同
`setnge`	如果不大于或等于（不是 `≥`）则设置	符号 `≠` 溢出	与 `setl` 相同
`setle`	如果小于或等于，则设置（`≤`）	符号 `≠` 溢出或零 `==` 1	与 `setng` 相同
`setng`	如果不大于（不是 `>`）则设置	符号 `≠` 溢出或零 `==` 1	与 `setle` 相同
`sete`	如果相等，则设置（`=`）	零 `==` 1	与 `setz` 相同
`setne`	如果不相等，则设置（`≠`）	零 `==` 0	与 `setnz` 相同

set``cc指令特别有价值，因为它们可以将比较结果转换为布尔值（假/真或 0/1）。这在将高层语言（如 Swift 或 C/C++）的语句翻译成汇编语言时尤其重要。以下示例展示了如何以这种方式使用这些指令：

; bool = a <= b:

          mov eax, a
          cmp eax, b
          setle bool    ; bool is a byte variable

因为set``cc指令总是产生 0 或 1，你可以将结果与and和or指令一起使用，计算复杂的布尔值：

; bool = ((a <= b) && (d == e)):

          mov   eax, a
          cmp   eax, b
          setle bl
          mov   eax, d
          cmp   eax, e
          sete  bh
          and   bh, bl
          mov   bool, bh

6.1.6 `test` 指令

x86-64 的test指令与and指令的关系，类似于cmp指令与sub指令的关系。也就是说，test指令计算其两个操作数的逻辑与，并根据结果设置条件码标志；然而，它不会将逻辑与的结果存储回目标操作数。test指令的语法与and指令相似：

test `operand1`, `operand2`

test指令如果逻辑与操作的结果为 0，则设置零标志。它会设置符号标志，如果结果的 HO 位包含 1。test指令总是清除进位标志和溢出标志。

test 指令的主要用途是检查单个位是否包含 0 或 1. 以指令 test al, 1 为例。该指令将 AL 与值 1 进行逻辑与运算；如果 AL 的第 0 位为 0，结果将是 0（设置零标志），因为常量 1 的其他位都为 0. 相反，如果 AL 的第 0 位为 1，那么结果就不是 0，因此 test 会清除零标志。因此，你可以在执行此 test 指令后测试零标志，查看第 0 位是否包含 0 或 1（例如，使用 setz 或 setnz 指令，或 jz / jnz 指令）。

test 指令还可以检查指定的位集中的所有位是否都包含 0. 指令 test al, 0fh 只有在 AL 的低 4 位全为 0 时才会设置零标志。

test 指令的一个重要用途是检查寄存器是否包含 0. test reg, reg 指令，其中两个操作数是相同的寄存器，将该寄存器与其自身进行逻辑与运算。如果寄存器中包含 0，结果将是 0，CPU 会设置零标志。然而，如果寄存器中包含非零值，将该值与自身进行逻辑与运算会得到相同的非零值，因此 CPU 会清除零标志。因此，你可以在执行该指令后立即检查零标志（例如，使用 setz 或 setnz 指令，或 jz 和 jnz 指令）来查看寄存器是否包含 0. 以下是一些示例：

 test eax, eax
          setz bl          ; BL is set to 1 if EAX contains 0
               .
               .
               .
          test bl, bl
          jz   bxIs0

     `Do something if BL != 0`

bxIs0:

test 指令的一个主要缺点是立即数（常量）操作数不能大于 32 位（大多数指令都是如此），这使得使用该指令测试超过第 31 位的设置位变得困难。要测试单个位，可以使用 bt（位测试）指令（参见第十二章的“操作位的指令”）。否则，你必须将 64 位常量移动到寄存器中（mov 指令确实支持 64 位立即数操作数），然后将目标寄存器与新加载的寄存器中的 64 位常量值进行测试。

6.2 算术表达式

面对汇编语言的初学者，可能最大的冲击是缺乏熟悉的算术表达式。在大多数高级语言中，算术表达式 看起来与其代数等价物相似。例如：

x = y * z;

在汇编语言中，你需要几条语句来完成相同的任务：

mov  eax, y
imul eax, z
mov  x, eax

显然，高级语言版本要更容易输入、阅读和理解。尽管需要大量的输入，将算术表达式转换成汇编语言并不难。通过分步处理问题，就像你手动解决问题一样，你可以轻松地将任何算术表达式分解为等效的汇编语言语句。

6.2.1 简单赋值

转换为汇编语言最简单的表达式是简单赋值。简单赋值将单个值复制到变量中，通常有两种形式：

`variable` = `constant`

或者

`var1` = `var2`

将第一种形式转换为汇编语言非常简单——只需使用以下汇编语言语句：

mov `variable`, `constant`

这个mov指令将常量复制到变量中。

第二种赋值稍微复杂一些，因为 x86-64 并不提供内存到内存的mov指令。因此，要将一个内存变量复制到另一个变量，必须通过寄存器来移动数据。根据约定（以及出于轻微的效率考虑），大多数程序员倾向于使用 AL、AX、EAX 或 RAX 来完成此操作。例如：

`var1` = `var2`;

变为

mov eax, `var2`
mov `var1`, eax

假设var1和var2是 32 位变量。如果它们是 8 位变量，使用 AL；如果是 16 位变量，使用 AX；如果是 64 位变量，使用 RAX。

当然，如果你已经在做其他操作时使用了 AL、AX、EAX 或 RAX，那么使用其他寄存器也能满足要求。不管怎样，你通常会使用一个寄存器将一个内存位置的内容传输到另一个位置。

6.2.2 简单表达式

下一层次的复杂度是简单表达式。简单表达式的形式为

`var1` = `term1` `op` `term2`;

其中var1是一个变量，term1和term2是变量或常量，而op是一个算术运算符（加法、减法、乘法等）。大多数表达式都采用这种形式。因此，x86-64 架构特别针对这种类型的表达式进行了优化，这一点应该不会让人感到意外。

这种类型的表达式的典型转换形式为：

mov eax, `term1`
`op`  eax, `term2`
mov `var1`, eax

其中op是与指定操作相对应的助记符（例如，+是add，–是sub，以此类推）。

请注意，简单表达式var1 = const1``op``const2``;通常通过编译时表达式和单一的mov指令轻松处理。例如，要计算var1 = 5 + 3;，只需使用单一指令mov var1, 5 + 3。

你需要注意一些不一致性。在处理 x86-64 上的(``i``)mul和(``i``)div指令时，必须使用 AL、AX、EAX 和 RAX 寄存器，以及 AH、DX、EDX 和 RDX 寄存器。你不能像其他操作那样随意使用寄存器。此外，如果你进行除法运算以将一个 16 位、32 位或 64 位数除以另一个数，别忘了符号扩展指令。最后，别忘了某些指令可能会导致溢出。在进行算术运算后，你可能需要检查溢出（或下溢）情况。

以下是常见简单表达式的示例：

; x = y + z:

          mov eax, y
          add eax, z
          mov x, eax

; x = y - z:

          mov eax, y
 sub eax, z
          mov x, eax

; x = y * z; (unsigned):

          mov eax, y
          mul z              ; Don't forget this wipes out EDX
          mov x, eax

; x = y * z; (signed):

          mov  eax, y
          imul eax, z        ; Does not affect EDX!
          mov x, eax

; x = y div z; (unsigned div):

          mov eax, y
          xor edx, edx       ; Zero-extend EAX into EDX
          div z
          mov x, eax

; x = y idiv z; (signed div):

          mov eax, y
          cdq                ; Sign-extend EAX into EDX
          idiv z
          mov x, eax

; x = y % z; (unsigned remainder):

          mov  eax, y
          xor  edx, edx      ; Zero-extend EAX into EDX
          div  z
          mov  x, edx        ; Note that remainder is in EDX

; x = y % z; (signed remainder):

          mov  eax, y
          cdq                ; Sign-extend EAX into EDX
          idiv z
          mov  x, edx        ; Remainder is in EDX

某些一元运算也符合简单表达式的条件，导致与一般规则有所不一致。一元运算的一个好例子是取反。在高级语言中，取反有两种可能的形式：

`var` = `–var`

或者

`var1` = `–var2`

请注意，var = –``constant实际上是一个简单的赋值，而不是一个简单的表达式。你可以将负常量指定为mov指令的操作数：

mov var, -14

要处理var1 = –``var1，请使用以下单条汇编语言语句：

; `var1` = `-var1`;

neg `var1`

如果涉及两个不同的变量，请使用以下方式：

; `var1` = `-var2`;

mov eax, `var2`
neg eax
mov `var1`, eax

6.2.3 复杂表达式

一个复杂的表达式是指包含超过两个项和一个运算符的任何算术表达式。这种表达式通常出现在用高级语言编写的程序中。复杂的表达式可能包括括号，用以覆盖运算符优先级、函数调用、数组访问等等。本节概述了转换此类表达式的规则。

一个容易转换为汇编语言的复杂表达式是包含三个项和两个运算符的表达式。例如：

w = w - y - z;

显然，这个语句的直接汇编语言转换需要两条sub指令。然而，即使是像这样的简单表达式，转换也并非微不足道。实际上，有两种方式可以将前面的语句转换为汇编语言：

mov eax, w
sub eax, y
sub eax, z
mov w, eax

和

mov eax, y
sub eax, z
sub w, eax

第二种转换，因为较短，显得更好。然而，它会产生一个不正确的结果（假设原语句具有类似 C 语言的语义）。结合性是问题所在。前面示例中的第二种序列计算了w = w – (y – z)，这与w = (w – y) – z并不相同。我们如何将括号放在子表达式周围，会影响结果。请注意，如果你对更简洁的形式感兴趣，可以使用以下序列：

mov eax, y
add eax, z
sub w, eax

这计算了w = w – (y + z)，等价于w = (w – y) – z。

优先级是另一个问题。考虑这个表达式：

x = w * y + z;

再次，我们可以用两种方式评估这个表达式：

x = (w * y) + z;

或者

x = w * (y + z);

到现在为止，你可能认为这个解释很疯狂。大家都知道评估这些表达式的正确方式是前一种形式。然而，你错了。例如，APL 编程语言仅按从右到左的顺序评估表达式，并且不对运算符的优先级进行区分。哪种方式是“正确的”，完全取决于你如何定义算术系统中的优先级。

考虑这个表达式：

x `op1` y `op2` z

如果op1的优先级高于op2，那么此表达式会计算为(x op1 y) op2 z。否则，如果op2的优先级高于op1，则该表达式计算为x op1 (y op2 z)。根据所涉及的运算符和操作数，这两种计算可能会产生不同的结果。

大多数高级语言使用一组固定的优先级规则来描述在涉及两个或多个不同运算符的表达式中评估的顺序。此类编程语言通常先进行乘法和除法运算，再进行加法和减法运算。那些支持指数运算的语言（例如 FORTRAN 和 BASIC）通常会在乘法和除法之前计算指数。这些规则是直观的，因为几乎每个人在上高中之前都会学习它们。

在将表达式转换为汇编语言时，必须确保首先计算具有最高优先级的子表达式。以下示例演示了这一技巧：

; w = x + y * z:

          mov ebx, x
          mov eax, y     ; Must compute y * z first because "*"
          imul eax, z    ; has higher precedence than "+"
 add eax, ebx
          mov w, eax

如果表达式中的两个操作符具有相同的优先级，你可以通过使用结合性规则来确定评估顺序。大多数操作符是左结合的，意味着它们从左到右进行评估。加法、减法、乘法和除法都是左结合的。右结合操作符则是从右到左进行评估。FORTRAN 中的指数运算符就是一个很好的右结合操作符的例子：

2**2**3

等于

2**(2**3)

不等于

(2**2)**3

优先级和结合性规则决定了评估的顺序。间接地，这些规则告诉你在表达式中放置括号的位置，以确定评估的顺序。当然，你总是可以使用括号来覆盖默认的优先级和结合性。然而，最终的要点是，你的汇编代码必须在完成某些操作之前完成其他操作，以正确计算给定表达式的值。以下示例演示了这一原理：

; w = x - y - z:

          mov eax, x     ; All the same operator precedence,
          sub eax, y     ; so we need to evaluate from left
          sub eax, z     ; to right because they are left-
          mov w, eax     ; associative

; w = x + y * z:

          mov  eax, y    ; Must compute y * z first because
          imul eax, z    ; multiplication has a higher
          add eax, x     ; precedence than addition
          mov w, eax

; w = x / y - z:

          mov  eax, x    ; Here we need to compute division
          cdq            ; first because it has the highest
          idiv y         ; precedence
          sub eax, z
          mov w, eax

; w = x * y * z:

          mov  eax, y     ; Addition and multiplication are
          imul eax, z     ; commutative; therefore, the order
          imul eax, x     ; of evaluation does not matter
          mov  w, eax

结合性规则有一个例外：如果一个表达式涉及乘法和除法，通常最好先执行乘法。例如，给定一个类似于以下形式的表达式：

w = x / y * z      ; Note: This is (x * z) / y, not x / (y * z)

通常来说，最好先计算x * z，然后将结果除以y，而不是先将x除以y，然后将商乘以z。

这种方法有两个好处。首先，记住 imul 指令总是会产生 64 位结果（假设是 32 位操作数）。通过先执行乘法，你会自动将乘积符号扩展到 EDX 寄存器中，这样就不必在除法前对 EAX 进行符号扩展。

第二个先进行乘法的理由是为了提高计算的准确性。记住，（整数）除法通常会产生不精确的结果。例如，如果你计算 5 / 2，你将得到值 2，而不是 2.5。计算（5 / 2）× 3 得到 6。然而，如果你计算（5 × 3）/ 2，你得到的值是 7，这比真实商值（7.5）要接近一些。因此，如果你遇到类似以下形式的表达式：

w = x / y * z;

你通常可以将其转换为以下汇编代码：

mov  eax, x
imul z      ; Note the use of extended imul!
idiv y
mov  w, eax

如果你正在编码的算法依赖于除法操作的截断效应，那么你无法使用这个技巧来改进算法。故事的寓意是：始终确保你完全理解任何你正在转换为汇编语言的表达式。如果语义要求你必须先执行除法操作，那么就这么做。

请考虑以下语句：

w = x – y * x;

由于减法不是可交换的，你不能先计算y * x，然后从结果中减去x。你不能使用直接的乘法和加法序列，而是必须将x加载到一个寄存器中，先计算y和x的乘积（将它们的乘积保存在另一个寄存器中），然后从x中减去这个乘积。例如：``

``` mov ecx, x mov eax, y imul eax, x sub ecx, eax mov w, ecx ``` This trivial example demonstrates the need for *temporary variables* in an expression. The code uses the ECX register to temporarily hold a copy of `x` until it computes the product of `y` and `x`. As your expressions increase in complexity, the need for temporaries grows. Consider the following C statement: ``` w = (a + b) * (y + z); ``` Following the normal rules of algebraic evaluation, you compute the subexpressions inside the parentheses first (that is, the two subexpressions with the highest precedence) and set their values aside. When you’ve computed the values for both subexpressions, you can compute their product. One way to deal with a complex expression like this is to reduce it to a sequence of simple expressions whose results wind up in temporary variables. For example, you can convert the preceding single expression into the following sequence: ``` temp1 = a + b; temp2 = y + z; w = temp1 * temp2; ``` Because converting simple expressions to assembly language is quite easy, it’s now a snap to compute the former complex expression in assembly. The code is shown here: ``` mov eax, a add eax, b mov temp1, eax mov eax, y add eax, z mov temp2, eax mov eax, temp1 imul eax, temp2 mov w, eax ``` This code is grossly inefficient and requires that you declare a couple of temporary variables in your data segment. However, it is easy to optimize this code by keeping temporary variables, as much as possible, in x86-64 registers. By using x86-64 registers to hold the temporary results, this code becomes the following: ``` mov eax, a add eax, b mov ebx, y add ebx, z imul eax, ebx mov w, eax ``` Here’s yet another example: ``` x = (y + z) * (a - b) / 10; ``` This can be converted to a set of four simple expressions: ``` temp1 = (y + z) temp2 = (a - b) temp1 = temp1 * temp2 x = temp1 / 10 ``` You can convert these four simple expressions into the following assembly language statements: ``` .const ten dword 10 . . . mov eax, y ; Compute EAX = y + z add eax, z mov ebx, a ; Compute EBX = a - b sub ebx, b imul ebx ; This sign-extends EAX into EDX idiv ten mov x, eax ``` The most important thing to keep in mind is that you should keep temporary values in registers for efficiency. Use memory locations to hold temporaries only if you’ve run out of registers. Ultimately, converting a complex expression to assembly language is very similar to solving the expression by hand, except instead of actually computing the result at each stage of the computation, you simply write the assembly code that computes the result. ### 6.2.4 Commutative Operators If `op` represents an operator, that operator is *commutative* if the following relationship is always true: ``` (A `op` B) = (B `op` A) ``` As you saw in the previous section, commutative operators are nice because the order of their operands is immaterial, and this lets you rearrange a computation, often making it easier or more efficient. Often, rearranging a computation allows you to use fewer temporary variables. Whenever you encounter a commutative operator in an expression, you should always check whether you can use a better sequence to improve the size or speed of your code. Tables 6-8 and 6-9, respectively, list the commutative and noncommutative operators you typically find in high-level languages. Table 6-8: Common Commutative Binary Operators | **Pascal** | **C/C++** | **Description** | | --- | --- | --- | | `+` | `+` | Addition | | `*` | `*` | Multiplication | | `and` | `&&` or `&` | Logical or bitwise AND | | `or` | `||` or `|` | Logical or bitwise OR | | `xor` | `^` | (Logical or) bitwise exclusive-OR | | `=` | `==` | Equality | | `<>` | `!=` | Inequality | Table 6-9: Common Noncommutative Binary Operators | **Pascal** | **C/C++** | **Description** | | --- | --- | --- | | `-` | `-` | Subtraction | | `/` or `div` | `/` | Division | | `mod` | `%` | Modulo or remainder | | `<` | `<` | Less than | | `<=` | `<=` | Less than or equal | | `>` | `>` | Greater than | | `>=` | `>=` | Greater than or equal | ## 6.3 Logical (Boolean) Expressions Consider the following expression from a C/C++ program: ``` b = ((x == y) && (a <= c)) || ((z - a) != 5); ``` Here, `b` is a Boolean variable, and the remaining variables are all integers. Although it takes only a single bit to represent a Boolean value, most assembly language programmers allocate a whole byte or word to represent Boolean variables. Most programmers (and, indeed, some programming languages like C) choose 0 to represent false and anything else to represent true. Some people prefer to represent true and false with 1 and 0 (respectively) and not allow any other values. Others select all 1 bits (0FFFF_FFFF_FFFF_FFFFh, 0FFFF_FFFFh, 0FFFFh, or 0FFh) for true and 0 for false. You could also use a positive value for true and a negative value for false. All these mechanisms have their advantages and drawbacks. Using only 0 and 1 to represent false and true offers two big advantages. First, the `set``cc` instructions produce these results, so this scheme is compatible with those instructions. Second, the x86-64 logical instructions (`and`, `or`, `xor`, and, to a lesser extent, `not`) operate on these values exactly as you would expect. That is, if you have two Boolean variables `a` and `b`, then the following instructions perform the basic logical operations on these two variables: ``` ; d = a AND b: mov al, a and al, b mov d, al ; d = a || b: mov al, a or al, b mov d, al ; d = a XOR b: mov al, a xor al, b mov d, al ; b = NOT a: mov al, a ; Note that the NOT instruction does not not al ; properly compute AL = NOT all by itself. and al, 1 ; That is, (NOT 0) does not equal 1\. The AND mov b, al ; instruction corrects this problem mov al, a ; Another way to do b = NOT a; xor al, 1 ; Inverts bit 0 mov b, al ``` As pointed out here, the `not` instruction will not properly compute logical negation. The bitwise `not` of 0 is 0FFh, and the bitwise `not` of 1 is 0FEh. Neither result is 0 or 1\. However, by ANDing the result with 1, you get the proper result. Note that you can implement the `not` operation more efficiently by using the `xor al, 1` instruction because it affects only the LO bit. As it turns out, using 0 for false and anything else for true has a lot of subtle advantages. Specifically, the test for true or false is often implicit in the execution of any logical instruction. However, this mechanism suffers from a big disadvantage: you cannot use the x86-64 `and`, `or`, `xor`, and `not` instructions to implement the Boolean operations of the same name. Consider the two values 55h and 0AAh. They’re both nonzero, so they both represent the value true. However, if you logically AND 55h and 0AAh together by using the x86-64 `and` instruction, the result is 0\. True AND true should produce true, not false. Although you can account for situations like this, it usually requires a few extra instructions and is somewhat less efficient when computing Boolean operations. A system that uses nonzero values to represent true and 0 to represent false is an *arithmetic logical system*. A system that uses two distinct values like 0 and 1 to represent false and true is called a *Boolean logical system*, or simply a Boolean system. You can use either system, as convenient. Consider again this Boolean expression: ``` b = ((x == y) and (a <= d)) || ((z - a) != 5); ``` The resulting simple expressions might be as follows: ``` mov eax, x cmp eax, y sete al ; AL = x == y; mov ebx, a cmp ebx, d setle bl ; BL = a <= d; and bl, al ; BL = (x = y) and (a <= d); mov eax, z sub eax, a cmp eax, 5 setne al or al, bl ; AL = ((x == y) && (a <= d)) || mov b, al ; ((z - a) != 5); ``` When working with Boolean expressions, don’t forget that you might be able to optimize your code by simplifying them with algebraic transformations. In Chapter 7, you’ll also see how to use control flow to calculate a Boolean result, which is generally quite a bit more efficient than using *complete Boolean evaluation*, as the examples in this section teach. ## 6.4 Machine and Arithmetic Idioms An *idiom* is an idiosyncrasy (a peculiarity). Several arithmetic operations and x86-64 instructions have idiosyncrasies that you can take advantage of when writing assembly language code. Some people refer to the use of machine and arithmetic idioms as *tricky programming* that you should always avoid in well-written programs. While it is wise to avoid tricks just for the sake of tricks, many machine and arithmetic idioms are well known and commonly found in assembly language programs. You will see some important idioms all the time, so it makes sense to discuss them. ### 6.4.1 Multiplying Without mul or imul When multiplying by a constant, you can sometimes write faster code by using shifts, additions, and subtractions in place of multiplication instructions. Remember, a `shl` instruction computes the same result as multiplying the specified operand by 2\. Shifting to the left two bit positions multiplies the operand by 4\. Shifting to the left three bit positions multiplies the operand by 8\. In general, shifting an operand to the left *n* bits multiplies it by 2^(*n*). You can multiply any value by a constant by using a series of shifts and additions or shifts and subtractions. For example, to multiply the AX register by 10, you need only multiply it by 8 and then add two times the original value. That is, 10 × AX = 8 × AX + 2 × AX. The code to accomplish this is as follows: ``` shl ax, 1 ; Multiply AX by 2 mov bx, ax ; Save 2 * AX for later shl ax, 2 ; Multiply AX by 8 (*4 really, ; but AX contains *2) add ax, bx ; Add in AX * 2 to AX * 8 to get AX * 10 ``` If you look at the instruction timings, the preceding shift-and-add example requires fewer clock cycles on some processors in the 80x86 family than the `mul` instruction. Of course, the code is somewhat larger (by a few bytes), but the performance improvement is usually worth it. You can also use subtraction with shifts to perform a multiplication operation. Consider the following multiplication by 7: ``` mov ebx, eax ; Save EAX * 1 shl eax, 3 ; EAX = EAX * 8 sub eax, ebx ; EAX * 8 - EAX * 1 is EAX * 7 ``` A common error that beginning assembly language programmers make is subtracting or adding 1 or 2 rather than EAX × 1 or EAX × 2\. The following does not compute EAX × 7: ``` shl eax, 3 sub eax, 1 ``` It computes (8 × EAX) – 1, something entirely different (unless, of course, EAX = 1). Beware of this pitfall when using shifts, additions, and subtractions to perform multiplication operations. You can also use the `lea` instruction to compute certain products. The trick is to use the scaled-index addressing modes. The following examples demonstrate some simple cases: ``` lea eax, [ecx][ecx] ; EAX = ECX * 2 lea eax, [eax][eax * 2] ; EAX = ECX * 3 lea eax, [eax * 4] ; EAX = ECX * 4 lea eax, [ebx][ebx * 4] ; EAX = EBX * 5 lea eax, [eax * 8] ; EAX = EAX * 8 lea eax, [edx][edx * 8] ; EAX = EDX * 9 ``` As time has progressed, Intel (and AMD) has improved the performance of the `imul` instruction to the point that it rarely makes sense to try to improve performance by using *strength-reduction optimizations* such as substituting shifts and additions for a multiplication. You should consult the Intel and AMD documentation (particularly the section on instruction timing) to see if a multi-instruction sequence is faster. Generally, a single shift instruction (for multiplication by a power of 2) or `lea` is going to produce better results than `imul`; beyond that, it’s best to measure and see. ### 6.4.2 Dividing Without div or idiv Just as the `shl` instruction is useful for simulating a multiplication by a power of 2, the `shr` and `sar` instructions can simulate a division by a power of two. Unfortunately, you cannot easily use shifts, additions, and subtractions to perform division by an arbitrary constant. Therefore, this trick is useful only when dividing by powers of 2\. Also, don’t forget that the `sar` instruction rounds toward negative infinity, unlike the `idiv` instruction, which rounds toward 0. You can also divide by a value by multiplying by its reciprocal. Because the `mul` instruction is faster than the `div` instruction, multiplying by a reciprocal is usually faster than division. To multiply by a reciprocal when dealing with integers, we must cheat. If you want to multiply by 1/10, there is no way you can load the value 1/10 into an x86-64 integer register prior to performing the multiplication. However, we could multiply 1/10 by 10, perform the multiplication, and then divide the result by 10 to get the final result. Of course, this wouldn’t buy you anything; in fact, it would make things worse because you’re now doing a multiplication by 10 as well as a division by 10\. However, suppose you multiply 1/10 by 65,536 (6554), perform the multiplication, and then divide by 65,536\. This would still perform the correct operation, and, as it turns out, if you set up the problem correctly, you can get the division operation for free. Consider the following code that divides AX by 10: ``` mov dx, 6554 ; 6554 = round(65,536 / 10) mul dx ``` This code leaves AX/10 in the DX register. To understand how this works, consider what happens when you use the `mul` instruction to multiply AX by 65,536 (1_0000h). This moves AX into DX and sets AX to 0 (a multiplication by 1_0000h is equivalent to a shift left by 16 bits). Multiplying by 6554 (65,536 divided by 10) puts AX divided by 10 into the DX register. Because `mul` is faster than `div`, this technique runs a little faster than using division. Multiplying by a reciprocal works well when you need to divide by a constant. You could even use this approach to divide by a variable, but the overhead to compute the reciprocal pays off only if you perform the division many, many times by the same value. ### 6.4.3 Implementing Modulo-N Counters with AND If you want to implement a counter variable that counts up to 2^(*n*)– 1 and then resets to 0, use the following code: ``` inc CounterVar and CounterVar, `n_bits` ``` where `n_bits` is a binary value containing *n* bits of 1s right-justified in the number. For example, to create a counter that cycles between 0 and 15 (2⁴ – 1), you could use the following: ``` inc CounterVar and CounterVar, 00001111b ``` ## 6.5 Floating-Point Arithmetic Integer arithmetic does not let you represent fractional numeric values. Therefore, modern CPUs support an approximation of *real* arithmetic: *floating-point arithmetic*. To represent real numbers, most floating-point formats employ scientific notation and use a certain number of bits to represent a mantissa and a smaller number of bits to represent an exponent. For example, in the number 3.456e+12, the mantissa consists of 3.456, and the exponent digits are 12\. Because the number of bits is fixed in computer-based representations, computers can represent only a certain number of digits (known as *significant digits*) in the mantissa. For example, if a floating-point representation could handle only three significant digits, then the fourth digit in 3.456e+12 (the 6) could not be accurately represented with that format, as three significant digits can represent only 3.45e+12 correctly. Because computer-based floating-point representations also use a finite number of bits to represent the exponent, it also has a limited range of values, ranging from 10^(±38) for the single-precision format to 10^(±308) for the double-precision format (and up to 10^(±4932) for the extended-precision format). This is known as the *dynamic range* of the value. A big problem with floating-point arithmetic is that it does not follow the standard rules of algebra. Normal algebraic rules apply only to *infinite-precision* arithmetic. Consider the simple statement *x* = *x* + 1, where *x* is an integer. On any modern computer, this statement follows the normal rules of algebra *as long as overflow does not occur.* That is, this statement is valid only for certain values of *x* (*minint* ≤ *x* < *maxint*). Most programmers do not have a problem with this because they are well aware that integers in a program do not follow the standard algebraic rules (for example, 5 / 2 does not equal 2.5). Integers do not follow the standard rules of algebra because the computer represents them with a finite number of bits. You cannot represent any of the (integer) values above the maximum integer or below the minimum integer. Floating-point values suffer from this same problem, only worse. After all, integers are a subset of real numbers. Therefore, the floating-point values must represent the same infinite set of integers. However, an infinite number of real values exists between any two integer values. In addition to having to limit your values between a maximum and minimum range, you cannot represent all the values between any pair of integers, either. To demonstrate the impact of limited-precision arithmetic, we will adopt a simplified decimal floating-point format for our examples. Our floating-point format will provide a mantissa with three significant digits and a decimal exponent with two digits. The mantissa and exponents are both signed values, as shown in Figure 6-1. ![f06001](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f06001.png) Figure 6-1: A floating-point format When adding and subtracting two numbers in scientific notation, we must adjust the two values so that their exponents are the same. Multiplication and division don’t require the exponents to be the same; instead, the exponent after a multiplication is the sum of the two operand exponents, and the exponent after a division is the difference of the dividend and divisor’s exponents. For example, when adding 1.2e1 and 4.5e0, we must adjust the values so they have the same exponent. One way to do this is to convert 4.5e0 to 0.45e1 and then add. This produces 1.65e1\. Because the computation and result require only three significant digits, we can compute the correct result via the representation shown in Figure 6-1. However, suppose we want to add the two values 1.23e1 and 4.56e0\. Although both values can be represented using the three-significant-digit format, the computation and result do not fit into three significant digits. That is, 1.23e1 + 0.456e1 requires four digits of precision in order to compute the correct result of 1.686, so we must either *round* or *truncate* the result to three significant digits. Rounding generally produces the most accurate result, so let’s round the result to obtain 1.69e1. In fact, the rounding does not occur after adding the two values together (that is, producing the sum 1.686e1 and then rounding this to 1.69e1). The rounding actually occurs when converting 4.56e0 to 0.456e1, because the value 0.456e1 requires four digits of precision to maintain. Therefore, during the conversion, we have to round it to 0.46e1 so that the result fits into three significant digits. Then, the sum of 1.23e1 and 0.46e1 produces the final (rounded) sum of 1.69e1. As you can see, the lack of *precision* (the number of digits or bits we maintain in a computation) affects the *accuracy* (the correctness of the computation). In the addition/subtraction example, we were able to round the result because we maintained *four* significant digits *during* the calculation (specifically, when converting 4.56e0 to 0.456e1). If our floating-point calculation had been limited to three significant digits during computation, we would have had to truncate the last digit of the smaller number, obtaining 0.45e1, resulting in a sum of 1.68e1, a value that is even less accurate. To improve the accuracy of floating-point calculations, it is useful to maintain one or more extra digits for use during the calculation (such as the extra digit used to convert 4.56e0 to 0.456e1). Extra digits available during a computation are known as *guard digits*(or *guard bits* in the case of a binary format). They greatly enhance accuracy during a long chain of computations. In a sequence of floating-point operations, the error can *accumulate* and greatly affect the computation itself. For example, suppose we were to add 1.23e3 to 1.00e0\. Adjusting the numbers so their exponents are the same before the addition produces 1.23e3 + 0.001e3\. The sum of these two values, even after rounding, is 1.23e3\. This might seem perfectly reasonable to you; after all, we can maintain only three significant digits, so adding in a small value shouldn’t affect the result at all. However, suppose we were to add 1.00e0 to 1.23e3 *10 times*.^(5) The first time we add 1.00e0 to 1.23e3, we get 1.23e3\. Likewise, we get this same result the second, third, fourth . . . and tenth times when we add 1.00e0 to 1.23e3\. On the other hand, had we added 1.00e0 to itself 10 times, then added the result (1.00e1) to 1.23e3, we would have gotten a different result, 1.24e3\. This is an important fact to know about limited-precision arithmetic: > The order of evaluation can affect the accuracy of the result. You will get more accurate results if the relative magnitudes (the exponents) are close to one another when adding and subtracting floating-point values. If you are performing a chain calculation involving addition and subtraction, you should attempt to group the values appropriately. Another problem with addition and subtraction is that you can wind up with *false* *precision*. Consider the computation 1.23e0 – 1.22e0, which produces 0.01e0\. Although the result is mathematically equivalent to 1.00e – 2, this latter form suggests that the last two digits are exactly 0\. Unfortunately, we have only a single significant digit at this time (remember, the original result was 0.01e0, and those two leading 0s were significant digits). Indeed, some floating-point unit (FPU) or software packages might actually insert random digits (or bits) into the LO positions. This brings up a second important rule concerning limited-precision arithmetic: > Subtracting two numbers with the same signs (or adding two numbers with different signs) can produce high-order significant digits (bits) that are 0\. This reduces the number of significant digits (bits) by a like amount in the final result. By themselves, multiplication and division do not produce particularly poor results. However, they tend to multiply any error that already exists in a value. For example, if you multiply 1.23e0 by 2, when you should be multiplying 1.24e0 by 2, the result is even less accurate. This brings up a third important rule when working with limited-precision arithmetic: > When performing a chain of calculations involving addition, subtraction, multiplication, and division, try to perform the multiplication and division operations first. Often, by applying normal algebraic transformations, you can arrange a calculation so the multiply and divide operations occur first. For example, suppose you want to compute `x * (y + z)`. Normally, you would add `y` and ``z together and multiply their sum by `x`. However, you will get a little more accuracy if you transform `x * (y + z)` to get `x * y + x * z` and compute the result by performing the multiplications first.^(6)`` ```` Multiplication and division are not without their own problems. When two very large or very small numbers are multiplied, it is quite possible for *overflow* or *underflow* to occur. The same situation occurs when dividing a small number by a large number, or dividing a large number by a small (fractional) number. This brings up a fourth rule you should attempt to follow when multiplying or dividing values: > When multiplying and dividing sets of numbers, try to arrange the multiplications so that they multiply large and small numbers together; likewise, try to divide numbers that have the same relative magnitudes. Given the inaccuracies present in any computation (including converting an input string to a floating-point value), you should *never* compare two floating-point values to see if they are equal. In a binary floating-point format, different computations that produce the same (mathematical) result may differ in their least significant bits. For example, 1.31e0 + 1.69e0 should produce 3.00e0\. Likewise, 1.50e0 + 1.50e0 should produce 3.00e0\. However, if you were to compare (1.31e0 + 1.69e0) against (1.50e0 + 1.50e0), you might find out that these sums are *not* equal to one another. The test for equality succeeds if and only if all bits (or digits) in the two operands are exactly the same. Because this is not necessarily true after two different floating-point computations that should produce the same result, a straight test for equality may not work. Instead, you should use the following test: ``` if `Value1` >= (`Value2` - `error`) and `Value1` <= (`Value2` + `error`) then ... ``` Another common way to handle this same comparison is to use a statement of this form: ``` if abs(`Value1` - `Value2`) <= `error` then ... ``` `error` should be a value slightly greater than the largest amount of error that will creep into your computations. The exact value will depend on the particular floating-point format you use. Here is the final rule we will state in this section: > When comparing two floating-point numbers, always compare one value to see if it is in the range given by the second value plus or minus a small error value. Many other little problems can occur when using floating-point values. This book can point out only some of the major problems and make you aware that you cannot treat floating-point arithmetic like real arithmetic because of the inaccuracies present in limited-precision arithmetic. A good text on numerical analysis or even scientific computing can help fill in the details. If you are going to be working with floating-point arithmetic *in any language*, you should take the time to study the effects of limited-precision arithmetic on your computations. ### 6.5.1 Floating-Point on the x86-64 When the 8086 CPU first appeared in the late 1970s, semiconductor technology was not to the point where Intel could put floating-point instructions directly on the 8086 CPU. Therefore, Intel devised a scheme to use a second chip to perform the floating-point calculations—the *8087* *floating-point unit (or x87* *FPU)*.^(7) By the release of the Intel Pentium chip, semiconductor technology had advanced to the point that the FPU was fully integrated onto the x86 CPU. Today, the x86-64 still contains the x87 FPU device, but it has also expanded the floating-point capabilities by using the SSE, SSE2, AVX, and AVX2 instruction sets. This section describes the x86 FPU instruction set. Later sections (and chapters) discuss the more advanced floating-point capabilities of the SSE through AVX2 instruction sets. ### 6.5.2 FPU Registers The x87 FPUs add 14 registers to the x86-64: eight floating-point data registers, a control register, a status register, a tag register, an instruction pointer, a data pointer, and an opcode register. The *data registers* are similar to the x86-64’s general-purpose register set insofar as all floating-point calculations take place in these registers. The *control register* contains bits that let you decide how the FPU handles certain degenerate cases like rounding of inaccurate computations; it also contains bits that control precision and so on. The *status register* is similar to the x86-64’s FLAGS register; it contains the condition code bits and several other floating-point flags that describe the state of the FPU. The *tag register* contains several groups of bits that determine the state of the value in each of the eight floating-point data registers. The *instruction*, *data pointer*, and *opcode* registers contain certain state information about the last floating-point instruction executed. We do not consider the last four registers here; see the Intel documentation for more details. #### 6.5.2.1 FPU Data Registers The FPUs provide eight 80-bit data registers organized as a stack, a significant departure from the organization of the general-purpose registers on the x86-64 CPU. MASM refers to these registers as ST(0), ST(1), . . . ST(7).^(8) The biggest difference between the FPU register set and the x86-64 register set is the stack organization. On the x86-64 CPU, the AX register is always the AX register, no matter what happens. On the FPU, however, the register set is an eight-element stack of 80-bit floating-point values (Figure 6-2). ![f06002](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f06002.png) Figure 6-2: FPU floating-point register stack ST(0) refers to the item on the top of stack, ST(1) refers to the next item on the stack, and so on. Many floating-point instructions push and pop items on the stack; therefore, ST(1) will refer to the previous contents of ST(0) after you push something onto the stack. Getting used to the register numbers changing will take some thought and practice, but this is an easy problem to overcome. #### 6.5.2.2 The FPU Control Register When Intel designed the 8087 (and, essentially, the IEEE floating-point standard), there were no standards in floating-point hardware. Different (mainframe and mini) computer manufacturers all had different and incompatible floating-point formats. Unfortunately, several applications had been written taking into account the idiosyncrasies of these different floating-point formats. Intel wanted to design an FPU that could work with the majority of the software out there (keep in mind that the IBM PC was three to four years away when Intel began designing the 8087, so Intel couldn’t rely on that “mountain” of software available for the PC to make its chip popular). Unfortunately, many of the features found in these older floating-point formats were mutually incompatible. For example, in some floating-point systems, rounding would occur when there was insufficient precision; in others, truncation would occur. Some applications would work with one floating-point system but not with the other. Intel wanted as many applications as possible to work with as few changes as possible on its 8087 FPUs, so it added a special register, the *FPU control register*, that lets the user choose one of several possible operating modes for the FPU. The 80x87 control register contains 16 bits organized as shown in Figure 6-3. ![f06003](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f06003.png) Figure 6-3: FPU control register Bits 10 and 11 of the FPU control register provide rounding control according to the values in Table 6-10. Table 6-10: Rounding Control | **Bits 10 and 11** | **Function** | | --- | --- | | 00 | To nearest or even | | 01 | Round down | | 10 | Round up | | 11 | Truncate | The 00 setting is the default. The FPU rounds up values above one-half of the least significant bit. It rounds down values below one-half of the least significant bit. If the value below the least significant bit is exactly one-half of the least significant bit, the FPU rounds the value toward the value whose least significant bit is 0\. For long strings of computations, this provides a reasonable, automatic way to maintain maximum precision. The round-up and round-down options are present for those computations requiring accuracy. By setting the rounding control to round down and performing the operation, then repeating the operation with the rounding control set to round up, you can determine the minimum and maximum ranges between which the true result will fall. The truncate option forces all computations to truncate any excess bits. You will rarely use this option if accuracy is important. However, you might use this option to help when porting older software to the FPU. This option is also extremely useful when converting a floating-point value to an integer. Because most software expects floating-point–to–integer conversions to truncate the result, you will need to use the truncation/rounding mode to achieve this. Bits 8 and 9 of the control register specify the precision during computation. This capability is provided to allow compatibility with older software as required by the IEEE 754 standard. The precision-control bits use the values in Table 6-11. Table 6-11: Mantissa Precision-Control Bits | **Bits 8 and 9** | **Precision control** | | --- | --- | | 00 | 24 bits | | 01 | Reserved | | 10 | 53 bits | | 11 | 64 bits | Some CPUs may operate faster with floating-point values whose precision is 53 bits (that is, 64-bit floating-point format) rather than 64 bits (that is, 80-bit floating-point format). See the documentation for your specific processor for details. Generally, the CPU defaults these bits to 11 to select the 64-bit mantissa precision. Bits 0 to 5 are the *exception masks*. These are similar to the interrupt enable bit in the x86-64’s FLAGS register. If these bits contain a 1, the corresponding condition is ignored by the FPU. However, if any bit contains 0s, and the corresponding condition occurs, then the FPU immediately generates an interrupt so the program can handle the degenerate condition. Bit 0 corresponds to an invalid operation error, which generally occurs as the result of a programming error. Situations that raise the invalid operation exception include pushing more than eight items onto the stack or attempting to pop an item off an empty stack, taking the square root of a negative number, or loading a non-empty register. Bit 1 masks the *denormalized* interrupt that occurs whenever you try to manipulate denormalized values. Denormalized exceptions occur when you load arbitrary extended-precision values into the FPU or work with very small numbers just beyond the range of the FPU’s capabilities. Normally, you would probably *not* enable this exception. If you enable this exception and the FPU generates this interrupt, the Windows runtime system raises anexception. Bit 2 masks the *zero-divide* exception. If this bit contains 0, the FPU will generate an interrupt if you attempt to divide a nonzero value by 0\. If you do not enable the zero-divide exception, the FPU will produce NaN whenever you perform a zero division. It’s probably a good idea to enable this exception by programming a 0 into this bit. Note that if your program generates this interrupt, the Windows runtime system will raise an exception. Bit 3 masks the *overflow* exception. The FPU will raise the overflow exception if a calculation overflows or if you attempt to store a value that is too large to fit into the destination operand (for example, storing a large extended-precision value into a single-precision variable). If you enable this exception and the FPU generates this interrupt, the Windows runtime system raises anexception. Bit 4, if set, masks the *underflow* exception. Underflow occurs when the result is too small to fit in the destination operand. Like overflow, this exception can occur whenever you store a small extended-precision value into a smaller variable (single or double precision) or when the result of a computation is too small for extended precision. If you enable this exception and the FPU generates this interrupt, the Windows runtime system raises an exception. Bit 5 controls whether the *precision* exception can occur. A precision exception occurs whenever the FPU produces an imprecise result, generally the result of an internal rounding operation. Although many operations will produce an exact result, many more will not. For example, dividing 1 by 10 will produce an inexact result. Therefore, this bit is usually 1 because inexact results are common. If you enable this exception and the FPU generates this interrupt, the Windows runtime system raises anexception. Bits 6 and 7, and 12 to 15, in the control register are currently undefined and reserved for future use (bits 7 and 12 were valid on older FPUs but are no longer used). The FPU provides two instructions, `fldcw` (*load control word*) and `fstcw` (*store control word*), that let you load and store the contents of the control register, respectively. The single operand to these instructions must be a 16-bit memory location. The `fldcw` instruction loads the control register from the specified memory location. `fstcw` stores the control register into the specified memory location. The syntax for these instructions is shown here: ``` fldcw `mem`[16] fstcw `mem`[16] ``` Here’s some example code that sets the rounding control to *truncate result* and sets the rounding precision to 24 bits: ``` .data fcw16 word ? . . . fstcw fcw16 mov ax, fcw16 and ax, 0f0ffh ; Clears bits 8-11 or ax, 0c00h ; Rounding control = %11, Precision = %00 mov fcw16, ax fldcw fcw16 ``` #### 6.5.2.3 The FPU Status Register The 16-bit FPU status register provides the status of the FPU at the instant you read it; its layout appears in Figure 6-4. The `fstsw` instruction stores the 16-bit floating-point status register into a word variable. ![f06004](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f06004.png) Figure 6-4: The FPU status register Bits 0 through 5 are the exception flags. These bits appear in the same order as the exception masks in the control register. If the corresponding condition exists, the bit is set. These bits are independent of the exception masks in the control register. The FPU sets and clears these bits regardless of the corresponding mask setting. Bit 6 indicates a *stack fault*. A stack fault occurs whenever a stack overflow or underflow occurs. When this bit is set, the C[1] condition code bit determines whether there was a stack overflow (C[1] = 1) or stack underflow (C[1] = 0) condition. Bit 7 of the status register is set if *any* error condition bit is set. It is the logical `or` of bits 0 through 5\. A program can test this bit to quickly determine if an error condition exists. Bits 8, 9, 10, and 14 are the coprocessor condition code bits. Various instructions set the condition code bits, as shown in Tables 6-12 and 6-13, respectively. Table 6-12: FPU Comparison Condition Code Bits (X = “Don’t care”) | **Instruction** | **Condition code bits** | **Condition** | | --- | --- | --- | | | **C[3]** | **C[2]** | **C[1]** | **C[0]** | | | --- | --- | --- | --- | --- | --- | | `fcom` `fcomp` `fcompp` `ficom` `ficomp` | 0 0 1 1 | 0 0 0 1 | X X X X | 0 1 0 1 | ST `>` source ST `<` source ST `=` source ST or source not comparable | | `ftst` | 0 0 1 1 | 0 0 0 1 | X X X X | 0 1 0 1 | ST is positive ST is negative ST is 0 (+ or –) ST is not comparable | | `fxam` | 0 0 0 0 1 1 1 1 0 0 0 0 1 | 0 0 1 1 0 0 1 1 0 0 1 1 0 | 0 1 0 1 0 1 0 1 0 1 0 1 X | 0 0 0 0 0 0 0 0 1 1 1 1 1 | Unsupported Unsupported + Normalized – Normalized + 0 – 0 + Denormalized – Denormalized + NaN – NaN + Infinity – Infinity Empty register | | `fucom` `fucomp` `fucompp` | 0 0 1 1 | 0 0 0 1 | X X X X | 0 1 0 1 | ST `>` source ST `<` source ST `=` source Unordered/not comparable | Table 6-13: FPU Condition Code Bits (X = “Don’t care”) | **Instruction** | **Condition code bits** | | --- | --- | | | **C[3]** | **C[2]** | **C[1]** | **C[0]** | | --- | --- | --- | --- | --- | | `fcom`, `fcomp`, `fcompp`, `ftst`, `fucom`, `fucomp`, `fucompp`, `ficom`, `ficomp` | Result of comparison, see Table 6-12. | Operands are not comparable. | Set to 0. | Result of comparison, see Table 6-12. | | `fxam` | See Table 6-12. | See Table 6-12. | Sign of result, or stack overflow/underflow if stack exception bit is set. | See Table 6-12. | | `fprem, fprem1` | Bit 0 of quotient | 0—reduction done 1—reduction incomplete | Bit 0 of quotient, or stack overflow/underflow if stack exception bit is set. | Bit 2 of quotient | | `fist`, `fbstp`, `frndint`, `fst`, `fstp`, `fadd`, `fmul`, `fdiv`, `fdivr`, `fsub`, `fsubr, fscale`, `fsqrt`, `fpatan`, `f2xm1`, `fyl2x`, `fyl2xp1` | Undefined | Undefined | Rounding direction if exception; otherwise, set to 0. | Undefined | | `fptan`, `fsin`, `fcos`, `fsincos` | Undefined | Set to 1 if within range; otherwise, 0. | Round-up occurred or stack overflow/underflow if stack exception bit is set. Undefined if C[2] is set. | Undefined | | `fchs`, `fabs`, `fxch`, `fincstp`, `fdecstp`, `const loads`, `fxtract`, `fld`, `fild`, `fbld`, `fstp (80 bit)` | Undefined | Undefined | Set to 0 or stack overflow/underflow if stack exception bit is set. | Undefined | | `fldenv`, `frstor` | Restored from memory operand | Restored from memory operand | Restored from memory operand | Restored from memory operand | | `fldcw`, `fstenv`, `fstcw`, `fstsw`, `fclex` | Undefined | Undefined | Undefined | Undefined | | `finit`, `fsave` | Cleared to 0 | Cleared to 0 | Cleared to 0 | Cleared to 0 | Bits 11 to 13 of the FPU status register provide the register number of the top of stack. During computations, the FPU adds (modulo 8) the logical register numbers supplied by the programmer to these 3 bits to determine the *physical* register number at runtime. Bit 15 of the status register is the *busy bit*. It is set whenever the FPU is busy. This bit is a historical artifact from the days when the FPU was a separate chip; most programs will have little reason to access this bit. ### 6.5.3 FPU Data Types The FPU supports seven data types: three integer types, a packed decimal type, and three floating-point types. The *integer type* supports 16-, 32-, and 64-bit integers, although it is often faster to do the integer arithmetic by using the integer unit of the CPU. The *packed decimal type* provides an 18-digit signed decimal (BCD) integer. The primary purpose of the BCD format is to convert between strings and floating-point values. The remaining three data types are the 32-, 64-, and 80-bit *floating-point data types*. The 80x87 data types appear in Figures 6-5, 6-6, and 6-7\. Just note, for future reference, that the largest BCD value the x87 supports is an 18-digit BCD value (bits 72 to 78 are unused in this format). ![f06005](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f06005.png) Figure 6-5: FPU floating-point formats The FPU generally stores values in a *normalized* format. The HO bit of the mantissa is always 1 when a floating-point number is normalized. In the 32- and 64-bit floating-point formats, the FPU does not actually store this bit; the FPU always assumes that it is 1\. Therefore, 32- and 64-bit floating-point numbers are always normalized. In the extended-precision 80-bit floating-point format, the FPU does *not* assume that the HO bit of the mantissa is 1; the HO bit of the mantissa appears as part of the string of bits. ![f06006](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f06006.png) Figure 6-6: FPU integer formats Normalized values provide the greatest precision for a given number of bits. However, many non-normalized values *cannot* be represented with the 80-bit format. These values are very close to 0 and represent the set of values whose mantissa HO bit is not 0\. The FPUs support a special 80-bit form known as *denormalized* values. Denormalized values allow the FPU to encode very small values it cannot encode using normalized values, but denormalized values offer fewer bits of precision than normalized values. Therefore, using denormalized values in a computation may introduce slight inaccuracy. Of course, this is always better than underflowing the denormalized value to 0 (which could make the computation even less accurate), but you must keep in mind that if you work with very small values, you may lose some accuracy in your computations. The FPU status register contains a bit you can use to detect when the FPU uses a denormalized value in a computation. ![f06007](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/f06007.png) Figure 6-7: FPU packed decimal format ### 6.5.4 The FPU Instruction Set The FPU adds many instructions to the x86-64 instruction set. We can classify these instructions as data movement instructions, conversions, arithmetic instructions, comparisons, constant instructions, transcendental instructions, and miscellaneous instructions. The following sections describe each of the instructions in these categories. ### 6.5.5 FPU Data Movement Instructions The *data movement instructions* transfer data between the internal FPU registers and memory. The instructions in this category are `fld`, `fst`, `fstp`, and `fxch`. The `fld` instruction always pushes its operand onto the floating-point stack. The `fstp` instruction always pops the top of stack after storing it. The remaining instructions do not affect the number of items on the stack. #### 6.5.5.1 The fld Instruction The `fld` instruction loads a 32-, 64-, or 80-bit floating-point value onto the stack. This instruction converts 32- and 64-bit operands to an 80-bit extended-precision value before pushing the value onto the floating-point stack. The `fld` instruction first decrements the TOS pointer (bits 11 to 13 of the status register) and then stores the 80-bit value in the physical register specified by the new TOS pointer. If the source operand of the `fld` instruction is a floating-point data register, `st(``i``)`, then the actual register that the FPU uses for the load operation is the register number *before* decrementing the TOS pointer. Therefore, `fld st(0)` duplicates the value on the top of stack. The `fld` instruction sets the stack fault bit if stack overflow occurs. It sets the denormalized exception bit if you load an 80-bit denormalized value. It sets the invalid operation bit if you attempt to load an empty floating-point register onto the TOS (or perform another invalid operation). Here are some examples: ``` fld st(1) fld real4_variable fld real8_variable fld real10_variable fld real8 ptr [rbx] ``` There is no way to directly load a 32-bit integer register onto the floating-point stack, even if that register contains a `real4` value. To do so, you must first store the integer register into a memory location, and then push that memory location onto the FPU stack by using the `fld` instruction. For example: ``` mov tempReal4, eax ; Save real4 value in EAX to memory fld tempReal4 ; Push that value onto the FPU stack ``` #### 6.5.5.2 The fst and fstp Instructions The `fst` and `fstp` instructions copy the value on the top of the floating-point stack to another floating-point register or to a 32-, 64-, or (`fstp` only) 80-bit memory variable. When copying data to a 32- or 64-bit memory variable, the FPU rounds the 80-bit extended-precision value on the TOS to the smaller format as specified by the rounding control bits in the FPU control register. By incrementing the TOS pointer in the status register after accessing the data in ST(0), the `fstp` instruction pops the value off the top of stack when moving it to the destination location. If the destination operand is a floating-point register, the FPU stores the value at the specified register number *before* popping the data off the top of stack. Executing an `fstp st(0)` instruction effectively pops the data off the top of stack with no data transfer. Here are some examples: ``` fst real4_variable fst real8_variable fst realArray[rbx * 8] fst st(2) fstp st(1) ``` The last example effectively pops ST(1) while leaving ST(0) on the top of stack. The `fst` and `fstp` instructions will set the stack exception bit if a stack underflow occurs (attempting to store a value from an empty register stack). They will set the precision bit if a loss of precision occurs during the store operation (for example, when storing an 80-bit extended-precision value into a 32- or 64-bit memory variable and some bits are lost during conversion). They will set the underflow exception bit when storing an 80-bit value into a 32- or 64-bit memory variable, but the value is too small to fit into the destination operand. Likewise, these instructions will set the overflow exception bit if the value on the top of stack is too big to fit into a 32- or 64-bit memory variable. They set the invalid operation flag if an invalid operation (such as storing into an empty register) occurs. Finally, these instructions set the C[1] condition bit if rounding occurs during the store operation (this occurs only when storing into a 32- or 64-bit memory variable and you have to round the mantissa to fit into the destination) or if a stack fault occurs. #### 6.5.5.3 The fxch Instruction The `fxch` instruction exchanges the value on the top of stack with one of the other FPU registers. This instruction takes two forms: one with a single FPU register as an operand and the second without any operands. The first form exchanges the top of stack with the specified register. The second form of `fxch` swaps the top of stack with ST(1). Many FPU instructions (for example, `fsqrt`) operate only on the top of the register stack. If you want to perform such an operation on a value that is not on top, you can use the `fxch` instruction to swap that register with TOS, perform the desired operation, and then use `fxch` to swap the TOS with the original register. The following example takes the square root of ST(2): ``` fxch st(2) fsqrt fxch st(2) ``` The `fxch` instruction sets the stack exception bit if the stack is empty; it sets the invalid operation bit if you specify an empty register as the operand; and it always clears the C[1] condition code bit. ### 6.5.6 Conversions The FPU performs all arithmetic operations on 80-bit real quantities. In a sense, the `fld` and `fst`/`fstp` instructions are conversion instructions because they automatically convert between the internal 80-bit real format and the 32- and 64-bit memory formats. Nonetheless, we’ll classify them as data movement operations, rather than conversions, because they are moving real values to and from memory. The FPU provides six other instructions that convert to or from integer or BCD format when moving data. These instructions are `fild`, `fist`, `fistp`, `fisttp`, `fbld`, and `fbstp`. #### 6.5.6.1 The fild Instruction The `fild` (*integer load*) instruction converts a 16-, 32-, or 64-bit two’s complement integer to the 80-bit extended-precision format and pushes the result onto the stack. This instruction always expects a single operand: the address of a word, double-word, or quad-word integer variable. You cannot specify one of the x86-64’s 16-, 32-, or 64-bit general-purpose registers. If you want to push the value of an x86-64 general-purpose register onto the FPU stack, you must first store it into a memory variable and then use `fild` to push that memory variable. The `fild` instruction sets the stack exception bit and C[1] (accordingly) if stack overflow occurs while pushing the converted value. Look at these examples: ``` fild word_variable fild dword_val[rcx * 4] fild qword_variable fild sqword ptr [rbx] ``` #### 6.5.6.2 The fist, fistp, and fisttp Instructions The `fist`, `fistp`, and `fisttp` instructions convert the 80-bit extended-precision variable on the top of stack to a 16-, 32-, or (`fistp`/`fistpp` only) 64-bit integer and store the result away into the memory variable specified by the single operand. The `fist` and `fistp` instructions convert the value on TOS to an integer according to the rounding setting in the FPU control register (bits 10 and 11). The `fisttp` instruction always does the conversion using the truncation mode. As with the `fild` instruction, the `fist`, `fistp`, and `fisttp` instructions will not let you specify one of the x86-64’s general-purpose 16-, 32-, or 64-bit registers as the destination operand. The `fist` instruction converts the value on the top of stack to an integer and then stores the result; it does not otherwise affect the floating-point register stack. The `fistp` and `fisttp` instructions pop the value off the floating-point register stack after storing the converted value. These instructions set the stack exception bit if the floating-point register stack is empty (this will also clear C[1]). They set the precision (imprecise operation) and C[1] bits if rounding occurs (that is, if the value in ST(0) has any fractional component). These instructions set the underflow exception bit if the result is too small (less than 1 but greater than 0, or less than 0 but greater than –1). Here are some examples: ``` fist word_var[rbx * 2] fist dword_var fisttp dword_var fistp qword_var ``` The `fist` and `fistp` instructions use the rounding control settings to determine how they will convert the floating-point data to an integer during the store operation. By default, the rounding control is usually set to round mode; yet, most programmers expect `fist`/`fistp` to truncate the decimal portion during conversion. If you want `fist`/`fistp` to truncate floating-point values when converting them to an integer, you will need to set the rounding control bits appropriately in the floating-point control register (or use the `fisttp` instruction to truncate the result regardless of the rounding control bits). Here’s an example: ``` .data fcw16 word ? fcw16_2 word ? IntResult sdword ? . . . fstcw fcw16 mov ax, fcw16 or ax, 0c00h ; Rounding = %11 (truncate) mov fcw16_2, ax ; Store and reload the ctrl word fldcw fcw16_2 fistp IntResult ; Truncate ST(0) and store as int32 fldcw fcw16 ; Restore original rounding control ``` #### 6.5.6.3 The fbld and fbstp Instructions The `fbld` and `fbstp` instructions load and store 80-bit BCD values. The `fbld` instruction converts a BCD value to its 80-bit extended-precision equivalent and pushes the result onto the stack. The `fbstp` instruction pops the extended-precision real value on TOS, converts it to an 80-bit BCD value (rounding according to the bits in the floating-point control register), and stores the converted result at the address specified by the destination memory operand. There is no `fbst` instruction. The `fbld` instruction sets the stack exception bit and C[1] if stack overflow occurs. The results are undefined if you attempt to load an invalid BCD value. The `fbstp` instruction sets the stack exception bit and clears C[1] if stack underflow occurs (the stack is empty). It sets the underflow flag under the same conditions as `fist` and `fistp`. Look at these examples: ``` ; Assuming fewer than eight items on the stack, the following ; code sequence is equivalent to an fbst instruction: fld st(0) fbstp tbyte_var ; The following example easily converts an 80-bit BCD value to ; a 64-bit integer: fbld tbyte_var fistp qword_var ``` These two instructions are especially useful for converting between string and floating-point formats. Along with the `fild` and `fist` instructions, you can use `fbld` and `fbstp` to convert between integer and string formats (see “Converting Unsigned Decimal Values to Strings” in Chapter 9). ### 6.5.7 Arithmetic Instructions *Arithmetic instructions* make up a small but important subset of the FPU’s instruction set. These instructions fall into two general categories: those that operate on real values and those that operate on a real and an integer value. #### 6.5.7.1 The fadd, faddp, and fiadd Instructions The `fadd`, `faddp`, and `fiadd` instructions take the following forms: ``` fadd faddp fadd st(`i`), st(0) fadd st(0), st(`i`) faddp st(`i`), st(0) fadd `mem`[32] fadd `mem`[64] fiadd `mem`[16] fiadd `mem`[32] ``` The `fadd` instruction, with no operands, is a synonym for `faddp`. The `faddp` instruction (also with no operands) pops the two values on the top of stack, adds them, and pushes their sum back onto the stack. The next two forms of the `fadd` instruction, those with two FPU register operands, behave like the x86-64’s `add` instruction. They add the value in the source register operand to the value in the destination register operand. One of the register operands must be ST(0). The `faddp` instruction with two operands adds ST(0) (which must always be the source operand) to the destination operand and then pops ST(0). The destination operand must be one of the other FPU registers. The last two forms, `fadd` with a memory operand, adds a 32- or 64-bit floating-point variable to the value in ST(0). This instruction will convert the 32- or 64-bit operands to an 80-bit extended-precision value before performing the addition. Note that this instruction does *not* allow an 80-bit memory operand. There are also instructions for adding 16- and 32-bit integers in memory to ST(0): `fiadd` `mem`16 and `fiadd` `mem`32. These instructions can raise the stack, precision, underflow, overflow, denormalized, and illegal operation exceptions, as appropriate. If a stack fault exception occurs, C[1] denotes stack overflow or underflow, or the rounding direction (see Table 6-13). Listing 6-1 demonstrates the various forms of the `fadd` instruction. ``` ; Listing 6-1 ; Demonstration of various forms of fadd. option casemap:none nl = 10 .const ttlStr byte "Listing 6-1", 0 fmtSt0St1 byte "st(0):%f, st(1):%f", nl, 0 fmtAdd1 byte "fadd: st0:%f", nl, 0 fmtAdd2 byte "faddp: st0:%f", nl, 0 fmtAdd3 byte "fadd st(1), st(0): st0:%f, st1:%f", nl, 0 fmtAdd4 byte "fadd st(0), st(1): st0:%f, st1:%f", nl, 0 fmtAdd5 byte "faddp st(1), st(0): st0:%f", nl, 0 fmtAdd6 byte "fadd mem: st0:%f", nl, 0 zero real8 0.0 one real8 1.0 two real8 2.0 minusTwo real8 -2.0 .data st0 real8 0.0 st1 real8 0.0 .code externdef printf:proc ; Return program title to C++ program: public getTitle getTitle proc lea rax, ttlStr ret getTitle endp ; printFP - Prints values of st0 and (possibly) st1. ; Caller must pass in ptr to fmtStr in RCX. printFP proc sub rsp, 40 ; For varargs (for example, printf call), double ; values must appear in RDX and R8 rather ; than XMM1, XMM2. ; Note: if only one double arg in format ; string, printf call will ignore 2nd ; value in R8. mov rdx, qword ptr st0 mov r8, qword ptr st1 call printf add rsp, 40 ret printFP endp ; Here is the "asmMain" function. public asmMain asmMain proc push rbp mov rbp, rsp sub rsp, 48 ; Shadow storage ; Demonstrate various fadd instructions: mov rax, qword ptr one mov qword ptr st1, rax mov rax, qword ptr minusTwo mov qword ptr st0, rax lea rcx, fmtSt0St1 call printFP ; fadd (same as faddp): fld one fld minusTwo fadd ; Pops st(0)! fstp st0 lea rcx, fmtAdd1 call printFP ; faddp: fld one fld minusTwo faddp ; Pops st(0)! fstp st0 lea rcx, fmtAdd2 call printFP ; fadd st(1), st(0): fld one fld minusTwo fadd st(1), st(0) fstp st0 fstp st1 lea rcx, fmtAdd3 call printFP ; fadd st(0), st(1): fld one fld minusTwo fadd st(0), st(1) fstp st0 fstp st1 lea rcx, fmtAdd4 call printFP ; faddp st(1), st(0): fld one fld minusTwo faddp st(1), st(0) fstp st0 lea rcx, fmtAdd5 call printFP ; faddp mem64: fld one fadd two fstp st0 lea rcx, fmtAdd6 call printFP leave ret ; Returns to caller asmMain endp end ``` Listing 6-1: Demonstration of `fadd` instructions Here’s the build command and output for the program in Listing 6-1: ``` C:\>**build listing6-1** C:\>**echo off** Assembling: listing6-1.asm c.cpp C:\>**listing6-1** Calling Listing 6-1: st(0):-2.000000, st(1):1.000000 fadd: st0:-1.000000 faddp: st0:-1.000000 fadd st(1), st(0): st0:-2.000000, st1:-1.000000 fadd st(0), st(1): st0:-1.000000, st1:1.000000 faddp st(1), st(0): st0:-1.000000 fadd mem: st0:3.000000 Listing 6-1 terminated ``` #### 6.5.7.2 The fsub, fsubp, fsubr, fsubrp, fisub, and fisubr Instructions These six instructions take the following forms: ``` fsub fsubp fsubr fsubrp fsub st(`i`), st(0) fsub st(0), st(`i`) fsubp st(`i`), st(0) fsub `mem`[32] fsub `mem`[64] fsubr st(`i`), st(0) fsubr st(0), st(`i`) fsubrp st(`i`), st(0) fsubr `mem`[32] fsubr `mem`[64] fisub `mem`[16] fisub `mem`[32] fisubr `mem`[16] fisubr `mem`[32] ``` With no operands, `fsub` is the same as `fsubp` (without operands). With no operands, the `fsubp` instruction pops ST(0) and ST(1) from the register stack, computes ST(1) – ST(0), and then pushes the difference back onto the stack. The `fsubr` and `fsubrp` instructions (*reverse subtraction*) operate in an identical fashion except they compute ST(0) – ST(1). With two register operands (*destination, source*), the `fsub` instruction computes *destination* = *destination* – *source*. One of the two registers must be ST(0). With two registers as operands, the `fsubp` also computes *destination* = *destination* – *source*,and then it pops ST(0) off the stack after computing the difference. For the `fsubp` instruction, the source operand must be ST(0). With two register operands, the `fsubr` and `fsubrp` instructions work in a similar fashion to `fsub` and `fsubp`, except they compute *destination* = *source* – *destination*. The `fsub` `mem`[32], `fsub` `mem`[64], `fsubr` `mem`[32], and `fsubr` `mem`[64] instructions accept a 32- or 64-bit memory operand. They convert the memory operand to an 80-bit extended-precision value and subtract this from ST(0) (`fsub`) or subtract ST(0) from this value (`fsubr`) and store the result back into ST(0). There are also instructions for subtracting 16- and 32-bit integers in memory from ST(0): `fisub mem`[16] and `fisub mem`[32] (also `fisubr mem`[16] and `fisubr mem`[32]). These instructions can raise the stack, precision, underflow, overflow, denormalized, and illegal operation exceptions, as appropriate. If a stack fault exception occurs, C[1] denotes stack overflow or underflow, or indicates the rounding direction (see Table 6-13). Listing 6-2 demonstrates the `fsub`/`fsubr` instructions. ``` ; Listing 6-2 ; Demonstration of various forms of fsub/fsubrl. option casemap:none nl = 10 .const ttlStr byte "Listing 6-2", 0 fmtSt0St1 byte "st(0):%f, st(1):%f", nl, 0 fmtSub1 byte "fsub: st0:%f", nl, 0 fmtSub2 byte "fsubp: st0:%f", nl, 0 fmtSub3 byte "fsub st(1), st(0): st0:%f, st1:%f", nl, 0 fmtSub4 byte "fsub st(0), st(1): st0:%f, st1:%f", nl, 0 fmtSub5 byte "fsubp st(1), st(0): st0:%f", nl, 0 fmtSub6 byte "fsub mem: st0:%f", nl, 0 fmtSub7 byte "fsubr st(1), st(0): st0:%f, st1:%f", nl, 0 fmtSub8 byte "fsubr st(0), st(1): st0:%f, st1:%f", nl, 0 fmtSub9 byte "fsubrp st(1), st(0): st0:%f", nl, 0 fmtSub10 byte "fsubr mem: st0:%f", nl, 0 zero real8 0.0 three real8 3.0 minusTwo real8 -2.0 .data st0 real8 0.0 st1 real8 0.0 .code externdef printf:proc ; Return program title to C++ program: public getTitle getTitle proc lea rax, ttlStr ret getTitle endp ; printFP - Prints values of st0 and (possibly) st1. ; Caller must pass in ptr to fmtStr in RCX. printFP proc sub rsp, 40 ; For varargs (for example, printf call), double ; values must appear in RDX and R8 rather ; than XMM1, XMM2. ; Note: if only one double arg in format ; string, printf call will ignore 2nd ; value in R8. mov rdx, qword ptr st0 mov r8, qword ptr st1 call printf add rsp, 40 ret printFP endp ; Here is the "asmMain" function. public asmMain asmMain proc push rbp mov rbp, rsp sub rsp, 48 ; Shadow storage ; Demonstrate various fsub instructions: mov rax, qword ptr three mov qword ptr st1, rax mov rax, qword ptr minusTwo mov qword ptr st0, rax lea rcx, fmtSt0St1 call printFP ; fsub (same as fsubp): fld three fld minusTwo fsub ; Pops st(0)! fstp st0 lea rcx, fmtSub1 call printFP ; fsubp: fld three fld minusTwo fsubp ; Pops st(0)! fstp st0 lea rcx, fmtSub2 call printFP ; fsub st(1), st(0): fld three fld minusTwo fsub st(1), st(0) fstp st0 fstp st1 lea rcx, fmtSub3 call printFP ; fsub st(0), st(1): fld three fld minusTwo fsub st(0), st(1) fstp st0 fstp st1 lea rcx, fmtSub4 call printFP ; fsubp st(1), st(0): fld three fld minusTwo fsubp st(1), st(0) fstp st0 lea rcx, fmtSub5 call printFP ; fsub mem64: fld three fsub minusTwo fstp st0 lea rcx, fmtSub6 call printFP ; fsubr st(1), st(0): fld three fld minusTwo fsubr st(1), st(0) fstp st0 fstp st1 lea rcx, fmtSub7 call printFP ; fsubr st(0), st(1): fld three fld minusTwo fsubr st(0), st(1) fstp st0 fstp st1 lea rcx, fmtSub8 call printFP ; fsubrp st(1), st(0): fld three fld minusTwo fsubrp st(1), st(0) fstp st0 lea rcx, fmtSub9 call printFP ; fsubr mem64: fld three fsubr minusTwo fstp st0 lea rcx, fmtSub10 call printFP leave ret ; Returns to caller asmMain endp end ``` Listing 6-2: Demonstration of the `fsub` instructions Here’s the build command and output for Listing 6-2: ``` C:\>**build listing6-2** C:\>**echo off** Assembling: listing6-2.asm c.cpp C:\>**listing6-2** Calling Listing 6-2: st(0):-2.000000, st(1):3.000000 fsub: st0:5.000000 fsubp: st0:5.000000 fsub st(1), st(0): st0:-2.000000, st1:5.000000 fsub st(0), st(1): st0:-5.000000, st1:3.000000 fsubp st(1), st(0): st0:5.000000 fsub mem: st0:5.000000 fsubr st(1), st(0): st0:-2.000000, st1:-5.000000 fsubr st(0), st(1): st0:5.000000, st1:3.000000 fsubrp st(1), st(0): st0:-5.000000 fsubr mem: st0:-5.000000 Listing 6-2 terminated ``` #### 6.5.7.3 The fmul, fmulp, and fimul Instructions The `fmul` and `fmulp` instructions multiply two floating-point values. The `fimul` instruction multiples an integer and a floating-point value. These instructions allow the following forms: ``` fmul fmulp fmul st(0), st(`i`) fmul st(`i`), st(0) fmul `mem`[32] fmul `mem`[64] fmulp st(`i`), st(0) fimul `mem`[16] fimul `mem`[32] ``` With no operands, `fmul` is a synonym for `fmulp`. The `fmulp` instruction, with no operands, will pop ST(0) and ST(1), multiply these values, and push their product back onto the stack. The `fmul` instructions with two register operands compute *destination* = *destination* × *source*. One of the registers (source or destination) must be ST(0). The `fmulp st(0), st(``i``)` instruction computes ST(*i*) = ST(*i*) × ST(0) and then pops ST(0). This instruction uses the value for *i* before popping ST(0). The `fmul` `mem`[32] and `fmul` `mem`[64] instructions require a 32- or 64-bit memory operand, respectively. They convert the specified memory variable to an 80-bit extended-precision value and then multiply ST(0) by this value. There are also instructions for multiplying 16- and 32-bit integers in memory by ST(0): `fimul mem`[16] and `fimul mem`[32]. These instructions can raise the stack, precision, underflow, overflow, denormalized, and illegal operation exceptions, as appropriate. If rounding occurs during the computation, these instructions set the C[1] condition code bit. If a stack fault exception occurs, C[1] denotes stack overflow or underflow. Listing 6-3 demonstrates the various forms of the `fmul` instruction. ``` ; Listing 6-3 ; Demonstration of various forms of fmul. option casemap:none nl = 10 .const ttlStr byte "Listing 6-3", 0 fmtSt0St1 byte "st(0):%f, st(1):%f", nl, 0 fmtMul1 byte "fmul: st0:%f", nl, 0 fmtMul2 byte "fmulp: st0:%f", nl, 0 fmtMul3 byte "fmul st(1), st(0): st0:%f, st1:%f", nl, 0 fmtMul4 byte "fmul st(0), st(1): st0:%f, st1:%f", nl, 0 fmtMul5 byte "fmulp st(1), st(0): st0:%f", nl, 0 fmtMul6 byte "fmul mem: st0:%f", nl, 0 zero real8 0.0 three real8 3.0 minusTwo real8 -2.0 .data st0 real8 0.0 st1 real8 0.0 .code externdef printf:proc ; Return program title to C++ program: public getTitle getTitle proc lea rax, ttlStr ret getTitle endp ; printFP - Prints values of st0 and (possibly) st1. ; Caller must pass in ptr to fmtStr in RCX. printFP proc sub rsp, 40 ; For varargs (for example, printf call), double ; values must appear in RDX and R8 rather ; than XMM1, XMM2. ; Note: if only one double arg in format ; string, printf call will ignore 2nd ; value in R8. mov rdx, qword ptr st0 mov r8, qword ptr st1 call printf add rsp, 40 ret printFP endp ; Here is the "asmMain" function. public asmMain asmMain proc push rbp mov rbp, rsp sub rsp, 48 ; Shadow storage ; Demonstrate various fmul instructions: mov rax, qword ptr three mov qword ptr st1, rax mov rax, qword ptr minusTwo mov qword ptr st0, rax lea rcx, fmtSt0St1 call printFP ; fmul (same as fmulp): fld three fld minusTwo fmul ; Pops st(0)! fstp st0 lea rcx, fmtMul1 call printFP ; fmulp: fld three fld minusTwo fmulp ; Pops st(0)! fstp st0 lea rcx, fmtMul2 call printFP ; fmul st(1), st(0): fld three fld minusTwo fmul st(1), st(0) fstp st0 fstp st1 lea rcx, fmtMul3 call printFP ; fmul st(0), st(1): fld three fld minusTwo fmul st(0), st(1) fstp st0 fstp st1 lea rcx, fmtMul4 call printFP ; fmulp st(1), st(0): fld three fld minusTwo fmulp st(1), st(0) fstp st0 lea rcx, fmtMul5 call printFP ; fmulp mem64: fld three fmul minusTwo fstp st0 lea rcx, fmtMul6 call printFP leave ret ; Returns to caller asmMain endp end ``` Listing 6-3: Demonstration of the `fmul` instruction Here is the build command and output for Listing 6-3: ``` C:\>**build listing6-3** C:\>**echo off** Assembling: listing6-3.asm c.cpp C:\>**listing6-3** Calling Listing 6-3: st(0):-2.000000, st(1):3.000000 fmul: st0:-6.000000 fmulp: st0:-6.000000 fmul st(1), st(0): st0:-2.000000, st1:-6.000000 fmul st(0), st(1): st0:-6.000000, st1:3.000000 fmulp st(1), st(0): st0:-6.000000 fmul mem: st0:-6.000000 Listing 6-3 terminated ``` #### 6.5.7.4 The fdiv, fdivp, fdivr, fdivrp, fidiv, and fidivr Instructions These six instructions allow the following forms: ``` fdiv fdivp fdivr fdivrp fdiv st(0), st(`i`) fdiv st(`i`), st(0) fdivp st(`i`), st(0) fdivr st(0), st(`i`) fdivr st(`i`), st(0) fdivrp st(`i`), st(0) fdiv `mem`[32] fdiv `mem`[64] fdivr `mem`[32] fdivr `mem`[64] fidiv `mem`[16] fidiv `mem`[32] fidivr `mem`[16] fidivr `mem`[32] ``` With no operands, the `fdiv` instruction is a synonym for `fdivp`. The `fdivp` instruction with no operands computes ST(1) = ST(1) / ST(0). The `fdivr` and `fdivrp` instructions work in a similar fashion to `fdiv` and `fdivp` except that they compute ST(0) / ST(1) rather than ST(1) / ST(0). With two register operands, these instructions compute the following quotients: ``` fdiv st(0), st(`i`) ; st(0) = st(0)/st(`i`) fdiv st(`i`), st(0) ; st(`i`) = st(`i`)/st(0) fdivp st(`i`), st(0) ; st(`i`) = st(`i`)/st(0) then pop st0 fdivr st(0), st(`i`) ; st(0) = st(`i`)/st(0) fdivr st(`i`), st(0) ; st(`i`) = st(0)/st(`i`) fdivrp st(`i`), st(0) ; st(`i`) = st(0)/st(`i`) then pop st0 ``` The `fdivp` and `fdivrp` instructions also pop ST(0) after performing the division operation. The value for *i* in these two instructions is computed before popping ST(0). These instructions can raise the stack, precision, underflow, overflow, denormalized, zero divide, and illegal operation exceptions, as appropriate. If rounding occurs during the computation, these instructions set the C[1] condition code bit. If a stack fault exception occurs, C[1] denotes stack overflow or underflow. Listing 6-4 provides a demonstration of the `fdiv`/`fdivr` instructions. ``` ; Listing 6-4 ; Demonstration of various forms of fsub/fsubrl. option casemap:none nl = 10 .const ttlStr byte "Listing 6-4", 0 fmtSt0St1 byte "st(0):%f, st(1):%f", nl, 0 fmtDiv1 byte "fdiv: st0:%f", nl, 0 fmtDiv2 byte "fdivp: st0:%f", nl, 0 fmtDiv3 byte "fdiv st(1), st(0): st0:%f, st1:%f", nl, 0 fmtDiv4 byte "fdiv st(0), st(1): st0:%f, st1:%f", nl, 0 fmtDiv5 byte "fdivp st(1), st(0): st0:%f", nl, 0 fmtDiv6 byte "fdiv mem: st0:%f", nl, 0 fmtDiv7 byte "fdivr st(1), st(0): st0:%f, st1:%f", nl, 0 fmtDiv8 byte "fdivr st(0), st(1): st0:%f, st1:%f", nl, 0 fmtDiv9 byte "fdivrp st(1), st(0): st0:%f", nl, 0 fmtDiv10 byte "fdivr mem: st0:%f", nl, 0 three real8 3.0 minusTwo real8 -2.0 .data st0 real8 0.0 st1 real8 0.0 .code externdef printf:proc ; Return program title to C++ program: public getTitle getTitle proc lea rax, ttlStr ret getTitle endp ; printFP - Prints values of st0 and (possibly) st1. ; Caller must pass in ptr to fmtStr in RCX. printFP proc sub rsp, 40 ; For varargs (for example, printf call), double ; values must appear in RDX and R8 rather ; than XMM1, XMM2. ; Note: if only one double arg in format ; string, printf call will ignore 2nd ; value in R8. mov rdx, qword ptr st0 mov r8, qword ptr st1 call printf add rsp, 40 ret printFP endp ; Here is the "asmMain" function. public asmMain asmMain proc push rbp mov rbp, rsp sub rsp, 48 ; Shadow storage ; Demonstrate various fdiv instructions: mov rax, qword ptr three mov qword ptr st1, rax mov rax, qword ptr minusTwo mov qword ptr st0, rax lea rcx, fmtSt0St1 call printFP ; fdiv (same as fdivp): fld three fld minusTwo fdiv ; Pops st(0)! fstp st0 lea rcx, fmtDiv1 call printFP ; fdivp: fld three fld minusTwo fdivp ; Pops st(0)! fstp st0 lea rcx, fmtDiv2 call printFP ; fdiv st(1), st(0): fld three fld minusTwo fdiv st(1), st(0) fstp st0 fstp st1 lea rcx, fmtDiv3 call printFP ; fdiv st(0), st(1): fld three fld minusTwo fdiv st(0), st(1) fstp st0 fstp st1 lea rcx, fmtDiv4 call printFP ; fdivp st(1), st(0): fld three fld minusTwo fdivp st(1), st(0) fstp st0 lea rcx, fmtDiv5 call printFP ; fdiv mem64: fld three fdiv minusTwo fstp st0 lea rcx, fmtDiv6 call printFP ; fdivr st(1), st(0): fld three fld minusTwo fdivr st(1), st(0) fstp st0 fstp st1 lea rcx, fmtDiv7 call printFP ; fdivr st(0), st(1): fld three fld minusTwo fdivr st(0), st(1) fstp st0 fstp st1 lea rcx, fmtDiv8 call printFP ; fdivrp st(1), st(0): fld three fld minusTwo fdivrp st(1), st(0) fstp st0 lea rcx, fmtDiv9 call printFP ; fdivr mem64: fld three fdivr minusTwo fstp st0 lea rcx, fmtDiv10 call printFP leave ret ; Returns to caller asmMain endp end ``` Listing 6-4: Demonstration of the `fdiv`/`fdivr` instructions Here’s the build command and sample output for Listing 6-4: ``` C:\>**build listing6-4** C:\>**echo off** Assembling: listing6-4.asm c.cpp C:\>**listing6-4** Calling Listing 6-4: st(0):-2.000000, st(1):3.000000 fdiv: st0:-1.500000 fdivp: st0:-1.500000 fdiv st(1), st(0): st0:-2.000000, st1:-1.500000 fdiv st(0), st(1): st0:-0.666667, st1:3.000000 fdivp st(1), st(0): st0:-1.500000 fdiv mem: st0:-1.500000 fdivr st(1), st(0): st0:-2.000000, st1:-0.666667 fdivr st(0), st(1): st0:-1.500000, st1:3.000000 fdivrp st(1), st(0): st0:-0.666667 fdivr mem: st0:-0.666667 Listing 6-4 terminated ``` #### 6.5.7.5 The fsqrt Instruction The `fsqrt` routine does not allow any operands. It computes the square root of the value on TOS and replaces ST(0) with this result. The value on TOS must be 0 or positive; otherwise, `fsqrt` will generate an invalid operation exception. This instruction can raise the stack, precision, denormalized, and invalid operation exceptions, as appropriate. If rounding occurs during the computation, `fsqrt` sets the C[1] condition code bit. If a stack fault exception occurs, C[1] denotes stack overflow or underflow. Here’s an example: ``` ; Compute z = sqrt(x**2 + y**2): fld x ; Load x fld st(0) ; Duplicate x on TOS fmulp ; Compute x**2 fld y ; Load y fld st(0) ; Duplicate y fmul ; Compute y**2 faddp ; Compute x**2 + y**2 fsqrt ; Compute sqrt(x**2 + y**2) fstp z ; Store result away into z ``` #### 6.5.7.6 The fprem and fprem1 Instructions The `fprem` and `fprem1` instructions compute a *partial remainder* (a value that may require additional computation to produce the actual remainder). Intel designed the `fprem` instruction before the IEEE finalized its floating-point standard. In the final draft of that standard, the definition of `fprem` was a little different from Intel’s original design. To maintain compatibility with the existing software that used the `fprem` instruction, Intel designed a new version to handle the IEEE partial remainder operation, `fprem1`. You should always use `fprem1` in new software; therefore, we will discuss only `fprem1` here, although you use `fprem` in an identical fashion. `fprem1` computes the partial remainder of ST(0) / ST(1). If the difference between the exponents of ST(0) and ST(1) is less than 64, `fprem1` can compute the exact remainder in one operation. Otherwise, you will have to execute `fprem1` two or more times to get the correct remainder value. The C[2] condition code bit determines when the computation is complete. Note that `fprem1` does *not* pop the two operands off the stack; it leaves the partial remainder in ST(0) and the original divisor in ST(1) in case you need to compute another partial product to complete the result. The `fprem1` instruction sets the stack exception flag if there aren’t two values on the top of stack. It sets the underflow and denormal exception bits if the result is too small. It sets the invalid operation bit if the values on TOS are inappropriate for this operation. It sets the C[2] condition code bit if the partial remainder operation is not complete (or on stack underflow). Finally, it loads C[1], C[2], and C[0] with bits 0, 1, and 2 of the quotient, respectively. An example follows: ``` ; Compute z = x % y: fld y fld x repeatLp: fprem1 fstsw ax ; Get condition code bits into AX and ah, 1 ; See if C2 is set jnz repeatLp ; Repeat until C2 is clear fstp z ; Store away the remainder fstp st(0) ; Pop old y value ``` #### 6.5.7.7 The frndint Instruction The `frndint` instruction rounds the value on TOS to the nearest integer by using the rounding algorithm specified in the control register. This instruction sets the stack exception flag if there is no value on the TOS (it will also clear C[1] in this case). It sets the precision and denormal exception bits if a loss of precision occurred. It sets the invalid operation flag if the value on the TOS is not a valid number. Note that the result on the TOS is still a floating-point value; it simply does not have a fractional component. #### 6.5.7.8 The fabs Instruction `fabs` computes the absolute value of ST(0) by clearing the mantissa sign bit of ST(0). It sets the stack exception bit and invalid operation bits if the stack is empty. Here’s an example: ``` ; Compute x = sqrt(abs(x)): fld x fabs fsqrt fstp x ``` #### 6.5.7.9 The fchs Instruction `fchs` changes the sign of ST(0)’s value by inverting the mantissa sign bit (this is the floating-point negation instruction). It sets the stack exception bit and invalid operation bits if the stack is empty. Look at this example: ``` ; Compute x = -x if x is positive, x = x if x is negative. ; That is, force x to be a negative value. fld x fabs fchs fstp x ``` ### 6.5.8 Comparison Instructions The FPU provides several instructions for comparing real values. The `fcom`, `fcomp`, and `fcompp` instructions compare the two values on the top of stack and set the condition codes appropriately. The `ftst` instruction compares the value on the top of stack with 0. Generally, most programs test the condition code bits immediately after a comparison. Unfortunately, no instructions test the FPU condition codes. Instead, you use the `fstsw` instruction to copy the floating-point status register into the AX register, then the `sahf` instruction to copy the AH register into the x86-64’s condition code bits. Then you can test the standard x86-64 flags to check for a condition. This technique copies C[0] into the carry flag, C[2] into the parity flag, and C[3] into the zero flag. The `sahf` instruction does not copy C[1] into any of the x86-64’s flag bits. Because `sahf` does not copy any FPU status bits into the sign or overflow flags, you cannot use signed comparison instructions. Instead, use unsigned operations (for example, `seta`, `setb`, `ja`, `jb`) when testing the results of a floating-point comparison. Yes, these instructions normally test unsigned values, and *floating-point numbers are signed values*. However, use the unsigned operations anyway; the `fstsw` and `sahf` instructions set the x86-64 FLAGS register as though you had compared unsigned values with the `cmp` instruction. The x86-64 processors provide an extra set of floating-point comparison instructions that directly affect the x86-64 condition code flags. These instructions circumvent having to use `fstsw` and `sahf` to copy the FPU status into the x86-64 condition codes. These instructions include `fcomi` and `fcomip`. You use them just like the `fcom` and `fcomp` instructions, except, of course, you do not have to manually copy the status bits to the FLAGS register. #### 6.5.8.1 The fcom, fcomp, and fcompp Instructions The `fcom`, `fcomp`, and `fcompp` instructions compare ST(0) to the specified operand and set the corresponding FPU condition code bits based on the result of the comparison. The legal forms for these instructions are as follows: ``` fcom fcomp fcompp fcom st(`i`) fcomp st(`i`) fcom `mem`[32] fcom `mem`[64] fcomp `mem`[32] fcomp `mem`[64] ``` With no operands, `fcom`, `fcomp`, and `fcompp` compare ST(0) against ST(1) and set the FPU flags accordingly. In addition, `fcomp` pops ST(0) off the stack, and `fcompp` pops both ST(0) and ST(1) off the stack. With a single-register operand, `fcom` and `fcomp` compare ST(0) against the specified register. `fcomp` also pops ST(0) after the comparison. With a 32- or 64-bit memory operand, the `fcom` and `fcomp` instructions convert the memory variable to an 80-bit extended-precision value and then compare ST(0) against this value, setting the condition code bits accordingly. `fcomp` also pops ST(0) after the comparison. These instructions set C[2] (which winds up in the parity flag when using `sahf`) if the two operands are not comparable (for example, NaN). If it is possible for an illegal floating-point value to wind up in a comparison, you should check the parity flag for an error before checking the desired condition (for example, with the `setp`/`setnp` or `jp`/`jnp` instructions). These instructions set the stack fault bit if there aren’t two items on the top of the register stack. They set the denormalized exception bit if either or both operands are denormalized. They set the invalid operation flag if either or both operands are NaNs. These instructions always clear the C[1] condition code. Let’s look at an example of a floating-point comparison: ``` fcompp fstsw ax sahf setb al ; AL = true if st(0) < st(1) . . . fcompp fstsw ax sahf jnb st1GEst0 ; Code that executes if st(0) < st(1). st1GEst0: ``` Because all x86-64 64-bit CPUs support the `fcomi` and `fcomip` instructions (described in the next section), you should consider using those instructions as they spare you from having to store the FPU status word into AX and then copy AH into the FLAGS register before testing the condition. On the other hand, `fcomi` and `fcomip` support only a limited number of operand forms (the `fcom` and `fcomp` instructions are more general). Listing 6-5 is a sample program that demonstrates the use of the various `fcom` instructions. ``` ; Listing 6-5 ; Demonstration of fcom instructions. option casemap:none nl = 10 .const ttlStr byte "Listing 6-5", 0 fcomFmt byte "fcom %f < %f is %d", nl, 0 fcomFmt2 byte "fcom(2) %f < %f is %d", nl, 0 fcomFmt3 byte "fcom st(1) %f < %f is %d", nl, 0 fcomFmt4 byte "fcom st(1) (2) %f < %f is %d", nl, 0 fcomFmt5 byte "fcom mem %f < %f is %d", nl, 0 fcomFmt6 byte "fcom mem %f (2) < %f is %d", nl, 0 fcompFmt byte "fcomp %f < %f is %d", nl, 0 fcompFmt2 byte "fcomp (2) %f < %f is %d", nl, 0 fcompFmt3 byte "fcomp st(1) %f < %f is %d", nl, 0 fcompFmt4 byte "fcomp st(1) (2) %f < %f is %d", nl, 0 fcompFmt5 byte "fcomp mem %f < %f is %d", nl, 0 fcompFmt6 byte "fcomp mem (2) %f < %f is %d", nl, 0 fcomppFmt byte "fcompp %f < %f is %d", nl, 0 fcomppFmt2 byte "fcompp (2) %f < %f is %d", nl, 0 three real8 3.0 zero real8 0.0 minusTwo real8 -2.0 .data st0 real8 ? st1 real8 ? .code externdef printf:proc ; Return program title to C++ program: public getTitle getTitle proc lea rax, ttlStr ret getTitle endp ; printFP - Prints values of st0 and (possibly) st1. ; Caller must pass in ptr to fmtStr in RCX. printFP proc sub rsp, 40 ; For varargs (for example, printf call), double ; values must appear in RDX and R8 rather ; than XMM1, XMM2. ; Note: if only one double arg in format ; string, printf call will ignore 2nd ; value in R8. mov rdx, qword ptr st0 mov r8, qword ptr st1 movzx r9, al call printf add rsp, 40 ret printFP endp ; Here is the "asmMain" function. public asmMain asmMain proc push rbp mov rbp, rsp sub rsp, 48 ; Shadow storage ; fcom demo: xor eax, eax fld three fld zero fcom fstsw ax sahf setb al fstp st0 fstp st1 lea rcx, fcomFmt call printFP ; fcom demo 2: xor eax, eax fld zero fld three fcom fstsw ax sahf setb al fstp st0 fstp st1 lea rcx, fcomFmt2 call printFP ; fcom st(`i`) demo: xor eax, eax fld three fld zero fcom st(1) fstsw ax sahf setb al fstp st0 fstp st1 lea rcx, fcomFmt3 call printFP ; fcom st(`i`) demo 2: xor eax, eax fld zero fld three fcom st(1) fstsw ax sahf setb al fstp st0 fstp st1 lea rcx, fcomFmt4 call printFP ; fcom mem64 demo: xor eax, eax fld three ; Never on stack so fstp st1 ; copy for output fld zero fcom three fstsw ax sahf setb al fstp st0 lea rcx, fcomFmt5 call printFP ; fcom mem64 demo 2: xor eax, eax fld zero ; Never on stack so fstp st1 ; copy for output fld three fcom zero fstsw ax sahf setb al fstp st0 lea rcx, fcomFmt6 call printFP ; fcomp demo: xor eax, eax fld zero fld three fst st0 ; Because this gets popped fcomp fstsw ax sahf setb al fstp st1 lea rcx, fcompFmt call printFP ; fcomp demo 2: xor eax, eax fld three fld zero fst st0 ; Because this gets popped fcomp fstsw ax sahf setb al fstp st1 lea rcx, fcompFmt2 call printFP ; fcomp demo 3: xor eax, eax fld zero fld three fst st0 ; Because this gets popped fcomp st(1) fstsw ax sahf setb al fstp st1 lea rcx, fcompFmt3 call printFP ; fcomp demo 4: xor eax, eax fld three fld zero fst st0 ; Because this gets popped fcomp st(1) fstsw ax sahf setb al fstp st1 lea rcx, fcompFmt4 call printFP ; fcomp demo 5: xor eax, eax fld three fstp st1 fld zero fst st0 ; Because this gets popped fcomp three fstsw ax sahf setb al lea rcx, fcompFmt5 call printFP ; fcomp demo 6: xor eax, eax fld zero fstp st1 fld three fst st0 ; Because this gets popped fcomp zero fstsw ax sahf setb al lea rcx, fcompFmt6 call printFP ; fcompp demo: xor eax, eax fld zero fst st1 ; Because this gets popped fld three fst st0 ; Because this gets popped fcompp fstsw ax sahf setb al lea rcx, fcomppFmt call printFP ; fcompp demo 2: xor eax, eax fld three fst st1 ; Because this gets popped fld zero fst st0 ; Because this gets popped fcompp fstsw ax sahf setb al lea rcx, fcomppFmt2 call printFP leave ret ; Returns to caller asmMain endp end ``` Listing 6-5: Program that demonstrates the `fcom` instructions Here’s the build command and output for the program in Listing 6-5: ``` C:\>**build listing6-5** C:\>**echo off** Assembling: listing6-5.asm c.cpp C:\>**listing6-5** Calling Listing 6-5: fcom 0.000000 < 3.000000 is 1 fcom(2) 3.000000 < 0.000000 is 0 fcom st(1) 0.000000 < 3.000000 is 1 fcom st(1) (2) 3.000000 < 0.000000 is 0 fcom mem 0.000000 < 3.000000 is 1 fcom mem 3.000000 (2) < 0.000000 is 0 fcomp 3.000000 < 0.000000 is 0 fcomp (2) 0.000000 < 3.000000 is 1 fcomp st(1) 3.000000 < 0.000000 is 0 fcomp st(1) (2) 0.000000 < 3.000000 is 1 fcomp mem 0.000000 < 3.000000 is 1 fcomp mem (2) 3.000000 < 0.000000 is 0 fcompp 3.000000 < 0.000000 is 0 fcompp (2) 0.000000 < 3.000000 is 1 Listing 6-5 terminated ``` #### 6.5.8.2 The fcomi and fcomip Instructions The `fcomi` and `fcomip` instructions compare ST(0) to the specified operand and set the corresponding FLAGS condition code bits based on the result of the comparison. You use these instructions in a similar manner to `fcom` and `fcomp` except you can test the CPU’s flag bits directly after the execution of these instructions without first moving the FPU status bits into the FLAGS register. The legal forms for these instructions are as follows: ``` fcomi st(0), st(`i`) fcomip st(0), st(`i`) ``` Note that a *pop-pop* version (`fcomipp`) does not exist. If all you want to do is compare the top two items on the FPU stack, you will have to explicitly pop that item yourself (for example, by using the `fstp st(0)` instruction). Listing 6-6 is a sample program that demonstrates the operation of the `fcomi` and `fcomip` instructions. ``` ; Listing 6-6 ; Demonstration of fcomi and fcomip instructions. option casemap:none nl = 10 .const ttlStr byte "Listing 6-6", 0 fcomiFmt byte "fcomi %f < %f is %d", nl, 0 fcomiFmt2 byte "fcomi(2) %f < %f is %d", nl, 0 fcomipFmt byte "fcomip %f < %f is %d", nl, 0 fcomipFmt2 byte "fcomip (2) %f < %f is %d", nl, 0 three real8 3.0 zero real8 0.0 minusTwo real8 -2.0 .data st0 real8 ? st1 real8 ? .code externdef printf:proc ; Return program title to C++ program: public getTitle getTitle proc lea rax, ttlStr ret getTitle endp ; printFP - Prints values of st0 and (possibly) st1. ; Caller must pass in ptr to fmtStr in RCX. printFP proc sub rsp, 40 ; For varargs (for example, printf call), double ; values must appear in RDX and R8 rather ; than XMM1, XMM2. ; Note: if only one double arg in format ; string, printf call will ignore 2nd ; value in R8. mov rdx, qword ptr st0 mov r8, qword ptr st1 movzx r9, al call printf add rsp, 40 ret printFP endp ; Here is the "asmMain" function. public asmMain asmMain proc push rbp mov rbp, rsp sub rsp, 48 ; Shadow storage ; Test to see if 0 < 3. ; Note: ST(0) contains 0, ST(1) contains 3. xor eax, eax fld three fld zero fcomi st(0), st(1) setb al fstp st0 fstp st1 lea rcx, fcomiFmt call printFP ; Test to see if 3 < 0. ; Note: ST(0) contains 0, ST(1) contains 3. xor eax, eax fld zero fld three fcomi st(0), st(1) setb al fstp st0 fstp st1 lea rcx, fcomiFmt2 call printFP ; Test to see if 3 < 0. ; Note: ST(0) contains 0, ST(1) contains 3. xor eax, eax fld zero fld three fst st0 ; Because this gets popped fcomip st(0), st(1) setb al fstp st1 lea rcx, fcomipFmt call printFP ; Test to see if 0 < 3. ; Note: ST(0) contains 0, ST(1) contains 3. xor eax, eax fld three fld zero fst st0 ; Because this gets popped fcomip st(0), st(1) setb al fstp st1 lea rcx, fcomipFmt2 call printFP leave ret ; Returns to caller asmMain endp end ``` Listing 6-6: Sample program demonstrating floating-point comparisons Here’s the build command and output for the program in Listing 6-6: ``` C:\>**build listing6-6** C:\>**echo off** Assembling: listing6-6.asm c.cpp C:\>**listing6-6** Calling Listing 6-6: fcomi 0.000000 < 3.000000 is 1 fcomi(2) 3.000000 < 0.000000 is 0 fcomip 3.000000 < 0.000000 is 0 fcomip (2) 0.000000 < 3.000000 is 1 Listing 6-6 terminated ``` #### 6.5.8.3 The ftst Instruction The `ftst` instruction compares the value in ST(0) against 0.0\. It behaves just like the `fcom` instruction would if ST(1) contained 0.0\. This instruction does not differentiate –0.0 from +0.0\. If the value in ST(0) is either of these values, `ftst` will set C[3] to denote equality (or unordered). This instruction does *not* pop ST(0) off the stack. Here’s an example: ``` ftst fstsw ax sahf sete al ; Set AL to 1 if TOS = 0.0 ``` ### 6.5.9 Constant Instructions The FPU provides several instructions that let you load commonly used constants onto the FPU’s register stack. These instructions set the stack fault, invalid operation, and C[1] flags if a stack overflow occurs; they do not otherwise affect the FPU flags. The specific instructions in this category include the following: ``` fldz ; Pushes +0.0 fld1 ; Pushes +1.0 fldpi ; Pushes pi (3.14159...) fldl2t ; Pushes log2(10) fldl2e ; Pushes log2(e) fldlg2 ; Pushes log10(2) fldln2 ; Pushes ln(2) ``` ### 6.5.10 Transcendental Instructions The FPU provides eight *transcendental* (logarithmic and trigonometric) instructions to compute sine, cosine, partial tangent, partial arctangent, 2^(*x*)– 1, *y* × log2, and *y* × log2. Using various algebraic identities, you can easily compute most of the other common transcendental functions by using these instructions. #### 6.5.10.1 The f2xm1 Instruction `f2xm1` computes 2^(ST(0)) – 1\. The value in ST(0) must be in the range –1.0 to +1.0\. If ST(0) is out of range, `f2xm1` generates an undefined result but raises no exceptions. The computed value replaces the value in ST(0). Here’s an example computing 10^(*i*) using the identity 10^(*i*) = 2^(*i*) ^(× log2(10)). This is useful for only a small range of *i* that doesn’t put ST(0) outside the previously mentioned valid range: ``` fld i fldl2t fmul f2xm1 fld1 fadd ``` Because `f2xm1` computes 2^(*x*) – 1, the preceding code adds 1.0 to the result at the end of the computation. #### 6.5.10.2 The fsin, fcos, and fsincos Instructions These instructions pop the value off the top of the register stack and compute the sine, cosine, or both, and push the result(s) back onto the stack. The `fsincos` instruction pushes the sine followed by the cosine of the original operand; hence, it leaves cos(ST(0)) in ST(0) and sin(ST(0)) in ST(1). These instructions assume ST(0) specifies an angle in radians, and this angle must be in the range –2⁶³ < ST(0) < +2⁶³. If the original operand is out of range, these instructions set the C[2] flag and leave ST(0) unchanged. You can use the `fprem1` instruction, with a divisor of 2π, to reduce the operand to a reasonable range. These instructions set the stack fault (or rounding)/C[1], precision, underflow, denormalized, and invalid operation flags according to the result of the computation. #### 6.5.10.3 The fptan Instruction `fptan` computes the tangent of ST(0), replaces ST(0) with this value, and then pushes 1.0 onto the stack. Like the `fsin` and `fcos` instructions, the value of ST(0) must be in radians and in the range –2⁶³ < ST(0) < +2⁶³. If the value is outside this range, `fptan` sets C[2] to indicate that the conversion did not take place. As with the `fsin`, `fcos`, and `fsincos` instructions, you can use the `fprem1` instruction to reduce this operand to a reasonable range by using a divisor of 2π. If the argument is invalid (that is, 0 or π radians, which causes a division by 0), the result is undefined and this instruction raises no exceptions. `fptan` will set the stack fault/rounding, precision, underflow, denormal, invalid operation, C[2], and C[1] bits as required by the operation. #### 6.5.10.4 The fpatan Instruction `fpatan` expects two values on the top of stack. It pops them and computes ST(0) = tan^(-1)(ST(1) / ST(0)). The resulting value is the arctangent of the ratio on the stack expressed in radians. If you want to compute the arctangent of a particular value, use `fld1` to create the appropriate ratio and then execute the `fpatan` instruction. This instruction affects the stack fault/C[1], precision, underflow, denormal, and invalid operation bits if a problem occurs during the computation. It sets the C[1] condition code bit if it has to round the result. #### 6.5.10.5 The fyl2x Instruction The `fyl2x` instruction computes ST(0) = ST(1) × log2). The instruction itself has no operands, but expects two operands on the FPU stack in ST(1) and ST(0), thus using the following syntax: ``` fyl2x ``` To compute the log of any other base, you can use the arithmetic identity log*n* = log2 / log2. So if you first compute log2 and put its reciprocal on the stack, then push *x* onto the stack and execute `fyl2x`, you wind up with log*n*(*x*). The `fyl2x` instruction sets the C[1] condition code bit if it has to round up the value. It clears C[1] if no rounding occurs or if a stack overflow occurs. The remaining floating-point condition codes are undefined after the execution of this instruction. `fyl2x` can raise the following floating-point exceptions: invalid operation, denormal result, overflow, underflow, and inexact result. Note that the `fldl2t` and `fldl2e` instructions turn out to be quite handy when using the `fyl2x` instruction (for computing log[10] and ln). #### 6.5.10.6 The fyl2xp1 Instruction `fyl2xp1` computes ST(0) = ST(1) × log2 + 1.0) from two operands on the FPU stack. The syntax for this instruction is as follows: ``` fyl2xp1 ``` Otherwise, the instruction is identical to `fyl2x`. ### 6.5.11 Miscellaneous Instructions The FPU includes several additional instructions that control the FPU, synchronize operations, and let you test or set various status bits: `finit`/`fninit`, `fldcw`, `fstcw`, `fclex`/`fnclex`, and `fstsw`. #### 6.5.11.1 The finit and fninit Instructions The `finit` and `fninit` instructions initialize the FPU for proper operation. Your code should execute one of these instructions before executing any other FPU instructions. They initialize the control register to 37Fh, the status register to 0, and the tag word to 0FFFFh. The other registers are unaffected. Here are some examples: ``` finit fninit ``` The difference between `finit` and `fninit` is that `finit` first checks for any pending floating-point exceptions before initializing the FPU; `fninit` does not. #### 6.5.11.2 The fldcw and fstcw Instructions The `fldcw` and `fstcw` instructions require a single 16-bit memory operand: ``` fldcw `mem`[16] fstcw `mem`[16] ``` These two instructions load the control word from a memory location (`fldcw`) or store the control word to a 16-bit memory location (`fstcw`). When you use `fldcw` to turn on one of the exceptions, if the corresponding exception flag is set when you enable that exception, the FPU will generate an immediate interrupt before the CPU executes the next instruction. Therefore, you should use `fclex` to clear any pending interrupts before changing the FPU exception enable bits. #### 6.5.11.3 The fclex and fnclex Instructions The `fclex` and `fnclex` instructions clear all exception bits, the stack fault bit, and the busy flag in the FPU status register. Here are examples: ``` fclex fnclex ``` The difference between these instructions is the same as that between `finit` and `fninit`: `fclex` first checks for pending floating-point exceptions. #### 6.5.11.4 The fstsw and fnstsw Instructions These instructions store the FPU status word into a 16-bit memory location or the AX register: ``` fstsw ax fnstsw ax fstsw `mem`[16] fnstsw `mem`[16] ``` These instructions are unusual in the sense that they can copy an FPU value into one of the x86-64 general-purpose registers (specifically, AX). The purpose is to allow the CPU to easily test the condition code register with the `sahf` instruction. The difference between `fstsw` and `fnstsw` is the same as that for `fclex` and `fnclex`. ## 6.6 Converting Floating-Point Expressions to Assembly Language Because the FPU register organization is different from the x86-64 integer register set, translating arithmetic expressions involving floating-point operands is a little different from translating integer expressions. Therefore, it makes sense to spend some time discussing how to manually translate floating-point expressions into assembly language. The FPU uses *postfix notation* (also called *reverse Polish notation*, or *RPN*) for arithmetic operations. Once you get used to using postfix notation, it’s actually a bit more convenient for translating expressions because you don’t have to worry about allocating temporary variables—they always wind up on the FPU stack. Postfix notation, as opposed to standard *infix notation*, places the operands before the operator. Table 6-14 provides simple examples of infix notation and the corresponding postfix notation. Table 6-14: Infix-to-Postfix Translation | **Infix notation** | **Postfix notation** | | --- | --- | | 5 + 6 | 5 6 + | | 7 – 2 | 7 2 – | | y × z | y z × | | a / b | a b / | A postfix expression like `5 6 +` says, “Push 5 onto the stack, push 6 onto the stack, and then pop the value off the top of stack (6) and add it to the new top of stack.” Sound familiar? This is exactly what the `fld` and `fadd` instructions do. In fact, you can calculate the result by using the following code: ``` fld five ; Declared somewhere as five real8 5.0 (or real4/real10) fld six ; Declared somewhere as six real8 6.0 (or real4/real10) fadd ; 11.0 is now on the top of the FPU stack ``` As you can see, postfix is a convenient notation because it’s easy to translate this code into FPU instructions. Another advantage to postfix notation is that it doesn’t require any parentheses. The examples in Table 6-15 demonstrate some slightly more complex infix-to-postfix conversions. Table 6-15: More-Complex Infix-to-Postfix Translations | **Infix notation** | **Postfix notation** | | --- | --- | | (y + z) * 2 | y z + 2 * | | y * 2 – (a + b) | y 2 * a b + – | | (a + b) * (c + d) | a b + c d + * | The postfix expression `y z + 2 *` says, “Push *y*, then push *z*; next, add those values on the stack (producing `y + z` on the stack). Next, push 2 and then multiply the two values (`2` and `y + z`) on the stack to produce two times the quantity `y + z`.” Once again, we can translate these postfix expressions directly into assembly language. The following code demonstrates the conversion for each of the preceding expressions: ``` ; y z + 2 * fld y fld z fadd fld const2 ; const2 real8 2.0 in .data section fmul ; y 2 * a b + - fld y fld const2 ; const2 real8 2.0 in .data section fmul fld a fld b fadd fsub ; a b + c d + * fld a fld b fadd fld c fld d fadd fmul ``` ### 6.6.1 Converting Arithmetic Expressions to Postfix Notation For simple expressions, those involving two operands and a single expression, the translation from infix to postfix notation is trivial: simply move the operator from the infix position to the postfix position (that is, move the operator from between the operands to after the second operand). For example `5 + 6` becomes `5 6 +`. Other than separating your operands so you don’t confuse them (that is, is it 5 and 6 or 56?), converting simple infix expressions into postfix notation is straightforward. For complex expressions, the idea is to convert the simple subexpressions into postfix notation and then treat each converted subexpression as a single operand in the remaining expression. The following discussion surrounds completed conversions with square brackets so it is easy to see which text needs to be treated as a single operand in the conversion. As for integer expression conversion, the best place to start is in the innermost parenthetical subexpression and then work your way outward, considering precedence, associativity, and other parenthetical subexpressions. As a concrete working example, consider the following expression: ``` x = ((y – z) * a) – (a + b * c) / 3.14159 ``` A possible first translation is to convert the subexpression (`y - z`) into postfix notation: ``` x = ([y z -] * a) - (a + b * c) / 3.14159 ``` Square brackets surround the converted postfix code just to separate it from the infix code, for readability. Remember, for the purposes of conversion, we will treat the text inside the square brackets as a single operand. Therefore, you would treat `[y z -]` as though it were a single variable name or constant. The next step is to translate the subexpression (`[y z -] * a`) into postfix form. This yields the following: ``` x = [y z - a *] - (a + b * c) / 3.14159 ``` Next, we work on the parenthetical expression (`a + b * c`). Because multiplication has higher precedence than addition, we convert `b * c` first: ``` x = [y z - a *] - (a + [b c *]) / 3.14159 ``` After converting `b * c`, we finish the parenthetical expression: ``` x = [y z - a *] - [a b c * +] / 3.14159 ``` This leaves only two infix operators: subtraction and division. Because division has the higher precedence, we’ll convert that first: ``` x = [y z - a *] - [a b c * + 3.14159 /] ``` Finally, we convert the entire expression into postfix notation by dealing with the last infix operation, subtraction: ``` x = [y z - a *] [a b c * + 3.14159 /] - ``` Removing the square brackets yields the following postfix expression: ``` x = y z - a * a b c * + 3.14159 / - ``` The following steps demonstrate another infix-to-postfix conversion for this expression: ``` a = (x * y - z + t) / 2.0 ``` 1. Work inside the parentheses. Because multiplication has the highest precedence, convert that first: ``` a = ([x y *] - z + t) / 2.0 ``` 2. Still working inside the parentheses, we note that addition and subtraction have the same precedence, so we rely on associativity to determine what to do next. These operators are left-associative, so we must translate the expressions from left to right. This means translate the subtraction operator first: ``` a = ([x y * z -] + t) / 2.0 ``` 3. Now translate the addition operator inside the parentheses. Because this finishes the parenthetical operators, we can drop the parentheses: ``` a = [x y * z - t +] / 2.0 ``` 4. Translate the final infix operator (division). This yields the following: ``` a = [x y * z - t + 2.0 /] ``` 5. Drop the square brackets, and we’re done: ``` a = x y * z - t + 2.0 / ``` ### 6.6.2 Converting Postfix Notation to Assembly Language Once you’ve translated an arithmetic expression into postfix notation, finishing the conversion to assembly language is easy. All you have to do is issue an `fld` instruction whenever you encounter an operand and issue an appropriate arithmetic instruction when you encounter an operator. This section uses the completed examples from the previous section to demonstrate how little there is to this process. ``` x = y z - a * a b c * + 3.14159 / - ``` 1. Convert `y` to `fld y`. 2. Convert `z` to `fld z`. 3. Convert `-` to `fsub`. 4. Convert `a` to `fld a`. 5. Convert `*` to `fmul`. 6. Continuing in a left-to-right fashion, generate the following code for the expression: ``` fld y fld z fsub fld a fmul fld a fld b fld c fmul fadd fldpi ; Loads pi (3.14159) fdiv fsub fstp x ; Store result away into x ``` Here’s the translation for the second example in the previous section: ``` a = x y * z - t + 2.0 / fld x fld y fmul fld z fsub fld t fadd fld const2 ; const2 real8 2.0 in .data section fdiv fstp a ; Store result away into a ``` As you can see, the translation is fairly simple once you’ve converted the infix notation to postfix notation. Also note that, unlike integer expression conversion, you don’t need any explicit temporaries. It turns out that the FPU stack provides the temporaries for you.^(9) For these reasons, converting floating-point expressions into assembly language is actually easier than converting integer expressions. ## 6.7 SSE Floating-Point Arithmetic Although the x87 FPU is relatively easy to use, the stack-based design of the FPU created performance bottlenecks as CPUs became more powerful. After introducing the *Streaming SIMD Extensions (SSE)* in its Pentium III CPUs (way back in 1999), Intel decided to resolve the FPU performance bottleneck and added scalar (non-vector) floating-point instructions to the SSE instruction set that could use the XMM registers. Most modern programs favor the use of the SSE (and later) registers and instructions for floating-point operations over the x87 FPU, using only those x87 operations available exclusively on the x87. The SSE instruction set supports two floating-point data types: 32-bit single-precision (Intel calls these *scalar single* operations) and 64-bit double-precision values (Intel calls these *scalar double* operations).^(10) The SSE does not support the 80-bit extended-precision floating-point data types of the x87 FPU. If you need the extended-precision format, you’ll have to use the x87 FPU. ### 6.7.1 SSE MXCSR Register The SSE MXCSR register is a 32-bit status and control register that controls SSE floating-point operations. Bits 16 to 32 are reserved and currently have no meaning. Table 6-16 lists the functions of the LO 16 bits. Table 6-16: SSE MXCSR Register | **Bit** | **Name** | **Function** | | --- | --- | --- | | 0 | IE | Invalid operation exception flag. Set if an invalid operation was attempted. | | 1 | DE | Denormal exception flag. Set if operations produced a denormalized value. | | 2 | ZE | Zero exception flag. Set if an attempt to divide by 0 was made. | | 3 | OE | Overflow exception flag. Set if there was an overflow. | | 4 | UE | Underflow exception flag. Set if there was an underflow. | | 5 | PE | Precision exception flag. Set if there was a precision exception. | | 6 | DAZ | Denormals are 0\. If set, treat denormalized values as 0. | | 7 | IM | Invalid operation mask. If set, ignore invalid operation exceptions. | | 8 | DM | Denormal mask. If set, ignore denormal exceptions. | | 9 | ZM | Divide-by-zero mask. If set, ignore division-by-zero exceptions. | | 10 | OM | Overflow mask. If set, ignore overflow exceptions. | | 11 | UM | Underflow mask. If set, ignore underflow exceptions. | | 12 | PM | Precision mask. If set, ignore precision exceptions. | | 13 14 | Rounding Control | 00: Round to nearest 01: Round toward –infinity 10: Round toward +infinity 11: Round toward 0 (truncate) | | 15 | FTZ | Flush to zero. When set, all underflow conditions set the register to 0. | Access to the SSE MXCSR register is via the following two instructions: ``` ldmxcsr `mem`[32] stmxcsr `mem`[32] ``` The `ldmxcsr` instruction loads the MXCSR register from the specified 32-bit memory location. The `stmxcsr` instruction stores the current contents of the MXCSR register to the specified memory location. By far, the most common use of these two instructions is to set the rounding mode. In typical programs using the SSE floating-point instructions, it is common to switch between the round-to-nearest and round-to-zero (truncate) modes. ### 6.7.2 SSE Floating-Point Move Instructions The SSE instruction set provides two instructions to move floating-point values between XMM registers and memory: `movss` (*move scalar single*) and `movsd` (*move scalar double*). Here is their syntax: ``` movss `xmm`[*n*], `mem`[32] movss `mem`[32], `xmm`[*n*] movsd `xmm`[*n*], `mem`[64] movsd `mem`[64], `xmm`[*n*] ``` As for the standard general-purpose registers, the `movss` and `movsd` instructions move data between an appropriate memory location (containing a 32- or 64-bit floating-point value) and one of the 16 XMM registers (XMM0 to XMM15). For maximum performance, `movss` memory operands should appear at a double-word-aligned memory address, and `movsd` memory operands should appear at a quad-word-aligned memory address. Though these instructions will function properly if the memory operands are not properly aligned in memory, there is a performance hit for misaligned accesses. In addition to the `movss` and `movsd` instructions that move floating-point values between XMM registers or XMM registers and memory, you’ll find a couple of other SSE move instructions useful that move data between XMM and general-purpose registers, `movd` and `movq`: ``` movd `reg`[32], `xmm`[*n*] movd `xmm`[*n*], `reg`[32] movq `reg`[64], `xmm`[*n*] movq `xmm`[*n*], `reg`[64] ``` These instructions also have a form that allows a source memory operand. However, you should use `movss` and `movsd` to move floating-point variables into XMM registers. The `movq` and `movd` instructions are especially useful for copying XMM registers into 64-bit general-purpose registers prior to a call to `printf()` (when printing floating-point values). As you’ll see in a few sections, these instructions are also useful for floating-point comparisons on the SSE. ### 6.7.3 SSE Floating-Point Arithmetic Instructions The Intel SSE instruction set adds the following floating-point arithmetic instructions: ``` addss `xmm`[*n*], `xmm`[*n*] addss `xmm`[*n*], `mem`[32] addsd `xmm`[*n*], `xmm`[*n*] addsd `xmm`[*n*], `mem`[64] subss `xmm`[*n*], `xmm`[*n*] subss `xmm`[*n*], `mem`[32] subsd `xmm`[*n*], `xmm`[*n*] subsd `xmm`[*n*], `mem`[64] mulss `xmm`[*n*], `xmm`[*n*] mulss `xmm`[*n*], `mem`[32] mulsd `xmm`[*n*], `xmm`[*n*] mulsd `xmm`[*n*], `mem`[64] divss `xmm`[*n*], `xmm`[*n*] divss `xmm`[*n*], `mem`[32] divsd `xmm`[*n*], `xmm`[*n*] divsd `xmm`[*n*], `mem`[64] minss `xmm`[*n*], `xmm`[*n*] minss `xmm`[*n*], `mem`[32] minsd `xmm`[*n*], `xmm`[*n*] minsd `xmm`[*n*], `mem`[64] maxss `xmm`[*n*], `xmm`[*n*] maxss `xmm`[*n*], `mem`[32] maxsd `xmm`[*n*], `xmm`[*n*] maxsd `xmm`[*n*], `mem`[64] sqrtss `xmm`[*n*], `xmm`[*n*] sqrtss `xmm`[*n*], `mem`[32] sqrtsd `xmm`[*n*], `xmm`[*n*] sqrtsd `xmm`[*n*], `mem`[64] rcpss `xmm`[*n*], `xmm`[*n*] rcpss `xmm`[*n*], `mem`[32] rsqrtss `xmm`[*n*], `xmm`[*n*] rsqrtss `xmm`[*n*], `mem`[32] ``` The `adds``x`, `subs``x`, `muls``x`, and `divs``x` instructions perform the expected floating-point arithmetic operations. The `mins``x` instructions compute the minimum value of the two operands, storing the minimum value into the destination (first) operand. The `maxs``x` instructions do the same thing, but compute the maximum of the two operands. The `sqrts``x` instructions compute the square root of the source (second) operand and store the result into the destination (first) operand. The `rcps``x` instructions compute the reciprocal of the source, storing the result into the destination.^(11) The `rsqrts``x` instructions compute the reciprocal of the square root.^(12) The operand syntax is somewhat limited for the SSE instructions (compared with the generic integer instructions): the destination operand must always be an XMM register. ### 6.7.4 SSE Floating-Point Comparisons The *SSE floating-point comparisons* work quite a bit differently from the integer and x87 FPU compare instructions. Rather than having a single generic instruction that sets flags (to be tested by `set``cc` or `j``cc` instructions), the SSE provides a set of condition-specific comparison instructions that store true (all 1 bits) or false (all 0 bits) into the destination operand. You can then test the result value for true or false. Here are the instructions: ``` cmpss `xmm`[*n*], `xmm`[*m*]/`mem`[32], `imm`[8] cmpsd `xmm`[*n*], `xmm`[*m*]/`mem`[64], `imm`[8] cmpeqss `xmm`[*n*], `xmm`[*m*]/`mem`[32] cmpltss `xmm`[*n*], `xmm`[*m*]/`mem`[32] cmpless `xmm`[*n*], `xmm`[*m*]/`mem`[32] cmpunordss `xmm`[*n*], `xmm`[*m*]/`mem`[32] cmpne qss `xmm`[*n*], `xmm`[*m*]/`mem`[32] cmpnltss `xmm`[*n*], `xmm`[*m*]/`mem`[32] cmpnless `xmm`[*n*], `xmm`[*m*]/`mem`[32] cmpordss `xmm`[*n*], `xmm`[*m*]/`mem`[32] cmpeqsd `xmm`[*n*], `xmm`[*m*]/`mem`[64] cmpltsd `xmm`[*n*], `xmm`[*m*]/`mem`[64] cmplesd `xmm`[*n*], `xmm`[*m*]/`mem`[64] cmpunordsd `xmm`[*n*], `xmm`[*m*]/`mem`[64] cmpneqsd `xmm`[*n*], `xmm`[*m*]/`mem`[64] cmpnltsd `xmm`[*n*], `xmm`[*m*]/`mem`[64] cmpnlesd `xmm`[*n*], `xmm`[*m*]/`mem`[64] cmpordsd `xmm`[*n*], `xmm`[*m*]/`mem`[64] ``` The immediate constant is a value in the range 0 to 7 and represents one of the comparisons in Table 6-17. Table 6-17: SSE Compare Immediate Operand | **imm[8]** | **Comparison** | | --- | --- | | 0 | First operand `==` second operand | | 1 | First operand `<` second operand | | 2 | First operand `<=` second operand | | 3 | First operand unordered second operand | | 4 | First operand `≠` second operand | | 5 | First operand not less than second operand (`≥`) | | 6 | First operand not less than or equal to second operand (`>`) | | 7 | First operand ordered second operand | The instructions without the third (immediate) operand are special *pseudo-ops* MASM provides that automatically supply the appropriate third operand. You can use the `nlt` form for `ge` and `nle` form for `gt`, assuming the operands are ordered. The *unordered* comparison returns true if either (or both) operands are unordered (typically, NaN values). Likewise, the ordered comparison returns true if both operands are ordered. As noted, these instructions leave 0 or all 1 bits in the destination register to represent false or true. If you want to branch based on these conditions, you should move the destination XMM register into a general-purpose register and test that register for zero/not zero. You can use the `movq` or `movd` instructions to accomplish this: ``` cmpeqsd xmm0, xmm1 movd eax, xmm0 ; Move true/false to EAX test eax, eax ; Test for true/false jnz xmm0EQxmm1 ; Branch if xmm0 == xmm1 ; Code to execute if xmm0 != xmm1. ``` ### 6.7.5 SSE Floating-Point Conversions The x86-64 provides several floating-point conversion instructions that convert between floating-point and integer formats. Table 6-18 lists these instructions and their syntax. Table 6-18: SSE Conversion Instructions | **Instruction syntax** | **Description** | | --- | --- | | `cvtsd2si` `reg`[32/64]`,` `xmm`[*n*]`/mem`[64] | Converts scalar double-precision FP to 32- or 64-bit integer. Uses the current rounding mode in the MXCSR to determine how to deal with fractional components. Result is stored in a general-purpose 32- or 64-bit register. | | `cvtsd2ss` `xmm`[*n*]`,` `xmm`[*n*]`/``mem`[64] | Converts scalar double-precision FP (in an XMM register or memory) to scalar single-precision FP and leaves the result in the destination XMM register. Uses the current rounding mode in the MXCSR to determine how to deal with inexact conversions. | | `cvtsi2sd` `xmm`[*n*]`,` `reg`[32/64]`/``mem`[32/64] | Converts a 32- or 64-bit integer in an integer register or memory to a double-precision floating-point value, leaving the result in an XMM register. | | `cvtsi2ss` `xmm`[*n*]`,` `reg`[32/64]`/``mem`[32/64] | Converts a 32- or 64-bit integer in an integer register or memory to a single-precision floating-point value, leaving the result in an XMM register. | | `cvtss2sd` `xmm`[*n*]`,` `xmm`[*n*]`/``mem`[32] | Converts a single-precision floating-point value in an XMM register or memory to a double-precision value, leaving the result in the destination XMM register. | | `cvtss2si` `reg`[32/64]`,` `xmm`[*n*]`/``mem`[32] | Converts a single-precision floating-point value in an XMM register or memory to an integer and leaves the result in a general-purpose 32- or 64-bit register. Uses the current rounding mode in the MXCSR to determine how to deal with inexact conversions. | | `cvttsd2si` `reg`[32/64]`,` `xmm`[*n*]`/``mem`[64] | Converts scalar double-precision FP to a 32- or 64-bit integer. Conversion is done using truncation (does not use the rounding control setting in the MXCSR). Result is stored in a general-purpose 32- or 64-bit register. | | `cvttss2si` `reg`[32/64]`,` `xmm`[*n*]`/``mem`[32] | Converts scalar single-precision FP to a 32- or 64-bit integer. Conversion is done using truncation (does not use the rounding control setting in the MXCSR). Result is stored in a general-purpose 32- or 64-bit register. | ## 6.8 For More Information The Intel and AMD processor manuals fully describe the operation of each of the integer and floating-point arithmetic instructions, including a detailed description of how these instructions affect the condition code bits and other flags in the FLAGS and FPU status registers. To write the best possible assembly language code, you need to be intimately familiar with how the arithmetic instructions affect the execution environment, so spending time with the Intel and AMD manuals is a good idea. Chapter 8 discusses multiprecision integer arithmetic. See that chapter for details on handling integer operands that are greater than 64 bits in size. The x86-64 SSE instruction set found on later iterations of the CPU provides support for floating-point arithmetic using the AVX register set. Consult the Intel and AMD documentation for details concerning the AVX floating-point instruction set. ## 6.9 Test Yourself 1. What are the implied operands for the single-operand `imul` and `mul` instructions? 2. What is the result size for an 8-bit `mul` operation? A 16-bit `mul` operation? A 32-bit `mul` operation? A 64-bit `mul` operation? Where does the CPU put the products? 3. What result(s) does an x86 `div` instruction produce? 4. When performing a signed 16×16–bit division using `idiv`, what must you do before executing the `idiv` instruction? 5. When performing an unsigned 32×32–bit division using `div`, what must you do before executing the `div` instruction? 6. What are the two conditions that will cause a `div` instruction to produce an exception? 7. How do the `mul` and `imul` instructions indicate overflow? 8. How do the `mul` and `imul` instructions affect the zero flag? 9. What is the difference between the extended-precision (single operand) `imul` instruction and the more generic (multi-operand) `imul` instruction? 10. What instructions would you normally use to sign-extend the accumulator prior to executing an `idiv` instruction? 11. How do the `div` and `idiv` instructions affect the carry, zero, overflow, and sign flags? 12. How does the `cmp` instruction affect the zero flag? 13. How does the `cmp` instruction affect the carry flag (with respect to an unsigned comparison)? 14. How does the `cmp` instruction affect the sign and overflow flags (with respect to a signed comparison)? 15. What operands do the `set``cc` instructions take? 16. What do the `set``cc` instructions do to their operand? 17. What is the difference between the `test` instruction and the `and` instruction? 18. What are the similarities between the `test` instruction and the `and` instruction? 19. Explain how you would use the `test` instruction to see if an individual bit is 1 or 0 in an operand. 20. Convert the following expressions to assembly language (assume all variables are signed 32-bit integers): ``` x = x + y x = y – z x = y * z x = y + z * t x = (y + z) * t x = -((x * y) / z) x = (y == z) && (t != 0) ``` 21. Compute the following expressions without using an `imul` or `mul` instruction (assume all variables are signed 32-bit integers): ``` x = x * 2 x = y * 5 x = y * 8 ``` 22. Compute the following expressions without using a `div` or `idiv` instruction (assume all variables are unsigned 16-bit integers): ``` x = x / 2 x = y / 8 x = z / 10 ``` 23. Convert the following expressions to assembly language by using the FPU (assume all variables are `real8` floating-point values): ``` x = x + y x = y – z x = y * z x = y + z * t x = (y + z) * t x = -((x * y) / z) ``` 24. Convert the following expressions to assembly language by using SSE instructions (assume all variables are `real4` floating-point values): ``` x = x + y x = y – z x = y * z x = y + z * t ``` 25. Convert the following expressions to assembly language by using FPU instructions; assume `b` is a one-byte Boolean variable and `x`, `y`, and `z` are `real8` floating-point variables: ``` b = x < y b = x >= y && x < z ``` ````

第七章：低级控制结构

本章讨论如何将高级语言（HLL）控制结构转换为汇编语言控制语句。到目前为止的示例采用了临时创建的方式构建汇编控制结构。现在是时候正式化如何控制汇编语言程序的操作了。完成本章后，你应该能够将 HLL 控制结构转换为汇编语言。

汇编语言中的控制结构由条件分支和间接跳转组成。本章讨论了这些指令，以及如何模拟 HLL 控制结构（如if/else、switch和循环语句）。本章还讨论了标签（条件分支和跳转语句的目标）以及汇编语言源文件中标签的作用域。

7.1 语句标签

在讨论跳转指令以及如何使用它们模拟控制结构之前，有必要深入讨论汇编语言中的语句标签。在汇编语言程序中，标签充当地址的符号名称。使用像LoopEntry这样的名称引用代码中的某个位置比使用像 0AF1C002345B7901Eh 这样的数字地址更为方便。因此，汇编语言的低级控制结构在源代码中大量使用标签（参见第二章的“简短绕道：控制转移指令介绍”）。

你可以对（代码）标签做三件事：通过（条件或无条件）跳转指令将控制转移到标签，使用call指令调用标签，以及获取标签的地址。当你想稍后在程序中间接地将控制转移到该地址时，获取标签的地址非常有用。

以下代码序列演示了在程序中获取标签地址的两种方式（使用lea指令和使用offset操作符）：

stmtLbl:
    .
    .
    .
  mov rcx, offset stmtLbl2
    .
    .
    .
  lea rax, stmtLbl
    .
    .
    .
stmtLbl2:

因为地址是 64 位的量，你通常会使用lea指令将地址加载到 64 位通用寄存器中。由于该指令使用当前指令的 32 位相对偏移量，因此指令编码比mov指令短得多（mov指令需要编码一个完整的 8 字节常量以及操作码字节）。

7.1.1 在过程中的使用局部符号

在proc/endp过程内定义的语句标签是局部的，从词法 作用域的角度来看：该语句标签仅在该过程内可见；你不能在过程外引用该语句标签。列表 7-1 展示了你不能在另一个过程内引用符号（注意，由于此错误，该程序无法汇编）。

; Listing 7-1

; Demonstration of local symbols.
; Note that this program will not
; compile; it fails with an
; undefined symbol error.

        option  casemap:none

            .code

hasLocalLbl proc

localStmLbl:
            ret
hasLocalLbl endp

; Here is the "asmMain" function.

asmMain     proc

asmLocal:   jmp     asmLocal        ; This is okay
            jmp     localStmtLbl    ; Undefined in asmMain
asmMain     endp
            end

列表 7-1：词法作用域符号的演示

汇编此文件的命令（及相应的诊断消息）如下所示：

C:\>**ml64 /c listing7-1.asm**
Microsoft (R) Macro Assembler (x64) Version 14.15.26730.0
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: listing7-1.asm
listing7-1.asm(26) : error A2006:undefined symbol : localStmtLbl

如果你真的想在过程外部访问某个语句（或其他）标签，你可以使用option指令来关闭程序某一部分的局部作用域，正如第五章所述：

option noscoped
option scoped

第一种形式告诉 MASM 停止将符号（在proc/endp之间）限制为包含它们的过程的局部符号。第二种形式恢复了过程中的符号词法作用域。因此，使用这两个指令，你可以为源文件的各个部分打开或关闭作用域（如果你愿意，也可以只为单个语句设置作用域）。Listing 7-2 演示了如何使用option指令使单个符号在包含它的过程外部变为全局符号（注意，这个程序仍然存在编译错误）。

; Listing 7-2

; Demonstration of local symbols #2.
; Note that this program will not
; compile; it fails with two
; undefined symbol errors.

            option  casemap:none

            .code

hasLocalLbl proc

localStmLbl:
            option noscoped
notLocal:
            option scoped
isLocal:
            ret
hasLocalLbl endp

; Here is the "asmMain" function.

asmMain     proc

            lea     rcx, localStmtLbl  ; Generates an error
            lea     rcx, notLocal      ; Assembles fine
            lea     rcx, isLocal       ; Generates an error
asmMain     endp
            end

Listing 7-2：option scoped和option noscoped指令

这是 Listing 7-2 的构建命令（和诊断输出）：

C:\>**ml64 /c listing7-2.asm**
Microsoft (R) Macro Assembler (x64) Version 14.15.26730.0
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: listing7-2.asm
listing7-2.asm(29) : error A2006:undefined symbol : localStmtLbl
listing7-2.asm(31) : error A2006:undefined symbol : isLocal

正如你从 MASM 的输出中看到的，notLocal符号（出现在option noscoped指令之后）并未生成未定义符号错误。然而，localStmtLbl和isLocal符号，它们是hasLocalLbl过程的局部符号，在该过程外部是未定义的。

7.1.2 使用标签地址初始化数组

MASM 还允许你通过语句标签的地址来初始化四字（quad-word）变量。然而，在变量声明的初始化部分出现的标签有一些限制。最重要的限制是符号必须与尝试使用它的数据声明位于相同的词法作用域中。因此，要么qword指令必须出现在与语句标签相同的过程内，要么你必须使用option noscoped指令来使符号在过程内成为全局符号。Listing 7-3 展示了这两种初始化qword变量并将其与语句标签地址关联的方式。

; Listing 7-3

; Initializing qword values with the
; addresses of statement labels.

        option  casemap:none

            .data
lblsInProc  qword   globalLbl1, globalLbl2  ; From procWLabels

            .code

; procWLabels - Just a procedure containing private (lexically scoped)
;               and global symbols. This really isn't an executable
;               procedure.

procWLabels proc
privateLbl:
            nop     ; "No operation" just to consume space
            option  noscoped
globalLbl1: jmp     globalLbl2
globalLbl2: nop
            option  scoped
privateLbl2:
            ret
dataInCode  qword   privateLbl, globalLbl1
            qword   globalLbl2, privateLbl2
procWLabels endp

            end

Listing 7-3：使用语句标签地址初始化四字变量

如果你用以下命令编译 Listing 7-3，你将不会得到任何汇编错误：

ml64 /c /Fl listing7-3.asm

如果你查看 MASM 生成的listing7-3.lst输出文件，你会看到 MASM 正确地用语句标签的（相对于段/可重定位的）偏移量初始化了四字声明：

00000000                        .data
00000000           lblsInProc   qword   globalLbl1, globalLbl2
       0000000000000001 R
       0000000000000003 R
          .
          .
          .
 00000005           dataInCode  qword   privateLbl, globalLbl1
       0000000000000000 R
       0000000000000001 R
 00000015  0000000000000003 R   qword   globalLbl2, privateLbl2
       0000000000000004 R

将控制转移到过程内部的语句标签通常被认为是不好的编程实践。除非你有充分的理由这样做，否则你可能不应该这么做。

由于 x86-64 上的地址是 64 位量，你通常会使用qword指令（如前面示例所示）来用语句标签的地址初始化数据对象。然而，如果你的程序（总是会是）小于 2GB，并且你设置了LARGEADDRESSAWARE:NO标志（使用sbuild.bat），你可以使用dword数据声明来存放标签的地址。当然，正如本书多次提到的那样，在 64 位程序中使用 32 位地址，如果你的程序超出 2GB 存储空间时，可能会导致问题。

7.2 无条件控制转移（jmp）

jmp（跳转）指令无条件地将控制转移到程序中的另一个位置。此指令有三种形式：直接跳转和两种间接跳转。它们的形式如下：

jmp `label`
jmp `reg`[64]
jmp `mem`[64]

第一条指令是一个直接跳转，你在之前的各种示例程序中已经见过这种跳转。对于直接跳转，通常通过使用语句标签来指定目标地址。标签要么出现在与可执行机器指令相同的行上，要么单独出现在可执行机器指令前的一行上。直接跳转完全等同于高级语言中的goto语句。^(1)

这是一个示例：

 `Statements`
          jmp laterInPgm
               .
               .
               .
laterInPgm:
 `Statements`

7.2.1 寄存器间接跳转

前面给出的第二种jmp指令形式——jmp reg64——是一种寄存器间接跳转指令，它将控制转移到指定的 64 位通用寄存器中存储的地址处的指令。要使用这种形式的jmp指令，必须在执行jmp之前将一个 64 位寄存器加载为机器指令的地址。当多个路径分别将寄存器加载不同地址时，控制将转移到由该路径到目前为止确定的适当位置。

清单 7-4 从用户处读取一个包含整数值的字符串。它使用 C 标准库函数strtol()将该字符串转换为二进制整数值。strtol()函数在报告错误时并不十分出色，因此该程序测试返回结果以验证输入是否正确，并使用寄存器间接跳转根据结果将控制转移到不同的代码路径。

清单 7-4 的第一部分包含常量、变量、外部声明和（通常的）getTitle()函数。

; Listing 7-4

; Demonstration of register-indirect jumps.

        option  casemap:none

nl          =       10
maxLen      =       256
EINVAL      =       22      ; "Magic" C stdlib constant, invalid argument
ERANGE      =       34      ; Value out of range

            .const
ttlStr      byte    "Listing 7-4", 0
fmtStr1     byte    "Enter an integer value between "
            byte    "1 and 10 (0 to quit): ", 0

badInpStr   byte    "There was an error in readLine "
            byte    "(ctrl-Z pressed?)", nl, 0

invalidStr  byte    "The input string was not a proper number"
            byte    nl, 0

rangeStr    byte    "The input value was outside the "
            byte    "range 1-10", nl, 0

unknownStr  byte    "There was a problem with strToInt "
            byte    "(unknown error)", nl, 0

goodStr     byte    "The input value was %d", nl, 0

fmtStr      byte    "result:%d, errno:%d", nl, 0

            .data
            externdef _errno:dword  ; Error return by C code
endStr      qword   ?
inputValue  dword   ?
buffer      byte    maxLen dup (?)

            .code
            externdef readLine:proc
            externdef strtol:proc
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

清单 7-4 的下一部分是strToInt()函数，它是 C 标准库strtol()函数的封装器，更加彻底地处理用户输入的错误。请参阅函数的返回值注释。

; strToInt - Converts a string to an integer, checking for errors.

; Argument:
;    RCX -   Pointer to string containing (only) decimal
;            digits to convert to an integer.

; Returns:
;    RAX -   Integer value if conversion was successful.
;    RCX -   Conversion state. One of the following:
;            0 - Conversion successful.
;            1 - Illegal characters at the beginning of the
;                string (or empty string).
;            2 - Illegal characters at the end of the string.
;            3 - Value too large for 32-bit signed integer.

strToInt    proc
strToConv   equ     [rbp+16]        ; Flush RCX here
endPtr      equ     [rbp-8]         ; Save ptr to end of str
            push    rbp
 mov     rbp, rsp
            sub     rsp, 32h       ; Shadow + 16-byte alignment

            mov     strToConv, rcx ; Save, so we can test later

            ; RCX already contains string parameter for strtol:

            lea     rdx, endPtr    ; Ptr to end of string goes here
            mov     r8d, 10        ; Decimal conversion
            call    strtol

; On return:

;    RAX    - Contains converted value, if successful.
;    endPtr - Pointer to 1 position beyond last char in string.

; If strtol returns with endPtr == strToConv, then there were no
; legal digits at the beginning of the string.

            mov     ecx, 1         ; Assume bad conversion
            mov     rdx, endPtr
            cmp     rdx, strToConv
            je      returnValue

; If endPtr is not pointing at a zero byte, then we've got
; junk at the end of the string.

            mov     ecx, 2         ; Assume junk at end
            mov     rdx, endPtr
            cmp     byte ptr [rdx], 0
            jne     returnValue

; If the return result is 7FFF_FFFFh or 8000_0000h (max long and
; min long, respectively), and the C global _errno variable 
; contains ERANGE, then we've got a range error.

            mov     ecx, 0         ; Assume good input
            cmp     _errno, ERANGE
            jne     returnValue
            mov     ecx, 3         ; Assume out of range
            cmp     eax, 7fffffffh
            je      returnValue
            cmp     eax, 80000000h
            je      returnValue

; If we get to this point, it's a good number.

            mov     ecx, 0

returnValue:
            leave
            ret
strToInt    endp

清单 7-4 的最后一部分是主程序。这是我们最感兴趣的代码部分。它将 RBX 寄存器加载为基于strToInt()返回结果执行的代码地址。strToInt()函数返回以下几种状态中的一种（有关解释，请参见前面代码中的注释）：

有效输入
字符串开头存在非法字符
字符串末尾存在非法字符
范围错误

程序然后根据 RBX 中保存的值（指定strToInt()返回结果的类型）将控制转移到asmMain()的不同部分。

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
saveRBX     equ     qword ptr [rbp-8]      ; Must preserve RBX
            push    rbp
            mov     rbp, rsp
            sub     rsp, 48                ; Shadow storage

            mov     saveRBX, rbx           ; Must preserve RBX

            ; Prompt the user to enter a value
            ; between 1 and 10:

repeatPgm:  lea     rcx, fmtStr1
            call    printf

            ; Get user input:

            lea     rcx, buffer
            mov     edx, maxLen     ; Zero-extends!
            call    readLine
            lea     rbx, badInput   ; Initialize state machine
            test    rax, rax        ; RAX is -1 on bad input
            js      hadError        ; (only neg value readLine returns)

            ; Call strToInt to convert string to an integer and
            ; check for errors:

            lea     rcx, buffer     ; Ptr to string to convert
            call    strToInt
            lea     rbx, invalid
            cmp     ecx, 1
            je      hadError
            cmp     ecx, 2
            je      hadError

 lea     rbx, range
            cmp     ecx, 3
            je      hadError

            lea     rbx, unknown
            cmp     ecx, 0
            jne     hadError

; At this point, input is valid and is sitting in EAX.

; First, check to see if the user entered 0 (to quit
; the program).

            test    eax, eax        ; Test for zero
            je      allDone

; However, we need to verify that the number is in the
; range 1-10.

            lea     rbx, range
            cmp     eax, 1
            jl      hadError
            cmp     eax, 10
            jg      hadError

; Pretend a bunch of work happens here dealing with the
; input number.

            lea     rbx, goodInput
            mov     inputValue, eax

; The different code streams all merge together here to
; execute some common code (we'll pretend that happens;
; for brevity, no such code exists here).

hadError:

; At the end of the common code (which doesn't mess with
; RBX), separate into five different code streams based
; on the pointer value in RBX:

            jmp     rbx

; Transfer here if readLine returned an error:

badInput:   lea     rcx, badInpStr
            call    printf
            jmp     repeatPgm

; Transfer here if there was a non-digit character
; in the string:

invalid:    lea     rcx, invalidStr
 call    printf
            jmp     repeatPgm

; Transfer here if the input value was out of range:

range:      lea     rcx, rangeStr
            call    printf
            jmp     repeatPgm

; Shouldn't ever get here. Happens if strToInt returns
; a value outside the range 0-3.

unknown:    lea     rcx, unknownStr
            call    printf
            jmp     repeatPgm

; Transfer down here on a good user input.

goodInput:  lea     rcx, goodStr
            mov     edx, inputValue ; Zero-extends!
            call    printf
            jmp     repeatPgm

; Branch here when the user selects "quit program" by
; entering the value zero:

allDone:    mov     rbx, saveRBX    ; Must restore before returning
            leave
            ret                     ; Returns to caller

asmMain     endp
            end

清单 7-4：使用寄存器间接jmp指令

以下是清单 7-4 中的构建命令和程序示例运行：

C:\>**build listing7-4**

C:\>**echo off**
 Assembling: listing7-4.asm
c.cpp

C:\>**listing7-4**
Calling Listing 7-4:
Enter an integer value between 1 and 10 (0 to quit): ^Z
There was an error in readLine (ctrl-Z pressed?)
Enter an integer value between 1 and 10 (0 to quit): a123
The input string was not a proper number
Enter an integer value between 1 and 10 (0 to quit): 123a
The input string was not a proper number
Enter an integer value between 1 and 10 (0 to quit): 1234567890123
The input value was outside the range 1-10
Enter an integer value between 1 and 10 (0 to quit): -1
The input value was outside the range 1-10
Enter an integer value between 1 and 10 (0 to quit): 11
The input value was outside the range 1-10
Enter an integer value between 1 and 10 (0 to quit): 5
The input value was 5
Enter an integer value between 1 and 10 (0 to quit): 0
Listing 7-4 terminated

7.2.2 内存间接跳转

jmp 指令的第三种形式是 内存间接 跳转，它从内存位置获取四字节值并跳转到该地址。这类似于寄存器间接 jmp，只不过地址出现在内存位置，而不是寄存器中。

列表 7-5 演示了这种形式的 jmp 指令的一种相当简单的使用方法。

; Listing 7-5

; Demonstration of memory-indirect jumps.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 7-5", 0
fmtStr1     byte    "Before indirect jump", nl, 0
fmtStr2     byte    "After indirect jump", nl, 0

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbp
 mov     rbp, rsp
            sub     rsp, 48                 ; Shadow storage

            lea     rcx, fmtStr1
            call    printf
            jmp     memPtr

memPtr      qword   ExitPoint

ExitPoint:  lea     rcx, fmtStr2
            call    printf

            leave
            ret     ; Returns to caller

asmMain     endp
            end

列表 7-5: 使用内存间接 jmp 指令

以下是列表 7-5 的构建命令和输出：

C:\>**build listing7-5**

C:\>**echo off**
 Assembling: listing7-5.asm
c.cpp

C:\>**listing7-5**
Calling Listing 7-5:
Before indirect jump
After indirect jump
Listing 7-5 terminated

请注意，如果你执行带有无效指针值的间接跳转，系统可能会崩溃。

7.3 条件跳转指令

尽管第二章已经概述了条件跳转指令，但在这里重复讨论并扩展这一内容是值得的，因为条件跳转是创建汇编语言控制结构的主要工具。

与无条件 jmp 指令不同，条件跳转指令没有间接形式。它们仅允许跳转到程序中的语句标签。

英特尔的文档为许多条件跳转指令定义了各种同义词或指令别名。表 7-1、7-2 和 7-3 列出了每个指令的所有别名，以及相对分支。你很快就会看到反向分支的用途。

表 7-1: j``cc 测试标志的指令

指令	描述	条件	别名	反向
`jc`	如果有进位则跳转	进位 = 1	`jb`, `jnae`	`jnc`
`jnc`	如果没有进位则跳转	进位 = 0	`jnb`, `jae`	`jc`
`jz`	如果为零则跳转	零 = 1	`je`	`jnz`
`jnz`	如果不为零则跳转	零 = 0	`jne`	`jz`
`js`	如果符号则跳转	符号 = 1		`jns`
`jns`	如果没有符号则跳转	符号 = 0		`js`
`jo`	如果溢出则跳转	溢出 = 1		`jno`
`jno`	如果没有溢出则跳转	溢出 = 0		`jo`
`jp`	如果有奇偶性则跳转	奇偶性 = 1	`jpe`	`jnp`
`jpe`	如果奇偶性为偶则跳转	奇偶性 = 1	`jp`	`jpo`
`jnp`	如果无奇偶性则跳转	奇偶性 = 0	`jpo`	`jp`
`jpo`	如果奇偶性为奇则跳转	奇偶性 = 0	`jnp`	`jpe`

表 7-2: j``cc 无符号比较指令

指令	描述	条件	别名	反向
`ja`	如果大于（`>`）则跳转	进位 = 0, 零 = 0	`jnbe`	`jna`
`jnbe`	如果不小于或等于（不是 `≤`）则跳转	进位 = 0, 零 = 0	`ja`	`jbe`
`jae`	如果大于或等于（`≥`）则跳转	进位 = 0	`jnc`, `jnb`	`jnae`
`jnb`	如果不小于（不是 `<`）则跳转	进位 = 0	`jnc`, `jae`	`jb`
`jb`	如果小于（`<`）则跳转	进位 = 1	`jc`, `jnae`	`jnb`
`jnae`	如果不大于或等于（不是 `≥`）则跳转	进位 = 1	`jc`, `jb`	`jae`
`jbe`	如果小于或等于（`≤`）则跳转	进位 = 1 或零 = 1	`jna`	`jnbe`
`jna`	如果不大于（不是 `>`）则跳转	进位 = 1 或零 = 1	`jbe`	`ja`
`je`	如果相等（`=`）则跳转	零 = 1	`jz`	`jne`
`jne`	如果不相等（`≠`）则跳转	零 = 0	`jnz`	`je`

表 7-3：j``cc 有符号比较指令

指令	描述	条件	别名	相反
`jg`	如果大于（`>`）则跳转	符号 = 溢出或零 = 0	`jnle`	`jng`
`jnle`	如果不小于或等于（不是 `≤`）则跳转	符号 = 溢出或零 = 0	`jg`	`jle`
`jge`	如果大于或等于（`≥`）则跳转	符号 = 溢出	`jnl`	`jnge`
`jnl`	如果不小于（不是 `<`）则跳转	符号 = 溢出	`jge`	`jl`
`jl`	如果小于（`<`）则跳转	符号 `≠` 溢出	`jnge`	`jnl`
`jnge`	如果不大于或等于（不是 `≥`）则跳转	符号 `≠` 溢出	`jl`	`jge`
`jle`	如果小于或等于（`≤`）则跳转	符号 `≠` 溢出或零 = 1	`jng`	`jnle`
`jng`	如果不大于（不是 `>`）则跳转	符号 `≠` 溢出或零 = 1	`jle`	`jg`
`je`	如果相等（`=`）则跳转	零 = 1	`jz`	`jne`
`jne`	如果不相等（`≠`）则跳转	零 = 0	`jnz`	`je`

在许多情况下，你需要生成特定分支指令的相反分支（该部分稍后的例子中会出现）。除了两个例外外，相反分支（N/No N） 规则描述了如何生成相反分支：

如果 j``cc 指令的第二个字母不是 n，则在 j 后面插入一个 n。例如，je 变成 jne，jl 变成 jnl。
如果 j``cc 指令的第二个字母是 n，则从指令中去掉该 n。例如，jng 变成 jg，jne 变成 je。

该规则的两个例外是 jpe（如果奇偶校验为偶）和 jpo（如果奇偶校验为奇）。^(2) 但是，你可以使用别名 jp 和 jnp 来分别表示 jpe 和 jpo，并且 N/No N 规则适用于 jp 和 jnp。

x86-64 条件跳转指令使你能够根据特定条件将程序流程分为两条路径。假设你想在 BX 等于 CX 时递增 AX 寄存器。你可以使用以下代码来实现：

 cmp bx, cx
          jne SkipStmts;
          inc ax
SkipStmts:

不是直接检查相等并跳转到处理该条件的代码，常见的方法是使用相反的分支跳过你想在条件为真时执行的指令。也就是说，如果 BX 不等于 CX，则跳过递增指令。始终使用前面提到的相反分支（N/No N）规则来选择相反分支。

你还可以使用条件跳转指令来合成循环。例如，以下代码序列从用户读取一串字符，并将每个字符依次存储到数组的元素中，直到用户按下回车键（换行）为止：

 mov edi, 0
RdLnLoop:
      call getchar         ; Some function that reads a character
                           ; into the AL register
      mov Input[rdi], al   ; Store away the character
      inc rdi              ; Move on to the next character
      cmp al, nl           ; See if the user pressed ENTER
      jne RdLnLoop

条件跳转指令仅测试 x86-64 标志；它们不会影响任何标志。

从效率的角度来看，需要注意的是，每个条件跳转都有两种机器代码编码方式：一种是 2 字节形式，另一种是 6 字节形式。

2 字节形式由j``cc操作码和一个 1 字节的相对 PC 偏移组成。1 字节偏移允许指令将控制权转移到当前指令大约±127 字节范围内的目标指令。考虑到平均 x86-64 指令可能是 4 到 5 字节长，j``cc的 2 字节形式能够在大约 20 到 25 条指令范围内跳转到目标指令。

因为 20 到 25 条指令的范围对于所有条件跳转来说都不够，x86-64 提供了第二种（6 字节）形式，包含 2 字节操作码和 4 字节偏移。6 字节形式使你能够跳转到当前指令大约±2GB 范围内的指令，这对于任何合理的程序来说大多足够了。

如果有机会跳转到一个较近的标签，而不是远距离跳转（并且仍然能达到相同的结果），跳转到较近的标签将使你的代码更短，可能更快。

7.4 跳板

在极少数情况下，如果需要跳转到超出 6 字节j``cc指令范围的位置，可以使用如下的指令序列：

 jn`cc`  skipJmp  ; Opposite jump of the one you want to use
        jmp   destPtr  ; JMP PC-relative is also limited to ±2GB
destPtr qword destLbl  ; so code must use indirect jump
skipJmp:

相反的条件分支将控制权转移到代码中的正常继续点（即在条件为假时通常会继续执行的代码）。如果条件为真，控制权将转移到一个内存间接跳转，该跳转通过 64 位指针跳转到原始目标位置。

该序列被称为跳板，因为程序跳到这一点，再进一步跳到程序中的其他位置（就像跳床一样，可以让你跳得越来越高）。跳板对于使用 PC 相对寻址模式的调用和无条件跳转指令非常有用（因此，它们的范围仅限于当前指令的±2GB 范围）。

你很少会使用跳板将控制转移到程序中的另一个位置。然而，当将控制转移到动态链接库或操作系统子程序时，跳板非常有用，因为这些可能在内存中距离较远。

7.5 条件移动指令

有时，在比较或其他条件测试之后，你所需要做的只是将一个值加载到寄存器中（相反，如果测试或比较失败，则不加载该值）。因为分支指令的执行可能比较昂贵，x86-64 CPU 支持一组条件移动指令，cmov``cc。这些指令出现在表 7-4、7-5 和 7-6 中；这些指令的通用语法如下：

cmov*cc* `reg`[16], `reg`[16]
cmov*cc* `reg`[16], `mem`[16]
cmov*cc* `reg`[32], `reg`[32]
cmov*cc* `reg`[32], `mem`[32]
cmov*cc* `reg`[64], `reg`[64]
cmov*cc* `reg`[64], `mem`[64]

目标始终是一个通用寄存器（16、32 或 64 位）。你只能使用这些指令从内存加载数据到寄存器，或者将数据从一个寄存器复制到另一个寄存器；不能用它们来有条件地将数据存储到内存。

表 7-4：cmov``cc 测试标志的指令

指令	描述	条件	别名
`cmovc`	如果有进位，则移动	进位 = 1	`cmovb`，`cmovnae`
`cmovnc`	如果无进位，则移动	进位 = 0	`cmovnb`，`cmovae`
`cmovz`	如果为零，则移动	零标志 = 1	`cmove`
`cmovnz`	如果不为零，则移动	零标志 = 0	`cmovne`
`cmovs`	如果有符号标志，则移动	符号 = 1
`cmovns`	如果无符号标志，则移动	符号 = 0
`cmovo`	如果溢出，则移动	溢出 = 1
`cmovno`	如果无溢出，则移动	溢出 = 0
`cmovp`	如果有奇偶校验，则移动	奇偶校验 = 1	`cmovpe`
`cmovpe`	如果奇偶校验为偶，则移动	奇偶校验 = 1	`cmovp`
`cmovnp`	如果无奇偶校验，则移动	奇偶校验 = 0	`cmovpo`
`cmovpo`	如果奇偶校验为奇，则移动	奇偶校验 = 0	`cmovnp`

表 7-5: cmov``cc 指令用于无符号比较

指令	描述	条件	别名
`cmova`	如果大于（`>`），则移动	进位 = 0，零标志 = 0	`cmovnbe`
`cmovnbe`	如果不小于或等于（非`≤`），则移动	进位 = 0，零标志 = 0	`cmova`
`cmovae`	如果大于或等于（`≥`），则移动	进位 = 0	`cmovnc`，`cmovnb`
`cmovnb`	如果不小于（非`<`），则移动	进位 = 0	`cmovnc`，`cmovae`
`cmovb`	如果小于（`<`），则移动	进位 = 1	`cmovc`，`cmovnae`
`cmovnae`	如果不大于或等于（非`≥`），则移动	进位 = 1	`cmovc`，`cmovb`
`cmovbe`	如果小于或等于（`≤`），则移动	进位 = 1 或零标志 = 1	`cmovna`
`cmovna`	如果不大于（非`>`），则移动	进位 = 1 或零标志 = 1	`cmovbe`
`cmove`	如果相等（`=`），则移动	零标志 = 1	`cmovz`
`cmovne`	如果不相等（`≠`），则移动	零标志 = 0	`cmovnz`

表 7-6: cmov``cc 指令用于有符号比较

指令	描述	条件	别名
`cmovg`	如果大于（`>`），则移动	符号 = 溢出或零标志 = 0	`cmovnle`
`cmovnle`	如果不小于或等于（非`≤`），则移动	符号 = 溢出或零标志 = 0	`cmovg`
`cmovge`	如果大于或等于（`≥`），则移动	符号 = 溢出	`cmovnl`
`cmovnl`	如果不小于（非`<`），则移动	符号 = 溢出	`cmovge`
`cmovl`	如果小于（`<`），则移动	符号 ≠ 溢出	`cmovnge`
`cmovnge`	如果不大于或等于（非`≥`），则移动	符号 ≠ 溢出	`cmovl`
`cmovle`	如果小于或等于（`≤`），则移动	符号 ≠ 溢出或零标志 = 1	`cmovng`
`cmovng`	如果不大于（非`>`），则移动	符号 ≠ 溢出或零标志 = 1	`cmovle`
`cmove`	如果相等（`=`），则移动	零标志 = 1	`cmovz`
`cmovne`	如果不相等（`≠`），则移动	零标志 = 0	`cmovnz`

此外，一组条件浮点移动指令（fcmov``cc）将会在 FPU 堆栈中将数据在 ST0 和其他 FPU 寄存器之间移动。不幸的是，这些指令在现代程序中并不是特别有用。如果你有兴趣使用它们，可以查看 Intel 文档获取更多细节。

7.6 在汇编语言中实现常见控制结构

本节展示了如何使用纯汇编语言实现决策、循环和其他控制结构。

7.6.1 决策

在最基本的形式中，决策是代码中的一个分支，根据某个条件在两个可能的执行路径之间切换。通常（尽管并非总是如此），条件指令序列是通过条件跳转指令来实现的。条件指令对应于高级语言中的if/then/endif语句：

if(`expression`) then
    `Statements`
endif;

要将其转换为汇编语言，你必须编写评估expression的语句，然后在结果为假时跳过statements。例如，如果你有如下 C 语句：

if(a == b)
{
    printf("a is equal to b \ n");
}

你可以将其转换为汇编语言，如下所示：

 mov  eax, a           ; Assume a and b are 32-bit integers
      cmp  eax, b
      jne  aNEb
      lea  rcx, aIsEqlBstr  ; "a is equal to b \ n"
      call printf
aNEb:

一般来说，条件语句可以分为三大类：if语句、switch/case语句和间接跳转。以下章节将描述这些程序结构、如何使用它们以及如何在汇编语言中编写它们。

7.6.2 `if/then/else`序列

最常见的条件语句是if/then/endif和if/then/else/endif语句。这两种语句的形式如图 7-1 所示。

图 7-1：if``/``then``/``else``/``endif和if``/``then``/``endif语句流程

if/then/endif语句只是if/then/else/endif语句的一种特殊情况（没有else块）。if/then/else/endif语句的基本实现形式如下所示（在 x86-64 汇编语言中）：

 `Sequence of statements to test a condition`
          j`cc` ElseCode;

 `Sequence of statements corresponding to the THEN block`
          jmp EndOfIf

ElseCode: 
 `Sequence of statements corresponding to the ELSE block`

EndOfIf:

其中j``cc表示条件跳转指令。

例如，要转换 C/C++语句

if(a == b)
    c = d;
else 
    b = b + 1;

转换为汇编语言时，你可以使用以下 x86-64 代码：

 mov eax, a
          cmp eax, b
          jne ElseBlk
          mov eax, d
          mov c, eax
          jmp EndOfIf;

ElseBlk:
          inc b

EndOfIf:

对于像(a == b)这样的简单表达式，生成一个if/then/else/endif语句的适当代码几乎是微不足道的。如果表达式变得更复杂，代码的复杂性也会增加。考虑之前提到的这个 C/C++ if语句：

if(((x > y) && (z < t)) || (a != b))
    c = d;

要转换像这样的复杂if语句，可以将其分解为三个if语句，形式如下：

if(a != b) c = d;
else if(x > y)
     if(z < t)
           c = d;

这种转换来自以下 C/C++等价物：

if(`expr1` && `expr2`) `Stmt`;

等价于

if(`expr1`) if(`expr2`) `Stmt`;

和

if(`expr1` || `expr2`) `Stmt`;

等价于

if(`expr1`) `Stmt`;
else if(`expr2`) `Stmt`;

在汇编语言中，前者的if语句变成了

; if(((x > y) && (z < t)) || (a != b))c = d;

          mov eax, a
          cmp eax, b
          jne DoIf;
          mov eax, x
          cmp eax, y
          jng EndOfIf;
          mov eax, z
          cmp eax, t
          jnl EndOfIf;
DoIf:
          mov eax, d
          mov c, eax
EndOfIf:

汇编语言中复杂条件语句最大的难题可能是，在编写完代码后，试图弄清楚自己做了什么。高级语言的表达式要容易阅读和理解得多。写得好的注释对于清晰的汇编语言实现if/then/else/endif语句至关重要。下面是前一个例子的优雅实现：

; if ((x > y) && (z < t)) or (a != b)  c = d;
; Implemented as: 
; if (a != b) then goto DoIf: 

          mov eax, a
          cmp eax, b
          jne DoIf

; if not (x > y) then goto EndOfIf:

          mov eax, x
          cmp eax, y
          jng EndOfIf

; if not (z < t) then goto EndOfIf:

          mov eax, z
          cmp eax, t
          jnl EndOfIf

; THEN block:

DoIf:     
          mov eax, d
          mov c, eax

; End of IF statement.

EndOfIf:

诚然，对于这么简单的例子来说，这样做有些过头。下面的写法可能已经足够：

; if (((x > y) && (z < t)) || (a != b))  c = d;
; Test the Boolean expression:

          mov eax, a
          cmp eax, b
          jne DoIf
          mov eax, x
          cmp eax, y
          jng EndOfIf
          mov eax, z
          cmp eax, t
          jnl EndOfIf

; THEN block:

DoIf:
          mov eax, d
          mov c, eax

; End of IF statement.

EndOfIf:

然而，当你的if语句变得复杂时，你的注释的密度（和质量）变得越来越重要。

7.6.3 使用完整布尔运算的复杂 if 语句

许多布尔表达式涉及合取（and）或析取（or）操作。本节描述了如何将这些布尔表达式转换为汇编语言。我们可以通过两种方式来实现：使用完整布尔评估或使用短路布尔评估。本节讨论了完整布尔评估，下一节将讨论短路布尔评估。

通过完整布尔评估的转换与将算术表达式转换为汇编语言几乎相同，如第六章所述。然而，对于布尔评估，你不需要将结果存储在变量中；一旦表达式的评估完成，你只需检查结果是false（0）还是true（1，或者非零），然后根据布尔表达式的要求采取相应的行动。通常，最后一条逻辑指令（and/or）会在结果为false时设置零标志，而在结果为true时清除零标志，因此你不必显式地测试结果。考虑以下if语句及其通过完整布尔评估转换为汇编语言的过程：

;     if(((x < y) && (z > t)) || (a != b))
;          `Stmt1` 

          mov  eax, x
 cmp   eax, y
          setl  bl        ; Store x < y in BL
          mov   eax, z
          cmp   eax, t
          setg  bh        ; Store z > t in BH
          and   bl, bh    ; Put (x < y) && (z > t) into BL
          mov   eax, a
          cmp   eax, b
          setne bh        ; Store a != b into BH
          or    bl, bh    ; Put (x < y) && (z > t) || (a != b) into BL
          je    SkipStmt1 ; Branch if result is false

 `Code for Stmt1 goes here`

SkipStmt1:

该代码在 BL 寄存器中计算一个布尔结果，然后在计算结束时测试该值，看它是否包含true或false。如果结果为false，则该序列会跳过与Stmt1相关的代码。在这个例子中，重要的是要注意，程序会执行计算布尔结果的每一条指令（直到je指令）。

7.6.4 短路布尔评估

如果你愿意付出更多的努力，通常可以通过使用短路布尔评估将布尔表达式转换为更简短、更快速的汇编语言指令序列。该方法试图通过执行部分指令来确定一个表达式是true还是false，这些指令是计算完整表达式所需的指令的一部分。

考虑表达式a && b。一旦我们确定a为false，就无需再评估b，因为这个表达式不可能为true。如果b表示一个复杂的子表达式而不是单一的布尔变量，那么仅评估a显然更高效。

作为一个具体的例子，考虑上一节中的子表达式((x < y) && (z > t))。一旦确定x不小于y，就无需检查z是否大于t，因为无论z和t的值如何，表达式都会为false。以下代码片段展示了如何实现该表达式的短路布尔评估：

; if((x < y) && (z > t)) then ...

          mov eax, x
          cmp eax, y
          jnl TestFails
          mov eax, z
          cmp eax, t
          jng TestFails

 `Code for THEN clause of IF statement`

TestFails:

代码一旦确定x不小于y，就会跳过后续的测试。当然，如果x小于y，程序必须测试z是否大于t；如果不是，程序会跳过then语句。只有当程序满足两个条件时，代码才会继续执行then语句。

对于逻辑or操作，方法类似。如果第一个子表达式计算结果为真，则不需要测试第二个操作数。无论此时第二个操作数的值为何，完整的表达式仍然为真。以下示例演示了如何使用短路求值与析取（or）结合：

; if(ch < 'A' || ch > 'Z')
;     then printf("Not an uppercase char");
; endif;

          cmp ch, 'A'
          jb ItsNotUC
          cmp ch, 'Z'
          jna ItWasUC

ItsNotUC:
 `Code to process ch if it's not an uppercase character`

ItWasUC:

因为连接词和析取词运算符是可交换的，所以如果更方便，你可以先计算左操作数或右操作数。^(3) 作为本节中的最后一个例子，考虑前一节中的完整布尔表达式：

; if(((x < y) && (z > t)) || (a != b)) `Stmt1` ;

          mov eax, a
          cmp eax, b
          jne DoStmt1
          mov eax, x
          cmp eax, y
          jnl SkipStmt1
          mov eax, z
          cmp eax, t
          jng SkipStmt1

DoStmt1:
 `Code for Stmt1 goes here`

SkipStmt1:

此示例中的代码首先评估a != b，因为它更短且更快，^(4) 而最后评估剩余的子表达式。这是汇编语言程序员用来编写更高效代码的常见技术。^(5)

7.6.5 短路与完全布尔求值

使用完全布尔求值时，表达式中每个语句都会执行；而短路布尔求值则可能不需要执行与布尔表达式相关的每个语句。如你在前两节中所见，基于短路求值的代码通常更短且更快。

然而，在某些情况下，短路布尔求值可能不会产生正确的结果。如果表达式中包含副作用，短路布尔求值会产生与完全布尔求值不同的结果。考虑以下 C/C++示例：

if((x == y) && (++z != 0)) `Stmt` ;

使用完全布尔求值时，你可能会生成以下代码：

 mov   eax, x      ; See if x == y
          cmp   eax, y
          sete  bl 
          inc   z           ; ++z
          cmp   z, 0        ; See if incremented z is 0
          setne bh
          and   bl, bh      ; Test x == y && ++z != 0
          jz    SkipStmt

 `Code for Stmt goes here`

SkipStmt:

使用短路布尔求值时，你可能会生成如下代码：

 mov eax, x      ; See if x == y
          cmp eax, y
          jne SkipStmt
          inc z           ; ++z - sets ZF if z becomes zero
          je  SkipStmt    ; See if incremented z is 0

 `Code for Stmt goes here`

SkipStmt:

请注意这两种转换之间一个微妙但重要的区别：如果x等于y，第一个版本仍然会在执行与Stmt相关的代码之前，递增z并将其与 0 进行比较；而短路版本则会跳过递增z的代码，前提是x等于y。因此，如果x等于y，这两段代码的行为是不同的。

这两种实现并没有什么特别错误的地方；根据具体情况，你可能希望或不希望代码在x等于y时递增z。然而，重要的是要意识到这两种方案会产生不同的结果，因此，如果代码对z的影响对你的程序很重要，你可以选择合适的实现方式。

许多程序利用短路布尔求值，并依赖程序不对表达式的某些部分进行求值。以下 C/C++代码片段演示了可能最常见的需要短路布尔求值的例子：

if(pntr != NULL && *pntr == 'a')  `Stmt` ;

如果pntr的值为NULL，则表达式为假，且无需评估表达式的其余部分。这个语句依赖于短路布尔评估来正确执行。如果 C/C++使用完整的布尔评估，表达式的后半部分将尝试解除引用一个NULL指针，而此时pntr为NULL。

考虑使用完整布尔评估来翻译这条语句：

; Complete Boolean evaluation:

          mov   rax, pntr
          test  rax, rax   ; Check to see if RAX is 0 (NULL is 0)
          setne bl
          mov   al, [rax]  ; Get *pntr into AL
          cmp   al, 'a'
          sete  bh
          and   bl, bh
          jz    SkipStmt

 `Code for Stmt goes here`

SkipStmt:

如果pntr包含NULL (0)，该程序将通过mov al, [rax]指令尝试访问内存位置 0 的数据。在大多数操作系统中，这将导致内存访问错误（一般保护错误）。

现在考虑短路布尔转换：

; Short-circuit Boolean evaluation:

      mov  rax, pntr   ; See if pntr contains NULL (0) and
      test rax, rax    ; immediately skip past Stmt if this
      jz   SkipStmt    ; is the case

 mov  al, [rax]   ; If we get to this point, pntr contains
      cmp  al, 'a'     ; a non-NULL value, so see if it points
      jne  SkipStmt    ; at the character "a"

 `Code for Stmt goes here`

SkipStmt:

在这个例子中，解除引用NULL指针的问题并不存在。如果pntr为NULL，这段代码会跳过尝试访问pntr所指向的内存地址的语句。

7.6.6 汇编语言中`if`语句的高效实现

在汇编语言中高效地编码if语句需要比简单地选择短路评估或完整布尔评估更多的思考。为了在汇编语言中编写尽可能快速的代码，你必须仔细分析情况并适当生成代码。以下段落提供了一些建议，你可以将它们应用到程序中以提高性能。

7.6.6.1 了解你的数据！

程序员常常错误地认为数据是随机的。实际上，数据很少是随机的，如果你了解程序常用的值类型，你可以编写更好的代码。为了更好理解，考虑以下 C/C++语句：

if((a == b) && (c < d)) ++i;

由于 C/C++使用短路评估，该代码会先测试a是否等于b。如果是，它会测试c是否小于d。如果你预计a大多数情况下等于b，但不预计c大多数情况下小于d，那么这条语句的执行会比应该的慢。考虑以下 MASM 实现的代码：

 mov eax, a
          cmp eax, b
          jne DontIncI

          mov eax, c
          cmp eax, d
          jnl DontIncI

          inc i

DontIncI:

如你所见，如果a大多数时间等于b，且c大多数时间不小于d，你将不得不几乎每次执行所有六条指令，以确定表达式为假。现在考虑以下实现，它利用了这些知识，以及&&运算符是可交换的事实：

 mov eax, c
          cmp eax, d
          jnl DontIncI

          mov eax, a
          cmp eax, b
          jne DontIncI

          inc i

DontIncI:

代码首先检查c是否小于d。如果大多数情况下c小于d，则该代码会在典型情况下只执行三条指令后跳过到标签DontIncI（相比于前面例子中的六条指令）。

这个事实在汇编语言中比在高级语言中更加明显，这也是汇编程序通常比其高级语言（HLL）对应程序更快的主要原因之一：优化在汇编语言中比在高级语言中更为明显。当然，关键在于理解数据的行为，这样你才能做出像前述那样的明智决策。

7.6.6.2 重新排列表达式

即使你的数据是随机的（或者你无法确定输入值如何影响你的决策），重新排列表达式中的项仍然可能是有益的。一些计算比其他计算要慢得多。例如，div 指令比简单的 cmp 指令要慢得多。因此，如果你有如下语句，你可能想要重新排列表达式，使得 cmp 先执行：

if((x % 10 = 0) && (x != y) ++x;

转换为汇编代码后，这条 if 语句变成了以下内容：

 mov  eax, x        ; Compute X % 10
          cdq                ; Must sign-extend EAX -> EDX:EAX
          idiv ten           ; "ten dword 10" in .const section
          test edx, edx      ; Remainder is in EDX, test for 0
          jnz  SkipIf

          mov  eax, x
          cmp  eax, y
          je   SkipIf

          inc  x

SkipIf:

idiv 指令的开销很大（通常比这个例子中大多数其他指令慢 50 到 100 倍）。除非余数为 0 的可能性比 x 等于 y 的可能性大 50 到 100 倍，否则最好先做比较，再进行余数计算：

 mov  eax, x
          cmp  eax, y
          je   SkipIf

          mov  eax, x     ; Compute X % 10
          cdq             ; Must sign-extend EAX -> EDX:EAX
          idiv ten        ; "ten dword 10" in .const section
          test edx, edx   ; See if remainder (EDX) is 0
          jnz  SkipIf

          inc  x

SkipIf:

因为 && 和 || 运算符在短路求值发生时并不是交换律的，所以在进行此类变换时请谨慎考虑。这个例子可以正常工作，因为没有副作用或可能的异常被重新排列的 && 运算符的求值方式所掩盖。

7.6.6.3 解构你的代码

结构化代码有时比非结构化代码效率低，因为它引入了代码重复或额外的分支，而这些在非结构化代码中可能是不存在的。^(6) 大多数情况下，这是可以接受的，因为非结构化代码难以阅读和维护；为了可维护的代码牺牲一些性能通常是可以接受的。然而，在某些情况下，你可能需要尽可能高的性能，可能会选择牺牲代码的可读性。

将以前写的结构化代码重写为非结构化代码以提高性能被称为 解构代码。非结构化代码和解构代码的区别在于，非结构化代码一开始就是以那种方式编写的；而解构代码起初是结构化代码，并有意以非结构化的方式重写，以提高效率。纯粹的非结构化代码通常难以阅读和维护。解构代码则不那么糟糕，因为你仅将解构（非结构化代码）限制在那些绝对必要的部分。

解构代码的一种经典方法是使用 代码移动（将代码片段物理地移到程序的其他地方），把程序很少使用的代码移到经常执行的代码之外。代码移动可以通过两种方式提高程序效率。

首先，采取分支比不采取分支更昂贵（耗时）。^(7)如果你将不常用的代码移动到程序的另一个位置，并在少数情况下分支到它，大部分时间你将直接执行频繁执行的代码。

其次，顺序的机器指令会消耗缓存存储。如果你将不常执行的语句从正常的代码流中移到程序的其他部分（这些部分很少加载到缓存中），这将提高系统的缓存性能。

例如，考虑以下伪 C/C++语句：

if(`see_if_an_error_has_occurred`)
{
 `Statements to execute if no error`
}
else
{
 `Error-handling statements`
}

在普通代码中，我们通常不期望错误频繁发生。因此，你通常会期望前面的if语句的then部分比else语句执行得更频繁。前面的代码可以转化为以下的汇编代码：

 cmp `see_if_an_error_has_occurred`, true
     je HandleTheError

 `Statements to execute if no error`

     jmp EndOfIf;

HandleTheError:
 `Error-handling statements`
EndOfIf:

如果表达式为假，则这段代码会直接跳到正常语句，并跳过错误处理语句。那些将控制权从程序的一个点转移到另一个点的指令（例如，jmp指令）往往很慢。执行一系列顺序指令要比在程序中到处跳转要快得多。不幸的是，前面的代码并不允许这样做。

解决这个问题的一种方法是将代码中的else语句移到程序的其他位置。你可以将代码重写如下：

 cmp `see_if_an_error_has_occurred`, true
     je HandleTheError

 `Statements to execute if no error`

EndOfIf:

在程序的其他地方（通常在jmp指令之后），你会插入以下代码：

HandleTheError:
 `Error-handling statements`
     jmp EndOfIf;

程序并没有变得更短。你从原始序列中移除的jmp指令最终会到达else语句的末尾。然而，由于else语句很少执行，移动jmp指令从频繁执行的then语句到else语句会带来巨大的性能提升，因为then语句只通过直线代码执行。在许多时间关键的代码段中，这个技巧出奇的有效。

7.6.6.4 计算而非分支

在 x86-64 系列的许多处理器中，分支（跳转）相比许多其他指令来说是昂贵的。因此，有时执行更多指令的顺序比执行少量涉及分支的指令要更好。

例如，考虑简单的赋值eax = abs(eax)。不幸的是，没有 x86-64 指令可以计算整数的绝对值。处理这个问题的显而易见的方法是使用一组指令，通过条件跳转来跳过neg指令（如果 EAX 为负，则该指令将 EAX 变为正值）：

 test eax, eax
          jns ItsPositive;

          neg eax

ItsPositive:

现在考虑以下的代码序列，这也能完成任务：

; Set EDX to 0FFFF_FFFFh if EAX is negative, 0000_0000 if EAX is
; 0 or positive:

          cdq

; If EAX was negative, the following code inverts all the bits in
; EAX; otherwise, it has no effect on EAX.

          xor eax, edx

; If EAX was negative, the following code adds 1 to EAX;
; otherwise, it doesn't modify EAX's value.

 and edx, 1   ; EDX = 0 or 1 (1 if EAX was negative)
          add eax, edx

这段代码会将 EAX 中的所有位反转，然后在 EAX 为负之前加 1；也就是说，它会将 EAX 中的值取反。如果 EAX 为零或正数，这段代码不会改变 EAX 中的值。

尽管这一序列需要四条指令，而不是前面例子所需的三条指令，但它没有控制转移指令，因此在许多 x86-64 架构的 CPU 上执行可能更快。当然，如果你使用之前介绍过的cmovns指令，使用以下三条指令也能实现（且没有控制转移）：

mov    edx, eax
neg    edx
cmovns eax, edx

这也证明了为什么了解指令集很重要！

7.6.7 `switch/case`语句

C/C++的switch语句具有以下形式：

 switch(`expression`)
      {
          case `const1`:
 `Stmts1: Code to execute if`
 `expression equals const1`

          case `const2`:
 `Stmts2: Code to execute if`
 `expression equals const2`
            .
            .
            .
          case `constn`:
 `Stmtsn: Code to execute if`
 `expression equals constn`

          default:  ; Note that the default section is optional
 `Stmts_default: Code to execute if expression`
                           `does not equal`
                           `any of the case values`
      }

当该语句执行时，它会将expression的值与常量const1到constn进行比较。如果找到匹配的常量，相应的语句将执行。

C/C++对switch语句有一些限制。首先，switch语句只允许整数表达式（或者可以转化为整数的类型）。其次，case子句中的所有常量必须是唯一的。稍后这些限制的原因会变得清晰。

7.6.7.1 `switch`语句语义

大多数入门编程教材通过将switch/case语句解释为一系列if/then/elseif/else/endif语句来介绍它。它们可能会声称以下两段 C/C++代码是等效的：

switch(`expression`)
{
    case 0: printf("i=0"); break;
    case 1: printf("i=1"); break;
    case 2: printf("i=2"); break;
}

if(eax == 0)
    printf("i=0");
else if(eax == 1)
    printf("i=1");
else if(eax == 2)
    printf("i=2");

虽然在语义上这两段代码可能是相同的，但它们的实现通常是不同的。if/then/elseif/else/endif链会对序列中的每个条件语句进行比较，而switch语句通常使用间接跳转来通过一次计算将控制权转移到多个语句中的任何一个。

7.6.7.2 `if/else`实现`switch`

switch（以及if/else/elseif）语句可以用以下汇编语言代码编写：

; if/then/else/endif form:

          mov eax, i
          test eax, eax   ; Check for 0
          jnz Not0

 `Code to print "i = 0"`
          jmp EndCase

Not0:
          cmp eax, 1
          jne Not1

 `Code to print "i = 1"`
          jmp EndCase

Not1:
          cmp eax, 2
          jne EndCase;

 `Code to print "i = 2"`
EndCase:

可能需要注意的唯一一件事是，确定最后一个case所需的时间比确定第一个case是否执行的时间要长。这是因为if/else/elseif版本实现了线性搜索，逐个检查从第一个到最后一个case值，直到找到匹配项。

7.6.7.3 间接跳转`switch`实现

可以使用间接跳转表来实现更快的switch语句。该实现将switch表达式作为索引，指向一个地址表；每个地址指向要执行的目标case的代码。考虑以下示例：

; Indirect Jump Version.

        mov eax, i
        lea rcx, JmpTbl
        jmp qword ptr [rcx][rax * 8]

JmpTbl  qword Stmt0, Stmt1, Stmt2

Stmt0:
 `Code to print "i = 0"`
        jmp EndCase;

Stmt1:
 `Code to print "i = 1"`
        jmp EndCase;

Stmt2:
 `Code to print "i = 2"`

EndCase:

首先，switch语句要求你创建一个指针数组，每个元素包含代码中一个语句标签的地址（这些标签必须附加到每个switch语句中对应case的执行指令序列上）。在前面的示例中，初始化了指向语句标签Stmt0、Stmt1和Stmt2地址的JmpTbl数组，起到了这个作用。我将这个数组放在了过程内部，因为这些标签是过程的局部标签。不过，请注意，你必须将数组放置在一个永远不会作为代码执行的地方（比如紧跟在jmp指令之后，如本例所示）。

程序将i的值加载到 RAX 寄存器中（假设i是 32 位整数，mov指令会将 EAX 零扩展到 RAX），然后使用这个值作为JmpTbl数组的索引（RCX 保存JmpTbl数组的基址），并将控制权转移到指定位置找到的 8 字节地址。例如，如果 RAX 包含 0，jmp [rcx][rax * 8]指令将从地址JmpTbl+0处取出四字数据（RAX × 8 = 0）。因为表中的第一个四字数据包含了Stmt0的地址，所以jmp指令将控制权转移到Stmt0标签后面的第一条指令。同样，如果i（因此 RAX）包含 1，那么间接的jmp指令将从表中偏移量为 8 的位置取出四字数据，并将控制权转移到Stmt1标签后面的第一条指令（因为Stmt1的地址出现在表的偏移量 8 的位置）。最后，如果i/RAX 包含 2，那么这段代码将控制权转移到Stmt2标签后面的语句，因为它出现在JmpTbl表中的偏移量 16 处。

随着更多（连续的）case的增加，跳转表的实现比if/elseif形式更高效（无论是在空间还是速度上）。除了简单的情况外，switch语句几乎总是更快，通常差距较大。只要case值是连续的，switch语句版本通常也更小。

7.6.7.4 非连续跳转表条目与范围限制

如果你需要包括不连续的case标签，或者不能确定switch的值是否超出范围，会发生什么呢？在 C/C++的switch语句中，这种情况会将控制权转移到switch语句后面的第一条语句（或者转移到default语句，如果switch中有default语句的话）。

然而，在前面的示例中并不会发生这种情况。如果变量i不包含 0、1 或 2，执行前面的代码会产生未定义的结果。例如，如果i的值为 5，当你执行代码时，间接的jmp指令会获取JmpTbl中偏移量 40（5 × 8）处的 qword，并将控制转移到该地址。不幸的是，JmpTbl没有六个条目，因此程序将获取JmpTbl后面第六个 quad word 的值，并将其用作目标地址，这通常会导致程序崩溃或将控制转移到一个意外的位置。

解决方案是在间接jmp指令之前放置一些指令，验证switch选择值是否在合理范围内。在前面的例子中，我们可能希望在执行jmp指令之前验证i的值是否在 0 到 2 的范围内。如果i的值超出这个范围，程序应该直接跳转到endcase标签（这对应于跳到整个switch语句后的第一条语句）。以下代码提供了这种修改：

 mov eax, i
        cmp eax, 2
        ja  EndCase
        lea rcx, JmpTbl
        jmp qword ptr [rcx][rax * 8]

JmpTbl  qword Stmt0, Stmt1, Stmt2

Stmt0:
 `Code to print "i = 0"`
        jmp EndCase;

Stmt1:
 `Code to print "i = 1"`
        jmp EndCase;

Stmt2:
 `Code to print "i = 2"`

EndCase:

尽管前面的示例解决了选择值超出 0 到 2 范围的问题，但它仍然有几个严重的限制：

各 case 的值必须从 0 开始。也就是说，在这个示例中，最小的case常量必须是 0。
各 case 的值必须是连续的。

解决第一个问题很简单，可以分两步来处理。首先，在确定 case 值是否合法之前，你需要将 case 选择值与下限和上限进行比较。例如：

; SWITCH statement specifying cases 5, 6, and 7:
; WARNING: This code does *NOT* work.
; Keep reading to find out why.

     mov eax, i
     cmp eax, 5
     jb  EndCase
     cmp eax, 7              ; Verify that i is in the range
     ja  EndCase             ; 5 to 7 before the indirect jmp
     lea rcx, JmpTbl
     jmp qword ptr [rcx][rax * 8]

JmpTbl  qword Stmt5, Stmt6, Stmt7

Stmt5:
 `Code to print "i = 5"`
        jmp EndCase;

Stmt6:
 `Code to print "i = 6"`
        jmp EndCase;

Stmt7:
 `Code to print "i = 7"`

EndCase:

这段代码添加了一对额外的指令cmp和jb，用于测试选择值是否在 5 到 7 的范围内。如果不在此范围，控制将跳转到EndCase标签；否则，控制将通过间接的jmp指令转移。不幸的是，正如注释所指出的，这段代码是有问题的。

假设变量i的值为 5：代码会验证 5 是否在 5 到 7 的范围内，然后获取偏移量 40（5 × 8）处的 dword，并跳转到该地址。然而，像之前一样，这会加载表格边界之外的 8 个字节，并且不会将控制转移到一个已定义的位置。一种解决方案是在执行jmp指令之前，从 EAX 中减去最小的 case 选择值，如下例所示：

; SWITCH statement specifying cases 5, 6, and 7.
; WARNING: There is a better way to do this; keep reading.

     mov eax, i
     cmp eax, 5
     jb  EndCase
     cmp eax, 7              ; Verify that i is in the range
     ja  EndCase             ; 5 to 7 before the indirect jmp
     sub eax, 5              ; 5 to 7 -> 0 to 2
     lea rcx, JmpTbl
     jmp qword ptr [rcx][rax * 8]

JmpTbl  qword Stmt5, Stmt6, Stmt7

Stmt5:
 `Code to print "i = 5"`
        jmp EndCase;

Stmt6:
 `Code to print "i = 6"`
        jmp EndCase;

Stmt7:
 `Code to print "i = 7"`

EndCase:

通过从 EAX 的值中减去 5，我们强制 EAX 在jmp指令之前取 0、1 或 2。因此，选择值为 5 时跳转到Stmt5，选择值为 6 时转移控制到Stmt6，选择值为 7 时跳转到Stmt7。

为了改进这段代码，你可以通过将sub指令合并到jmp指令的地址表达式中来消除sub指令。以下代码实现了这一点：

; SWITCH statement specifying cases 5, 6, and 7:

     mov eax, i
     cmp eax, 5
     jb  EndCase
     cmp eax, 7                           ; Verify that i is in the range
     ja  EndCase                          ; 5 to 7 before the indirect jmp
     lea rcx, JmpTbl
     jmp qword ptr [rcx][rax * 8 – 5 * 8] ; 5 * 8 compensates for zero index

JmpTbl  qword Stmt5, Stmt6, Stmt7

Stmt5:
 `Code to print "i = 5"`
        jmp EndCase;

Stmt6:
 `Code to print "i = 6"`
 jmp EndCase;

Stmt7:
 `Code to print "i = 7"`

EndCase:

C/C++的switch语句提供了一个default子句，当选择的值与任何 case 值不匹配时会执行。例如：

switch(`expression`)
{

    case 5:  printf("ebx = 5"); break;
    case 6:  printf("ebx = 6"); break;
    case 7:  printf("ebx = 7"); break;
    default
        printf("ebx does not equal 5, 6, or 7");
}

在纯汇编语言中实现default子句的等价物是很容易的。只需在代码开头的jb和ja指令中使用不同的目标标签。以下示例实现了类似前面示例的 MASM switch语句：

; SWITCH statement specifying cases 5, 6, and 7
; with a DEFAULT clause:

     mov eax, i
     cmp eax, 5 
     jb  DefaultCase
     cmp eax, 7                           ; Verify that i is in the range
     ja  DefaultCase                      ; 5 to 7 before the indirect jmp
     lea rcx, JmpTbl
     jmp qword ptr [rcx][rax * 8 – 5 * 8] ; 5 * 8 compensates for zero index

JmpTbl  qword Stmt5, Stmt6, Stmt7

Stmt5:
 `Code to print "i = 5"`
        jmp EndCase

Stmt6:
 `Code to print "i = 6"`
        jmp EndCase

Stmt7:
 `Code to print "i = 7"`
        jmp EndCase

DefaultCase:
 `Code to print "EBX does not equal 5, 6, or 7"`

EndCase:

如前所述的第二个限制（即case值需要是连续的）很容易通过在跳转表中插入额外的条目来处理。考虑以下 C/C++ switch语句：

switch(i)
{
    case 1  printf("i = 1"); break;
    case 2  printf("i = 2"); break;
    case 4  printf("i = 4"); break;
    case 8  printf("i = 8"); break;
    default:
        printf("i is not 1, 2, 4, or 8");
}

最小的switch值是 1，最大值是 8。因此，在间接jmp指令之前的代码需要将i中的值与 1 和 8 进行比较。如果值介于 1 和 8 之间，仍然可能i不包含合法的case选择值。然而，由于jmp指令使用case选择表对四字节表进行索引，表必须有八个四字节条目。

为了处理 1 到 8 之间不是case选择值的值，只需将default子句的语句标签（或者如果没有default子句，则指定endswitch后第一条指令的标签）放入跳转表中每个没有对应case子句的条目中。以下代码演示了这一技术：

; SWITCH statement specifying cases 1, 2, 4, and 8
; with a DEFAULT clause:

     mov eax, i
     cmp eax, 1
     jb  DefaultCase
     cmp eax, 8                           ; Verify that i is in the range
     ja  DefaultCase                      ; 1 to 8 before the indirect jmp
     lea rcx, JmpTbl
     jmp qword ptr [rcx][rax * 8 – 1 * 8] ; 1 * 8 compensates for zero index

JmpTbl  qword Stmt1, Stmt2, DefaultCase, Stmt4
        qword DefaultCase, DefaultCase, DefaultCase, Stmt8

Stmt1:
 `Code to print "i = 1"`
        jmp EndCase

Stmt2:
 `Code to print "i = 2"`
        jmp EndCase

Stmt4:
 `Code to print "i = 4"`
        jmp EndCase

Stmt8:
 `Code to print "i = 8"`
        jmp EndCase

DefaultCase:
 `Code to print "i does not equal 1, 2, 4, or 8"`

EndCase:

7.6.7.5 稀疏跳转表

当前switch语句的实现存在一个问题。如果case值包含不连续的、相隔很远的条目，跳转表可能会变得极其庞大。以下switch语句将生成一个非常大的代码文件：

switch(i)
{
    case 1:       `Stmt1` ;
    case 100:     `Stmt2` ;
    case 1000:    `Stmt3` ;
    case 10000:   `Stmt4` ;
    default:      `Stmt5` ;

}

在这种情况下，如果你使用一系列if语句来实现switch语句，而不是使用间接跳转语句，程序的体积将会更小。然而，要记住一点：跳转表的大小通常不会影响程序的执行速度。如果跳转表包含两个条目或两千个条目，switch语句将在一个固定的时间内执行多分支操作。if语句的实现则要求每个出现在case语句中的case标签都需要线性增长的时间。

使用汇编语言相对于像 Pascal 或 C/C++这样的高级语言（HLL）最大的优势之一是你可以选择诸如switch语句的实际实现方式。在某些情况下，你可以将switch语句实现为一系列if/then/elseif语句，或者将其实现为一个跳转表，或者使用这两者的混合方式：

switch(i)
{
    case 0:   `Stmt0` ;
    case 1:   `Stmt1` ;
    case 2:   `Stmt2` ;
    case 100: `Stmt3` ;
    default:  `Stmt4` ;

}

它可以变成如下：

mov eax, i
cmp eax, 100
je  DoStmt3;
cmp eax, 2
ja  TheDefaultCase
lea rcx, JmpTbl
jmp qword ptr [rcx][rax * 8]
 .
 .
 .

如果你愿意接受程序大小不超过 2GB（并使用LARGEADDRESSAWARE:NO命令行选项），你可以改进switch语句的实现，并节省一条指令：

; SWITCH statement specifying cases 5, 6, and 7
; with a DEFAULT clause:

     mov eax, i
     cmp eax, 5
     jb  DefaultCase
     cmp eax, 7                  ; Verify that i is in the range
     ja  DefaultCase             ; 5 to 7 before the indirect jmp
     jmp JmpTbl[rax * 8 – 5 * 8] ; 5 * 8 compensates for zero index

JmpTbl  qword Stmt5, Stmt6, Stmt7

Stmt5:
 `Code to print "i = 5"`
        jmp EndCase

Stmt6:
 `Code to print "i = 6"`
        jmp EndCase

Stmt7:
 `Code to print "i = 7"`
        jmp EndCase

DefaultCase:
 `Code to print "EBX does not equal 5, 6, or 7"`

EndCase:

这段代码移除了lea rcx, JmpTbl指令，并将jmp [rcx][rax * 8 – 5 * 8]替换为jmp JmpTbl[rax * 8 – 5 * 8]。这是一个小的改进，但仍然是一个改进（这个序列不仅少了一条指令，还减少了一个寄存器的使用）。当然，始终要注意编写不具备大地址意识的 64 位程序的危险。

一些switch语句具有稀疏的情况，但在整个情况集内有一些连续的情况组。考虑以下的 C/C++ switch语句：

switch(`expression`)
{
    case 0:
 `Code for case 0`
        break;

    case 1:
 `Code for case 1`
        break;

    case 2:
 `Code for case 2`
        break;

    case 10:
 `Code for case 10`
        break;

    case 11:
 `Code for case 11`
        break;

    case 100:
 `Code for case 100`
        break;

    case 101:
 `Code for case 101`
        break;

    case 103:
 `Code for case 101`
        break;

    case 1000:
 `Code for case 1000`
        break;

    case 1001:
 `Code for case 1001`
        break;

    case 1003:
 `Code for case 1001`
        break;

    default:
 `Code for default case`
        break;
} // end switch

你可以将一个由广泛分隔的（几乎）连续的情况组组成的switch语句转换为汇编语言代码，为每个连续组实现一个跳转表，然后使用比较指令来确定执行哪个跳转表指令序列。下面是前面 C/C++代码的一种可能实现：

; Assume expression has been computed and is sitting in EAX/RAX
; at this point...

         cmp   eax, 100
         jb    try0_11
         cmp   eax, 103
         ja    try1000_1003
         cmp   eax, 100
         jb    default
         lea   rcx, jt100
         jmp   qword ptr [rcx][rax * 8 – 100 * 8]
jt100    qword case100, case101, default, case103

try0_11: cmp   ecx, 11 ; Handle cases 0-11 here
         ja    defaultCase
         lea   rcx, jt0_11
         jmp   qword ptr [rcx][rax * 8]
jt0_11   qword case0, case1, case2, defaultCase 
         qword defaultCase, defaultCase, defaultCase
         qword defaultCase, defaultCase, defaultCase
         qword case10, case11

try1000_1003:
         cmp   eax, 1000
         jb    defaultCase
         cmp   eax, 1003
         ja    defaultCase
         lea   rcx, jt1000
         jmp   qword ptr [rcx][rax * 8 – 1000 * 8]
jt1000   qword case1000, case1001, defaultCase, case1003
           .
           .
           .
 `Code for the actual cases here`

这个代码序列将组 0 到 2 和组 10 到 11 合并为一个单一的组（需要七个额外的跳转表条目），以避免写入额外的跳转表序列。

当然，对于这样一组简单的情况，可能直接使用比较-分支序列更为容易。这个例子被简化了一些，仅仅是为了说明一个观点。

7.6.7.6 其他`switch`语句替代方案

如果这些情况过于稀疏，除了逐个比较表达式的值外什么都做不了，会发生什么？代码是否注定要被转换成等效的if/elseif/else/endif序列？不一定。然而，在我们考虑其他替代方案之前，必须提到并非所有的if/elseif/else/endif序列都是一样的。回顾一下前面的例子。一个直接的实现可能是这样的：

if(unsignedExpression <= 11)
{
 `Switch for 0 to 11`
}
else if(unsignedExpression >= 100 && unsignedExpression <= 101)
{
 `Switch for 100 to 101`
}
else if(unsignedExpression >= 1000 && unsignedExpression <= 1001)
{
 `Switch for 1000 to 1001`
}
else
{
 `Code for default case`
}

相反，前面的实现首先测试值为 100，并根据比较结果（小于时为组 0 到 11，大于时为组 1000 到 1001）进行分支，从而有效地创建了一个小的二分查找，减少了比较的次数。在高级语言代码中很难看出节省了多少，但在汇编代码中，你可以计算出在最佳和最差情况下执行的指令数量，并看到比标准线性查找方法（仅仅按照switch语句中出现的顺序比较值）有所改进。^(8)

如果你的情况过于稀疏（完全没有有意义的分组），比如前面章节中给出的 1、10、100、1000、10000 的例子，你将无法（合理地）使用跳转表来实现switch语句。与其直接退化为线性查找（这可能很慢），更好的解决方案是对你的情况进行排序，并使用二分查找来测试它们。

使用二分查找时，你首先将表达式值与中间值进行比较。如果它小于中间值，你将在值列表的前半部分重复查找；如果它大于中间值，你将在值列表的后半部分重复查找；如果它相等，显然你会进入代码处理该测试。以下是 1、10、100、……示例的二分查找版本：

; Assume expression has been calculated into EAX.

        cmp eax, 100
        jb  try1_10
 ja  try1000_10000

 `Code to handle case 100 goes here`
        jmp AllDone

try1_10:
        cmp eax,1
        je  case1
        cmp eax, 10
        jne defaultCase

 `Code to handle case 10 goes here`
        jmp AllDone
case1:
 `Code to handle case 1 goes here`
        jmp AllDone

try1000_10000:
        cmp eax, 1000
        je  case1000
        cmp eax, 10000
        jne defaultCase

 `Code to handle case 10000 goes here`
        jmp AllDone

case1000:
 `Code to handle case 1000 goes here`
        jmp AllDone

defaultCase:
 `Code to handle defaultCase goes here`

AllDone:

本节中展示的技术有许多可能的替代方案。例如，一个常见的解决方案是创建一个包含记录（结构体）集合的表，每个记录条目是一个包含案例值和跳转地址的二元组。与其有一长串的比较指令，不如用一个简短的循环遍历所有表元素，寻找匹配的案例值并将控制转移到相应的跳转地址。这个方案比本节中的其他技术慢，但比传统的if/elseif/else/endif实现要短得多。^(9)

顺便提一下，defaultCase标签通常出现在（非跳转表）switch实现中的多个j``cc指令中。由于条件跳转指令有两种编码形式：一种是 2 字节格式，另一种是 6 字节格式，因此你应该尽量将defaultCase放置在这些条件跳转附近，以便尽可能使用短格式的指令。尽管本节中的示例通常将跳转表（它们消耗大量字节）紧跟在相应的间接跳转之后，但你可以将这些表移到程序的其他位置，以帮助保持条件跳转指令的简短。以下是考虑到这一点的早期 1、10、100、……示例的代码：

; Assume expression has been computed and is sitting in EAX/RAX
; at this point...

         cmp   eax, 100
         jb    try0_13
         cmp   eax, 103
         ja    try1000_1003
         lea   rcx, jt100
         jmp   qword ptr [rcx][rax * 8 – 100 * 8]

try0_13: cmp   ecx, 13      ; Handle cases 0 to 13 here
         ja    defaultCase
         lea   rcx, jt0_13
         jmp   qword ptr [rcx][rax * 8]

try1000_1003:
         cmp   eax, 1000    ; Handle cases 1000 to 1003 here
         jb    defaultCase
         cmp   eax, 1003
         ja    defaultCase
         lea   rcx, jt1000
         jmp   qword ptr [rcx][rax * 8 – 1000 * 8]

defaultCase:
 `Put defaultCase here to keep it near all the`
 `conditional jumps to defaultCase` 

         jmp   AllDone

jt0_13   qword case0, case1, case2, case3
         qword defaultCase, defaultCase, defaultCase
         qword defaultCase, defaultCase, defaultCase
         qword case10, case11, case12, case13
jt100    qword case100, case101, case102, case103
jt1000   qword case1000, case1001, case1002, case1003
           .
           .
           .
 `Code for the actual cases here`

7.7 状态机与间接跳转

另一种在汇编语言程序中常见的控制结构是状态机。状态机使用状态变量来控制程序流程。FORTRAN 编程语言通过分配的goto语句提供了这一功能。C 语言的某些变种（例如，GNU 的自由软件基金会的 GCC）也提供了类似的功能。在汇编语言中，间接跳转可以实现状态机。

那么，什么是状态机呢？简单来说，它是一段代码，通过进入和离开特定的状态来跟踪其执行历史。为了本章的目的，我们可以假设状态机是一段（某种方式下）记住其执行历史（其状态）并根据该历史执行代码段的代码。

从实际意义上讲，所有程序都是状态机。CPU 寄存器和内存中的值构成了该机器的状态。然而，我们将采用一个更加有限的视角。实际上，对于大多数目的来说，只有一个变量（或 RIP 寄存器中的值）表示当前状态。

现在让我们考虑一个具体的例子。假设你有一个过程，并且希望在第一次调用时执行一个操作，第二次调用时执行不同的操作，第三次调用时执行另一个操作，然后在第四次调用时再执行一个新的操作。在第四次调用后，它按顺序重复这四个操作。

例如，假设你希望在第一次调用时将 EAX 和 EBX 相加，在第二次调用时将它们相减，在第三次调用时将它们相乘，在第四次调用时将它们相除。你可以按照 Listing 7-6 中所示的方式实现这个过程。

; Listing 7-6

; A simple state machine example.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 7-6", 0
fmtStr0     byte    "Calling StateMachine, "
            byte    "state=%d, EAX=5, ECX=6", nl, 0

fmtStr0b    byte    "Calling StateMachine, "
            byte    "state=%d, EAX=1, ECX=2", nl, 0

fmtStrx     byte    "Back from StateMachine, "
            byte    "state=%d, EAX=%d", nl, 0

fmtStr1     byte    "Calling StateMachine, "
            byte    "state=%d, EAX=50, ECX=60", nl, 0

fmtStr2     byte    "Calling StateMachine, "
            byte    "state=%d, EAX=10, ECX=20", nl, 0

fmtStr3     byte    "Calling StateMachine, "
            byte    "state=%d, EAX=50, ECX=5", nl, 0

            .data
state       byte    0

 .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

StateMachine proc
             cmp    state, 0
             jne    TryState1

; State 0: Add ECX to EAX and switch to state 1:

             add    eax, ecx
             inc    state           ; State 0 becomes state 1
             jmp    exit

TryState1:
             cmp    state, 1
             jne    TryState2

; State 1: Subtract ECX from EAX and switch to state 2:

             sub    eax, ecx
             inc    state           ; State 1 becomes state 2
             jmp    exit

TryState2:   cmp    state, 2
             jne    MustBeState3

; If this is state 2, multiply ECX by EAX and switch to state 3:

             imul   eax, ecx
             inc    state           ; State 2 becomes state 3
             jmp    exit

; If it isn't one of the preceding states, we must be in state 3,
; so divide EAX by ECX and switch back to state 0.

MustBeState3:
             push   rdx          ; Preserve this 'cause it
                                 ; gets whacked by div
             xor    edx, edx     ; Zero-extend EAX into EDX
             div    ecx
             pop    rdx          ; Restore EDX's value preserved above
             mov    state, 0     ; Reset the state back to 0

exit:        ret

StateMachine endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 48         ; Shadow storage

            mov     state, 0        ; Just to be safe

; Demonstrate state 0:

            lea     rcx, fmtStr0
            movzx   rdx, state
            call    printf

            mov     eax, 5
            mov     ecx, 6
            call    StateMachine

            lea     rcx, fmtStrx
            mov     r8, rax
            movzx   edx, state
            call    printf

; Demonstrate state 1:

            lea     rcx, fmtStr1
            movzx   rdx, state
            call    printf

            mov     eax, 50
            mov     ecx, 60
            call    StateMachine

            lea     rcx, fmtStrx
            mov     r8, rax
            movzx   edx, state
            call    printf

; Demonstrate state 2:

            lea     rcx, fmtStr2
            movzx   rdx, state
            call    printf

            mov     eax, 10
            mov     ecx, 20
            call    StateMachine

            lea     rcx, fmtStrx
            mov     r8, rax
            movzx   edx, state
            call    printf

; Demonstrate state 3:

            lea     rcx, fmtStr3
            movzx   rdx, state
            call    printf

            mov     eax, 50
            mov     ecx, 5
            call    StateMachine

            lea     rcx, fmtStrx
            mov     r8, rax
            movzx   edx, state
            call    printf

; Demonstrate back in state 0:

            lea     rcx, fmtStr0b
            movzx   rdx, state
            call    printf

            mov     eax, 1
            mov     ecx, 2
            call    StateMachine

            lea     rcx, fmtStrx
            mov     r8, rax
            movzx   edx, state
            call    printf

            leave
            ret     ; Returns to caller

asmMain     endp
            end

Listing 7-6: 一个状态机示例

这是构建命令和程序输出：

C:\>**build listing7-6**

C:\>**echo off**
 Assembling: listing7-6.asm
c.cpp

C:\>**listing7-6**
Calling Listing 7-6:
Calling StateMachine, state=0, EAX=5, ECX=6
Back from StateMachine, state=1, EAX=11
Calling StateMachine, state=1, EAX=50, ECX=60
Back from StateMachine, state=2, EAX=-10
Calling StateMachine, state=2, EAX=10, ECX=20
Back from StateMachine, state=3, EAX=200
Calling StateMachine, state=3, EAX=50, ECX=5
Back from StateMachine, state=0, EAX=10
Calling StateMachine, state=0, EAX=1, ECX=2
Back from StateMachine, state=1, EAX=3
Listing 7-6 terminated

从技术上讲，这个过程本身并不是状态机。相反，变量state和cmp/jne指令构成了状态机。这个过程不过是通过if/then/elseif结构实现的一个switch语句。唯一不同的是它记住了被调用的次数^(10)，并根据调用次数表现得不同。

虽然这是一个正确的状态机实现，但效率并不高。敏锐的读者当然会意识到，通过使用实际的switch语句，而不是if/then/elseif/endif结构，这段代码可以稍微加速。然而，实际上还有更好的解决方案。

在汇编语言中，常见的做法是使用间接跳转来实现状态机。我们可以让state变量存储要在进入过程时执行的代码的地址，而不是包含像 0、1、2 或 3 这样的值。通过简单地跳转到该地址，状态机可以省去选择适当代码片段所需的测试。请参考 Listing 7-7 中使用间接跳转的实现。

; Listing 7-7

; An indirect jump state machine example.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 7-7", 0
fmtStr0     byte    "Calling StateMachine, "
            byte    "state=0, EAX=5, ECX=6", nl, 0

fmtStr0b    byte    "Calling StateMachine, "
            byte    "state=0, EAX=1, ECX=2", nl, 0

fmtStrx     byte    "Back from StateMachine, "
            byte    "EAX=%d", nl, 0

fmtStr1     byte    "Calling StateMachine, "
            byte    "state=1, EAX=50, ECX=60", nl, 0

fmtStr2     byte    "Calling StateMachine, "
            byte    "state=2, EAX=10, ECX=20", nl, 0

fmtStr3     byte    "Calling StateMachine, "
            byte    "state=3, EAX=50, ECX=5", nl, 0

             .data
state        qword  state0

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; StateMachine version 2.0 - using an indirect jump.

             option noscoped     ; state`x` labels must be global
StateMachine proc

             jmp    state

; State 0: Add ECX to EAX and switch to state 1:

state0:      add    eax, ecx
             lea    rcx, state1
             mov    state, rcx
             ret

; State 1: Subtract ECX from EAX and switch to state 2:

state1:      sub    eax, ecx
             lea    rcx, state2
             mov    state, rcx
             ret

; If this is state 2, multiply ECX by EAX and switch to state 3:

state2:      imul   eax, ecx
             lea    rcx, state3
             mov    state, rcx
             ret

state3:      push   rdx          ; Preserve this 'cause it 
                                 ; gets whacked by div
             xor    edx, edx     ; Zero-extend EAX into EDX
             div    ecx
             pop    rdx          ; Restore EDX's value preserved above
             lea    rcx, state0
             mov    state, rcx
             ret

StateMachine endp
             option scoped

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 48         ; Shadow storage

            lea     rcx, state0
            mov     state, rcx      ; Just to be safe

; Demonstrate state 0:

            lea     rcx, fmtStr0
            call    printf

            mov     eax, 5
            mov     ecx, 6
            call    StateMachine

            lea     rcx, fmtStrx
            mov     rdx, rax
            call    printf

; Demonstrate state 1:

            lea     rcx, fmtStr1
            call    printf

            mov     eax, 50
            mov     ecx, 60
            call    StateMachine

            lea     rcx, fmtStrx
            mov     rdx, rax
            call    printf

; Demonstrate state 2:

            lea     rcx, fmtStr2
            call    printf

            mov     eax, 10
            mov     ecx, 20
            call    StateMachine

            lea     rcx, fmtStrx
            mov     rdx, rax
            call    printf

; Demonstrate state 3:

            lea     rcx, fmtStr3
            call    printf

            mov     eax, 50
            mov     ecx, 5
            call    StateMachine

            lea     rcx, fmtStrx
            mov     rdx, rax
            call    printf

; Demonstrate back in state 0:

            lea     rcx, fmtStr0b
            call    printf

            mov     eax, 1
            mov     ecx, 2
            call    StateMachine

            lea     rcx, fmtStrx
            mov     rdx, rax
            call    printf

            leave
            ret     ; Returns to caller

asmMain     endp
            end

Listing 7-7: 使用间接跳转的状态机

这是构建命令和程序输出：

C:\>**build listing7-7**

C:\>**echo off**
 Assembling: listing7-7.asm
c.cpp

C:\>**listing7-7**
Calling Listing 7-7:
Calling StateMachine, state=0, EAX=5, ECX=6
Back from StateMachine, EAX=11
Calling StateMachine, state=1, EAX=50, ECX=60
Back from StateMachine, EAX=-10
Calling StateMachine, state=2, EAX=10, ECX=20
Back from StateMachine, EAX=200
Calling StateMachine, state=3, EAX=50, ECX=5
Back from StateMachine, EAX=10
Calling StateMachine, state=0, EAX=1, ECX=2
Back from StateMachine, EAX=3
Listing 7-7 terminated

StateMachine过程开始时的jmp指令将控制转移到由state变量指向的位置。第一次调用StateMachine时，它指向State0标签。此后，每个代码子段都将state变量设置为指向适当的后续代码。

7.8 循环

循环表示构成典型程序的最终基本控制结构（顺序、决策和循环）。与汇编语言中的许多其他结构一样，你会发现自己在一些从未想过的地方使用循环。

大多数高级语言（HLL）都有隐式的循环结构。例如，考虑 BASIC 语句if A$ = B$ then 100。该if语句比较两个字符串，如果它们相等，则跳转到语句 100。在汇编语言中，你需要写一个循环，逐个字符地将A$中的字符与B$中的对应字符进行比较，只有当所有字符匹配时，才跳转到语句 100。^(11)

程序循环由三个组件组成：可选的初始化组件、可选的循环终止测试和循环体。组装这些组件的顺序可以显著影响循环的操作。程序中常见的三种组件排列是：while 循环、repeat/until 循环（在 C/C++ 中为 do/while）和无限循环（例如 C/C++ 中的 for(;;)）。

7.8.1 while 循环

最通用的循环是 while 循环。在 C/C++ 中，它具有以下形式：

while(`expression`) `statement(s)`;

在 while 循环中，终止测试出现在循环开始时。由于终止测试的位置，循环体可能永远不会执行，如果布尔表达式始终为假。

请考虑以下 C/C++ while 循环：

i = 0;
while(i < 100)
{
    ++i;
}

i = 0; 语句是该循环的初始化代码。i 是一个循环控制变量，因为它控制循环体的执行。i < 100 是循环终止条件：只要 i 小于 100，循环就不会终止。单一语句 ++i;（递增 i）是循环体，它在每次循环迭代时执行。

一个 C/C++ while 循环可以通过 if 和 goto 语句轻松合成。例如，你可以用以下 C 代码替换之前的 C while 循环：

i = 0;
WhileLp:
if(i < 100)
{

    ++i;
      goto WhileLp;

}

更一般地，你可以按照以下方式构建任何 while 循环：

`Optional initialization code`

UniqueLabel:
if(`not_termination_condition`)
{
 `Loop body`
    goto UniqueLabel;

}

因此，你可以使用本章前面的技术将 if 语句转换为汇编语言，并添加一个 jmp 指令来生成 while 循环。本节中的示例翻译为以下纯 x86-64 汇编代码：^(12)

 mov i, 0
WhileLp:
          cmp i, 100
          jnl WhileDone
          inc i
          jmp WhileLp;

WhileDone:

7.8.2 repeat/until 循环

repeat/until（do/while）循环在循环结束时测试终止条件，而不是在循环开始时。在 Pascal 中，repeat/until 循环具有以下形式：

`Optional initialization code`
repeat

 `Loop body`

until(`termination_condition`);

这与以下 C/C++ do/while 循环类似：

`Optional initialization code`
do
{
 `Loop body`

}while(`not_termination_condition`);

该序列首先执行初始化代码，然后执行循环体，最后测试一个条件来判断是否需要重复循环。如果布尔表达式为假，则循环重复；否则，循环终止。你需要注意 repeat/until 循环的两点：终止测试出现在循环的末尾，并且因此，循环体总是至少执行一次。

与 while 循环一样，repeat/until 循环也可以通过 if 语句和 jmp 指令合成。你可以使用以下代码：

`Initialization code`
SomeUniqueLabel:

 `Loop body`

if(`not_termination_condition`) goto SomeUniqueLabel;

基于前面章节中介绍的内容，你可以轻松地在汇编语言中合成 repeat/until 循环。以下是一个简单的示例：

 repeat (`Pascal code`)

          write('Enter a number greater than 100:');
          readln(i);

     until(i > 100);

// This translates to the following if/jmp code:

     RepeatLabel:

          write('Enter a number greater than 100:');
          readln(`i`);

     if(`i` <= 100) then goto RepeatLabel;

// It also translates into the following assembly code:

RepeatLabel:

 call print
          byte "Enter a number greater than 100: ", 0
          call readInt  ; Function to read integer from user

          cmp  eax, 100 ; Assume readInt returns integer in EAX
          jng  RepeatLabel

7.8.3 forever/endfor 循环

如果while循环在循环开始时测试是否终止，repeat/until/do/while循环则在循环结束时检查是否终止，那么唯一剩下的测试终止位置就是在循环的中间。C/C++高级for(;;)循环，结合break语句，提供了这一功能。C/C++无限循环的形式如下：

for(;;)
{
 `Loop body`

}

没有显式的终止条件。除非另有说明，for(;;)构造形成一个无限循环。break语句通常用于处理循环终止。考虑以下使用for(;;)构造的 C++代码：

for(;;)
{
     cin >> `character`;
     if(`character` == '.') break;
     cout `<< character`;

}

将一个for（永远）循环转换为纯汇编语言很简单。你需要的仅仅是一个标签和一个jmp指令。此示例中的break语句实际上也只是一个jmp指令（或条件跳转）。上述代码的纯汇编语言版本如下所示：

foreverLabel:

          call getchar    ; Assume it returns char in AL
          cmp  al, '.'
          je   ForIsDone

          mov  cl, al     ; Pass char read from getchar to putchar
          call putcchar   ; Assume this prints the char in CL
          jmp  foreverLabel

ForIsDone:

7.8.4 `for`循环

标准的for循环是while循环的一种特殊形式，它重复执行循环体特定次数（这被称为确定性循环）。在 C/C++中，for循环的形式为：

for(`initialization_Stmt`; `termination_expression`; `inc_Stmt`)
{
 `Statements`

}

这相当于以下内容：

`initialization_Stmt`;
while(`termination_expression`)
{
 `Statements` 

    inc_Stmt;

}

传统上，程序使用for循环来处理数组和按顺序访问的其他对象。我们通常使用初始化语句初始化一个循环控制变量，然后用该变量作为数组（或其他数据类型）的索引。例如：

for(i = 0; i < 7; ++i)
{
     printf("Array Element = %d \ n", SomeArray[i]);

}

要将其转换为纯汇编语言，首先将for循环转换为等效的while循环：

i = 0;
while(i < 7)
{
    printf("Array Element = %d \ n", SomeArray[i]);
    ++i;
}

现在，使用第 433 页中“while 循环”的技术，将代码转换为纯汇编语言：

 xor  rbx, rbx      ; Use RBX to hold loop index
WhileLp:  cmp  ebx, 7
          jnl  EndWhileLp

          lea  rcx, fmtStr   ; fmtStr = "Array Element = %d", nl, 0
          lea  rdx, SomeArray
 mov  rdx, [rdx][rbx * 4] ; Assume SomeArray is 4-byte ints
          call printf

          inc  rbx
          jmp  WhileLp;

EndWhileLp:

7.8.5 `break`和`continue`语句

C/C++中的break和continue语句都转换为单一的jmp指令。break指令退出直接包含break语句的循环；continue语句重新开始包含continue语句的循环。

要将break语句转换为纯汇编语言，只需发出一个goto/jmp指令，将控制转移到循环的end语句之后，从而退出循环。你可以通过在循环体后放置一个标签，并跳转到该标签来实现。以下代码片段展示了这一技术在各种循环中的应用。

// Breaking out of a FOR(;;) loop:

for(;;)
{
 `Stmts`
          // break;
          goto BreakFromForever;
 `Stmts`
}
BreakFromForever:

// Breaking out of a FOR loop:

for(initStmt; expr; incStmt)
{
 `Stmts`
          // break;
          goto BrkFromFor;
 `Stmts`
}
BrkFromFor:

// Breaking out of a WHILE loop:

while(expr)
{
 `Stmts`
          // break;
          goto BrkFromWhile;
 `Stmts`
}
BrkFromWhile:

// Breaking out of a REPEAT/UNTIL loop (DO/WHILE is similar):

repeat
 `Stmts`
          // break;
          goto BrkFromRpt;
 `Stmts`
until(expr);
BrkFromRpt:

在纯汇编语言中，将适当的控制结构转换为汇编语言，并用jmp指令替换goto。

continue语句比break语句稍微复杂一些。其实现仍然是一个单一的jmp指令；然而，目标标签在不同的循环中不会指向相同的位置。图 7-2、7-3、7-4 和 7-5 展示了continue语句在每个循环中转移控制的位置。

图 7-2：for(;;)循环的continue目标

图 7-3：continue目标和while循环

图 7-4：continue目标和for循环

图 7-5：continue目标和repeat``/``until循环

以下代码片段展示了如何将continue语句转换为每种循环类型的适当jmp指令：

for(;;)/continue/endfor

; Conversion of FOREVER loop with continue
; to pure assembly:
 for(;;)
 {
 `Stmts`
      continue;
 `Stmts`
 }

; Converted code:

foreverLbl:
 `Stmts`
          ; continue;
          jmp foreverLbl
 `Stmts`
     jmp foreverLbl

while/continue/endwhile

; Conversion of WHILE loop with continue
; into pure assembly:

 while(expr)
 {
 `Stmts`
      continue;
 `Stmts`
 }

; Converted code:

whlLabel:
 `Code to evaluate expr`
     jcc EndOfWhile    ; Skip loop on expr failure
 `Stmts`
          ; continue;
          jmp whlLabel ; Jump to start of loop on continue
 `Stmts`
     jmp whlLabel      ; Repeat the code
EndOfWhile:

for/continue/endfor

; Conversion for a FOR loop with continue
; into pure assembly:

 for(initStmt; expr; incStmt)
 {
 `Stmts`
     continue;
 `Stmts`
 }

; Converted code:

 `initStmt`
ForLpLbl:
 `Code to evaluate expr`
          jcc EndOfFor     ; Branch if expression fails
 `Stmts`

          ; continue;
          jmp ContFor      ; Branch to incStmt on continue

 `Stmts`

ContFor:
 `incStmt`
          jmp ForLpLbl

EndOfFor:

repeat/continue/until

 repeat
      ` Stmts`
      continue;
      ` Stmts`
 until(expr);

 do
 {
      ` Stmts`
      continue;
      ` Stmts`

 }while(!expr);

; Converted code:

RptLpLbl:
     ` Stmts`
          ; continue;
          jmp ContRpt  ; Continue branches to termination test
          ` Stmts`
ContRpt:
     ` Code to test expr`
     j`cc` RptLpLbl      ; Jumps if expression evaluates false

7.8.6 寄存器使用与循环

考虑到 x86-64 访问寄存器的效率高于访问内存位置，寄存器是放置循环控制变量的理想位置（尤其适用于小型循环）。然而，寄存器是有限资源；只有 16 个通用寄存器（其中一些，如 RSP 和 RBP，是为特殊用途保留的）。与内存相比，尽管寄存器使用起来比内存高效，但你不能在寄存器中放置太多数据。

循环为寄存器带来了特殊的挑战。寄存器非常适合用作循环控制变量，因为它们操作高效，可以用作数组和其他数据结构的索引（这是循环控制变量的常见用途）。然而，由于寄存器的数量有限，使用寄存器时常常会遇到问题。请考虑以下代码，它无法正常工作，因为它试图重用一个已经在使用的寄存器（CX），从而导致外循环的控制变量被破坏：

 mov cx, 8
loop1:    
          mov cx, 4
loop2:
 `Stmts`
          dec cx
          jnz loop2

          dec cx
          jnz loop1

这里的目的是创建一组嵌套循环；也就是说，一个循环嵌套在另一个循环内部。内循环（loop2）应在外循环（loop1）执行八次的每一次中重复四次。不幸的是，两个循环都使用了相同的寄存器作为循环控制变量。因此，这会形成一个无限循环。由于在遇到第二个dec指令时 CX 寄存器的值总是 0，控制会始终转移到loop1标签（因为递减 0 会产生非零结果）。解决方法是保存并恢复 CX 寄存器，或使用不同的寄存器代替 CX 作为外循环的控制变量：

 mov cx, 8
loop1:
          push rcx
          mov  cx, 4
loop2:
 `Stmts`
          dec cx
          jnz loop2;

          pop rcx
          dec cx
          jnz loop1
or
          mov dx,8
loop1:
          mov cx, 4
loop2:
 `Stmts`
          dec cx
          jnz loop2

          dec dx
          jnz loop1

寄存器损坏是汇编语言程序中循环的主要错误来源之一，因此要时刻注意这个问题。

7.9 循环性能优化

由于循环是程序中性能问题的主要来源，它们是加速软件时首先需要关注的地方。虽然如何编写高效程序的讨论超出了本章的范围，但在设计程序中的循环时，你应该注意以下概念。它们的目的都是通过去除循环中的不必要指令，从而减少执行单次循环迭代所需的时间。

7.9.1 将终止条件移到循环末尾

请考虑以下三种类型的循环的流程图：

REPEAT/UNTIL loop:
     Initialization code
          Loop body
     Test for termination
     Code following the loop

WHILE loop:
     Initialization code
     Loop-termination test
          Loop body
          Jump back to test
     Code following the loop

FOREVER/ENDFOR loop:
     Initialization code
          Loop body part one
          Loop-termination test
          Loop body part two
          Jump back to Loop body part one
     Code following the loop

正如你所看到的，repeat/until循环是其中最简单的。这在这些循环的汇编语言实现中得到了体现。考虑以下语义上完全相同的repeat/until和while循环：

; Example involving a WHILE loop:

         mov  esi, edi
         sub  esi, 20

; while(ESI <= EDI)

whileLp: cmp  esi, edi
         jnle endwhile

 `Stmts`

         inc  esi
         jmp  whileLp
endwhile:

; Example involving a REPEAT/UNTIL loop:

         mov esi, edi
         sub esi, 20
repeatLp:

 `Stmts`

         inc  esi
         cmp  esi, edi
         jng  repeatLp

在循环末尾测试终止条件可以让我们从循环中移除jmp指令，这在循环嵌套在其他循环中时尤为重要。根据循环的定义，你可以很容易地看到该循环将执行恰好 20 次，这表明转换为repeat/until循环是微不足道的并且总是可行的。

不幸的是，情况并不总是这么简单。考虑以下 C 代码：

while(esi <= edi)
{
 `Stmts`
    ++esi;
}

在这个特定的例子中，我们在进入循环时完全不知道 ESI 寄存器包含什么。因此，我们不能假设循环体至少会执行一次。所以，在执行循环体之前，我们必须测试循环终止条件。可以在循环末尾通过加入一个jmp指令来进行测试：

 jmp WhlTest
TopOfLoop:
 `Stmts`
          inc  esi 
WhlTest:  cmp  esi, edi
          jle TopOfLoop

尽管代码与原始的while循环一样长，但jmp指令只执行一次，而不是在每次循环重复时执行。然而，这种效率的轻微提升是通过轻微的可读性损失来实现的（因此一定要加上注释）。第二段代码比原始实现更接近“意大利面条代码”。这种小小的性能提升常常以牺牲清晰度为代价。因此，你应该仔细分析代码，确保性能提升值得放弃清晰度。

7.9.2 执行反向循环

由于 x86-64 的标志位特性，从某个数值递减到（或递增到）0 的循环比从 0 到另一个值的循环效率更高。比较以下 C/C++的for循环与相应的汇编语言代码：

for(j = 1; j <= 8; ++j)
{
 `Stmts`
}

; Conversion to pure assembly (as well as using a
; REPEAT/UNTIL form):

mov j, 1
ForLp:
 `Stmts`
     inc j
     cmp j, 8
     jle ForLp

现在考虑另一个循环，它也有八次迭代，但它将循环控制变量从 8 递减到 1，而不是从 1 递增到 8，从而节省了每次循环重复时的比较操作：

 mov j, 8
LoopLbl:
 `Stmts`
     dec j
     jnz LoopLbl

节省每次迭代中cmp指令的执行时间可能会使代码运行更快。不幸的是，你无法强制所有循环都反向运行。然而，通过一些努力和技巧，你应该能够编写许多for循环，使它们反向操作。

上述例子之所以有效，是因为循环从 8 递减到 1。当循环控制变量变为 0 时，循环终止。如果你需要在循环控制变量变为 0 时执行循环会发生什么？例如，假设前面的循环需要从 7 递减到 0。只要下限是非负的，你可以将早期代码中的jnz指令替换为jns指令：

 mov j, 7
LoopLbl:
 `Stmts`
     dec j
     jns LoopLbl

该循环将重复八次，j的值从 7 递减到 0。当它将 0 递减到-1 时，它会设置标志位并终止循环。

请记住，有些值看起来可能是正数，但实际上是负数。如果循环控制变量是一个字节，则在二进制补码系统中，范围从 128 到 255 的值是负数。因此，使用任何在 129 到 255（或当然是 0）范围内的 8 位值初始化循环控制变量，在执行一次后就会终止循环。如果不小心，这可能会导致问题。

7.9.3 使用循环不变计算

循环不变计算 是一种出现在循环中的计算，结果始终相同。你不必在循环内部执行这样的计算。你可以在循环外部计算它们，并在循环内部引用这些计算的结果。以下 C 代码演示了一个不变计算：

for(i = 0; i < n; ++i)
{
    k = (j - 2) + i
}

因为 j 在整个循环执行过程中都不会改变，所以子表达式 j - 2 可以在循环外部计算：

jm2 = j - 2;
for(i = 0; i < n; ++i)
{
    k = jm2 + i;
}

尽管通过将子表达式 j - 2 计算移出循环，我们已经消除了一个指令，但这个计算仍然有一个不变的部分：将 j - 2 加到 i 上 n 次。由于这个不变的部分在循环中执行 n 次，因此我们可以将之前的代码转换为以下内容：

k = (j - 2) * n;
for(i = 0; i < n; ++i)
{
    k = k + i;
}

这将转换为以下汇编代码：

 mov  eax, j
      sub  eax, 2
      imul eax, n
      mov  ecx, 0
lp:   cmp  ecx, n
      jnl  loopDone
      add  eax, ecx   ; Single instruction implements loop body!
      inc  ecx
      jmp  lp
loopDone:
      mov  k, eax

对于这个特定的循环，实际上你可以在根本不使用循环的情况下计算结果（公式对应于之前的迭代计算）。尽管如此，这个简单的例子展示了如何从循环中消除循环不变计算。

7.9.4 展开循环

对于小型循环——即其主体只有几条语句的循环——处理循环所需的开销可能占总处理时间的一个重要比例。例如，看看以下 Pascal 代码及其相关的 x86-64 汇编语言代码：

 for i := 3 downto 0 do A[i] := 0;

          mov i, 3
          lea rcx, A
LoopLbl:
          mov ebx, i
          mov [rcx][rbx * 4], 0
          dec i
          jns LoopLbl

每次循环重复执行四条指令。只有一条指令执行期望的操作（将 0 移入 A 的一个元素）。剩余的三条指令控制循环。因此，需要 16 条指令才能完成逻辑上由 4 所要求的操作。

While we could make many improvements to this loop based on the information presented thus far, consider carefully exactly what this loop is doing—it’s storing four 0s into `A[0]` through `A[3]`. A more efficient approach is to use four `mov` instructions to accomplish the same task. For example, if `A` is an array of double words, the following code initializes `A` much faster than the preceding code: ``` mov A[0], 0 mov A[4], 0 mov A[8], 0 mov A[12], 0 ``` Although this is a simple example, it shows the benefit of *loop unraveling* (also known as *loop* *unrolling*). If this simple loop appeared buried inside a set of nested loops, the 4:1 instruction reduction could possibly double the performance of that section of your program. Of course, you cannot unravel all loops. Loops that execute a variable number of times are difficult to unravel because there is rarely a way to determine at assembly time the number of loop iterations. Therefore, unraveling a loop is a process best applied to loops that execute a known number of times, with the number of times known at assembly time. Even if you repeat a loop a fixed number of iterations, it may not be a good candidate for loop unraveling. Loop unraveling produces impressive performance improvements when the number of instructions controlling the loop (and handling other overhead operations) represents a significant percentage of the total number of instructions in the loop. Had the previous loop contained 36 instructions in the body (exclusive of the four overhead instructions), the performance improvement would be, at best, only 10 percent (compared with the 300 to 400 percent it now enjoys). Therefore, the costs of unraveling a loop—all the extra code that must be inserted into your program—quickly reach a point of diminishing returns as the body of the loop grows larger or as the number of iterations increases. Furthermore, entering that code into your program can become quite a chore. Therefore, loop unraveling is a technique best applied to small loops. Note that the superscalar 80x86 chips (Pentium and later) have *branch-prediction hardware* and use other techniques to improve performance. Loop unrolling on such systems may actually *slow* the code because these processors are optimized to execute short loops. Whenever applying “improvements” to speed up your code, you should always measure the performance before and after to ensure there was sufficient gain to justify the change. ### 7.9.5 Using Induction Variables Consider the following Pascal loop: ``` for i := 0 to 255 do csetVar[i] := []; ``` Here the program is initializing each element of an array of character sets to the empty set. The straightforward code to achieve this is the following: ``` mov i, 0 lea rcx, csetVar FLp: ; Compute the index into the array (assume that each ; element of a csetVar array contains 16 bytes). mov ebx, i ; Zero-extends into RBX! shl ebx, 4 ; Set this element to the empty set (all 0 bits). xor rax, rax mov qword ptr [rcx][rbx], rax mov qword ptr [rcx][rbx + 8], rax inc i cmp i, 256 jb FLp; ``` Although unraveling this code will still improve performance, it will take 1024 instructions to accomplish this task, too many for all but the most time-critical applications. However, you can reduce the execution time of the body of the loop by using induction variables. An *induction variable* is one whose value depends entirely on the value of another variable. In the preceding example, the index into the array `csetVar` tracks the loop-control variable (it’s always equal to the value of the loop-control variable times 16). Because `i` doesn’t appear anywhere else in the loop, there is no sense in performing the computations on `i`. Why not operate directly on the array index value? The following code demonstrates this technique: ``` xor rbx, rbx ; i * 16 in RBX xor rax, rax ; Loop invariant lea rcx, csetVar ; Base address of csetVar array FLp: mov qword ptr [rcx][rbx], rax mov qword ptr [rcx][rbx + 8], rax add ebx, 16 cmp ebx, 256 * 16 jb FLp ; mov ebx, 256 ; If you care to maintain same semantics as C code ``` The induction that takes place in this example occurs when the code increments the loop-control variable (moved into EBX for efficiency) by 16 on each iteration of the loop rather than by 1\. Multiplying the loop-control variable by 16 (and the final loop-termination constant value) allows the code to eliminate multiplying the loop-control variable by 16 on each iteration of the loop (that is, this allows us to remove the `shl` instruction from the previous code). Further, because this code no longer refers to the original loop-control variable (`i`), the code can maintain the loop-control variable strictly in the EBX register. ## 7.10 For More Information *Write Great Code*, Volume 2, by this author (Second Edition, No Starch Press, 2020) provides a good discussion of the implementation of various HLL control structures in low-level assembly language. It also discusses optimizations such as induction, unrolling, strength reduction, and so on, that apply to optimizing loops. ## 7.11 Test Yourself 1. What are the two typical mechanisms for obtaining the address of a label appearing in a program? 2. What statement can you use to make all symbols global that appear within a procedure? 3. What statement can you use to make all symbols local that appear within a procedure? 4. What are the two forms of the indirect `jmp` instruction? 5. What is a state machine? 6. What is the general rule for converting a branch to its opposite branch? 7. What are the two exceptions to the rule for converting a branch to its opposite branch? 8. What is a trampoline? 9. What is the general syntax of the conditional move instruction? 10. What is the advantage of a conditional move instruction over a conditional jump? 11. What are some disadvantages of conditional moves? 12. Explain the difference between short-circuit and complete Boolean evaluation. 13. Convert the following `if` statements to assembly language sequences by using complete Boolean evaluation (assume all variables are unsigned 32-bit integer values): ``` if(x == y || z > t) { `Do something` } if(x != y && z < t) { `THEN statements` } else { `ELSE statements` } ``` 14. Convert the preceding statements to assembly language by using short-circuit Boolean evaluation (assume all variables are signed 16-bit integer values). 15. Convert the following `switch` statements to assembly language (assume all variables are unsigned 32-bit integers): ``` switch(s) { case 0: ` case 0 code ` break; case 1: ` case 1 code ` break; case 2: ` case 2 code ` break; case 3: ` case 3 code ` break; } switch(t) { case 2: ` case 0 code ` break; case 4: ` case 4 code ` break; case 5: ` case 5 code ` break; case 6: ` case 6 code ` break; default: `Default code` } switch(u) { case 10: ` case 10 code ` break; case 11: ` case 11 code ` break; case 12: ` case 12 code ` break; case 25: ` case 25 code ` break; case 26: ` case 26 code ` break; case 27: ` case 27 code ` break; default: `Default code` } ``` 16. Convert the following `while` loops to assembly code (assume all variables are signed 32-bit integers): ``` while(i < j) { `Code for loop body` } while(i < j && k != 0) { ` Code for loop body, part a` if(m == 5) continue; ` Code for loop body, part b` if(n < 6) break; ` Code for loop body, part c` } do { `Code for loop body` } while(i != j); do { ` Code for loop body, part a` if(m != 5) continue; ` Code for loop body, part b` if(n == 6) break; `Code for loop body, part c` } while(i < j && k > j); for(int i = 0; i < 10; ++i) { `Code for loop body` } ```

第八章：高级算术

本章介绍了扩展精度算术、不同大小操作数的算术运算以及十进制算术。通过本章的学习，你将知道如何对任何大小的整数操作数进行算术和逻辑运算，包括那些大于 64 位的操作数，并且如何将不同大小的操作数转换为兼容格式。最后，你将学习如何使用 x86-64 BCD 指令在 x87 FPU 上执行十进制算术，这使你能够在那些确实需要基数为 10 的操作的应用中使用十进制算术。

8.1 扩展精度运算

汇编语言相对于高级语言的一个大优势是，汇编语言不限制整数运算的大小。例如，标准的 C 编程语言定义了三种整数大小：short int、int和long int。^(1) 在 PC 上，这些通常是 16 位和 32 位整数。

尽管 x86-64 的机器指令限制你只能使用单条指令处理 8 位、16 位、32 位或 64 位整数，但你可以使用多条指令处理任何大小的整数。如果你想将 256 位整数相加，也不成问题。本节将介绍如何将各种算术和逻辑操作从 16 位、32 位或 64 位扩展到任意位数。

8.1.1 扩展精度加法

x86-64 的add指令将两个 8 位、16 位、32 位或 64 位的数字相加。执行add后，如果和的最高位（HO 位）溢出，x86-64 的进位标志将被设置。你可以利用这一信息执行扩展精度加法操作。^(2) 请考虑你手动执行多位数加法操作的方式（如图 8-1 所示）。

图 8-1：多位数相加

x86-64 处理扩展精度算术的方式与此相同，不同的是它不是每次添加一个数字，而是每次添加一个字节、字、双字或四字。考虑图 8-2 中的三个四字（192 位）加法操作。

图 8-2：将两个 192 位对象相加

如你所见，基本思路是将一个较大的操作分解为一系列较小的操作。由于 x86 处理器系列每次最多能相加 64 位（使用通用寄存器），因此该操作必须以 64 位或更少的块进行。以下是步骤：

将两个低位四字相加，就像在手动算法中将两个低位数字相加一样，使用add指令。如果低位加法溢出，add会将进位标志设置为1；否则，它会清除进位标志。
使用adc（带进位加法）指令，将两个 192 位值中的第二对四字相加，并加上之前加法的进位（如果有的话）。adc指令与add指令的语法相同，几乎执行相同的操作：
```
adc `dest`, `source` ; `dest` := `dest` + `source` + `C`
```
唯一的区别是，adc会将进位标志的值与源操作数和目标操作数一起加进去。它设置标志与add相同（包括如果发生无符号溢出则设置进位标志）。这正是我们需要的，用来加在 192 位和的中间两个双字上。
使用adc再次将 192 位值的高位双字与中间两个四字之和的进位相加。

总结来说，add指令将低位四字加在一起，而adc将所有其他四字对加在一起。在扩展精度加法序列结束时，进位标志指示无符号溢出（如果设置），溢出标志指示符号溢出，符号标志表示结果的符号。零标志在扩展精度加法结束时没有任何实际意义（它只是表示两个高位四字的和为 0，并不表示整个结果为 0）。

例如，假设你有两个 128 位的值需要相加，定义如下：

 .data
X       oword   ?
Y       oword   ?

假设你还想将结果存储到第三个变量Z中，它也是一个oword。以下 x86-64 代码将完成此任务：

mov rax, qword ptr X    ; Add together the LO 64 bits
add rax, qword ptr Y    ; of the numbers and store the
mov qword ptr Z, rax    ; result into the LO qword of Z

mov rax, qword ptr X[8] ; Add together (with carry) the
adc rax, qword ptr Y[8] ; HO 64 bits and store the result
mov qword ptr Z[8], rax ; into the HO qword of Z

前三条指令将X和Y的低位四字加在一起，并将结果存储到Z的低位四字中。最后三条指令将X和Y的高位四字加在一起，连同低位字的进位，并将结果存储到Z的高位四字中。

请记住，X、Y和Z是oword对象（128 位），像mov rax, X这样的指令会尝试将 128 位值加载到 64 位寄存器中。要加载 64 位值，特别是低 64 位，qword ptr操作符会将符号X、Y和Z强制转换为 64 位。要加载高 64 位四字，你需要使用类似X[8]的地址表达式，并配合qword ptr操作符，因为 x86 内存空间按字节寻址，八个连续字节构成一个四字。

你可以通过使用adc将更高位的值加到一起，将这个算法扩展到任何位数。例如，要将声明为四个四字数组的两个 256 位值相加，你可以使用如下代码：

 .data
BigVal1 qword  4 dup (?)
BigVal2 qword  4 dup (?)
BigVal3 qword  4 dup (?)   ; Holds the sum
     .
     .
     .
; Note that there is no need for "qword ptr"
; because the base type of BitVal`x` is qword.

    mov rax, BigVal1[0]
    add rax, BigVal2[0]
    mov BigVal3[0], rax

    mov rax, BigVal1[8]
    adc rax, BigVal2[8]
    mov BigVal3[8], rax

    mov rax, BigVal1[16]
    adc rax, BigVal2[16]
    mov BigVal3[16], rax

    mov rax, BigVal1[24]
    adc rax, BigVal2[24]
    mov BigVal3[24], rax

8.1.2 扩展精度减法

就像它进行加法一样，x86-64 也以相同的方式进行多字节减法，除非它一次减去的是完整的字节、字、双字或四字，而不是十进制数字。你对低位字节、字、双字或四字使用sub指令，并对高位值使用sub与借位指令sbb（带借位减法）。

以下示例演示了一个使用 x86-64 上的 64 位寄存器的 128 位减法：

 .data
Left    oword   ?
Right   oword   ?
Diff    oword   ?
         .
         .
         .
    mov rax, qword ptr Left
    sub rax, qword ptr Right
    mov qword ptr Diff, rax

    mov rax, qword ptr Left[8]
    sbb rax, qword ptr Right[8]
    mov qword ptr Diff[8], rax

以下示例演示了一个 256 位减法：

 .data
BigVal1  qword 4 dup (?)
BigVal2  qword 4 dup (?)
BigVal3  qword 4 dup (?)
 .
     .
     .

; Compute BigVal3 := BigVal1 - BigVal2.

; Note: don't need to coerce types of
; BigVa1, BigVal2, or BigVal3 because
; their base types are already qword.

    mov rax, BigVal1[0]
    sub rax, BigVal2[0]
    mov BigVal3[0], rax

    mov rax, BigVal1[8]
    sbb rax, BigVal2[8]
    mov BigVal3[8], rax

    mov rax, BigVal1[16]
    sbb rax, BigVal2[16]
    mov BigVal3[16], rax

    mov rax, BigVal1[24]
    sbb rax, BigVal2[24]
    mov BigVal3[24], rax

8.1.3 扩展精度比较

不幸的是，没有“与借位比较”指令可以用来执行扩展精度比较。幸运的是，你可以通过仅使用cmp指令来比较扩展精度值，正如你将很快看到的那样。

考虑两个无符号值 2157h 和 1293h。这两个值的低字节不会影响比较结果。只需比较高字节，即 21h 和 12h，我们就可以知道第一个值大于第二个值。

你只需要在一对值的高字节相等时，查看这对值的两个字节。在所有其他情况下，比较高字节就足以告诉你这些值的一切。对于任意数量的字节，情况都是如此，而不仅仅是两个字节。以下代码通过首先比较它们的高四字（quad word）来比较两个有符号的 128 位整数，只有在高四字相等的情况下，才会比较它们的低四字：

; This sequence transfers control to location "IsGreater" if
; QwordValue > QwordValue2\. It transfers control to "IsLess" if
; QwordValue < QwordValue2\. It falls through to the instruction
; following this sequence if QwordValue = QwordValue2\. 
; To test for inequality, change the "IsGreater" and "IsLess"
; operands to "NotEqual" in this code.

 mov rax, qword ptr QWordValue[8]  ; Get HO qword
        cmp rax, qword ptr QWordValue2[8]
        jg  IsGreater
        jl  IsLess;

        mov rax, qword ptr QWordValue[0]  ; If HO qwords equal,
        cmp rax, qword ptr QWordValue2[0] ; then we must compare
        jg  IsGreater;                    ; the LO dwords
        jl  IsLess;

; Fall through to this point if the two values were equal.

要比较无符号值，可以使用ja和jb指令来代替jg和jl。

你可以通过前述序列合成任何比较，如下所示，演示了有符号比较；如果你想进行无符号比较，只需将jg、jge、jl和jle分别替换为ja、jae、jb和jbe。以下每个示例假设有这些声明：

 .data
OW1     oword  ?
OW2     oword  ?

OW1q    textequ <qword ptr OW1>
OW2q    textequ <qword ptr OW2>

以下代码实现了一个 128 位测试，检查OW1 < OW2（有符号）。如果OW1 < OW2，控制会转移到IsLess标签。如果不成立，控制会继续执行下一条语句：

 mov rax, OW1q[8]    ; Get HO dword
    cmp rax, OW2q[8]
    jg  NotLess
    jl  IsLess

    mov rax, OW1q[0]    ; Fall through to here if the HO
    cmp rax, OW2q[0]    ; qwords are equal
    jl  IsLess
NotLess:

这是一个 128 位测试，检查OW1 <= OW2（有符号）。如果条件成立，程序会跳转到IsLessEq：

 mov rax, OW1q[8]    ; Get HO dword
    cmp rax, OW2q[8]
    jg  NotLessEQ
    jl  IsLessEQ

 mov rax, QW1q[0]    ; Fall through to here if the HO
    cmp rax, QW2q[0]    ; qwords are equal
    jle IsLessEQ
NotLessEQ:

这是一个 128 位测试，检查OW1 > OW2（有符号）。如果此条件成立，程序会跳转到IsGtr：

 mov rax, QW1q[8]    ; Get HO dword
    cmp rax, QW2q[8]
    jg  IsGtr
    jl  NotGtr

    mov rax, QW1q[0]    ; Fall through to here if the HO
    cmp rax, QW2q[0]    ; qwords are equal
    jg  IsGtr
NotGtr:

以下是一个 128 位测试，检查OW1 >= OW2（有符号）。如果成立，代码会跳转到标签IsGtrEQ：

 mov rax, QW1q[8]    ; Get HO dword
    cmp rax, QW2q[8]
    jg  IsGtrEQ
    jl  NotGtrEQ

    mov rax, QW1q[0]    ; Fall through to here if the HO
    cmp rax, QW2q[0]    ; qwords are equal
    jge IsGtrEQ
NotGtrEQ:

这是一个 128 位测试，检查OW1 == OW2（有符号或无符号）。如果OW1 == OW2，代码会跳转到标签IsEqual。如果它们不相等，程序会继续执行下一条指令：

 mov rax, QW1q[8]    ; Get HO dword
    cmp rax, QW2q[8]
    jne NotEqual

    mov rax, QW1q[0]    ; Fall through to here if the HO
    cmp rax, QW2q[0]    ; qwords are equal
    je  IsEqual
NotEqual:

以下是一个 128 位测试，检查OW1 != OW2（有符号或无符号）。如果OW1 != OW2，该代码会跳转到标签IsNotEqual。如果它们相等，则继续执行下一条指令：

 mov rax, QW1q[8]    ; Get HO dword
    cmp rax, QW2q[8]
    jne IsNotEqual

 mov rax, QW1q[0]    ; Fall through to here if the HO
    cmp rax, QW2q[0]    ; qwords are equal
    jne IsNotEqual

; Fall through to this point if they are equal.

为了将前述代码推广到大于 128 位的对象，首先从对象的高四字开始比较，然后逐步比较它们的低四字，只要相应的双字（double word）相等。以下示例比较了两个 256 位值，检查第一个值是否小于或等于（无符号）第二个值：

 .data
Big1    qword  4 dup (?)
Big2    qword  4 dup (?)
         .
         .
         .
        mov rax, Big1[24]
        cmp rax, Big2[24]
        jb  isLE
        ja  notLE

        mov rax, Big1[16]
        cmp rax, Big2[16]
        jb  isLE
        ja  notLE

        mov rax, Big1[8]
        cmp rax, Big2[8]
        jb  isLE
        ja  notLE

        mov  rax, Big1[0]
        cmp  rax, Big2[0]
        jnbe notLE
isLE:
        `Code to execute if Big1 <= Big2`
          .
          .
          .
notLE:
        `Code to execute if Big1 > Big2`

8.1.4 扩展精度乘法

尽管 8×8、16×16、32×32 或 64×64 位乘法通常已经足够，但有时你可能需要将更大的值相乘。你可以使用 x86-64 单操作数的 mul 和 imul 指令进行扩展精度乘法操作，使用你在手动乘法时采用的相同技术。考虑一下你手动执行多位数乘法的方式（图 8-3）。

图 8-3：多位数乘法

x86-64 在执行扩展精度乘法时使用相同的方法，唯一的区别是它处理的是字节、字、双字和四字，而不是数字，如图 8-4 所示。

执行扩展精度乘法时，最重要的一点是你还必须同时执行扩展精度加法。将所有部分积相加需要多次加法。

图 8-4：扩展精度乘法

清单 8-1 演示了如何使用 32 位指令将两个 64 位值相乘（得到一个 128 位结果）。从技术上讲，你可以使用单条指令执行 64 位乘法，但这个例子展示了一种方法，你可以通过使用 x86-64 的 64 位寄存器，而不是 32 位寄存器，轻松地将其扩展到 128 位。

; Listing 8-1

; 128-bit multiplication.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 8-1", 0
fmtStr1     byte    "%d * %d = %I64d (verify:%I64d)", nl, 0

 .data
op1         qword   123456789
op2         qword   234567890
product     oword   ?
product2    oword   ?

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; mul64 - Multiplies two 64-bit values passed in RDX and RAX by
;         doing a 64x64-bit multiplication, producing a 128-bit result.
;         Algorithm is easily extended to 128x128 bits by switching the
;         32-bit registers for 64-bit registers.

; Stores result to location pointed at by R8.

mul64       proc
mp          equ     <dword ptr [rbp - 8]>     ; Multiplier
mc          equ     <dword ptr [rbp - 16]>    ; Multiplicand
prd         equ     <dword ptr [r8]>          ; Result

            push    rbp
            mov     rbp, rsp
            sub     rsp, 24

            push    rbx     ; Preserve these register values
            push    rcx

; Save parameters passed in registers:

            mov     qword ptr mp, rax
            mov     qword ptr mc, rdx

; Multiply the LO dword of multiplier times multiplicand.

            mov eax, mp
            mul mc          ; Multiply LO dwords
            mov prd, eax    ; Save LO dword of product
            mov ecx, edx    ; Save HO dword of partial product result

            mov eax, mp
            mul mc[4]       ; Multiply mp(LO) * mc(HO)
            add eax, ecx    ; Add to the partial product
            adc edx, 0      ; Don't forget the carry!
 mov ebx, eax    ; Save partial product for now
            mov ecx, edx

; Multiply the HO word of multiplier with multiplicand.

            mov eax, mp[4]  ; Get HO dword of multiplier
            mul mc          ; Multiply by LO word of multiplicand
            add eax, ebx    ; Add to the partial product
            mov prd[4], eax ; Save the partial product
            adc ecx, edx    ; Add in the carry!

            mov eax, mp[4]  ; Multiply the two HO dwords together
            mul mc[4]
            add eax, ecx    ; Add in partial product
            adc edx, 0      ; Don't forget the carry!

            mov prd[8], eax ; Save HO qword of result
            mov prd[12], edx

; EDX:EAX contains 64-bit result at this point.

            pop     rcx     ; Restore these registers
            pop     rbx
            leave
            ret    
mul64       endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 64         ; Shadow storage

; Test the mul64 function:

            mov     rax, op1
            mov     rdx, op2
            lea     r8, product
            call    mul64

; Use a 64-bit multiply to test the result:

            mov     rax, op1
            mov     rdx, op2
            imul    rax, rdx
            mov     qword ptr product2, rax

; Print the results:

            lea     rcx, fmtStr1
            mov     rdx, op1
 mov     r8,  op2
            mov     r9,  qword ptr product
            mov     rax, qword ptr product2
            mov     [rsp + 32], rax
            call    printf

            leave
            ret     ; Returns to caller

asmMain     endp
            end

清单 8-1：扩展精度乘法

该代码仅适用于无符号操作数。要将两个带符号值相乘，你必须在乘法前注意操作数的符号，取两个操作数的绝对值，进行无符号乘法，然后根据原始操作数的符号调整结果积的符号。带符号操作数的乘法留给你自己完成。

清单 8-1 中的示例相当直接，因为可以将部分积保存在不同的寄存器中。如果你需要将更大的值相乘，你将需要在临时（内存）变量中保存部分积。除此之外，清单 8-1 使用的算法可以推广到任意数量的双字。

8.1.5 扩展精度除法

你不能通过使用 div 和 idiv 指令合成一个通用的 n-位 / m-位除法操作——尽管一个较不通用的操作，将一个 n-位数除以一个 64 位数，可以通过使用 div 指令完成。一个通用的扩展精度除法需要一系列的移位和减法指令（这需要相当多的指令，运行速度较慢）。本节介绍了两种方法（使用 div 和移位减法）来进行扩展精度除法。

8.1.5.1 使用 div 指令的特殊情况形式

将 128 位数除以 64 位数是由div和idiv指令直接处理的，只要结果商能够适应 64 位寄存器。然而，如果商无法适应 64 位，则必须执行扩展精度除法。

例如，假设你要将 0004_0000_0000_1234h 除以 2。天真直接的方法可能如下所示（假设值保存在名为dividend的一对四字变量中，divisor是一个包含 2 的四字变量）：

; This code does *NOT* work!

mov rax, qword ptr dividend[0]    ; Get dividend into EDX:EAX
mov rdx, qword ptr dividend[8]
div divisor                       ; Divide RDX:RAX by divisor

虽然这段代码在语法上是正确的，并且可以编译，但在运行时会引发除法错误异常。使用div时，商必须适应 RAX 寄存器，而 2_0000_091Ah 无法适应，因为它是一个 66 位的数值（如果你想查看它产生适合的结果，可以尝试除以 8）。

相反，诀窍是将被除数的（零扩展或符号扩展的）高双字除以除数，然后使用余数和被除数的低双字重复此过程，如下所示：

 .data
dividend  qword    1234h, 4
divisor   qword    2      ; dividend/divisor = 2_0000_091Ah
quotient  qword    2 dup (?)
remainder qword    ?
     .
     .
     .
    mov rax, dividend[8]
    xor edx, edx          ; Zero-extend for unsigned division
    div divisor
    mov quotient[8], rax  ; Save HO qword of the quotient
    mov rax, dividend[0]  ; This code doesn't zero-extend
    div divisor           ; RAX into RDX before div instr
    mov quotient[0], rax  ; Save LO qword of the quotient (91Ah)
    mov remainder, rdx    ; Save the remainder

quotient变量是 128 位，因为结果可能需要与被除数一样多的位数（例如，如果除以 1）。无论dividend和divisor操作数的大小如何，余数最多只有 64 位（在这种情况下）。因此，本示例中的remainder变量只是一个四字。

正确计算 128 / 64 的商，首先计算dividend[8] / divisor的 64 / 64 商。第一次除法得到的商成为最终商的高双字（HO double word）。此除法的余数成为 RDX 中的扩展，用于第二次除法操作的下半部分。代码的第二部分将rdx:dividend[0]除以divisor，以生成商的低四字（LO quad word）以及除法的余数。代码不会在第二个div指令之前将 RAX 零扩展到 RDX，因为 RDX 已经包含了不能被干扰的有效位。

上述 128 / 64 除法操作是通用除法算法的一个特例，用于将任意大小的值除以 64 位除数。通用算法如下：

将被除数的高四字移动到 RAX 并将其零扩展到 RDX。
进行除法操作。
将 RAX 中的值存储到商结果变量的相应四字位置（在除法之前加载到 RAX 中的被除数四字位置）。
将 RAX 加载为被除数中的下一个低四字，不修改 RDX。
重复步骤 2 到 4，直到处理完被除数中的所有四字。

最后，RDX 寄存器将包含余数，商将出现在目标变量中，即步骤 3 存储结果的地方。Listing 8-2 展示了如何用 64 位除数除以 256 位数，从而得到一个 256 位商和一个 64 位余数。

; Listing 8-2

; 256-bit by 64-bit division.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 8-2", 0
fmtStr1     byte    "quotient  = "
            byte    "%08x_%08x_%08x_%08x_%08x_%08x_%08x_%08x"
            byte    nl, 0

fmtStr2     byte    "remainder = %I64x", nl, 0

            .data

; op1 is a 256-bit value. Initial values were chosen
; to make it easy to verify the result.

op1         oword   2222eeeeccccaaaa8888666644440000h
            oword   2222eeeeccccaaaa8888666644440000h

op2         qword   2
result      oword   2 dup (0) ; Also 256 bits
remain      qword   0

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; div256 - Divides a 256-bit number by a 64-bit number.

; Dividend  - passed by reference in RCX.
; Divisor   - passed in RDX.

; Quotient  - passed by reference in R8.
; Remainder - passed by reference in R9.

div256      proc
divisor     equ     <qword ptr [rbp - 8]>
dividend    equ     <qword ptr [rcx]>
quotient    equ     <qword ptr [r8]>
remainder   equ     <qword ptr [r9]>

            push    rbp
            mov     rbp, rsp
            sub     rsp, 8

            mov     divisor, rdx

            mov     rax, dividend[24]  ; Begin div with HO qword
            xor     rdx, rdx           ; Zero-extend into RDS
            div     divisor            ; Divide HO word
            mov     quotient[24], rax  ; Save HO result

            mov     rax, dividend[16]  ; Get dividend qword #2
            div     divisor            ; Continue with division
            mov     quotient[16], rax  ; Store away qword #2

            mov     rax, dividend[8]   ; Get dividend qword #1
            div     divisor            ; Continue with division
            mov     quotient[8], rax   ; Store away qword #1

            mov     rax, dividend[0]   ; Get LO dividend qword
            div     divisor            ; Continue with division
            mov     quotient[0], rax   ; Store away LO qword

            mov     remainder, rdx     ; Save remainder

            leave
            ret
div256      endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 80         ; Shadow storage

; Test the div256 function:

            lea     rcx, op1
            mov     rdx, op2
 lea     r8, result
            lea     r9, remain
            call    div256

; Print the results:

            lea     rcx, fmtStr1
            mov     edx, dword ptr result[28]
            mov     r8d, dword ptr result[24]
            mov     r9d, dword ptr result[20]
            mov     eax, dword ptr result[16]
            mov     [rsp + 32], rax
            mov     eax, dword ptr result[12]
            mov     [rsp + 40], rax
            mov     eax, dword ptr result[8]
            mov     [rsp + 48], rax
            mov     eax, dword ptr result[4]
            mov     [rsp + 56], rax
            mov     eax, dword ptr result[0]
            mov     [rsp + 64], rax
            call    printf

            lea     rcx, fmtStr2
            mov     rdx, remain
            call    printf

            leave
            ret    ; Returns to caller

asmMain     endp
            end

Listing 8-2：无符号 128 / 32 位扩展精度除法

这是构建命令和程序输出（请注意，你可以通过简单地查看结果来验证除法是否正确，注意每个数字是原始值的一半）：

C:\>**build listing8-2**

C:\>**echo off**
 Assembling: listing8-2.asm
c.cpp

C:\>**listing8-2**
Calling Listing 8-2:
quotient  = 11117777_66665555_44443333_22220000_11117777_66665555_44443333_22220000
remainder = 0
Listing 8-2 terminated

你可以通过向序列中添加更多的 mov-div-mov 指令来扩展此代码以支持任意位数。像上一节的扩展精度乘法一样，这个扩展精度除法算法只适用于无符号操作数。要除以两个带符号的数，必须注意它们的符号，取它们的绝对值，进行无符号除法，然后根据操作数的符号设置结果的符号。

8.1.5.2 通用 N 位除以 M 位

要使用大于 64 位的除数，必须通过使用移位和减法策略来实现除法，这种方法有效，但非常慢。与乘法一样，理解计算机如何执行除法的最佳方式是研究你是如何学习手工做长除法的。考虑操作 3456 / 12 以及你手动执行此操作时的步骤，如图 8-5 所示。

图 8-5：手动逐位除法操作

这个算法在二进制中实际上更容易，因为在每一步中，你不需要猜测 12 能被余数除几次，也不需要将 12 乘以你的猜测来得到要减去的数。在二进制算法的每一步中，除数要么正好除尽余数一次，要么不除尽。举个例子，考虑 27（11011）除以 3（11）的除法，如图 8-6 所示。

图 8-6：二进制长除法

以下算法以同时计算商和余数的方式实现了这个二进制除法操作：

Quotient := Dividend;
Remainder := 0;
for i := 1 to NumberBits do

    Remainder:Quotient := Remainder:Quotient SHL 1;
    if Remainder >= Divisor then

        Remainder := Remainder - Divisor;
        Quotient := Quotient + 1;

    endif
endfor

NumberBits 是 Remainder（余数）、Quotient（商）、Divisor（除数）和 Dividend（被除数）变量中的位数。SHL 是左移操作符。Quotient := Quotient + 1; 语句将 Quotient 的最低有效位设置为 1，因为该算法之前将 Quotient 左移了 1 位。清单 8-3 实现了这个算法。

; Listing 8-3

; 128-bit by 128-bit division.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 8-3", 0
fmtStr1     byte    "quotient  = "
            byte    "%08x_%08x_%08x_%08x"
            byte    nl, 0

fmtStr2     byte    "remainder = "
            byte    "%08x_%08x_%08x_%08x"
            byte    nl, 0

fmtStr3     byte    "quotient (2)  = "
            byte    "%08x_%08x_%08x_%08x"
            byte    nl, 0

             .data

; op1 is a 128-bit value. Initial values were chosen
; to make it easy to verify the result.

op1         oword   2222eeeeccccaaaa8888666644440000h
op2         oword   2
op3         oword   11117777666655554444333322220000h
result      oword   ?
remain      oword   ?

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; div128 - This procedure does a general 128 / 128 division operation
;          using the following algorithm (all variables are assumed
;          to be 128-bit objects).

; Quotient := Dividend;
; Remainder := 0;
; for i := 1 to NumberBits do

;    Remainder:Quotient := Remainder:Quotient SHL 1;
;    if Remainder >= Divisor then

;      Remainder := Remainder - Divisor;
;      Quotient := Quotient + 1;

; endif
; endfor

; Data passed:

; 128-bit dividend, by reference in RCX.
; 128-bit divisor, by reference in RDX.

; Data returned:

; Pointer to 128-bit quotient in R8.
; Pointer to 128-bit remainder in R9.

div128      proc
remainder   equ     <[rbp - 16]>
dividend    equ     <[rbp - 32]>
quotient    equ     <[rbp - 32]>    ; Aliased to dividend
divisor     equ     <[rbp - 48]>

            push    rbp
            mov     rbp, rsp
            sub     rsp, 48

            push    rax
            push    rcx

            xor     rax, rax        ; Initialize remainder to 0
            mov     remainder, rax
            mov     remainder[8], rax

; Copy the dividend to local storage:

            mov     rax, [rcx]
            mov     dividend, rax
            mov     rax, [rcx+8]
            mov     dividend[8], rax

; Copy the divisor to local storage:

            mov     rax, [rdx]
            mov     divisor, rax
            mov     rax, [rdx + 8]
            mov     divisor[8], rax

            mov     cl, 128         ; Count off bits in CL

; Compute Remainder:Quotient := Remainder:Quotient SHL 1:

repeatLp:   shl     qword ptr dividend[0], 1  ; 256-bit extended-
            rcl     qword ptr dividend[8], 1  ; precision shift
            rcl     qword ptr remainder[0], 1 ; through remainder
            rcl     qword ptr remainder[8], 1

; Do a 128-bit comparison to see if the remainder
; is greater than or equal to the divisor.

            mov     rax, remainder[8]
            cmp     rax, divisor[8]
            ja      isGE
            jb      notGE

            mov     rax, remainder
            cmp     rax, divisor
            ja      isGE
            jb      notGE

; Remainder := Remainder - Divisor;

isGE:       mov     rax, divisor
            sub     remainder, rax
            mov     rax, divisor[8]
            sbb     remainder[8], rax

; Quotient := Quotient + 1;

            add     qword ptr quotient, 1
            adc     qword ptr quotient[8], 0

notGE:      dec     cl
            jnz     repeatLp

; Okay, copy the quotient (left in the dividend variable)
; and the remainder to their return locations.

            mov     rax, quotient[0]
            mov     [r8], rax
            mov     rax, quotient[8]
            mov     [r8][8], rax

            mov     rax, remainder[0]
            mov     [r9], rax
            mov     rax, remainder[8]
            mov     [r9][8], rax

            pop     rcx
            pop     rax
            leave
            ret

div128      endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 64        ; Shadow storage

; Test the div128 function:

            lea     rcx, op1
            lea     rdx, op2
            lea     r8, result
            lea     r9, remain
            call    div128

; Print the results:

            lea     rcx, fmtStr1
            mov     edx, dword ptr result[12]
            mov     r8d, dword ptr result[8]
            mov     r9d, dword ptr result[4]
            mov     eax, dword ptr result[0]
            mov     [rsp + 32], rax
            call    printf

            lea     rcx, fmtStr2
            mov     edx, dword ptr remain[12]
            mov     r8d, dword ptr remain[8]
            mov     r9d, dword ptr remain[4]
            mov     eax, dword ptr remain[0]
            mov     [rsp + 32], rax
            call    printf

; Test the div128 function:

            lea     rcx, op1
            lea     rdx, op3
            lea     r8, result
            lea     r9, remain
            call    div128

; Print the results:

            lea     rcx, fmtStr3
            mov     edx, dword ptr result[12]
            mov     r8d, dword ptr result[8]
            mov     r9d, dword ptr result[4]
            mov     eax, dword ptr result[0]
            mov     [rsp + 32], rax
            call    printf

            lea     rcx, fmtStr2
            mov     edx, dword ptr remain[12]
 mov     r8d, dword ptr remain[8]
            mov     r9d, dword ptr remain[4]
            mov     eax, dword ptr remain[0]
            mov     [rsp + 32], rax
            call    printf

            leave
            ret    ; Returns to caller

asmMain     endp
            end

清单 8-3：扩展精度除法

这是构建命令和程序输出：

C:\>**build listing8-3**

C:\>**echo off**
 Assembling: listing8-3.asm
c.cpp

C:\>**listing8-3**
Calling Listing 8-3:
quotient  = 11117777_66665555_44443333_22220000
remainder = 00000000_00000000_00000000_00000000
quotient (2)  = 00000000_00000000_00000000_00000002
remainder = 00000000_00000000_00000000_00000000
Listing 8-3 terminated

这段代码没有检查除数是否为 0（如果尝试除以 0，它会产生值 0FFFF_FFFF_FFFF_FFFFh）；它只处理无符号值，且非常慢（比 div 和 idiv 指令慢一个或两个数量级）。要处理除以 0 的情况，在运行此代码之前检查除数是否为 0，如果除数为 0，则返回适当的错误代码。处理带符号值的方式与前面的除法算法相同：注意符号，取操作数的绝对值，进行无符号除法，然后在结果中修正符号。

你可以使用以下技巧大幅提高这个除法的性能。检查除数变量是否仅使用 32 位。通常，尽管除数是一个 128 位变量，但它的值本身适合 32 位（即 Divisor 的高双字为 0），你可以使用 div 指令，它要快得多。改进后的算法稍微复杂一些，因为你首先需要比较高四字是否为 0，但总体来说，它运行得更快，同时仍然能够除以任何两个值对。

8.1.6 扩展精度取反运算

neg 指令没有提供通用的扩展精度形式。然而，取反等同于从 0 中减去一个值，因此我们可以通过使用 sub 和 sbb 指令轻松地模拟扩展精度取反。以下代码提供了一种简单的方法，通过使用扩展精度减法将一个（320 位）值从 0 中减去来实现取反：

 .data
Value  qword 5 dup (?) ; 320-bit value
        .
        .
        .
    xor rax, rax       ; RAX = 0
    sub rax, Value
    mov Value, rax

    mov eax, 0         ; Cannot use XOR here:
    sbb rax , Value[8] ; must preserve carry!
    mov Value[8], rax

    mov eax, 0         ; Zero-extends!
    sbb rax, Value[16]
    mov Value[16], rax

    mov eax, 0
    sbb rax, Value[24]
    mov Value[24], rax

    mov rax, 0
    sbb rax, Value[32]
    mov Value[32], rax

一种稍微更高效的方法来取反较小的值（128 位）是使用 neg 和 sbb 指令的组合。这个技巧利用了 neg 将操作数从 0 中减去的事实。特别地，它设置标志位的方式与 sub 指令将目标值从 0 中减去时相同。此代码的形式如下（假设你想要取反 RDX:RAX 中的 128 位值）：

neg rdx
neg rax
sbb rdx, 0

前两条指令对 128 位结果的高四字和低四字进行取反。然而，如果低四字取反时有借位（可以将 neg rax 看作是从 RAX 中减去 0，可能会产生进位/借位），这个借位不会从高四字中减去。序列末尾的 sbb 指令如果在取反 RAX 时没有发生借位，它不会从 RDX 中减去任何东西；如果取反时需要借位，它将从 RDX 中减去 1。

经过大量的工作，实际上可以将此方案扩展到超过 128 位。然而，在 256 位左右（当然，一旦超过 256 位），实际上使用从零减去的通用方案所需的指令更少。

8.1.7 扩展精度与运算

执行 n 字节与运算很简单：只需对两个操作数的相应字节进行与运算，并保存结果。例如，要对所有 128 位长的操作数执行与运算，你可以使用以下代码：

mov rax,  qword ptr source1
and rax,  qword ptr source2
mov qword ptr dest, rax

mov rax,  qword ptr source1[8]
and rax,  qword ptr source2[8]
mov qword ptr dest[8], rax

为了将此技巧扩展到任意数量的四字，你可以在操作数中逻辑地将相应的字节、字、双字或四字进行与运算。

这个序列根据最后一次 and 运算的结果设置标志位。如果你最后对高四字进行与运算，这将正确地设置除了零标志之外的所有标志。如果你需要在这个序列之后测试零标志，可以对两个结果的双字进行逻辑或操作（或者将它们与 0 进行比较）。

8.1.8 扩展精度或运算

多字节的逻辑或（OR）操作与多字节与（AND）操作的执行方式相同。你将两个操作数中对应的字节进行按位或运算。例如，要对两个 192 位的值进行逻辑或运算，可以使用以下代码：

mov rax,  qword ptr source1
or  rax,  qword ptr source2
mov qword ptr dest, rax

mov rax,  qword ptr source1[8]
or  rax,  qword ptr source2[8]
mov qword ptr dest[8], rax

mov rax,  qword ptr source1[16]
or  rax,  qword ptr source2[16]
mov qword ptr dest[16], rax

与前面的例子一样，这样做不会为整个操作正确设置零标志。如果您需要在执行扩展精度的 OR 操作后测试零标志，必须将所有结果的双字进行逻辑 OR 运算。

8.1.9 扩展精度异或操作

与其他逻辑操作一样，扩展精度的异或（XOR）操作会对两个操作数中的对应字节进行排他或运算，从而得到扩展精度的结果。以下代码序列对两个 64 位操作数进行运算，计算它们的异或结果，并将结果存储到一个 64 位变量中：

mov rax,  qword ptr source1
xor rax,  qword ptr source2
mov qword ptr dest, rax

mov rax,  qword ptr source1[8]
xor rax,  qword ptr source2[8]
mov qword ptr dest[8], rax

前两个部分关于零标志的评论，以及关于 XMM 和 YMM 寄存器的评论，在此处同样适用。

8.1.10 扩展精度 NOT 操作

not 指令会反转指定操作数中的所有位。通过在所有受影响的操作数上执行 not 指令来执行扩展精度的 NOT 操作。例如，要对 RDX:RAX 中的值执行 128 位的 NOT 操作，请执行以下指令：

not rax
not rdx

请记住，如果执行两次 NOT 指令，结果将返回原始值。此外，使用全 1（0FFh、0FFFFh、0FFFF_FFFFh 或 0FFFF_FFFF_FFFF_FFFFh）对一个值进行异或操作，所执行的操作与 not 指令相同。

8.1.11 扩展精度移位操作

扩展精度移位操作需要移位和旋转指令。本节描述了如何构造这些操作。

8.1.11.1 扩展精度左移

128 位的 shl（左移）操作的形式如图 8-7 所示。

图 8-7：128 位左移操作

要通过机器指令完成此操作，首先需要将低位四字（LO qword）左移（例如，使用 shl 指令），并将来自第 63 位的输出捕获（方便的是，进位标志会为我们完成此操作）。然后，我们必须将此位移入高位四字（HO qword）的低位，同时将其他所有位向左移（并通过进位标志捕获输出）。

您可以使用 shl 和 rcl 指令来实现这种 128 位移位操作。例如，要将 RDX:RAX 中的 128 位数据左移一位，可以使用以下指令：

shl rax, 1
rcl rdx, 1

shl 指令会将一个 0 移入 128 位操作数的第 0 位，并将第 63 位移入进位标志。然后，rcl 指令将进位标志移入第 64 位，并将第 127 位移入进位标志。最终结果正是我们所需要的。

使用此技术，您每次只能将扩展精度的值移动 1 位。您不能使用 CL 寄存器将扩展精度操作数移动多个位，也不能在使用此技术时指定大于 1 的常数值。

要对大于 128 位的操作数执行左移操作，请使用额外的rcl指令。扩展精度左移操作始终从最低有效的四字（quad word）开始，每个随后的rcl指令作用于下一个更高有效位的双字（double word）。例如，要对一个内存位置执行 192 位的左移操作，可以使用以下指令：

shl qword ptr Operand[0], 1
rcl qword ptr Operand[8], 1
rcl qword ptr Operand[16], 1

如果需要将数据移位 2 位或更多位，可以选择重复前述的指令序列所需的次数（对于常数移位次数）或将这些指令放入循环中，以一定次数重复执行它们。例如，以下代码将 192 位的值Operand左移 CL 指定的位数：

ShiftLoop:
    shl qword ptr Operand[0], 1
    rcl qword ptr Operand[8], 1
    rcl qword ptr Operand[16], 1
    dec cl
    jnz ShiftLoop

8.1.11.2 扩展精度右移和算术右移

实现shr（右移）和sar（算术右移）的方式类似，唯一的不同是你必须从操作数的最高有效字（HO word）开始，并逐步移至最低有效字（LO word）：

; Extended-precision SAR:

    sar qword ptr Operand[16], 1
    rcr qword ptr Operand[8], 1
    rcr qword ptr Operand[0], 1

; Extended-precision SHR:

    shr qword ptr Operand[16], 1
    rcr qword ptr Operand[8], 1
    rcr qword ptr Operand[0], 1

扩展精度移位会以不同于 8 位、16 位、32 位和 64 位对应指令的方式设置标志，因为旋转指令与移位指令对标志的影响不同。幸运的是，进位标志是你在移位操作后最常检查的标志，而扩展精度移位操作（即旋转指令）会正确设置此标志。

8.1.11.3 高效的多比特扩展精度移位

shld和shrd指令让你能够高效地实现多位扩展精度移位。这些指令具有以下语法：

shld `Operand`[1], `Operand`[2], `constant`
shld `Operand`[1], `Operand`[2], `cl`
shrd `Operand`[1], `Operand`[2], `constant`
shrd `Operand`[1], `Operand`[2], `cl`

shld指令的工作原理如图 8-8 所示。

图 8-8: shld操作

Operand2 必须是 16 位、32 位或 64 位寄存器。Operand1 可以是寄存器或内存位置。两个操作数的大小必须相同。第三个操作数constant或cl指定要移位的位数，其值范围可以是 0 到n – 1，其中n是前两个操作数的大小。

shld指令将Operand2 中位的副本左移由第三个操作数指定的位数，将结果存储到由第一个操作数指定的位置。最高有效位（HO bits）移入进位标志，Operand2 的最高有效位移入Operand1 的最低有效位（LO bits）。第三个操作数指定移位的位数。如果位数是n，则shld将位n – 1 移入进位标志（显然，这条指令只保留最后一位移入进位标志）。shld指令按如下方式设置标志位：

如果移位计数为 0，shld不会影响任何标志。
进位标志包含从Operand1 的最高有效位（HO bit）移出的最后一位。
如果移位计数为 1，则如果Operand1 的符号位在移位过程中发生变化，溢出标志将为 1。如果移位计数不是 1，溢出标志是未定义的。
如果移位结果为 0，零标志将为 1。
符号标志将包含结果的高位（HO bit）。

shrd 指令类似于 shld，只不过它是将位向右移位，而不是向左移位。为了更清楚地理解 shrd 指令，请参考图 8-9。

图 8-9：shrd 操作

shrd 指令会设置标志位，具体如下：

如果移位计数为 0，shrd 不会影响任何标志。
进位标志包含从 Operand1 的低位（LO bit）移出的最后一位。
如果移位计数为 1，当 Operand1 的高位（HO bit）发生变化时，溢出标志将为 1。如果计数不是 1，溢出标志则未定义。
如果移位结果为 0，零标志位将为 1。
符号标志将包含结果的高位（HO）位。

考虑以下代码序列：

 .data
ShiftMe qword   012345678h, 90123456h, 78901234h
     .
     .
     .
    mov  rax, ShiftMe[8]
    shld ShiftMe[16], rax, 6
 mov  rax, ShiftMe[0]
    shld ShiftMe[8], rax, 6
    shl  ShiftMe[0], 6

第一条 shld 指令将 ShiftMe[8] 的位移入 ShiftMe[16]，而不会影响 ShiftMe[8] 的值。第二条 shld 指令将 ShiftMe 的位移入 ShiftMe[8]。最后，shl 指令会将低双字（LO double word）适当的位数移位。

关于这段代码有两点需要注意。首先，与其他扩展精度左移操作不同，这个序列是从高四字（HO quad word）向低四字（LO quad word）移位。其次，进位标志不会包含来自高位移位操作的进位。如果需要在那时保留进位标志，必须在第一次 shld 指令后保存标志，并在 shl 指令后恢复标志。

你可以使用 shrd 指令执行扩展精度的右移操作。它的工作方式几乎与前面的代码序列相同，只是你从低四字（LO quad word）开始，最后处理高四字（HO quad word）。解决方案留给你作为练习。

8.1.12 扩展精度旋转操作

rcl 和 rcr 操作的扩展方式类似于 shl 和 shr。例如，要执行 192 位的 rcl 和 rcr 操作，使用以下指令：

rcl qword ptr Operand[0], 1
rcl qword ptr Operand[8], 1
rcl qword ptr Operand[16], 1

rcr qword ptr Operand[16], 1
rcr qword ptr Operand[8], 1
rcr qword ptr Operand[0], 1

这段代码与扩展精度移位操作的代码唯一不同之处是，第一条指令是 rcl 或 rcr，而不是 shl 或 shr。

执行扩展精度的 rol 或 ror 操作并不像简单的左移或右移那样，因为输入位的处理方式不同。你可以使用 bt、shld 和 shrd 指令来实现扩展精度的 rol 或 ror 指令。^3 以下代码展示了如何使用 shld 和 bt 指令执行 128 位扩展精度的 rol 操作：

; Compute rol RDX:RAX, 4:

        mov  rbx, rdx
        shld rdx, rax, 4
        shld rax, rbx, 4
        bt   rbx, 28        ; Set carry flag, if desired

扩展精度的 ror 指令类似；只需记住，你首先在对象的低端（LO）进行操作，最后在高端（HO）进行操作。

8.2 操作不同大小的操作数

有时，您可能需要对一对大小不同的操作数进行计算。例如，您可能需要将一个字（word）和一个双字（double word）相加，或者从一个字值中减去一个字节（byte）值。为此，需要将较小的操作数扩展到较大操作数的大小，然后对两个相同大小的操作数进行运算。对于带符号的操作数，您需要将较小的操作数符号扩展到与较大操作数相同的大小；对于无符号值，您则需要将较小的操作数零扩展。这适用于任何操作。

以下示例演示了如何将一个字节变量与一个字变量相加：

 .data
var1    byte    ?
var2    word    ?
         .
         .
         .
; Unsigned addition:

        movzx   ax, var1
        add     ax, var2

; Signed addition:

        movsx   ax, var1
        add     ax, var2

在这两种情况下，字节变量被加载到 AL 寄存器中，扩展为 16 位，然后与字操作数相加。如果您可以选择操作的顺序（例如，将 8 位值加到 16 位值），这段代码效果非常好。

有时，您无法指定操作的顺序。也许 16 位值已经在 AX 寄存器中，您想要加上一个 8 位值。对于无符号加法，您可以使用以下代码：

 mov ax, var2    ; Load 16-bit value into AX
    .               ; Do some other operations, leaving
    .               ; a 16-bit quantity in AX
    add al, var1    ; Add in the 8-bit value
    adc ah, 0       ; Add carry into the HO word

第一个 add 指令将 var1 中的字节加到累加器中的 LO 字节。adc 指令将加法中的进位加到累加器的 HO 字节中。如果省略了 adc，可能无法获得正确的结果。

将一个 8 位带符号操作数加到一个 16 位带符号值中会稍微复杂一些。不幸的是，您不能将一个立即数值（如前面的例子中）加到 AX 的 HO 字中，因为 HO 扩展字节可能是 0 或 0FFh。如果有可用的寄存器，最好的做法是：

mov   bx, ax      ; BX is the available register
movsx ax, var1
add   ax, bx

如果没有额外的寄存器，您可以尝试以下代码：

push  ax          ; Save word value
movsx ax, var1    ; Sign-extend 8-bit operand to 16 bits
add   ax, [rsp]   ; Add in previous word value
add   rsp, 2      ; Pop junk from stack

这种方式之所以有效，是因为 x86-64 可以推送 16 位寄存器。给出一点建议：不要让 RSP 寄存器错位（即不在 8 字节边界上）时间过长。如果您正在使用 32 位或 64 位寄存器，完成堆栈操作后，您必须推送整个 64 位寄存器并将 RSP 加 8。

另一种选择是将 16 位值存储到累加器中的内存位置，然后像以前一样继续操作：

mov   temp, ax
movsx ax, var1
add   ax, temp

所有这些示例都将一个字节值加到一个字值中。通过将较小的操作数零扩展或符号扩展到较大操作数的大小，您可以轻松地将两个不同大小的变量相加。

作为最后一个示例，考虑将一个 8 位带符号值与一个 oword（128 位）值相加：

 .data
OVal  qword   ?
BVal  byte    ?
     .
     .
     .
movsx rax, BVal
cqo
add   rax, qword ptr OVal
adc   rdx, qword ptr OVal[8]

8.3 十进制运算

x86-64 CPU 使用二进制编号系统来表示其本地的内部表示。在计算机早期，设计者认为十进制（基数为 10）运算对商业计算更为精确。数学家们已经证明，这并非如此；然而，某些算法依赖于十进制运算来生成正确的结果。因此，尽管十进制运算通常比二进制运算效率低且准确性差，但十进制运算的需求仍然存在。

为了在本地二进制格式中表示十进制数字，最常见的技术是使用二进制编码十进制（BCD）表示法。这使用 4 位来表示 10 个可能的十进制数字（见表 8-1）。这 4 位的二进制值等于对应的十进制值，范围是 0 到 9。当然，4 位可以表示 16 个不同的值；BCD 格式忽略剩余的六个位组合。因为每个 BCD 数字需要 4 位，我们可以使用一个字节表示一个两位数的 BCD 值。这意味着我们可以通过一个字节表示范围在 0 到 99 之间的十进制值（而不是在二进制格式下一个字节表示的 0 到 255 的范围）。

表 8-1：二进制编码十进制表示法

BCD 表示法	十进制等效值
0000	0
0001	1
0010	2
0011	3
0100	4
0101	5
0110	6
0111	7
1000	8
1001	9
1010	非法
1011	非法
1100	非法
1101	非法
1110	非法
1111	非法

8.3.1 字面 BCD 常量

MASM 不提供字面 BCD 常量，也不需要字面 BCD 常量。因为 BCD 仅仅是十六进制表示法的一种形式，不允许使用 0Ah 到 0Fh 的值，你可以通过使用 MASM 的十六进制表示法轻松创建 BCD 常量。例如，下面的mov指令将 BCD 值 99 复制到 AL 寄存器中：

mov al, 99h

需要牢记的重要一点是，你不能使用 MASM 字面十进制常量表示 BCD 值。也就是说，mov al, 95不会将 95 的 BCD 表示加载到 AL 寄存器中。相反，它会将 5Fh 加载到 AL 寄存器中，这是一个非法的 BCD 值。

8.3.2 使用 FPU 进行打包十进制运算

为了提高依赖于十进制运算的应用程序的性能，英特尔将十进制运算的支持直接集成到 FPU 中。FPU 支持精度高达 18 位十进制数字的值，并且在所有 FPU 算术能力下进行计算，从加法到超越操作。如果你能接受只有 18 位精度以及一些其他限制，那么在 FPU 上进行十进制运算是正确的选择。

FPU 仅支持一种 BCD 数据类型：一个 10 字节 18 位的打包十进制值。打包十进制格式使用前 9 个字节以标准的打包十进制格式存储 BCD 值。第一个字节包含两个低位数字（LO），第九个字节包含两个高位数字（HO）。第十个字节的高位（HO）位用于存储符号位，FPU 忽略第十个字节中剩余的位（因为使用这些位会产生 FPU 无法在本地浮点格式中精确表示的 BCD 值）。

FPU 对负 BCD 值使用补码表示。符号位如果数字为负则包含 1，如果数字为正则包含 0。如果数字为 0，符号位可以是 0 或 1，因为像二进制补码格式一样，0 有两种不同的表示。

MASM 的tbyte类型是用于定义打包 BCD 变量的标准数据类型。fbld和fbstp指令需要一个tbyte操作数（你可以用十六进制/BCD 值初始化它）。

FPU 并不完全支持十进制算术，而是提供了两条指令，fbld和fbstp，用于在将数据传入和传出 FPU 时，在打包十进制和二进制浮点格式之间进行转换。fbld（浮动/BCD 加载）指令在将 BCD 值转换为二进制浮点格式后，将一个 80 位的打包 BCD 值加载到 FPU 栈顶。同样，fbstp（浮动/BCD 存储并弹出）指令将浮点值从栈顶弹出，将其转换为打包 BCD 值，并将 BCD 值存储到目标内存位置。这意味着计算是使用二进制算术进行的。如果你有一个完全依赖十进制算术的算法，在使用 FPU 实现时，可能会失败。^(4)

打包 BCD 与浮点格式之间的转换并不是一个廉价的操作。fbld和fbstp指令可能会非常慢（例如，比fld和fstp慢两个数量级以上）。因此，如果你只进行简单的加减法，这些指令可能会很昂贵。

由于 FPU 将打包的十进制值转换为内部浮点格式，因此你可以在同一个计算中混合使用打包十进制、浮点和（二进制）整数格式。以下代码片段演示了如何实现这一点：

 .data
tb        tbyte    654321h
two       real8    2.0
one       dword    1

          fbld     tb 
          fmul     two
          fiadd    one
          fbstp    tb

; TB now contains: 1308643h.

FPU 将打包十进制值视为整数值。因此，如果你的计算产生了小数结果，fbstp指令将根据当前的 FPU 舍入模式对结果进行舍入。如果你需要处理小数值，你需要坚持使用浮点结果。

8.4 更多信息

唐纳德·克努斯（Donald Knuth）的《计算机程序设计的艺术》，第二卷：半数值算法（Addison-Wesley Professional，1997）包含了许多关于十进制算术和扩展精度算术的有用信息，尽管该书内容是通用的，并未描述如何在 x86-64 汇编语言中实现此操作。关于 BCD 算术的更多信息也可以在以下网站找到：

BCD 算术教程，homepage.divms.uiowa.edu/~jones/bcd/bcd.html

** 通用十进制算术，speleotrove.com/decimal/ ** 英特尔十进制浮点数学库，software.intel.com/en-us/articles/intel-decimal-floating-point-math-library/**

***## 8.5 自测

提供计算 x = y + z 的代码，假设以下条件：
1. x、y 和 z 是 128 位整数
2. x 和 y 是 96 位整数，z 是 64 位整数
3. x、y 和 z 是 48 位整数
提供计算 x = y − z 的代码，假设以下条件：
1. x、y 和 z 是 192 位整数
2. x、y 和 z 是 96 位整数
提供计算 x = y × z 的代码，假设 x、y 和 z 是 128 位无符号整数。
提供计算 x = y / z 的代码，假设 x 和 y 是 128 位带符号整数，z 是 64 位带符号整数。
假设 x 和 y 是无符号的 128 位整数，将以下内容转换为汇编语言：
1. 如果 (x == y) 则执行代码
2. 如果 (x < y) 则执行代码
3. 如果 (x > y) 则执行代码
4. 如果 (x ≠ y) 则执行代码
假设 x 和 y 是带符号的 96 位整数，将以下内容转换为汇编语言：
1. 如果 (x == y) 则执行代码
2. 如果 (x < y) 则执行代码
3. 如果 (x > y) 则执行代码
假设 x 和 y 是带符号的 128 位整数，提供两种不同的方法将以下内容转换为汇编语言：
1. x = –x
2. x = –y
假设 x、y 和 z 都是 128 位整数，转换以下内容为汇编语言：
1. x = y & z （按位逻辑与）
2. x = y | z （按位逻辑或）
3. x = y ^ z （按位逻辑异或）
4. x = ~y （按位逻辑非）
5. x = y << 1 （按位左移）
6. x = y >> 1 （按位右移）
假设 x 和 y 是带符号的 128 位值，将 x = y >> 1 转换为汇编语言（按位算术右移）。
提供汇编代码，通过进位标志（左移 1 位）旋转 x 的 128 位值。
提供汇编代码，通过进位标志（右移 1 位）旋转 x 的 128 位值。***

第九章：数字转换

本章讨论了不同数字格式之间的转换，包括整数到十进制字符串、整数到十六进制字符串、浮点数到字符串、十六进制字符串到整数、十进制字符串到整数，以及实数字符串到浮点数。除了基本的转换外，本章还讨论了错误处理（对于字符串到数字的转换）和性能优化。本章讨论了标准精度转换（适用于 8 位、16 位、32 位和 64 位整数格式）以及扩展精度转换（例如，128 位整数和字符串转换）。

9.1 将数字值转换为字符串

到目前为止，本书依赖于 C 标准库来执行数字输入输出（将数字数据写入显示器并从用户读取数字数据）。然而，C 标准库没有提供扩展精度的数字输入输出功能（甚至 64 位数字输入输出也有问题；本书使用了 Microsoft 扩展的printf()来进行 64 位数字输出）。因此，现在是时候解析并讨论如何在汇编语言中进行数字输入输出了——嗯，算是吧。因为大多数操作系统仅支持字符或字符串输入输出，我们不会进行实际的数字输入输出。相反，我们将编写将数字值与字符串之间转换的函数，然后进行字符串输入输出。

本节中的示例专门处理 64 位（非扩展精度）和 128 位值，但算法是通用的，可以扩展到任何位数。

9.1.1 将数字值转换为十六进制字符串

将数值转换为十六进制字符串相对简单。只需将二进制表示中的每个半字节（4 位）转换为“0”到“9”或“A”到“F”中的一个字符。请参考清单 9-1 中的btoh函数，该函数接收 AL 寄存器中的一个字节，并返回 AH（高半字节）和 AL（低半字节）中的两个对应字符。

; btoh - This procedure converts the binary value
;        in the AL register to two hexadecimal
;        characters and returns those characters
;        in the AH (HO nibble) and AL (LO nibble)
;        registers.

btoh        proc

            mov     ah, al      ; Do HO nibble first
            shr     ah, 4       ; Move HO nibble to LO
            or      ah, '0'     ; Convert to char
            cmp     ah, '9' + 1 ; Is it "A" through "F"?
            jb      AHisGood

; Convert 3Ah to 3Fh to "A" through "F":

            add     ah, 7

; Process the LO nibble here:

AHisGood:   and     al, 0Fh     ; Strip away HO nibble
            or      al, '0'     ; Convert to char
            cmp     al, '9' + 1 ; Is it "A" through "F"?
            jb      ALisGood

; Convert 3Ah to 3Fh to "A" through "F":

            add     al, 7
ALisGood:   ret
btoh        endp

清单 9-1：一个将字节转换为两个十六进制字符的函数

你可以通过将数值与 0（30h）进行按位或运算，将 0 到 9 范围内的任何数值转换为相应的 ASCII 字符。不幸的是，这会将 0Ah 到 0Fh 的数值映射到 3Ah 到 3Fh。因此，清单 9-1 中的代码会检查其是否产生大于 3Ah 的值，并加上 7，以生成最终的字符代码，范围是 41h 到 46h（“A”到“F”）。

一旦我们能够将单个字节转换为一对十六进制字符，创建一个字符串并输出到显示器就变得简单了。我们可以对数字中的每个字节调用btoh（字节到十六进制）函数，并将相应的字符存储在字符串中。清单 9-2 提供了btoStr（字节到字符串）、wtoStr（字词到字符串）、dtoStr（双字到字符串）和qtoStr（四字到字符串）函数的示例。

; Listing 9-2

; Numeric-to-hex string functions.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 9-2", 0
fmtStr1     byte    "btoStr: Value=%I64x, string=%s"
            byte    nl, 0

fmtStr2     byte    "wtoStr: Value=%I64x, string=%s"
            byte    nl, 0

fmtStr3     byte    "dtoStr: Value=%I64x, string=%s"
            byte    nl, 0

fmtStr4     byte    "qtoStr: Value=%I64x, string=%s"
            byte    nl, 0

            .data
buffer      byte    20 dup (?)

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
 ret
getTitle    endp

; btoh - This procedure converts the binary value
;        in the AL register to two hexadecimal
;        characters and returns those characters
;        in the AH (HO nibble) and AL (LO nibble)
;        registers.

btoh        proc

            mov     ah, al      ; Do HO nibble first
            shr     ah, 4       ; Move HO nibble to LO
            or      ah, '0'     ; Convert to char
            cmp     ah, '9' + 1 ; Is it "A" to "F"?
            jb      AHisGood

; Convert 3Ah through 3Fh to "A" to "F":

            add     ah, 7

; Process the LO nibble here:

AHisGood:   and     al, 0Fh     ; Strip away HO nibble
            or      al, '0'     ; Convert to char
            cmp     al, '9' + 1 ; Is it "A" to "F"?
            jb      ALisGood

; Convert 3Ah through 3Fh to "A" to "F":

            add     al, 7   
ALisGood:   ret

btoh        endp

; btoStr - Converts the byte in AL to a string of hexadecimal
;          characters and stores them at the buffer pointed at
;          by RDI. Buffer must have room for at least 3 bytes.
;          This function zero-terminates the string.

btoStr      proc
            push    rax
            call    btoh        ; Do conversion here

; Create a zero-terminated string at [RDI] from the
; two characters we converted to hex format:

            mov     [rdi], ah
            mov     [rdi + 1], al
            mov     byte ptr [rdi + 2], 0
            pop     rax
            ret
btoStr      endp

; wtoStr - Converts the word in AX to a string of hexadecimal
;          characters and stores them at the buffer pointed at
;          by RDI. Buffer must have room for at least 5 bytes.
;          This function zero-terminates the string.

wtoStr      proc
            push    rdi
            push    rax     ; Note: leaves LO byte at [RSP]

; Use btoStr to convert HO byte to a string:

            mov     al, ah
            call    btoStr

            mov     al, [rsp]       ; Get LO byte
            add     rdi, 2          ; Skip HO chars
            call    btoStr

            pop     rax
            pop     rdi
            ret
wtoStr      endp

; dtoStr - Converts the dword in EAX to a string of hexadecimal
;          characters and stores them at the buffer pointed at
;          by RDI. Buffer must have room for at least 9 bytes.
;          This function zero-terminates the string.

dtoStr      proc
            push    rdi
            push    rax     ; Note: leaves LO word at [RSP]

; Use wtoStr to convert HO word to a string:

            shr     eax, 16
            call    wtoStr

            mov     ax, [rsp]       ; Get LO word
            add     rdi, 4          ; Skip HO chars
            call    wtoStr

            pop     rax
            pop     rdi
            ret
dtoStr      endp

; qtoStr - Converts the qword in RAX to a string of hexadecimal
;          characters and stores them at the buffer pointed at
;          by RDI. Buffer must have room for at least 17 bytes.
;          This function zero-terminates the string.

qtoStr      proc
            push    rdi
            push    rax     ; Note: leaves LO dword at [RSP]

; Use dtoStr to convert HO dword to a string:

            shr     rax, 32
            call    dtoStr

            mov     eax, [rsp]      ; Get LO dword
            add     rdi, 8          ; Skip HO chars
            call    dtoStr

            pop     rax
            pop     rdi
            ret
qtoStr      endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rdi
            push    rbp
            mov     rbp, rsp
            sub     rsp, 64         ; Shadow storage

; Because all the (`x`)toStr functions preserve RDI,
; we need to do the following only once:

            lea     rdi, buffer

; Demonstrate call to btoStr:

            mov     al, 0aah
            call    btoStr

            lea     rcx, fmtStr1
            mov     edx, eax
            mov     r8, rdi
            call    printf

; Demonstrate call to wtoStr:

            mov     ax, 0a55ah
            call    wtoStr

            lea     rcx, fmtStr2
            mov     edx, eax
            mov     r8, rdi
            call    printf

; Demonstrate call to dtoStr:

            mov     eax, 0aa55FF00h
            call    dtoStr

            lea     rcx, fmtStr3
            mov     edx, eax
 mov     r8, rdi
            call    printf

; Demonstrate call to qtoStr:

            mov     rax, 1234567890abcdefh
            call    qtoStr

            lea     rcx, fmtStr4
            mov     rdx, rax
            mov     r8, rdi
            call    printf

            leave
            pop     rdi
            ret     ; Returns to caller

asmMain     endp
            end

清单 9-2：btoStr、wtoStr、dtoStr和qtoStr函数

这是构建命令和示例输出：

C:\>**build listing9-2**

C:\>**echo off**
 Assembling: listing9-2.asm
c.cpp

C:\>**listing9-2**
Calling Listing 9-2:
btoStr: Value=aa, string=AA
wtoStr: Value=a55a, string=A55A
dtoStr: Value=aa55ff00, string=AA55FF00
qtoStr: Value=1234567890abcdef, string=1234567890ABCDEF
Listing 9-2 terminated

清单 9-2 中的每个后续函数都建立在前一个函数的基础上。例如，wtoStr调用btoStr两次，将 AX 中的 2 个字节转换为 4 个十六进制字符的字符串。如果你在每个调用这些函数的地方都内联展开它们，代码会更快（但也会变得更大）。如果你只需要其中一个函数，内联展开它的所有调用会值得付出额外的努力。

这是qtoStr的一个版本，包含两个改进：内联展开对dtoStr、wtoStr和btoStr的调用，以及使用一个简单的表查找（数组访问）来进行半字节到十六进制字符的转换（有关表查找的更多信息，请参见第十章）。这个更快版本的qtoStr的框架出现在清单 9-3 中。

; qtoStr - Converts the qword in RAX to a string of hexadecimal
;          characters and stores them at the buffer pointed at
;          by RDI. Buffer must have room for at least 17 bytes.
;          This function zero-terminates the string.

hexChar             byte    "0123456789ABCDEF"

qtoStr      proc
            push    rdi
            push    rcx
            push    rdx
            push    rax                ; Leaves LO dword at [RSP]

            lea     rcx, hexChar

            xor     edx, edx           ; Zero-extends!
            shld    rdx, rax, 4
            shl     rax, 4
            mov     dl, [rcx][rdx * 1] ; Table lookup
            mov     [rdi], dl

; Emit bits 56-59:

            xor     edx, edx
            shld    rdx, rax, 4
            shl     rax, 4
            mov     dl, [rcx][rdx * 1]
            mov     [rdi + 1], dl

; Emit bits 52-55:

            xor     edx, edx
            shld    rdx, rax, 4
            shl     rax, 4
            mov     dl, [rcx][rdx * 1]
            mov     [rdi + 2], dl
             .
             .
             .
 `Code to emit bits 8-51 was deleted for length reasons.`
 `The code should be obvious if you look at the output`
 `for the other nibbles appearing here.` 
             .
             .
             .
; Emit bits 4-7:

            xor     edx, edx
            shld    rdx, rax, 4
            shl     rax, 4
            mov     dl, [rcx][rdx * 1]
            mov     [rdi + 14], dl

; Emit bits 0-3:

            xor     edx, edx
            shld    rdx, rax, 4
            shl     rax, 4
            mov     dl, [rcx][rdx * 1]
            mov     [rdi + 15], dl

; Zero-terminate string:

            mov     byte ptr [rdi + 16], 0

            pop     rax
            pop     rdx
            pop     rcx
            pop     rdi
            ret
qtoStr      endp

清单 9-3：qtoStr的更快实现

编写一个简短的主程序，包含以下循环

 lea     rdi, buffer
            mov     rax, 07fffffffh
loopit:     call    qtoStr
            dec     eax
            jnz     loopit

然后，我使用一台 2012 年款的 2.6 GHz Intel Core i7 处理器，通过秒表得到了qtoStr内联版本和原始版本的大致执行时间：

内联版本：19 秒
原始版本：85 秒

如你所见，内联版本显著（快了四倍）更快，但你可能不会经常将 64 位数字转换为十六进制字符串，因此不足以为内联版本那种不够简洁的代码辩护。

说实话，你可能通过使用一个更大的表（256 个 16 位条目）来表示十六进制字符，并一次转换一个字节，而不是一个半字节，从而将时间几乎减少一半。这将需要比内联版本少一半的指令（尽管表的大小将增加 32 倍）。

9.1.2 将扩展精度十六进制值转换为字符串

扩展精度的十六进制到字符串的转换非常简单。它只是上一节中正常十六进制转换例程的扩展。例如，这里是一个 128 位的十六进制转换函数：

; otoStr - Converts the oword in RDX:RAX to a string of hexadecimal
;          characters and stores them at the buffer pointed at
;          by RDI. Buffer must have room for at least 33 bytes.
;          This function zero-terminates the string.

otoStr      proc
            push    rdi
            push    rax     ; Note: leaves LO dword at [RSP]

; Use qtoStr to convert each qword to a string:

            mov     rax, rdx
            call    qtoStr

            mov     rax, [rsp]      ; Get LO qword
            add     rdi, 16         ; Skip HO chars
            call    qtoStr

            pop     rax
            pop     rdi
            ret
otoStr      endp

9.1.3 将无符号十进制值转换为字符串

十进制输出比十六进制输出稍微复杂一些，因为二进制数字的高位（HO 位）会影响十进制表示中的低位数字（十六进制值并不受此影响，这也是为什么十六进制输出如此简单的原因）。因此，我们需要通过从数字中提取每一位十进制数字，来创建二进制数的十进制表示。

输出无符号十进制数的最常见方法是不断地将值除以 10，直到结果变为 0。第一次除法后的余数是一个 0 到 9 之间的数值，这个值对应十进制数的低位数字。通过连续除以 10（以及对应的余数），可以提取数字的每一位。

对这个问题的迭代解决方案通常会分配足够大的存储空间来容纳整个数字的字符字符串。然后，代码在循环中提取十进制数字，并将它们逐一放入字符串中。在转换过程结束时，例程会以相反的顺序打印字符串中的字符（记住，除法算法先提取低位数字，最后提取高位数字，这与你需要打印的顺序正好相反）。

本节采用了递归解决方案，因为它稍微更优雅一些。该解决方案首先通过将值除以 10 并将余数保存在局部变量中开始。如果商不为 0，例程会递归调用自己，先输出所有前导数字。递归调用返回后（输出了所有前导数字），递归算法会输出与余数相关的数字，完成操作。当打印十进制值 789 时，操作过程如下：

将 789 除以 10。商为 78，余数为 9。
将余数（9）保存在一个局部变量中，并递归地调用该例程，使用商值作为参数。
递归入口 1：将 78 除以 10。商为 7，余数为 8。
将余数（8）保存在局部变量中，并递归地调用该例程，使用商值作为参数。
递归入口 2：将 7 除以 10。商为 0，余数为 7。
将余数（7）保存在局部变量中。由于商为 0，不再递归调用例程。
输出保存在局部变量中的余数值（7）。返回到调用者（递归入口 1）。
返回到递归入口 1：输出在递归入口 1 中保存在局部变量中的余数值（8）。返回到调用者（原始例程调用）。
原始调用：输出原始调用中保存在局部变量中的余数值（9）。返回到输出例程的原始调用者。

列表 9-4 实现了递归算法。

; Listing 9-4

; Numeric unsigned integer-to-string function.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 9-4", 0
fmtStr1     byte    "utoStr: Value=%I64u, string=%s"
            byte    nl, 0

            .data
buffer      byte    24 dup (?)

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; utoStr - Unsigned integer to string.

; Inputs:

;    RAX:   Unsigned integer to convert.
;    RDI:   Location to hold string.

; Note: for 64-bit integers, resulting
; string could be as long as 21 bytes
; (including the zero-terminating byte).

utoStr      proc
            push    rax
            push    rdx
            push    rdi

; Handle zero specially:

            test    rax, rax
            jnz     doConvert

            mov     byte ptr [rdi], '0'
            inc     rdi
            jmp     allDone 

doConvert:  call    rcrsvUtoStr

; Zero-terminate the string and return:

allDone:    mov     byte ptr [rdi], 0
            pop     rdi
            pop     rdx
            pop     rax
            ret
utoStr      endp

ten         qword   10

; Here's the recursive code that does the
; actual conversion:

rcrsvUtoStr proc

            xor     rdx, rdx           ; Zero-extend RAX -> RDX
            div     ten
            push    rdx                ; Save output value
            test    eax, eax           ; Quit when RAX is 0
            jz      allDone 

; Recursive call to handle value % 10:

            call    rcrsvUtoStr

allDone:    pop     rax                ; Retrieve char to print
            and     al, 0Fh            ; Convert to "0" to "9"
            or      al, '0'
            mov     byte ptr [rdi], al ; Save in buffer
            inc     rdi                ; Next char position
            ret
rcrsvUtoStr endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rdi
 push    rbp
            mov     rbp, rsp
            sub     rsp, 56         ; Shadow storage

; Because all the (`x`)toStr functions preserve RDI,
; we need to do the following only once:

            lea     rdi, buffer
            mov     rax, 1234567890
            call    utoStr

; Print the result:

            lea     rcx, fmtStr1
            mov     rdx, rax
            mov     r8, rdi
            call    printf

            leave
            pop     rdi
            ret     ; Returns to caller

asmMain     endp
            end

列表 9-4：无符号整数到字符串的转换函数（递归）

这是构建命令和程序输出：

C:\>**build listing9-4**

C:\>**echo off**
 Assembling: listing9-4.asm
c.cpp

C:\>**listing9-4**
Calling Listing 9-4:
utoStr: Value=1234567890, string=1234567890
Listing 9-4 terminated

与十六进制输出不同，实际上没有必要提供字节大小、字大小或双字大小的数字到十进制字符串的转换函数。只需要将较小的值零扩展到 64 位即可。与十六进制转换不同，qtoStr 函数不会输出前导零，因此对于所有大小的变量（64 位及以下），输出是相同的。

与十六进制转换（本身就非常快速，而且你也不常用它）不同，整数到字符串的转换函数你会频繁调用。因为它使用了 div 指令，所以可能会比较慢。幸运的是，我们可以通过使用 fist 和 fbstp 指令来加速它。

fbstp 指令将当前位于栈顶的 80 位浮点值转换为一个 18 位的打包 BCD 值（采用第六章中图 6-7 所示的格式）。fist 指令允许将一个 64 位整数加载到 FPU 栈上。因此，通过使用这两个指令，你可以（大部分）将一个 64 位整数转换为打包 BCD 值，该值每 4 位编码一个十进制数字。因此，你可以使用将十六进制数字转换为字符串的相同算法，将 fbstp 产生的打包 BCD 结果转换为字符字符串。

使用 fist 和 fbstp 将整数转换为字符串时，有一个小问题：Intel 打包 BCD 格式（见第六章中的图 6-7）仅支持 18 位，而 64 位整数最多可以有 19 位。因此，任何基于 fbstp 的 utoStr 函数都必须处理第 19 位作为特殊情况。考虑到这一点，清单 9-5 提供了这个新的 utoStr 函数版本。

; Listing 9-5

; Fast unsigned integer-to-string function
; using fist and fbstp.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 9-5", 0
fmtStr1     byte    "utoStr: Value=%I64u, string=%s"
            byte    nl, 0

            .data
buffer      byte    30 dup (?)

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; utoStr - Unsigned integer to string.

; Inputs:

;    RAX:   Unsigned integer to convert.
;    RDI:   Location to hold string.

; Note: for 64-bit integers, resulting
; string could be as long as 21 bytes
; (including the zero-terminating byte).

bigNum      qword   1000000000000000000
utoStr      proc
            push    rcx
            push    rdx
            push    rdi
            push    rax
            sub     rsp, 10

; Quick test for zero to handle that special case:

            test    rax, rax
            jnz     not0
            mov     byte ptr [rdi], '0'
            jmp     allDone

; The FBSTP instruction supports only 18 digits.
; 64-bit integers can have up to 19 digits.
; Handle that 19th possible digit here:

not0:       cmp     rax, bigNum
            jb      lt19Digits

; The number has 19 digits (which can be 0-9).
; Pull off the 19th digit:

            xor     edx, edx
            div     bigNum            ; 19th digit in AL
            mov     [rsp + 10], rdx   ; Remainder
            or      al, '0'
            mov     [rdi], al
            inc     rdi

; The number to convert is nonzero.
; Use BCD load and store to convert
; the integer to BCD:

lt19Digits: fild    qword ptr [rsp + 10]
            fbstp   tbyte ptr [rsp]

; Begin by skipping over leading zeros in
; the BCD value (max 19 digits, so the most
; significant digit will be in the LO nibble
; of DH).

            mov     dx, [rsp + 8]
            mov     rax, [rsp]
            mov     ecx, 20
            jmp     testFor0

Skip0s:     shld    rdx, rax, 4
            shl     rax, 4
testFor0:   dec     ecx         ; Count digits we've processed
            test    dh, 0fh     ; Because the number is not 0
            jz      Skip0s      ; this always terminates

; At this point the code has encountered
; the first nonzero digit. Convert the remaining
; digits to a string:

cnvrtStr:   and     dh, 0fh
            or      dh, '0'
            mov     [rdi], dh
            inc     rdi
            mov     dh, 0
            shld    rdx, rax, 4
            shl     rax, 4
            dec     ecx
            jnz     cnvrtStr

; Zero-terminate the string and return:

allDone:    mov     byte ptr [rdi], 0
            add     rsp, 10
            pop     rax
            pop     rdi
            pop     rdx
            pop     rcx
            ret
utoStr      endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 64         ; Shadow storage

; Because all the (`x`)toStr functions preserve RDI,
; we need to do the following only once:

            lea     rdi, buffer
            mov     rax, 9123456789012345678
            call    utoStr

            lea     rcx, fmtStr1
            mov     rdx, 9123456789012345678
            lea     r8, buffer
            call    printf

            leave
            ret     ; Returns to caller
asmMain     endp
            end

清单 9-5：基于 fist 和 fbstp 的 utoStr 函数

这是该程序的构建命令和示例输出：

C:\>**build listing9-5**

C:\>**echo off**
 Assembling: listing9-5.asm
c.cpp

C:\>**listing9-5**
Calling Listing 9-5:
utoStr: Value=9123456789012345678, string=9123456789012345678
Listing 9-5 terminated

清单 9-5 中的程序确实使用了 div 指令，但它仅执行一到两次，而且仅当数字中有 19 或 20 位时才会执行。因此，这个 div 指令的执行时间对 utoStr 函数的整体速度影响很小（尤其是在你考虑到实际打印 19 位数字的频率时）。

我在一台 2.6 GHz 的 2012 年左右的 Core i7 处理器上得到了以下执行时间：

原始 utoStr：108 秒
fist 和 fbstp 实现：11 秒

显然，fist 和 fbstp 的实现是赢家。

9.1.4 带符号整数值转换为字符串

要将带符号整数值转换为字符串，首先检查该数字是否为负数；如果是，则输出一个连字符（-）并取其绝对值。然后调用 utoStr 函数完成剩余的转换。清单 9-6 显示了相关代码。

; itoStr - Signed integer-to-string conversion.

; Inputs:
;    RAX -   Signed integer to convert.
;    RDI -   Destination buffer address.

itoStr      proc
            push    rdi
            push    rax
            test    rax, rax
            jns     notNeg

; Number was negative, emit "-" and negate
; value.

 mov     byte ptr [rdi], '-'
            inc     rdi
            neg     rax

; Call utoStr to convert non-negative number:

notNeg:     call    utoStr
            pop     rax
            pop     rdi
            ret
itoStr      endp

清单 9-6：带符号整数到字符串转换

9.1.5 扩展精度无符号整数转换为字符串

对于扩展精度输出，整个字符串转换算法中唯一需要扩展精度运算的操作是除以 10 操作。因为我们要用扩展精度值除以一个轻松适配到四字单元的值，我们可以使用快速（且简单的）扩展精度除法算法，采用 div 指令（详见第八章中的《使用 div 指令的特殊情况形式》部分）。清单 9-7 实现了一个使用该技术的 128 位十进制输出例程。

; Listing 9-7

; Extended-precision numeric unsigned 
; integer-to-string function.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 9-7", 0
fmtStr1     byte    "otoStr(0): string=%s", nl, 0
fmtStr2     byte    "otoStr(1234567890): string=%s", nl, 0
fmtStr3     byte    "otoStr(2147483648): string=%s", nl, 0
fmtStr4     byte    "otoStr(4294967296): string=%s", nl, 0
fmtStr5     byte    "otoStr(FFF...FFFF): string=%s", nl, 0

            .data
buffer      byte    40 dup (?)

b0          oword   0
b1          oword   1234567890
b2          oword   2147483648
b3          oword   4294967296

; Largest oword value
; (decimal=340,282,366,920,938,463,463,374,607,431,768,211,455):

b4          oword   0FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFh

 .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; DivideBy10 - Divides "divisor" by 10 using fast
;              extended-precision division algorithm
;              that employs the div instruction.

; Returns quotient in "quotient."
; Returns remainder in RAX.
; Trashes RDX.

; RCX - Points at oword dividend and location to
;       receive quotient.

ten         qword   10

DivideBy10  proc
parm        equ     <[rcx]>

            xor     edx, edx       ; Zero-extends!
            mov     rax, parm[8]
            div     ten
            mov     parm[8], rax

            mov     rax, parm
            div     ten
            mov     parm, rax
            mov     eax, edx       ; Remainder (always "0" to "9"!)
            ret    
DivideBy10  endp

; Recursive version of otoStr.
; A separate "shell" procedure calls this so that
; this code does not have to preserve all the registers
; it uses (and DivideBy10 uses) on each recursive call.

; On entry:
;    Stack - Contains oword in/out parameter (dividend in/quotient out).
;    RDI   - Contains location to place output string.

; Note: this function must clean up stack (parameters)
;       on return.

rcrsvOtoStr proc
value       equ     <[rbp + 16]>
remainder   equ     <[rbp - 8]>
            push    rbp
 mov     rbp, rsp
            sub     rsp, 8
            lea     rcx, value
            call    DivideBy10
            mov     remainder, al

; If the quotient (left in value) is not 0, recursively
; call this routine to output the HO digits.

            mov     rax, value
            or      rax, value[8]
            jz      allDone

            mov     rax, value[8]
            push    rax
            mov     rax, value
            push    rax
            call    rcrsvOtoStr

allDone:    mov     al, remainder
            or      al, '0'
            mov     [rdi], al
            inc     rdi
            leave
            ret     16      ; Remove parms from stack
rcrsvOtoStr endp

; Nonrecursive shell to the above routine so we don't bother
; saving all the registers on each recursive call.

; On entry:

;   RDX:RAX - Contains oword to print.
;   RDI     - Buffer to hold string (at least 40 bytes).

otostr      proc

            push    rax
            push    rcx
            push    rdx
            push    rdi

; Special-case zero:

            test    rax, rax
            jnz     not0
            test    rdx, rdx
            jnz     not0
            mov     byte ptr [rdi], '0'
            inc     rdi
            jmp     allDone

not0:       push    rdx
 push    rax
            call    rcrsvOtoStr

; Zero-terminate string before leaving:

allDone:    mov     byte ptr [rdi], 0

            pop     rdi
            pop     rdx
            pop     rcx
            pop     rax
            ret

otostr      endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rdi
            push    rbp
            mov     rbp, rsp
            sub     rsp, 56         ; Shadow storage

; Because all the (`x`)toStr functions preserve RDI,
; we need to do the following only once:

            lea     rdi, buffer

; Convert b0 to a string and print the result:

            mov     rax, qword ptr b0
            mov     rdx, qword ptr b0[8]
            call    otostr

            lea     rcx, fmtStr1
            lea     rdx, buffer
            call    printf

; Convert b1 to a string and print the result:

            mov     rax, qword ptr b1
            mov     rdx, qword ptr b1[8]
            call    otostr

            lea     rcx, fmtStr2
            lea     rdx, buffer
            call    printf

; Convert b2 to a string and print the result:

            mov     rax, qword ptr b2
            mov     rdx, qword ptr b2[8]
 call    otostr

            lea     rcx, fmtStr3
            lea     rdx, buffer
            call    printf

; Convert b3 to a string and print the result:

            mov     rax, qword ptr b3
            mov     rdx, qword ptr b3[8]
            call    otostr

            lea     rcx, fmtStr4
            lea     rdx, buffer
            call    printf

; Convert b4 to a string and print the result:

            mov     rax, qword ptr b4
            mov     rdx, qword ptr b4[8]
            call    otostr

            lea     rcx, fmtStr5
            lea     rdx, buffer
            call    printf

            leave
            pop     rdi
            ret     ; Returns to caller

asmMain     endp
            end

清单 9-7：128 位扩展精度十进制输出例程

这是构建命令和程序输出：

C:\>**build listing9-7**

C:\>**echo off**
 Assembling: listing9-7.asm
c.cpp

C:\>**listing9-7**
Calling Listing 9-7:
otoStr(0): string=0
otoStr(1234567890): string=1234567890
otoStr(2147483648): string=2147483648
otoStr(4294967296): string=4294967296
otoStr(FFF...FFFF):
        string=340282366920938463463374607431768211455
Listing 9-7 terminated

可惜，我们不能使用fbstp指令来提高该算法的性能，因为fbstp仅限于 80 位 BCD 值。

9.1.6 将扩展精度有符号十进制值转换为字符串

一旦你有了扩展精度无符号十进制输出例程，编写扩展精度有符号十进制输出例程就很简单了。基本算法与之前给出的 64 位整数类似：

检查数字的符号。
如果是正数，调用无符号输出例程打印它。如果是负数，则打印一个负号。然后将该数字取反，并调用无符号输出例程打印它。

要检查扩展精度整数的符号，请测试数字的 HO 位。为了取反一个大数，最好的解决方案可能是从 0 中减去该值。列表 9-8 是一个快速版的i128toStr，它使用了上一节中的otoStr例程。

; i128toStr - Converts a 128-bit signed integer to a string.

; Inputs:
;    RDX:RAX - Signed integer to convert.
;    RDI     - Pointer to buffer to receive string.

i128toStr   proc
            push    rax
            push    rdx
            push    rdi

            test    rdx, rdx  ; Is number negative?
            jns     notNeg

            mov     byte ptr [rdi], '-'
            inc     rdi
            neg     rdx       ; 128-bit negation
            neg     rax
            sbb     rdx, 0

notNeg:     call    otostr
            pop     rdi
            pop     rdx
            pop     rax
            ret
i128toStr   endp

列表 9-8：128 位有符号整数到字符串的转换

9.1.7 格式化转换

前面部分的代码通过使用最少的必要字符位置将有符号和无符号整数转换为字符串。为了创建格式化良好的值表，你需要编写在输出数字之前为数字字符串提供适当填充的函数。一旦你有了这些例程的“未格式化”版本，实现格式化版本就很容易了。

第一步是编写iSize和uSize例程，计算显示值所需的最小字符位置数。实现此目标的一个算法类似于数字字符串转换例程。实际上，唯一的区别是进入例程时初始化一个计数器为 0（例如，非递归外壳例程），然后在每次递归调用时增加此计数器，而不是输出一个数字。（不要忘记在数字为负时在iSize中增加计数器；你必须为输出负号留出空间。）计算完成后，这些例程应该将操作数的大小返回到 EAX 寄存器。

唯一的问题是这种转换方案速度较慢（使用递归和div并不是很快）。事实证明，一个简单的暴力版本，通过将整数值与 1、10、100、1000 等进行比较，运行得要快得多。以下是实现这一点的代码：

; uSize - Determines how many character positions it will take
;         to hold a 64-bit numeric-to-string conversion.

; Input:
;   RAX -    Number to check.

; Returns:
;   RAX -    Number of character positions required.

dig2        qword   10
dig3        qword   100
dig4        qword   1000
dig5        qword   10000
dig6        qword   100000
dig7        qword   1000000
dig8        qword   10000000
dig9        qword   100000000
dig10       qword   1000000000
dig11       qword   10000000000
dig12       qword   100000000000
dig13       qword   1000000000000
dig14       qword   10000000000000
dig15       qword   100000000000000
dig16       qword   1000000000000000
dig17       qword   10000000000000000
dig18       qword   100000000000000000
dig19       qword   1000000000000000000
dig20       qword   10000000000000000000

uSize       proc
            push    rdx
            cmp     rax, dig10
            jae     ge10
            cmp     rax, dig5
            jae     ge5
            mov     edx, 4
            cmp     rax, dig4
            jae     allDone
            dec     edx
            cmp     rax, dig3
            jae     allDone
            dec     edx
            cmp     rax, dig2
            jae     allDone
            dec     edx
            jmp     allDone

ge5:        mov     edx, 9
            cmp     rax, dig9
            jae     allDone
            dec     edx
            cmp     rax, dig8
            jae     allDone
            dec     edx
            cmp     rax, dig7
            jae     allDone
            dec     edx
            cmp     rax, dig6
            jae     allDone
            dec     edx      ; Must be 5
            jmp     allDone

ge10:       cmp     rax, dig14
            jae     ge14
            mov     edx, 13
            cmp     rax, dig13
            jae     allDone
            dec     edx
            cmp     rax, dig12
            jae     allDone
            dec     edx
            cmp     rax, dig11
            jae     allDone
            dec     edx      ; Must be 10
            jmp     allDone

ge14:       mov     edx, 20
            cmp     rax, dig20
            jae     allDone
            dec     edx
            cmp     rax, dig19
            jae     allDone
            dec     edx
            cmp     rax, dig18
 jae     allDone
            dec     edx
            cmp     rax, dig17
            jae     allDone
            dec     edx
            cmp     rax, dig16
            jae     allDone
            dec     edx
            cmp     rax, dig15
            jae     allDone
            dec     edx      ; Must be 14

allDone:    mov     rax, rdx ; Return digit count
            pop     rdx
            ret
uSize       endp

对于有符号整数，可以使用以下代码：

; iSize - Determines the number of print positions required by 
;         a 64-bit signed integer.

iSize       proc
            test    rax, rax
            js      isNeg

            jmp     uSize   ; Effectively a call and ret

; If the number is negative, negate it, call uSize,
; and then bump the size up by 1 (for the "-" character):

isNeg:      neg     rax
            call    uSize
            inc     rax
            ret
iSize       endp

对于扩展精度的大小操作，暴力算法方法很快就变得不切实际（64 位已经够糟糕了）。最佳解决方案是将扩展精度值除以 10 的幂（例如，1e+18）。这样可以将数字的大小减少 18 位。只要商大于 64 位（并跟踪除以 1e+18 的次数），就重复这一过程。当商适合 64 位（19 或 20 位数字）时，调用 64 位的 uSize 函数，并加上你通过除法操作消除的数字位数（每除以 1e+18 减少 18 位）。这个实现留给你自己完成……

一旦你有了 iSize 和 uSize 例程，编写格式化输出例程 utoStrSize 或 itoStrSize 就变得容易了。初次进入时，这些例程会调用相应的 iSize 或 uSize 例程来确定数字所需的字符位置数。如果 iSize 或 uSize 例程返回的值大于最小大小参数（传入 utoStrSize 或 itoStrSize 的值），则不需要其他格式化操作。如果参数大小的值大于 iSize 或 uSize 返回的值，程序必须计算这两个值之间的差异，并在数字转换之前将相应数量的空格（或其他填充字符）输出到字符串中。清单 9-9 显示了 utoStrSize 和 itoStrSize 函数。

; utoStrSize - Converts an unsigned integer to a formatted string
;              having at least "minDigits" character positions.
;              If the actual number of digits is smaller than
;              "minDigits" then this procedure inserts enough
;              "pad" characters to extend the size of the string.

; Inputs:
;    RAX -   Number to convert to string.
;    CL  -   minDigits (minimum print positions).
;    CH  -   Padding character.
;    RDI -   Buffer pointer for output string.

utoStrSize  proc
            push    rcx
            push    rdi
            push    rax

            call    uSize           ; Get actual number of digits
            sub     cl, al          ; >= the minimum size?
            jbe     justConvert

; If the minimum size is greater than the number of actual
; digits, we need to emit padding characters here.

; Note that this code used "sub" rather than "cmp" above.
; As a result, CL now contains the number of padding
; characters to emit to the string (CL is always positive
; at this point as negative and zero results would have
; branched to justConvert).

padLoop:    mov     [rdi], ch
            inc     rdi
            dec     cl
            jne     padLoop

; Okay, any necessary padding characters have already been
; added to the string. Call utoStr to convert the number
; to a string and append to the buffer:

justConvert:
            mov     rax, [rsp]      ; Retrieve original value
            call    utoStr

            pop     rax
            pop     rdi
 pop     rcx
            ret
utoStrSize  endp

; itoStrSize - Converts a signed integer to a formatted string
;              having at least "minDigits" character positions.
;              If the actual number of digits is smaller than
;              "minDigits" then this procedure inserts enough
;              "pad" characters to extend the size of the string.

; Inputs:
;    RAX -   Number to convert to string.
;    CL  -   minDigits (minimum print positions).
;    CH  -   Padding character.
;    RDI -   Buffer pointer for output string.

itoStrSize  proc
            push    rcx
            push    rdi
            push    rax

            call    iSize           ; Get actual number of digits
            sub     cl, al          ; >= the minimum size?
            jbe     justConvert

; If the minimum size is greater than the number of actual
; digits, we need to emit padding characters here.

; Note that this code used "sub" rather than "cmp" above.
; As a result, CL now contains the number of padding
; characters to emit to the string (CL is always positive
; at this point as negative and zero results would have
; branched to justConvert).

padLoop:    mov     [rdi], ch
            inc     rdi
            dec     cl
            jne     padLoop

; Okay, any necessary padding characters have already been
; added to the string. Call utoStr to convert the number
; to a string and append to the buffer:

justConvert:
            mov     rax, [rsp]     ; Retrieve original value
            call    itoStr

            pop     rax
            pop     rdi
            pop     rcx
            ret
itoStrSize  endp

清单 9-9：格式化整数到字符串的转换函数

9.1.8 将浮点值转换为字符串

本章迄今为止的代码涉及将整数数值转换为字符字符串（通常用于输出给用户）。将浮点数值转换为字符串同样重要。本节（及其子节）涵盖了这一转换。

浮点数值可以转换为两种形式的字符串：

十进制表示法转换（例如，± xxx.yyy 格式）
指数（或科学）表示法转换（例如，± x.yyyyye ± zz 格式）

无论最终的输出格式如何，都需要两个不同的操作来将浮点值转换为字符字符串。首先，你必须将尾数转换为适当的数字字符串。其次，你必须将指数转换为数字字符串。

然而，这并不是一个简单的将两个整数值转换为十进制字符串并连接它们（在尾数和指数之间加上一个e）的情况。首先，尾数不是一个整数值：它是一个定点小数二进制值。简单地将它视为一个n位的二进制值（其中n是尾数位数）几乎总会导致转换错误。其次，虽然指数在某种程度上是一个整数值，^(1) 它表示的是 2 的幂，而不是 10 的幂。将 2 的幂以整数形式显示并不适合十进制浮动点表示。处理这两个问题（分数尾数和二进制指数）是将浮动点值转换为字符串的主要复杂性所在。

尽管在 x86-64 上有三种浮动点格式——单精度（32 位real4）、双精度（64 位real8）和扩展精度（80 位real10）——x87 FPU 在将值加载到 FPU 时会自动将real4和real8格式转换为real10格式。因此，通过在转换过程中使用 x87 FPU 进行所有浮动点算术操作，我们只需要编写代码将real10值转换为字符串形式。

real10浮动点值具有 64 位尾数。这不是一个 64 位整数。相反，这 64 位表示的值介于 0 和略小于 2 之间。（有关 IEEE 80 位浮动点格式的更多细节，请参见第二章中的《IEEE 浮动点格式》）第 63 位通常为 1。如果第 63 位为 0，则尾数是非规格化的，表示介于 0 和大约 3.65 × 10^(-4951)之间的数字。

要以大约 18 位精度以十进制形式输出尾数，诀窍是反复将浮动点值乘以或除以 10，直到该数字位于 1e+18 和略小于 1e+19 之间（即 9.9999...e+18）。一旦指数在适当的范围内，尾数位将形成一个 18 位的整数值（没有小数部分），该值可以转换为十进制字符串，从而获得组成尾数值的 18 个数字（使用我们的好朋友fbstp指令）。实际上，你可以通过将浮动点值乘以或除以大的 10 的幂来将其值调整到 1e+18 到 1e+19 的范围。这种方法更快（浮动点操作较少），也更精确（同样因为浮动点操作较少）。

要将指数转换为适当的十进制字符串，你需要追踪除以或乘以 10 的次数。每次除以 10 时，将十进制指数值加 1；每次乘以 10 时，将十进制指数值减 1。过程结束时，从十进制指数值中减去 18（因为此过程产生的值的指数是 18），然后将十进制指数值转换为字符串。

9.1.8.1 转换浮动点指数

要将指数转换为十进制数字字符串，请使用以下算法：

如果数字是 0.0，直接输出尾数字符串“ 000000000000000000”（注意字符串开头的空格）。
将十进制指数初始化为 0。
如果指数为负，输出一个连字符（-）并取反值；如果是正数，则输出一个空格字符。
如果（可能为负的）指数值小于 1.0，跳至步骤 8。
正指数：将数字与逐渐减小的 10 的幂进行比较，从 10^(+4096) 开始，然后是 10^(+2048)，然后是 10^(+1024)，然后是...，最后是 10⁰。每次比较后，如果当前值大于该幂次，则除以该幂次，并将该幂次的指数（4096, 2048, ... , 0）加到十进制指数值上。
重复步骤 5，直到指数为 0（即值处于 1.0 ≤ value < 10.0 范围内）。
跳至步骤 10。
负指数：将数字与逐渐增大的 10 的幂进行比较，从 10^(-4096) 开始，然后是 10^(-2048)，然后是 10^(-1024)，然后是...，最后是 10⁰。每次比较后，如果当前值小于该幂次，则除以该幂次，并将该幂次的指数（4096, 2048, ... , 0）从十进制指数值中减去。
重复步骤 8，直到指数为 0（即值处于 1.0 ≤ value < 10.0 范围内）。
某些合法的浮点值太大，无法用 18 位数字表示（例如，9,223,372,036,854,775,807 可以适配到 63 位，但需要超过 18 位有效数字才能表示）。具体来说，范围在 403A_DE0B_6B3A_763F_FF01h 到 403A_DE0B_6B3A_763F_FFFFh 之间的值大于 999,999,999,999,999,999，但仍然适配到 64 位尾数。fbstp 指令无法将这些值转换为压缩 BCD 值。

为了解决这个问题，代码应该显式地测试该范围内的值，并将其向上舍入为 1e+17（如果发生这种情况，还要增加十进制指数值）。在某些情况下，值可能大于 1e+19。此时，最后一次除以 10.0 将解决这个问题。
此时，浮点值已经是 fbstp 指令可以转换为压缩 BCD 值的合理数值，因此转换函数使用 fbstp 来进行此转换。
最后，使用将数值转换为十六进制（BCD）字符串的操作，将压缩 BCD 值转换为 ASCII 字符串（参见第 500 页的“将无符号十进制值转换为字符串”和清单 9-5）。

清单 9-10 提供了（简化的）代码和数据，用于实现尾数到字符串的转换函数FPDigits。FPDigits 将尾数转换为 18 位数字序列，并返回 EAX 寄存器中的十进制指数值。它不会在字符串中放置小数点，也不会处理指数部分。

 .data

            align   4

; TenTo17 - Holds the value 1.0e+17\. Used to get a floating-
;           point number into the range `x.xxxxxxxxxxxx`e+17.

TenTo17     real10  1.0e+17

; PotTblN - Hold powers of 10 raised to negative powers of 2.

PotTblN     real10  1.0,
                    1.0e-1,
                    1.0e-2,
                    1.0e-4,
                    1.0e-8,
                    1.0e-16,
                    1.0e-32,
 1.0e-64,
                    1.0e-128,
                    1.0e-256,
                    1.0e-512,
                    1.0e-1024,
                    1.0e-2048,
                    1.0e-4096

; PotTblP - Hold powers of 10 raised to positive powers of 2.

            align   4
PotTblP     real10  1.0,
                    1.0e+1,
                    1.0e+2,
                    1.0e+4,
                    1.0e+8,
                    1.0e+16,
                    1.0e+32,
                    1.0e+64,
                    1.0e+128,
                    1.0e+256,
                    1.0e+512,
                    1.0e+1024,
                    1.0e+2048,
                    1.0e+4096

; ExpTbl - Integer equivalents to the powers
;          in the tables above.

            align   4
ExpTab      dword   0,
                    1,
                    2,
                    4,
                    8,
                    16,
                    32,
                    64,
                    128,
                    256,
                    512,
                    1024,
                    2048,
                    4096
               .
               .
               .

*************************************************************

; FPDigits - Used to convert a floating-point number on the FPU
;            stack (ST(0)) to a string of digits.

; Entry Conditions:

; ST(0) -    80-bit number to convert.
;            Note: code requires two free FPU stack elements.
; RDI   -    Points at array of at least 18 bytes where 
;            FPDigits stores the output string.

; Exit Conditions:

; RDI   -    Converted digits are found here.
; RAX   -    Contains exponent of the number.
; CL    -    Contains the sign of the mantissa (" " or "-").
; ST(0) -    Popped from stack.

*************************************************************

P10TblN     equ     <real10 ptr [r8]>
P10TblP     equ     <real10 ptr [r9]>
xTab        equ     <dword ptr [r10]>

FPDigits    proc
            push    rbx
            push    rdx
            push    rsi
            push    r8
            push    r9
            push    r10

; Special case if the number is zero.

            ftst
            fstsw   ax
            sahf
            jnz     fpdNotZero

; The number is zero, output it as a special case.

            fstp    tbyte ptr [rdi] ; Pop value off FPU stack
            mov     rax, "00000000"
            mov     [rdi], rax 
            mov     [rdi + 8], rax 
            mov     [rdi + 16], ax
            add     rdi, 18 
            xor     edx, edx        ; Return an exponent of 0
            mov     bl, ' '         ; Sign is positive
            jmp     fpdDone

fpdNotZero:

; If the number is not zero, then fix the sign of the value.

            mov     bl, ' '         ; Assume it's positive
            jnc     WasPositive     ; Flags set from sahf above

 fabs                 ; Deal only with positive numbers
            mov     bl, '-'      ; Set the sign return result

WasPositive:

; Get the number between 1 and 10 so we can figure out 
; what the exponent is.  Begin by checking to see if we have
; a positive or negative exponent.

            xor     edx, edx     ; Initialize exponent to 0
            fld1
            fcomip  st(0), st(1)
            jbe     PosExp

; We've got a value between zero and one, exclusive,
; at this point.  That means this number has a negative
; exponent.  Multiply the number by an appropriate power
; of 10 until we get it in the range 1 through 10.

            mov     esi, sizeof PotTblN  ; After last element
            mov     ecx, sizeof ExpTab   ; Ditto
            lea     r8, PotTblN
            lea     r9, PotTblP
            lea     r10, ExpTab

CmpNegExp:
            sub     esi, 10          ; Move to previous element
            sub     ecx, 4           ; Zeroes HO bytes
            jz      test1

            fld     P10TblN[rsi * 1] ; Get current power of 10
            fcomip  st(0), st(1)     ; Compare against NOS
            jbe     CmpNegExp        ; While Table >= value

            mov     eax, xTab[rcx * 1]
            test    eax, eax
            jz      didAllDigits

            sub     edx, eax
            fld     P10TblP[rsi * 1]
            fmulp
            jmp     CmpNegExp

; If the remainder is *exactly* 1.0, then we can branch
; on to InRange1_10; otherwise, we still have to multiply
; by 10.0 because we've overshot the mark a bit.

test1:
            fld1
            fcomip  st(0), st(1)
            je      InRange1_10

didAllDigits:

; If we get to this point, then we've indexed through
; all the elements in the PotTblN and it's time to stop.

            fld     P10TblP[10]   ; 10.0
            fmulp
            dec     edx
            jmp     InRange1_10

; At this point, we've got a number that is 1 or greater.
; Once again, our task is to get the value between 1 and 10.

PosExp:

            mov     esi, sizeof PotTblP ; After last element
            mov     ecx, sizeof ExpTab  ; Ditto
            lea     r9, PotTblP
            lea     r10, ExpTab

CmpPosExp:
            sub     esi, 10             ; Move back 1 element in
            sub     ecx, 4              ; PotTblP and ExpTbl
            fld     P10TblP[rsi * 1]
            fcomip  st(0), st(1)
            ja      CmpPosExp;
            mov     eax, xTab[rcx * 1]
            test    eax, eax
            jz      InRange1_10

            add     edx, eax
            fld     P10TblP[rsi * 1]
            fdivp
            jmp     CmpPosExp

InRange1_10:

; Okay, at this point the number is in the range 1 <= x < 10.
; Let's multiply it by 1e+18 to put the most significant digit
; into the 18th print position.  Then convert the result to
; a BCD value and store away in memory.

            sub     rsp, 24         ; Make room for BCD result
            fld     TenTo17
            fmulp

; We need to check the floating-point result to make sure it
; is not outside the range we can legally convert to a BCD 
; value.

; Illegal values will be in the range:

; >999,999,999,999,999,999 ... <1,000,000,000,000,000,000
; $403a_de0b_6b3a_763f_ff01 ... $403a_de0b_6b3a_763f_ffff

; Should one of these values appear, round the result up to
; $403a_de0b_6b3a_7640_0000:

            fstp    real10 ptr [rsp]
            cmp     word ptr [rsp + 8], 403ah
            jne     noRounding

            cmp     dword ptr [rsp + 4], 0de0b6b3ah
            jne     noRounding

            mov     eax, [rsp]
            cmp     eax, 763fff01h
            jb      noRounding;
            cmp     eax, 76400000h
            jae     TooBig

            fld     TenTo17
            inc     edx           ; Inc exp as this is really 10¹⁸
            jmp     didRound

; If we get down here, there were problems getting the
; value in the range 1 <= x <= 10 above and we've got a value
; that is 10e+18 or slightly larger. We need to compensate for
; that here.

TooBig:
            lea     r9, PotTblP
            fld     real10 ptr [rsp]
            fld     P10TblP[10]   ; /10
            fdivp
            inc     edx           ; Adjust exp due to fdiv
            jmp     didRound

noRounding:
            fld     real10 ptr [rsp]
didRound:   
            fbstp   tbyte ptr [rsp]

; The data on the stack contains 18 BCD digits. Convert these
; to ASCII characters and store them at the destination location
; pointed at by EDI.

            mov     ecx, 8
repeatLp:
            mov     al, byte ptr [rsp + rcx]
            shr     al, 4         ; Always in the
            or      al, '0'       ; range "0" to "9"
            mov     [rdi], al
            inc     rdi

            mov     al, byte ptr [rsp + rcx]
            and     al, 0fh
 or      al, '0'
            mov     [rdi], al
            inc     rdi

            dec     ecx
            jns     repeatLp

            add     rsp, 24         ; Remove BCD data from stack

fpdDone:

            mov     eax, edx        ; Return exponent in EAX
            mov     cl, bl          ; Return sign in CL
            pop     r10
            pop     r9
            pop     r8
            pop     rsi
            pop     rdx
            pop     rbx
            ret

FPDigits    endp

清单 9-10：浮点尾数到字符串的转换

9.1.8.2 将浮点值转换为十进制字符串

FPDigits 函数执行将浮点值转换为十进制字符串所需的大部分工作：它将尾数转换为一串数字，并以十进制整数形式提供指数。尽管十进制格式没有明确显示指数值，但将浮点值转换为十进制字符串的过程需要指数（十进制）值，以确定小数点的位置。结合调用者提供的几个附加参数，从 FPDigits 获取输出并将其转换为适当格式化的十进制数字字符串相对容易。

最终要写入的函数是 r10ToStr，这是将 real10 值转换为字符串时调用的主要函数。这是一个格式化输出函数，通过使用标准格式化选项来转换二进制浮点值，控制输出宽度、小数点后的位置数以及在没有出现数字的地方填充字符（通常是空格）。调用 r10ToStr 函数时需要以下参数：

r10

要转换为字符串的 real10 值（如果 r10 是 real4 或 real8 值，FPU 会在将其加载到 FPU 时自动将其转换为 real10 值）。

fWidth

字段宽度。这是字符串将占用的总字符位置数。此计数包括符号的空间（可以是空格或连字符），但不包括字符串的零终止字节空间。字段宽度必须大于 0 且小于或等于 1024。

decDigits

小数点右侧的数字个数。此值必须至少比 fWidth 小 3，因为必须为符号字符、至少一个小数点左侧的数字以及小数点留出空间。如果此值为 0，则转换例程不会在字符串中发出小数点。这是一个无符号值；如果调用者在此处提供负数，程序将把它当作一个非常大的正数（并将返回错误）。

fill

填充字符。如果 r10ToStr 生成的数字字符串使用的字符少于 fWidth，程序将把数字值右对齐，并用此 fill 字符（通常是空格字符）填充最左侧的字符。

buffer

用于接收数字字符串的缓冲区。

maxLength

缓冲区的大小（包括零终止字节）。如果转换例程尝试创建比此值更大的字符串（即 fWidth 大于或等于此值），则会返回错误。

字符串输出操作只有三个实际任务：正确放置小数点（如果存在），仅复制由 fWidth 值指定的数字，并将截断的数字四舍五入为输出数字。

舍入操作是该过程最有趣的部分。r10ToStr 函数在舍入之前将 real10 值转换为 ASCII 字符，因为转换后的结果更容易进行舍入。所以，舍入操作的过程包括将 5 加到最不重要显示数字之后的（ASCII）数字上。如果这个和超过了（字符）9，舍入算法必须将 1 加到最不重要显示数字上。如果这个和超过了 9，算法必须从字符中减去（值）10，并将 1 加到下一个不那么重要的数字上。这个过程会重复进行，直到达到最重要的数字，或者直到没有进位（即和不超过 9）。在（罕见的）舍入通过所有数字的情况下（例如字符串为“999999 . . . 9”），舍入算法必须将字符串替换为“10000 . . . 0”，并将十进制指数加 1。

输出字符串的算法对于负指数和非负指数的值有所不同。负指数的处理可能是最简单的。以下是输出负指数值的算法：

函数首先将 3 加到 decDigits。
如果 decDigits 小于 4，则将其设置为 4 作为默认值。^(3)
如果 decDigits 大于 fWidth，函数向字符串中输出 fWidth 个 "#" 字符，然后返回。
如果 decDigits 小于 fWidth，则输出 (fWidth - decDigits) 个填充字符 (fill) 到输出字符串中。
如果 r10 为负数，向字符串中输出 -0.；否则，输出 0.（如果是非负数，则在 0 前面加上空格）。
接下来，输出转换后的数字的数字。如果字段宽度小于 21（18 位数字加上 3 位前导 0. 或 -0. 字符），则函数从转换后的数字字符串中输出指定的 (fWidth) 字符。如果宽度大于 21，则函数输出转换后的所有 18 位数字，并在其后跟随需要填充字段宽度的零字符。
最后，函数将字符串以零终止并返回。

如果指数为正数或 0，则转换稍微复杂一些。首先，代码需要确定结果所需的字符位置数量。其计算方式如下：

`exponent` + 2 + `decDigits` + (0 if `decDigits` is 0, 1 otherwise)

exponent 值是小数点左侧的数字数量（减去 1）。2 组件存在是因为始终有一个位置用于符号字符（空格或连字符），并且小数点左侧始终至少有一个数字。decDigits 组件添加了小数点后面显示的数字数量。最后，如果小数点存在（即如果 decDigits 大于 0），此方程式会为点字符加上 1。

一旦计算出所需的宽度，函数会将该值与调用者提供的 fWidth 值进行比较。如果计算出的值大于 fWidth，函数将输出 fWidth 个 "#" 字符并返回。否则，它可以将数字输出到字符串中。

正如负指数情况那样，代码首先确定数字是否会占用输出字符串中的所有字符位置。如果不会，它会计算 fWidth 与实际字符数之间的差异，并输出 fill 字符来填充数字字符串。接着，输出一个空格或连字符字符（取决于原始值的符号）。然后，函数输出小数点左侧的数字（通过递减 exponent 值）。如果 decDigits 值不为零，函数会输出点字符并输出 FPDigits 生成的数字字符串中的任何剩余数字。如果函数超过了 FPDigits 生成的 18 个数字（无论是在小数点之前还是之后），函数会用 0 字符填充剩余的位置。最后，函数输出字符串的零终止字节并返回给调用者。

清单 9-11 提供了 r10ToStr 函数的源代码。

***********************************************************

; r10ToStr -  Converts a real10 floating-point number to the
;             corresponding string of digits.  Note that this
;             function always emits the string using decimal
;             notation.  For scientific notation, use the e10ToBuf
;             routine.

; On Entry:

;    r10        -    real10 value to convert.
;                    Passed in ST(0).

;    fWidth     -    Field width for the number (note that this
;                    is an *exact* field width, not a minimum
;                    field width).
;                    Passed in EAX (RAX).

;    decimalpts -    # of digits to display after the decimal pt.
;                    Passed in EDX (RDX). 

;    fill       -    Padding character if the number is smaller
;                    than the specified field width.
;                    Passed in CL (RCX).

;    buffer     -    Stores the resulting characters in
;                    this string.
;                    Address passed in RDI.

;    maxLength  -    Maximum string length.
;                    Passed in R8d (R8).

; On Exit:

; Buffer contains the newly formatted string.  If the
; formatted value does not fit in the width specified,
; r10ToStr will store "#" characters into this string.

; Carry -    Clear if success; set if an exception occurs.
;            If width is larger than the maximum length of
;            the string specified by buffer, this routine
;            will return with the carry set and RAX = -1,
;            -2, or -3.

***********************************************************

r10ToStr    proc

; Local variables:

fWidth      equ     <dword ptr [rbp - 8]>    ; RAX: uns32
decDigits   equ     <dword ptr [rbp - 16]>   ; RDX: uns32
fill        equ     <[rbp - 24]>             ; CL: char
bufPtr      equ     <[rbp - 32]>             ; RDI: pointer
exponent    equ     <dword ptr [rbp - 40]>   ; uns32
sign        equ     <byte ptr [rbp - 48]>    ; char
digits      equ     <byte ptr [rbp - 128]>   ; char[80]
maxWidth    =       64              ; Must be smaller than 80 - 2

            push    rdi
            push    rbx
            push    rcx
            push    rdx
            push    rsi
            push    rax
            push    rbp
            mov     rbp, rsp
            sub     rsp, 128        ; 128 bytes of local vars

; First, make sure the number will fit into the 
; specified string.

            cmp     eax, r8d        ; R8d = max length
            jae     strOverflow

; If the width is zero, raise an exception:

            test    eax, eax
            jz      voor            ; Value out of range

            mov     bufPtr, rdi
            mov     qword ptr decDigits, rdx
            mov     fill, rcx
            mov     qword ptr fWidth, rax

; If the width is too big, raise an exception:

            cmp     eax, maxWidth
            ja      badWidth

; Okay, do the conversion.
; Begin by processing the mantissa digits:

            lea     rdi, digits     ; Store result here
            call    FPDigits        ; Convert r80 to string
            mov     exponent, eax   ; Save exp result
            mov     sign, cl        ; Save mantissa sign char

; Round the string of digits to the number of significant 
; digits we want to display for this number:

            cmp     eax, 17
            jl      dontForceWidthZero

            xor     rax, rax        ; If the exp is negative or
                                    ; too large, set width to 0
dontForceWidthZero:
            mov     rbx, rax        ; Really just 8 bits
            add     ebx, decDigits  ; Compute rounding position
            cmp     ebx, 17
            jge     dontRound       ; Don't bother if a big #

; To round the value to the number of significant digits,
; go to the digit just beyond the last one we are considering
; (EAX currently contains the number of decimal positions)
; and add 5 to that digit.  Propagate any overflow into the
; remaining digit positions.

            inc     ebx                 ; Index + 1 of last sig digit
            mov     al, digits[rbx * 1] ; Get that digit
            add     al, 5               ; Round (for example, +0.5)
            cmp     al, '9'
            jbe     dontRound

            mov     digits[rbx * 1], '0' + 10 ; Force to zero

whileDigitGT9:                                ; (See sub 10 below)
            sub     digits[rbx * 1], 10       ; Sub out overflow, 
            dec     ebx                       ; carry, into prev
            js      hitFirstDigit;            ; digit (until 1st
                                              ; digit in the #)
            inc     digits[rbx * 1]
            cmp     digits[rbx], '9'          ; Overflow if > "9"
            ja      whileDigitGT9
            jmp     dontRound

hitFirstDigit:

; If we get to this point, then we've hit the first
; digit in the number.  So we've got to shift all
; the characters down one position in the string of
; bytes and put a "1" in the first character position.

            mov     ebx, 17

repeatUntilEBXeq0:

            mov     al, digits[rbx * 1]
            mov     digits[rbx * 1 + 1], al
            dec     ebx
            jnz     repeatUntilEBXeq0

            mov     digits, '1'
 inc     exponent    ; Because we added a digit

dontRound: 

; Handle positive and negative exponents separately.

            mov     rdi, bufPtr ; Store the output here
            cmp     exponent, 0
            jge     positiveExponent

; Negative exponents:
; Handle values between 0 and 1.0 here (negative exponents
; imply negative powers of 10).

; Compute the number's width.  Since this value is between
; 0 and 1, the width calculation is easy: it's just the
; number of decimal positions they've specified plus three
; (since we need to allow room for a leading "-0.").

            mov     ecx, decDigits
            add     ecx, 3
            cmp     ecx, 4
            jae     minimumWidthIs4

            mov     ecx, 4      ; Minimum possible width is four

minimumWidthIs4:
            cmp     ecx, fWidth
            ja      widthTooBig 

; This number will fit in the specified field width,
; so output any necessary leading pad characters.

            mov     al, fill
            mov     edx, fWidth
            sub     edx, ecx
            jmp     testWhileECXltWidth

whileECXltWidth:
            mov     [rdi], al
            inc     rdi
            inc     ecx

testWhileECXltWidth:
            cmp     ecx, fWidth
            jb      whileECXltWidth

; Output " 0." or "-0.", depending on the sign of the number.

            mov     al, sign
            cmp     al, '-'
            je      isMinus

            mov     al, ' '

isMinus:    mov     [rdi], al
            inc     rdi
            inc     edx

            mov     word ptr [rdi], '.0'
            add     rdi, 2
            add     edx, 2

; Now output the digits after the decimal point:

            xor     ecx, ecx        ; Count the digits in ECX
            lea     rbx, digits     ; Pointer to data to output d

; If the exponent is currently negative, or if
; we've output more than 18 significant digits,
; just output a zero character.

repeatUntilEDXgeWidth: 
            mov     al, '0'
            inc     exponent
            js      noMoreOutput

            cmp     ecx, 18
            jge     noMoreOutput

            mov     al, [rbx]
            inc     ebx

noMoreOutput:
            mov     [rdi], al
            inc     rdi
            inc     ecx
            inc     edx
            cmp     edx, fWidth
            jb      repeatUntilEDXgeWidth
            jmp     r10BufDone

; If the number's actual width was bigger than the width
; specified by the caller, emit a sequence of "#" characters
; to denote the error.

widthTooBig:

; The number won't fit in the specified field width,
; so fill the string with the "#" character to indicate
; an error.

            mov     ecx, fWidth
            mov     al, '#'
fillPound:  mov     [rdi], al
            inc     rdi
            dec     ecx
            jnz     fillPound
            jmp     r10BufDone

; Handle numbers with a positive exponent here.

positiveExponent:

; Compute # of digits to the left of the ".".
; This is given by:

;                   Exponent        ; # of digits to left of "."
;           +       2               ; Allow for sign and there
;                                   ; is always 1 digit left of "."
;           +       decimalpts      ; Add in digits right of "."
;           +       1               ; If there is a decimal point

            mov     edx, exponent   ; Digits to left of "."
            add     edx, 2          ; 1 digit + sign posn
            cmp     decDigits, 0
            je      decPtsIs0

            add     edx, decDigits  ; Digits to right of "."
            inc     edx             ; Make room for the "."

decPtsIs0:

; Make sure the result will fit in the
; specified field width.

            cmp     edx, fWidth
            ja      widthTooBig

; If the actual number of print positions
; is fewer than the specified field width,
; output leading pad characters here.

            cmp     edx, fWidth
            jae     noFillChars

            mov     ecx, fWidth
            sub     ecx, edx
            jz      noFillChars
            mov     al, fill
fillChars:  mov     [rdi], al
            inc     rdi
            dec     ecx
            jnz     fillChars

noFillChars:

; Output the sign character.

            mov     al, sign
            cmp     al, '-'
            je      outputMinus;

            mov     al, ' '

outputMinus:
            mov     [rdi], al
            inc     rdi

; Okay, output the digits for the number here.

            xor     ecx, ecx        ; Counts # of output chars
            lea     rbx, digits     ; Ptr to digits to output

; Calculate the number of digits to output
; before and after the decimal point.

            mov     edx, decDigits  ; Chars after "."
            add     edx, exponent   ; # chars before "."
            inc     edx             ; Always one digit before "."

; If we've output fewer than 18 digits, go ahead
; and output the next digit.  Beyond 18 digits,
; output zeros.

repeatUntilEDXeq0:
            mov     al, '0'
            cmp     ecx, 18
            jnb     putChar

            mov     al, [rbx]
            inc     rbx

putChar:    mov     [rdi], al
            inc     rdi

; If the exponent decrements to zero,
; then output a decimal point.

            cmp     exponent, 0
            jne     noDecimalPt
            cmp     decDigits, 0
            je      noDecimalPt

            mov     al, '.'
            mov     [rdi], al
            inc     rdi

noDecimalPt:
            dec     exponent        ; Count down to "." output
            inc     ecx             ; # of digits thus far
            dec     edx             ; Total # of digits to output
            jnz     repeatUntilEDXeq0

; Zero-terminate string and leave:

r10BufDone: mov     byte ptr [rdi], 0
            leave
            clc                     ; No error
            jmp     popRet

badWidth:   mov     rax, -2     ; Illegal width
            jmp     ErrorExit

strOverflow:
            mov     rax, -3     ; String overflow
            jmp     ErrorExit

voor:       or      rax, -1     ; Range error
ErrorExit:  leave
            stc     ; Error
            mov     [rsp], rax  ; Change RAX on return

popRet:     pop     rax
            pop     rsi
            pop     rdx
            pop     rcx
            pop     rbx
            pop     rdi
            ret

r10ToStr    endp

清单 9-11：r10ToStr 转换函数

9.1.8.3 将浮点值转换为指数形式

将浮点值转换为指数（科学）形式比转换为十进制形式要容易一些。尾数总是呈现为 sx.y 形式，其中 s 是一个连字符或空格，x 是恰好一个小数位，y 是一个或多个小数位。FPDigits 函数几乎完成了创建该字符串的所有工作。指数转换函数需要输出带符号和小数点字符的尾数字符串，然后输出该数字的十进指数。将指数值（由 FPDigits 在 EAX 寄存器中以十进制整数形式返回）转换为字符串，实际上只是本章早些时候提到的数字到十进制字符串转换，使用不同的输出格式。

本章介绍的函数允许你指定指数的数字位数为 1、2、3 或 4。如果指数需要的位数超过调用者指定的数字，函数将返回失败。如果需要的位数少于调用者指定的数字，函数会在指数前填充 0。为了模拟典型的浮点转换形式，对于单精度值，指定 2 位的指数；对于双精度值，指定 3 位的指数；对于扩展精度值，指定 4 位的指数。

列表 9-12 提供了一个快速且粗略的函数，将十进制指数值转换为适当的字符串形式，并将这些字符输出到缓冲区。此函数将 RDI 指向超出最后一个指数数字的位置，并且没有对字符串进行零终止。它实际上只是一个辅助函数，用于输出 e10ToStr 函数的字符，该函数将在下一个列表中出现。

*************************************************************

; expToBuf - Unsigned integer to buffer.
;            Used to output up to 4-digit exponents.

; Inputs:

;    EAX:   Unsigned integer to convert.
;    ECX:   Print width 1-4.
;    RDI:   Points at buffer.

;    FPU:   Uses FPU stack.

; Returns:

;    RDI:   Points at end of buffer.

expToBuf    proc

expWidth    equ     <[rbp + 16]>
exp         equ     <[rbp + 8]>
bcd         equ     <[rbp - 16]>

            push    rdx
            push    rcx            ; At [RBP + 16]
            push    rax            ; At [RBP + 8]
            push    rbp
            mov     rbp, rsp
            sub     rsp, 16

; Verify exponent digit count is in the range 1-4:

            cmp     rcx, 1
            jb      badExp
            cmp     rcx, 4
            ja      badExp
            mov     rdx, rcx

; Verify the actual exponent will fit in the number of digits:

            cmp     rcx, 2
            jb      oneDigit
            je      twoDigits
            cmp     rcx, 3
            ja      fillZeros      ; 4 digits, no error
            cmp     eax, 1000
            jae     badExp
            jmp     fillZeros

oneDigit:   cmp     eax, 10
            jae     badExp
            jmp     fillZeros

twoDigits:  cmp     eax, 100
            jae     badExp

; Fill in zeros for exponent:

fillZeros:  mov     byte ptr [rdi + rcx * 1 - 1], '0'
            dec     ecx
            jnz     fillZeros

; Point RDI at the end of the buffer:

            lea     rdi, [rdi + rdx * 1 - 1]
            mov     byte ptr [rdi + 1], 0
            push    rdi             ; Save pointer to end

; Quick test for zero to handle that special case:

            test    eax, eax
            jz      allDone

; The number to convert is nonzero.
; Use BCD load and store to convert
; the integer to BCD:

            fild    dword ptr exp   ; Get integer value
            fbstp   tbyte ptr bcd   ; Convert to BCD

; Begin by skipping over leading zeros in
; the BCD value (max 10 digits, so the most
; significant digit will be in the HO nibble
; of byte 4).

            mov     eax, bcd        ; Get exponent digits
            mov     ecx, expWidth   ; Number of total digits

OutputExp:  mov     dl, al
            and     dl, 0fh
            or      dl, '0'
            mov     [rdi], dl
            dec     rdi
            shr     ax, 4
            jnz     OutputExp

; Zero-terminate the string and return:

allDone:    pop     rdi
            leave
            pop     rax
            pop     rcx
            pop     rdx
            clc
            ret

badExp:     leave
            pop     rax
 pop     rcx
            pop     rdx
            stc
            ret

expToBuf    endp

列表 9-12: 指数转换函数

实际的 e10ToStr 函数在列表 9-13 中，类似于 r10ToStr 函数。由于形式固定，尾数的输出不那么复杂，但在输出指数时需要做一些额外的工作。有关此代码的操作细节，请参考第 527 页的“将浮点值转换为十进制字符串”。

***********************************************************

; e10ToStr - Converts a real10 floating-point number to the
;            corresponding string of digits.  Note that this
;            function always emits the string using scientific
;            notation; use the r10ToStr routine for decimal notation.  

; On Entry:

;    e10         -   real10 value to convert.
;                    Passed in ST(0).

;    width       -   Field width for the number (note that this
;                    is an *exact* field width, not a minimum
;                    field width).
;                    Passed in RAX (LO 32 bits).

;    fill        -   Padding character if the number is smaller
;                    than the specified field width.
;                    Passed in RCX.

;    buffer      -   e10ToStr stores the resulting characters in
;                    this buffer (passed in RDI).

;    expDigs     -   Number of exponent digits (2 for real4,
;                    3 for real8, and 4 for real10).
;                    Passed in RDX (LO 8 bits).

;    maxLength   -   Maximum buffer size.
;                    Passed in R8\.                           

; On Exit:                                                  

;    RDI         -  Points at end of converted string.      

; Buffer contains the newly formatted string.  If the    
; formatted value does not fit in the width specified,   
; e10ToStr will store "#" characters into this string.   

; If there was an error, EAX contains -1, -2, or -3      
; denoting the error (value out of range, bad width,     
; or string overflow, respectively).                     

***********************************************************

; Unlike the integer-to-string conversions, this routine    
; always right-justifies the number in the specified        
; string.  Width must be a positive number; negative        
; values are illegal (actually, they are treated as         
; *really* big positive numbers that will always raise      
; a string overflow exception).                              

***********************************************************

e10ToStr    proc

fWidth      equ     <[rbp - 8]>       ; RAX
buffer      equ     <[rbp - 16]>      ; RDI
expDigs     equ     <[rbp - 24]>      ; RDX
rbxSave     equ     <[rbp - 32]>
rcxSave     equ     <[rbp - 40]>
rsiSave     equ     <[rbp - 48]>
Exponent    equ     <dword ptr [rbp - 52]>
MantSize    equ     <dword ptr [rbp - 56]>
Sign        equ     <byte ptr [rbp - 60]>
Digits      equ     <byte ptr [rbp - 128]>

            push    rbp
            mov     rbp, rsp
            sub     rsp, 128

            mov     buffer, rdi
            mov     rsiSave, rsi
            mov     rcxSave, rcx
            mov     rbxSave, rbx
            mov     fWidth, rax
            mov     expDigs, rdx

            cmp     eax, r8d
            jae     strOvfl
            mov     byte ptr [rdi + rax * 1], 0 ; Zero-terminate str

; First, make sure the width isn't zero.

            test    eax, eax
            jz      voor

; Just to be on the safe side, don't allow widths greater 
; than 1024:

            cmp     eax, 1024
            ja      badWidth

; Okay, do the conversion.

            lea     rdi, Digits     ; Store result string here
            call    FPDigits        ; Convert e80 to digit str
            mov     Exponent, eax   ; Save away exponent result
            mov     Sign, cl        ; Save mantissa sign char

; Verify that there is sufficient room for the mantissa's sign,
; the decimal point, two mantissa digits, the "E", and the
; exponent's sign.  Also add in the number of digits required
; by the exponent (2 for real4, 3 for real8, 4 for real10).

; -1.2e+00    :real4
; -1.2e+000   :real8
; -1.2e+0000  :real10

            mov     ecx, 6          ; Char posns for above chars
            add     ecx, expDigs    ; # of digits for the exp
            cmp     ecx, fWidth
            jbe     goodWidth

; Output a sequence of "#...#" chars (to the specified width)
; if the width value is not large enough to hold the 
; conversion:

            mov     ecx, fWidth
            mov     al, '#'
            mov     rdi, buffer
fillPound:  mov     [rdi], al
            inc     rdi
            dec     ecx
            jnz     fillPound
            jmp     exit_eToBuf

; Okay, the width is sufficient to hold the number; do the
; conversion and output the string here:

goodWidth:

            mov     ebx, fWidth     ; Compute the # of mantissa
            sub     ebx, ecx        ; digits to display
            add     ebx, 2          ; ECX allows for 2 mant digs
            mov     MantSize,ebx

; Round the number to the specified number of print positions.
; (Note: since there are a maximum of 18 significant digits,
; don't bother with the rounding if the field width is greater
; than 18 digits.)

 cmp     ebx, 18
            jae     noNeedToRound

; To round the value to the number of significant digits,
; go to the digit just beyond the last one we are considering
; (EBX currently contains the number of decimal positions)
; and add 5 to that digit.  Propagate any overflow into the
; remaining digit positions.

            mov     al, Digits[rbx * 1] ; Get least sig digit + 1
            add     al, 5               ; Round (for example, +0.5)
            cmp     al, '9'
            jbe     noNeedToRound
            mov     Digits[rbx * 1], '9' + 1
            jmp     whileDigitGT9Test

whileDigitGT9:

; Subtract out overflow and add the carry into the previous
; digit (unless we hit the first digit in the number).

            sub     Digits[rbx * 1], 10     
            dec     ebx                     
            cmp     ebx, 0                  
            jl      firstDigitInNumber      

            inc     Digits[rbx * 1]
            jmp     whileDigitGT9Test

firstDigitInNumber:

; If we get to this point, then we've hit the first
; digit in the number.  So we've got to shift all
; the characters down one position in the string of
; bytes and put a "1" in the first character position.

            mov     ebx, 17
repeatUntilEBXeq0:

            mov     al, Digits[rbx * 1]
            mov     Digits[rbx * 1 + 1], al
            dec     ebx
            jnz     repeatUntilEBXeq0

            mov     Digits, '1'
            inc     Exponent         ; Because we added a digit
            jmp     noNeedToRound

whileDigitGT9Test:
            cmp     Digits[rbx], '9' ; Overflow if char > "9"
            ja      whileDigitGT9 

noNeedToRound:      

; Okay, emit the string at this point.  This is pretty easy
; since all we really need to do is copy data from the
; digits array and add an exponent (plus a few other simple chars).

            xor     ecx, ecx    ; Count output mantissa digits
            mov     rdi, buffer
            xor     edx, edx    ; Count output chars
            mov     al, Sign
            cmp     al, '-'
            je      noMinus

            mov     al, ' '

noMinus:    mov     [rdi], al

; Output the first character and a following decimal point
; if there are more than two mantissa digits to output.

            mov     al, Digits
            mov     [rdi + 1], al
            add     rdi, 2
            add     edx, 2
            inc     ecx
            cmp     ecx, MantSize
            je      noDecPt

            mov     al, '.'
            mov     [rdi], al
            inc     rdi
            inc     edx

noDecPt:

; Output any remaining mantissa digits here.
; Note that if the caller requests the output of
; more than 18 digits, this routine will output zeros
; for the additional digits.

            jmp     whileECXltMantSizeTest

whileECXltMantSize:

            mov     al, '0'
            cmp     ecx, 18
            jae     justPut0

            mov     al, Digits[rcx * 1]

justPut0:
            mov     [rdi], al
            inc     rdi
            inc     ecx
            inc     edx

whileECXltMantSizeTest:
            cmp     ecx, MantSize
            jb      whileECXltMantSize

; Output the exponent:

            mov     byte ptr [rdi], 'e'
            inc     rdi
            inc     edx
            mov     al, '+'
            cmp     Exponent, 0
            jge     noNegExp

            mov     al, '-'
            neg     Exponent

noNegExp:
            mov     [rdi], al
            inc     rdi
            inc     edx

            mov     eax, Exponent
            mov     ecx, expDigs
            call    expToBuf
            jc      error

exit_eToBuf:
            mov     rsi, rsiSave
            mov     rcx, rcxSave
            mov     rbx, rbxSave
            mov     rax, fWidth
            mov     rdx, expDigs
            leave
            clc
            ret

strOvfl:    mov     rax, -3
            jmp     error

badWidth:   mov     rax, -2
            jmp     error

voor:       mov     rax, -1
error:      mov     rsi, rsiSave
            mov     rcx, rcxSave
            mov     rbx, rbxSave
            mov     rdx, expDigs
            leave
            stc
            ret

e10ToStr   endp

列表 9-13: e10ToStr 转换函数

9.2 字符串与数字转换例程

数值到字符串的转换例程和字符串到数字的转换例程有一些基本的区别。首先，数字到字符串的转换通常不会发生错误；^(4) 而字符串到数字的转换则必须处理实际可能出现的错误，如非法字符和数字溢出。

一个典型的数字输入操作包括从用户读取一串字符，然后将这串字符转换为内部数字表示。例如，在 C++ 中，像 cin >> i32; 这样的语句从用户那里读取一行文本，并将该行文本开头的一串数字字符转换为一个 32 位带符号整数（假设 i32 是一个 32 位的 int 对象）。cin >> i32; 语句跳过某些字符，如开头的空格，这些字符可能出现在实际的数字字符之前。输入字符串也可能包含数字输入后的额外数据（例如，可能从同一行输入中读取两个整数值），因此输入转换例程必须确定数字数据在输入流中的结束位置。

通常，C++ 通过查找一组分隔符字符来实现这一点。分隔符字符集可能是简单的“任何非数字字符”，或者是空白字符集（空格、制表符等），也可能是其他一些字符，如逗号（,）或其他标点符号字符。为了举例说明，本节中的代码假设任何开头的空格或制表符字符（ASCII 码 9）可能出现在数字字符之前，转换在遇到第一个非数字字符时停止。可能的错误情况如下：

字符串开头完全没有数字（跳过任何空格或制表符）。
数字串是一个值，其大小超出了目标数字类型的范围（例如，64 位）。

由调用者来确定数字字符串是否以无效字符结尾（从函数调用返回时）。

9.2.1 将十进制字符串转换为整数

将包含十进制数字的字符串转换为数字的基本算法如下：

初始化累加器变量为 0。
跳过字符串中的任何前导空格或制表符。
获取空格或制表符之后的第一个字符。
如果字符不是数字字符，则返回错误。如果字符是数字字符，则继续到第 5 步。
将数字字符转换为数值（使用 AND 0Fh）。
设置累加器 =（累加器 × 10）+ 当前的数字值。
如果发生溢出，返回并报告错误。如果没有溢出，继续执行第 8 步。
从字符串中获取下一个字符。
如果字符是数字字符，返回到第 5 步，否则继续到第 10 步。
返回成功，累加器包含转换后的值。

对于有符号整数输入，您使用相同的算法，进行以下修改：

如果第一个非空格或制表符字符是一个连字符（-），则设置一个标志，表示该数字为负数，并跳过“-”字符（如果第一个字符不是-，则清除标志）。
在成功转换结束时，如果设置了标志，则在返回之前对整数结果取负（必须检查取负操作是否溢出）。

清单 9-14 实现了转换算法。

; Listing 9-14

; String-to-numeric conversion.

        option  casemap:none

false       =       0
true        =       1
tab         =       9
nl          =       10

            .const
ttlStr      byte    "Listing 9-14", 0
fmtStr1     byte    "strtou: String='%s'", nl
            byte    "    value=%I64u", nl, 0

fmtStr2     byte    "Overflow: String='%s'", nl
            byte    "    value=%I64x", nl, 0

fmtStr3     byte    "strtoi: String='%s'", nl
            byte    "    value=%I64i",nl, 0

unexError   byte    "Unexpected error in program", nl, 0

value1      byte    "  1", 0
value2      byte    "12 ", 0
value3      byte    " 123 ", 0
value4      byte    "1234", 0
value5      byte    "1234567890123456789", 0
value6      byte    "18446744073709551615", 0
OFvalue     byte    "18446744073709551616", 0
OFvalue2    byte    "999999999999999999999", 0

ivalue1     byte    "  -1", 0
ivalue2     byte    "-12 ", 0
ivalue3     byte    " -123 ", 0
ivalue4     byte    "-1234", 0
ivalue5     byte    "-1234567890123456789", 0
ivalue6     byte    "-9223372036854775807", 0
OFivalue    byte    "-9223372036854775808", 0
OFivalue2   byte    "-999999999999999999999", 0

            .data
buffer      byte    30 dup (?)

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; strtou -   Converts string data to a 64-bit unsigned integer.

; Input:
;   RDI  -   Pointer to buffer containing string to convert.

; Output:
;   RAX  -   Contains converted string (if success), error code
;            if an error occurs.

;   RDI  -   Points at first char beyond end of numeric string.
;            If error, RDI's value is restored to original value.
;            Caller can check character at [RDI] after a
;            successful result to see if the character following
;            the numeric digits is a legal numeric delimiter.

;   C    -   (carry flag) Set if error occurs, clear if
;            conversion was successful. On error, RAX will
;            contain 0 (illegal initial character) or
;            0FFFFFFFFFFFFFFFFh (overflow).

strtou      proc
            push    rdi      ; In case we have to restore RDI
            push    rdx      ; Munged by mul 
            push    rcx      ; Holds input char

 xor     edx, edx ; Zero-extends!
            xor     eax, eax ; Zero-extends!

; The following loop skips over any whitespace (spaces and
; tabs) that appears at the beginning of the string.

            dec     rdi      ; Because of inc below
skipWS:     inc     rdi
            mov     cl, [rdi]
            cmp     cl, ' '
            je      skipWS
            cmp     al, tab
            je      skipWS

; If we don't have a numeric digit at this point,
; return an error.

            cmp     cl, '0'  ; Note: "0" < "1" < ... < "9"
            jb      badNumber
            cmp     cl, '9'
            ja      badNumber

; Okay, the first digit is good. Convert the string
; of digits to numeric form:

convert:    and     ecx, 0fh ; Convert to numeric in RCX
            mul     ten      ; Accumulator *= 10
            jc      overflow
            add     rax, rcx ; Accumulator += digit
            jc      overflow
            inc     rdi      ; Move on to next character
            mov     cl, [rdi]
            cmp     cl, '0'
            jb      endOfNum
            cmp     cl, '9'
            jbe     convert

; If we get to this point, we've successfully converted
; the string to numeric form:

endOfNum:   pop     rcx
            pop     rdx

; Because the conversion was successful, this procedure
; leaves RDI pointing at the first character beyond the
; converted digits. As such, we don't restore RDI from
; the stack. Just bump the stack pointer up by 8 bytes
; to throw away RDI's saved value.

            add     rsp, 8
            clc              ; Return success in carry flag
            ret

; badNumber - Drop down here if the first character in
;             the string was not a valid digit.

badNumber:  mov     rax, 0
            pop     rcx
            pop     rdx
            pop     rdi
            stc              ; Return error in carry flag
            ret     

overflow:   mov     rax, -1  ; 0FFFFFFFFFFFFFFFFh
            pop     rcx
            pop     rdx
            pop     rdi
            stc              ; Return error in carry flag
            ret

ten         qword   10

strtou      endp

; strtoi - Converts string data to a 64-bit signed integer.

; Input:
;   RDI  -   Pointer to buffer containing string to convert.

; Output:
;   RAX  -   Contains converted string (if success), error code
;            if an error occurs.

;   RDI  -   Points at first char beyond end of numeric string.
;            If error, RDI's value is restored to original value.
;            Caller can check character at [RDI] after a
;            successful result to see if the character following
;            the numeric digits is a legal numeric delimiter.

;   C    -   (carry flag) Set if error occurs, clear if
;            conversion was successful. On error, RAX will
;            contain 0 (illegal initial character) or
;            0FFFFFFFFFFFFFFFFh (-1, indicating overflow).

strtoi      proc
negFlag     equ     <byte ptr [rsp]>

            push    rdi      ; In case we have to restore RDI
            sub     rsp, 8

; Assume we have a non-negative number.

            mov     negFlag, false

; The following loop skips over any whitespace (spaces and
; tabs) that appears at the beginning of the string.

            dec     rdi      ; Because of inc below
skipWS:     inc     rdi
            mov     al, [rdi]
            cmp     al, ' '
            je      skipWS
            cmp     al, tab
            je      skipWS

; If the first character we've encountered is "-",
; then skip it, but remember that this is a negative
; number.

            cmp     al, '-'
            jne     notNeg
            mov     negFlag, true
            inc     rdi             ; Skip "-"

notNeg:     call    strtou          ; Convert string to integer
            jc      hadError

; strtou returned success. Check the negative flag and
; negate the input if the flag contains true.

            cmp     negFlag, true
            jne     itsPosOr0

            cmp     rax, tooBig     ; Number is too big
            ja      overflow
            neg     rax
itsPosOr0:  add     rsp, 16         ; Success, so don't restore RDI
            clc                     ; Return success in carry flag
            ret

; If we have an error, we need to restore RDI from the stack:

overflow:   mov     rax, -1         ; Indicate overflow
hadError:   add     rsp, 8          ; Remove locals
            pop     rdi
            stc                     ; Return error in carry flag
            ret 

tooBig      qword   7fffffffffffffffh
strtoi      endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 64         ; Shadow storage

; Test unsigned conversions:

            lea     rdi, value1
            call    strtou

jc      UnexpectedError

            lea     rcx, fmtStr1
            lea     rdx, value1
            mov     r8, rax
            call    printf

            lea     rdi, value2
            call    strtou
            jc      UnexpectedError

            lea     rcx, fmtStr1
            lea     rdx, value2
            mov     r8, rax
            call    printf

            lea     rdi, value3
            call    strtou
            jc      UnexpectedError

            lea     rcx, fmtStr1
            lea     rdx, value3
            mov     r8, rax
            call    printf

            lea     rdi, value4
            call    strtou
            jc      UnexpectedError

            lea     rcx, fmtStr1
            lea     rdx, value4
            mov     r8, rax
            call    printf

            lea     rdi, value5
            call    strtou
            jc      UnexpectedError

            lea     rcx, fmtStr1
            lea     rdx, value5
            mov     r8, rax
            call    printf

            lea     rdi, value6
            call    strtou
            jc      UnexpectedError

            lea     rcx, fmtStr1
            lea     rdx, value6
            mov     r8, rax
            call    printf

 lea     rdi, OFvalue
            call    strtou
            jnc     UnexpectedError
            test    rax, rax        ; Nonzero for overflow
            jz      UnexpectedError

            lea     rcx, fmtStr2
            lea     rdx, OFvalue
            mov     r8, rax
            call    printf

            lea     rdi, OFvalue2
            call    strtou
            jnc     UnexpectedError
            test    rax, rax        ; Nonzero for overflow
            jz      UnexpectedError

            lea     rcx, fmtStr2
            lea     rdx, OFvalue2
            mov     r8, rax
            call    printf

; Test signed conversions:

            lea     rdi, ivalue1
            call    strtoi
            jc      UnexpectedError

            lea     rcx, fmtStr3
            lea     rdx, ivalue1
            mov     r8, rax
            call    printf

            lea     rdi, ivalue2
            call    strtoi
            jc      UnexpectedError

            lea     rcx, fmtStr3
            lea     rdx, ivalue2
            mov     r8, rax
            call    printf

            lea     rdi, ivalue3
            call    strtoi
            jc      UnexpectedError

            lea     rcx, fmtStr3
            lea     rdx, ivalue3
            mov     r8, rax
            call    printf

 lea     rdi, ivalue4
            call    strtoi
            jc      UnexpectedError

            lea     rcx, fmtStr3
            lea     rdx, ivalue4
            mov     r8, rax
            call    printf

            lea     rdi, ivalue5
            call    strtoi
            jc      UnexpectedError

            lea     rcx, fmtStr3
            lea     rdx, ivalue5
            mov     r8, rax
            call    printf

            lea     rdi, ivalue6
            call    strtoi
            jc      UnexpectedError

            lea     rcx, fmtStr3
            lea     rdx, ivalue6
            mov     r8, rax
            call    printf

            lea     rdi, OFivalue
            call    strtoi
            jnc     UnexpectedError
            test    rax, rax        ; Nonzero for overflow
            jz      UnexpectedError

            lea     rcx, fmtStr2
            lea     rdx, OFivalue
            mov     r8, rax
            call    printf

            lea     rdi, OFivalue2
            call    strtoi
            jnc     UnexpectedError
            test    rax, rax        ; Nonzero for overflow
            jz      UnexpectedError

            lea     rcx, fmtStr2
            lea     rdx, OFivalue2
            mov     r8, rax
            call    printf

            jmp     allDone

UnexpectedError:
            lea     rcx, unexError
            call    printf

allDone:    leave
            ret     ; Returns to caller
asmMain     endp
            end

清单 9-14：数字到字符串的转换

以下是该程序的构建命令和示例输出：

C:\>**build listing9-14**

C:\>**echo off**
 Assembling: listing9-14.asm
c.cpp

C:\>**listing9-14**
Calling Listing 9-14:
strtou: String='  1'
    value=1
strtou: String='12 '
    value=12
strtou: String=' 123 '
    value=123
strtou: String='1234'
    value=1234
strtou: String='1234567890123456789'
    value=1234567890123456789
strtou: String='18446744073709551615'
    value=18446744073709551615
Overflow: String='18446744073709551616'
    value=ffffffffffffffff
Overflow: String='999999999999999999999'
    value=ffffffffffffffff
strtoi: String='  -1'
    value=-1
strtoi: String='-12 '
    value=-12
strtoi: String=' -123 '
    value=-123
strtoi: String='-1234'
    value=-1234
strtoi: String='-1234567890123456789'
    value=-1234567890123456789
strtoi: String='-9223372036854775807'
    value=-9223372036854775807
Overflow: String='-9223372036854775808'
    value=ffffffffffffffff
Overflow: String='-999999999999999999999'
    value=ffffffffffffffff
Listing 9-14 terminated

对于扩展精度的字符串到数字转换，您只需修改strtou函数，使其具有扩展精度累加器，然后进行扩展精度的乘法（而不是标准乘法）。

9.2.2 将十六进制字符串转换为数字形式

与数字输出类似，十六进制输入是最容易编写的数字输入程序。十六进制字符串到数字转换的基本算法如下：

将扩展精度累加器值初始化为 0。
对于每个有效的十六进制数字字符，重复步骤 3 到 6；如果不是有效的十六进制数字字符，则跳到步骤 7。
将十六进制字符转换为 0 到 15（0h 到 0Fh）范围内的值。
如果扩展精度累加器值的高 4 位非零，则引发异常。
将当前的扩展精度值乘以 16（即向左移动 4 位）。
将转换后的十六进制数字值添加到累加器中。
检查当前输入字符以确保它是一个有效的分隔符。如果不是，则引发异常。

清单 9-15 实现了这个 64 位值的扩展精度十六进制输入程序。

; Listing 9-15

; Hexadecimal string-to-numeric conversion.

        option  casemap:none

false       =       0
true        =       1
tab         =       9
nl          =       10

            .const
ttlStr      byte    "Listing 9-15", 0
fmtStr1     byte    "strtoh: String='%s' "
            byte    "value=%I64x", nl, 0

fmtStr2     byte    "Error, RAX=%I64x, str='%s'", nl, 0 
fmtStr3     byte    "Error, expected overflow: RAX=%I64x, "
            byte    "str='%s'", nl, 0

fmtStr4     byte    "Error, expected bad char: RAX=%I64x, "
            byte    "str='%s'", nl, 0 

hexStr      byte    "1234567890abcdef", 0
hexStrOVFL  byte    "1234567890abcdef0", 0
hexStrBAD   byte    "x123", 0

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; strtoh -   Converts string data to a 64-bit unsigned integer.

; Input:
;   RDI  -   Pointer to buffer containing string to convert.

; Output:
;   RAX  -   Contains converted string (if success), error code
;            if an error occurs.

;   RDI  -   Points at first char beyond end of hexadecimal string.
;            If error, RDI's value is restored to original value.
;            Caller can check character at [RDI] after a
;            successful result to see if the character following
;            the numeric digits is a legal numeric delimiter.

;   C    -   (carry flag) Set if error occurs, clear if
;            conversion was successful. On error, RAX will
;            contain 0 (illegal initial character) or
;            0FFFFFFFFFFFFFFFFh (overflow).

strtoh      proc
            push    rcx      ; Holds input char
            push    rdx      ; Special mask value
            push    rdi      ; In case we have to restore RDI

; This code will use the value in RDX to test and see if overflow
; will occur in RAX when shifting to the left 4 bits:

            mov     rdx, 0F000000000000000h
            xor     eax, eax ; Zero out accumulator

; The following loop skips over any whitespace (spaces and
; tabs) that appears at the beginning of the string.

            dec     rdi      ; Because of inc below
skipWS:     inc     rdi
            mov     cl, [rdi]
            cmp     cl, ' '
            je      skipWS
            cmp     al, tab
            je      skipWS

; If we don't have a hexadecimal digit at this point,
; return an error.

 cmp     cl, '0'  ; Note: "0" < "1" < ... < "9"
            jb      badNumber
            cmp     cl, '9'
            jbe     convert
            and     cl, 5fh  ; Cheesy LC -> UC conversion
            cmp     cl, 'A'
            jb      badNumber
            cmp     cl, 'F'
            ja      badNumber
            sub     cl, 7    ; Maps 41h to 46h -> 3Ah to 3Fh

; Okay, the first digit is good. Convert the string
; of digits to numeric form:

convert:    test    rdx, rax ; See if adding in the current
            jnz     overflow ; digit will cause an overflow

            and     ecx, 0fh ; Convert to numeric in RCX

; Multiply 64-bit accumulator by 16 and add in new digit:

            shl     rax, 4
            add     al, cl   ; Never overflows outside LO 4 bits

; Move on to next character:

            inc     rdi
            mov     cl, [rdi]
            cmp     cl, '0'
            jb      endOfNum
            cmp     cl, '9'
            jbe     convert

            and     cl, 5fh  ; Cheesy LC -> UC conversion
            cmp     cl, 'A'
            jb      endOfNum
            cmp     cl, 'F'
            ja      endOfNum
            sub     cl, 7    ; Maps 41h to 46h -> 3Ah to 3Fh
            jmp     convert

; If we get to this point, we've successfully converted
; the string to numeric form:

endOfNum:

; Because the conversion was successful, this procedure
; leaves RDI pointing at the first character beyond the
; converted digits. As such, we don't restore RDI from
; the stack. Just bump the stack pointer up by 8 bytes
; to throw away RDI's saved value.

 add     rsp, 8   ; Remove original RDI value
            pop     rdx      ; Restore RDX
            pop     rcx      ; Restore RCX
            clc              ; Return success in carry flag
            ret

; badNumber- Drop down here if the first character in
;            the string was not a valid digit.

badNumber:  xor     rax, rax
            jmp     errorExit

overflow:   or      rax, -1  ; Return -1 as error on overflow
errorExit:  pop     rdi      ; Restore RDI if an error occurs
            pop     rdx
            pop     rcx
            stc              ; Return error in carry flag
            ret

strtoh      endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 64  ; Shadow storage

; Test hexadecimal conversion:

            lea     rdi, hexStr
            call    strtoh
            jc      error

            lea     rcx, fmtStr1
            mov     r8, rax
            lea     rdx, hexStr
            call    printf

; Test overflow conversion:

            lea     rdi, hexStrOVFL
            call    strtoh
            jnc     unexpected

            lea     rcx, fmtStr2
            mov     rdx, rax
            mov     r8, rdi
            call    printf

; Test bad character:

            lea     rdi, hexStrBAD
            call    strtoh
            jnc     unexp2

            lea     rcx, fmtStr2
            mov     rdx, rax
            mov     r8, rdi
            call    printf
            jmp     allDone

unexpected: lea     rcx, fmtStr3
            mov     rdx, rax
            mov     r8, rdi
            call    printf
            jmp     allDone

unexp2:     lea     rcx, fmtStr4
            mov     rdx, rax
            mov     r8, rdi
            call    printf
            jmp     allDone

error:      lea     rcx, fmtStr2
            mov     rdx, rax
            mov     r8, rdi
            call    printf

allDone:    leave
            ret     ; Returns to caller
asmMain     endp
            end

清单 9-15：十六进制字符串到数字的转换

以下是构建命令和程序输出：

C:\>**build listing9-15**

C:\>**echo off**
 Assembling: listing9-15.asm
c.cpp

C:\>**listing9-15**
Calling Listing 9-15:
strtoh: String='1234567890abcdef' value=1234567890abcdef
Error, RAX=ffffffffffffffff, str='1234567890abcdef0'
Error, RAX=0, str='x123'
Listing 9-15 terminated

对于处理大于 64 位数字的十六进制字符串转换，你需要使用扩展精度左移 4 位。列表 9-16 演示了对 strtoh 函数进行必要修改以支持 128 位转换。

; strtoh128 - Converts string data to a 128-bit unsigned integer.

; Input:
;   RDI     - Pointer to buffer containing string to convert.

; Output:
;   RDX:RAX - Contains converted string (if success), error code
;             if an error occurs.

;   RDI     - Points at first char beyond end of hex string.
;             If error, RDI's value is restored to original value.
;             Caller can check character at [RDI] after a
;             successful result to see if the character following
;             the numeric digits is a legal numeric delimiter.

;   C       - (carry flag) Set if error occurs, clear if
;             conversion was successful. On error, RAX will
;             contain 0 (illegal initial character) or
;             0FFFFFFFFFFFFFFFFh (overflow).

strtoh128   proc
            push    rbx      ; Special mask value
            push    rcx      ; Input char to process
            push    rdi      ; In case we have to restore RDI

; This code will use the value in RDX to test and see if overflow
; will occur in RAX when shifting to the left 4 bits:

            mov     rbx, 0F000000000000000h
            xor     eax, eax ; Zero out accumulator
            xor     edx, edx

; The following loop skips over any whitespace (spaces and
; tabs) that appears at the beginning of the string.

            dec     rdi      ; Because of inc below
skipWS:     inc     rdi
            mov     cl, [rdi]
            cmp     cl, ' '
            je      skipWS
            cmp     al, tab
            je      skipWS

; If we don't have a hexadecimal digit at this point,
; return an error.

            cmp     cl, '0'  ; Note: "0" < "1" < ... < "9"
            jb      badNumber
            cmp     cl, '9'
            jbe     convert
 and     cl, 5fh  ; Cheesy LC -> UC conversion
            cmp     cl, 'A'
            jb      badNumber
            cmp     cl, 'F'
            ja      badNumber
            sub     cl, 7    ; Maps 41h to 46h -> 3Ah to 3Fh

; Okay, the first digit is good. Convert the string
; of digits to numeric form:

convert:    test    rdx, rbx ; See if adding in the current
            jnz     overflow ; digit will cause an overflow

            and     ecx, 0fh ; Convert to numeric in RCX

; Multiply 64-bit accumulator by 16 and add in new digit:

            shld    rdx, rax, 4
            shl     rax, 4
            add     al, cl   ; Never overflows outside LO 4 bits

; Move on to next character:

            inc     rdi      
            mov     cl, [rdi]
            cmp     cl, '0'
            jb      endOfNum
            cmp     cl, '9'
            jbe     convert

            and     cl, 5fh  ; Cheesy LC -> UC conversion
            cmp     cl, 'A'
            jb      endOfNum
            cmp     cl, 'F'
            ja      endOfNum
            sub     cl, 7    ; Maps 41h to 46h -> 3Ah to 3Fh
            jmp     convert

; If we get to this point, we've successfully converted
; the string to numeric form:

endOfNum:

; Because the conversion was successful, this procedure
; leaves RDI pointing at the first character beyond the
; converted digits. As such, we don't restore RDI from
; the stack. Just bump the stack pointer up by 8 bytes
; to throw away RDI's saved value.

            add     rsp, 8   ; Remove original RDI value
            pop     rcx      ; Restore RCX
            pop     rbx      ; Restore RBX
            clc              ; Return success in carry flag
            ret

; badNumber - Drop down here if the first character in
;             the string was not a valid digit.

badNumber:  xor     rax, rax
            jmp     errorExit

overflow:   or      rax, -1  ; Return -1 as error on overflow
errorExit:  pop     rdi      ; Restore RDI if an error occurs
            pop     rcx
            pop     rbx
            stc              ; Return error in carry flag
            ret

strtoh128   endp

列表 9-16：128 位十六进制字符串到数值的转换

9.2.3 无符号十进制字符串转换为整数

无符号十进制输入的算法与十六进制输入几乎完全相同。事实上，唯一的区别（除了仅接受十进制数字外）是，对于每个输入字符，你将累积值乘以 10 而不是 16（一般来说，任何进制的算法都是一样的；只需将累积值乘以输入的进制）。列表 9-17 演示了如何编写一个 64 位无符号十进制输入例程。

; Listing 9-17

; 64-bit unsigned decimal string-to-numeric conversion.

        option  casemap:none

false       =       0
true        =       1
tab         =       9
nl          =       10

            .const
ttlStr      byte    "Listing 9-17", 0
fmtStr1     byte    "strtou: String='%s' value=%I64u", nl, 0
fmtStr2     byte    "strtou: error, rax=%d", nl, 0

qStr      byte    "12345678901234567", 0

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
 ret
getTitle    endp

; strtou -   Converts string data to a 64-bit unsigned integer.

; Input:
;   RDI  -   Pointer to buffer containing string to convert.

; Output:
;   RAX  -   Contains converted string (if success), error code
;            if an error occurs.

;   RDI  -   Points at first char beyond end of numeric string.
;            If error, RDI's value is restored to original value.
;            Caller can check character at [RDI] after a
;            successful result to see if the character following
;            the numeric digits is a legal numeric delimiter.

;   C    -   (carry flag) Set if error occurs, clear if
;            conversion was successful. On error, RAX will
;            contain 0 (illegal initial character) or
;            0FFFFFFFFFFFFFFFFh (overflow).

strtou      proc
            push    rcx      ; Holds input char
            push    rdx      ; Save, used for multiplication
            push    rdi      ; In case we have to restore RDI

            xor     rax, rax ; Zero out accumulator

; The following loop skips over any whitespace (spaces and
; tabs) that appears at the beginning of the string.

            dec     rdi      ; Because of inc below
skipWS:     inc     rdi
            mov     cl, [rdi]
            cmp     cl, ' '
            je      skipWS
            cmp     al, tab
            je      skipWS

; If we don't have a numeric digit at this point,
; return an error.

            cmp     cl, '0'  ; Note: "0" < "1" < ... < "9"
            jb      badNumber
            cmp     cl, '9'
            ja      badNumber

; Okay, the first digit is good. Convert the string
; of digits to numeric form:

convert:    and     ecx, 0fh ; Convert to numeric in RCX

; Multiple 64-bit accumulator by 10:

            mul     ten
            test    rdx, rdx ; Test for overflow
            jnz     overflow

            add     rax, rcx
            jc      overflow

; Move on to next character:

            inc     rdi
            mov     cl, [rdi]
            cmp     cl, '0'
            jb      endOfNum
            cmp     cl, '9'
            jbe     convert

; If we get to this point, we've successfully converted
; the string to numeric form:

endOfNum:

; Because the conversion was successful, this procedure
; leaves RDI pointing at the first character beyond the
; converted digits. As such, we don't restore RDI from
; the stack. Just bump the stack pointer up by 8 bytes
; to throw away RDI's saved value.

            add     rsp, 8   ; Remove original RDI value
            pop     rdx
            pop     rcx      ; Restore RCX
            clc              ; Return success in carry flag
            ret

; badNumber - Drop down here if the first character in
;             the string was not a valid digit.

badNumber:  xor     rax, rax
            jmp     errorExit

overflow:   mov     rax, -1  ; 0FFFFFFFFFFFFFFFFh
errorExit:  pop     rdi
            pop     rdx
            pop     rcx
            stc              ; Return error in carry flag
            ret

ten         qword   10

strtou      endp

; Here is the "asmMain" function.

 public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 64  ; Shadow storage

; Test hexadecimal conversion:

            lea     rdi, qStr
            call    strtou
            jc      error

            lea     rcx, fmtStr1
            mov     r8, rax
            lea     rdx, qStr
            call    printf
            jmp     allDone

error:      lea     rcx, fmtStr2
            mov     rdx, rax
            call    printf

allDone:    leave
            ret     ; Returns to caller
asmMain     endp
            end

列表 9-17：无符号十进制字符串到数值的转换

这是列表 9-17 中程序的构建命令和示例输出：

C:\>**build listing9-17**

C:\>**echo off**
 Assembling: listing9-17.asm
c.cpp

C:\>**listing9-17**
Calling Listing 9-17:
strtou: String='12345678901234567' value=12345678901234567
Listing 9-17 terminated

是否可以创建一个更快的函数，使用 fbld（x87 FPU BCD 存储）指令？可能不行。fbstp 指令在整数转换中要快得多，因为标准算法使用了多次执行（非常慢的）div 指令。十进制到数值的转换使用的是 mul 指令，这比 div 快得多。虽然我没有实际尝试过，但我怀疑使用 fbld 不会产生更快的运行代码。

9.2.4 扩展精度字符串转换为无符号整数

（十进制）字符串到数值的转换算法是相同的，无论整数的大小如何。你读取一个十进制字符，将其转换为整数，将累积结果乘以 10，然后将转换后的字符加进去。对于大于 64 位的值，唯一变化的是乘以 10 和加法操作。例如，要将一个字符串转换为 128 位整数，你需要能够将 128 位的值乘以 10，并将一个 8 位值（零扩展到 128 位）加到 128 位值上。

列表 9-18 演示了如何编写一个 128 位无符号十进制输入例程。除了 128 位乘以 10 和 128 位加法操作外，这段代码在功能上与 64 位字符串到整数的转换完全相同。

; strtou128 - Converts string data to a 128-bit unsigned integer.

; Input:
;   RDI     - Pointer to buffer containing string to convert.

; Output:
;   RDX:RAX - Contains converted string (if success), error code
;             if an error occurs.

;   RDI     - Points at first char beyond end of numeric string.
;             If error, RDI's value is restored to original value.
;             Caller can check character at [RDI] after a
;             successful result to see if the character following
;             the numeric digits is a legal numeric delimiter.

;   C       - (carry flag) Set if error occurs, clear if
;             conversion was successful. On error, RAX will
;             contain 0 (illegal initial character) or
;             0FFFFFFFFFFFFFFFFh (overflow).

strtou128   proc
accumulator equ     <[rbp - 16]>
partial     equ     <[rbp - 24]>
            push    rcx      ; Holds input char
            push    rdi      ; In case we have to restore RDI
            push    rbp
            mov     rbp, rsp
            sub     rsp, 24  ; Accumulate result here

            xor     edx, edx ; Zero-extends!
            mov     accumulator, rdx
            mov     accumulator[8], rdx

; The following loop skips over any whitespace (spaces and
; tabs) that appears at the beginning of the string.

            dec     rdi      ; Because of inc below
skipWS:     inc     rdi
            mov     cl, [rdi]
 cmp     cl, ' '
            je      skipWS
            cmp     al, tab
            je      skipWS

; If we don't have a numeric digit at this point,
; return an error.

            cmp     cl, '0'         ; Note: "0" < "1" < ... < "9"
            jb      badNumber
            cmp     cl, '9'
            ja      badNumber

; Okay, the first digit is good. Convert the string
; of digits to numeric form:

convert:    and     ecx, 0fh        ; Convert to numeric in RCX

; Multiply 128-bit accumulator by 10:

            mov     rax, accumulator 
            mul     ten
            mov     accumulator, rax
            mov     partial, rdx    ; Save partial product
            mov     rax, accumulator[8]
            mul     ten
            jc      overflow1
            add     rax, partial
            mov     accumulator[8], rax
            jc      overflow1

; Add in the current character to the 128-bit accumulator:

            mov     rax, accumulator
            add     rax, rcx
            mov     accumulator, rax
            mov     rax, accumulator[8]
            adc     rax, 0
            mov     accumulator[8], rax
            jc      overflow2

; Move on to next character:

            inc     rdi
            mov     cl, [rdi]
            cmp     cl, '0'
            jb      endOfNum
            cmp     cl, '9'
            jbe     convert

; If we get to this point, we've successfully converted
; the string to numeric form:

endOfNum:

; Because the conversion was successful, this procedure
; leaves RDI pointing at the first character beyond the
; converted digits. As such, we don't restore RDI from
; the stack. Just bump the stack pointer up by 8 bytes
; to throw away RDI's saved value.

            mov     rax, accumulator
            mov     rdx, accumulator[8]
            leave
            add     rsp, 8   ; Remove original RDI value
            pop     rcx      ; Restore RCX
            clc              ; Return success in carry flag
            ret

; badNumber - Drop down here if the first character in
;             the string was not a valid digit.

badNumber:  xor     rax, rax
            xor     rdx, rdx
            jmp     errorExit

overflow1:  mov     rax, -1
            cqo              ; RDX = -1, too
            jmp     errorExit

overflow2:  mov     rax, -2  ; 0FFFFFFFFFFFFFFFEh
            cqo              ; Just to be consistent
errorExit:  leave            ; Remove accumulator from stack
            pop     rdi
            pop     rcx
            stc              ; Return error in carry flag
            ret

ten         qword   10

strtou128   endp

列表 9-18：扩展精度无符号十进制输入

9.2.5 扩展精度有符号十进制字符串转换为整数

一旦你有了一个无符号十进制输入例程，编写一个有符号十进制输入例程就很简单，具体算法如下：

消耗输入流开始部分的所有分隔符字符。
如果下一个输入字符是减号，消耗此字符并设置一个标志，表示该数字是负数；否则直接跳到步骤 3。
调用无符号十进制输入例程，将其余部分的字符串转换为整数。
检查返回结果，确保其高位（HO）位是清除的。如果结果的高位是设置的，则引发超出范围的异常。
如果代码在第 2 步中遇到了减号，则取结果的相反值。

我会把实际的代码实现留给你作为编程练习。

9.2.6 实现字符串到浮点数的转换

将表示浮点数的字符字符串转换为 80 位的real10格式，比本章前面出现的real10到字符串的转换稍微简单一些。因为十进制转换（没有指数）是更一般的科学计数法转换的一个子集，所以如果你能处理科学计数法，你就能免费处理十进制转换。除此之外，基本的算法是将尾数字符转换为压缩的 BCD 格式（这样该函数就可以使用fbld指令来进行字符串到数字的转换），然后读取（可选的）指数并相应地调整real10的指数。进行转换的算法如下：

从去除任何前导的空格或制表符字符（以及其他分隔符）开始。
检查是否有前导的加号（+）或减号（-）字符。如果有，跳过它。如果数字是负数，则将符号标志设置为真（如果是非负数，则设置为假）。
初始化指数值为-18。该算法将根据字符串中的尾数数字创建一个左对齐的压缩 BCD 值，提供给fbld指令，而左对齐的压缩 BCD 值总是大于或等于 10¹⁸。初始化指数为-18 是为了考虑到这一点。
初始化一个有效数字计数器变量，记录到目前为止已处理的有效数字的数量，初始值为 18。
如果数字以任何前导零开头，则跳过这些零（不改变小数点左侧的前导零的指数或有效数字计数器）。
如果扫描在处理完任何前导零后遇到小数点，则跳到第 11 步；否则跳到第 7 步。
对于小数点左侧的每个非零数字，如果有效数字计数器不为零，将该非零数字插入到“数字字符串”数组中，位置由有效数字计数器（减去 1）指定。^(5) 请注意，这将以反向位置将字符插入到字符串中。
对于小数点左侧的每个数字，将指数值（最初初始化为-18）增加 1。
如果有效数字计数器不为零，递减有效数字计数器（这也将提供对数字字符串数组的索引）。
如果遇到的第一个非数字字符不是小数点，跳到第 14 步。
跳过小数点字符。
对于小数点右侧的每个数字，继续将这些数字（按相反顺序）添加到数字字符串数组中，只要有效数字计数器不为零。如果有效数字计数器大于零，则递减它。同时，递减指数值。
如果算法在此时还未遇到至少一个十进制数字，则报告非法字符异常并返回。
如果当前字符不是e或E，则跳到步骤 20。^(6)否则，跳过e或E字符，继续执行步骤 15。
如果下一个字符是+或-，则跳过它。如果符号字符是-，则将标志设置为 true，否则设置为 false（请注意，该指数符号标志与算法中较早设置的尾数符号标志不同）。
如果下一个字符不是十进制数字，则报告错误。
将数字字符串（从当前的十进制数字字符开始）转换为整数。
将转换后的整数加到指数值上（该值在算法开始时被初始化为–18）。
如果指数值超出了–4930 到+4930 的范围，则报告超出范围异常。
将数字字符数组转换为 18 位（9 字节）打包的 BCD 值，通过去除每个字符的高 4 位，将成对的字符合并为一个字节（通过将奇数索引字节左移 4 位，并与每对中的偶数索引字节进行逻辑或运算），然后将高字节（第 10 个字节）设为 0。
将打包的 BCD 值转换为real10值（使用fbld指令）。
取指数的绝对值（但保留指数的符号）。该值将是 13 位或更小（4096 有第 12 位被设置，因此 4930 或更小的值会有一些第 0 到第 13 位的组合被设置为 1，其他位为 0）。
如果指数为正，则对于指数中每一位被设置的位，将当前的real10值乘以 10 的该位指定的幂次方。例如，如果位 12、10 和 1 被设置，则将real10值分别乘以 10⁴⁰⁹⁶、10¹⁰²⁴和 10²。
如果指数为负，则对于指数中每一位被设置的位，将当前的real10值除以 10 的该位指定的幂次方。例如，如果位 12、10 和 1 被设置，则将real10值分别除以 10⁴⁰⁹⁶、10¹⁰²⁴和 10²。
如果尾数为负（算法开始时设置了第一个符号标志），则取反浮点数。

列表 9-19 提供了该算法的实现。

; Listing 9-19

; Real string-to-floating-point conversion.

        option  casemap:none

false       =       0
true        =       1
tab         =       9
nl          =       10

            .const
ttlStr      byte    "Listing 9-19", 0
fmtStr1     byte    "strToR10: str='%s', value=%e", nl, 0

fStr1a      byte    "1.234e56",0
fStr1b      byte    "-1.234e56",0
fStr1c      byte    "1.234e-56",0
fStr1d      byte    "-1.234e-56",0
fStr2a      byte    "1.23",0
fStr2b      byte    "-1.23",0
fStr3a      byte    "1",0
fStr3b      byte    "-1",0
fStr4a      byte    "0.1",0
fStr4b      byte    "-0.1",0
fStr4c      byte    "0000000.1",0
fStr4d      byte    "-0000000.1",0
fStr4e      byte    "0.1000000",0
fStr4f      byte    "-0.1000000",0
fStr4g      byte    "0.0000001",0
fStr4h      byte    "-0.0000001",0
fStr4i      byte    ".1",0
fStr4j      byte    "-.1",0

values      qword   fStr1a, fStr1b, fStr1c, fStr1d,
                    fStr2a, fStr2b,
                    fStr3a, fStr3b,
                    fStr4a, fStr4b, fStr4c, fStr4d,
                    fStr4e, fStr4f, fStr4g, fStr4h,
                    fStr4i, fStr4j,
                    0

            align   4
PotTbl      real10  1.0e+4096,
                    1.0e+2048,
 1.0e+1024,
                    1.0e+512,
                    1.0e+256,
                    1.0e+128,
                    1.0e+64,
                    1.0e+32,
                    1.0e+16,
                    1.0e+8,
                    1.0e+4,
                    1.0e+2,
                    1.0e+1,
                    1.0e+0

            .data
r8Val       real8   ?

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

*********************************************************

; strToR10 - RSI points at a string of characters that represent a
;            floating-point value. This routine converts that string
;            to the corresponding FP value and leaves the result on
;            the top of the FPU stack. On return, ESI points at the
;            first character this routine couldn't convert.

; Like the other ATOx routines, this routine raises an
; exception if there is a conversion error or if ESI
; contains NULL.

*********************************************************

strToR10    proc

sign        equ     <cl>
expSign     equ     <ch>

DigitStr    equ     <[rbp - 20]>
BCDValue    equ     <[rbp - 30]>
rsiSave     equ     <[rbp - 40]>

            push    rbp
            mov     rbp, rsp
            sub     rsp, 40

            push    rbx
 push    rcx
            push    rdx
            push    r8
            push    rax

; Verify that RSI is not NULL.

            test    rsi, rsi
            jz      refNULL

; Zero out the DigitStr and BCDValue arrays.

            xor     rax, rax
            mov     qword ptr DigitStr, rax
            mov     qword ptr DigitStr[8], rax
            mov     dword ptr DigitStr[16], eax

            mov     qword ptr BCDValue, rax
            mov     word ptr BCDValue[8], ax

; Skip over any leading space or tab characters in the sequence.

            dec     rsi
whileDelimLoop:
            inc     rsi
            mov     al, [rsi]
            cmp     al, ' '
            je      whileDelimLoop
            cmp     al, tab
            je      whileDelimLoop

; Check for "+" or "-".

            cmp     al, '-'
            sete    sign
            je      doNextChar
            cmp     al, '+'
            jne     notPlus
doNextChar: inc     rsi             ; Skip the "+" or "-"
            mov     al, [rsi]

notPlus:

; Initialize EDX with -18 since we have to account
; for BCD conversion (which generates a number * 10¹⁸ by
; default). EDX holds the value's decimal exponent.

            mov     rdx, -18

; Initialize EBX with 18, which is the number of significant
; digits left to process and it is also the index into the
; DigitStr array.

 mov     ebx, 18         ; Zero-extends!

; At this point, we're beyond any leading sign character.
; Therefore, the next character must be a decimal digit
; or a decimal point.

            mov     rsiSave, rsi    ; Save to look ahead 1 digit
            cmp     al, '.'
            jne     notPeriod

; If the first character is a decimal point, then the
; second character needs to be a decimal digit.

            inc     rsi
            mov     al, [rsi]

notPeriod:
            cmp     al, '0'
            jb      convError
            cmp     al, '9'
            ja      convError
            mov     rsi, rsiSave    ; Go back to orig char
            mov     al, [rsi]
            jmp     testWhlAL0

; Eliminate any leading zeros (they do not affect the value or
; the number of significant digits).

whileAL0:   inc     rsi
            mov     al, [rsi]
testWhlAL0: cmp     al, '0'
            je      whileAL0

; If we're looking at a decimal point, we need to get rid of the
; zeros immediately after the decimal point since they don't
; count as significant digits.  Unlike zeros before the decimal
; point, however, these zeros do affect the number's value as
; we must decrement the current exponent for each such zero.

            cmp     al, '.'
            jne     testDigit

            inc     edx             ; Counteract dec below
repeatUntilALnot0:
            dec     edx
            inc     rsi
            mov     al, [rsi]
            cmp     al, '0'
            je      repeatUntilALnot0
            jmp     testDigit2

; If we didn't encounter a decimal point after removing leading
; zeros, then we've got a sequence of digits before a decimal
; point.  Process those digits here.

; Each digit to the left of the decimal point increases
; the number by an additional power of 10\.  Deal with
; that here.

whileADigit:
            inc     edx     

; Save all the significant digits, but ignore any digits
; beyond the 18th digit.

            test    ebx, ebx
            jz      Beyond18

            mov     DigitStr[rbx * 1], al
            dec     ebx

Beyond18:   inc     rsi
            mov     al, [rsi]

testDigit:  
            sub     al, '0'
            cmp     al, 10
            jb      whileADigit

            cmp     al, '.'-'0'
            jne     testDigit2

            inc     rsi             ; Skip over decimal point
            mov     al, [rsi]
            jmp     testDigit2

; Okay, process any digits to the right of the decimal point.

whileDigit2:
            test    ebx, ebx
            jz      Beyond18_2

            mov     DigitStr[rbx * 1], al
            dec     ebx

Beyond18_2: inc     rsi
            mov     al, [rsi]

testDigit2: sub     al, '0'
            cmp     al, 10
            jb      whileDigit2

; At this point, we've finished processing the mantissa.
; Now see if there is an exponent we need to deal with.

            mov     al, [rsi]       
            cmp     al, 'E'
            je      hasExponent
            cmp     al, 'e'
            jne     noExponent

hasExponent:
            inc     rsi
            mov     al, [rsi]       ; Skip the "E".
            cmp     al, '-'
            sete    expSign
            je      doNextChar_2
            cmp     al, '+'
            jne     getExponent;

doNextChar_2:
            inc     rsi             ; Skip "+" or "-"
            mov     al, [rsi]

; Okay, we're past the "E" and the optional sign at this
; point.  We must have at least one decimal digit.

getExponent:
            sub     al, '0'
            cmp     al, 10
            jae     convError

            xor     ebx, ebx        ; Compute exponent value in EBX
ExpLoop:    movzx   eax, byte ptr [rsi] ; Zero-extends to RAX!
            sub     al, '0'
            cmp     al, 10
            jae     ExpDone

            imul    ebx, 10
            add     ebx, eax
            inc     rsi
            jmp     ExpLoop

; If the exponent was negative, negate our computed result.

ExpDone:
            cmp     expSign, false
            je      noNegExp

            neg     ebx

noNegExp:

; Add in the BCD adjustment (remember, values in DigitStr, when
; loaded into the FPU, are multiplied by 10¹⁸ by default.
; The value in EDX adjusts for this).

            add     edx, ebx

noExponent:

; Verify that the exponent is between -4930 and +4930 (which
; is the maximum dynamic range for an 80-bit FP value).

            cmp     edx, 4930
            jg      voor            ; Value out of range
 cmp     edx, -4930
            jl      voor

; Now convert the DigitStr variable (unpacked BCD) to a packed
; BCD value.

            mov     r8, 8
for9:       mov     al, DigitStr[r8 * 2 + 2]
            shl     al, 4
            or      al, DigitStr[r8 * 2 + 1]
            mov     BCDValue[r8 * 1], al

            dec     r8
            jns     for9

            fbld    tbyte ptr BCDValue

; Okay, we've got the mantissa into the FPU.  Now multiply the
; mantissa by 10 raised to the value of the computed exponent
; (currently in EDX).

; This code uses power of 10 tables to help make the 
; computation a little more accurate.

; We want to determine which power of 10 is just less than the
; value of our exponent.  The powers of 10 we are checking are
; 10**4096, 10**2048, 10**1024, 10**512, and so on. A slick way to
; do this check is by shifting the bits in the exponent
; to the left.  Bit #12 is the 4096 bit.  So if this bit is set,
; our exponent is >= 10**4096\.  If not, check the next bit down
; to see if our exponent >= 10**2048, etc.

            mov     ebx, -10 ; Initial index into power of 10 table
            test    edx, edx
            jns     positiveExponent

; Handle negative exponents here.

            neg     edx
            shl     edx, 19 ; Bits 0 to 12 -> 19 to 31
            lea     r8, PotTbl

whileEDXne0:
            add     ebx, 10
            shl     edx, 1
            jnc     testEDX0

            fld     real10 ptr [r8][rbx * 1]
            fdivp

testEDX0:   test    edx, edx
            jnz     whileEDXne0
            jmp     doMantissaSign

; Handle positive exponents here.

positiveExponent:
            lea     r8, PotTbl
            shl     edx, 19 ; Bits 0 to 12 -> 19 to 31
            jmp     testEDX0_2

whileEDXne0_2:
            add     ebx, 10
            shl     edx, 1
            jnc     testEDX0_2

            fld     real10 ptr [r8][rbx * 1]
            fmulp

testEDX0_2: test    edx, edx
            jnz     whileEDXne0_2

; If the mantissa was negative, negate the result down here.

doMantissaSign:
            cmp     sign, false
            je      mantNotNegative

            fchs

mantNotNegative:
            clc                     ; Indicate success
            jmp     Exit

refNULL:    mov     rax, -3
            jmp     ErrorExit

convError:  mov     rax, -2
            jmp     ErrorExit

voor:       mov     rax, -1         ; Value out of range
            jmp     ErrorExit

illChar:    mov     rax, -4

ErrorExit:  stc                     ; Indicate failure
            mov     [rsp], rax      ; Save error code
Exit:       pop     rax
            pop     r8
            pop     rdx
            pop     rcx
            pop     rbx
            leave
            ret

strToR10    endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbx
            push    rsi
            push    rbp
            mov     rbp, rsp
            sub     rsp, 64         ; Shadow storage

; Test floating-point conversion:

            lea     rbx, values
ValuesLp:   cmp     qword ptr [rbx], 0
            je      allDone

            mov     rsi, [rbx]
            call    strToR10
            fstp    r8Val

            lea     rcx, fmtStr1
            mov     rdx, [rbx]
            mov     r8, qword ptr r8Val
            call    printf
            add     rbx, 8
            jmp     ValuesLp

allDone:    leave
            pop     rsi
            pop     rbx
            ret     ; Returns to caller
asmMain     endp
            end

列表 9-19：strToR10函数

这里是列表 9-19 的构建命令和示例输出。

C:\>**build listing9-19**

C:\>**echo off**
 Assembling: listing9-19.asm
c.cpp

C:\>**listing9-19**
Calling Listing 9-19:
strToR10: str='1.234e56', value=1.234000e+56
strToR10: str='-1.234e56', value=-1.234000e+56
strToR10: str='1.234e-56', value=1.234000e-56
strToR10: str='-1.234e-56', value=-1.234000e-56
strToR10: str='1.23', value=1.230000e+00
strToR10: str='-1.23', value=-1.230000e+00
strToR10: str='1', value=1.000000e+00
strToR10: str='-1', value=-1.000000e+00
strToR10: str='0.1', value=1.000000e-01
strToR10: str='-0.1', value=-1.000000e-01
strToR10: str='0000000.1', value=1.000000e-01
strToR10: str='-0000000.1', value=-1.000000e-01
strToR10: str='0.1000000', value=1.000000e-01
strToR10: str='-0.1000000', value=-1.000000e-01
strToR10: str='0.0000001', value=1.000000e-07
strToR10: str='-0.0000001', value=-1.000000e-07
strToR10: str='.1', value=1.000000e-01
strToR10: str='-.1', value=-1.000000e-01
Listing 9-19 terminated

9.3 更多信息

唐纳德·克努斯的《计算机程序设计的艺术》第二卷：半数值算法（Addison-Wesley Professional，1997 年）包含了许多关于十进制算术和扩展精度算术的有用信息，尽管该文本是通用的，并没有描述如何在 x86 汇编语言中实现此操作。

9.4 自测

将 8 位十六进制值从 AL 转换为两个十六进制数字（分别存入 AH 和 AL）的代码是什么？
dToStr将生成多少个十六进制数字？
解释如何使用qToStr编写一个 128 位十六进制输出例程。
你应该使用什么指令来产生最快的 64 位十进制到字符串转换函数？
如果给定一个无符号十进制到字符串的转换函数，如何编写一个有符号十进制到字符串的转换？
utoStrSize 函数的参数是什么？
如果数字需要的打印位置超过 minDigits 参数指定的数量，uSizeToStr 会输出什么字符串？
r10ToStr 函数的参数是什么？
如果输出不能适应 fWidth 参数指定的字符串大小，r10ToStr 会输出什么字符串？
e10ToStr 函数的参数是什么？
什么是分隔符字符？
在字符串到数字的转换过程中，可能出现的两种错误是什么？

第十章：表查找

本章讨论了如何通过使用表查找来加速或减少计算的复杂性。在早期的 x86 编程中，用表查找替代昂贵的计算是提高程序性能的常用方法。今天，现代系统中的内存速度限制了通过表查找获得的性能提升。然而，对于复杂的计算，这仍然是编写高性能代码的可行技术。本章演示了使用表查找时的空间/速度权衡。

10.1 表

对于汇编语言程序员来说，表是一个包含初始化值的数组，一旦创建后这些值不会变化。在汇编语言中，你可以使用表进行多种用途：计算函数、控制程序流，或者只是进行查找。一般来说，表提供了一种快速执行操作的机制，代价是程序中的空间（额外的空间用于存储表格数据）。在本节中，我们将探讨表在汇编语言程序中的一些可能用途。

10.1.1 通过表查找计算函数

一个看似简单的高级语言算术表达式可能等同于相当多的 x86-64 汇编语言代码，因此可能计算代价很高。汇编语言程序员通常会预先计算许多值，并使用这些值的表查找来加速程序。这种方法的优点是更容易，而且通常效率更高。

考虑以下 Pascal 语句：

if (character >= 'a') and (character <= 'z') then 
      character := chr(ord(character) - 32);

这个 Pascal 的if语句将character变量的值从小写字母转换为大写字母，如果character位于a到z的范围内。执行相同操作的 MASM 代码需要七条机器指令，如下所示：

 mov al, character
        cmp al, 'a'
        jb  notLower
        cmp al, 'z'
        ja  notLower

        and al, 5fh  ; Same as sub(32, al) in this code
        mov character, al
notLower:

然而，使用表查找可以将这段序列减少到仅四条指令：

mov   al, character
lea   rbx, CnvrtLower
xlat
mov   character, al

xlat，或称翻译指令，执行以下操作：

mov al, [rbx + al * 1]

这条指令使用当前的 AL 寄存器值作为索引，查找在 RBX 中找到的数组的基地址。它获取数组中该索引位置的字节，并将该字节复制到 AL 寄存器中。英特尔称这条指令为翻译，因为程序员通常使用它通过查找表将字符从一种形式转换为另一种形式，就像我们在这里使用它一样。

在前面的示例中，CnvrtLower是一个 256 字节的表，在 0 到 60h 的索引位置存储值 0 到 60h，在 61h 到 7Ah 的索引位置存储值 41h 到 5Ah，在 7Bh 到 0FFh 的索引位置存储值 7Bh 到 0FFh。因此，如果 AL 中包含的值位于 0 到 60h 或 7Ah 到 0FFh 范围内，xlat指令返回相同的值，实际上 AL 保持不变。但是，如果 AL 中包含的值位于 61h 到 7Ah（即 ASCII 码 a 到 z 的范围），则xlat指令会将 AL 中的值替换为 41h 到 5Ah 范围内的值（即 ASCII 码 A 到 Z 的范围），从而将小写字母转换为大写字母。

随着函数复杂度的增加，查找表方法的性能优势大幅提升。虽然你几乎不会使用查找表将小写字母转换为大写字母，但考虑一下如果你想要交换字母大小写会发生什么；例如，通过计算：

 mov al, character
        cmp al, 'a'
        jb  notLower
        cmp al, 'z'
        ja  allDone

        and al, 5fh
        jmp allDone

notLower:
        cmp al, 'A'
        jb  allDone
        cmp al, 'Z'
        ja  allDone

        or  al, 20h
allDone:
        mov character, al

这段代码有 13 条机器指令。

计算这个相同函数的查找表代码如下：

mov   al, character
lea   rbx, SwapUL
xlat
mov   character, al

如你所见，当使用查找表来计算一个函数时，只有表格发生变化；代码保持不变。

10.1.1.1 函数的定义域和值域

通过查找表计算的函数具有有限的定义域（它们接受的可能输入值的集合），因为函数的定义域中的每个元素都需要在查找表中有一项。例如，我们之前的大小写转换函数，其定义域是 256 字符的扩展 ASCII 字符集。像sin或cos这样的函数接受的是实数集（无限大）作为可能的输入值。你不会发现通过查找表实现一个定义域为实数集的函数很实用，因为你必须将定义域限制为一个较小的集合。

大多数查找表都非常小，通常只有 10 到 256 项。查找表很少超过 1000 项。大多数程序员没有耐心去创建（并验证正确性）一个 1000 项的查找表（不过，参见第 590 页的“生成查找表”部分，讨论如何通过编程生成查找表）。

基于查找表的函数的另一个限制是，定义域中的元素必须相当连续。查找表使用输入值作为查找表的索引，并返回该表项中的值。一个接受 0、100、1000 和 10,000 这些值的函数，由于输入值的范围，会需要 10,001 个不同的元素在查找表中。因此，你无法通过查找表有效地创建这样一个函数。在本节关于查找表的内容中，我们假设函数的定义域是一个相当连续的值集。

函数的值域是它产生的所有可能输出值的集合。从查找表的角度来看，函数的值域决定了每个表项的大小。例如，如果函数的值域是整数值 0 到 255，那么每个表项需要一个字节；如果值域是 0 到 65,535，那么每个表项需要 2 个字节，依此类推。

通过查找表实现的最佳函数是那些定义域和值域始终为 0 到 255（或该范围的子集）的函数。任何这样的函数都可以通过相同的两条指令来计算：lea rbx, table和xlat。唯一改变的只是查找表。之前展示的大小写转换例程就是这样一个好例子。

一旦函数的范围或域超出了 0 到 255，你就不能（方便地）使用xlat指令来计算函数值。需要考虑三种情况：

域超出了 0 到 255，但范围在 0 到 255 之间。
域在 0 到 255 之间，但范围超出了 0 到 255。
函数的域和范围都超出了 0 到 255。

我们将在接下来的章节中考虑这些情况。

10.1.1.2 域超出 0 到 255，范围在 0 到 255 之间

如果一个函数的域超出了 0 到 255，但函数的范围落在该值集内，我们的查找表将需要超过 256 个条目，但每个条目可以用一个字节表示。因此，查找表可以是一个字节数组。除了那些可以使用xlat指令的查找，属于此类别的函数是最有效的。以下的 Pascal 函数调用

B := Func(X);

其中Func是

function Func(X:dword):byte;

可以很容易地转换为以下的 MASM 代码：

mov edx, X    ; Zero-extends into RDX!
lea rbx, FuncTable
mov al, [rbx][rdx * 1]
mov B, al

这段代码将函数参数加载到 RDX 寄存器中，使用该值（范围为 0 到??）作为索引访问FuncTable表，提取该位置的字节，并将结果存储到B中。显然，表中必须包含每个可能的X值的有效条目。例如，假设你想将一个 80×25 文本视频显示器上的光标位置（范围为 0 到 1999，80×25 显示器上有 2000 个字符位置）映射到屏幕上的X（0 到 79）或Y（0 到 24）坐标。你可以通过以下函数计算X坐标

X = Posn % 80;

和Y坐标通过公式

Y = Posn / 80;

（其中Posn是屏幕上的光标位置）。这可以通过以下 x86-64 代码计算：

mov ax, Posn
mov cl, 80
div cl

; X is now in AH, Y is now in AL.

然而，x86-64 上的div指令非常慢。如果你需要对每个写入屏幕的字符进行此计算，将严重降低视频显示代码的速度。以下代码通过表查找实现这两个功能，可能会显著提高代码的性能：

lea   rbx, yCoord
movzx ecx, Posn           ; Use a plain mov instr if Posn 
mov   al, [rbx][rcx * 1]  ; is uns32 rather than an 
lea   rbx, xCoord         ; uns16 value
mov   ah, [rbx][rcx * 1]

请记住，将值加载到 ECX 寄存器中会自动将该值零扩展到 RCX 寄存器。因此，这段代码中的movzx指令实际上会将Posn零扩展到 RCX，而不仅仅是 ECX。

如果你愿意接受LARGEADDRESSAWARE:NO链接选项的限制（请参见第三章中的《大地址不可知应用程序》），你可以稍微简化这段代码：

movzx ecx, Posn           ; Use a plain mov instr if Posn
mov   al, yCoord[rcx * 1] ; is uns32 rather than an
mov   ah, xCoord[rcx * 1] ; uns16 value

10.1.1.3 域在 0 到 255 之间，范围超出 0 到 255，或两者都超出 0 到 255

如果一个函数的域在 0 到 255 之间，但范围超出了这个范围，查找表将包含 256 个或更少的条目，但每个条目将需要 2 个或更多字节。如果函数的范围和域都超出了 0 到 255，那么每个条目将需要 2 个或更多字节，且表将包含超过 256 个条目。

回顾第四章，索引一个一维数组（其中table是一个特例）的公式如下：

`element_address` *=* `Base` *+* `index`*`element_size`

如果函数的值域中的元素需要 2 个字节，那么在索引表格之前，必须将索引乘以 2。同样，如果每个条目需要 3、4 或更多字节，则必须将索引乘以每个表项的大小，然后才能作为索引使用。例如，假设你有一个由以下（伪）Pascal 声明定义的函数，F(``x``)：

function F(`x`:dword):word;

你可以使用以下 x86-64 代码创建此函数（当然，还需要适当命名的表F）：

movzx ebx, `x`
lea   r8, F
mov   ax, [r8][rbx * 2]

如果你可以接受LARGEADDRESSAWARE:NO的限制，你可以按如下方式减少：

movzx ebx, `x`
mov   ax, F[rbx * 2]

任何域较小且大多数是连续的函数都是通过表查找计算的良好候选。某些情况下，非连续的域也是可以接受的，只要可以将域强制转换为适当的值集（你已经看到的一个例子是处理switch语句表达式）。这种操作称为条件化，是下一节的主题。

10.1.1.4 域条件化

域条件化是指对函数域中的一组值进行处理，使其更容易作为该函数的输入。考虑以下函数：

sin `x` = sin `x`|(x∈[–2π,2π])

这意味着（计算机）函数sin(``x``)等价于（数学）函数 sin x，其中

–2π <= `x` <= 2π

正如我们所知，正弦是一个圆形函数，它可以接受任何实数输入。然而，用于计算正弦的公式只接受这一小部分值。

这种范围限制不会带来任何实际问题；只需计算sin(``x`` mod (2 * pi))，我们就可以计算任何输入值的正弦。修改输入值以便能够轻松计算函数的过程称为输入条件化。在前面的例子中，我们计算了x mod 2 * pi并将结果作为sin函数的输入。这将x截断到sin所需的域，而不会影响结果。我们也可以将输入条件化应用于表查找。事实上，将索引缩放以处理字节条目就是一种输入条件化。考虑以下 Pascal 函数：

function val(`x`:word):word; begin
    case `x` of
        0: val := 1;
        1: val := 1;
        2: val := 4;
        3: val := 27;
        4: val := 256;
        otherwise val := 0;
    end;
end;

这个函数计算 0 到 4 范围内的x的值，如果x超出此范围，则返回 0。由于x可以取 65,536 个不同的值（是 16 位字），创建一个包含 65,536 个字的表，其中只有前五个条目非零，似乎是相当浪费的。然而，如果我们使用输入条件化，我们仍然可以通过表查找计算这个函数。以下汇编语言代码展示了这一原理：

 mov   ax, 0      ; AX = 0, assume `x` > 4
        movzx ebx, `x`     ; Note that HO bits of RBX must be 0!
        lea   r8, val
        cmp   bx, 4
        ja    defaultResult

 mov   ax, [r8][rbx * 2]

defaultResult:

这段代码检查x是否超出了 0 到 4 的范围。如果是，它会手动将 AX 设置为 0；否则，它会通过val表查找函数值。通过输入条件化，你可以实现一些通过表查找否则难以实现的函数。

### 10.1.2 Generating Tables One big problem with using table lookups is creating the table in the first place. This is particularly true if the table has many entries. Figuring out the data to place in the table, then laboriously entering the data and, finally, checking that data to make sure it is valid, is very time-consuming and boring. For many tables, there is no way around this process. For other tables, there is a better way: using the computer to generate the table for you. An example is probably the best way to describe this. Consider the following modification to the sine function: ![eq1001](https://github.com/OpenDocCN/greenhat-zh/raw/master/docs/art-64b-asm-vol1/img/eq1001.png) This states that *x* is an integer in the range 0 to 359 and *r* must be an integer. The computer can easily compute this with the following code: ``` Thousand dword 1000 . . . lea r8, Sines movzx ebx, x mov eax, [r8][rbx * 2] ; Get sin(`x`) * 1000 imul r ; Note that this extends EAX into EDX idiv Thousand ; Compute (`r` *(sin(`x`) * 1000)) / 1000 ``` (This provides the usual improvement if you can live with the limitations of `LARGEADDRESSAWARE:NO`.) Note that integer multiplication and division are not associative. You cannot remove the multiplication by 1000 and the division by 1000 because they appear to cancel each other out. Furthermore, this code must compute this function in exactly this order. All that we need to complete this function is `Sines`, a table containing 360 different values corresponding to the sine of the angle (in degrees) times 1000\. The C/C++ program in Listing 10-1 generates this table for you. ``` // Listing 10-1: GenerateSines // A C program that generates a table of sine values for // an assembly language lookup table. #include <stdlib.h> #include <stdio.h> #include <math.h> int main(int argc, char **argv) { FILE *outFile; int angle; int r; // Open the file: outFile = fopen("sines.asm", "w"); // Emit the initial part of the declaration to // the output file: fprintf ( outFile, "Sines:" // sin(0) = 0 ); // Emit the sines table: for(angle = 0; angle <= 359; ++angle) { // Convert angle in degrees to an angle in // radians using: // radians = angle * 2.0 * pi / 360.0; // Multiply by 1000 and store the rounded // result into the integer variable r. double theSine = sin ( angle * 2.0 * 3.14159265358979323846 / 360.0 ); r = (int) (theSine * 1000.0); // Write out the integers eight per line to the // source file. // Note: If (angle AND %111) is 0, then angle // is divisible by 8 and we should output a // newline first. if((angle & 7) == 0) { fprintf(outFile, "\n\tword\t"); } fprintf(outFile, "%5d", r); if ((angle & 7) != 7) { fprintf(outFile, ","); } } // endfor fprintf(outFile, "\n"); fclose(outFile); return 0; } // end main ``` Listing 10-1: A C program that generates a table of sines This program produces the following output (truncated for brevity): ``` Sines: word 0, 17, 34, 52, 69, 87, 104, 121 word 139, 156, 173, 190, 207, 224, 241, 258 word 275, 292, 309, 325, 342, 358, 374, 390 word 406, 422, 438, 453, 469, 484, 499, 515 word 529, 544, 559, 573, 587, 601, 615, 629 word 642, 656, 669, 681, 694, 707, 719, 731 word 743, 754, 766, 777, 788, 798, 809, 819 word 829, 838, 848, 857, 866, 874, 882, 891 word 898, 906, 913, 920, 927, 933, 939, 945 word 951, 956, 961, 965, 970, 974, 978, 981 word 984, 987, 990, 992, 994, 996, 997, 998 word 999, 999, 1000, 999, 999, 998, 997, 996 word 994, 992, 990, 987, 984, 981, 978, 974 word 970, 965, 961, 956, 951, 945, 939, 933 word 927, 920, 913, 906, 898, 891, 882, 874 . . . word -898, -891, -882, -874, -866, -857, -848, -838 word -829, -819, -809, -798, -788, -777, -766, -754 word -743, -731, -719, -707, -694, -681, -669, -656 word -642, -629, -615, -601, -587, -573, -559, -544 word -529, -515, -500, -484, -469, -453, -438, -422 word -406, -390, -374, -358, -342, -325, -309, -292 word -275, -258, -241, -224, -207, -190, -173, -156 word -139, -121, -104, -87, -69, -52, -34, -17 ``` Obviously, it’s much easier to write the C program that generated this data than to enter (and verify) this data by hand. Of course, you don’t even have to write the table-generation program in C (or Pascal/Delphi, Java, C#, Swift, or another high-level language). Because the program will execute only once, the performance of the table-generation program is not an issue. Once you run your table-generation program, all that remains to be done is to cut and paste the table from the file (*sines.asm* in this example) into the program that will actually use the table. ### 10.1.3 Table-Lookup Performance In the early days of PCs, table lookups were a preferred way to do high-performance computations. Today, it is not uncommon for a CPU to be 10 to 100 times faster than main memory. As a result, using a table lookup may not be faster than doing the same calculation with machine instructions. However, the on-chip CPU cache memory subsystems operate at near CPU speeds. Therefore, table lookups can be cost-effective if your table resides in cache memory on the CPU. This means that the way to get good performance using table lookups is to use small tables (because there’s only so much room on the cache) and use tables whose entries you reference frequently (so the tables stay in the cache). See *Write Great Code*, Volume 1 (No Starch Press, 2020) or the electronic version of *The Art of Assembly Language* at [`www.randallhyde.com/`](https://www.randallhyde.com/) for details concerning the operation of cache memory and how you can optimize your use of cache memory. ## 10.2 For More Information Donald Knuth’s *The Art of Computer Programming*, Volume 3: *Searching and Sorting* (Addison-Wesley Professional, 1998) contains a lot of useful information about searching for data in tables. Searching for data is an alternative when a straight array access won’t work in a given situation. ## 10.3 Test Yourself 1. What is the domain of a function? 2. What is the range of a function? 3. What does the `xlat` instruction do? 4. Which domain and range values allow you to use the `xlat` instruction? 5. Provide the code that implements the following functions (using pseudo-C prototypes and `f` as the table name): 1. `byte f(byte input)` 2. `word f(byte input)` 3. `byte f(word input)` 4. `word f(word input)` 6. What is domain conditioning? 7. Why might table lookups not be effective on modern processors?

第十一章：SIMD 指令

本章讨论了 x86-64 上的向量指令。这类特殊指令提供并行处理，传统上被称为单指令，多数据（SIMD）指令，因为字面上来说，一条指令同时作用于多个数据块。由于这种并发性，SIMD 指令通常可以比相应的单指令，单数据（SISD）或标量指令执行得更快（理论上，速度可以快 32 到 64 倍），而这些标准 x86-64 指令集中的标量指令就是这样。

x86-64 实际上提供了三组向量指令：多媒体扩展（MMX）指令集、流式 SIMD 扩展（SSE）指令集和高级向量扩展（AVX）指令集。本书不考虑 MMX 指令，因为它们已经过时（SSE 指令集有 MMX 指令的等效功能）。

x86-64 向量指令集（SSE/AVX）几乎与标量指令集一样大。关于 SSE/AVX 编程和算法，完全可以写成一本书。然而，这不是那本书；SIMD 和并行算法是超出本书范围的高级主题，因此本章仅介绍了一些 SSE/AVX 指令，并到此为止。

本章首先介绍一些前提信息。首先，它开始讨论 x86-64 向量架构和流数据类型。然后，讨论如何通过使用cpuid指令来检测各种向量指令的存在（并非所有 x86-64 CPU 都具备这些指令）。由于大多数向量指令需要特殊的内存对齐来处理数据操作数，本章还讨论了 MASM 段。

11.1 SSE/AVX 架构

让我们首先快速了解一下 x64-86 CPU 中的 SSE 和 AVX 功能。SSE 和 AVX 指令有几个变种：原始的 SSE，以及 SSE2、SSE3、SSE3、SSE4（SSE4.1 和 SSE4.2）、AVX、AVX2（AVX 和 AVX2 有时被称为 AVX-256），以及 AVX-512。SSE3 是在 Pentium 4F（Prescott）CPU 发布时引入的，这是英特尔的第一款 64 位 CPU。因此，你可以假设所有英特尔 64 位 CPU 都支持 SSE3 及之前的 SIMD 指令。

SSE/AVX 架构有三个主要的代际：

SSE 架构（在 64 位 CPU 上）提供了十六个 128 位的 XMM 寄存器，支持整数和浮点数据类型
AVX/AVX2 架构，支持十六个 256 位的 YMM 寄存器（同样支持整数和浮点数据类型）
AVX-512 架构，支持最多三十二个 512 位的 ZMM 寄存器

一般来说，本章的示例使用的是 AVX2 及以前的指令。有关 AVX-512 等附加指令集扩展的讨论，请参阅英特尔和 AMD CPU 手册。本章不会试图描述每一条 SSE 或 AVX 指令。大多数流式指令有非常专业的用途，并不适用于通用应用。

11.2 流式数据类型

SSE 和 AVX 编程模型支持两种基本数据类型：标量和向量。标量保存一个单精度或双精度浮点值。向量保存多个浮点数或整数值（根据标量数据类型（字节、字、双字、四字、单精度或双精度）和寄存器及内存大小（128 位或 256 位），值的数量在 2 到 32 之间）。

XMM 寄存器（XMM0 到 XMM15）可以容纳一个 32 位浮点值（标量）或四个单精度浮点值（向量）。YMM 寄存器（YMM0 到 YMM15）可以容纳八个单精度（32 位）浮点值（向量）；参见图 11-1。

图 11-1：打包和标量的单精度浮点数据类型

XMM 寄存器可以容纳一个双精度标量值或一个包含一对双精度值的向量。YMM 寄存器可以容纳一个包含四个双精度浮点值的向量，如图 11-2 所示。

图 11-2：打包的和标量的双精度浮点类型

XMM 寄存器可以容纳 16 个字节值（YMM 寄存器可以容纳 32 个字节值），使 CPU 能够通过一条指令执行 16 个（32 个）字节大小的运算（图 11-3）。

图 11-3：打包的字节数据类型

XMM 寄存器可以容纳八个字大小的值（YMM 寄存器可以容纳十六个字大小的值），使 CPU 能够通过一条指令执行八个（十六个）16 位字大小的整数运算（图 11-4）。

图 11-4：打包的字数据类型

XMM 寄存器可以容纳四个双字值（YMM 寄存器可以容纳八个双字值），使 CPU 能够通过一条指令执行四个（八个）32 位双字大小的整数运算（图 11-5）。

图 11-5：打包的双字数据类型

XMM 寄存器可以容纳两个四字值（YMM 寄存器可以容纳四个四字值），使 CPU 能够通过一条指令执行两个（四个）64 位四字运算（图 11-6）。

图 11-6：打包的四字数据类型

英特尔的文档将 XMM 和 YMM 寄存器中的向量元素称为通道。例如，128 位的 XMM 寄存器有 16 个字节。位 0 到 7 是通道 0，位 8 到 15 是通道 1，位 16 到 23 是通道 2，……，位 120 到 127 是通道 15。256 位的 YMM 寄存器有 32 个字节大小的通道，而 512 位的 ZMM 寄存器有 64 个字节大小的通道。

类似地，128 位的 XMM 寄存器有八个 word 大小的通道（通道 0 到 7）。256 位的 YMM 寄存器有十六个 word 大小的通道（通道 0 到 15）。在支持 AVX-512 的 CPU 上，ZMM 寄存器（512 位）有三十二个 word 大小的通道，编号从 0 到 31。

XMM 寄存器有四个 dword 大小的通道（通道 0 到 3）；它还具有四个单精度（32 位）浮点通道（同样编号为 0 到 3）。YMM 寄存器有八个 dword 或单精度通道（通道 0 到 7）。AVX2 ZMM 寄存器有十六个 dword 或单精度大小的通道（编号为 0 到 15）。

XMM 寄存器支持两个 qword 大小的通道（或两个双精度通道），编号为 0 到 1。按预期，YMM 寄存器有两倍的数量（四个通道，编号为 0 到 3），而 AVX2 ZMM 寄存器则有四倍的数量（通道 0 到 7）。

若干 SSE/AVX 指令引用了这些寄存器中的各个通道。特别是，shuffle 和 unpack 指令允许在 SSE 和 AVX 操作数的通道之间移动数据。有关通道使用的示例，请参见第 625 页的“shuffle 和 unpack 指令”。

11.3 使用 cpuid 区分指令集

英特尔在 1978 年推出了 8086 微处理器（不久之后推出了 8088）。几乎每一代 CPU，英特尔都会向指令集添加新指令。直到本章为止，本书使用的指令通常在所有 x86-64 CPU（无论是英特尔还是 AMD）上都能使用。本章介绍了仅在后期型号的 x86-64 CPU 上可用的指令。为了允许程序员确定其应用程序正在使用的 CPU，以便动态避免在旧处理器上使用较新的指令，英特尔引入了 cpuid 指令。

cpuid 指令期望传递一个单一参数（称为 leaf 函数），该参数通过 EAX 寄存器传递。它根据 EAX 中传递的值，返回不同 32 位寄存器中的有关 CPU 的各类信息。应用程序可以测试返回的信息，以查看是否支持某些 CPU 功能。

随着英特尔引入新指令，它改变了 cpuid 的行为以反映这些变化。具体来说，英特尔改变了程序可以在 EAX 中传递给 cpuid 的合法值范围；这一点称为 支持的最高功能。因此，一些 64 位的 CPU 仅接受 0h 到 05h 范围内的值。本章讨论的指令可能要求传递 0h 到 07h 范围内的值。因此，当使用 cpuid 时，首先要做的事情是验证它是否接受 EAX = 07h 作为有效参数。

为了确定支持的最高功能，你需要将 EAX 加载为 0 或 8000_0000h，然后执行 cpuid 指令（所有 64 位 CPU 都支持这两个功能值）。返回值是你可以传递给 cpuid 的最大值。Intel 和 AMD 的文档（另请参见 en.wikipedia.org/wiki/CPUID）会列出 cpuid 返回的各种 CPU 值；对于本章而言，我们只需要验证支持的最高功能是 01h（所有 64 位 CPU 都是如此）或 07h（对于某些指令）。

除了提供支持的最高功能外，cpuid 指令（EAX = 0h 或 8000_0002h）还会返回一个 12 字符的厂商 ID，存储在 EBX、ECX 和 EDX 寄存器中。对于 x86-64 芯片，这将是以下两种之一：

GenuineIntel（EBX 为 756e_6547h，EDX 为 4965_6e69h，ECX 为 6c65_746eh）
AuthenticAMD（EBX 为 6874_7541h，EDX 为 6974_6E65h，ECX 为 444D_4163h）

为了确定 CPU 是否可以执行大多数 SSE 和 AVX 指令，你需要执行 cpuid，EAX = 01h，并测试 ECX 寄存器中放置的各个位。对于一些更高级的特性（如高级位操作功能和 AVX2 指令），你需要执行 cpuid，EAX = 07h，并检查 EBX 寄存器中的结果。cpuid 指令（EAX = 1）会在 ECX 中的以下位返回有趣的 SSE/AVX 特性标志，如表 11-1 所示；当 EAX = 07h 时，它会在 EBX 中返回位操作或 AVX2 标志，如表 11-2 所示。如果该位被设置，则说明 CPU 支持特定的指令。

表 11-1：Intel cpuid 特性标志（EAX = 1）

位	ECX
0	SSE3 支持
1	PCLMULQDQ 支持
9	SSSE3 支持
19	CPU 支持 SSE4.1 指令
20	CPU 支持 SSE4.2 指令
28	高级向量扩展

表 11-2：Intel cpuid 扩展特性标志（EAX = 7，ECX = 0）

位	EBX
3	位操作指令集 1
5	高级向量扩展 2（AVX2）
8	位操作指令集 2

示例 11-1 查询 CPU 上的厂商 ID 和基本特性标志。

; Listing 11-1

; CPUID Demonstration.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 11-1", 0

            .data
maxFeature  dword   ?
VendorID    byte    14 dup (0)

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; Used for debugging:

print       proc
            push    rax
            push    rbx
            push    rcx
            push    rdx
            push    r8
            push    r9
            push    r10
            push    r11

            push    rbp
            mov     rbp, rsp
            sub     rsp, 40
            and     rsp, -16

            mov     rcx, [rbp + 72]   ; Return address
            call    printf

            mov     rcx, [rbp + 72]
            dec     rcx
skipTo0:    inc     rcx
            cmp     byte ptr [rcx], 0
            jne     skipTo0
            inc     rcx
            mov     [rbp + 72], rcx

            leave
            pop     r11
            pop     r10
            pop     r9
            pop     r8
            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            ret
print       endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbx
            push    rbp
            mov     rbp, rsp
 sub     rsp, 56         ; Shadow storage

            xor     eax, eax
            cpuid
            mov     maxFeature, eax
            mov     dword ptr VendorID, ebx 
            mov     dword ptr VendorID[4], edx 
            mov     dword ptr VendorID[8], ecx

            lea     rdx, VendorID
            mov     r8d, eax
            call    print
            byte    "CPUID(0): Vendor ID='%s',  "
            byte    "max feature=0%xh", nl, 0

; Leaf function 1 is available on all CPUs that support
; CPUID, no need to test for it. 

            mov     eax, 1
            cpuid
            mov     r8d, edx
            mov     edx, ecx
            call    print
            byte    "cpuid(1), ECX=%08x, EDX=%08x", nl, 0

; Most likely, leaf function 7 is supported on all modern CPUs
; (for example, x86-64), but we'll test its availability nonetheless.

            cmp     maxFeature, 7
            jb      allDone

            mov     eax, 7
            xor     ecx, ecx
            cpuid
            mov     edx, ebx
            mov     r8d, ecx
            call    print
            byte    "cpuid(7), EBX=%08x, ECX=%08x", nl, 0

allDone:    leave
            pop     rbx
            ret     ; Returns to caller
asmMain     endp
            end

示例 11-1：cpuid 演示程序

在一台配备 Intel i7-3720QM CPU 的旧款 MacBook Pro Retina 上，通过 Parallels 运行，输出如下：

C:\>**build listing11-1**

C:\>**echo off**
 Assembling: listing11-1.asm
c.cpp

C:\>**listing11-1**
Calling Listing 11-1:
CPUID(0): Vendor ID='GenuineIntel', max feature=0dh
cpuid(1), ECX=ffba2203, EDX=1f8bfbff
cpuid(7), EBX=00000281, ECX=00000000
Listing 11-1 terminated

该 CPU 支持 SSE3 指令（ECX 的第 0 位为 1）、SSE4.1 和 SSE4.2 指令（ECX 的第 19 位和第 20 位为 1）以及 AVX 指令（第 28 位为 1）。这些基本上是本章描述的指令。大多数现代 CPU 都会支持这些指令（Intel 于 2012 年发布的 i7-3720QM 就支持这些指令）。该处理器不支持 Intel 指令集中的一些更有趣的扩展特性（如扩展的位操作指令集和 AVX2 指令集）。使用这些指令的程序无法在这台（古老的）MacBook Pro 上执行。

在一台较新的 CPU（iMac Pro 10 核 Intel Xeon W-2150B）上运行此程序，产生以下输出：

C:\>**listing11-1**
Calling Listing 11-1:
CPUID(0): Vendor ID='GenuineIntel', max feature=016h
cpuid(1), ECX=fffa3203, EDX=1f8bfbff
cpuid(7), EBX=d09f47bb, ECX=00000000
Listing 11-1 terminated

如你所见，通过查看扩展特性位，较新的 Xeon CPU 确实支持这些额外的指令。列表 11-2 中的代码片段提供了对列表 11-1 的快速修改，用来测试 BMI1 和 BMI2 位操作指令集的可用性（将以下代码插入到列表 11-1 中的 allDone 标签之前）。

; Test for extended bit manipulation instructions 
; (BMI1 and BMI2):

            and     ebx, 108h       ; Test bits 3 and 8
            cmp     ebx, 108h       ; Both must be set
            jne     Unsupported
            call    print
            byte    "CPU supports BMI1 & BMI2", nl, 0
            jmp     allDone 

Unsupported:
            call    print
            byte    "CPU does not support BMI1 & BMI2 "
            byte    "instructions", nl, 0

allDone:    leave
            pop     rbx
            ret     ; Returns to caller
asmMain     endp

列表 11-2：测试 BMI1 和 BMI2 指令集

这是在 Intel i7-3720QM CPU 上的构建命令和程序输出：

C:\>**build listing11-2**

C:\>**echo off**
 Assembling: listing11-2.asm
c.cpp

C:\>**listing11-2**
Calling Listing 11-2:
CPUID(0): Vendor ID='GenuineIntel', max feature=0dh
cpuid(1), ECX=ffba2203, EDX=1f8bfbff
cpuid(7), EBX=00000281, ECX=00000000
CPU does not support BMI1 & BMI2 instructions
Listing 11-2 terminated

这是相同的程序在 iMac Pro（Intel Xeon W-2150B）上运行的情况：

C:\>**listing11-2**
Calling Listing 11-2:
CPUID(0): Vendor ID='GenuineIntel', max feature=016h
cpuid(1), ECX=fffa3203, EDX=1f8bfbff
cpuid(7), EBX=d09f47bb, ECX=00000000
CPU supports BMI1 & BMI2
Listing 11-2 terminated

11.4 完整段语法与段对齐

如你所见，SSE 和 AVX 内存数据需要在 16 字节、32 字节甚至 64 字节的边界上进行对齐。虽然你可以使用 align 指令来对齐数据（详见第三章中的“MASM 对数据对齐的支持”），但在使用本书中介绍的简化段指令时，16 字节对齐以外的对齐将不起作用。如果需要超过 16 字节的对齐，则必须使用 MASM 完整段声明。

如果你想创建一个可以完全控制段属性的段，则需要使用 segment 和 ends 指令。^(1) 段声明的通用语法如下：

`segname`  segment `readonly` `alignment` '`class`'
         statements
`segname`  ends

segname 是一个标识符。它是段的名称（也必须出现在 ends 指令之前）。它不需要是唯一的；你可以有多个段声明使用相同的名称。当 MASM 输出代码到目标文件时，它会将具有相同名称的段合并。避免使用 _TEXT、_DATA、_BSS 和 _CONST 作为段名称，因为 MASM 分别将这些名称用于 .code、.data、.data? 和 .const 指令。

readonly 选项可以为空或是 MASM 保留字 readonly。这提示 MASM 该段将包含只读（常量）数据。如果你试图（直接）将值存储到在只读段中声明的变量，MASM 将会报错，指出不能修改只读段。

alignment 选项是可选的，它允许你指定以下选项之一：

byte
word
dword
para
page
align(``n``) （n 是一个常量，必须是 2 的幂）

对齐选项告诉 MASM，此特定段的第一个字节必须出现在对齐选项的倍数地址上。byte、word 和 dword 保留字指定 1 字节、2 字节或 4 字节的对齐。para 对齐选项指定段落对齐（16 字节）。page 对齐选项指定 256 字节的地址对齐。最后，align(``n``) 对齐选项允许你指定任何 2 的幂次方地址对齐（1、2、4、8、16、32 等等）。

默认的段对齐，如果你没有明确指定，是段落对齐（16 字节）。这也是简化段指令（.code、.data、.data?和.const）的默认对齐方式。

如果你有一些（SSE/AVX）数据对象，必须从一个是 32 字节或 64 字节倍数的地址开始，那么创建一个 64 字节对齐的新数据段就是你需要的。以下是一个这样的段示例：

dseg64  segment align(64)
obj64   oword   0, 1, 2, 3   ; Starts on 64-byte boundary
b       byte    0            ; Messes with alignment
        align   32           ; Sets alignment to 32 bytes
obj32   oword   0, 1         ; Starts on 32-byte boundary
dseg64  ends

可选的class字段是一个字符串（由撇号和单引号分隔），通常是以下名称之一：CODE、DATA或CONST。请注意，MASM 和微软链接器会将具有相同类名的段合并，即使它们的段名不同。

本章展示了这些段声明的示例，它们在需要时使用。

11.5 SSE、AVX 和 AVX2 内存操作数对齐

SSE 和 AVX 指令通常允许访问各种内存操作数大小。所谓的标量指令，操作单个数据元素，可以访问字节、字、双字和四字操作数。在许多方面，这些类型的内存访问类似于非 SIMD 指令的内存访问。SSE、AVX 和 AVX2 指令集扩展还可以访问内存中的打包或向量操作数。与标量内存操作数不同，严格的规则限制了对打包内存操作数的访问。本节讨论这些规则。

SSE 指令可以用单条指令访问最多 128 位的内存（16 字节）。大多数多操作数 SSE 指令可以将 XMM 寄存器或 128 位内存操作数指定为它们的源（第二）操作数。通常情况下，这些内存操作数必须出现在 16 字节对齐的内存地址上（也就是说，内存地址的低 4 位必须包含 0）。

因为段的默认对齐是para（16 字节），你可以通过使用align指令轻松确保任何 16 字节打包的数据对象是 16 字节对齐的：

align 16

如果你尝试在使用byte、word或dword对齐类型定义的段中使用align 16，MASM 会报告错误。在使用para、page或任何align(``n``)选项时，其中n大于或等于 16 时，它将正常工作。

如果你使用 AVX 指令访问 256 位（32 字节）内存操作数，你必须确保这些内存操作数从 32 字节的地址边界开始。不幸的是，align 32不起作用，因为默认的段对齐是para（16 字节）对齐，而段的对齐必须大于或等于该段内出现的任何align指令的操作数字段。因此，为了能够定义 AVX 指令可用的 256 位变量，你必须显式地定义一个在（最小）32 字节边界上对齐的（数据）段，例如以下内容：

avxData    segment  align(32)
           align    32    ; This is actually redundant here
someData   oword    0, 1  ; 256 bits of data
             .
             .
             .
avxData    ends

虽然说这有些多余，但它非常重要，值得一再强调：

几乎所有 AVX/AVX2 指令，如果你尝试在一个不是 32 字节对齐的地址访问一个 256 位对象，都会引发对齐错误。始终确保你的 AVX 打包操作数正确对齐。

如果你使用 AVX2 扩展指令与 512 位内存操作数，你必须确保这些操作数出现在内存中一个是 64 字节的倍数的地址上。至于 AVX 指令，你必须定义一个具有大于或等于 64 字节对齐的段，例如这样：

avx2Data   segment  align(64)
someData   oword    0, 1, 2, 3  ; 512 bits of data
             .
             .
             .
avx2Data   ends

请原谅重复，但重要的是要记住：

几乎所有 AVX-512 指令，如果你尝试在一个不是 64 字节对齐的地址访问一个 512 位对象，都会引发对齐错误。始终确保你的 AVX-512 打包操作数正确对齐。

如果你在同一个应用程序中使用 SSE、AVX 和 AVX2 数据类型，你可以通过为单个段使用 64 字节对齐选项来创建一个单一的数据段来保存所有这些数据值，而不是为每种数据类型的大小创建单独的段。记住，段的对齐必须大于或等于特定数据类型所要求的对齐方式。因此，64 字节对齐对于 SSE 和 AVX/AVX2 变量以及 AVX-512 变量都能很好地工作：

SIMDData   segment  align(64)
sseData    oword    0    ; 64-byte-aligned is also 16-byte-aligned
           align    32   ; Alignment for AVX data
avxData    oword    0, 1 ; 32 bytes of data aligned on 32 bytes
           align    64
avx2Data   oword    0, 1, 2, 3  ; 64 bytes of data
             .
             .
             .
SIMDData   ends

如果你指定的对齐选项远大于你需要的（例如 256 字节的 page 对齐），你可能会不必要地浪费内存。

当你的 SSE、AVX 和 AVX2 数据值是静态或全局变量时，align 指令表现良好。那么当你想要在栈上创建局部变量或在堆上创建动态变量时会发生什么呢？即使你的程序遵循微软的 ABI，你在进入程序（或进入一个过程）时，栈上的对齐保证只有 16 字节对齐。同样，根据你的堆管理函数，malloc（或类似函数）返回的地址也无法保证适合 SSE、AVX 或 AVX2 数据对象的对齐方式。

在一个过程内部，你可以通过过度分配存储、将对象的大小减去 1 添加到分配的地址，然后使用 and 指令将地址的低位清零（16 字节对齐对象清除 4 位，32 字节对齐对象清除 5 位，64 字节对齐对象清除 6 位）来为 16 字节、32 字节或 64 字节对齐的变量分配存储空间。然后你可以通过使用这个指针来引用该对象。下面的示例代码演示了如何做到这一点：

sseproc     proc
sseptr      equ     <[rbp - 8]>
avxptr      equ     <[rbp - 16]>
avx2ptr     equ     <[rbp - 24]>
            push    rbp
            mov     rbp, rsp
            sub     rsp, 160

; Load RAX with an address 64 bytes
; above the current stack pointer. A
; 64-byte-aligned address will be somewhere
; between RSP and RSP + 63.

            lea     rax, [rsp + 63]

; Mask out the LO 6 bits of RAX. This
; generates an address in RAX that is
; aligned on a 64-byte boundary and is
; between RSP and RSP + 63:

            and     rax, -64 ; 0FFFF...FC0h

; Save this 64-byte-aligned address as
; the pointer to the AVX2 data:

            mov     avx2ptr, rax

; Add 64 to AVX2's address. This skips
; over AVX2's data. The address is also
; 64-byte-aligned (which means it is
; also 32-byte-aligned). Use this as
; the address of AVX's data:

            add     rax, 64
            mov     avxptr, rax

; Add 32 to AVX's address. This skips
; over AVX's data. The address is also
; 32-byte-aligned (which means it is
; also 16-byte-aligned). Use this as
; the address of SSE's data:

            add     rax, 32
            mov     sseptr, rax
             .
             . `Code that accesses the`
             . `AVX2, AVX, and SSE data`
             . `areas using avx2ptr`,
             . `avxptr, and sseptr`

            leave
            ret
sseproc     endp

对于在堆上分配的数据，你可以做同样的事情：分配额外的存储（最多分配大小的两倍减去 1），将对象的大小减去 1（15、31 或 63）加到地址中，然后使用 -64、-32 或 -16 来屏蔽新形成的地址，以分别产生 64 字节、32 字节或 16 字节对齐的对象。

11.6 SIMD 数据移动指令

x86-64 CPU 提供了多种数据移动指令，用于在（SSE/AVX）寄存器之间复制数据、从内存加载寄存器以及将寄存器值存储到内存。以下小节描述了每条指令。

11.6.1 (v)movd 和 (v)movq 指令

对于 SSE 指令集，movd（移动 dword）和movq（移动 qword）指令将来自 32 位或 64 位通用寄存器或内存位置的值复制到 XMM 寄存器的低 dword 或 qword 中：^(2)

movd `xmm`[*n*], `reg`32/`mem`32
movq `xmm`[*n*], `reg`64/`mem`64

如图 11-7 和 11-8 所示，这些指令将值零扩展到 XMM 寄存器中的剩余高位（HO 位）。

图 11-7：将 32 位值从内存移动到 XMM 寄存器（带零扩展）

图 11-8：将 64 位值从内存移动到 XMM 寄存器（带零扩展）

以下指令将 XMM 寄存器的低 32 位或 64 位存储到 dword 或 qword 内存位置或通用寄存器：

movd `reg`[32]/`mem`[32], `xmm`n
movq `reg`[64]/`mem`[64], `xmm`n

movq指令还允许你将一个 XMM 寄存器的低 32 位（LO qword）数据复制到另一个 XMM 寄存器，但由于某些原因，movd指令不允许两个 XMM 寄存器操作数：

movq `xmm`n, `xmm`n

对于 AVX 指令，你可以使用以下指令：^(3)

vmovd `xmm`n, `reg`[32]/`mem`[32]
vmovd `reg`[32]/`mem`[32], `xmm`n
vmovq `xmm`n, `reg`[64]/`mem`[64]
vmovq `reg`[64]/`mem`[64], `xmm`n

具有 XMM 目标操作数的指令还会将它们的值零扩展到高位（最多扩展到位 255，不像标准 SSE 指令不会修改 YMM 寄存器的上位）。

因为movd和movq指令访问的是 32 位和 64 位内存值（而不是 128 位、256 位或 512 位值），所以这些指令不要求它们的内存操作数按 16 字节、32 字节或 64 字节对齐。当然，如果它们的操作数在内存中按 dword（movd）或 qword（movq）对齐，指令执行可能会更快。

11.6.2 (v)movaps、(v)movapd 和 (v)movdqa 指令

movaps（移动对齐的打包单精度）、movapd（移动对齐的打包双精度）和movdqa（移动双四字对齐）指令在内存与 XMM 寄存器之间或两个 XMM 寄存器之间移动 16 字节的数据。AVX 版本（带有v前缀）在内存与 XMM 或 YMM 寄存器之间，或两个 XMM 或 YMM 寄存器之间移动 16 字节或 32 字节的数据（涉及 XMM 寄存器的移动会将相应 YMM 寄存器的高位清零）。内存位置必须按 16 字节或 32 字节边界对齐（分别），否则 CPU 将生成未对齐访问错误。

这三条 mov* 指令将 16 字节数据加载到 XMM 寄存器中，理论上可以互换使用。实际上，Intel 可能会针对它们所移动的数据类型（单精度浮点值、双精度浮点值或整数值）对操作进行优化，因此最好根据所使用的数据类型选择适当的指令（有关说明，请参见第 622 页的“性能问题与 SIMD 移动指令”）。同样，所有三条 vmov* 指令将 16 或 32 字节的数据加载到 XMM 或 YMM 寄存器中，也可以互换使用。

这些指令具有以下形式：

movaps `xmm`n, `mem`[128]    vmovaps `xmm`n, `mem`[128]    vmovaps `ymm`n, `mem`[256]
movaps `mem`[128], `xmm`n    vmovaps `mem`[128], `xmm`n    vmovaps `mem`[256], `ymm`n
movaps `xmm`n, `xmm`n     vmovaps `xmm`n, `xmm`n     vmovaps `ymm`n, `ymm`n
movapd `xmm`n, `mem`[128]    vmovapd `xmm`n, `mem`[128]    vmovapd `ymm`n, `mem`[256]
movapd `mem`[128], `xmm`n    vmovapd `mem`[128], `xmm`n    vmovapd `mem`[256], `ymm`n
movapd `xmm`n, `xmm`n     vmovapd `xmm`n, `xmm`n     vmovapd `ymm`n, `ymm`n
movdqa `xmm`n, `mem`[128]    vmovdqa `xmm`n, `mem`[128]    vmovdqa `ymm`n, `mem`[256]
movdqa `mem`[128], `xmm`n    vmovdqa `mem`[128], `xmm`n    vmovdqa `mem`[256], `ymm`n
movdqa `xmm`n, `xmm`n     vmovdqa `xmm`n, `xmm`n     vmovdqa `ymm`n, `ymm`n

mem128 操作数应为一个包含四个单精度浮点值的向量（数组），用于 (v)movaps 指令；应为一个包含两个双精度浮点值的向量，用于 (v)movapd 指令；当使用 (v)movdqa 指令时，应为一个 16 字节的值（16 字节，8 个字，4 个双字或 2 个四字）。如果无法保证操作数在 16 字节边界上对齐，请改用 movups、movupd 或 movdqu 指令（请参见下一节）。

mem256 操作数应为一个包含八个单精度浮点值的向量（数组），用于 vmovaps 指令；应为一个包含四个双精度浮点值的向量，用于 vmovapd 指令；当使用 vmovdqa 指令时，应为一个 32 字节的值（32 字节，16 个字，8 个双字或 4 个四字）。如果无法保证操作数是 32 字节对齐的，请改用 vmovups、vmovupd 或 vmovdqu 指令。

尽管物理机器指令本身对内存操作数的数据类型并不特别关心，但 MASM 的汇编语法显然是关心的。如果指令与以下任一类型不匹配，则需要使用操作数类型强制转换。

movaps 指令允许 real4、dword 和 oword 操作数。
movapd 指令允许 real8、qword 和 oword 操作数。
movdqa 指令仅允许 oword 操作数。
vmovaps 指令允许 real4、dword 和 ymmword ptr 操作数（当使用 YMM 寄存器时）。
vmovapd 指令允许 real8、qword 和 ymmword ptr 操作数（当使用 YMM 寄存器时）。
vmovdqa 指令仅允许 ymmword ptr 操作数（当使用 YMM 寄存器时）。

通常你会看到 memcpy（内存复制）函数使用 (v)movapd 指令进行高性能操作。更多详情请访问 Agner Fog 的网站 www.agner.org/optimize/。

11.6.3 （v）movups、（v）movupd 和（v）movdqu 指令

当你无法保证打包数据的内存操作数位于 16 字节或 32 字节对齐的地址边界时，可以使用 (v)movups（无对齐的打包单精度）、(v)movupd（无对齐的打包双精度）和 (v)movdqu（无对齐的双四字）指令，在 XMM 或 YMM 寄存器与内存之间移动数据。

至于对齐的移动指令，所有不对齐的移动指令都做相同的事情：将 16（32）字节的数据从内存中复制到内存中，或者反之。不同数据类型的约定与对齐数据移动指令的约定相同。

11.6.4 对齐与不对齐移动的性能

列表 11-3 和列表 11-4 提供了示范程序，展示了使用对齐和不对齐内存访问的 mova* 和 movu* 指令的性能。

; Listing 11-3

; Performance test for packed versus unpacked
; instructions. This program times aligned accesses.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 11-3", 0

dseg        segment align(64) 'DATA'

; Aligned data types:

            align   64
alignedData byte    64 dup (0)
dseg        ends

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; Used for debugging:

print       proc

; Print code removed for brevity.
; See Listing 11-1 for actual code.

print       endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbx
            push    rbp
            mov     rbp, rsp
            sub     rsp, 56         ; Shadow storage

            call    print
            byte    "Starting", nl, 0

            mov     rcx, 4000000000 ; 4,000,000,000
            lea     rdx, alignedData
            mov     rbx, 0
rptLp:      mov     rax, 15
rptLp2:     movaps  xmm0, xmmword ptr [rdx + rbx * 1]
            movapd  xmm0, real8 ptr   [rdx + rbx * 1]
            movdqa  xmm0, xmmword ptr [rdx + rbx * 1]
            vmovaps ymm0, ymmword ptr [rdx + rbx * 1]
            vmovapd ymm0, ymmword ptr [rdx + rbx * 1]
            vmovdqa ymm0, ymmword ptr [rdx + rbx * 1]
            vmovaps zmm0, zmmword ptr [rdx + rbx * 1]
            vmovapd zmm0, zmmword ptr [rdx + rbx * 1]

            dec     rax
            jns     rptLp2

            dec     rcx
            jnz     rptLp

            call    print
            byte    "Done", nl, 0

allDone:    leave
            pop     rbx
 ret     ; Returns to caller
asmMain     endp
            end

列表 11-3：对齐内存访问时序代码

; Listing 11-4

; Performance test for packed versus unpacked
; instructions. This program times unaligned accesses. 

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 11-4", 0

dseg        segment align(64) 'DATA'

; Aligned data types:

            align   64
alignedData byte    64 dup (0)
dseg        ends

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; Used for debugging:

print       proc

; Print code removed for brevity.
; See Listing 11-1 for actual code.

print       endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbx
            push    rbp
            mov     rbp, rsp
 sub     rsp, 56         ; Shadow storage

            call    print
            byte    "Starting", nl, 0

            mov     rcx, 4000000000 ; 4,000,000,000
            lea     rdx, alignedData
rptLp:      mov     rbx, 15
rptLp2:
            movups  xmm0, xmmword ptr [rdx + rbx * 1]
            movupd  xmm0, real8 ptr   [rdx + rbx * 1]
            movdqu  xmm0, xmmword ptr [rdx + rbx * 1]
            vmovups ymm0, ymmword ptr [rdx + rbx * 1]
            vmovupd ymm0, ymmword ptr [rdx + rbx * 1]
            vmovdqu ymm0, ymmword ptr [rdx + rbx * 1]
            vmovups zmm0, zmmword ptr [rdx + rbx * 1]
            vmovupd zmm0, zmmword ptr [rdx + rbx * 1]
            dec     rbx
            jns     rptLp2

            dec     rcx
            jnz     rptLp

            call    print
            byte    "Done", nl, 0

allDone:    leave
            pop     rbx
            ret     ; Returns to caller
asmMain     endp
            end

列表 11-4：不对齐内存访问时序代码

列表 11-3 中的代码在 3GHz Xeon W CPU 上执行大约需要 1 分 7 秒。在相同的处理器上，列表 11-4 中的代码执行需要 1 分 55 秒。如你所见，某些情况下，在对齐的地址边界上访问 SIMD 数据是有优势的。

11.6.5 `(v)movlps` 和 `(v)movlpd` 指令

(v)movl* 指令和 (v)movh* 指令（见下节）可能看起来像是普通的移动指令。它们的行为与许多其他 SSE/AVX 移动指令相似。然而，这些指令的设计目的是支持浮点向量的打包和解包。具体而言，这些指令允许你将来自两个不同源的两对单精度浮点数或一对双精度浮点操作数合并到一个单一的 XMM 寄存器中。

(v)movlps 指令使用以下语法：

movlps  `xmm`[dest], `mem`[64]
movlps  `mem`[64],  `xmm`[src]
vmovlps `xmm`[dest], `xmm`[src], `mem`[64]
vmovlps `mem`[64],  `xmm`[src]

movlps xmmdest, mem64 形式将一对单精度浮点值复制到目标 XMM 寄存器的两个低 32 位通道，如图 11-9 所示。此指令不改变高 64 位。

图 11-9：movlps 指令

movlps mem64, xmmsrc 形式将 XMM 源寄存器中的低 64 位（两个低单精度通道）复制到指定的内存位置。从功能上讲，这与 movq 或 movsd 指令等价（因为它将 64 位数据复制到内存），尽管如果 XMM 寄存器的低 64 位实际包含两个单精度值，则此指令可能会稍微更快一些（有关详细解释，请参见《性能问题与 SIMD 移动指令》一节，第 622 页）。

vmovlps 指令有三个操作数：一个目标 XMM 寄存器，一个源 XMM 寄存器和一个源（64 位）内存位置。该指令将内存位置中的两个单精度值复制到目标 XMM 寄存器的低 64 位。它还将源寄存器的高 64 位（也包含两个单精度值）复制到目标寄存器的高 64 位。图 11-10 显示了该操作。请注意，该指令通过一条指令合并了一对操作数。

图 11-10: vmovlps 指令

类似于 movsd，movlpd（移动低字打包双精度）指令将源操作数的低 64 位（一个双精度浮点值）复制到目标操作数的低 64 位。不同之处在于，movlpd 指令在从内存移动数据到 XMM 寄存器时不会进行零扩展，而 movsd 指令则会将值零扩展到目标 XMM 寄存器的上 64 位。（无论是 movsd 还是 movlpd，在 XMM 寄存器之间复制数据时都不会进行零扩展；当然，当将数据存储到内存时，零扩展也不适用。）^(4)

11.6.6 `movhps` 和 `movhpd` 指令

movhps 和 movhpd 指令将一个 64 位的值（movhps 为两个单精度浮点数，movhpd 为一个双精度浮点数）移动到目标 XMM 寄存器的高字部分。图 11-11 显示了 movhps 指令的操作；图 11-12 显示了 movhpd 指令。

图 11-11: movhps 指令

图 11-12: movhpd 指令

movhps 和 movhpd 指令也可以将 XMM 寄存器的高字（HO quad word）存储到内存中。允许的语法如下所示：

movhps `xmm`n, `mem`[64]
movhps `mem`[64], `xmm`n
movhpd `xmm`n, `mem`[64]
movhpd `mem`[64], `xmm`n

这些指令不会影响 YMM 寄存器的 128 到 255 位（如果 CPU 上存在 YMM 寄存器）。

通常你会使用 movlps 指令，然后再使用 movhps 指令将四个单精度浮点数加载到 XMM 寄存器中，浮点数来自两个不同的数据源（类似地，你可以使用 movlpd 和 movhpd 指令从不同源加载一对双精度值到一个 XMM 寄存器）。相反，你也可以使用该指令将一个向量结果拆分成两部分，并将这两部分存储到不同的数据流中。这可能就是该指令的预期用途。当然，如果你能将其用于其他目的，也可以尝试。

MASM（版本 14.15.26730.0，至少）似乎要求 movhps 操作数为 64 位数据类型，并不允许使用 real4 操作数。^(5) 因此，在使用此指令时，你可能需要显式地将一对 real4 值转换为 qword ptr：

r4m         real4   1.0, 2.0, 3.0, 4.0
r8m         real8   1.0, 2.0
              .
              .
              .
            movhps  xmm0, qword ptr r4m2
            movhpd  xmm0, r8m

11.6.7 `vmovhps` 和 `vmovhpd` 指令

尽管 AVX 指令扩展提供了 vmovhps 和 vmovhpd 指令，但它们并非 SSE movhps 和 movhpd 指令的简单扩展。这些指令的语法如下：

vmovhps `xmm`[dest], `xmm`[src], `mem`[64]
vmovhps `mem`[64],  `xmm`[src]
vmovhpd `xmm`[dest], `xmm`[src], `mem`[64]
vmovhpd `mem`[64],  `xmm`[src]

将数据存储到 64 位内存位置的指令行为类似于 movhps 和 movhpd 指令。将数据加载到 XMM 寄存器的指令有两个源操作数。它们将完整的 128 位（四个单精度值或两个双精度值）加载到目标 XMM 寄存器中。高 64 位来自内存操作数；低 64 位来自源 XMM 寄存器的低 64 位，如图 11-13 所示。这些指令还将值零扩展到（重叠的）YMM 寄存器的上 128 位。

图 11-13：vmovhpd 和 vmovhps 指令

与 movhps 指令不同，MASM 正确接受 real4 源操作数用于 vmovhps 指令：

r4m         real4   1.0, 2.0, 3.0, 4.0
r8m         real8   1.0, 2.0
              .
              .
              .
            vmovhps xmm0, xmm1, r4m
            vmovhpd xmm0, xmm1, r8m

11.6.8 `movlhps` 和 `vmovlhps` 指令

movlhps 指令将一对 32 位单精度浮点值从源 XMM 寄存器的低 64 位移动到目标 XMM 寄存器的高 64 位。它保持目标寄存器的低 64 位不变。如果目标寄存器位于支持 256 位 AVX 寄存器的 CPU 上，此指令还将保持重叠 YMM 寄存器的高 128 位不变。

这些指令的语法如下：

movlhps  `xmm`[dest], `xmm`[src]
vmovlhps `xmm`[dest], `xmm`[src1], `xmm`[src2]

你不能使用此指令在内存和 XMM 寄存器之间移动数据；它仅在 XMM 寄存器之间传输数据。没有双精度版本的此指令。

vmovlhps 指令类似于 movlhps，但有以下不同之处：

vmovlhps 需要三个操作数：两个源 XMM 寄存器和一个目标 XMM 寄存器。
vmovlhps 将第一个源寄存器的低 64 位拷贝到目标寄存器的低 64 位。
vmovlhps 将第二个源寄存器的低 64 位拷贝到目标寄存器的 64 至 127 位。
vmovlhps 将结果零扩展到重叠 YMM 寄存器的上 128 位。

没有 vmovlhpd 指令。

11.6.9 `movhlps` 和 `vmovhlps` 指令

movhlps 指令的语法如下：

movhlps `xmm`[dest], `xmm`[src]

movhlps 指令将源操作数的高 64 位中的一对 32 位单精度浮点值拷贝到目标寄存器的低 64 位，而不改变目标寄存器的高 64 位（这是 movlhps 的反操作）。此指令仅在 XMM 寄存器之间拷贝数据；不允许使用内存操作数。

vmovhlps 指令需要三个 XMM 寄存器操作数；其语法如下：

vmovhlps `xmm`[dest], `xmm`[src1], `xmm`[src2]

该指令将第一个源寄存器的高 64 位复制到目标寄存器的高 64 位，将第二个源寄存器的高 64 位复制到目标寄存器的 0 到 63 位，最后将结果零扩展到覆盖的 YMM 寄存器的上位。

没有 movhlpd 或 vmovhlpd 指令。

11.6.10 (v)movshdup 和 (v)movsldup 指令

movshdup 指令将源操作数（内存或 XMM 寄存器）中的两个奇数索引的单精度浮点值移动，并将每个元素复制到目标 XMM 寄存器，如图 11-14 所示。

图 11-14: movshdup 和 vmovshdup 指令

该指令忽略 XMM 寄存器中偶数索引位置的单精度浮点值。vmovshdup 指令的工作方式相同，但作用于 YMM 寄存器，复制四个单精度值而不是两个（当然，还会将高位 0）。这些指令的语法如下所示：

movshdup  `xmm`[dest], `mem`[128]/`xmm`[src]
vmovshdup `xmm`[dest], `mem`[128]/`xmm`[src]
vmovshdup `ymm`[dest], `mem`[256]/`ymm`[src]

movsldup 指令的工作方式与 movshdup 指令相同，唯一不同的是它将源 XMM 寄存器中偶数索引位置的两个单精度值复制并重复到目标 XMM 寄存器。同样，vmovsldup 指令将源 YMM 寄存器中偶数索引位置的四个双精度值复制并重复，如图 11-15 所示。

图 11-15: movsldup 和 vmovsldup 指令

语法如下：

movsldup  `xmm`[dest], `mem`[128]/`xmm`[src]
vmovsldup `xmm`[dest], `mem`[128]/`xmm`[src]
vmovsldup `ymm`[dest], `mem`[256]/`ymm`[src]

11.6.11 (v)movddup 指令

movddup 指令将 XMM 寄存器的低 64 位或 64 位内存位置中的双精度值复制并重复到目标 XMM 寄存器的低 64 位；然后，它还会将该值复制到目标寄存器的 64 到 127 位，如图 11-16 所示。

图 11-16: movddup 指令的行为

该指令不会影响 YMM 寄存器的高 128 位（如果适用）。该指令的语法如下：

movddup `xmm`[dest], `mem`[64]/`xmm`[src]

vmovddup 指令在 XMM 或 YMM 目标寄存器与 XMM 或 YMM 源寄存器或 128 位或 256 位内存位置之间进行操作。128 位版本的工作方式与 movddup 指令相同，但它会将目标 YMM 寄存器的高位清零。256 位版本将源值中偶数索引（0 和 2）处的一对双精度值复制到目标 YMM 寄存器中对应的索引，并将这些值复制到目标寄存器中的奇数索引位置，如图 11-17 所示。

图 11-17: vmovddup 指令的行为

该指令的语法如下：

movddup  `xmm`[dest], `mem`[64]/`xmm`[src]
vmovddup `ymm`[dest], `mem`[256]/`ymm`[src]

11.6.12 (v)lddqu 指令

(v)lddqu指令在操作上与(v)movdqu完全相同。如果（内存）源操作数没有正确对齐并且跨越了内存中的缓存行边界，你有时可以使用此指令来提高性能。有关此指令及其性能限制的更多详细信息，请参阅 Intel 或 AMD 的文档（特别是优化手册）。

这些指令通常采用以下形式：

lddqu  `xmm`[dest], `mem`[128]
vlddqu `xmm`[dest], `mem`[128]
vlddqu `ymm`[dest], `mem`[256]

11.6.13 性能问题与 SIMD 移动指令

当你从编程模型层次查看 SSE/AVX 指令的语义时，你可能会质疑为什么某些指令出现在指令集中。例如，movq、movsd和movlps指令都可以从内存位置加载 64 位数据到 XMM 寄存器的 LO 64 位部分。为什么要这样做？为什么不使用一条指令直接将内存中的四字数据复制到 XMM 寄存器的 LO 64 位部分（无论是 64 位整数、一对 32 位整数、64 位双精度浮点值，还是一对 32 位单精度浮点值）？答案就在于微架构这个术语中。

x86-64 宏架构是软件工程师所看到的编程模型。在宏架构中，XMM 寄存器是一个 128 位的资源，在任何给定时刻，它可以容纳一个 128 位的位数组（或一个整数）、一对 64 位整数值、一对 64 位双精度浮点值、一组四个单精度浮点值、一组四个双字整数、八个字或 16 个字节。所有这些数据类型是相互叠加的，就像 8 位、16 位、32 位和 64 位的通用寄存器相互叠加一样（这被称为别名）。如果你将两个双精度浮点值加载到 XMM 寄存器中，然后修改位位置 0 到 15 的（整数）字，你实际上也在改变 XMM 寄存器的 LO 四字中的双精度值中相同的位（0 到 15）。x86-64 编程模型的语义要求这样做。

然而，从微体系结构的角度来看，并没有要求 CPU 在 CPU 中使用相同的物理位来存储整数、单精度和双精度值（即使它们被别名映射到同一个寄存器）。微体系结构可以为单个寄存器设置一组单独的位，用来存储整数、单精度和双精度值。例如，当你使用movq指令将 64 位加载到 XMM 寄存器时，该指令实际上可能会将位复制到底层的整数寄存器中（而不影响单精度或双精度子寄存器）。同样，movlps指令会将一对单精度值复制到单精度寄存器中，movsd指令则会将一个双精度值复制到双精度寄存器中（见图 11-18）。这些独立的子寄存器（整数、单精度和双精度）可以直接连接到处理它们特定数据类型的算术或逻辑单元，从而使对这些子寄存器的算术和逻辑操作更加高效。只要数据位于适当的子寄存器中，一切都能顺利进行。

图 11-18：微体系结构级别的寄存器别名

然而，如果你使用movq指令将一对单精度浮点数加载到 XMM 寄存器中，然后尝试对这两个值执行单精度向量操作，会发生什么情况呢？从宏体系结构的角度来看，这两个单精度值正坐落在 XMM 寄存器的适当位置，因此这应该是一个合法操作。然而，从微体系结构的角度来看，这两个单精度浮点数坐落在整数子寄存器中，而不是单精度子寄存器中。底层微体系结构必须注意到这些值位于错误的子寄存器，并在执行单精度算术或逻辑操作之前将它们移动到适当的（单精度）子寄存器中。这可能会引入轻微的延迟（当微体系结构移动数据时），这就是为什么你应该始终为数据类型选择适当的移动指令。

11.6.14 对 SIMD 移动指令的一些最终评论

SIMD 数据移动指令是一组令人困惑的指令。它们的语法不一致，许多指令重复执行其他指令的操作，而且它们还存在一些令人困惑的不规则性问题。对于 x86-64 指令集的新手来说，可能会问：“为什么指令集是这样设计的？”为什么，确实是这样？

这个问题的答案是历史性的。最早的 x86 CPU 并没有 SIMD 指令集。Intel 为 Pentium 系列 CPU 增加了 MMX 指令集。在那个时候（1990 年代初期），当时的技术只允许 Intel 增加少量的指令，而且 MMX 寄存器的大小限制为 64 位。此外，软件工程师和计算机系统设计师才刚刚开始探索现代计算机的多媒体功能，因此当时并不完全清楚哪些指令（和数据类型）是支持我们几十年后所看到的软件所必需的。因此，最早的 SIMD 指令和数据类型在功能上是有限的。

随着时间的推移，CPU 获得了更多的硅资源，软件/系统工程师也发现了计算机的新用途（以及在这些计算机上运行的新算法），因此 Intel（和 AMD）通过添加新的 SIMD 指令来支持这些更现代的多媒体应用。例如，最初的 MMX 指令仅支持整数数据类型，因此 Intel 在 SSE 指令集中增加了浮点数支持，因为多媒体应用需要真实数据类型。随后，Intel 将整数类型从 64 位扩展到了 128 位、256 位，甚至 512 位。随着每次扩展，Intel（和 AMD）不得不保留旧的指令集扩展，以便允许现有软件在新的 CPU 上运行。

结果是，新的指令集不断堆积了与旧指令相同功能的新指令（并附带一些额外的功能）。这就是为什么像movaps和vmovaps这样的指令在功能上有显著重叠的原因。如果 CPU 资源早些时候就已到位（例如，能够在 CPU 上放置 256 位的 YMM 寄存器），那么几乎就不需要movaps指令了——vmovaps本可以完成所有工作。^(6)

从理论上讲，我们可以通过从头开始重新设计一个架构优雅的 x86-64 变种，设计一个最小的指令集来处理当前 x86-64 的所有活动，而不需要现有指令集中存在的所有冗余和臃肿。然而，这样的 CPU 将失去 x86-64 的主要优势：运行为 Intel 架构编写的数十年软件的能力。能够运行所有这些旧软件的代价是，汇编语言程序员（和编译器开发者）必须处理指令集中所有这些不规则性。

11.7 洗牌和解包指令

SSE/AVX 洗牌和解包指令是移动指令的变体。除了移动数据之外，这些指令还可以重新排列出现在 XMM 和 YMM 寄存器不同通道中的数据。

11.7.1 (v)pshufb 指令

pshufb 指令是第一个打包字节洗牌 SIMD 指令（首次出现在 MMX 指令集中）。由于其起源，语法和行为与指令集中其他洗牌指令有所不同。其语法如下：

pshufb `xmm`[dest], `xmm`/`mem`[128]

第一个（目标）操作数是一个 XMM 寄存器，其字节车道将由 pshufb 洗牌（重新排列）。第二个操作数（可以是 XMM 寄存器或 128 位 oword 内存位置）是一个包含 16 个字节值的数组，这些值控制洗牌操作。如果第二个操作数是内存位置，该 oword 值必须在 16 字节边界上对齐。

第二个操作数中的每个字节（车道）都为第一个操作数中相应的字节车道选择一个值，如图 11-19 所示。

图 11-19：pshufb 指令的车道索引对应关系

第二个操作数中的 16 字节索引分别采用图 11-20 中所示的形式。

图 11-20：phsufb 字节索引

pshufb 指令会忽略索引字节中的第 4 到第 6 位。第 7 位是清除位；如果此位为 1，pshufb 指令将忽略车道索引位，并在 XMM[dest] 中对应的字节位置存储 0。如果清除位为 0，pshufb 指令将执行洗牌操作。

pshufb 洗牌操作是逐车道进行的。指令首先会创建 XMM[dest] 的临时副本。然后，对于每个索引字节（其 HO 位为 0），pshufb 将根据索引的 LO 4 位，从与该索引车道匹配的 XMM[dest] 车道中复制指定的车道，如图 11-21 所示。在此示例中，位于车道 6 的索引包含值 00000011b。该值选择临时（原始 XMM[dest]）值中车道 3 的值，并将其复制到 XMM[dest] 的车道 6。pshufb 指令会对所有 16 个车道重复此操作。

图 11-21：洗牌操作

AVX 指令集扩展引入了 vpshufb 指令。其语法如下：

vpshufb `xmm`[dest], `xmm`[src], `xmm`[index]/`mem`[128]
vpshufb `ymm`[dest], `ymm`[src], `ymm`[index]/`mem`[256]

AVX 变体添加了源寄存器（而不是使用 XMM[dest] 作为源和目标寄存器），并且与其不同的是，vpshufb 指令从 XMM[src] 寄存器中选择源字节，而不是在操作前创建 XMM[dest] 的临时副本并从该副本中选择值。除此之外，这些指令还将 YMM[dest] 的 HO 位清零，128 位变体与 SSE pshufb 指令的操作完全相同。

AVX 指令允许您指定 256 位的 YMM 寄存器，除了 128 位的 XMM 寄存器之外。^(7)

11.7.2 (v)pshufd 指令

SSE 扩展首次引入了 pshufd 指令。AVX 扩展增加了 vpshufd 指令。这些指令以类似于 (v)pshufb 指令的方式打乱 XMM 和 YMM 寄存器中的双字（不是双精度值）。然而，打乱索引的指定方式与 (v)pshufb 不同。(v)pshufd 指令的语法如下：

pshufd  `xmm`[dest], `xmm`[src]/`mem`[128], `imm`[8]
vpshufd `xmm`[dest], `xmm`[src]/`mem`[128], `imm`[8]
vpshufd `ymm`[dest], `ymm`[src]/`me`m[256], `imm`[8]

第一个操作数（XMM[dest] 或 YMM[dest]）是目标操作数，用于存储被打乱的值。第二个操作数是源操作数，指令将从中选择双字以放入目标寄存器；通常，如果这是内存操作数，则必须将其对齐到适当的（16 字节或 32 字节）边界。第三个操作数是一个 8 位立即数，指定从源操作数中选择双字的索引。

对于带有 XMM[dest] 操作数的 (v)pshufd 指令，imm[8] 操作数的编码如表 11-3 所示。位 0 到 1 中的值选择源操作数中的特定双字，并将其放入 XMM[dest] 操作数的双字 0 中。位 2 到 3 中的值选择源操作数中的一个双字，并将其放入 XMM[dest] 操作数的双字 1 中。位 4 到 5 中的值选择源操作数中的一个双字，并将其放入 XMM[dest] 操作数的双字 2 中。最后，位 6 到 7 中的值选择源操作数中的一个双字，并将其放入 XMM[dest] 操作数的双字 3 中。

表 11-3：(v)pshufd imm[8] 操作数值

位位置	目标通道
0 到 1	0
2 到 3	1
4 到 5	2
6 到 7	3

128 位 pshufd 与 vpshufd 指令的区别在于，pshufd 会保持底层 YMM 寄存器的高 128 位不变，而 vpshufd 会将底层 YMM 寄存器的高 128 位清零。

vpshufd 的 256 位变体（当使用 YMM 寄存器作为源和目标操作数时）仍然使用 8 位立即数操作数作为索引值。每个 2 位的索引值操作 YMM 寄存器中的两个双字值。位 0 到 1 控制双字 0 和 4，位 2 到 3 控制双字 1 和 5，位 4 到 5 控制双字 2 和 6，位 6 到 7 控制双字 3 和 7，如表 11-4 所示。

表 11-4：vpshufd YMM[dest], YMM[src]/mem[src], imm[8] 的双字传输

索引	YMM/mem[src] [索引] 复制到	YMM/mem[src] [索引 + 4] 复制到
imm[8] 的位 0 到 1	YMM[dest][0]	YMM[dest][4]
imm[8] 的位 2 到 3	YMM[dest][1]	YMM[dest][5]
imm[8] 的位 4 到 5	YMM[dest][2]	YMM[dest][6]
imm[8] 的位 6 到 7	YMM[dest][3]	YMM[dest][7]

256 位版本的灵活性稍差，因为它一次复制两个双字，而不是一个。它处理 LO 128 位的方式与 128 位版本相同；它还通过使用相同的洗牌模式，将源的高 128 位中的相应字道复制到 YMM 目标寄存器中。不幸的是，你无法通过 vpshufd 指令独立控制 YMM 寄存器的高低两半。如果你真的需要独立地洗牌双字，可以使用 vshufb，并使用合适的索引复制 4 字节（替代单个双字）。

11.7.3 (v)pshuflw 和 (v)pshufhw 指令

pshuflw 和 vpshuflw 以及 pshufhw 和 vpshufhw 指令支持在 XMM 或 YMM 寄存器内进行 16 位字的洗牌。这些指令的语法如下：

pshuflw  `xmm`[dest], `xmm`[src]/`mem`[128], `imm`[8]
pshufhw  `xmm`[dest], `xmm`[src]/`mem`[128], `imm`[8]

vpshuflw `xmm`[dest], `xmm`[src]/`mem`[128], `imm`[8]
vpshufhw `xmm`[dest], `xmm`[src]/`mem`[128], `imm`[8]

vpshuflw `ymm`[dest], `ymm`[src]/`mem`[256], `imm`[8]
vpshufhw `ymm`[dest], `ymm`[src]/`mem`[256], `imm`[8]

128 位的 lw 变种将源操作数的高 64 位复制到 XMM[dest] 操作数的相同位置。然后，它们使用索引（imm[8]）操作数选择 XMM[src]/mem[128] 操作数的 LO 四字双字（word lanes 0 到 3），并将其移动到目标操作数的 LO 4 个字道中。例如，如果 imm[8] 的低 2 位是 10b，那么 pshuflw 指令会将源的第 2 道复制到目标操作数的第 0 道中（参见图 11-22）。请注意，pshuflw 不会修改叠加的 YMM 寄存器的高 128 位，而 vpshuflw 会将高 128 位清零。

图 11-22: (v)pshuflw xmm``, xmm``/``mem``, imm8 操作

256 位的 vpshuflw 指令（具有 YMM 目标寄存器）一次复制两对字—一对在 YMM 目标寄存器的高 128 位，另一对在 LO 128 位，源位置为 256 位，如图 11-23 所示。索引（imm[8]）选择对于高低 128 位是相同的。

图 11-23: vpshuflw ymm``, ymm``/``mem``, imm8 操作

128 位的 hw 变种将源操作数的低 64 位复制到目标操作数的相同位置。然后，它们使用索引操作数选择源操作数中第 4 到第 7 个字（按 0 到 3 索引），并将其移动到目标操作数的高 4 个字道中（参见图 11-24）。

图 11-24: (v)pshufhw 操作

256 位的 vpshufhw 指令（具有 YMM 目标寄存器）一次复制两对字—一对在 YMM 目标寄存器的高 128 位和一对在 LO 128 位，源位置为 256 位，如图 11-25 所示。

图 11-25: vpshufhw 操作

11.7.4 shufps 和 shufpd 指令

shuffle 指令（shufps 和 shufpd）从源操作数中提取单精度或双精度值，并将它们放置到目标操作数的指定位置。第三个操作数，一个 8 位立即数值，选择从源操作数中提取哪些值并移动到目标寄存器。以下是这两条指令的语法：

shufps `xmm`[src1/dest], `xmm`[src2]/`mem`[128], `imm`[8]
shufpd `xmm`[src1/dest], `xmm`[src2]/`mem`[128], `imm`[8]

对于 shufps 指令，第二个源操作数是一个 8 位立即数值，实际上是一个包含 2 位值的四元素数组。

imm[8] 位 0 和 1 从 XMM[src1/dest] 操作数的四个通道中选择一个单精度值，并将其存储到目标操作中的通道 0。位 2 和 3 从 XMM[src1/dest] 操作数的四个通道中选择一个单精度值，并将其存储到目标操作中的通道 1（目标操作数同样为 XMM[src1/dest]）。

imm[8] 位 4 和 5 从 XMM[src2]/mem[src2] 操作数的四个通道中选择一个单精度值，并将其存储到目标操作中的通道 2。位 6 和 7 从 XMM[src2]/mem[src2] 操作数的四个通道中选择一个单精度值，并将其存储到目标操作中的通道 3。

图 11-26 显示了 shufps 指令的操作。

图 11-26：shufps 操作

例如，指令

shufps xmm0, xmm1, 0E4h  ; 0E4h = 11 10 01 00

加载 XMM0 寄存器以下的单精度值：

XMM0[0 到 31] 来自 XMM0[0 到 32]
XMM0[32 到 63] 来自 XMM0[32 到 63]
XMM0[64 到 95] 来自 XMM1[63 到 95]
XMM0[96 到 127] 来自 XMM1[96 到 127]

如果第二个操作数（XMM[src2]/mem[src2]）与第一个操作数（XMM[src1/dest]）相同，则可以重新排列 XMM[dest] 寄存器中的四个单精度值（这可能就是指令名称 shuffle 的来源）。

shufpd 指令的工作方式类似，打乱双精度值。由于 XMM 寄存器中只有两个双精度值，因此只需一个位来选择这两个值中的一个。同样，因为目标寄存器中只有两个双精度值，指令只需要两个（单比特）数组元素来选择目标。结果，第三个操作数 imm[8] 实际上只是一个 2 位值；指令会忽略 imm[8] 操作数中的位 2 到 7。imm[8] 操作数的位 0 选择从 XMM[src1/dest] 操作数中选择通道 0 和位 0 到 63（如果为 0）或通道 1 和位 64 到 127（如果为 1），并将其放入 XMM[dest] 的通道 0 和位 0 到 63 中。imm[8] 操作数的位 1 选择从 XMM[src]/mem[128] 操作数中选择通道 0 和位 0 到 63（如果为 0）或通道 1 和位 64 到 127（如果为 1），并将其放入 XMM[dest] 的通道 1 和位 64 到 127 中。图 11-27 显示了这个操作。

图 11-27：shufpd 操作

11.7.5 vshufps 和 vshufpd 指令

vshufps和vshufpd指令类似于shufps和shufpd。它们允许你在 128 位 XMM 寄存器或 256 位 YMM 寄存器中进行值的洗牌。^(8) vshufps和vshufpd指令有四个操作数：一个目标 XMM 或 YMM 寄存器，两个源操作数（src[1]必须是 XMM 或 YMM 寄存器，src[2]可以是 XMM 或 YMM 寄存器，或者是 128 位或 256 位的内存位置），以及一个 imm[8]操作数。它们的语法如下：

vshufps `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128], `imm`[8]
vshufpd `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128], `imm`[8]

vshufps `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256], `imm`[8]
vshufpd `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256], `imm`[8]

而 SSE 洗牌指令使用目标寄存器作为隐式源操作数，AVX 洗牌指令则允许你指定显式的目标和源操作数（它们可以完全不同，或完全相同，或是任何组合）。

对于 256 位的vshufps指令，imm[8] 操作数是一个包含四个 2 位值的数组（位 0:1, 2:3, 4:5, 和 6:7）。这些 2 位值从源位置选择四个单精度值中的一个，具体如表 11-5 所示。

表 11-5：vshufps 目标选择

	目标	imm[8] 值
imm[8] 位		00
---	---	---
76 54 32 10	目标[0 到 31]	源[1][0 到 31]
	目标[128 到 159]	源[1][128 到 159]
76 54 32 10	目标[32 到 63]	源[1][0 到 31]
	目标[160 到 191]	源[1][128 到 159]
76 54 32 10	目标[64 到 95]	源[2][0 到 31]
	目标[192 到 223]	源[2][128 到 159]
76 54 32 10	目标[96 到 127]	源[2][0 到 31]
	目标[224 到 255]	源[2][128 到 159]

如果两个源操作数相同，你可以随意重新排列单精度值的顺序（如果目标和两个源操作数相同，你可以在该寄存器内任意重新排列双字）。

vshufps指令还允许你指定 XMM 和 128 位内存操作数。在这种形式下，它的行为与shufps指令非常相似，不同之处在于你可以指定两个不同的 128 位源操作数（而不是只有一个 128 位源操作数），并且它会将对应的 YMM 寄存器的高 128 位清零。如果目标操作数与第一个源操作数不同，这种方式可能会很有用。如果vshufps的第一个源操作数与目标操作数相同，应该使用shufps指令，因为其机器编码更短。

vshufpd 指令是 shufpd 指令的扩展，支持 256 位（并增加了第二个源操作数）。由于 256 位 YMM 寄存器中包含四个双精度浮点数值，vshufpd 需要 4 位来选择源索引（而 shufpd 只需 2 位）。表 11-6 描述了 vshufpd 如何将数据从源操作数复制到目标操作数。

表 11-6：vshufpd 目标选择

	目标	imm[8] 值
imm[8] 位		0
---	---	---
7654 3 2 1 0	Dest[0 to 63]	Src[1][0 to 63]
7654 3 2 1 0	Dest[64 to 127]	Src[2][0 to 63]
7654 3 2 1 0	Dest[128 to 191]	Src[1][128 to 191]
7654 3 2 1 0	Dest[192 to 255]	Src[2][128 to 191]

与 vshufps 指令类似，vshufpd 也允许你指定 XMM 寄存器，如果你想要 shufpd 的三操作数版本。

11.7.6 (v)unpcklps、(v)unpckhps、(v)unpcklpd 和 (v)unpckhpd 指令

解包（和合并）指令是洗牌指令的简化变种。这些指令将单精度和双精度浮点数值从源操作数的固定位置复制，并将这些值插入到目标操作数的固定位置。它们本质上是没有 imm[8] 操作数并且具有固定洗牌模式的洗牌指令。

unpcklps 和 unpckhps 指令从两个源中选择它们各自的单精度操作数的一半，将这些值合并（交错排列），然后将合并的结果存储到目标操作数中（目标操作数与第一个源操作数相同）。这两个指令的语法如下：

unpcklps `xmm`[dest], `xmm`[src]/`mem`[128]
unpckhps `xmm`[dest], `xmm`[src]/`mem`[128]

XMM[dest] 操作数既作为第一个源操作数，也作为目标操作数。XMM[src]/mem[128] 操作数是第二个源操作数。

两者的区别在于它们选择源操作数的方式。unpcklps 指令将两个低位单精度浮点数值从源操作数复制到位位置 32 到 63（dword 1）和 96 到 127（dword 3）。它保留目标操作数中的 dword 0 不变，并将原本在 dword 1 中的值复制到目标操作数的 dword 2 中。图 11-28 展示了此操作。

图 11-28：unpcklps 指令操作

unpckhps 指令将两个单精度浮点数值从两个源操作数复制到目标寄存器，如图 11-29 所示。

图 11-29：unpckhps 指令操作

unpcklpd 和 unpckhpd 指令的功能与 unpcklps 和 unpckhps 相同，只不过它们处理的是双精度浮点数值，而不是单精度浮点数值。图 11-30 和 11-31 展示了这两个指令的操作。

图 11-30：unpcklpd指令操作

图 11-31：unpckhpd指令操作

vunpcklps、vunpckhps、vunpcklpd和vunpckhpd指令的语法如下：

vunpcklps `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vunpckhps `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]

vunpcklps `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]
vunpckhps `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]

它们的工作原理与非v变种相似，存在一些差异：

AVX 变种支持使用 YMM 寄存器和 XMM 寄存器。
AVX 变种需要三个操作数。第一个（目标）和第二个（源[1]）操作数必须是 XMM 或 YMM 寄存器。第三个（源[2]）操作数可以是 XMM 或 YMM 寄存器，或者是 128 位或 256 位内存位置。两操作数形式只是三操作数形式的一种特殊情况，其中第一个和第二个操作数指定相同的寄存器名称。
128 位变种会将 YMM 寄存器的 HO 位清零，而不是让这些位保持不变。

当然，带有 YMM 寄存器的 AVX 指令交错处理的单精度或双精度值的数量是原来的两倍。交错扩展按直观方式发生，vunpcklps（图 11-32）如下：

源[1]中的单精度值（位 0 到 31）首先被写入目标的位 0 到 31。
源[2]中的单精度值（位 0 到 31）被写入目标的位 32 到 63。
源[1]中的单精度值（位 32 到 63）被写入目标的位 64 到 95。
源[2]中的单精度值（位 32 到 63）被写入目标的位 96 到 127。
源[1]中的单精度值（位 128 到 159）首先被写入目标的位 128 到 159。
源[2]中的单精度值（位 128 到 159）被写入目标的位 160 到 191。
源[1]中的单精度值（位 160 到 191）被写入目标的位 192 到 223。
源[2]中的单精度值（位 160 到 191）被写入目标的位 224 到 256。

图 11-32：vunpcklps指令操作

vunpckhps指令（图 11-33）执行以下操作：

源[1]中的单精度值（位 64 到 95）首先被写入目标的位 0 到 31。
源[2]中的单精度值（位 64 到 95）被写入目标的位 32 到 63。
源[1]中的单精度值（位 96 到 127）被写入目标的位 64 到 95。
源[2]中的单精度值（位 96 到 127）被写入目标的位 96 到 127。

图 11-33：vunpckhps指令操作

同样，vunpcklpd和vunpckhpd用于移动双精度值。

11.7.7 整数解包指令

punpck*指令提供了一组整数解包指令，以补充浮点变种。这些指令出现在表 11-7 中。

表 11-7：整数解包指令

指令	描述
`punpcklbw`	解包低字节为字
`punpckhbw`	解包高字节为字
`punpcklwd`	解包低字为双字
`punpckhwd`	解包高字为双字
`punpckldq`	解包低双字为四字
`punpckhdq`	解包高双字为四字
`punpcklqdq`	解包低四字为双四字（双四字）
`punpckhqdq`	解包高四字为双四字（双四字）

11.7.7.1 punpck* 指令

punpck* 指令从两个不同的源中提取一半的字节、字、双字或四字，并将这些值合并到目标 SSE 寄存器中。以下是这些指令的语法：

punpcklbw  `xmm`[dest], `xmm`[src]
punpcklbw  `xmm`[dest], `mem`[src]
punpckhbw  `xmm`[dest], `xmm`[src]
punpckhbw  `xmm`[dest], `mem`[src]
punpcklwd  `xmm`[dest], `xmm`[src]
punpcklwd  `xmm`[dest], `mem`[src]
punpckhwd  `xmm`[dest], `xmm`[src]
punpckhwd  `xmm`[dest], `mem`[src]
punpckldq  `xmm`[dest], `xmm`[src]
punpckldq  `xmm`[dest], `mem`[src]
punpckhdq  `xmm`[dest], `xmm`[src]
punpckhdq  `xmm`[dest], `mem`[src]
punpcklqdq `xmm`[dest], `xmm`[src]
punpcklqdq `xmm`[dest], `mem`[src]
punpckhqdq `xmm`[dest], `xmm`[src]
punpckhqdq `xmm`[dest], `mem`[src]

图 11-34 至 11-41 展示了这些指令的每个数据传输。

图 11-34: punpcklbw 指令操作

图 11-35: punpckhbw 操作

图 11-36: punpcklwd 操作

图 11-37: punpckhwd 操作

图 11-38: punpckldq 操作

图 11-39: punpckhdq 操作

图 11-40: punpcklqdq 操作

图 11-41: punpckhqdq 操作

11.7.7.2 vpunpck* SSE 指令

AVX vpunpck* 指令提供了一组 AVX 整数解包指令，以补充 SSE 变体。这些指令出现在表 11-8 中。

表 11-8: AVX 整数解包指令

指令	描述
`vpunpcklbw`	解包低字节为字
`vpunpckhbw`	解包高字节为字
`vpunpcklwd`	解包低字为双字
`vpunpckhwd`	解包高字为双字
`vpunpckldq`	解包低双字为四字
`vpunpckhdq`	解包高双字为四字
`vpunpcklqdq`	解包低四字为双四字（双四字）
`vpunpckhqdq`	解包高四字为双四字（双四字）

vpunpck* 指令从两个不同的源中提取一半的字节、字、双字或四字，并将这些值合并到目标 AVX 或 SSE 寄存器中。以下是这些指令的 SSE 形式语法：

vpunpcklbw  `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vpunpckhbw  `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vpunpcklwd  `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vpunpckhwd  `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vpunpckldq  `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vpunpckhdq  `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vpunpcklqdq `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vpunpckhqdq `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]

从功能上看，AVX 指令（vunpck*）和 SSE 指令（unpck*）之间的唯一区别是，SSE 变体保持 YMM AVX 寄存器的上半部分（位 128 到 255）不变，而 AVX 变体将结果零扩展到 256 位。有关这些指令操作的描述，请参见图 11-34 至 11-41。

11.7.7.3 vpunpck* AVX 指令

AVX vunpck* 指令还支持使用 AVX YMM 寄存器，在这种情况下，解包和合并操作从 128 位扩展到 256 位。以下是这些指令的语法：

vpunpcklbw  `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]
vpunpckhbw  `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]
vpunpcklwd  `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]
vpunpckhwd  `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]
vpunpckldq  `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]
vpunpckhdq  `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]
vpunpcklqdq `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]
vpunpckhqdq `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]

11.7.8 `(v)pextrb`、`(v)pextrw`、`(v)pextrd` 和 `(v)pextrq` 指令

(v)pextrb、(v)pextrw、(v)pextrd 和 (v)pextrq 指令从 128 位 XMM 寄存器中提取一个字节、字、双字或四字，并将这些数据复制到通用寄存器或内存位置。这些指令的语法如下：

pextrb  `reg`[32], `xmm`[src], `imm`[8]   ; imm[8] = 0 to 15
pextrb  `reg`[64], `xmm`[src], `imm`[8]   ; imm[8] = 0 to 15
pextrb  `mem`[8], `xmm`[src], `imm`[8]    ; imm[8] = 0 to 15
vpextrb `reg`[32], `xmm`[src], `imm`[8]  ; imm[8] = 0 to 15
vpextrb `reg`[64], `xmm`[src], `imm`[8]  ; imm[8] = 0 to 15
vpextrb `mem`[8], `xmm`[src], `imm`[8]   ; imm[8] = 0 to 15

pextrw  `reg`[32], `xmm`[src], `imm`[8]  ; imm[8] = 0 to 7
pextrw  `reg`[64], `xmm`[src], `imm`[8]  ; imm[8] = 0 to 7
pextrw  `mem`[16], `xmm`[src], `imm`[8]  ; imm[8] = 0 to 7
vpextrw `reg`[32], `xmm`[src], `imm`[8]  ; imm[8] = 0 to 7
vpextrw `reg`[64], `xmm`[src], `imm`[8]  ; imm[8] = 0 to 7
vpextrw `mem`[16], `xmm`[src], `imm`[8]  ; imm[8] = 0 to 7

pextrd  `reg`[32], `xmm`[src], `imm`[8]  ; imm[8] = 0 to 3
pextrd  `mem`[32], `xmm`[src], `imm`[8]  ; imm[8] = 0 to 3
vpextrd `mem`[64], `xmm`[src], `imm`[8]  ; imm[8] = 0 to 3
vpextrd `reg`[32], `xmm`[src], `imm`[8]  ; imm[8] = 0 to 3
vpextrd `reg`[64], `xmm`[src], `imm`[8]  ; imm[8] = 0 to 3
vpextrd `mem`[32], `xmm`[src], `imm`[8]  ; imm[8] = 0 to 3

pextrq  `reg`[64], `xmm`[src], `imm`[8]  ; imm[8] = 0 to 1
pextrq  `mem`[64], `xmm`[src], `imm`[8]  ; imm[8] = 0 to 1
vpextrq `reg`[64], `xmm`[src], `imm`[8]  ; imm[8] = 0 to 1
vpextrq `mem`[64], `xmm`[src], `imm`[8]  ; imm[8] = 0 to 1

字节和字指令期望 32 位或 64 位的通用寄存器作为目标（第一个）操作数，或者一个与指令大小相同的内存位置（即，pextrb 期望一个字节大小的内存操作数，pextrw 期望一个字大小的操作数，以此类推）。源（第二）操作数是一个 128 位的 XMM 寄存器。索引（第三）操作数是一个 8 位的立即数，用来指定索引（通道号）。这些指令从由 8 位立即数指定的通道中提取字节、字、双字或四字，并将该值复制到目标操作数中。双字和四字变体分别需要 32 位或 64 位的通用寄存器。如果目标操作数是 32 位或 64 位的通用寄存器，则指令会根据需要将值扩展为 32 位或 64 位。

11.7.9 `(v)pinsrb`、`(v)pinsrw`、`(v)pinsrd` 和 `(v)pinsrq` 指令

(v)pinsr{b,w,d,q} 指令从通用寄存器或内存位置提取一个字节、字、双字或四字，并将该数据存储到 XMM 寄存器的一个通道中。这些指令的语法如下：^(9)

pinsrb  `xmm`[dest], `reg`[32], `imm`[8]          ; imm[8] = 0 to 15
pinsrb  `xmm`[dest], `mem`[8], `imm`[8]           ; imm[8] = 0 to 15
vpinsrb `xmm`[dest], `xmm`[src2], `reg`[32], `imm`[8]   ; imm[8] = 0 to 15
vpinsrb `xmm`[dest], `xmm`[src2], `mem`[8], `imm`[8]    ; imm[8] = 0 to 15

pinsrw  `xmm`dest, `reg`32, `imm`8          ; imm[*8*] = 0 to 7
pinsrw  `xmm`[dest], `mem`[16], `imm`[8]          ; imm[8] = 0 to 7
vpinsrw `xmm`[dest], `xmm`[src2], `reg`[32], `imm`[8]  ; imm[8] = 0 to 7
vpinsrw `xmm`[dest], `xmm`[src2], `mem`[16], `imm`[8]  ; imm[8] = 0 to 7

pinsrd  `xmm`[dest], `reg`[32], `imm`[8]          ; imm[8] = 0 to 3
pinsrd  `xmm`[dest], `mem`[32], `imm`[8]          ; imm[8] = 0 to 3
vpinsrd `xmm`[dest], `xmm`[src2], `reg`[32], `imm`[8]  ; imm[8] = 0 to 3
vpinsrd `xmm`[dest], `xmm`[src2], `mem`[32], `imm`[8]  ; imm[8] = 0 to 3

pinsrq  `xmm`[dest], `reg`[64], `imm`[8]          ; imm[8] = 0 to 1
pinsrq  `xmm`[dest], `xmm`[src2], `mem`[64], `imm`[8]  ; imm[8] = 0 to 1
vpinsrq `xmm`[dest], `xmm`[src2], `reg`[64], `imm`[8]  ; imm[8] = 0 to 1
vpinsrq `xmm`[dest], `xmm`[src2], `mem`[64], `imm`[8]  ; imm[8] = 0 to 1

目标（第一个）操作数是一个 128 位的 XMM 寄存器。pinsr* 指令期望内存位置或 32 位通用寄存器作为其源（第二）操作数（除了 pinsrq 指令，它们需要一个 64 位寄存器）。索引（第三）操作数是一个 8 位的立即数，用来指定索引（通道号）。

这些指令从通用寄存器或内存位置提取一个字节、字、双字或四字，并将其复制到由 8 位立即数指定的 XMM 寄存器中的通道。这些 pinsr{b,w,d,q} 指令保持底层 YMM 寄存器中的任何高位（HO）位不变（如果适用）。

vpinsr{b,w,d,q} 指令将数据从 XMM 源寄存器复制到目标寄存器，然后将字节、字、双字或四字复制到目标寄存器中指定的位置。这些指令会将值在底层 YMM 寄存器的高位（HO）扩展为零。

11.7.10 `(v)extractps` 和 `(v)insertps` 指令

extractps 和 vextractps 指令在功能上等同于 pextrd 和 vpextrd。它们从 XMM 寄存器中提取一个 32 位（单精度浮点数）值，并将其移动到一个 32 位的通用寄存器或 32 位的内存位置。这些 (v)extractps 指令的语法如下所示：

extractps  `reg`[32], `xmm`[src], `imm`[8]
extractps  `mem`[32], `xmm`[src], `imm`[8]
vextractps `reg`[32], `xmm`[src], `imm`[8]
vextractps `mem`[32], `xmm`[src], `imm`[8]

insertps和vinsertps指令将一个 32 位浮点值插入到 XMM 寄存器中，并可选择性地清除 XMM 寄存器中的其他通道。此类指令的语法如下：

insertps  `xmm`[dest], `xmm`[src], `imm`[8]
insertps  `xmm`[dest], `mem`[32], `imm`[8]
vinsertps `xmm`[dest], `xmm`[src1], `xmm`[src2], `imm`[8]
vinsertps `xmm`[dest], `xmm`[src1], `mem`[32], `imm`[8]

对于insertps和vinsertps指令，imm[8]操作数包含在表 11-9 中列出的字段。

表 11-9：insertps和vinsertps指令的 imm[8]位字段

位(bit)	含义
6 到 7	（仅当源操作数是 XMM 寄存器时）：从源 XMM 寄存器中选择 32 位通道（0、1、2 或 3）。如果源操作数是 32 位内存位置，指令将忽略此字段并使用内存中的完整 32 位数据。
4 到 5	指定目标 XMM 寄存器中存储单精度值的通道。
3	如果设置，清零 XMM[dest]的第 3 通道。
2	如果设置，清零 XMM[dest]的第 2 通道。
1	如果设置，清零 XMM[dest]的第 1 通道。
0	如果设置，清零 XMM[dest]的第 0 通道。

在具有 AVX 扩展的 CPU 上，insertps不修改 YMM 寄存器的高位；vinsertps会清零高位。

vinsertps指令首先将 XMM[src1]寄存器复制到 XMM[dest]，然后执行插入操作。对应 YMM 寄存器的高位（HO 位）被设置为 0。

x86-64 架构不提供(v)extractpd或(v)insertpd指令。

11.8 SIMD 算术和逻辑操作

SSE 和 AVX 指令集扩展提供了多种标量和向量的算术与逻辑操作。

第六章中的“SSE 浮点运算”已经讨论了使用标量 SSE 指令集进行的浮点运算，因此本节不再重复这一讨论。相反，本节将讨论向量（或打包）算术和逻辑指令。

向量指令在 SSE 或 AVX 寄存器中的不同数据通道上并行执行多个操作。给定两个源操作数，典型的 SSE 指令将同时计算两个双精度浮点结果、两个四字整型计算、四个单精度浮点操作、四个双字整型计算、八个字整型计算或十六个字节计算。AVX 寄存器（YMM）将通道数量翻倍，因此并行计算的数量也翻倍。

图 11-42 展示了 SSE 和 AVX 指令如何执行并行计算；一个值从两个源位置的相同通道中取出，执行计算后，指令将结果存储到目标位置的相同通道中。这个过程对源操作数和目标操作数中的每个通道都同时发生。例如，如果一对 XMM 寄存器包含四个单精度浮点值，则 SIMD 打包浮点加法指令将在源操作数的相应通道中对单精度值进行加法，并将结果存储到目标 XMM 寄存器的相应通道中。

图 11-42：SIMD 并行算术和逻辑操作

某些操作——例如，逻辑与、ANDN（与非）、或、异或——不需要拆分为多个通道，因为这些操作无论指令大小如何，都能得到相同的结果。通道大小是一个单独的位。因此，相关的 SSE/AVX 指令会在不考虑通道大小的情况下操作其整个操作数。

11.9 SIMD 逻辑（按位）指令

SSE 和 AVX 指令集扩展提供了表 11-10 中显示的逻辑操作（使用 C/C++按位操作符语法）。

表 11-10：SSE/AVX 逻辑指令

操作	描述
`andpd`	dest = dest 和 source（128 位操作数）
`vandpd`	dest = source1 和 source2（128 位或 256 位操作数）
`andnpd`	dest = dest 和 ~source（128 位操作数）
`vandnpd`	dest = source1 和 ~source2（128 位或 256 位操作数）
`orpd`	dest = dest \| source（128 位操作数）
`vorpd`	dest = source1 \| source2（128 位或 256 位操作数）
`xorpd`	dest = dest ^ source（128 位操作数）
`vxorpd`	dest = source1 ^ source2（128 位或 256 位操作数）

这些指令的语法如下：

andpd   `xmm`[dest], `xmm`[src]/`mem`[128]
vandpd  `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vandpd  `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]

andnpd  `xmm`[dest], `xmm`[src]/`mem`[128]
vandnpd `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vandnpd `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]

orpd    `xmm`[dest], `xmm`[src]/`mem`[128]
vorpd   `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vorpd   `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]

xorpd   `xmm`[dest], `xmm`[src]/`mem`[128]
vxorpd  `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vxorpd  `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]

SSE 指令（没有v前缀）会保留底层 YMM 寄存器的 HO 位（如果适用）。带有v前缀的 AVX 指令（128 位操作数）会将其结果零扩展到 YMM 寄存器的 HO 位。

如果（第二个）源操作数是一个内存位置，它必须按适当的边界对齐（例如，mem[128]值为 16 字节，mem[256]值为 32 字节）。如果未对齐，将导致运行时内存对齐故障。

11.9.1 （v）ptest 指令

ptest指令（打包测试）类似于标准整数test指令。ptest指令对两个操作数执行逻辑与操作，如果结果为 0，则设置零标志。ptest指令会在第二个操作数与第一个操作数反转位的逻辑与结果为 0 时设置进位标志。ptest指令支持以下语法：

ptest  `xmm`[src1], `xmm`[src2]/`mem`[128]
vptest `xmm`[src1], `xmm`[src2]/`mem`[128]
vptest `ymm`[src1], `ymm`[src2]/`mem`[256]

11.9.2 字节移位指令

SSE 和 AVX 指令集扩展还支持一组逻辑和算术移位指令。首先要考虑的是 pslldq 和 psrldq。尽管它们以 p 开头，暗示它们是打包（向量）指令，但这些指令实际上只是 128 位的逻辑左移和右移指令。它们的语法如下：

pslldq  `xmm`[dest], `imm`[8]
vpslldq `xmm`[dest], `xmm`[src], `imm`[8]
vpslldq `ymm`[dest], `ymm`[src], `imm`[8]
psrldq  `xmm`[dest], `imm`[8]
vpsrldq `xmm`[dest], `xmm`[src], `imm`[8]
vpsrldq `ymm`[dest], `ymm`[src], `imm`[8]

pslldq 指令将目标 XMM 寄存器向左移动，移动的字节数由 imm[8] 操作数指定。该指令会在腾出的低字节位置填充 0。

vpslldq 指令从源寄存器（XMM 或 YMM）中获取值，将该值向左移动 imm[8] 字节，然后将结果存储到目标寄存器中。对于 128 位变体，该指令会将结果零扩展到底层 YMM 寄存器的 128 到 255 位（在支持 AVX 的 CPU 上）。

psrldq 和 vpsrldq 指令的操作方式与 (v)pslldq 类似，当然，它们是将操作数向右移位，而不是向左。它们是逻辑右移操作，因此会将 0 移入操作数的高字节，移出第 0 位的位会丢失。

pslldq 和 psrldq 指令移位的是字节而不是位。例如，许多 SSE 指令会生成字节掩码 0 或 0FFh，表示布尔结果。这些指令通过一次性移动整个字节，来移动这些字节掩码中的每一位。

11.9.3 位移指令

SSE/AVX 指令集扩展还提供了在两个或更多整数通道上并行工作的向量位移操作。这些指令提供了逻辑左移、逻辑右移和算术右移操作的字（word）、双字（dword）和四字（qword）变体，使用的语法如下：

`shift`  `xmm`[dest], `imm`[8]
`shift`  `xmm`[dest], `xmm`[src]/`mem`[128]
`vshift` `xmm`[dest], `xmm`[src], `imm`[8]
`vshift` `xmm`[dest], `xmm`[src], `mem`[128]
`vshift` `ymm`[dest], `ymm`[src], `imm`[8]
`vshift` `ymm`[dest], `ymm`[src], `xmm`/`mem`[128]

其中 shift = psllw、pslld、psllq、psrlw、psrld、psrlq、psraw 或 psrad，vshift = vpsllw、vpslld、vpsllq、vpsrlw、vpsrld、vpsrlq、vpsraw、vpsrad 或 vpsraq。

(v)psl* 指令将操作数向左移动；(v)psr* 指令将操作数向右移动。(v)psll* 和 (v)psrl* 指令是逻辑移位指令，会将 0 移入移位腾出的位位置。任何从操作数中移出的位都会丢失。(v)psra* 指令是算术右移指令，它们会在将该通道的位向右移时，复制该通道的高位（HO）位；所有从低位（LO）移出的位都会丢失。

SSE 的双操作数指令将其第一个操作数视为源操作数和目标操作数。第二个操作数指定移位的位数（可以是 8 位立即数常量，或存储在 XMM 寄存器或 128 位内存位置中的值）。无论移位计数的大小如何，只有计数的低 4、5 或 6 位是有效的（取决于通道大小）。

AVX 三操作数指令为移位操作指定了一个独立的源和目标寄存器。这些指令从源寄存器获取值，按指定的位数进行移位，并将移位后的结果存储到目标寄存器中。源寄存器保持不变（除非该指令明确指定源和目标操作数使用相同寄存器）。对于 AVX 指令，源和目标寄存器可以是 XMM（128 位）或 YMM（256 位）寄存器。第三个操作数可以是一个 8 位立即数、XMM 寄存器或 128 位内存位置。第三个操作数指定位移的位数（与 SSE 指令相同）。即使源和目标寄存器是 256 位的 YMM 寄存器，位移计数也需要指定为 XMM 寄存器。

w 后缀指令处理 16 位操作数（128 位目标操作数为 8 通道，256 位目标操作数为 16 通道）。d 后缀指令处理 32 位双字操作数（128 位目标操作数为 4 通道，256 位目标操作数为 8 通道）。q 后缀指令处理 64 位操作数（128 位操作数为 2 通道，256 位操作数为 4 通道）。

11.10 SIMD 整数算术指令

SSE 和 AVX 指令集扩展主要处理浮点计算。然而，它们也包括一组有符号和无符号整数算术运算。本节描述了 SSE/AVX 整数算术指令。

11.10.1 SIMD 整数加法

SIMD 整数加法指令见表 11-11。这些指令不会影响任何标志，因此不会在执行过程中指示溢出（有符号或无符号）发生。程序本身必须确保源操作数都在适当的范围内，才能执行加法。如果加法过程中发生进位，则进位将丢失。

表 11-11: SIMD 整数加法指令

指令	操作数	描述
`paddb`	`xmm`[dest], `xmm`/`mem`[128]	16 通道字节相加
`vpaddb`	`xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]	16 通道字节相加
`vpaddb`	`ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]	32 通道字节相加
`paddw`	`xmm`[dest], `xmm`/`mem`[128]	8 通道字相加
`vpaddw`	`xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]	8 通道字相加
`vpaddw`	`ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]	16 通道字相加
`paddd`	`xmm`[dest], `xmm`/`mem`[128]	4 通道双字相加
`vpaddd`	`xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]	4 通道双字相加
`vpaddd`	`ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]	8 通道双字相加
`paddq`	`xmm`[dest], `xmm`/`mem`[128]	2 通道四字相加
`vpaddq`	`xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]	2 通道四字相加
`vpaddq`	`ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]	4 通道四字节加法

这些加法指令被称为垂直加法，因为如果我们将两个源操作数堆叠在一起（在打印页面上），每个通道的加法操作是垂直发生的（一个源通道直接位于第二个源通道上方，进行相应的加法操作）。

打包加法忽略了加法操作中的任何溢出，只保留每次加法的低字节、字、双字或四字。例如，只要溢出不可能发生，这不是一个问题。然而，对于某些算法（尤其是常使用打包加法的音频和视频），截断溢出可能会产生异常的结果。

更简洁的解决方案是使用饱和算术。对于无符号加法，饱和算术会将溢出剪裁（或饱和）到指令大小能处理的最大值。例如，如果两个字节值的加法超过 0FFh，饱和算术会生成 0FFh——这是最大的无符号 8 位值（同样，饱和减法如果发生下溢则会产生 0）。对于有符号饱和算术，剪裁会发生在最大的正值和最小的负值处（例如，7Fh/+127 为正值，80h/–128 为负值）。

x86 SIMD 指令提供了有符号和无符号饱和算术，尽管这些操作仅限于 8 位和 16 位数值。^(10) 这些指令出现在表 11-12 中。

表 11-12：SIMD 整数饱和加法指令

指令	操作数	描述
`paddsb`	`xmm`[dest], `xmm`/`mem`[128]	16 通道字节有符号饱和加法
`vpaddsb`	`xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]	16 通道字节有符号饱和加法
`vpaddsb`	`ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]	32 通道字节有符号饱和加法
`paddsw`	`xmm`[dest], `xmm`/`mem`[128]	8 通道字节有符号饱和加法
`vpaddsw`	`xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]	8 通道字节有符号饱和加法
`vpaddsw`	`ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]	16 通道字节有符号饱和加法
`paddusb`	`xmm`[dest], `xmm`/`mem`[128]	16 通道字节无符号饱和加法
`vpaddusb`	`xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]	16 通道字节无符号饱和加法
`vpaddusb`	`ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]	32 通道字节无符号饱和加法
`paddusw`	`xmm`[dest], `xmm`/`mem`[128]	8 通道字节无符号饱和加法
`vpaddusw`	`xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]	8 通道字节无符号饱和加法
`vpaddusw`	`ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]	16 通道字节无符号饱和加法

和往常一样，padd* 和 vpadd* 指令接受 128 位 XMM 寄存器（十六个 8 位加法或八个 16 位加法）。padd* 指令不会改变任何对应 YMM 目标寄存器的高 128 位（HO bits）；而 vpadd* 变体会清除这些高位。另外，注意到 padd* 指令只有两个操作数（目标寄存器也是源寄存器），而 vpadd* 指令有两个源操作数和一个目标操作数。使用 YMM 寄存器的 vpadd* 指令提供了双倍的并行加法数量。

11.10.2 横向加法

SSE/AVX 指令集还支持三种横向加法指令，列于表 11-13 中。

表 11-13：横向加法指令

指令	描述
(v)``phaddw	16 位（word）横向加法
(v)``phaddd	32 位（dword）横向加法
(v)``phaddsw	16 位（word）横向加法并饱和

横向加法指令将两个源操作数中的相邻字（word）或双字（dword）进行相加，并将结果的和存入目标位置，如图 11-43 所示。

图 11-43：横向加法操作

phaddw 指令具有以下语法：

phaddw `xmm`[dest], `xmm`[src]/`mem`[128]

它计算以下内容：

temp[0 to 15]    = `xmm`[dest][0 to 15]        + `xmm`[dest][16 to 31]
temp[16 to 31]   = `xmm`[dest][32 to 47]       + `xmm`[dest][48 to 63]
temp[32 to 47]   = `xmm`[dest][64 to 79]       + `xmm`[dest][80 to 95]
temp[48 to 63]   = `xmm`[dest][96 to 111]      + `xmm`[dest][112 to 127]
temp[64 to 79]   = `xmm`[src]/`mem`[128][0 to 15]   + `xmm`[src]/`mem`[128][16 to 31]
temp[80 to 95]   = `xmm`[src]/`mem`[128][32 to 47]  + `xmm`[src]/`mem`[128][48 to 63]
temp[96 to 111]  = `xmm`[src]/`mem`[128][64 to 79]  + `xmm`[src]/`mem`[128][80 to 95]
temp[112 to 127] = `xmm`[src]/`mem`[128][96 to 111] + `xmm`[src]/`mem`[128][112 to 127]
`xmm`[dest] = temp

与大多数 SSE 指令一样，phaddw 不会影响对应 YMM 目标寄存器的高位，只会影响低 128 位（LO 128 bits）。

128 位 vphaddw 指令具有以下语法：

vphaddw `xmm`dest, `xmm`src1, `xmm`src2/`mem`128

它计算以下内容：

`xmm`[dest][0 to 15]    = `xmm`[src1][0 to 15]         + `xmm`[src1][16 to 31]
`xmm`[dest][16 to 31]   = `xmm`[src1][32 to 47]        + `xmm`[src1][48 to 63]
`xmm`[dest][32 to 47]   = `xmm`[src1][64 to 79]        + `xmm`[src1][80 to 95]
`xmm`[dest][48 to 63]   = `xmm`[src1][96 to 111]       + `xmm`[src1][112 to 127]
`xmm`[dest][64 to 79]   = `xmm`[src2]/`mem`[128][0 to 15]   + `xmm`[src2]/`mem`[128][16 to 31]
`xmm`[dest][80 to 95]   = `xmm`[src2]/`mem`[128][32 to 47]  + `xmm`[src2]/`mem`[128][48 to 63]
`xmm`[dest][96 to 111]  = `xmm`[src2]/`mem`[128][64 to 79]  + `xmm`[src2]/`mem`[128][80 to 95]
`xmm`[dest][111 to 127] = `xmm`[src2]/`mem`[128][96 to 111] + `xmm`[src2]/`mem`[128][112 to 127]

vphaddw 指令将对应 YMM 目标寄存器的高 128 位清零。

256 位 vphaddw 指令具有以下语法：

vphaddw `ymm`dest, `ymm`src1, `ymm`src2/`mem`256

vphaddw 并不是以直观的方式简单地扩展 128 位版本。相反，它混合计算如下（其中 SRC1 为 YMM[src1]，SRC2 为 YMM[src2]/mem[256]）：

`ymm`[dest][0 to 15]    = SRC1[16 to 31]   + SRC1[0 to 15]
`ymm`[dest][16 to 31]   = SRC1[48 to 63]   + SRC1[32 to 47]
`ymm`[dest][32 to 47]   = SRC1[80 to 95]   + SRC1[64 to 79]
`ymm`[dest][48 to 63]   = SRC1[112 to 127] + SRC1[96 to 111]
`ymm`[dest][64 to 79]   = SRC2[16 to 31]   + SRC2[0 to 15]
`ymm`[dest][80 to 95]   = SRC2[48 to 63]   + SRC2[32 to 47]
`ymm`[dest][96 to 111]  = SRC2[80 to 95]   + SRC2[64 to 79]
`ymm`[dest][112 to 127] = SRC2[112 to 127] + SRC2[96 to 111]
`ymm`[dest][128 to 143] = SRC1[144 to 159] + SRC1[128 to 143]
`ymm`[dest][144 to 159] = SRC1[176 to 191] + SRC1[160 to 175]
`ymm`[dest][160 to 175] = SRC1[208 to 223] + SRC1[192 to 207]
`ymm`[dest][176 to 191] = SRC1[240 to 255] + SRC1[224 to 239]
`ymm`[dest][192 to 207] = SRC2[144 to 159] + SRC2[128 to 143]
`ymm`[dest][208 to 223] = SRC2[176 to 191] + SRC2[160 to 175]
`ymm`[dest][224 to 239] = SRC2[208 to 223] + SRC2[192 to 207]
`ymm`[dest][240 to 255] = SRC2[240 to 255] + SRC2[224 to 239]

11.10.3 双字（Double-Word）大小的横向加法

phaddd 指令具有以下语法：

phaddd `xmm`[dest], `xmm`[src]/`mem`[128]

它计算以下内容：

temp[0 to 31]   = `xmm`[dest][0 to 31]       + `xmm`[dest][32 to 63]
temp[32 to 63]  = `xmm`[dest][64 to 95]      + `xmm`[dest][96 to 127]
temp[64 to 95]  = `xmm`[src]/`mem`[128][0 to 31]  + `xmm`[src]/`mem`[128][32 to 63]
temp[96 to 127] = `xmm`[src]/`mem`[128][64 to 95] + `xmm`[src]/`mem`[128][96 to 127]
`xmm`[dest] = temp

128 位 vphaddd 指令具有以下语法：

vphaddd `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]

它计算以下内容：

`xmm`dest[0 to 31]     = `xmm`src1[0 to 31]        + `xmm`src1[32 to 63]
`xmm`dest[32 to 63]    = `xmm`src1[64 to 95]       + `xmm`src1[96 to 127]
`xmm`dest[64 to 95]    = `xmm`src2/`mem`128[0 to 31]  + `xmm`src2/`mem`128[32 to 63]
`xmm`dest[96 to 127]   = `xmm`src2/`mem`128[64 to 95] + `xmm`src2/`mem`128[96 to 127]
(`ymm`dest[128 to 255] = 0)

与 vphaddw 类似，256 位 vphaddd 指令具有以下语法：

vphaddd `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]

它计算以下内容：

`ymm`[dest][0 to 31]    = `ymm`[src1][32 to 63]         + `ymm`[src1][0 to 31]
`ymm`[dest][32 to 63]   = `ymm`[src1][96 to 127]        + `ymm`[src1][64 to 95]
`ymm`[dest][64 to 95]   = `ymm`[src2]/mem[128][32 to 63]   + `ymm`[src2]/`mem`[128][0 to 31]
`ymm`[dest][96 to 127]  = `ymm`[src2]/mem[128][96 to 127]  + `ymm`[src2]/`mem`[128][64 to 95]
`ymm`[dest][128 to 159] = `ymm`[src1][160 to 191]       + `ymm`[src1][128 to 159]
`ymm`[dest][160 to 191] = `ymm`[src1][224 to 255]       + `ymm`[src1][192 to 223]
`ymm`[dest][192 to 223] = `ymm`[src2]/`mem`[128][160 to 191] + `ymm`[src2]/`mem`[128][128 to 159]
`ymm`[dest][224 to 255] = `ymm`[src2]/`mem`[128][224 to 255] + `ymm`[src2]/`mem`[128][192 to 223]

如果在横向加法过程中发生溢出，(v)phaddw 和 (v)phaddd 会忽略溢出，并将结果的低 16 位或 32 位存入目标位置。

(v)phaddsw 指令有以下形式：

phaddsw  `xmm`[dest], `xmm`[src]/`mem`[128]
vphaddsw `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vphaddsw `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]

(v)phaddsw 指令（横向带饱和的有符号整数加法，字）是 (v)phaddw 的一种稍有不同的形式：与仅将低位存入目标位置的结果不同，该指令会对结果进行饱和处理。饱和意味着任何（正）溢出都会导致值为 7FFFh，无论实际结果如何。类似地，任何负溢出都会导致值为 8000h。

饱和算术在音频和视频处理中的表现良好。如果使用标准的（环绕/取模）加法将两个声音样本相加，结果会产生难听的点击声。而饱和算术则会产生一个截断的音频信号。虽然这并不理想，但比起取模算术的结果，这听起来要好得多。类似地，在视频处理中，饱和算术会产生褪色（白色）的颜色，而不是取模算术所带来的奇怪颜色。

遗憾的是，双字操作数没有饱和水平加法（例如，用于处理 24 位音频）。

11.10.4 SIMD 整数减法

SIMD 整数减法指令见表 11-14。与 SIMD 加法指令一样，这些指令不会影响任何标志；任何进位、借位、溢出或下溢信息都会丢失。这些指令将第二个源操作数从第一个源操作数中减去（对于仅支持 SSE 的指令，结果也作为目标操作数），并将结果存储到目标操作数中。

表 11-14: SIMD 整数减法指令

指令	操作数	描述
`psubb`	`xmm`[dest], `xmm`/`mem`[128]	16 道字节减法
`vpsubb`	`xmm`[dest], `xmm`[src], `xmm`/`mem`[128]	16 道字节减法
`vpsubb`	`ymm`[dest], `ymm`[src], `ymm`/`mem`[256]	32 道字节减法
`psubw`	`xmm`[dest], `xmm`/`mem`[128]	8 道字减法
`vpsubw`	`xmm`[dest], `xmm`[src], `xmm`/`mem`[128]	8 道字减法
`vpsubw`	`ymm`[dest], `ymm`[src], `ymm`/`mem`[256]	16 道字减法
`psubd`	`xmm`[dest], `xmm`/`mem`[128]	4 道双字减法
`vpsubd`	`xmm`[dest], `xmm`[src], `xmm`/`mem`[128]	4 道双字减法
`vpsubd`	`ymm`[dest], `ymm`[src], `ymm`/`mem`[256]	8 道双字减法
`psubq`	`xmm`[dest], `xmm`/`mem`[128]	2 道四字减法
`vpsubq`	`xmm`[dest], `xmm`[src], `xmm`/`mem`[128]	2 道四字减法
`vpsubq`	`ymm`[dest], `ymm`[src], `ymm`/`mem`[256]	4 道四字减法

(v)phsubw、(v)phsubd 和 (v)phsubsw 水平减法指令的工作方式与水平加法指令相同，不同之处在于（当然）它们计算的是两个源操作数的差值，而不是和。有关水平加法指令的详细信息，请参见前面的章节。

同样，也有一组带符号和无符号字节和字饱和减法指令（见表 11-15）。对于带符号指令，字节型指令会将正溢出饱和为 7Fh (+127)，负下溢饱和为 80h (–128)。字型指令会将值饱和为 7FFFh (+32,767) 和 8000h (–32,768)。无符号饱和指令会将值饱和为 0FFFFh (+65,535) 和 0。

表 11-15: SIMD 整数饱和减法指令

指令	操作数	描述
`psubsb`	`xmm`[dest]，`xmm`/`mem`[128]	16 通道字节带符号饱和减法
`vpsubsb`	`xmm`[dest]，`xmm`[src]，`xmm`/`mem`[128]	16 通道字节带符号饱和减法
`vpsubsb`	`ymm`[dest]，`ymm`[src]，`ymm`/`mem`[256]	32 通道字节带符号饱和减法
`psubsw`	`xmm`[dest]，`xmm`/`mem`[128]	8 通道字节带符号饱和减法
`vpsubsw`	`xmm`[dest]，`xmm`[src]，`xmm`/`mem`[128]	8 通道字节带符号饱和减法
`vpsubsw`	`ymm`[dest]，`ymm`[src]，`ymm`/`mem`[256]	16 通道字节带符号饱和减法
`psubusb`	`xmm`[dest]，`xmm`/`mem`[128]	16 通道字节无符号饱和减法
`vpsubusb`	`xmm`[dest]，`xmm`[src]，`xmm`/`mem`[128]	16 通道字节无符号饱和减法
`vpsubusb`	`ymm`[dest]，`ymm`[src]，`ymm`/`mem`[256]	32 通道字节无符号饱和减法
`psubusw`	`xmm`[dest]，`xmm`/`mem`[128]	8 通道字节无符号饱和减法
`vpsubusw`	`xmm`[dest]，`xmm`[src]，`xmm`/`mem`[128]	8 通道字节无符号饱和减法
`vpsubusw`	`ymm`[dest]，`ymm`[src]，`ymm`/`mem`[256]	16 通道字节无符号饱和减法

11.10.5 SIMD 整数乘法

SSE/AVX 指令集扩展在一定程度上支持乘法。逐通道乘法要求两个n位值的操作结果能够适应n位，但是n × n的乘法可能会产生 2×n位的结果。因此，逐通道乘法操作会出现溢出丢失的问题。基础的打包整数乘法将一对通道相乘，并将结果的低位存储到目标通道。对于扩展算术，打包整数乘法指令会生成结果的高位。

表 11-16 中的指令处理 16 位乘法操作。(v)pmullw指令将源操作数中的 16 位值进行相乘，并将结果的低位存储到相应的目标通道。该指令适用于带符号和无符号值。(v)pmulhw指令计算两个带符号字节值的乘积，并将结果的高位存储到目标通道。对于无符号操作数，(v)pmulhuw执行相同的操作。通过使用相同的操作数执行(v)pmullw和(v)pmulh(u)w，你可以计算 16×16 位乘法的完整 32 位结果。（你可以使用punpck*指令将结果合并为 32 位整数。）

表 11-16：SIMD 16 位打包整数乘法指令

指令	操作数	描述
`pmullw`	`xmm`[dest]，`xmm`/`mem`[128]	8 通道字节乘法，生成乘积的低 16 位
`vpmullw`	`xmm`[dest]，`xmm`[src]，`xmm`/`mem`[128]	8 通道字节乘法，生成乘积的低 16 位
`vpmullw`	`ymm`[dest], `ymm`[src], `ymm`/`mem`[256]	16 路 word 乘法，生成乘积的 LO word
`pmulhuw`	`xmm`[dest], `xmm`/`mem`[128]	8 路 word 无符号乘法，生成乘积的 HO word
`vpmulhuw`	`xmm`[dest], `xmm`[src], `xmm`/`mem`[128]	8 路 word 无符号乘法，生成乘积的 HO word
`vpmulhuw`	`ymm`[dest], `ymm`[src], `ymm`/`mem`[256]	16 路 word 无符号乘法，生成乘积的 HO word
`pmulhw`	`xmm`[dest], `xmm`/`mem`[128]	8 路 word 有符号乘法，生成乘积的 HO word
`vpmulhw`	`xmm`[dest], `xmm`[src], `xmm`/`mem`[128]	8 路 word 有符号乘法，生成乘积的 HO word
`vpmulhw`	`ymm`[dest], `ymm`[src], `ymm`/`mem`[256]	16 路 word 有符号乘法，生成乘积的 HO word

表 11-17 列出了 32 位和 64 位版本的打包乘法指令。没有 (v)pmulhd 或 (v)pmulhq 指令；请参阅 (v)pmuludq 和 (v)pmuldq 以处理 32 位和 64 位打包乘法。

表 11-17：SIMD 32 位和 64 位打包整数乘法指令

指令	操作数	描述
`pmulld`	`xmm`[dest], `xmm`/`mem`[128]	4 路 dword 乘法，生成乘积的 LO dword
`vpmulld`	`xmm`[dest], `xmm`[src], `xmm`/`mem`[128]	4 路 dword 乘法，生成乘积的 LO dword
`vpmulld`	`ymm`[dest], `ymm`[src], `ymm`/`mem`[256]	8 路 dword 乘法，生成乘积的 LO dword
`vpmullq`	`xmm`[dest], `xmm`[src], `xmm`/`mem`[128]	2 路 qword 乘法，生成乘积的 LO qword
`vpmullq`	`ymm`[dest], `ymm`[src], `ymm`/`mem`[256]	4 路 qword 乘法，生成乘积的 LO qword（仅在 AVX-512 CPU 上可用）

在某个阶段，Intel 引入了 (v)pmuldq 和 (v)pmuludq 来执行有符号和无符号的 32×32 位乘法，生成一个 64 位的结果。这些指令的语法如下：

pmuldq   `xmm`[dest], `xmm`/`mem`[128]
vpmuldq  `xmm`[dest], `xmm`[src1], `xmm`/`mem`[128]
vpmuldq  `ymm`[dest], `ymm`[src1], `ymm`/`mem`[256]

pmuludq  `xmm`[dest], `xmm`/`mem`[128]
vpmuludq `xmm`[dest], `xmm`[src1], `xmm`/`mem`[128]
vpmuludq `ymm`[dest], `ymm`[src1], `ymm`/`mem`[256]

128 位变体乘以位于 0 和 2 路的双字，并将 64 位结果存储到 qword 路 0 和 1（dword 路 0 和 1、2 和 3）。在具有 AVX 寄存器的 CPU 上，^(11) pmuldq 和 pmuludq 不影响 YMM 寄存器的高 128 位。vpmuldq 和 vpmuludq 指令将结果扩展为 256 位。256 位变体乘以位于 0、2、4 和 6 路的双字，生成 64 位结果并将其存储到 qword 路 0、1、2 和 3（dword 路 0 和 1、2 和 3、4 和 5、6 和 7）。

pclmulqdq 指令提供了乘以两个 qword 值的能力，生成一个 128 位的结果。该指令的语法如下：

pclmulqdq  `xmm`[dest], `xmm`/`mem`[128], `imm`[8]
vpclmulqdq `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128], `imm`[8]

这些指令将 XMM[dest] 和 XMM[src] 中的 qword 值相乘，并将 128 位的结果保留在 XMM[dest] 中。imm[8] 操作数指定要用作源操作数的 qword。pclmulqdq 的可能组合列在表 11-18 中，vpclmulqdq 的组合列在表 11-19 中。

表 11-18: pclmulqdq 指令的 imm[8] 操作数值

imm[8]	结果
00h	XMM[dest] = XMM[dest][0 到 63] * XMM/mem[128][0 到 63]
01h	XMM[dest] = XMM[dest][64 到 127] * XMM/mem[128][0 到 63]
10h	XMM[dest] = XMM[dest][0 到 63] * XMM/mem[128][64 到 127]
11h	XMM[dest] = XMM[dest][64 到 127] * XMM/mem[128][64 到 127]

表 11-19: vpclmulqdq 指令的 imm[8] 操作数值

imm[8]	结果
00h	XMM[dest] = XMM[src1][0 到 63] * XMM[src2]/mem[128][0 到 63]
01h	XMM[dest] = XMM[src1][64 到 127] * XMM[src2]/mem[128][0 到 63]
10h	XMM[dest] = XMM[src1][0 到 63] * XMM[src2]/mem[128][64 到 127]
11h	XMM[dest] = XMM[src1][64 到 127] * XMM[src2]/mem[128][64 到 127]

像往常一样，pclmulqdq 会保持对应的 YMM 目标寄存器的高 128 位不变，而 vpcmulqdq 会将这些位清零。

11.10.6 SIMD 整数平均值

(v)pavgb 和 (v)pavgw 指令计算两个字节或字的平均值。这些指令将源操作数和目标操作数中字节或字的值相加，然后将结果除以 2，四舍五入，并将平均结果存储在目标操作数的 lane 中。这些指令的语法如下：

pavgb  `xmm`[dest], `xmm`/`mem`[128]
vpavgb `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vpavgb `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]
pavgw  `xmm`[dest], `xmm`/`mem`[128]
vpavgw `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vpavgw `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]

128 位的 pavgb 和 vpavgb 指令计算 16 个字节大小的平均值（针对源和目标操作数中的 16 个 lane）。vpavgb 指令的 256 位变体计算 32 个字节大小的平均值。

128 位的 pavgw 和 vpavgw 指令计算 8 个字的平均值（针对源和目标操作数中的 8 个 lane）。vpavgw 指令的 256 位变体计算 16 个字节大小的平均值。

vpavgb 和 vpavgw 指令计算第一个 XMM 或 YMM 源操作数和第二个 XMM、YMM 或内存源操作数的平均值，将平均结果存储在目标 XMM 或 YMM 寄存器中。

不幸的是，没有 (v)pavgd 或 (v)pavgq 指令。毫无疑问，这些指令最初是为了混合 8 位和 16 位的音频或视频流（或照片处理）而设计的，但 x86-64 CPU 设计者从未认为有必要将其扩展到 16 位以上（尽管 24 位音频在专业音频工程师中很常见）。

11.10.7 SIMD 整数最小值和最大值

SSE4.1 指令集扩展添加了八个打包的整数 最小值 和 最大值 指令，如表 11-20 所示。这些指令扫描一对 128 位或 256 位操作数的各个 lane，并将最大值或最小值从该 lane 复制到目标操作数的相同 lane。

表 11-20：SIMD 最小值和最大值指令

指令	描述
(v)``pmaxsb	目标字节通道设置为在相应源通道中找到的两个有符号字节值的最大值。
(v)``pmaxsw	目标字通道设置为在相应源通道中找到的两个有符号字值的最大值。
(v)``pmaxsd	目标双字通道设置为在相应源通道中找到的两个有符号双字值的最大值。
v``pmaxsq	目标四字通道设置为在相应源通道中找到的两个有符号四字值的最大值。（此指令需要 AVX-512 支持。）
(v)``pmaxub	目标字节通道设置为在相应源通道中找到的两个无符号字节值的最大值。
(v)``pmaxuw	目标字通道设置为在相应源通道中找到的两个无符号字值的最大值。
(v)``pmaxud	目标双字通道设置为在相应源通道中找到的两个无符号双字值的最大值。
v``pmaxuq	目标四字通道设置为在相应源通道中找到的两个无符号四字值的最大值。（此指令需要 AVX-512 支持。）
(v)``pminsb	目标字节通道设置为在相应源通道中找到的两个有符号字节值的最小值。
(v)``pminsw	目标字通道设置为在相应源通道中找到的两个有符号字值的最小值。
(v)``pminsd	目标双字通道设置为在相应源通道中找到的两个有符号双字值的最小值。
v``pminsq	目标四字通道设置为在相应源通道中找到的两个有符号四字值的最小值。（此指令需要 AVX-512 支持。）
(v)``pminub	目标字节通道设置为在相应源通道中找到的两个无符号字节值的最小值。
(v)``pminuw	目标字通道设置为在相应源通道中找到的两个无符号字值的最小值。
(v)``pminud	目标双字通道设置为在相应源通道中找到的两个无符号双字值的最小值。
v``pminuq	目标四字通道设置为在相应源通道中找到的两个无符号四字值的最小值。（此指令需要 AVX-512 支持。）

这些指令的通用语法如下：^(12)

pm`xxyz`  `xmm`[dest], `xmm`[src]/`mem`[128]
vpm`xxyz` `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vpm`xxyz` `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]

SSE 指令计算源操作数和目标操作数对应通道的最小值或最大值，并将最小值或最大值结果存储到目标寄存器对应通道中。AVX 指令计算两个源操作数相同通道中的最小值或最大值，并将最小值或最大值结果存储到目标寄存器对应通道中。

11.10.8 SIMD 整数绝对值

SSE/AVX 指令集扩展提供了三组用于计算有符号字节、字和双字整数绝对值的指令：(v)pabsb、(v)pabsw 和 (v)pabsd。^(13) 这些指令的语法如下：

pabsb  `xmm`[dest], `xmm`[src]/`mem`[128]
vpabsb `xmm`[dest], `xmm`[src]/`mem`[128]
vpabsb `ymm`[dest], `ymm`[src]/`mem`[256]

pabsw  `xmm`[dest], `xmm`[src]/`mem`[128]
vpabsw `xmm`[dest], `xmm`[src]/`mem`[128]
vpabsw `ymm`[dest], `ymm`[src]/`mem`[256]

pabsd  `xmm`[dest], `xmm`[src]/`mem`[128]
vpabsd `xmm`[dest], `xmm`[src]/`mem`[128]
vpabsd `ymm`[dest], `ymm`[src]/`mem`[256]

在支持 AVX 寄存器的系统上，SSE pabsb、pabsw 和 pabsd 指令不会修改 YMM 寄存器的高位。AVX 指令的 128 位版本（vpabsb、vpabsw 和 vpabsd）会将结果零扩展到高位。

11.10.9 SIMD 整数符号调整指令

(v)psignb、(v)psignw 和 (v)psignd 指令将源通道中找到的符号应用到相应的目标通道。其算法如下：

if source lane value is less than zero then
    negate the corresponding destination lane
else if source lane value is equal to zero
    set the corresponding destination lane to zero
else 
    leave the corresponding destination lane unchanged

这些指令的语法如下：

psignb  `xmm`[dest], `xmm`[src]/`mem`[128]
vpsignb `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vpsignb `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]

psignw  `xmm`[dest], `xmm`[src]/`mem`[128]
vpsignw `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vpsignw `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]

psignd  `xmm`[*dest*], `xmm`[*src*]/`mem`[*128*]
vpsignd `xmm`[*dest*], `xmm`[*src1*], `xmm`[*src2*]/`mem`[*128*]
vpsignd `ymm`[*dest*], `ymm`[*src1*], `ymm`[*src2*]/`mem`[*256*]

和往常一样，128 位 SSE 指令不会修改 YMM 寄存器的高位（如果适用），而 128 位 AVX 指令将结果零扩展到 YMM 寄存器的高位。

11.10.10 SIMD 整数比较指令

(v)pcmpeqb、(v)pcmpeqw、(v)pcmpeqd、(v)pcmpeqq、(v)pcmpgtb、(v)pcmpgtw、(v)pcmpgtd 和 (v)pcmpgtq 指令提供打包的有符号整数比较。这些指令在其操作数的各个通道中比较对应的字节、字、双字或四字（取决于指令后缀）。^(14) 它们将比较指令的结果存储到对应的目标通道中。

11.10.10.1 SSE 等值比较指令

以下是 SSE 相等比较 指令（pcmpeq*）的语法：

pcmpeqb `xmm`[dest], `xmm`[src]/`mem`[128]  ; Compares 16 bytes
pcmpeqw `xmm`[dest], `xmm`[src]/`mem`[128]  ; Compares 8 words
pcmpeqd `xmm`[dest], `xmm`[src]/`mem`[128]  ; Compares 4 dwords
pcmpeqq `xmm`[dest], `xmm`[src]/`mem`[128]  ; Compares 2 qwords

这些指令计算

`xmm`[dest][`lane`] = `xmm`[dest][`lane`] == `xmm`[src]/`mem`[128][`lane`]

其中，lane 对于 pcmpeqb 为 0 到 15，pcmpeqw 为 0 到 7，pcmpeqd 为 0 到 3，pcmpeqq 为 0 到 1。如果两个值在同一通道中相等，== 运算符会产生全为 1 的值；如果值不相等，则会产生全为 0 的值。

11.10.10.2 SSE 大于比较指令

以下是 SSE 大于比较 指令（pcmpgt*）的语法：

pcmpgtb `xmm`[dest], `xmm`[src]/`mem`[128]  ; Compares 16 bytes
pcmpgtw `xmm`[dest], `xmm`[src]/`mem`[128]  ; Compares 8 words
pcmpgtd `xmm`[dest], `xmm`[src]/`mem`[128]  ; Compares 4 dwords
pcmpgtq `xmm`[dest], `xmm`[src]/`mem`[128]  ; Compares 2 qwords

这些指令计算

`xmm`[dest][`lane`] = `xmm`[dest][`lane`] > `xmm`[src]/`mem`[128][`lane`]

其中，lane 与比较相等指令中的相同，> 运算符当 XMM[dest] 通道中的有符号整数大于对应 XMM[src]/MEM[128] 通道中的有符号值时，产生全为 1 的值。

在支持 AVX 的 CPU 上，SSE 打包整数比较保留底层 YMM 寄存器的高位值。

11.10.10.3 AVX 比较指令

这些指令的 128 位变体具有以下语法：

vpcmpeqb `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]  ; Compares 16 bytes
vpcmpeqw `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]  ; Compares 8 words
vpcmpeqd `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]  ; Compares 4 dwords
vpcmpeqq `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]  ; Compares 2 qwords

vpcmpgtb `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]  ; Compares 16 bytes
vpcmpgtw `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]  ; Compares 8 words
vpcmpgtd `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]  ; Compares 4 dwords
vpcmpgtq `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]  ; Compares 2 qwords

这些指令的计算如下：

`xmm`[dest][`lane`] = `xmm`[src1][`lane`] == `xmm`[src2]/`mem`[128][`lane`]
`xmm`[dest][`lane`] = `xmm`[src1][`lane`] >  `xmm`[src2]/`mem`[128][`lane`]

这些 AVX 指令将 0 写入底层 YMM 寄存器的高位。

这些指令的 256 位变体具有以下语法：

vpcmpeqb `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]  ; Compares 32 bytes
vpcmpeqw `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]  ; Compares 16 words
vpcmpeqd `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]  ; Compares 8 dwords
vpcmpeqq `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]  ; Compares 4 qwords

vpcmpgtb `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]  ; Compares 32 bytes
vpcmpgtw `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]  ; Compares 16 words
vpcmpgtd `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]  ; Compares 8 dwords
vpcmpgtq `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]  ; Compares 4 qwords

这些指令的计算如下：

`ymm`[dest][`lane`] = `ymm`[src1][`lane`] == `ymm`[src2]/`mem`[256][`lane`]
`ymm`[dest][`lane`] = `ymm`[src1][`lane`] >  `ymm`[src2]/`mem`[256][`lane`]

当然，256 位和 128 位指令的主要区别在于，256 位变体支持更多的字节（32）、字（16）、双字（8）和四字（4）有符号整数通道。

11.10.10.4 比较小于指令

没有打包的比较小于指令。你可以通过交换操作数并使用大于比较来合成小于比较。也就是说，如果x < y，那么y > x也成立。如果两个打包操作数都在 XMM 或 YMM 寄存器中，交换寄存器是相对容易的（特别是在使用三操作数 AVX 指令时）。如果第二个操作数是内存操作数，你必须首先将该操作数加载到寄存器中，以便你可以交换操作数（内存操作数必须始终是第二个操作数）。

11.10.10.5 使用打包比较结果

问题仍然是如何处理从打包比较中获得的结果。SSE/AVX 打包的有符号整数比较不会影响条件码标志（因为它们比较多个值，而这些比较中只有一个可能被移入标志位）。相反，打包比较仅仅产生布尔结果。你可以使用这些结果与打包 AND 指令（pand、vpand、pandn、vpandn）、打包 OR 指令（por和vpor）或打包 XOR 指令（pxor和vpxor）来屏蔽或修改其他打包数据值。当然，你也可以提取单个通道值并通过条件跳转进行测试。以下部分描述了一种实现此目标的简便方法。

11.10.10.6 （v）pmovmskb 指令

(v)pmovmskb指令从 XMM 或 YMM 寄存器中的所有字节提取 HO 位，并将 16 位或 32 位（分别）存储到通用寄存器中。这些指令将通用寄存器中的所有 HO 位设置为 0（除了用于存放掩码位的那些位）。语法如下：

pmovmskb  `reg`, `xmm`[src]
vpmovmskb `reg`, `xmm`[src]
vpmovmskb `reg`, `ymm`[src]

其中，reg是任何 32 位或 64 位的通用整数寄存器。pmovmskb和vpmovmskb指令在 XMM 源寄存器中的语义相同，但pmovmskb的编码更为高效。

(v)pmovmskb指令将每个字节通道的符号位复制到通用寄存器的相应位置。它将 XMM 寄存器中的位 7（通道 0 的符号位）复制到目标寄存器的位 0；它将 XMM 寄存器中的位 15（通道 1 的符号位）复制到目标寄存器的位 1；它将 XMM 寄存器中的位 23（通道 2 的符号位）复制到目标寄存器的位 2；以此类推。

128 位指令仅填充目标寄存器的第 0 至第 15 位（将所有其他位清零）。vpmovmskb指令的 256 位形式填充目标寄存器的第 0 至第 31 位（如果指定 64 位寄存器，则清零高位）。

你可以使用pmovmskb指令，在执行(v)pcmpeqb或(v)pcmpgtb指令后，从 XMM 或 YMM 寄存器中的每个字节通道提取单个位。考虑以下代码序列：

pcmpeqb  xmm0, xmm1
pmovmskb eax,  xmm0

在执行这两个指令后，如果 XMM0 的字节 0 与 XMM1 的字节 0 相等或不相等，EAX 的第 0 位将分别为 1 或 0。同样，EAX 的第 1 位将包含比较 XMM0 字节 1 和 XMM1 字节 1 的结果，以此类推，直到第 15 位，它比较 XMM0 和 XMM1 的 16 字节值。

不幸的是，pmovmskw、pmovmskd和pmovmsq指令并不存在。你可以通过使用以下代码序列来实现与pmovmskw相同的效果：

pcmpeqw  xmm0, xmm1
pmovmskb eax, xmm0
mov      cl, 0     ; Put result here
shr      ax, 1     ; Shift out lane 7 result
rcl      cl, 1     ; Shift bit into CL
shr      ax, 1     ; Ignore this bit
shr      ax, 1     ; Shift out lane 6 result
rcl      cl, 1     ; Shift lane 6 result into CL
shr      ax, 1     ; Ignore this bit
shr      ax, 1     ; Shift out lane 5 result
rcl      cl, 1     ; Shift lane 5 result into CL
shr      ax, 1     ; Ignore this bit
shr      ax, 1     ; Shift out lane 4 result
rcl      cl, 1     ; Shift lane 4 result into CL
shr      ax, 1     ; Ignore this bit
shr      ax, 1     ; Shift out lane 3 result
rcl      cl, 1     ; Shift lane 3 result into CL
shr      ax, 1     ; Ignore this bit
shr      ax, 1     ; Shift out lane 2 result
rcl      cl, 1     ; Shift lane 2 result into CL
shr      ax, 1     ; Ignore this bit
shr      ax, 1     ; Shift out lane 1 result
rcl      cl, 1     ; Shift lane 1 result into CL
shr      ax, 1     ; Ignore this bit
shr      ax, 1     ; Shift out lane 0 result
rcl      cl, 1     ; Shift lane 0 result into CL

由于pcmpeqw产生的是一个包含 0000h 或 0FFFFh 的字序列，而pmovmskb期望的是字节值，pmovmskb生成的结果是预期的两倍，每个奇数位都是前一个偶数位的重复（因为输入值要么是 0000h，要么是 0FFFFh）。这段代码会获取每个奇数位（从第 15 位开始并向下），跳过偶数位。虽然这段代码足够简单，但它相对较长且速度较慢。如果你愿意接受一个 8 位的结果，其中通道编号与位编号不匹配，你可以使用更高效的代码：

pcmpeqw  xmm0, xmm1
pmovmskb eax, xmm0
shr      al, 1     ; Move odd bits to even positions
and      al, 55h   ; Zero out the odd bits, keep even bits
and      ah, 0aah  ; Zero out the even bits, keep odd bits
or       al, ah    ; Merge the two sets of bits

这会将位按照图 11-44 中所示的方式交错排列。通常，处理这个重新排列在软件中是足够容易的。当然，你也可以使用 256 项查找表（见第十章）按你希望的方式重新排列这些位。当然，如果你只需要测试单个比特，而不是将其作为某种掩码使用，你可以直接测试pmovmskb保留在 EAX 中的位；无需将它们合并成一个字节。

图 11-44：从pcmpeqw合并位

当使用双字或四字打包比较时，你也可以使用类似于此处为pcmpeqw提供的方案。然而，浮点掩码移动指令（见“(v)movmskps, (v)movmskpd 指令”第 676 页）通过打破使用适合数据类型的 SIMD 指令的规则，能更高效地完成这项工作。

11.10.11 整数转换

SSE 和 AVX 指令集扩展提供了多种指令，用于将整数值从一种形式转换为另一种形式。有零扩展和符号扩展指令，用于将较小的值转换为较大的值。其他指令则用于将较大的值转换为较小的值。本节将介绍这些指令。

11.10.11.1 打包零扩展指令

带零扩展的移动指令执行在表 11-21 中出现的转换。

表 11-21：SSE4.1 和 AVX 打包零扩展指令

语法	描述
`pmovzxbw` `xmm`[dest]`,` `xmm`[src]/``mem[64]	将 XMM[src]/mem[64] 中低 8 字节的八个字节值零扩展为 XMM[dest] 中的字值。
`pmovzxbd` `xmm`[dest]`,` `xmm`[src]/``mem[32]	将 XMM[src]/mem[32] 中低 4 字节的四个字节值零扩展为 XMM[dest] 中的双字值。
`pmovzxbq` `xmm`[dest]`,` `xmm`[src]/``mem[16]	将 XMM[src]/mem[16] 中低 2 字节的两个字节值零扩展为 XMM[dest] 中的四字值。
`pmovzxwd` `xmm`[dest]`,` `xmm`[src]/``mem[64]	将 XMM[src]/mem[64] 中低 8 字节的四个字值零扩展为 XMM[dest] 中的双字值。
`pmovzxwq` `xmm`[dest]`,` `xmm`[src]/``mem[32]	将 XMM[src]/mem[32] 中低 4 字节的两个字值零扩展为 XMM[dest] 中的四字值。
`pmovzxdq` `xmm`[dest]`,` `xmm`[src]/``mem[64]	将 XMM[src]/mem[64] 中低 8 字节的两个双字值零扩展为 XMM[dest] 中的四字值。

一组可比的 AVX 指令也存在（语法相同，但指令助记符前有 v 前缀）。如同往常一样，SSE 指令会保持 YMM 寄存器的上位比特不变，而 AVX 指令会将 0 存储到 YMM 寄存器的上位比特中。

AVX2 指令集扩展通过允许使用 YMM 寄存器来加倍通道数。它们与 SSE/AVX 指令采用相似的操作数（将 YMM 替换为目标寄存器，并加倍内存位置的大小），并处理两倍数量的通道，以在 YMM 目标寄存器中生成十六个字、八个双字或四个四字。有关详细信息，请参见表 11-22。

表 11-22：AVX2 打包零扩展指令

语法	描述
v``pmovzxbw `ymm`[dest]`,` `xmm`[src]/``mem[128]	将 XMM[src]/mem[128] 中低 16 字节的十六个字节值零扩展为 YMM[dest] 中的字值。
v``pmovzxbd `ymm`[dest]`,` `xmm`[src]/``mem[64]	将 XMM[src]/mem[64] 中低 8 字节的八个字节值零扩展为 YMM[dest] 中的双字值。
v``pmovzxbq `ymm`[dest]`,` `xmm`[src]/``mem[32]	将 XMM[src]/mem[32] 中低 4 字节的四个字节值零扩展为 YMM[dest] 中的四字值。
v``pmovzxwd `ymm`[dest]`,` `xmm`[src]/``mem[128]	将 XMM[src]/mem[128] 中低 16 字节的八个字值零扩展为 YMM[dest] 中的双字值。
v``pmovzxwq `ymm`[dest]`,` `xmm`[src]/``mem[64]	将 XMM[src]/mem[64] 中低 8 字节的四个字值零扩展为 YMM[dest] 中的四字值。
v``pmovzxdq `ymm`[dest]`,` `xmm`[src]/``mem[128]	将 XMM[src]/mem[128] 中低 16 字节的四个双字值零扩展为 YMM[dest] 中的四字值。

11.10.11.2 打包符号扩展指令

SSE/AVX/AVX2 指令集扩展提供了一组相似的指令，用于符号扩展字节、字和双字值。表 11-23 列出了 SSE 打包符号扩展指令。

表 11-23: SSE 打包符号扩展指令

语法	描述
`pmovsxbw` `xmm`[dest]`,` `xmm`[src]/``mem[64]	将 XMM[src]/mem[64] 中低 8 字节的八个字节值扩展为 XMM[dest] 中的字值。
`pmovsxbq` `xmm`[dest]`,` `xmm`[src]/``mem[32]	将 XMM[src]/mem[32] 中低 4 字节的四个字节值扩展为 XMM[dest] 中的双字值。
`pmovsxbq` `xmm`[dest]`,` `xmm`[src]/``mem[16]	将 XMM[src]/mem[16] 中低 2 字节的两个字节值扩展为 XMM[dest] 中的四字值。
`pmovsxwd` `xmm`[dest]`,` `xmm`[src]/``mem[64]	将 XMM[src]/mem[64] 中低 8 字节的四个字值扩展为 XMM[dest] 中的双字值。
`pmovsxwq` `xmm`[dest]`,` `xmm`[src]/``mem[32]	将 XMM[src]/mem[32] 中低 4 字节的两个字值扩展为 XMM[dest] 中的四字值。
`pmovsxdq` `xmm`[dest]`,` `xmm`[src]`/mem`[64]	将 XMM[src]/mem[64] 中低 8 字节的两个双字值扩展为 XMM[dest] 中的四字值。

还有一组对应的 AVX 指令（其助记符以 v 前缀为标识）。通常，SSE 和 AVX 指令的区别在于，SSE 指令不会改变 YMM 寄存器的高位（如果适用），而 AVX 指令则会将这些高位置为 0。

支持 AVX2 的处理器还允许 YMM[dest] 目标寄存器，这将使指令能够处理更多的（输出）值；见表 11-24。

表 11-24: AVX 打包符号扩展指令

语法	描述
v``pmovsxbw `ymm`[dest]`,` `xmm`[src]/``mem[128]	将 XMM[src]/mem[128] 中低 16 字节的字节值扩展为 YMM[dest] 中的字值。
v``pmovsxbd `ymm`[dest]`,` `xmm`[src]/``mem[64]	将 XMM[src]/mem[64] 中低 8 字节的八个字节值扩展为 YMM[dest] 中的双字值。
v``pmovsxbq `ymm`[dest]`,` `xmm`[src]/``mem[32]	将 XMM[src]/mem[32] 中低 4 字节的四个字节值扩展为 YMM[dest] 中的四字值。
v``pmovsxwd `ymm`[dest]`,` `xmm`[src]/``mem[128]	将 XMM[src]/mem[128] 中低 16 字节的八个字值扩展为 YMM[dest] 中的双字值。
v``pmovsxwq `ymm`[dest]`,` `xmm`[src]/``mem[64]	将 XMM[src]/mem[64] 中低 8 字节的四个字值扩展为 YMM[dest] 中的四字值。
v``pmovsxdq `ymm`[dest]`,` `xmm`[src]/``mem[128]	将 XMM[src]/mem[128]的低 16 字节中的四个双字值符号扩展为 YMM[dest]中的四个四字值。

11.10.11.3 打包符号扩展与饱和

除了将较小的有符号或无符号值转换为较大格式外，支持 SSE/AVX/AVX2 的 CPU 还能够通过饱和将较大值转换为较小值；请参见表 11-25。

表 11-25：SSE 打包符号扩展与饱和指令

语法	描述
`packsswb` `xmm`[dest]`,` `xmm`[src]/``mem[128]	使用符号饱和，将来自两个 128 位源的十六个有符号字值打包成一个 128 位目标寄存器中的十六个字节。
`packuswb` `xmm`[dest]`,` `xmm`[src]/``mem[128]	使用无符号饱和，将来自两个 128 位源的十六个无符号字值打包成一个 128 位目标寄存器中的十六个字节。
`packssdw` `xmm`[dest]`,` `xmm`[src]/``mem[128]	使用符号饱和，将来自两个 128 位源的八个有符号双字值打包成一个 128 位目标寄存器中的八个字节值。
`packusdw` `xmm`[dest]`,` `xmm`[src]/``mem[128]	使用无符号饱和，将来自两个 128 位源的八个无符号双字值打包成一个 128 位目标寄存器中的八个字。

饱和操作会检查其操作数，查看值是否超出结果的范围（有符号字节的范围是-128 到+127，无符号字节的范围是 0 到 255，有符号字的范围是-32,768 到+32,767，无符号字的范围是 0 到 65,535）。当饱和到字节时，如果有符号源值小于-128，字节饱和会将值设置为-128。当饱和到字时，如果有符号源值小于-32,768，有符号饱和会将值设置为-32,768。同样地，如果有符号字节或字值超过+127 或+32,767，饱和会分别将值替换为+127 或+32,767。对于无符号操作，饱和会将值限制为+255（字节）或+65,535（字）。无符号值永远不小于 0，因此无符号饱和会将值裁剪为+255 或+65,535。

支持 AVX 的 CPU 提供这些指令的 128 位变体，支持三个操作数：两个源操作数和一个独立的目标操作数。这些指令（助记符与 SSE 指令相同，但以v为前缀）具有以下语法：

vpacksswb  `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vpackuswb  `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vpackssdw  `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vpackusdw  `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]

这些指令大致等同于 SSE 变体，唯一的不同是，这些指令使用 XMM[src1]作为第一个源操作数，而不是 SSE 指令使用的 XMM[dest]。另外，SSE 指令不会修改 YMM 寄存器的高位（如果 CPU 上存在该寄存器），而 AVX 指令会将 0 存入 YMM 寄存器的高位。

支持 AVX2 的 CPU 还允许使用 YMM 寄存器（和 256 位内存位置），以便指令能够饱和更多的值（参见表 11-26）。当然，在使用这些指令之前，请不要忘记检查 AVX2（和 AVX）兼容性。

表 11-26：AVX 打包符号扩展与饱和指令

语法	描述
v``packsswb `ymm`[dest]`,` `ymm`[src1]`,` `ymm`[src2]/``mem[256]	将来自两个 256 位源的 32 个有符号字值打包到一个 256 位目标寄存器的 32 个字节通道中，使用有符号饱和。
v``packuswb `ymm`[dest]`,` `ymm`[src1]`,` `ymm`[src2]/``mem[256]	将来自两个 256 位源的 32 个无符号字值打包到一个 256 位目标寄存器的 32 个字节通道中，使用无符号饱和。
v``packssdw `ymm`[dest]`,` `ymm`[src1]`,` `ymm`[src2]/``mem[256]	将来自两个 256 位源的 16 个有符号 dword 值打包到一个 256 位目标寄存器的 16 个字值中，使用有符号饱和。
v``packusdw `ymm`[dest]`,` `ymm`[src1]`,` `ymm`[src2]/``mem[256]	将来自两个 256 位源的 16 个无符号 dword 值打包到一个 256 位目标寄存器的 16 个字值中，使用无符号饱和。

11.11 SIMD 浮点算术运算

SSE 和 AVX 指令集扩展为“第六章中的 SSE 浮点算术”中的所有标量浮点指令提供了打包算术等效指令。本节不重复标量浮点操作的讨论；有关详细信息，请参见第六章。

128 位 SSE 打包浮点指令具有以下通用语法（其中 instr 是表 11-27 中的浮点指令之一）：

`instr`ps `xmm`[dest], `xmm`[src]/`mem`[128]
`instr`pd `xmm`[dest], `xmm`[src]/`mem`[128]

打包单精度（*ps）指令同时执行四个单精度浮点运算。打包双精度（*pd）指令同时执行两个双精度浮点运算。像典型的 SSE 指令一样，这些打包算术指令计算

`xmm`[dest][`lane`] = `xmm`[dest][`lane`] `op` `xmm`[src]/`mem`[128][`lane`]

lane 在打包单精度指令中取值范围为 0 到 3，在打包双精度指令中取值范围为 0 到 1。op 表示操作（例如加法或减法）。当 SSE 指令在支持 AVX 扩展的 CPU 上执行时，SSE 指令会将 AVX 寄存器的高位保持不变。

128 位 AVX 打包浮点指令具有以下语法：^(15)

v`instr`ps `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128] ; For dyadic operations
v`instr`pd `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128] ; For dyadic operations
v`instr`ps `xmm`[dest], `xmm`[src]/`mem`[128]          ; For monadic operations
v`instr`pd `xmm`[dest], `xmm`[src]/`mem`[128]          ; For monadic operations

这些指令计算

`xmm`[dest][`lane`] = `xmm`[src1][`lane`] `op` `xmm`[src2]/`mem`[128][`lane`]

其中 op 对应于与特定指令相关的操作（例如，vaddps 执行打包单精度加法）。这些 128 位 AVX 指令会清除底层 YMM[dest] 寄存器的 HO 位。

256 位 AVX 打包浮点指令具有以下语法：

v`instr`ps `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256] ; For dyadic operations
v`instr`pd `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256] ; For dyadic operations
v`instr`ps `ymm`[dest], `ymm`[src]/`mem`[256]          ; For monadic operations
v`instr`pd `ymm`[dest], `ymm`[src]/`mem`[256]          ; For monadic operations

这些指令计算

`ymm`[dest][`lane`] = `ymm`[src1][`lane`] `op` `ymm`[src]/`mem`[256][`lane`]

其中，op对应于与特定指令关联的操作（例如，vaddps是打包的单精度加法）。由于这些指令操作的是 256 位操作数，它们计算的数据通道数是 128 位指令的两倍。具体来说，它们同时计算八个单精度结果（v*ps指令）或四个双精度结果（v*pd指令）。

表 11-27 提供了 SSE/AVX 打包指令的列表。

表 11-27：浮点数算术指令

指令	通道数	描述
`addps`	4	加法四个单精度浮点数值
`addpd`	2	加法两个双精度浮点数值
`vaddps`	4/8	加法四个（128 位/XMM 操作数）或八个（256 位/YMM 操作数）单精度数值
`vaddpd`	2/4	加法两个（128 位/XMM 操作数）或四个（256 位/YMM 操作数）双精度数值
`subps`	4	减去四个单精度浮点数值
`subpd`	2	减去两个双精度浮点数值
`vsubps`	4/8	减去四个（128 位/XMM 操作数）或八个（256 位/YMM 操作数）单精度数值
`vsubpd`	2/4	减去两个（128 位/XMM 操作数）或四个（256 位/YMM 操作数）双精度数值
`mulps`	4	乘以四个单精度浮点数值
`mulpd`	2	乘以两个双精度浮点数值
`vmulps`	4/8	乘以四个（128 位/XMM 操作数）或八个（256 位/YMM 操作数）单精度数值
`vmulpd`	2/4	乘以两个（128 位/XMM 操作数）或四个（256 位/YMM 操作数）双精度数值
`divps`	4	除以四个单精度浮点数值
`divpd`	2	除以两个双精度浮点数值
`vdivps`	4/8	除以四个（128 位/XMM 操作数）或八个（256 位/YMM 操作数）单精度数值
`vdivpd`	2/4	除以两个（128 位/XMM 操作数）或四个（256 位/YMM 操作数）双精度数值
`maxps`	4	计算四个单精度浮点数值对的最大值
`maxpd`	2	计算两个双精度浮点数值对的最大值
`vmaxps`	4/8	计算四个（128 位/XMM 操作数）或八个（256 位/YMM 操作数）单精度数值对的最大值
`vmaxpd`	2/4	计算两个（128 位/XMM 操作数）或四个（256 位/YMM 操作数）双精度数值对的最大值
`minps`	4	计算四个单精度浮点数值对的最小值
`minpd`	2	计算两个双精度浮点数值对的最小值
`vminps`	4/8	计算四个（128 位/XMM 操作数）或八个（256 位/YMM 操作数）单精度数值对的最小值
`vminpd`	2/4	计算两个（128 位/XMM 操作数）或四个（256 位/YMM 操作数）双精度值对的最小值
`sqrtps`	4	计算四个单精度浮动值的平方根
`sqrtpd`	2	计算两个双精度浮动值的平方根
`vsqrtps`	4/8	计算四个（128 位/XMM 操作数）或八个（256 位/YMM 操作数）单精度值的平方根
`vsqrtpd`	2/4	计算两个（128 位/XMM 操作数）或四个（256 位/YMM 操作数）双精度值的平方根
`rsqrtps`	4	计算四个单精度浮动值的近似倒数平方根^(*)
`vrsqrtps`	4/8	计算四个（128 位/XMM 操作数）或八个（256 位/YMM 操作数）单精度值的近似倒数平方根
^(*) 相对误差 ≤ 1.5 × 2^(-12)。

SSE/AVX 指令集扩展还包括浮点水平加法和减法指令。这些指令的语法如下：

haddps  `xmm`[dest], `xmm`[src]/`mem`[128]
vhaddps `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vhaddps `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]
haddpd  `xmm`[dest], `xmm`[src]/`mem`[128]
vhaddpd `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vhaddpd `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]

hsubps  `xmm`[dest], `xmm`[src]/`mem`[128]
vhsubps `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vhsubps `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]
hsubpd  `xmm`[dest], `xmm`[src]/`mem`[128]
vhsubpd `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128]
vhsubpd `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256]

至于整数水平加法和减法指令，这些指令在同一寄存器的相邻通道中加法或减法，并将结果存储到目标寄存器（通道 2），如图 11-43 所示。

11.12 SIMD 浮点比较指令

与整数打包比较类似，SSE/AVX 浮点比较比较两组浮点值（无论是单精度还是双精度，具体取决于指令语法），并将结果布尔值（所有 1 位表示真，所有 0 位表示假）存储到目标通道。然而，浮点比较比整数对应物更为全面。部分原因是浮点运算更为复杂；然而，CPU 设计师的硅预算不断增加也是原因之一。

11.12.1 SSE 和 AVX 比较

有两组基本的浮点比较：(v)cmpps，它比较一组打包的单精度值；和(v)cmppd，它比较一组打包的双精度值。这些指令并不直接将比较类型编码到助记符中，而是使用一个 imm[8]操作数，其值指定比较类型。这些指令的通用语法如下：

cmpps  `xmm`[dest], `xmm`[src]/`mem`[128], `imm`[8]
vcmpps `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128], `imm`[8]
vcmpps `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256], `imm`[8]

cmppd  `xmm`[dest], `xmm`[src]/`mem`[128], `imm`[8]
vcmppd `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128], `imm`[8]
vcmppd `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256], `imm`[8]

imm[8]操作数指定比较类型。共有 32 种可能的比较方式，详见表 11-28。

表 11-28: cmpps和cmppd指令的 imm[8]值^(†)

imm[8]	描述	结果	信号
		A < B	A = B
---	---	---	---
00h	EQ, 有序, 安静	0	1
01h	LT, 有序, 信号	1	0
02h	LE, 有序, 信号	1	1
03h	unordered，安静	0	0
04h	NE，unordered，安静	1	0
05h	NLT，unordered，信号	0	1
06h	NLE，unordered，信号	0	0
07h	ordered，安静	1	1
08h	EQ，unordered，安静	0	1
09h	NGE，unordered，信号	1	0
0Ah	NGT，unordered，信号	1	1
0Bh	假，ordered，安静	0	0
0Ch	NE，ordered，安静	1	0
0Dh	GE，ordered，信号	0	1
0Eh	GT，ordered，信号	0	0
0Fh	真，unordered，安静	1	1
10h	EQ，ordered，信号	0	1
11h	LT，ordered，安静	1	0
12h	LE，ordered，安静	1	1
13h	unordered，信号	0	0
14h	NE，unordered，信号	1	0
15h	NLT，unordered，安静	0	1
16h	NLE，unordered，安静	0	0
17h	ordered，信号	1	1
18h	EQ，unordered，信号	0	1
19h	NGE，unordered，安静	1	0
1Ah	NGT，unordered，安静	1	1
1Bh	假，ordered，信号	0	0
1Ch	NE，ordered，信号	1	0
1Dh	GE，ordered，安静	0	1
1Eh	GT，ordered，安静	0	0
1Fh	真， unordered，信号	1	1
^(†) 深色阴影条目仅在支持 AVX 扩展的 CPU 上可用。

“真”与“假”的比较总是将 true 或 false 存储到目标通道中。大多数情况下，这些比较并不是特别有用。pxor、xorps、xorpd、vxorps 和 vxorpd 指令可能更适合将 XMM 或 YMM 寄存器设置为 0。AVX2 之前，使用真比较是设置 XMM 或 YMM 寄存器所有位为 1 的最短指令，尽管 pcmpeqb 也常用于此（注意此后指令的微架构效率问题）。

请注意，非 AVX CPU 不支持 GT、GE、NGT 和 NGE 指令。在这些 CPU 上，可以使用反向操作（例如，使用 NLT 替代 GE），或交换操作数并使用相反条件（就像打包整数比较中所做的那样）。

11.12.2 无序与有序比较

无序关系在至少有一个源操作数是 NaN 时为真；有序关系在两个源操作数都不是 NaN 时为真。拥有有序和无序比较使您可以根据最终布尔结果的解释，将错误条件通过比较传递为假或真。无序结果，顾名思义，是不可比较的。当您比较两个值，其中一个不是数字时，必须始终将结果视为失败的比较。

为了处理这种情况，您可以使用有序或无序比较来强制结果为假或真，这与您最终期望的比较结果相反。例如，假设您正在比较一系列值，并希望在所有比较有效时，结果掩码为真（例如，您正在检查是否所有的 src[1] 值都大于相应的 src[2] 值）。在这种情况下，您会使用有序比较，如果被比较的某个值是 NaN，它将强制某个特定的元素为假。另一方面，如果您正在检查所有条件是否在比较后为假，则可以使用无序比较，如果任何值是 NaN，它将强制结果为真。

11.12.3 信号与静默比较

信号比较在操作产生静默 NaN 时会生成无效的算术操作异常 (IA)。静默比较不会抛出异常，只会在 MXCSR 中反映状态（请参见第六章中的“SSE MXCSR 寄存器”）。请注意，您也可以在 MXCSR 寄存器中屏蔽信号异常；如果要允许异常，则必须显式地将 MXCSR 中的 IM（无效操作屏蔽，位 7）设置为 0。

11.12.4 指令同义词

MASM 支持使用某些同义词，这样您就不必记住 32 种编码。表 11-29 列出了这些同义词。在此表中，x1 表示目标操作数（XMM[n] 或 YMM[n]），x2 表示源操作数（XMM[n]/mem[128] 或 YMM[n]/mem[256]，具体情况而定）。

表 11-29：常见打包浮点比较的同义词

同义词	指令	同义词	指令
`cmpeqps` x1``, `x2`	`cmpps` x1``, x2``, 0	`cmpeqpd` x1``, `x2`	`cmppd` x1``, x2``, 0
`cmpltps` x1``, `x2`	`cmpps` x1``, x2``, 1	`cmpltpd` x1``, `x2`	`cmppd` x1``, x2``, 1
`cmpleps` x1``, `x2`	`cmpps` x1``, x2``, 2	`cmlepd` x1``, `x2`	`cmppd` x1``, x2``, 2
`cmpunordps` x1``, `x2`	`cmpps` x1``, x2``, 3	`cmpunordpd` x1``, `x2`	`cmppd` x1``, x2``, 3
`cmpneqps` x1``, `x2`	`cmpps` x1``, x2``, 4	`cmpneqpd` x1``, `x2`	`cmppd` x1``, x2``, 4
`cmpnltps` x1``, `x2`	`cmpps` x1``, x2``, 5	`cmpnltpd` x1``, `x2`	`cmppd` x1``, x2``, 5
`cmpnleps` x1``, `x2`	`cmpps` x1``, x2``, 6	`cmpnlepd` x1``, `x2`	`cmppd` x1``, x2``, 6
`cmpordps` x1``, `x2`	`cmpps` x1``, x2``, 7	`cmpordpd` x1``, `x2`	`cmppd` x1``, x2``, 7

同义词允许你编写诸如

cmpeqps  xmm0, xmm1

而不是

cmpps  xmm0, xmm1, 0       ; Compare xmm0 to xmm1 for equality

显然，使用同义词可以使代码更易读和理解。并非所有的比较都有同义词。为了创建易读的同义词，对于 MASM 不支持的指令，可以使用宏（或更具可读性的符号常量）。有关宏的更多信息，请参阅第十三章。

11.12.5 AVX 扩展比较

这些指令的 AVX 版本支持三个寄存器操作数：目标 XMM 或 YMM 寄存器、源 XMM 或 YMM 寄存器，以及源 XMM 或 YMM 寄存器或 128 位或 256 位内存位置（后面跟着指定比较类型的 imm[8]操作数）。基本语法如下：

vcmpps `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128], `imm`[8]
vcmpps `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256], `imm`[8]

vcmppd `xmm`[dest], `xmm`[src1], `xmm`[src2]/`mem`[128], `imm`[8]
vcmppd `ymm`[dest], `ymm`[src1], `ymm`[src2]/`mem`[256], `imm`[8]

128 位的vcmpps指令比较 XMM[src1]寄存器每个通道中的四个单精度浮点值与相应的 XMM[src2]/mem[128]通道中的值，并将结果（全 1 位表示真，全 0 位表示假）存储到 XMM[dest]寄存器的相应通道。256 位的vcmpps指令比较 YMM[src1]寄存器每个通道中的八个单精度浮点值与相应的 YMM[src2]/mem[256]通道中的值，并将真或假的结果存储到 YMM[dest]寄存器的相应通道。

vcmppd指令比较两个通道（128 位版本）或四个通道（256 位版本）中的双精度值，并将结果存储到目标寄存器的相应通道中。

对于 SSE 比较指令，AVX 指令提供了同义词，省去了记忆 32 个 imm[8]值的需要。表 11-30 列出了这 32 个指令同义词。

表 11-30：AVX 打包比较指令

imm[8]	指令
00h	`vcmpeqps` 或 `vcmpeqpd`
01h	`vcmpltps` 或 `vcmpltpd`
02h	`vcmpleps` 或 `vcmplepd`
03h	`vcmpunordps` 或 `vcmpunordpd`
04h	`vcmpneqps` 或 `vcmpneqpd`
05h	`vcmpltps` 或 `vcmpltpd`
06h	`vcmpleps` 或 `vcmplepd`
07h	`vcmpordps` 或 `vcmpordpd`
08h	`vcmpeq_uqps` 或 `vcmpeq_uqpd`
09h	`vcmpngeps` 或 `vcmpngepd`
0Ah	`vcmpngtps` 或 `vcmpngtpd`
0Bh	`vcmpfalseps` 或 `vcmpfalsepd`
0Ch	`vcmpneq_oqps` 或 `vcmpneq_oqpd`
0Dh	`vcmpgeps` 或 `vcmpgepd`
0Eh	`vcmpgtps` 或 `vcmpgtpd`
0Fh	`vcmptrueps` 或 `vcmptruepd`
10h	`vcmpeq_osps` 或 `vcmpeq_ospd`
11h	`vcmplt_oqps` 或 `vcmplt_oqpd`
12h	`vcmple_oqps` 或 `vcmple_oqpd`
13h	`vcmpunord_sps` 或 `vcmpunord_spd`
14h	`vcmpneq_usps` 或 `vcmpneq_uspd`
15h	`vcmpnlt_uqps` 或 `vcmpnlt_uqpd`
16h	`vcmpnle_uqps` 或 `vcmpnle_uqpd`
17h	`vcmpord_sps` 或 `vcmpord_spd`
18h	`vcmpeq_usps` 或 `vcmpeq_uspd`
19h	`vcmpnge_uqps` 或 `vcmpnge_uqpd`
1Ah	`vcmpngt_uqps` 或 `vcmpngt_uqpd`
1Bh	`vcmpfalse_osps` 或 `vcmpfalse_ospd`
1Ch	`vcmpneq_osps` 或 `vcmpneq_ospd`
1Dh	`vcmpge_oqps` 或 `vcmpge_oqpd`
1Eh	`vcmpgt_oqps` 或 `vcmpgt_oqpd`
1Fh	`vcmptrue_usps` 或 `vcmptrue_uspd`

11.12.6 使用 SIMD 比较指令

对于整数比较（见第 662 页的“使用打包比较结果”），浮点比较指令生成一个布尔结果的向量，你可以用这个向量对数据通道执行进一步操作。你可以使用打包逻辑指令（pand 和 vpand，pandn 和 vpandn，por 和 vpor，以及 pxor 和 vpxor）来处理这些结果。你还可以提取单个通道的值，并用条件跳转来测试它们，尽管这绝对不是 SIMD 的处理方式；接下来的章节会描述一种提取这些掩码的方法。

11.12.7 (v)movmskps、(v)movmskpd 指令

movmskps 和 movmskpd 指令从它们的打包单精度和双精度浮点源操作数中提取符号位，并将这些位存储到通用寄存器的低 4 位（或 8 位）中。其语法为

movmskps  `reg`, `xmm`[src]
movmskpd  `reg`, `xmm`[src] 
vmovmskps `reg`, `ymm`[src]
vmovmskpd `reg`, `ymm`[src]

其中 reg 是任意 32 位或 64 位通用整数寄存器。

movmskps 指令从 XMM 源寄存器中的四个单精度浮点值中提取符号位，并将这些位复制到目标寄存器的低 4 位中，如图 11-45 所示。

movmskpd 指令将源 XMM 寄存器中的两个双精度浮点值的符号位复制到目标寄存器的第 0 位和第 1 位，如图 11-46 所示。

vmovmskps 指令从 XMM 和 YMM 源寄存器中的四个或八个单精度浮点值中提取符号位，并将这些位复制到目标寄存器的低 4 位或 8 位中。图 11-47 显示了使用 YMM 源寄存器的这一操作。

图 11-45: movmskps 操作

图 11-46: movmskpd 操作

图 11-47: vmovmskps 操作

vmovmskpd 指令从源 YMM 寄存器中的四个双精度浮点值复制符号位到目标寄存器的第 0 到第 3 位中，如图 11-48 所示。

图 11-48: vmovmskpd 操作

当使用 XMM 源寄存器时，此指令将从两个双精度浮点值中复制符号位到目标寄存器的第 0 位和第 1 位。在所有情况下，这些指令会将结果零扩展到通用目标寄存器的高位。请注意，这些指令不允许使用内存操作数。

虽然这些指令的声明数据类型是打包单精度和打包双精度，但你也将使用这些指令处理 32 位整数（movmskps和vmovmskps）和 64 位整数（movmskpd和vmovmskpd）。具体来说，这些指令非常适合从各种通道中提取 1 位布尔值，尤其是在进行（dword 或 qword）打包整数比较之后，以及在进行单精度或双精度浮点比较后（请记住，虽然打包浮点比较比较的是浮点值，其结果实际上是整数值）。

考虑以下指令序列：

 cmpeqpd  xmm0, xmm1
         movmskpd rax,  xmm0      ; Moves 2 bits into RAX
         lea      rcx,  jmpTable
         jmp      qword ptr [rcx][rax*8]

jmpTable qword    nene
         qword    neeq
         qword    eqne
         qword    eqeq

由于movmskpd从 XMM0 提取 2 位并将它们存储到 RAX 中，因此这段代码可以使用 RAX 作为跳转表的索引来选择四个不同的分支标签。如果两个比较都产生不相等，nene标签下的代码会执行；当通道 0 的值相等但通道 1 的值不等时，跳转到neeq标签；当通道 0 的值不等但通道 1 的值相等时，跳转到eqne标签；最后，当两个通道的值都相等时，跳转到eqeq标签。

11.13 浮点转换指令

在之前的描述中，我介绍了几条指令，用于在各种标量浮点数和整数格式之间转换数据（详见第六章的“SSE 浮点转换”）。这些指令的变体也存在于打包数据转换中。表 11-31 列出了你将常用的许多这些指令。

表 11-31：SSE 转换指令

指令语法	描述
`cvtdq2pd` `xmm`[dest]`,` `xmm`[src]/``mem[64]	将 XMM[src]/mem[64]中的两个打包带符号双字整数转换为 XMM[dest]中的两个打包双精度浮点值。如果存在 YMM 寄存器，则此指令保持 HO 位不变。
`vcvtdq2pd` `xmm`[dest]`,` `xmm`[src]/``mem[64]	（AVX）将 XMM[src]/mem[64]中的两个打包带符号双字整数转换为 XMM[dest]中的两个打包双精度浮点值。此指令将 0 存储到底层 YMM 寄存器的 HO 位。
`vcvtdq2pd` `ymm`[dest]`,` `xmm`[src]/``mem[128]	（AVX）将 XMM[src]/mem[128]中的四个打包带符号双字整数转换为 YMM[dest]中的四个打包双精度浮点值。
`cvtdq2ps` `xmm`[dest]`,` `xmm`[src]/``mem[128]	将 XMM[src]/mem[128]中的四个打包带符号双字整数转换为 XMM[dest]中的四个打包单精度浮点值。如果存在 YMM 寄存器，则此指令保持 HO 位不变。
`vcvtdq2ps` `xmm`[dest]`,` `xmm`[src]/``mem[128]	（AVX）将 XMM[src]/mem[128]中的四个打包带符号双字整数转换为 XMM[dest]中的四个打包单精度浮点值。如果存在 YMM 寄存器，则此指令将 0 写入 HO 位。
`vcvtdq2ps` `ymm`[dest]`,` `ymm`[src]/``mem[256]	(AVX) 将 YMM[src]/mem[256] 中的八个打包的有符号双字整数转换为 YMM[dest] 中的八个打包的单精度浮点值。如果 YMM 寄存器存在，此指令将 HO 位写入 0。
`cvtpd2dq` `xmm`[dest]`,` `xmm`[src]/``mem[128]	将 XMM[src]/mem[128] 中的两个打包的双精度浮点值转换为 XMM[dest] 中的两个打包的有符号双字整数。如果 YMM 寄存器存在，此指令将保持 HO 位不变。浮点到整数的转换使用当前的 SSE 四舍五入模式。
`vcvtpd2dq` `xmm`[dest]`,` `xmm`[src]/``mem[128]	(AVX) 将 XMM[src]/mem[128] 中的两个打包的双精度浮点值转换为 XMM[dest] 中的两个打包的有符号双字整数。此指令将 0 存入底层 YMM 寄存器的 HO 位。浮点到整数的转换使用当前的 AVX 四舍五入模式。
`vcvtpd2dq` `xmm`[dest]`,` `ymm`[src]/``mem[256]	(AVX) 将 YMM[src]/mem[256] 中的四个打包的双精度浮点值转换为 XMM[dest] 中的四个打包的有符号双字整数。浮点到整数的转换使用当前的 AVX 四舍五入模式。
`cvtpd2ps` `xmm`[dest]`,` `xmm`[src]/``mem[128]	将 XMM[src]/mem[128] 中的两个打包的双精度浮点值转换为 XMM[dest] 中的两个打包的单精度浮点值。如果 YMM 寄存器存在，此指令将保持 HO 位不变。
`vcvtpd2ps` `xmm`[dest]`,` `xmm`[src]/``mem[128]	(AVX) 将 XMM[src]/mem[128] 中的两个打包的双精度浮点值转换为 XMM[dest] 中的两个打包的单精度浮点值。此指令将 0 存入底层 YMM 寄存器的 HO 位。
`vcvtpd2ps` `xmm`[dest]`,` `ymm`[src]/``mem[256]	(AVX) 将 YMM[src]/mem[256] 中的四个打包的双精度浮点值转换为 YMM[dest] 中的四个打包的单精度浮点值。
`cvtps2dq` `xmm`[dest]`,` `xmm`[src]/``mem[128]	将 XMM[src]/mem[128] 中的四个打包的单精度浮点值转换为 XMM[dest] 中的四个打包的有符号双字整数。如果 YMM 寄存器存在，此指令将保持 HO 位不变。浮点到整数的转换使用当前的 SSE 四舍五入模式。
`vcvtps2dq` `xmm`[dest]`,` `xmm`[src]/``mem[128]	(AVX) 将 XMM[src]/mem[128] 中的四个打包的单精度浮点值转换为 XMM[dest] 中的四个打包的有符号双字整数。此指令将 0 存入底层 YMM 寄存器的 HO 位。浮点到整数的转换使用当前的 AVX 四舍五入模式。
`vcvtps2dq` `ymm`[dest]`,` `ymm`[src]/``mem[256]	（AVX）将来自 YMM[src]/mem[256]的八个打包的单精度浮点值转换为 YMM[dest]中的八个打包的有符号双字整数。浮点到整数的转换使用当前的 AVX 舍入模式。
`cvtps2pd` `xmm`[dest]`,` `xmm`[src]/``mem[64]	将来自 XMM[src]/mem[64]的两个打包的单精度浮点值转换为 XMM[dest]中的两个打包的双精度值。如果存在 YMM 寄存器，此指令将保持 HO 位不变。
`vcvtps2pd` `xmm`[dest]`,` `xmm`[src]/``mem[64]	（AVX）将来自 XMM[src]/mem[64]的两个打包的单精度浮点值转换为 XMM[dest]中的两个打包的双精度值。此指令将 0 存储到底层 YMM 寄存器的 HO 位。
`vcvtps2pd` `ymm`[dest]`,` `xmm`[src]/``mem[128]	（AVX）将来自 XMM[src]/mem[128]的四个打包的单精度浮点值转换为 YMM[dest]中的四个打包的双精度值。
`cvttpd2dq` `xmm`[dest]`,` `xmm`[src]/``mem[128]	将来自 XMM[src]/mem[128]的两个打包的双精度浮点值转换为 XMM[dest]中的两个打包的有符号双字整数，使用截断。如果存在 YMM 寄存器，此指令将保持 HO 位不变。
`vcvttpd2dq` `xmm`[dest]`,` `xmm`[src]/``mem[128]	（AVX）将来自 XMM[src]/mem[128]的两个打包的双精度浮点值转换为 XMM[dest]中的两个打包的有符号双字整数，使用截断。此指令将 0 存储到底层 YMM 寄存器的 HO 位。
`vcvttpd2dq` `xmm`[dest]`,` `ymm`[src]/``mem[256]	（AVX）将来自 YMM[src]/mem[256]的四个打包的双精度浮点值转换为 XMM[dest]中的四个打包的有符号双字整数，使用截断。
`cvttps2dq` `xmm`[dest]`,` `xmm`[src]/``mem[128]	将来自 XMM[src]/mem[128]的四个打包的单精度浮点值转换为 XMM[dest]中的四个打包的有符号双字整数，使用截断。如果存在 YMM 寄存器，此指令将保持 HO 位不变。
`vcvttps2dq` `xmm`[dest]`,` `xmm`[src]/``mem[128]	（AVX）将来自 XMM[src]/mem[128]的四个打包的单精度浮点值转换为 XMM[dest]中的四个打包的有符号双字整数，使用截断。此指令将 0 存储到底层 YMM 寄存器的 HO 位。
`vcvttps2dq` `ymm`[dest]`,` `ymm`[src]/``mem[256]	（AVX）将来自 YMM[src]/mem[256]的八个打包的单精度浮点值转换为 YMM[dest]中的八个打包的有符号双字整数，使用截断。

11.14 对齐 SIMD 内存访问

大多数 SSE 和 AVX 指令要求它们的内存操作数位于 16 字节（SSE）或 32 字节（AVX）边界上，但这并不总是可能的。处理未对齐内存地址的最简单方法是使用不要求对齐内存操作数的指令，例如 movdqu、movups 和 movupd。然而，使用未对齐的数据移动指令的性能损失通常会削弱使用 SSE/AVX 指令的初衷。

相反，SIMD 指令对齐数据的技巧是通过使用标准的通用寄存器处理前几个数据项，直到达到一个正确对齐的地址。例如，假设你想使用 pcmpeqb 指令比较一个大字节数组中的 16 字节块。pcmpeqb 要求它的内存操作数位于 16 字节对齐的地址上，因此如果内存操作数尚未 16 字节对齐，你可以使用标准（非 SSE）指令处理数组中的前 1 到 15 个字节，直到你达到一个适合 pcmpeqb 的地址。例如：

cmpLp:  mov  al, [rsi]
        cmp  al, someByteValue
        je   foundByte
        inc  rsi
        test rsi, 0Fh
        jnz  cmpLp
 `Use SSE instructions here, as RSI is now 16-byte-aligned`

将 RSI 与 0Fh 进行按位与运算，如果 RSI 的低 4 位包含 0，则会产生 0 的结果（并设置零标志）。如果 RSI 的低 4 位包含 0，则它所包含的地址是按 16 字节边界对齐的。^(16)

这种方法的唯一缺点是，在获得适当地址之前，你必须单独处理最多 15 字节。那是 6 × 15，或者 90 条机器指令。然而，对于大块数据（例如，超过约 48 或 64 字节），你可以摊销单字节比较的成本，这种方法就不那么糟糕了。

为了提高这段代码的性能，你可以修改初始地址，使其从 16 字节边界开始。将 RSI 中的值（在这个特定示例中）与 0FFFFFFFFFFFFFFF0h（–16）进行按位与运算，修改 RSI，使其保存包含原始地址的 16 字节块的起始地址。^(17)

 and  rsi, -16

为了避免匹配数据结构开始前的非预期字节，我们可以创建一个掩码来覆盖多余的字节。例如，假设我们使用以下指令序列快速比较每次 16 字节：

 sub      rsi, 16
cmpLp:     add      rsi, 16
           movdqa   xmm0, xmm2   ; XMM2 contains bytes to test
           pcmpeqb  xmm0, [rsi]
           pmovmskb eax, xmm0
           ptest    eax, eax
           jz       cmpLp

如果在执行此代码之前使用 AND 指令对 RSI 寄存器进行对齐，那么在比较前 16 字节时，我们可能会得到错误的结果。为了解决这个问题，我们可以创建一个掩码，消除任何来自非预期比较的位。为了创建这个掩码，我们从所有 1 位开始，并将与 16 字节块的开始到我们要比较的第一个实际数据项对应的位清零。可以使用以下表达式来计算此掩码：

-1 << (startAdrs & 0xF)  ; Note: -1 is all 1 bits

这将在数据比较之前的位置创建 0 位，并在其后创建 1 位（对于前 16 字节）。我们可以使用此掩码将 pmovmskb 指令的非预期位结果清零。以下代码片段演示了这一技巧：

 mov    rcx, rsi
           and    rsi, -16   ; Align to 16 bits
           and    ecx, 0fH   ; Strip out offset of start of data
           mov    ebx, -1    ; 0FFFFFFFFh – all 1 bits
           shl    ebx, cl    ; Create mask

; Special case for the first 1 to 16 bytes:

           movdqa   xmm0, xmm2
           pcmpeqb  xmm0, [rsi]
           pmovmskb eax, xmm0
           and      eax, ebx
           jnz      foundByte
cmpLp:     add      rsi, 16
           movdqa   xmm0, xmm2   ; XMM2 contains bytes to test
           pcmpeqb  xmm0, [rsi]
           pmovmskb eax, xmm0
           test     eax, eax
           jz       cmpLp
foundByte:
 `Do whatever needs to be done when the block of 16 bytes`
 `contains at least one match between the bytes in XMM2`
 `and the data at RSI`

假设例如，地址已经对齐到 16 字节边界。将该值与 0Fh 做按位与操作会得到 0。将 -1 向左移动零位得到 -1（全是 1 位）。稍后，当代码将其与通过 pcmpeqb 和 pmovmskb 指令获得的掩码做按位与操作时，结果不会改变。因此，代码测试所有 16 个字节（如果原始地址是 16 字节对齐的话，我们希望如此）。

当 RSI 中的地址在低 4 位中具有 0001b 的值时，实际数据从 16 字节块的偏移量 1 开始。因此，在将 XMM2 中的值与 [RSI] 处的 16 字节进行比较时，我们希望忽略第一个字节。在这种情况下，掩码为 0FFFFFFFEh，除了位 0 为 0 外，其余均为 1。比较后，如果 EAX 的第 0 位包含 1（表示偏移 0 处的字节匹配），则按位与操作会将该位消除（将其替换为 0），以免影响比较。同样，如果块的起始偏移量是 2、3、...、15，shl 指令会修改 EBX 中的位掩码，将这些偏移量处的字节从首次比较操作中排除。结果是，只需 11 条指令即可完成与原始（逐字节比较）示例中最多 90 条指令相同的工作。

11.15 对齐字、双字和四字对象地址

在对齐非字节大小的对象时，你可以按对象的大小（以字节为单位）增加指针，直到获得一个 16 字节（或 32 字节）对齐的地址。但是，只有当对象大小为 2、4 或 8 时，这种方法才有效（因为其他任何值都可能错过那些是 16 的倍数的地址）。

例如，你可以逐字处理一个包含单词对象的数组的前几个元素（其中数组的第一个元素在内存中出现在偶数地址），每次增加指针值 2，直到你得到一个可以被 16（或 32）整除的地址。需要注意的是，这种方法只有在对象数组的起始地址是元素大小的倍数时才有效。例如，如果一个包含单词值的数组从内存中的奇数地址开始，你无法通过每次加 2 得到一个可以被 16 或 32 整除的地址，且在没有先将数据移动到另一个正确对齐的内存位置之前，无法使用 SSE/AVX 指令处理该数据。

11.16 用多个相同值填充 XMM 寄存器

对于许多 SIMD 算法，你可能需要在 XMM 或 YMM 寄存器中存储相同值的多个副本。你可以使用 (v)movddup、(v)movshdup、(v)pinsd、(v)pinsq 和 (v)pshufd 指令来处理单精度和双精度浮点数。例如，如果你有一个单精度浮点值 r4var 存储在内存中，并且你想要在整个 XMM0 中复制它，你可以使用以下代码：

movss  xmm0, r4var
pshufd xmm0, xmm0, 0    ; Lanes 3, 2, 1, and 0 from lane 0

要将一对双精度浮点数从 r8var 复制到 XMM0 中，你可以使用：

movsd  xmm0, r8var
pshufd xmm0, xmm0, 44h  ; Lane 0 to lanes 0 and 2, 1 to 1, and 3

当然，pshufd实际上是为双字整数操作设计的，因此在movsd或movss之后立即使用pshufd可能会涉及额外的延迟（时间）。尽管pshufd允许内存操作数，但该操作数必须是 16 字节对齐的 128 位内存操作数，因此它不适用于通过 XMM 寄存器直接复制浮点值。

对于双精度浮点值，你可以使用movddup将单个 64 位浮点数复制到 XMM 寄存器的低位到高位：

movddup xmm0, r8var

movddup指令允许不对齐的 64 位内存操作数，因此它可能是复制双精度值的最佳选择。

要在 XMM 寄存器中复制字节、字、双字或四字整数值，pshufb、pshufw、pshufd或pshufq指令是不错的选择。例如，要在 XMM0 中复制一个字节，你可以使用以下指令序列：

movzx  eax, byteToCopy
movd   xmm0, eax
pxor   xmm1, xmm1   ; Mask to copy byte 0 throughout
pshufb xmm0, xmm1

XMM1 操作数是一个包含掩码的字节数组，用于将数据从 XMM0 中的位置复制到 XMM0 自身。值 0 将 XMM0 中的字节 0 复制到 XMM0 中的所有其他位。通过简单地更改 XMM1 中的掩码值，你可以使用相同的代码复制字、双字和四字。或者，你也可以使用pshuflw或pshufd指令来完成此任务。这里是另一个变体，它将一个字节复制到 XMM0 中的所有位置：

movzx     eax, byteToCopy
mov       ah, al
movd      xmm0, eax
punpcklbw xmm0, xmm0    ; Copy bytes 0 and 1 to 2 and 3
pshufd    xmm0, xmm0, 0 ; Copy LO dword throughout

11.17 将一些常见常量加载到 XMM 和 YMM 寄存器

没有 SSE/AVX 指令可以将立即数常量加载到寄存器中。然而，你可以使用几个惯用法（技巧）将某些常见的常量值加载到 XMM 或 YMM 寄存器中。本节讨论了这些惯用法的一些例子。

向 SSE/AVX 寄存器加载 0 使用的惯用法与通用整数寄存器相同：将寄存器与自身进行异或。例如，要将 XMM0 中的所有位设置为 0，你可以使用以下指令：

pxor xmm0, xmm0

要将 XMM 或 YMM 寄存器中的所有位设置为 1，你可以使用pcmpeqb指令，如下所示：

pcmpeqb xmm0, xmm0

因为任何给定的 XMM 或 YMM 寄存器都等于它自身，所以该指令将 0FFh 存储到 XMM0 的所有字节中（或者任何你指定的 XMM 或 YMM 寄存器中）。

如果你想将 8 位值 01h 加载到 XMM 寄存器的所有 16 个字节中，你可以使用以下代码（来自 Intel）：

pxor    xmm0, xmm0
pcmpeqb xmm1, xmm1
psubb   xmm0, xmm1   ; 0 - (-1) is (1)

如果你想创建 16 位或 32 位结果（例如，XMM0 中的四个 32 位双字，每个包含值 00000001h），你可以在此示例中将psubb替换为psubw或psubd。

如果你希望 1 位位于不同的比特位置（而不是每个字节的比特 0），你可以在前面的序列之后使用pslld指令来重新定位这些位。例如，如果你想将 XMM0 寄存器加载为 8080808080808080h，你可以使用以下指令序列：

pxor    xmm0, xmm0
pcmpeqb xmm1, xmm1
psubb   xmm0, xmm1
pslld   xmm0, 7         ; 01h -> 80h in each byte

当然，你可以为pslld提供不同的立即数常量，以将寄存器中的每个字节加载为 02h、04h、08h、10h、20h 或 40h。

这是一个巧妙的技巧，你可以用它将 2^(n) – 1（直到第 n 位的所有 1 位）加载到 SSE/AVX 寄存器的所有通道中：^(18)

; For 16-bit lanes:

pcmpeqd  xmm0, xmm0     ; Set all bits to 1
psrlw    xmm0, 16 - `n`   ; Clear top 16 - `n` bits of xmm0

; For 32-bit lanes:

pcmpeqd  xmm0, xmm0     ; Set all bits to 1
psrld    xmm0, 32 - `n`   ; Clear top 16 - `n` bits of xmm0

; For 64-bit lanes:

pcmpeqd  xmm0, xmm0     ; Set all bits to 1
psrlq    xmm0, 64 - `n`   ; Clear top 16 - `n` bits of xmm0

你还可以通过左移而非右移来加载反向位（NOT(2^(n) – 1)，即从第 n 位到寄存器末尾的所有 1 位）：

; For 16-bit lanes:

pcmpeqd  xmm0, xmm0     ; Set all bits to 1
psllw    xmm0, `n`        ; Clear bottom `n` bits of xmm0

; For 32-bit lanes:

pcmpeqd  xmm0, xmm0     ; Set all bits to 1
pslld    xmm0, `n`        ; Clear bottom `n` bits of xmm0

; For 64-bit lanes:

pcmpeqd  xmm0, xmm0     ; Set all bits to 1
psllq    xmm0, `n`        ; Clear bottom `n` bits of xmm0

当然，你也可以通过将常量放入内存位置（最好是 16 字节或 32 字节对齐）来加载常量到 XMM 或 YMM 寄存器中，然后使用movdqu或movdqa指令将该值加载到寄存器中。不过，值得注意的是，如果内存中的数据没有出现在缓存中，这样的操作可能会比较慢。另一种可能性是，如果常量足够小，可以将常量加载到 32 位或 64 位的整数寄存器中，并使用movd或movq将该值复制到 XMM 寄存器中。

11.18 设置、清除、反转和测试 SSE 寄存器中的单个位

这是 Raymond Chen 提出的另一组技巧（blogs.msdn.microsoft.com/oldnewthing/20141222-00/?p=43333/），用于设置、清除或测试 XMM 寄存器中的单个位。

要设置单个位（假设第 n 位是常量），同时清除其他所有位，你可以使用以下宏：

; setXBit - Sets bit `n` in SSE register xReg.

setXBit  macro   xReg, n
         pcmpeqb xReg, xReg   ; Set all bits in xReg
         psrlq   xReg, 63     ; Set both 64-bit lanes to 01h
         if      n lt 64
         psrldq  xReg, 8      ; Clear the upper lane
         else
         pslldq  xReg, 8      ; Clear the lower lane
         endif
         if      (n and 3fh) ne 0
         psllq   xReg, (n and 3fh)
         endif
         endm

一旦你可以用单个位的值填充 XMM 寄存器，你就可以使用该寄存器的值在另一个 XMM 寄存器中设置、清除、反转或测试该位。例如，要在 XMM1 中设置第 n 位，而不影响 XMM1 中的其他位，你可以使用以下代码序列：

setXBit xmm0, `n`      ; Set bit `n` in XMM1 to 1 without
por     xmm1, xmm0   ; affecting any other bits

要清除 XMM 寄存器中的第 n 位，你可以使用相同的指令序列，但将vpandn（与非）指令替换为por指令：

setXBit xmm0, `n`            ; Clear bit `n` in XMM1 without
vpandn  xmm1, xmm0, xmm1   ; affecting any other bits

要反转一个位，只需将pxor替换为por或vpandn：

setXBit xmm0, `n`      ; Invert bit `n` in XMM1 without
pxor    xmm1, xmm0   ; affecting any other bits

要测试一个位是否已设置，你有几种选择。如果你的 CPU 支持 SSE4.1 指令集扩展，你可以使用ptest指令：

setXBit xmm0, `n`      ; Test bit `n` in XMM1
ptest   xmm1, xmm0
jnz     bitNisSet    ; Fall through if bit `n` is clear

如果你使用的是不支持ptest指令的老款 CPU，你可以如下使用pmovmskb：

; Remember, psllq shifts bits, not bytes.
; If bit `n` is not in bit position 7 of a given
; byte, then move it there. For example, if `n` = 0, then
; (7 - (0 and 7)) is 7, so psllq moves bit 0 to bit 7.

movdqa   xmm0, xmm1
if       7 - (`n` and 7)
psllq    xmm0, 7 - (`n` and 7)
endif

; Now that the desired bit to test is sitting in bit position
; 7 of *some* byte, use pmovmskb to extract all bit 7s into AX:

pmovmskb eax, xmm0

; Now use the (integer) test instruction to test that bit:

test    ax, 1 shl (`n` / 8)
jnz     bitNisSet

11.19 使用单一递增索引处理两个向量

有时你的代码需要同时处理两块数据，在循环执行过程中，指针会同时递增到两块数据中。

一种简单的方法是使用缩放索引寻址模式。如果 R8 和 R9 中包含指向你要处理的数据的指针，你可以通过使用如下代码遍历两个数据块：

 dec rcx
blkLoop:  inc rcx
          mov eax, [r8][rcx * 4]
          cmp eax, [r9][rcx * 4]
          je  theyreEqual
          cmp eax, sentinelValue
          jne blkLoop

这段代码通过两个 dword 数组进行遍历并比较值（用于搜索数组中相同索引处的相等值）。该循环使用了四个寄存器：EAX 用于比较数组中的两个值，两个数组的指针（R8 和 R9），然后是 RCX 索引寄存器，用于遍历两个数组。

通过在此循环中递增 R8 和 R9 寄存器，可以消除循环中的 RCX（假设修改 R8 和 R9 中的值是可以接受的）：

 sub r8, 4
          sub r9, 4
blkLoop:  add r8, 4
 add r9, 4
          mov eax, [r8]
          cmp eax, [r9]
          je  theyreEqual
          cmp eax, sentinelValue
          jne blkLoop

这种方案在循环中需要额外的 add 指令。如果该循环的执行速度至关重要，插入这条额外的加法指令可能会成为一个障碍。

然而，你可以使用一个巧妙的技巧，这样你每次迭代时只需要增量一个寄存器：

 sub r9, r8            ; R9 = R9 - R8
          sub r8, 4
blkLoop:  add r8, 4
          mov eax, [r8]
          cmp eax, [r9][r8 * 1] ; Address = R9 + R8
          je  theyreEqual
          cmp eax, sentinelValue
          jne blkLoop

注释在这里是因为它们解释了所使用的技巧。在代码的开始部分，你从 R9 中减去 R8 的值，并将结果保留在 R9 中。在循环体内，你通过使用 [r9][r8 * 1] 缩放索引寻址模式来补偿这个减法（其有效地址是 R8 和 R9 的和，从而恢复 R9 至其原始值，至少在循环的第一次迭代时是如此）。现在，因为 cmp 指令的内存地址是 R8 和 R9 的和，向 R8 加 4 也会将 4 加到 cmp 指令使用的有效地址上。因此，在每次循环迭代时，mov 和 cmp 指令会查看各自数组的连续元素，但代码只需要增量一个指针。

这种方案在使用 SSE 和 AVX 指令处理 SIMD 数组时特别有效，因为 XMM 和 YMM 寄存器分别是 16 字节和 32 字节，所以你不能使用正常的缩放因子（1、2、4 或 8）来索引打包数据值的数组。你最终会在遍历数组时必须将 16（或 32）加到指针上，从而失去缩放索引寻址模式的一个优势。例如：

; Assume R9 and R8 point at (32-byte-aligned) arrays of 20 double values.
; Assume R10 points at a (32-byte-aligned) destination array of 20 doubles.

          sub     r9, r8     ; R9 = R9 - R8
          sub     r10, r8    ; R10 = R10 – R8
          sub     r8, 32
 mov     ecx, 5     ; Vector with 20 (5 * 4) double values
addLoop:  add     r8, 32
          vmovapd ymm0, [r8]
          vaddpd  ymm0, ymm0, [r9][r8 * 1] ; Address = R9 + R8
          vmovapd [r10][r8 * 1], ymm0      ; Address = R10 + R8
          dec     ecx
          jnz     addLoop

11.20 对齐两个地址到边界

前面的 vmovapd 和 vaddpd 指令要求它们的内存操作数必须是 32 字节对齐的，否则会触发一般保护错误（内存访问违规）。如果你能够控制数组在内存中的位置，可以为数组指定对齐方式。如果你无法控制数据在内存中的位置，则有两种选择：无论性能损失如何，处理非对齐数据，或将数据移动到合适对齐的位置。

如果你必须处理非对齐数据，你可以用非对齐的移动代替对齐的移动（例如，vmovupd 代替 vmovdqa），或者通过使用非对齐的移动将数据加载到 YMM 寄存器中，然后使用你想要的指令在该寄存器中操作数据。例如：

addLoop:  add     r8, 32
          vmovupd ymm0, [r8]
          vmovupd ymm1, [r9][r8 * 1]  ; Address = R9 + R8
          vaddpd  ymm0, ymm0, ymm1
          vmovupd [r10][r8 * 1], ymm0 ; Address = R10 + R8
          dec     ecx
          jnz     addLoop

可惜的是，vaddpd 指令不支持非对齐的内存访问，因此在进行打包加法操作之前，你必须先将第二个数组（由 R9 指向）的值加载到另一个寄存器（YMM1）中。这就是非对齐访问的缺点：不仅非对齐的移动操作更慢，而且你可能还需要使用额外的寄存器和指令来处理非对齐的数据。

当你有一个将在未来反复使用的数据操作数时，将数据移动到一个你可以控制其对齐方式的内存位置是一个选择。移动数据是一项昂贵的操作；然而，如果你有一个标准数据块将要与许多其他数据块进行比较，你可以将移动该数据块到新位置的成本分摊到所有需要执行的操作上。

移动数据尤其在当数据数组之一（或两者）出现在一个不是子元素大小的整数倍的地址时非常有用。例如，如果你有一个双字数组，它从一个奇数地址开始，你将永远无法将指针对齐到该数组数据的 16 字节边界，除非你移动数据。

11.21 处理长度不是 SSE/AVX 寄存器大小倍数的数据块

使用 SIMD 指令处理一个大数据集，同时处理 2、4、8、16 或 32 个值，通常可以使 SIMD 算法（向量化算法）的运行速度比 SISD（标量）算法快一个数量级。然而，有两个边界条件会带来问题：数据集的开始（当起始地址可能没有正确对齐时）和数据集的结束（当没有足够的数组元素来完全填充 XMM 或 YMM 寄存器时）。我已经处理了数据集开始部分的问题（数据未对齐）。本节将讨论后者的问题。

在大多数情况下，当数组末尾的数据用尽时（而 XMM 和 YMM 寄存器需要更多数据来执行打包操作），你可以使用前面提到的相同技术来对齐指针：将比必要的更多数据加载到寄存器中，并屏蔽掉不需要的结果。例如，如果在字节数组中只剩下 8 个字节需要处理，你可以加载 16 个字节，执行操作，并忽略最后 8 个字节的结果。在我在过去几节中使用的比较循环示例中，你可以执行以下操作：

movdqa   xmm0, [r8]
pcmpeqd  xmm0, [r9]
pmovmskb eax, xmm0
and      eax, 0ffh     ; Mask out the last 8 compares
cmp      eax, 0ffh
je       matchedData

在大多数情况下，访问数据结构末尾之外的数据（例如，访问本例中 R8、R9 指向的数据，或两者）是无害的。然而，正如你在第三章“内存访问和 4K 内存管理单元页面”中看到的那样，如果额外的数据恰好跨越了内存管理单元页面，并且该新页面不允许读取访问，那么 CPU 会生成一个通用保护故障（内存访问或分段故障）。因此，除非你知道有效数据在内存中紧随数组之后（至少在指令引用的范围内），否则你不应该访问该内存区域；这样做可能会导致你的软件崩溃。

这个问题有两种解决方案。首先，你可以在与寄存器大小相同的地址边界上对内存访问进行对齐（例如，XMM 寄存器的 16 字节对齐）。使用 SSE/AVX 指令访问数据结构末尾以外的数据将不会跨越页面边界（因为在 16 字节边界上对齐的 16 字节访问总是会落在同一 MMU 页面内，32 字节对齐的 32 字节访问也一样）。

第二种解决方案是在访问内存之前检查内存地址。虽然你不能访问新的页面而不可能触发访问故障^(19)，你可以检查地址本身，看看在该地址访问 16 个（或 32 个）字节是否会访问新页面中的数据。如果是，你可以在访问下一个页面的数据之前采取一些预防措施。例如，与你继续使用 SIMD 模式处理数据不同，你可以切换到 SISD 模式，使用标准的标量指令处理数据，直到数组的末尾。

要测试 SIMD 访问是否会跨越 MMU 页面边界，假设 R9 包含你即将使用 SSE 指令访问内存中 16 个字节的地址，可以使用如下代码：

mov  eax, r9d
and  eax, 0fffh
cmp  eax, 0ff0h
ja   willCrossPage

每个 MMU 页面的大小为 4KB，并且位于内存中的 4KB 地址边界上。因此，地址的低 12 位提供了该地址所关联的 MMU 页面的索引。前面的代码检查地址是否有大于 0FF0h（4080）的页面偏移量。如果是，则从该地址开始访问 16 个字节将会跨越页面边界。如果需要检查 32 字节访问，请检查 0FE0h 的值。

11.22 动态测试 CPU 特性

在本章的开始，我提到过，当测试 CPU 功能集以确定它支持哪些扩展时，最好的解决方案是根据某些功能的存在或缺失动态选择一组函数。为了演示如何动态测试并使用（或避免使用）某些 CPU 特性——特别是测试 AVX 扩展的存在——我将修改（并扩展）我至今在示例中使用的print过程。

我一直在使用的print过程非常方便，但它没有保留任何 SSE 或 AVX 寄存器，而printf()调用可能（合法地）修改这些寄存器。print的通用版本应该保留易失性的 XMM 和 YMM 寄存器以及通用寄存器。

问题在于，你不能编写一个适用于所有 CPU 的通用版本的print。如果只保留 XMM 寄存器，代码将在任何 x86-64 CPU 上运行。然而，如果 CPU 支持 AVX 扩展，并且程序使用了 YMM0 到 YMM5 寄存器，那么打印例程将只保留这些寄存器的低 128 位，因为它们与对应的 XMM 寄存器是别名。如果你保存了易失的 YMM 寄存器，代码将在不支持 AVX 扩展的 CPU 上崩溃。因此，诀窍是编写代码，动态地确定 CPU 是否具有 AVX 寄存器，并在它们存在时保留这些寄存器，否则只保留 SSE 寄存器。

实现这一点的简单方法，可能也是print函数最合适的解决方案，就是将cpuid指令直接嵌入print中，并在保存（和恢复）寄存器之前立即测试结果。以下是一个代码片段，展示了如何实现这一点：

AVXSupport  =     10000000h              ; Bit 28

print       proc

; Preserve all the volatile registers
; (be nice to the assembly code that
; calls this procedure):

            push    rax
            push    rbx                  ; CPUID messes with EBX
            push    rcx
            push    rdx
            push    r8
            push    r9
            push    r10
            push    r11

; Reserve space on the stack for the AVX/SSE registers.
; Note: SSE registers need only 96 bytes, but the code
; is easier to deal with if we reserve the full 128 bytes
; that the AVX registers need and ignore the extra 64
; bytes when running SSE code.

            sub     rsp, 192

; Determine if we have to preserve the YMM registers:

            mov     eax, 1
            cpuid
            test    ecx, AVXSupport      ; Test bits 19 and 20
            jnz     preserveAVX

; No AVX support, so just preserve the XXM0 to XXM3 registers:

            movdqu  xmmword ptr [rsp + 00], xmm0
            movdqu  xmmword ptr [rsp + 16], xmm1
            movdqu  xmmword ptr [rsp + 32], xmm2
            movdqu  xmmword ptr [rsp + 48], xmm3
            movdqu  xmmword ptr [rsp + 64], xmm4
            movdqu  xmmword ptr [rsp + 80], xmm5
            jmp     restOfPrint

; YMM0 to YMM3 are considered volatile, so preserve them:

preserveAVX: 
            vmovdqu ymmword ptr [rsp + 000], ymm0
            vmovdqu ymmword ptr [rsp + 032], ymm1
            vmovdqu ymmword ptr [rsp + 064], ymm2
            vmovdqu ymmword ptr [rsp + 096], ymm3
 vmovdqu ymmword ptr [rsp + 128], ymm4
            vmovdqu ymmword ptr [rsp + 160], ymm5

restOfPrint:
        `The rest of the print function goes here`

在print函数的末尾，当需要恢复所有内容时，你可以进行另一项测试，以确定是否恢复 XMM 或 YMM 寄存器。^(20)

对于其他函数，如果你不希望每次调用函数时都承担cpuid（以及保存它影响的所有寄存器）的开销，诀窍是编写三个函数：一个用于 SSE CPU，一个用于 AVX CPU，还有一个特殊的函数（你只需调用一次），该函数选择将来调用这两个中的哪一个。使这个方案高效的魔法是间接调用。你不会直接调用这些函数。相反，你将初始化一个指针，并将要调用的函数的地址赋给它，然后通过使用该指针间接调用这三个函数之一。对于当前的示例，我们将这个指针命名为print，并用第三个函数choosePrint的地址初始化它：

 .data
print     qword   choosePrint

这是choosePrint的代码：

; On first call, determine if we support AVX instructions
; and set the "print" pointer to point at print_AVX or
; print_SSE:

choosePrint proc
            push    rax             ; Preserve registers that get
            push    rbx             ; tweaked by CPUID
            push    rcx
            push    rdx

            mov     eax, 1
            cpuid
            test    ecx, AVXSupport ; Test bit 28 for AVX
            jnz     doAVXPrint

            lea     rax, print_SSE  ; From now on, call
            mov     print, rax      ; print_SSE directly

; Return address must point at the format string
; following the call to this function! So we have
; to clean up the stack and JMP to print_SSE.

            pop     rdx
            pop     rcx
            pop     rbx
 pop     rax
            jmp     print_SSE

doAVXPrint: lea     rax, print_AVX  ; From now on, call
            mov     print, rax      ; print_AVX directly

; Return address must point at the format string
; following the call to this function! So we have
; to clean up the stack and JMP to print_AUX.

            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            jmp     print_AVX

choosePrint endp

print_SSE过程在没有 AVX 支持的 CPU 上运行，而print_AVX过程在支持 AVX 的 CPU 上运行。choosePrint过程执行cpuid指令以确定 CPU 是否支持 AVX 扩展；如果支持，它将print指针初始化为print_AVX过程的地址，如果不支持，它将print_SSE的地址存储到print变量中。

choosePrint不是一个显式的初始化过程，你不需要在调用print之前调用它。choosePrint过程只会执行一次（假设你是通过print指针调用它，而不是直接调用）。第一次执行后，print指针将包含适合 CPU 的打印函数的地址，choosePrint将不再执行。

你可以像调用任何其他print一样调用print指针；例如：

call print
byte "Hello, world!", nl, 0

在设置好 print 指针之后，choosePrint 必须将控制权转交给适当的打印过程（print_SSE 或 print_AVX），以完成用户期望的工作。由于保存的寄存器值位于堆栈上，并且实际的打印例程只期望返回地址，choosePrint 首先会恢复它保存的所有（通用）寄存器，然后跳转到（而不是调用）适当的打印过程。它执行跳转，而不是调用，因为指向格式字符串的返回地址已经位于堆栈顶部。从 print_SSE 或 print_AVX 过程返回后，控制将返回给调用 choosePrint 的程序（通过 print 指针）。

清单 11-5 展示了完整的 print 函数，包括 print_SSE 和 print_AVX，以及一个简单的主程序，它调用了 print。我已经扩展了 print，使其能够接受 R10 和 R11 中的参数，以及 RDX、R8 和 R9 中的参数（此函数保留 RCX 用于存放调用 print 后格式字符串的地址）。

; Listing 11-5

; Generic print procedure and dynamically
; selecting CPU features.

        option  casemap:none

nl          =       10

; SSE4.2 feature flags (in ECX):

SSE42       =       00180000h       ; Bits 19 and 20
AVXSupport  =       10000000h       ; Bit 28

; CPUID bits (EAX = 7, EBX register)

AVX2Support  =      20h             ; Bit 5 = AVX

            .const
ttlStr      byte    "Listing 11-5", 0

            .data
            align   qword
print       qword   choosePrint     ; Pointer to print function

; Floating-point values for testing purposes:

fp1         real8   1.0
fp2         real8   2.0
fp3         real8   3.0
fp4         real8   4.0
fp5         real8   5.0

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

***************************************************************

; print - "Quick" form of printf that allows the format string to
;         follow the call in the code stream. Supports up to five
;         additional parameters in RDX, R8, R9, R10, and R11.

; This function saves all the Microsoft ABI–volatile,
; parameter, and return result registers so that code
; can call it without worrying about any registers being
; modified (this code assumes that Windows ABI treats
; YMM4 to YMM15 as nonvolatile).

; Of course, this code assumes that AVX instructions are
; available on the CPU.

; Allows up to 5 arguments in:

;  RDX - Arg #1
;  R8  - Arg #2
;  R9  - Arg #3
;  R10 - Arg #4
;  R11 - Arg #5

; Note that you must pass floating-point values in
; these registers, as well. The printf function
; expects real values in the integer registers. 

; There are two versions of this function, one that
; will run on CPUs without AVX capabilities (no YMM
; registers) and one that will run on CPUs that
; have AVX capabilities (YMM registers). The difference
; between the two is which registers they preserve
; (print_SSE preserves only XMM registers and will
; run properly on CPUs that don't have YMM register
; support; print_AVX will preserve the volatile YMM
; registers on CPUs with AVX support).

; On first call, determine if we support AVX instructions
; and set the "print" pointer to point at print_AVX or
; print_SSE:

choosePrint proc
            push    rax             ; Preserve registers that get
            push    rbx             ; tweaked by CPUID
            push    rcx
            push    rdx

            mov     eax, 1
            cpuid
            test    ecx, AVXSupport ; Test bit 28 for AVX
            jnz     doAVXPrint

            lea     rax, print_SSE  ; From now on, call
            mov     print, rax      ; print_SSE directly

; Return address must point at the format string
; following the call to this function! So we have
; to clean up the stack and JMP to print_SSE.

            pop     rdx
            pop     rcx
 pop     rbx
            pop     rax
            jmp     print_SSE

doAVXPrint: lea     rax, print_AVX  ; From now on, call
            mov     print, rax      ; print_AVX directly

; Return address must point at the format string
; following the call to this function! So we have
; to clean up the stack and JMP to print_AUX.

            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            jmp     print_AVX

choosePrint endp

; Version of print that will preserve volatile
; AVX registers (YMM0 to YMM3):

print_AVX   proc

; Preserve all the volatile registers
; (be nice to the assembly code that
; calls this procedure):

            push    rax
            push    rbx
            push    rcx
            push    rdx
            push    r8
            push    r9
            push    r10
            push    r11

; YMM0 to YMM7 are considered volatile, so preserve them:

            sub     rsp, 256
            vmovdqu ymmword ptr [rsp + 000], ymm0
            vmovdqu ymmword ptr [rsp + 032], ymm1
            vmovdqu ymmword ptr [rsp + 064], ymm2
            vmovdqu ymmword ptr [rsp + 096], ymm3
            vmovdqu ymmword ptr [rsp + 128], ymm4
            vmovdqu ymmword ptr [rsp + 160], ymm5
            vmovdqu ymmword ptr [rsp + 192], ymm6
            vmovdqu ymmword ptr [rsp + 224], ymm7

            push    rbp

returnAdrs  textequ <[rbp + 328]>

            mov     rbp, rsp
            sub     rsp, 128
 and     rsp, -16

; Format string (passed in RCX) is sitting at
; the location pointed at by the return address,
; load that into RCX:

            mov     rcx, returnAdrs

; To handle more than 3 arguments (4 counting
; RCX), you must pass data on stack. However, to the
; print caller, the stack is unavailable, so use
; R10 and R11 as extra parameters (could be just
; junk in these registers, but pass them just
; in case):

            mov     [rsp + 32], r10
            mov     [rsp + 40], r11
            call    printf

; Need to modify the return address so
; that it points beyond the zero-terminating byte.
; Could use a fast strlen function for this, but
; printf is so slow it won't really save us anything.

            mov     rcx, returnAdrs
            dec     rcx
skipTo0:    inc     rcx
            cmp     byte ptr [rcx], 0
            jne     skipTo0
            inc     rcx
            mov     returnAdrs, rcx

            leave
            vmovdqu ymm0, ymmword ptr [rsp + 000]
            vmovdqu ymm1, ymmword ptr [rsp + 032]
            vmovdqu ymm2, ymmword ptr [rsp + 064]
            vmovdqu ymm3, ymmword ptr [rsp + 096]
            vmovdqu ymm4, ymmword ptr [rsp + 128]
            vmovdqu ymm5, ymmword ptr [rsp + 160]
            vmovdqu ymm6, ymmword ptr [rsp + 192]
            vmovdqu ymm7, ymmword ptr [rsp + 224]
            add     rsp, 256
            pop     r11
            pop     r10
            pop     r9
            pop     r8
            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            ret
print_AVX   endp

; Version that will run on CPUs without
; AVX support and will preserve the
; volatile SSE registers (XMM0 to XMM3):

print_SSE   proc

; Preserve all the volatile registers
; (be nice to the assembly code that
; calls this procedure):

            push    rax
            push    rbx
            push    rcx
            push    rdx
            push    r8
            push    r9
            push    r10
            push    r11

; XMM0 to XMM3 are considered volatile, so preserve them:

            sub     rsp, 128
            movdqu  xmmword ptr [rsp + 00],  xmm0
            movdqu  xmmword ptr [rsp + 16],  xmm1
            movdqu  xmmword ptr [rsp + 32],  xmm2
            movdqu  xmmword ptr [rsp + 48],  xmm3
            movdqu  xmmword ptr [rsp + 64],  xmm4
            movdqu  xmmword ptr [rsp + 80],  xmm5
            movdqu  xmmword ptr [rsp + 96],  xmm6
            movdqu  xmmword ptr [rsp + 112], xmm7

            push    rbp

returnAdrs  textequ <[rbp + 200]>

            mov     rbp, rsp
            sub     rsp, 128
            and     rsp, -16

; Format string (passed in RCX) is sitting at
; the location pointed at by the return address,
; load that into RCX:

            mov     rcx, returnAdrs

; To handle more than 3 arguments (4 counting
; RCX), you must pass data on stack. However, to the
; print caller, the stack is unavailable, so use
; R10 and R11 as extra parameters (could be just
; junk in these registers, but pass them just
; in case):

            mov     [rsp + 32], r10
            mov     [rsp + 40], r11
            call    printf

; Need to modify the return address so
; that it points beyond the zero-terminating byte.
; Could use a fast strlen function for this, but
; printf is so slow it won't really save us anything.

            mov     rcx, returnAdrs
            dec     rcx
skipTo0:    inc     rcx
            cmp     byte ptr [rcx], 0
            jne     skipTo0
            inc     rcx
            mov     returnAdrs, rcx

            leave
            movdqu  xmm0, xmmword ptr [rsp + 00] 
            movdqu  xmm1, xmmword ptr [rsp + 16] 
            movdqu  xmm2, xmmword ptr [rsp + 32] 
            movdqu  xmm3, xmmword ptr [rsp + 48] 
            movdqu  xmm4, xmmword ptr [rsp + 64] 
            movdqu  xmm5, xmmword ptr [rsp + 80] 
            movdqu  xmm6, xmmword ptr [rsp + 96] 
            movdqu  xmm7, xmmword ptr [rsp + 112] 
            add     rsp, 128
            pop     r11
            pop     r10
            pop     r9
            pop     r8
            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            ret
print_SSE   endp 

***************************************************************

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbx
            push    rsi
            push    rdi
            push    rbp
            mov     rbp, rsp
            sub     rsp, 56         ; Shadow storage

; Trivial example, no arguments:

            call    print
            byte    "Hello, world!", nl, 0

; Simple example with integer arguments:

            mov     rdx, 1          ; Argument #1 for printf
            mov     r8, 2           ; Argument #2 for printf
 mov     r9, 3           ; Argument #3 for printf
            mov     r10, 4          ; Argument #4 for printf
            mov     r11, 5          ; Argument #5 for printf
            call    print
            byte    "Arg 1=%d, Arg2=%d, Arg3=%d "
            byte    "Arg 4=%d, Arg5=%d", nl, 0

; Demonstration of floating-point operands. Note that
; args 1, 2, and 3 must be passed in RDX, R8, and R9.
; You'll have to load parameters 4 and 5 into R10 and R11.

            mov     rdx, qword ptr fp1
            mov     r8,  qword ptr fp2
            mov     r9,  qword ptr fp3
            mov     r10, qword ptr fp4
            mov     r11, qword ptr fp5
            call    print
            byte    "Arg1=%6.1f, Arg2=%6.1f, Arg3=%6.1f "
            byte    "Arg4=%6.1f, Arg5=%6.1f ", nl, 0

allDone:    leave
            pop     rdi
            pop     rsi
            pop     rbx
            ret     ; Returns to caller
asmMain     endp
            end

清单 11-5：动态选择的打印过程

以下是清单 11-5 中程序的构建命令和输出：

C:\>**build listing11-5**

C:\>**echo off**
 Assembling: listing11-5.asm
c.cpp

C:\>**listing11-5**
Calling Listing 11-5:
Hello, World!
Arg 1=1, Arg2=2, Arg3=3 Arg 4=4, Arg5=5
Arg1=   1.0, Arg2=   2.0, Arg3=   3.0 Arg4=   4.0, Arg5=   5.0
Listing 11-5 terminated

11.23 MASM 包含指令

如你所见，将 print 过程的源代码包含在本书每个示例清单中浪费了大量空间。在每个清单中包含上一节的新版本也是不切实际的。在第十五章，我将讨论包括文件、库和其他可以帮助你将大型项目分解成可管理部分的功能。不过，在此之前，讨论 MASM 的 include 指令是很有意义的，这样本书就能消除示例程序中的许多不必要的代码重复。

MASM 的 include 指令使用以下语法：

include  `source_filename`

其中 source_filename 是文本文件的名称（通常与包含此 include 指令的源文件在同一目录中）。MASM 会将源文件插入到汇编文件中，插入位置是 include 指令所在的位置，就像该文件中的文本出现在正在汇编的源文件中一样。

例如，我已经提取了与新打印过程相关的所有源代码（choosePrint、print_AVX 和 print_SSE 过程，以及 print qword 变量），并将它们插入到 print.inc 源文件中。^21 在本书后续的清单中，我将简单地在代码中放置以下指令，代替 print 函数：

include print.inc

我还将 getTitle 过程放入了一个单独的头文件 (getTitle.inc) 中，以便从示例清单中移除这些公共代码。

11.24 还有更多内容

本章甚至没有开始描述所有各种 SSE、AVX、AVX2 和 AVX512 指令。如前所述，大多数 SIMD 指令有特定的用途（例如，交织或解交织与视频或音频信息相关的字节），在其特定问题领域之外并不十分有用。其他指令（至少在本书写作时）是相当新的，今天许多使用中的 CPU 无法执行它们。如果你有兴趣了解更多 SIMD 指令，请查看下一节中的信息。

11.25 获取更多信息

关于 AMD CPU 的cpuid指令的更多信息，请参见 2010 年 AMD 文档《CPUID 规格》(www.amd.com/system/files/TechDocs/25481.pdf)。对于英特尔 CPU，请查看《英特尔架构与处理器识别与 CPUID 模型和家族编号》(software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers/)。

微软官网（特别是 Visual Studio 文档）提供了关于 MASM segment指令和 x86-64 段的更多信息。例如，在互联网上搜索MASM 段指令，就会找到页面docs.microsoft.com/en-us/cpp/assembler/masm/segment?view=msvc-160/。

关于所有 SIMD 指令的完整讨论可以在英特尔的文档中找到：英特尔® 64 和 IA-32 架构软件开发者手册， 第二卷：指令集参考。

你可以很容易地在英特尔的官网上找到这些文档；例如：

software.intel.com/en-us/articles/intel-sdm/

** software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4.html*

**AMD 的变体可以在www.amd.com/system/files/TechDocs/40332.pdf找到。

尽管本章介绍了许多 SSE/AVX/AVX2 指令及其功能，但并未花费太多时间描述如何在典型程序中使用这些指令。你可以很容易地在互联网上找到许多使用 SSE 和 AVX 指令的高效算法。以下网址提供了一些示例：

SIMD 编程教程

SSE 算术，由 Stefano Tommesani 编写，tommesani.com/index.php/2010/04/24/sse-arithmetic/
x86/x64 SIMD 指令集列表，www.officedaytime.com/simd512e/

** SIMD 编程基础，索尼计算机娱乐，ftp.cvut.cz/kernel/people/geoff/cell/ps3-linux-docs/CellProgrammingTutorial/BasicsOfSIMDProgramming.html*

**排序算法

“在英特尔 Skylake 上使用 AVX-512 矢量化的混合快速排序算法” 由 Berenger Bramas 撰写，arxiv.org/pdf/1704.08579.pdf
“多核 SIMD 处理器上的寄存器级排序算法” 由田晓晨等人撰写，olab.is.s.u-tokyo.ac.jp/~kamil.rocki/xiaochen_rocki_IA3_SC13.pdf
“使用 AVX 指令的快速快速排序实现” 由 Shay Gueron 和 Vlad Krasnov 撰写，citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1009.7773&rep=rep1&type=pdf

搜索算法

“适用于子字符串搜索的 SIMD 友好算法” 由 Wojciech Mula 撰写，0x80.pl/articles/simd-strfind.html
“利用流式 SIMD 扩展技术进行快速多字符串匹配” 由 Simone Faro 和 M. Oğuzhan Külekci 撰写，citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1041.3831&rep=rep1&type=pdf
“现代处理器上的 k 叉搜索” 由 Benjamin Schlegel 等人撰写，event.cwi.nl/damon2009/DaMoN09-KarySearch.pdf

11.26 自测

你如何判断某个 SSE 或 AVX 特性是否在 CPU 上可用？
为什么检查 CPU 的制造商很重要？
使用 cpuid 获取特征标志时，EAX 设置应为多少？
哪个特征标志位告诉你 CPU 支持 SSE4.2 指令？
以下指令的默认段名称是什么？
1. .code
2. .data
3. .data?
4. .const
默认的段对齐方式是什么？
你如何创建一个对齐到 64 字节边界的数据段？
哪些指令集扩展支持 YMMx 寄存器？
什么是一个 lane？
标量指令和矢量指令之间有什么区别？
SSE 内存操作数（XMM）通常需要对齐到哪个内存边界？
AVX 内存操作数（YMM）通常需要对齐到哪个内存边界？
AVX-512 内存操作数（ZMM）通常需要对齐到哪个内存边界？
你会使用哪条指令将数据从一个 32 位通用整数寄存器移动到 XMM 和 YMM 寄存器的低 32 位？
你会使用哪条指令将数据从一个 64 位通用整数寄存器移动到 XMM 和 YMM 寄存器的低 64 位？
你会使用哪三条指令从对齐的内存位置加载 16 字节到 XMM 寄存器中？
你会使用哪三条指令从任意内存地址加载 16 字节到 XMM 寄存器中？
如果你想将 XMM 寄存器中的 HO 64 位移动到另一个 XMM 寄存器的 HO 64 位，而不影响目标的 LO 64 位，应该使用哪个指令？
如果你想将 XMM 寄存器中的双精度值复制到另一个 XMM 寄存器的两个四字（LO 和 HO）中，应该使用哪个指令？
你会使用哪个指令重新排列 XMM 寄存器中的字节？
你会使用哪个指令重新排列 XMM 寄存器中的双字通道？
你会使用哪些指令从 XMM 寄存器中提取字节、字、双字或四字，并将它们移动到通用寄存器中？
你会使用哪个指令将一个字节、字、双字或四字从通用寄存器插入到 XMM 寄存器中的某个位置？
andnpd指令的作用是什么？
你会使用哪个指令将 XMM 寄存器中的字节向左移动一个字节位置（8 位）？
你会使用哪个指令将 XMM 寄存器中的字节向右移动一个字节位置（8 位）？
如果你想将 XMM 寄存器中的两个四字向左移动n位，应该使用哪个指令？
如果你想将 XMM 寄存器中的两个四字向右移动n位，应该使用哪个指令？
当paddb指令中的和无法容纳在 8 位中时，会发生什么？
垂直加法和水平加法之间的区别是什么？
pcmpeqb指令将比较结果放在哪个地方？它是如何表示结果为真的？
没有pcmpltq指令。解释如何比较一对 XMM 寄存器中的通道，以判断小于条件。
pmovmskb指令的作用是什么？
以下操作会执行多少次加法？
1. addps
2. addpd
如果你有一个指向 RAX 中的数据的指针，并且希望将该地址强制对齐到 16 字节边界，应该使用哪个指令？
如何将 XMM0 寄存器中的所有位设置为 0？
如何将 XMM1 寄存器中的所有位设置为 1？
在汇编时，使用哪个指令将源文件的内容插入当前源文件？

第十二章：比特操作

操作内存中的比特，或许是汇编语言最著名的特性。即使是以比特操作著称的 C 语言，也没有提供如此完整的比特操作集。

本章讨论如何使用 x86-64 汇编语言在内存和寄存器中操作比特串。首先回顾至今为止涉及的比特操作指令，介绍一些新的指令，然后回顾内存中比特串的打包和解包信息，这是许多比特操作的基础。最后，本章讨论了几种以比特为核心的算法及其在汇编语言中的实现。

12.1 什么是比特数据？

比特操作 是指处理 比特数据：由非连续或不是 8 比特倍数长度的比特串构成的数据类型。通常，这些比特对象不代表数值整数，尽管我们不会对我们的比特串施加这一限制。

比特串 是由一个或多个比特组成的连续序列。它不需要从任何特定的位置开始或结束。例如，一个比特串可以从内存中一个字节的第 7 个比特开始，并继续到下一个字节的第 6 个比特。同样，比特串也可以从 EAX 的第 30 个比特开始，消耗 EAX 的上 2 个比特，然后从 EBX 的第 0 个比特继续，直到第 17 个比特。在内存中，比特必须是物理上连续的（即比特编号总是递增，除非跨越字节边界，而在字节边界，内存地址增加 1 字节）。在寄存器中，如果比特串跨越寄存器边界，应用程序定义延续的寄存器，但比特串总是从第二个寄存器的第 0 个比特继续。

比特串 是由所有相同值的比特组成的序列。零串是一个只包含 0 的比特串，而一串是一个只包含 1 的比特串。首个设置的比特 是比特串中第一个包含 1 的比特的位置；也就是说，紧随可能的零串后的第一个 1 比特。首个清除的比特 也有类似的定义。最后一个设置的比特 是比特串中最后一个包含 1 的比特位置；其后的比特形成一个连续的零串。最后一个清除的比特 也有类似的定义。

比特集 是一个比特集合，未必是连续的，位于更大的数据结构中。例如，从双字中提取的比特 0 到 3、7、12、24 和 31 形成一个比特集。通常，我们将处理的比特集是 容器对象（封装比特集的数据结构）的一部分，其大小通常不超过大约 32 或 64 比特，尽管这个限制是完全人为的。比特串是比特集的特例。

一个位偏移是从边界位置（通常是字节边界）到指定位的位数。如第二章所述，我们从边界位置的 0 开始编号位。

一个掩码是一个位序列，我们用它来操作另一个值中的某些位。例如，位字符串 0000_1111_0000b 在与and指令配合使用时，会清除除了位 4 到 7 以外的所有位。同样，如果你将相同的值与or指令配合使用，它可以将目标操作数中的位 4 到 7 设置为 1。掩码这个术语来源于这些位字符串与and指令的结合使用。在这些情况下，1 和 0 位的作用就像你在涂漆时使用的遮蔽胶带；它们可以通过某些位而不改变它们，同时遮蔽（清除）其他位。

拥有这些定义后，我们已经准备好开始操作一些位了！

12.2 操作位的指令

位操作通常包括六项活动：设置位、清除位、反转位、测试和比较位、从位字符串中提取位以及将位插入位字符串。最基本的位操作指令是and、or、xor、not、test以及移位和旋转指令。以下段落回顾了这些指令，重点讨论了如何使用它们来操作内存或寄存器中的位。

12.2.1 `and`指令

and指令提供了用 0 替换位序列中不需要的位的功能。这条指令对于隔离与其他无关数据（或者至少是与位字符串或位集无关的数据）合并的位字符串或位集特别有用。例如，假设一个位字符串占用了 EAX 寄存器的 12 到 24 位；我们可以通过使用以下指令（见图 12-1）将 EAX 中的所有其他位清零，从而隔离这个位字符串：

and eax, 1111111111111000000000000b

理论上，你可以使用or指令将所有不需要的位掩码为 1，而不是 0，但如果不需要的位位置包含 0，那么后续的比较和操作通常会更容易。

图 12-1：使用and指令隔离位字符串

一旦你清除了位集中的不需要的位，你通常可以直接对位集进行操作。例如，要检查 EAX 中 12 到 24 位的位字符串是否包含 12F3h，你可以使用以下代码：

and eax, 1111111111111000000000000b 
cmp eax, 1001011110011000000000000b

这里有一个使用常量表达式的解决方案，它稍微容易理解一些：

and eax, 1111111111111000000000000b
cmp eax, 12F3h shl 12

为了让你在处理这个值时使用的常量和其他值更容易操作，你可以使用shr指令在掩码操作后，将位字符串与位 0 对齐，如下所示：

and eax, 1111111111111000000000000b
shr eax, 12
cmp eax, 12F3h
 `Other operations that require the bit string at bit #0`

12.2.2 `or`指令

or指令特别有用，可以将一个位集插入到另一个位字符串中，使用以下步骤：

清除源操作数中围绕位集的所有位。
清除目标操作数中你希望插入位集的所有位。
将位集和目标操作数进行 OR 操作。

例如，假设你有一个值，它位于 EAX 的位 0 到 12 中，你希望将其插入到 EBX 的位 12 到 24 中，而不影响 EBX 中的其他位。你应该首先从 EAX 中去掉位 13 及以上的位；然后从 EBX 中去掉位 12 到 24。接下来，你需要将 EAX 中的位进行移位，使得位串占据 EAX 中的位 12 到 24。最后，你将 EAX 中的值通过 OR 操作插入到 EBX 中（见图 12-2），如图所示：

and eax, 1FFFh      ; Strip all but bits 0 to 12 from EAX
and ebx, 0FE000FFFh ; Clear bits 12 to 24 in EBX
shl eax, 12         ; Move bits 0 to 12 to 12 to 24 in EAX
or ebx,eax          ; Merge the bits into EBX

在图 12-2 中，所需的位（AAAAAAAAAAAAA）形成一个位串。然而，即使你在操作一个不连续的位集，这个算法依然能正常工作。你所需要做的就是创建一个在适当位置上有 1 的位掩码。

在使用位掩码时，像前面几个例子那样使用字面数字常量是非常糟糕的编程风格。你应该始终在 MASM 中创建符号常量。通过将这些常量与一些常量表达式结合，你可以生成更易于阅读和维护的代码。当前的示例代码更合适的写法如下：

StartPosn = 12 
BitMask   = 1FFFh shl StartPosn ; Mask occupies bits 12 to 24
        . 
        .
        . 
   shl eax, StartPosn   ; Move into position
   and eax, BitMask     ; Strip all but bits 12 to 24 from EAX
   and ebx, not BitMask ; Clear bits 12 to 24 in EBX
   or  ebx, eax         ; Merge the bits into EBX

图 12-2：将 EAX 中的位 0 到 12 插入到 EBX 的位 12 到 24 中

使用编译时 not 操作符来反转位掩码，可以避免每次修改 BitMask 常量时需要创建另一个常量。如果需要维护两个相互依赖的符号，这在程序中并不是一个好的做法。

当然，除了将一个位集与另一个合并，or 指令还可以用于将位强制设置为 1。在源操作数中将某些位设置为 1，你可以通过使用 or 指令将目标操作数中的对应位强制设置为 1。

12.2.3 `xor` 指令

xor 指令允许你反转位集中的选定位。当然，如果你想反转目标操作数中的所有位，not 指令更合适；但是，如果你只想反转选定的位而不影响其他位，xor 是更好的选择。

xor操作的一个有趣事实是，它让你可以以几乎任何想象得到的方式操作已知数据。例如，如果你知道某个字段包含 1010b，你可以通过与 1010b 进行异或操作将该字段强制为 0。类似地，你可以通过与 0101b 进行异或操作将其强制为 1111b。虽然这看起来像是浪费，因为你可以很容易地使用and/or将这个 4 位字符串强制为 0 或全 1，但xor指令有两个优点。首先，你不只限于将字段强制为全 0 或全 1；你实际上可以通过xor将这些位设置为 16 种有效组合中的任何一种。其次，如果你需要同时操作目标操作数中的其他位，and/or可能无法完成这个任务。

例如，假设你知道某个字段包含 1010b，你想将其强制为 0，另一个字段在同一操作数中包含 1000b，并且你希望将该字段加 1（即将该字段设置为 1001b）。你无法通过单个and或or指令完成这两个操作，但你可以通过单个xor指令来实现；只需将第一个字段与 1010b 进行异或操作，将第二个字段与0001b进行异或操作。然而，记住，这个技巧只有在你知道目标操作数中某个已设置的位的当前值时才有效。

12.2.4 逻辑指令对标志的修改

除了在目标操作数中设置、清除和取反位外，and、or和xor指令还会影响 FLAGS 寄存器中的各种条件码。这些指令执行以下操作：

始终清除进位标志和溢出标志。
如果结果的 HO 位为 1，则设置符号标志；否则清除它；也就是说，这些指令将结果的 HO 位复制到符号标志中。
如果结果为零，则设置零标志；如果结果不为零，则清除零标志。
如果目标操作数的 LO 字节中设置的位数是偶数，则设置奇偶校验标志；如果设置的位数是奇数，则清除奇偶校验标志。

因为这些指令总是清除进位标志和溢出标志，所以你不能期望系统在执行这些指令时保留这两个标志的状态。许多汇编语言程序中常见的错误是假设这些指令不会影响进位标志。许多人会执行一个设置或清除进位标志的指令；执行一个and、or或xor指令；然后尝试测试上一个指令中进位标志的状态。这是行不通的。

这些指令的一个有趣方面是它们会将结果的高字节位（HO bit）复制到符号标志位中。因此，您可以通过测试符号标志来轻松测试高字节位（使用 cmovs 和 cmovns、sets 和 setns，或 js 和 jns 指令）。因此，许多汇编语言程序员会将一个重要的布尔变量放在操作数的高字节位中，以便在逻辑操作后通过使用符号标志轻松测试该变量的状态。

12.2.4.1 奇偶标志

奇偶校验是一种最初由电报和其他串行通信协议使用的简单错误检测方案。其思路是计算字符中已设置的位数，并在传输中包含一个额外的位来指示该字符包含偶数或奇数个已设置的位。接收端也会计算这些位并验证额外的奇偶位是否指示了传输成功。奇偶标志的目的是帮助计算这个额外的位，尽管奇偶校验已由硬件接管。^(1)

x86-64 and、or 和 xor 指令会在其操作数的低字节（LO byte）包含偶数个已设置位时设置奇偶标志位。有一个重要的事实需要重申：奇偶标志位仅反映目标操作数的低字节中已设置位的数量；它不包括字、双字或其他大小操作数中的高字节（HO byte）。指令集仅使用低字节来计算奇偶性，因为使用奇偶校验的通信程序通常是面向字符的传输系统（如果一次传输超过 8 位，可以使用更好的错误检查方案）。

12.2.4.2 零标志

零标志的设置是 and、or 和 xor 指令产生的更重要的结果之一。实际上，程序在执行 and 指令后如此频繁地引用此标志，以至于 Intel 添加了一个单独的指令 test，其主要目的是将两个结果进行逻辑与运算并设置标志，而不会对任何指令操作数产生其他影响。

零标志在执行 and 或 test 指令后有三个主要用途：（1）检查操作数中某一特定位是否被设置，（2）检查多个位集中是否至少有一个位为 1，（3）检查操作数是否为 0。使用（1）实际上是（2）的特例，其中位集仅包含一个位。我们将在接下来的段落中探讨这些用途。

要测试给定操作数中的特定位是否被设置，可以使用and和test指令与包含单一设置位的常量值进行操作。这会清除操作数中的所有其他位，如果操作数在该位位置包含 0，则在该位置留下 0，如果包含 1，则留下 1。因为结果中的其他所有位都是 0，所以如果该特定位为 0，则整个结果为 0；如果该位为 1，则整个结果为非零。x86-64 反映了这一状态在零标志中（Z = 1 表示该位为 0；Z = 0 表示该位为 1）。以下指令序列演示了如何测试 EAX 中第 4 位是否被设置：

 test eax, 10000b  ; Check bit #4 to see if it is 0 or 1
     jnz  bitIsSet

    `Do this if the bit is clear`
        .
        .
        .
bitIsSet:   ; Branch here if the bit is set

你还可以使用and和test指令来查看是否有多个位中的任何一位被设置。只需提供一个常量，该常量在你想要测试的所有位置上是 1（其他地方为 0）。将操作数与这样的常量进行与运算，如果操作数中的任何位被设置为 1，则会产生非零值。以下示例测试 EAX 中的值在第 1、2、4 和 7 位位置是否包含 1：

 test eax, 10010110b 
     jz   noBitsSet

    `Do whatever needs to be done if one of the bits is set`

noBitsSet:

你不能仅使用单一的and或test指令来检查位集中所有对应的位是否等于 1。要实现这一点，你必须先屏蔽掉不在位集中的位，然后将结果与掩码本身进行比较。如果结果等于掩码，则位集中所有的位都包含 1。你必须使用and指令来执行此操作，因为test指令不会修改结果。以下示例检查位集（bitMask）中的所有位是否等于 1：

 and eax, bitMask 
     cmp eax, bitMask 
     jne allBitsArentSet 

; All the bit positions in EAX corresponding to the set 
; bits in bitMask are equal to 1 if we get here.

    `Do whatever needs to be done if the bits match`

allBitsArentSet:

当然，一旦我们加入cmp指令，就不需要真正检查位集中所有的位是否都是 1 了。我们可以通过将适当的值作为操作数传递给cmp指令，来检查任意组合的值。

请注意，test和and指令只有在 EAX（或其他目标操作数）中的所有位在常量操作数中 1 出现的相应位置上都是 0 时，才会设置零标志。这提示了另一种检查位集中的所有 1 的方法：在使用and或test指令之前，将 EAX 中的值取反。然后，如果零标志被设置，则说明在（原始）位集中所有的位都是 1。例如：

not  eax 
test eax, bitMask 
jnz  NotAllOnes

; At this point, EAX contained all 1s in the bit positions 
; occupied by 1s in the bitMask constant. 

    `Do whatever needs to be done at this point`

NotAllOnes:

前面的段落都暗示bitMask（源操作数）是一个常量，但你也可以使用变量或其他寄存器。只需在执行前面的test、and或cmp指令之前，先将该变量或寄存器加载适当的位掩码即可。

12.2.5 位测试指令

我们之前已经看到的另一组可以用来操作位的指令是 位测试指令。这些指令包括 bt（位测试）、bts（位测试并置位）、btc（位测试并补码）和 btr（位测试并重置）。bt``x 指令使用以下语法：

bt`x`  `bits_to_test`, `bit_number`
bt`x`  `reg`[16], `reg`[16]
bt`x`  `reg`[32], `reg`[32]
bt`x`  `reg`[64], `reg`[64]
bt`x`  `reg`[16], `constant`
bt`x`  `reg`[32], `constant`
bt`x`  `reg`[64], `constant`
bt`x`  `mem`[16], `reg`[16]
bt`x`  `mem`[32], `reg`[32]
bt`x`  `mem`[64], `reg`[64]
bt`x`  `mem`[16], `constant`
bt`x`  `mem`[32], `constant`
bt`x`  `mem`[64], `constant`

其中 x 代表无内容、c、s 或 r。

bt``x 指令的第二个操作数是一个位号，指定要检查的第一个操作数中的位。如果第一个操作数是寄存器，则第二个操作数必须包含一个值，该值在 0 到寄存器大小（以位为单位）减 1 之间；因为 x86-64 的最大（通用）寄存器为 64 位，所以该值的最大值为 63（对于 64 位寄存器）。如果第一个操作数是内存位置，则位数不限制在 0 到 63 的范围内。如果第二个操作数是常量，它可以是 0 到 255 之间的任何 8 位值。如果第二个操作数是寄存器，它没有（实际的）限制，实际上，它允许负的位偏移。

bt 指令将指定的位从第二个操作数复制到进位标志中。例如，bt eax, 8 指令将 EAX 寄存器的第 8 位复制到进位标志中。你可以在该指令执行后测试进位标志，以确定 EAX 中的第 8 位是被置位还是清零。

bts、btc 和 btr 指令在测试位的同时也会操作该位。这些指令可能会比较慢（取决于你使用的处理器），如果性能是你的主要关注点，应该避免使用它们，尤其是当你使用旧的 CPU 时。如果性能（与方便性相比）是一个问题，你应该始终尝试两种不同的算法——一种使用这些指令，另一种使用 and 和 or 指令——并测量性能差异；然后选择两者中最优的一种方法。

12.2.6 使用移位和旋转指令操作位

移位和旋转指令是另一组可以用来操作和测试位的指令。这些指令将高位（左移和旋转）或低位（右移和旋转）移入进位标志中。因此，在执行这些指令后，你可以测试进位标志以确定操作数的高位或低位的原始设置；例如：

shr  al, 1
jc   LOBitWasSet

移位和旋转指令的一个优点是它们会自动将操作数中的位向上或向下移动，这样下一个要测试的位就位于正确的位置；这在循环中操作时尤其有用。

移位和旋转指令对于对齐位串以及打包和解包数据非常有用。第二章中有一些这方面的示例，本章的早些示例也使用了移位指令来实现此目的。

12.3 进位标志作为位累加器

bt``x、移位和旋转指令根据操作和选定的位设置或清除进位标志。由于这些指令将“位结果”放入进位标志中，因此通常方便将进位标志视为位操作的 1 位寄存器或累加器。在本节中，我们将探索一些可能的操作，这些操作可以在进位标志中进行。

使用进位标志作为某种输入值的指令，对于操作进位标志中的位结果非常有用。例如：

adc, sbb
rcl, rcr
cmc, clc, 和 stc
cmovc, cmovnc
jc, jnc
setc, setnc

adc和sbb指令在加法或减法操作中加上或减去进位标志，因此，如果你已经将一个位结果计算到进位标志中，你可以通过使用这些指令将该结果计入加法或减法中。

要保存进位标志结果，你可以使用旋转通过进位指令（rcl和rcr），将进位标志移入目标操作数的低位或高位。这些指令对于将一组位结果打包到字节、字或双字值中非常有用。

cmc（反向进位）指令使你能够轻松地反转位操作的结果。你还可以使用clc和stc指令在涉及进位标志的一串位操作之前初始化进位标志。

测试进位标志的指令，如jc、jnc、cmovc、cmovnc、setc和setnc，在计算之后很有用，尤其是当结果保存在进位标志中时。

如果你有一系列位运算，并且想要测试这些运算是否产生特定的一组 1 位结果，你可以清空一个寄存器或内存位置，然后使用rcl或rcr指令将每个结果移入该位置。一旦位操作完成，比较寄存器或内存位置中的结果与常数值。如果你想测试涉及与（AND）和或（OR）的结果序列，可以使用setc和setnc指令将寄存器设置为 0 或 1，然后使用and和or指令合并结果。

12.4 打包和解包位串

一种常见的位操作是将位串插入操作数中，或者从操作数中提取位串。第二章提供了打包和解包此类数据的简单示例；现在是时候正式描述如何执行这些操作了。

就我们而言，我将假设我们正在处理适合一个字节、字、双字或四字操作数的位串。跨越对象边界的大位串需要额外的处理；我们将在本节稍后讨论跨越四字边界的位串。

在打包和解包位串时，我们必须考虑其起始位位置和长度。起始位位置是该位串中最低有效位（LO 位）在更大操作数中的位号。长度是操作数中的位数。

要将数据插入（打包）到目标操作数中，你首先需要一个右对齐的位串（即，从位位置 0 开始），并且该位串被零扩展到 8、16、32 或 64 位；然后将这些数据插入到另一个宽度为 8、16、32 或 64 位的操作数的适当起始位置中。不能保证目标位位置包含任何特定的值。

前两步（可以按任意顺序执行）是清除目标操作数中的相应位，并将位串（的副本）进行移位，使得低位（LO 位）从适当的位位置开始。第三步是将移位后的结果与目标操作数进行按位或（OR）运算。这将位串插入到目标操作数中（见图 12-3）。

图 12-3：将位串插入到目标操作数中

以下三条指令将已知长度的位串插入到目标操作数中，如图 12-3 所示。这些指令假定源操作数在 BX 寄存器中，目标操作数在 AX 寄存器中：

shl  bx, 5 
and  ax, 1111111000011111b 
or   ax, bx

如果在编写程序时无法知道长度和起始位置（即，必须在运行时计算它们），则可以使用查找表插入位串。假设我们有两个 8 位值：一个表示我们插入字段的起始位位置，另一个表示非零的 8 位长度值。还假设源操作数在 EBX 寄存器中，目标操作数在 EAX 寄存器中。列表 12-1 中的mergeBits过程演示了如何做到这一点。

; Listing 12-1

; Demonstrate inserting bit strings into a register.

; Note that this program must be assembled and linked
; with the "LARGEADDRESSAWARE:NO" option.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 12-1", 0

; The index into the following table specifies the length 
; of the bit string at each position. There are 65 entries
; in this table (one for each bit length from 0 to 64). 

            .const
MaskByLen   equ     this qword
    qword   0
    qword   1,           3,           7,           0fh
    qword   1fh,         3fh,         7fh,         0ffh
    qword   1ffh,        3ffh,        7ffh,        0fffh
    qword   1fffh,       3fffh,       7fffh,       0ffffh
    qword   1ffffh,      3ffffh,      7ffffh,      0fffffh
    qword   1fffffh,     3fffffh,     7fffffh,     0ffffffh 
    qword   1ffffffh,    3ffffffh,    7ffffffh,    0fffffffh 
    qword   1fffffffh,   3fffffffh,   7fffffffh,   0ffffffffh

    qword   1ffffffffh,         03ffffffffh
    qword   7ffffffffh,         0fffffffffh

    qword   1fffffffffh,        03fffffffffh
    qword   7fffffffffh,        0ffffffffffh

    qword   1ffffffffffh,       03ffffffffffh
    qword   7ffffffffffh,       0fffffffffffh

    qword   1fffffffffffh,      03fffffffffffh
    qword   7fffffffffffh,      0ffffffffffffh

    qword   1ffffffffffffh,     03ffffffffffffh
    qword   7ffffffffffffh,     0fffffffffffffh

    qword   1fffffffffffffh,    03fffffffffffffh
    qword   7fffffffffffffh,    0ffffffffffffffh

    qword   1ffffffffffffffh,   03ffffffffffffffh
    qword   7ffffffffffffffh,   0fffffffffffffffh

    qword   1fffffffffffffffh,  03fffffffffffffffh
    qword   7fffffffffffffffh,  0ffffffffffffffffh

Val2Merge   qword   12h, 1eh, 5555h, 1200h, 120h
LenInBits   byte    5,     9,    16,    16,   12
StartPosn   byte    7,     4,     4,    12,   18

MergeInto   qword   0ffffffffh, 0, 12345678h
            qword   11111111h, 0f0f0f0fh

            include getTitle.inc
            include print.inc

            .code

; mergeBits(Val2Merge, MergeWith, Start, Length):
; Length (LenInBits[i]) value is passed in DL.
; Start (StartPosn[i]) is passed in CL.
; Val2Merge (Val2Merge[i]) and MergeWith (MergeInto[i])
; are passed in RBX and RAX.

; mergeBits result is returned in RAX.

mergeBits   proc
            push    rbx
            push    rcx
            push    rdx
            push    r8
            movzx   edx, dl         ; Zero-extends to RDX
            mov     rdx, MaskByLen[rdx * 8]
            shl     rdx, cl
            not     rdx
            shl     rbx, cl
            and     rax, rdx
            or      rax, rbx
            pop     r8
            pop     rdx
            pop     rcx
            pop     rbx
            ret
mergeBits   endp 

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbx
            push    rsi
            push    rdi
            push    rbp
            mov     rbp, rsp
            sub     rsp, 56         ; Shadow storage

; The following loop calls mergeBits as
; follows:

;  mergeBits(Val2Merge[i], MergeInto[i], 
;            StartPosn[i], LenInBits[i]);

; Where "i" runs from 4 down to 0.

; Index of the last element in the arrays:

            mov     r10, (sizeof LenInBits) - 1
testLoop:   

; Fetch the Val2Merge element and write
; its value to the display while it is handy.

            mov     rdx, Val2Merge[r10 * 8]
            call    print
            byte    "merge( %x, ", 0
            mov     rbx, rdx

; Fetch the MergeInto element and write
; its value to the display.

            mov     rdx, MergeInto[r10 * 8]
            call    print
            byte    "%x, ", 0
            mov     rax, rdx

; Fetch the StartPosn element and write
; its value to the display.

            movzx   edx, StartPosn[r10 * 1] ; Zero-extends to RDX
            call    print
            byte    "%d, ", 0
            mov     rcx, rdx

; Fetch the LenInBits element and write
; its value to the display.

            movzx   edx, LenInBits[r10 * 1] ; Zero-extends to RDX
            call    print
            byte    "%d ) = ", 0

; Call mergeBits(Val2Merge, MergeInto,
;                StartPosn, LenInBits)

            call    mergeBits

; Display the function result (returned
; in RAX). For this program, the results
; are always 32 bits, so it prints only
; the LO 32 bits of RAX:

            mov     edx, eax
            call    print
            byte    "%x", nl, 0

; Repeat for each element of the array.

            dec     r10
 jns     testLoop

allDone:    leave
            pop     rdi
            pop     rsi
            pop     rbx
            ret     ; Returns to caller
asmMain     endp
            end

列表 12-1：插入位串，其中位串长度和起始位置是变量

这是列表 12-1 中程序的构建命令和输出。因为这个程序直接访问数组（而不是将数组的地址加载到寄存器中，这会使代码变得晦涩），所以这个程序必须使用LARGEADDRESSAWARE:NO标志进行构建，因此使用了sbuild.bat批处理文件（有关sbuild.bat的描述，请参见第三章中的“Large Address Unaware Applications”）。

C:\>**sbuild listing12-1**

C:\>**echo off**
 Assembling: listing12-1.asm
c.cpp

C:\>**listing12-1**
Calling Listing 12-1:
merge(120, f0f0f0f, 18, 12) = 4830f0f
merge(1200, 11111111, 12, 16) = 11200111
merge(5555, 12345678, 4, 16) = 12355558
merge(1e, 0, 4, 9) = 1e0
merge(12, ffffffff, 7, 5) = fffff97f
Listing 12-1 terminated

MaskByLen表中的每个条目（见列表 12-1）包含由表索引指定的 1 位数值。通过使用mergeBits中的Length参数值作为索引，可以从此表中获取一个具有与Length值相同数量的 1 位的值。mergeBits函数获取适当的掩码，将其向左移位，使得这一串 1 的低位（LO 位）与我们要插入数据的字段的起始位置对齐，然后反转该掩码，并使用反转后的值清除目标操作数中的相应位。

要从较大的操作数中提取比特串，你需要做的就是屏蔽掉不需要的位，然后将结果移动，直到比特串的最低位（LO 位）位于目标操作数的第 0 位。例如，要从 EBX 中提取从第 5 位开始的 4 位字段，并将结果保存在 EAX 中，你可以使用以下代码：

mov eax, ebx        ; Copy data to destination
and eax, 111100000b ; Strip unwanted bits
shr eax, 5          ; Right-justify to bit position 0

如果你在编写程序时不知道比特串的长度和起始位置，你仍然可以提取所需的比特串。代码类似于插入（虽然稍微简单一些）。假设你有我们在插入比特串时使用的Length和Start值，你可以通过以下代码提取相应的比特串（假设源操作数为 EBX，目标操作数为 EAX）：

movzx edx, Length
lea   r8, MaskByLen      ; Table from Listing 12-1
mov   rdx, [r8][rdx * 8]
mov   cl, StartingPosition
mov   rax, rbx
shr   rax, cl
and   rax, rdx

到目前为止的所有示例都假设比特串完全出现在一个四字（或更小）对象中。如果比特串的长度小于或等于 64 位，这种情况总是成立。然而，如果比特串的长度加上它在对象中起始位置的偏移量（模 8）大于 64，那么比特串将在对象内跨越一个四字边界。

提取这样的比特串需要最多三步操作：第一步操作提取比特串的起始位置（直到第一个四字边界），第二步操作复制整个四字（假设比特串的长度足够大，需要多个四字），最后一步操作复制位于比特串末尾的最后一个四字中的剩余位。该操作的实际实现留给你作为练习。

12.5 BMI1 指令用于提取位并创建位掩码

如果你的 CPU 支持 BMI1（位操作指令集，第一集）指令集扩展，^(2)你可以使用bextr（位提取）指令从 32 位或 64 位通用寄存器中提取比特。该指令的语法如下：

bextr `reg`[dest], `reg`[src], `reg`[ctrl]
bextr `reg`[dest], `mem`[src], `reg`[ctrl]

操作数必须具有相同的大小，并且必须是 32 位或 64 位寄存器（或内存位置）。

bextr指令将两个参数编码到regctrl 中：

regctrl 的第 0 到 7 位指定源操作数中的起始位位置（对于 32 位操作数，这必须是 0 到 31 之间的值，对于 64 位操作数，这必须是 0 到 63 之间的值）。
regctrl 的第 8 到 15 位指定了要从源操作数中提取的位数。

bextr指令将从regsrc 或memsrc 中提取指定的比特并将这些比特（移至第 0 位）存储在regdest 中。一般来说，你应该尽量使用 RAX 和 EAX、RBX 和 EBX、RCX 和 ECX，或 RDX 和 EDX 作为ctrl寄存器，因为你可以通过使用 AH 和 AL、BH 和 BL、CH 和 CL、DH 和 DL 这四对 8 位寄存器来轻松操作起始值和长度值。示例 12-2 提供了bextr指令的快速演示。^(3)

; Listing 12-2

; Demonstrate extracting bit strings from a register.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 12-2", 0

            include getTitle.inc
            include print.inc

; Here is the "asmMain" function.

            .code
            public  asmMain
asmMain     proc
            push    rbx
            push    rsi
            push    rdi
            push    rbp
            mov     rbp, rsp
            sub     rsp, 56         ; Shadow storage

; >>>> Unique code for various listings:

            mov     rax, 123456788abcdefh
            mov     bl, 4
            mov     bh, 16

            bextr   rdx, rax, rbx

            call    print
            byte    "Extracted bits: %x", nl, 0

; <<<< End of unique code.

allDone:    leave
            pop     rdi
            pop     rsi
 pop     rbx
            ret     ; Returns to caller
asmMain     endp
            end

示例 12-2：bextr指令示例

示例 12-2 会产生以下输出：

C:\>**build listing12-2**

C:\>**echo off**
 Assembling: listing12-2.asm
c.cpp

C:\>**listing12-2**
Calling Listing 12-2:
Extracted bits: bcde
Listing 12-2 terminated

BMI1 指令集扩展还包括一条提取寄存器中最低编号已设置位的指令：blsi（提取最低已设置的孤立位）。该指令的语法如下：

blsi `reg`[dest], `reg`[src]
blsi `reg`[dest], `mem`[src]

所有操作数必须大小相同，并且可以是 32 位或 64 位。此指令定位源操作数（寄存器或内存）中的最低已设置位。它将该位复制到目标寄存器，并将目标寄存器中的所有其他位清零。如果源值为 0，blsi 会将 0 复制到目标寄存器，并设置零标志和进位标志。清单 12-3 是对该指令的简单演示（请注意，我已从清单 12-2 中省略了公共代码）。

; >>>> Unique code for various listings.

mov     r8, 12340000h
blsi    edx, r8

call    print
byte    "Extracted bit: %x", nl, 0

; <<<< End of unique code.

清单 12-3：blsi 指令的简单演示

将其插入到一个示例程序壳中并运行，会产生以下输出：

Extracted bit: 40000

BMI1 andn 指令在与 blsi 配合使用时非常有用。andn（与非）指令具有以下通用语法：

andn `reg`[dest], `reg`[src1], `reg`[src2]
andn `reg`[dest], `reg`[src1], `mem`[src2]

所有操作数必须大小相同，并且必须是 32 位或 64 位。此指令对 regsrc1 中值的倒数副本与第三个操作数（src2 操作数）进行逻辑与运算，并将结果存储到 regdest 操作数中。

你可以在执行 blsi 指令后立即使用 andn 指令，从 blsi 的源操作数中移除最低编号的位。清单 12-4 演示了此操作（和往常一样，省略了公共代码）。

; >>>> Unique code for various listings.

mov     r8, 12340000h
blsi    edx, r8
andn    r8, rdx, r8

; Output value 1 is in RDX (extracted bit),
; output value 2 in R8 (value with deleted bit).

call    print
byte    "Extracted bit: %x, result: %x", nl, 0

; <<<< End of unique code.

清单 12-4：提取并移除操作数中最低位的已设置位

运行此代码会产生以下输出：

Extracted bit: 40000, result: 12300000

提取 LO 位并保留其余位（如在清单 12-4 中使用 blsi 和 andn 指令所做的）是如此常见的操作，以至于英特尔创建了一条专门处理此任务的指令：blsr（重置最低已设置位）。以下是其通用语法：

blsr `reg`[dest], `reg`[src]
blsr `reg`[dest], `mem`[src]

两个操作数必须大小相同，并且必须是 32 位或 64 位。此指令从源操作数中获取数据，将最低编号的已设置位清零，并将结果复制到目标寄存器。如果源操作数包含 0，此指令会将 0 复制到目标寄存器，并设置进位标志。

清单 12-5 演示了此指令的使用方法。

; >>>> Unique code for various listings.

mov     r8, 12340000h
blsr    edx, r8

; Output value 1 is in RDX (extracted bit), resulting value.

call    print
byte    "Value with extracted bit: %x", nl, 0

; <<<< End of unique code.

清单 12-5：blsr 指令示例

这是该代码片段的输出（插入到测试程序壳后）：

Value with extracted bit: 12300000

另一个有用的 BMI1 指令是 blsmsk。此指令通过查找最低编号的已设置位来创建一个位掩码。然后，它创建一个包含所有 1 位直到并包括最低已设置位的位掩码。blsmsk 指令将剩余位设置为 0。如果原始值为 0，blsmsk 会将目标寄存器中的所有位设置为 1，并设置进位标志。以下是 blsmsk 的通用语法：

blsmsk `reg`[dest], `reg`[src]
blsmsk `reg`[dest], `mem`[src]

清单 12-6 是一个示例代码片段及其将产生的输出。

; >>>> Unique code for various listings.

mov     r8, 12340000h
blsmsk  edx, r8

; Output value 1 is in RDX (mask).

call    print
byte    "Mask: %x", nl, 0

; <<<< End of unique code.

清单 12-6：blsmsk 示例

以下是示例输出：

Mask: 7ffff

特别注意，blsmsk 指令生成的掩码在源文件中最低编号的已设置位所处的位置上包含一个 1 位。通常，你实际上会希望得到一个位掩码，其中 1 位位于最低编号的已设置位之前的所有位置。使用 blsi 和 dec 指令可以轻松实现这一点，如列表 12-7 所示。

; >>>> Unique code for various listings.

mov     r8, 12340000h
blsi    rdx, r8
dec     rdx

; Output value 1 is in RDX (mask).

call    print
byte    "Mask: %x", nl, 0

; <<<< End of unique code.

列表 12-7：创建一个不包含最低编号的已设置位的位掩码

这是输出：

Mask: 3ffff

BMI1 指令集中的最后一条指令是 tzcnt（尾随零计数）。该指令具有以下通用语法：

tzcnt `reg`[dest], `reg`[src]
tzcnt `reg`[dest], `mem`[src]

和往常一样，操作数必须具有相同的大小。tzcnt 指令在 BMI1 指令中是独一无二的，因为它支持 16 位、32 位和 64 位操作数。

tzcnt 指令计算源操作数中从最低有效位开始向上数的零位数，并将零位计数存储到目标寄存器中。方便的是，零位的计数值也就是源操作数中第一个已设置位的位索引。如果源操作数为 0，该指令将设置进位标志（此时它还会将目标寄存器设置为操作数的大小）。

要使用 bextr、blsi、blsr 和 blsmsk 查找并提取零位，在执行这些指令之前反转源操作数。同样，为了使用 tzcnt 计算尾随已设置位的数量，首先要反转源操作数。^(4)

如果你在程序中使用了 bextr、blsi、blsr、blsmsk、tzcnt 或 andn，别忘了检查是否存在 BMI1 指令集扩展。并非所有 x86-64 CPU 都支持这些指令。

12.6 合并位集和分配位串

插入和提取位集与插入和提取位串的区别不大，前提是你插入的位集（或提取的位集）的形状与主对象中位集的形状相同。位集的形状指的是位集中文本的分布，而不考虑位集的起始位置。例如，一个包含位 0、4、5、6 和 7 的位集，其形状与包含位 12、16、17、18 和 19 的位集相同，因为这两个位集的分布是相同的。

插入或提取该位集的代码与上一节的代码几乎相同；唯一的不同是你使用的掩码值。例如，要将这个位集从 EAX 中的 0 位开始插入到 EBX 中从第 12 位开始的相应位集中，你可以使用以下代码：

and ebx, not 11110001000000000000b ; Mask out destination bits
shl eax, 12                        ; Move source bits into position
or  ebx, eax                       ; Merge the bit set into EBX

然而，假设你在 EAX 中的 0 到 4 位上有 5 个已设置的位，并且你希望将它们合并到 EBX 中的第 12、16、17、18 和 19 位上。你必须以某种方式在对 EBX 执行逻辑或运算之前，先分配这些位。考虑到这个特定的位集由两段 1 位组成，过程变得相对简化。以下代码以巧妙的方式分配这些位：

and ebx, not 11110001000000000000b
and eax, 11110001000000000000b  ; Mask out destination bits
shl eax, 2    ; Spread out bits: 1 to 4 goes to 3 to 6 and 0 goes to 2
btr eax, 2    ; Bit 2 -> carry and then clear bit 2
rcl eax, 13   ; Shift in carry and put bits into final position
or  ebx, eax  ; Merge the bit set into EBX

使用btr（位测试和重置）指令的这个技巧效果很好，因为我们在原始源操作数中只有 1 个位的位置不对。可惜的是，如果这些位相对于彼此都处于错误的位置，那么这个方案就不是一个高效的解决方案。稍后我们会看到一个更通用的解决方案。

提取这个位集并将位合并到位串中并不像看起来那么容易。然而，我们仍然有一些巧妙的技巧可以使用。考虑以下代码，它从 EBX 中提取位集，并将结果放入 EAX 中的位 0 到 4：

mov eax, ebx 
and eax, 11110001000000000000b  ; Strip unwanted bits
shr eax, 5                      ; Put bit 12 into bit 7, and so on
shr ah, 3                       ; Move bits 11 to 14 to 8 to 11
shr eax, 7                      ; Move down to bit 0

这段代码将（原始）位 12 移到位位置 7，即 AL 的 HO 位。同时，它将位 16 到 19 移到位 11 到 14（即 AH 的位 3 到 6）。然后，代码将 AH 中的位 3 到 6 移到位 0。这会将位集的 HO 位定位，使其与 AL 中剩下的位相邻。最后，代码将所有位移到位 0。再次强调，这并不是一个通用的解决方案，但它展示了如果仔细思考，这个问题的一个巧妙处理方式。

上述的合并和分配算法仅适用于它们特定的位集。一个更通用的解决方案（可能是允许你指定一个掩码，然后根据该掩码分配或合并位的方案）会更为复杂。以下代码演示了如何根据位掩码中的值来分配位于位串中的位：

; EAX - Originally contains a value into which we 
;       insert bits from EBX.
; EBX - LO bits contain the values to insert into EAX.
; EDX - Bitmap with 1s indicating the bit positions in 
;       EAX to insert.
; CL -  Scratchpad register.

          mov cl, 32      ; Count number of bits we rotate
          jmp DistLoop

CopyToEAX:
          rcr ebx, 1      ; Don't use SHR, must preserve Z-flag
          rcr eax, 1 
          jz  Done 
DistLoop: dec cl 
          shr edx, 1 
          jc  CopyToEAX 
          ror eax, 1      ; Keep current bit in EAX
          jnz DistLoop 

Done:     ror eax, cl     ; Reposition remaining bits

如果我们将 EDX 加载为 11001001b，代码将把位 0 到 3 的值复制到 EAX 中的位 0、3、6 和 7 中。注意短路测试，它检查是否已耗尽 EDX 中的值（通过检查 EDX 中的 0）。旋转指令不影响零标志，但移位指令会。因此，之前的shr指令会在没有更多位可以分配时设置零标志（当 EDX 变为 0 时）。

合并位的通用算法比一般的分配算法稍微高效一些。以下是将位从 EBX 中提取出来，并通过 EDX 中的位掩码将结果保留在 EAX 中的代码：

; EAX - Destination register.
; EBX - Source register.
; EDX - Bitmap with 1s representing bits to copy to EAX.
; EBX and EDX are not preserved.

     xor eax, eax    ; Clear destination register 
     jmp ShiftLoop

ShiftInEAX:  
     rcl ebx, 1      ; EBX to EAX
     rcl eax, 1
ShiftLoop:   
     shl edx, 1      ; Check to see if we need to copy a bit
     jc  ShiftInEAX  ; If carry set, go copy the bit
     rcl ebx, 1      ; Current bit is uninteresting, skip it
     jnz ShiftLoop   ; Repeat as long as there are bits in EDX

这个过程还利用了移位和旋转指令的一个巧妙特性：移位指令会影响零标志，而旋转指令则不会。因此，shl edx, 1指令在 EDX 变为 0 时会设置零标志（经过移位后）。如果进位标志也被设置，代码将再次遍历循环，直到将一个位移入 EAX，但下一次代码将 EDX 左移 1 位时，EDX 仍然为 0，因此进位标志将被清除。在这一迭代中，代码将跳出循环。

另一种合并位的方法是通过查找表。通过一次获取一个字节的数据（这样你的表不会太大），你可以使用该字节的值作为查找表的索引，合并所有位直到位 0。最后，你可以将每个字节低位的位合并在一起。在某些情况下，这可能会产生一个更高效的合并算法。具体实现由你来决定。

12.7 使用 BMI2 指令合并和分配位字符串

英特尔的 BMI2（位操作指令集，第二集）^(5) 指令集扩展包括一组便捷的指令，可以用来插入或提取任意的位集：pdep（并行位存储）和 pext（并行位提取）。如果你的 CPU 支持这些指令，它们可以处理本章中许多使用非-BMI 指令的任务。它们确实是非常强大的指令。

这些指令具有以下语法：

pdep `reg`[dest], `reg`[src], `reg`[mask]
pdep `reg`[dest], `reg`[src], `mem`[mask]
pext `reg`[dest], `reg`[src], `reg`[mask]
pext `reg`[dest], `reg`[src], `mem`[mask]

所有操作数必须大小相同，并且必须为 32 位或 64 位。

pext 指令从源寄存器（第二个寄存器）提取任意的位字符串，并将这些位合并到目标寄存器中，从位 0 开始按连续的位位置排列。第三个操作数——掩码，控制着 pext 从源寄存器提取哪些位。

掩码操作数包含 pext 将从源寄存器提取的位位置上的 1 位。图 12-4 显示了这个位掩码的工作原理。对于掩码操作数中的每一个 1 位，pext 指令将源寄存器中对应的位复制到目标寄存器中下一个可用的位位置（从位 0 开始）。

图 12-4：pext 指令的位掩码

清单 12-8 是一个示例程序片段及其输出，展示了 pext 指令（与往常一样，此清单省略了常见代码）。

; >>>> Unique code for various listings.

mov     r8d, 12340000h
mov     r9d, 0F0f000Fh
pext    edx, r8d, r9d

; Output value 1 is in RDX (mask).

call    print
byte    "Extracted: %x", nl, 0

; <<<< End of unique code.
------------------------------------------------------------------------------
Extracted: 240

清单 12-8：pext 指令示例

pdep 指令执行与 pext 相反的操作。它从源寄存器操作数的低位（LO 位）开始，获取连续的位集，并通过使用掩码操作数中的 1 位来决定这些位在目标寄存器中的分布，如图 12-5 所示。pdep 指令将目标寄存器中的所有其他位设置为 0。

图 12-5：pdep 指令操作

清单 12-9 是 pdep 指令及其输出的示例。

mov     r8d, 1234h
mov     r9d, 0F0FF00Fh 
pdep    edx, r8d, r9d

; Output value 1 is in RDX (mask).

call    print
byte    "Distributed: %x", nl, 0
------------------------------------------------------------------------------
Distributed: 1023004

清单 12-9：pdep 指令示例

如果在程序中使用了 pdep 或 pext 指令，别忘了测试是否支持 BMI2 指令集扩展。并非所有 x86-64 CPU 都支持这些指令。请参见第十一章的清单 11-2，查看如何检查是否支持 BMI2 指令集扩展。

12.8 位字符串的打包数组

尽管效率低得多，但完全可以创建大小不是 8 位倍数的元素数组。缺点是，计算数组元素的“地址”并操作该数组元素需要额外的工作。在本节中，我们将通过一些示例来看一看如何打包和解包数组元素，这些元素是任意位数长度的。

为什么你需要位对象数组？答案很简单：节省空间。如果一个对象只占 3 位，你可以通过打包数据，而不是为每个对象分配一个字节，将同样的空间装入 2.67 倍的元素。对于非常大的数组，这可以节省大量空间。当然，这种节省空间的代价是速度：你必须执行额外的指令来打包和解包数据，从而减慢对数据的访问速度。

在一个大块位中定位数组元素的位偏移量的计算几乎与标准数组访问相同：

`element_address_in_bits` = 
 `base_address_in_bits` + `index` * `element_size_in_bits`

一旦你计算出元素的位地址，你需要将其转换为字节地址（因为我们在访问内存时必须使用字节地址），并提取指定的元素。由于数组元素的基地址（几乎）总是从字节边界开始，我们可以使用以下公式来简化这一任务：

`yte_of_1st_bit` = 
    `base_address` + (`index` * `element_size_in_bits`) / 8

`offset_to_1st_bit` = 
    (`index` * `element_size_in_bits`) % 8

例如，假设我们有一个包含 200 个三位对象的数组，我们可以按如下方式声明：

 .data
AO3Bobjects  byte (200 * 3)/8 + 2 dup (?)  ; "+2" handles truncation

前面维度中的常量表达式为足够的字节预留空间来存储 600 位（200 个元素，每个元素 3 位）。正如注释所指出的，这个表达式在末尾添加了 2 个额外的字节，以确保我们不会丢失任何奇数位^(6)，并且允许我们访问数组末尾之后的 1 个字节（当向数组存储数据时）。

现在，假设你想访问这个数组的第i个三位元素。你可以通过以下代码提取这些位：

; Extract the `i`th group of 3 bits in AO3Bobjects 
; and leave this value in EAX.

xor  ecx, ecx             ; Put `i` / 8 remainder here
mov  eax, i               ; Get the index into the array
lea  rax, [rax + rax * 2] ; RAX := RAX * 3 (3 bits/element)
shrd rcx, rax, 3          ; RAX / 8 -> RAX and RAX mod 8 -> RCX 
                          ; (HO bits)
shr  rax, 3               ; Remember, shrd doesn't modify EAX
rol  rcx, 3               ; Put remainder into LO 3 bits of RCX

; Okay, fetch the word containing the 3 bits we want to 
; extract. We have to fetch a word because the last bit or two 
; could wind up crossing the byte boundary (that is, bit offset 6 
; and 7 in the byte).

lea r8, AO3Bobjects
mov ax, [r8][rax * 1]
shr ax, cl                ; Move bits down to bit 0
and eax, 111b             ; Remove the other bits (incl HO RAX)

将一个元素插入到数组中要稍微复杂一点。除了计算数组元素的基地址和位偏移量外，你还需要创建一个掩码来清除目标位置中你要插入新数据的位。Listing 12-10 将 EAX 的低 3 位插入到AO3Bobjects数组的第i个元素中。

; Listing 12-10

; Creating a bit mask with blsi and dec.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 12-10", 0

Masks       equ     this word
            word    not 0111b,            not 00111000b
            word    not 000111000000b,    not 1110b
            word    not 01110000b,        not 001110000000b
            word    not 00011100b,        not 11100000b

            .data
i           dword   5
AO3Bobjects byte    (200*3)/8 + 2 dup (?)   ; "+2" handles truncation

 include getTitle.inc
            include print.inc

            .code

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbx
            push    rsi
            push    rdi
            push    rbp
            mov     rbp, rsp
            sub     rsp, 56           ; Shadow storage

            mov     eax, 7            ; Value to store

            mov     ebx, i            ; Get the index into the array
            mov     ecx, ebx          ; Use LO 3 bits as index
            and     ecx, 111b         ; into Masks table
            lea     r8, Masks
            mov     dx, [r8][rcx * 2] ; Get bit mask

; Convert index into the array into a bit index.
; To do this, multiply the index by 3:

            lea     rbx, [rbx + rbx * 2]

; Divide by 8 to get the byte index into EBX
; and the bit index (the remainder) into ECX:

            shrd    ecx, ebx, 3
            shr     ebx, 3
            rol     ecx, 3

; Grab the bits and clear those we're inserting.

            lea     r8, AO3Bobjects
            and     dx, [r8][rbx * 1]

; Put our 3 bits in their proper location.

            shl     ax, cl

; Merge bits into destination.

            or      dx, ax

; Store back into memory.

            mov     [r8][rbx * 1], dx

 mov     edx, dword ptr AO3Bobjects
            call    print
            byte    "value:%x", nl, 0

allDone:    leave
            pop     rdi
            pop     rsi
            pop     rbx
            ret     ; Returns to caller
asmMain     endp
            end

Listing 12-10：将值 7（111b）存储到一个 3 位元素的数组中

将 Listing 12-10 中的代码插入到 Shell 汇编文件中，产生以下输出：

value:38000

print语句打印AO3Bobjects的前 32 位。由于每个元素是 3 位，所以数组看起来像这样：

000 000 000 000 000 111 000 000 000 000 00 ...

其中第 0 位是最左边的位。为了让它们更易读，我们将 32 位翻转过来，并按 4 位分组（方便转换为十六进制），得到：

0000 0000 0000 0011 1000 0000 0000 0000

结果是 38000h。

Listing 12-10 使用查找表生成清除数组中适当位置所需的掩码。该数组的每个元素包含所有 1，除了需要为给定位偏移清除的三个 0（注意使用 not 运算符来反转表中的常量）。

12.9 搜索位

一个常见的位操作是定位位序列的结束。这个操作的一个特例是定位 16 位、32 位或 64 位值中第一个（或最后一个）设置或清除的位。在本节中，我们将探讨处理这种特例的方法。

第一个设置位 指的是在一个值中，从位 0 向高位扫描时，第一个包含 1 的位。对于 第一个清除位，也有类似的定义。最后一个设置位 是在一个值中，从高位向位 0 扫描时，第一个包含 1 的位。对于 最后一个清除位，也有类似的定义。

搜索第一个或最后一个位的一种明显方法是使用循环中的移位指令，并计算在移出 1（或 0）到进位标志之前的迭代次数。迭代次数指定了该位置。以下是一些示例代码，用于检查 EAX 中的第一个设置位，并将该位位置返回到 ECX：

 mov ecx, -32  ; Count off the bit positions in ECX
TstLp:    shr eax, 1    ; Check to see if current bit
                        ; position contains a 1
          jc  Done      ; Exit loop if it does 
          inc ecx       ; Bump up our bit counter by 1
          jnz TstLp     ; Exit if we execute this loop 32 times

Done:     add cl, 32    ; Adjust loop counter so it holds 
                        ; the bit position

; At this point, CL contains the bit position of the 
; first set bit. CL contains 32 if EAX originally 
; contained 0 (no set bits).

这段代码唯一复杂的地方是，它运行的循环计数器是从 -32 到 0，而不是从 32 到 0。这样，在循环结束后计算位位置就稍微容易一些。

这个特定循环的缺点是它的开销很大。根据 EAX 中的原始值，这个循环可能会重复最多 32 次。如果你检查的值在 EAX 的低位经常有很多 0，这段代码就会运行得比较慢。

搜索第一个（或最后一个）设置位是一个非常常见的操作，因此 Intel 特意增加了几条指令来加速这一过程。这些指令是 bsf (位扫描前进) 和 bsr (位扫描反向)。它们的语法如下：

bsr `dest`[reg], `reg`[src] 
bsr `dest`[reg], `mem`[src] 
bsf `dest`[reg], `reg`[src] 
bsf `dest`[reg], `mem`[src]

源操作数和目标操作数必须具有相同的大小（16 位、32 位或 64 位）。目标操作数必须是寄存器。源操作数可以是寄存器或内存位置。

bsf 指令扫描源操作数中第一个被设置的位（从位位置 0 开始）。bsr 指令通过从高位向低位扫描，查找源操作数中的最后一个被设置的位。如果这些指令在源操作数中找到一个已设置的位，它们会清除零标志并将该位位置放入目标寄存器中。如果源寄存器包含 0（即没有被设置的位），则这些指令会设置零标志，并在目标寄存器中留下一个不确定的值。你应该在这些指令执行后立即测试零标志，以验证目标寄存器的值。以下是一个示例：

mov ebx, SomeValue  ; Value whose bits we want to check
bsf eax, ebx        ; Put position of first set bit in EAX
jz  NoBitsSet       ; Branch if SomeValue contains 0
mov FirstBit, eax   ; Save location of first set bit
    .
    .
    .

你以相同的方式使用 bsr 指令，唯一的区别是它计算操作数中最后一个设置位的位置（即从高位向低位扫描时找到的第一个设置位）。

x86-64 CPU 不提供定位第一个包含 0 的位的指令。然而，你可以通过先取反源操作数（如果必须保留源操作数的值，则取反其副本），然后搜索第一个 1 位，来轻松扫描 0 位；这对应于原始操作数值中的第一个 0 位。

bsf 和 bsr 指令是复杂的 x86-64 指令，可能比其他指令更慢。在某些情况下，通过使用离散指令定位第一个设置的位可能会更快。但是，由于这些指令的执行时间因 CPU 而异，因此你应该在将它们用于时间关键代码之前测试它们的性能。

注意，bsf 和 bsr 指令不会影响源操作数。一种常见的操作是提取（并清除）操作数中找到的第一个或最后一个设置的位。如果源操作数在寄存器中，你可以在找到该位后使用 btr（或 btc）指令清除该位。下面是实现这一结果的一些代码：

 bsf ecx, eax       ; Locate first set bit in EAX
          jz  noBitFound     ; If we found a bit, clear it

          btr eax, ecx       ; Clear the bit we just found

noBitFound:

在这一序列的末尾，零标志指示我们是否找到了一位（注意，btr 不会影响零标志）。

因为 bsf 和 bsr 指令只支持 16 位、32 位和 64 位操作数，所以你需要稍微不同的方式来计算 8 位操作数的第一个位位置。有几种合理的做法。首先，你可以将 8 位操作数零扩展到 16 位或 32 位，然后使用 bsf 或 bsr 指令。另一种选择是创建一个查找表，其中每个条目包含你用作索引的值的位数；然后，你可以使用 xlat 指令来“计算”值中的第一个位位置（你需要将值 0 作为特殊情况处理）。另一种解决方案是使用本节开头出现的移位算法；对于 8 位操作数来说，这并非完全低效的解决方案。

你可以使用 bsf 和 bsr 来确定一段位的大小，前提是操作数中只有一段连续的位。只需定位这一段中的第一个和最后一个位（如前面的例子所示），然后计算这两个值之间的差（加 1）。当然，这种方案仅在第一个和最后一个设置位之间没有中断的 0 时有效。

12.10 计数位

前一节中的最后一个例子展示了一个非常通用问题的特定案例：计数位。遗憾的是，这个例子有一个严重的限制：它只计算源操作数中出现的单一连续的 1 位。本节讨论了这个问题的更通用解决方案。

几乎每周，总有人在互联网上的新闻组中询问如何计算寄存器操作数中的位数。这是一个常见的请求，毫无疑问，许多汇编语言课程的教师布置了这个任务，作为教学手段，目的是让学生了解移位和旋转指令，具体如下：

; BitCount1:

; Counts the bits in the EAX register, 
; returning the count in EBX.

          mov cl, 32    ; Count the 32 bits in EAX
          xor ebx, ebx  ; Accumulate the count here
CntLoop:  shr eax, 1    ; Shift bit out of EAX and into carry
          adc bl, 0     ; Add the carry into the EBX register
          dec cl        ; Repeat 32 times
          jnz CntLoop

这个“技巧”是，这段代码使用adc指令将进位标志的值加到 BL 寄存器中。由于计数将小于 32，结果会舒适地适应 BL 寄存器。

不管代码是否复杂，这条指令序列的执行速度都不算特别快。前面的循环总是执行 32 次，因此这段代码序列会执行 130 条指令（每次迭代 4 条指令，加上 2 条额外指令）。

为了提高效率，可以使用popcnt指令（人口计数，在 SSE 4.1 指令集中引入），它计算源操作数中 1 位的数量，并将结果存储到目标操作数中：

popcnt `reg`[dest], `reg`[src]
popcnt `reg`[dest], `mem`[src]

操作数必须大小相同，并且必须是 16 位、32 位或 64 位。

12.11 反转位串

另一个常见的编程项目是反转操作数中的位。这是一个独立有用的功能，也常被教学人员布置。这段程序将 LO 位与 HO 位交换，位 1 与次高位交换，依此类推。教学人员通常期待的解决方案如下：

; Reverse the 32 bits in EAX, leaving the result in EBX: 

               mov cl, 32     ; Move current bit in EAX to 
RvsLoop:       shr eax, 1     ; the carry flag

 rcl ebx, 1    ; Shift the bit back into 
                             ; EBX, backward
               dec cl  
               jnz RvsLoop

和前面的例子一样，这段代码存在重复执行 32 次循环的问题，总共执行了 129 条指令（对于 32 位操作数，64 位操作数则需要翻倍）。通过展开循环，你可以将指令数减少到 64 条，但这依然有些昂贵。

优化问题的最佳解决方案通常是使用更好的算法，而不是试图通过选择更快的指令来调整代码。在前面的部分中，例如，我们通过替换更复杂的算法来加速位串计数，而不是使用简单的“移位计数”算法。在前面的示例中，关键是尽可能并行地进行大量工作。

假设我们只需要交换 32 位值中偶数位和奇数位。我们可以通过以下代码轻松地在 EAX 中交换偶数位和奇数位：

mov edx, eax        ; Make a copy of the odd bits
shr eax, 1          ; Move the even bits to the odd positions
and edx, 55555555h  ; Isolate the odd bits
and eax, 55555555h  ; Isolate the even bits
shl edx, 1          ; Move the odd bits to even positions
or  eax, edx        ; Merge the bits and complete the swap

交换偶数位和奇数位使我们能够部分地实现数字位的反转。在执行前面的代码序列后，你可以通过以下代码交换相邻的位对，来交换 32 位值中所有字节的位：

mov edx, eax        ; Make a copy of the odd-numbered bit pairs
shr eax, 2          ; Move the even bit pairs to the odd position
and edx, 33333333h  ; Isolate the odd pairs
and eax, 33333333h  ; Isolate the even pairs
shl edx, 2          ; Move the odd pairs to the even positions
or  eax, edx        ; Merge the bits and complete the swap

完成前面的序列后，你会交换 32 位寄存器中的相邻字节。同样，唯一的区别是位掩码和移位的长度。以下是代码：

mov edx, eax        ; Make a copy of the odd-numbered nibbles
shr eax, 4          ; Move the even nibbles to the odd position
and edx, 0f0f0f0fh  ; Isolate the odd nibbles
and eax, 0f0f0f0fh  ; Isolate the even nibbles
shl edx, 4          ; Move the odd pairs to the even positions
or  eax, edx        ; Merge the bits and complete the swap

你可能已经看出其中的模式，并且能推测出在接下来的两步中，你需要交换字节和字。你可以像前面的例子一样编写代码，但有一种更好的方法：使用bswap。bswap（字节交换）指令使用以下语法：

bswap `reg`[32]

bswap指令交换指定 32 位寄存器中字节 0 和字节 3，以及字节 1 和字节 2，这正是反转比特时所需要的操作（以及在将数据在小端和大端数据格式之间转换时，这条指令的主要用途）。你可以直接使用bswap eax指令来完成工作，而无需再插入 12 条指令交换字节和字。在以下代码序列中展示了最终结果：

mov   edx, eax       ; Make a copy of the odd bits in the data
shr   eax, 1         ; Move the even bits to the odd positions
and   edx, 55555555h ; Isolate the odd bits
and   eax, 55555555h ; Isolate the even bits
shl   edx, 1         ; Move the odd bits to the even positions
or    eax, edx       ; Merge the bits and complete the swap

mov   edx, eax       ; Make a copy of the odd-numbered bit pairs
shr   eax, 2         ; Move the even bit pairs to the odd position
and   edx, 33333333h ; Isolate the odd pairs
and   eax, 33333333h ; Isolate the even pairs
shl   edx, 2         ; Move the odd pairs to the even positions
or    eax, edx       ; Merge the bits and complete the swap

mov   edx, eax       ; Make a copy of the odd-numbered nibbles
shr   eax, 4         ; Move the even nibbles to the odd position
and   edx, 0f0f0f0fh ; Isolate the odd nibbles
and   eax, 0f0f0f0fh ; Isolate the even nibbles
shl   edx, 4         ; Move the odd pairs to the even positions
or    eax,edx        ; Merge the bits and complete the swap

bswap eax            ; Swap the bytes and words

该算法只需要 19 条指令，并且比之前的比特移位循环执行得更快。当然，这个序列的内存消耗略高。如果你更倾向于节省内存而非时钟周期，那么循环可能是一个更好的解决方案。

12.12 合并比特串

另一个常见的比特串操作是通过合并或交错来自两个不同源的比特，生成一个单一的比特串。以下示例代码序列通过合并两个 16 位比特串中的交替比特，创建了一个 32 位比特串：

; Merge two 16-bit strings into a single 32-bit string.
; AX - Source for even-numbered bits.
; BX - Source for odd-numbered bits.
; CL  - Scratch register.
; EDX - Destination register.

          mov  cl, 16 
MergeLp:  shrd edx, eax, 1     ; Shift a bit from EAX into EDX
          shrd edx, ebx, 1     ; Shift a bit from EBX into EDX
          dec  cl 
          jne  MergeLp;

这个特定的例子将两个 16 位值合并在一起，交替地将它们的比特插入到结果值中。为了更快地实现此代码，可以展开循环以减少一半指令的使用。

通过一些小的修改，我们可以将四个 8 位值合并在一起，或从源字符串中合并其他比特集合。例如，以下代码从 EAX 复制比特 0 到 5，从 EBX 复制比特 0 到 4，从 EAX 复制比特 6 到 11，从 EBX 复制比特 5 到 15，最后从 EAX 复制比特 12 到 15：

shrd edx, eax, 6
shrd edx, ebx, 5
shrd edx, eax, 6
shrd edx, ebx, 11
shrd edx, eax, 4

当然，如果你有 BMI2 指令集可用，你也可以使用pextr指令提取各种比特并将其插入到另一个寄存器中。

12.13 提取比特串

我们还可以在多个目的地之间提取并分配比特串中的比特。以下代码将 EAX 中的 32 位值并将交替的比特分配给 BX 和 DX 寄存器：

 mov cl, 16   ; Count the loop iterations
ExtractLp: shr eax, 1   ; Extract even bits to (E)BX
           rcr ebx, 1 
           shr eax, 1   ; Extract odd bits to (E)DX
           rcr edx, 1 
           dec cl       ; Repeat 16 times
           jnz ExtractLp
           shr ebx, 16  ; Need to move the results from the HO
           shr edx, 16  ; bytes of EBX and EDX to the LO bytes

该序列执行 99 条指令（循环内部 6 条，循环重复 16 次，加上循环外部 3 条）。你可以展开循环并使用其他技巧，但当一切完成后，可能不值得增加复杂性。

如果你有 BMI2 指令集扩展可用，你也可以使用pext指令高效地完成这项工作：

mov  ecx, 55555555h  ; Odd bit positions
pext edx, eax, ecx   ; Put odd bits into EDX
mov  ecx, 0aaaaaaaah ; Even bit positions
pext ebx, eax, ecx   ; Put even bits into EBX

12.14 搜索比特模式

另一个与比特相关的操作是搜索比特串中是否包含特定的比特模式。例如，你可能希望从比特串的某个特定位置开始，找到1011b第一次出现的比特索引。在本节中，我们将探讨一些简单的算法来完成这个任务。

要搜索特定的位模式，我们需要知道四件事：

要搜索的模式（模式）
我们要搜索的模式的长度
我们将要搜索的位串（源）
要搜索的位串的长度

搜索的基本思路是根据模式的长度创建一个掩码，并使用该掩码对源的副本进行掩码处理。然后我们可以直接将模式与掩码后的源进行比较，检查它们是否相等。如果相等，搜索完成；如果不相等，则增加位位置计数器，将源右移一个位置，并重试。你需要重复这个操作 length``(``source``) - length``(``pattern``) 次。如果在这些尝试之后仍未检测到位模式，算法就失败了（因为我们已经耗尽了源操作数中可能与模式长度匹配的所有位）。这里有一个简单的算法，搜索 EBX 寄存器中的 4 位模式：

 mov cl, 28       ; 28 attempts because 32 - 4 = 28
                           ; (len(src) - len(pat))
          mov ch, 1111b    ; Mask for the comparison
          mov al, `pattern`  ; Pattern to search for
          and al, ch       ; Mask unnecessary bits in AL
          mov ebx, `source`  ; Get the source value
ScanLp:   mov dl, bl       ; Copy the LO 4 bits of EBX
          and dl, ch       ; Mask unwanted bits
          cmp al, dl       ; See if we match the pattern
          jz  Matched
          dec cl           ; Repeat specified number of times
          shr ebx, 1 
          jnz ScanLp 

; Do whatever needs to be done if we failed to 
; match the bit string. 

 jmp Done

Matched: 

; If we get to this point, we matched the bit string. 
; We can compute the position in the original source as 28 - CL.

Done:

位串扫描是字符串匹配的一个特殊情况。字符串匹配是计算机科学中研究得非常透彻的问题，你可以用于字符串匹配的许多算法同样适用于位串匹配。这些算法超出了本章的范围，但为了让你对其工作原理有所了解，你可以通过计算一个函数（如 xor 或 sub）在模式与当前源位之间的值，并将结果用作查找表的索引来决定可以跳过多少位。这些算法允许你跳过多个位，而不是在每次扫描循环迭代时只偏移一次（正如之前的算法所做的那样）。

12.15 获取更多信息

AMD Athlon 优化指南包含了用于基于位的计算的有用算法。要了解更多关于位查找算法的内容，请阅读一本关于数据结构和算法的教科书，并学习其中的字符串匹配算法章节。

关于位操作的终极书籍可能是 Hacker’s Delight（黑客的愉悦），第二版，作者是 Henry S. Warren（Addison-Wesley，2012）。虽然本书使用 C 编程语言举例，但几乎所有概念也适用于汇编语言程序。

12.16 自测

你会使用什么通用指令来清除寄存器中的位？
你可以使用什么指令清除寄存器中由位号指定的某一位？
你会使用什么通用指令来设置寄存器中的位？
你可以使用什么指令设置寄存器中由位号指定的某一位？
你会使用什么通用指令来反转寄存器中的位？
你可以使用什么指令反转寄存器中由位号指定的某一位？
你会使用什么通用指令来测试寄存器中某一位（或一组位）是否为 0 或 1？
你可以使用什么指令测试寄存器中由位号指定的单个位？
你可以使用什么单一指令提取并合并一组位？
你可以使用什么单一指令在寄存器中定位并插入一组位？
你可以使用什么单一指令从较大的位串中提取一个位子串？
什么指令允许你搜索寄存器中第一个已设置的位？
什么指令允许你搜索寄存器中最后一个已设置的位？
如何搜索寄存器中第一个未设置的位？
如何搜索寄存器中最后一个未设置的位？
你可以使用什么指令来计算寄存器中位的数量？

第十三章：宏和 MASM 编译时语言

本章讨论 MASM 编译时语言，包括非常重要的宏展开功能。宏是一个标识符，汇编程序会将其展开成额外的文本（通常是多行文本），允许你通过一个标识符来缩写大量代码。MASM 的宏功能实际上是一个计算机语言中的计算机语言；也就是说，你可以在 MASM 源文件中编写小型程序，这些程序的目的是生成其他 MASM 源代码，然后由 MASM 进行汇编。

这种语言中的语言，也称为编译时语言，由宏（编译时语言中的过程等效物）、条件语句（if语句）、循环和其他语句组成。本章介绍了 MASM 编译时语言的许多功能，并展示了如何使用它们来减少编写汇编语言代码的工作量。

13.1 编译时语言简介

MASM 实际上是将两种语言合并成一个程序。运行时语言是你在之前所有章节中阅读到的标准 x86-64/MASM 汇编语言。这被称为运行时语言，因为你编写的程序在运行可执行文件时执行。MASM 包含一个解释器，用于另一种语言，即 MASM 编译时语言（CTL）。MASM 源文件包含 MASM CTL 和运行时程序的指令，MASM 在汇编（编译）期间执行 CTL 程序。一旦 MASM 完成汇编，CTL 程序就会终止（参见图 13-1）。

图 13-1：编译时执行与运行时执行

CTL 应用程序不是 MASM 生成的运行时可执行文件的一部分，尽管 CTL 应用程序可以为你编写部分运行时程序，事实上，这就是 CTL 的主要用途。通过自动代码生成，CTL 使你能够轻松而优雅地输出重复的代码。通过学习如何使用 MASM CTL 并正确应用它，你可以像开发高级语言应用程序一样快速开发汇编语言应用程序（甚至更快，因为 MASM 的 CTL 让你能够创建非常高级语言的构造）。

13.2 echo 和 .err 指令

你可能还记得第一章开始时提到的大多数人学习新语言时编写的典型第一个程序——“Hello, world！”程序。列表 13-1 提供了用 MASM 编译时语言编写的基本“Hello, world！”程序。

; Listing 13-1

; CTL "Hello, world!" program.

echo    Listing 13-1: Hello, world!
end

列表 13-1：CTL “Hello, world！” 程序

该程序中的唯一 CTL 语句是echo语句。^(1) end语句仅仅是为了让 MASM 保持正常运行。

echo语句在汇编 MASM 程序时显示其参数列表的文本表示。因此，如果你使用以下命令编译前面的程序

ml64 /c listing13-1.asm

MASM 汇编器将立即打印以下文本：

Listing 13-1: Hello, world!

除了显示与echo参数列表相关的文本外，echo语句对程序的汇编没有任何影响。它对调试 CTL 程序非常宝贵，可以显示汇编过程的进度，以及在汇编过程中发生的假设和默认操作。

尽管汇编语言调用print也会将文本输出到标准输出，但在 MASM 源文件中的以下两组语句之间有一个很大的区别：

echo "Hello World"

call print
byte "Hello World", nl,0

第一个语句在汇编过程中打印"Hello World"（并带有换行符），并且对可执行程序没有影响。最后两行不会影响汇编过程（除了将代码输出到可执行文件）。然而，当你运行可执行文件时，第二组语句会打印字符串Hello World，后面跟着换行符序列。

.err指令，类似于echo，将在汇编期间将字符串显示到控制台，但这必须是一个文本字符串（由<和>界定）。.err语句将文本作为 MASM 错误诊断的一部分显示。此外，.err语句还会增加错误计数，这将导致 MASM 在处理完当前源文件后停止汇编（不进行汇编或链接）。通常，当你的 CTL 代码发现一些问题，导致它无法生成有效代码时，你会使用.err语句在汇编过程中显示错误消息。例如：

.err <Statement must have exactly one operand>

13.3 编译时常量和变量

就像运行时语言一样，编译时语言也支持常量和变量。你可以通过使用textequ或equ指令来声明编译时常量。你可以通过使用=指令（编译时赋值语句）来声明编译时变量。例如：

inc_by equ 1
ctlVar = 0
ctlVar = ctlVar + inc_by

13.4 编译时表达式和运算符

MASM CTL 支持常量表达式在 CTL 赋值语句中的使用。有关常量表达式的讨论，请参见第四章中的“MASM 常量声明”（这些也是 CTL 表达式和运算符）。

除了本章中出现的运算符和函数，MASM 还包括一些你会发现有用的额外 CTL 运算符、函数和指令。以下小节将描述这些内容。

13.4.1 MASM 转义（!）运算符

第一个运算符是!运算符。当它放在另一个符号前面时，MASM 会将该字符视为文本，而不是特殊符号。例如，!;创建一个由分号字符组成的文本常量，而不是一个注释，后者会导致 MASM 忽略；符号后的所有文本（对于 C/C++程序员来说，这类似于字符串常量中的反斜杠转义字符\）。

13.4.2 MASM 评估（%）运算符

第二个有用的 CTL 操作符是%。百分号操作符使得 MASM 评估它后面的表达式，并用该表达式的值替换它。例如，考虑以下代码序列：

num10   =        10
text10  textequ  <10>
tn11    textequ  %num10 + 1

如果你在汇编语言源文件中组装这个序列，并指示 MASM 生成汇编清单，它会报告以下三个符号：

num10  . . . . . . . . . . . . .        Number   0000000Ah
text10 . . . . . . . . . . . . .        Text     10
tn11 . . . . . . . . . . . . . .        Text     11

num10被正确地报告为数值（十进制 10），text10被报告为文本符号（包含字符串10），而tn11被报告为文本符号（正如你所预期的，因为该代码序列使用textequ指令来定义它）。然而，MASM 不会包含字符串%num10 + 1，而是评估表达式num10 + 1，产生数值 11，然后将其转换为文本数据。（顺便说一下，若要在文本字符串中放置百分号，请使用文本序列<!%>。）

如果你将%操作符放在源代码行的第一列，MASM 将把该行的所有数值表达式转换为文本形式。这在使用echo指令时很有用。它使得echo显示数值常量的值，而不仅仅是显示常量的名称。

13.4.3 `catstr`指令

catstr函数具有以下语法：

`identifier`   catstr  `string1`, `string2`, ...

identifier是一个（直到此时）未定义的符号。string1和string2操作数是被<和>符号包围的文本数据。这个语句将把两个字符串的连接结果存储到identifier中。请注意，identifier是一个文本对象，而不是字符串对象。如果你在代码中指定该标识符，MASM 会用文本字符串替换标识符，并尝试将该文本数据作为源代码的一部分进行处理。

catstr语句允许两个或更多用逗号分隔的操作数。catstr指令将按照它们在操作数字段中出现的顺序连接文本值。以下语句生成文本数据Hello, World!：

helloWorld catstr <Hello>, <, >, <World!!>

在这个示例中需要使用两个感叹号，因为!是一个操作符，告诉 MASM 将下一个符号视为文本而非操作符。只有一个!符号时，MASM 会认为你尝试将>符号包含为字符串的一部分，并报告错误（因为没有关闭的>符号）。在文本字符串中使用!!告诉 MASM 将第二个!符号视为文本字符。

13.4.4 `instr`指令

instr指令用于在一个字符串中查找另一个字符串的存在。该指令的语法是

`identifier`  instr  `start`, `source`, `search`

其中，identifier是一个符号，MASM 将在其中放入search字符串在source字符串中的偏移量。搜索从source中的start位置开始。不同于常规，source中的第一个字符的位置是 1（而不是 0）。以下示例在字符串Hello World中搜索World（从字符位置 1 开始，即H字符的索引）：

WorldPosn  instr 1, <Hello World>, <World>

该语句将 WorldPosn 定义为值为 7 的数字（因为如果从位置 1 开始计数，字符串 World 在 Hello World 中的位置是 7）。

13.4.5 `sizestr` 指令

sizestr 指令计算字符串的长度。^(2) 该指令的语法为：

`identifier`  sizestr  `string`

其中，identifier 是 MASM 将存储字符串长度的符号，string 是该指令计算其长度的字符串字面量。举个例子，

hwLen sizestr <Hello World>

将符号 hwLen 定义为一个数字，并将其值设为 11。

13.4.6 `substr` 指令

substr 指令从较大的字符串中提取子字符串。该指令的语法为：

`identifier` substr `source`, `start`, `len`

其中，identifier 是 MASM 将创建的符号（类型为 TEXT，初始化为子字符串字符），source 是 MASM 从中提取子字符串的源字符串，start 是从字符串中开始提取的起始位置，len 是要提取的子字符串的长度。len 操作数是可选的；如果未指定，MASM 会假定你想要使用从 start 位置开始的字符串剩余部分作为子字符串。以下是一个从字符串 Hello World 中提取 Hello 的示例：

hString substr <Hello World>, 1, 5

13.5 条件汇编（编译时决策）

MASM 的编译时语言提供了一个 if 语句，它让你在汇编时做出决策。if 语句有两个主要用途。if 的传统用法是支持 条件汇编，根据程序中各种符号或常量值的状态，在汇编过程中决定是否包含或排除代码。第二个用途是支持 MASM 编译时语言中的标准 if 语句决策过程。本节将讨论这两个 if 语句的用途。

MASM 编译时 if 语句的最简单形式使用以下语法：

if `constant_boolean_expression`  
      `Text`  
endif

在编译时，MASM 会评估 if 后面的表达式。该表达式必须是一个常量表达式，且结果为整数值。如果该表达式的值为真（非零），MASM 会继续处理源文件中的文本，就像 if 语句不存在一样。然而，如果表达式的值为假（零），MASM 会将 if 和对应的 endif 子句之间的所有文本视为注释（即忽略这些文本），如图 13-2 所示。

图 13-2：MASM 编译时 if 语句的操作

编译时表达式中的标识符必须是常量标识符或 MASM 编译时函数调用（具有适当的参数）。因为 MASM 在汇编时评估这些表达式，所以它们不能包含运行时变量。

MASM 的 if 语句支持可选的 elseif 和 else 子句，这些子句的行为直观易懂。if 语句的完整语法如下：

if `constant_boolean_expression1`
      `Text`  
elseif `constant_boolean_expression2`
      `Text`  
else 
      `Text`  
endif

如果第一个布尔表达式的值为真，MASM 会处理直到 elseif 子句的文本。然后它会跳过所有文本（即将其视为注释），直到遇到 endif 子句。MASM 会在 endif 子句之后按正常方式继续处理文本。

如果第一个布尔表达式的值为假，MASM 会跳过所有文本，直到遇到 elseif、else 或 endif 子句。如果遇到 elseif 子句（如前面的例子），MASM 会评估与该子句相关联的布尔表达式。如果该表达式的值为真，MASM 会处理 elseif 和 else 子句之间的文本（或者如果没有 else 子句，则处理到 endif 子句）。如果在处理该文本时，MASM 遇到另一个 elseif 或如前所述的 else 子句，MASM 将忽略所有后续文本，直到找到相应的 endif。如果前面例子中的第一个和第二个布尔表达式的值都为假，MASM 将跳过它们关联的文本，开始处理 else 子句中的文本。

你可以通过包含零个或多个 elseif 子句，并根据需要提供 else 子句，创建几乎无限种的 if 语句序列。

条件汇编的传统用途之一是开发可以轻松配置为多个环境的软件。例如，fcomip 指令使得浮点比较变得简单，但该指令仅在 Pentium Pro 及更高版本的处理器上可用。为了在支持此指令的处理器上使用它，并在较旧的处理器上回退到标准浮点比较，大多数工程师使用条件汇编将不同的指令序列嵌入到同一个源文件中（而不是编写和维护两个版本的程序）。以下示例演示了如何做到这一点：

; Set true (1) to use FCOMI`xx` instrs.

PentProOrLater = 0
          . 
          . 
          . 
        if PentProOrLater

          fcomip st(0), st(1) ; Compare ST1 to ST0 and set flags

        else 

          fcomp               ; Compare ST1 to ST0
          fstsw ax            ; Move the FPU condition code bits
          sahf                ; into the FLAGS register

        endif

如当前编写的代码片段，将编译 else 子句中的三条指令，并忽略 if 和 else 子句之间的代码（因为常量 PentProOrLater 为假）。通过将 PentProOrLater 的值更改为真，你可以告诉 MASM 编译单条 fcomip 指令，而不是三条指令序列。

尽管你只需维护一个源文件，但条件汇编并不能让你创建一个在所有处理器上都能高效运行的单一 可执行文件。使用这种技术时，你仍然需要创建两个可执行程序（一个用于 Pentium Pro 及更高版本的处理器，一个用于早期的处理器），通过编译源文件两次：第一次汇编时，你必须将 PentProOrLater 常量设置为假；第二次汇编时，你必须将其设置为真。

如果你熟悉其他语言中的条件汇编，如 C/C++，你可能会想知道 MASM 是否支持类似 C 的 #ifdef 语句。答案是肯定的，它支持。请考虑以下对前面代码的修改，使用了该指令：

; Note: uncomment the following line if you are compiling this 
; code for a Pentium Pro or later CPU. 

; PentProOrLater = 0       ; Value and type are irrelevant
          . 
          . 
          . 
ifdef PentProOrLater 

     fcomip st(0), st(1)   ; Compare ST1 to ST0 and set flags

else 

     fcomp                 ; Compare ST1 to ST0
     fstsw ax              ; Move the FPU condition code bits
     sahf                  ; into the FLAGS register

endif

条件汇编的另一个常见用途是将调试和测试代码引入您的程序。许多 MASM 程序员使用的一种典型调试技巧是在代码中的关键点插入打印语句；这使得他们能够跟踪代码并在各个检查点显示重要的值。

然而，这种技术的一个大问题是，在完成项目之前，必须删除调试代码。还有两个进一步的问题如下：

程序员常常忘记删除一些调试语句，这会在最终程序中产生缺陷。
删除调试语句后，这些程序员常常发现他们在稍后的某个时刻需要这个语句来调试另一个问题。因此，他们不断地插入和删除相同的语句。

条件汇编可以为此问题提供解决方案。通过定义一个符号（比如debug）来控制程序中的调试输出，您可以通过修改一行源代码来启用或禁用所有调试输出。以下代码片段演示了这一点：

; Set to true to activate debug output.

debug   =    0
 . 
          . 
          . 
     if debug

        echo *** DEBUG build

        mov  edx, i
        call print
        byte "At point A, i=%d", nl, 0 

     else

     echo *** RELEASE build

     endif

只要您将所有调试输出语句用如上所述的if语句包围，就不必担心调试输出会意外出现在最终应用程序中。通过将debug符号设置为 false，您可以自动禁用所有这些输出。同样，在调试语句完成即时目的后，您也不必将它们从程序中删除。通过使用条件汇编，您可以将这些语句保留在代码中，因为它们非常容易被禁用。以后，如果您决定在汇编过程中需要查看这些调试信息，您可以通过将debug符号设置为 true 重新启用它。

尽管程序配置和调试控制是条件汇编的两个常见的传统用途，但不要忘记，if语句提供了 MASM CTL 中的基本条件语句。您将在编译时程序中像使用 MASM 或其他语言中的if语句一样使用if语句。本章后续部分将展示大量使用if语句的示例。

13.6 重复汇编（编译时循环）

MASM 的while..endm、for..endm和forc..endm语句提供了编译时循环结构。^(3) while语句指示 MASM 在汇编期间重复处理相同的语句序列。这对于构建数据表以及为编译时程序提供传统的循环结构非常有用。

while语句使用以下语法：

while `constant_boolean_expression`
      `Text` 
endm

当 MASM 在汇编过程中遇到 while 语句时，它会评估常量布尔表达式。如果表达式的结果为假，MASM 会跳过 while 和 endm 之间的文本（这种行为类似于 if 语句在表达式结果为假时的处理）。如果表达式的结果为真，MASM 会处理 while 和 endm 之间的语句，然后“跳回”源文件中的 while 语句开始处，并重复这个过程，如 Figure 13-3 所示。

图 13-3：MASM 编译时 while 语句操作

为了理解这个过程是如何工作的，考虑一下 Listing 13-2 中的程序。

; Listing 13-2

; CTL while loop demonstration program.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 13-2", 0

            .data
ary         dword   2, 3, 5, 8, 13

            include getTitle.inc
            include print.inc

            .code

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbx
            push    rbp
            mov     rbp, rsp
            sub     rsp, 56           ; Shadow storage

i           =       0            
            while   i LT lengthof ary ; 5  

            mov     edx, i            ; This is a constant!
            mov     r8d, ary[i * 4]   ; Index is a constant
            call    print
            byte    "array[%d] = %d", nl, 0

i           =       i + 1
            endm 

allDone:    leave
            pop     rbx
            ret     ; Returns to caller
asmMain     endp
            end

Listing 13-2：w``hile``..``endm 演示

这是 Listing 13-2 的构建命令和程序输出：

C:\>**build listing13-2**

C:\>**echo off**
 Assembling: listing13-2.asm
c.cpp

C:\>**listing13-2**
Calling Listing 13-2:
array[0] = 2
array[1] = 3
array[2] = 5
array[3] = 8
array[4] = 13
Listing 13-2 terminated

while 循环在汇编过程中会重复五次。每次循环时，MASM 汇编器会处理 while 和 endm 指令之间的语句。因此，前面的程序实际上等价于 Listing 13-3 中显示的代码片段。

.
.
.
mov     edx, 0          ; This is a constant!
mov     r8d, ary[0]     ; Index is a constant
call    print
byte    "array[%d] = %d", nl, 0

mov     edx, 1          ; This is a constant!
mov     r8d, ary[4]     ; Index is a constant
call    print
byte    "array[%d] = %d", nl, 0

mov     edx, 2          ; This is a constant!
mov     r8d, ary[8]     ; Index is a constant
call    print
byte    "array[%d] = %d", nl, 0

mov     edx, 3          ; This is a constant!
mov     r8d, ary[12]    ; Index is a constant
call    print
byte    "array[%d] = %d", nl, 0

mov     edx, 4          ; This is a constant!
mov     r8d, ary[16]    ; Index is a constant
call    print
byte    "array[%d] = %d", nl, 0

Listing 13-3：与 Listing 13-2 中的代码等价的程序

如你在这个例子中看到的，while 语句对于构建重复代码序列非常方便，尤其是对于展开循环。

MASM 提供了两种形式的 for..endm 循环。这两种循环的一般形式如下：

for `identifier`, <`arg1`, `arg2`, ..., `argn`> 
  . 
  . 
  . 
endm 

forc `identifier`, <`string`>
  . 
  . 
  . 
endm

第一种 for 循环形式（普通 for）会对指定的 < 和 > 括号之间的每个参数执行一次代码。在每次循环重复时，它会将 identifier 设置为当前参数的文本：在第一次循环时，identifier 被设置为 arg1，第二次循环时设置为 arg2，以此类推，直到最后一次循环时，identifier 被设置为 argn。例如，下面的 for 循环会生成将 RAX、RBX、RCX 和 RDX 寄存器压入栈中的代码：

for  reg, <rax, rbx, rcx, rdx>
push reg
endm

这个 for 循环等价于以下代码：

push rax
push rbx
push rcx
push rdx

forc 编译时循环会对第二个参数指定的字符串中的每个字符重复其循环体。例如，下面的 forc 循环会为字符串中的每个字符生成一个十六进制字节值：

 forc   hex, <0123456789ABCDEF>
hexNum  catstr <0>,<hex>,<h>
        byte   hexNum
        endm

for 循环比 forc 循环更为有用。不过，forc 在某些情况下还是很方便的。大多数情况下，当你使用这些循环时，你会传递一组可变的参数，而不是一个固定的字符串。正如你很快会看到的，这些循环对于处理宏参数非常有用。

13.7 宏（编译时过程）

宏是语言处理器在编译过程中用其他文本替换的对象。宏是替换长且重复的文本序列为更短文本序列的绝佳工具。除了宏的传统角色（例如，C/C++中的#define），MASM 的宏还充当了类似编译时语言过程或函数的功能。

宏是 MASM 的主要功能之一。接下来的章节将探讨 MASM 的宏处理功能以及宏与其他 MASM CTL 控制结构之间的关系。

13.8 标准宏

MASM 支持一种简单直接的宏机制，允许你以类似声明过程的方式定义宏。一个典型的简单宏声明如下所示：

`macro_name` macro `arguments` 
      `Macro body`
          endm

以下代码是宏声明的具体示例：

neg128 macro 

       neg rdx 
       neg rax 
       sbb rdx, 0 

       endm

执行此宏的代码将计算 RDX:RAX 中 128 位值的二补数（参见第八章中“扩展精度负值操作”部分的neg描述）。

要执行与neg128相关的代码，你需要在希望执行这些指令的地方指定宏的名称。例如：

mov    rax, qword ptr i128 
mov    rdx, qword ptr i128[8] 
neg128

这看起来故意像是任何其他指令；宏的原始目的是创建合成指令，以简化汇编语言编程。

尽管你不需要使用call指令来调用宏，从程序的角度来看，调用宏执行的指令序列就像调用过程一样。你可以通过以下过程声明将此简单宏实现为过程：

neg128p  proc 

         neg   rdx
         neg   rax
         sbb   rdx, 0
         ret

neg128p  endp

以下两个语句都将使 RDX:RAX 中的值取反：

neg128
call   neg128p

这两者之间的区别（宏调用与过程调用）在于，宏会将其文本内联展开，而过程调用则会发出对文本中其他地方相应过程的调用。也就是说，MASM 会将neg128的调用直接替换为以下文本：

neg  rdx
neg  rax
sbb  rdx, 0

另一方面，MASM 会将过程call neg128p替换为call指令的机器码：

call neg128p

你应该根据效率来选择宏调用还是过程调用。宏比过程调用稍快，因为你不需要执行call和相应的ret指令，但它们可能会使你的程序变大，因为每次宏调用都会展开为宏体的文本。如果宏体很大，而且你在程序中多次调用该宏，那么它将使最终的可执行文件变得更大。此外，如果宏体执行的指令超过几个简单指令，call和ret指令序列的开销对整体执行时间的影响较小，因此执行时间的节省几乎可以忽略不计。另一方面，如果过程的主体非常短（像前面的neg128示例），宏实现可能会更快，并且不会显著增加程序的大小。一个好的经验法则如下：

对于短小且时间关键的程序单元，使用宏。对于较长的代码块并且执行时间不那么关键时，使用过程。

相比于过程，宏有许多其他缺点。宏不能拥有局部（自动）变量，宏参数的工作方式与过程参数不同，宏不支持（运行时）递归，而且调试宏比过程更为困难（仅举几个缺点）。因此，除非性能至关重要，否则你不应将宏作为过程的替代品。

13.9 宏参数

像过程一样，宏允许你定义参数，使你能够在每次宏调用时提供不同的数据，这让你可以编写通用宏，其行为可以根据你提供的参数而变化。通过在编译时处理这些宏参数，你可以编写复杂的宏。

宏参数声明语法很简单。你需要在宏声明中提供一个参数名称列表，作为操作数：

neg128  macro reg64HO, reg64LO

        neg   reg64HO
        neg   reg64LO
        sbb   reg64HO, 0

        endm

当你调用宏时，作为参数传递实际的参数值：

neg128  rdx, rax

13.9.1 标准宏参数展开

MASM 自动将类型text与宏参数关联。这意味着在宏展开过程中，MASM 会将你提供的文本替代为实际参数，并在所有出现形式参数名称的地方替换。按文本替换传递的语义与按值传递或按引用传递略有不同，因此在这里探讨这些差异是很有价值的。

考虑以下宏调用，使用前面章节中的neg128宏：

neg128 rdx, rax
neg128 rbx, rcx

这两个调用展开成以下代码：

; neg128 rdx, rax 

     neg rdx 
     neg rax 
     sbb rdx, 0

; neg128 rbx, rcx 

     neg rbx 
     neg rcx 
     sbb rbx, 0

宏调用不会像值传递那样创建参数的局部副本，也不会将实际参数的地址传递给宏。相反，形式为neg128 rdx, rax的宏调用等价于以下内容：

reg64HO  textequ <rdx> 
reg64LO  textequ <rax> 

         neg    reg64HO  
         neg    reg64LO  
         sbb    reg64HO, 0

文本对象会立即将其字符串值展开为内联文本，产生neg128 rdx, rax的前述展开。

宏参数不限于内存、寄存器或常量操作数，就像指令或过程操作数一样。只要其扩展在任何使用正式参数的地方都是合法的，任何文本都是可以的。同样，正式参数可以出现在宏体中的任何位置，而不仅仅是内存、寄存器或常量操作数合法的位置。考虑以下宏声明和示例调用，它们展示了如何将一个正式参数扩展为一个完整的指令：

chkError macro instr, jump, target

         instr 
         jump  target 

         endm

     chkError <cmp eax, 0>, jnl, RangeError  ; Example 1
          .
          .
          . 
 chkError <test bl, 1>, jnz, ParityError ; Example 2

; Example 1 expands to:

     cmp  eax, 0 
     jnl  RangeError 

; Example 2 expands to:

     test bl, 1 
     jnz  ParityError

我们使用 < 和 > 括号将完整的 cmp 和 test 指令视为一个单一字符串（通常，这些指令中的逗号会将它们拆分为两个宏参数）。

通常，MASM 假设所有逗号之间的文本构成一个单一的宏参数。如果 MASM 遇到任何开括号符号（左括号、左大括号或左尖括号），它将包括所有文本直到适当的闭括号符号，忽略括号符号内可能出现的逗号。当然，MASM 不会将字符串常量中的逗号（和括号符号）视为实际参数的结束。所以以下宏和调用是完全合法的：

_print macro strToPrint 

       call print
       byte strToPrint, nl, 0 

      endm 
       . 
       . 
       . 
      _print "Hello, world!"

MASM 将字符串 Hello, world! 作为一个单独的参数处理，因为逗号出现在字面量字符串常量中，就像你的直觉所示的那样。

当 MASM 扩展你的宏参数时，可能会遇到一些问题，因为参数是作为文本而不是值进行扩展的。考虑以下宏声明和调用：

Echo2nTimes macro n, theStr
echoCnt     =     0
            while echoCnt LT n * 2

            call  print
            byte  theStr, nl, 0

echoCnt     =     echoCnt + 1
            endm
            endm
 . 
             . 
             . 
            Echo2nTimes  3 + 1, "Hello"

这个例子在汇编期间显示 Hello 五次，而不是你直觉上可能期望的八次。这是因为前面的 while 语句扩展为

while  echoCnt LT 3 + 1 * 2

n 的实际参数是 3 + 1；因为 MASM 将这个文本直接替换到 n 位置，所以你会得到一个错误的文本扩展。在编译时，MASM 计算 3 + 1 * 2 的值为 5，而不是值 8（如果 MASM 按值传递该参数，而不是通过文本替换传递的话，你会得到值 8）。

当传递可能包含编译时表达式的数字参数时，解决这个问题的常见方法是将宏中的正式参数用括号括起来；例如，你可以将前面的宏重写为：

Echo2nTimes macro n, theStr
echoCnt     =     0
            while echoCnt LT (n) * 2

            call  print
            byte  theStr, nl, 0

echoCnt     =     echoCnt + 1
            endm  ; while
            endm  ; macro

现在，调用扩展为以下代码，产生直观的结果：

while  echoCnt LT (3 + 1) * 2 
call   print
byte   theStr, nl, 0
endm

如果你无法控制宏的定义（可能它是你使用的库模块的一部分，你不能更改宏定义，因为这样做可能会破坏现有代码），还有一个解决此问题的方法：在宏调用中的参数前使用 MASM % 运算符，以便 CTL 解释器在扩展参数之前先计算表达式。例如：

Echo2nTimes  %3 + 1, "Hello"

这将导致 MASM 正确地生成八次对 print 过程的调用（及相关数据）。

13.9.2 可选和必需的宏参数

一般来说，MASM 会将宏参数视为可选参数。如果你定义一个需要两个参数的宏，并且只传递一个参数来调用该宏，MASM 通常不会对该调用提出警告。相反，它会简单地将空字符串替代第二个参数的扩展。有些情况下，这种做法是可以接受的，甚至可能是期望的。

然而，假设你在前面的 neg128 宏中省略了第二个参数。那样会编译成一个缺少操作数的 neg 指令，MASM 会报告错误；例如：

neg128      macro   `arg1`, `arg2`      ; Line 6
            neg     `arg1`            ; Line 7
            neg     `arg2`            ; Line 8
            sbb     `arg1`, 0         ; Line 9
            endm                    ; Line 10
                                    ; Line 11
            neg128  rdx             ; Line 12

这是 MASM 报告的错误：

listing14.asm(12) : error A2008:syntax error : in instruction
 neg128(2): Macro Called From
  listing14.asm(12): Main Line Code

(12) 表示错误发生在源文件的第 12 行。neg128(2) 行表示错误发生在 neg128 宏的第 2 行。这里有点难以看出究竟是什么导致了问题。

一种解决方案是在宏内部使用条件汇编来测试两个参数是否都存在。起初，你可能认为可以使用如下代码：

neg128  macro reg64HO, reg64LO

        if   reg64LO eq <>
        .err <neg128 requires 2 operands>
        endif

        neg  reg64HO
        neg  reg64LO
        sbb  reg64O, 0
        endm
         .
         .
         .
        neg128 rdx

不幸的是，这种方式失败了，原因有几个。首先，eq 运算符不能与文本操作数一起使用。MASM 会在尝试应用该运算符之前扩展文本操作数，因此前面示例中的 if 语句实际上变成了

 if   eq

因为 MASM 将空字符串替代了 eq 运算符两侧的操作数。当然，这会导致语法错误。即使在 eq 运算符两侧有非空的文本操作数，这也会失败，因为 eq 期望的是数值型操作数。MASM 通过引入几个额外的条件 if 语句来解决这个问题，这些语句专门用于处理文本操作数和宏参数。表 13-1 列出了这些额外的 if 语句。

表 13-1：文本处理条件 if 语句

语句	文本操作数	含义
`ifb`^(*)	`arg`	如果为空：如果 `arg` 被评估为空字符串，则为真。
`ifnb`	`arg`	如果不为空：如果 `arg` 被评估为非空字符串，则为真。
`ifdif`	arg1``, `arg2`	如果不同：如果 `arg1` 和 `arg2` 不同（区分大小写），则为真。
`ifdifi`	arg1``, `arg2`	如果不同：如果 `arg1` 和 `arg2` 不同（不区分大小写），则为真。
`ifidn`	arg1``, `arg2`	如果相同：如果 `arg1` 和 `arg2` 完全相同（区分大小写），则为真。
`ifidni`	arg1``, `arg2`	如果相同：如果 `arg1` 和 `arg2` 完全相同（不区分大小写），则为真。
^(*) `ifb` `arg` 是 ifidn <``arg``>, <> 的简写。

你可以像标准的 if 语句一样使用这些条件 if 语句。你也可以在这些 if 语句后面跟随一个 elseif 或 else 子句，但没有 elseifb、elseifnb 等变体（这些 if 语句后面只能跟一个带布尔表达式的标准 elseif）。

以下代码片段演示了如何使用ifb语句来确保neg128宏有正好两个参数。无需检查reg64HO是否为空；如果reg64HO为空，reg64LO也将为空，ifb语句会报告相应的错误：

neg128  macro reg64HO, reg64LO

        ifb  <reg64LO>
        .err <neg128 requires 2 operands>
        endif

 neg  reg64HO
        neg  reg64LO
        sbb  reg64HO, 0
        endm

使用ifb时要非常小心。很容易将文本符号传递给宏，最终测试该符号的名称是否为空，而不是文本本身。考虑以下例子：

symbol      textequ <>
            neg128  rax, symbol     ; Generates an error

neg128调用有两个参数，第二个参数不为空，因此ifb指令对参数列表是有效的。然而，在宏内部，当neg128扩展reg64LO并添加neg指令后，扩展结果为空字符串，从而产生错误（这正是ifb本应防止的情况）。

处理缺失宏参数的另一种方式是明确告诉 MASM 某个参数是必需的，可以在宏定义行上使用:req后缀。考虑以下neg128宏的定义：

neg128  macro reg64HO:req, reg64LO:req
        neg   reg64HO
        neg   reg64LO
        sbb   reg64HO, 0
        endm

使用:req选项后，如果缺少一个或多个宏参数，MASM 将报告以下信息：

listing14.asm(12) : error A2125:missing macro argument

13.9.3 默认宏参数值

处理缺失宏参数的一种方式是为这些参数定义默认值。考虑以下neg128宏的定义：

neg128  macro reg64HO:=<rdx>, reg64LO:=<rax>
        neg   reg64HO
        neg   reg64LO
        sbb   reg64HO, 0
        endm

:=运算符告诉 MASM，如果宏调用行上没有实际值，则将运算符右侧的文本常量替换为相关的宏参数。考虑以下两个neg128的调用：

neg128       ; Defaults to "RDX, RAX" for the args
neg128 rbx   ; Uses RBX:RAX for the 128-bit register pair

13.9.4 带有可变参数数量的宏

可以告诉 MASM 允许在宏调用中使用可变数量的参数：

varParms  macro varying:vararg 

     ` Macro body`

          endm 
           . 
           . 
           . 
          varParms 1 
          varParms 1, 2 
          varParms 1, 2, 3 
          varParms

在宏内部，MASM 将创建一个文本对象，形式为<``arg1``, arg2``, ..., argn``>，并将该文本对象分配给关联的参数名称（在前面的示例中为varying）。您可以使用 MASM 的for循环来提取可变参数的单个值。例如：

varParms  macro varying:vararg 
          for   curArg, <varying>
          byte  curArg
          endm  ; End of FOR loop
          endm  ; End of macro

          varParms 1 
          varParms 1, 2 
          varParms 1, 2, 3
          varParms <5 dup (?)>

这是包含此示例源代码的汇编输出清单：

 00000000                        .data
                       varParms  macro varying:vararg
                                 for   curArg, <varying>
                                 byte  curArg
                                 endm  ; End of FOR loop
                                 endm  ; End of macro

                                 varParms 1
 00000000  01         2          byte  1
                                 varParms 1, 2
 00000001  01         2          byte  1
 00000002  02         2          byte  2
                                 varParms 1, 2, 3
 00000003  01         2          byte  1
 00000004  02         2          byte  2
 00000005  03         2          byte  3
                                 varParms <5 dup (?)>
 00000006  00000005 [ 2          byte  5 dup (?)
            00
           ]

一个宏最多可以有一个vararg参数。如果一个宏有多个参数，并且也有vararg参数，那么vararg参数必须是最后一个参数。

13.9.5 宏扩展（&）运算符

在宏内部，您可以使用&运算符将宏名称（或其他文本符号）替换为其实际值。该运算符在任何地方都是有效的，甚至在字符串文字中也是如此。考虑以下示例：

expand      macro   parm
            byte    '&parm', 0
            endm    

            .data
            expand  a

本示例中的宏调用扩展为以下代码：

byte 'a', 0

如果出于某种原因，您需要在宏中输出字符串'&parm'（该宏将parm作为参数之一），则必须绕过扩展运算符。请注意，'!&parm'不会转义&运算符。一个有效的解决方案是重新编写byte指令：

expand      macro   parm
            byte    '&', 'parm', 0
            endm

现在，&运算符不会在字符串内扩展parm。

13.10 宏中的局部符号

考虑以下宏声明：

jzc    macro  target

       jnz    NotTarget 
       jc     target 
NotTarget: 
       endm

这个宏模拟了一条指令，只有在零标志和进位标志都被设置时，才会跳转到指定的目标位置。相反，如果零标志或进位标志有一个被清除，则该宏会将控制转移到宏调用后面的指令。

这个宏存在一个严重问题。考虑一下，如果你在程序中多次使用这个宏，会发生什么情况：

jzc Dest1 
  . 
  . 
  . 
jzc Dest2 
  . 
  . 
  .

上述宏调用扩展为以下代码：

 jnz NotTarget 
         jc Dest1 
NotTarget: 
          . 
          . 
          . 
         jnz NotTarget 
         jc Dest2 
NotTarget: 
          . 
          . 
          .

这两个宏调用在宏扩展过程中都会生成相同的标签NotTarget。当 MASM 处理此代码时，它会报告符号重复定义的错误。

MASM 解决这个问题的方法是允许在宏内部使用局部符号。局部宏符号是特定宏调用中唯一的符号。你必须通过使用local指令显式地告诉 MASM 哪些符号必须是局部的：

`macro_name`    macro  `optional_parameters` 
              local  `list_of_local_names`
         `Macro body`
              endm

list_of_local_names是由一个或多个 MASM 标识符组成的序列，这些标识符由逗号分隔。每当 MASM 在特定的宏调用中遇到这些名称时，它会自动为该标识符替换一个唯一的名称。对于每个宏调用，MASM 都会为局部符号替换一个不同的名称。

你可以通过以下宏代码来修正jzc宏的问题：

jzc      macro   target
         local   NotTarget

         jnz     NotTarget
         jc      target
NotTarget: 

         endm

现在，每当 MASM 处理这个宏时，它会自动为每个NotTarget的出现关联一个唯一的符号。这将防止在没有声明NotTarget为局部符号时出现符号重复错误。

MASM 为每个局部符号生成??``nnnn格式的符号，其中nnnn是一个（唯一的）四位十六进制数。因此，如果你在汇编清单中看到像??0000这样的符号，你就知道它们的来源。

宏定义可以包含多个local指令，每个指令有自己的一组局部名称。然而，如果在一个宏中有多个local语句，它们应该紧跟在macro指令之后。

13.11 exitm 指令

MASM 的exitm指令（仅能在宏内部使用）告诉 MASM 立即终止宏的处理。MASM 会忽略宏中的任何额外行。如果你把宏看作是一个过程，那么exitm就是返回语句。

exitm指令在条件汇编序列中非常有用。也许在检查某些宏参数是否存在（或不存在）之后，你可能希望停止宏的处理，以避免 MASM 产生额外的错误。例如，考虑之前的neg128宏：

neg128  macro reg64HO, reg64LO

        ifb   <reg64LO>
        .err  <neg128 requires 2 operands>
        exitm
        endif

        neg   reg64HO
        neg   reg64LO
        sbb   reg64HO, 0
        endm

如果在条件汇编中没有exitm指令，这个宏将尝试汇编neg reg64LO指令，并生成另一个错误，因为reg64LO会扩展为空字符串。

13.12 MASM 宏函数语法

最初，MASM 的宏设计允许程序员创建替代助记符。程序员可以使用宏来替换汇编语言源文件中的机器指令或其他语句（或语句序列）。宏只能在源文件中创建整行的输出文本。这使得程序员无法使用如下的宏调用：

mov rax, `some_macro_invocation`(`arguments`)

今天，MASM 支持额外的语法，允许你创建宏函数。一个 MASM 宏函数的定义看起来与普通宏定义完全相同，唯一的不同是：你使用一个带有文本参数的 exitm 指令来从宏中返回函数结果。请参考清单 13-4 中的 upperCase 宏函数。

; Listing 13-4

; CTL while loop demonstration program.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 13-4", 0

; upperCase macro function.

; Converts text argument to a string, converting
; all lowercase characters to uppercase.

upperCase   macro   theString
            local   resultString, thisChar, sep
resultStr   equ     <> ; Initialize function result with ""
sep         textequ <> ; Initialize separator char with ""

            forc    curChar, theString

; Check to see if the character is lowercase.
; Convert it to uppercase if it is, otherwise
; output it to resultStr as is. Concatenate the
; current character to the end of the result string
; (with a ", " separator, if this isn't the first
; character appended to resultStr).

 if      ('&curChar' GE 'a') and ('&curChar' LE 'z')
resultStr   catstr  resultStr, sep, %'&curChar'-32
            else
resultStr   catstr  resultStr, sep, %'&curChar'
            endif

; First time through, sep is the empty string. For all
; other iterations, sep is the comma separator between
; values.

sep         textequ <, >
            endm    ; End for

            exitm   <resultStr>
            endm    ; End macro

; Demonstration of the upperCase macro function:

            .data
chars       byte    "Demonstration of upperCase"
            byte    "macro function:"
            byte    upperCase(<abcdEFG123>), nl, 0

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbx
            push    rbp
            mov     rbp, rsp
            sub     rsp, 56         ; Shadow storage

            lea     rcx, chars      ; Prints characters converted to uppercase
            call    printf

allDone:    leave
            pop     rbx
            ret     ; Returns to caller
asmMain     endp
            end

清单 13-4：示例宏函数

每当你调用一个 MASM 宏函数时，必须始终在宏名称后面加上一对括号，括号中包含宏的参数。即使宏没有参数，也必须包含一对空括号。这是 MASM 区分标准宏和宏函数的方式。

早期版本的 MASM 包含了诸如 sizestr（使用名称 @sizestr）等指令的函数。最近的 MASM 版本已经移除了这些函数。不过，你可以轻松编写自己的宏函数来替代这些丢失的函数。以下是 @sizestr 函数的一个快速替代：

; @sizestr - Replacement for the MASM @sizestr function
;            that Microsoft removed from MASM.

@sizestr    macro   theStr
            local   theLen
theLen      sizestr <theStr>
            exitm   <&theLen>
            endm

exitm 指令中的 & 运算符强制 @sizestr 宏展开与 theLen 本地符号关联的文本，并将其放入 < 和 > 字符串定界符中，然后返回该值给调用宏函数的人。没有 & 运算符时，@sizestr 宏将返回形如 ??0002 的文本（MASM 为本地符号 theLen 创建的唯一符号）。

13.13 将宏作为编译时过程和函数

尽管程序员通常使用宏来展开成一系列机器指令，但并没有要求宏体中必须包含任何可执行指令。实际上，许多宏只包含编译时语言语句（例如 if、while、for、= 赋值等）。通过在宏体中仅放置编译时语言语句，你可以有效地使用宏编写编译时程序和函数。

以下 unique 宏是一个很好的编译时函数示例，它返回一个字符串结果：

unique macro 
       local  theSym
       exitm  <theSym>
       endm

每当你的代码引用此宏时，MASM 会将宏调用替换为文本 theSym。MASM 为本地宏符号生成唯一的符号，如 ??0000。因此，每次调用 unique 宏时，都会生成一系列符号，如 ??0000、??0001、??0002，以此类推。

13.14 编写编译时“程序”

MASM 编译时语言允许你编写短程序来编写其他程序——特别是自动化创建大型或复杂的汇编语言序列。以下小节提供了这类编译时程序的简单示例。

13.14.1 在编译时构建数据表

本书之前提到过，你可以编写程序来为你的汇编语言程序生成大型、复杂的查找表（请参见第十章中的“生成表格”讨论）。第十章提供了生成表格的 C++ 程序，可以将其粘贴到汇编程序中。在本节中，我们将使用 MASM 编译时语言在程序汇编过程中构建数据表，该程序使用这些表格。

编译时语言的一个常见用法是构建 ASCII 字符查找表，用于运行时的字母大小写转换，使用 xlat 指令。示例 13-5 演示了如何构建一个大写字母转换表和一个小写字母转换表。^(4) 请注意使用宏作为编译时过程，以减少生成表格代码的复杂性。

; Listing 13-5

; Creating lookup tables with macros.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 13-5", 0
fmtStr1     byte    "testString converted to UC:", nl
            byte    "%s", nl, 0

fmtStr2     byte    "testString converted to LC:", nl
            byte    "%s", nl, 0

testString  byte    "This is a test string ", nl
            byte    "Containing UPPERCASE ", nl
            byte    "and lowercase chars", nl, 0

emitChRange macro   start, last
            local   index, resultStr
index       =       start
            while   index lt last
            byte    index
index       =       index + 1
            endm
            endm

; Lookup table that will convert lowercase
; characters to uppercase. The byte at each
; index contains the value of that index,
; except for the bytes at indexes "a" to "z".
; Those bytes contain the values "A" to "Z".
; Therefore, if a program uses an ASCII
; character's numeric value as an index
; into this table and retrieves that byte,
; it will convert the character to uppercase.

lcToUC      equ             this byte
            emitChRange     0, 'a'
            emitChRange     'A', %'Z'+1
            emitChRange     %'z'+1, 0ffh

; As above, but this table converts uppercase
; to lowercase characters.

UCTolc      equ             this byte
            emitChRange     0, 'A'
            emitChRange     'a', %'z'+1
            emitChRange     %'Z'+1, 0ffh

            .data

; Store the destination strings here:

toUC        byte    256 dup (0)
TOlc        byte    256 dup (0)     

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbx
            push    rdi
            push    rsi
            push    rbp
            mov     rbp, rsp
 sub     rsp, 56         ; Shadow storage

; Convert the characters in testString to uppercase:

            lea     rbx, lcToUC
            lea     rsi, testString
            lea     rdi, toUC
            jmp     getUC

toUCLp:     xlat
            mov     [rdi], al
            inc     rsi
            inc     rdi
getUC:      mov     al, [rsi]
            cmp     al, 0
            jne     toUCLp

; Display the converted string:

            lea     rcx, fmtStr1
            lea     rdx, toUC
            call    printf

; Convert the characters in testString to lowercase:

            lea     rbx, UCTolc
            lea     rsi, testString
            lea     rdi, TOlc
            jmp     getLC

toLCLp:     xlat
            mov     [rdi], al
            inc     rsi
            inc     rdi
getLC:      mov     al, [rsi]
            cmp     al, 0
            jne     toLCLp

; Display the converted string:

            lea     rcx, fmtStr2
            lea     rdx, TOlc
            call    printf

allDone:    leave
            pop     rsi
            pop     rdi
            pop     rbx
            ret     ; Returns to caller
asmMain     endp
            end

示例 13-5：使用编译时语言生成大小写转换表

以下是示例 13-5 中的程序构建命令和示例输出：

C:\>**build listing13-5**

C:\>**echo off**
 Assembling: listing13-5.asm
c.cpp

C:\>**listing13-5**
Calling Listing 13-5:
testString converted to UC:
THIS IS A TEST STRING
CONTAINING UPPERCASE
AND LOWERCASE CHARS

testString converted to LC:
this is a test string
containing uppercase
and lowercase chars

Listing 13-5 terminated

13.14.2 展开循环

第七章指出，你可以展开循环来提高某些汇编语言程序的性能。然而，这需要大量的额外输入，特别是当你有许多循环迭代时。幸运的是，MASM 的编译时语言功能，特别是 while 循环，能够提供帮助。只需少量额外输入和一次循环体复制，你就可以按需展开循环多次。

如果你只是想重复相同的代码序列一定次数，展开代码尤其简单。你所需要做的只是将一个 MASM while..endm 循环包裹在序列周围，并计数指定的迭代次数。例如，如果你想打印 Hello World 10 次，你可以按如下方式编码：

count = 0
while count LT 10
     call print
     byte "Hello World", nl, 0 

count = count + 1
endm

尽管这段代码看起来类似于高级语言中的 while 循环，但请记住其根本区别：前面的代码仅仅是程序中 10 次对 print 的直接调用。如果你使用实际的循环来编码，那么将只有一次 print 调用，并且会有很多额外的逻辑来循环返回并执行那次唯一的调用 10 次。

如果循环中的任何指令引用了循环控制变量或其他随每次迭代变化的值，那么展开循环会变得稍微复杂一些。一个典型的例子是一个将整数数组元素归零的循环：

 xor eax, eax   ; Set EAX and RBX to 0
        xor rbx, rbx
        lea rcx, array
whlLp:  cmp rbx, 20
        jae loopDone
        mov [rcx][rbx * 4], eax
        inc rbx
        jmp whlLp

loopDone:

在这段代码片段中，循环使用循环控制变量（在 RBX 中）的值来索引 array。简单地复制 mov [rcx][ebx * 4], eax 20 次并不是展开该循环的正确方法。你必须用适当的常数索引（范围从 0 到 76，对应的循环索引乘以 4）替换示例中的 rbx * 4。正确展开这个循环应该生成以下代码序列：

mov  [rcx][0 * 4], eax
mov  [rcx][1 * 4], eax
mov  [rcx][2 * 4], eax
mov  [rcx][3 * 4], eax
mov  [rcx][4 * 4], eax
mov  [rcx][5 * 4], eax
mov  [rcx][6 * 4], eax
mov  [rcx][7 * 4], eax
mov  [rcx][8 * 4], eax
mov  [rcx][9 * 4], eax
mov [rcx][10 * 4], eax 
mov [rcx][11 * 4], eax 
mov [rcx][12 * 4], eax 
mov [rcx][13 * 4], eax 
mov [rcx][14 * 4], eax 
mov [rcx][15 * 4], eax 
mov [rcx][16 * 4], eax 
mov [rcx][17 * 4], eax 
mov [rcx][18 * 4], eax 
mov [rcx][19 * 4], eax

你可以使用以下编译时代码序列轻松完成此操作：

iteration = 0
while iteration LT 20 
     mov [rcx][iteration * 4], eax
     iteration = iteration + 1
endm

如果循环中的语句使用了循环控制变量的值，只有当这些值在编译时已知时，才能展开这样的循环。当用户输入（或其他运行时信息）控制迭代次数时，无法展开循环。

当然，如果代码序列在这个循环之前已经将 RCX 加载为array的地址，你也可以使用以下while循环来节省 RCX 寄存器的使用：

iteration = 0
while iteration LT 20 
     mov array[iteration * 4], eax
     iteration = iteration + 1
endm

13.15 模拟高级语言过程调用

在汇编语言中调用过程（函数）是一个真正的麻烦。加载寄存器参数、将值推入栈中以及其他活动完全分散注意力。与对汇编语言函数的调用相比，高级语言的过程调用要更具可读性，并且更容易编写。宏提供了一种以类似高级语言的方式调用过程和函数的良好机制。

13.15.1 类似高级语言调用的无参数调用

当然，最简单的例子是调用一个没有参数的汇编语言过程：

someProc  macro
          call    _someProc
          endm

_someProc proc
            .
            .
            .
_someProc endp
            .
            .
            .
          someProc   ; Call the procedure

这个简单的例子展示了本书在通过宏调用过程时所使用的一些约定：

如果过程和所有对该过程的调用都发生在同一个源文件中，则将宏定义放置在过程之前，方便查找。（第十五章讨论了如果你从多个源文件调用过程时宏的位置问题。）
如果你通常会命名过程为someProc，将过程的名称更改为_someProc，然后将someProc用作宏名称。

虽然使用someProc形式的宏调用与使用call someProc调用过程的优势似乎有些可疑，但保持所有过程调用的一致性（通过对所有过程使用宏调用）有助于提高程序的可读性。

13.15.2 类似高级语言调用的单参数调用

下一步的复杂度是调用带有单个参数的过程。假设你正在使用微软的 ABI 并通过 RCX 传递参数，最简单的解决方案如下所示：

someProc  macro   parm1
          mov     rcx, parm1
          call    _someProc
          endm
           .
           .
           .
          someProc Parm1Value

如果你传递的是 64 位整数按值传递，此宏效果良好。如果参数是 8 位、16 位或 32 位值，则需要在mov指令中将 CL、CX 或 ECX 替换为 RCX。^(5)

如果你按引用传递第一个参数，你将需要将lea指令替换为本例中的mov指令。由于引用参数总是 64 位值，lea指令通常采用以下形式：

lea     rcx, `parm1`

最后，如果你传递的是real4或real8值作为参数，你需要将以下其中一条指令替换为前一个宏中的mov指令：

movss  xmm0, parm1  ; Use this for real4 parameters
movsd  xmm0, parm1  ; Use this for real8 parameters

只要实际参数是一个内存变量或适当的整数常量，这个简单的宏定义就能很好地工作，覆盖了大量的实际案例。

例如，要使用当前宏方案调用 C 标准库中的printf()函数并传入一个单一参数（格式字符串），你应该像下面这样写宏：^(6)

cprintf  macro  parm1
         lea    rcx, parm1
         call   printf
         endm

因此，你可以像这样调用这个宏：

cprintf fmtStr

其中fmtStr（假设）是位于.data段中的byte对象，包含了printf格式字符串。

对于更像高级语言的语法，我们应该允许类似以下的调用方式：

cprintf "This is a printf format string"

不幸的是，当前宏的写法会生成以下（语法上不正确的）语句：

lea   rcx, "This is a printf format string"

我们可以通过重新编写宏来修改它，从而允许这种调用：

cprintf  macro  parm1
         local  fmtStr
         .data
fmtStr   byte   parm1, nl, 0
         .code
         lea    rcx, fmtStr
         call   printf
         endm

使用字符串常量作为参数调用此宏，展开后的代码如下：

 .data
fmtStr   byte   "This is a printf format string", nl, 0
         .code
         lea    rcx, fmtStr  ; Technically, fmtStr will really be something
         call   printf       ; like ??0001

这个新版本宏的唯一问题是它不再接受像这样的调用：

cprintf fmtStr

其中fmtStr是.data段中的一个字节对象。我们确实希望有一个可以接受两种形式的宏。

13.15.3 使用`opattr`来确定参数类型

这其中的技巧在于opattr操作符（参见第四章中的表 4-1）。该操作符会根据后续表达式的类型返回一个设置了特定位的整数值。特别地，如果后续的表达式是可重定位的或者引用了内存，则会设置第 2 位。因此，如果像fmtStr这样的变量作为参数出现时，该位将被设置；而如果你传递一个字符串字面量作为参数（opattr实际上会对长度超过 8 个字符的字符串字面量返回 0 值，仅供参考），那么该位将保持清除。现在考虑清单 13-6 中的代码。

; Listing 13-6

; opattr demonstration.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 13-6", 0

fmtStr      byte    nl, "Hello, World! #2", nl, 0

            .code
            externdef printf:proc

; Return program title to C++ program:

            public  getTitle
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp

; cprintf macro:

;           cprintf fmtStr
;           cprintf "Format String"

cprintf     macro   fmtStrArg
            local   fmtStr, attr, isConst

attr        =       opattr fmtStrArg
isConst     =       (attr and 4) eq 4
            if      (attr eq 0) or isConst
            .data   
fmtStr      byte    fmtStrArg, nl, 0
            .code
            lea     rcx, fmtStr

            else

 lea     rcx, fmtStrArg

            endif
            call    printf
            endm

atw         =       opattr "Hello World"
bin         =       opattr "abcdefghijklmnopqrstuvwxyz"

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbx
            push    rdi
            push    rsi
            push    rbp
            mov     rbp, rsp
            sub     rsp, 56         ; Shadow storage

            cprintf "Hello World!"
            cprintf fmtStr

allDone:    leave
            pop     rsi
            pop     rdi
            pop     rbx
            ret     ; Returns to caller
asmMain     endp
            end

清单 13-6：宏中的opattr操作符

这是清单 13-6 的构建命令和示例输出：

C:\>**build listing13-6**

C:\>**echo off**
 Assembling: listing13-6.asm
c.cpp

C:\>**listing13-6**
Calling Listing 13-6:
Hello World!
Hello, World! #2
Listing 13-6 terminated

这个cprintf宏远非完美。例如，C/C++的printf()函数允许多个参数，而这个宏不支持。但该宏确实演示了如何根据传递给cprintf的参数类型来处理两种不同的printf调用。

13.15.4 使用固定数量参数的类高级语言调用

将宏调用机制从一个参数扩展到两个或更多（假设参数数量是固定的）相当简单。你只需要添加更多的形式参数并在宏定义中处理这些参数。清单 13-7 是第九章清单 9-11 的修改版，使用宏调用来调用r10ToStr、e10ToStr以及一些固定的printf调用（为了简洁，鉴于这是一个非常长的程序，这里仅包括宏和少数几个调用）。

 .
           .     ; About 1200 lines from Listing 9-10.
           .

; r10ToStr - Macro to create an HLL-like call for the 
;            _r10ToStr procedure.

; Parameters:

;   r10    - Must be the name of a real4, real8, or 
;            real10 variable.
;   dest   - Must be the name of a byte buffer to hold 
;            string result.

;   wdth   - Output width for the string. Either an
;            integer constant or a dword variable.

;   dPts   - Number of positions after the decimal
;            point. Either an integer constant or
;            a dword variable.

;   fill   - Fill char. Either a character constant
;            or a byte variable.

;   mxLen  - Maximum length of output string. Either
;            an integer constant or a dword variable.

r10ToStr     macro   r10, dest, wdth, dPts, fill, mxLen
             fld     r10

; dest is a label associated with a string variable:

             lea     rdi, dest

; wdth is either a constant or a dword var:

             mov     eax, wdth

; dPts is either a constant or a dword var
; holding the number of decimal point positions:

            mov     edx, dPts

; Process fill character. If it's a constant, 
; directly load it into ECX (which zero-extends
; into RCX). If it's a variable, then move with
; zero extension into ECX (which also zero-
; extends into RCX).

; Note: bit 2 from opattr is 1 if fill is 
; a constant.

            if      ((opattr fill) and 4) eq 4
            mov     ecx, fill
            else
            movzx   ecx, fill
            endif

; mxLen is either a constant or a dword var.

            mov     r8d, mxLen
            call    _r10ToStr
            endm

; e10ToStr - Macro to create an HLL-like call for the 
;            _e10ToStr procedure.

; Parameters:

;   e10   - Must be the name of a real4, real8, or 
;           real10 variable.
;   dest  - Must be the name of a byte buffer to hold 
;           string result.

;   wdth  - Output width for the string. Either an
;           integer constant or a dword variable.

;   xDigs - Number of exponent digits.

;   fill  - Fill char. Either a character constant
;           or a byte variable.

;   mxLen - Maximum length of output string. Either
;           an integer constant or a dword variable.

e10ToStr    macro   e10, dest, wdth, xDigs, fill, mxLen
            fld     e10

; dest is a label associated with a string variable:

            lea     rdi, dest

; wdth is either a constant or a dword var:

            mov     eax, wdth

; xDigs is either a constant or a dword var
; holding the number of decimal point positions:

            mov     edx, xDigs

; Process fill character. If it's a constant, 
; directly load it into ECX (which zero-extends
; into RCX). If it's a variable, then move with
; zero extension into ECX (which also zero-
; extends into RCX).

; Note: bit 2 from opattr is 1 if fill is 
; a constant.

            if      ((opattr fill) and 4) eq 4
            mov     ecx, fill
            else
            movzx   ecx, fill
            endif

; mxLen is either a constant or a dword var.

            mov     r8d, mxLen
            call    _e10ToStr
            endm

; puts - A macro to print a string using printf.

; Parameters:

;   fmt    - Format string (must be a byte
;            variable or string constant).

;   theStr - String to print (must be a
;            byte variable, a register,
;            or a string constant).

puts         macro   fmt, theStr
             local   strConst, bool

             lea     rcx, fmt

             if      ((opattr theStr) and 2)

; If memory operand:

             lea     rdx, theStr

             elseif  ((opattr theStr) and 10h)

; If register operand:

             mov     rdx, theStr

             else 

; Assume it must be a string constant.

            .data
strConst    byte    theStr, 0
            .code
            lea     rdx, strConst

            endif

            call    printf
            endm

            public  asmMain
asmMain     proc
            push    rbx
            push    rsi
            push    rdi
            push    rbp
            mov     rbp, rsp
            sub     rsp, 64         ; Shadow storage

; F output:

            r10ToStr r10_1, r10str_1, 30, 16, '*', 32
            jc      fpError
            puts    fmtStr1, r10str_1

            r10ToStr r10_1, r10str_1, 30, 15, '*', 32
            jc      fpError
            puts    fmtStr1, r10str_1
             .
             .    ; Similar code to Listing 9-10 with macro
             .    ; invocations rather than procedure calls.
; E output:

            e10ToStr e10_1, r10str_1, 26, 3, '*', 32
            jc      fpError
            puts    fmtStr3, r10str_1

            e10ToStr e10_2, r10str_1, 26, 3, '*', 32
            jc      fpError
            puts    fmtStr3, r10str_1
             .
             .    ; Similar code to Listing 9-10 with macro
             .    ; invocations rather than procedure calls.

清单 13-7：将浮点值转换为字符串的宏调用实现

将这些 HLL 样式的调用与清单 9-11 中的原始过程调用进行比较：

; F output:

fld     r10_1
lea     rdi, r10str_1
mov     eax, 30         ; fWidth
mov     edx, 16         ; decimalPts
mov     ecx, '*'        ; Fill
mov     r8d, 32         ; maxLength
call    r10ToStr
jc      fpError

lea     rcx, fmtStr1
lea     rdx, r10str_1
call    printf

fld     r10_1
lea     rdi, r10str_1
mov     eax, 30         ; fWidth
mov     edx, 15         ; decimalPts
mov     ecx, '*'        ; Fill
mov     r8d, 32         ; maxLength
call    r10ToStr
jc      fpError

lea     rcx, fmtStr1
lea     rdx, r10str_1
call    printf
.
.   ; Additional code from Listing 9-10.
.
; E output:

fld     e10_1
lea     rdi, r10str_1
mov     eax, 26         ; fWidth
mov     edx, 3          ; expDigits
mov     ecx, '*'        ; Fill
mov     r8d, 32         ; maxLength
call    e10ToStr
jc      fpError

lea     rcx, fmtStr3
lea     rdx, r10str_1
call    printf

fld     e10_2
lea     rdi, r10str_1
mov     eax, 26         ; fWidth
mov     edx, 3          ; expDigits
mov     ecx, '*'        ; Fill
mov     r8d, 32         ; maxLength
call    e10ToStr
jc      fpError

lea     rcx, fmtStr3
lea     rdx, r10str_1
call    printf
.
.   ; Additional code from Listing 9-10.
.

显然，宏版本更易于阅读（事实证明，它也更易于调试和维护）。

13.15.5 类似 HLL 的调用，带有可变参数列表

一些过程期望有一个可变数量的参数；C/C++中的printf()函数就是一个很好的例子。虽然有些过程只支持固定数量的参数，但使用可变参数列表编写它们可能会更好。例如，考虑本书中多次出现的print过程；它的字符串参数（跟在print调用后的代码流中）从技术上讲是一个单字符串参数。考虑以下调用print的宏实现：

print       macro   arg
            call    _print
            byte    arg, 0
            endm

你可以如下调用这个宏：

print  "Hello, World!"

这个宏的唯一问题是，你通常希望在调用中提供多个参数，比如这样：

print  "Hello, World!", nl, "It's a great day!", nl

不幸的是，这个宏不能接受这个参数列表。然而，这似乎是print宏的自然用途，因此修改print宏来处理多个参数，并在调用_print函数后将它们合并为一个单一字符串是非常有意义的。列表 13-8 提供了这样的实现。

; Listing 13-8

; HLL-like procedure calls with
; a varying parameter list.

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 13-8", 0

            .code
            externdef printf:proc

 include getTitle.inc

; Note: don't include print.inc here
; because this code uses a macro for
; print.

; print macro - HLL-like calling sequence for the _print
;               function (which is, itself, a shell for
;               the printf function).

; If print appears on a line by itself (no; arguments), 
; then emit a string consisting of a single newline 
; character (and zero-terminating byte). If there are 
; one or more arguments, emit each argument and append 
; a single 0 byte after all the arguments.

; Examples:

;           print
;           print   "Hello, World!"
;           print   "Hello, World!", nl

print       macro   arg1, optArgs:vararg
            call    _print

            ifb     <arg1>

; If print is used by itself, print a
; newline character:

            byte    nl, 0

            else

; If we have one or more arguments, then
; emit each of them:

            byte    arg1

            for     oa, <optArgs>

            byte    oa

            endm

; Zero-terminate the string.

            byte    0

            endif
            endm

_print      proc
            push    rax
            push    rbx
 push    rcx
            push    rdx
            push    r8
            push    r9
            push    r10
            push    r11

            push    rbp
            mov     rbp, rsp
            sub     rsp, 40
            and     rsp, -16

            mov     rcx, [rbp + 72]   ; Return address
            call    printf

            mov     rcx, [rbp + 72]
            dec     rcx
skipTo0:    inc     rcx
            cmp     byte ptr [rcx], 0
            jne     skipTo0
            inc     rcx
            mov     [rbp + 72], rcx

            leave
            pop     r11
            pop     r10
            pop     r9
            pop     r8
            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            ret
_print      endp

p           macro   arg
            call    _print
            byte    arg, 0
            endm      

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbx
            push    rdi
            push    rsi
            push    rbp
            mov     rbp, rsp
            sub     rsp, 56         ; Shadow storage

            print   "Hello world"
            print
            print   "Hello, World!", nl

allDone:    leave
            pop     rsi
            pop     rdi
            pop     rbx
            ret     ; Returns to caller
asmMain     endp
            end

列表 13-8：print宏的可变参数实现

这是列表 13-8 中程序的构建命令和输出：

C:\>**build listing13-8**

C:\>**echo off**
 Assembling: listing13-8.asm
c.cpp

C:\>**listing13-8**
Calling Listing 13-8:
Hello world
Hello, World!
Listing 13-8 terminated

使用这个新的print宏，你现在可以通过简单地列出print调用中的参数，以类似 HLL 的方式调用print过程：

print "Hello World", nl, "How are you today?", nl

这将生成一个字节指令，将所有单独的字符串组件连接在一起。

顺便说一下，值得注意的是，传递一个包含多个参数的字符串给原始的（单参数）print版本是可能的。通过重写宏调用

print "Hello World", nl

如下所示：

print <"Hello World", nl>

你将得到期望的输出。MASM 会将<和>括号之间的所有内容视为一个单独的参数。然而，不断将这些括号放在多个参数周围是有点麻烦的（并且你的代码会不一致，因为单一参数不需要这些括号）。print宏的可变参数实现是一个更好的解决方案。

13.16 `invoke`宏

曾几何时，MASM 提供了一个特殊指令invoke，你可以用它来调用过程并传递参数（它与proc指令一起工作，确定过程所需的参数数量和类型）。当微软将 MASM 修改为支持 64 位代码时，它从 MASM 语言中移除了invoke语句。

然而，一些有创造力的程序员已经编写了 MASM 宏，用于在 64 位版本的 MASM 中模拟invoke指令。invoke宏不仅在自身使用时非常有用，而且还提供了一个很好的示例，展示了如何编写高级宏来调用过程。欲了解更多关于invoke宏的信息，请访问www.masm32.com/并下载 MASM32 SDK。它包括一套用于 64 位程序的宏（和其他工具），其中就包括invoke宏。

13.17 高级宏参数解析

前面的章节提供了处理宏参数的示例，用于确定宏参数的类型，以便确定生成的代码类型。通过仔细检查参数的属性，宏可以根据需要做出各种选择来处理该参数。本节将介绍一些在处理宏参数时可以使用的更高级技巧。

显然，opattr 编译时操作符是在查看宏参数时可以使用的最重要工具之一。此操作符使用以下语法：

opattr `expression`

请注意，opattr 后面跟着一个通用地址表达式；你不限于使用单一符号。

opattr 操作符返回一个整数值，这是一个位掩码，指定了关联表达式的 opattr 属性。如果 opattr 后面的表达式包含前向引用符号或是非法表达式，opattr 将返回 0。微软的文档指出，opattr 返回的值如表 13-2 所示。

表 13-2: opattr 返回值

位	含义
0	表达式中有代码标签。
1	表达式是可重定位的。
2	表达式是常量表达式。
3	表达式使用直接（PC 相对）寻址。
4	表达式是寄存器。
5	表达式不包含未定义的符号（已废弃）。
6	表达式是堆栈段内存表达式。
7	表达式引用了外部符号。
8–11	语言类型^(*)
	值
	0
	1
	2
	3
	4
	5
	6
^(*) 64 位代码通常不支持语言类型，因此这些位通常为 0。

老实说，微软的文档在解释 MASM 如何设置位时做得并不够好。例如，考虑以下 MASM 语句：

codeLabel:
opcl       =  opattr codeLabel ; Sets opcl to 25h or 0010_0101b
opconst    =  opattr 0         ; Sets opconst to 36 or 0010_0100b

opconst 的位 2 和 5 被设置，就像你从表 13-2 中预期的一样。然而，opcl 的位 0、2 和 5 被设置；0 和 5 是有道理的，但位 2（表达式是常量表达式）则不合逻辑。如果在宏中，仅通过测试位 2 来判断操作数是否为常量（我必须承认，在本章的早期示例中我曾这样做过），当位 2 被设置时，你可能会陷入困境，因为你会假定它是一个常量。

可能最明智的做法是屏蔽掉位 0 到 7（或者也许只是位 0 到 6），并将结果与 8 位值进行比较，而不是使用简单的掩码。表 13-3 列出了你可以进行比较的一些常见值。

表 13-3: opattr 结果的 8 位值

值	含义
0	未定义（前向引用）符号或非法表达式
34 / 22h	形式为 [``reg `+` const``] 的内存访问
36 / 24h	常量
37 / 25h	代码标签（过程名或带有 `:` 后缀的符号）或 `offset` `code_label` 形式
38 / 26h	形式为 `offset` `label` 的表达式，其中 `label` 是 `.data` 部分中的变量
42 / 2Ah	全局符号（例如，`.data` 部分中的符号）
43 / 2Bh	内存访问形式为 [``reg `+` code_label``]，其中 `code_label` 是带有 `:` 后缀的过程名或符号
48 / 30h	寄存器（通用寄存器，MM，XMM，YMM，ZMM，浮动点/ST，或其他特殊用途寄存器）
98 / 62h	堆栈相对内存访问（内存地址形式为 `[rsp +` xxx``] 和 `[rbp +` xxx``]）
165 / 0A5h	外部代码符号（37 / 25h，位 7 设置）
171 / ABh	外部数据符号（43 / 2Bh，位 7 设置）

也许 opattr 最大的问题，正如已经指出的，是它认为常量表达式是可以容纳在 64 位内的整数。这就导致了两个重要常量类型的问题：字符串字面量（长度超过 8 个字符）和浮动点常量。opattr 对这两者返回 0。^(8)

13.17.1 检查字符串字面量常量

尽管 opattr 无法帮助我们判断操作数是否为字符串，但我们可以利用 MASM 的字符串处理操作来测试操作数的第一个字符是否为引号。以下代码正是实现这个功能：

; testStr is a macro function that tests its
; operand to see if it is a string literal.

testStr     macro   strParm
            local   firstChar

            ifnb    <strParm>
firstChar   substr  <strParm>, 1, 1

            ifidn   firstChar,<!">

; First character was ", so assume it's
; a string.

            exitm   <1>
            endif   ; ifidn
            endif   ; ifnb

; If we get to this point in the macro,
; we definitely do not have a string.

            exitm   <0>
            endm

考虑以下两个 testStr 宏的调用：

isAStr  = testStr("String Literal")
notAStr = testStr(someLabel)

MASM 将把符号 isAStr 设置为值 1，notAStr 设置为值 0。

13.17.2 检查实常量

实常量 是 MASM 的 opattr 运算符不支持的另一种字面量类型。同样，编写一个宏来测试实常量可以解决这个问题。遗憾的是，解析实数不像检查字符串常量那么简单：没有一个单一的起始字符可以用来判断“嘿，我们这里有一个浮动点常量。”宏必须逐字符解析操作数并验证它。

首先，这里是定义 MASM 浮动点常量的语法：

Sign     ::= (+|-) 
Digit    ::= [0-9]
Mantissa ::= (Digit)+ | '.' Digit)+ | (Digit)+ '.' Digit*
Exp      ::= (e|E) Sign? Digit? Digit? Digit?
Real     ::= Sign? Mantissa Exp?

一个实数由一个可选符号、一个尾数和一个可选指数构成。尾数至少包含一个数字；它还可以包含一个小数点，且小数点两侧都可以有数字（或者其中一侧）。然而，尾数不能仅由小数点组成。

测试实常量的宏函数应该如下调用：

isReal = getReal(`some_text`)

其中 some_text 是我们想要测试的文本数据，以查看它是否为实常量。getReal 的宏可能如下：

; getReal - Parses a real constant.

; Returns:
;    true  - If the parameter contains a syntactically
;            correct real number (and no extra characters).
;    false - If there are any illegal characters or
;            other syntax errors in the numeric string.

getReal      macro   origParm
             local   parm, curChar, result

; Make a copy of the parameter so we don't
; delete the characters in the original string.

parm         textequ &origParm

; Must have at least one character:

            ifb     parm
            exitm   <0>
            endif

; Extract the optional sign:

            if      isSign(parm)
curChar     textequ extract1st(parm)        ; Skip sign char
            endif

; Get the required mantissa:

            if      getMant(parm) eq 0
            exitm   <0>                     ; Bad mantissa
            endif

; Extract the optional exponent:

result      textequ getExp(parm)    
            exitm   <&result>       

            endm    ; getReal

测试实常量是一个复杂的过程，因此逐步分析这个宏（以及所有附属宏）是值得的：

创建原始参数字符串的副本。在处理过程中，getReal 会在解析字符串时删除参数字符串中的字符。此宏会创建副本，以防止修改传递给它的原始文本字符串。
检查参数是否为空。如果调用者传入空字符串，结果就不是有效的实数常量，getReal必须返回false。重要的是要立即检查空字符串，因为后续代码假设字符串至少有一个字符。
调用getSign宏函数。此函数（其定义稍后会出现）如果其参数的第一个字符是+或-符号，则返回true；否则，返回false。
如果第一个字符是符号字符，调用extract1st宏：
```
curChar     textequ extract1st(parm)        ; Skip sign char
```
extract1st宏将其参数的第一个字符作为函数结果返回（此语句将其赋值给curChar符号），然后删除该参数的第一个字符。因此，如果传递给getReal的原始字符串是+1，这条语句将+放入curChar，并删除parm中的第一个字符（生成字符串1）。extract1st的定义稍后将在本节中给出。

getReal实际上并未使用分配给curChar的符号字符。此次extract1st调用的目的仅仅是为了删除parm中的第一个字符。
调用getMant。如果其字符串参数的前缀是有效的尾数，此宏函数将返回true。如果尾数中没有至少一个数字字符，则返回false。注意，getMant会在遇到第一个非尾数字符时停止处理字符串（包括遇到第二个小数点时，如果尾数中有两个或更多小数点）。getMant函数不关心非法字符，它将剩余字符的检查留给getReal，在从getMant返回后，getReal来判断整个字符串是否有效。作为副作用，getMant会删除它处理的参数字符串中的所有前导字符。
调用getExp宏函数来处理任何（可选的）尾部指数。getExp宏还负责确保没有垃圾字符跟随其后（这会导致解析失败）。

isSign宏比较简单。以下是它的实现：

; isSign - Macro function that returns true if the
;          first character of its parameter is a
;          "+" or "-".

isSign      macro   parm
            local   FirstChar
            ifb     <parm>
            exitm   <0>
            endif

FirstChar   substr  parm, 1, 1
            ifidn   FirstChar, <+>
            exitm   <1>
            endif
            ifidn   FirstChar, <->
            exitm   <1>
            endif
            exitm   <0>
            endm

此宏使用substr操作提取参数中的第一个字符，然后将其与符号字符（+ 或 -）进行比较。如果是符号字符，则返回true，否则返回false。

extract1st宏函数移除传入参数的第一个字符，并将该字符作为函数结果返回。作为副作用，该宏函数还会删除它所传入参数的第一个字符。以下是extract1st的实现：

extract1st  macro   parm
            local   FirstChar
            ifb     <%parm>
 exitm   <>
            endif
FirstChar   substr  parm, 1, 1
            if      @sizestr(%parm) GE 2
parm        substr  parm, 2
            else
parm        textequ <>
            endif

            exitm   <FirstChar>
            endm

ifb指令检查参数字符串是否为空。如果为空，extract1st会立即返回空字符串，而不会进一步修改其参数。

请注意parm参数前的%运算符。parm参数实际上展开为持有实数常量的字符串变量的名称。由于在getReal函数中对原始参数进行的拷贝，这个展开结果像是??0005。如果你简单地指定ifb <parm>，ifb指令会看到<??0005>，而不是空白。将%运算符放在parm符号前，告诉 MASM 求值该表达式（即??0005符号），并用它求值后的文本来替换它（在这种情况下是实际的字符串）。

如果字符串不是空白，extract1st会使用substr指令提取字符串中的第一个字符，并将该字符赋值给FirstChar符号。extract1st宏函数将返回此值作为函数结果。

接下来，extract1st函数需要删除参数字符串中的第一个字符。它使用@sizestr函数（该函数的定义稍早在本章中）来确定字符字符串是否包含至少两个字符。如果是，extract1st使用substr指令从参数中提取从第二个字符位置开始的所有字符，并将这个子字符串重新赋值给传入的参数。如果extract1st正在处理字符串中的最后一个字符（即@sizestr返回 1 时），则代码不能使用substr指令，因为索引将超出范围。if指令的else部分会在@sizestr返回小于 2 的值时返回一个空字符串。

下一个getReal的附属宏函数是getMant。这个宏负责解析浮点常量的尾数部分。实现如下：

getMant     macro   parm
            local   curChar, sawDecPt, rpt
sawDecPt    =       0
curChar     textequ extract1st(parm)        ; Get 1st char
            ifidn   curChar, <.>            ; Check for dec pt
sawDecPt    =       1
curChar     textequ extract1st(parm)        ; Get 2nd char
            endif

; Must have at least one digit:

            if      isDigit(curChar) eq 0
            exitm   <0>                     ; Bad mantissa
            endif

; Process zero or more digits. If we haven't already
; seen a decimal point, allow exactly one of those.

; Do loop at least once if there is at least one
; character left in parm:

rpt         =       @sizestr(%parm)
            while   rpt

; Get the 1st char from parm and see if
; it is a decimal point or a digit:

curChar     substr  parm, 1, 1
            ifidn   curChar, <.>
rpt         =       sawDecPt eq 0
sawDecPt    =       1
            else
rpt         =       isDigit(curChar)
            endif

; If char was legal, then extract it from parm:

            if      rpt
curChar     textequ extract1st(parm)        ; Get next char
            endif

; Repeat as long as we have more chars and the
; current character is legal:

rpt         =       rpt and (@sizestr(%parm) gt 0)
            endm    ; while

; If we've seen at least one digit, we've got a valid
; mantissa. We've stopped processing on the first 
; character that is not a digit or the 2nd "." char.

            exitm   <1>
            endm    ; getMant

尾数必须至少包含一个十进制数字。它可以包含零次或一次小数点（小数点可以出现在第一个数字之前、尾数末尾，或者在一串数字的中间）。getMant宏函数使用本地符号sawDecPt来跟踪是否已经遇到过小数点。该函数首先将sawDecPt初始化为 false（0）。

一个有效的尾数必须至少有一个字符（因为它必须至少包含一个十进制数字）。因此，getMant接下来的操作是从参数字符串中提取第一个字符，并将该字符放入curChar。如果第一个字符是句点（小数点），宏会将sawDecPt设置为 true。

getMant函数使用while指令来处理尾数中所有剩余的字符。一个本地变量rpt控制while循环的执行。在getMant开始时，如果第一个字符是句点或十进制数字，rpt会被设置为 true。isDigit宏函数测试其参数的第一个字符，并在字符是 0 到 9 之间的一个时返回 true。isDigit的定义稍后会出现。

如果参数的第一个字符是点（.）或数字，getMant 函数会从字符串的开头删除该字符，并且如果新的参数字符串长度大于零，则第一次执行 while 循环的主体。

while 循环从当前参数字符串中获取第一个字符（暂时不删除它），并将其与数字或 . 字符进行比较。如果是数字，循环将从参数字符串中删除该字符并继续。如果当前字符是小数点，代码首先检查是否已经看到过小数点（使用 sawDecPt）。如果已经看到第二个小数点，函数返回 true（后续代码会处理第二个 . 字符）。如果代码还没有看到小数点，循环将 sawDecPt 设置为 true，并继续执行循环。

while 循环会在看到数字、小数点或长度大于零的字符串时重复执行。循环完成后，getMant 函数返回 true。getMant 只有在未看到至少一个数字（无论是字符串开头还是小数点后面的数字）时才返回 false。

isDigit 宏函数是一个暴力测试函数，它将第一个字符与 10 个数字进行比较。此函数不会从参数中删除任何字符。isDigit 的源代码如下：

isDigit     macro   parm
            local   FirstChar
            if      @sizestr(%parm) eq 0
            exitm   <0>
            endif

FirstChar   substr  parm, 1, 1
            ifidn   FirstChar, <0>
            exitm   <1>
            endif
            ifidn   FirstChar, <1>
            exitm   <1>
            endif
            ifidn   FirstChar, <2>
            exitm   <1>
            endif
            ifidn   FirstChar, <3>
            exitm   <1>
            endif
            ifidn   FirstChar, <4>
            exitm   <1>
 endif
            ifidn   FirstChar, <5>
            exitm   <1>
            endif
            ifidn   FirstChar, <6>
            exitm   <1>
            endif
            ifidn   FirstChar, <7>
            exitm   <1>
            endif
            ifidn   FirstChar, <8>
            exitm   <1>
            endif
            ifidn   FirstChar, <9>
            exitm   <1>
            endif
            exitm   <0>
            endm

唯一值得评论的是 @sizestr 中的 % 操作符（原因前文已解释）。

现在我们来看 getReal 中最后一个辅助函数：getExp (获取指数) 宏函数。以下是它的实现：

getExp      macro   parm
            local   curChar

; Return success if no exponent present.

            if      @sizestr(%parm) eq 0
            exitm   <1>
            endif

; Extract the next character, return failure
; if it is not an "e" or "E" character:

curChar     textequ extract1st(parm)
            if      isE(curChar) eq 0
            exitm   <0>
            endif

; Extract the next character:

curChar     textequ extract1st(parm)

; If an optional sign character appears,
; remove it from the string:

            if      isSign(curChar)
curChar     textequ extract1st(parm)        ; Skip sign char
            endif                           ; isSign

; Must have at least one digit:

            if      isDigit(curChar) eq 0
 exitm   <0>
            endif

; Optionally, we can have up to three additional digits:

            if      @sizestr(%parm) gt 0
curChar     textequ extract1st(parm)        ; Skip 1st digit
            if      isDigit(curChar) eq 0
            exitm   <0>
            endif
            endif

            if      @sizestr(%parm) gt 0
curChar     textequ extract1st(parm)        ; Skip 2nd digit
            if      isDigit(curChar) eq 0
            exitm   <0>
            endif
            endif

            if      @sizestr(%parm) gt 0
curChar     textequ extract1st(parm)        ; Skip 3rd digit
            if      isDigit(curChar) eq 0
            exitm   <0>
            endif
            endif

; If we get to this point, we have a valid exponent.

            exitm   <1>     
            endm    ; getExp

指数在实数常量中是可选的。因此，这个宏函数首先检查是否传入了一个空字符串。如果是，它会立即返回成功。再次强调，ifb <%parm> 指令必须在 parm 参数前加上 % 操作符。

如果参数字符串不为空，字符串的第一个字符必须是 E 或 e 字符。如果不是，函数返回 false。检查 E 或 e 是通过 isE 辅助函数完成的，其实现如下（注意使用了 ifidni，它不区分大小写）：

isE         macro   parm
            local   FirstChar
            if      @sizestr(%parm) eq 0
            exitm   <0>
            endif

FirstChar   substr  parm, 1, 1
            ifidni   FirstChar, <e>
            exitm   <1>
            endif
            exitm   <0>
            endm

接下来，getExp 函数查找可选的符号字符。如果它遇到符号字符，它会将其从字符串的开头删除。

至少需要一个数字，最多四个数字，跟在 e 或 E 和符号字符之后。getExp 中的其余代码处理这一部分内容。

列表 13-9 演示了本节中出现的宏代码片段。请注意，这是一个纯粹的编译时程序；它的所有活动都发生在 MASM 汇编源代码时，并不会生成任何可执行的机器代码。

; Listing 13-9

; This is a compile-time program.
; It does not generate any executable code.

; Several useful macro functions:

; mout       - Like echo, but allows "%" operators.

; testStr    - Tests an operand to see if it
;              is a string literal constant.

; @sizestr   - Handles missing MASM function.

; isDigit    - Tests first character of its
;              argument to see if it's a decimal
;              digit.

; isSign     - Tests first character of its
;              argument to see if it's a "+"
;              or a "-" character.

; extract1st - Removes the first character
;              from its argument (side effect)
;              and returns that character as
;              the function result.

; getReal    - Parses the argument and returns
;              true if it is a reasonable-
;              looking real constant.

; Test strings and invocations for the
; getReal macro:

 `Note: actual macro code appears in previous code snippets`
 `and has been removed from this listing` `for brevity` 

mant1       textequ <1>
mant2       textequ <.2>
mant3       textequ <3.4>
rv4         textequ <1e1>
rv5         textequ <1.e1>
rv6         textequ <1.0e1>
rv7         textequ <1.0e + 1>
rv8         textequ <1.0e - 1>
rv9         textequ <1.0e12>
rva         textequ <1.0e1234>
rvb         textequ <1.0E123>
rvc         textequ <1.0E + 1234>
rvd         textequ <1.0E - 1234>
rve         textequ <-1.0E - 1234>
rvf         textequ <+1.0E - 1234>
badr1       textequ <>
badr2       textequ <a>
badr3       textequ <1.1.0>
badr4       textequ <e1>
badr5       textequ <1ea1>
badr6       textequ <1e1a>

% echo get_Real(mant1) = getReal(mant1) 
% echo get_Real(mant2) = getReal(mant2)
% echo get_Real(mant3) = getReal(mant3)
% echo get_Real(rv4)   = getReal(rv4)
% echo get_Real(rv5)   = getReal(rv5)
% echo get_Real(rv6)   = getReal(rv6)
% echo get_Real(rv7)   = getReal(rv7)
% echo get_Real(rv8)   = getReal(rv8)
% echo get_Real(rv9)   = getReal(rv9)
% echo get_Real(rva)   = getReal(rva)
% echo get_Real(rvb)   = getReal(rvb)
% echo get_Real(rvc)   = getReal(rvc)
% echo get_Real(rvd)   = getReal(rvd)
% echo get_Real(rve)   = getReal(rve)
% echo get_Real(rvf)   = getReal(rvf)
% echo get_Real(badr1) = getReal(badr1)
% echo get_Real(badr2) = getReal(badr2)
% echo get_Real(badr3) = getReal(badr3)
% echo get_Real(badr4) = getReal(badr4)
% echo get_Real(badr5) = getReal(badr5)
% echo get_Real(badr5) = getReal(badr5)
        end

清单 13-9：包含getReal宏的编译时程序和测试代码

这是构建命令和（编译时）程序输出：

C:\>**ml64 /c listing13-9.asm**
Microsoft (R) Macro Assembler (x64) Version 14.15.26730.0
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: listing13-9.asm
get_Real(1) = 1
get_Real(.2) = 1
get_Real(3.4) = 1
get_Real(1e1)  = 1
get_Real(1.e1) = 1
get_Real(1.0e1) = 1
get_Real(1.0e + 1) = 1
get_Real(1.0e - 1) = 1
get_Real(1.0e12) = 1
get_Real(1.0e1234) = 1
get_Real(1.0E123) = 1
get_Real(1.0E + 1234) = 1
get_Real(1.0E - 1234) = 1
get_Real(-1.0E - 1234) = 1
get_Real(+1.0E - 1234) = 1
get_Real() = 0
get_Real(a) = 0
get_Real(1.1.0) = 0
get_Real(e1) = 0
get_Real(1ea1) = 0
get_Real(1ea1) = 0

13.17.3 检查寄存器

尽管opattr运算符提供了一个位来告诉你它的操作数是 x86-64 寄存器，但那只是opattr提供的唯一信息。特别是，opattr的返回值不会告诉你它遇到了哪个寄存器；无论是通用寄存器、XMM、YMM、ZMM、MM、ST 寄存器，还是其他寄存器；或者该寄存器的大小。幸运的是，通过一些工作，你可以使用 MASM 的条件汇编语句和其他运算符来确定所有这些信息。

首先，这里有一个简单的宏函数isReg，它根据操作数是否为寄存器返回 1 或 0。它是opattr运算符的简单封装，返回第 4 位的设置：

isReg       macro   parm
            local   result
result      textequ %(((opattr &parm) and 10h) eq 10h)
            exitm   <&result>
            endm

尽管此函数提供了一些便利，但它并没有提供opattr运算符已经提供的信息。我们希望知道操作数中出现的是哪个寄存器，以及该寄存器的大小。

清单 13-10（在线可访问：artofasm.randallhyde.com/）提供了一系列有用的宏函数和等式，用于在你自己的宏中处理寄存器操作数。接下来的段落描述了一些更有用的等式和宏。

清单 13-10 包含了一组将寄存器名称映射到数值的等式。这些等式使用类似reg``XXX的符号，其中XXX是寄存器名称（全部大写）。示例包括：regAL、regSIL、regR8B、regAX、regBP、regR8W、regEAX、regEBP、regR8D、regRAX、regRSI、regR15、regST、regST0、regMM0、regXMM0和regYMM0。

还有一个特殊的等式表示符号regNone，它代表非寄存器实体。这些等式为每个符号（regNone的值为 0）赋予 1 到 117 的数值范围。

所有这些等式的目的（通常来说，即为寄存器分配数值）是通过使用条件汇编，使在宏中测试特定寄存器（或寄存器范围）变得更加容易。

清单 13-10 中出现了一组有用的宏，用于将寄存器名称的文本形式（即 AL、AX、EAX、RAX 等）转换为其数值形式（regAL、regAX、regEAX、regRAX等）。实现这一功能的最通用宏函数是whichReg(``register``)。该函数接受一个文本对象，并返回该文本对应的reg``XXX值。如果传入的文本不是有效的通用寄存器、FPU、MMX、XMM 或 YMM 寄存器之一，whichReg将返回regNone。以下是一些whichReg调用的示例：

alVal  =       whichReg(al)
axTxt  textequ <ax>
axVal  =       whichReg(axTxt)

aMac   macro   parameter
       local   regVal
regVal =       whichReg(parameter)
       if      regVal eq regNone
       .err    <Expected a register argument>
       exitm
       endif
         .
         .
         .
       endm

whichReg 宏函数接受任何 x86-64 通用寄存器、FPU、MMX、XMM 或 YMM 寄存器。在许多情况下，你可能希望将寄存器的集合限制为这些寄存器的某个子集。因此，列表 13-11（也可以在线查看 artofasm.randallhyde.com/）提供了以下宏函数：

isGPReg(``text``) 返回任何通用（8 位、16 位、32 位或 64 位）寄存器的非零寄存器值。如果参数不是这些寄存器之一，则返回 regNone (0)。
is8BitReg(``text``) 返回任何通用 8 位寄存器的非零寄存器值。否则，返回 regNone (0)。
is16BitReg(``text``) 返回任何通用 16 位寄存器的非零寄存器值。否则，返回 regNone (0)。
is32BitReg(``text``) 返回任何通用 32 位寄存器的非零寄存器值。否则，返回 regNone (0)。
is64BitReg(``text``) 返回任何通用 64 位寄存器的非零寄存器值。否则，返回 regNone (0)。
isFPReg(``text``) 返回任何 FPU 寄存器（ST，以及 ST(0) 到 ST(7)）的非零寄存器值。否则，返回 regNone (0)。
isMMReg(``text``) 返回任何 MMX 寄存器（MM0 到 MM7）的非零寄存器值。否则，返回 regNone (0)。
isXMMReg``(``text``) 返回任何 XMM 寄存器（XMM0 到 XMM15）的非零寄存器值。否则，返回 regNone (0)。
isYMMReg``(``text``) 返回任何 YMM 寄存器（YMM0 到 YMM15）的非零寄存器值。否则，返回 regNone (0)。

如果你需要其他寄存器分类，编写自己的宏函数以返回适当的值非常容易。例如，如果你想测试某个特定寄存器是否为 Windows ABI 参数寄存器（RCX、RDX、R8 或 R9）之一，你可以创建一个类似以下的宏函数：

isWinParm  macro  theReg
           local  regVal, isParm
regVal      =     whichReg(theReg)
isParm      =     (regVal eq regRCX) or (regVal eq regRDX)
isParm      =     isParm or (regVal eq regR8)
isParm      =     isParm or (regVal eq regR9)

            if    isParm
            exitm <%regVal>
            endif
            exitm <%regNone>
            endm

如果你已经将寄存器的文本形式转换为其数值形式，那么在某些情况下，你可能需要将该数值转换回文本形式，以便在指令中使用该寄存器。列表 13-10 中的 toReg(``reg_num``) 宏可以实现这一功能。如果你提供一个范围在 1 到 117 之间的值（这些是寄存器的数值），此宏将返回与该寄存器值对应的文本。例如：

mov toReg(1), 0    ; Equivalent to mov al, 0

（注意 regAL = 1。）

如果将 regNone 传递给 toReg 宏，toReg 会返回一个空字符串。任何超出 0 到 117 范围的值都会导致未定义符号错误消息。

在使用宏时，若已将一个寄存器作为参数传递，你可能会发现需要将该寄存器转换为更大的大小（例如，将 AL 转换为 AX、EAX 或 RAX；将 AX 转换为 EAX 或 RAX；或将 EAX 转换为 RAX）。示例 13-11 提供了几个宏来进行这种向上转换。这些宏函数接受寄存器编号作为输入参数，并生成一个包含实际寄存器名称的文本结果：

reg8To16 将一个 8 位通用寄存器转换为其 16 位等效值^(8)
reg8To32 将一个 8 位通用寄存器转换为其 32 位等效值
reg8To64 将一个 8 位通用寄存器转换为其 64 位等效值
reg16To32 将一个 16 位通用寄存器转换为其 32 位等效值
reg16To64 将一个 16 位通用寄存器转换为其 64 位等效值
reg32To64 将一个 32 位通用寄存器转换为其 64 位等效值

示例 13-10 中的另一个有用的宏函数是regSize(``reg_value``)宏。该函数返回作为参数传递的寄存器值的大小（以字节为单位）。以下是一些示例调用：

alSize    =  regSize(regAL)   ; Returns 1
axSize    =  regSize(regAX)   ; Returns 2
eaxSize   =  regSize(regEAX)  ; Returns 4
raxSize   =  regSize(regRAX)  ; Returns 8
stSize    =  regSize(regST0)  ; Returns 10
mmSize    =  regSize(regMM0)  ; Returns 8
xmmSize   =  regSize(regXMM0) ; Returns 16
ymmSize   =  regSize(regYMM0) ; Returns 32

示例 13-10 中的宏和常量在编写处理通用代码的宏时非常有用。例如，假设你想创建一个putInt宏，该宏接受任意的 8 位、16 位或 32 位寄存器操作数，并将该寄存器的值作为整数打印出来。你希望能够传递任何任意的（通用）寄存器，并在必要时进行符号扩展，然后再打印。示例 13-12 是该宏的一种可能实现。

; Listing 13-12

; Demonstration of putInt macro.

; putInt - This macro expects an 8-, 16-, or 32-bit
;          general-purpose register argument. It will
;          print the value of that register as an
;          integer.

putInt      macro   theReg
            local   regVal, sz
regVal      =       isGPReg(theReg)

; Before we do anything else, make sure
; we were actually passed a register:

            if      regVal eq regNone
            .err    <Expected a register>
            endif

; Get the size of the register so we can
; determine if we need to sign-extend its
; value:

sz          =       regSize(regVal)

; If it was a 64-bit register, report an
; error:

            if      sz gt 4
            .err    64-bit register not allowed
            endif

; If it's a 1- or 2-byte register, we will need
; to sign-extend the value into EDX:

            if      (sz eq 1) or (sz eq 2)
            movsx   edx, theReg

; If it's a 32-bit register, but is not EDX, we need
; to move it into EDX (don't bother emitting
; the instruction if the register is EDX;
; the data is already where we want it):

            elseif  regVal ne regEDX
            mov     edx, theReg
            endif

; Print the value in EDX as an integer:

            call    print
            byte    "%d", 0
            endm

        option  casemap:none

nl          =       10

            .const
ttlStr      byte    "Listing 13-12", 0

 `Note: several thousand lines of code omitted here`
 `for brevity. This includes most of the text from`
  `` `Listing 13-11 plus the putInt macro`              .code                          include getTitle.inc             include print.inc             public  asmMain asmMain     proc             push    rbx             push    rbp             mov     rbp, rsp             sub     rsp, 56         ; Shadow storage               call    print             byte    "Value 1:", 0             mov     al, 55             putInt  al                          call    print             byte    nl, "Value 2:", 0             mov     cx, 1234             putInt  cx                          call    print             byte    nl, "Value 3:", 0             mov     ebx, 12345678             putInt  ebx                          call    print             byte    nl, "Value 4:", 0             mov     edx, 1             putInt  edx             call    print             byte    nl, 0  allDone:    leave             pop     rbx             ret     ; Returns to caller asmMain     endp             end ``

示例 13-12: putInt 宏函数测试程序

以下是示例 13-12 的构建命令和示例输出：

C:\>**build listing13-12**

C:\>**echo off**
 Assembling: listing13-12.asm
c.cpp

C:\>**listing13-11**
Calling Listing 13-12:
Value 1:55
Value 2:1234
Value 3:12345678
Value 4:1
Listing 13-12 terminated

虽然示例 13-12 是一个相对简单的示例，但它应该能很好地帮助你了解如何使用示例 13-10 中的宏。

13.17.4 编译时数组

编译时常量数组是一个仅在编译时存在的数组——数组的数据在运行时不存在。遗憾的是，MASM 并没有直接支持这种有用的 CTL 数据类型。幸运的是，可以使用 MASM 的其他 CTL 功能来模拟编译时数组。

本节考虑了两种模拟编译时数组的方法：文本字符串和常量列表（每个数组元素一个常量）。常量列表可能是最简单的实现方法，因此本节首先考虑这种方法。

在清单 13-11（在线可用）中，提供了一个非常有用的函数，将字符串中的所有文本转换为大写字母（toUpper）。寄存器宏使用此宏将寄存器名称转换为大写字符（以便寄存器名称比较不区分大小写）。toUpper 宏相对简单。它提取字符串中的每个字符，检查该字符的值是否在 a 到 z 范围内，如果是，它会将该字符的值作为索引，查找数组（从 a 到 z 索引的数组）中对应的元素值（该元素的值将为 A 到 Z）。以下是 toUpper 宏：

; toUpper - Converts alphabetic characters to uppercase
;           in a text string.

toUpper     macro   lcStr
            local   result

; Build the result string in "result":

result      textequ <>

; For each character in the source string, 
; convert it to uppercase.

            forc    eachChar, <lcStr>

; See if we have a lowercase character:

            if      ('&eachChar' ge 'a') and ('&eachChar' le 'z')

; If lowercase, convert it to the symbol "lc_*" where "*"
; is the lowercase character. The equates below will map
; this character to uppercase:

eachChar    catstr  <lc_>,<eachChar>
result      catstr  result, &eachChar

            else

; If it wasn't a lowercase character, just append it
; to the end of the string:

result      catstr  result, <eachChar>

            endif
            endm            ; forc
 exitm   result  ; Return result string
            endm            ; toUpper

处理数组访问的“魔法”语句是以下两条语句：

eachChar    catstr  <lc_>,<eachChar>
result      catstr  result, &eachChar

eachChar catstr 操作在宏遇到小写字符时，会生成形如 lc_a, lc_b, . . . , lc_z 的字符串。result catstr 操作将形如 lc_a 等标签扩展为其对应的值，并将结果连接到 result 字符串的末尾（该字符串为寄存器名）。在清单 13-11 中 toUpper 宏之后，您将看到以下等式：

lc_a        textequ <A>
lc_b        textequ <B>
lc_c        textequ <C>
lc_d        textequ <D>
lc_e        textequ <E>
lc_f        textequ <F>
lc_g        textequ <G>
lc_h        textequ <H>
lc_i        textequ <I>
lc_j        textequ <J>
lc_k        textequ <K>
lc_l        textequ <L>
lc_m        textequ <M>
lc_n        textequ <N>
lc_o        textequ <O>
lc_p        textequ <P>
lc_q        textequ <Q>
lc_r        textequ <R>
lc_s        textequ <S>
lc_t        textequ <T>
lc_u        textequ <U>
lc_v        textequ <V>
lc_w        textequ <W>
lc_x        textequ <X>
lc_y        textequ <Y>
lc_z        textequ <Z>

因此，lc_a 将扩展为字符 A，lc_b 将扩展为字符 B，依此类推。这一系列等式构成了 toUpper 使用的查找表（数组）。该数组应该被命名为 lc_，数组的索引是数组名称的后缀（a 到 z）。toUpper 宏通过将 character 追加到 lc_ 后来访问 lc_[``character``] 元素，然后扩展文本等式 lc_``character（扩展是通过应用 & 运算符到宏产生的 eachChar 字符串来实现的）。

请注意以下两点。首先，数组索引不一定必须是整数（或序数）值。任何任意的字符字符串都可以作为索引。^(9) 其次，如果提供的索引超出范围（即不在 a 到 z 之间），toUpper 宏将尝试扩展形如 lc_``xxxx 的符号，从而导致未定义的标识符。因此，如果提供一个超出范围的索引，MASM 将报告未定义的符号错误。但对于 toUpper 宏来说，这不会是一个问题，因为 toUpper 会在构造 lc_``xxxx 符号之前验证索引（通过条件 if 语句）。

清单 13-11 还提供了另一种实现编译时数组的方式：使用文本字符串来存储数组元素，并通过 substr 提取该字符串中的数组元素。is``XX``BitReg 宏（例如 is8BitReg、is16BitReg 等）将一些数据数组传递给更通用的 lookupReg 宏。以下是 is16BitReg 宏：^(10)

all16Regs   catstr <AX>,
                   <BX>,
                   <CX>,
                   <DX>,
                   <SI>,
                   <DI>,
                   <BP>,
                   <SP>,
                   <R8W>,
                   <R10W>,
                   <R11W>,
                   <R12W>,
                   <R13W>,
                   <R14W>,
                   <R15W>

all16Lens   catstr <2>, <0>,           ; AX
                   <2>, <0>,           ; BX
                   <2>, <0>,           ; CX 
                   <2>, <0>,           ; DX
                   <2>, <0>,           ; SI
                   <2>, <0>,           ; DI
                   <2>, <0>,           ; BP
                   <2>, <0>,           ; SP
                   <3>, <0>, <0>,      ; R8W
                   <3>, <0>, <0>,      ; R9W
                   <4>, <0>, <0>, <0>, ; R10W
                   <4>, <0>, <0>, <0>, ; R11W
                   <4>, <0>, <0>, <0>, ; R12W
                   <4>, <0>, <0>, <0>, ; R13W
                   <4>, <0>, <0>, <0>, ; R14W
                   <4>, <0>, <0>, <0>  ; R15W

is16BitReg  macro   parm
            exitm   lookupReg(parm, all16Regs, all16Lens)
            endm    ; is16BitReg

all16Regs字符串是一个寄存器名称的列表（所有名称都连接成一个字符串）。lookupReg宏将通过使用 MASM 的instr指令在这个寄存器名称字符串中查找用户提供的寄存器（parm）。如果instr在名称列表中找不到该寄存器，parm就不是一个有效的 16 位寄存器，instr返回值为 0。如果它在all16Regs中找到了parm所持有的字符串，则instr返回该匹配项在all16Regs中的（非零）索引。单独来看，一个非零索引并不意味着lookupReg找到了有效的 16 位寄存器。例如，如果用户提供PR作为寄存器名称，instr指令将返回all16Regs字符串中的非零索引（即SP寄存器最后一个字符的索引，R来自R8W寄存器名称的第一个字符）。同样，如果调用者将字符串R8传递给is16BitReg，instr指令将返回指向R8W条目的第一个字符的索引，但R8不是有效的 16 位寄存器。

尽管instr可以拒绝一个寄存器名称（通过返回 0），但是如果instr返回非零值，则需要额外的验证；这时all16Lens数组发挥作用。lookupReg宏使用instr返回的索引作为all16Lens数组的索引。如果该项为 0，则all16Regs中的索引不是有效的寄存器索引（它指向一个不在寄存器名称开头的字符串）。如果all16Lens中的索引指向非零值，lookupReg将此值与parm字符串的长度进行比较。如果它们相等，parm包含一个实际的 16 位寄存器名称；如果它们不相等，parm太长或太短，且不是有效的 16 位寄存器名称。以下是完整的lookupReg宏：

; lookupReg - Given a (suspected) register and a lookup table, convert
;             that register to the corresponding numeric form.

lookupReg   macro   theReg, regList, regIndex
            local   regUpper, regConst, inst, regLen, indexLen

; Convert (possible) register to uppercase:

regUpper    textequ toUpper(theReg)
regLen      sizestr <&theReg>

; Does it exist in regList? If not, it's not a register.

inst        instr   1, regList, &regUpper
            if      inst ne 0

regConst    substr  &regIndex, inst, 1
            if      &regConst eq regLen

; It's a register (in text form). Create an identifier of
; the form "reg`XX`" where "`XX`" represents the register name.

regConst    catStr  <reg>,regUpper

            ifdef   &regConst

; Return "reg`XX`" as function result. This is the numeric value
; for the register.

            exitm   regConst
            endif
            endif
            endif

; If the parameter string wasn't in regList, then return
; "regNone" as the function result:

            exitm   <regNone>
            endm    ; lookupReg

请注意，lookupReg还将寄存器值常量（regNone、regAL、regBL等）作为关联的编译时数组（见regConst定义）。

13.18 使用宏来编写宏

宏的一个高级用法是让宏调用创建一个或多个新的宏。如果你将一个宏声明嵌套在另一个宏中，调用那个（外层）宏将展开其中的宏定义，并在该点定义该宏。当然，如果你多次调用外层（封闭）宏，除非你在构造新宏时小心，否则可能会导致重复的宏定义（也就是说，通过每次调用外层宏时为它赋予一个新名称）。在某些情况下，能够动态生成宏是非常有用的。

考虑一下上一节中的编译时数组示例。如果你想使用多个等式方法创建一个编译时数组，你必须在使用该数组之前手动为所有数组元素定义等式。特别是当数组元素数量很大时，这个过程可能会显得非常繁琐。幸运的是，你可以很容易地创建一个宏来自动化这个过程。

以下宏声明接受两个参数：要创建的数组的名称和要放入数组的元素数量。这个宏生成一个定义列表（使用=指令，而不是textequ指令），每个元素都被初始化为 0：

genArray    macro   arrayName, elements
            local   index, eleName, getName

; Loop over each element of the array:

index       =       0
            while   index lt &elements

; Generate a textequ statement to define a single
; element of the array, for example:

; ary`XX` = 0

; where "`XX`" is the index (0 to (elements - 1)).

eleName     catstr  <&arrayName>,%index,< = 0>

; Expand the text just created with the catstr directive.

            eleName

; Move on to next array index:

index       =       index + 1
            endm    ; while

            endm    ; genArray

例如，以下宏调用创建了 10 个数组元素，命名为ary0到ary9：

genArray ary, 10

你可以通过直接使用名称ary0、ary1、ary2，...，ary9来访问数组元素。如果你想通过程序化方式访问这些数组元素（例如在编译时的while循环中），你需要使用catstr指令创建一个文本等式，将数组名称（ary）与索引连接起来。有没有更方便的方法，可以让宏函数为你创建这个文本等式呢？编写一个执行此操作的宏非常简单：

ary_get     macro   index
            local   element
element     catstr  <ary>,%index
            exitm   <element>
            endm

使用这个宏，你可以通过调用宏ary_get(``index``)轻松访问ary数组的元素。你还可以编写一个宏，将一个值存储到ary数组的指定元素中：

ary_set     macro   index, value
            local   assign
assign      catstr  <ary>, %index, < = >, %value
            assign
            endm

这两个宏非常有用，你可能希望在每次使用genArray宏创建数组时都包含它们。那么为什么不让genArray 宏为你生成这些宏呢？列表 13-13 提供了一个实现，它正是这样做的。

``` ; Listing 13-13 ; This is a compile-time program. ; It does not generate any executable code. option casemap:none genArray macro arrayName, elements local index, eleName, getName ; Loop over each element of the array: index = 0 while index lt &elements ; Generate a textequ statement to define a single ; element of the array, for example: ; ary`XX` = 0 ; where "`XX`" is the index (0 to (elements - 1)). eleName catstr <&arrayName>,%index,< = 0> ; Expand the text just created with the catstr directive: eleName ; Move on to next array index: index = index + 1 endm ; while ; Create a macro function to retrieve a value from ; the array: getName catstr <&arrayName>,<_get> getName macro theIndex local element element catstr <&arrayName>,%theIndex exitm <element> endm ; Create a macro to assign a value to ; an array element. setName catstr <&arrayName>,<_set> setName macro theIndex, theValue local assign assign catstr <&arrayName>, %theIndex, < = >, %theValue assign endm endm ; genArray ; mout - Replacement for echo. Allows "%" operator ; in operand field to expand text symbols. mout macro valToPrint local cmd cmd catstr <echo >, <valToPrint> cmd endm ; Create an array ("ary") with ten elements: genArray ary, 10 ; Initialize each element of the array to ; its index value: index = 0 while index lt 10 ary_set index, index index = index + 1 endm ; Print out the array values: index = 0 while index lt 10 value = ary_get(index) mout ary[%index] = %value index = index + 1 endm end ``` Listing 13-13: A macro that writes another pair of macros Here’s the build command and sample output for the compile-time program in Listing 13-13: ``` C:\>**ml64 /c /Fl listing13-13.asm** Microsoft (R) Macro Assembler (x64) Version 14.15.26730.0 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: listing13-13.asm ary[0] = 0 ary[1] = 1 ary[2] = 2 ary[3] = 3 ary[4] = 4 ary[5] = 5 ary[6] = 6 ary[7] = 7 ary[8] = 8 ary[9] = 9 ``` ## 13.19 Compile-Time Program Performance When writing compile-time programs, keep in mind that MASM is interpreting these programs during assembly. This can have a huge impact on the time it takes MASM to assemble your source files. Indeed, it is quite possible to create infinite loops that will cause MASM to (seemingly) hang up during assembly. Consider the following trivial example: ``` true = 1 while true endm ``` Any attempt to assemble a MASM source file containing this sequence will lock up the system until you press ctrl-C (or use another mechanism to abort the assembly process). Even without infinite loops, it is easy to create macros that take a considerable amount of time to process. If you use such macros hundreds (or even thousands) of times in a source file (as is common for some complex print-type macros), it could take a while for MASM to process your source files. Be aware of this (and be patient if MASM seems to hang up—it could simply be your compile-time programs taking a while to do their job). If you think a compile-time program has entered an infinite loop, the `echo` directive (or macros like `mout`, appearing throughout this chapter) can help you track down the infinite loop (or other bugs) in your compile-time programs. ## 13.20 For More Information Although this chapter has spent a considerable amount of time describing various features of MASM’s macro support and compile-time language features, the truth is this chapter has barely described what’s possible with MASM. Sadly, Microsoft’s documentation all but ignores the macro facilities of MASM. Probably the best place to learn about advanced macro programming with MASM is the MASM32 forum at [`www.masm32.com/board/index.php`](http://www.masm32.com/board/index.php). Although it is an older book, covering MASM version 6, *The Waite Group’s Microsoft Macro Assembler Bible* by Nabajyoti Barkakati and this author (Sams, 1992) does go into detail about the use of MASM’s macro facilities (as well as other directives that are poorly documented these days). Also, the MASM 6.*x* manual can still be found online at various sites. While this manual is woefully outdated with respect to the latest versions of MASM (it does not, for example, cover any of the 64-bit instructions or addressing modes), it does a decent job of describing MASM’s macro facilities and many of MASM’s directives. Just keep in mind when reading the older documentation that Microsoft has *disabled* many features that used to be present in MASM. ## 13.21 Test Yourself 1. What does *CTL* stand for? 2. When do CTL programs execute? 3. What directive would you use to print a message (not an error) during assembly? 4. What directive would you use to print an error message during assembly? 5. What directive would you use to create a CTL variable? 6. What is the MASM macro escape character operator? 7. What does the MASM `%` operator do? 8. What does the MASM macro `&` operator do? 9. What does the `catstr` directive do? 10. What does the MASM `instr` directive do? 11. What does the `sizestr` directive do? 12. What does the `substr` directive do? 13. What are the main (four) conditional assembly directives? 14. What directives could you use to create compile-time loops? 15. What directive would you use to extract the characters from a MASM text object in a loop? 16. What directives do you use to define a macro? 17. How do you invoke a macro in a MASM source file? 18. How do you specify macro parameters in a macro declaration? 19. How do you specify that a macro parameter is required? 20. How do you specify that a macro parameter is optional? 21. How do you specify a variable number of macro arguments? 22. Explain how you can manually test whether a macro parameter is present (without using the `:req` suffix). 23. How can you define local symbols in a macro? 24. What directive would you use (generally inside a conditional assembly sequence) to immediately terminate macro expansion without processing any additional statements in the macro? 25. How would you return a textual value from a macro function? 26. What operator could you use to test a macro parameter to see if it is a machine register versus a memory variable?

第十四章：字符串指令

字符串是存储在连续内存位置中的一组值。x86-64 CPU 可以处理四种类型的字符串：字节字符串、字字符串、双字字符串和四字字符串。

x86-64 微处理器系列支持几条专门用于处理字符串的指令。它们可以移动字符串、比较字符串、在字符串中查找特定值、将字符串初始化为固定值，并对字符串执行其他基础操作。x86-64 的字符串指令对于分配和比较数组、表格和记录也非常有用，它们可能大大加速你的数组操作代码。本章将探讨字符串指令的各种用途。

14.1 x86-64 字符串指令

所有 x86-64 系列处理器都支持五条字符串指令：movs``x、cmps``x、scas``x、lods``x 和 stos``x。^(1)（x = b、w、d 或 q，分别表示字节、字、双字或四字；本书通常在讨论这些字符串指令时省略 x 后缀。）移动、比较、扫描、加载和存储是你可以构建大多数其他字符串操作的基础。

字符串指令操作的是块（连续的线性数组）内存。例如，movs 指令将一系列字节从一个内存位置移动到另一个位置，cmps 指令比较两个内存块，scas 指令扫描内存块以查找特定值。然而，源块和目标块（以及指令需要的任何其他值）并不是作为显式操作数提供的。相反，字符串指令使用特定寄存器作为操作数：

RSI（源索引）寄存器
RDI（目标索引）寄存器
RCX（计数）寄存器
AL、AX、EAX 和 RAX 寄存器
FLAGS 寄存器中的方向标志

例如，movs（移动字符串）指令将从 RSI 指定的源地址复制 RCX 个元素到 RDI 指定的目标地址。同样，cmps 指令将 RSI 指向的字符串（长度为 RCX）与 RDI 指向的字符串进行比较。

以下部分描述了如何使用这五条指令，首先是一个前缀，它使指令按照预期执行：对由 RSI 指向的字符串中的每个值重复操作。^(2)

14.1.1 rep、repe、repz、repnz 和 repne 前缀

单独使用时，字符串指令不会对数据字符串进行操作。例如，movs 指令只会复制一个字节、字、双字或四字。重复前缀告诉 x86-64 执行多字节字符串操作——具体来说，重复执行字符串操作最多 RCX 次。^(3)

带有重复前缀的字符串指令的语法如下：

rep prefix: 
     rep  movs`x`(`x` is b, w, d, or q)
 rep  stos`x`

repe prefix: (Note: repz is a synonym for repe)
     repe  cmps`x` 
     repe  scas`x`

repne prefix: (Note: repnz is a synonym for repne)
     repne  cmps`x`
     repne  scas`x`

通常不会将重复前缀与 lods 指令一起使用。

rep 前缀告诉 CPU “按 RCX 寄存器指定的次数重复此操作。” repe 前缀表示“当比较相等时，重复此操作，或重复 RCX 指定的次数（先满足的条件为止）。” repne 前缀的动作是“当比较不相等时，重复此操作，或重复 RCX 指定的次数。” 实际上，你会在大多数字符字符串比较中使用 repe；repne 主要与 scas``x 指令一起使用，用来在字符串中查找特定字符（如零终止字节）。

你可以使用重复前缀通过单条指令处理整个字符串。你也可以在不使用重复前缀的情况下使用字符串指令，作为字符串原始操作来合成更强大的字符串操作。

14.1.2 方向标志

FLAGS 寄存器中的 方向标志 控制 CPU 如何处理字符串。如果方向标志被清除，CPU 在处理每个字符串元素后会增加 RSI 和 RDI。例如，执行 movs 将会把 RSI 处的字节、字、双字或四字移动到 RDI，然后分别增加 RSI 和 RDI 1、2、4 或 8（依此类推）。当在此指令前指定 rep 前缀时，CPU 会为字符串中的每个元素增加 RSI 和 RDI（RCX 中的计数指定元素的数量）。完成后，RSI 和 RDI 寄存器将指向字符串之后的第一个元素。

如果方向标志被设置，x86-64 在处理每个字符串元素后会递减 RSI 和 RDI（同样，RCX 指定了重复字符串操作的元素数量）。之后，RSI 和 RDI 寄存器将指向字符串之前的第一个字节、字或双字。

你可以使用 cld（清除方向标志）和 std（设置方向标志）指令来改变方向标志的值。

微软 ABI 要求在进入一个（符合微软 ABI 的）过程时，方向标志必须被清除。因此，如果在过程内设置了方向标志，你应该在使用完它后始终清除该标志（特别是在调用其他代码或从过程返回之前）。

14.1.3 movs 指令

movs 指令使用以下语法：

movsb
movsw
movsd
movsq
rep  movsb
rep  movsw
rep  movsd
rep  movsq

movsb（移动字符串，字节）指令获取地址 RSI 处的字节，将其存储到地址 RDI，然后将 RSI 和 RDI 寄存器分别增加或减少 1。如果存在 rep 前缀，CPU 会检查 RCX 是否为 0。如果不是，它会将字节从 RSI 移动到 RDI 并递减 RCX 寄存器。这个过程会重复，直到 RCX 变为 0。如果 RCX 在初次执行时为 0，movsb 指令将不会复制任何数据字节。

movsw（移动字符串，字长）指令从地址 RSI 获取一个字，将其存储到地址 RDI，然后将 RSI 和 RDI 分别增加或减少 2。如果有rep前缀，CPU 会重复执行此过程 RCX 次。

movsd指令在双字（double words）上以类似的方式操作。在每次数据移动后，它会将 RSI 和 RDI 各增加或减少 4。

最后，movsq指令对四字（quad words）执行相同的操作。每次数据移动后，它会将 RSI 和 RDI 各增加或减少 8。

例如，这段代码将 384 字节从CharArray1复制到CharArray2：

CharArray1  byte 384 dup (?) 
CharArray2  byte 384 dup (?)
             . 
             . 
             . 
            cld
            lea  rsi, CharArray1
            lea  rdi, CharArray2
            mov  rcx, lengthof(CharArray1) ; = 384
        rep movsb

如果用movsw替换movsb，前面的代码将移动 384 个字（768 字节），而不是 384 个字节：

WordArray1  word 384 dup (?) 
WordArray2  word 384 dup (?)
             . 
             . 
             . 
            cld
            lea  rsi, WordArray1
            lea  rdi, WordArray2
            mov  rcx, lengthof(WordArray1) ; = 384
        rep movsw

记住，RCX 寄存器包含的是元素计数，而不是字节计数；幸运的是，MASM 中的lengthof操作符返回的是数组元素的数量（字长），而不是字节数量。

如果你在执行movsq、movsb、movsw或movsd指令之前设置了方向标志，CPU 将在每次移动一个字符串元素后递减 RSI 和 RDI 寄存器。这意味着，在执行movsb、movsw、movsd或movsq指令之前，RSI 和 RDI 寄存器必须指向各自字符串的最后一个元素。例如：

CharArray1 byte  384 dup (?) 
CharArray2 byte  384 dup (?) 
            . 
            . 
            . 
           std
           lea rsi, CharArray1[lengthof(CharArray1) - 1] 
           lea rdi, CharArray2[lengthof(CharArray1) - 1]
           mov rcx, lengthof(CharArray1);
       rep movsb
           cld

虽然有时从尾到头处理字符串是有用的（参见第 834 页的“比较扩展精度整数”），但通常情况下，你会按正向方向处理字符串。对于某一类字符串操作，能够在两种方向上处理字符串是必需的：当源和目标块重叠时移动字符串。考虑以下代码的执行：

CharArray1  byte ? 
CharArray2  byte 384 dup (?) 
             . 
             . 
             . 
            cld
            lea rsi, CharArray1
            lea rdi, CharArray2
            mov rcx, lengthof(CharArray2);
        rep movsb

这组指令将CharArray1和CharArray2当作一对 384 字节的字符串来处理。然而，CharArray1数组中的最后 383 个字节与CharArray2数组中的前 383 个字节重叠。让我们逐字节跟踪这段代码的执行。

当 CPU 执行movsb指令时，它会执行以下操作：

将 RSI 指向的字节（CharArray1）复制到 RDI 指向的字节（CharArray2）。
增加 RSI 和 RDI，并将 RCX 递减 1。现在 RSI 寄存器指向CharArray1 + 1（即CharArray2的地址），RDI 寄存器指向CharArray2 + 1。
将 RSI 指向的字节复制到 RDI 指向的字节。然而，这个字节原本是从CharArray1位置复制过来的。因此，movsb指令将原本位于CharArray1位置的值复制到CharArray2和CharArray2 + 1位置。
再次增加 RSI 和 RDI，并递减 RCX。
将位于CharArray1 + 2（CharArray2 + 1）的字节复制到CharArray2 + 2位置。同样，这个值原本出现在CharArray1位置。

循环的每次重复都会将CharArray1中的下一个元素复制到CharArray2数组中下一个可用的位置。形象地看，类似于图 14-1。结果是movsb指令在字符串中复制了X。

图 14-1：在两个重叠数组之间复制数据（正向方向）

如果你真的希望在两个数组重叠时将一个数组移入另一个数组，你应该从两个字符串的末尾开始，逐个元素地将源字符串的每个元素移动到目标字符串，如图 14-2 所示。

图 14-2：使用反向复制复制重叠数组中的数据

设置方向标志并将 RSI 和 RDI 指向字符串的末尾，当两个字符串重叠并且源字符串的地址低于目标字符串时，这样可以让你（正确地）将一个字符串移动到另一个字符串。如果两个字符串重叠并且源字符串的地址高于目标字符串，则清除方向标志并将 RSI 和 RDI 指向两个字符串的开头。

如果两个字符串没有重叠，你可以使用任意一种技巧在内存中移动字符串。通常，清除方向标志进行操作是最简单的。

你不应该使用movs``x指令将一个数组填充为单一字节、字、双字或四字的值。另一个字符串指令stos在这种情况下要好得多。

如果你要从一个数组移动大量字节到另一个数组，使用movsq指令比使用movsb指令要更快。如果你要移动的字节数是 8 的偶数倍，那么这只是一个微不足道的变化；只需将要复制的字节数除以 8，将这个值加载到 RCX 寄存器中，然后使用movsq指令。如果字节数不能被 8 整除，你可以使用movsq指令复制数组中除了最后 1、2、……、7 个字节以外的所有字节（也就是说，除去字节数除以 8 后的余数）。例如，如果你想高效地移动 4099 个字节，你可以使用以下指令序列：

 lea  rsi, Source 
     lea  rdi, Destination 
     mov  rcx, 512     ; Copy 512 qwords = 4096 bytes
 rep movsq
     movsw             ; Copy bytes 4097 and 4098
     movsb             ; Copy the last byte

使用此技术复制数据时，最多只需要四个movs``x指令，因为你可以用不超过一个（每个）movsb、movsw和movsd指令来复制 1、……、7 个字节。如果两个数组按照四字对齐，这种方案最为高效。如果没有四字对齐，你可能需要将movsb、movsw或movsd指令（或所有三者）移到movsq指令之前或之后，以便movsq指令与四字对齐的数据一起工作。

如果你在程序执行时才知道要复制的块的大小，仍然可以使用如下代码来提高字节块移动的性能：

 lea  rsi, Source
     lea  rdi, Destination
     mov  rcx, Length
     shr  rcx, 3       ; Divide by 8
     jz   lessThan8    ; Execute movsq only if 8 or more bytes

 rep movsq             ; Copy the qwords

lessThan8: 
     mov  rcx, Length 
     and  rcx, 111b      ; Compute (Length mod 8)
     jz   divisibleBy8   ; Execute movsb only if # of bytes/8 <> 0

 rep movsb             ; Copy the remaining 1 to 7 bytes

divisibleBy8:

在许多计算机系统上，movsq指令提供了一种快速的方式，将数据从一个位置复制到另一个位置。虽然在某些 CPU 上可能有更快的方式复制数据，但最终内存总线性能才是限制因素，而 CPU 通常比内存总线要快。因此，除非你有特别的系统，编写复杂的代码来提升内存到内存的传输速度可能是在浪费时间。

此外，英特尔在后来的处理器中改进了movs``x指令的性能，使得在复制相同数量字节时，movsb的效率与movsw、movsd和movsq相同。在这些后来的处理器上，使用movsb来复制指定数量的字节可能比之前提到的复杂方法更高效。

底线是：如果块移动的速度对你很重要，可以尝试几种不同的方法，并选择最快的（或者如果它们速度相同，选择最简单的，这种情况也很有可能）。

14.1.4 cmps 指令

cmps指令用于比较两个字符串。CPU 将 RDI 指向的值与 RSI 指向的值进行比较。当使用repe或repne前缀来比较整个字符串时，RCX 包含源字符串中的元素数量。像movs指令一样，MASM 允许此指令的几种形式：

cmpsb
cmpsw
cmpsd
cmpsq

repe   cmpsb
repe   cmpsw
repe   cmpsd
repe   cmpsq

repne  cmpsb
repne  cmpsw
repne  cmpsd
repne  cmpsq

如果没有重复前缀，cmps指令会将 RDI 位置的值与 RSI 位置的值相减，并根据结果更新标志（结果会被丢弃）。在比较完这两个位置后，cmps会根据cmpsb、cmpsw、cmpsd和cmpsq的不同，分别将 RSI 和 RDI 寄存器加 1、加 2、加 4 或加 8。如果方向标志为清除，cmps会递增 RSI 和 RDI 寄存器，否则会递减它们。

请记住，RCX 寄存器中的值决定了处理的元素数量，而不是字节数。因此，在使用cmpsw时，RCX 指定比较的字数。同样，cmpsd和cmpsq时，RCX 包含要处理的双字和四字的数量。

repe前缀会在元素相等并且 RCX 大于 0 时，比较字符串中的连续元素。repne前缀则在元素不相等时执行相同操作。

执行完repne cmps后，RCX 寄存器的值为 0（此时两个字符串完全不同），或者 RCX 包含两个字符串中比较的元素数量，直到找到匹配项。虽然这种形式的cmps指令对比较字符串并不特别有用，但它对于定位字节、字或双字数组中第一对匹配的元素非常有用。

14.1.4.1 比较字符字符串

字符串通常使用 字典顺序 来比较，也就是你从小到大熟悉的标准字母顺序。我们比较对应的元素，直到遇到一个不匹配的字符或更短字符串的结尾。如果一对对应的字符不匹配，就基于该字符来比较两个字符串。如果两个字符串匹配到更短字符串的长度，那么就比较它们的长度。只有当两个字符串的长度相等，并且每对对应的字符都完全相同，两个字符串才被认为相等。字符串的长度只在两个字符串在更短字符串的长度范围内完全相等时才影响比较。例如，Zebra 小于 Zebras，因为它是较短的那个字符串；然而，尽管 Zebra 较短，它还是大于 AAAAAAAAAAH!。

对于（ASCII）字符字符串，请按以下方式使用 cmpsb 指令：

清除方向标志。
将 RCX 寄存器加载为较短字符串的长度。
将 RSI 和 RDI 寄存器指向你要比较的两个字符串的第一个字符。
使用 repe 前缀和 cmpsb 指令按字节逐一比较字符串。
如果两个字符串相等，则比较它们的长度。

以下代码比较了几个字符字符串：

 cld
       mov  rsi, AdrsStr1
       mov  rdi, AdrsStr2
       mov  rcx, LengthSrc
       cmp  rcx, LengthDest
       jbe  srcIsShorter        ; Put the length of the 
                                ; shorter string in RCX
       mov  rcx, LengthDest 

srcIsShorter:
  repe cmpsb
       jnz   notEq              ; If equal to the length of the 
                                ; shorter string, cmp lengths
       mov   rcx, LengthSrc
       cmp   rcx, LengthDest

notEq:

如果你使用字节来存储字符串长度，应该适当地调整此代码（即，使用 movzx 指令将长度加载到 RCX 寄存器中）。

14.1.4.2 比较扩展精度整数

你还可以使用 cmps 指令来比较多字整数值（即扩展精度整数值）。由于进行字符串比较需要一定的设置，因此对于小于六个或八个双字长度的整数值，这种方法并不实际，但对于大整数值来说，非常适用。

与字符串不同，我们不能通过字典顺序来比较整数串。在比较字符串时，我们从最低有效字节到最高有效字节依次比较字符。而在比较整数时，我们必须从最高有效字节、字或双字开始，逐步比较到最低有效字节。所以，要比较两个 32 字节（256 位）的整数值，可以使用以下代码：

 std
     lea  rsi, SourceInteger[3 * 8]
     lea  rdi, DestInteger[3 * 8]
     mov  rcx, 4
repe cmpsq
     cld

该代码将整数从最重要的四字（qword）到最不重要的四字进行比较。cmpsq 指令在两个值不相等时停止，或者当 RCX 递减至 0 时停止（意味着两个值相等）。一如既往，标志提供比较结果。

14.1.5 `scas` 指令

scas（扫描字符串）指令用于在字符串中查找特定元素——例如，快速扫描另一个字符串中的 0。

与 movs 和 cmps 指令不同，scas 只需要一个目标字符串（由 RDI 指向）。源操作数是 AL（scasb）、AX（scasw）、EAX（scasd）或 RAX（scasq）寄存器中的值。scas 指令将累加器中的值（AL、AX、EAX 或 RAX）与由 RDI 指向的值进行比较，然后将 RDI 增加（或减少）1、2、4 或 8。CPU 根据比较结果设置标志。

scas 指令有以下几种形式：

scasb
scasw
scasd
scasq

repe   scasb
repe   scasw
repe   scasd
repe   scasq

repne  scasb
repne  scasw
repne  scasd
repne  scasq

使用 repe 前缀时，scas 扫描字符串，寻找一个与累加器中的值不匹配的元素。使用 repne 前缀时，scas 扫描字符串，寻找第一个与累加器中的值相等的元素。这有点反直觉，因为 repe scas 实际上是在扫描字符串，而累加器中的值与字符串操作数相等，repne scas 则是在扫描字符串，而累加器的值与字符串操作数不相等。

与 cmps 和 movs 指令一样，RCX 寄存器中的值指定了在使用重复前缀时要处理的元素数量，而不是字节数量。

14.1.6 `stos` 指令

stos 指令将累加器中的值存储到由 RDI 指定的位置。存储值后，CPU 会根据方向标志的状态增加或减少 RDI。虽然 stos 指令有很多用途，但它的主要用途是将数组和字符串初始化为常量值。例如，如果你有一个 256 字节的数组，想用 0 清空它，可以使用以下代码：

 cld
     lea  rdi, DestArray
     mov  rcx, 32          ; 32 quad words = 256 bytes
     xor  rax, rax         ; Zero out RAX
rep  stosq

这段代码写入的是 32 个四字，而不是 256 个字节，因为单个 stosq 操作比四个 stosb 操作要快（在一些旧的 CPU 上）。

stos 指令有八种形式：

stosb
stosw
stosd
stosq

rep  stosb
rep  stosw
rep  stosd
rep  stosq

stosb 指令将 AL 寄存器中的值存储到指定的内存位置，stosw 将 AX 寄存器存储到指定的内存位置，stosd 将 EAX 存储到指定的位置，stosq 将 RAX 存储到指定的位置。使用 rep 前缀时，这个过程会重复 RCX 寄存器指定的次数。

如果你需要初始化一个元素值不同的数组，你无法（轻易地）使用 stos。

14.1.7 `lods` 指令

lods 指令将由 RSI 指向的字节、字、双字或四字复制到 AL、AX、EAX 或 RAX 寄存器中，之后它会增加或减少 RSI 寄存器的值，步长为 1、2、4 或 8。使用 lods 从内存中获取字节（lodsb）、字（lodsw）、双字（lodsd）或四字（lodsq）以进行进一步处理。

与 stos 一样，lods 指令也有八种形式：

lodsb
lodsw
lodsd
lodsq

rep  lodsb
rep  lodsw
rep  lodsd
rep  lodsq

你可能永远不会在此指令中使用重复前缀，因为每次 lods 重复时，累加器寄存器都会被覆盖。重复操作结束时，累加器将包含从内存读取的最后一个值。^(4)

14.1.8 从 lods 和 stos 构建复杂的字符串函数

你可以使用 lods 和 stos 指令生成任何特定的字符串操作。例如，假设你需要一个将字符串中的所有大写字符转换为小写的字符串操作。你可以使用以下代码：

 mov rsi, StringAddress  ; Load string address into RSI
     mov rdi, rsi            ; Also point RDI here
     mov rcx, stringLength   ; Presumably, this was precomputed 
     jrcxz skipUC            ; Don't do anything if length is 0
rpt:
     lodsb                   ; Get the next character in the string
     cmp   al, 'A'
     jb    notUpper
     cmp   al, 'Z'
     ja    notUpper
     or    al, 20h           ; Convert to lowercase
notUpper:
     stosb                   ; Store converted char into string
     dec   rcx
     jnz   rpt               ; Zero flag is set when RCX is 0
skipUC:

rpt 循环获取 RSI 指定位置的字节，测试其是否为大写字母，如果是，则将其转换为小写字母（如果不是，则保持不变），然后将结果字符存储在 RDI 指定的位置，并重复这一过程，直到 RCX 中的值指定的次数为止。

由于 lods 和 stos 指令使用累加器作为中介位置，你可以使用任何累加器操作来快速操作字符串元素。这可以是像 toLower（或 toUpper）这样的简单函数，也可以是像数据加密这样的复杂操作。你甚至可以使用这一指令序列在将数据从一个字符串移动到另一个字符串时计算哈希值、校验和或 CRC 值。在移动字符串数据的同时，你对字符串逐字符进行的任何操作都是可行的。

14.2 x86-64 字符串指令的性能

在早期的 x86-64 处理器中，字符串指令提供了操作字符串和数据块的最有效方法。然而，这些指令并不是 Intel RISC 核心指令集的一部分，因此可能比使用离散指令执行相同操作时要慢（尽管它们更紧凑）。Intel 在后来的处理器上优化了 movs 和 stos 指令，使其尽可能快速运行，但其他字符串指令可能相对较慢。

和往常一样，建议通过使用不同的算法（包括使用和不使用字符串指令的算法）来实现性能关键的算法，并通过比较它们的性能来确定使用哪种解决方案。由于字符串指令相对于其他指令的运行速度取决于你使用的处理器，因此请在你预期代码运行的处理器上进行实验。

14.3 SIMD 字符串指令

SSE4.2 指令集扩展包括四条强大的字符字符串操作指令。这些指令最早在 2008 年推出，因此今天仍有一些计算机可能不支持它们。在尝试在广泛分发的商业应用程序中使用它们之前，请始终使用 cpuid 来确定这些指令是否可用（请参见第十一章中的“使用 cpuid 区分指令集”）。

处理文本和字符串片段的四条 SSE4.2 指令如下：

PCMPESTRI 打包比较显式长度字符串，返回索引
PCMPESTRM 打包比较显式长度字符串，返回掩码
PCMPISTRI 打包比较隐式长度字符串，返回索引
PCMPISTRM 打包比较隐式长度字符串，返回掩码

隐式长度字符串使用哨兵（尾部）字节标记字符串的结束，具体来说，是一个零终止字节（或在 Unicode 字符的情况下是字）。显式长度字符串是那些你需要提供字符串长度的字符串。

产生索引的指令返回源字符串中第一次（或最后一次）匹配的索引。返回位掩码的指令返回一个包含 0 或（全部）1 位的数组，标记两个输入字符串中每个匹配的出现位置。

打包比较字符串指令是 x86-64 指令集中最复杂的指令之一。这些指令的语法为

pcmp`X`str`Y`  `xmm`[src1], `xmm`[src2]/`mem`[src2], `imm`[8]
vpcmp`X`str`Y` `xmm`[src1], `xmm`[src2]/`mem`[src2], `imm`[8]

其中 X 为 E 或 I，Y 为 I 或 M。这两种形式都使用 128 位操作数（在这种情况下，v 前缀形式没有 256 位的 YMM 寄存器），并且与大多数 SSE 指令不同，(v)pcmp``X``str``Y 指令允许操作数不按 16 字节对齐（如果需要 16 字节对齐的内存操作数，它们将几乎无法用于其预期操作）。

(v)pcmp``X``str``Y 指令比较一对 XMM 寄存器中相应的字节或字，将各个比较结果合并成一个向量（位掩码），并返回所有比较的结果。imm[8] 操作数控制各种比较属性，如“比较类型”中所述（见下一页）。

14.3.1 打包比较操作数大小

立即数操作数的第 0 位和第 1 位指定字符串元素的大小和类型。元素可以是字节或字，或者它们可以作为无符号或有符号值用于比较（请参见表 14-1）。

位 0 指定字（Unicode）或字节（ASCII）操作数。位 1 指定操作数是有符号还是无符号。通常，对于字符字符串，使用无符号比较。然而，在某些情况下（或处理整数而非字符的字符串时），你可能需要指定有符号比较。

表 14-1：打包比较 imm[8] 第 0 位和第 1 位

位(s)	位值	含义
0–1	00	两个源操作数都包含 16 个无符号字节。
	01	两个源操作数都包含 8 个无符号字。
	10	两个源操作数都包含 16 个有符号字节。
	11	两个源操作数都包含 8 个有符号字。

14.3.2 比较类型

立即数操作数的第 2 位和第 3 位指定指令如何比较这两个字符串。有四种比较类型，它们分别是：测试一个字符串中的字符与第二个字符串中的字符集合进行比较，测试一个字符串中的字符与字符范围进行比较，执行字符串的直接比较，或者在另一个字符串中搜索子字符串（见表 14-2）。

表 14-2：打包比较 imm[8] 第 2 位和第 3 位

位(s)	位值	含义
2–3	00	等于任何：比较第二个源字符串中的每个字符与第一个源操作数中出现的字符集合。
	01	范围：将第二个源操作数中的每个值与第一个源操作数指定的一组范围进行比较。
	10	每个相等：逐字符比较两个操作数的每个对应元素的相等性。
	11	有序相等：在由第二个操作数指定的字符串中查找由第一个操作数指定的子字符串。

位 2 到 3 指定要执行的比较类型（在 Intel 术语中称为聚合操作）。每个相等（10b）可能是最容易理解的比较方式。打包的比较指令将比较字符串中每个相应的字符（最多到字符串的长度——稍后会详细说明），并为字符串中每个字节或字的比较结果设置一个布尔标志，如图 14-3 所示。这与 C/C++的memcmp()或strcmp()函数的操作类似。

图 14-3：每个相等聚合比较操作

任意相等比较将第二个源操作数中的每个字节与第一个源操作数中的字符进行比较，看它是否与其中的任何一个字符匹配（XMM[src2]/mem[src2]）。例如，如果 XMM[src1]包含字符串abcdefABCDEF（以及四个 0 字节），而 XMM[src2]/mem[src2]包含12AF89C0，则结果比较将产生 00101100b（在对应于字符 A、F 和 C 的位置上是 1）。还请注意，第一个字符（1）映射到位 0，A 和 F 字符分别映射到位 2 和 3。这类似于 C 标准库中的strspn()和strcspn()函数。

有序相等比较将在 XMM[src1]中查找每个可以在 XMM[src2]/mem[src2]操作数中找到的字符串。例如，如果 XMM[src2]/mem[src2]操作数包含字符串never need shine，而 XMM[src1]操作数包含字符串ne（用 0 填充），则有序相等比较将生成向量 0100000001000001b。这类似于 C 标准库中的strstr()函数。

范围比较聚合操作将 XMM[src1]操作数中的条目分成成对（在寄存器的偶数和奇数索引处）。第一个元素（字节或字）指定下限，第二个条目指定上限。XMM[src1]寄存器最多支持八个字节范围或四个字范围（如果需要更少的范围，可以将剩余的对填充为 0）。此聚合操作将 XMM[src2]/mem[src2]操作数中的每个字符与这些范围中的每个进行比较，如果字符在指定的范围内（包括在内），则在结果向量中存储 true，如果它超出了这些范围，则存储 false。

14.3.3 结果极性

立即数操作数的位 4 和位 5 指定结果的极性（见表 14-3）。本章将在稍后详细讨论这些位的含义（需要一些额外的注释）。

表 14-3: 打包比较 imm[8] 位 4 和 5

位	位值	含义
4–5	00	正极性
	01	负极性
	10	正掩码
	11	负掩码

14.3.4 输出处理

立即操作数的第 6 位指定指令结果（参见表 14-4）。打包比较指令不使用第 7 位；它应始终为 0。

表 14-4: 打包比较 imm[8] 位 6（和 7）

位	位值	含义
6	0	仅限 (v)pcom``X``stri，返回的索引存储在 ECX 中，是第一个结果。仅限 (v)pcom``X``strm，掩码出现在 XMM0 的低位，零扩展到 128 位。
	1	仅限 (v)pcom``X``stri，返回的索引存储在 ECX 中，是最后一个结果。仅限 (v)pcom``X``strm，将位掩码扩展为字节或字掩码。
7	0	此位保留，应始终为 0。

(v)pcmpestrm 和 (v)pcmpistrm 指令生成一个位掩码结果并将其存储到 XMM0 寄存器中（这是固定的——CPU 不根据这些指令的操作数来确定）。如果 imm8 操作数的第 6 位是 0，这两个指令会将该位掩码打包成 8 或 16 位并存储到 XMM0 的低 8 位（或 16 位），并将该值零扩展到 XMM0 的高位。如果 imm8 位 6 是 1，这些指令将会将位掩码（每个字节或字的所有 1 位）存储到整个 XMM0 寄存器中。^(5)

(v)pcmpestri 和 (v)pcmpistri 指令生成一个索引结果，并将该值返回到 ECX 寄存器中。^(6) 如果 imm8 操作数的第 6 位是 0，这两个指令返回结果位掩码中最低位设置的位的索引（即，第一个匹配的比较）。如果 imm8 操作数的第 6 位是 1，这些指令返回结果位掩码中最高位设置的位的索引（即，最后一个匹配的比较）。如果结果位掩码中没有设置的位，这些指令将返回 16（用于字节比较）或 8（用于字比较）到 ECX 寄存器。尽管这些指令在内部生成位掩码结果以计算索引，但它们不会覆盖 XMM0 寄存器（与 (v)pcmpestrm 和 (v)pcmpistrm 指令不同）。

14.3.5 打包字符串比较长度

(v)pcmp``X``str``Y 指令有一个 16 字节（XMM 寄存器大小）的比较限制。即使在具有 32 字节 YMM 寄存器的 AVX 处理器上也是如此。要比较更大的字符串，需要执行多个 (v)pcmp``X``str``Y 指令。

(v)pcmpistri和(v)pcmpistrm指令使用隐式字符串长度。字符串出现在 XMM 寄存器或内存中，首字符（如果有）出现在 LO 字节中，后续字符按顺序排列。字符串以零终止字节或字结束。如果字符数超过 16（字节字符串）或 8（字字符串），则寄存器（或 128 位内存）大小将限制字符串的长度。

(v)pcmpestri和(v)pcmpestrm指令使用显式提供的字符串长度。RAX 和 EAX 寄存器指定 XMM[src1]中字符串的长度，RDX 和 EDX 寄存器指定 XMM[src2]/mem[src2]中字符串的长度。如果字符串长度大于 16（字节字符串）或 8（字字符串），指令会将长度饱和为 16 或 8。另外，(v)pcmpestri和(v)pcmpestrm指令会取长度的绝对值，因此-1 到-16 等同于 1 到 16。

显式长度指令将长度饱和为 16（或 8）的原因是允许程序在循环中处理更大的字符串。通过在循环中每次处理 16 字节（或 8 字）并递减总字符串长度（从某个大值递减到 0），打包字符串操作将在每次循环迭代时处理 16 个或 8 个字符，直到最后一次循环迭代。在这一点上，指令将处理字符串中剩余的（总长度对 16 或 8 取模）字符。

显式长度指令取长度的绝对值的原因是允许处理大字符串的代码将循环计数器（从大正值到 0 递减）或（从负值递增）到 0，以便程序能够更方便地操作。

当长度（隐式或显式）小于 16（字节）或 8（字）时，XMM 寄存器（或 128 位内存位置）中的某些字符将无效。具体来说，零终止字符后的每个字符（对于隐式长度字符串）或超出 RAX 和 EAX 或 RDX 和 EDX 中计数的部分将无效。无论是否存在无效字符，打包比较指令仍会通过比较字符串中的所有字符，生成一个中间位向量结果。

由于两个输入字符串（在 XMM[src1]和 XMM[src2]/mem[src2]中）的字符串长度不一定相等，因此有四种可能的情况：src1和src2都无效，恰好一个源操作数无效（另一个有效，所以这里有两种情况），或者两者都有效。根据哪个操作数有效或无效，打包比较指令可能会强制结果为真或假。表 14-5 列出了这些指令如何强制结果，具体取决于imm8 操作数指定的比较类型（聚合操作）。

表 14-5：当源 1 和源 2 有效或无效时的比较结果

Src1	Src2	等任意	范围	等每个	等序
无效	无效	强制假	强制假	强制真	强制真
无效	有效	强制假	强制假	强制假	强制真
有效	无效	强制假	强制假	强制假	强制假
有效	有效	结果	结果	结果	结果

要理解此表中的条目，你必须分别考虑每种比较类型。

等任意比较会检查src2中每个出现的字符是否出现在src1指定的字符集内。如果src1中的某个字符无效，那意味着指令正在比较一个不在字符集中的字符；在这种情况下，你希望返回假（不论src2是否有效）。如果src1有效但src2无效，则你已经到达（或超过）字符串的末尾；这不是一个有效的比较，因此等任意也会在这种情况下强制返回假结果。

范围比较也是（某种意义上）将源字符串（src2）与一组字符（由src1中的范围指定）进行比较。因此，如果任一操作数无效，打包比较指令会强制返回假，原因与等任意比较相同。

等每个比较是传统的字符串比较操作，比较src2中的字符串与src1中的字符串。如果两个字符串中对应的字符无效，那么你已经越过了两个字符串的末尾。打包比较指令在这种情况下会强制返回真，因为这些指令实际上是在比较空字符串（空字符串是相等的）。如果一个字符串中的字符有效，而另一个字符串中对应的字符无效，那么你是在将实际字符与空字符串进行比较，这总是不相等；因此，打包字符串比较指令会强制返回假结果。

等序操作会在较大的字符串 XMM[src2]/mem[src2]中查找子字符串 XMM[src1]。如果你已经超出了两个字符串的末尾，你实际上是在比较空字符串（而且一个空字符串总是另一个空字符串的子字符串），因此，打包比较指令会返回一个真实结果。如果你已经到达src1字符串的末尾（即要查找的子字符串），即使src2中还有更多字符，结果也为真；因此，在这种情况下，打包比较会返回真。然而，如果你已经到达src2字符串的末尾，但src1（子字符串）字符串还没到末尾，等序操作就不可能返回真，因此，打包比较指令会在这种情况下强制返回假。

如果极性位（imm8 的第 4 到第 5 位）包含 00b 或 10b，则极性位不会影响比较操作。如果极性位是 01b，打包字符串比较指令在将数据复制到 XMM0（(v)pcmpistrm和(v)pcmpestrm）或计算索引（(v)pcmpestri和(v)pcmpistri）之前，会反转临时位图结果中的所有位。如果极性设置为 11b，则打包字符串比较指令仅在对应的src2字符有效时，才反转结果位。

14.3.6 打包字符串比较结果

关于打包字符串比较指令的最后一点要注意的是它们如何影响 CPU 标志。这些指令在 SSE/AVX 指令中比较特殊，因为它们会影响条件代码。然而，它们并不会以标准方式影响条件代码（例如，你不能像使用cmps指令那样，通过进位标志和零标志测试字符串是否小于或大于）。相反，这些指令重新定义了进位、零、符号和溢出标志的含义；此外，每条指令独立地定义这些标志的含义。

所有八条指令—(v)pcmpestri、(v)pcmpistri、(v)pcmpestrm和(v)pcmpistrm—如果（内部）结果位图中的所有位都是 0（无比较），则清除进位标志；如果位图中至少有 1 个位被设置，则设置进位标志。请注意，进位标志在应用极性后被设置或清除。

零标志指示src2的长度是否小于 16（对于字字符是 8）。对于(v)pcmpestri和(v)pcmpestrm指令，如果 EDX 小于 16（8），则设置零标志；对于(v)pcmpistri和(v)pcmpistrm指令，如果 XMM[src2]/mem[src2]包含空字符，则设置零标志。

符号标志指示src1的长度是否小于 16（对于字字符是 8）。对于(v)pcmpestri和(v)pcmpestrm指令，如果 EAX 小于 16（8），则设置符号标志；对于(v)pcmpistri和(v)pcmpistrm指令，如果 XMM[src1]包含空字符，则设置零标志。

溢出标志包含结果位图的第 0 位的设置（即，源字符串的第一个字符是否匹配）。这在执行等序比较后可能会很有用，可以用来检查子字符串是否是较大字符串的前缀（例如）。

14.4 对齐和内存管理单元页面

(v)pcmp``X``str``Y指令的优点在于它们不要求其内存操作数是 16 字节对齐的。然而，这种缺乏对齐性会产生一个特殊问题：单条(v)pcmp``X``str``Y指令的内存访问可能会跨越 MMU 页面边界。如在第三章“内存访问和 4K 内存管理单元页面”中所述，一些 MMU 页面可能无法访问，如果 CPU 尝试从这些页面读取数据，将生成一般保护错误。

如果字符串的长度小于 16 字节并且在页面边界之前结束，使用 (v)pcmp``X``str``Y 来访问该数据可能会导致意外的页面错误，因为它会从内存中读取完整的 16 字节数据，包括字符串末尾之后的数据。虽然访问超出字符串并跨越到新的、不可访问的 MMU 页面的数据是一个罕见的情况，但它确实可能发生，因此你需要确保不会跨越 MMU 页面边界访问数据，除非下一个 MMU 页面包含实际数据。

如果你已将地址对齐到 16 字节边界，并且从该地址开始访问 16 字节内存，你就不必担心会跨越新的 MMU 页面。MMU 页面包含 16 字节的整数倍（一个 MMU 页面包含 256 个 16 字节的块）。如果 CPU 从 16 字节边界开始访问 16 字节数据，那么该块的最后 15 字节会落在与第一个字节相同的 MMU 页面中。这就是为什么大多数 SSE 内存访问是安全的：它们要求 16 字节对齐的内存操作数。例外情况是未对齐的移动指令和 (v)pcmp``X``str``Y 指令。

通常你使用未对齐的移动指令（例如，movdqu 和 movupd）将 16 个实际字节的数据移动到 SSE/AVX 寄存器中；因此，这些指令通常不会访问内存中的额外字节。然而，(v)pcmp``X``str``Y 指令通常会访问超出实际字符串末尾的数据字节。这些指令会从内存中读取完整的 16 字节，即使字符串实际使用的字节少于 16 个。因此，在使用 (v)pcmp``X``str``Y 指令（以及其他未对齐的移动指令，如果你使用它们来读取数据结构的末尾之外的内容）时，你应该确保你提供的内存地址至少距离 MMU 页面末尾 16 字节，或者确保内存中的下一页包含有效数据。

如第三章所述，机器指令没有允许你测试内存页面是否能合法访问的指令。因此，你必须确保 (v)pcmp``X``str``Y 指令的任何内存访问都不会跨越页面边界。下一章提供了几个例子。

14.5 获取更多信息

Agner Fog 是全球顶尖的 x86(-64) 汇编语言优化专家之一。他的网站 (www.agner.org/optimize/#manuals/) 详细介绍了优化内存移动和其他字符串指令的内容。如果你想编写快速的 x86 汇编语言字符串代码，强烈推荐这个网站。

T. Herselman 花费了大量时间编写快速的memcpy函数。你可以在www.codeproject.com/Articles/1110153/Apex-memmove-the-fastest-memcpy-memmove-on-x-x-EVE/（或者搜索网络上的Apex memmove）找到他的成果。这段代码的长度无疑会让你决定继续使用movs指令（在现代 x86-64 CPU 上运行相当快）。

14.6 自我测试

通用字符串指令支持什么大小的操作数？
五个通用字符串指令是什么？
pcmp``X``str``Y指令支持什么大小的操作数？
rep movsb指令使用哪些寄存器？
cmpsw指令使用哪些寄存器？
repne scasb指令使用哪些寄存器？
stosd指令使用哪些寄存器？
如果你希望在每次字符串操作后递增 RSI 和 RDI 寄存器，应该设置什么方向标志？
如果你希望在每次字符串操作后递减 RSI 和 RDI 寄存器，应该设置什么方向标志？
如果一个函数或过程修改了方向标志，那么它在返回之前应该做什么？
如果函数修改了方向标志的值，微软 ABI 要求函数在返回之前 _ 方向标志。
哪些字符串指令是 Intel 为后来的 x86-64 CPU 优化的，以提高性能？
你在使用movs指令之前，何时需要设置方向标志？
你在使用movs指令之前，何时需要清除方向标志？
如果方向标志没有正确设置，当你执行movs指令时会发生什么情况？
通常使用哪个字符串前缀与cmpsb一起测试两个字符串是否相等？
在比较两个字符串时，通常应该如何设置方向标志？
在执行带有重复前缀的字符串指令之前，是否需要测试 RCX 是否为 0？
如果你想在 C/C++字符串中查找一个以零终止的字节，最合适的（通用）字符串指令是什么？
如果你想用 0 填充一块内存，最合适的字符串指令是什么？
如果你想编写自己的字符串操作，你会使用哪些字符串指令？
哪个字符串指令通常不与重复前缀一起使用？
在使用pcmp``X``str``Y指令之前，你应该做什么？
哪些 SSE 字符串指令自动处理零终止字符串？
哪些 SSE 字符串指令需要显式的长度值？
在pcmp``X``str``Y指令中，你应该在哪里传递显式长度？
哪个pcmp``X``str``Y聚合操作用于搜索属于字符集合的字符？
哪个pcmp``X``str``Y聚合操作用于比较两个字符串？
哪个pcmp``X``str``Y聚合操作检查一个字符串是否是另一个字符串的子串？
pcmp``X``str``Y 指令和 MMU 页有什么问题？

第十五章：管理复杂项目

大多数汇编语言源文件并不是独立的程序。它们是多个源文件的组成部分，可能用不同的语言编写，编译并链接在一起，形成复杂的应用程序。大型编程是软件工程师用来描述处理大型软件项目开发的过程、方法和工具的术语。

虽然每个人对什么是大型程序有自己的理解，单独编译是支持大型编程的流行技术之一。使用单独编译，你首先将大型源文件拆分成易于管理的部分。然后你将这些单独的文件编译成目标代码模块。最后，你将目标模块链接在一起，形成一个完整的程序。如果你需要对某个模块进行小的修改，你只需要重新组装那个模块；不需要重新组装整个程序。一旦你调试并测试了代码的大部分，当你对程序的其他部分进行小的修改时，继续组装相同的代码就是浪费时间。想象一下，在一台快速的 PC 上，你只改动了一行代码，却要等 20 或 30 分钟才能重新组装程序！

以下章节描述了 MASM 提供的单独编译工具，以及如何有效地在程序中使用这些工具，以实现模块化和减少开发时间。

15.1 `include` 指令

当源文件中遇到 include 指令时，它会在 include 指令的位置将指定的文件合并到编译中。include 指令的语法是

include `filename`

其中 filename 是一个有效的文件名。根据约定，MASM 的 include 文件具有 .inc（include）后缀，但任何包含 MASM 汇编语言源代码的文件都可以正常使用。被包含的文件可以在汇编过程中再次包含其他文件。

单独使用 include 指令并不能实现单独编译。你可以使用 include 指令将一个大型源文件拆分成多个模块，并在编译时将这些模块合并在一起。下面的示例会在程序编译时包括 print.inc 和 getTitle.inc 文件：

include  print.inc
include  getTitle.inc

现在你的程序将受益于模块化。可惜，你并不会节省任何开发时间。include 指令在编译时将源文件插入到 include 指令的位置，就像你自己手动输入这些代码一样。MASM 仍然需要编译代码，而这需要时间。如果你在汇编过程中包含了大量源文件（例如一个庞大的库），编译过程可能需要永远。

一般来说，你不应该使用include指令来包含前面示例中展示的源代码^(1)。相反，你应该使用include指令将一组通用的常量、类型、外部过程声明以及其他类似的项目插入程序中。通常，汇编语言的包含文件不包含任何机器代码（宏外的部分；详细信息请参见第十三章）。以这种方式使用include文件的目的，在你看到外部声明如何工作的之后，会变得更加清晰。

15.2 忽略重复的包含操作

当你开始开发复杂的模块和库时，你最终会发现一个大问题：一些头文件需要包含其他头文件。其实，这并不是什么大问题，但问题出现在当一个头文件包含另一个头文件，而第二个头文件又包含另一个，第三个头文件又包含另一个……最后那个头文件又包含第一个头文件时。现在这就是一个大问题，因为它会在编译器中产生一个无限循环，并导致 MASM 抱怨重复的符号定义。毕竟，第一次读取头文件时，它会处理该文件中的所有声明；第二次读取时，它会将这些符号视为重复符号。

忽略重复包含的标准技巧，C/C++程序员非常熟悉，就是使用条件汇编让 MASM 忽略包含文件的内容。（请参见第十三章中的“条件汇编（编译时决策）”）诀窍是将一个ifndef（如果未定义）语句放在包含文件的所有语句周围。你将包含文件的文件名作为ifndef操作数，使用下划线替换点（或其他任何未定义的符号）。然后，在ifndef语句之后，立即定义该符号（通常使用数值等式并将该符号赋值为常数 0）。以下是这个ifndef用法的一个示例：

 ifndef  myinclude_inc   ; Filename: myinclude.inc
myinclude_inc =       0

`Put all the source code lines for the include file here`

; The following statement should be the last non-blank line
; in the source file:

              endif  ; myinclude_inc

在第二次包含时，MASM 会直接跳过包含文件的内容（包括任何include指令），这样就避免了无限循环和所有的重复符号定义。

15.3 汇编单元和外部指令

汇编单元是一个源文件及其直接或间接包含的任何文件的集合。汇编单元在汇编后会生成一个单独的.obj文件。微软链接器将多个目标文件（由 MASM 或其他编译器生成，如 MSVC）结合成一个单独的可执行单元（.exe文件）。本节的主要目的（实际上，这一整章的目的）是描述这些汇编单元（.obj文件）在链接过程中如何相互传递链接信息。汇编单元是创建汇编语言模块化程序的基础。

要使用 MASM 的汇编单元功能，你必须创建至少两个源文件。一个文件包含第二个文件使用的变量和过程。第二个文件使用这些变量和过程，但不知道它们是如何实现的。

与其使用 include 指令来创建模块化程序（因为每次汇编主程序时，MASM 都必须重新编译无错误的代码，浪费时间），不如预先汇编调试好的模块并将目标代码模块链接在一起，这样的解决方案要好得多。这正是 public、extern 和 externdef 指令所允许你做的事情。

从技术上讲，本书到目前为止出现的所有程序都是单独汇编的模块（这些模块恰好与 C/C++ 主程序链接，而不是与其他汇编语言模块链接）。名为 asmMain 的汇编语言主程序只是一个与 C++ 兼容的函数，通用的 c.cpp 程序从其主程序中调用了这个函数。考虑第二章中 Listing 2-1 的 asmMain 函数体：

; Here is the "asmMain" function.

        public  asmMain
asmMain proc
         .
         .
         .
asmMain endp

每个包含 asmMain 函数的程序中都包含了 public asmMain 语句，而没有任何定义或解释。好了，现在是时候解决这个遗漏了。

MASM 源文件中的普通符号是该源文件私有的，其他源文件无法访问这些符号（当然，前提是这些源文件没有直接包含包含这些私有符号的文件）。也就是说，源文件中大多数符号的作用域仅限于该源文件中的代码行（以及它包含的任何文件）。public 指令告诉 MASM 将指定符号设置为全局符号——在链接阶段，其他汇编单元可以访问它。通过本书示例程序中的 public asmMain 语句，这些示例程序将 asmMain 符号设置为包含它们的源文件的全局符号，以便 c.cpp 程序可以调用 asmMain 函数。

仅仅将符号设置为公共符号不足以在另一个源文件中使用该符号。想要使用该符号的源文件还必须将该符号声明为外部符号。这会通知链接器，当包含外部声明的文件使用该符号时，链接器必须修补该公共符号的地址。例如，c.cpp 源文件在以下代码行中将 asmMain 符号定义为外部符号（顺便提一下，这个声明还定义了外部符号 getTitle 和 readLine）：

// extern "C" namespace prevents
// "name mangling" by the C++
// compiler.

extern "C"
{
 // asmMain is the assembly language
    // code's "main program":

    void asmMain(void);

    // getTitle returns a pointer to a
    // string of characters from the
    // assembly code that specifies the
    // title of that program (which makes
    // this program generic and usable
    // with a large number of sample
    // programs in "The Art of 64-Bit
    // Assembly").

    char *getTitle(void);

    // C++ function that the assembly
    // language program can call:

    int readLine(char *dest, int maxLen);

};

请注意，在这个示例中，readLine 是一个在 c.cpp 源文件中定义的 C++ 函数。C/C++ 没有显式的公共声明。相反，如果你为一个源文件中的函数提供源代码，并且声明该函数为外部函数，C/C++ 会通过外部声明自动将该符号设置为公共符号。

MASM 实际上有两个外部符号声明指令：extern和externdef。^(2)这两个指令的语法是：

extern    `symbol`:`type`  {`optional_list_of_symbol:type_pairs`}
externdef `symbol`:`type`  {`optional_list_of_symbol:type_pairs`}

其中，symbol是你想要从另一个汇编单元中使用的标识符，而type是该符号的数据类型。数据类型可以是以下任何一种：

proc，表示该符号是一个过程（函数）名称或语句标签
任意 MASM 内建数据类型（例如byte、word、dword、qword、oword等）
任意用户自定义数据类型（例如结构体名称）
abs，表示一个常量值

abs类型并不是用来声明通用外部常量（例如someConst = 0）。像这样的纯常量声明通常会出现在头文件（即包含文件）中，本节稍后会描述这一点。相反，abs类型通常保留给基于对象模块中代码偏移量的常量。例如，如果你在一个汇编单元中有以下代码，

 public someLen
someStr   byte   "abcdefg"
someLen   =      $-someStr

someLen的类型，在extern声明中，将是abs。

这两个指令使用逗号分隔的列表来允许多个符号声明；例如：

extern p:proc, b:byte, d:dword, a:abs

然而，我认为，如果将每个声明限制为单个符号，你的程序会更易于阅读。

当你在程序中放置extern指令时，MASM 会将该声明视为任何其他符号声明。如果符号已存在，MASM 会生成符号重定义错误。通常，应该将所有外部声明放在源文件的开始部分，以避免作用域或前向引用问题。由于public指令实际上并不定义符号，因此public指令的位置并不像extern指令那么关键。有些程序员将所有公共声明放在源文件的开头；其他程序员则将公共声明放在符号定义之前（如我在大多数相同程序中对asmMain符号所做的那样）。这两种位置都可以。

15.4 MASM 中的头文件

由于一个源文件中的公共符号可以被多个汇编单元使用，因此会出现一个小问题：你必须在所有使用该符号的文件中复制extern指令。对于少量符号来说，这不是什么大问题。然而，随着外部符号数量的增加，跨多个源文件维护这些外部符号会变得繁琐。MASM 的解决方案与 C/C++相同：头文件。

头文件是包含多个汇编单元间共有的外部（以及其他）声明的包含文件。之所以叫做头文件，是因为通常会在使用它们的源文件的开始部分（头部）插入包含语句。这实际上是 MASM 中包含文件的主要用途：包含外部（以及其他）公共声明。

15.5 `externdef`指令

当你开始使用包含大量库模块（汇编单元）的头文件时，你会很快发现extern指令存在一个大问题。通常，你会为一大套库函数创建一个头文件，每个函数可能会出现在自己的汇编单元中。有些库函数可能会使用同一库模块（一组目标文件）中的其他函数；因此，该特定库函数的源文件可能会想要包含库的头文件，以便引用其他库函数的外部名称。

不幸的是，如果头文件包含当前源文件中函数的外部定义，则会发生符号重新定义错误：

; header.inc
           ifndef   header_inc
header_inc =        0

           extern  func1:proc
           extern  func2:proc

           endif   ; header_inc

以下源文件的汇编会产生错误，因为func1已经在header.inc头文件中定义：

; func1.asm

           include header.inc

           .code

func1      proc
             .
             .
             .
           call func2
             .
             .
             .
func1      endp
           end

C/C++不会遇到这个问题，因为external关键字既作为公共声明，也作为外部声明。

为了克服这个问题，MASM 引入了externdef指令。该指令类似于 C/C++中的external指令：当符号在源文件中不存在时，它表现得像一个extern指令，而当符号在源文件中定义时，它表现得像一个public指令。此外，同一符号的多个externdef声明可以出现在源文件中（尽管如果出现多个声明，它们应该指定相同的符号类型）。考虑修改后的header.inc头文件，使用externdef定义：

; header.inc
           ifndef     header_inc
header_inc =          0

 externdef  func1:proc
           externdef  func2:proc

           endif      ; header_inc

使用这个头文件，func1.asm汇编单元将会正确编译。

15.6 分离编译

很早在第十一章的“MASM 包含指令”中，我就开始将print和getTitle函数放入头文件中，这样我就可以在每个需要使用这些函数的源文件中简单地包含它们，而无需手动将这些函数复制粘贴到每个程序中。显然，这些是应该制作成汇编单元并与其他程序链接的好例子，而不是在汇编过程中被包含进来。

清单 15-1 是一个头文件，其中包含了必要的print和getTitle声明：^(3)

; aoalib.inc - Header file containing external function
;              definitions, constants, and other items used
;              by code in "The Art of 64-Bit Assembly."

            ifndef      aoalib_inc
aoalib_inc  equ         0

; Constant definitions:

; nl (newline constant):

nl          =           10

; SSE4.2 feature flags (in ECX):

SSE42       =       00180000h       ; Bits 19 and 20
AVXSupport  =       10000000h       ; Bit 28

; CPUID bits (EAX = 7, EBX register):

AVX2Support  =      20h             ; Bit 5 = AVX

**********************************************************

; External data declarations:

            externdef   ttlStr:byte

**********************************************************

; External function declarations:

            externdef   print:qword
            externdef   getTitle:proc

; Definition of C/C++ printf function that
; the print function will call (and some
; AoA sample programs call this directly,
; as well).

            externdef   printf:proc

            endif       ; aoalib_inc

清单 15-1：aoalib.inc头文件

清单 15-2 包含了在第十一章“MASM 包含指令”中使用的print函数，并将其转换为一个汇编单元。

; print.asm - Assembly unit containing the SSE/AVX dynamically
;             selectable print procedures.

            include aoalib.inc

            .data
            align   qword
print       qword   choosePrint     ; Pointer to print function

            .code

; print - "Quick" form of printf that allows the format string to
;         follow the call in the code stream. Supports up to five
;         additional parameters in RDX, R8, R9, R10, and R11.

; This function saves all the Microsoft ABI–volatile,
; parameter, and return result registers so that code
; can call it without worrying about any registers being
; modified (this code assumes that Windows ABI treats
; YMM6 to YMM15 as nonvolatile).

; Of course, this code assumes that AVX instructions are
; available on the CPU.

; Allows up to 5 arguments in:

;  RDX - Arg #1
;  R8  - Arg #2
;  R9  - Arg #3
;  R10 - Arg #4
;  R11 - Arg #5

; Note that you must pass floating-point values in
; these registers as well. The printf function
; expects real values in the integer registers. 

; There are two versions of this program, one that
; will run on CPUs without AVX capabilities (no YMM
; registers) and one that will run on CPUs that
; have AVX capabilities (YMM registers). The difference
; between the two is which registers they preserve
; (print_SSE preserves only XMM registers and will
; run properly on CPUs that don't have YMM register
; support; print_AVX will preserve the volatile YMM
; registers on CPUs with AVX support).

; On first call, determine if we support AVX instructions
; and set the "print" pointer to point at print_AVX or
; print_SSE:

choosePrint proc
            push    rax             ; Preserve registers that get
            push    rbx             ; tweaked by CPUID
            push    rcx
            push    rdx

            mov     eax, 1
            cpuid
            test    ecx, AVXSupport ; Test bit 28 for AVX
            jnz     doAVXPrint

            lea     rax, print_SSE  ; From now on, call
            mov     print, rax      ; print_SSE directly

; Return address must point at the format string
; following the call to this function! So we have
; to clean up the stack and JMP to print_SSE.

            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            jmp     print_SSE

doAVXPrint: lea     rax, print_AVX  ; From now on, call
            mov     print, rax      ; print_AVX directly

; Return address must point at the format string
; following the call to this function! So we have
; to clean up the stack and JMP to print_AUX.

            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            jmp     print_AVX

choosePrint endp

; Version of print that will preserve volatile
; AVX registers (YMM0 to YMM3):

thestr      byte "YMM4:%I64x", nl, 0
print_AVX   proc

; Preserve all the volatile registers
; (be nice to the assembly code that
; calls this procedure):

            push    rax
            push    rbx
            push    rcx
            push    rdx
            push    r8
            push    r9
            push    r10
            push    r11

; YMM0 to YMM7 are considered volatile, so preserve them:

            sub     rsp, 256
            vmovdqu ymmword ptr [rsp + 000], ymm0
            vmovdqu ymmword ptr [rsp + 032], ymm1
            vmovdqu ymmword ptr [rsp + 064], ymm2
            vmovdqu ymmword ptr [rsp + 096], ymm3
            vmovdqu ymmword ptr [rsp + 128], ymm4
            vmovdqu ymmword ptr [rsp + 160], ymm5
            vmovdqu ymmword ptr [rsp + 192], ymm6
            vmovdqu ymmword ptr [rsp + 224], ymm7

            push    rbp

returnAdrs  textequ <[rbp + 328]>

            mov     rbp, rsp
            sub     rsp, 256
            and     rsp, -16

; Format string (passed in RCX) is sitting at
; the location pointed at by the return address;
; load that into RCX:

            mov     rcx, returnAdrs

; To handle more than three arguments (four counting
; RCX), you must pass data on stack. However, to the
; print caller, the stack is unavailable, so use
; R10 and R11 as extra parameters (could be just
; junk in these registers, but pass them just
; in case).

 mov     [rsp + 32], r10
            mov     [rsp + 40], r11
            call    printf

; Need to modify the return address so
; that it points beyond the zero-terminating byte.
; Could use a fast strlen function for this, but
; printf is so slow it won't really save us anything.

            mov     rcx, returnAdrs
            dec     rcx
skipTo0:    inc     rcx
            cmp     byte ptr [rcx], 0
            jne     skipTo0
            inc     rcx
            mov     returnAdrs, rcx

            leave
            vmovdqu ymm0, ymmword ptr [rsp + 000]
            vmovdqu ymm1, ymmword ptr [rsp + 032]
            vmovdqu ymm2, ymmword ptr [rsp + 064]
            vmovdqu ymm3, ymmword ptr [rsp + 096]
            vmovdqu ymm4, ymmword ptr [rsp + 128]
            vmovdqu ymm5, ymmword ptr [rsp + 160]
            vmovdqu ymm6, ymmword ptr [rsp + 192]
            vmovdqu ymm7, ymmword ptr [rsp + 224]
            add     rsp, 256
            pop     r11
            pop     r10
            pop     r9
            pop     r8
            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            ret
print_AVX   endp

; Version that will run on CPUs without
; AVX support and will preserve the
; volatile SSE registers (XMM0 to XMM3):

print_SSE   proc

; Preserve all the volatile registers
; (be nice to the assembly code that
; calls this procedure):

            push    rax
            push    rbx
            push    rcx
            push    rdx
            push    r8
            push    r9
 push    r10
            push    r11

; XMM0 to XMM3 are considered volatile, so preserve them:

            sub     rsp, 128
            movdqu  xmmword ptr [rsp + 00],  xmm0
            movdqu  xmmword ptr [rsp + 16],  xmm1
            movdqu  xmmword ptr [rsp + 32],  xmm2
            movdqu  xmmword ptr [rsp + 48],  xmm3
            movdqu  xmmword ptr [rsp + 64],  xmm4
            movdqu  xmmword ptr [rsp + 80],  xmm5
            movdqu  xmmword ptr [rsp + 96],  xmm6
            movdqu  xmmword ptr [rsp + 112], xmm7

            push    rbp

returnAdrs  textequ <[rbp + 200]>

            mov     rbp, rsp
            sub     rsp, 128
            and     rsp, -16

; Format string (passed in RCX) is sitting at
; the location pointed at by the return address;
; load that into RCX:

            mov     rcx, returnAdrs

; To handle more than three arguments (four counting
; RCX), you must pass data on stack. However, to the
; print caller, the stack is unavailable, so use
; R10 and R11 as extra parameters (could be just
; junk in these registers, but pass them just
; in case):

            mov     [rsp + 32], r10
            mov     [rsp + 40], r11
            call    printf

; Need to modify the return address so
; that it points beyond the zero-terminating byte.
; Could use a fast strlen function for this, but
; printf is so slow it won't really save us anything.

            mov     rcx, returnAdrs
            dec     rcx
skipTo0:    inc     rcx
            cmp     byte ptr [rcx], 0
            jne     skipTo0
            inc     rcx
            mov     returnAdrs, rcx

            leave
 movdqu  xmm0, xmmword ptr [rsp + 00] 
            movdqu  xmm1, xmmword ptr [rsp + 16] 
            movdqu  xmm2, xmmword ptr [rsp + 32] 
            movdqu  xmm3, xmmword ptr [rsp + 48] 
            movdqu  xmm4, xmmword ptr [rsp + 64] 
            movdqu  xmm5, xmmword ptr [rsp + 80] 
            movdqu  xmm6, xmmword ptr [rsp + 96] 
            movdqu  xmm7, xmmword ptr [rsp + 112] 
            add     rsp, 128
            pop     r11
            pop     r10
            pop     r9
            pop     r8
            pop     rdx
            pop     rcx
            pop     rbx
            pop     rax
            ret
print_SSE   endp            
            end

清单 15-2：出现在汇编单元中的print函数

为了完成迄今为止使用的所有常见aoalib函数，这里是清单 15-3。

; getTitle.asm - The getTitle function converted to
;                an assembly unit.

; Return program title to C++ program:

            include aoalib.inc

            .code
getTitle    proc
            lea     rax, ttlStr
            ret
getTitle    endp
            end

清单 15-3：作为汇编单元的getTitle函数

清单 15-4 是一个使用清单 15-2 和 15-3 中汇编单元的程序。

; Listing 15-4

; Demonstration of linking.

            include aoalib.inc

            .data
ttlStr      byte    "Listing 15-4", 0

***************************************************************

; Here is the "asmMain" function.

            .code
            public  asmMain
asmMain     proc
            push    rbx
            push    rsi
            push    rdi
            push    rbp
            mov     rbp, rsp
            sub     rsp, 56         ; Shadow storage

            call    print
            byte    "Assembly units linked", nl, 0

            leave
            pop     rdi
            pop     rsi
            pop     rbx
            ret     ; Returns to caller
asmMain     endp
            end

清单 15-4：一个使用print和getTitle汇编模块的主程序

那么如何构建和运行这个程序呢？不幸的是，本书到目前为止使用的build.bat批处理文件无法完成这个任务。这里有一个命令，它会将所有单元汇集并将它们链接在一起：

ml64 /c print.asm getTitle.asm listing15-4.asm
cl /EHa c.cpp print.obj getTitle.obj listing15-4.obj

这些命令将正确地编译所有源文件并将它们的目标代码链接在一起，生成可执行文件c.exe。

不幸的是，前面的命令失去了分离编译的一个主要优势。当你执行ml64 /c print.asm getTitle.asm listing15-4.asm命令时，它会编译所有的汇编源文件。记住，分离编译的一个主要原因是为了减少大项目的编译时间。虽然前面的命令有效，但它们并没有实现这个目标。

要分别编译这两个模块，你必须分别对它们运行 MASM。要分别编译这三个源文件，可以将ml64调用拆分成三个单独的命令：

ml64 /c print.asm
ml64 /c getTitle.asm
ml64 /c listing15-4.asm
cl /EHa c.cpp print.obj getTitle.obj listing15-4.obj

当然，这个顺序仍然会编译所有三个汇编源文件。然而，在第一次执行这些命令之后，你已经构建了print.obj和getTitle.obj文件。从此以后，只要你不更改print.asm或getTitle.asm源文件（并且不删除print.obj或getTitle.obj文件），你就可以通过使用这些命令来构建和运行 Listing 15-4 中的程序：

ml64 /c listing15-4.asm
cl /EHa c.cpp print.obj getTitle.obj listing15-4.obj

现在，你节省了编译print.asm和getTitle.asm文件所需的时间。

15.7 Makefile 简介

本书中使用的build.bat文件比逐个输入构建命令要方便得多。不幸的是，build.bat支持的构建机制实际上只适用于少数固定的源文件。虽然你可以轻松构造一个批处理文件来编译一个大型汇编项目中的所有文件，但运行该批处理文件时会重新汇编项目中的每一个源文件。虽然你可以使用复杂的命令行功能来避免一些这种情况，但有一种更简单的方法：makefile。

makefile是一种特殊语言的脚本（最早在 Unix 的早期版本中设计），它指定了如何基于某些条件执行一系列命令，这些命令由 make 程序执行。如果你已经安装了 MSVC 和 MASM 作为 Visual Studio 的一部分，那么你可能也已经安装了（作为同一过程的一部分）Microsoft 版本的 make：nmake.exe。^(4) 要使用nmake.exe，你可以在 Windows 命令行中按如下方式执行：

nmake `optional_arguments`

如果你在命令行中单独执行nmake（没有任何参数），nmake.exe将搜索名为makefile的文件，并尝试处理该文件中的命令。对于许多项目来说，这是非常方便的。你将把所有项目的源文件放在一个目录中（或该目录下的子目录中），并将一个名为makefile的单一 makefile 放在该目录中。通过切换到该目录并执行nmake（或make），你可以轻松构建项目。

如果您想使用不同于 makefile 的文件名，必须在文件名前加上 /f 选项，如下所示：

nmake /f mymake.mak

文件名不一定需要具有 .mak 扩展名。然而，当使用非 makefile 命名的 makefile 时，这是一个常见的约定。

nmake 程序确实提供了许多命令行选项，/help 将列出它们。请查阅 nmake 文档以了解其他命令行选项的描述（其中大多数是高级选项，对于大多数任务来说不必要）。

15.7.1 基本 Makefile 语法

makefile 是一个标准的 ASCII 文本文件，包含以下格式的一系列行（或该序列的多个出现）：

`target`: `dependencies`
    `commands`

target``: dependencies 行是可选的。commands 项是一个包含一个或多个命令行命令的列表，也是可选的。target 项，如果存在，必须从它所在的源行的第 1 列开始。commands 项必须在前面至少有一个空白字符（空格或制表符）（即，它们不能从源行的第 1 列开始）。考虑以下有效的 makefile：

c.exe:
  ml64 /c print.asm
  ml64 /c getTitle.asm
  ml64 /c listing15-4.asm
  cl /EHa c.cpp print.obj getTitle.obj listing15-4.obj

如果这些命令出现在名为 makefile 的文件中，并且您执行 nmake，那么 nmake 将像命令行解释器在批处理文件中出现这些命令时那样执行它们。

target 项是某种标识符或文件名。考虑以下 makefile：

executable:
  ml64 /c listing15-4.asm
  cl /EHa c.cpp print.obj getTitle.obj listing15-4.obj

library:
  ml64 /c print.asm
  ml64 /c getTitle.asm

这将构建命令分为两组：一组由 executable 标签指定，另一组由 library 标签指定。

如果您没有任何命令行选项运行 nmake，nmake 只会执行与 makefile 中第一个目标相关的命令。在这个例子中，如果您单独运行 nmake，它将汇编 listing15-4.asm、print.asm 和 getTitle.asm；编译 c.cpp；并尝试将生成的 c.obj 与 print.obj、getTitle.obj 和 listing15-4.obj 链接。这应该能够成功生成 c.exe 可执行文件。

要处理库目标之后的命令，请将目标名称作为 nmake 命令行参数指定：

nmake library

该 nmake 命令编译 print.asm 和 getTitle.asm。因此，如果您执行该命令一次（且以后不再更改 print.asm 或 getTitle.asm），只需执行 nmake 命令本身即可生成可执行文件（或者如果您希望明确说明正在构建可执行文件，可以使用 nmake executable）。

15.7.2 Make 依赖关系

尽管在命令行中指定要构建的目标非常有用，但随着项目的增大（包含许多源文件和库模块），始终跟踪哪些源文件需要重新编译可能会变得繁琐且容易出错；如果不小心，您可能会忘记在对某个不常用的库模块进行修改后重新编译它，并且困惑为何应用程序仍然失败。make 依赖选项可以让您自动化构建过程，帮助避免这些问题。

在 makefile 中，一个或多个（以空格分隔的）依赖项可以跟随一个目标：

`target`: `dependency1` `dependency2` `dependency3` ...

依赖项可以是目标名称（出现在该 makefile 中的目标）或文件名。如果依赖项是一个目标名称（而不是文件名），nmake会执行与该目标相关联的命令。请考虑以下 makefile：

executable:
  ml64 /c listing15-4.asm
  cl /EHa c.cpp print.obj getTitle.obj listing15-4.obj

library:
  ml64 /c print.asm
  ml64 /c getTitle.asm

all: library executable

all目标依赖于library和executable目标，因此它会执行与这些目标相关联的命令（并按library、executable的顺序执行，这一点很重要，因为library目标文件必须在相关的目标模块链接到可执行程序之前构建）。all标识符是 makefile 中常见的目标，实际上，它通常是 makefile 中出现的第一个或第二个目标。

如果target``: dependencies行变得过长，导致无法读取（nmake并不特别关心行长问题），你可以通过在行末放置一个反斜杠字符（\）来将这一行拆分为多行。nmake程序会将以反斜杠结尾的源行与 makefile 中的下一行合并。

目标名称和依赖项也可以是文件名。将文件名指定为目标名称通常是为了告诉构建系统如何构建该特定文件。例如，我们可以将当前示例重写如下：

executable:
  ml64 /c listing15-4.asm
  cl /EHa c.cpp print.obj getTitle.obj listing15-4.obj

library: print.obj getTitle.obj

print.obj:
  ml64 /c print.asm

getTitle.obj:
  ml64 /c getTitle.asm

all: library executable

当依赖项与目标关联且目标为文件名时，你可以将target``: dependencies语句理解为“target依赖于dependencies”。在处理 make 命令时，nmake会比较指定为目标文件名和依赖文件名的文件的修改日期和时间戳。

如果目标的日期和时间早于任何依赖项（或者目标文件不存在），nmake会执行目标后的命令。如果目标文件的修改日期和时间比所有依赖文件的日期和时间都要晚（更新），nmake则不会执行命令。如果目标后面的某个依赖项本身是其他地方的目标，nmake会首先执行该命令（以查看它是否修改目标对象，改变其修改日期和时间，可能会导致nmake执行当前目标的命令）。如果目标或依赖项只是一个标签（而不是文件名），nmake会将其修改日期和时间视为比任何文件都要旧。

请考虑对运行中的makefile示例做如下修改：

c.exe: print.obj getTitle.obj listing15-4.obj
  cl /EHa c.cpp print.obj getTitle.obj listing15-4.obj

listing15-4.obj: listing15-4.asm
  ml64 /c listing15-4.asm

print.obj: print.asm
  ml64 /c print.asm

getTitle.obj: getTitle.asm
  ml64 /c getTitle.asm

注意，all和library目标已被移除（它们被认为是不必要的），而executable被更改为c.exe（最终的目标可执行文件）。

考虑 c.exe 目标。因为 print.obj、getTitle.obj 和 listing15-4.obj 都是目标（也是文件名），nmake 会首先执行这些目标。执行这些目标后，nmake 会比较 c.exe 的修改日期和时间与这三个目标文件的修改日期和时间。如果 c.exe 比其中任何一个目标文件都要旧，nmake 会执行 c.exe 目标行后面的命令（编译 c.cpp 并将其与目标文件链接）。如果 c.exe 比依赖的目标文件更新，nmake 将不会执行该命令。

对于每个依赖的目标文件，nmake 会按相同的过程递归执行，依次处理 print.obj、getTitle.obj 和 listing15-4.obj 目标。在处理 c.exe 目标时，nmake 会依次处理 print.obj、getTitle.obj 和 listing15-4.obj 目标（按这个顺序）。在每一种情况下，nmake 会比较 .obj 文件的修改日期和时间与对应的 .asm 文件。如果 .obj 文件比 .asm 文件更新，nmake 会返回处理 c.exe 目标，而不做任何操作；如果 .obj 文件比 .asm 文件旧（或不存在），nmake 会执行相应的 ml64 命令生成新的 .obj 文件。

如果 c.exe 比所有的 .obj 文件都更新（且它们都比 .asm 文件更新），执行 nmake 不会做任何事情（好吧，它会报告 c.exe 已经是最新的，但不会处理 makefile 中的任何命令）。如果任何文件是过时的（因为它们已被修改），这个 makefile 只会编译和链接必要的文件，以使 c.exe 更新。

到目前为止，makefile 缺少一个重要的依赖关系：所有的 .asm 文件都包含了 aoalib.inc 文件。对 aoalib.inc 的更改可能会导致这些 .asm 文件的重新编译。这个依赖关系已经添加到 Listing 15-5 中。这个列表还演示了如何通过在行首使用 # 字符来在 makefile 中包含注释。

# listing15-5.mak

# makefile for Listing 15-4.

listing15-4.exe:print.obj getTitle.obj listing15-4.obj
    cl /nologo /O2 /Zi /utf-8 /EHa /Felisting15-4.exe c.cpp \
            print.obj getTitle.obj listing15-4.obj

listing15-4.obj: listing15-4.asm aoalib.inc
  ml64 /nologo /c listing15-4.asm

print.obj: print.asm aoalib.inc
  ml64 /nologo /c print.asm

getTitle.obj: getTitle.asm aoalib.inc
  ml64 /nologo /c getTitle.asm

列表 15-5: 用于构建 Listing 15-4 的 makefile

这是使用 Listing 15-5 中的 makefile 来构建 Listing 15-4 程序的 nmake 命令：

C:\>**nmake /f listing15-5.mak**

Microsoft (R) Program Maintenance Utility Version 14.15.26730.0
Copyright (C) Microsoft Corporation.  All rights reserved.

 ml64 /nologo /c print.asm
 Assembling: print.asm
        ml64 /nologo /c getTitle.asm
 Assembling: getTitle.asm
        ml64 /nologo /c listing15-4.asm
 Assembling: listing15-4.asm
        cl /nologo /O2 /Zi /utf-8 /EHa /Felisting15-4.exe c.cpp  print.obj getTitle.obj listing15-4.obj
c.cpp

C:\>**listing15-4**
Calling Listing 15-4:
Assembly units linked
Listing 15-4 terminated

15.7.3 Make Clean 和 Touch

在大多数专业制作的 makefile 中，你会找到一个常见的目标 clean。clean 目标会删除一组适当的文件，以便下次执行 makefile 时强制重新构建整个系统。这个命令通常会删除与项目相关的所有 .obj 和 .exe 文件。Listing 15-6 提供了 Listing 15-5 中的 clean 目标。

# listing15-6.mak

# makefile for Listing 15-4.

listing15-4.exe:print.obj getTitle.obj listing15-4.obj
    cl /nologo /O2 /Zi /utf-8 /EHa /Felisting15-4.exe c.cpp \
            print.obj getTitle.obj listing15-4.obj

listing15-4.obj: listing15-4.asm aoalib.inc
    ml64 /nologo /c listing15-4.asm

print.obj: print.asm aoalib.inc
    ml64 /nologo /c print.asm

getTitle.obj: getTitle.asm aoalib.inc
    ml64 /nologo /c getTitle.asm

clean:
    del getTitle.obj
    del print.obj
    del listing15-4.obj
    del c.obj
    del listing15-4.ilk
    del listing15-4.pdb
    del vc140.pdb
    del listing15-4.exe

# Alternative clean (if you like living dangerously):

# clean:
#   del *.obj
#   del *.ilk
#   del *.pdb
#   del *.exe

列表 15-6: 一个 clean 目标示例

这是一个示例的清理和重建操作：

C:\>**nmake /f listing15-6.mak clean**

Microsoft (R) Program Maintenance Utility Version 14.15.26730.0
Copyright (C) Microsoft Corporation.  All rights reserved.

        del getTitle.obj
        del print.obj
        del listing15-4.obj
        del c.obj
        del listing15-4.ilk
        del listing15-4.pdb
        del listing15-4.exe

C:\>**nmake /f listing15-6.mak**

Microsoft (R) Program Maintenance Utility Version 14.15.26730.0
Copyright (C) Microsoft Corporation.  All rights reserved.

        ml64 /nologo /c print.asm
 Assembling: print.asm
        ml64 /nologo /c getTitle.asm
 Assembling: getTitle.asm
        ml64 /nologo /c listing15-4.asm
 Assembling: listing15-4.asm
        cl /nologo /O2 /Zi /utf-8 /EHa /Felisting15-4.exe c.cpp
           print.obj getTitle.obj listing15-4.obj
c.cpp

如果你想强制重新编译一个文件（而不需要手动编辑和修改它），一个 Unix 工具会派上用场：touch。touch程序接受一个文件名作为参数，然后更新文件的修改日期和时间（而不对文件本身进行修改）。例如，在使用 Listing 15-6 中的 makefile 构建 Listing 15-4 之后，如果你执行命令

touch listing15-4.asm

然后再次执行 Listing 15-6 中的 makefile，它会重新组装 Listing 15-4 中的代码，重新编译c.cpp，并生成一个新的可执行文件。

不幸的是，虽然touch是一个标准的 Unix 应用程序，并且在每个 Unix 和 Linux 发行版中都会附带，但它不是 Windows 的标准应用程序^(5)。幸运的是，你可以很容易地在互联网上找到适用于 Windows 的touch版本。这也是一个相对简单的程序，可以自行编写。

15.8 Microsoft 链接器和库代码

许多常见的项目会重用开发人员早期创建的代码（或者使用来自开发者组织外部的代码）。这些代码库相对来说是静态的：在使用这些库代码的项目开发过程中，它们很少发生变化。特别地，通常不会将库的构建过程纳入特定项目的 makefile 中。一个特定项目可能会在 makefile 中将库文件列为依赖项，但假设库文件是在其他地方构建的，并作为整体提供给项目。除此之外，库和一组目标代码文件之间还存在一个主要的区别：打包。

在处理大量单独的目标文件时，尤其是当你在处理真正的库目标文件集时，会变得很麻烦。一个库可能包含几十、几百甚至上千个目标文件。列出所有这些目标文件（甚至仅仅是项目使用的文件）是一项繁重的工作，并且可能导致一致性错误。解决这个问题的常见方法是将各种目标文件组合成一个单独的包（文件），称为库文件。在 Windows 下，库文件通常具有.lib后缀。

对于许多项目，你会获得一个库（.lib）文件，它将特定的库模块打包在一起。你在构建程序时将这个文件提供给链接器，链接器会自动从库中挑选出它需要的目标模块。这是一个重要的要点：在构建可执行文件时包含一个库，并不会自动将该库中的所有代码插入到可执行文件中。链接器足够智能，能够只提取它需要的目标文件，并忽略它不使用的目标文件（记住，库只是一个包含大量目标文件的包）。

那么问题是，“如何创建一个库文件？”简短的回答是，“通过使用 Microsoft Library Manager 程序（lib.exe）。”lib程序的基本语法是

lib /out:`libname.lib` `list_of_.obj_files`

其中libname.lib是你要生成的库文件的名称，list_of_.obj_files是你要合并到库中的（以空格分隔的）目标文件列表。例如，如果你想将print.obj和getTitle.obj文件合并成一个库模块（aoalib.lib），可以使用以下命令：

lib /out:aoalib.lib getTitle.obj print.obj

一旦你有了一个库模块，你可以像指定目标文件一样，在链接器（或ml64或cl）命令行中指定它。例如，要将aoalib.lib模块与 Listing 15-4 中的程序链接，你可以使用以下命令：

cl /EHa /Felisting15-4.exe c.cpp listing15-4.obj aoalib.lib

lib程序支持多种命令行选项。你可以通过使用以下命令获取这些选项的列表：

lib /?

请参阅在线的 Microsoft 文档，了解各种命令的描述。最有用的选项之一可能是

lib /list `lib_filename.lib`

其中lib_filename.lib表示库文件名。这将打印该库模块中包含的目标文件列表。例如，lib /list aoalib.lib会输出如下内容：

C:\>**lib /list aoalib.lib**
Microsoft (R) Library Manager Version 14.15.26730.0
Copyright (C) Microsoft Corporation.  All rights reserved.

getTitle.obj
print.obj

MASM 提供了一条特殊指令includelib，允许你指定要包含的库。此指令的语法为

includelib `lib_filename.lib`

其中lib_filename.lib是你要包含的库文件的名称。此指令在 MASM 生成的目标文件中嵌入一条命令，将该库文件名传递给链接器。链接器将在处理包含includelib指令的目标模块时自动加载库文件。

这一操作与手动将库文件名指定给链接器（通过命令行）是相同的。你是否偏好将includelib指令放在 MASM 源文件中，或是在链接器（或ml64/cl）命令行中包含库名称，取决于你自己。根据我的经验，大多数汇编语言程序员（尤其是在编写独立的汇编语言程序时）更喜欢使用includelib指令。

15.9 目标文件和库对程序大小的影响

程序中的基本链接单元是目标文件。在将目标文件组合成可执行文件时，Microsoft 链接器将把单个目标文件中的所有数据合并到最终的可执行文件中。即使主程序没有直接或间接调用该目标模块中的所有函数，或没有使用该目标文件中的所有数据，这也是成立的。所以，如果你将 100 个例程放入一个单独的汇编语言源文件并将它们编译成一个目标模块，链接器会将这 100 个例程的代码全部包含到你的最终可执行文件中，即使你只使用其中的一个例程。

如果你想避免这种情况，你应该将这 100 个例程拆分成 100 个独立的目标模块，并将这 100 个目标文件组合成一个单一的库。当微软的链接器处理这个库文件时，它将选择包含程序使用的函数的单个目标文件，并仅将该目标文件合并到最终的可执行文件中。通常，这种方式比将一个包含 100 个函数的目标文件链接进来更高效。

上一句话中的关键词是通常。事实上，将多个函数合并成一个目标文件是有一些合理原因的。首先，考虑当链接器将目标文件合并到可执行文件中时会发生什么。为了确保正确的对齐，每当链接器从目标文件中获取一个部分或段（例如，.code段）时，它会添加足够的填充，以确保该段中的数据对齐到指定的对齐边界。大多数段的默认对齐为 16 字节，因此链接器会将它链接的每个目标文件中的段对齐到 16 字节边界。通常，这并不算太糟糕，特别是当你的过程较大时。然而，假设你创建的这 100 个过程都是非常短小的（每个只有几字节）。那么你就会浪费很多空间。

当然，在现代计算机上，几百字节的浪费空间并不会造成太大影响。然而，结合这些过程中的几个到一个单独的目标模块（即使你并不调用所有的）来填补一些浪费的空间可能更为实际。不过，不要过度操作；一旦你超出了对齐边界，不管是因为填充浪费了空间，还是因为你包含了从未被调用的代码，最终你还是在浪费空间。

15.10 更多信息

虽然这是一本较旧的书，涵盖的是 MASM 6 版本，《Waite Group 的微软宏汇编语言宝典》 由纳巴乔提·巴尔卡提和本书作者编写（Sams，1992 年），它详细讨论了 MASM 的外部指令（extern，externdef，和public）和包含文件。

你也可以在网上找到 MASM 6 手册（最后发布的版本）。

欲了解更多关于 makefile 的信息，请参考以下资源：

维基百科：en.wikipedia.org/wiki/Make_(software)
使用 GNU Make 管理项目，第三版，作者：罗伯特·梅克伦堡（O'Reilly Media，2004 年）
《GNU Make 书》 由约翰·格雷厄姆-卡明（No Starch Press，2015 年）

15.11 测试自己

你会使用什么语句来防止递归包含文件？
什么是汇编单元？
你会使用什么指令来告诉 MASM 一个符号是全局的，并且在当前源文件外可见？
你会使用什么指令来告诉 MASM 使用另一个目标模块中的全局符号？
哪个指令可以防止在汇编源文件中定义外部符号时出现重复符号错误？
你会使用什么外部数据类型声明来访问外部常量符号？
你会使用什么外部数据类型声明来访问外部过程？
微软的 make 程序叫什么名字？
基本的 makefile 语法是什么？
什么是 makefile 依赖的文件？
makefile 中的 clean 命令通常做什么？
什么是库文件？

第十六章：独立汇编语言程序

到目前为止，本书依赖于 C/C++ 主程序来调用用汇编语言编写的示例代码。尽管这可能是汇编语言在现实世界中的最大应用，但也可以在汇编语言中编写独立的代码（没有 C/C++ 主程序）。

在本章的上下文中，独立汇编语言程序指的是你编写的一个可执行的汇编程序，它不会直接链接到 C/C++ 程序中执行。没有 C/C++ 主程序调用你的汇编代码，你就不会拖带 C/C++ 库代码和运行时系统，因此你的程序会更小，也不会与 C/C++ 公共名称发生外部命名冲突。然而，你必须自己完成很多 C/C++ 库所做的工作，或者编写相应的汇编代码，或调用 Win32 API。

Win32 API 是一个裸金属接口，提供给 Windows 操作系统，提供了成千上万的函数，你可以从独立的汇编语言程序中调用——本章无法考虑所有这些函数。本章为你提供了 Win32 应用程序的基本介绍（尤其是基于控制台的应用程序）。这些信息将帮助你开始在 Windows 下编写独立的汇编语言程序。

要在你的汇编程序中使用 Win32 API，你需要从 www.masm32.com/ 下载 MASM32 库包。^(1) 本章中的大多数示例假设 MASM32 64 位包含文件已经在你的系统的 C:\masm32 子目录中。

16.1 独立的 Hello World

在向你展示一些 Windows 独立汇编语言编程的奇迹之前，也许最好的起点是从头开始：一个独立的“Hello, world!”程序（清单 16-1）。

; Listing 16-1.asm

; A stand-alone assembly language version of 
; the ubiquitous "Hello, world!" program.

; Link in the Windows Win32 API:

            includelib kernel32.lib

; Here are the two Windows functions we will need
; to send "Hello, world!" to the standard console device:

            extrn __imp_GetStdHandle:proc
            extrn __imp_WriteFile:proc

            .code
hwStr       byte    "Hello World!"
hwLen       =       $-hwStr

; This is the honest-to-goodness assembly language
; main program:

main        proc

; On entry, stack is aligned at 8 mod 16\. Setting aside
; 8 bytes for "bytesWritten" ensures that calls in main have
; their stack aligned to 16 bytes (8 mod 16 inside function),
; as required by the Windows API (which __imp_GetStdHandle and
; __imp_WriteFile use. They are written in C/C++).

            lea     rbx, hwStr
            sub     rsp, 8
            mov     rdi, rsp      ; Hold # of bytes written here

; Note: must set aside 32 bytes (20h) for shadow registers for
; parameters (just do this once for all functions). 
; Also, WriteFile has a 5th argument (which is NULL), 
; so we must set aside 8 bytes to hold that pointer (and
; initialize it to zero). Finally, stack must always be 
; 16-byte-aligned, so reserve another 8 bytes of storage
; to ensure this.

            sub     rsp, 030h  ; Shadow storage for args

; Handle = GetStdHandle(-11);
; Single argument passed in ECX.
; Handle returned in RAX.

            mov     rcx, -11                     ; STD_OUTPUT
            call    qword ptr __imp_GetStdHandle ; Returns handle
                                                 ; in RAX

; WriteFile(handle, "Hello World!", 12, &bytesWritten, NULL);
; Zero out (set to NULL) "lpOverlapped" argument:

            xor     rcx, rcx
            mov     [rsp + 4 * 8], rcx

            mov     r9, rdi    ; Address of "bytesWritten" in R9
            mov     r8d, hwLen ; Length of string to write in R8D
            lea     rdx, hwStr ; Ptr to string data in RDX
            mov     rcx, rax   ; File handle passed in RCX
            call    qword ptr __imp_WriteFile

; Clean up stack and return:

            add     rsp, 38h
            ret
main        endp
            end

清单 16-1：独立的“Hello, world!”程序

__imp_``GetStdHandle 和 __imp_``WriteFile 过程是 Windows 内的函数（它们是所谓的 Win32 API 的一部分，尽管这是执行的 64 位代码）。__imp_GetStdHandle 过程，在传入（虽然是魔法般的）数字 -11 作为参数时，返回标准输出设备的句柄。使用这个句柄，调用 __imp_WriteFile 将把输出发送到标准输出设备（控制台）。要构建并运行此程序，使用以下命令：

ml64 listing16-1.asm /link /subsystem:console /entry:main

MASM 的/link命令行选项告诉它，接下来的命令（直到行末）将被传递给链接器。/subsystem:console（链接器）命令行选项告诉链接器这个程序是一个控制台应用程序（也就是说，它将在命令行窗口中运行）。/entry:main链接器选项将主程序的名称传递给链接器。链接器将这个地址存储在可执行文件中的一个特殊位置，以便 Windows 在将可执行文件加载到内存后确定主程序的起始地址。

16.2 头文件与 Windows 接口

在 Listing 16-1 的“Hello, world!”示例的开始部分，你会注意到以下几行：

includelib kernel32.lib

; Here are the two Windows functions we will need
; to send "Hello, world!" to the standard console device:

extrn __imp_GetStdHandle:proc
extrn __imp_WriteFile:proc

kernel32.lib库文件包含了许多 Win32 API 函数的对象模块定义，包括__imp_GetStdHandle和__imp_WriteFile过程。为所有 Win32 API 函数在你的汇编语言程序中插入extrn指令是一个巨大的工作量。处理这些函数定义的正确方式是将它们包含在一个头文件（包含文件）中，然后在你编写的每个使用 Win32 API 函数的应用程序中都包含这个文件。

坏消息是，创建一个合适的头文件集合是一个庞大的任务。好消息是，已经有人为你做了所有这些工作：MASM32 头文件。Listing 16-2 是 Listing 16-1 的重做版，使用 MASM32 64 位包含文件来获取 Win32 外部声明。请注意，我们通过包含文件listing16-2.inc来引入 MASM32，而不是直接使用它。稍后会详细解释。

; Listing 16-2

            include    listing16-2.inc
            includelib kernel32.lib               ; File I/O library

; Include just the files we need from masm64rt.inc:

;           include \masm32\include64\masm64rt.inc
;           OPTION DOTNAME                        ; Required for macro files
;           option casemap:none                   ; Case sensitive
;           include \masm32\include64\win64.inc
;           include \masm32\macros64\macros64.inc
;           include \masm32\include64\kernel32.inc

            .data
bytesWrtn   qword   ?
hwStr       byte    "Listing 16-2", 0ah, "Hello, World!", 0
hwLen       =       sizeof hwStr

            .code

**********************************************************

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbx
            push    rsi
            push    rdi
            push    r15
            push    rbp
            mov     rbp, rsp
            sub     rsp, 56            ; Shadow storage
            and     rsp, -16

            mov     rcx, -11           ; STD_OUTPUT
            call    __imp_GetStdHandle ; Returns handle

            xor     rcx, rcx
            mov     bytesWrtn, rcx

            lea     r9, bytesWrtn      ; Address of "bytesWritten" in R9
            mov     r8d, hwLen         ; Length of string to write in R8D 
            lea     rdx, hwStr         ; Ptr to string data in RDX
            mov     rcx, rax           ; File handle passed in RCX
            call    __imp_WriteFile

allDone:    leave
            pop     r15
            pop     rdi
            pop     rsi
            pop     rbx
            ret     ; Returns to caller
asmMain     endp
            end

这是listing16-2.inc包含文件：

; listing16-2.inc

; Header file entries extracted from MASM32 header
; files (placed here rather than including the 
; full MASM32 headers to avoid namespace pollution
; and speed up assemblies).

PPROC           TYPEDEF PTR PROC        ; For include file prototypes

externdef __imp_GetStdHandle:PPROC
externdef __imp_WriteFile:PPROC

Listing 16-2: 使用 MASM32 64 位包含文件

这是构建命令和示例输出：

C:\>**ml64 /nologo listing16-2.asm kernel32.lib /link /nologo /subsystem:console /entry:asmMain**
 Assembling: listing16-2.asm

C:\>**listing16-2**
Listing 16-2
Hello, World!

MASM32 包含文件

include \masm32\include64\masm64rt.inc

包含了 MASM32 64 位系统中的其他数百个包含文件。将这个包含指令加入到你的程序中，能够为你的应用程序提供对大量 Win32 API 函数、数据声明和其他资源（如 MASM32 宏）的访问。

然而，当你组装源文件时，计算机会暂停一会儿。这是因为那个单一的包含指令在组装过程中将成千上万行代码包含到程序中。如果你知道哪个头文件包含你需要使用的实际声明，你可以通过只包含必要的文件来加速编译过程（就像在listing16-2.asm中使用 MASM32 64 位包含文件那样）。

将 masm64rt.inc 引入到你的程序中还存在一个问题：命名空间污染。MASM32 包含文件会将成千上万的符号引入到你的程序中，因此有可能你想使用的符号已经在 MASM32 包含文件中被定义了（并且可能是用于与你想要的用途不同的目的）。如果你有一个 file grep 工具，这是一个搜索目录中文件并递归查找子目录中特定字符串的程序，你可以轻松找到你想在文件中使用的符号的所有出现位置，并将该符号的定义复制到你自己的源文件中（或者更好的是，复制到你专门为此目的创建的头文件中）。本章使用这种方法来处理许多示例程序。

16.3 Win32 API 和 Windows ABI

Win32 API 函数都遵循 Windows ABI 调用约定。这意味着对这些函数的调用可以修改所有易失寄存器（RAX、RCX、RDX、R8、R9、R10、R11 和 XMM0 到 XMM5），但必须保留非易失寄存器（这里没有列出的其他寄存器）。此外，API 调用通过 RDX、RCX、R8、R9（以及 XMM0 到 XMM3）传递参数，然后是栈；在进行 API 调用之前，栈必须进行 16 字节对齐。有关更多详细信息，请参见本书中关于 Windows ABI 的讨论。

16.4 构建独立的控制台应用程序

看一下前面章节中的（简化版）构建命令：^(2)

ml64 listing16-2.asm /link /subsystem:console /entry:asmMain

/subsystem:console 选项告诉链接器，除了可能创建的 GUI 窗口外，系统还必须为应用程序创建一个特殊窗口以显示控制台信息。如果你从 Windows 命令行运行该程序，它将使用已经打开的 cmd.exe 程序的控制台窗口。

16.5 构建独立的 GUI 应用程序

要创建一个纯 Windows GUI 应用程序而不打开控制台窗口，可以指定 /subsystem:windows 而不是 /subsystem:console。Listing 16-3 中的简单对话框应用程序是一个特别简单的 Windows 应用程序示例。它显示一个简单的对话框，然后在用户点击对话框中的确定按钮时退出。

; Listing 16-3

; Dialog box demonstration.

            include    listing16-3.inc
            includelib user32.lib

          ; include \masm32\include64\masm64rt.inc

            .data

msg         byte    "Dialog Box Demonstration",0
DBTitle     byte    "Dialog Box Title", 0

            .code

**********************************************************

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbp
            mov     rbp, rsp
            sub     rsp, 56         ; Shadow storage
            and     rsp, -16

            xor     rcx, rcx        ; HWin = NULL
            lea     rdx, msg        ; Message to display
            lea     r8, DBTitle     ; Dialog box title
            mov     r9d, MB_OK      ; Has an "OK" button
            call    MessageBox

allDone:    leave
            ret     ; Returns to caller
asmMain     endp
            end

Listing 16-3：一个简单的对话框应用程序

这是 listing16-3.inc 包含文件：

; listing16-3.inc

; Header file entries extracted from MASM32 header
; files (placed here rather than including the 
; full MASM32 headers to avoid namespace pollution
; and speed up assemblies).

PPROC           TYPEDEF PTR PROC        ; For include file prototypes

MB_OK                                equ 0h

externdef __imp_MessageBoxA:PPROC
MessageBox equ <__imp_MessageBoxA>

以下是 Listing 16-3 中程序的构建命令：

C:\>**ml64 listing16-3.asm /link /subsystem:windows /entry:asmMain**

图 16-1 显示了 Listing 16-3 的运行时输出。

file:///Users/DisPater/Desktop/Hyde501089/Production/IndesignFiles/image_fi/501089c16/f16001.tiff

图 16-1：示例对话框输出

16.6 简要了解 MessageBox Windows API 函数

虽然在汇编语言中创建 GUI 应用程序超出了本书的范围，但 MessageBox 函数足够实用（即使在控制台应用程序中）值得特别提及。

MessageBox 函数有四个参数：

RCX 窗口句柄。通常是 NULL（0），表示消息框是一个独立的对话框，未与任何特定窗口关联。
RDX 消息指针。RDX 包含一个指向零终止字符串的指针，该字符串将在消息框的正文中显示。
R8 窗口标题。R8 包含一个指向零终止字符串的指针，该字符串显示在消息框窗口的标题栏中。
R9D 消息框类型。这是一个整数值，指定消息框中出现的按钮类型和其他图标。典型的值有：MB_OK、MB_OKCANCEL、MB_ABORTRETRYIGNORE、MB_YESNOCANCEL、MB_YESNO 和 MB_RETRYCANCEL。

MessageBox 函数返回一个整数值到 RAX，表示用户按下的按钮（如果指定了 MB_OK，那么当用户点击“确定”按钮时，消息框返回的就是这个值）。

16.7 Windows 文件 I/O

本书中大多数示例代码缺少一个关于文件 I/O 的讨论。尽管你可以轻松地使用 C 标准库函数来打开、读取、写入和关闭文件，但在本章中，使用文件 I/O 作为示例，涵盖这个缺失的细节似乎是合适的。

Win32 API 提供了许多有用的文件 I/O函数：读取和写入文件数据。本节描述了这些函数中的一小部分：

CreateFileA 一个函数（尽管它的名字是这样），你用它来打开现有文件或创建新文件。
WriteFile 一个函数，用来将数据写入文件。
ReadFile 一个函数，用来从文件中读取数据。
CloseHandle 一个函数，关闭文件并将任何缓存数据刷新到存储设备。
GetStdHandle 一个你已经见过的函数，它返回标准输入或输出设备（标准输入、标准输出或标准错误）的句柄。
GetLastError 一个函数，你可以用它来检索 Windows 错误代码，如果在执行这些函数中的任何一个时发生错误。

清单 16-4 演示了这些函数的使用，并创建了一些有用的过程来调用这些函数。请注意，这段代码相当长，因此我已将其拆分成更小的块，并在每个部分前面加上了个别的解释。

Win32 文件 I/O 函数都属于 kernel32.lib 库模块。因此，清单 16-4 使用 includelib kernel32.lib 语句，在构建阶段自动链接此库。为了加快汇编速度并减少命名空间污染，本程序并没有自动包含所有的 MASM32 等式文件（通过 include \masm32\include64\masm64rt.inc 语句）。相反，我从 MASM32 头文件中收集了所有必要的等式和其他定义，并将它们放在 listing16-4.inc 头文件中（稍后在本章中会看到）。最后，程序还包含了 aoalib.inc 头文件，只是为了使用该文件中定义的一些常量（如 cr 和 nl）：

; Listing 16-4 

; File I/O demonstration.

            include    listing16-4.inc
            include    aoalib.inc   ; To get some constants
            includelib kernel32.lib ; File I/O library

            .const
prompt      byte    "Enter (text) filename:", 0
badOpenMsg  byte    "Could not open file", cr, nl, 0

            .data

inHandle    dword   ?
inputLn     byte    256 dup (0)

fileBuffer  byte    4096 dup (0)

以下代码围绕每个文件 I/O 函数构建了 包装代码，以保留易失性寄存器值。这些函数使用以下宏定义来保存和恢复寄存器值：

 .code

rcxSave     textequ <[rbp - 8]>
rdxSave     textequ <[rbp - 16]>
r8Save      textequ <[rbp - 24]>
r9Save      textequ <[rbp - 32]>
r10Save     textequ <[rbp - 40]>
r11Save     textequ <[rbp - 48]>
xmm0Save    textequ <[rbp - 64]>
xmm1Save    textequ <[rbp - 80]>
xmm2Save    textequ <[rbp - 96]>
xmm3Save    textequ <[rbp - 112]>
xmm4Save    textequ <[rbp - 128]>
xmm5Save    textequ <[rbp - 144]>
var1        textequ <[rbp - 160]>

mkActRec    macro
            push    rbp
            mov     rbp, rsp
            sub     rsp, 256        ; Includes shadow storage
            and     rsp, -16        ; Align to 16 bytes
            mov     rcxSave, rcx
            mov     rdxSave, rdx
            mov     r8Save, r8
            mov     r9Save, r9
            mov     r10Save, r10
            mov     r11Save, r11
            movdqu  xmm0Save, xmm0
            movdqu  xmm1Save, xmm1
 movdqu  xmm2Save, xmm2
            movdqu  xmm3Save, xmm3
            movdqu  xmm4Save, xmm4
            movdqu  xmm5Save, xmm5
            endm

rstrActRec  macro
            mov     rcx, rcxSave
            mov     rdx, rdxSave
            mov     r8, r8Save 
            mov     r9, r9Save 
            mov     r10, r10Save
            mov     r11, r11Save
            movdqu  xmm0, xmm0Save
            movdqu  xmm1, xmm1Save
            movdqu  xmm2, xmm2Save
            movdqu  xmm3, xmm3Save
            movdqu  xmm4, xmm4Save
            movdqu  xmm5, xmm5Save
            leave
            endm

清单 16-4 中出现的第一个函数是 getStdOutHandle。这是一个包装函数，封装了 __imp_GetStdHandle，用于保留易失性寄存器并显式请求标准输出设备句柄。该函数返回标准输出设备句柄，保存在 RAX 寄存器中。在 getStdOutHandle 后面是类似的函数，用于获取标准错误句柄和标准输入句柄：

; getStdOutHandle - Returns stdout handle in RAX:

getStdOutHandle proc
                mkActRec
                mov     rcx, STD_OUTPUT_HANDLE
                call    __imp_GetStdHandle  ; Returns handle
                rstrActRec
                ret
getStdOutHandle endp

; getStdErrHandle - Returns stderr handle in RAX:

getStdErrHandle proc
                mkActRec
                mov     rcx, STD_ERROR_HANDLE
                call    __imp_GetStdHandle  ; Returns handle
                rstrActRec
                ret
getStdErrHandle endp

; getStdInHandle - Returns stdin handle in RAX:

getStdInHandle proc
               mkActRec
               mov     rcx, STD_INPUT_HANDLE
               call    __imp_GetStdHandle   ; Returns handle
               rstrActRec
               ret
getStdInHandle endp

现在考虑 write 函数的包装代码：

; write - Write data to a file handle.

; RAX - File handle.
; RSI - Pointer to buffer to write.
; RCX - Length of buffer to write.

; Returns:

; RAX - Number of bytes actually written
;       or -1 if there was an error.

write       proc
            mkActRec

            mov     rdx, rsi        ; Buffer address
            mov     r8, rcx         ; Buffer length
            lea     r9, var1        ; bytesWritten
            mov     rcx, rax        ; Handle
            xor     r10, r10        ; lpOverlapped is passed
            mov     [rsp+4*8], r10  ; on the stack
            call    __imp_WriteFile
            test    rax, rax        ; See if error
            mov     rax, var1       ; bytesWritten
            jnz     rtnBytsWrtn     ; If RAX was not zero
            mov     rax, -1         ; Return error status

rtnBytsWrtn:
            rstrActRec
            ret
write       endp

write 函数将数据从内存缓冲区写入由文件句柄指定的输出文件（如果你希望将数据写入控制台，它也可以是标准输出或标准错误句柄）。write 函数期望以下参数数据：

RAX 文件句柄，指定写入目标。这通常是通过 open 或 openNew 函数（在程序稍后的部分）或 getStdOutHandle 和 getStdErrHandle 函数获得的句柄。
RSI 包含要写入文件的数据的缓冲区地址。
RCX 写入文件的数据字节数（来自缓冲区）。

此函数不遵循 Windows ABI 调用约定。虽然没有官方的 汇编语言调用约定，但许多汇编语言程序员倾向于使用 x86-64 字符串指令使用的相同寄存器。例如，源数据（缓冲区）通过 RSI（源索引寄存器）传递，计数（缓冲区大小）参数出现在 RCX 寄存器中。write 过程将数据移动到适当位置，以供调用 __imp_WriteFile（并设置额外的参数）。

__imp_WriteFile 函数是实际的 Win32 API 写入函数（技术上，__imp_WriteFile 是指向该函数的指针；调用指令是通过此指针的间接调用）。__imp_WriteFile 具有以下参数：

RCX 文件句柄。
RDX 缓冲区地址。
R8 缓冲区大小（实际上是 R8D 中的 32 位）。
R9 地址，指向一个 DWORD 变量，用于接收写入文件的字节数；如果写操作成功，该值将等于缓冲区大小。
[rsp + 32] lpOverlapped值；将其设置为 NULL（0）。根据 Windows ABI，调用者通过栈传递第四个参数之后的所有参数，为前四个参数留出空间（影子参数）。

从__imp_WriteFile返回时，如果写入成功，RAX 包含非零值（true）；如果出现错误，RAX 包含零（false）。如果发生错误，可以调用 Win32 的GetLastError函数来获取错误代码。

请注意，write函数将写入文件的字节数返回在 RAX 寄存器中。如果发生错误，write在 RAX 寄存器中返回-1。

接下来是puts和newLn函数：

; puts - Outputs a zero-terminated string to standard output device.

; RSI - Address of string to print to standard output.

            .data
stdOutHnd   qword   0
hasSOHndl   byte    0

            .code
puts        proc
            push    rax
            push    rcx
            cmp     hasSOHndl, 0
            jne     hasHandle

 call    getStdOutHandle
            mov     stdOutHnd, rax
            mov     hasSOHndl, 1

; Compute the length of the string:

hasHandle:  mov     rcx, -1
lenLp:      inc     rcx
            cmp     byte ptr [rsi][rcx * 1], 0
            jne     lenLp

            mov     rax, stdOutHnd
            call    write

            pop     rcx
            pop     rax
            ret
puts        endp

; newLn - Outputs a newline sequence to the standard output device:

newlnSeq    byte    cr, nl

newLn       proc
            push    rax
            push    rcx
            push    rsi
            cmp     hasSOHndl, 0
            jne     hasHandle

            call    getStdOutHandle
            mov     stdOutHnd, rax
            mov     hasSOHndl, 1

hasHandle:  lea     rsi, newlnSeq
            mov     rcx, 2
            mov     rax, stdOutHnd
            call    write

            pop     rsi
            pop     rcx
            pop     rax
            ret
newLn       endp

puts和newLn过程将字符串写入标准输出设备。puts函数写入一个以零终止的字符串，其地址通过 RSI 寄存器传递。newLn函数写入一个换行序列（回车和换行符）到标准输出设备。

这两个函数有一个小优化：它们只调用getStdOutHandle一次来获取标准输出设备句柄。在第一次调用这两个函数中的任何一个时，它们调用getStdOutHandle并缓存结果（在stdOutHnd变量中），并设置标志（hasSOHndl），指示缓存的值有效。之后，这些函数使用缓存值，而不是不断调用getStdOutHandle来检索标准输出设备句柄。

write函数需要一个缓冲区长度；它不适用于以零终止的字符串。因此，puts函数在调用write之前必须显式确定零终止字符串的长度。newLn函数不需要这样做，因为它知道回车换行序列的长度（两个字符）。

在清单 16-4 中的下一个函数是read函数的包装器：

; read - Read data from a file handle.

; EAX - File handle.
; RDI - Pointer to buffer receive data.
; ECX - Length of data to read.

; Returns:

; RAX - Number of bytes actually read
;       or -1 if there was an error.

read        proc
            mkActRec

            mov     rdx, rdi        ; Buffer address
            mov     r8, rcx         ; Buffer length
            lea     r9, var1        ; bytesRead
            mov     rcx, rax        ; Handle
            xor     r10, r10        ; lpOverlapped is passed
            mov     [rsp+4*8], r10  ; on the stack
            call    __imp_ReadFile
            test    rax, rax        ; See if error
            mov     rax, var1       ; bytesRead
            jnz     rtnBytsRead     ; If RAX was not zero
            mov     rax, -1         ; Return error status

rtnBytsRead:
            rstrActRec
            ret
read        endp

read函数是write函数的输入对应函数。参数相似（但请注意，read使用 RDI 作为目标地址来传递缓冲区参数）：

RAX 文件句柄。
RDI 目标缓冲区，用于存储从文件读取的数据。
RCX 从文件中读取的字节数。

read函数是对 Win32 API __imp_ReadFile函数的包装，具有以下参数：

RCX 文件句柄。
RDX 文件缓冲区地址。
R8 要读取的字节数。
R9 地址，指向一个 DWORD 变量，用于接收实际读取的字节数。
[rsp + 32] 重叠操作；应为 NULL（0）。根据 Windows ABI，调用者通过栈传递第四个参数之后的所有参数，为前四个参数留出空间（影子参数）。

read 函数如果在读取操作期间发生错误，会在 RAX 中返回 -1。否则，它返回实际从文件中读取的字节数。如果读取操作到达文件结尾（EOF），此值可能会小于请求的读取量。返回值为 0 通常表示已到达文件末尾（EOF）。

open 函数用于打开一个现有的文件进行读取、写入或两者兼有。它是 Windows CreateFileA API 调用的封装函数：

; open - Open existing file for reading or writing.

; RSI - Pointer to filename string (zero-terminated).
; RAX - File access flags.
;       (GENERIC_READ, GENERIC_WRITE, or
;       "GENERIC_READ + GENERIC_WRITE")

; Returns:

; RAX - Handle of open file (or INVALID_HANDLE_VALUE if there
;       was an error opening the file).

open        proc
            mkActRec

            mov     rcx, rsi               ; Filename
            mov     rdx, rax               ; Read and write access
            xor     r8, r8                 ; Exclusive access
            xor     r9, r9                 ; No special security
            mov     r10, OPEN_EXISTING     ; Open an existing file
            mov     [rsp + 4 * 8], r10     
            mov     r10, FILE_ATTRIBUTE_NORMAL
            mov     [rsp + 5 * 8], r10
            mov     [rsp + 6 * 8], r9      ; NULL template file
            call    __imp_CreateFileA
            rstrActRec
            ret
open        endp

open 过程有两个参数：

RSI 是指向包含要打开文件的文件名的零终止字符串的指针。
RAX 是一组文件访问标志。通常是常量 GENERIC_READ（用于打开文件以进行读取）、GENERIC_WRITE（用于打开文件以进行写入）或 GENERIC_READ + GENERIC_WRITE（用于同时打开文件进行读取和写入）。

open 函数在设置好适当的参数后调用 Windows CreateFileA 函数。CreateFileA 中的 A 后缀代表 ASCII。这个函数期望调用者传递一个 ASCII 文件名。另一个函数 CreateFileW 则期望传递 Unicode 文件名，且编码为 UTF-16。Windows 内部使用 Unicode 文件名；当调用 CreateFileA 时，它会将 ASCII 文件名转换为 Unicode，然后调用 CreateFileW。open 函数坚持使用 ASCII 字符。

CreateFileA 函数具有以下参数：

RCX 是指向零终止的（ASCII）字符串，包含要打开文件的文件名。
RDX 读取和写入访问标志（GENERIC_READ 和 GENERIC_WRITE）。
R8 共享模式标志（0 表示独占访问）。控制当前进程打开文件时，是否允许其他进程访问该文件。可能的标志值有 FILE_SHARE_READ、FILE_SHARE_WRITE 和 FILE_SHARE_DELETE（或它们的组合）。
R9 是指向安全描述符的指针。open 函数没有指定任何特殊的安全性，它只是将 NULL (0) 作为该参数传递。
[rsp + 32] 该参数包含创建处置标志。open 函数打开一个现有的文件，因此它传递 OPEN_EXISTING。其他可能的值有 CREATE_ALWAYS、CREATE_NEW、OPEN_ALWAYS、OPEN_EXISTING 或 TRUNCATE_EXISTING。OPEN_EXISTING 要求文件必须存在，否则会返回打开错误。作为第五个参数，该值通过堆栈传递（在第五个 64 位位置）。
[rsp + 40] 该参数包含文件属性。此函数仅使用 FILE_ATTRIBUTE_NORMAL 属性（例如，不是只读的）。
[rsp + 48] 该参数是指向文件模板句柄的指针。open 函数不使用文件模板，因此它在该参数中传递 NULL (0)。

open 函数返回一个文件句柄，该句柄存储在 RAX 寄存器中。如果发生错误，函数会在 RAX 中返回 INVALID_HANDLE_VALUE。

openNew 函数也是对 CreateFileA 函数的封装：

; openNew - Creates a new file and opens it for writing.

; RSI - Pointer to filename string (zero-terminated).

; Returns:

; RAX - Handle of open file (or INVALID_HANDLE_VALUE if there
;       was an error opening the file).

openNew     proc
            mkActRec

            mov     rcx, rsi                         ; Filename
            mov     rdx, GENERIC_WRITE+GENERIC_WRITE ; Access
 xor     r8, r8                           ; Exclusive access
            xor     r9, r9                           ; No security
            mov     r10, CREATE_ALWAYS               ; Open a new file
            mov     [rsp + 4 * 8], r10 
            mov     r10, FILE_ATTRIBUTE_NORMAL
            mov     [rsp + 5 * 8], r10
            mov     [rsp + 6 * 8], r9                ; NULL template
            call    __imp_CreateFileA
            rstrActRec
            ret
openNew     endp

openNew在磁盘上创建一个新的（空的）文件。如果文件之前已存在，openNew会在打开新文件之前删除它。这个函数与前面的open函数几乎相同，只有以下两个区别：

调用者不通过 RAX 寄存器传递文件访问标志。文件访问始终假定为GENERIC_WRITE。
该函数传递CREATE_ALWAYS创建方式标志给CreateFileA，而不是OPEN_EXISTING。

closeHandle函数是对 Windows CloseHandle函数的一个简单封装。你将要关闭的文件句柄传递给 RAX 寄存器。该函数如果发生错误，则返回 RAX 中的0，如果文件关闭操作成功，则返回一个非零文件句柄。这个封装函数的唯一目的是在调用 Windows CloseHandle函数时保留所有易失性寄存器：

; closeHandle - Closes a file specified by a file handle.

; RAX - Handle of file to close.

closeHandle proc
            mkActRec

            call    __imp_CloseHandle

            rstrActRec
            ret
closeHandle endp

尽管该程序没有显式地使用getLastError，但它确实提供了一个封装getLastError函数的函数（只是为了展示它是如何写的）。每当此程序中的 Windows 函数返回错误指示时，你必须调用getLastError来获取实际的错误代码。该函数没有输入参数。它返回在 RAX 寄存器中生成的最后一个 Windows 错误代码。

在函数返回错误指示后，立即调用getLastError非常重要。如果在错误和错误代码检索之间调用了其他 Windows 函数，这些中介调用将重置最后的错误代码值。

和closeHandle函数一样，getLastError过程是对 Windows GetLastError函数的一个非常简单的封装，它在调用过程中保留了易失性寄存器的值：

; getLastError - Returns the error code of the last Windows error.

; Returns:

; RAX - Error code.

getLastError proc
             mkActRec
             call   __imp_GetLastError
             rstrActRec
             ret
getLastError endp

stdin_read是对read函数的一个简单封装函数，它从标准输入设备读取数据（而不是从另一个设备上的文件读取数据）：

; stdin_read - Reads data from the standard input.

; RDI - Buffer to receive data.
; RCX - Buffer count (note that data input will
;       stop on a newline character if that
;       comes along before RCX characters have
;       been read).

; Returns:

; RAX - -1 if error, bytes read if successful.

stdin_read  proc
            .data
hasStdInHnd byte    0
stdInHnd    qword   0
            .code
            mkActRec
            cmp     hasStdInHnd, 0
            jne     hasHandle

            call    getStdInHandle
            mov     stdInHnd, rax
            mov     hasStdInHnd, 1

hasHandle:  mov     rax, stdInHnd   ; Handle
            call    read

            rstrActRec
            ret
stdin_read  endp

stdin_read类似于puts（和newLn）过程，因为它在第一次调用时缓存了标准输入句柄，并在随后的调用中使用该缓存值。需要注意的是，stdin_read不会（直接）保留易失性寄存器。该函数没有直接调用任何 Windows 函数，因此不需要保留易失性寄存器（stdin_read调用了read函数，后者会保留易失性寄存器）。stdin_read函数有以下参数：

RDI 指向目标缓冲区，该缓冲区将接收从标准输入设备读取的字符。
RCX 缓冲区大小（最大读取字节数）。

此函数返回实际读取的字节数，存储在 RAX 寄存器中。这个值可能小于 RCX 中传递的值。如果用户按下回车键，该函数会立即返回。此函数不会为从标准输入设备读取的字符串添加零终止符。请使用 RAX 寄存器中的值来确定字符串的长度。如果该函数因为用户在标准输入设备上按下回车键而返回，那么该回车符将出现在缓冲区中。

stdin_getc 函数从标准输入设备读取一个字符，并将该字符返回到 AL 寄存器：

; stdin_getc - Reads a single character from the standard input.
;              Returns character in AL register.

stdin_getc  proc
            push    rdi
            push    rcx
            sub     rsp, 8

            mov     rdi, rsp
            mov     rcx, 1
            call    stdin_read
            test    eax, eax        ; Error on read?
            jz      getcErr
            movzx   rax, byte ptr [rsp]

getcErr:    add     rsp, 8
            pop     rcx
            pop     rdi 
            ret
stdin_getc  endp

readLn 函数从标准输入设备读取一串字符，并将其放入调用者指定的缓冲区。参数如下：

RDI 缓冲区的地址。
RCX 最大缓冲区大小。（readLn 允许用户输入最多 RCX - 1 个字符。）

此函数将在用户输入的字符串末尾添加一个零终止字节。此外，它会去除行末的回车符（或换行符或换行符）。它将字符数返回在 RAX 寄存器中（不包括回车键）：

; readLn - Reads a line of text from the user.
;          Automatically processes backspace characters
;          (deleting previous characters, as appropriate).
;          Line returned from function is zero-terminated
;          and does not include the ENTER key code (carriage
;          return) or line feed.

; RDI - Buffer to place line of text read from user.
; RCX - Maximum buffer length.

; Returns:

; RAX - Number of characters read from the user
;       (does not include ENTER key).

readLn      proc
            push    rbx

            xor     rbx, rbx           ; Character count
            test    rcx, rcx           ; Allowable buffer is 0?
            je      exitRdLn
            dec     rcx                ; Leave room for 0 byte
readLp:
            call    stdin_getc         ; Read 1 char from stdin
            test    eax, eax           ; Treat error like ENTER
            jz      lineDone
            cmp     al, cr             ; Check for ENTER key
            je      lineDone
            cmp     al, nl             ; Check for newline code
            je      lineDone
            cmp     al, bs             ; Handle backspace character
            jne     addChar

; If a backspace character came along, remove the previous
; character from the input buffer (assuming there is a
; previous character).

            test    rbx, rbx           ; Ignore BS character if no
            jz      readLp             ; chars in the buffer
            dec     rbx
            jmp     readLp

; If a normal character (that we return to the caller),
; then add the character to the buffer if there is
; room for it (ignore the character if the buffer is full).

addChar:    cmp     ebx, ecx           ; See if we're at the
            jae     readLp             ; end of the buffer
            mov     [rdi][rbx * 1], al ; Save char to buffer
 inc     rbx
            jmp     readLp

; When the user presses ENTER (or the line feed) key
; during input, come down here and zero-terminate the string.

lineDone:   mov     byte ptr [rdi][rbx * 1], 0 

exitRdLn:   mov     rax, rbx        ; Return char cnt in RAX
            pop     rbx
            ret
readLn      endp

这是列表 16-4 的主程序，它从用户处读取文件名，打开该文件，读取文件数据，并将数据显示到标准输出设备：

**********************************************************

; Here is the "asmMain" function.

            public  asmMain
asmMain     proc
            push    rbx
            push    rsi
            push    rdi
            push    rbp
            mov     rbp, rsp
            sub     rsp, 64         ; Shadow storage
            and     rsp, -16

; Get a filename from the user:

            lea     rsi, prompt
            call    puts

            lea     rdi, inputLn
            mov     rcx, lengthof inputLn
            call    readLn

; Open the file, read its contents, and display
; the contents to the standard output device:

            lea     rsi, inputLn
            mov     rax, GENERIC_READ
            call    open

            cmp     eax, INVALID_HANDLE_VALUE
            je      badOpen

            mov     inHandle, eax

; Read the file 4096 bytes at a time:

readLoop:   mov     eax, inHandle
            lea     rdi, fileBuffer
            mov     ecx, lengthof fileBuffer
            call    read
            test    eax, eax        ; EOF?
            jz      allDone
            mov     rcx, rax        ; Bytes to write

            call    getStdOutHandle
            lea     rsi, fileBuffer
            call    write
            jmp     readLoop

badOpen:    lea     rsi, badOpenMsg
            call    puts

allDone:    mov     eax, inHandle
            call    closeHandle

            leave
            pop     rdi
            pop     rsi
            pop     rbx
            ret     ; Returns to caller
asmMain     endp
            end

列表 16-4：文件 I/O 演示程序

这是列表 16-4 的构建命令和示例输出：

C:\>**nmake /nologo /f listing16-4.mak**
        ml64 /nologo listing16-4.asm  /link /subsystem:console /entry:asmMain
 Assembling: listing16-4.asm
Microsoft (R) Incremental Linker Version 14.15.26730.0
Copyright (C) Microsoft Corporation.  All rights reserved.

/OUT:listing16-4.exe
listing16-4.obj
/subsystem:console
/entry:asmMain

C:\>**listing16-4**
Enter (text) filename:listing16-4.mak
listing16-4.exe: listing16-4.obj listing16-4.asm
        ml64 /nologo listing16-4.asm \
                /link /subsystem:console /entry:asmMain

这是 listing16-4.inc 包含文件：

; listing16-4.inc

; Header file entries extracted from MASM32 header
; files (placed here rather than including the 
; entire set of MASM32 headers to avoid namespace 
; pollution and speed up assemblies).

STD_INPUT_HANDLE                     equ -10
STD_OUTPUT_HANDLE                    equ -11
STD_ERROR_HANDLE                     equ -12
CREATE_NEW                           equ 1
CREATE_ALWAYS                        equ 2
OPEN_EXISTING                        equ 3
OPEN_ALWAYS                          equ 4
FILE_ATTRIBUTE_READONLY              equ 1h
FILE_ATTRIBUTE_HIDDEN                equ 2h
FILE_ATTRIBUTE_SYSTEM                equ 4h
FILE_ATTRIBUTE_DIRECTORY             equ 10h
FILE_ATTRIBUTE_ARCHIVE               equ 20h
FILE_ATTRIBUTE_NORMAL                equ 80h
FILE_ATTRIBUTE_TEMPORARY             equ 100h
FILE_ATTRIBUTE_COMPRESSED            equ 800h
FILE_SHARE_READ                      equ 1h
FILE_SHARE_WRITE                     equ 2h
GENERIC_READ                         equ 80000000h
GENERIC_WRITE                        equ 40000000h
GENERIC_EXECUTE                      equ 20000000h
GENERIC_ALL                          equ 10000000h
INVALID_HANDLE_VALUE                 equ -1

PPROC           TYPEDEF PTR PROC        ; For include file prototypes

externdef __imp_GetStdHandle:PPROC
externdef __imp_WriteFile:PPROC
externdef __imp_ReadFile:PPROC
externdef __imp_CreateFileA:PPROC
externdef __imp_CloseHandle:PPROC
externdef __imp_GetLastError:PPROC

这是 listing16-4.mak makefile 文件：

listing16-4.exe: listing16-4.obj listing16-4.asm
    ml64 /nologo listing16-4.asm \
        /link /subsystem:console /entry:asmMain

16.8 Windows 应用程序

本章仅展示了在 Windows 下编写纯汇编语言应用程序时可能实现的一些功能。kernel32.lib 库提供了数百个可供调用的函数，涵盖了多个不同的主题领域，如操作文件系统（例如，删除文件、查找目录中的文件名、切换目录）、创建线程并进行同步、处理环境字符串、分配和释放内存、操作 Windows 注册表、使程序暂停一定时间、等待事件发生等等。

kernel32.lib 库只是 Win32 API 中的一个库。gdi32.lib 库包含了创建在 Windows 下运行的 GUI 应用程序所需的大部分函数。创建此类应用程序远超本书的范围，但如果你想创建独立的 Windows GUI 应用程序，你需要深入了解这个库。以下的“获取更多信息”部分提供了互联网资源链接，如果你有兴趣用汇编语言创建独立的 Windows GUI 应用程序，可以参考。

16.9 获取更多信息

如果你想编写在 Windows 上运行的独立 64 位汇编语言程序，你的第一站应该是 www.masm32.com/。虽然这个网站主要致力于创建在 Windows 上运行的 32 位汇编语言程序，但它也为 64 位程序员提供了大量的信息。更重要的是，这个网站包含了你需要从 64 位汇编语言程序访问 Win32 API 的头文件。

如果你打算认真编写基于 Win32 API 的 Windows 汇编语言应用程序，Charles Petzold 的 Programming Windows（第五版，Microsoft，1998 年）是一本绝对必要购买的书。这本书已经很老了（不要购买新版的 C# 和 XAML 版本），你可能需要购买二手书。它是为 C 程序员（而非汇编程序员）编写的，但如果你了解 Windows ABI（你现在应该已经知道了），将所有的 C 调用翻译成汇编语言并不难。尽管关于 Win32 API 的很多信息可以在网上找到（例如在 MASM32 网站上），但将所有信息集成在一本（非常大的！）书中是必不可少的。

网络上另一个关于 Win32 API 调用的好资源是软件分析师 Geoff Chappell 的 Win32 编程页面（www.geoffchappell.com/studies/windows/win32/）。

Iczelion 教程是编写 x86 汇编语言 Windows 程序的最初标准。尽管它们最初是为 32 位 x86 汇编语言编写的，但已经有多个将该代码翻译成 64 位汇编语言的版本，例如：masm32.com/board/index.php?topic=4190.0/。

HLA 标准库和示例（可以在 www.randallhyde.com/ 找到）包含了大量的 Windows 代码和 API 函数调用。尽管这些代码都是 32 位的，但将它们转换为 64 位的 MASM 代码非常容易。

16.10 自我测试

告诉 MASM 你正在构建控制台应用程序的链接器命令行选项是什么？
你应该访问哪个网站来获取 Win32 编程信息？
将 \masm32\include64\masm64rt.inc 包含在所有汇编语言源文件中的主要缺点是什么？
哪个链接器命令行选项允许你指定汇编语言主程序的名称？
允许你弹出对话框的 Win32 API 函数的名称是什么？
什么是包装代码？
你将使用哪个 Win32 API 函数来打开一个现有文件？
你使用哪个 Win32 API 函数来检索最后的 Windows 错误代码？

第三部分

参考资料

第十七章：A

ASCII 字符集

二进制	十六进制	十进制	字符
0000_0000	00	0	空字符
0000_0001	01	1	ctrl-A
0000_0010	02	2	ctrl-B
0000_0011	03	3	ctrl-C
0000_0100	04	4	ctrl-D
0000_0101	05	5	ctrl-E
0000_0110	06	6	ctrl-F
0000_0111	07	7	响铃
0000_1000	08	8	退格
0000_1001	09	9	tab
0000_1010	0A	10	换行
0000_1011	0B	11	ctrl-K
0000_1100	0C	12	换页
0000_1101	0D	13	回车
0000_1110	0E	14	ctrl-N
0000_1111	0F	15	ctrl-O
0001_0000	10	16	ctrl-P
0001_0001	11	17	ctrl-Q
0001_0010	12	18	ctrl-R
0001_0011	13	19	ctrl-S
0001_0100	14	20	ctrl-T
0001_0101	15	21	ctrl-U
0001_0110	16	22	ctrl-V
0001_0111	17	23	ctrl-W
0001_1000	18	24	ctrl-X
0001_1001	19	25	ctrl-Y
0001_1010	1A	26	ctrl-Z
0001_1011	1B	27	esc (ctrl-[)
0001_1100	1C	28	ctrl-\
0001_1101	1D	29	ctrl-]
0001_1110	1E	30	ctrl-^
0001_1111	1F	31	ctrl-_
0010_0000	20	32	空格
0010_0001	21	33	!
0010_0010	22	34	"
0010_0011	23	35	#
0010_0100	24	36	$
0010_0101	25	37	%
0010_0110	26	38	&
0010_0111	27	39	'
0010_1000	28	40	(
0010_1001	29	41	)
0010_1010	2A	42	*
0010_1011	2B	43	+
0010_1100	2C	44	,
0010_1101	2D	45	-
0010_1110	2E	46	.
0010_1111	2F	47	/
0011_0000	30	48	0
0011_0001	31	49	1
0011_0010	32	50	2
0011_0011	33	51	3
0011_0100	34	52	4
0011_0101	35	53	5
0011_0110	36	54	6
0011_0111	37	55	7
0011_1000	38	56	8
0011_1001	39	57	9
0011_1010	3A	58	:
0011_1011	3B	59	;
0011_1100	3C	60	<
0011_1101	3D	61	=
0011_1110	3E	62	>
0011_1111	3F	63	?
0100_0000	40	64	@
0100_0001	41	65	A
0100_0010	42	66	B
0100_0011	43	67	C
0100_0100	44	68	D
0100_0101	45	69	E
0100_0110	46	70	F
0100_0111	47	71	G
0100_1000	48	72	H
0100_1001	49	73	I
0100_1010	4A	74	J
0100_1011	4B	75	K
0100_1100	4C	76	L
0100_1101	4D	77	M
0100_1110	4E	78	N
0100_1111	4F	79	O
0101_0000	50	80	P
0101_0001	51	81	Q
0101_0010	52	82	R
0101_0011	53	83	S
0101_0100	54	84	T
0101_0101	55	85	U
0101_0110	56	86	V
0101_0111	57	87	W
0101_1000	58	88	X
0101_1001	59	89	Y
0101_1010	5A	90	Z
0101_1011	5B	91	[
0101_1100	5C	92	\
0101_1101	5D	93	]
0101_1110	5E	94	^
0101_1111	5F	95	_
0110_0000	60	96	`
0110_0001	61	97	a
0110_0010	62	98	b
0110_0011	63	99	c
0110_0100	64	100	d
0110_0101	65	101	e
0110_0110	66	102	f
0110_0111	67	103	g
0110_1000	68	104	h
0110_1001	69	105	i
0110_1010	6A	106	j
0110_1011	6B	107	k
0110_1100	6C	108	l
0110_1101	6D	109	m
0110_1110	6E	110	n
0110_1111	6F	111	o
0111_0000	70	112	p
0111_0001	71	113	q
0111_0010	72	114	r
0111_0011	73	115	s
0111_0100	74	116	t
0111_0101	75	117	u
0111_0110	76	118	v
0111_0111	77	119	w
0111_1000	78	120	x
0111_1001	79	121	y
0111_1010	7A	122	z
0111_1011	7B	123	{
0111_1100	7C	124	\|
0111_1101	7D	125	}
0111_1110	7E	126	~
0111_1111	7F	127	delete

第十八章：B

词汇表

符号

.code

程序代码部分。

.const

用于声明已初始化的只读值的部分。

.data

用于声明已初始化变量的部分。

.data?

用于声明未初始化变量的部分。

A

ABI

见应用二进制接口。

地址总线

一组电子信号，表示内存元素的二进制地址。

聚合数据类型

由一个或多个较小数据类型组成的数据类型。

API

应用程序编程接口。

应用二进制接口

一组约定，代码使用这些约定来确保调用其他函数或过程的代码与被调用的函数或过程之间的互操作性。

ASCII

美国信息交换标准代码。

汇编单元

一个源文件及其包含的或间接包含的所有文件的汇编。

结合性

结合性决定了在一个复杂表达式中操作符的分组顺序，其中所有操作符具有相同的优先级。例如，如果你有两个操作符，op1 和 op2，结合性决定了表达式 x op1 y op2 z 的求值顺序。左结合操作符会先计算 (x op1 y) op2 z，而右结合操作符会先计算 x op1 (y op2 z)。

自动变量

见局部变量。

AVX

高级向量扩展。

B

BCD

二进制编码十进制。

大端字节序

如果内存中的多字节数据对象的高字节出现在内存的最低地址处，而低字节出现在内存的最高地址处，则这些数据对象采用大端字节序。

C

调用约定

传递数据给过程及从过程中返回数据的协议，包括数据的传递位置、数据的对齐方式和数据的大小。

CLI

命令行界面，或命令行解释器（Windows cmd.exe 应用程序）。

代码片段

见代码片段。

强制转换

强制将一个数据类型作为另一个数据类型处理；例如，将字符值当作整数处理。

列主序排列

一个函数，用于将多维数组元素存储到线性内存中，通过将每一列的元素存储在连续的内存位置中，然后按顺序将每一列放置到内存中。

可交换

如果（A op B）总是等于（B op A），则该操作是可交换的。

复合数据类型

见聚合数据类型。

控制总线

一组来自 CPU 的电子信号，用于控制如读取、写入和生成等待状态等活动。

控制字符

特殊的非打印字符，用于控制打印字符的机器方面。这包括像回车（将打印头移到行首）、换行（将打印设备移动到下一行）和退格（将打印位置回退一个字符）等操作。

CTL

编译时语言。

D

悬空指针

在已释放并返回给系统的内存上使用指针访问该内存（并且该内存可能正在被用于其他目的）。

数据总线

从 CPU 到外部设备（如内存或输入/输出设备）传输数据的一组电子信号。

分隔符字符

分隔属于一组的字符序列的字符（例如，空格或逗号分隔的数字字符串）。

依赖关系

在 makefile 中，如果改变某个文件需要重新编译（或其他操作）原文件，则该文件依赖于另一个文件。

解引用

访问由指针变量指定的地址上的数据。

描述符

描述另一个数据结构的数据结构。通常，描述符包含诸如指向实际数据的指针、类型信息或长度信息等信息。

指令

一种汇编语言语句，提供给汇编器的信息，但不是机器指令，不生成任何代码。

函数的定义域

函数接受的所有可能输入值的集合。

双字

双字（两个 16 位字，形成一个 32 位值）。

动态类型系统

允许对象类型在运行时改变的程序组织。

E

有效地址

指令将在内存中访问的最终地址，所有地址计算完成后。

尾声

清理过程局部变量存储的标准退出序列。通常，这包括以下语句：

leave
ret

F

外观代码

改变调用代码和被调用函数或过程之间参数或返回结果接口的代码，以使调用序列兼容。

假精度

计算结果中的额外位，包含垃圾值；它们的存在表示结果中比实际值更多的精度。

字段

记录、结构或对象的成员。

浮点单元

实现浮点运算的 CPU 部分。

浮点运算单元

见浮点运算单元。

完整路径名

以反斜杠（\）字符开始的路径名，指定从根目录开始的路径。另见路径名。

G

粒度

最小的访问单元；例如，MMU 可以使用页面粒度访问内存，其中粒度为 4096 字节。

守卫数字（或位）

在计算过程中维护的额外数字（或位），用于增强长链计算的精度。

H

堆

程序中用于存储动态分配的内存对象的内存区域。

高级语言

高级语言。

高阶。

水平加法或减法

在 XMM 或 YMM 寄存器中相邻车道的加法或减法，而不是通常在不同的 XMM 或 YMM 寄存器中对应车道的加法或减法。另见垂直加法 或减法。

I

输入/输出

输入/输出。

集成开发环境

见集成开发环境。

习语

机器的特异性。

间接寻址

一种技术，在这种技术中，指令的操作数提供了指令可以找到对象地址的位置，而不是直接提供对象本身。

归纳变量

其值完全依赖于另一个变量的值的变量（通常在循环执行期间）。

集成开发环境

一套程序员工具，包括编译器、汇编器、链接器、调试器和编辑器，可以让你在同一系统中开发软件。

L

硬件通道

向量的一个元素（SSE/AVX 打包数据类型）。

叶函数

不调用任何其他函数的函数。这个名称来源于调用树图，其中叶节点是那些不调用任何其他过程（且没有从其节点出来的边）的过程。

字典顺序

字母顺序（或更准确地说，是基于字符代码的顺序）。字符串按字符逐一比较，从第一个字符到较短字符串的长度。如果两个字符串的长度相同，那么较长的字符串视为更大的字符串。只有当两个字符串的长度相同且所有字符相等时，它们才相等。

库模块

一组目标文件，通常被组织成一个.lib文件（尽管这对于库模块并不是必须的）。

生命周期

从存储首次分配给变量开始，到存储不再可用于该变量为止的时间段。

后进先出

小端法

如果内存中的多字节数据对象的低字节出现在内存中最低地址的位置，高字节出现在内存中最高地址的位置，则它们是小端法表示的。

低位

局部变量

变量（更准确地称为自动变量）在进入过程时分配存储，并且当过程返回到调用者时，这些存储会返回给其他使用。

循环不变计算

在循环中出现的计算，每次迭代都会产生相同的结果。

M

机器码

汇编语言指令的二进制（或数字）编码。

宏

宏处理器将替换宏标识符的文本序列，这个标识符在源文件中出现的每个位置都会被替换。

宏架构

CPU 架构的一个视图，这个视图对软件可见。

宏函数

一种宏，你可以在源文件的任何地方调用（包括指令或指令的操作数字段中）；该宏返回一个文本字符串，宏调用会将其替换为该字符串。

明示常量

表示常量值的标识符。MASM 直接在程序中每次出现该标识符时替换为该常量的值。

MASM

微软宏汇编器

内存管理单元

CPU 的一个组件，负责将程序地址转换为物理内存地址，并处理非法内存访问。

微架构

CPU 的设计位于软件可见层级之下。

MMU

参见内存管理单元。

MMX

多媒体扩展（针对 x86 CPU 的扩展指令集，支持多媒体操作）。

助记符

字面意思是记忆辅助工具。应用于指令名称时，助记符实际上意味着缩写。例如，助记符lea代表加载有效地址。

MSVC

微软 Visual C++。

N

命名空间污染

在源文件中有多个名称，从而限制了程序员可用的新名称数量。（当源文件包含大量符号时，程序员通常会通过重用相同的名称来产生冲突，导致编译过程中的重复符号错误。）

非数字（NaN）

非数字。浮点异常值，表示无法获得有效的数值结果。

O

操作码（opcode）

操作码。机器指令的数字编码。

有序比较

两个值之间的比较，且两者都不是 NaN。

oword

八进制字（八个 16 位字，或一个 16 字节值）。

P

部分路径名

一个以目录名称开始的路径名（不是反斜杠字符），表示当前（默认）目录之外的路径。

通过引用传递

一种参数传递机制，调用者将实际参数数据的地址传递给过程或函数。

值传递

一种参数传递机制，调用者将参数的实际值传递给过程或函数。

路径名

一串由反斜杠（\）字符分隔的（子）目录名称，可能以文件名结尾。

程序计数器（PC）

程序计数器。汇编语言程序中当前指令或指令地址。PC 相对寻址是当前机器指令的偏移量。

幂集

一种集合数据类型，通过使用单个位来表示集合中的每个对象。如果集合的基数（集合中的成员数）是n，则该集合数据类型将需要n个位。在数学中，任何集合S的幂集是S的所有子集的集合，包括空集和S本身；这需要 2^(n)个不同的集合，可以通过n位位串来表示。

优先级

当两个不同的运算符出现在一个表达式中（没有括号来表示评估顺序）时，优先级决定先进行哪些操作。例如，对于运算符op1和op2，以及表达式x op1 y op2 z，评估顺序由运算符的优先级决定。如果op1的优先级高于op2，则表达式按（x op1 y） op2 z的顺序评估。如果op2的优先级高于op1，则表达式按x op1（y op2 z）的顺序评估。如果两个运算符具有相同的优先级，则结合性规则控制评估顺序（另见结合性）。

精度

在计算中保持的数字或位数。

大规模编程

使用过程、方法论和工具来处理大型软件系统的开发。

序言

标准的过程入口序列，通常包含以下语句：

push  rbp
mov   rbp, rsp
sub   rsp, `size_of_local_variables`

真子集

一个包含在另一个集合内的集合，且这两个集合不相等。

真超集

一个包含另一个集合所有元素的集合，且这两个集合不相等。

Q

字长

四字（四个 16 位值，形成一个 64 位值）。

R

函数的值域

一个函数生成的所有可能输出值的集合。

记录

见结构体。

行主序排列

一个用于将多维数组排列到线性内存中的函数，方法是将每行的元素存储在连续的内存位置中，然后将每行按顺序放置在内存中。

S

饱和

将较大（位大小）值转换为较小值的过程，通过剪裁（即如果原始值过大以适应较小的结果，则强制为最大值或最小值）。

标量数据类型

一种原始的、不可分割的数据类型（例如整数或浮点值），不能被拆分成任何更小的部分（除了单独的位）。

作用域

标识符的作用域决定了在编译期间源文件中该标识符的可见（可访问）范围。在大多数高级语言中，过程局部变量的作用域是该过程的主体；标识符在该过程外部不可访问。

符号压缩

将较大符号值转换为较小符号值的过程。

有效数字

在计算过程中保持其值的数字位数。

SIMD

见单指令多数据指令。

单指令多数据指令

专用机器指令，能够同时操作两个或多个数据单元。为某些多媒体和其他应用提供更高性能的操作。

SISD

单指令单数据。

代码片段

演示概念的小段代码。

SSE

流式 SIMD 扩展。

状态机

程序逻辑通过程序维护的状态来保持先前执行的历史。该状态可以保存在变量中，或者保存在状态机的当前执行位置中。

静态变量

变量的生命周期是整个程序的执行时间；通常，在汇编语言程序的.data、.data?或.const部分声明静态变量。

强度缩减优化

使用较便宜的操作来计算与更昂贵操作相同的结果。

字符串描述符

提供关于字符串数据的信息的数据结构。通常，字符串描述符包含指向实际字符串数据的指针、字符串中的字符数（即长度），以及可能的字符串类型或编码（如 ASCII、UTF-8，或描述其他编码的信息）。

结构体

由一组异质（不同类型）对象组成的复合数据结构。

系统总线

包含地址、数据和控制总线的电子信号集合。

T

时间戳

与系统中某个事件相关联的数值（通常是基于时间的）。时间戳是单调递增的；也就是说，如果两个事件有时间戳，它们的时间戳值较大的事件会在较晚的时间发生。

栈顶

栈顶。

跳板

代码中的一个固定点，程序可以跳转（或调用）到这个点，转移到代码的另一个位置，超出了jmp或call指令的正常范围。

复杂的编程

使用计算结果中的非明显结果的编程结构。

U

无序比较

比较两个值，其中至少有一个值是 NaN。

展开循环

将函数体从循环中提取并多次展开（每次展开对应一次循环迭代），以避免在运行时产生循环控制的开销。

URL

统一资源定位符（网页地址）。

V

变体类型

一种在程序执行期间可以动态变化的数据类型（即，它是一个可变类型）。

向量指令

对多个数据项同时操作的指令（SIMD 指令）。具体来说，是两个或更多数据值的数组。

垂直加法或减法

对两个 XMM 或 YMM 寄存器中的对应数据进行加法或减法运算。另见水平加法 或减法。

W

空白字符

在显示器上占位但没有可打印字符的字符（如空格和制表符）。

字

16 位值。

包装代码

一种改变函数调用行为的代码，而不直接修改该函数（例如改变调用者将参数传递给底层函数的位置）。包装代码也被称为外观模式。

第十九章：C

安装和使用 Visual Studio

本书使用的 Microsoft 宏汇编器（MASM）、Microsoft C++ 编译器、Microsoft 链接器及其他工具，都可以在 Microsoft Visual Studio 包中找到。在写这篇文章时，你可以在 visualstudio.microsoft.com/vs/community/ 下载 Windows 版本的 Visual Studio Community 版。当然，网址会随时间变化。通过网络搜索 Microsoft Visual Studio download 应该能引导你到合适的页面。

C.1 安装 Visual Studio Community

下载 Visual Studio Community 版后，运行安装程序。由于 Microsoft 以其即使在发生小幅更新时也会完全更改程序的用户界面而闻名，因此本附录不提供逐步的操作指导。这里提供的任何指引在你尝试运行时可能已经过时。不过，最重要的是确保你下载并安装 Microsoft Visual C++ 桌面工具。

C.2 为 MASM 创建命令行提示符

为了使用 Microsoft Visual C++（MSVC）编译器和 MASM，我们需要通过使用 Visual Studio 提供的批处理文件来初始化环境，然后保持命令行解释器（CLI）打开，以便我们可以构建和运行程序。我们有两个选择：使用 Visual Studio 安装程序创建的环境，或者创建一个自定义环境。

在写这篇文章时，Visual Studio 2019 安装程序创建了各种命令行界面（CLI）环境：

VS 2019 的开发者命令提示符
VS 2019 的开发者 PowerShell
x64 原生工具命令提示符（VS 2019）
x64_x86 跨平台工具命令提示符（VS 2019）
x86 原生工具命令提示符（VS 2019）
x86_x64 跨平台工具命令提示符（VS 2019）

你可以通过点击开始（Windows 图标）在 Windows 任务栏上，然后导航到并点击Visual Studio 2019文件夹来找到这些工具。x86 指的是 32 位版本，而 x64 指的是 64 位版本的 Windows。

开发者命令提示符、开发者 PowerShell、x86 原生工具和 x64_x86 跨平台工具是面向 Windows 的 32 位版本，因此它们超出了本书的范围。x86_x64 跨平台工具面向 64 位 Windows，但环境中的工具本身是 32 位的。基本上，这些是为运行 32 位版本 Windows 的用户准备的工具。x64 原生工具是为面向和运行 64 位版本 Windows 的用户准备的。今天 32 位版本的 Windows 很少见，因此我们没有在 x86_x64 跨平台工具下使用或测试本书的代码。理论上，它应该能够组装和编译 64 位代码，但我们无法在这个 32 位环境中运行它。

我们使用并测试的是运行在 64 位 Windows 下的 x64 原生工具。如果你右键点击x64 原生工具，你可以将其固定到开始菜单，或者选择更多，你可以将其固定到任务栏。

或者，你可以创建自定义环境，我们现在将介绍这个过程。我们将通过以下步骤创建一个指向 MASM 命令行提示符的快捷方式：

找到名为vcvars64.bat的批处理文件（或类似文件）。如果找不到vcvars64.bat，可以尝试vcvarsall.bat。在编写本章时（使用 Visual Studio 2019），我找到了vcvars64.bat文件，路径为：*C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build*。
创建文件的快捷方式（通过在 Windows 资源管理器中右键点击它，并从弹出菜单中选择创建快捷方式）。将此快捷方式移到 Windows 桌面上，并将其重命名为VSCmdLine。
右键点击桌面上的快捷方式图标，然后点击属性▶快捷方式。找到包含vcvars64.bat文件路径的目标文本框；例如：
```
"C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvars64.bat"
```
在此路径前添加前缀cmd /k：
```
**cmd /k** "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvars64.bat"
```
cmd命令是 Microsoft 的cmd.exe命令行解释器。/k选项告诉cmd.exe执行后续的命令（即vcvars64.bat文件），并在命令执行完成后保持窗口打开。现在，当你双击桌面上的快捷方式图标时，它将初始化所有环境变量，并保持命令窗口打开，这样你就可以从命令行执行 Visual Studio 工具（例如 MASM 和 MSVC）。

如果你找不到vcvars64.bat，但有vcvarsall.bat，也在命令行末尾添加x64：
```
cmd /k "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvarsall.bat" **x64**
```
在关闭快捷方式的属性对话框之前，将起始位置文本框修改为C:\，或者其他你通常在开始使用 Visual Studio 命令行工具时工作的目录。

双击桌面上的快捷方式图标；你应该看到一个命令窗口，里面有如下文本：
```
**********************************************************************
** Visual Studio 2019 Developer Command Prompt v16.9.0
** Copyright (c) 2019 Microsoft Corporation
**********************************************************************
[vcvarsall.bat] Environment initialized for: 'x64'
```
从命令行输入ml64。这应该会产生类似如下的输出：
```
C:\>`ml64`
Microsoft (R) Macro Assembler (x64) Version 14.28.29910.0
Copyright (C) Microsoft Corporation.  All rights reserved.

usage: ML64 [options] filelist [/link linkoptions]
Run "ML64 /help" or "ML64 /?" for more info
```
尽管 MASM 抱怨你没有提供要编译的文件名，但你收到此消息意味着ml64.exe已经在执行路径中，因此系统已正确设置环境变量，使你能够运行 Microsoft 宏汇编器。

作为最终测试，执行cl命令以验证是否能够运行 MSVC。你应该会看到类似如下的输出：

C:\>`cl`
Microsoft (R) C/C++ Optimizing Compiler Version 19.28.29910 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

usage: cl [option...] filename... [/link linkoption...]

最后，做一次最终检查，在 Windows 开始菜单中找到 Visual Studio 应用程序。点击它并验证是否能够启动 Visual Studio IDE。如果你愿意，可以复制此快捷方式并将其放到桌面上，以便通过双击快捷方式图标启动 Visual Studio。

C.3 编辑、汇编和运行 MASM 源文件

你将使用某种文本编辑器来创建和维护 MASM 汇编语言源文件。如果你还不熟悉 Visual Studio，并且希望使用一个更容易学习和使用的环境，可以考虑下载免费的 Notepad++ 文本编辑器应用程序。Notepad++ 对 MASM 提供了出色的支持，速度快，且易于学习和使用。无论你选择哪种文本编辑器（我使用一款名为 CodeWright 的商业产品），第一步是创建一个简单的汇编语言源文件。

MASM 要求所有源文件都必须有 .asm 后缀，所以用编辑器创建文件 hw64.asm 并输入以下内容：

includelib kernel32.lib

        extrn __imp_GetStdHandle:proc
        extrn __imp_WriteFile:proc

        .CODE
hwStr   byte    "Hello World!"
hwLen   =       $-hwStr

main    PROC

; On entry, stack is aligned at 8 mod 16\. Setting aside 8
; bytes for "bytesWritten" ensures that calls in main have
; their stack aligned to 16 bytes (8 mod 16 inside function).

 lea     rbx, hwStr
        sub     rsp, 8
        mov     rdi, rsp        ; Hold # of bytes written here

; Note: must set aside 32 bytes (20h) for shadow registers for
; parameters (just do this once for all functions). 
; Also, WriteFile has a 5th argument (which is NULL), 
; so we must set aside 8 bytes to hold that pointer (and
; initialize it to zero). Finally, the stack must always be 
; 16-byte-aligned, so reserve another 8 bytes of storage
; to ensure this.

; Shadow storage for args (always 30h bytes).

        sub     rsp, 030h 

; Handle = GetStdHandle(-11);
; Single argument passed in ECX.
; Handle returned in RAX.

        mov     rcx, -11        ; STD_OUTPUT
        call    qword ptr __imp_GetStdHandle 

; WriteFile(handle, "Hello World!", 12, &bytesWritten, NULL);
; Zero out (set to NULL) "LPOverlapped" argument:

        mov     qword ptr [rsp + 4 * 8], 0  ; 5th argument on stack

        mov     r9, rdi         ; Address of "bytesWritten" in R9
        mov     r8d, hwLen      ; Length of string to write in R8D 
        lea     rdx, hwStr      ; Ptr to string data in RDX
        mov     rcx, rax        ; File handle passed in RCX
        call    qword ptr __imp_WriteFile
        add     rsp, 38h
        ret
main    ENDP
        END

这个（纯）汇编语言程序没有提供解释。书中的各个章节会解释机器指令。

回看源代码，你会看到第一行如下：

includelib kernel32.lib

kernel32.lib 是一个 Windows 库，其中包含了此汇编语言程序使用的 GetStdHandle 和 WriteFile 函数。Visual Studio 安装包中包含了此文件，并且 vcvars64.bat 文件应该会将它放入包含路径中，以便链接器能够找到它。如果你在汇编和链接程序（在下一步中）时遇到问题，只需复制此文件（无论你在 Visual Studio 安装中找到它的位置），并将该副本包含在你构建 hw64.asm 文件的目录中。

要编译（组装）这个文件，打开命令窗口（即之前创建的快捷方式）以获取命令提示符。然后输入以下命令：

ml64 hw64.asm /link /subsystem:console /entry:main

假设你没有输入错误，命令窗口应输出类似以下内容：

C:\MASM64>**ml64 hw64.asm /link /subsystem:console /entry:main**
Microsoft (R) Macro Assembler (x64) Version 14.28.29910.0
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: hw64.asm
Microsoft (R) Incremental Linker Version 14.28.29910.0
Copyright (C) Microsoft Corporation.  All rights reserved.

/OUT:hw64.exe
hw64.obj
/subsystem:console
/entry:main

你可以通过在命令行提示符下输入命令hw64来运行此汇编产生的 hw64.exe 输出文件。输出应如下所示：

C:\MASM64>**hw64**
Hello World!

第二十章：D

Windows 命令行解释器

微软的 MASM（微软汇编语言工具）主要是通过 Windows 命令行使用的工具。因此，为了正确使用 MASM（至少是本书中的所有示例），你需要熟悉 Windows 命令行解释器（CLI）。

附录 C 显示了如何设置 Windows CLI 以便你能够使用它。本附录简要描述了一些你将在 CLI 中使用的常见命令。

D.1 命令行语法

一个基本的 Windows CLI 命令的格式是

`command options`

其中 command 是一个内置 CLI 命令、一个磁盘上的可执行程序（通常带有 .exe 文件后缀）或一个批处理文件名（带有 .bat 后缀），options 是该命令的零个或多个选项，选项是特定于命令的。

本书中最常见的命令行可执行程序示例可能是 ml64.exe 程序（MASM 汇编器）。微软的链接器 (link.exe)、库文件管理器 (lib.exe)、nmake (nmake.exe) 和 MSVC 编译器 (cl.exe) 也是你可能从命令行运行的可执行程序示例。

本书中的所有示例程序也是你可以从命令行运行的命令。例如，下面的命令执行 build.bat 批处理文件，以构建 listing2-1.exe 可执行文件（来自第二章）：

build listing2-1

在构建 listing2-1.exe 可执行文件后，你可以从命令行运行它。以下是命令及其产生的输出：

C:\>**listing2-1**
Calling Listing 2-1:
i=1, converted to hex=1
j=123, converted to hex=7b
k=456789, converted to hex=6f855
Listing 2-1 terminated

listing2-1.exe 可执行文件不支持任何命令行选项。如果你在命令行中输入 listing2-1 命令后面跟着任何内容，listing2-1.exe 程序将忽略这些文本。

虽然大多数选项是特定于命令的，但你可以将某些命令行选项应用于你从命令行运行的大多数程序：特别是 I/O 重定向。许多控制台应用程序将数据写入 标准输出设备（控制台窗口）。例如，本书中所有的 print 和 printf 函数调用都会将数据写入标准输出设备。通常，所有发送到标准输出设备的输出都会作为文本显示在命令行（控制台）窗口中。

然而，你可以通过使用 输出重定向选项 告诉 Windows 将数据发送到一个文件（甚至是另一个设备）。输出重定向选项的格式是

`command options` >`filename more_options`

其中 command 是命令名称，options 和 more_options 是零个或多个命令行选项（不包含输出重定向选项），filename 是你希望将 command 的输出发送到的文件名。请考虑以下命令行：

listing2-1 >listing2-1.txt

执行此命令不会产生任何显示输出。然而，你会发现该命令在磁盘上创建了一个新的文本文件。该文本文件将包含 listing2-1.exe 程序的输出（如前所示）。

Windows CLI 还支持使用以下语法进行标准输入重定向。

`command options` <`filename more_options`

其中，command是命令名称，options和more_options是零个或多个命令行选项（不包含输入重定向选项），filename是command将从中读取输入的文件名。

输入重定向使得一个通常从用户（键盘，即标准输入设备）读取数据的程序，转而从文本文件中读取数据。例如，假设你之前执行了listing2-1命令，并将输出重定向到listing2-1.txt输出文件。考虑以下命令（来自第一章），该命令从用户读取一行文本（在这个特定的例子中，我输入了hello来响应程序的输入请求）：

C:\>**build listing1-8**
C:\>**echo off**
 Assembling: listing1-8.asm
c.cpp
C:\>**listing1-8**
Calling Listing 1-8:
Enter a string: **hello**
User entered: 'hello'
Listing 1-8 terminated

现在考虑以下命令：

C:\>**listing1-8 <listing2-1.txt**
Calling Listing 1-8:
Enter a string: User entered: 'Calling Listing 2-1:'
Listing 1-8 terminated

在这个例子中，输入从先前执行的listing2-1.exe生成的listing2-1.txt文件重定向。listing1-8.exe程序将该文件的第一行作为输入读取（而不是从键盘读取一行文本）。程序不会回显从文件读取的文本（包括换行符）；这就是为什么User entered: 'Calling Listing 2-1:'文本出现在与Enter a string:提示相同的一行上的原因。当实际从键盘读取数据时，系统会将数据回显到显示屏上（包括换行符）。而在从文件重定向输入时则不会发生这种情况。

文件包含多行文本。然而，listing1-8.exe只读取一行文本，因此它忽略了listing2-1.txt文件中的其余行。

你可以在同一个命令中同时重定向标准输入和标准输出。考虑以下情况：

C:\>listing1-8 <listing2-1.txt >listing1-8.txt

这段代码从listing2-1.txt文件读取数据，并将所有输出发送到listing1-8.txt文件。

当将程序的输出重定向到文本文件时，如果输出文件已经存在，Windows 会在写入标准输出文本之前删除该文件。你还可以通过使用以下输出重定向语法（使用两个大于符号）指示 Windows 将命令的输出附加到现有文件中：

`command options` >>`filename more_options`

除了重定向选项外，命令行选项通常是文件名（例如，ml64 mySource.asm）或控制命令行为的选项（例如，ml64 的/c或/Fl命令行选项，在本书中将多次出现）。按照惯例，大多数 Windows 命令行界面（CLI）命令在实际选项前使用斜杠字符（/）作为前缀（而不是文件名）。这是一个惯例，而非硬性要求。

一些命令，例如，使用 Unix 约定的破折号或连字符（-）来代替（或附加在）斜杠字符。这实际上是一个特定应用程序的选择。请查阅你正在使用的特定程序的文档以了解详细信息。所有内置的 CLI 命令以及大多数微软的 CLI 程序都使用斜杠字符来指定选项。

D.2 目录名称和驱动器字母

许多命令接受或要求文件或目录的路径名作为命令行选项。路径名由两个主要部分组成：驱动器字母和目录或文件的路径名。驱动器字母是一个字母（A 到 Z），后跟一个冒号；例如：

A: B: C: etc.

驱动器字母不区分大小写。A: 等同于命令行中的 a:。Windows 为软盘驱动器保留了 A: 和 B: 字母。由于现代机器上不常见软盘驱动器，所以你通常不会使用这些驱动器字母。然而，如果你有一台非常旧的机器……

C: 是启动驱动器的默认驱动器字母。如果你的机器只有一个硬盘（或固态硬盘），Windows 很可能会将 C: 关联到该驱动器。书中出现的所有示例假设你正在使用 C: 驱动器（尽管这并非强制要求）。

如果你有多个驱动器（无论是多个物理驱动器单元，还是你将硬盘划分为多个逻辑驱动器），Windows 通常会将连续的驱动器字母（D:, E: 等）与这些附加驱动器关联。如果你愿意，可以重新分配驱动器字母，因此无法保证所有驱动器字母在字母表中是连续的。

你可以通过在命令行中输入字母和冒号来切换默认驱动器。例如，

D:

会将默认驱动器切换到 D:，前提是该驱动器存在。如果该驱动器不存在，Windows 会报错并表示无法找到指定的驱动器，同时不会更改默认驱动器。

通常情况下（你可以更改此设置），Windows 会将当前的驱动器字母显示为命令行提示符的一部分（默认情况下，它也会显示默认的路径名）。例如，典型的 Windows 命令行提示符看起来像这样：

C:\>

命令提示符中出现的 \ 字符表示当前（默认）目录。在这种情况下，单独的 \ 表示 C: 驱动器上的根（或主）目录。如果当前目录是其他目录，Windows 会在驱动器字母后列出该目录。例如，如果当前目录是 \WINDOWS，CLI 会将以下内容显示为命令行提示符：

C:\WINDOWS>

正如你可能知道的，Windows 有一个层次化的文件系统，允许在（子）目录内部创建子目录。反斜杠字符用于分隔完整路径名中的目录名称。你常常会在 Windows 中看到两种路径形式：完整路径名和部分路径名。

完整路径名以反斜杠（\）字符开始，并从根目录开始。部分路径名不以反斜杠开始，路径从当前（默认）目录开始（部分路径名中的第一个子目录必须出现在当前默认子目录中）。

空格通常用于分隔命令行上的选项。如果路径名中包含空格，你必须用引号将整个路径名括起来；例如：

"\This\Path name\has\a\space"

CLI 支持路径名中的一对通配符字符。星号字符（*）将匹配零个或多个字符。问号字符（?）将匹配零个或一个字符。

命令必须明确支持通配符字符；Windows CLI 命令支持通配符选项，大多数 Microsoft 工具也支持（例如，ml64.exe）。然而，并非所有可执行文件都支持文件名中的通配符。通配符字符可以用于目录名和文件名，但它们不会替代路径名中的反斜杠字符（\）。

D.3 一些有用的内建命令

Windows CLI 包含许多内建命令（这些命令是 cmd.exe 程序的一部分，无需单独的 .exe 或 .bat 文件）。内建命令太多，无法一一列举（而且你也不会使用大部分命令）；因此，本节只介绍最常用的一小部分命令。

D.3.1 cd 和 chdir 命令

cd（change directory）命令将默认目录切换到你作为命令行选项指定的目录。请注意，chdir 是 cd 的同义词。其语法是：

cd `directory_name`

其中，directory_name 是新目录的完整或部分路径名。例如：

cd \masm32\examples

即使你在路径名中指定了驱动器字母，cd 命令通常不会改变默认的驱动器字母。例如，如果当前驱动器字母是 D:，则以下命令不会直接改变默认的驱动器字母和路径名：

D:\>cd C:\masm32\examples
D:\>

请注意，cd 命令执行后，命令提示符仍然是 D:\>。然而，如果你切换到 C: 驱动器（使用 C: 命令），Windows 会根据之前的命令设置默认目录：

D:>C:
C:\masm32\examples>

如你所见，默认目录与驱动器字母相关联（每个驱动器字母都有自己的默认目录）。

如果你想用 cd 命令同时切换驱动器字母和路径名，只需在路径名前添加 /d 选项：

D:\>cd /d C:\masm32\examples
C:\masm32\examples

切记，如果路径名中包含空格，使用 cd 命令时必须将路径名用引号括起来：

cd /d "C:\program files"

以下内容显示有关 cd 命令的帮助信息：

cd /?

如果你单独使用 cd 命令（没有命令行参数），此命令会显示当前（默认）路径名。

D.3.2 cls 命令

cls 命令清除屏幕（至少是命令窗口）。当你在编译之前希望清屏，并且只想看到与该次编译相关的消息时，这非常有用。

D.3.3 `copy` 命令

copy 命令将一个或多个文件复制到其他位置。通常，使用此命令可以在当前目录中创建文件的备份副本，或将文件复制到其他子目录中。该命令的语法如下：

copy `source_filename destination_filename`

该命令复制由 source_filename 指定的文件，并将其副本命名为 destination_filename。这两个名称可以是完整的或部分的路径名。

copy 命令支持多个命令行选项（除了源文件和目标文件名）。你可能不会经常使用这些选项。要了解更多详细信息，可以执行以下帮助命令：

copy /?

D.3.4 `date` 命令

date 命令本身会显示当前的系统日期，并提示你输入一个新的日期（该操作将永久设置系统日期——使用时请小心！）。若使用 /t 命令行选项，此命令只会显示日期，而不会要求你更改日期。示例如下：

C:\>date /t
Sat 02/23/2019

像往常一样，date /? 会显示此命令的帮助信息。

D.3.5 `del`（`erase`）命令

del 命令（erase 是 del 的同义词）将删除你指定的文件（或文件们），这些文件是作为命令行选项提供的。其语法为：

del `options files_to_delete`

其中，options 是以斜杠开头的命令行选项，files_to_delete 是要删除的文件名（路径名）列表，文件名之间用空格或逗号分隔。此命令支持通配符字符；例如，以下命令会删除当前目录中所有的 *.obj 文件：

del *.obj

不用说，使用此命令时要非常小心，尤其是在使用通配符字符时。例如，考虑以下命令（这可能是一个拼写错误）：

del * .obj

该命令删除当前目录中的所有文件，然后尝试删除名为 *.obj 的文件（该文件在命令删除子目录中的所有文件后将不存在）。

此命令与一些有用的命令行选项相关联。使用 /? 选项了解它们：

C:\>del /?

D.3.6 `dir` 命令

dir（目录）命令是最有用的命令行工具之一。它显示目录列表（即目录中的文件列表）。

如果没有任何命令行选项，该命令将显示当前目录下的所有文件。如果作为参数仅提供一个驱动器字母（加冒号），该命令将显示指定驱动器上默认目录中的所有文件。如果提供了指向子目录的路径名，该命令将显示指定目录中的所有文件。如果提供了指向单个文件名的路径名，该命令将显示该文件的目录信息。

像往常一样，该命令支持多个以斜杠字符开头的命令行选项。使用 dir /? 可以查看此命令的帮助信息。

D.3.7 `more` 命令

more 命令一次显示文本文件中的一屏内容。显示完一屏内容后，程序会等待用户按下回车或空格键。按下空格键会显示下一屏的内容；按下回车键会显示下一行内容。按下 Q 键则终止程序。

more 命令在命令行中需要指定一个或多个文件名作为参数。如果您指定了两个或更多文件，more 将按顺序显示输出。more 命令还支持多个命令行选项。您可以使用以下命令了解它们：

more /?

D.3.8 `move` 命令

move 命令将文件从一个位置移动到另一个位置（可能在移动的过程中重命名文件）。它类似于 copy，但 move 会在移动后删除源位置的文件。此命令的基本语法如下：

move `original_file new_file`

像往常一样，/? 命令行选项提供该命令的帮助信息。

D.3.9 `ren` 和 `rename` 命令

ren 命令（rename 是其同义词）用于更改文件名。其语法为

ren `original_filename new_filename`

其中（显然）original_filename 是您希望更改的旧文件名，new_filename 是您希望使用的新文件名。新旧文件必须位于同一目录中。如果您希望在重命名文件的同时将其移动到新目录，请使用 move 命令。

D.3.10 `rd` 和 `rmdir` 命令

rd 命令（rmdir 是其同义词）用于删除（移除）一个目录。在使用此命令之前，目录必须为空（尽管 /s 选项可以覆盖此要求）。此命令的基本语法是

rd `directory_path`

其中 directory_path 是您希望删除的目录的路径。使用 rd /? 命令获取帮助信息。

D.3.11 `time` 命令

如果没有参数，time 命令会显示当前系统时间并提示您更改时间。使用 /t 命令行参数，time 只会显示当前时间。使用 /? 显示该命令的帮助信息。

D.4 获取更多信息

本附录仅提供了 Windows 命令行解释器的最基础介绍——足以使用 MASM 编译和运行汇编语言程序。CLI 支持数十个内置命令（可能超过一百个）。了解这些命令的一个地方是 docs.microsoft.com/en-us/windows-server/administration/windows-commands/cmd/.

第二十一章：E

问题的答案

E.1 第一章问题的答案

cmd.exe
ml64.exe
地址、数据和控制
AL、AH、AX 和 EAX
BL、BH、BX 和 EBX
SIL、SI 和 ESI
R8B、R8W 和 R8D
FLAGS、EFLAGS 或 RFLAGS
(a) 2, (b) 4, (c) 16, (d) 32, (e) 8
任何 8 位寄存器和任何可以用 8 位表示的常量
32
| 目的地 | 常量大小 |
| --- | --- |
| RAX | 32 |
| EAX | 32 |
| AX | 16 |
| AL | 8 |
| AH | 8 |
| mem[32] | 32 |
| mem[64] | 32 |
64
任何内存操作数都可以工作，无论其大小如何。
call
ret
应用程序二进制接口
(a) AL, (b) AX, (c) EAX, (d) RAX, (e) XMM0, (f) RAX
RCX 用于整数操作数，XMM0 用于浮点/向量操作数
RDX 用于整数操作数，XMM1 用于浮点/向量操作数
R8 用于整数操作数，XMM2 用于浮点/向量操作数
R9 用于整数操作数，XMM3 用于浮点/向量操作数
dword 或 sdword
qword

E.2 第二章问题的答案

9 × 10³ + 3 × 10² + 8 × 10¹ + 4 × 10⁰ + 5 × 10^(-1) + 7 × 10^(-2) + 6 × 10^(-3)
(a) 10, (b) 12, (c) 7, (d) 9, (e) 3, (f) 15
(a) A, (b) E, (c) B, (d) D, (e) 2, (f) C, (g) CF, (h) 98D1
(a) 0001_0010_1010_1111, (b) 1001_1011_1110_0111, (c) 0100_1010, (d) 0001_0011_0111_1111, (e) 1111_0000_0000_1101, (f) 1011_1110_1010_1101, (g) 0100_1001_0011_1000
(a) 10, (b) 11, (c) 15, (d) 13, (e) 14, (f) 12
(a) 16, (b) 64, (c) 128, (d) 32, (e) 4, (f) 8, (g) 4
(a) 2, (b) 4, (c) 8, (d) 16
(a) 16, (b) 256, (c) 65,636, (d) 2
4
0 到 7
位 0
位 31
(a) 0, (b) 0, (c) 0, (d) 1
(a) 0, (b) 1, (c) 1, (d) 1
(a) 0, (b) 1, (c) 1, (d) 0
1
AND
OR
NOT
XOR
not
1111_1011
0000_0010
(a) 和 (c) 和 (e)
neg
(a) 和 (c) 和 (d)
jmp
label:
进位、溢出、零、符号
JZ
JC
JA, JAE, JBE, JB, JE, JNE（以及同义词 JNA、JNAE、JNB、JNBE，另有其他同义词）
JG、JGE、JL、JLE、JE、JNE（以及同义词 JNG、JNGE、JNL 和 JNLE）
如果移位的结果为 0，则 ZF = 1。
从操作数中移出的 HO 位进入进位标志。
如果下一个 HO 位与移位前的 HO 位不同，OF 会被设置；否则，它会被清除，但仅适用于 1 位移位。
SF 被设置为结果的 HO 位。
如果移位的结果为 0，则 ZF = 1。
从操作数中移出的 LO 位进入进位标志。
如果下一个 HO 位与移位前的 HO 位不同，OF 会被设置；否则，它会被清除，但仅适用于 1 位移位。
在 SHR 指令之后，SF 始终被清除，因为一个 0 总是被移入结果的 HO 位。
如果移位的结果为 0，则 ZF = 1。
从操作数中移出的 LO 位进入进位标志。
SAR 指令之后，OF 总是清除，因为符号不可能发生变化。
SF 被设置为结果的 HO 位，尽管从技术上讲它永远不会改变。
从操作数中移出的 HO 位进入进位标志。
它不会影响 ZF。
从操作数中移出的 LO 位进入进位标志。
它不会影响符号标志。
乘以 2
除以 2
乘法和除法
将它们相减并检查其差值是否小于一个小的误差值。
在 HO 尾数位置上有 1 位的值
7
30h 到 39h
撇号和引号
UTF-8、UTF-16 和 UTF-32
一个标量整数值，表示一个 Unicode 字符
一块 65,536 个不同的 Unicode 字符

E.3 第三章问题的答案

RIP
操作码，机器指令的数字编码
静态和标量变量
±2GB
要访问的内存位置的地址
RAX
lea
在完成所有寻址模式计算后获得的最终地址
1、2、4 或 8
2GB 总内存
你可以使用 VAR[REG]寻址模式直接访问数组的元素，使用 64 位寄存器作为数组的索引，而无需首先将数组的地址加载到单独的基寄存器中。
.data部分可以保存已初始化的数据值；.data?部分只能包含未初始化的变量。
.code 和 .const
.data和.data?
指向特定部分的偏移量（例如，.data）
使用some_ID label some_type来告知 MASM 以下数据的类型是some_type，尽管实际上它可能是另一种类型。
MASM 将它们合并为一个单独的部分。
使用align 8语句。
内存管理单元
如果b位于 MMU 页的最后一个字节处且下一个页面不可读，从以b开头的内存位置加载一个字会产生一般保护错误。
一个常量表达式加上变量在内存中的基地址
将以下操作数类型强制转换为另一种类型
小端值在内存中以其 LO 字节位于最低地址，HO 字节位于最高地址的形式出现。大端值则相反：它们的 HO 字节出现在最低地址，LO 字节出现在内存中的最高地址。
xchg al, ah 或 xchg ah, al
bswap eax
bswap rax
(a) 从 RSP 中减去 8，(b) 将 RAX 中的值存储到 RSP 指向的位置。
(a) 从 RSP 指向的 8 个字节中加载 RAX，(b) 将 8 加到 RSP。
反转
后进先出
使用[RSP ± const]寻址模式将数据从栈中移动进出。
Windows ABI 要求栈在 16 字节边界上对齐；推送 RAX 可能会使栈在 8 字节（而不是 16 字节）边界上对齐。

E.4 第四章问题的答案

imul reg``, constant
imul destreg``, srcreg``, constant
imul destreg``, srcreg
一个符号（命名）常量，MASM 将在源文件中每次出现该名称时替换为文字常量。
=, equ, textequ
文本等式替换为可以是任何文本的字符串；数值等式必须分配一个可以用 64 位整数表示的数值常量。
使用文本分隔符<和>包围字符串字面量；例如，<"a long string">。
MASM 在汇编过程中可以计算的算术表达式
lengthof。
当前段的偏移量。
this 和 $。
使用常量表达式 $-startingLocation。
使用一系列（数字）等式，每个连续的等式的值设置为前一个等式的值加一；例如：
```
val1 = 0
val2 = val1 + 1
val3 = val2 + 1
etc.
```
使用 typedef 指令。
指针是内存中的一个变量，它保存另一个内存对象的地址。
将指针变量加载到一个 64 位寄存器中，并使用寄存器间接寻址模式来引用该地址。
使用 qword 数据声明，或其他 64 位大小的数据类型。
offset 操作符。
(a) 未初始化的指针，(b) 使用指针保存非法值，(c) 在存储已被释放后使用指针（悬空指针），(d) 在不再使用存储后未释放存储（内存泄漏），(e) 使用错误的数据类型访问间接数据。
在存储已被释放后使用指针。
未能在使用完存储后释放它。
由较小的数据对象组成的聚合类型。
一个以 0 字节（或其他 0 值）结尾的字符序列。
一个包含长度值作为第一个元素的字符串。
描述符是一种数据类型，包含一个指向字符数据的指针、字符串长度以及可能描述字符串数据的其他信息。
一种同质的数据元素集合（所有元素类型相同）。
数组第一个元素的内存地址。
array byte 10 dup (?)（作为示例）。
只需将初始值填写为字节、字、双字或其他数据声明指令的操作数字段。此外，你还可以使用一个或多个常量值作为dup操作符的操作数；例如，5 dup (2, 3)。
(a) base_address + index * 4（4 是元素大小），(b) W[i,j] = base_address + (i * 8 + j) * 2（2 是元素大小），(c) R[i,j,k] = base_address +(((i * 4) + j) * 6 + k) * 8（8 是元素大小）。
一种多维数组的组织方式，在这种方式中，你将每一行的元素存储在连续的内存位置中，然后将每一行按顺序存储在内存中。
一种多维数组的组织方式，在这种方式中，你将每一列的元素存储在连续的内存位置中，然后将每一列按顺序存储在内存中。
W word 4 dup (8 dup (?))
一种异质的数据元素集合（每个字段可能有不同的类型）。
struct 和 ends。
点操作符。
一种异质的数据元素集合（每个字段可能有不同的类型）；联合体中每个字段的偏移量从 0 开始。
union 和 ends。
记录和结构体的字段在结构体内按顺序出现在连续的内存位置（每个字段都有自己的字节块）；而联合体的字段彼此重叠，每个字段都从联合体中的偏移量零开始。
一个未命名的联合体，它的字段被视为外部结构体的字段。

E.5 第五章问题的答案。

它将返回地址推送到栈上（调用后下一条指令的地址），然后跳转到操作数指定的地址。
它从栈中弹出一个返回地址，并将地址移动到 RIP 寄存器，将控制转移到调用当前过程后面的指令。
弹出返回地址后，CPU 将此值加到 RSP 中，从栈中移除相应字节的参数。
紧接着调用过程指令的地址
命名空间污染发生在源文件中定义了太多符号、标识符或名称，以至于在该源文件中很难选择新的、唯一的名称。
在名称后加两个冒号；例如，id::。
在过程之前使用option noscoped指令
在进入过程时使用push指令将寄存器值保存在栈上；然后使用pop指令在从过程返回之前立即恢复寄存器值。
代码难以维护。（其次的问题，虽然不大，是它占用更多空间。）
性能——因为你经常保存一些调用代码不需要保存的寄存器
当子程序尝试返回时，它会使用你在栈上留下的垃圾作为返回地址，这通常会产生未定义的结果（程序崩溃）。
子程序使用调用前栈上任何存在的内容作为返回地址，结果是未定义的。
一组与过程调用（激活）相关的数据，包括参数、局部变量、返回地址和其他项目
RBP
8 字节（64 位）

push rbp
mov  rbp, rsp
sub  rsp, sizeOfLocals ; Assuming there are local variables

```
leave
ret
```
and rsp, -16
源文件中的一部分（通常是过程的主体），在程序中符号可见且可用
从为变量分配存储空间开始，到系统释放该存储空间为止
进入代码块（通常是过程）时自动分配存储的变量，并在退出该代码块时自动释放该存储
进入过程时（或与自动变量关联的代码块）
使用textequ指令或 MASM 本地指令
var1: –2；local2: –8（MASM 将变量对齐到 dword 边界）；dVar: –9；qArray: –32（数组的基地址是最低的内存地址）；rlocal: –40（数组的基地址是最低的内存地址）；ptrVar: –48
option prologue:PrologueDef 和 option epilogue:EpilogueDef。还应该提供 option prologue:none 和 option epilogue:none 来关闭此功能。
在 MASM 生成过程代码之前，在所有本地指令之后
每当出现ret指令的地方
实际参数的值
实际参数值的内存地址
RCX, RDX, R8 和 R9（或这些寄存器的较小子组件）
XMM0, XMM1, XMM2 或 XMM3
在栈上，位于为寄存器传递的参数预留的阴影位置（32 字节）之上
程序可以自由修改易失性寄存器，而无需保留其值；但必须在过程调用之间保留非易失性寄存器的值。
RAX、RCX、RDX、R8、R9、R10、R11、XMM0、XMM1、XMM2、XMM3、XMM4、XMM5，以及所有 YMM 和 ZMM 寄存器的 HO 128 位
RBX、RSI、RDI、RBP、RSP、R12、R13、R14、R15 和 XMM6–XMM15。并且，返回过程时方向标志必须被清除。
使用来自 RBP 寄存器的正偏移量
为调用者通过 RCX、RDX、R8 和 R9 寄存器传递的参数在栈上保留的存储空间
32 字节
32 字节
32 字节
parm1：RBP + 16；parm2：RBP + 24；parm3：RBP + 32；parm4：RBP + 40
```
mov rax, parm4
mov al, [rax]
```
lclVar1：RBP – 1；lclVar2：RBP – 4（对齐到 2 字节边界）；lclVar3：RBP – 8；lclVar4：RBP – 16
通过引用
应用程序二进制接口
在 RAX 寄存器中
作为参数传递的过程的地址
间接地。通常通过使用call parm指令，其中parm是过程参数，一个包含过程地址的 qword 变量。你也可以将参数值加载到一个 64 位寄存器中，通过该寄存器间接调用过程。
分配本地存储空间以保存需要保留的寄存器值，并在过程入口时将寄存器数据移入该存储空间，然后在从过程返回前将数据移回寄存器。

E.6 第六章问题的答案

对于 8 位操作数使用 AL，16 位操作数使用 AX，32 位操作数使用 EAX，64 位操作数使用 RAX
8 位mul操作：16 位；16 位mul操作：32 位；32 位mul操作：64 位；64 位mul操作：128 位。CPU 将乘积存放在 AX 中用于 8×8 的乘积，DX:AX 用于 16×16 的乘积，EDX:EAX 用于 32×32 的乘积，RDX:RAX 用于 64×64 的乘积。
商存放在 AL、AX、EAX 或 RAX 中，余数存放在 AH、DX、EDX 或 RDX 中
将 AX 符号扩展到 DX。
将 EAX 零扩展到 EDX。
除以 0 并产生一个无法适应累加器寄存器（AL、AX、EAX 或 RAX）的商
通过设置进位标志和溢出标志
它们会打乱标志；也就是说，它们会将标志置于未定义的状态。
扩展精度的imul指令生成一个 2 × n位的结果，使用隐式操作数（AL、AX、EAX 和 RAX），并修改 AH、DX、EDX 和 RDX 寄存器。此外，扩展精度的imul指令不允许常量操作数，而通用的imul指令则允许。
cbw、cwd、cdq、cqo
它们会打乱所有标志，留下未定义的状态。
如果两个操作数相等，则设置零标志。
如果第一个操作数小于第二个操作数，则设置进位标志。
如果第一个操作数小于第二个操作数，则符号标志和溢出标志不同；如果第一个操作数大于或等于第二个操作数，则它们相同。
一个 8 位寄存器或内存位置
如果条件为真，它们将操作数设置为 1；如果条件不为真，则设置为 false。
test 指令与 and 指令相同，唯一的不同是它不将结果存储到目标（第一个）操作数，而只是设置标志。
它们都以相同的方式设置条件码标志。
将要测试的操作数作为第一个（目标）操作数，并提供一个包含单个 1 位的立即常数，该位位于要测试的位位置。测试指令执行后，零标志将包含所需位的状态。

以下是一些可能的解决方案，并非唯一解：

x = x + y

mov eax, x
add eax, y
mov x, eax

x = y – z

mov eax, y
sub eax, z
mov x, eax

x = y * z

mov  eax, y
imul eax, z
mov  x, eax

x = y + z * t

mov  eax, z
imul eax, t
add  eax, y
mov  x, eax

x = (y + z) * t

mov  eax, y
add  eax, z
imul eax, t
mov  x, eax

x = -((x*y)/z)

mov  eax, x
imul y          ; Note: Sign-extends into EDX
idiv z
mov  x, eax

x = (y == z) && (t != 0)

mov   eax, y
cmp   eax, z
sete  bl
cmp   t, 0
setne bh
and   bl, bh
movzx eax, bl   ; Because x is a 32-bit integer
mov   x, eax

以下是一些可能的解决方案，并非唯一解：

x = x * 2

shl   x, 1

x = y * 5

mov   eax, y
lea   eax, [eax][eax*4]
mov   x, eax

这里是另一种解决方案：

mov   eax, y
mov   ebx, eax
shl   eax, 2
add   eax, ebx
mov   x, eax

x = y * 8

mov   eax, y
shl   eax, 3
mov   x, eax

x = x /2

shr   x, 1

x = y / 8

mov   ax, y
shr   ax, 3
mov   x, ax

x = z / 10

movzx eax, z
imul  eax, 6554  ; Or 6553
shr   eax, 16
mov   x, ax

x = x + y

fld   x
fld   y
faddp
fstp  x

x = y – z

fld   y
fld   z
fsubp
fstp  x

x = y * z

fld   y
fld   z
fmulp
fstp  x

x = y + z * t

fld   y
fld   z
fld   t
fmulp
faddp
fstp  x

x = (y + z) * t

fld   y
fld   z
faddp
fld   t
fmulp
fstp  x

x = -((x * y)/z)

fld   x
fld   y
fmulp
fld   z
fdivp
fchs
fstp  x

x = x + y

movss xmm0, x
addss xmm0, y
movss x, xmm0

x = y – z

movss xmm0, y
subss xmm0, z
movss x, xmm0

x = y * z

movss xmm0, y
mulss xmm0, z
movss x, xmm0

x = y + z * t

movss xmm0, z
mulss xmm0, t
addss xmm0, y
movss x, xmm0

b = x < y

fld    y
fld    x
fcomip st(0), st(1)
setb   b
fstp   st(0)

b = x >= y && x < z

fld    y
fld    x
fcomip st(0), st(1)
setae  bl
fstp   st(0)
fld    z
fld    x
fcomip st(0), st(1)
setb   bh
fstp   st(0)
and    bl, bh
mov    b, bl

E.7 第七章问题的答案

使用 lea 指令或 offset 操作符。
option noscoped
option scoped
jmp reg64 和 jmp mem64
维护历史信息的代码段，无论是通过变量还是程序计数器
如果跳转助记符的第二个字母是 n，则移除 n；否则，插入 n 作为第二个字符。
jpo 和 jpe
用于扩展跳转或调用指令范围的短代码序列，超出 ±2GB 范围
cmov``cc``reg``, src，其中 cc 是条件后缀之一（紧随条件跳转之后），reg 是一个 16 位、32 位或 64 位寄存器，src 是与 reg 相同大小的源寄存器或内存位置。
你可以通过使用条件跳转来有条件地执行一大组不同类型的指令，而无需控制转移的时间开销。
目标必须是寄存器，且不允许使用 8 位寄存器。
布尔表达式的完全求值会评估表达式的所有组成部分，即使从逻辑上看不需要这样做；短路求值在确定表达式必须为真或假时会立即停止。

if(x == y || z > t)
{
    `Do something` 
}
    mov  eax, x
    cmp  eax, y
    sete bl
    mov  eax, z
    cmp  eax, t
    seta bh
    or   bl, bh
    jz   skipIF
     `Code for statements that "do something"`
skipIF:

if(x != y && z < t)
{
     `THEN statements`
}
Else
{
     `ELSE statements`
}
    mov   eax, x
    cmp   eax, y
    setne bl
    mov   eax, z
    cmp   eax, t
    setb  bh
    and   bl, bh
    jz    doElse
    ` Code for THEN statements`
    jmp   endOfIF

doElse:
    ` Code for ELSE statements`
endOfIF:

1st IF:
    mov  ax, x
    cmp  ax, y
    jeq  doBlock
    mov  eax, z
    cmp  eax, t
    jnl  skipIF
doBlock:     `Code for statements that "do something"`
skipIF:

2nd IF:
    mov   eax, x
    cmp   eax, y
    je    doElse
    mov   eax, z
    cmp   eax, t
    jnl   doElse
    ` Code for THEN statements`
    jmp   endOfIF

doElse:
    ` Code for ELSE statements`
endOfIF:

switch(s)
{
   case 0:   `case 0 code`  break;
   case 1:   `case 1 code`  break;
   case 2:   `case 2 code`  break;
   case 3:   `case 3 code`  break;
}

    mov eax, s ; Zero-extends!
    cmp eax, 3
    ja  skipSwitch
    lea rbx, jmpTbl
    jmp [rbx][rax * 8]
jmpTbl qword case0, case1, case2, case3

case0: `case 0 code`
       jmp skipSwitch

case1: `case 1 code`
       jmp skipSwitch

case2: `case 2 code`
       jmp skipSwitch

case3: `case 3 code`
 skipSwitch:

switch(t)
{
   case 2:  `case 0 code` break;
   case 4:  `case 4 code` break;
   case 5:  `case 5 code` break;
   case 6:  `case 6 code` break;
   default: `default code`
}
    mov eax, t ; Zero-extends!
    cmp eax, 2
    jb  swDefault
    cmp eax, 6
    ja  swDefault
    lea rbx, jmpTbl
    jmp [rbx][rax * 8 – 2 * 8]
jmpTbl qword case2, swDefault, case4, case5, case6

swDefault: `default code`
       jmp endSwitch

case2: `case 2 code`
       jmp endSwitch

case4: `case 4 code`
       jmp endSwitch

case5: `case 5 code`
       jmp endSwitch

case6: `case 6 code`

endSwitch:

switch(u)
{
   case 10:  ` case 10 code ` break;
   case 11:  ` case 11 code ` break;
   case 12:  ` case 12 code ` break;
   case 25:  ` case 25 code ` break;
   case 26:  ` case 26 code ` break;
   case 27:  ` case 27 code ` break;
   default:  ` default code`
} 
     lea rbx, jmpTbl1  ; Assume cases 10-12
     mov eax, u        ; Zero-extends!
     cmp eax, 10
     jb  swDefault
     cmp eax, 12
     jbe sw1
     cmp eax, 25
     jb  swDefault
     cmp eax, 27
 ja  swDefault
     lea rbx, jmpTbl2
     jmp [rbx][rax * 8 – 25 * 8]
sw1: jmp [rbx][rax*8-2*8]
jmpTbl1 qword case10, case11, case12
jmpTbl2 qword case25, case26, case27

swDefault: `default code`
       jmp endSwitch

case10: `case 10 code`
       jmp endSwitch

case11: `case 11 code`
       jmp endSwitch

case12: `case 12 code`
       jmp endSwitch

case25: `case 25 code`
       jmp endSwitch

case26: `case 26 code`
       jmp endSwitch

case27: `case 27 code`

endSwitch:

while(i < j)
{
     `Code for loop body`
}

whlLp:
     mov eax, i
     cmp eax, j
     jnl endWhl
      `Code for loop body`
     jmp whlLp
endWhl:

while(i < j && k != 0)
{
     `Code for loop body, part a`
    if(m == 5) continue;
     `Code for loop body, part b`
    if(n < 6) break;
     `Code for loop body, part c`
}

; Assume short-circuit evaluation:
 whlLp:
     mov eax, i
     cmp eax, j
     jnl endWhl
     mov eax, k
     cmp eax, 0
     je  endWhl
     ` Code for loop body, part a`
     cmp m, 5
     je  whlLp
     ` Code for loop body, part b`
     cmp n, 6
     jl  endWhl
    `  Code for loop body, part c`
     jmp whlLp
endWhl:

do
{
   `Code for loop body`
} while(i != j);

doLp:
   `Code for loop body`
     mov eax, i
     cmp eax, j
     jne doLp

do
{
   `Code for loop body, part a`
    if(m != 5) continue;
   `Code for loop body, part b`
    if(n == 6) break;
   `Code for loop body, part c`
} while(i < j && k > j);

doLp:
  ` Code for loop body, part a`
     cmp m, 5
     jne doCont
  ` Code for loop body, part b`
     cmp n, 6
     je  doExit
  ` Code for loop body, part c`
doCont:     mov eax, i
     cmp eax, j
     jnl doExit
     mov eax, k
     cmp eax, j
     jg  doLp
doExit:

for(int i = 0; i < 10; ++i)
{
   `Code for loop body`
}

       mov i, 0
forLp: cmp i, 10
       jnl forDone
       ` Code for loop body`
       inc i
       jmp forLp
forDone:

E.8 第八章问题的答案

你可以通过以下方式计算 x = y + z：

mov rax, qword ptr y
add rax, qword ptr z
mov qword ptr x, rax
mov rax, qword ptr y[8]
adc rax, qword ptr z[8]
mov qword ptr x[8], rax

mov rax, qword ptr y
add rax, qword ptr z
mov qword ptr x, rax
mov eax, dword ptr z[8] 
adc eax, qword ptr y[8]
mov dword ptr x[8], eax

mov eax, dword ptr y
add eax, dword ptr z
mov dword ptr x, eax
mov ax, word ptr z[4]
adc ax, word ptr y[4]
mov word ptr x[4], ax

你可以通过以下方式计算 x = y – z：

mov rax, qword ptr y
sub rax, qword ptr z
mov qword ptr x, rax
mov rax, qword ptr y[8]
sbb rax, qword ptr z[8]
mov qword ptr x[8], rax
mov rax, qword ptr y[16]
sbb rax, qword ptr z[16]
mov qword ptr x[16], rax

mov rax, qword ptr y
sub rax, qword ptr z
mov qword ptr x, rax
mov eax, dword ptr y[8]
sbb eax, dword ptr z[8]
mov dword ptr x[8], eax

mov rax, qword ptr y
mul qword ptr z
mov qword ptr x, rax
mov rbx, rdx

mov rax, qword ptr y
mul qword ptr z[8]
add rax, rbx
adc rdx, 0
mov qword ptr x[8], rax
mov rbx, rdx

mov rax, qword ptr y[8]
mul qword ptr z
add x[8], rax
adc rbx, rdx

mov rax, qword ptr y[8]
mul qword ptr z[8]
add rax, rbx
mov qword ptr x[16], rax
adc rdx, 0
mov qword ptr x[24], rdx

mov  rax, qword ptr y[8]
cqo
idiv qword ptr z
mov  qword ptr x[8], rax
mov  rax, qword ptr y
idiv qword ptr z
mov  qword ptr x, rax

转换如下：

; Note: order of comparison (HO vs. LO) is irrelevant
; for "==" comparison.

 mov rax, qword ptr x[8]
    cmp rax, qword ptr y[8]
    jne skipElse
    mov rax, qword ptr x
    cmp rax, qword ptr y
    jne skipElse
    `then code`
skipElse:

 mov rax, qword ptr x[8]
    cmp rax, qword ptr y[8]
    jnb skipElse
    mov rax, qword ptr x
    cmp rax, qword ptr y
    jnb skipElse
   ` then code`
skipElse:

 mov rax, qword ptr x[8]
    cmp rax, qword ptr y[8]
    jna skipElse
    mov rax, qword ptr x
    cmp rax, qword ptr y
    jna skipElse
    `then code`
skipElse:

; Note: order of comparison (HO vs. LO) is irrelevant
; for "!=" comparison.

    mov rax, qword ptr x[8]
    cmp rax, qword ptr y[8]
    jne doElse
    mov rax, qword ptr x
    cmp rax, qword ptr y
    je skipElse
doElse:
    `then code`
skipElse:

转换如下：

; Note: order of comparison (HO vs. LO) is irrelevant
; for "==" comparison.

    mov eax, dword ptr x[8]
    cmp eax, dword ptr y[8]
 jne skipElse
    mov rax, qword ptr x
    cmp rax, qword ptr y
    jne skipElse
    `then code`
skipElse:

 mov eax, dword ptr x[8]
    cmp eax, dword ptr y[8]
    jnb skipElse
    mov rax, qword ptr x
    cmp rax, qword ptr y
    jnb skipElse
    `then code`
skipElse:

 mov eax, dword ptr x[8]
    cmp eax, dword ptr y[8]
    jna skipElse
    mov rax, qword ptr x
    cmp rax, qword ptr y
    jna skipElse
    `then code`
skipElse:

转换如下：

neg qword ptr x[8]
neg qword ptr x
sbb qword ptr x[8], 0

xor rax, rax
xor rdx, rdx
sub rax, qword ptr x
sbb rdx, qword ptr x[8]
mov qword ptr x, rax
mov qword ptr x[8], rdx

mov rax, qword ptr y
mov rdx, qword ptr y[8]
neg rdx
neg rax
sbb rdx, 0
mov qword ptr x, rax
mov qword ptr x[8], rdx

xor rdx, rdx
xor rax, rax
sub rax, qword ptr y
sbb rdx, qword ptr y[8]
mov qword ptr x, rax
mov qword ptr x[8], rdx

转换如下：

mov rax, qword ptr y
and rax, qword ptr z
mov qword ptr x, rax
mov rax, qword ptr y[8]
and rax, qword ptr z[8]
mov qword ptr x[8], rax

mov rax, qword ptr y
or  rax, qword ptr z
mov qword ptr x, rax
mov rax, qword ptr y[8]
or  rax, qword ptr z[8]
mov qword ptr x[8], rax

mov rax, qword ptr y
xor rax, qword ptr z
mov qword ptr x, rax
mov rax, qword ptr y[8]
xor rax, qword ptr z[8]
mov qword ptr x[8], rax

mov rax, qword ptr y
not rax
mov qword ptr x, rax
mov rax, qword ptr y[8]
not rax
mov qword ptr x[8], rax

mov rax, qword ptr y
shl rax, 1
mov qword ptr x, rax
mov rax, qword ptr y[8]
rcl rax, 1
mov qword ptr x[8], rax

mov rax, qword ptr y[8]
shr rax, 1
mov qword ptr x[8], rax
mov rax, qword ptr y
rcr rax, 1
mov qword ptr x rax

mov rax, qword ptr y[8]
sar rax, 1
mov qword ptr x[8], rax
mov rax, qword ptr y
rcr rax, 1
mov qword ptr x, rax

rcl qword ptr x, 1
rcl qword ptr x[8], 1

rcr qword ptr x[8], 1
rcr qword ptr x, 1

E.9 第九章问题的答案

btoh        proc

            mov     ah, al      ; Do HO nibble first
            shr     ah, 4       ; Move HO nibble to LO
            or      ah, '0'     ; Convert to char
            cmp     ah, '9' + 1 ; Is it "A" to "F"?
            jb      AHisGood

; Convert 3Ah to 3Fh to "A" to "F".

            add     ah, 7

; Process the LO nibble here.

AHisGood:   and     al, 0Fh     ; Strip away HO nibble
            or      al, '0'     ; Convert to char
            cmp     al, '9' + 1 ; Is it "A" to "F"?
            jb      ALisGood

; Convert 3Ah to 3Fh to "A" to "F".

 add     al, 7
ALisGood:   ret
btoh        endp

8
调用 qToStr 两次：一次使用高 64 位，一次使用低 64 位。然后将两个字符串连接起来。
fbstp
如果输入值为负数，发出一个连字符（-）字符并取其负值；然后调用无符号十进制转换函数。如果数字为 0 或正数，仅调用无符号十进制转换函数。

; Inputs:
;    RAX -   Number to convert to string.
;    CL  -   minDigits (minimum print positions).
;    CH  -   Padding character.
;    RDI -   Buffer pointer for output string.

它将生成所需的完整字符串；minDigits 参数指定字符串的最小大小。

; On Entry:

   ; r10        - Real10 value to convert.
   ;              Passed in ST(0).

   ; fWidth     - Field width for the number (note that this
   ;              is an *exact* field width, not a minimum
   ;              field width).
   ;              Passed in EAX (RAX).

   ; decimalpts - # of digits to display after the decimal pt.
   ;              Passed in EDX (RDX). 

   ; fill       - Padding character if the number is smaller
   ;              than the specified field width.
   ;              Passed in CL (RCX).

   ; buffer     - r10ToStr stores the resulting characters
   ;              in this string.
   ;              Address passed in RDI.

   ; maxLength  - Maximum string length.
   ;              Passed in R8D (R8).

一个包含 fWidth 个 # 字符的字符串。

; On Entry:

;    e10     - Real10 value to convert.
;              Passed in ST(0).

;    width   - Field width for the number (note that this
;              is an *exact* field width, not a minimum
;              field width).
;              Passed in RAX (LO 32 bits).

;    fill    - Padding character if the number is smaller
;              than the specified field width.
;              Passed in RCX.

;    buffer  - e10ToStr stores the resulting characters in
;              this buffer (passed in EDI).
;              Passed in RDI (LO 32 bits).

;    expDigs - Number of exponent digits (2 for real4,
;              3 for real8, and 4 for real10).
;              Passed in RDX (LO 8 bits).

用于分隔字符序列与其他此类序列的字符，例如开始或结束一个数字字符串
输入中的非法字符和转换过程中的数值溢出

E.10 第十章问题的答案

所有可能的输入（参数）值的集合
所有可能的函数输出（返回）值的集合
计算 AL = [RBX + AL × 1]
字节值：域是从 0 到 255 的所有整数集合，范围也是从 0 到 255 的所有整数集合。

实现这些功能的代码如下：

```
lea rbx, f
mov al, input
xlat
```

lea rbx, f
movzx rax, input
mov ax, [rbx][rax * 2]

lea rbx, f
movzx rax, input
mov al, [rbx][rax * 1]

lea rbx, f
movzx rax, input
mov ax, [rbx][rax * 2]

修改输入值，使其位于函数的输入域内
主存储器非常慢，查找表中的值可能比计算值更快。

E.11 第十一章问题的答案

使用 cpuid 指令。
因为 Intel 和 AMD 有不同的功能集
EAX = 1
ECX 的第 20 位
(a) _TEXT，(b) _DATA，(c) _BSS，(d) CONST
PARA 或 16 字节

data  segment align(64) 'DATA'
           .
           .
           .
data  ends

AVX/AVX2/AVX-256/AVX-512
SIMD 寄存器中的数据类型；通常为 1、2、4 或 8 字节宽
标量指令对单一数据项进行操作；向量指令同时对两个或更多数据项进行操作。
16 字节
32 字节
64 字节
movd
movq
movaps、movapd 和 movdqa
movups、movupd 和 movdqu
movhps 或 movhpd
movddup
pshufb
pshufd，不过pshufb也可以起作用
(v)pextrb、(v)pextrw、(v)pextrd 或 (v)pextrq
(v)pinsrb、(v)pinsrw、(v)pinsrd 或 (v)pinsrq
它获取第二操作数的位，反转它们，然后将这些反转的位与第一个（目标）操作数进行逻辑与运算。
pslldq
pslrdq
psllq
pslrq
HO 位的进位被丢失。
在垂直加法中，CPU 将两个不同 XMM 寄存器相同通道中的值相加；在水平加法中，CPU 将同一个 XMM 寄存器相邻通道中的值相加。
在目标 XMM 寄存器中，通过将 0FFh 存储到目标 XMM 寄存器的相应通道中（0 表示假）
交换 pcmpgtq 指令的操作数。
它将每个字节的 HO 位从 XMM 寄存器复制到通用 16 位寄存器的相应位位置；例如，通道 0 的第 7 位进入第 0 位。
(a) SSE 上为 4，AVX2 上为 8，(b) SSE 上为 2，AVX2 上为 4
and rax, -16
pxor xmm0, xmm0
pcmpeqb xmm1, xmm1
include

E.12 第十二章问题的答案

and/andn
btr
or
bts
xor
btc
test/and
bt
pext
pdep
bextr
bsf
bsr
反转寄存器并使用 bsf。
反转寄存器并使用bsr。
popcnt

E.13 第十三章问题的答案

编译时语言
在汇编和编译过程中
echo（或 %out）
.err
= 指令
!
它用表示该编译时表达式值的文本替换表达式。
它用文本符号的展开替换文本符号。
它在汇编时将两个或更多文本字符串连接起来，并将结果存储到文本符号中。
它在 MASM 文本对象中搜索一个子字符串，并返回该子字符串在该对象中的索引；如果子字符串没有出现在更大的字符串中，则返回 0。
它返回一个 MASM 文本字符串的长度。
它从更大的 MASM 文本字符串中返回一个子字符串。
if、elseif、else 和 endif
while、for、forc 和 endm
forc
macro、endm
指定宏的名称，宏扩展将在该位置发生。
作为宏指令的操作数
在宏操作数字段的参数名称后指定 :req。
宏参数是可选的，默认情况下，如果没有 :req 后缀。
在最后一个宏参数声明后使用 :vararg 后缀。
使用条件汇编指令，如 ifb 或 ifnb，查看实际的宏参数是否为空。
使用 local 指令。
exitm
使用 exitm <text>。
opattr

E.14 第十四章问题的答案

字节、字、双字和四字
movs、cmps、scas、stos 和 lods
字节和字
RSI、RDI 和 RCX
RSI 和 RDI
RCX、RSI 和 AL
RDI 和 EAX
Dir = 0
Dir = 1
清除方向标志；或者保留其值。
清除
movs 和 stos
当源和目标块重叠，且源地址起始位置比目标块的内存地址更低时
这是默认条件；当源地址和目标块重叠，且源地址起始位置比目标块的内存地址更高时，你还需要清除方向标志。
源块的部分内容可以在目标块中复制。
repe
方向标志应清除。
不，字符串指令在使用重复前缀时，会在字符串操作之前测试 RCX。
scasb
stos
lods 和 stos
lods
验证 CPU 是否支持 SSE 4.2 指令。
pcmpistri 和 pcmpistrm
pcmpestri 和 pcmpestrm
RAX 存储 src1 长度，RDX 存储 src2 长度。
等于任何，或可能等于的范围
等于每个
等于已排序
pcmp``X``str``Y 指令总是读取 16 字节的内存，即使字符串长度不足，也有可能在读取字符串末尾超出时发生 MMU 页面错误。

E.15 第十五章问题的答案

ifndef 和 endif
汇编源文件及其包含或间接包含的所有文件
public
extern 和 externdef
externdef
abs
proc
nmake.exe

多个如下形式的块：

`target`: `dependencies`
    `commands`

依赖文件是当前文件正常操作所依赖的文件；该依赖文件必须在当前文件的编译和链接之前更新和构建。
删除旧的对象和可执行文件，并删除其他杂项文件。
一组目标文件

E.16 第十六章问题的答案

/subsystem:console
www.masm32.com/
它会减慢汇编过程。
/entry:``procedure_name
MessageBox
包围函数调用并改变你调用函数方式的代码（例如，参数顺序和位置）
__imp_CreateFileA
__imp_GetLastError

posted @ 2025-11-25 17:04 绝不原创的飞龙阅读(9) 评论(0) 收藏举报

刷新页面返回顶部

龙哥盟

掠夺·扩张·投机·博弈

64-位汇编的艺术卷一-全-

64 位汇编的艺术卷一（全）

前言

关于本书中的源代码说明

第一部分

第一章：汇编语言的 Hello, World

1.1 你需要准备的

1.2 在你的计算机上设置 MASM

1.3 在你的机器上设置文本编辑器

1.4 MASM 程序的结构

1.5 运行你的第一个 MASM 程序

1.6 运行您的第一个 MASM/C++ 混合程序

第二章：计算机数据表示与操作

2.1 数字系统

2.1.1 十进制系统回顾

2.1.2 二进制数字系统

2.1.3 二进制约定

2.2 十六进制计数系统

2.3 关于数字与表示的说明

2.4 数据组织

2.4.1 位

2.4.2 半字

2.4.3 字节

2.4.4 字

2.4.5 双字

2.4.6 四字和八字

2.5 位上的逻辑操作

2.5.1 与操作

2.5.2 或操作

2.5.3 XOR 运算

2.5.4 NOT 运算

2.6 二进制数字和位字符串的逻辑运算

2.7 有符号和无符号数

2.8 符号扩展与零扩展

2.9 标志收缩与饱和

2.10 简短插曲：控制转移指令简介

2.10.1 jmp指令

2.10.2 条件跳转指令

2.10.3 cmp 指令与相应的条件跳转

2.10.4 条件跳转同义词

2.11 移位和旋转

第三章：内存访问和组织

3.1 运行时内存组织

3.1.1 .code 区

3.1.2 .data 区

3.1.3 .const部分

3.1.4 .data?部分

3.1.5 程序中声明部分的组织

3.1.6 内存访问和 4K 内存管理单元页面

第四章：常量、变量和数据类型

4.1 imul 指令

4.2 inc 和 dec 指令

4.3 MASM 常量声明

4.3.1 常量表达式

4.3.2 this和$运算符

4.3.3 常量表达式计算

4.4 MASM typedef 声明

4.5 类型强制转换

4.6 指针数据类型

4.6.1 在汇编语言中使用指针

4.6.2 在 MASM 中声明指针

4.6.3 指针常量与指针常量表达式

4.6.4 指针变量与动态内存分配

4.6.5 常见的指针问题

4.7 组合数据类型

4.8 字符串

4.8.1 零终止字符串

4.8.2 长度前缀字符串

4.8.3 字符串描述符

4.8.4 字符串指针

4.8.5 字符串函数

4.9 数组

4.9.1 在 MASM 程序中声明数组

4.9.2 访问一维数组的元素

4.9.3 排序值数组

4.10 多维数组

4.10.1 行主序排序

4.10.2 列优先排序

2.10.1 `jmp`指令

2.10.3 `cmp` 指令与相应的条件跳转

3.1.1 `.code` 区

3.1.2 `.data` 区

3.1.3 `.const`部分

3.1.4 `.data?`部分

4.1 `imul` 指令

4.2 `inc` 和 `dec` 指令

4.3.2 `this`和`$`运算符

5.5.4 使用`proc`指令声明参数

6.1.3 `div`和`idiv`指令

6.1.6 `test` 指令