std::atomic from source to assemble---from gcc's perspective

intro

c++11 introduces a lot of interesting features including atomic. Thinking about it, you may find std::atomic is actually a machine-dependent feature: different architectures implement it in their own ways, especially for RISC such as ARM and CISC such as x86. Even more, there may be some architectures which don't support it at all.

So how does compilers deal with it?

the += operator override

In std library, the atomic += operator is overridden as the following function:

      __int_type
      operator+=(__int_type __i) noexcept
      { return __atomic_add_fetch(&_M_i, __i, int(memory_order_seq_cst)); }

The obvious double underscore indicates that the function could be a reserved word for compiler, as states in [C++ reference]( https://en.cppreference.com/c/language/identifier#:~:text=Note%3A in C%2B%2B%2C identifiers ,a%20double%20underscore%20are%20reserved.):

Note: in C++, identifiers with a double underscore anywhere are reserved everywhere; in C, only the ones that begin with a double underscore are reserved.

identifier recognition

The __atomic_add_fetch is defined in sync-builtins.def which is included by builtins.def

DEF_SYNC_BUILTIN (BUILT_IN_ATOMIC_ADD_FETCH_N,
		  "__atomic_add_fetch",
		  BT_FN_VOID_VAR, ATTR_NOTHROWCALL_LEAF_LIST)

In the c_define_builtins() function, all builtin functions are registered by name, which is "__atomic_add_fetch" for our analysis, making the lexer can recognize it.


/* Build builtin functions common to both C and C++ language
   frontends.  */
static void
c_define_builtins (tree va_list_ref_type_node, tree va_list_arg_type_node)
{
///...

  c_init_attributes ();

#define DEF_BUILTIN(ENUM, NAME, CLASS, TYPE, LIBTYPE, BOTH_P, FALLBACK_P, \
		    NONANSI_P, ATTRS, IMPLICIT, COND)			\
  if (NAME && COND)							\
    def_builtin_1 (ENUM, NAME, CLASS,                                   \
		   builtin_types[(int) TYPE],                           \
		   builtin_types[(int) LIBTYPE],                        \
		   BOTH_P, FALLBACK_P, NONANSI_P,                       \
		   built_in_attributes[(int) ATTRS], IMPLICIT);
#include "builtins.def"

  targetm.init_builtins ();
///...
}

implicit function code conversion

Surprisingly, there is no handler for BUILT_IN_ATOMIC_ADD_FETCH_N in the expand_builtin() which is supposed to expand the it. After some debugging, I find out that the resolve_overloaded_builtin() called in finish_call_expr() convert the function code in a very clandestine way: it adds a exact_log2 (n) to the orig_code, meaning the __atomic_add_fetch, __atomic_add_fetch_1, __atomic_add_fetch_2, __atomic_add_fetch_4, __atomic_add_fetch_8, __atomic_add_fetch_16 must be placed sequentially to keep their enum values can be calculated via the expression above, which they are in sync-builtins.def.

/* Some builtin functions are placeholders for other expressions.  This
   function should be called immediately after parsing the call expression
   before surrounding code has committed to the type of the expression.

   LOC is the location of the builtin call.

   FUNCTION is the DECL that has been invoked; it is known to be a builtin.
   PARAMS is the argument list for the call.  The return value is non-null
   when expansion is complete, and null if normal processing should
   continue.  */

tree
resolve_overloaded_builtin (location_t loc, tree function,
			    vec<tree, va_gc> *params, bool complain)
{
///...
	fncode = (enum built_in_function)((int)orig_code + exact_log2 (n) + 1);
	new_function = builtin_decl_explicit (fncode);
///...
}

This means __atomic_add_fetch((int*)i, 1, 1) and __atomic_add_fetch((long*)i, 1, 1) have BUILT_IN_ATOMIC_ADD_FETCH_4 and BUILT_IN_ATOMIC_ADD_FETCH_8 respectively, even they have the same function name.

convert to rtx

The procedure can be easily found for BUILT_IN_ATOMIC_ADD_FETCH_8 and it turns out to be quick straightforward: expand_builtin=>expand_builtin_atomic_fetch_op=>expand_atomic_fetch_op>expand_atomic_fetch_op_no_fallback>get_atomic_op_for_code. In the `` case, the function tables are initialized in the case PLUS branch:

/* Fill in structure pointed to by OP with the various optab entries for an
   operation of type CODE.  */

static void
get_atomic_op_for_code (struct atomic_op_functions *op, enum rtx_code code)
{
  gcc_assert (op!= NULL);

  /* If SWITCHABLE_TARGET is defined, then subtargets can be switched
     in the source code during compilation, and the optab entries are not
     computable until runtime.  Fill in the values at runtime.  */
  switch (code)
    {
    case PLUS:
      op->mem_fetch_before = atomic_fetch_add_optab;
      op->mem_fetch_after = atomic_add_fetch_optab;
      op->mem_no_result = atomic_add_optab;
      op->fetch_before = sync_old_add_optab;
      op->fetch_after = sync_new_add_optab;
      op->no_result = sync_add_optab;
      op->reverse_code = MINUS;
      break;

The corresponding op entry for atomic_fetch_add_optab is in optabs.def and, more over, the machine code for this operation is configured in the second parameter.

OPTAB_D (atomic_exchange_optab,	 "atomic_exchange$I$a")
OPTAB_D (atomic_fetch_add_optab, "atomic_fetch_add$I$a")
OPTAB_D (atomic_fetch_and_optab, "atomic_fetch_and$I$a")

machine instruction

i386

The machine code is described in sync.md

;; For operand 2 nonmemory_operand predicate is used instead of
;; register_operand to allow combiner to better optimize atomic
;; additions of constants.
(define_insn "atomic_fetch_add<mode>"
  [(set (match_operand:SWI 0 "register_operand" "=<r>")
	(unspec_volatile:SWI
	  [(match_operand:SWI 1 "memory_operand" "+m")
	   (match_operand:SI 3 "const_int_operand")]		;; model
	  UNSPECV_XCHG))
   (set (match_dup 1)
	(plus:SWI (match_dup 1)
		  (match_operand:SWI 2 "nonmemory_operand" "0")))
   (clobber (reg:CC FLAGS_REG))]
  "TARGET_XADD"
  "lock{%;} %K3xadd{<imodesuffix>}\t{%0, %1|%1, %0}")

arm

There is no such instructions in arm architecture, so the expand_atomic_fetch_op takes a fallback. In the arm's specific case, the classic compare and swap loop is adopted, or

/* This function expands an atomic fetch_OP or OP_fetch operation:
   TARGET is an option place to stick the return value.  const0_rtx indicates
   the result is unused.
   atomically fetch MEM, perform the operation with VAL and return it to MEM.
   CODE is the operation being performed (OP)
   MEMMODEL is the memory model variant to use.
   AFTER is true to return the result of the operation (OP_fetch).
   AFTER is false to return the value before the operation (fetch_OP).  */
rtx
expand_atomic_fetch_op (rtx target, rtx mem, rtx val, enum rtx_code code,
			enum memmodel model, bool after)
{
///...

  /* If nothing else has succeeded, default to a compare and swap loop.  */
  if (can_compare_and_swap_p (mode, true))
    {
      rtx_insn *insn;
      rtx t0 = gen_reg_rtx (mode), t1;

      start_sequence ();

      /* If the result is used, get a register for it.  */
      if (!unused_result)
        {
	  if (!target || !register_operand (target, mode))
	    target = gen_reg_rtx (mode);
	  /* If fetch_before, copy the value now.  */
	  if (!after)
	    emit_move_insn (target, t0);
	}
      else
        target = const0_rtx;

      t1 = t0;
      if (code == NOT)
        {
	  t1 = expand_simple_binop (mode, AND, t1, val, NULL_RTX,
				    true, OPTAB_LIB_WIDEN);
	  t1 = expand_simple_unop (mode, code, t1, NULL_RTX, true);
	}
      else
	t1 = expand_simple_binop (mode, code, t1, val, NULL_RTX, true,
				  OPTAB_LIB_WIDEN);

      /* For after, copy the value now.  */
      if (!unused_result && after)
        emit_move_insn (target, t1);
      insn = end_sequence ();

      if (t1 != NULL && expand_compare_and_swap_loop (mem, t0, t1, insn))
        return target;
    }

  return NULL_RTX;
}
///...
}

verify

The test code is as follows:

tsecer@harry: cat atomicadd.cpp 
void foo()
{
    __atomic_add_fetch((int*)0, 1, 1);
}

i386

tsecer@harry: g++ -c atomicadd.cpp 
tsecer@harry: objdump -d atomicadd.o 

atomicadd.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_Z3foov>:
   0:	f3 0f 1e fa          	endbr64
   4:	55                   	push   %rbp
   5:	48 89 e5             	mov    %rsp,%rbp
   8:	f0 83 04 25 00 00 00 	lock addl $0x1,0x0
   f:	00 01 
  11:	90                   	nop
  12:	5d                   	pop    %rbp
  13:	c3                   	ret
tsecer@harry: 

arm

Although I don't have an arm compiler at hand, it can be verified in the compiler explorer for ARM GCC 16.1.0:

foo():
        push    {r7}
        add     r7, sp, #0
        movs    r3, #0
.L2:
        ldrex   r1, [r3]
        add     r1, r1, #1
        strex   r2, r1, [r3]
        cmp     r2, #0
        bne     .L2
        dmb     ish
        nop
        mov     sp, r7
        pop     {r7}
        bx      lr

outro

This is a very essential analysis of a builtin function's processing in gcc, but it skims over the tree, gimple, rtx or even md(machine description). After all, all statements go through these basic procedures in the compiler, so it's pretty interesting to take a peek of what is going on under the hood, isn't it?

posted on 2026-06-07 18:37  tsecer  阅读(8)  评论(0)    收藏  举报

导航