ARM Assembly Programming: A Guide for Assembler Developers

TT
Emily Ross
ARM Assembly Programming: A Guide for Assembler Developers

ARM Assembly Programming: An Assembler Developer's Guide

ARM processors power the overwhelming majority of smartphones, tablets, and IoT devices in the world. With Apple Silicon, AWS Graviton, and Ampere servers, ARM has also become a serious server architecture. For a developer who already understands assembly language through HLASM, learning ARM is a question of learning a new syntax and a new ISA — the foundational thinking is already in place.

This article introduces ARM assembly with explicit comparisons to z/Architecture and HLASM. If you are coming from the x86 track, the ARM-vs-x86 differences are also noted where they matter.


ARM Architecture Family

ARM Holdings licenses its processor designs and instruction set architectures to semiconductor companies. The two most relevant architectures today are:

ArchitectureBitsCommon Devices
ARMv7-A32-bitRaspberry Pi 2, older Android devices
ARMv8-A / AArch6464-bitiPhone, Android (2016+), Apple M-series, AWS Graviton
ARMv8-M / Cortex-M32-bitMicrocontrollers (STM32, nRF52, etc.)

This guide focuses primarily on AArch64 (the 64-bit execution state of ARMv8-A), as it is the architecture you will encounter most in modern development. Key differences from 32-bit ARM are noted where relevant.


AArch64 Registers

AArch64 has 31 general-purpose registers named X0–X30, each 64 bits wide. When referenced with a W prefix (W0–W30), only the lower 32 bits are used.

RegisterRole
X0–X7Arguments and return values (calling convention)
X8Indirect result register / syscall number
X9–X15Temporary (caller-saved)
X16–X17Intra-procedure call scratch (used by PLT)
X18Platform register (reserved on some OSes)
X19–X28Callee-saved registers
X29 (FP)Frame pointer
X30 (LR)Link register — holds return address
SPStack pointer (not a general register)
PCProgram counter (not directly accessible)
XZR / WZRZero register — always reads as 0, writes are discarded

Compared to HLASM: z/Architecture has 16 GPRs (GR0–GR15) compared to AArch64's 31. The ARM zero register (XZR) has no direct HLASM equivalent — in HLASM you use GR0 as a base register to get zero (hardware rule: GR0 used as base contributes zero). ARM also has a dedicated link register (X30/LR) for return addresses, whereas HLASM uses GR14 by convention.

NZCV Flags Register

AArch64's status flags register is called NZCV:

  • N (Negative) — set if result is negative
  • Z (Zero) — set if result is zero
  • C (Carry) — set on unsigned overflow
  • V (Overflow) — set on signed overflow

Compare instructions set these flags, and conditional instructions test them — directly analogous to HLASM's condition code.


Load-Store Architecture: Familiar Territory for HLASM Developers

Like z/Architecture, ARM is a load-store architecture. Arithmetic and logical instructions operate only on registers. Separate load and store instructions move data between registers and memory.

asm
// Load a 64-bit value from memory
LDR  X0, [X1]          // X0 = memory[X1]
LDR  X0, [X1, #8]      // X0 = memory[X1 + 8]  (base + offset)

// Store a 64-bit value to memory
STR  X0, [X1]          // memory[X1] = X0
STR  X0, [X1, #8]      // memory[X1 + 8] = X0

// Load pair (two registers at once — common for save/restore)
LDP  X29, X30, [SP, #16]   // X29 = memory[SP+16], X30 = memory[SP+24]
STP  X29, X30, [SP, #-16]! // push X29 and X30, pre-decrement SP

Compared to HLASM: HLASM's L R1,FIELD (Load fullword) is equivalent to LDR W0, [base, #offset]. HLASM's ST R1,FIELD is STR W0, [base, #offset]. The load-store model is identical in philosophy; the syntax differs.

Pre-indexed and Post-indexed Addressing

ARM provides two powerful addressing variants for array traversal:

asm
LDR  X0, [X1, #8]!     // Pre-indexed: X1 = X1+8 first, then load
LDR  X0, [X1], #8      // Post-indexed: load from X1 first, then X1 = X1+8

Post-indexed addressing maps naturally to HLASM's BXLE (Branch on Index Low or Equal) loop idiom, where a register is incremented and tested in a single step.


AArch64 Instruction Set

Arithmetic Instructions

asm
ADD  X0, X1, X2          // X0 = X1 + X2
ADD  X0, X1, #10         // X0 = X1 + 10 (immediate)
SUB  X0, X1, X2          // X0 = X1 - X2
MUL  X0, X1, X2          // X0 = X1 * X2 (lower 64 bits)
SDIV X0, X1, X2          // X0 = X1 / X2 (signed division)
NEG  X0, X1              // X0 = -X1

Compared to z/Architecture: HLASM's multiply M R1,FIELD puts the result in a register pair (R1 and R1+1) because 32-bit × 32-bit produces a 64-bit result. AArch64's MUL gives the lower 64 bits of a 64-bit × 64-bit result in a single register; UMULH gives the upper 64 bits.

Logical and Shift Instructions

asm
AND  X0, X1, X2          // bitwise AND
ORR  X0, X1, X2          // bitwise OR (note: ORR not OR)
EOR  X0, X1, X2          // bitwise XOR (Exclusive OR)
LSL  X0, X1, #3          // logical shift left by 3 (multiply by 8)
LSR  X0, X1, #1          // logical shift right by 1
ASR  X0, X1, #1          // arithmetic shift right (preserves sign)

The MOV X0, XZR idiom zeroes a register using the zero register — equivalent to HLASM's XR R1,R1.

Comparison and Branching

asm
CMP  X0, X1              // sets NZCV based on X0 - X1 (result discarded)
CMP  X0, #10             // compare with immediate
B.EQ label               // branch if equal (Z=1)
B.NE label               // branch if not equal (Z=0)
B.LT label               // branch if less than (signed)
B.GT label               // branch if greater than (signed)
B.LO label               // branch if lower (unsigned)
B.HI label               // branch if higher (unsigned)
B    label               // unconditional branch

Compared to HLASM: CMPCR or CLC. B.EQBE (Branch Equal). The condition code mechanism is conceptually identical.

Conditional Instructions

ARM has a powerful feature z/Architecture lacks at the general instruction level: many instructions can be conditionally executed:

asm
CSEL  X0, X1, X2, EQ    // if Z=1: X0=X1, else X0=X2  (conditional select)
CSET  X0, GT             // X0 = 1 if GT condition, else 0
CINC  X0, X1, NE         // X0 = X1+1 if NE, else X0 = X1

These allow branchless conditionals — valuable for performance because they avoid branch misprediction penalties.


Function Calls and the AArch64 PCS

AArch64 Linux uses the ARM64 Procedure Call Standard (PCS):

  • X0–X7: first eight integer/pointer arguments; X0 is the return value.
  • X8: indirect result register (used when the return value is a struct).
  • X30 (LR): the CALL instruction (BL) stores the return address here automatically.
  • X29 (FP): frame pointer — points to the current stack frame.
  • X19–X28: callee-saved. A function that uses these must save and restore them.
  • Stack alignment: SP must be 16-byte aligned at all times.
asm
// Function prologue (save frame pointer and link register)
my_func:
    STP  X29, X30, [SP, #-16]!  // push FP and LR onto stack (pre-decrement SP)
    MOV  X29, SP                 // set up frame pointer

    // function body
    // X0 holds first argument, X1 second, etc.
    // result goes in X0

    LDP  X29, X30, [SP], #16    // pop FP and LR (post-increment SP)
    RET                          // branch to address in LR (X30)

Compared to HLASM linkage:

  • HLASM: STM R14,R12,12(R13) saves 15 registers to a 72-byte save area.
  • AArch64: STP X29,X30,[SP,#-16]! saves only FP and LR (plus any X19–X28 registers the function uses).

ARM's approach is lighter — only the link register and frame pointer are unconditionally saved. HLASM saves all registers because the z/OS linkage convention requires a complete register save area for the save area chain.


AArch64 System Calls (Linux)

asm
// write to stdout
MOV  X8, #64         // syscall: write (64 on AArch64 Linux)
MOV  X0, #1          // fd = stdout
ADR  X1, message     // buffer address
MOV  X2, #13         // byte count
SVC  #0              // supervisor call — invoke the kernel

// exit
MOV  X8, #93         // syscall: exit
MOV  X0, #0          // exit code
SVC  #0

The AArch64 Linux calling convention puts the syscall number in X8 (unlike x86-64 which uses RAX, and unlike z/OS which uses the SVC instruction number directly). The SVC #0 instruction is the AArch64 equivalent of z/OS's SVC n — both trap into the operating system.


NEON SIMD: ARM's Vector Instructions

ARM's SIMD extension is called NEON (Advanced SIMD). It provides 32 registers (V0–V31), each 128 bits wide, that can be treated as vectors of integers or floats.

asm
// Add four 32-bit integers simultaneously
LD1  {V0.4S}, [X0]       // load 4 × 32-bit ints from memory into V0
LD1  {V1.4S}, [X1]       // load 4 × 32-bit ints from memory into V1
ADD  V2.4S, V0.4S, V1.4S // V2[i] = V0[i] + V1[i] for i in {0,1,2,3}
ST1  {V2.4S}, [X2]       // store result

This is conceptually similar to x86-64's SSE/AVX extensions. z/Architecture has its own vector facility (introduced as the SIMD Vector Facility in 2015) with 32 × 128-bit vector registers.


Bare-Metal ARM: Programming Without an OS

One of ARM's most distinctive applications is embedded systems — microcontrollers that run without an operating system. On a Cortex-M microcontroller:

  • There is no OS, no virtual memory, no file system.
  • Your program starts from a vector table at address 0 — a list of exception handler addresses.
  • Memory-mapped I/O: peripherals (GPIOs, UARTs, timers) are controlled by reading and writing specific memory addresses.
  • Interrupts are handled by storing handler addresses in the vector table.
asm
// Cortex-M4 vector table (simplified)
    .word  0x20008000       // initial stack pointer (top of SRAM)
    .word  Reset_Handler    // reset handler — entry point
    .word  NMI_Handler      // NMI interrupt
    .word  HardFault_Handler

Compared to z/OS: z/OS provides a full operating system environment; you never write bare-metal code for a mainframe. The bare-metal ARM model is closer to writing a z/OS SVC handler or an I/O supervisor routine — direct hardware access with no safety net.


Debugging ARM Assembly

On Linux (Raspberry Pi, Android via adb, or QEMU):

bash
# Install cross-compiler and QEMU
sudo apt install gcc-aarch64-linux-gnu qemu-user

# Assemble and link
aarch64-linux-gnu-as program.s -o program.o
aarch64-linux-gnu-ld -o program program.o

# Debug with GDB (cross-platform)
aarch64-linux-gnu-gdb ./program

For Cortex-M embedded development, OpenOCD with a JTAG/SWD probe (J-Link or ST-Link) provides hardware debugging — the equivalent of IBM's hardware service element on mainframes.


Conclusion

ARM assembly has a clean, orthogonal design that many assembly programmers find pleasant to work with. Its load-store architecture is familiar from HLASM. Its 31 general-purpose registers, conditional select instructions, and powerful addressing modes make it expressive and efficient.

The key differences from HLASM are the 31 vs 16 registers, the link register (LR/X30) for return addresses, the lighter calling convention, the SVC #0 syscall interface, and the fact that ARM is little-endian on most implementations.

Explore how the three architectures — z/Architecture, x86-64, and ARM — come together in practical programs through the Mini Projects in Assembly Language article.