Examining Compiler Output 2
2018-01-04In this post I present a short comparison between the machine code generated by GCC 7.2 and Clang 5.0.0 for the evaluation of a floating point polynomial.
For both compilers, only -O2 was used, but -O3 produced the same code.
float evaluate(float a, float b) {
return (a - b + 1.0f) * (a - b) * (a - b - 1.0f);
}
GCC
evaluate(float, float):
subss xmm0, xmm1
movss xmm2, DWORD PTR .LC0[rip]
movaps xmm1, xmm0
addss xmm1, xmm2
mulss xmm1, xmm0
subss xmm0, xmm2
mulss xmm0, xmm1
ret
.LC0:
.long 1065353216
Clang
.LCPI0_0:
.long 1065353216 # float 1
.LCPI0_1:
.long 3212836864 # float -1
evaluate(float, float): # @evaluate(float, float)
subss xmm0, xmm1
movss xmm1, dword ptr [rip + .LCPI0_0]
addss xmm1, xmm0
mulss xmm1, xmm0
addss xmm0, dword ptr [rip + .LCPI0_1]
mulss xmm0, xmm1
ret
Explanation (for the Clang output)
xmm0 = a
xmm1 = b
xmm0 = a - b
xmm1 = 1.0f
xmm1 = a - b + 1.0f
xmm1 = (a - b + 1.0f) * (a - b)
xmm0 = a - b - 1.0f
xmm0 = (a - b - 1.0f) * (a - b + 1.0f) * (a - b)
GCC seems to prefer movaps over movss, even though movss is sufficient in this case. A reason for doing so is that using movaps avoid stalls from partial updates to XMM registers. Clang doesn’t generate movaps, but uses two constants and only addss for them rather than only having one and using subss to subtract one from a register.
After benchmarking these alternatives, they had roughly the same throughput.