2018-02-26
Inspired by an analysis of the source code of
Rogue, I decided to find
out how modern compilers encode selecting the maximum of four integers.
int max(int a, int b, int c, int d) {
return a > b ? (a > c ? (a > d ? a : d) : (c > d ? c : d))
: (b > c ? (b > d ? b : d) : (c > d ? c : d));
}
It is important to note that this is isolated code generation and a modern
compiler could probably do better if more context was available.
Using Clang 5.0, I get the following (with optimizations enabled, of course).
0: 39 f7 cmp edi, esi
2: 0f 4d f7 cmovge esi, edi
5: 39 d6 cmp esi, edx
7: 0f 4c f2 cmovl esi, edx
a: 39 ce cmp esi, ecx
c: 0f 4c f1 cmovl esi, ecx
f: 89 f0 mov eax, esi
11: c3 ret
I will rename the code according to the calling convention used here.
a -> edi
b -> esi
c -> edx
d -> ecx
The result follows with some comments on what the code is doing.
cmp a, b
cmovge b, a # If a >= b, assign a to b.
# Now b has max(a, b).
cmp b, c
cmovl b, c # If max(a, b) < c, assign c to b.
# Now b has max(max(a, b), c).
cmp b, d
cmovl b, d # If max(max(a, b), c) < d, assign d to b.
# Now b has max(max(max(a, b), c), d).
mov eax, b # Return b through eax.
ret
I don’t think it can be done in less code than this.
As one might expect, using C++ and std::max, we get almost identical x86.
2018-01-04
In this post I present a short comparison between the machine code generated by
GCC 7.2 and Clang 5.0.0 for the evaluation of a floating point polynomial.
For both compilers, only -O2 was used, but -O3 produced the same code.
float evaluate(float a, float b) {
return (a - b + 1.0f) * (a - b) * (a - b - 1.0f);
}
GCC
evaluate(float, float):
subss xmm0, xmm1
movss xmm2, DWORD PTR .LC0[rip]
movaps xmm1, xmm0
addss xmm1, xmm2
mulss xmm1, xmm0
subss xmm0, xmm2
mulss xmm0, xmm1
ret
.LC0:
.long 1065353216
Clang
.LCPI0_0:
.long 1065353216 # float 1
.LCPI0_1:
.long 3212836864 # float -1
evaluate(float, float): # @evaluate(float, float)
subss xmm0, xmm1
movss xmm1, dword ptr [rip + .LCPI0_0]
addss xmm1, xmm0
mulss xmm1, xmm0
addss xmm0, dword ptr [rip + .LCPI0_1]
mulss xmm0, xmm1
ret
Explanation (for the Clang output)
xmm0 = a
xmm1 = b
xmm0 = a - b
xmm1 = 1.0f
xmm1 = a - b + 1.0f
xmm1 = (a - b + 1.0f) * (a - b)
xmm0 = a - b - 1.0f
xmm0 = (a - b - 1.0f) * (a - b + 1.0f) * (a - b)
GCC seems to prefer movaps over movss, even though movss is sufficient in this
case. A reason for doing so is that using movaps avoid stalls from partial
updates to XMM registers. Clang doesn’t generate movaps, but uses two constants
and only addss for them rather than only having one and using subss to subtract
one from a register.
After benchmarking these alternatives, they had roughly the same throughput.
2017-11-30
In this post I present a short comparison between the machine code generated by
Clang 5.0.0 for two different for loops.
Sum up to N
The first function sums from 1 to N and returns the result.
C++
int sumUpTo(int n) {
int x = 0;
for (int i = 1; i <= n; i++) {
x += i;
}
return x;
}
Output
sumUpTo(int):
test edi, edi
jle .LBB0_1
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
shr rcx
lea eax, [rcx + 2*rdi - 1]
ret
.LBB0_1:
xor eax, eax
ret
As you can see, Clang used a closed formula to evaluate the expression!
What should it be
(1 + n)(n) / 2
= (n^2 + n) / 2
What Clang does
(n - 1)(n - 2) / 2 + (2n - 1)
= (n^2 - 3n + 2) / 2 + (4n - 2) / 2
= (n^2 + n) / 2
It is important to note that the reason why a compiler would rather evaluate
this convoluted expression is that it will not overflow like the first
expression would.
Sum up to N multiplying by 2
C++
int sumTwiceUpTo(int n) {
int x = 0;
for (int i = 1; i <= n; i++) {
x += 2 * i;
}
return x;
}
Output
sumTwiceUpTo(int):
test edi, edi
jle .LBB1_1
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul ecx, eax
and ecx, -2
lea eax, [rcx + 4*rdi - 2]
ret
.LBB1_1:
xor eax, eax
ret
What is is
What Clang does
((n - 1)(n - 2) & 111...0) + 4n - 2
= (n - 1)(n - 2) + 4n - 2
= n^2 + n
Again, a closed formula!
It is interesting to note that the and is completely useless here: it is forcing
the product (which is always even) into a even number. Also, it requires the
multiplication result to be evaluated, so it could add measurable delay.
2017-10-14
Description
Wunderlist to Taskwarrior is an application that fetches your tasks from
Wunderlist and inserts them into Taskwarrior. It can be regularly ran by the
operating system to keep Taskwarrior up-to-date.
Why I Made This
As I cannot update Taskwarrior when I am not near one of my computers I needed a
way to update Taskwarrior while on the road. The solution I found uses
Wunderlist as an intermediate. I already used Wunderlist for simple lists such
as the weekly groceries but I greatly prefer Taskwarrior over it, so I still use
Taskwarrior when I am at my computer.
Relevant Features
- Selective Synchronization. Lists starting with “!” are never synchronized.
- Data Safety. All communication is done via HTTPS.
- Persistence. SQLite is used to keep track of what has been synchronized.
I had to decide between a continuous process which would update Taskwarrior or a
utility application that would be scheduled to run regularly. I opted for the
latter as the application initialization and termination are not so expensive.
Currently, I’ve been running it every minute on one of my computers.
The Code
Wunderlist to Taskwarrior was written in Haskell using Stack so it should be
easy enough to get a build for your own computer.
This is the GitHub repository.
2017-09-10
Description
This is a post regarding the solution to a substring problem which follows a pattern that is quite common among substring problems: it is solvable with a pair of fast and slow iterators.
This problem can often appear as an interview question or as part of a competitive programming problem in sites such as LeetCode and Codeforces.
If this is not possible, we will return the empty string.
Problem Statement
Given two strings, S and T, find the minimum window in S which will contain all
the characters in T (with their frequencies) in linear time.
S |
T |
Answer |
"" |
"" |
"" |
"" |
"A" |
"" |
"A" |
"" |
"" |
"A" |
"A" |
"A" |
"A" |
"AB" |
"A" |
"ADOBECODEBANC" |
"ABC" |
"BANC" |
"ADOBECODECABANC" |
"ABC" |
"CAB" |
C++ Solution
string minimum_window_substring(string s, string t) {
if (t.empty()) {
return "";
}
// Frequency in the token.
unordered_map<char, size_t> tf;
for (char c : t) {
tf[c]++;
}
// Frequency in the string.
unordered_map<char, size_t> sf;
// How many characters we have to match.
const auto target = t.size();
const auto n = s.size();
size_t best_i = 0;
const auto no_size = numeric_limits<size_t>::max();
size_t best_size = no_size;
size_t slow = 0;
size_t fast = 0;
size_t matched = 0;
// Advance fast until we met the target.
// Then shrink with slow until we no longer meet the conditions.
// As this advances each pointer at most N times, this is O(n).
while (slow < n && fast < n) {
sf[s[fast]]++;
if (sf[s[fast]] <= tf[s[fast]]) {
matched++;
}
fast++;
// Here, the range is [slow, fast).
while (matched == target && slow < fast) {
const size_t match_size = fast - slow;
if (match_size < best_size) {
best_i = slow;
best_size = match_size;
}
sf[s[slow]]--;
if (sf[s[slow]] < tf[s[slow]]) {
matched--;
}
slow++;
}
}
if (best_size == no_size) {
return "";
}
return s.substr(best_i, best_size);
}