The precise semantics of floating-point arithmetic programs depends on the execution platform, including the compiler and the target hardware, even if these are fully compliant with the widely adopted IEEE 754 floating-point standard. Platform dependencies infringe on the highly desirable goal of software portability (which is in fact promised by heterogeneous computing frameworks like OpenCL): the same program run on the same inputs on different platforms can produce different results. Serious doubts on the portability of numeric applications arise when these differences are behavioral, i.e. when they lead to changes in the control flow of a program.
In this research we design an algorithm that takes a numeric procedure and determines an input that may lead to different branching decisions depending on how the arithmetic in the procedure is compiled. Our algorithm proceeds in two phases. The first phase determines a candidate input, such that minor variations of the value of exactly one of the input variables causes different branching decisions. In the second phase, the algorithm efficiently searches though all possible platform parameters such as decisions made by the compiler about evaluation order, and use or non-use of FMA hardware features. The goal is to find two sets of platform parameters that lead to different branching decisions.
Our implementation of the algorithm requires minimal intervention by the user. We illustrate its operation on a diverse set of examples, characteristic of scientific numeric computing, where control flow divergence actually occurs across different execution platforms.
Experiments
We test our algorithm on three procedures: (1) ray tracing, in the raySphere function there is a branching decision over the D variable (D> 0). The inputs are s, r, which represent the coordinates of the sphere origin and the ray, and radiusSq, a scalar value giving the square of the radius of sphere; (2) molecular dynamics, this open source code calculates the Lennard Jones potential of a molecular system. The energy is calculated based on pairwise distances of atoms and their velocity and acceleration. In the ComputeAccel function a decision branch occurs over (rr , rrCut). rr is the pairwise distance between atoms j1 and j2 and rrCut is the threshold for pairwise distances. The calculation of rr is directly effected by a density variable of the atom using a face centered cubic model; (3) long summation of 32 numbers, we compare serial C code which accumulates a value to a register, and a reduction kernel, written in OpenCL which is the common way to implement long summation on a parallel architecture. The result of the sum is compared to zero.
1. Generating Unstable Inputs
To test our algorithm, we first apply transformations to each procedure: we attach a main function that calls the tested procedure; we annotate the symbolic variables and volatile expressions by calling klee_make_symbolic_with_sort and klee_tag_reorderable. Following shows the ray tracing code after transformation.
//Dot Product 3-Vectors
float dot3(float *a, float *b){
float r = a[0] * b[0] + a[1] * b[1]
+ a[2] * b[2];
klee_tag_reorderable(&r, 0, 2);
return r;
}
int raySphere(float *r, float *s, float radiusSq) {
float A = dot3(r,r);
float B = -2.0 * dot3(s,r);
float C = dot3(s,s) - radiusSq;
float D = B*B - 4*A*C;
if (D > 0)
return 0;
}
int main() {
float r[3], s[3];
float radiusSq;
klee_make_symbolic_with_sort(&r, sizeof(r), "r", 8, 32);
klee_make_symbolic_with_sort(&s, sizeof(s), "s", 8, 32);
klee_make_symbolic_with_sort(&radiusSq, sizeof(radiusSq), "radiusSq", 0, 32);
return raySphere(r, s, radius);
}
For a conditional q in the execution path, there may exist multiple inputs that exhibit control flow instability. In our experiments we split the domian of one input variable into subintervals, and run our algorithm on each of these subintervals. In the table below it shows the experiment results.
Procedure | Variable | Domain | Subinterval length | #Unstable Inputs |
ray tracing | s[0] | [-50, 50] | 0.01 | 408 |
molecular dynamics | r_j1[0] | [0, 1] | 0.1 | 12 |
long summation | a[0] | [0, 1] | 0.01 | 100 |
2. Control Flow Divergence
To show control flow divergence, we took the generated sets of inputs and tested them on six different architectures. We convert the serial c code into OpenCL code so that we can run the same code and compare the results on these heterogeneous architectures. Below is the list of these architectures.
type | Manufacture | Description | Year | FMA? | |
1 | CPU | Intel | E5 2650 | 2012 | * |
2 | CPU | AMD | A8-3850 | 2011 | N |
3 | GPU | NVIDIA | GF108 Quadro 600 | 2010 | N |
4 | GPU | NVIDIA | Tesla C2075 | 2011 | Y |
5 | GPU | NVIDIA | Tesla K20 | 2013 | Y |
6 | GPU | AMD | Radeon HD 65550D | 2011 | N |
2.1 Ray Tracing Results
For the generated 408 sets of inputs. 45 of these produce results on different sides of zero for D on different architectures. In the table below some of these inputs are shown. D1, D2 and D3 show the value of D for D1: machine 1,2, D2: machine 3,4,5 and D3: machine 6.
s[0] | s[1] | s[2] | r[0] | r[1] | r[2] | radiusSq | D1 | D2 | D3 |
-33.3704719543457 | -53 | -52.0185546875 | -33.999900817871 | -54 | -53 | 2.9802322E-8 | 4.54933548 | -3.562728882 | 0.0 |
30.1333904266357 | -41 | -40.0234375 | 30.870059967041 | -42 | -41 | 1.907348633E-6 | 3.874450684 | -7.142879486 | 0.0 |
40.0934867858886 | -53 | -52.0185546875 | 40.8501510620117 | -54 | -53 | 2.9802322E-8 | 7.814369202 | -4.13470459 | 0.0 |
40.073875427246 | -53 | -52.0185546875 | 40.8301696777343 | -53 | -54 | 2.9802322E-8 | -6.409370422 | 7.924546242 | 0.0 |
-27.4816207885742 | -53 | -52.0185546875 | -27.9999465942382 | -54 | -53 | 2.9802322E-8 | 6.105501175 | -6.985953331 | 0.0 |
2.2 Molecular Dynamics Results
Following table shows some of the results. R is the difference in pairwise distances of the atoms j1 and j2 to rrCut (rr-rrCut). R1,R2 shows result for machine 1,2,6 and machine 3,4,5 respectively.
r_j1[0] | r_j1[1] | r_j1[2] | r_j2[0] | r_j2[1] | r_j2[2] | Density | rrCut | R1 | R2 |
0.50000011920928955 | 23.71043968200683594 | 22.29841232299804688 | 3.81469726562E-6 | 3.81469726562E-6 | 3.81469726562E-6 | 0.3 | 2.25 | 0.0 | -2.38418579101562E-7 |
0.5078125 | 0.12553329765796661 | 7.8125E-3 | 23.9375 | 23.75 | 22.25 | 0.3 | 2.25 | -2.38418579101562E-7 | 0.0 |
0.4000244140625 | 0.41066083312034607 | 26.125 | 6.25E-2 | 1.5625E-2 | 6.25E-2 | 0.2 | 1.44 | 5.72204590376657E-8 | -6.19888305131155E-8 |
0.10009765625 | 27.23913192749023438 | 4.8828125E-4 | 27.21875 | 0.125 | 25.9453125 | 0.2 | 1.44 | -6.19888305131155E-8 | 5.72204590376657E-8 |
0.2000005841255188 | 34.19741058349609375 | 1.81899E-12 | 0.18798828125 | 7.4505806E-9 | 33.69966888427734375 | 0.1 | 0.25 | -1.49011611938476E-8 | 0.0 |
0.3125 | 6.25E-2 | 33.701171875 | 0.300048828125 | 2.464281208813E-2 | 6.103515625E-5 | 0.1 | 0.25 | 0.0 | -1.49011611938476E-08 |
2.3 Long summation and Reduction Results
The 32 input numbers have a range between [-1.05,1]. Sum (1,2,3,5,6) shows the reduction results for machine 1,2,3,5,6 and sum 4 shows the reduction results for machine 4. Our code compared the result of the summation to 0. The reduction kernel and serial summation results on the inputs generated produce results on either side of zero.
Sum 1,2,3,5,6 | Sum 4 | Serial Sum |
-4.65661287E-8 | -8.7544322E-8 | 4.49363142E-8 |
-5.96046448E-8 | -8.7544322E-8 | 1.054722816E-7 |
-5.96046448E-8 | -8.7544322E-8 | 9.1502443E-8 |