Wednesday, July 13, 2016

Funny OAT compiler mistake

I noticed a funny bug in the OAT compiler ( http://stackoverflow.com/questions/28435860/what-does-oat-mean ) where the optimization apparently goes wrong. I have only tested this on a few devices that I have, so I don't know how well spread this problem is or if it's fixed in newer versions of Android. One device says it's Android 6.0.1 and one 5.1.1.

Consider the following example:
class MyThread extends Thread {
    private static MyThread instance;
    private static final int LENGTH = 100;

    public synchronized static void init() {
        if (instance == null) {
            instance = new MyThread();
            instance.start();
        }
    }

    public void run() {
        Random rnd = new Random(314);
        for (int i = 0; i < 10; i++) {
            int a = rnd.nextInt(9) - 4;
            int b = rnd.nextInt(9) - 4;
            int z0 = 0;
            int z = calcLoop(a, b, z0);
            int zz[] = calcArray(a, b, z0);

            Log.i("TestCalc", "a=" + a + " b=" + b + " z0=" + z0 + " loop=" + z + " array=" + zz[LENGTH - 1]);
            Log.i("TestCalc", "zz[0]=" + zz[0] + " zz[1]=" + zz[1] + " zz[2]=" + zz[2]);
        }
    }

    private int calcLoop(int a, int b, int z) {
        for (int i = 0; i < LENGTH; i++) {
            z = z * a + b;
        }
        return z;
    }

    private int[] calcArray(int a, int b, int z) {
        int ret[] = new int[LENGTH];
        for (int i = 0; i < LENGTH; i++) {
            ret[i] = -123;
        }
        for (int i = 0; i < LENGTH; i++) {
            z = z * a + b;
            ret[i] = z;
        }
        return ret;
    }
}

You would expect the values in the printout for z and zz to be the same. Here is what I got from my first device (Android 5.1.1 on a QualComm Snapdragon 616, 64-bit, platform):

a=-1 b=-1 z0=0 loop=-4 array=-1
zz[0]=-1 zz[1]=-1 zz[2]=-1
a=-1 b=1 z0=0 loop=4 array=1
zz[0]=1 zz[1]=1 zz[2]=1
a=-2 b=0 z0=0 loop=0 array=0
zz[0]=0 zz[1]=0 zz[2]=0
a=1 b=2 z0=0 loop=200 array=50
zz[0]=2 zz[1]=2 zz[2]=2
a=-2 b=3 z0=0 loop=134217732 array=33554433
zz[0]=3 zz[1]=3 zz[2]=3
a=3 b=-1 z0=0 loop=1934863036 array=1557457583
zz[0]=-1 zz[1]=-1 zz[2]=-1
a=4 b=1 z0=0 loop=1431655764 array=1431655765
zz[0]=1 zz[1]=1 zz[2]=1
a=3 b=-2 z0=0 loop=-425241224 array=-1180052130
zz[0]=-2 zz[1]=-2 zz[2]=-2
a=2 b=-4 z0=0 loop=-536870896 array=-134217724
zz[0]=-4 zz[1]=-4 zz[2]=-4
a=-4 b=-4 z0=0 loop=858993456 array=-858993460
zz[0]=-4 zz[1]=-4 zz[2]=-4
On a device with Android 6.0.1 and QualComm Snapdragon 801, 32-bit, platform:

a=-1 b=-1 z0=0 loop=-4 array=-1
zz[0]=-1 zz[1]=-1 zz[2]=-1
a=-1 b=1 z0=0 loop=4 array=1
zz[0]=1 zz[1]=1 zz[2]=1
a=-2 b=0 z0=0 loop=0 array=0
zz[0]=0 zz[1]=0 zz[2]=0
a=1 b=2 z0=0 loop=200 array=50
zz[0]=2 zz[1]=2 zz[2]=2
a=-2 b=3 z0=0 loop=134217732 array=33554433
zz[0]=3 zz[1]=3 zz[2]=3
a=3 b=-1 z0=0 loop=1934863036 array=1557457583
zz[0]=-1 zz[1]=-1 zz[2]=-1
a=4 b=1 z0=0 loop=1431655764 array=1431655765
zz[0]=1 zz[1]=1 zz[2]=1
a=3 b=-2 z0=0 loop=-425241224 array=-1180052130
zz[0]=-2 zz[1]=-2 zz[2]=-2
a=2 b=-4 z0=0 loop=-536870896 array=-134217724
zz[0]=-4 zz[1]=-4 zz[2]=-4
a=-4 b=-4 z0=0 loop=858993456 array=-858993460
zz[0]=-4 zz[1]=-4 zz[2]=-4
This makes no sense. The calculations are totally wrong in almost every case. I'm probably missing something vital here or doing something incredibly stupid. But let's have a look at how the code is compiled:

ubuntu$ adb shell
device$ su
device# cd /data/dalvik-cache/arm64
device# ll *testcalc*
device# oatdump --oat-file=data@app@com.nicklase.testcalc-1@base.apk@classes.dex --output=/sdcard/nicklas.txt
device# exit
device$ exit
ubuntu$  adb pull /sdcard/nicklas.txt .
Let's take a look at the calcLoop() method:

  2: int com.nicklase.testcalc.MyThread.calcLoop(int, int, int) (dex_method_idx=10812)
    DEX CODE:
      0x0000: const/4 v0, #+0
      0x0001: const/16 v1, #+100
      0x0003: if-lt v0, v1, +3
      0x0005: return v5
      0x0006: mul-int v1, v5, v3
      0x0008: add-int v5, v1, v4
      0x000a: add-int/lit8 v0, v0, #+1
      0x000c: goto -11
Seems ok? Let's look at the compiled ARM code:

    CODE: (code_offset=0x0027bd6c size_offset=0x0027bd68 size=192)...
      0x0027bd6c: d10103ff      sub sp, sp, #0x40 (64)
      0x0027bd70: a90157f4      stp x20, x21, [sp, #16]
      0x0027bd74: a9025ff6      stp x22, x23, [sp, #32]
      0x0027bd78: a9037bf8      stp x24, x30, [sp, #48]
      0x0027bd7c: b90003e0      str w0, [sp]
      0x0027bd80: b90047e1      str w1, [sp, #68]
      0x0027bd84: 2a0203f7      mov w23, w2
      0x0027bd88: 2a0303f8      mov w24, w3
      0x0027bd8c: 2a0403f4      mov w20, w4
      0x0027bd90: 2a1f03f5      mov w21, wzr
      0x0027bd94: 52800c96      movz w22, #0x64
      0x0027bd98: 52800082      movz w2, #0x4
      0x0027bd9c: 0b0202b5      add w21, w21, w2
      0x0027bda0: 510006b5      sub w21, w21, #0x1 (1)
      0x0027bda4: 6b1602bf      cmp w21, w22
      0x0027bda8: 5400018a      b.ge #+0x30 (addr 0x7fccd16cf4)
      0x0027bdac: 4f000400      unimplemented (Unimplemented)
      0x0027bdb0: 4e040ee1      unimplemented (Unimplemented)
      0x0027bdb4: 4e040f03      unimplemented (Unimplemented)
      0x0027bdb8: 4ea19c02      unimplemented (Unimplemented)
      0x0027bdbc: 4ea38440      unimplemented (Unimplemented)
      0x0027bdc0: 0b0202b5      add w21, w21, w2
      0x0027bdc4: 6b1602bf      cmp w21, w22
      0x0027bdc8: 54ffff8b      b.lt #-0x10 (addr 0x7fccd16cd4)
      0x0027bdcc: 4eb1b800      unimplemented (Unimplemented)
      0x0027bdd0: 0e043c03      unimplemented (Unimplemented)
      0x0027bdd4: 0b030294      add w20, w20, w3
      0x0027bdd8: 4b0202b5      sub w21, w21, w2
      0x0027bddc: 110006b5      add w21, w21, #0x1 (1)
      0x0027bde0: 6b1602bf      cmp w21, w22
      0x0027bde4: 540000ea      b.ge #+0x1c (addr 0x7fccd16ce0)
      0x0027bde8: 1b177e96      mul w22, w20, w23
      0x0027bdec: 0b1802d4      add w20, w22, w24
      0x0027bdf0: 110006b5      add w21, w21, #0x1 (1)
      0x0027bdf4: 52800c96      movz w22, #0x64
      0x0027bdf8: 6b1602bf      cmp w21, w22
      0x0027bdfc: 54ffff6b      b.lt #-0x14 (addr 0x7fccd16cd8)
      0x0027be00: 71000673      subs w19, w19, #0x1 (1)
      0x0027be04: 540000e0      b.eq #+0x1c (addr 0x7fccd16ce0)
      0x0027be08: 2a1403e0      mov w0, w20
      0x0027be0c: a94157f4      ldp x20, x21, [sp, #16]
      0x0027be10: a9425ff6      ldp x22, x23, [sp, #32]
      0x0027be14: a9437bf8      ldp x24, x30, [sp, #48]
      0x0027be18: 910103ff      add sp, sp, #0x40 (64)
      0x0027be1c: d65f03c0      ret
      0x0027be20: f941f65e      ldr x30, [x18, #1000]
      0x0027be24: d63f03c0      blr x30
      suspend point dex PC: 0x0005
      0x0027be28: 17fffff8      b #-0x20 (addr 0x7fccd16ce4)
Well, the "mul" and "add" stuff is there:

     0x0027bde8: 1b177e96      mul w22, w20, w23
      0x0027bdec: 0b1802d4      add w20, w22, w24
      0x0027bdf0: 110006b5      add w21, w21, #0x1 (1)
      0x0027bdf4: 52800c96      movz w22, #0x64
      0x0027bdf8: 6b1602bf      cmp w21, w22
      0x0027bdfc: 54ffff6b      b.lt #-0x14 (addr 0x7fccd16cd8)
But why is the result wrong? I don't know at this point. I will have to dig into it further. The problem seems serious enough to have escaped unnoticed. There is probably a bug report somewhere already...

And btw, if your run the same Java code on a standard Java VM on your PC it will produce what you would expect :-)

-----

Update 2016-07-13:

The oatdump utility isn't able to properly disassemble the code. But bringing up the code in gdb instead reveals the problem more clearly. The optimization that fails is that 4 multiply and add operations are brought together with vector operations. Well, it didn't work and I'm very puzzled about why the bug is there. How can have survived and not been discovered during normal regression testing?

The vector stuff:

   0x000000000027cdac <oatexec+933292>:    movi    v0.4s, #0x0
   0x000000000027cdb0 <oatexec+933296>:    dup    v1.4s, w23
   0x000000000027cdb4 <oatexec+933300>:    dup    v3.4s, w24
   0x000000000027cdb8 <oatexec+933304>:    mul    v2.4s, v0.4s, v1.4s
   0x000000000027cdbc <oatexec+933308>:    add    v0.4s, v2.4s, v3.4s
   0x000000000027cdc0 <oatexec+933312>:    add    w21, w21, w2
   0x000000000027cdc4 <oatexec+933316>:    cmp    w21, w22
   0x000000000027cdc8 <oatexec+933320>:    b.lt    0x27cdb8 <oatexec+933304>
   0x000000000027cdcc <oatexec+933324>:    addv    s0, v0.4s
   0x000000000027cdd0 <oatexec+933328>:    mov    w3, v0.s[0]
If there is a proper way of doing these operations via vector operations? I don't know. I have to think about that another day. In my original application where this problem was encountered, I wanted the "normal" 32-bit wrap around calculations.

No comments:

Post a Comment