2017-08-02 100 views
12

我编写了一个小程序,用于测量循环中的时间(通过内联Sparc汇编代码片段)。在Sparc 32位上处理值> 2^32的整数

一切都是正确的,直到我设置迭代次数大约在4.0 + 9(2^32以上)以上。

下面的代码片段:

#include <stdio.h> 
#include <sys/time.h> 
#include <unistd.h> 
#include <math.h> 
#include <stdint.h> 

int main (int argc, char *argv[]) 
{ 
    // For indices 
    int i; 
    // Set the number of executions 
    int nRunning = atoi(argv[1]); 
    // Set the sums 
    double avgSum = 0.0; 
    double stdSum = 0.0; 
    // Average of execution time 
    double averageRuntime = 0.0; 
    // Standard deviation of execution time 
    double deviationRuntime = 0.0; 

    // Init sum 
    unsigned long long int sum = 0; 
    // Number of iterations 
    unsigned long long int nLoop = 4000000000ULL; 
    //uint64_t nLoop = 4000000000; 

    // DEBUG 
    printf("sizeof(unsigned long long int) = %zu\n",sizeof(unsigned long long int)); 
    printf("sizeof(unsigned long int) = %zu\n",sizeof(unsigned long int)); 

    // Time intervals 
    struct timeval tv1, tv2; 
    double diff; 

    // Loop for multiple executions 
    for (i=0; i<nRunning; i++) 
    { 
    // Start time 
    gettimeofday (&tv1, NULL); 

    // Loop with Sparc assembly into C source 
    asm volatile ("clr %%g1\n\t" 
       "clr %%g2\n\t" 
       "mov %1, %%g1\n" // %1 = input parameter 
       "loop:\n\t" 
       "add %%g2, 1, %%g2\n\t" 
       "subcc %%g1, 1, %%g1\n\t" 
       "bne loop\n\t" 
       "nop\n\t" 
       "mov %%g2, %0\n" // %0 = output parameter 
       : "=r" (sum)  // output 
       : "r" (nLoop) // input 
       : "g1", "g2"); // clobbers 

    // End time 
    gettimeofday (&tv2, NULL); 

    // Compute runtime for loop 
    diff = (tv2.tv_sec - tv1.tv_sec) * 1000000ULL + (tv2.tv_usec - tv1.tv_usec); 

    // Summing diff time 
    avgSum += diff; 
    stdSum += (diff*diff); 

    // DEBUG 
    printf("diff = %e\n", diff); 
    printf("avgSum = %e\n", avgSum); 

    } 
    // Compute final averageRuntime 
    averageRuntime = avgSum/nRunning; 

    // Compute standard deviation 
    deviationRuntime = sqrt(stdSum/nRunning-averageRuntime*averageRuntime); 

    // Print results 
    printf("(Average Elapsed time, Standard deviation) = %e usec %e usec\n", averageRuntime, deviationRuntime); 
    // Print sum from assembly loop 
    printf("Sum = %llu\n", sum); 

例如,nLoop < 2^32,我得到正确的值diffavgSumstdSum。事实上,printf,与nLoop = 4.0e+9,得出:

sizeof(unsigned long long int) = 8 
sizeof(unsigned long int) = 4 
diff = 9.617167e+06 
avgSum = 9.617167e+06 
diff = 9.499878e+06 
avgSum = 1.911704e+07 
(Average Elapsed time, Standard deviation) = 9.558522e+06 usec 5.864450e+04 usec 
Sum = 4000000000 

的代码被编译在Debian Sparc 32 bits Etchgcc 4.1.2

不幸的是,如果我拿例如nLoop = 5.0e+9,我会得到测量时间小而不正确的值;这里是在这种情况下,printf的输出:

sizeof(unsigned long long int) = 8 
sizeof(unsigned long int) = 4 
diff = 5.800000e+01 
avgSum = 5.800000e+01 
diff = 4.000000e+00 
avgSum = 6.200000e+01 
(Average Elapsed time, Standard deviation) = 3.100000e+01 usec 2.700000e+01 usec 
Sum = 5000000000 

我不知道在哪里的问题可能来自使用uint64_t但没有成功,我做其他检查。

也许问题是我用32位操作系统处理large integers (> 2^32)或者它可能是不支持8字节整数的程序集内联代码。

如果有人能够给我一些线索来解决这个错误,

问候

更新1

@Andrew Henle的建议,我采取了同样的代码,但不是行内的Sparc汇编代码片段,我只是放了一个简单的循环。

下面是用简单的回路,其已得到nLoop = 5.0e+9(见行“unsigned long long int nLoop = 5000000000ULL;”的节目,所以上面的limit 2^32-1

#include <stdio.h> 
#include <stdlib.h> 
#include <sys/time.h> 
#include <unistd.h> 
#include <math.h> 
#include <stdint.h> 

int main (int argc, char *argv[]) 
{ 
    // For indices of nRunning 
    int i; 
    // For indices of nRunning 
    unsigned long long int j; 
    // Set the number of executions 
    int nRunning = atoi(argv[1]); 
    // Set the sums 
    unsigned long long int avgSum = 0; 
    unsigned long long int stdSum = 0; 
    // Average of execution time 
    double averageRuntime = 0.0; 
    // Standard deviation of execution time 
    double deviationRuntime = 0.0; 

    // Init sum 
    unsigned long long int sum; 
    // Number of iterations 
    unsigned long long int nLoop = 5000000000ULL; 

    // DEBUG 
    printf("sizeof(unsigned long long int) = %zu\n",sizeof(unsigned long long int)); 
    printf("sizeof(unsigned long int) = %zu\n",sizeof(unsigned long int)); 

    // Time intervals 
    struct timeval tv1, tv2; 
    unsigned long long int diff; 

    // Loop for multiple executions 
    for (i=0; i<nRunning; i++) 
    { 
    // Reset sum 
    sum = 0; 

    // Start time 
    gettimeofday (&tv1, NULL); 

    // Loop with Sparc assembly into C source 
    /* asm volatile ("clr %%g1\n\t" 
       "clr %%g2\n\t" 
       "mov %1, %%g1\n" // %1 = input parameter 
     "loop:\n\t" 
     "add %%g2, 1, %%g2\n\t" 
     "subcc %%g1, 1, %%g1\n\t" 
     "bne loop\n\t" 
     "nop\n\t" 
     "mov %%g2, %0\n" // %0 = output parameter 
     : "=r" (sum)  // output 
     : "r" (nLoop) // input 
     : "g1", "g2"); // clobbers 
    */ 

    // Classic loop 
    for (j=0; j<nLoop; j++) 
     sum ++; 

    // End time 
    gettimeofday (&tv2, NULL); 

    // Compute runtime for loop 
    diff = (unsigned long long int) ((tv2.tv_sec - tv1.tv_sec) * 1000000 + (tv2.tv_usec - tv1.tv_usec)); 

    // Summing diff time 
    avgSum += diff; 
    stdSum += (diff*diff); 

    // DEBUG 
    printf("diff = %llu\n", diff); 
    printf("avgSum = %llu\n", avgSum); 
    printf("stdSum = %llu\n", stdSum); 
    // Print sum from assembly loop 
    printf("Sum = %llu\n", sum); 

    } 
    // Compute final averageRuntime 
    averageRuntime = avgSum/nRunning; 

    // Compute standard deviation 
    deviationRuntime = sqrt(stdSum/nRunning-averageRuntime*averageRuntime); 

    // Print results 
    printf("(Average Elapsed time, Standard deviation) = %e usec %e usec\n", averageRuntime, deviationRuntime); 

    return 0; 

} 

此代码段工作正常,即,可变sum打印为 (见 “printf("Sum = %llu\n", sum)”):

Sum = 5000000000 

所以,问题来自于与Sparc的会议楼的版本

我怀疑,在该汇编代码,行"mov %1, %%g1\n" // %1 = input parameter要差些存储nLoop%g1 register(我认为%g1是一个32位寄存器,因此不能存储值以上2^32-1)。

然而,在该行的输出中的参数(变量sum):

"mov %%g2, %0\n" // %0 = output parameter 

高于极限,因为它是等于50亿。

附上与大会环路版本之间并没有它的Vimdiff可以:

figure

在左边,程序汇编,就没事了,不大会(只是一个简单的循环,而不是

我提醒你我的问题是,对于nLoop> 2^32-1并且使用汇编循环,我会在执行结束时获得有效的sum参数,但无效(太短)averagestandard deviation次(花费在循环中);这里是的输出示例:

sizeof(unsigned long long int) = 8 
sizeof(unsigned long int) = 4 
diff = 17 
avgSum = 17 
stdSum = 289 
Sum = 5000000000 
diff = 4 
avgSum = 21 
stdSum = 305 
Sum = 5000000000 
(Average Elapsed time, Standard deviation) = 1.000000e+01 usec 7.211103e+00 usec 

随着服用nLoop = 4.0e+9,即nLoop = 4000000000ULL,是没有问题的,时间值是有效的。

更新2:

我通过生成汇编代码搜索更深入。与nLoop = 4000000000 (4.0e+9)版本下面是:

.file "loop-WITH-asm-inline-4-Billions.c" 
    .section ".rodata" 
    .align 8 
.LLC1: 
    .asciz "sizeof(unsigned long long int) = %zu\n" 
    .align 8 
.LLC2: 
    .asciz "sizeof(unsigned long int) = %zu\n" 
    .align 8 
.LLC3: 
    .asciz "diff = %llu\n" 
    .align 8 
.LLC4: 
    .asciz "avgSum = %llu\n" 
    .align 8 
.LLC5: 
    .asciz "stdSum = %llu\n" 
    .align 8 
.LLC6: 
    .asciz "Sum = %llu\n" 
    .global __udivdi3 
    .global __cmpdi2 
    .global __floatdidf 
    .align 8 
.LLC7: 
    .asciz "(Average Elapsed time, Standard deviation) = %e usec %e usec\n" 
    .align 8 
.LLC0: 
    .long 0 
    .long 0 
    .section ".text" 
    .align 4 
    .global main 
    .type main, #function 
    .proc 04 
main: 
    save %sp, -248, %sp 
    st %i0, [%fp+68] 
    st %i1, [%fp+72] 
    ld [%fp+72], %g1 
    add %g1, 4, %g1 
    ld [%g1], %g1 
    mov %g1, %o0 
    call atoi, 0 
    nop 
    mov %o0, %g1 
    st %g1, [%fp-68] 
    st %g0, [%fp-64] 
    st %g0, [%fp-60] 
    st %g0, [%fp-56] 
    st %g0, [%fp-52] 
    sethi %hi(.LLC0), %g1 
    or %g1, %lo(.LLC0), %g1 
    ldd [%g1], %f8 
    std %f8, [%fp-48] 
    sethi %hi(.LLC0), %g1 
    or %g1, %lo(.LLC0), %g1 
    ldd [%g1], %f8 
    std %f8, [%fp-40] 
    mov 0, %g2 
    sethi %hi(4000000000), %g3 
    std %g2, [%fp-24] 
    sethi %hi(.LLC1), %g1 
    or %g1, %lo(.LLC1), %o0 
    mov 8, %o1 
    call printf, 0 
    nop 
    sethi %hi(.LLC2), %g1 
    or %g1, %lo(.LLC2), %o0 
    mov 4, %o1 
    call printf, 0 
    nop 
    st %g0, [%fp-84] 
    b .LL2 
    nop 
.LL3: 
    st %g0, [%fp-32] 
    st %g0, [%fp-28] 
    add %fp, -92, %g1 
    mov %g1, %o0 
    mov 0, %o1 
    call gettimeofday, 0 
    nop 
    ldd [%fp-24], %o4 
    clr %g1 
    clr %g2 
    mov %o4, %g1 
loop: 
    add %g2, 1, %g2 
    subcc %g1, 1, %g1 
    bne loop 
    nop 
    mov %g2, %o4 

    std %o4, [%fp-32] 
    add %fp, -100, %g1 
    mov %g1, %o0 
    mov 0, %o1 
    call gettimeofday, 0 
    nop 
    ld [%fp-100], %g2 
    ld [%fp-92], %g1 
    sub %g2, %g1, %g2 
    sethi %hi(999424), %g1 
    or %g1, 576, %g1 
    smul %g2, %g1, %g3 
    ld [%fp-96], %g2 
    ld [%fp-88], %g1 
    sub %g2, %g1, %g1 
    add %g3, %g1, %g1 
    st %g1, [%fp-12] 
    sra %g1, 31, %g1 
    st %g1, [%fp-16] 
    ldd [%fp-64], %o4 
    ldd [%fp-16], %g2 
    addcc %o5, %g3, %g3 
    addx %o4, %g2, %g2 
    std %g2, [%fp-64] 
    ld [%fp-16], %g2 
    ld [%fp-12], %g1 
    smul %g2, %g1, %g4 
    ld [%fp-16], %g2 
    ld [%fp-12], %g1 
    smul %g2, %g1, %g1 
    add %g4, %g1, %g4 
    ld [%fp-12], %g2 
    ld [%fp-12], %g1 
    umul %g2, %g1, %g3 
    rd %y, %g2 
    add %g4, %g2, %g4 
    mov %g4, %g2 
    ldd [%fp-56], %o4 
    addcc %o5, %g3, %g3 
    addx %o4, %g2, %g2 
    std %g2, [%fp-56] 
    sethi %hi(.LLC3), %g1 
    or %g1, %lo(.LLC3), %o0 
    ld [%fp-16], %o1 
    ld [%fp-12], %o2 
    call printf, 0 
    nop 
    sethi %hi(.LLC4), %g1 
    or %g1, %lo(.LLC4), %o0 
    ld [%fp-64], %o1 
    ld [%fp-60], %o2 
    call printf, 0 
    nop 
    sethi %hi(.LLC5), %g1 
    or %g1, %lo(.LLC5), %o0 
    ld [%fp-56], %o1 
    ld [%fp-52], %o2 
    call printf, 0 
    nop 
    sethi %hi(.LLC6), %g1 
    or %g1, %lo(.LLC6), %o0 
    ld [%fp-32], %o1 
    ld [%fp-28], %o2 
    call printf, 0 
    nop 
    ld [%fp-84], %g1 
    add %g1, 1, %g1 
    st %g1, [%fp-84] 
.LL2: 
    ld [%fp-84], %g2 
    ld [%fp-68], %g1 
    cmp %g2, %g1 
    bl .LL3 
    nop 
    ld [%fp-68], %g1 
    sra %g1, 31, %g1 
    ld [%fp-68], %g3 
    mov %g1, %g2 
    ldd [%fp-64], %o0 
    mov %g2, %o2 
    mov %g3, %o3 
    call __udivdi3, 0 
    nop 
    mov %o0, %g2 
    mov %o1, %g3 
    std %g2, [%fp-136] 
    ldd [%fp-136], %o0 
    mov 0, %o2 
    mov 0, %o3 
    call __cmpdi2, 0 
    nop 
    mov %o0, %g1 
    cmp %g1, 1 
    bl .LL6 
    nop 
    ldd [%fp-136], %o0 
    call __floatdidf, 0 
    nop 
    std %f0, [%fp-144] 
    b .LL5 
    nop 
.LL6: 
    ldd [%fp-136], %o4 
    and %o4, 0, %g2 
    and %o5, 1, %g3 
    ld [%fp-136], %o5 
    sll %o5, 31, %g1 
    ld [%fp-132], %g4 
    srl %g4, 1, %o5 
    or %o5, %g1, %o5 
    ld [%fp-136], %g1 
    srl %g1, 1, %o4 
    or %g2, %o4, %g2 
    or %g3, %o5, %g3 
    mov %g2, %o0 
    mov %g3, %o1 
    call __floatdidf, 0 
    nop 
    std %f0, [%fp-144] 
    ldd [%fp-144], %f8 
    ldd [%fp-144], %f10 
    faddd %f8, %f10, %f8 
    std %f8, [%fp-144] 
.LL5: 
    ldd [%fp-144], %f8 
    std %f8, [%fp-48] 
    ld [%fp-68], %g1 
    sra %g1, 31, %g1 
    ld [%fp-68], %g3 
    mov %g1, %g2 
    ldd [%fp-56], %o0 
    mov %g2, %o2 
    mov %g3, %o3 
    call __udivdi3, 0 
    nop 
    mov %o0, %g2 
    mov %o1, %g3 
    std %g2, [%fp-128] 
    ldd [%fp-128], %o0 
    mov 0, %o2 
    mov 0, %o3 
    call __cmpdi2, 0 
    nop 
    mov %o0, %g1 
    cmp %g1, 1 
    bl .LL8 
    nop 
    ldd [%fp-128], %o0 
    call __floatdidf, 0 
    nop 
    std %f0, [%fp-120] 
    b .LL7 
    nop 
.LL8: 
    ldd [%fp-128], %o4 
    and %o4, 0, %g2 
    and %o5, 1, %g3 
    ld [%fp-128], %o5 
    sll %o5, 31, %g1 
    ld [%fp-124], %g4 
    srl %g4, 1, %o5 
    or %o5, %g1, %o5 
    ld [%fp-128], %g1 
    srl %g1, 1, %o4 
    or %g2, %o4, %g2 
    or %g3, %o5, %g3 
    mov %g2, %o0 
    mov %g3, %o1 
    call __floatdidf, 0 
    nop 
    std %f0, [%fp-120] 
    ldd [%fp-120], %f8 
    ldd [%fp-120], %f10 
    faddd %f8, %f10, %f8 
    std %f8, [%fp-120] 
.LL7: 
    ldd [%fp-48], %f8 
    ldd [%fp-48], %f10 
    fmuld %f8, %f10, %f8 
    ldd [%fp-120], %f10 
    fsubd %f10, %f8, %f8 
    std %f8, [%fp-112] 
    ldd [%fp-112], %f8 
    fsqrtd %f8, %f8 
    std %f8, [%fp-152] 
    ldd [%fp-152], %f10 
    ldd [%fp-152], %f8 
    fcmpd %f10, %f8 
    nop 
    fbe .LL9 
    nop 
    ldd [%fp-112], %o0 
    call sqrt, 0 
    nop 
    std %f0, [%fp-152] 
.LL9: 
    ldd [%fp-152], %f8 
    std %f8, [%fp-40] 
    sethi %hi(.LLC7), %g1 
    or %g1, %lo(.LLC7), %o0 
    ld [%fp-48], %o1 
    ld [%fp-44], %o2 
    ld [%fp-40], %o3 
    ld [%fp-36], %o4 
    call printf, 0 
    nop 
    mov 0, %g1 
    mov %g1, %i0 
    restore 
    jmp %o7+8 
    nop 
    .size main, .-main 
    .ident "GCC: (GNU) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)" 
    .section ".note.GNU-stack" 

当我生成汇编语言代码版本nLoop = 5000000000 (5.0e+9),差异是在如下图所示(与vimdiff):

vimdiff differences

的“块4十亿“版本:

mov  0, %g2                               
sethi %hi(4000000000), %g3 

被替换为”5十亿“版本由:

mov  1, %g2 
sethi %hi(705032192), %g3             
or  %g3, 512, %g3               

我可以看到,5.0+e9不能在32位进行编码,由于指令

sethi %hi(705032192), %g3 

矛盾的是,当我编译版本“5个十亿”汇编代码,所述输出中参数sum计算得很好,即等于5 Billions,我无法解释它。

欢迎任何帮助或评论,谢谢。

+3

您似乎在汇编代码中访问'sum',它是'unsigned long long'。当然,你必须调整你的asm代码来匹配参数的大小和类型。您是否尝试使用C代码并让编译器工作?如果编译器支持8个字节的整数值,它可以创建代码来操作它们。 – Gerhardh

+0

@ Gerhardh-如果你看看printf的输出结果,你可以看到'sum'的计算结果很好(第一个例子为4.0e + 9,第二个为5.0e + 9)。在这两种情况下,'sum'被声明为'unsigned long long int'。我不明白为什么在装配输入参数中使用'nLoop> 2^32'的情况并非如此? – youpilat13

+1

您是否使用汇编代码进行64位计算?您可能不需要 –

回答

0

很大程度上取决于什么版本的sparc和你正在使用的ABI。如果您使用的是sparc v8或更早版本,则只有32位寄存器的32位模式。在这种情况下,当您尝试将5000000000加载到32位寄存器中时,它会失败并代之以加载5000000000 mod 2 (即705032704)。这似乎正在发生。

另一方面,如果你有一个以32位模式运行的64位sparc处理器(通常称为v8plus),那么你可以使用64位寄存器,所以这是可行的。

+0

- @ Chris Dodd我从QEMU的映像中使用Debian Sparc 32位蚀刻(使用gcc 4.1.2)。 (该图片可在https://people.debian.org/~aurel32/qemu/sparc/上找到)。根据你的说法,我必须创建一个Debian Sparc 64位而不是32位的QEMU映像。我在https://people.debian.org/~glaubitz/debian-cd/2017-03-24/上找到了Debian-9 Sparc 64位的iso映像。如何在MacOS 10.9.5下使用此ISO创建QEMU映像:我尝试过:qemu-system-sparc -hda debian_stretch_sparc64.img debian-9.0-sparc64-NETINST-1.iso -boot d但是出现错误不生成“.img”文件 – youpilat13

+0

我也试过这样做:qemu-system-sparc -hda debian_stretch_sparc64.img -cdrom debian-9.0-sparc64-NETINST-1.iso -boot d。当没有光驱时如何避免“-cdrom”标志? – youpilat13

0

你似乎对64位值

的一半是做32位操作从生成的代码,这里的地方nLoop双负载到两个%o4%o5(因为它是64位long long值):

ldd [%fp-24], %o4 
    clr %g1 
    clr %g2 

然后你只是%o4工作:

mov %o4, %g1    ; <---- what about %o5???? 
loop: 
    add %g2, 1, %g2 
    subcc %g1, 1, %g1 
    bne loop 
    nop 
    mov %g2, %o4 

要使这项工作重新编写汇编代码,将%o4 + %o5作为一个64位值处理。

相关问题