Linux上的线程局部变量访问速度有多快

Linux中访问线程局部变量的速度有多快。从gcc编译器生成的代码中，我可以看到使用了fs段寄存器。显然，访问线程局部变量不应该花费额外的周期。Linux上的线程局部变量访问速度有多快

但是，我一直在阅读关于线程局部变量访问缓慢的恐怖故事。怎么来的？当然，有时不同的编译器会使用与使用段寄存器fs不同的方法，但是通过fs段寄存器访问线程局部变量也是慢的吗？

2012-03-28 pythonic

幕后发生了什么：http：//www.akkadia.org/drepper/tls.pdf ..有没有人觉得读这篇文章的动机，并在简短的回答中总结一下？：D – 2012-03-28 15:53:05

“恐怖故事”可能来自TSS（线程专有存储），通过pthreads_setspecific。 TSS比TLS慢，但如果正确完成则不是很多。 – 2012-03-28 19:57:22

我可以给你一个关于_non_线程局部变量（一个简单的整数计数器）缓慢的恐怖故事，它是通过多个线程修改的，并且由于缓存监听而使系统减慢到爬行。让它成为本地线程，并在最后对所有线程局部人物进行求和，这使得我的加速因子达到了100或类似。 – hirschhornsalz 2012-03-29 08:53:26

如何快速在Linux中

正在访问线程局部变量这取决于，在很多事情。

某些处理器（i*86）具有特殊段（fs或模式）。其他处理器不会（但通常他们将有一个保留用于访问当前线程的寄存器，并且使用该专用寄存器很容易找到TLS）。

在i*86上，使用fs，访问是几乎与直接存储器访问一样快。

我一直在阅读有关的线程局部变量的访问

如果您提供链接到一些这样的恐怖故事，这将有助于缓慢的恐怖故事。没有这些链接，就不可能说出他们的作者是否知道他们在谈论什么。

来源

2012-03-28 16:08:01

恐怖故事？没问题：我在一个嵌入式MIPS平台上工作，每个对线程本地存储的访问都导致了非常慢的内核调用。您可以在该平台上每秒执行大约8000次TLS访问。 – 2014-08-25 08:09:40

但是，我一直在阅读有关线程局部变量访问缓慢的恐怖故事。怎么来的？

让我演示Linux x86_64上线程局部变量的缓慢程度，我从http://software.intel.com/en-us/blogs/2011/05/02/the-hidden-performance-cost-of-accessing-thread-local-variables取得了一个示例。

没有__thread变量，没有缓慢。

我会用这个测试的性能作为基础。

#include "stdio.h" 
    #include "math.h" 

    double tlvar; 
    //following line is needed so get_value() is not inlined by compiler 
    double get_value() __attribute__ ((noinline)); 
    double get_value() 
    { 
     return tlvar; 
    } 
    int main() 

    { 
     int i; 
     double f=0.0; 
     tlvar = 1.0; 
     for(i=0; i<1000000000; i++) 
     { 
     f += sqrt(get_value()); 
     } 
     printf("f = %f\n", f); 
     return 1; 
    }

这是的get_value的汇编代码（）

Dump of assembler code for function get_value: 
=> 0x0000000000400560 <+0>:  movsd 0x200478(%rip),%xmm0  # 0x6009e0 <tlvar> 
    0x0000000000400568 <+8>:  retq 
End of assembler dump.

这是如何快速运行：

$ time ./inet_test_no_thread 
f = 1000000000.000000 

real 0m5.169s 
user 0m5.137s 
sys  0m0.002s

有__thread变量在一个可执行文件（未在共享库），仍然没有缓慢。

#include "stdio.h" 
#include "math.h" 

__thread double tlvar; 
//following line is needed so get_value() is not inlined by compiler 
double get_value() __attribute__ ((noinline)); 
double get_value() 
{ 
    return tlvar; 
} 

int main() 
{ 
    int i; 
    double f=0.0; 

    tlvar = 1.0; 
    for(i=0; i<1000000000; i++) 
    { 
    f += sqrt(get_value()); 
    } 
    printf("f = %f\n", f); 
    return 1; 
}

这是（的get_value的汇编代码）

(gdb) disassemble get_value 
Dump of assembler code for function get_value: 
=> 0x0000000000400590 <+0>:  movsd %fs:0xfffffffffffffff8,%xmm0 
    0x000000000040059a <+10>: retq 
End of assembler dump.

这是如何快速运行：

$ time ./inet_test 
f = 1000000000.000000 

real 0m5.232s 
user 0m5.158s 
sys  0m0.007s

所以，这是很明显的是，当__thread VAR是可执行文件它和普通的全球变量一样快。

有一个__thread变量，它在共享库中，有缓慢。

可执行文件：

$ cat inet_test_main.c 
#include "stdio.h" 
#include "math.h" 
int test(); 

int main() 
{ 
    test(); 
    return 1; 
}

共享库：

$ cat inet_test_lib.c 
#include "stdio.h" 
#include "math.h" 

static __thread double tlvar; 
//following line is needed so get_value() is not inlined by compiler 
double get_value() __attribute__ ((noinline)); 
double get_value() 
{ 
    return tlvar; 
} 

int test() 
{ 
    int i; 
    double f=0.0; 
    tlvar = 1.0; 
    for(i=0; i<1000000000; i++) 
    { 
    f += sqrt(get_value()); 
    } 
    printf("f = %f\n", f); 
    return 1; 
}

这是的get_value（）的汇编代码，看看它是多么的不同 - 它调用__tls_get_addr()：

Dump of assembler code for function get_value: 
=> 0x00007ffff7dfc6d0 <+0>:  lea 0x200329(%rip),%rdi  # 0x7ffff7ffca00 
    0x00007ffff7dfc6d7 <+7>:  callq 0x7ffff7dfc5c8 <[email protected]> 
    0x00007ffff7dfc6dc <+12>: movsd 0x0(%rax),%xmm0 
    0x00007ffff7dfc6e4 <+20>: retq 
End of assembler dump. 

(gdb) disas __tls_get_addr 
Dump of assembler code for function __tls_get_addr: 
    0x0000003c40a114d0 <+0>:  push %rbx 
    0x0000003c40a114d1 <+1>:  mov %rdi,%rbx 
=> 0x0000003c40a114d4 <+4>:  mov %fs:0x8,%rdi 
    0x0000003c40a114dd <+13>: mov 0x20fa74(%rip),%rax  # 0x3c40c20f58 <_rtld_local+3928> 
    0x0000003c40a114e4 <+20>: cmp %rax,(%rdi) 
    0x0000003c40a114e7 <+23>: jne 0x3c40a11505 <__tls_get_addr+53> 
    0x0000003c40a114e9 <+25>: xor %esi,%esi 
    0x0000003c40a114eb <+27>: mov (%rbx),%rdx 
    0x0000003c40a114ee <+30>: mov %rdx,%rax 
    0x0000003c40a114f1 <+33>: shl $0x4,%rax 
    0x0000003c40a114f5 <+37>: mov (%rax,%rdi,1),%rax 
    0x0000003c40a114f9 <+41>: cmp $0xffffffffffffffff,%rax 
    0x0000003c40a114fd <+45>: je  0x3c40a1151b <__tls_get_addr+75> 
    0x0000003c40a114ff <+47>: add 0x8(%rbx),%rax 
    0x0000003c40a11503 <+51>: pop %rbx 
    0x0000003c40a11504 <+52>: retq 
    0x0000003c40a11505 <+53>: mov (%rbx),%rdi 
    0x0000003c40a11508 <+56>: callq 0x3c40a11200 <_dl_update_slotinfo> 
    0x0000003c40a1150d <+61>: mov %rax,%rsi 
    0x0000003c40a11510 <+64>: mov %fs:0x8,%rdi 
    0x0000003c40a11519 <+73>: jmp 0x3c40a114eb <__tls_get_addr+27> 
    0x0000003c40a1151b <+75>: callq 0x3c40a11000 <tls_get_addr_tail> 
    0x0000003c40a11520 <+80>: jmp 0x3c40a114ff <__tls_get_addr+47> 
End of assembler dump.

它运行速度差不多慢两倍！：

$ time ./inet_test_main 
f = 1000000000.000000 

real 0m9.978s 
user 0m9.906s 
sys  0m0.004s

最后 - 这就是perf报告 - __tls_get_addr - CPU利用率为21％：

$ perf report --stdio 
# 
# Events: 10K cpu-clock 
# 
# Overhead   Command  Shared Object    Symbol 
# ........ .............. ................... .................. 
# 
    58.05% inet_test_main libinet_test_lib.so [.] test 
    21.15% inet_test_main ld-2.12.so   [.] __tls_get_addr 
    10.69% inet_test_main libinet_test_lib.so [.] get_value 
    5.07% inet_test_main libinet_test_lib.so [.] [email protected] 
    4.82% inet_test_main libinet_test_lib.so [.] [email protected] 
    0.23% inet_test_main [kernel.kallsyms] [k] 0xffffffffa0165b75

所以，你可以看到，当一个线程局部变量是在共享库（声明为静态并仅用于共享库）它比较慢。如果一个共享库中的线程局部变量很少被访问，那么这对性能来说不是问题。如果它在这个测试中经常使用，那么开销会很大。

在评论中提到的文档http://www.akkadia.org/drepper/tls.pdf讨论了四种可能的TLS访问模型。坦率地说，我不明白什么时候使用“Initial exec TLS model”，但是对于其他三种型号，只有当__thread变量位于可执行文件中并且从可执行文件访问时，才有可能避免调用__tls_get_addr()。

来源

2014-08-25 07:40:18

所有这些测试。大。然而，每次操作五纳秒不是我所说的非常慢。它的顺序与函数调用的顺序相同，所以除非线程局部变量实际上是您所做的唯一事情，否则它不应该成为问题。线程同步通常要昂贵得多。如果你可以通过使用线程本地存储来避免这种情况，那么你有一个巨大的共赢库。 – cmaster 2014-08-25 16:48:13

Linux上的线程局部变量访问速度有多快

回答

相关问题