2016-09-15 80 views
4

我正在做一些长期模拟,我试图在ODE系统的解决方案中实现最高的准确性。我试图找出四倍(128位)精度计算需要多少时间才能达到双倍(64位)精度。我搜索了一下,并看到了一些有关它的观点:有人说它将需要4倍的时间,其他人需要60-70次...因此,我决定亲自动手,并编写了一个简单的Fortran基准程序:CPU时间四倍与双精度

program QUAD_TEST 

implicit none 

integer,parameter :: dp = selected_int_kind(15) 
integer,parameter :: qp = selected_int_kind(33) 

integer :: cstart_dp,cend_dp,cstart_qp,cend_qp,crate 
real  :: time_dp,time_qp 
real(dp) :: sum_dp,sqrt_dp,pi_dp,mone_dp,zero_dp 
real(qp) :: sum_qp,sqrt_qp,pi_qp,mone_qp,zero_qp 
integer :: i 

! ============================================================================== 

! == TEST 1. ELEMENTARY OPERATIONS == 
sum_dp = 1._dp 
sum_qp = 1._qp 
call SYSTEM_CLOCK(count_rate=crate) 

write(*,*) 'Testing elementary operations...' 

call SYSTEM_CLOCK(count=cstart_dp) 
do i=1,50000000 
    sum_dp = sum_dp - 1._dp 
    sum_dp = sum_dp + 1._dp 
    sum_dp = sum_dp*2._dp 
    sum_dp = sum_dp/2._dp 
end do 
call SYSTEM_CLOCK(count=cend_dp) 
time_dp = real(cend_dp - cstart_dp)/real(crate) 
write(*,*) 'DP sum: ',sum_dp 
write(*,*) 'DP time: ',time_dp,' seconds' 

call SYSTEM_CLOCK(count=cstart_qp) 
do i=1,50000000 
    sum_qp = sum_qp - 1._qp 
    sum_qp = sum_qp + 1._qp 
    sum_qp = sum_qp*2._qp 
    sum_qp = sum_qp/2._qp 
end do 
call SYSTEM_CLOCK(count=cend_qp) 
time_qp = real(cend_qp - cstart_qp)/real(crate) 
write(*,*) 'QP sum: ',sum_qp 
write(*,*) 'QP time: ',time_qp,' seconds' 
write(*,*) 
write(*,*) 'DP is ',time_qp/time_dp,' times faster.' 
write(*,*) 

! == TEST 2. SQUARE ROOT == 
sqrt_dp = 2._dp 
sqrt_qp = 2._qp 

write(*,*) 'Testing square root ...' 

call SYSTEM_CLOCK(count=cstart_dp) 
do i = 1,10000000 
    sqrt_dp = sqrt(sqrt_dp) 
    sqrt_dp = 2._dp 
end do 
call SYSTEM_CLOCK(count=cend_dp) 
time_dp = real(cend_dp - cstart_dp)/real(crate) 
write(*,*) 'DP sqrt: ',sqrt_dp 
write(*,*) 'DP time: ',time_dp,' seconds' 

call SYSTEM_CLOCK(count=cstart_qp) 
do i = 1,10000000 
    sqrt_qp = sqrt(sqrt_qp) 
    sqrt_qp = 2._qp 
end do 
call SYSTEM_CLOCK(count=cend_qp) 
time_qp = real(cend_qp - cstart_qp)/real(crate) 
write(*,*) 'QP sqrt: ',sqrt_qp 
write(*,*) 'QP time: ',time_qp,' seconds' 
write(*,*) 
write(*,*) 'DP is ',time_qp/time_dp,' times faster.' 
write(*,*) 

! == TEST 3. TRIGONOMETRIC FUNCTIONS == 
pi_dp = acos(-1._dp); mone_dp = 1._dp; zero_dp = 0._dp 
pi_qp = acos(-1._qp); mone_qp = 1._qp; zero_qp = 0._qp 

write(*,*) 'Testing trigonometric functions ...' 

call SYSTEM_CLOCK(count=cstart_dp) 
do i = 1,10000000 
    mone_dp = cos(pi_dp) 
    zero_dp = sin(pi_dp) 
end do 
call SYSTEM_CLOCK(count=cend_dp) 
time_dp = real(cend_dp - cstart_dp)/real(crate) 
write(*,*) 'DP cos: ',mone_dp 
write(*,*) 'DP sin: ',zero_dp 
write(*,*) 'DP time: ',time_dp,' seconds' 

call SYSTEM_CLOCK(count=cstart_qp) 
do i = 1,10000000 
    mone_qp = cos(pi_qp) 
    zero_qp = sin(pi_qp) 
end do 
call SYSTEM_CLOCK(count=cend_qp) 
time_qp = real(cend_qp - cstart_qp)/real(crate) 
write(*,*) 'QP cos: ',mone_qp 
write(*,*) 'QP sin: ',zero_qp 
write(*,*) 'QP time: ',time_qp,' seconds' 
write(*,*) 
write(*,*) 'DP is ',time_qp/time_dp,' times faster.' 
write(*,*) 

end program QUAD_TEST 

典型的运行结果,与gfortran 4.8.4编译,没有任何优化后旗:

Testing elementary operations... 
DP sum: 1.0000000000000000  
DP time: 0.572000027  seconds 
QP sum: 1.00000000000000000000000000000000000  
QP time: 4.32299995  seconds 

DP is 7.55769205  times faster. 

Testing square root ... 
DP sqrt: 2.0000000000000000  
DP time: 5.20000011E-02 seconds 
QP sqrt: 2.00000000000000000000000000000000000  
QP time: 2.60700011  seconds 

DP is 50.1346169  times faster. 

Testing trigonometric functions ... 
DP cos: -1.0000000000000000  
DP sin: 1.2246467991473532E-016 
DP time: 2.79600000  seconds 
QP cos: -1.00000000000000000000000000000000000  
QP sin: 8.67181013E-0035 
QP time: 5.90199995  seconds 

DP is 2.11087275  times faster. 

一定有什么怎么回事。我的猜测是sqrt通过优化的算法与gfortran一起计算,该算法可能还没有用于四倍精度计算。这可能不是sincos的情况,但为什么基本操作的四倍精度要慢7.6倍,而对于三角函数,事情只能减慢2倍?如果用于三角函数的算法对于四倍和双精度的算法是相同的,那么我预计他们的CPU时间也会增加七倍。

使用128位精度时,与64位相比,科学计算的平均减速度是多少?

我在Intel i7-4771 @ 3.50GHz上运行这个。

+4

请勿在多核系统上使用'CPU_TIME'。您可能最终将一个核心的开始时间和另一个核心的结束时间考虑在内。由于这些时间不相关,你可能会得到任何东西。使用'system_clock'解决了这个问题。 –

+1

请参阅[此处](https://stackoverflow.com/questions/25465101/fortran-parallel-programming/25465290#25465290)和[here](https://stackoverflow.com/questions/6878246/fortran-intrinsic-定时例程 - 这就是更好-CPU时间 - 或系统时钟)。 –

+0

感谢@AlexanderVogt,我编辑帖子以使用SYSTEM_CLOCK。 – LeWavite

回答

5

超过一个答案延长发表评论,但...

当前CPU的双精度浮点运算提供了大量的硬件加速。有些甚至提供扩展精度的设施。 除此之外,您仅限于(如您注意的)相当慢的软件实现。

但是,在一般情况下,这种减速的确切因素几乎无法预测。 它取决于您的CPU(例如,它内置了哪种加速度)以及软件堆栈。 对于双精度,通常使用不同的数学库,而不是四倍精度,这些可能会对基本操作使用不同的算法。

对于使用相同算法的给定硬件上的特定操作/算法,您可能可以派生一个数字,但这肯定不是普遍适用的。

0

有趣的是要注意,如果你改变:

sqrt_qp = sqrt(sqrt_qp) 
sqrt_qp = 2._qp 

sqrt_qp = sqrt(2._qp) 

计算会更快!

+0

什么?你的两段代码做了完全不同的事情!首先将变量设置为“2”,另一个计算“2”的平方根。 'sqrt(2._qp)'可以在编译时计算。 –

+0

此外,这只是一个评论或原始问题的答案? –

+0

该代码是在一个循环中,结果是重复计算sqrt(2.0) –