_umul128

在Visual C++中，当针对Windows 32位时_umul128未定义。针对Win32时，两个无符号64位整数如何相乘？该解决方案只需在针对Windows 32位的Visual C++ 2017上工作。_umul128

2017-10-22 Augusto

https://msdn.microsoft.com/en-us/library/windows/desktop/aa383711(v=vs.85).aspx –

你需要做很多吗？这可能值得使用SSE2或AVX2，具体取决于你想要做什么。虽然可能不会，因为没有SIMD附加功能。但是，请参阅https://stackoverflow.com/questions/17863411/sse-multiplication-of-2-64-bit-integers获取64位结果。不过，这需要更多的指令来获得128b的结果。 https://stackoverflow.com/questions/6738283/can-xmm-registers-be-used-to-do-any-128-bit-integer-math –

[SIMD使用无符号乘法对64位* 64位进行签名到128位]（https://stackoverflow.com/questions/28807341/simd-signed-with-unsigned-multiplication-for-64-bit-64-bit-to-128-bit），可能加速32位，如果你有AVX2或AVX512，使用'adc' +'mul'（或BMI2'mulx'）作为大数组数组的构建块，但是一次一个标量可能会更好。 –

我发现下面的代码（从xmrrig），这似乎做的工作就好了：

static inline uint64_t __umul128(uint64_t multiplier, uint64_t multiplicand, 
    uint64_t *product_hi) 
{ 
    // multiplier = ab = a * 2^32 + b 
    // multiplicand = cd = c * 2^32 + d 
    // ab * cd = a * c * 2^64 + (a * d + b * c) * 2^32 + b * d 
    uint64_t a = multiplier >> 32; 
    uint64_t b = multiplier & 0xFFFFFFFF; 
    uint64_t c = multiplicand >> 32; 
    uint64_t d = multiplicand & 0xFFFFFFFF; 

    //uint64_t ac = a * c; 
    uint64_t ad = a * d; 
    //uint64_t bc = b * c; 
    uint64_t bd = b * d; 

    uint64_t adbc = ad + (b * c); 
    uint64_t adbc_carry = adbc < ad ? 1 : 0; 

    // multiplier * multiplicand = product_hi * 2^64 + product_lo 
    uint64_t product_lo = bd + (adbc << 32); 
    uint64_t product_lo_carry = product_lo < bd ? 1 : 0; 
    *product_hi = (a * c) + (adbc >> 32) + (adbc_carry << 32) + product_lo_carry; 

    return product_lo; 
}

来源

2017-10-25 02:43:29 Augusto

不幸的是，这个编译MSVC真的很差。对于uint64 * uint64 => 64位乘法，它使用'__allmul'的函数调用，即使两个输入的上半部分都是编译时0.也就是说，当它可以使用一个'mul'指令来执行32x32 => 64b全乘。 https://godbolt.org/g/53Qxyo。如果32x32 => 64b乘法的内在原因，它会使效率更高。铿锵很好地编译它;还有很多工作要做，但它对所有随身携带使用'adc'指令。 MSVC使用一些'adc'，但为其他分支：/ –

此外，请确保*不要*在内部可用的64位代码中使用它。（我猜如果编译器提供它，使用相同的名称是确保编译时错误的好方法）。它以64位模式编译，但不是一个'mul'指令。 –

如果你需要这个在32位代码中实际上很快，可以考虑使用http://gmplib.org/。他们有一个32位的asm实现乘法，但是我没有看到2个肢体* 2个肢体= 4个肢体的特殊情况。当你通过这些尺寸时，IDK通用函数的速度有多快。 https://gmplib.org/repo/gmp/file/tip/mpn/x86/mul_basecase.asm –

这个答案已经从MSVC 32位模式优化了对方的回答某个版本的xmrrig function的。原始版本与其他编译器，特别是铿锵有关。

我看着@ Augusto的功能MSVC的输出，它真的很糟糕。 对于32x32 => 64b乘法使用__emulu明显改善了它（因为MSVC是愚蠢的，并且在已知输入实际上仅为32位且上半部分为零的情况下不优化uint64_t * uint64_t = uint64_t）。其他编译器（gcc和clang）生成一个单一的mul指令，而不是调用辅助函数。 MSVC的代码还有其他问题，我不知道如何通过调整源代码来修复，尽管。我认为，如果你想在该编译器上获得良好的性能，那么必须使用内联asm（或单独编译的asm函数）。

如果您需要更灵活的任意精度（更大的数字），请参阅GMPlib's low-level functions，它们有asm实现，而不是尝试从此__umul128中构建256b乘法。但如果你确切需要这个，那么值得尝试。使用C++支持恒定传播和CSE优化，您不会用asm获得。

铛编译此没有大的问题，实际使用adc所有附加有进位（除了一个，它与setc指令保存）。 MSVC在进位检查上分支，并且只是产生令人讨厌的代码。海湾合作委员会也没有做得很好，有一些分支随身携带。（因为gcc不知道如何把carry = sum < a成adc，gcc bug 79173。）

IDK如果MSVC或GCC支持任何附加与携带的用于在32位模式的64位整数内部函数。 _addcarry_u64generates poor code with gcc anyway (in 64-bit mode)，但ICC可能会好。 IDK关于MSVC。

如果你想要一个asm实现，我建议使用这个函数的铿锵5.0的输出。你可能手工找到一些优化，但肯定比MSVC更好。但当然，https://gcc.gnu.org/wiki/DontUseInlineAsm中的大部分参数都适用：如果乘以内联变为常数的内容，或者上半部分已知为零，则阻塞恒定传播是一个主要问题。

Source + asm output for MSVC 32-bit and clang5.0 32-bit on Godbolt

漂亮的代码铿锵。有点不好的代码与MSVC，但比以前更好。有点不好与gcc也（没有改变与其他答案）。

#include <stdint.h> 

#ifdef _MSC_VER 
# include <intrin.h> 
#else 
// MSVC doesn't optimize 32x32 => 64b multiplication without its intrinsic 
// But good compilers can just use this to get a single mul instruction 
static inline 
uint64_t __emulu(uint32_t x, uint32_t y) { 
    return x * (uint64_t)y; 
} 
#endif 

// This is still pretty ugly with MSVC, branching on the carry 
// and using XMM store/integer reload to zero a register! 
// But at least it inlines 4 mul instructions 
// instead of calling a generic 64x64 => 64b multiply helper function 
uint64_t __umul128(uint64_t multiplier, uint64_t multiplicand, 
    uint64_t *product_hi) 
{ 
    // multiplier = ab = a * 2^32 + b 
    // multiplicand = cd = c * 2^32 + d 
    // ab * cd = a * c * 2^64 + (a * d + b * c) * 2^32 + b * d 
    uint64_t a = multiplier >> 32; 
    uint64_t b = (uint32_t)multiplier; // & 0xFFFFFFFF; 
    uint64_t c = multiplicand >> 32; 
    uint64_t d = (uint32_t)multiplicand; // & 0xFFFFFFFF; 

    //uint64_t ac = __emulu(a, c); 
    uint64_t ad = __emulu(a, d); 
    //uint64_t bc = __emulu(b, c); 
    uint64_t bd = __emulu(b, d); 

    uint64_t adbc = ad + __emulu(b , c); 
    uint64_t adbc_carry = (adbc < ad); // ? 1 : 0; 
    // MSVC gets confused by the ternary and makes worse code than using a boolean in an integer context for 1 : 0 

    // multiplier * multiplicand = product_hi * 2^64 + product_lo 
    uint64_t product_lo = bd + (adbc << 32); 
    uint64_t product_lo_carry = (product_lo < bd); // ? 1 : 0; 
    *product_hi = __emulu(a , c) + (adbc >> 32) + (adbc_carry << 32) + product_lo_carry; 

    return product_lo; 
}

请确保您只在32位代码中使用它。在64位代码中，它无法优化到单个64位mul指令（这会产生全部结果的64位二分之一）。实现GNU C++扩展的编译器（clang，gcc，ICC）可以使用unsigned __int128并获得很好的代码。例如a * (unsigned __int128)b产生128b结果。（Godbolt示例）。

来源

2017-10-25 04:59:58

回答

相关问题