LZW又一惊一乍了
看完微软大神写的求平均值代码,我意识到自己还是too young了 | 量子位
On finding the average of two unsigned integers without overflow - The Old New Thing (microsoft.com)
assembly - x86_64 registers rax/eax/ax/al overwriting full register contents - Stack Overflow
Fisher, Randall J (2003). General-Purpose SIMD Within A Register: Parallel Processing on Consumer Microprocessors (PDF) (Ph.D.). Purdue University.
SWAR - History of The SWAR Programming Model (primidi.com)
欲求(a+b)/2,a+b可能会溢出,结果肯定不会溢出。a/2和b/2都肯定不会溢出,a/2 + b/2 和 (a+b)/2 就差1点了。拿a=1, b=2, a=4, b=6等试试,(a / 2) + (b / 2) + (a & b & 1)不难理解。这里的除是整除,13/2=6,而不是6.5. 只有a和b都是奇数时,a & b & 1才是1。3/2 = 1, 5/2=2, (3 & 5 & 1) = 1, 1+2+1=4.
(a & b) + (a ^ b) / 2; 请搜 hacker's delight, 或看下面的简介:
1010 8 + 0 + 2 + 0 = 10 0011 0 + 0 + 2 + 1 = 3 -----------------------
如果a和b的某位都是1, 那结果的该位也是1,如果都是0,那也是0,如(0+0)/2=0, (2+2)/2=2. 都是1时&的结果才是1。加0不影响结果。如果一个是0另一个不是0(那必然是1),异或后必然是1. 这个1应该/2,>>1应该也对且更好。说了是unsigned,那右移时符号位不会跟着跑。整除:13/2=6, 不是6.5。C里注意运算符优先级,写成(a & b) + ((a ^ b) >> 1)保险。把二进制数看作向量。
r?x是64位寄存器,e?x是32位的。e?x是r?x的低32位。mov edx, edx不是废话,会把rdx的高32位清零:trying to write anything at all to eax by any means leads to wiping of high 32 bits of rax register. 16位寄存器ax可以通过ah和al来访问(high 和 low),我记得mov al, 0不会导致ah改变,Intel技术倒退了:-)。
Intel's x86 architecture was not the only architecture to include SIMD-like parallel instructions. Sun's VIS, SGI's MDMX, and other multimedia instruction sets had been added to other manufacturers' existing instruction set architectures to support so-called ″new media″ applications. These extensions had significant differences in the precision of data and types of instructions supported.
Dietz and Fisher began developing the idea of a well-defined parallel programming model that would allow the programming to target the model without knowing the specifics of the target architecture. This model would become the basis of Fisher's dissertation. The acronym "SWAR" was coined by Dietz and Fisher one day in Hank's office in the MSEE building at Purdue University. It refers to this form of parallel processing, architectures which are designed to natively perform this type of processing, and the general-purpose programming model that is Fisher's dissertation.




Intel Integrated Performance Primitives is an extensive library of ready-to-use, domain-specific functions that are highly optimized for diverse Intel architectures.
窃以为SWAR的精神是:不要用if和for,写成a op1 b op2 c op3 d...这样的形式,op1, op2, op3是操作符,a, b, c, d都是长度和精度任意的向量。SWAR在不同的指令集上包了/抽象了一层,是个C库,其实未必有ipp (Intel Integrated Primitive)高级。因为:
- ipp有算高斯分布,可能还有DCT, IDCT等更高层的函数, SWAR更像个硬件抽象层。用户只是不用为不同的CPU写汇编了,DCT还得自己写
- ipp支持全系Intel CPU. MMX和SSE4差别也算不太小。ipp好像是每种CPU一个dll,动态加载。
(a/2) + (b/2) + (a & b & 1)也符合SWAR精神,/, +, &都是op. low + (high - low) / 2不太符合,但如果cmp算二元操作符,再定义个三元操作符,输入操作数是a, b和a>b的结果呢?
如果硬件有两个“向量除法器",a/2和b/2可以同时运算,并行度不低。而且a>>1和b>>1即可,不必做除法。举手之劳,为啥要依赖C编译器去优化?
顺路鼓吹FPGA卡。比如end user有个程序,C++写的,调用了ipp或CUDA. 首先请用户把代码里调用那些函数的部分挑出来(如grep),然后比如龙芯做个子集,比如mini_ipp_for_client_A. 龙芯连CPU都能造,堆16384个MAC(我也学坏了,就是乘和累加:-))应该piece of cake (again, I know a little English), 给客户一个firmware, 一个min_ipp.lib/.so/.dll和mini_ipp.h,客户刷firmware,重新编译和link程序。三赢:用户花钱不多,龙芯费事不多,速度可能还更快。nVidia的linux下的显卡驱动开源了(不好好支持linux是20年前的事了),可以借鉴(寄存器、DMA, 中断、共享内存……)。紫光和复旦微电子都有FPGA,紫光的EDA官网有下载link;复旦的能到1亿门。我是外行知道的少瞎说,没有任何倾向。哦,不必全部用FPGA实现,这事不能说太细…… :-) 哦哦,比方说add_vec(int n, T* a, T* b); 一定要问清楚n一般是多少。或者FPGA厂商仅提供16384xint8, 8192xint16, 1024xfloat32的点积等几种firmware和driver,中层程序用户自己改。我就是打个比方,实际情况比这复杂。还可以把解放号/猪八戒等兼职网站开到乌克兰/俄罗斯。我甚至觉得用这个写IDCT比用SSE写简单,一条"指令"可以处理64个数,我简直都想试试了。:-) 可以把cos都算出来烧在FPGA里。图像压缩与离散余弦变换
该卡集成2D显示,支持VGA, 1080p和2K几种分辨率就够了。这样用户不必再买显卡。如果用户的程序是用python写的,python调用C,国内能干这个的多的是。
为啥说又:Competition-Level Code Generation with AlphaCode
别人厉害就承认,但也不要长他人志气减自己信心。临渊羡鱼不如退而结网,不动脑子光watch别人的dust有啥意思。再来几句难听的:SWAR就是个demo,罗列了一堆CPU,真正支持的只有MMX和AltiVec,还不知道到啥程度。Microsoft真正厉害的是总部做开发的,Research的和中国的都不咋厉害,当然,都比我厉害很多。
Kinect was the controller-free interface for the Xbox 360. Although Kinect set a world record for the fastest-selling consumer electronics device in its first 60 days (eight million units), manufacturing ceased in late 2017. As Cult of Mac reveals, PrimeSense (who developed the technology behind the Kinect, and who do in fact still own the design and the IP, free of Microsoft influence) had developed some of the key systems inside Kinect in mid-2008. 如果很厉害,为啥要外购?
想10年内超过Intel凭啥?钱更多?人更厉害?技术积累更多?二进制翻译超过native? 我没听说过2000多条指令的指令集。3DNow!里能一步运算的向量好像都是长度为3的。CUDA从GPU改过来,好像以处理三角形为起源也别扭。有多少用户自己写SIMD汇编程序?这样的用户往往不在乎钱追求极致性能,反正也满足不了他们。ipp20多年历史了,久经锤炼。综上,Plan B: 主CPU risc-v, 高主频大cache;ippp, Intelligence Plus Performance Processor在卡上。先请俄罗斯人写个纯C版的ipp(.h文件他们自己下载),然后看懂代码,设计指令集。通用部分risc-v, SIMD部分的特点是寄存器特别长,电路优先支持SIVMD(very many data),如32个65536位的寄存器/1MB的一级cache之类。然后再请俄罗斯人改成C+汇编版。指令集不和别人兼容可随便设计。sendfile()不经过用户态把数据发到网卡。HANDLE load_matrix(const char* file_name); 把文件里的矩阵不经用户态读入"显卡上的GDDR"。HANDLE add_vec(HANDLE a, b); 比SIMD指令集更稀罕的是火了的SIMD指令集。AMD的3DNow!火了吗?超过AMD就已经很难了。
SIMD的M是multiple,不是many. Kinect和connect读音接近,kinetic energy是动能。Hmm, Intelligence Potential Performance Processor, potential能是势能。
load_vec()的参数应改成file_handle, offset, type, n.
好像有若干开源的科学计算库。每所大学的数学系写个函数,也不用麻烦俄罗斯人了。Graphics And Matrix Multiplication Acceleration=GAMMA. 先FPGA充分验证再28nm制程芯片。显卡散热器为啥没有多热管多鳍片无风扇的?开汽车没有引擎轰鸣声不过瘾?手机拍照要拿扬声器播放机械快门的声音?
我是外行纯转载:
浙公网安备 33010602011771号