Home | History | Annotate | Download | only in asm
      1 #! /usr/bin/env perl
      2 # Copyright 2010-2016 The OpenSSL Project Authors. All Rights Reserved.
      3 #
      4 # Licensed under the OpenSSL license (the "License").  You may not use
      5 # this file except in compliance with the License.  You can obtain a copy
      6 # in the file LICENSE in the source distribution or at
      7 # https://www.openssl.org/source/license.html
      8 
      9 #
     10 # ====================================================================
     11 # Written by Andy Polyakov <appro (at] openssl.org> for the OpenSSL
     12 # project. The module is, however, dual licensed under OpenSSL and
     13 # CRYPTOGAMS licenses depending on where you obtain it. For further
     14 # details see http://www.openssl.org/~appro/cryptogams/.
     15 # ====================================================================
     16 #
     17 # March, May, June 2010
     18 #
     19 # The module implements "4-bit" GCM GHASH function and underlying
     20 # single multiplication operation in GF(2^128). "4-bit" means that it
     21 # uses 256 bytes per-key table [+64/128 bytes fixed table]. It has two
     22 # code paths: vanilla x86 and vanilla SSE. Former will be executed on
     23 # 486 and Pentium, latter on all others. SSE GHASH features so called
     24 # "528B" variant of "4-bit" method utilizing additional 256+16 bytes
     25 # of per-key storage [+512 bytes shared table]. Performance results
     26 # are for streamed GHASH subroutine and are expressed in cycles per
     27 # processed byte, less is better:
     28 #
     29 #		gcc 2.95.3(*)	SSE assembler	x86 assembler
     30 #
     31 # Pentium	105/111(**)	-		50
     32 # PIII		68 /75		12.2		24
     33 # P4		125/125		17.8		84(***)
     34 # Opteron	66 /70		10.1		30
     35 # Core2		54 /67		8.4		18
     36 # Atom		105/105		16.8		53
     37 # VIA Nano	69 /71		13.0		27
     38 #
     39 # (*)	gcc 3.4.x was observed to generate few percent slower code,
     40 #	which is one of reasons why 2.95.3 results were chosen,
     41 #	another reason is lack of 3.4.x results for older CPUs;
     42 #	comparison with SSE results is not completely fair, because C
     43 #	results are for vanilla "256B" implementation, while
     44 #	assembler results are for "528B";-)
     45 # (**)	second number is result for code compiled with -fPIC flag,
     46 #	which is actually more relevant, because assembler code is
     47 #	position-independent;
     48 # (***)	see comment in non-MMX routine for further details;
     49 #
     50 # To summarize, it's >2-5 times faster than gcc-generated code. To
     51 # anchor it to something else SHA1 assembler processes one byte in
     52 # ~7 cycles on contemporary x86 cores. As for choice of MMX/SSE
     53 # in particular, see comment at the end of the file...
     54 
     55 # May 2010
     56 #
     57 # Add PCLMULQDQ version performing at 2.10 cycles per processed byte.
     58 # The question is how close is it to theoretical limit? The pclmulqdq
     59 # instruction latency appears to be 14 cycles and there can't be more
     60 # than 2 of them executing at any given time. This means that single
     61 # Karatsuba multiplication would take 28 cycles *plus* few cycles for
     62 # pre- and post-processing. Then multiplication has to be followed by
     63 # modulo-reduction. Given that aggregated reduction method [see
     64 # "Carry-less Multiplication and Its Usage for Computing the GCM Mode"
     65 # white paper by Intel] allows you to perform reduction only once in
     66 # a while we can assume that asymptotic performance can be estimated
     67 # as (28+Tmod/Naggr)/16, where Tmod is time to perform reduction
     68 # and Naggr is the aggregation factor.
     69 #
     70 # Before we proceed to this implementation let's have closer look at
     71 # the best-performing code suggested by Intel in their white paper.
     72 # By tracing inter-register dependencies Tmod is estimated as ~19
     73 # cycles and Naggr chosen by Intel is 4, resulting in 2.05 cycles per
     74 # processed byte. As implied, this is quite optimistic estimate,
     75 # because it does not account for Karatsuba pre- and post-processing,
     76 # which for a single multiplication is ~5 cycles. Unfortunately Intel
     77 # does not provide performance data for GHASH alone. But benchmarking
     78 # AES_GCM_encrypt ripped out of Fig. 15 of the white paper with aadt
     79 # alone resulted in 2.46 cycles per byte of out 16KB buffer. Note that
     80 # the result accounts even for pre-computing of degrees of the hash
     81 # key H, but its portion is negligible at 16KB buffer size.
     82 #
     83 # Moving on to the implementation in question. Tmod is estimated as
     84 # ~13 cycles and Naggr is 2, giving asymptotic performance of ...
     85 # 2.16. How is it possible that measured performance is better than
     86 # optimistic theoretical estimate? There is one thing Intel failed
     87 # to recognize. By serializing GHASH with CTR in same subroutine
     88 # former's performance is really limited to above (Tmul + Tmod/Naggr)
     89 # equation. But if GHASH procedure is detached, the modulo-reduction
     90 # can be interleaved with Naggr-1 multiplications at instruction level
     91 # and under ideal conditions even disappear from the equation. So that
     92 # optimistic theoretical estimate for this implementation is ...
     93 # 28/16=1.75, and not 2.16. Well, it's probably way too optimistic,
     94 # at least for such small Naggr. I'd argue that (28+Tproc/Naggr),
     95 # where Tproc is time required for Karatsuba pre- and post-processing,
     96 # is more realistic estimate. In this case it gives ... 1.91 cycles.
     97 # Or in other words, depending on how well we can interleave reduction
     98 # and one of the two multiplications the performance should be between
     99 # 1.91 and 2.16. As already mentioned, this implementation processes
    100 # one byte out of 8KB buffer in 2.10 cycles, while x86_64 counterpart
    101 # - in 2.02. x86_64 performance is better, because larger register
    102 # bank allows to interleave reduction and multiplication better.
    103 #
    104 # Does it make sense to increase Naggr? To start with it's virtually
    105 # impossible in 32-bit mode, because of limited register bank
    106 # capacity. Otherwise improvement has to be weighed against slower
    107 # setup, as well as code size and complexity increase. As even
    108 # optimistic estimate doesn't promise 30% performance improvement,
    109 # there are currently no plans to increase Naggr.
    110 #
    111 # Special thanks to David Woodhouse for providing access to a
    112 # Westmere-based system on behalf of Intel Open Source Technology Centre.
    113 
    114 # January 2010
    115 #
    116 # Tweaked to optimize transitions between integer and FP operations
    117 # on same XMM register, PCLMULQDQ subroutine was measured to process
    118 # one byte in 2.07 cycles on Sandy Bridge, and in 2.12 - on Westmere.
    119 # The minor regression on Westmere is outweighed by ~15% improvement
    120 # on Sandy Bridge. Strangely enough attempt to modify 64-bit code in
    121 # similar manner resulted in almost 20% degradation on Sandy Bridge,
    122 # where original 64-bit code processes one byte in 1.95 cycles.
    123 
    124 #####################################################################
    125 # For reference, AMD Bulldozer processes one byte in 1.98 cycles in
    126 # 32-bit mode and 1.89 in 64-bit.
    127 
    128 # February 2013
    129 #
    130 # Overhaul: aggregate Karatsuba post-processing, improve ILP in
    131 # reduction_alg9. Resulting performance is 1.96 cycles per byte on
    132 # Westmere, 1.95 - on Sandy/Ivy Bridge, 1.76 - on Bulldozer.
    133 
    134 $0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
    135 push(@INC,"${dir}","${dir}../../../perlasm");
    136 require "x86asm.pl";
    137 
    138 $output=pop;
    139 open STDOUT,">$output";
    140 
    141 &asm_init($ARGV[0],$x86only = $ARGV[$#ARGV] eq "386");
    142 
    143 $sse2=0;
    144 for (@ARGV) { $sse2=1 if (/-DOPENSSL_IA32_SSE2/); }
    145 
    146 ($Zhh,$Zhl,$Zlh,$Zll) = ("ebp","edx","ecx","ebx");
    147 $inp  = "edi";
    148 $Htbl = "esi";
    149 
    151 $unroll = 0;	# Affects x86 loop. Folded loop performs ~7% worse
    152 		# than unrolled, which has to be weighted against
    153 		# 2.5x x86-specific code size reduction.
    154 
    155 sub x86_loop {
    156     my $off = shift;
    157     my $rem = "eax";
    158 
    159 	&mov	($Zhh,&DWP(4,$Htbl,$Zll));
    160 	&mov	($Zhl,&DWP(0,$Htbl,$Zll));
    161 	&mov	($Zlh,&DWP(12,$Htbl,$Zll));
    162 	&mov	($Zll,&DWP(8,$Htbl,$Zll));
    163 	&xor	($rem,$rem);	# avoid partial register stalls on PIII
    164 
    165 	# shrd practically kills P4, 2.5x deterioration, but P4 has
    166 	# MMX code-path to execute. shrd runs tad faster [than twice
    167 	# the shifts, move's and or's] on pre-MMX Pentium (as well as
    168 	# PIII and Core2), *but* minimizes code size, spares register
    169 	# and thus allows to fold the loop...
    170 	if (!$unroll) {
    171 	my $cnt = $inp;
    172 	&mov	($cnt,15);
    173 	&jmp	(&label("x86_loop"));
    174 	&set_label("x86_loop",16);
    175 	    for($i=1;$i<=2;$i++) {
    176 		&mov	(&LB($rem),&LB($Zll));
    177 		&shrd	($Zll,$Zlh,4);
    178 		&and	(&LB($rem),0xf);
    179 		&shrd	($Zlh,$Zhl,4);
    180 		&shrd	($Zhl,$Zhh,4);
    181 		&shr	($Zhh,4);
    182 		&xor	($Zhh,&DWP($off+16,"esp",$rem,4));
    183 
    184 		&mov	(&LB($rem),&BP($off,"esp",$cnt));
    185 		if ($i&1) {
    186 			&and	(&LB($rem),0xf0);
    187 		} else {
    188 			&shl	(&LB($rem),4);
    189 		}
    190 
    191 		&xor	($Zll,&DWP(8,$Htbl,$rem));
    192 		&xor	($Zlh,&DWP(12,$Htbl,$rem));
    193 		&xor	($Zhl,&DWP(0,$Htbl,$rem));
    194 		&xor	($Zhh,&DWP(4,$Htbl,$rem));
    195 
    196 		if ($i&1) {
    197 			&dec	($cnt);
    198 			&js	(&label("x86_break"));
    199 		} else {
    200 			&jmp	(&label("x86_loop"));
    201 		}
    202 	    }
    203 	&set_label("x86_break",16);
    204 	} else {
    205 	    for($i=1;$i<32;$i++) {
    206 		&comment($i);
    207 		&mov	(&LB($rem),&LB($Zll));
    208 		&shrd	($Zll,$Zlh,4);
    209 		&and	(&LB($rem),0xf);
    210 		&shrd	($Zlh,$Zhl,4);
    211 		&shrd	($Zhl,$Zhh,4);
    212 		&shr	($Zhh,4);
    213 		&xor	($Zhh,&DWP($off+16,"esp",$rem,4));
    214 
    215 		if ($i&1) {
    216 			&mov	(&LB($rem),&BP($off+15-($i>>1),"esp"));
    217 			&and	(&LB($rem),0xf0);
    218 		} else {
    219 			&mov	(&LB($rem),&BP($off+15-($i>>1),"esp"));
    220 			&shl	(&LB($rem),4);
    221 		}
    222 
    223 		&xor	($Zll,&DWP(8,$Htbl,$rem));
    224 		&xor	($Zlh,&DWP(12,$Htbl,$rem));
    225 		&xor	($Zhl,&DWP(0,$Htbl,$rem));
    226 		&xor	($Zhh,&DWP(4,$Htbl,$rem));
    227 	    }
    228 	}
    229 	&bswap	($Zll);
    230 	&bswap	($Zlh);
    231 	&bswap	($Zhl);
    232 	if (!$x86only) {
    233 		&bswap	($Zhh);
    234 	} else {
    235 		&mov	("eax",$Zhh);
    236 		&bswap	("eax");
    237 		&mov	($Zhh,"eax");
    238 	}
    239 }
    240 
    241 if ($unroll) {
    242     &function_begin_B("_x86_gmult_4bit_inner");
    243 	&x86_loop(4);
    244 	&ret	();
    245     &function_end_B("_x86_gmult_4bit_inner");
    246 }
    247 
    248 sub deposit_rem_4bit {
    249     my $bias = shift;
    250 
    251 	&mov	(&DWP($bias+0, "esp"),0x0000<<16);
    252 	&mov	(&DWP($bias+4, "esp"),0x1C20<<16);
    253 	&mov	(&DWP($bias+8, "esp"),0x3840<<16);
    254 	&mov	(&DWP($bias+12,"esp"),0x2460<<16);
    255 	&mov	(&DWP($bias+16,"esp"),0x7080<<16);
    256 	&mov	(&DWP($bias+20,"esp"),0x6CA0<<16);
    257 	&mov	(&DWP($bias+24,"esp"),0x48C0<<16);
    258 	&mov	(&DWP($bias+28,"esp"),0x54E0<<16);
    259 	&mov	(&DWP($bias+32,"esp"),0xE100<<16);
    260 	&mov	(&DWP($bias+36,"esp"),0xFD20<<16);
    261 	&mov	(&DWP($bias+40,"esp"),0xD940<<16);
    262 	&mov	(&DWP($bias+44,"esp"),0xC560<<16);
    263 	&mov	(&DWP($bias+48,"esp"),0x9180<<16);
    264 	&mov	(&DWP($bias+52,"esp"),0x8DA0<<16);
    265 	&mov	(&DWP($bias+56,"esp"),0xA9C0<<16);
    266 	&mov	(&DWP($bias+60,"esp"),0xB5E0<<16);
    267 }
    268 
    269 if (!$x86only) {{{
    270 
    271 &static_label("rem_4bit");
    272 
    273 if (!$sse2) {{	# pure-MMX "May" version...
    274 
    275     # This code was removed since SSE2 is required for BoringSSL. The
    276     # outer structure of the code was retained to minimize future merge
    277     # conflicts.
    278 
    279 }} else {{	# "June" MMX version...
    280 		# ... has slower "April" gcm_gmult_4bit_mmx with folded
    281 		# loop. This is done to conserve code size...
    282 $S=16;		# shift factor for rem_4bit
    283 
    284 sub mmx_loop() {
    285 # MMX version performs 2.8 times better on P4 (see comment in non-MMX
    286 # routine for further details), 40% better on Opteron and Core2, 50%
    287 # better on PIII... In other words effort is considered to be well
    288 # spent...
    289     my $inp = shift;
    290     my $rem_4bit = shift;
    291     my $cnt = $Zhh;
    292     my $nhi = $Zhl;
    293     my $nlo = $Zlh;
    294     my $rem = $Zll;
    295 
    296     my ($Zlo,$Zhi) = ("mm0","mm1");
    297     my $tmp = "mm2";
    298 
    299 	&xor	($nlo,$nlo);	# avoid partial register stalls on PIII
    300 	&mov	($nhi,$Zll);
    301 	&mov	(&LB($nlo),&LB($nhi));
    302 	&mov	($cnt,14);
    303 	&shl	(&LB($nlo),4);
    304 	&and	($nhi,0xf0);
    305 	&movq	($Zlo,&QWP(8,$Htbl,$nlo));
    306 	&movq	($Zhi,&QWP(0,$Htbl,$nlo));
    307 	&movd	($rem,$Zlo);
    308 	&jmp	(&label("mmx_loop"));
    309 
    310     &set_label("mmx_loop",16);
    311 	&psrlq	($Zlo,4);
    312 	&and	($rem,0xf);
    313 	&movq	($tmp,$Zhi);
    314 	&psrlq	($Zhi,4);
    315 	&pxor	($Zlo,&QWP(8,$Htbl,$nhi));
    316 	&mov	(&LB($nlo),&BP(0,$inp,$cnt));
    317 	&psllq	($tmp,60);
    318 	&pxor	($Zhi,&QWP(0,$rem_4bit,$rem,8));
    319 	&dec	($cnt);
    320 	&movd	($rem,$Zlo);
    321 	&pxor	($Zhi,&QWP(0,$Htbl,$nhi));
    322 	&mov	($nhi,$nlo);
    323 	&pxor	($Zlo,$tmp);
    324 	&js	(&label("mmx_break"));
    325 
    326 	&shl	(&LB($nlo),4);
    327 	&and	($rem,0xf);
    328 	&psrlq	($Zlo,4);
    329 	&and	($nhi,0xf0);
    330 	&movq	($tmp,$Zhi);
    331 	&psrlq	($Zhi,4);
    332 	&pxor	($Zlo,&QWP(8,$Htbl,$nlo));
    333 	&psllq	($tmp,60);
    334 	&pxor	($Zhi,&QWP(0,$rem_4bit,$rem,8));
    335 	&movd	($rem,$Zlo);
    336 	&pxor	($Zhi,&QWP(0,$Htbl,$nlo));
    337 	&pxor	($Zlo,$tmp);
    338 	&jmp	(&label("mmx_loop"));
    339 
    340     &set_label("mmx_break",16);
    341 	&shl	(&LB($nlo),4);
    342 	&and	($rem,0xf);
    343 	&psrlq	($Zlo,4);
    344 	&and	($nhi,0xf0);
    345 	&movq	($tmp,$Zhi);
    346 	&psrlq	($Zhi,4);
    347 	&pxor	($Zlo,&QWP(8,$Htbl,$nlo));
    348 	&psllq	($tmp,60);
    349 	&pxor	($Zhi,&QWP(0,$rem_4bit,$rem,8));
    350 	&movd	($rem,$Zlo);
    351 	&pxor	($Zhi,&QWP(0,$Htbl,$nlo));
    352 	&pxor	($Zlo,$tmp);
    353 
    354 	&psrlq	($Zlo,4);
    355 	&and	($rem,0xf);
    356 	&movq	($tmp,$Zhi);
    357 	&psrlq	($Zhi,4);
    358 	&pxor	($Zlo,&QWP(8,$Htbl,$nhi));
    359 	&psllq	($tmp,60);
    360 	&pxor	($Zhi,&QWP(0,$rem_4bit,$rem,8));
    361 	&movd	($rem,$Zlo);
    362 	&pxor	($Zhi,&QWP(0,$Htbl,$nhi));
    363 	&pxor	($Zlo,$tmp);
    364 
    365 	&psrlq	($Zlo,32);	# lower part of Zlo is already there
    366 	&movd	($Zhl,$Zhi);
    367 	&psrlq	($Zhi,32);
    368 	&movd	($Zlh,$Zlo);
    369 	&movd	($Zhh,$Zhi);
    370 
    371 	&bswap	($Zll);
    372 	&bswap	($Zhl);
    373 	&bswap	($Zlh);
    374 	&bswap	($Zhh);
    375 }
    376 
    377 &function_begin("gcm_gmult_4bit_mmx");
    378 	&mov	($inp,&wparam(0));	# load Xi
    379 	&mov	($Htbl,&wparam(1));	# load Htable
    380 
    381 	&call	(&label("pic_point"));
    382 	&set_label("pic_point");
    383 	&blindpop("eax");
    384 	&lea	("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax"));
    385 
    386 	&movz	($Zll,&BP(15,$inp));
    387 
    388 	&mmx_loop($inp,"eax");
    389 
    390 	&emms	();
    391 	&mov	(&DWP(12,$inp),$Zll);
    392 	&mov	(&DWP(4,$inp),$Zhl);
    393 	&mov	(&DWP(8,$inp),$Zlh);
    394 	&mov	(&DWP(0,$inp),$Zhh);
    395 &function_end("gcm_gmult_4bit_mmx");
    396 
    398 ######################################################################
    399 # Below subroutine is "528B" variant of "4-bit" GCM GHASH function
    400 # (see gcm128.c for details). It provides further 20-40% performance
    401 # improvement over above mentioned "May" version.
    402 
    403 &static_label("rem_8bit");
    404 
    405 &function_begin("gcm_ghash_4bit_mmx");
    406 { my ($Zlo,$Zhi) = ("mm7","mm6");
    407   my $rem_8bit = "esi";
    408   my $Htbl = "ebx";
    409 
    410     # parameter block
    411     &mov	("eax",&wparam(0));		# Xi
    412     &mov	("ebx",&wparam(1));		# Htable
    413     &mov	("ecx",&wparam(2));		# inp
    414     &mov	("edx",&wparam(3));		# len
    415     &mov	("ebp","esp");			# original %esp
    416     &call	(&label("pic_point"));
    417     &set_label	("pic_point");
    418     &blindpop	($rem_8bit);
    419     &lea	($rem_8bit,&DWP(&label("rem_8bit")."-".&label("pic_point"),$rem_8bit));
    420 
    421     &sub	("esp",512+16+16);		# allocate stack frame...
    422     &and	("esp",-64);			# ...and align it
    423     &sub	("esp",16);			# place for (u8)(H[]<<4)
    424 
    425     &add	("edx","ecx");			# pointer to the end of input
    426     &mov	(&DWP(528+16+0,"esp"),"eax");	# save Xi
    427     &mov	(&DWP(528+16+8,"esp"),"edx");	# save inp+len
    428     &mov	(&DWP(528+16+12,"esp"),"ebp");	# save original %esp
    429 
    430     { my @lo  = ("mm0","mm1","mm2");
    431       my @hi  = ("mm3","mm4","mm5");
    432       my @tmp = ("mm6","mm7");
    433       my ($off1,$off2,$i) = (0,0,);
    434 
    435       &add	($Htbl,128);			# optimize for size
    436       &lea	("edi",&DWP(16+128,"esp"));
    437       &lea	("ebp",&DWP(16+256+128,"esp"));
    438 
    439       # decompose Htable (low and high parts are kept separately),
    440       # generate Htable[]>>4, (u8)(Htable[]<<4), save to stack...
    441       for ($i=0;$i<18;$i++) {
    442 
    443 	&mov	("edx",&DWP(16*$i+8-128,$Htbl))		if ($i<16);
    444 	&movq	($lo[0],&QWP(16*$i+8-128,$Htbl))	if ($i<16);
    445 	&psllq	($tmp[1],60)				if ($i>1);
    446 	&movq	($hi[0],&QWP(16*$i+0-128,$Htbl))	if ($i<16);
    447 	&por	($lo[2],$tmp[1])			if ($i>1);
    448 	&movq	(&QWP($off1-128,"edi"),$lo[1])		if ($i>0 && $i<17);
    449 	&psrlq	($lo[1],4)				if ($i>0 && $i<17);
    450 	&movq	(&QWP($off1,"edi"),$hi[1])		if ($i>0 && $i<17);
    451 	&movq	($tmp[0],$hi[1])			if ($i>0 && $i<17);
    452 	&movq	(&QWP($off2-128,"ebp"),$lo[2])		if ($i>1);
    453 	&psrlq	($hi[1],4)				if ($i>0 && $i<17);
    454 	&movq	(&QWP($off2,"ebp"),$hi[2])		if ($i>1);
    455 	&shl	("edx",4)				if ($i<16);
    456 	&mov	(&BP($i,"esp"),&LB("edx"))		if ($i<16);
    457 
    458 	unshift	(@lo,pop(@lo));			# "rotate" registers
    459 	unshift	(@hi,pop(@hi));
    460 	unshift	(@tmp,pop(@tmp));
    461 	$off1 += 8	if ($i>0);
    462 	$off2 += 8	if ($i>1);
    463       }
    464     }
    465 
    466     &movq	($Zhi,&QWP(0,"eax"));
    467     &mov	("ebx",&DWP(8,"eax"));
    468     &mov	("edx",&DWP(12,"eax"));		# load Xi
    469 
    470 &set_label("outer",16);
    471   { my $nlo = "eax";
    472     my $dat = "edx";
    473     my @nhi = ("edi","ebp");
    474     my @rem = ("ebx","ecx");
    475     my @red = ("mm0","mm1","mm2");
    476     my $tmp = "mm3";
    477 
    478     &xor	($dat,&DWP(12,"ecx"));		# merge input data
    479     &xor	("ebx",&DWP(8,"ecx"));
    480     &pxor	($Zhi,&QWP(0,"ecx"));
    481     &lea	("ecx",&DWP(16,"ecx"));		# inp+=16
    482     #&mov	(&DWP(528+12,"esp"),$dat);	# save inp^Xi
    483     &mov	(&DWP(528+8,"esp"),"ebx");
    484     &movq	(&QWP(528+0,"esp"),$Zhi);
    485     &mov	(&DWP(528+16+4,"esp"),"ecx");	# save inp
    486 
    487     &xor	($nlo,$nlo);
    488     &rol	($dat,8);
    489     &mov	(&LB($nlo),&LB($dat));
    490     &mov	($nhi[1],$nlo);
    491     &and	(&LB($nlo),0x0f);
    492     &shr	($nhi[1],4);
    493     &pxor	($red[0],$red[0]);
    494     &rol	($dat,8);			# next byte
    495     &pxor	($red[1],$red[1]);
    496     &pxor	($red[2],$red[2]);
    497 
    498     # Just like in "May" version modulo-schedule for critical path in
    499     # 'Z.hi ^= rem_8bit[Z.lo&0xff^((u8)H[nhi]<<4)]<<48'. Final 'pxor'
    500     # is scheduled so late that rem_8bit[] has to be shifted *right*
    501     # by 16, which is why last argument to pinsrw is 2, which
    502     # corresponds to <<32=<<48>>16...
    503     for ($j=11,$i=0;$i<15;$i++) {
    504 
    505       if ($i>0) {
    506 	&pxor	($Zlo,&QWP(16,"esp",$nlo,8));		# Z^=H[nlo]
    507 	&rol	($dat,8);				# next byte
    508 	&pxor	($Zhi,&QWP(16+128,"esp",$nlo,8));
    509 
    510 	&pxor	($Zlo,$tmp);
    511 	&pxor	($Zhi,&QWP(16+256+128,"esp",$nhi[0],8));
    512 	&xor	(&LB($rem[1]),&BP(0,"esp",$nhi[0]));	# rem^(H[nhi]<<4)
    513       } else {
    514 	&movq	($Zlo,&QWP(16,"esp",$nlo,8));
    515 	&movq	($Zhi,&QWP(16+128,"esp",$nlo,8));
    516       }
    517 
    518 	&mov	(&LB($nlo),&LB($dat));
    519 	&mov	($dat,&DWP(528+$j,"esp"))		if (--$j%4==0);
    520 
    521 	&movd	($rem[0],$Zlo);
    522 	&movz	($rem[1],&LB($rem[1]))			if ($i>0);
    523 	&psrlq	($Zlo,8);				# Z>>=8
    524 
    525 	&movq	($tmp,$Zhi);
    526 	&mov	($nhi[0],$nlo);
    527 	&psrlq	($Zhi,8);
    528 
    529 	&pxor	($Zlo,&QWP(16+256+0,"esp",$nhi[1],8));	# Z^=H[nhi]>>4
    530 	&and	(&LB($nlo),0x0f);
    531 	&psllq	($tmp,56);
    532 
    533 	&pxor	($Zhi,$red[1])				if ($i>1);
    534 	&shr	($nhi[0],4);
    535 	&pinsrw	($red[0],&WP(0,$rem_8bit,$rem[1],2),2)	if ($i>0);
    536 
    537 	unshift	(@red,pop(@red));			# "rotate" registers
    538 	unshift	(@rem,pop(@rem));
    539 	unshift	(@nhi,pop(@nhi));
    540     }
    541 
    542     &pxor	($Zlo,&QWP(16,"esp",$nlo,8));		# Z^=H[nlo]
    543     &pxor	($Zhi,&QWP(16+128,"esp",$nlo,8));
    544     &xor	(&LB($rem[1]),&BP(0,"esp",$nhi[0]));	# rem^(H[nhi]<<4)
    545 
    546     &pxor	($Zlo,$tmp);
    547     &pxor	($Zhi,&QWP(16+256+128,"esp",$nhi[0],8));
    548     &movz	($rem[1],&LB($rem[1]));
    549 
    550     &pxor	($red[2],$red[2]);			# clear 2nd word
    551     &psllq	($red[1],4);
    552 
    553     &movd	($rem[0],$Zlo);
    554     &psrlq	($Zlo,4);				# Z>>=4
    555 
    556     &movq	($tmp,$Zhi);
    557     &psrlq	($Zhi,4);
    558     &shl	($rem[0],4);				# rem<<4
    559 
    560     &pxor	($Zlo,&QWP(16,"esp",$nhi[1],8));	# Z^=H[nhi]
    561     &psllq	($tmp,60);
    562     &movz	($rem[0],&LB($rem[0]));
    563 
    564     &pxor	($Zlo,$tmp);
    565     &pxor	($Zhi,&QWP(16+128,"esp",$nhi[1],8));
    566 
    567     &pinsrw	($red[0],&WP(0,$rem_8bit,$rem[1],2),2);
    568     &pxor	($Zhi,$red[1]);
    569 
    570     &movd	($dat,$Zlo);
    571     &pinsrw	($red[2],&WP(0,$rem_8bit,$rem[0],2),3);	# last is <<48
    572 
    573     &psllq	($red[0],12);				# correct by <<16>>4
    574     &pxor	($Zhi,$red[0]);
    575     &psrlq	($Zlo,32);
    576     &pxor	($Zhi,$red[2]);
    577 
    578     &mov	("ecx",&DWP(528+16+4,"esp"));	# restore inp
    579     &movd	("ebx",$Zlo);
    580     &movq	($tmp,$Zhi);			# 01234567
    581     &psllw	($Zhi,8);			# 1.3.5.7.
    582     &psrlw	($tmp,8);			# .0.2.4.6
    583     &por	($Zhi,$tmp);			# 10325476
    584     &bswap	($dat);
    585     &pshufw	($Zhi,$Zhi,0b00011011);		# 76543210
    586     &bswap	("ebx");
    587 
    588     &cmp	("ecx",&DWP(528+16+8,"esp"));	# are we done?
    589     &jne	(&label("outer"));
    590   }
    591 
    592     &mov	("eax",&DWP(528+16+0,"esp"));	# restore Xi
    593     &mov	(&DWP(12,"eax"),"edx");
    594     &mov	(&DWP(8,"eax"),"ebx");
    595     &movq	(&QWP(0,"eax"),$Zhi);
    596 
    597     &mov	("esp",&DWP(528+16+12,"esp"));	# restore original %esp
    598     &emms	();
    599 }
    600 &function_end("gcm_ghash_4bit_mmx");
    601 }}
    602 
    604 if ($sse2) {{
    605 ######################################################################
    606 # PCLMULQDQ version.
    607 
    608 $Xip="eax";
    609 $Htbl="edx";
    610 $const="ecx";
    611 $inp="esi";
    612 $len="ebx";
    613 
    614 ($Xi,$Xhi)=("xmm0","xmm1");	$Hkey="xmm2";
    615 ($T1,$T2,$T3)=("xmm3","xmm4","xmm5");
    616 ($Xn,$Xhn)=("xmm6","xmm7");
    617 
    618 &static_label("bswap");
    619 
    620 sub clmul64x64_T2 {	# minimal "register" pressure
    621 my ($Xhi,$Xi,$Hkey,$HK)=@_;
    622 
    623 	&movdqa		($Xhi,$Xi);		#
    624 	&pshufd		($T1,$Xi,0b01001110);
    625 	&pshufd		($T2,$Hkey,0b01001110)	if (!defined($HK));
    626 	&pxor		($T1,$Xi);		#
    627 	&pxor		($T2,$Hkey)		if (!defined($HK));
    628 			$HK=$T2			if (!defined($HK));
    629 
    630 	&pclmulqdq	($Xi,$Hkey,0x00);	#######
    631 	&pclmulqdq	($Xhi,$Hkey,0x11);	#######
    632 	&pclmulqdq	($T1,$HK,0x00);		#######
    633 	&xorps		($T1,$Xi);		#
    634 	&xorps		($T1,$Xhi);		#
    635 
    636 	&movdqa		($T2,$T1);		#
    637 	&psrldq		($T1,8);
    638 	&pslldq		($T2,8);		#
    639 	&pxor		($Xhi,$T1);
    640 	&pxor		($Xi,$T2);		#
    641 }
    642 
    643 sub clmul64x64_T3 {
    644 # Even though this subroutine offers visually better ILP, it
    645 # was empirically found to be a tad slower than above version.
    646 # At least in gcm_ghash_clmul context. But it's just as well,
    647 # because loop modulo-scheduling is possible only thanks to
    648 # minimized "register" pressure...
    649 my ($Xhi,$Xi,$Hkey)=@_;
    650 
    651 	&movdqa		($T1,$Xi);		#
    652 	&movdqa		($Xhi,$Xi);
    653 	&pclmulqdq	($Xi,$Hkey,0x00);	#######
    654 	&pclmulqdq	($Xhi,$Hkey,0x11);	#######
    655 	&pshufd		($T2,$T1,0b01001110);	#
    656 	&pshufd		($T3,$Hkey,0b01001110);
    657 	&pxor		($T2,$T1);		#
    658 	&pxor		($T3,$Hkey);
    659 	&pclmulqdq	($T2,$T3,0x00);		#######
    660 	&pxor		($T2,$Xi);		#
    661 	&pxor		($T2,$Xhi);		#
    662 
    663 	&movdqa		($T3,$T2);		#
    664 	&psrldq		($T2,8);
    665 	&pslldq		($T3,8);		#
    666 	&pxor		($Xhi,$T2);
    667 	&pxor		($Xi,$T3);		#
    668 }
    669 
    671 if (1) {		# Algorithm 9 with <<1 twist.
    672 			# Reduction is shorter and uses only two
    673 			# temporary registers, which makes it better
    674 			# candidate for interleaving with 64x64
    675 			# multiplication. Pre-modulo-scheduled loop
    676 			# was found to be ~20% faster than Algorithm 5
    677 			# below. Algorithm 9 was therefore chosen for
    678 			# further optimization...
    679 
    680 sub reduction_alg9 {	# 17/11 times faster than Intel version
    681 my ($Xhi,$Xi) = @_;
    682 
    683 	# 1st phase
    684 	&movdqa		($T2,$Xi);		#
    685 	&movdqa		($T1,$Xi);
    686 	&psllq		($Xi,5);
    687 	&pxor		($T1,$Xi);		#
    688 	&psllq		($Xi,1);
    689 	&pxor		($Xi,$T1);		#
    690 	&psllq		($Xi,57);		#
    691 	&movdqa		($T1,$Xi);		#
    692 	&pslldq		($Xi,8);
    693 	&psrldq		($T1,8);		#
    694 	&pxor		($Xi,$T2);
    695 	&pxor		($Xhi,$T1);		#
    696 
    697 	# 2nd phase
    698 	&movdqa		($T2,$Xi);
    699 	&psrlq		($Xi,1);
    700 	&pxor		($Xhi,$T2);		#
    701 	&pxor		($T2,$Xi);
    702 	&psrlq		($Xi,5);
    703 	&pxor		($Xi,$T2);		#
    704 	&psrlq		($Xi,1);		#
    705 	&pxor		($Xi,$Xhi)		#
    706 }
    707 
    708 &function_begin_B("gcm_init_clmul");
    709 	&mov		($Htbl,&wparam(0));
    710 	&mov		($Xip,&wparam(1));
    711 
    712 	&call		(&label("pic"));
    713 &set_label("pic");
    714 	&blindpop	($const);
    715 	&lea		($const,&DWP(&label("bswap")."-".&label("pic"),$const));
    716 
    717 	&movdqu		($Hkey,&QWP(0,$Xip));
    718 	&pshufd		($Hkey,$Hkey,0b01001110);# dword swap
    719 
    720 	# <<1 twist
    721 	&pshufd		($T2,$Hkey,0b11111111);	# broadcast uppermost dword
    722 	&movdqa		($T1,$Hkey);
    723 	&psllq		($Hkey,1);
    724 	&pxor		($T3,$T3);		#
    725 	&psrlq		($T1,63);
    726 	&pcmpgtd	($T3,$T2);		# broadcast carry bit
    727 	&pslldq		($T1,8);
    728 	&por		($Hkey,$T1);		# H<<=1
    729 
    730 	# magic reduction
    731 	&pand		($T3,&QWP(16,$const));	# 0x1c2_polynomial
    732 	&pxor		($Hkey,$T3);		# if(carry) H^=0x1c2_polynomial
    733 
    734 	# calculate H^2
    735 	&movdqa		($Xi,$Hkey);
    736 	&clmul64x64_T2	($Xhi,$Xi,$Hkey);
    737 	&reduction_alg9	($Xhi,$Xi);
    738 
    739 	&pshufd		($T1,$Hkey,0b01001110);
    740 	&pshufd		($T2,$Xi,0b01001110);
    741 	&pxor		($T1,$Hkey);		# Karatsuba pre-processing
    742 	&movdqu		(&QWP(0,$Htbl),$Hkey);	# save H
    743 	&pxor		($T2,$Xi);		# Karatsuba pre-processing
    744 	&movdqu		(&QWP(16,$Htbl),$Xi);	# save H^2
    745 	&palignr	($T2,$T1,8);		# low part is H.lo^H.hi
    746 	&movdqu		(&QWP(32,$Htbl),$T2);	# save Karatsuba "salt"
    747 
    748 	&ret		();
    749 &function_end_B("gcm_init_clmul");
    750 
    751 &function_begin_B("gcm_gmult_clmul");
    752 	&mov		($Xip,&wparam(0));
    753 	&mov		($Htbl,&wparam(1));
    754 
    755 	&call		(&label("pic"));
    756 &set_label("pic");
    757 	&blindpop	($const);
    758 	&lea		($const,&DWP(&label("bswap")."-".&label("pic"),$const));
    759 
    760 	&movdqu		($Xi,&QWP(0,$Xip));
    761 	&movdqa		($T3,&QWP(0,$const));
    762 	&movups		($Hkey,&QWP(0,$Htbl));
    763 	&pshufb		($Xi,$T3);
    764 	&movups		($T2,&QWP(32,$Htbl));
    765 
    766 	&clmul64x64_T2	($Xhi,$Xi,$Hkey,$T2);
    767 	&reduction_alg9	($Xhi,$Xi);
    768 
    769 	&pshufb		($Xi,$T3);
    770 	&movdqu		(&QWP(0,$Xip),$Xi);
    771 
    772 	&ret	();
    773 &function_end_B("gcm_gmult_clmul");
    774 
    775 &function_begin("gcm_ghash_clmul");
    776 	&mov		($Xip,&wparam(0));
    777 	&mov		($Htbl,&wparam(1));
    778 	&mov		($inp,&wparam(2));
    779 	&mov		($len,&wparam(3));
    780 
    781 	&call		(&label("pic"));
    782 &set_label("pic");
    783 	&blindpop	($const);
    784 	&lea		($const,&DWP(&label("bswap")."-".&label("pic"),$const));
    785 
    786 	&movdqu		($Xi,&QWP(0,$Xip));
    787 	&movdqa		($T3,&QWP(0,$const));
    788 	&movdqu		($Hkey,&QWP(0,$Htbl));
    789 	&pshufb		($Xi,$T3);
    790 
    791 	&sub		($len,0x10);
    792 	&jz		(&label("odd_tail"));
    793 
    794 	#######
    795 	# Xi+2 =[H*(Ii+1 + Xi+1)] mod P =
    796 	#	[(H*Ii+1) + (H*Xi+1)] mod P =
    797 	#	[(H*Ii+1) + H^2*(Ii+Xi)] mod P
    798 	#
    799 	&movdqu		($T1,&QWP(0,$inp));	# Ii
    800 	&movdqu		($Xn,&QWP(16,$inp));	# Ii+1
    801 	&pshufb		($T1,$T3);
    802 	&pshufb		($Xn,$T3);
    803 	&movdqu		($T3,&QWP(32,$Htbl));
    804 	&pxor		($Xi,$T1);		# Ii+Xi
    805 
    806 	&pshufd		($T1,$Xn,0b01001110);	# H*Ii+1
    807 	&movdqa		($Xhn,$Xn);
    808 	&pxor		($T1,$Xn);		#
    809 	&lea		($inp,&DWP(32,$inp));	# i+=2
    810 
    811 	&pclmulqdq	($Xn,$Hkey,0x00);	#######
    812 	&pclmulqdq	($Xhn,$Hkey,0x11);	#######
    813 	&pclmulqdq	($T1,$T3,0x00);		#######
    814 	&movups		($Hkey,&QWP(16,$Htbl));	# load H^2
    815 	&nop		();
    816 
    817 	&sub		($len,0x20);
    818 	&jbe		(&label("even_tail"));
    819 	&jmp		(&label("mod_loop"));
    820 
    821 &set_label("mod_loop",32);
    822 	&pshufd		($T2,$Xi,0b01001110);	# H^2*(Ii+Xi)
    823 	&movdqa		($Xhi,$Xi);
    824 	&pxor		($T2,$Xi);		#
    825 	&nop		();
    826 
    827 	&pclmulqdq	($Xi,$Hkey,0x00);	#######
    828 	&pclmulqdq	($Xhi,$Hkey,0x11);	#######
    829 	&pclmulqdq	($T2,$T3,0x10);		#######
    830 	&movups		($Hkey,&QWP(0,$Htbl));	# load H
    831 
    832 	&xorps		($Xi,$Xn);		# (H*Ii+1) + H^2*(Ii+Xi)
    833 	&movdqa		($T3,&QWP(0,$const));
    834 	&xorps		($Xhi,$Xhn);
    835 	 &movdqu	($Xhn,&QWP(0,$inp));	# Ii
    836 	&pxor		($T1,$Xi);		# aggregated Karatsuba post-processing
    837 	 &movdqu	($Xn,&QWP(16,$inp));	# Ii+1
    838 	&pxor		($T1,$Xhi);		#
    839 
    840 	 &pshufb	($Xhn,$T3);
    841 	&pxor		($T2,$T1);		#
    842 
    843 	&movdqa		($T1,$T2);		#
    844 	&psrldq		($T2,8);
    845 	&pslldq		($T1,8);		#
    846 	&pxor		($Xhi,$T2);
    847 	&pxor		($Xi,$T1);		#
    848 	 &pshufb	($Xn,$T3);
    849 	 &pxor		($Xhi,$Xhn);		# "Ii+Xi", consume early
    850 
    851 	&movdqa		($Xhn,$Xn);		#&clmul64x64_TX	($Xhn,$Xn,$Hkey); H*Ii+1
    852 	  &movdqa	($T2,$Xi);		#&reduction_alg9($Xhi,$Xi); 1st phase
    853 	  &movdqa	($T1,$Xi);
    854 	  &psllq	($Xi,5);
    855 	  &pxor		($T1,$Xi);		#
    856 	  &psllq	($Xi,1);
    857 	  &pxor		($Xi,$T1);		#
    858 	&pclmulqdq	($Xn,$Hkey,0x00);	#######
    859 	&movups		($T3,&QWP(32,$Htbl));
    860 	  &psllq	($Xi,57);		#
    861 	  &movdqa	($T1,$Xi);		#
    862 	  &pslldq	($Xi,8);
    863 	  &psrldq	($T1,8);		#
    864 	  &pxor		($Xi,$T2);
    865 	  &pxor		($Xhi,$T1);		#
    866 	&pshufd		($T1,$Xhn,0b01001110);
    867 	  &movdqa	($T2,$Xi);		# 2nd phase
    868 	  &psrlq	($Xi,1);
    869 	&pxor		($T1,$Xhn);
    870 	  &pxor		($Xhi,$T2);		#
    871 	&pclmulqdq	($Xhn,$Hkey,0x11);	#######
    872 	&movups		($Hkey,&QWP(16,$Htbl));	# load H^2
    873 	  &pxor		($T2,$Xi);
    874 	  &psrlq	($Xi,5);
    875 	  &pxor		($Xi,$T2);		#
    876 	  &psrlq	($Xi,1);		#
    877 	  &pxor		($Xi,$Xhi)		#
    878 	&pclmulqdq	($T1,$T3,0x00);		#######
    879 
    880 	&lea		($inp,&DWP(32,$inp));
    881 	&sub		($len,0x20);
    882 	&ja		(&label("mod_loop"));
    883 
    884 &set_label("even_tail");
    885 	&pshufd		($T2,$Xi,0b01001110);	# H^2*(Ii+Xi)
    886 	&movdqa		($Xhi,$Xi);
    887 	&pxor		($T2,$Xi);		#
    888 
    889 	&pclmulqdq	($Xi,$Hkey,0x00);	#######
    890 	&pclmulqdq	($Xhi,$Hkey,0x11);	#######
    891 	&pclmulqdq	($T2,$T3,0x10);		#######
    892 	&movdqa		($T3,&QWP(0,$const));
    893 
    894 	&xorps		($Xi,$Xn);		# (H*Ii+1) + H^2*(Ii+Xi)
    895 	&xorps		($Xhi,$Xhn);
    896 	&pxor		($T1,$Xi);		# aggregated Karatsuba post-processing
    897 	&pxor		($T1,$Xhi);		#
    898 
    899 	&pxor		($T2,$T1);		#
    900 
    901 	&movdqa		($T1,$T2);		#
    902 	&psrldq		($T2,8);
    903 	&pslldq		($T1,8);		#
    904 	&pxor		($Xhi,$T2);
    905 	&pxor		($Xi,$T1);		#
    906 
    907 	&reduction_alg9	($Xhi,$Xi);
    908 
    909 	&test		($len,$len);
    910 	&jnz		(&label("done"));
    911 
    912 	&movups		($Hkey,&QWP(0,$Htbl));	# load H
    913 &set_label("odd_tail");
    914 	&movdqu		($T1,&QWP(0,$inp));	# Ii
    915 	&pshufb		($T1,$T3);
    916 	&pxor		($Xi,$T1);		# Ii+Xi
    917 
    918 	&clmul64x64_T2	($Xhi,$Xi,$Hkey);	# H*(Ii+Xi)
    919 	&reduction_alg9	($Xhi,$Xi);
    920 
    921 &set_label("done");
    922 	&pshufb		($Xi,$T3);
    923 	&movdqu		(&QWP(0,$Xip),$Xi);
    924 &function_end("gcm_ghash_clmul");
    925 
    927 } else {		# Algorithm 5. Kept for reference purposes.
    928 
    929 sub reduction_alg5 {	# 19/16 times faster than Intel version
    930 my ($Xhi,$Xi)=@_;
    931 
    932 	# <<1
    933 	&movdqa		($T1,$Xi);		#
    934 	&movdqa		($T2,$Xhi);
    935 	&pslld		($Xi,1);
    936 	&pslld		($Xhi,1);		#
    937 	&psrld		($T1,31);
    938 	&psrld		($T2,31);		#
    939 	&movdqa		($T3,$T1);
    940 	&pslldq		($T1,4);
    941 	&psrldq		($T3,12);		#
    942 	&pslldq		($T2,4);
    943 	&por		($Xhi,$T3);		#
    944 	&por		($Xi,$T1);
    945 	&por		($Xhi,$T2);		#
    946 
    947 	# 1st phase
    948 	&movdqa		($T1,$Xi);
    949 	&movdqa		($T2,$Xi);
    950 	&movdqa		($T3,$Xi);		#
    951 	&pslld		($T1,31);
    952 	&pslld		($T2,30);
    953 	&pslld		($Xi,25);		#
    954 	&pxor		($T1,$T2);
    955 	&pxor		($T1,$Xi);		#
    956 	&movdqa		($T2,$T1);		#
    957 	&pslldq		($T1,12);
    958 	&psrldq		($T2,4);		#
    959 	&pxor		($T3,$T1);
    960 
    961 	# 2nd phase
    962 	&pxor		($Xhi,$T3);		#
    963 	&movdqa		($Xi,$T3);
    964 	&movdqa		($T1,$T3);
    965 	&psrld		($Xi,1);		#
    966 	&psrld		($T1,2);
    967 	&psrld		($T3,7);		#
    968 	&pxor		($Xi,$T1);
    969 	&pxor		($Xhi,$T2);
    970 	&pxor		($Xi,$T3);		#
    971 	&pxor		($Xi,$Xhi);		#
    972 }
    973 
    974 &function_begin_B("gcm_init_clmul");
    975 	&mov		($Htbl,&wparam(0));
    976 	&mov		($Xip,&wparam(1));
    977 
    978 	&call		(&label("pic"));
    979 &set_label("pic");
    980 	&blindpop	($const);
    981 	&lea		($const,&DWP(&label("bswap")."-".&label("pic"),$const));
    982 
    983 	&movdqu		($Hkey,&QWP(0,$Xip));
    984 	&pshufd		($Hkey,$Hkey,0b01001110);# dword swap
    985 
    986 	# calculate H^2
    987 	&movdqa		($Xi,$Hkey);
    988 	&clmul64x64_T3	($Xhi,$Xi,$Hkey);
    989 	&reduction_alg5	($Xhi,$Xi);
    990 
    991 	&movdqu		(&QWP(0,$Htbl),$Hkey);	# save H
    992 	&movdqu		(&QWP(16,$Htbl),$Xi);	# save H^2
    993 
    994 	&ret		();
    995 &function_end_B("gcm_init_clmul");
    996 
    997 &function_begin_B("gcm_gmult_clmul");
    998 	&mov		($Xip,&wparam(0));
    999 	&mov		($Htbl,&wparam(1));
   1000 
   1001 	&call		(&label("pic"));
   1002 &set_label("pic");
   1003 	&blindpop	($const);
   1004 	&lea		($const,&DWP(&label("bswap")."-".&label("pic"),$const));
   1005 
   1006 	&movdqu		($Xi,&QWP(0,$Xip));
   1007 	&movdqa		($Xn,&QWP(0,$const));
   1008 	&movdqu		($Hkey,&QWP(0,$Htbl));
   1009 	&pshufb		($Xi,$Xn);
   1010 
   1011 	&clmul64x64_T3	($Xhi,$Xi,$Hkey);
   1012 	&reduction_alg5	($Xhi,$Xi);
   1013 
   1014 	&pshufb		($Xi,$Xn);
   1015 	&movdqu		(&QWP(0,$Xip),$Xi);
   1016 
   1017 	&ret	();
   1018 &function_end_B("gcm_gmult_clmul");
   1019 
   1020 &function_begin("gcm_ghash_clmul");
   1021 	&mov		($Xip,&wparam(0));
   1022 	&mov		($Htbl,&wparam(1));
   1023 	&mov		($inp,&wparam(2));
   1024 	&mov		($len,&wparam(3));
   1025 
   1026 	&call		(&label("pic"));
   1027 &set_label("pic");
   1028 	&blindpop	($const);
   1029 	&lea		($const,&DWP(&label("bswap")."-".&label("pic"),$const));
   1030 
   1031 	&movdqu		($Xi,&QWP(0,$Xip));
   1032 	&movdqa		($T3,&QWP(0,$const));
   1033 	&movdqu		($Hkey,&QWP(0,$Htbl));
   1034 	&pshufb		($Xi,$T3);
   1035 
   1036 	&sub		($len,0x10);
   1037 	&jz		(&label("odd_tail"));
   1038 
   1039 	#######
   1040 	# Xi+2 =[H*(Ii+1 + Xi+1)] mod P =
   1041 	#	[(H*Ii+1) + (H*Xi+1)] mod P =
   1042 	#	[(H*Ii+1) + H^2*(Ii+Xi)] mod P
   1043 	#
   1044 	&movdqu		($T1,&QWP(0,$inp));	# Ii
   1045 	&movdqu		($Xn,&QWP(16,$inp));	# Ii+1
   1046 	&pshufb		($T1,$T3);
   1047 	&pshufb		($Xn,$T3);
   1048 	&pxor		($Xi,$T1);		# Ii+Xi
   1049 
   1050 	&clmul64x64_T3	($Xhn,$Xn,$Hkey);	# H*Ii+1
   1051 	&movdqu		($Hkey,&QWP(16,$Htbl));	# load H^2
   1052 
   1053 	&sub		($len,0x20);
   1054 	&lea		($inp,&DWP(32,$inp));	# i+=2
   1055 	&jbe		(&label("even_tail"));
   1056 
   1057 &set_label("mod_loop");
   1058 	&clmul64x64_T3	($Xhi,$Xi,$Hkey);	# H^2*(Ii+Xi)
   1059 	&movdqu		($Hkey,&QWP(0,$Htbl));	# load H
   1060 
   1061 	&pxor		($Xi,$Xn);		# (H*Ii+1) + H^2*(Ii+Xi)
   1062 	&pxor		($Xhi,$Xhn);
   1063 
   1064 	&reduction_alg5	($Xhi,$Xi);
   1065 
   1066 	#######
   1067 	&movdqa		($T3,&QWP(0,$const));
   1068 	&movdqu		($T1,&QWP(0,$inp));	# Ii
   1069 	&movdqu		($Xn,&QWP(16,$inp));	# Ii+1
   1070 	&pshufb		($T1,$T3);
   1071 	&pshufb		($Xn,$T3);
   1072 	&pxor		($Xi,$T1);		# Ii+Xi
   1073 
   1074 	&clmul64x64_T3	($Xhn,$Xn,$Hkey);	# H*Ii+1
   1075 	&movdqu		($Hkey,&QWP(16,$Htbl));	# load H^2
   1076 
   1077 	&sub		($len,0x20);
   1078 	&lea		($inp,&DWP(32,$inp));
   1079 	&ja		(&label("mod_loop"));
   1080 
   1081 &set_label("even_tail");
   1082 	&clmul64x64_T3	($Xhi,$Xi,$Hkey);	# H^2*(Ii+Xi)
   1083 
   1084 	&pxor		($Xi,$Xn);		# (H*Ii+1) + H^2*(Ii+Xi)
   1085 	&pxor		($Xhi,$Xhn);
   1086 
   1087 	&reduction_alg5	($Xhi,$Xi);
   1088 
   1089 	&movdqa		($T3,&QWP(0,$const));
   1090 	&test		($len,$len);
   1091 	&jnz		(&label("done"));
   1092 
   1093 	&movdqu		($Hkey,&QWP(0,$Htbl));	# load H
   1094 &set_label("odd_tail");
   1095 	&movdqu		($T1,&QWP(0,$inp));	# Ii
   1096 	&pshufb		($T1,$T3);
   1097 	&pxor		($Xi,$T1);		# Ii+Xi
   1098 
   1099 	&clmul64x64_T3	($Xhi,$Xi,$Hkey);	# H*(Ii+Xi)
   1100 	&reduction_alg5	($Xhi,$Xi);
   1101 
   1102 	&movdqa		($T3,&QWP(0,$const));
   1103 &set_label("done");
   1104 	&pshufb		($Xi,$T3);
   1105 	&movdqu		(&QWP(0,$Xip),$Xi);
   1106 &function_end("gcm_ghash_clmul");
   1107 
   1108 }
   1109 
   1111 &set_label("bswap",64);
   1112 	&data_byte(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0);
   1113 	&data_byte(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2);	# 0x1c2_polynomial
   1114 &set_label("rem_8bit",64);
   1115 	&data_short(0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E);
   1116 	&data_short(0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E);
   1117 	&data_short(0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E);
   1118 	&data_short(0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E);
   1119 	&data_short(0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E);
   1120 	&data_short(0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E);
   1121 	&data_short(0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E);
   1122 	&data_short(0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E);
   1123 	&data_short(0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE);
   1124 	&data_short(0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE);
   1125 	&data_short(0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE);
   1126 	&data_short(0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE);
   1127 	&data_short(0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E);
   1128 	&data_short(0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E);
   1129 	&data_short(0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE);
   1130 	&data_short(0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE);
   1131 	&data_short(0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E);
   1132 	&data_short(0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E);
   1133 	&data_short(0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E);
   1134 	&data_short(0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E);
   1135 	&data_short(0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E);
   1136 	&data_short(0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E);
   1137 	&data_short(0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E);
   1138 	&data_short(0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E);
   1139 	&data_short(0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE);
   1140 	&data_short(0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE);
   1141 	&data_short(0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE);
   1142 	&data_short(0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE);
   1143 	&data_short(0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E);
   1144 	&data_short(0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E);
   1145 	&data_short(0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE);
   1146 	&data_short(0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE);
   1147 }}	# $sse2
   1148 
   1149 &set_label("rem_4bit",64);
   1150 	&data_word(0,0x0000<<$S,0,0x1C20<<$S,0,0x3840<<$S,0,0x2460<<$S);
   1151 	&data_word(0,0x7080<<$S,0,0x6CA0<<$S,0,0x48C0<<$S,0,0x54E0<<$S);
   1152 	&data_word(0,0xE100<<$S,0,0xFD20<<$S,0,0xD940<<$S,0,0xC560<<$S);
   1153 	&data_word(0,0x9180<<$S,0,0x8DA0<<$S,0,0xA9C0<<$S,0,0xB5E0<<$S);
   1154 }}}	# !$x86only
   1155 
   1156 &asciz("GHASH for x86, CRYPTOGAMS by <appro\@openssl.org>");
   1157 &asm_finish();
   1158 
   1159 close STDOUT;
   1160 
   1161 # A question was risen about choice of vanilla MMX. Or rather why wasn't
   1162 # SSE2 chosen instead? In addition to the fact that MMX runs on legacy
   1163 # CPUs such as PIII, "4-bit" MMX version was observed to provide better
   1164 # performance than *corresponding* SSE2 one even on contemporary CPUs.
   1165 # SSE2 results were provided by Peter-Michael Hager. He maintains SSE2
   1166 # implementation featuring full range of lookup-table sizes, but with
   1167 # per-invocation lookup table setup. Latter means that table size is
   1168 # chosen depending on how much data is to be hashed in every given call,
   1169 # more data - larger table. Best reported result for Core2 is ~4 cycles
   1170 # per processed byte out of 64KB block. This number accounts even for
   1171 # 64KB table setup overhead. As discussed in gcm128.c we choose to be
   1172 # more conservative in respect to lookup table sizes, but how do the
   1173 # results compare? Minimalistic "256B" MMX version delivers ~11 cycles
   1174 # on same platform. As also discussed in gcm128.c, next in line "8-bit
   1175 # Shoup's" or "4KB" method should deliver twice the performance of
   1176 # "256B" one, in other words not worse than ~6 cycles per byte. It
   1177 # should be also be noted that in SSE2 case improvement can be "super-
   1178 # linear," i.e. more than twice, mostly because >>8 maps to single
   1179 # instruction on SSE2 register. This is unlike "4-bit" case when >>4
   1180 # maps to same amount of instructions in both MMX and SSE2 cases.
   1181 # Bottom line is that switch to SSE2 is considered to be justifiable
   1182 # only in case we choose to implement "8-bit" method...
   1183