Home | History | Annotate | Download | only in asm
      1 #!/usr/bin/env perl
      2 #
      3 # ====================================================================
      4 # Written by Andy Polyakov <appro (at] openssl.org> for the OpenSSL
      5 # project. The module is, however, dual licensed under OpenSSL and
      6 # CRYPTOGAMS licenses depending on where you obtain it. For further
      7 # details see http://www.openssl.org/~appro/cryptogams/.
      8 # ====================================================================
      9 #
     10 # March, May, June 2010
     11 #
     12 # The module implements "4-bit" GCM GHASH function and underlying
     13 # single multiplication operation in GF(2^128). "4-bit" means that it
     14 # uses 256 bytes per-key table [+64/128 bytes fixed table]. It has two
     15 # code paths: vanilla x86 and vanilla SSE. Former will be executed on
     16 # 486 and Pentium, latter on all others. SSE GHASH features so called
     17 # "528B" variant of "4-bit" method utilizing additional 256+16 bytes
     18 # of per-key storage [+512 bytes shared table]. Performance results
     19 # are for streamed GHASH subroutine and are expressed in cycles per
     20 # processed byte, less is better:
     21 #
     22 #		gcc 2.95.3(*)	SSE assembler	x86 assembler
     23 #
     24 # Pentium	105/111(**)	-		50
     25 # PIII		68 /75		12.2		24
     26 # P4		125/125		17.8		84(***)
     27 # Opteron	66 /70		10.1		30
     28 # Core2		54 /67		8.4		18
     29 # Atom		105/105		16.8		53
     30 # VIA Nano	69 /71		13.0		27
     31 #
     32 # (*)	gcc 3.4.x was observed to generate few percent slower code,
     33 #	which is one of reasons why 2.95.3 results were chosen,
     34 #	another reason is lack of 3.4.x results for older CPUs;
     35 #	comparison with SSE results is not completely fair, because C
     36 #	results are for vanilla "256B" implementation, while
     37 #	assembler results are for "528B";-)
     38 # (**)	second number is result for code compiled with -fPIC flag,
     39 #	which is actually more relevant, because assembler code is
     40 #	position-independent;
     41 # (***)	see comment in non-MMX routine for further details;
     42 #
     43 # To summarize, it's >2-5 times faster than gcc-generated code. To
     44 # anchor it to something else SHA1 assembler processes one byte in
     45 # ~7 cycles on contemporary x86 cores. As for choice of MMX/SSE
     46 # in particular, see comment at the end of the file...
     47 
     48 # May 2010
     49 #
     50 # Add PCLMULQDQ version performing at 2.10 cycles per processed byte.
     51 # The question is how close is it to theoretical limit? The pclmulqdq
     52 # instruction latency appears to be 14 cycles and there can't be more
     53 # than 2 of them executing at any given time. This means that single
     54 # Karatsuba multiplication would take 28 cycles *plus* few cycles for
     55 # pre- and post-processing. Then multiplication has to be followed by
     56 # modulo-reduction. Given that aggregated reduction method [see
     57 # "Carry-less Multiplication and Its Usage for Computing the GCM Mode"
     58 # white paper by Intel] allows you to perform reduction only once in
     59 # a while we can assume that asymptotic performance can be estimated
     60 # as (28+Tmod/Naggr)/16, where Tmod is time to perform reduction
     61 # and Naggr is the aggregation factor.
     62 #
     63 # Before we proceed to this implementation let's have closer look at
     64 # the best-performing code suggested by Intel in their white paper.
     65 # By tracing inter-register dependencies Tmod is estimated as ~19
     66 # cycles and Naggr chosen by Intel is 4, resulting in 2.05 cycles per
     67 # processed byte. As implied, this is quite optimistic estimate,
     68 # because it does not account for Karatsuba pre- and post-processing,
     69 # which for a single multiplication is ~5 cycles. Unfortunately Intel
     70 # does not provide performance data for GHASH alone. But benchmarking
     71 # AES_GCM_encrypt ripped out of Fig. 15 of the white paper with aadt
     72 # alone resulted in 2.46 cycles per byte of out 16KB buffer. Note that
     73 # the result accounts even for pre-computing of degrees of the hash
     74 # key H, but its portion is negligible at 16KB buffer size.
     75 #
     76 # Moving on to the implementation in question. Tmod is estimated as
     77 # ~13 cycles and Naggr is 2, giving asymptotic performance of ...
     78 # 2.16. How is it possible that measured performance is better than
     79 # optimistic theoretical estimate? There is one thing Intel failed
     80 # to recognize. By serializing GHASH with CTR in same subroutine
     81 # former's performance is really limited to above (Tmul + Tmod/Naggr)
     82 # equation. But if GHASH procedure is detached, the modulo-reduction
     83 # can be interleaved with Naggr-1 multiplications at instruction level
     84 # and under ideal conditions even disappear from the equation. So that
     85 # optimistic theoretical estimate for this implementation is ...
     86 # 28/16=1.75, and not 2.16. Well, it's probably way too optimistic,
     87 # at least for such small Naggr. I'd argue that (28+Tproc/Naggr),
     88 # where Tproc is time required for Karatsuba pre- and post-processing,
     89 # is more realistic estimate. In this case it gives ... 1.91 cycles.
     90 # Or in other words, depending on how well we can interleave reduction
     91 # and one of the two multiplications the performance should be between
     92 # 1.91 and 2.16. As already mentioned, this implementation processes
     93 # one byte out of 8KB buffer in 2.10 cycles, while x86_64 counterpart
     94 # - in 2.02. x86_64 performance is better, because larger register
     95 # bank allows to interleave reduction and multiplication better.
     96 #
     97 # Does it make sense to increase Naggr? To start with it's virtually
     98 # impossible in 32-bit mode, because of limited register bank
     99 # capacity. Otherwise improvement has to be weighed agiainst slower
    100 # setup, as well as code size and complexity increase. As even
    101 # optimistic estimate doesn't promise 30% performance improvement,
    102 # there are currently no plans to increase Naggr.
    103 #
    104 # Special thanks to David Woodhouse <dwmw2 (at] infradead.org> for
    105 # providing access to a Westmere-based system on behalf of Intel
    106 # Open Source Technology Centre.
    107 
    108 # January 2010
    109 #
    110 # Tweaked to optimize transitions between integer and FP operations
    111 # on same XMM register, PCLMULQDQ subroutine was measured to process
    112 # one byte in 2.07 cycles on Sandy Bridge, and in 2.12 - on Westmere.
    113 # The minor regression on Westmere is outweighed by ~15% improvement
    114 # on Sandy Bridge. Strangely enough attempt to modify 64-bit code in
    115 # similar manner resulted in almost 20% degradation on Sandy Bridge,
    116 # where original 64-bit code processes one byte in 1.95 cycles.
    117 
    118 #####################################################################
    119 # For reference, AMD Bulldozer processes one byte in 1.98 cycles in
    120 # 32-bit mode and 1.89 in 64-bit.
    121 
    122 # February 2013
    123 #
    124 # Overhaul: aggregate Karatsuba post-processing, improve ILP in
    125 # reduction_alg9. Resulting performance is 1.96 cycles per byte on
    126 # Westmere, 1.95 - on Sandy/Ivy Bridge, 1.76 - on Bulldozer.
    127 
    128 $0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
    129 push(@INC,"${dir}","${dir}../../../perlasm");
    130 require "x86asm.pl";
    131 
    132 $output=pop;
    133 open STDOUT,">$output";
    134 
    135 &asm_init($ARGV[0],$x86only = $ARGV[$#ARGV] eq "386");
    136 
    137 $sse2=0;
    138 for (@ARGV) { $sse2=1 if (/-DOPENSSL_IA32_SSE2/); }
    139 
    140 ($Zhh,$Zhl,$Zlh,$Zll) = ("ebp","edx","ecx","ebx");
    141 $inp  = "edi";
    142 $Htbl = "esi";
    143 
    145 $unroll = 0;	# Affects x86 loop. Folded loop performs ~7% worse
    146 		# than unrolled, which has to be weighted against
    147 		# 2.5x x86-specific code size reduction.
    148 
    149 sub x86_loop {
    150     my $off = shift;
    151     my $rem = "eax";
    152 
    153 	&mov	($Zhh,&DWP(4,$Htbl,$Zll));
    154 	&mov	($Zhl,&DWP(0,$Htbl,$Zll));
    155 	&mov	($Zlh,&DWP(12,$Htbl,$Zll));
    156 	&mov	($Zll,&DWP(8,$Htbl,$Zll));
    157 	&xor	($rem,$rem);	# avoid partial register stalls on PIII
    158 
    159 	# shrd practically kills P4, 2.5x deterioration, but P4 has
    160 	# MMX code-path to execute. shrd runs tad faster [than twice
    161 	# the shifts, move's and or's] on pre-MMX Pentium (as well as
    162 	# PIII and Core2), *but* minimizes code size, spares register
    163 	# and thus allows to fold the loop...
    164 	if (!$unroll) {
    165 	my $cnt = $inp;
    166 	&mov	($cnt,15);
    167 	&jmp	(&label("x86_loop"));
    168 	&set_label("x86_loop",16);
    169 	    for($i=1;$i<=2;$i++) {
    170 		&mov	(&LB($rem),&LB($Zll));
    171 		&shrd	($Zll,$Zlh,4);
    172 		&and	(&LB($rem),0xf);
    173 		&shrd	($Zlh,$Zhl,4);
    174 		&shrd	($Zhl,$Zhh,4);
    175 		&shr	($Zhh,4);
    176 		&xor	($Zhh,&DWP($off+16,"esp",$rem,4));
    177 
    178 		&mov	(&LB($rem),&BP($off,"esp",$cnt));
    179 		if ($i&1) {
    180 			&and	(&LB($rem),0xf0);
    181 		} else {
    182 			&shl	(&LB($rem),4);
    183 		}
    184 
    185 		&xor	($Zll,&DWP(8,$Htbl,$rem));
    186 		&xor	($Zlh,&DWP(12,$Htbl,$rem));
    187 		&xor	($Zhl,&DWP(0,$Htbl,$rem));
    188 		&xor	($Zhh,&DWP(4,$Htbl,$rem));
    189 
    190 		if ($i&1) {
    191 			&dec	($cnt);
    192 			&js	(&label("x86_break"));
    193 		} else {
    194 			&jmp	(&label("x86_loop"));
    195 		}
    196 	    }
    197 	&set_label("x86_break",16);
    198 	} else {
    199 	    for($i=1;$i<32;$i++) {
    200 		&comment($i);
    201 		&mov	(&LB($rem),&LB($Zll));
    202 		&shrd	($Zll,$Zlh,4);
    203 		&and	(&LB($rem),0xf);
    204 		&shrd	($Zlh,$Zhl,4);
    205 		&shrd	($Zhl,$Zhh,4);
    206 		&shr	($Zhh,4);
    207 		&xor	($Zhh,&DWP($off+16,"esp",$rem,4));
    208 
    209 		if ($i&1) {
    210 			&mov	(&LB($rem),&BP($off+15-($i>>1),"esp"));
    211 			&and	(&LB($rem),0xf0);
    212 		} else {
    213 			&mov	(&LB($rem),&BP($off+15-($i>>1),"esp"));
    214 			&shl	(&LB($rem),4);
    215 		}
    216 
    217 		&xor	($Zll,&DWP(8,$Htbl,$rem));
    218 		&xor	($Zlh,&DWP(12,$Htbl,$rem));
    219 		&xor	($Zhl,&DWP(0,$Htbl,$rem));
    220 		&xor	($Zhh,&DWP(4,$Htbl,$rem));
    221 	    }
    222 	}
    223 	&bswap	($Zll);
    224 	&bswap	($Zlh);
    225 	&bswap	($Zhl);
    226 	if (!$x86only) {
    227 		&bswap	($Zhh);
    228 	} else {
    229 		&mov	("eax",$Zhh);
    230 		&bswap	("eax");
    231 		&mov	($Zhh,"eax");
    232 	}
    233 }
    234 
    235 if ($unroll) {
    236     &function_begin_B("_x86_gmult_4bit_inner");
    237 	&x86_loop(4);
    238 	&ret	();
    239     &function_end_B("_x86_gmult_4bit_inner");
    240 }
    241 
    242 sub deposit_rem_4bit {
    243     my $bias = shift;
    244 
    245 	&mov	(&DWP($bias+0, "esp"),0x0000<<16);
    246 	&mov	(&DWP($bias+4, "esp"),0x1C20<<16);
    247 	&mov	(&DWP($bias+8, "esp"),0x3840<<16);
    248 	&mov	(&DWP($bias+12,"esp"),0x2460<<16);
    249 	&mov	(&DWP($bias+16,"esp"),0x7080<<16);
    250 	&mov	(&DWP($bias+20,"esp"),0x6CA0<<16);
    251 	&mov	(&DWP($bias+24,"esp"),0x48C0<<16);
    252 	&mov	(&DWP($bias+28,"esp"),0x54E0<<16);
    253 	&mov	(&DWP($bias+32,"esp"),0xE100<<16);
    254 	&mov	(&DWP($bias+36,"esp"),0xFD20<<16);
    255 	&mov	(&DWP($bias+40,"esp"),0xD940<<16);
    256 	&mov	(&DWP($bias+44,"esp"),0xC560<<16);
    257 	&mov	(&DWP($bias+48,"esp"),0x9180<<16);
    258 	&mov	(&DWP($bias+52,"esp"),0x8DA0<<16);
    259 	&mov	(&DWP($bias+56,"esp"),0xA9C0<<16);
    260 	&mov	(&DWP($bias+60,"esp"),0xB5E0<<16);
    261 }
    262 
    263 if (!$x86only) {{{
    264 
    265 &static_label("rem_4bit");
    266 
    267 if (!$sse2) {{	# pure-MMX "May" version...
    268 
    269     # This code was removed since SSE2 is required for BoringSSL. The
    270     # outer structure of the code was retained to minimize future merge
    271     # conflicts.
    272 
    273 }} else {{	# "June" MMX version...
    274 		# ... has slower "April" gcm_gmult_4bit_mmx with folded
    275 		# loop. This is done to conserve code size...
    276 $S=16;		# shift factor for rem_4bit
    277 
    278 sub mmx_loop() {
    279 # MMX version performs 2.8 times better on P4 (see comment in non-MMX
    280 # routine for further details), 40% better on Opteron and Core2, 50%
    281 # better on PIII... In other words effort is considered to be well
    282 # spent...
    283     my $inp = shift;
    284     my $rem_4bit = shift;
    285     my $cnt = $Zhh;
    286     my $nhi = $Zhl;
    287     my $nlo = $Zlh;
    288     my $rem = $Zll;
    289 
    290     my ($Zlo,$Zhi) = ("mm0","mm1");
    291     my $tmp = "mm2";
    292 
    293 	&xor	($nlo,$nlo);	# avoid partial register stalls on PIII
    294 	&mov	($nhi,$Zll);
    295 	&mov	(&LB($nlo),&LB($nhi));
    296 	&mov	($cnt,14);
    297 	&shl	(&LB($nlo),4);
    298 	&and	($nhi,0xf0);
    299 	&movq	($Zlo,&QWP(8,$Htbl,$nlo));
    300 	&movq	($Zhi,&QWP(0,$Htbl,$nlo));
    301 	&movd	($rem,$Zlo);
    302 	&jmp	(&label("mmx_loop"));
    303 
    304     &set_label("mmx_loop",16);
    305 	&psrlq	($Zlo,4);
    306 	&and	($rem,0xf);
    307 	&movq	($tmp,$Zhi);
    308 	&psrlq	($Zhi,4);
    309 	&pxor	($Zlo,&QWP(8,$Htbl,$nhi));
    310 	&mov	(&LB($nlo),&BP(0,$inp,$cnt));
    311 	&psllq	($tmp,60);
    312 	&pxor	($Zhi,&QWP(0,$rem_4bit,$rem,8));
    313 	&dec	($cnt);
    314 	&movd	($rem,$Zlo);
    315 	&pxor	($Zhi,&QWP(0,$Htbl,$nhi));
    316 	&mov	($nhi,$nlo);
    317 	&pxor	($Zlo,$tmp);
    318 	&js	(&label("mmx_break"));
    319 
    320 	&shl	(&LB($nlo),4);
    321 	&and	($rem,0xf);
    322 	&psrlq	($Zlo,4);
    323 	&and	($nhi,0xf0);
    324 	&movq	($tmp,$Zhi);
    325 	&psrlq	($Zhi,4);
    326 	&pxor	($Zlo,&QWP(8,$Htbl,$nlo));
    327 	&psllq	($tmp,60);
    328 	&pxor	($Zhi,&QWP(0,$rem_4bit,$rem,8));
    329 	&movd	($rem,$Zlo);
    330 	&pxor	($Zhi,&QWP(0,$Htbl,$nlo));
    331 	&pxor	($Zlo,$tmp);
    332 	&jmp	(&label("mmx_loop"));
    333 
    334     &set_label("mmx_break",16);
    335 	&shl	(&LB($nlo),4);
    336 	&and	($rem,0xf);
    337 	&psrlq	($Zlo,4);
    338 	&and	($nhi,0xf0);
    339 	&movq	($tmp,$Zhi);
    340 	&psrlq	($Zhi,4);
    341 	&pxor	($Zlo,&QWP(8,$Htbl,$nlo));
    342 	&psllq	($tmp,60);
    343 	&pxor	($Zhi,&QWP(0,$rem_4bit,$rem,8));
    344 	&movd	($rem,$Zlo);
    345 	&pxor	($Zhi,&QWP(0,$Htbl,$nlo));
    346 	&pxor	($Zlo,$tmp);
    347 
    348 	&psrlq	($Zlo,4);
    349 	&and	($rem,0xf);
    350 	&movq	($tmp,$Zhi);
    351 	&psrlq	($Zhi,4);
    352 	&pxor	($Zlo,&QWP(8,$Htbl,$nhi));
    353 	&psllq	($tmp,60);
    354 	&pxor	($Zhi,&QWP(0,$rem_4bit,$rem,8));
    355 	&movd	($rem,$Zlo);
    356 	&pxor	($Zhi,&QWP(0,$Htbl,$nhi));
    357 	&pxor	($Zlo,$tmp);
    358 
    359 	&psrlq	($Zlo,32);	# lower part of Zlo is already there
    360 	&movd	($Zhl,$Zhi);
    361 	&psrlq	($Zhi,32);
    362 	&movd	($Zlh,$Zlo);
    363 	&movd	($Zhh,$Zhi);
    364 
    365 	&bswap	($Zll);
    366 	&bswap	($Zhl);
    367 	&bswap	($Zlh);
    368 	&bswap	($Zhh);
    369 }
    370 
    371 &function_begin("gcm_gmult_4bit_mmx");
    372 	&mov	($inp,&wparam(0));	# load Xi
    373 	&mov	($Htbl,&wparam(1));	# load Htable
    374 
    375 	&call	(&label("pic_point"));
    376 	&set_label("pic_point");
    377 	&blindpop("eax");
    378 	&lea	("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax"));
    379 
    380 	&movz	($Zll,&BP(15,$inp));
    381 
    382 	&mmx_loop($inp,"eax");
    383 
    384 	&emms	();
    385 	&mov	(&DWP(12,$inp),$Zll);
    386 	&mov	(&DWP(4,$inp),$Zhl);
    387 	&mov	(&DWP(8,$inp),$Zlh);
    388 	&mov	(&DWP(0,$inp),$Zhh);
    389 &function_end("gcm_gmult_4bit_mmx");
    390 
    392 ######################################################################
    393 # Below subroutine is "528B" variant of "4-bit" GCM GHASH function
    394 # (see gcm128.c for details). It provides further 20-40% performance
    395 # improvement over above mentioned "May" version.
    396 
    397 &static_label("rem_8bit");
    398 
    399 &function_begin("gcm_ghash_4bit_mmx");
    400 { my ($Zlo,$Zhi) = ("mm7","mm6");
    401   my $rem_8bit = "esi";
    402   my $Htbl = "ebx";
    403 
    404     # parameter block
    405     &mov	("eax",&wparam(0));		# Xi
    406     &mov	("ebx",&wparam(1));		# Htable
    407     &mov	("ecx",&wparam(2));		# inp
    408     &mov	("edx",&wparam(3));		# len
    409     &mov	("ebp","esp");			# original %esp
    410     &call	(&label("pic_point"));
    411     &set_label	("pic_point");
    412     &blindpop	($rem_8bit);
    413     &lea	($rem_8bit,&DWP(&label("rem_8bit")."-".&label("pic_point"),$rem_8bit));
    414 
    415     &sub	("esp",512+16+16);		# allocate stack frame...
    416     &and	("esp",-64);			# ...and align it
    417     &sub	("esp",16);			# place for (u8)(H[]<<4)
    418 
    419     &add	("edx","ecx");			# pointer to the end of input
    420     &mov	(&DWP(528+16+0,"esp"),"eax");	# save Xi
    421     &mov	(&DWP(528+16+8,"esp"),"edx");	# save inp+len
    422     &mov	(&DWP(528+16+12,"esp"),"ebp");	# save original %esp
    423 
    424     { my @lo  = ("mm0","mm1","mm2");
    425       my @hi  = ("mm3","mm4","mm5");
    426       my @tmp = ("mm6","mm7");
    427       my ($off1,$off2,$i) = (0,0,);
    428 
    429       &add	($Htbl,128);			# optimize for size
    430       &lea	("edi",&DWP(16+128,"esp"));
    431       &lea	("ebp",&DWP(16+256+128,"esp"));
    432 
    433       # decompose Htable (low and high parts are kept separately),
    434       # generate Htable[]>>4, (u8)(Htable[]<<4), save to stack...
    435       for ($i=0;$i<18;$i++) {
    436 
    437 	&mov	("edx",&DWP(16*$i+8-128,$Htbl))		if ($i<16);
    438 	&movq	($lo[0],&QWP(16*$i+8-128,$Htbl))	if ($i<16);
    439 	&psllq	($tmp[1],60)				if ($i>1);
    440 	&movq	($hi[0],&QWP(16*$i+0-128,$Htbl))	if ($i<16);
    441 	&por	($lo[2],$tmp[1])			if ($i>1);
    442 	&movq	(&QWP($off1-128,"edi"),$lo[1])		if ($i>0 && $i<17);
    443 	&psrlq	($lo[1],4)				if ($i>0 && $i<17);
    444 	&movq	(&QWP($off1,"edi"),$hi[1])		if ($i>0 && $i<17);
    445 	&movq	($tmp[0],$hi[1])			if ($i>0 && $i<17);
    446 	&movq	(&QWP($off2-128,"ebp"),$lo[2])		if ($i>1);
    447 	&psrlq	($hi[1],4)				if ($i>0 && $i<17);
    448 	&movq	(&QWP($off2,"ebp"),$hi[2])		if ($i>1);
    449 	&shl	("edx",4)				if ($i<16);
    450 	&mov	(&BP($i,"esp"),&LB("edx"))		if ($i<16);
    451 
    452 	unshift	(@lo,pop(@lo));			# "rotate" registers
    453 	unshift	(@hi,pop(@hi));
    454 	unshift	(@tmp,pop(@tmp));
    455 	$off1 += 8	if ($i>0);
    456 	$off2 += 8	if ($i>1);
    457       }
    458     }
    459 
    460     &movq	($Zhi,&QWP(0,"eax"));
    461     &mov	("ebx",&DWP(8,"eax"));
    462     &mov	("edx",&DWP(12,"eax"));		# load Xi
    463 
    464 &set_label("outer",16);
    465   { my $nlo = "eax";
    466     my $dat = "edx";
    467     my @nhi = ("edi","ebp");
    468     my @rem = ("ebx","ecx");
    469     my @red = ("mm0","mm1","mm2");
    470     my $tmp = "mm3";
    471 
    472     &xor	($dat,&DWP(12,"ecx"));		# merge input data
    473     &xor	("ebx",&DWP(8,"ecx"));
    474     &pxor	($Zhi,&QWP(0,"ecx"));
    475     &lea	("ecx",&DWP(16,"ecx"));		# inp+=16
    476     #&mov	(&DWP(528+12,"esp"),$dat);	# save inp^Xi
    477     &mov	(&DWP(528+8,"esp"),"ebx");
    478     &movq	(&QWP(528+0,"esp"),$Zhi);
    479     &mov	(&DWP(528+16+4,"esp"),"ecx");	# save inp
    480 
    481     &xor	($nlo,$nlo);
    482     &rol	($dat,8);
    483     &mov	(&LB($nlo),&LB($dat));
    484     &mov	($nhi[1],$nlo);
    485     &and	(&LB($nlo),0x0f);
    486     &shr	($nhi[1],4);
    487     &pxor	($red[0],$red[0]);
    488     &rol	($dat,8);			# next byte
    489     &pxor	($red[1],$red[1]);
    490     &pxor	($red[2],$red[2]);
    491 
    492     # Just like in "May" version modulo-schedule for critical path in
    493     # 'Z.hi ^= rem_8bit[Z.lo&0xff^((u8)H[nhi]<<4)]<<48'. Final 'pxor'
    494     # is scheduled so late that rem_8bit[] has to be shifted *right*
    495     # by 16, which is why last argument to pinsrw is 2, which
    496     # corresponds to <<32=<<48>>16...
    497     for ($j=11,$i=0;$i<15;$i++) {
    498 
    499       if ($i>0) {
    500 	&pxor	($Zlo,&QWP(16,"esp",$nlo,8));		# Z^=H[nlo]
    501 	&rol	($dat,8);				# next byte
    502 	&pxor	($Zhi,&QWP(16+128,"esp",$nlo,8));
    503 
    504 	&pxor	($Zlo,$tmp);
    505 	&pxor	($Zhi,&QWP(16+256+128,"esp",$nhi[0],8));
    506 	&xor	(&LB($rem[1]),&BP(0,"esp",$nhi[0]));	# rem^(H[nhi]<<4)
    507       } else {
    508 	&movq	($Zlo,&QWP(16,"esp",$nlo,8));
    509 	&movq	($Zhi,&QWP(16+128,"esp",$nlo,8));
    510       }
    511 
    512 	&mov	(&LB($nlo),&LB($dat));
    513 	&mov	($dat,&DWP(528+$j,"esp"))		if (--$j%4==0);
    514 
    515 	&movd	($rem[0],$Zlo);
    516 	&movz	($rem[1],&LB($rem[1]))			if ($i>0);
    517 	&psrlq	($Zlo,8);				# Z>>=8
    518 
    519 	&movq	($tmp,$Zhi);
    520 	&mov	($nhi[0],$nlo);
    521 	&psrlq	($Zhi,8);
    522 
    523 	&pxor	($Zlo,&QWP(16+256+0,"esp",$nhi[1],8));	# Z^=H[nhi]>>4
    524 	&and	(&LB($nlo),0x0f);
    525 	&psllq	($tmp,56);
    526 
    527 	&pxor	($Zhi,$red[1])				if ($i>1);
    528 	&shr	($nhi[0],4);
    529 	&pinsrw	($red[0],&WP(0,$rem_8bit,$rem[1],2),2)	if ($i>0);
    530 
    531 	unshift	(@red,pop(@red));			# "rotate" registers
    532 	unshift	(@rem,pop(@rem));
    533 	unshift	(@nhi,pop(@nhi));
    534     }
    535 
    536     &pxor	($Zlo,&QWP(16,"esp",$nlo,8));		# Z^=H[nlo]
    537     &pxor	($Zhi,&QWP(16+128,"esp",$nlo,8));
    538     &xor	(&LB($rem[1]),&BP(0,"esp",$nhi[0]));	# rem^(H[nhi]<<4)
    539 
    540     &pxor	($Zlo,$tmp);
    541     &pxor	($Zhi,&QWP(16+256+128,"esp",$nhi[0],8));
    542     &movz	($rem[1],&LB($rem[1]));
    543 
    544     &pxor	($red[2],$red[2]);			# clear 2nd word
    545     &psllq	($red[1],4);
    546 
    547     &movd	($rem[0],$Zlo);
    548     &psrlq	($Zlo,4);				# Z>>=4
    549 
    550     &movq	($tmp,$Zhi);
    551     &psrlq	($Zhi,4);
    552     &shl	($rem[0],4);				# rem<<4
    553 
    554     &pxor	($Zlo,&QWP(16,"esp",$nhi[1],8));	# Z^=H[nhi]
    555     &psllq	($tmp,60);
    556     &movz	($rem[0],&LB($rem[0]));
    557 
    558     &pxor	($Zlo,$tmp);
    559     &pxor	($Zhi,&QWP(16+128,"esp",$nhi[1],8));
    560 
    561     &pinsrw	($red[0],&WP(0,$rem_8bit,$rem[1],2),2);
    562     &pxor	($Zhi,$red[1]);
    563 
    564     &movd	($dat,$Zlo);
    565     &pinsrw	($red[2],&WP(0,$rem_8bit,$rem[0],2),3);	# last is <<48
    566 
    567     &psllq	($red[0],12);				# correct by <<16>>4
    568     &pxor	($Zhi,$red[0]);
    569     &psrlq	($Zlo,32);
    570     &pxor	($Zhi,$red[2]);
    571 
    572     &mov	("ecx",&DWP(528+16+4,"esp"));	# restore inp
    573     &movd	("ebx",$Zlo);
    574     &movq	($tmp,$Zhi);			# 01234567
    575     &psllw	($Zhi,8);			# 1.3.5.7.
    576     &psrlw	($tmp,8);			# .0.2.4.6
    577     &por	($Zhi,$tmp);			# 10325476
    578     &bswap	($dat);
    579     &pshufw	($Zhi,$Zhi,0b00011011);		# 76543210
    580     &bswap	("ebx");
    581 
    582     &cmp	("ecx",&DWP(528+16+8,"esp"));	# are we done?
    583     &jne	(&label("outer"));
    584   }
    585 
    586     &mov	("eax",&DWP(528+16+0,"esp"));	# restore Xi
    587     &mov	(&DWP(12,"eax"),"edx");
    588     &mov	(&DWP(8,"eax"),"ebx");
    589     &movq	(&QWP(0,"eax"),$Zhi);
    590 
    591     &mov	("esp",&DWP(528+16+12,"esp"));	# restore original %esp
    592     &emms	();
    593 }
    594 &function_end("gcm_ghash_4bit_mmx");
    595 }}
    596 
    598 if ($sse2) {{
    599 ######################################################################
    600 # PCLMULQDQ version.
    601 
    602 $Xip="eax";
    603 $Htbl="edx";
    604 $const="ecx";
    605 $inp="esi";
    606 $len="ebx";
    607 
    608 ($Xi,$Xhi)=("xmm0","xmm1");	$Hkey="xmm2";
    609 ($T1,$T2,$T3)=("xmm3","xmm4","xmm5");
    610 ($Xn,$Xhn)=("xmm6","xmm7");
    611 
    612 &static_label("bswap");
    613 
    614 sub clmul64x64_T2 {	# minimal "register" pressure
    615 my ($Xhi,$Xi,$Hkey,$HK)=@_;
    616 
    617 	&movdqa		($Xhi,$Xi);		#
    618 	&pshufd		($T1,$Xi,0b01001110);
    619 	&pshufd		($T2,$Hkey,0b01001110)	if (!defined($HK));
    620 	&pxor		($T1,$Xi);		#
    621 	&pxor		($T2,$Hkey)		if (!defined($HK));
    622 			$HK=$T2			if (!defined($HK));
    623 
    624 	&pclmulqdq	($Xi,$Hkey,0x00);	#######
    625 	&pclmulqdq	($Xhi,$Hkey,0x11);	#######
    626 	&pclmulqdq	($T1,$HK,0x00);		#######
    627 	&xorps		($T1,$Xi);		#
    628 	&xorps		($T1,$Xhi);		#
    629 
    630 	&movdqa		($T2,$T1);		#
    631 	&psrldq		($T1,8);
    632 	&pslldq		($T2,8);		#
    633 	&pxor		($Xhi,$T1);
    634 	&pxor		($Xi,$T2);		#
    635 }
    636 
    637 sub clmul64x64_T3 {
    638 # Even though this subroutine offers visually better ILP, it
    639 # was empirically found to be a tad slower than above version.
    640 # At least in gcm_ghash_clmul context. But it's just as well,
    641 # because loop modulo-scheduling is possible only thanks to
    642 # minimized "register" pressure...
    643 my ($Xhi,$Xi,$Hkey)=@_;
    644 
    645 	&movdqa		($T1,$Xi);		#
    646 	&movdqa		($Xhi,$Xi);
    647 	&pclmulqdq	($Xi,$Hkey,0x00);	#######
    648 	&pclmulqdq	($Xhi,$Hkey,0x11);	#######
    649 	&pshufd		($T2,$T1,0b01001110);	#
    650 	&pshufd		($T3,$Hkey,0b01001110);
    651 	&pxor		($T2,$T1);		#
    652 	&pxor		($T3,$Hkey);
    653 	&pclmulqdq	($T2,$T3,0x00);		#######
    654 	&pxor		($T2,$Xi);		#
    655 	&pxor		($T2,$Xhi);		#
    656 
    657 	&movdqa		($T3,$T2);		#
    658 	&psrldq		($T2,8);
    659 	&pslldq		($T3,8);		#
    660 	&pxor		($Xhi,$T2);
    661 	&pxor		($Xi,$T3);		#
    662 }
    663 
    665 if (1) {		# Algorithm 9 with <<1 twist.
    666 			# Reduction is shorter and uses only two
    667 			# temporary registers, which makes it better
    668 			# candidate for interleaving with 64x64
    669 			# multiplication. Pre-modulo-scheduled loop
    670 			# was found to be ~20% faster than Algorithm 5
    671 			# below. Algorithm 9 was therefore chosen for
    672 			# further optimization...
    673 
    674 sub reduction_alg9 {	# 17/11 times faster than Intel version
    675 my ($Xhi,$Xi) = @_;
    676 
    677 	# 1st phase
    678 	&movdqa		($T2,$Xi);		#
    679 	&movdqa		($T1,$Xi);
    680 	&psllq		($Xi,5);
    681 	&pxor		($T1,$Xi);		#
    682 	&psllq		($Xi,1);
    683 	&pxor		($Xi,$T1);		#
    684 	&psllq		($Xi,57);		#
    685 	&movdqa		($T1,$Xi);		#
    686 	&pslldq		($Xi,8);
    687 	&psrldq		($T1,8);		#
    688 	&pxor		($Xi,$T2);
    689 	&pxor		($Xhi,$T1);		#
    690 
    691 	# 2nd phase
    692 	&movdqa		($T2,$Xi);
    693 	&psrlq		($Xi,1);
    694 	&pxor		($Xhi,$T2);		#
    695 	&pxor		($T2,$Xi);
    696 	&psrlq		($Xi,5);
    697 	&pxor		($Xi,$T2);		#
    698 	&psrlq		($Xi,1);		#
    699 	&pxor		($Xi,$Xhi)		#
    700 }
    701 
    702 &function_begin_B("gcm_init_clmul");
    703 	&mov		($Htbl,&wparam(0));
    704 	&mov		($Xip,&wparam(1));
    705 
    706 	&call		(&label("pic"));
    707 &set_label("pic");
    708 	&blindpop	($const);
    709 	&lea		($const,&DWP(&label("bswap")."-".&label("pic"),$const));
    710 
    711 	&movdqu		($Hkey,&QWP(0,$Xip));
    712 	&pshufd		($Hkey,$Hkey,0b01001110);# dword swap
    713 
    714 	# <<1 twist
    715 	&pshufd		($T2,$Hkey,0b11111111);	# broadcast uppermost dword
    716 	&movdqa		($T1,$Hkey);
    717 	&psllq		($Hkey,1);
    718 	&pxor		($T3,$T3);		#
    719 	&psrlq		($T1,63);
    720 	&pcmpgtd	($T3,$T2);		# broadcast carry bit
    721 	&pslldq		($T1,8);
    722 	&por		($Hkey,$T1);		# H<<=1
    723 
    724 	# magic reduction
    725 	&pand		($T3,&QWP(16,$const));	# 0x1c2_polynomial
    726 	&pxor		($Hkey,$T3);		# if(carry) H^=0x1c2_polynomial
    727 
    728 	# calculate H^2
    729 	&movdqa		($Xi,$Hkey);
    730 	&clmul64x64_T2	($Xhi,$Xi,$Hkey);
    731 	&reduction_alg9	($Xhi,$Xi);
    732 
    733 	&pshufd		($T1,$Hkey,0b01001110);
    734 	&pshufd		($T2,$Xi,0b01001110);
    735 	&pxor		($T1,$Hkey);		# Karatsuba pre-processing
    736 	&movdqu		(&QWP(0,$Htbl),$Hkey);	# save H
    737 	&pxor		($T2,$Xi);		# Karatsuba pre-processing
    738 	&movdqu		(&QWP(16,$Htbl),$Xi);	# save H^2
    739 	&palignr	($T2,$T1,8);		# low part is H.lo^H.hi
    740 	&movdqu		(&QWP(32,$Htbl),$T2);	# save Karatsuba "salt"
    741 
    742 	&ret		();
    743 &function_end_B("gcm_init_clmul");
    744 
    745 &function_begin_B("gcm_gmult_clmul");
    746 	&mov		($Xip,&wparam(0));
    747 	&mov		($Htbl,&wparam(1));
    748 
    749 	&call		(&label("pic"));
    750 &set_label("pic");
    751 	&blindpop	($const);
    752 	&lea		($const,&DWP(&label("bswap")."-".&label("pic"),$const));
    753 
    754 	&movdqu		($Xi,&QWP(0,$Xip));
    755 	&movdqa		($T3,&QWP(0,$const));
    756 	&movups		($Hkey,&QWP(0,$Htbl));
    757 	&pshufb		($Xi,$T3);
    758 	&movups		($T2,&QWP(32,$Htbl));
    759 
    760 	&clmul64x64_T2	($Xhi,$Xi,$Hkey,$T2);
    761 	&reduction_alg9	($Xhi,$Xi);
    762 
    763 	&pshufb		($Xi,$T3);
    764 	&movdqu		(&QWP(0,$Xip),$Xi);
    765 
    766 	&ret	();
    767 &function_end_B("gcm_gmult_clmul");
    768 
    769 &function_begin("gcm_ghash_clmul");
    770 	&mov		($Xip,&wparam(0));
    771 	&mov		($Htbl,&wparam(1));
    772 	&mov		($inp,&wparam(2));
    773 	&mov		($len,&wparam(3));
    774 
    775 	&call		(&label("pic"));
    776 &set_label("pic");
    777 	&blindpop	($const);
    778 	&lea		($const,&DWP(&label("bswap")."-".&label("pic"),$const));
    779 
    780 	&movdqu		($Xi,&QWP(0,$Xip));
    781 	&movdqa		($T3,&QWP(0,$const));
    782 	&movdqu		($Hkey,&QWP(0,$Htbl));
    783 	&pshufb		($Xi,$T3);
    784 
    785 	&sub		($len,0x10);
    786 	&jz		(&label("odd_tail"));
    787 
    788 	#######
    789 	# Xi+2 =[H*(Ii+1 + Xi+1)] mod P =
    790 	#	[(H*Ii+1) + (H*Xi+1)] mod P =
    791 	#	[(H*Ii+1) + H^2*(Ii+Xi)] mod P
    792 	#
    793 	&movdqu		($T1,&QWP(0,$inp));	# Ii
    794 	&movdqu		($Xn,&QWP(16,$inp));	# Ii+1
    795 	&pshufb		($T1,$T3);
    796 	&pshufb		($Xn,$T3);
    797 	&movdqu		($T3,&QWP(32,$Htbl));
    798 	&pxor		($Xi,$T1);		# Ii+Xi
    799 
    800 	&pshufd		($T1,$Xn,0b01001110);	# H*Ii+1
    801 	&movdqa		($Xhn,$Xn);
    802 	&pxor		($T1,$Xn);		#
    803 	&lea		($inp,&DWP(32,$inp));	# i+=2
    804 
    805 	&pclmulqdq	($Xn,$Hkey,0x00);	#######
    806 	&pclmulqdq	($Xhn,$Hkey,0x11);	#######
    807 	&pclmulqdq	($T1,$T3,0x00);		#######
    808 	&movups		($Hkey,&QWP(16,$Htbl));	# load H^2
    809 	&nop		();
    810 
    811 	&sub		($len,0x20);
    812 	&jbe		(&label("even_tail"));
    813 	&jmp		(&label("mod_loop"));
    814 
    815 &set_label("mod_loop",32);
    816 	&pshufd		($T2,$Xi,0b01001110);	# H^2*(Ii+Xi)
    817 	&movdqa		($Xhi,$Xi);
    818 	&pxor		($T2,$Xi);		#
    819 	&nop		();
    820 
    821 	&pclmulqdq	($Xi,$Hkey,0x00);	#######
    822 	&pclmulqdq	($Xhi,$Hkey,0x11);	#######
    823 	&pclmulqdq	($T2,$T3,0x10);		#######
    824 	&movups		($Hkey,&QWP(0,$Htbl));	# load H
    825 
    826 	&xorps		($Xi,$Xn);		# (H*Ii+1) + H^2*(Ii+Xi)
    827 	&movdqa		($T3,&QWP(0,$const));
    828 	&xorps		($Xhi,$Xhn);
    829 	 &movdqu	($Xhn,&QWP(0,$inp));	# Ii
    830 	&pxor		($T1,$Xi);		# aggregated Karatsuba post-processing
    831 	 &movdqu	($Xn,&QWP(16,$inp));	# Ii+1
    832 	&pxor		($T1,$Xhi);		#
    833 
    834 	 &pshufb	($Xhn,$T3);
    835 	&pxor		($T2,$T1);		#
    836 
    837 	&movdqa		($T1,$T2);		#
    838 	&psrldq		($T2,8);
    839 	&pslldq		($T1,8);		#
    840 	&pxor		($Xhi,$T2);
    841 	&pxor		($Xi,$T1);		#
    842 	 &pshufb	($Xn,$T3);
    843 	 &pxor		($Xhi,$Xhn);		# "Ii+Xi", consume early
    844 
    845 	&movdqa		($Xhn,$Xn);		#&clmul64x64_TX	($Xhn,$Xn,$Hkey); H*Ii+1
    846 	  &movdqa	($T2,$Xi);		#&reduction_alg9($Xhi,$Xi); 1st phase
    847 	  &movdqa	($T1,$Xi);
    848 	  &psllq	($Xi,5);
    849 	  &pxor		($T1,$Xi);		#
    850 	  &psllq	($Xi,1);
    851 	  &pxor		($Xi,$T1);		#
    852 	&pclmulqdq	($Xn,$Hkey,0x00);	#######
    853 	&movups		($T3,&QWP(32,$Htbl));
    854 	  &psllq	($Xi,57);		#
    855 	  &movdqa	($T1,$Xi);		#
    856 	  &pslldq	($Xi,8);
    857 	  &psrldq	($T1,8);		#
    858 	  &pxor		($Xi,$T2);
    859 	  &pxor		($Xhi,$T1);		#
    860 	&pshufd		($T1,$Xhn,0b01001110);
    861 	  &movdqa	($T2,$Xi);		# 2nd phase
    862 	  &psrlq	($Xi,1);
    863 	&pxor		($T1,$Xhn);
    864 	  &pxor		($Xhi,$T2);		#
    865 	&pclmulqdq	($Xhn,$Hkey,0x11);	#######
    866 	&movups		($Hkey,&QWP(16,$Htbl));	# load H^2
    867 	  &pxor		($T2,$Xi);
    868 	  &psrlq	($Xi,5);
    869 	  &pxor		($Xi,$T2);		#
    870 	  &psrlq	($Xi,1);		#
    871 	  &pxor		($Xi,$Xhi)		#
    872 	&pclmulqdq	($T1,$T3,0x00);		#######
    873 
    874 	&lea		($inp,&DWP(32,$inp));
    875 	&sub		($len,0x20);
    876 	&ja		(&label("mod_loop"));
    877 
    878 &set_label("even_tail");
    879 	&pshufd		($T2,$Xi,0b01001110);	# H^2*(Ii+Xi)
    880 	&movdqa		($Xhi,$Xi);
    881 	&pxor		($T2,$Xi);		#
    882 
    883 	&pclmulqdq	($Xi,$Hkey,0x00);	#######
    884 	&pclmulqdq	($Xhi,$Hkey,0x11);	#######
    885 	&pclmulqdq	($T2,$T3,0x10);		#######
    886 	&movdqa		($T3,&QWP(0,$const));
    887 
    888 	&xorps		($Xi,$Xn);		# (H*Ii+1) + H^2*(Ii+Xi)
    889 	&xorps		($Xhi,$Xhn);
    890 	&pxor		($T1,$Xi);		# aggregated Karatsuba post-processing
    891 	&pxor		($T1,$Xhi);		#
    892 
    893 	&pxor		($T2,$T1);		#
    894 
    895 	&movdqa		($T1,$T2);		#
    896 	&psrldq		($T2,8);
    897 	&pslldq		($T1,8);		#
    898 	&pxor		($Xhi,$T2);
    899 	&pxor		($Xi,$T1);		#
    900 
    901 	&reduction_alg9	($Xhi,$Xi);
    902 
    903 	&test		($len,$len);
    904 	&jnz		(&label("done"));
    905 
    906 	&movups		($Hkey,&QWP(0,$Htbl));	# load H
    907 &set_label("odd_tail");
    908 	&movdqu		($T1,&QWP(0,$inp));	# Ii
    909 	&pshufb		($T1,$T3);
    910 	&pxor		($Xi,$T1);		# Ii+Xi
    911 
    912 	&clmul64x64_T2	($Xhi,$Xi,$Hkey);	# H*(Ii+Xi)
    913 	&reduction_alg9	($Xhi,$Xi);
    914 
    915 &set_label("done");
    916 	&pshufb		($Xi,$T3);
    917 	&movdqu		(&QWP(0,$Xip),$Xi);
    918 &function_end("gcm_ghash_clmul");
    919 
    921 } else {		# Algorithm 5. Kept for reference purposes.
    922 
    923 sub reduction_alg5 {	# 19/16 times faster than Intel version
    924 my ($Xhi,$Xi)=@_;
    925 
    926 	# <<1
    927 	&movdqa		($T1,$Xi);		#
    928 	&movdqa		($T2,$Xhi);
    929 	&pslld		($Xi,1);
    930 	&pslld		($Xhi,1);		#
    931 	&psrld		($T1,31);
    932 	&psrld		($T2,31);		#
    933 	&movdqa		($T3,$T1);
    934 	&pslldq		($T1,4);
    935 	&psrldq		($T3,12);		#
    936 	&pslldq		($T2,4);
    937 	&por		($Xhi,$T3);		#
    938 	&por		($Xi,$T1);
    939 	&por		($Xhi,$T2);		#
    940 
    941 	# 1st phase
    942 	&movdqa		($T1,$Xi);
    943 	&movdqa		($T2,$Xi);
    944 	&movdqa		($T3,$Xi);		#
    945 	&pslld		($T1,31);
    946 	&pslld		($T2,30);
    947 	&pslld		($Xi,25);		#
    948 	&pxor		($T1,$T2);
    949 	&pxor		($T1,$Xi);		#
    950 	&movdqa		($T2,$T1);		#
    951 	&pslldq		($T1,12);
    952 	&psrldq		($T2,4);		#
    953 	&pxor		($T3,$T1);
    954 
    955 	# 2nd phase
    956 	&pxor		($Xhi,$T3);		#
    957 	&movdqa		($Xi,$T3);
    958 	&movdqa		($T1,$T3);
    959 	&psrld		($Xi,1);		#
    960 	&psrld		($T1,2);
    961 	&psrld		($T3,7);		#
    962 	&pxor		($Xi,$T1);
    963 	&pxor		($Xhi,$T2);
    964 	&pxor		($Xi,$T3);		#
    965 	&pxor		($Xi,$Xhi);		#
    966 }
    967 
    968 &function_begin_B("gcm_init_clmul");
    969 	&mov		($Htbl,&wparam(0));
    970 	&mov		($Xip,&wparam(1));
    971 
    972 	&call		(&label("pic"));
    973 &set_label("pic");
    974 	&blindpop	($const);
    975 	&lea		($const,&DWP(&label("bswap")."-".&label("pic"),$const));
    976 
    977 	&movdqu		($Hkey,&QWP(0,$Xip));
    978 	&pshufd		($Hkey,$Hkey,0b01001110);# dword swap
    979 
    980 	# calculate H^2
    981 	&movdqa		($Xi,$Hkey);
    982 	&clmul64x64_T3	($Xhi,$Xi,$Hkey);
    983 	&reduction_alg5	($Xhi,$Xi);
    984 
    985 	&movdqu		(&QWP(0,$Htbl),$Hkey);	# save H
    986 	&movdqu		(&QWP(16,$Htbl),$Xi);	# save H^2
    987 
    988 	&ret		();
    989 &function_end_B("gcm_init_clmul");
    990 
    991 &function_begin_B("gcm_gmult_clmul");
    992 	&mov		($Xip,&wparam(0));
    993 	&mov		($Htbl,&wparam(1));
    994 
    995 	&call		(&label("pic"));
    996 &set_label("pic");
    997 	&blindpop	($const);
    998 	&lea		($const,&DWP(&label("bswap")."-".&label("pic"),$const));
    999 
   1000 	&movdqu		($Xi,&QWP(0,$Xip));
   1001 	&movdqa		($Xn,&QWP(0,$const));
   1002 	&movdqu		($Hkey,&QWP(0,$Htbl));
   1003 	&pshufb		($Xi,$Xn);
   1004 
   1005 	&clmul64x64_T3	($Xhi,$Xi,$Hkey);
   1006 	&reduction_alg5	($Xhi,$Xi);
   1007 
   1008 	&pshufb		($Xi,$Xn);
   1009 	&movdqu		(&QWP(0,$Xip),$Xi);
   1010 
   1011 	&ret	();
   1012 &function_end_B("gcm_gmult_clmul");
   1013 
   1014 &function_begin("gcm_ghash_clmul");
   1015 	&mov		($Xip,&wparam(0));
   1016 	&mov		($Htbl,&wparam(1));
   1017 	&mov		($inp,&wparam(2));
   1018 	&mov		($len,&wparam(3));
   1019 
   1020 	&call		(&label("pic"));
   1021 &set_label("pic");
   1022 	&blindpop	($const);
   1023 	&lea		($const,&DWP(&label("bswap")."-".&label("pic"),$const));
   1024 
   1025 	&movdqu		($Xi,&QWP(0,$Xip));
   1026 	&movdqa		($T3,&QWP(0,$const));
   1027 	&movdqu		($Hkey,&QWP(0,$Htbl));
   1028 	&pshufb		($Xi,$T3);
   1029 
   1030 	&sub		($len,0x10);
   1031 	&jz		(&label("odd_tail"));
   1032 
   1033 	#######
   1034 	# Xi+2 =[H*(Ii+1 + Xi+1)] mod P =
   1035 	#	[(H*Ii+1) + (H*Xi+1)] mod P =
   1036 	#	[(H*Ii+1) + H^2*(Ii+Xi)] mod P
   1037 	#
   1038 	&movdqu		($T1,&QWP(0,$inp));	# Ii
   1039 	&movdqu		($Xn,&QWP(16,$inp));	# Ii+1
   1040 	&pshufb		($T1,$T3);
   1041 	&pshufb		($Xn,$T3);
   1042 	&pxor		($Xi,$T1);		# Ii+Xi
   1043 
   1044 	&clmul64x64_T3	($Xhn,$Xn,$Hkey);	# H*Ii+1
   1045 	&movdqu		($Hkey,&QWP(16,$Htbl));	# load H^2
   1046 
   1047 	&sub		($len,0x20);
   1048 	&lea		($inp,&DWP(32,$inp));	# i+=2
   1049 	&jbe		(&label("even_tail"));
   1050 
   1051 &set_label("mod_loop");
   1052 	&clmul64x64_T3	($Xhi,$Xi,$Hkey);	# H^2*(Ii+Xi)
   1053 	&movdqu		($Hkey,&QWP(0,$Htbl));	# load H
   1054 
   1055 	&pxor		($Xi,$Xn);		# (H*Ii+1) + H^2*(Ii+Xi)
   1056 	&pxor		($Xhi,$Xhn);
   1057 
   1058 	&reduction_alg5	($Xhi,$Xi);
   1059 
   1060 	#######
   1061 	&movdqa		($T3,&QWP(0,$const));
   1062 	&movdqu		($T1,&QWP(0,$inp));	# Ii
   1063 	&movdqu		($Xn,&QWP(16,$inp));	# Ii+1
   1064 	&pshufb		($T1,$T3);
   1065 	&pshufb		($Xn,$T3);
   1066 	&pxor		($Xi,$T1);		# Ii+Xi
   1067 
   1068 	&clmul64x64_T3	($Xhn,$Xn,$Hkey);	# H*Ii+1
   1069 	&movdqu		($Hkey,&QWP(16,$Htbl));	# load H^2
   1070 
   1071 	&sub		($len,0x20);
   1072 	&lea		($inp,&DWP(32,$inp));
   1073 	&ja		(&label("mod_loop"));
   1074 
   1075 &set_label("even_tail");
   1076 	&clmul64x64_T3	($Xhi,$Xi,$Hkey);	# H^2*(Ii+Xi)
   1077 
   1078 	&pxor		($Xi,$Xn);		# (H*Ii+1) + H^2*(Ii+Xi)
   1079 	&pxor		($Xhi,$Xhn);
   1080 
   1081 	&reduction_alg5	($Xhi,$Xi);
   1082 
   1083 	&movdqa		($T3,&QWP(0,$const));
   1084 	&test		($len,$len);
   1085 	&jnz		(&label("done"));
   1086 
   1087 	&movdqu		($Hkey,&QWP(0,$Htbl));	# load H
   1088 &set_label("odd_tail");
   1089 	&movdqu		($T1,&QWP(0,$inp));	# Ii
   1090 	&pshufb		($T1,$T3);
   1091 	&pxor		($Xi,$T1);		# Ii+Xi
   1092 
   1093 	&clmul64x64_T3	($Xhi,$Xi,$Hkey);	# H*(Ii+Xi)
   1094 	&reduction_alg5	($Xhi,$Xi);
   1095 
   1096 	&movdqa		($T3,&QWP(0,$const));
   1097 &set_label("done");
   1098 	&pshufb		($Xi,$T3);
   1099 	&movdqu		(&QWP(0,$Xip),$Xi);
   1100 &function_end("gcm_ghash_clmul");
   1101 
   1102 }
   1103 
   1105 &set_label("bswap",64);
   1106 	&data_byte(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0);
   1107 	&data_byte(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2);	# 0x1c2_polynomial
   1108 &set_label("rem_8bit",64);
   1109 	&data_short(0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E);
   1110 	&data_short(0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E);
   1111 	&data_short(0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E);
   1112 	&data_short(0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E);
   1113 	&data_short(0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E);
   1114 	&data_short(0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E);
   1115 	&data_short(0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E);
   1116 	&data_short(0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E);
   1117 	&data_short(0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE);
   1118 	&data_short(0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE);
   1119 	&data_short(0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE);
   1120 	&data_short(0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE);
   1121 	&data_short(0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E);
   1122 	&data_short(0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E);
   1123 	&data_short(0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE);
   1124 	&data_short(0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE);
   1125 	&data_short(0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E);
   1126 	&data_short(0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E);
   1127 	&data_short(0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E);
   1128 	&data_short(0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E);
   1129 	&data_short(0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E);
   1130 	&data_short(0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E);
   1131 	&data_short(0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E);
   1132 	&data_short(0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E);
   1133 	&data_short(0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE);
   1134 	&data_short(0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE);
   1135 	&data_short(0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE);
   1136 	&data_short(0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE);
   1137 	&data_short(0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E);
   1138 	&data_short(0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E);
   1139 	&data_short(0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE);
   1140 	&data_short(0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE);
   1141 }}	# $sse2
   1142 
   1143 &set_label("rem_4bit",64);
   1144 	&data_word(0,0x0000<<$S,0,0x1C20<<$S,0,0x3840<<$S,0,0x2460<<$S);
   1145 	&data_word(0,0x7080<<$S,0,0x6CA0<<$S,0,0x48C0<<$S,0,0x54E0<<$S);
   1146 	&data_word(0,0xE100<<$S,0,0xFD20<<$S,0,0xD940<<$S,0,0xC560<<$S);
   1147 	&data_word(0,0x9180<<$S,0,0x8DA0<<$S,0,0xA9C0<<$S,0,0xB5E0<<$S);
   1148 }}}	# !$x86only
   1149 
   1150 &asciz("GHASH for x86, CRYPTOGAMS by <appro\@openssl.org>");
   1151 &asm_finish();
   1152 
   1153 close STDOUT;
   1154 
   1155 # A question was risen about choice of vanilla MMX. Or rather why wasn't
   1156 # SSE2 chosen instead? In addition to the fact that MMX runs on legacy
   1157 # CPUs such as PIII, "4-bit" MMX version was observed to provide better
   1158 # performance than *corresponding* SSE2 one even on contemporary CPUs.
   1159 # SSE2 results were provided by Peter-Michael Hager. He maintains SSE2
   1160 # implementation featuring full range of lookup-table sizes, but with
   1161 # per-invocation lookup table setup. Latter means that table size is
   1162 # chosen depending on how much data is to be hashed in every given call,
   1163 # more data - larger table. Best reported result for Core2 is ~4 cycles
   1164 # per processed byte out of 64KB block. This number accounts even for
   1165 # 64KB table setup overhead. As discussed in gcm128.c we choose to be
   1166 # more conservative in respect to lookup table sizes, but how do the
   1167 # results compare? Minimalistic "256B" MMX version delivers ~11 cycles
   1168 # on same platform. As also discussed in gcm128.c, next in line "8-bit
   1169 # Shoup's" or "4KB" method should deliver twice the performance of
   1170 # "256B" one, in other words not worse than ~6 cycles per byte. It
   1171 # should be also be noted that in SSE2 case improvement can be "super-
   1172 # linear," i.e. more than twice, mostly because >>8 maps to single
   1173 # instruction on SSE2 register. This is unlike "4-bit" case when >>4
   1174 # maps to same amount of instructions in both MMX and SSE2 cases.
   1175 # Bottom line is that switch to SSE2 is considered to be justifiable
   1176 # only in case we choose to implement "8-bit" method...
   1177