Documentation for ELL2.C Rich Schroeppel rcs@cs.arizona.edu January 1996 Acknowledgement: Oliver Spatscheck wrote some of the assembly code for the DEC Alpha, and contributed an important idea in optimizing the Alpha Xor Multiply. This work was supported by the National Computer Security Center (part of NSA), contract MDA904-92-C-5151, and by DARPA contract DABT63-94-C-0002. Introduction ELL2.C implements operations with elliptic curves over a finite field. These operations can be useful in cryptography, usually for exchanging keys. I recommend Al Menezes' book "Elliptic Curve Public Key Cryptosystems" for an introduction to elliptic curves. Any good algebra book will explain finite fields. I also recommend our paper from Crypto '95 (in Springer LNCS #963), or our tech-report (ftp://ftp.cs.arizona.edu/reports/1995/TR95-03.ps). This documentation uses the phrases "field element" and "polynomial" interchangeably. Compiling and Running the Program The compile instructions are at the front of ELL2.C. The file ell2.examples contains sample compilations. The ALPHA switch selects 64bit code. There is one assembly language routine for the Alpha. The Sparc compile takes several minutes. Example of compiling for the Sparc: ------------------------------------------------------- leibniz% gcc2 -O9 -o ell2-960117-gcc2O9b ell2.c ------------------------------------------------------- The name of the binary program, ell2-960117-gcc2O9, includes the date, compiler, and optimization level. Running the program. An example: ------------------------------------------------------- leibniz% time ell2-960117-gcc2O9 -m55 100000 GF155 here. Multiply test M55, 155*155 bits, 100000 times: 07531fca 53579ace ff003742 98765432 abcdef12 05371fac abdecf12 35795ace f003f742 98754632 001a94d3 ce915148 99c5bf6f 64daaef0 0f620c12 d6a32b04 d25a6f7a aeb1a2e5 c59b9857 71e42144 11.2u 0.0s 0:12 91% 0+336k 0+1io 0pf+0w ------------------------------------------------------- This test is for timing the multiplication of two field elements. The "-m55" switch selects the test; the number 100000 says to run the test 100000 times. (This number is required.) The compiled-in inputs are 155 bits each; the product is 309 bits. From this test, we learn that the time per multiplication is 11.2sec/100000 = 112 microseconds. The file ell2.examples contains more examples of running the program. There are about 100 routines that can be called from top level. See the code in "main" for a complete list, and the -names that call them. Most of the tests will run without arguments (other than the repetition count). To determine if a test allows other arguments, it's necessary to examine the code in "main" that calls the test. If you run the program without arguments, or forget to include the count, it runs a few basic tests. Timing note: The "main" program includes a fair amount of setup time (10-100 msec). I always run my timing tests for several seconds to minimize the (relative) effect of the setup. Programming Matters ELL2.C is a research tool, and it contains much obsolete code. This code is useful in explaining the thread of ideas that leads to the more complicated "optimized" routines. It may also be helpful when creating systems for other machines, with different time/memory/cache/register constraints. The older routines tend to be shorter and simpler. When several low-level routines with the same functionality exist, a #define is used to select which routine is used by higher level code. To make a production system, strip out all the inessential code. The top-level routine could be either "testecv" or "testdhtm". The Diffie Hellman key exchange protocol requires the program to choose random numbers to use as the point multipliers. I have instead used fixed, compiled-in values. A production version will need to include a suitable source of random numbers. I've used global variables and arrays in many places, so the code isn't reentrant. This program was written to run on two particular machines, the Sun Sparc model IPC, with 32bit words, running at 25MHz, under Sunos 4.1.1; and the DEC Alpha series, with 64bit words, under DEC Unix (OSF/1 V3.0); our timings were run on a 175 MHz machine. A long time (and many versions) ago, the program was compiled and run on a 66 MHz 486 machine under Mach+Unix, and on a DEC Mips 25 MHz workstation running Mach+Unix. The times for the basic elliptic curve operations were within 10% of the Sparc times. The Sparc has a 64KB instruction cache, more than enough for our purposes. The Alpha has an 8KB on-chip primary instruction cache, and getting the right code into that cache was a headache. [The Alpha has two 32bit instructions in each word, so the multiply and inversion have to fit in 2048 instructions. Recent versions of the Alpha chip have a 16KB Icache, ameliorating this problem.] We tried both the native compiler and gcc2 on the Alpha: The native compiler produced slightly faster code, but insisted on arranging the routines to its own taste. The low-level code ran faster, but we could find no way to get the correct code for doing a full ecurve calculation (a point multiplication) into the cache without overlap, so the time for a real unit of work was worse than with GCC2. GCC2 was dumber about rearranging the routines, so we were able to get the critical code into the cache unoverlaped simply by removing all extraneous code. The #define ECV switch eliminates everything that isn't needed to perform an elliptic curve point multiplication, and uses a stripped-down top-level. The MACRO64 switch causes some routines to be inlined; it's currently turned on within the code (under an #ifdef ALPHA). Use ECV and MACRO64 only on the Alpha. Porting: The C code is pretty much vanilla. Watch for problems with unsigned compares, and shifts of unsigned quantities. I've used the capability of running together strings in the print statements; older C compilers don't like this. The example file will be helpful in testing a port. The program will break badly if the data type (unsigned long) is other than 32 or 64 bits. Start with the lowest level routines and work up. To get the fastest results for your machine, it may be necessary to change the macros which select among the low-level routines. The principal code selection macros are MULT, SQUARE, REDUCE155, MODINV, Edouble, and Epow. There are some secondary selections that only matter when particular primary selecions are made. Data Structures Most arrays and variables that represent field elements are (unsigned long). Numerical variables, counters, etc. are generally (long). The polynomial representation (for elements of GF[2^155]) is five consecutive 32bit words of memory (three 64bit words for the Alpha). The lower degree powers of u go into the lower-order ("rightmost") bits. u^0 = 1 goes into the low-order bit with numerical value 1. The high- order bit represents u^31. Polynomials are usually represented by small arrays of five words (Alpha: three), with array[0] holding u^31...u^0, and array[4] holding u^154...u^128. The unused high-order 5 (Alpha: 37) bits of the last array word are 0, unless the code says otherwise. Sometimes, a polynomial is represented by a group of named variables, such as A0...A4. In this case, A0 holds the low degree terms u^31...u^0, and A4 holds the high degree terms u^154...u^128. Some computations require double-sized arrays to hold polynomials of twice the normal degree. In this case, the polynomial terms are placed consecutively: the bit representing u^155 is right next to the bit for u^154, and the "unused 5 bits" in array[4] are occupied by u^159...u^155. Of course there will be 10 unused high-order bits in array[9], and they will be 0. On the Alpha, array[2] will hold bits for u^191...u^128, while array[4] has 10 unused high order bits. A double-length array will usually have a wasted word array[5]. A double-length array is usually named with a "d" at the end, or with an upper-case letter at the end. Some routines need double length arrays as arguments; the comments tell all. The program has a fair number of compiled in test values (field elements), and the FIVE macro is used to create the array elements. The polynomial powers are given with the lowest degree part first: In FIVE(A,B,C,D,E), A represents u^31...u^0, and E represents u^155...u^128. Program Organization The routines are generally in the order they were written, with low- level primitives at the front of the file, and the top-level "main" routine at the end. Many routines are directly followed by a test routine. The test routine will call the target routine several times with either canned arguments or arguments supplied by the user from the command line. The test routine is called by "main". Sometimes the results are checked and/or printed. The last section of the program, before the top-level main routine, is a klunky implementation of (most of) the Schoof algorithm, which determines the number of points on a particular curve. The Schoof algorithm is explained in Menezes' book. My implementation isn't an exact match to his explanation. With a separate Lisp program to fill in some of the missing details, I've determined sizes for a couple of groups. The top-level of the program is a collection of timing tests, which time and/or test most of the routines. The interface to the program is a simple command line with arguments. The command line says which test to run, how many times, and with what arguments. The program runs the test, may print an answer, and exits. The user is responsible for wrapping the command in a Unix "time" command to determine the actual execution time for a routine. The most important routines, from the viewpoint of execution time, are those for multiplying two polynomials, and the inversion routine. I wrote many versions of these routines, and selected the fastest based on timing tests. (Which is fastest depends on the processor - if you are working on a new processor, you probably want to rerun the timing tests, and perhaps change the macros that select which routines are used in "production" code.) Most of the program works with field elements from GF[2^155], defined by the irreducible polynomial u^155 + u^62 + 1. The elliptic curve used is defined by the equation Y^2 + XY = X^3 + A6, where A6 is a field element. Generally the curve defaults to a compiled-in value if no other is specified. The routines Epowknown, testdhtm, elgamalsetup, and elgamalEnc also implicitly assume a particular curve, because the table of point values at ectabxy is built in. The program is committed to the field GF[2^155], and to the polynomial u^155+u^62+1 as the magic trinomial T(u). It is only slightly committed to the choice of elliptic curve and generating point. The particular elliptic curve is usually specified by giving a point on the curve, which determines the curve parameter A6. A6 must not be 0. Most of the paths from startup to the curve code include a setup for A6 (the setup routine is "a6calc"). You may use other means of setting up A6, so long as you also set up A60. For efficiency, the present code assumes that A6 is only 20 bits long, which restricts the starting points. If you limit the starting coordinates X,Y so that X<128 and Y<1024 (decimal), then A6 will be at most 20 bits. There are lots of larger points which will also evaluate to a small A6, but it's trickier to find them. If you need to use an A6 larger than 20 bits, modify the routine selected by MULTKRED (either mult15c20xred or mult13c20xred), or rewrite the point doubling routine. The point addition routine doesn't use A6, except when it calls the doubling code. Elliptic Curves for the Disinclined (Read this even if you don't want to know anything about elliptic curves.) What does this program do? - Very basic. The program mostly operates on 155 bit quantities. These are called field elements or polynomials. They are represented in 5 computer words on 32bit machines, and 3 words on 64bit machines. The field elements are not ordinary numbers, but they can be used in some of the same ways. There are operations that add and multiply field elements, and take reciprocals. Multiplication of two field elements gives a 309bit result, which is given to a "reduce" routine to turn it back into a 155bit quantity. Negation of a field element is a nop, and subtraction is the same as addition. The second data type is called an "elliptic curve point" and is a pair of field elements, just like an ordinary point in the plane is made up of a pair of real numbers. Not just any old pair of field elements will do - in order to qualify as a "point on the curve", the two elements have to satisfy a special equation (using the special addition and multiplication of field elements). Different equations give different curves (i.e., different sets of points). Each curve has roughly 2^155 points, but getting the exact count is a challenge. Elliptic curve points have their own kind of addition defined. This addition has nothing to do with ordinary arithmetic addition, or with addition of field elements, or addition of vectors. The only excuse for calling the operation "addition" is that it satisfies some of the same algebraic laws as ordinary addition. Once this special addition is defined, we can talk about a "multiple of a point P". If P is a point, then 2P is just P+P; and 3P means P+P+P (which is the same as P+2P); and 4P means P+P+P+P. Larger multiples can be calculated in several different ways: 4P can be done as P+3P, or 2P+2P. The program usually uses the straightforward double-and-increment scheme. It's easy to calculate -P from P, so sometimes points are subtracted. (This would be a good way to compute 127P.) The key exchange starts with a particular elliptic curve and point P on that curve. (There are default choices compiled in; you can change them by giving arguments to the program when you run it.) We assume all parties to the key exchange know the particular curve in use, and the starting point P. The program should choose a random 155bit number K (an ordinary integer!); it then calculates KP. This point is another pair of (155bit) field elements. This pair is supposed to be sent over the network to another host, which picks its own 155bit random number L, computes LP, and sends it back to us. We can then compute the Kth multiple of the point LP, which we denote by KLP. The remote host computes LKP. These two points are equal, which gives us 310 bits of "shared secret" to work with. In the actual program, the numbers K and L are compiled in, and the program pretends it is handling both ends of the transaction. To actually turn this program into a key-exchange scheme, you must add a source of random numbers, and network communications software. Elliptic Curves for the Less Disinclined (Read this if you want to muck around with the code.) What does this program do? - More details. The basic "numbers" that the program works with are 155bit quantities. These quantities represent polynomials of degree up to 154, with coefficients either 0 or 1 (only). The basic operations on the polynomials are addition, and multiplication, with all coefficients reduced modulo 2. This means that, for example, X+X = 2X = 0, and that (X+1)^2 = X^2 + 2X + 1 = X^2+1. In this system, addition of the polynomials is done by Xoring the bit representations. Multiplication of two polynomials is done by the usual shift-and-conditionally-add scheme for multiplying binary numbers, except that the addition step is changed to an xor step. Because we interpret the coefficients mod 2, addition is the same as subtraction; every polynomial is its own negative. Complication: Since the multiplication step can create polynomials of degree up to 308, we always reduce the result of a multiplication modulo the "magic trinomial" T(u) = u^155 + u^62 + 1. This forces the degree back down to 154 or less. For example, u^200 is reduced to u^107 + u^45. (The rule is that u^(155+N) is replaced with u^(62+N) + u^N, which can be done with shift and xor manipulations. Two iterations are needed when N>92.) It happens, because the magic trinomial T is irreducible mod 2, that every non-zero polynomial P of degree <=154 has a unique reciprocal P' polynomial, which satisfies the equation P * P' = 1 (mod T). For example, the reciprocal of u^3 is u^152 + u^59. If these two polynomials are multiplied the result is u^155 + u^62. After the modular reduction, the result is 1. In consequence, the set of polynomials of degree <= 154 forms a mathematical FIELD, in which the operations of addition, subtraction, multiplication, and division follow most of the usual rules of algebra. This field has exactly 2^155 elements - there are exactly 2^155 polynomials; and they can be represented as quantities of 155 bits. The first part of the program is concerned with implementing the operations for the field efficiently, especially multiplication and reciprocals. We've done very well in that regard. The next concept required is trickier. Consider the equation in two varibles X and Y: Y^2 + XY = X^3 + A. If A were a particular real number, this equation would be a curve in the X-Y plane. Particular combinations of X,Y that satisfied the equation would be points on the curve. We make a mathematical leap, and extend the notions of curve and point to the field we described above. We say that X and Y will be elements of the field (polynomials of degree up to 154, with coefficients only 0 or 1) and that A is a particular field element, perhaps u^19 + u^7 + 1. The set of X,Y pairs which satisfy the equation are called "points" on the "curve". This is quite a stretch - we've replaced ordinary numbers with these weird polynomials, we are using mod 2 addition and multiplication for the coefficients, and the equation is counted as true as long as long as it's true when the two sides are both reduced modulo the magic trinomial T. The real number curve has an infinite number of points, while in the field case we could in theory write down a complete list of the solutions. Moreover, the real number case is continuous: you can move along the curve, usually changing X and Y by a little bit. The notion of points being "nearby" is pretty much discarded in the finite field case. If X is changed by 1 bit, Y will be completely different. So what's the point of all this? There's a piece of magic, discovered centuries ago, whose implications were fully realized only in the last decade. Mathematicians were interested in finding solutions to equations like ours where X and Y were both rational numbers. Long ago, mathematicians discovered that you could use two solutions of the elliptic curve equation to generate a third solution. It turns out if you draw a line through two points on the curve, it usually intersects the curve in exactly one more point. When analytic geometry was invented, it turned out that that third solution could be specified by an algebraic formula. If you take the third point and reflect it through the curve's axis, you get a fourth point. This fourth point usually isn't on the same line as the first three points, so you can do the line trick again to get more solutions. In favorable cases, this scheme lets you use two starting solutions to create an infinite set of other solutions. If you've only got one solution, another trick comes to the rescue: you can draw a tangent to the curve (as if you had two equal solutions!) and there will still be a "third" point of intersection of the tangent with the curve. This third point can be reflected through the curve axis as before, and away you go. Using analytic geometry, the operation of taking two points, finding the third intersection, and reflecting it through the axis to get a fourth point on the curve, can be specified entirely in terms of algebraic formulas. This has two interesting consequences: The first is that we can carry out the steps by evaluating the formulas, even when the geometry makes no sense. In particular, we can do it in our field GF[2^155]. The second consequence is even more interesting: The manipulation that takes two points and produces a fourth is called "addition of elliptic curve points" because it has many of the properties of ordinary addition. [This is in spite of the fact that the operation has very little to do with ordinary vector addition of points, or addition of the coordinates. Mathematicians are pretty free when it comes to reusing names.] It's pretty obvious that swapping the two starting points won't change the answer, so the "addition" is commutative. But there's extra magic here: the addition turns out to also be associative - you can add up three points in any order, and always get the same answer. There's no easy way to explain why this works, but it's possible to carry out the algebra and check it. (It was probably discovered when people trying to generate points on a curve kept getting the same answers repeatedly, and had to be careful with the starting points in order to get new points.) There's one more operation we need to mention: reflecting a point through the curve axis gives the "negative" of the point. There's also a "zero" point, although it's kind of strange - both coordinates X and Y are taken to be "infinity". This happens when the line joining the two input points doesn't intersect the curve; then the answer is taken to be the zero point. When you are adding a point to itself ("doubling" a point), you use the formula for the tangent to the curve -- even though the notion of differentiation is meaningless in a finite field. (This complicates the proofs of associativity no end, because of all the special cases for zero and doubling.) So we have an addition operation defined for curve points, and it works for curves with field elements from GF[2^155]. What good is it? We've finally reached the payoff. The Diffie-Hellman Key Exchange method was originally defined to use arithmetic modulo a prime P, but it can be generalized to work in any group - for example, the group of points of an elliptic curve defined over GF[2^155] with a bizarro addition rule. The payoff is that elliptic curve systems appear to be immune to the most effective cryptanalytic attack that works on regular Diffie-Hellman. [The underlying reason for this was mentioned above: the concept "nearby" doesn't work for elliptic curves over finite fields.] In regular Diffie-Hellman, you pick a prime modulus P, and a generator G (both ordinary integers), and the parties to the key exchange calculate various powers of G (mod P). The calculations use multiplication (mod P) as the basic operation. To calculate G^X, you examine the binary representation for X, and do a series of squarings and some multiplications (all operations mod P). In the Elliptic Curve variation, you choose a curve by picking the element A, and find some curve point (X,Y) as the generator. The parties calculate some numeric multiple of the (X,Y), such as 65537 (X,Y), as the work of the key exchange. The multiples are calculated by a series of point doublings an additions. [65537 (X,Y) would be done with 16 doublings and one addition.] In either scheme, the bits of the final answer are used to create a secret key that is used for further communication. Why is it worth doing Diffie-Hellman in this oddball group? Because the main attack that works on ordinary Diffie-Hellman doesn't work against our oddball variation, the numbers involved can be substantially shorter. We pay a price in code complexity, but at the end of the day, our key exchange runs faster because the number of operations is smaller. In the 155bit case, our security is roughly the same as ordinary Diffie- Hellman with a 1024bit modulus, and our speed is three times as fast. Compared to ordinary Diffie-Hellman with a 512bit modulus, we have similar speeds but substantially stronger security. The program's compiled-in curve is based on the point (0x7b, 0x1c8), which has A60 = 0x7338f. The group order for this curve is 45671926166590716193865565914344635196769237316 which is 12*Prime. The point (0x7b, 0x1c8) is a cube (the triple of a generator of full order). The best possible group order with this class of curve equations is 4*Prime. (The size of the group is approx- imately fixed; the figure-of-merit is log2(largest-prime-factor)/2. 4*Prime would have a slightly better figure of merit. Our curve has f.o.m. 75.7, versus the best possible 76.5. Two Improvements Which Could be Made The routine for inverting a field element, modinvh2, has two improvements possible. The first is that only some of the length combinations of the variables are possible or likely, and the code could be shortened by removing impossible combinations and coalescing unlikely ones. This is of interest mainly on the Alpha, because the primary instruction cache is tight: If the inversion routine were shorter, we would have room to use Oliver's best multiplication routine -- presently, it won't fit in the instruction cache with the hogsized inversion routine. This might lead to a 10% time improvement. The second routine is important for both processors: The code does single bit shifts of the internal variables representing field elements. It would be better to test for a few low-order 0s in the A variable, and shift by several bits at once. This should lead to a 10-20% improvement. If the first improvement is made, shortening the code, then the second improvement shouldn't lengthen the code excessively. Modifying the Program to Prevent the Kocher Timing Attack If exponents are not reused in Diffie Hellman style key exchanges, the Kocher timing attack doesn't work. If an exponent is reused in many key exchanges, the timing attack must be thwarted. One way to prevent timing attacks is called blinding: The basic computation is modified in some way that produces the correct answer, but with the details of the modification invisible to the attacker. For example, the operation used in Elliptic Curve Key Exchange is to compute a multiple K of a curve point P, giving KP. A simple blind is to pre-select a random curve point Q, and pre-compute KQ. When the curve protocol wants to compute KP, instead compute K(P+Q)-KQ. The extra time in the protocol is minimal: an extra curve addition to compute P+Q, and an extra curve subtraction of KQ. If Q is changed only occasionally, the amortized extra cost of computing KQ is small. For example, it might be appropriate to select Q at the same time as K. To compute a random curve point, choose a random X coordinate and solve for Y. This will work 50% of the time -- half the Xs have no Ys. (The other Xs have two Ys, giving a curve point and its negative.) Check that X,Y satisfies the curve equation, and try another X if not. The routine "ysolve" solves for Y from X. Be sure that your random source is really not visible to an opponent -- random numbers based on the time of day usually aren't a safe choice.