# Parallelization of Integer Squaring Algorithms with Delayed Carry 

Korchenko Oleksandr, Kovtun Vladislav, Okhrimenko Andrew*<br>Academic Department of IT-Security, National Aviation University, Kiev, Ukraine<br>*Corresponding author: andrew.okhrimenko@gmail.com

Received May 30, 2014; Revised June 02, 2014; Accepted June 03, 2014


#### Abstract

Increasing amounts of information that needs to be protected put in claims specific requirements for information security systems. The main goal of this paper is to find ways to increase performance of cryptographic transformation with public key by increasing performance of integers squaring. Authors use delayed carry mechanism and approaches of effective parallelization for Comba multiplication algorithm, which was previously proposed by authors. They use the idea of carries accumulation by addition products of multiplying the relevant machine words in columns. As a result, it became possible to perform addition of such products in the column independently of each other. However, independent accumulation of products and carries require correction of the intermediate results to account for the accumulated carries. Due to the independence of accumulation in the columns, it became possible to parallelize the process of products accumulation that allowed formulating several approaches. In this paper received theoretical estimates of the computational complexity for proposed squaring algorithms. Software implementations of algorithms in C++ allowed receiving practical results of the performance, which are not contrary to theoretical estimates. The authors first proposed applying the method of delayed carry and parallelization techniques for squaring algorithms, which was previously proposed for integers multiplication.


Keywords: squaring, multiplication, integers, delayed carry, parallelization
Cite This Article: Korchenko Oleksandr, Kovtun Vladislav, and Okhrimenko Andrew, "Parallelization of Integer Squaring Algorithms with Delayed Carry." Journal of Computer Networks, vol. 2, no. 2 (2014): 10-17. doi: 10.12691/jcn-2-2-2.

## 1. Introduction

Cryptographic transformation with public key (CTPK) are the basis for most modern cryptosystems. Increasing amounts of information that needs to be protected, makes specific demands for CTPK. Multiplicative operations [2,3], such as multiplication and squaring of integers, are the most frequently used in CTPK. One of the performance increasing approaches in CTPK is increasing the productivity of basic operations, such as multiplication, squaring, modular reduction and multiplicative inversion. Performance increasing approaches in CTPK by increasing the productivity of integer multiplication were reviewed in $[4,5,6]$. The main goal of this paper is to find ways of increasing performance of CTPK, by increasing productivity of squaring integers, using the delayed carry mechanism [5,6] and efficient parallelization approaches [4].

Squaring is a special case of multiplication where both multipliers are equal $[1,3]$. Show features of multiplication and squaring by considering "schoolbook" multiplication of two integers 123 and 456, Figure 1.

Figure 1 shows that to calculate the product of two integers 123 and 456, it should complete 9 unique multiplication operations. Squaring using "schoolbook" multiplication allows some optimizations. Multiply
integer 123 by itself, using "schoolbook" multiplication Figure 2.

|  |  | 1 | 2 | 3 |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
| $\times$ |  | 4 | 5 | 6 |  |
|  |  | $6 \cdot 1$ | $5 \cdot 2$ | $5 \cdot 2$ | $6 \cdot 3$ |
|  | $5 \cdot 1$ | $5 \cdot 3$ |  | Row 0 |  |
| 4.1 | $4 \cdot 2$ | $4 \cdot 3$ |  |  | Row 1 |
| Column 4 | Column 3 | Column 2 | Column 1 | Column 0 |  |

Figure 1. "Schoolbook" multiplication of two integers


Figure 2. "Schoolbook" multiplication integer by itself
Figure 2 shows how to multiply decimals in a different order, such as products in rows 0 and 2 in column 2 ( $3 * 1$ $=1 * 3)$, products in rows 0 and 1 in column $1(3 * 2=2 *$ 3) and products in rows 1 and 2 in column 3: ( $1 * 2=2$ * 1). Therefore, for squaring for n-digit number, there are only $\left(n^{2}+n\right) / 2$ unique multiplications required $\left(n^{2}\right.$ operations required for multiplication in common case).

Let $x$ be integer being squared, a $x_{k}-k$-th term of $x$. It is easy to notice features:

1. In row $k$ the product in column $2 k$ has a $x_{k}^{2}$ term in it. In Figure 2 it 3 * 3, 2 * 2, $\mathbf{1}^{*} \mathbf{1}$.
2. Every non-square term of a column will appear twice (product in column $j$ in row $k$, where $j \neq 2 k$ has a pair). In Figure 2 it $\mathbf{3 * 1}=\mathbf{1 * 3}, \mathbf{3 * 2}=2 * 3$ and $\mathbf{1}$ * $2=2$ * 1 . Every odd column is made-up entirely of product pairs.
3. For row $k$, such as $k \neq 0$ and $k \neq n-1$, the first unique product that is not a square, is located in the column $2 k+1$. In Figure 2 it $\mathbf{2 * 1}$.

## 2. Multiplication Algorithm Modified Comba

In [5] proposed generalized modified algorithm Comba for integer multiplication - Modified Comba (MC), which uses the idea of delayed carry. The basis of the algorithm is loops (p. 2 and p.3), and inner loops (p.2.1 and p.3.1). At the lowest level of the hierarchy, in loops p.2.1, p.3.1 there are multiplication and accumulation of delayed carry. Accumulated carry is taken into account in the final iterations of the loops p. 2 and p.3. Using $2 w$-bit variables for storing $w$-bit variables eliminates the carry accounting of $w$-bit variable after each arithmetic operation. Carry accumulated in the higher part of the 2 w -bit variable and is taken into account when needed, Figure 3. The generalized algorithm MC [5] for the w-bit systems is given below.

Multiplication algorithm 1. Modified Comba
INPUT: $a, b \in \mathbf{G F}(p), n=\log _{2^{w}} a, n k=2 n-1$
OUTPUT: $c=a \cdot b$

1. $r_{0}^{(2 w)} \leftarrow 0, r_{1}^{(2 w)} \leftarrow 0, r_{2}^{(2 w)} \leftarrow 0$.
2. For $k \leftarrow 0, k<n, k++$ do
2.1. For $i \leftarrow 0, j \leftarrow k, i \leq k, i++, j--$ do
2.1.1. $(u v)^{(2 w)} \leftarrow a_{i}^{(w)} \cdot b_{j}^{(w)}$.
2.1.2. $r_{0}^{(2 w)} \leftarrow r_{0}^{(2 w)}+v^{(w)}, r_{1}^{(2 w)} \leftarrow r_{1}^{(2 w)}+u^{(w)}$
2.2. $r_{1}^{(2 w)} \leftarrow r_{1}^{(2 w)}+\operatorname{hi}(w)\left(r_{0}^{(2 w)}\right)$,
$r_{2}^{(2 w)} \leftarrow r_{2}^{(2 w)}+\operatorname{hi}_{(w)}\left(r_{1}^{(2 w)}\right)$
2.3. $c_{k}^{(w)} \leftarrow \operatorname{low}_{(w)}\left(r_{0}^{(2 w)}\right), r_{0}^{(2 w)} \leftarrow \operatorname{low}_{(w)}\left(r_{1}^{(2 w)}\right)$,
$r_{1}^{(2 w)} \leftarrow \operatorname{low}_{(w)}\left(r_{2}^{(2 w)}\right), r_{2}^{(2 w)} \leftarrow 0$.
3. For $k \leftarrow n, l \leftarrow 1, k<n k, k++, l++$ do
3.1. For $i \leftarrow l, j \leftarrow k-l, i<n, i++, j--$ do
3.1.1. $(u v)^{(2 w)} \leftarrow a_{i}^{(w)} \cdot b_{j}^{(w)}$.
3.1.2. $r_{0}^{(2 w)} \leftarrow r_{0}^{(2 w)}+v^{(w)}, r_{1}^{(2 w)} \leftarrow r_{1}^{(2 w)}+u^{(w)}$
3.2. $r_{1}^{(2 w)} \leftarrow r_{1}^{(2 w)}+\operatorname{hi}(w)\left(r_{0}^{(2 w)}\right)$,
$r_{2}^{(2 w)} \leftarrow r_{2}^{(2 w)}+\operatorname{hi}_{(2)}\left(r_{1}^{(2 w)}\right)$
3.3. $c_{k}^{(w)} \leftarrow \operatorname{low}_{(w)}\left(r_{0}^{(2 w)}\right), r_{0}^{(2 w)} \leftarrow \operatorname{low}_{(w)}\left(r_{1}^{(2 w)}\right)$,
$r_{1}^{(2 w)} \leftarrow \operatorname{low}_{(w)}\left(r_{2}^{(2 w)}\right), r_{2}^{(2 w)} \leftarrow 0$.
4. $c_{n k}^{(w)} \leftarrow \operatorname{low}_{(2)}\left(r_{0}^{(2 w)}\right)$.
5. Return (c).

The computational complexity of the MC algorithm:

$$
I_{S q r}^{M C}=n^{2}\left(I_{m u l}^{w}+\left(\frac{3 n+1}{n}\right) I_{a d d}^{2 w+w}\right)
$$

where n - number of w -bit machine words required to store the multiplier of given size, $I_{m u l}^{w}$ - a multiplication operation for w-bit words, $I_{a d d}^{2 w+w}$ - an addition operation for 2w-bit and w-bit words. Assignment operations do not take into account in computational complexity of the algorithms.

Using the idea of delayed carry it can independently produce addition of multiplication results corresponding by columns, that enables to perform the accumulation of sum of high and least significant bit in separate parallel threads. However, it is necessary to make an adjustment (account carry) $r_{1}=r_{1}+\operatorname{Hi}\left(r_{0}\right), r_{2}=r_{2}+\operatorname{Hi}\left(r_{1}\right)$ and set result $c_{i}=\operatorname{Low}\left(r_{0}\right)$ after sum accumulation in each thread. Figure 3 and Figure 4 is a graphical interpretation of the MC algorithm, for $\mathrm{n}=3$, where well-defined results addition for corresponding products in columns.


Figure 3. Graphical interpretation of loop 2 in MC algorithm


Figure 4. Graphical interpretation of loop 3 in MC algorithm

## 3. Squaring Algorithms

3.1. Squaring algorithm Modified Comba SQR

Using delayed carry mechanism [5,6] and approaches to parallelization [4,5], were offer three squaring algorithms that take account the above features. Consider squaring features, the MC algorithm was modified in inner loops of delayed carry accumulation (p.2.1 and p.3.1), and added an additional check to avoid duplication in the sum accumulation (p.2.1.2 and p.3.1.2 ).

Modified Comba SQR (MCSQR) is squaring algorithm for $w$-bit machine words, based on multiplication algorithm MC [4,5,6].

Squaring algorithm 1. Modified Comba SQR.
INPUT: $a \in \mathbf{G F}(p), n=\log _{2^{w}} a, n k=2 n-1$
OUTPUT: $c=a^{2}$

1. $r_{0}^{(2 w)} \leftarrow 0, r_{1}^{(2 w)} \leftarrow 0, r_{2}^{(2 w)} \leftarrow 0$.
2. For $k \leftarrow 0, k<n, k++$ do
2.1. For $i \leftarrow 0, j \leftarrow k, i \leq j, i++, j--$ do
2.1.1. $\left(u^{2}\right)^{(2 w)} \leftarrow a_{i}^{(w)} \cdot a_{j}^{(w)}$.
2.1.2. if $(i<j)$ then $r_{0}^{(2 w)} \leftarrow r_{0}^{(2 w)}+\left(u^{(w)} \ll 1\right)$,
$r_{1}^{(2 w)} \leftarrow r_{1}^{(2 w)}+\left(\left(u^{(w)} \ll 1\right)\right.$ or $\left.\left(u^{(w)} \gg(w-1)\right)\right) ;$
else $r_{0}^{(2 w)} \leftarrow r_{0}^{(2 w)}+u^{(w)}, r_{1}^{(2 w)} \leftarrow r_{1}^{(2 w)}+u^{(w)}$
2.2. $r_{1}^{(2 w)} \leftarrow r_{1}^{(2 w)}+\operatorname{hi}(w)\left(r_{0}^{(2 w)}\right)$,
$r_{2}^{(2 w)} \leftarrow r_{2}^{(2 w)}+\operatorname{hi}_{(w)}\left(r_{1}^{(2 w)}\right)$
2.3. $c_{k}^{(w)} \leftarrow \operatorname{low}_{(w)}\left(r_{0}^{(2 w)}\right), r_{0}^{(2 w)} \leftarrow \operatorname{low}_{(w)}\left(r_{1}^{(2 w)}\right)$,
$r_{1}^{(2 w)} \leftarrow \operatorname{low}_{(w)}\left(r_{2}^{(2 w)}\right), r_{2}^{(2 w)} \leftarrow 0$.
3. For $k \leftarrow n, l \leftarrow 1, k<n k, k++, l++$ do
3.1. For $i \leftarrow l, j \leftarrow k-l, i \leq j, i++, j--$ do
3.1.1. $\left(u^{2}\right)^{(2 w)} \leftarrow a_{i}^{(w)} \cdot a_{j}^{(w)}$.
3.1.2. if $(i<j)$ then $r_{0}^{(2 w)} \leftarrow r_{0}^{(2 w)}+\left(u^{(w)} \ll 1\right)$,
$r_{1}^{(2 w)} \leftarrow r_{1}^{(2 w)}+\left(\left(u^{(w)} \ll 1\right)\right.$ or $\left.\left(u^{(w)} \gg(w-1)\right)\right)$;
else $r_{0}^{(2 w)} \leftarrow r_{0}^{(2 w)}+u^{(w)}, r_{1}^{(2 w)} \leftarrow r_{1}^{(2 w)}+u^{(w)}$
3.2. $r_{1}^{(2 w)} \leftarrow r_{1}^{(2 w)}+\operatorname{hi}_{(w)}\left(r_{0}^{(2 w)}\right)$,
$r_{2}^{(2 w)} \leftarrow r_{2}^{(2 w)}+\mathrm{hi}_{(2)}\left(r_{1}^{(2 w)}\right)$
3.3. $c_{k}^{(w)} \leftarrow \operatorname{low}_{(w)}\left(r_{0}^{(2 w)}\right), r_{0}^{(2 w)} \leftarrow \operatorname{low}_{(w)}\left(r_{1}^{(2 w)}\right)$, $r_{1}^{(2 w)} \leftarrow \operatorname{low}_{(w)}\left(r_{2}^{(2 w)}\right), r_{2}^{(2 w)} \leftarrow 0$.
4. $c_{n k}^{(w)} \leftarrow \operatorname{low}_{(2)}\left(r_{0}^{(2 w)}\right)$.

## 5. Return (c).

The computational complexity of the MCSQR algorithm:

$$
I_{s q r}^{M C S Q R}=n^{2}\left(I_{m u l}^{w}+\left(\frac{3 n+1}{n}\right) I_{a d d}^{2 w+w}+3\left(\frac{n-1}{2 n}\right) I_{\text {shift }}^{w}\right)
$$

where n - number of w-bit machine words required to store the multiplier of given size, $I_{\text {mul }}^{w}$ - a multiplication operation for w-bit words, $I_{\text {add }}^{2 w+w}$ - an addition operation for 2 w -bit and w-bit words, $I_{\text {shift }}^{w}$ - a word shift operation. Assignment operations do not take into account in computational complexity of the algorithms.

The evaluation results of computational complexity of MCSQR for different bit length multipliers are shown in Table 1 and Table 2 (MUL, ADD and SHIFT - amount of required multiplication, addition and shift operations).

Table 1. The number of operations for MCSQR (w=32 bit)

| BIT SIZE | MCSQR, $w=32$ bit |  |  |
| :---: | :---: | :---: | :---: |
|  | MUL | ADD | SHIFT |
| 128 | 16 | 52 | 18 |
| 256 | 64 | 200 | 84 |
| 512 | 256 | 784 | 360 |
| 1024 | 1024 | 3104 | 1488 |
| 2048 | 4096 | 12352 | 6048 |
| 3072 | 9216 | 27744 | 13680 |
| 4096 | 16384 | 49280 | 24384 |
| 6144 | 36864 | 110784 | 55008 |
| 8192 | 65536 | 196864 | 97920 |
| 12288 | 147456 | 442752 | 220608 |
| 16384 | 262144 | 786944 | 392448 |

Table 2. The number of operations for MCSQR (w=64 bit)

| BIT SIZE | MCSQR, $w=64$ bit |  |  |
| :---: | :---: | :---: | :---: |
|  | MUL | ADD | SHIFT |
| 128 | 4 | 14 | 3 |
| 256 | 16 | 52 | 18 |
| 512 | 64 | 200 | 84 |
| 1024 | 256 | 784 | 360 |
| 2048 | 1024 | 3104 | 1488 |
| 3072 | 2304 | 6960 | 3384 |
| 4096 | 4096 | 12352 | 6048 |
| 6144 | 9216 | 27744 | 13680 |
| 8192 | 16384 | 49280 | 24384 |
| 12288 | 36864 | 110784 | 55008 |
| 16384 | 65536 | 196864 | 97920 |

Table 2 shows that using 64-bit machine words in MCSQR algorithm can significantly reduce the number of necessary operations (including reducing the number of multiplications by 4 times).

### 3.2. Squaring Algorithms with Parallelization Techniques

The delayed carry mechanism allows formulating several approaches to the MCSQR parallelization:

- Parallel execution (in two parallel threads) of loops in the step 2 and 3 with further final result correction.
- Parallel execution (number of parallel threads) of iterations in loops in step 2 and 3 with further intermediate results (from parallel threads) merging.


### 3.2.1. Algorithm Modified Comba SQR 2x

Algorithm with two parallel processing threads and paralleling features of the MCSQR algorithm was
proposed. Modified Comba SQR 2x (MCSQR2x) is squaring algorithm for w-bit platforms, based on multiplication algorithm Modified Comba 2x (MC2x) [4,5] with two parallel processing threads.

Squaring algorithm 2. Modified Comba SQR 2x with two parallel processing threads:
INPUT: $a \in \mathbf{G F}(p), n=\log _{2^{w}} a, n k=2 n-1$
OUTPUT: $c=a^{2}$

1. \#pragma omp parallel sections begin
1.1. \#pragma omp section begin
1.1.1. $r l_{0}^{(2 w)} \leftarrow 0, r l_{1}^{(2 w)} \leftarrow 0, r l_{2}^{(2 w)} \leftarrow 0$.
1.1.2. For $k \leftarrow 0, k<n, k++$ do
1.1.2.1. For $i \leftarrow 0, j \leftarrow k, i \leq j, i++, j--$ do
1.1.2.1.1. $\left(u^{2}\right)^{(2 w)} \leftarrow a_{i}^{(w)} \cdot a_{j}^{(w)}$.
1.1.2.1.2. if $(i<j)$ then $r l_{0}^{(2 w)} \leftarrow r l_{0}^{(2 w)}+\left(u^{(w)} \ll 1\right)$,
$r l_{1}^{(2 w)} \leftarrow r l_{1}^{(2 w)}+\left(\left(u^{(w)} \ll 1\right)\right.$ or $\left.\left(u^{(w)} \gg(w-1)\right)\right)$;
else $r l_{0}^{(2 w)} \leftarrow r l_{0}^{(2 w)}+u^{(w)}, r l_{1}^{(2 w)} \leftarrow r l_{1}^{(2 w)}+u^{(w)}$
1.1.2.2. $r l_{1}^{(2 w)} \leftarrow r l_{1}^{(2 w)}+\operatorname{hi}_{(w)}\left(r l_{0}^{(2 w)}\right)$,
$r l_{2}^{(2 w)} \leftarrow r l_{2}^{(2 w)}+\operatorname{hi}_{(w)}\left(r l_{1}^{(2 w)}\right)$
1.1.2.3. $c_{k}^{(w)} \leftarrow \operatorname{low}_{(w)}\left(r l_{0}^{(2 w)}\right)$,
$r l_{0}^{(2 w)} \leftarrow \operatorname{low}_{(w)}\left(r l_{1}^{(2 w)}\right)$,
$r l_{1}^{(2 w)} \leftarrow \operatorname{low}_{(w)}\left(r l_{2}^{(2 w)}\right), r l_{2}^{(2 w)} \leftarrow 0$.
1.1.3. $r_{0}^{(2 w)} \leftarrow r l_{0}^{(2 w)}$.
\#pragma omp section end
1.2. \#pragma omp section begin
1.2.1. $r l_{0}^{(2 w)} \leftarrow 0, r l_{1}^{(2 w)} \leftarrow 0, r l_{2}^{(2 w)} \leftarrow 0$.
1.2.2. For $k \leftarrow n, l \leftarrow 1, k<n k, k++, l++$ do
1.2.2.1. For $i \leftarrow l, j \leftarrow n-1, i \leq j, i++, j--$ do
1.2.2.1.1. $\left(u^{2}\right)^{(2 w)} \leftarrow a_{i}^{(w)} \cdot a_{j}^{(w)}$.
1.2.2.1.2. if $(i<j)$ then $r l_{0}^{(2 w)} \leftarrow r l_{0}^{(2 w)}+\left(u^{(w)} \ll 1\right)$,
$r l_{1}^{(2 w)} \leftarrow r l_{1}^{(2 w)}+\left(\left(u^{(w)} \ll 1\right)\right.$ or $\left.\left(u^{(w)} \gg(w-1)\right)\right) ;$ else $r l_{0}^{(2 w)} \leftarrow r l_{0}^{(2 w)}+u^{(w)}, r l_{1}^{(2 w)} \leftarrow r l_{1}^{(2 w)}+u^{(w)}$ 1.2.2.2. $r l_{1}^{(2 w)} \leftarrow r l_{1}^{(2 w)}+\operatorname{hi}_{(w)}\left(r l_{0}^{(2 w)}\right)$,
$r l_{2}^{(2 w)} \leftarrow r l_{2}^{(2 w)}+\operatorname{hi}_{(w)}\left(r l_{1}^{(2 w)}\right)$
1.2.2.3. $c_{k}^{(w)} \leftarrow \operatorname{low}_{(w)}\left(r l_{0}^{(2 w)}\right)$,
$r l_{0}^{(2 w)} \leftarrow \operatorname{low}_{(w)}\left(r l_{1}^{(2 w)}\right)$,
$r l_{1}^{(2 w)} \leftarrow \operatorname{low}_{(w)}\left(r l_{2}^{(2 w)}\right), r l_{2}^{(2 w)} \leftarrow 0$.
\#pragma omp section end \#pragma omp parallel sections end
2. For $k \leftarrow n, k<n k, k++$ do
2.1. $r_{0}^{(2 w)} \leftarrow r_{0}^{(2 w)}+c_{k}^{(w)}$.
2.2. $c_{k}^{(w)} \leftarrow \operatorname{low}_{(w)}\left(r_{0}^{(2 w)}\right)$.
2.3. $\operatorname{low}_{(w)}\left(r_{0}^{(2 w)}\right) \leftarrow \operatorname{hi}_{(w)}\left(r_{0}^{(2 w)}\right)$.
3. $c_{n k}^{(w)} \leftarrow c_{n k}^{(w)}+\operatorname{low}_{(w)}\left(r_{0}^{(2 w)}\right)$.
4. Return (c).

The computational complexity of the MCSQR2x algorithm:

$$
\begin{aligned}
& I_{\text {sqr }}^{M C S Q R 2 x} \\
& =\max \left[n \left(\left[\frac{n}{2}\binom{I_{\text {mul }}^{w}+2\left(\frac{n+1}{2 n}\right) I_{\text {add }}^{2 w+w}}{+3\left(\frac{n-1}{2 n}\right) I_{\text {shift }}^{w}}+2 I_{\text {add }}^{2 w+w}\right)\right.\right. \text {, } \\
& \left.n\left(\left\lfloor\frac{n}{2}\right\rfloor\binom{ I_{m u l}^{w}+2\left(\frac{n+1}{2 n}\right) I_{\text {add }}^{2 w+w}}{+3\left(\frac{n-1}{2 n}\right) I_{\text {shift }}^{w}}+2 I_{\text {add }}^{2 w+w}\right)\right] \\
& +\frac{n}{2} I_{\text {add }}^{2 w+w}+I_{\text {add }}^{w}
\end{aligned}
$$

where n - number of w -bit machine words required to store the multiplier of given size, $I_{m u l}^{w}$ - a multiplication operation for w-bit words, $I_{\text {add }}^{2 w+w}$ - an addition operation for 2 w -bit and w-bit words, $I_{\text {shift }}^{w}$ - a shift operation. Assignment operations do not take into account in computational complexity of the algorithms.

The evaluation results of computational complexity of MCSQR2x for different bit length multipliers are shown in Table 3 and Table 4 (MUL, ADD and SHIFT - amount of required multiplication, addition and shift operations):

Table 3. The number of operations for MCSQR2X ( $w=32$ bit)

| BIT SIZE | MCSQR2x, $w=32$ bit |  |  |
| :---: | :---: | :---: | :---: |
|  | MUL | ADD | SHIFT |
| 128 | 8 | 19 | 9 |
| 256 | 32 | 53 | 42 |
| 512 | 128 | 169 | 180 |
| 1024 | 512 | 593 | 744 |
| 2048 | 2048 | 2209 | 3024 |
| 3072 | 4608 | 4849 | 6840 |
| 4096 | 8192 | 8513 | 12192 |
| 6144 | 18432 | 18913 | 27504 |
| 8192 | 32768 | 33409 | 48960 |
| 12288 | 73728 | 74689 | 110304 |
| 16384 | 131072 | 132353 | 196224 |

Table 4. The number of operations for MCSQR2x (w=64 bit)

| BIT SIZE | MCSQR2x, $w=64$ bit |  |  |
| :---: | :---: | :---: | :---: |
|  | MUL | ADD | SHIFT |
| 128 | 2 | 8 | 2 |
| 256 | 8 | 19 | 9 |
| 512 | 32 | 53 | 42 |
| 1024 | 128 | 169 | 180 |
| 2048 | 512 | 593 | 744 |
| 3072 | 1152 | 1273 | 1692 |
| 4096 | 2048 | 2209 | 3024 |
| 6144 | 4608 | 4849 | 6840 |
| 8192 | 8192 | 8513 | 12192 |
| 12288 | 18432 | 18913 | 27504 |
| 16384 | 32768 | 33409 | 48960 |

### 3.2.2. Algorithm Modified Comba SQR Mx

Modified Comba SQR Mx (MCSQRMx) is the squaring algorithm for w-bit platforms, based on multiplication algorithm Modified Comba Mx (MCMx) with multiple parallel processing threads [4,5] :

Squaring algorithm 3. Modified Comba SQR Mx with multiple parallel processing threads
INPUT: $a \in \mathbf{G F}(p), n=\log _{2} w, n k=2 n-1$
OUTPUT: $c=a^{2}$

1. $l \leftarrow 1$.
2. Arrays $r 0_{i}^{(w)}$ and $r 1_{i}^{(w)}, i=\overline{0, n k}$.
3. \#pragma omp parallel begin
4. \#pragma omp for nowait begin
4.1. For $k \leftarrow 0, k<n, k++$ do
4.1.1. $r l_{0}^{(w)} \leftarrow 0, r l_{1}^{(w)} \leftarrow 0$.
4.1.2. For $i \leftarrow 0, j \leftarrow k, i \leq j, i++, j--$ do
4.1.2.1. $\left(u^{2}\right)^{(2 w)} \leftarrow a_{i}^{(w)} \cdot a_{j}^{(w)}$.
4.1.2.2. if $(i<j)$ then $r l_{0}^{(2 w)} \leftarrow r l_{0}^{(2 w)}+\left(u^{(w)} \ll 1\right)$,
$r l_{1}^{(2 w)} \leftarrow r l_{1}^{(2 w)}+\left(\left(u^{(w)} \ll 1\right)\right.$ or $\left.\left(u^{(w)} \gg(w-1)\right)\right)$;
else $r l_{0}^{(2 w)} \leftarrow r l_{0}^{(2 w)}+u^{(w)}, r l_{1}^{(2 w)} \leftarrow r l_{1}^{(2 w)}+u^{(w)}$
4.1.3. $r 0_{k}^{(2 w)} \leftarrow r l_{0}^{(2 w)}, r 1_{k}^{(2 w)} \leftarrow r l_{1}^{(2 w)}$
\#pragma omp for end
5. \#pragma omp for nowait begin
5.1. For $k \leftarrow n, k<n k, k++$ do
5.1.1. $r l_{0}^{(2 w)} \leftarrow 0, r l_{1}^{(2 w)} \leftarrow 0$.
5.1.2. For $i \leftarrow l, j \leftarrow n-1, i \leq j, i++, j--$ do
5.1.2.1. $\left(u^{2}\right)^{(2 w)} \leftarrow a_{i}^{(w)} \cdot a_{j}^{(w)}$.
5.1.2.2. if $(i<j)$ then $r l_{0}^{(2 w)} \leftarrow r l_{0}^{(2 w)}+\left(u^{(w)} \ll 1\right)$, $r l_{1}^{(2 w)} \leftarrow r l_{1}^{(2 w)}+\left(\left(u^{(w)} \ll 1\right)\right.$ or $\left.\left(u^{(w)} \gg(w-1)\right)\right)$; else $r l_{0}^{(2 w)} \leftarrow r l_{0}^{(2 w)}+u^{(w)}, r l_{1}^{(2 w)} \leftarrow r l_{1}^{(2 w)}+u^{(w)}$ 5.1.3. $r 0_{k}^{(2 w)} \leftarrow r l_{0}^{(2 w)}, r 1_{k}^{(2 w)} \leftarrow r l_{1}^{(2 w)}$
5.1.4. l++ .
\#pragma omp for end
\#pragma omp parallel end
6. $r^{(2 w)} \leftarrow 0$
7. For $k \leftarrow 0, k<n k, k++$ do
7.1. $r l_{0}^{(2 w)} \leftarrow r 0_{k}^{(2 w)}$.
7.2. $r l_{0}^{(2 w)} \leftarrow r l_{0}^{(2 w)}+r^{(2 w)}$.
7.3. $c_{k}^{(w)} \leftarrow \operatorname{low}_{(w)}\left(r l_{0}^{(2 w)}\right)$.
7.4. $r^{(2 w)} \leftarrow r 1_{k}^{(2 w)}+h \mathrm{hi}_{(w)}\left(r l_{0}^{(2 w)}\right)$.
8. $c_{n k}^{(w)} \leftarrow \operatorname{low}_{(w)}\left(r^{(2 w)}\right)$.
9. Return (c).

The computational complexity of the MCSQRMx algorithm:

$$
\begin{aligned}
& I_{m u l}^{M C S Q R M x}=\frac{n}{Z}\left[\left[\begin{array}{l}
\frac{n}{2}
\end{array}\binom{I_{m u l}^{w}+2\left(\frac{n+1}{2 n}\right) I_{a d d}^{2 w+w}}{+3\left(\frac{n-1}{2 n}\right) I_{\text {shift }}^{w}}\right]\right. \\
& +\frac{n}{Z}\left[\left\lfloor\frac{n}{2}\right\rfloor\binom{ I_{m u l}^{w}+2\left(\frac{n+1}{2 n}\right) I_{a d d}^{2 w+w}}{+3\left(\frac{n-1}{2 n}\right) I_{\text {shift }}^{w}}\right]+(2 n-1) 3 I_{\text {add }}^{2 w+w}
\end{aligned}
$$

where Z - parallel threads count, n - number of w-bit machine words required to store the multiplier of given size, $I_{m u l}^{w}$ - a multiplication operation for w-bit words, $I_{a d d}^{2 w+w}$ - an addition operation for 2 w -bit and w-bit words, $I_{\text {shift }}^{w}$ - a word shift operation. Assignment operations do not take into account in computational complexity of the algorithms.

The evaluation results of computational complexity of MCSQR2x for different bit length multipliers are shown in Table 5 and Table 6 for $\mathrm{Z}=4$ (MUL, ADD and SHIFT amount of required multiplication, addition and shift operations):

Table 5. The number of operations for MCSQRMX ( $\mathbf{w}=\mathbf{3 2} \mathbf{~ b i t )}$

| B BIT SIZE | MCSQRMx, $w=32$ bit |  |  |
| :---: | :---: | :---: | :---: |
|  | MUL | ADD | SHIFT |
| 128 | 4 | 26 | 5 |
| 256 | 16 | 63 | 21 |
| 512 | 64 | 161 | 90 |
| 1024 | 256 | 453 | 372 |
| 2048 | 1024 | 1421 | 1512 |
| 3072 | 2304 | 2901 | 3420 |
| 4096 | 4096 | 4893 | 6096 |
| 6144 | 9216 | 10413 | 13752 |
| 8192 | 16384 | 17981 | 24480 |
| 12288 | 36864 | 39261 | 55152 |
| 16384 | 65536 | 68733 | 98112 |

Table 6. The number of operations for MCSQRMx ( $\mathbf{w}=\mathbf{6 4} \mathbf{~ b i t )}$

| BIT SIZE | MCSQRMx, $w=64$ bit |  |  |
| :---: | :---: | :---: | :---: |
|  | MUL | ADD | SHIFT |
| 128 | 1 | 11 | 1 |
| 256 | 4 | 26 | 5 |
| 512 | 16 | 63 | 21 |
| 1024 | 64 | 161 | 90 |
| 2048 | 256 | 453 | 372 |
| 3072 | 576 | 873 | 846 |
| 4096 | 1024 | 1421 | 1512 |
| 6144 | 2304 | 2901 | 3420 |
| 8192 | 4096 | 4893 | 6096 |
| 12288 | 9216 | 10413 | 13752 |
| 16384 | 16384 | 17981 | 24480 |

Theoretical calculations show that the parallel squaring algorithms have a lower computational complexity, primarily due to the parallel execution of the elementary operations of addition and multiplication. Furthermore, the use of 64-bit machine words reduces the number of multiplications by 4 times.

## 4. Field Research

Squaring algorithms MCSQR, MCSQR2x and MCSQRMx as previously proposed algorithms for multiplication MC, MC2x and MCMx (Kovtun et al., 2012), (Kovtun and Okhrimenko, 2012), (Kovtun and Okhrimenko, 2013) have been implemented in software in C++ using the Intel C + + Compiler XE 13. The proposed algorithms have been implemented for 32- and 64-bit platforms. Measurements were performed on a computer running Microsoft Windows 7 Ultimate x64 SP1 and the processor Intel Core i5-3570 (6M Cache, 3.40 GHz ) with four physical cores. For multiplication of two 64-bit integers, have been used the built-in compiler intrinsic function _umul128, (128-bit result of the multiplication is represented as an array of 64-bit words). Comparison of the results occurred by comparing the average time of multiplication operations in software implementation MC, MC2x and MCMx and the proposed algorithms squaring MCSQR, MCSQR2x and MCSQRMx, for 1 million iterations. The experimental results for 32-bit platforms are shown in Table 7 and Table 8.

Table 7. The experimental results for squaring ( $\mathrm{w}=\mathbf{3 2} \mathbf{~ b i t )}$

| BIT SIZE |  |  |  |
| :---: | :---: | :---: | :---: |
| 128 | MCSQR, ms | MCSQR2x, ms | MCSQRMx, ms |
| 256 | 147 | 681 | 726 |
| 512 | 495 | 702 | 750 |
| 1024 | 1722 | 1576 | 821 |
| 2048 | 6490 | 4056 | 936 |
| 3072 | 14305 | 8143 | 1432 |
| 4096 | 24992 | 13688 | 2196 |
| 6144 | 54928 | 29593 | 3023 |
| 8192 | 97266 | 51542 | 5519 |
| 12288 | 214921 | 111930 | 7438 |
| 16384 | 378488 | 196768 | 213960 |

Table 8. The experimental results for multiplication ( $\mathbf{w}=\mathbf{3 2} \mathbf{~ b i t )}$

| BIT SIZE | MC, ms | MC2x, ms | MCMx, ms |
| :---: | :---: | :---: | :---: |
| 128 | 62 | 723 | 768 |
| 256 | 156 | 749 | 796 |
| 512 | 530 | 936 | 874 |
| 1024 | 1919 | 2028 | 999 |
| 2048 | 7394 | 5772 | 1513 |
| 3072 | 28673 | 12028 | 2340 |
| 4096 | 63133 | 44569 | 3245 |
| 6144 | 246730 | 174174 | 5938 |
| 8192 | 435943 | 306416 | 7878 |
| 12288 | 16384 | 23306 |  |

It is proposed to normalize the results by dividing the results of MC, MC2x and MCMx to results MCSQR, MCSQR2x and MCSQRMx. The normalized results are shown in Table 9.

Table 9. Normalized results of experiments for w = 32 bit

| Table 9. Normaized resuits of experiments for w = 32 bit |  |  |  |
| :---: | :---: | :---: | :---: |
| BIT <br> SIZE | MC/ <br> MCSQR | MC2x/ <br> MCSQR2x | MCMx/ <br> MCSQRMx |
| 128 | 1,051 | 1,062 | 1,058 |
| 256 | 1,061 | 1,067 | 1,061 |
| 512 | 1,071 | 1,091 | 1,065 |
| 1024 | 1,114 | 1,287 | 1,067 |
| 2048 | 1,139 | 1,423 | 1,057 |
| 3072 | 1,143 | 1,477 | 1,066 |
| 4096 | 1,147 | 1,502 | 1,073 |
| 6144 | 1,149 | 1,506 | 1,076 |
| 8192 | 1,141 | 1,530 | 1,059 |
| 12288 | 1,148 | 1,556 | 1,054 |
| 16384 | 1,152 | 1,557 | 1,091 |

Table 9 shows that all proposed squaring algorithms for 32-bit platforms are effectively than multiplying algorithms. Single-threaded algorithm MCSQR is more efficient than algorithm MC by 5\%, and advantage increases to $15 \%$ with the increase of the multipliers bit size. The algorithm MCSQR2x is more efficient than algorithm MC2x by $6 \%$, and advantage increases to $56 \%$ with the increase of the multipliers bit size. Multi-threaded algorithm MCSQRMx is more effectively than algorithm MCMx by $6 \%$, and advantage increases to $9 \%$ with the increase of the multipliers bit size.

The experimental results for 64-bit platforms are shown in Table 10 and Table 11.

Table 10. The experimental results for squaring ( $\mathbf{w}=\mathbf{6 4}$ bit)

| BIT SIZE | MCSQR, ms | MCSQR2x, ms | MCSQRMx, ms |
| :---: | :---: | :---: | :---: |
| 128 | 14 | 628 | 687 |
| 256 | 46 | 687 | 721 |
| 512 | 150 | 699 | 780 |
| 1024 | 483 | 889 | 952 |
| 2048 | 1736 | 1541 | 1560 |
| 3072 | 3776 | 2816 | 2690 |
| 4096 | 6521 | 4112 | 4009 |
| 6144 | 14521 | 8549 | 8205 |
| 8192 | 25490 | 14430 | 13759 |
| 12288 | 56379 | 31794 | 29718 |
| 16384 | 99232 | 54507 | 51651 |

Table 11. The experimental results for multiplication ( $\mathrm{w}=\mathbf{6 4} \mathbf{~ b i t )}$

| BIT SIZE | MC, ms | MC2x, ms | MCMx, ms |
| :---: | :---: | :---: | :---: |
| 128 | 15 | 818 | 903 |
| 256 | 50 | 896 | 949 |
| 512 | 163 | 924 | 1070 |
| 1024 | 593 | 1184 | 1293 |
| 2048 | 2347 | 2053 | 2224 |
| 3072 | 5175 | 3801 | 3810 |
| 4096 | 8830 | 5600 | 5507 |
| 6144 | 19532 | 11590 | 11357 |
| 8192 | 34335 | 19858 | 18752 |
| 12288 | 76674 | 44991 | 40435 |
| 16384 | 135252 | 77127 | 71682 |

It is proposed to normalize the results for 64-bit platforms by dividing the results of MC, MC2x and MCMx to results MCSQR, MCSQR2x and MCSQRMx too. The normalized results are shown in Table 12.

Table 12. Normalized results of experiments for $w=64$ bit

| $\begin{gathered} \text { BIT } \\ \text { SIZE } \end{gathered}$ | $\begin{gathered} \mathrm{MC} / \\ \text { MCSQR } \end{gathered}$ | $\begin{gathered} \text { MC2x/ } \\ \text { MCSQR2x } \end{gathered}$ | $\begin{gathered} \text { MCMx/ } \\ \text { MCSQRMx } \\ \hline \end{gathered}$ |
| :---: | :---: | :---: | :---: |
| 128 | 1,071 | 1,303 | 1,314 |
| 256 | 1,087 | 1,304 | 1,316 |
| 512 | 1,087 | 1,322 | 1,372 |
| 1024 | 1,228 | 1,332 | 1,358 |
| 2048 | 1,352 | 1,332 | 1,426 |
| 3072 | 1,370 | 1,350 | 1,416 |
| 4096 | 1,354 | 1,362 | 1,374 |
| 6144 | 1,345 | 1,356 | 1,384 |
| 8192 | 1,347 | 1,376 | 1,363 |
| 12288 | 1,360 | 1,415 | 1,361 |
| 16384 | 1,363 | 1,415 | 1,388 |

Table 12 shows that all proposed squaring algorithms for 64-bit platforms are effectively than multiplying algorithms. Single-threaded algorithm MCSQR is more efficient than algorithm MC by 7\%, and advantage increases to $36 \%$ with the increase of the multipliers bit size. The algorithm MCSQR2x is more efficient than algorithm MC2x by 30\%, and advantage increases to $41 \%$ with the increase of the multipliers bit size. Multi-threaded algorithm MCSQRMx is more effectively than algorithm MCMx in average of $37 \%$.

To compare the performance of software implementations of squaring algorithms for 32 and 64 bit platforms, the results obtained on 32-bit platform were divided by results on 64-bit platform, for the same algorithms. The comparison results are shown in Table 13 and Table 14.

Table 14 shows that, as in case of multiplication, software implementations of squaring algorithms for 64bit platforms were more effective than the same implementation for 32-bit platforms (up to 4 times for the algorithm MCSQR, up to 3,6 times for the algorithm MCSQR2x). Multithreaded multiplication and squaring algorithms for 32-bit platforms was more effective than for 64-bit, because there are not support 128-bit operations in modern compilers, that's why it require software emulation such operations. MCSQRMx algorithm for 64bit platforms is more effectively on 128-512 bits multipliers (4-6\%), which are widely used in cryptography.

Table 13. Performance comparing of squaring software implementations

| BIT <br> SIZE | MCSQR x86/ <br> MCSQR x64 | MCSQR2x x86 / <br> MCSQR2x x64 | MCSQRMx x86 / <br> MCSQRMx x64 |
| :---: | :---: | :---: | :---: |
| 128 | 4,214 | 1,084 | 1,057 |
| 256 | 3,196 | 1,022 | 1,040 |
| 512 | 3,300 | 1,227 | 1,053 |
| 1024 | 3,565 | 1,773 | 0,983 |
| 2048 | 3,738 | 2,632 | 0,918 |
| 3072 | 3,788 | 2,892 | 0,816 |
| 4096 | 3,833 | 3,329 | 0,754 |
| 6144 | 3,783 | 3,462 | 0,673 |
| 8192 | 3,816 | 3,572 | 0,541 |
| 12288 | 3,812 | 3,520 | 0,470 |
| 16384 | 3,814 | 3,610 | 0,414 |

Table 14. Performance comparing of multiplication software implementations

| BIT SIZE | MC x86 / <br> MC x64 | MC2X x86 / <br> MC2X x64 | MCMX x86 / <br> MCMX x64 |
| :---: | :---: | :---: | :---: |
| 128 | 4,133 | 0,884 | 0,850 |
| 256 | 3,120 | 0,836 | 0,839 |
| 512 | 3,252 | 1,013 | 0,817 |
| 1024 | 3,236 | 1,713 | 0,773 |
| 2048 | 3,150 | 2,811 | 0,680 |
| 3072 | 3,159 | 3,164 | 0,614 |
| 4096 | 3,247 | 3,671 | 0,589 |
| 6144 | 3,232 | 3,845 | 0,523 |
| 8192 | 3,232 | 3,970 | 0,420 |
| 12288 | 3,218 | 3,871 | 0,364 |
| 16384 | 3,223 | 3,973 | 0,325 |

Theoretical estimation for MCSQR and MCSQR2x confirmed by practical results of research. MCSQRMx algorithm test results for 64-bit platforms indicate that there are no operations performed on 128-bit data array.

## 5. Conclusions

Using previously proposed approaches to improve the performance of the integers multiplication [4,5,6] have developed squaring algorithms Modified Comba SQR, Modified Comba SQR 2x and Modified Comba SQR Mx.

Theoretical calculations show that the parallel squaring algorithms have a lower computational complexity, primarily due to the parallel execution of the elementary operations of addition and multiplication. Furthermore, the use of 64-bit machine words reduces the number of multiplications by 4 times.

Software implementations of squaring algorithms for 64-bit platforms were more effective than the same implementation for 32-bit platforms (up to 4 times for the algorithm MCSQR, up to 3,6 times for the algorithm MCSQR2x). MCSQRMx algorithm for 64-bit platforms is more effectively on 128-512 bits multipliers (4-6\%), which are widely used in cryptography. Multithreaded multiplication and squaring algorithms for 32-bit platforms was more effective than for 64-bit, because there are not support 128 -bit operations in modern compilers, that's why it require software emulation such operations.
Experimental researches have shown the effectiveness of the proposed squaring algorithms over multiplication
algorithms for 32-bit and 64-bit platforms. The theoretical results are confirmed by practice.

The most perspective algorithm is MCSQRMx, which shows significantly better results than other presented algorithms. MCSQRMx has a high degree of parallelism, which allows implementing it on various microprocessor platforms, that's why further research will focus on its development using specialized software and hardware (e.g., NVIDIA CUDA and OpenCL).

## References

[1] Cohen H., Frey G., Avanzi R., Doche C., Lange T., Nguyen K., Vercauteren F. Handbook of Elliptic and Hyperelliptic Curve Cryptography, Chapman \& Hall/CRC, 2006, 843.
[2] Denis, T., Rose G. (2006). BigNum Math: Implementing Cryptographic Multiple Precision Arithmetic, Elsevier/Syngress, 2006, 315.
[3] Hankerson, D., Menezes, A.J., Vanstone, S. Guide to Elliptic Curve Cryptography, Springer-Verlag Professional Computing Series, 2004, 332.
[4] Kovtun, V.Y., Okhrimenko, A.O. "Approaches for the Parallelization of Software Implementation of Integer Multiplication", Radiotehnika. Vseukrainskij mezhvedomstvennyj nauchno-tehnicheskij sbornik, 171, 123-132. 2012.
[5] Kovtun, V.Y., Okhrimenko, A.O. "Integer multiplication algorithms with delayed carry for public-key cryptosystems". In: V.S. Ponomarenko (Eds.), Informacionnye tehnologi i sistemy v upravlenii, obrazovanii, nauke. Kharkiv: Cifrova drukarnja №1. 69-82. 2013.
[6] Kovtun, V.Y., Okhrimenko, A.O., Nechiporuk, V.V. "Approaches for the performance increasing of software implementation of integer multiplication in prime fields", Zashchita informacii, 1 (54), 68-75. 2012.

