High Performance Optimization Function for 32-Bits Microcontrollers in Key Scheduling of the Lightweight Cipher Algorithm CLEFIA

Edwar Jacinto, Holman Montiel* and Fernando Martinez
Technological Faculty, District University Francisco Jose de Caldas, Bogota D.C., Colombia; ejacintog@udistrital.edu.co, hmontiela@udistrital.edu.co, fmartinezs@udistrital.edu.co

Abstract

Objectives: This paper shows an optimized code for light-weight cipher algorithms, attempting to keep the balance between the use of resources and the communication speed. Methods/Analysis: A real performance analysis is applied to the cryptographic algorithm CLEFIA, under the standards by ISO/IEC 29192-2, by means of a code optimization for key scheduling through bit-oriented instructions. It is used the Freescale KL25Z development board for the measure of response times and the structural blocks’ execution times for the cipher algorithm. Findings: In this paper a bit-level optimization was sought over some operative structures of the algorithm, taking advantage of the 32-bit architecture in the development platform, generating this way a better response time for the application and an increase of the Throughput performance regarding the reference code by SONY. Novelty/Improvement: This application was developed so it can be used by many platforms into any electronic application, which requires an encryption process, where the use of a PC is not worthy because of the size and cost.

Keywords: Cipher Algorithm, ISO/IEC 29192, Lightweight Algorithm

1. Introduction

The development of electronic prototypes related to communication processes establish the need for using techniques guaranteeing the information treatment through basic tools and complex mechanisms, offering in this way a security level for specific information. At present can be appreciated a great number of developments based on embedded systems solutions applied to low cost and high performance technologies, where a variety of developments are implemented, using sturdiness and practicality from the several boards available in the tech market.

These developments need the implementation of standardized cryptographic techniques which allow adding a level of reliability regarding the communication tasks; nevertheless in many occasions, adapting these cipher algorithms carries an analysis and development processes very complex, where not always the benefits of the embedded system architecture are not taken advantage of.

In this paper, the intention is to use one of the cipher algorithms standardized by ISO/IEC 29192-2, which specifies two suitable block ciphers for applications that require light-weight cryptographic implementation. It will be handled the CLEFIA light-weight algorithm optimization with a key-length of 128 bits.

It is proposed to perform a functional analysis of the cipher algorithm CLEFIA and to take as reference all of the information given by its developer, SONY, to create a code library, which can be executed on 32 bits embedded systems. The optimization of the code is intended to be done through an architecture analysis and the use of bit-oriented instructions, provided by the development compiler of the mbed platforms in the Freescale KL25Z. The mbed compiler provides an interface with C and C ++ languages. The compiler is designed to create,
compile and download projects who can run on a mbed microcontroller. There's no need to install any additional software to create projects, because it is a web app which can be connected from any place and allows he storage of online projects.

## 2. Proposed Work

The proposal is based on a code library generation which can be used by any 32 bits embedded system, for the key scheduling part (128-bit) used in the CLEFIA encryption process, taking into account the round number that entails the structure, this part provides whitening keys and round keys for the data processing part. The key scheduling part consists of two steps: Generating and expanding as per the Figure 1. This lineal operation references to the round keys associated to permutations and exchanges occurred during the encryption. It seeks to maintain the software efficiency taking advantage of the hardware architecture, based on the premise of the word size or basic structure of the microcontroller used, relaying on an appropriated handle of the rotation and basic logic operations about the type of the variable used during the process and the hardware general capacity of processing.

![Figure 1. Key schedule – Components.](image1)

Figure 2 shows the flowchart of the proposal developed to optimize the code structure of the DoubleSwap function, used by the two operative blocks of the key scheduling.

![Figure 2. Flowchart code optimization for DoubleSwap Function.](image2)

## 3. Implementation and Results

### 3.1 Key Scheduling for CLEFIA

This is an structural part of the algorithm who takes care of the key generation used for the encryption process; all of this based on the DoubleSwap function, which uses a 128 bits input vector divided by 4 parts that are exchanges attempting to increase the algorithm security.

\[
Y = \sum (X) \\
Y = X[7-63] | X[121-127] | X[0-6] | X[64-120]
\]

Where \(X[ab]\) denotes a bit-string cut from the \(X\)'s a-bit to the \(X\)'s b-bit. Where the 0 bit is the most significant bit and has a length of 128 bits.
**Figure 3.** DoubleSwap Function.

**Table 1.** Round keys by L

<table>
<thead>
<tr>
<th>$W_{K_0}$</th>
<th>$W_{K_1}$</th>
<th>$W_{K_2}$</th>
<th>$W_{K_3}$</th>
<th>$- K$</th>
</tr>
</thead>
<tbody>
<tr>
<td>RK_0</td>
<td>RK_1</td>
<td>RK_2</td>
<td>RK_3</td>
<td>$L \oplus \left( \sum_{i=1}^{4} L \oplus K \oplus \left( \sum_{i=1}^{8} L \oplus K \right) \right)$</td>
</tr>
<tr>
<td>RK_4</td>
<td>RK_5</td>
<td>RK_6</td>
<td>RK_7</td>
<td>$\left( \sum_{i=1}^{8} L \oplus K \right) \oplus L$</td>
</tr>
<tr>
<td>RK_8</td>
<td>RK_9</td>
<td>RK_{10}</td>
<td>RK_{11}</td>
<td>$\left( \sum_{i=1}^{8} L \oplus K \right) \oplus L$</td>
</tr>
<tr>
<td>RK_{12}</td>
<td>RK_{13}</td>
<td>RK_{14}</td>
<td>RK_{15}</td>
<td>$\left( \sum_{i=1}^{8} L \oplus K \right) \oplus L$</td>
</tr>
<tr>
<td>RK_{16}</td>
<td>RK_{17}</td>
<td>RK_{18}</td>
<td>RK_{19}</td>
<td>$\left( \sum_{i=1}^{8} L \oplus K \right) \oplus L$</td>
</tr>
<tr>
<td>RK_{20}</td>
<td>RK_{21}</td>
<td>RK_{22}</td>
<td>RK_{23}</td>
<td>$\left( \sum_{i=1}^{8} L \oplus K \right) \oplus L$</td>
</tr>
<tr>
<td>RK_{24}</td>
<td>RK_{25}</td>
<td>RK_{26}</td>
<td>RK_{27}</td>
<td>$\left( \sum_{i=1}^{8} L \oplus K \right) \oplus L$</td>
</tr>
<tr>
<td>RK_{28}</td>
<td>RK_{29}</td>
<td>RK_{30}</td>
<td>RK_{31}</td>
<td>$\left( \sum_{i=1}^{8} L \oplus K \right) \oplus L$</td>
</tr>
<tr>
<td>RK_{32}</td>
<td>RK_{33}</td>
<td>RK_{34}</td>
<td>RK_{35}</td>
<td>$\left( \sum_{i=1}^{8} L \oplus K \right) \oplus L$</td>
</tr>
</tbody>
</table>

Source: RFC 6114
High Performance Optimization Function for 32-Bits Microcontrollers in Key Scheduling of the Lightweight Cipher Algorithm
CLEFIA

3.2 General Structure

The CLEFIA cipher system is based on the round keys generation for the encrypted data system, according to this, starting from K as main key, it must be generated an intermediate key named L.

The 128 bits L intermediate key is generated by the function GFN\{4,12\} (Generalized Feistel Network), that carries 24 32 bits constant values, CON_128 \[i\] (0 < = i<24), as round keys and K = K0 | K1 | K2 | K3 as the input key. Then, K and L are used to generate WK\[i\] (0 <i< = 4) and RK\[j\] (0 < = j <36) constant values CON_128 \[i\] (24 < = i<60) as in the following steps.

<table>
<thead>
<tr>
<th>Generating L from K</th>
</tr>
</thead>
<tbody>
<tr>
<td>Step 1: [L \leftarrow GFN_{12}(CON_{128}^{0}, ..., CON_{128}^{23}, K0, ..., K3)]</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Expanding K and L</th>
</tr>
</thead>
<tbody>
<tr>
<td>Step 2: WK0</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Step 3:</th>
</tr>
</thead>
<tbody>
<tr>
<td>For i = 0 to 8 do:</td>
</tr>
<tr>
<td>[T \leftarrow L \oplus (CON_{24+4i}^{128}, CON_{24+4i+1}^{128}, CON_{24+4i+2}^{128}, CON_{24+4i+3}^{128})]</td>
</tr>
<tr>
<td>[T \leftarrow T \oplus CON_128[i]]</td>
</tr>
<tr>
<td>If i is odd: [T \leftarrow T \oplus K]</td>
</tr>
<tr>
<td>[RK_{4i}</td>
</tr>
</tbody>
</table>

The generation of the intermediate key L, is vital for the algorithm execution, because is associated to each one of the respective rounds of the algorithm encryption process. A direct link between the round key value and the process round number is presents. Table 1 shows the relation between the generation of the respective round key and the permutation performed with the DoubleSwap function.

3.3 Results and Discussion

The optimization of the DoubleSwap function takes advantage of the 32 bits architecture, mainly focused on bit-oriented instructions linked to the handle of masks which facilitate and reduce execution times per block of operations. The creation of a specific function that take care of the permutation process through the use of temporal arrays and bit rotations will be related directly to the Throughput increase of the general algorithm. A comparative frame emerges from the study proposal given by SONY as developer, where it establishes a code structure based on bits rotation on arrays of Char type variables; these arrays have a length of 16 positions of memory where each specific segment has 8 bits. It is used a sequential operation scheme over the memory positions of the permutation array, which entails that only 2 memory positions at the time can be worked on, initiating a continuous process from the position (0 – 1) to (15 – 14).

The performance optimization process for the mentioned function starts from the change of type for the work variable; the next code shows the optimized function. The function works only with three arrays type Integer with only four 32 bits memory positions each one, which are directly handled through comparative masks and bit-oriented instructions, taking full advantage of the length established by the microcontroller architecture, with no need of additional functions for the copy or output data assignment as SONY uses in their reference codes.

```c
Void doubleswap(void){
    temp_data[0]= L[0]<<7;
    temp= L[1] >> 25;
    out[0]= temp_data[0] | temp;   // 1er block
    temp = L[2] >> 7;
    out[2] = ( temp  | (L[0] & 0xfe000000)); //3er block
    L[0]= out[0];
    L[1]= out[1];
    L[2]= out[2];
    L[3]= out[3];
}
```

This modifications allowed to obtain an increase of the Throughput performance, from 145,45 Kbps to 250 Kbps and a reduction of execution times, from 880 µseg to 512 µseg. These measurements could be done directly on the microcontroller thanks to the use of Timer type elements using the functions start () and stop (), guaranteeing a total control on the validation and verification activities for the key permutation processes and the complete execution of the cipher algorithm.
4. Conclusion

It was obtained a code structure applicable to any 32 bits embedded system, which optimizes the key scheduling, improves the Throughput performance and reduce the execution times for the CLEFIA cipher algorithm structural blocks, proving this way that taking appropriate advantage of the structure potential of a embedded system can accomplish performance improvements of applications in the environments where there is no need to use the computational power of a conventional PC to guarantee the information safety.

5. Acknowledgments

This work was supported by the District University Francisco Jose de Caldas, in part through CIDC and partly by the Technological Faculty. The views expressed in this paper are not necessarily endorsed by District University. The authors thank the research groups ARMOS and DIGITI for the evaluation carried out on prototypes of ideas and strategies.

6. References