Wednesday, July 12, 2017

Looking for fast OpenSource algorithms on lattices? Try fpLLL!

Dear ECRYPT-NET fellows and readers,
I have some news from CWI @ Science Park in Amsterdam where fplll-days-3, organized by Leo Ducas and Marc Stevens, are currently taking place!

Previously held at ENS Lyon, this is the third time already for such a combined effort to enhance the fplll OpenSource project. fplll has become a lively project with many suggestions that help to debug and feature requests for continuously improving the code-base in various directions.

As a brief history of fplll it can be noted that the first code was written by Damien Stehlé. It is now written by many active contributors (according to GitHub, the most active developers are: Martin Albrecht, Shi Bai, Damien Stehlé, Guillaume Bonnoron, Marc Stevens and Koen de Boer) and maintained by Martin Albrecht and Shi Bai.

What does fplll do?

What fplll does, depending on additionally specified parameters, is performing its implementation of the LLL algorithm using fast floating-point arithmetic under the hood. Other available lattice reduction algorithms are HKZ, BKZ reduction and variants. These algorithms can be applied on an input lattice represented by a matrix, given i.e. in a file as a set of row vectors, to obtain a reduced representation --- a versatile starting point of numerous applications. Furthermore, fplll allows to solve (arbitrarily adjustable approximate-) SVP and CVP instances, when used to find a shortest lattice vector relative to a user-chosen center.

To get started, one can not only use and compile the fplll C++ sources to run experiments, but the often dubbed 'user-friendlier variant' fpylll which provides Python access to the underlying, fast C++ functions. Finally, every mathematician's dear, Sage, (at least for anyone who isn't fully satisfied by pure Python) benefits from an improved fpylll as well, because importing the fpylll module seamlessly allows direct usage within Sage. Soon a new Sage version, SageMath 8.0, will be released, which ships the current fpylll module that accesses said, fast C++ routines.

The significance of lattice-based cryptography is repeatedly mentioned in paper's abstracts and has been explored in past ECRYPT-NET blog posts i.e. on 'What are lattices?' and 'Learning problems and cryptography: LPN, LWE and LWR'.

Significance for cryptanalysis

From a cryptanalysts point of view, the significance lies in the fact that most security models of lattice-based cryptography typically assume lattice reduction to be the most promising attack-vector against the underlying lattice-based primitive. Some security models are able to immediately (and provably) rule out certain classes of attacks and, for instance, a few others can be argued to be less promising than known formulations as lattice problems. Such arguing hence leads to fplll basically representing the SVP/CVP-oracle and it's performance is deemed as a lower bound for the practical performance of an attack. Typically attacks require many calls to such an oracle function, thus such an approach of taking the time of a single run as a lower bound is used to set parameters in experimental cryptosystems, when commonly a more conservative
lower bound including a security margin is chosen. Specifically, many proposed lattice-based crypto-schemes have been parametrized such that these best known-attacks are taken into account.

I suppose, I do not need to point out the numerous advantages of OpenSource software (over closed-source projects) but and its value to the research community the significance of having freely-available, fast lattice reduction routines is manifold.

To begin with, there is a discrepancy between what theory predicts and algorithmic performance in practice. Techniques described in the literature, summarized as BKZ2.0, leave a broad range of implementation choices. Different groups using different software and metrics where their approach is supreme, naturally lead to results that are hard to compare. If there was software that comes with meaningful defaults for many standard lattice tasks, is customizable, and extensible to individual lattice solutions, then there is hope that the community can agree on problem instances. Ideally, problem instances should cover deployed or model experimental cryptosystems such that they embody a meaningful benchmark for new designs.

Originally, fplll was trying to provide such algorithms with reasonable speed. Recently, developers broadened theirs goals and try to fill gaps of cryptanalytic research. Concretely, now fplll strives for speed from low level optimizations, and by implementing diverse techniques from the literature hence catching up with the state of the art. Additionally, it can be easily tweaked on a high algorithmic level with the Python layer fpylll, yet easily exploiting all the available optimized routines boosting the performance. One can argue that together with diverse lattice challanges this project helps to benchmark and compare various efforts to cryptanalyze cryptographic primitives used in cryptosystem's constructions.

A couple of Lattice Challenges have been proposed (SVP-, Ideal-, LWE- and Ring-Challenges) and it seems that researchers also test their code on these instances, which aids a comparison of approaches.

Having them conveniently accessible and high-level, fast lattice operations allows to quickly try out a new idea, or slightly different approach which saves time and hopefully makes researchers willing to share their tweaks and algorithmic tricks more often in the future.

The workshop

To come back to the start, the fplll-days are meant to be a hands-on, work-oriented workshop that enables direct discussions with core developers and with the goal to improve existing functions and the many algorithms involved. The general idea behind this meeting is to optimize often used routines, make it user-friendlier and accessible to cryptanalysts, for example.

By using code profiling tools, performance and memory usage bottlenecks can be spotted in a first overview, which allows to direct efforts where they might lead to significant speed-ups. After discussing know issues and useful features, this workshop tries to provide an implementation of numerically stable algorithmic variants to push the dimension LLL can handle (like Givens rotations while resorting only to machine floating point type), sophisticated pruning strategies to speed-up enumeration, and implementing sieving algorithms --- all as a promising new direction in finding short vectors faster.

It is exciting to join and shape such a project so let's hope for many interesting projects that got started and delegated here to be completed during this week and further interested researchers turning into active users joining the party and coming up with meaningful, reproducible research results. Remember that Newton "has seen further, by standing on the shoulders of giants" thus achieving progress, and so you too are encouraged to become active, using an already established framework!

Thursday, July 6, 2017

A Brief Survey of Physical Attacks on Cryptographic Hardware

Previously, the topic of side-channel attacks (SCA) was covered on this blog. These attacks are very popular, for they can be mounted using very cheap equipment and do not necessarily require high level of expertise. Hence, SCA are widely accessible and present a common danger. As a result, they are well researched, and various countermeasures have been developed. Still, they are just a small part of the stack of physical attacks. Figure 1. crudely depicts the this colorful “stack”. The one thing all physical attacks have in common is that it is assumed that the attacker must gain physical access to the target device, and attain it for a certain amount of time. In the remainder of this post, a brief survey of these attacks will be given. More detailed descriptions will be provided in a series of posts that will follow. 



Figure 1: Stack of Physical Attacks

Invasiveness
The first segregation is based on the “invasiveness”. Invasive attacks entail breach of target’s packaging, or its surrounding enclosure. This is often a very delicate process which often requires expensive equipment and a high level of expertise. Since the breach is destructive by nature, it can be easily detected by subsequent users — if the chip itself was not destroyed in the process that is. The goal of this breach is to gain access to internal state of a chip. Commonly attackers target on-chip busses or storage elements, which may contain sensitive intermediaries of cryptographic computations or keys themselves. Aforementioned enclosures are a privilege of expensive devices, often called Hardware Security Modules (HSMs). HSMs may cost tens of thousands of Euros, and are envisioned to provide secure computational environments at high speeds. Apart from restricting access to the chip using sturdy build and “tamper-proof” latches and locks, enclosures are frequently equipped with seals and coatings that are supposed to witness any foul play that may have taken place. Additionally, tamper detection measures may be built in, envisioned to void all sensitive information at the first glimpse of attacker’s activities. Hence, invading these juggernauts is commonly more expensive and time consuming. Unfortunately, market-share of HSMs compared to bare smart-cards and RFIDs is neighboring negligible, especially with the rise of the IoT.

On the contrary, non-invasive adversaries do not cause any structural damage to packaging nor enclosures. They interact with the target device using its existing interfaces, and mediums that require no mechanical interaction with the device. They are virtually free, but may require significant expertise of attackers.

Activeness
The second segregation is based on the “activeness” of the attacker. Active attacks entail induction of computational (logical) or structural changes in the target chip. When we talk about computational changes, a very common example are Fault Injection (FI) attacks. There are two phases to FI attacks: fault injection during the execution of the targeted algorithm, and the analysis based on the observations of faulty outputs. A common method for altering device’s execution is called clock glitching. Namely, by introducing a premature edge on the clock signal, attacker violates devices’s critical path. As a result, incorrect values are captured in device’s registers. Alternatively, faults can be induced by shooting a laser beam with enough power to change the state of the device, while allowing it to remain operational. Here, any data or control register fall under “state of the device”. For example, round counter, commonly used in implementations of block ciphers, is a very favored target for such faults. Active attacks may require higher level of technical skill, and a more sophisticated setup.

On the contrary, passive adversaries may only observe device’s execution, while interacting through its predefined interfaces. Well-known SCA fall under this category. These attacks are well researched, and can be mounted using very cheap equipment. Developed techniques (e.g., Mutual Information Analysis) are extremely powerful, and once incorporated in the attackers setup can be reproduced quite trivially. Consequently, although they entail only limited exposure of the device, they pose a serious threat for they are very accessible even to attackers with modest capabilities.

The Reality
Activeness and invasiveness are two orthogonal properties, resulting in a total of four possibilities (although I find that the existence of “invasive and passive attacks” calls for a philosophical debate). Unfortunately, situation is much more complex than that in practice. Firstly, attackers are likely to use combined attacks. For example, FI + SCA may be a very powerful combination. Additionally, the distinction mentioned above is not as binary. Rather, along each of the two orthogonal axes there are many shades. For example, faults can be injected in some chips by applying laser beams to their packaging (non-invasive), while others may be shielded from such beams (hence they have to be attacked invasively).

Consequently, there exists a myriad of possible attack variations. Moreover, even if we lock on a certain extreme — let us say passive, non-invasive, CPA — quality of the measurement setup plays a very significant role. A 500 Euro oscilloscope can hardly match its 30000 Euro counterpart. In hindsight, there are no upper bounds to the power of a skilled invasive attacker performing a battery of active and passive attacks, apart from the temporal and financial constraints.


Taking all above into account, choosing a set of countermeasures is a difficult task (let alone implementing them properly). Bare in mind that these countermeasures are not for free. They may significantly increase the price of devices, reducing the profit margins severely. Therefore, there are no silver bullets in protection against physical attacks. In other words, in practice security engineers work to demotivate attackers with high probability. They try to stop “the attacker of interest”, rather then stopping all attacks. To achieve this, first step is identifying potential attackers. This process is often called profiling, and in a nutshell I would describe it as follows. Please note that this is a gross simplification of the problem, meant to depict the general idea. No distinction is made between fixed (price of the setup) and recurring (every time the attack is mounted) costs, nor between temporal and financial costs. Lastly, please note that the value of assets is heavily simplified as well, for the sake of avoiding a philosophical discussion yet again.


Manufacturer’s Dilemma
Assume that a device D, which costs d to manufacture, protects assets worth x , and features a
countermeasure C that costs c to deploy. We may consider D to be secure against an attacker A, who can mount a successful attack at a cost a (which includes A’s investment in
development of expertise), as long as
x a+μA,

μA being the attackers profit margin. In other words, if the cost a is high enough attacker can not obtain desired amount of profit for given assets. On the other hand a manufacturer M that produces D wants to sell for a price m such that
m d + c + μM,
μ M being M’s profit margin. In other words, price of deploying countermeasures c directly cuts into manufacturer’s profits. Looking at these inequalities, it seems that there is no dilemma at all. Nevertheless, cost of attack depends on the selection of a countermeasure, i.e.,
a =f(c).
Assuming that an increase in c leads to the increase in a , by applying some high school math (readers are welcome to play with it), we see that the selection of C must be performed based on the value of assets it protects. A more detailed discussion on this topic will be given in one of the following posts.
In conclusion, physical attacks are a great threat. As IoT progresses, and the amount of ubiquitous devices increases their potential impact may only grow. Deploying devices that protect assets against physical attacks is a complex problem, which demands bespoke solutions, tailored to individual use cases. 

Friday, June 23, 2017

Algorithm and Key Size Document



The ECRYPT Algorithm and Key Size document is probably the most high impact output from our ECRYPT projects. It is referenced and used throughout the world, to guide the uses of cryptography in practice. The current version of the document can be found here


We are requesting input for the next edition of this document. To do this we have created a Slack channel where people can debate inputs.

We encourage everyone to get involved by sending us your email so we can add you. Once added you can add other people to the channel as you see fit. Please email Nigel Smart or Saartje Verheyen to be added if you do not know someone who is already involved.

We ask you to add comments to the slack channel of corrections and new text to add (including where you think it should go). After you have presented some input, other people can then comment on your text, and add further corrections. 

At the end of September we will freeze the discussion and start the process of incorporating all the suggestions into the final document. 

If you have contributed to the Slack discussion in a positive manner we will include you as an author on the final document as a contributor. That way you get to claim you have contributed to a high impact document (carrot); if you do not contribute however then you cannot complain if we say something you disagree with (stick). 

Of course in the end it is a community effort, and in case of disagreement the editors will need to take one side or another.

Friday, June 16, 2017

Boomerang Attacks

In cryptography, a boomerang attack is a method of cryptanalysis that is based on differential cryptanalysis.

Boomerang attacks were first introduced by Wagner and allow an adversary to concatenate two high probability differential trails to attack a cipher. This is especially useful if there is a lack of long differentials with sufficient probabilities. The adversary can therefore decompose the encryption function $F$ in two subciphers $f$ and $g$ such that $F = f \circ g$. Then the adversary can search for high probability trails $\Delta \rightarrow \Delta^*$ with probability $p$ for  $f$ and  $\nabla \rightarrow \nabla^*$ with probability $q$ for $g$. The differential trails can then be combined in a chosen plaintext/adaptive chosen ciphertext attack to mount a boomerang distinguisher and then a key recovery attack based on this distinguisher to recover the secret key.

A boomerang distinguisher



The basic attack works as follows:

  1. The adversary chooses a random plaintext $X_1$ and calculates $X_2 = X_1 \oplus \Delta$.
  2. The adversary requests the ciphertexts for $X_1$ and $X_2$ from an encryption oracle which are $Y_1 = F(X_1)$ and $Y_2 = F(X_2)$
  3. Calculate ciphertexts $Y_3 = Y_1 \oplus \nabla$ and $Y_4 = Y_2 \oplus \nabla$.
  4. Request the decryptions of $Y_3$ and $Y_4$ to obtain $X_3 = F^{-1}(Y_3)$ and $X_4 = F^{-1}(Y_4)$.
  5. If the difference between $X_3$ and $X_4$ is the same as between $X_1$ and $X_2$, namely $\Delta$ we obtain a correct quartett $(X_1, X_2, X_3, X_4)$.

Calculating a correct quartet requires an attacker to consider both plaintext pairs $(X_1, X_2)$ and $(X_3, X_4)$ and results in a total probability of (pq)^2.
For an attack to succeed, for the probability of the boomerang distinguisher it must hold that $(pq) > 2^{n/2}$. For N plaintext pairs, an adversary expects about $N\cdot(pq)^2$ correct quartets in an attack, while there are only $N\cdot2^{-n}$ (where n is the blocksize) correct quartets for an ideal primitive.







Thursday, April 27, 2017

Yet Another "Ni Hao (Hello)"

Hello, everyone. I am Junwei Wang from China. It’s pleasant to be an ECRYPT-NET fellow and to receive funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie actions (MSCA).

I received my bachelor degree and master degree from Shandong University in 2012 and the University of Luxembourg in 2015, respectively. My master thesis is about the practical implementation of countermeasures resistant again high-order differential power analysis (DPA). Presently, I am doing my Ph.D. student at CryptoExperts under the supervisor of Jean-Sébastien CORON, Sihem MESNAGER, Pascal PAILLIER, and Matthieu RIVAIN.  My research interest is white-box cryptography.

The advent of content-protected applications and mobile payment systems in recent years motivates the exploration of a solution to protect sensitive information from being extracted under the white-box context in which the attacker has the power of control the whole process of the execution of software. Though some theoretical evidence reveals the fact that no generic obfuscator can be constructed to shield algorithms to be attacked, and all existed practical schemes has been broken, it is still interesting to investigate new schemes which can withstand all the existing attacks, and to propose and analyse a new attacking method which is more generic than the existing ones. I hope somebody has similar interest with me, then we can deeply explore this area together.

By the way, I am looking forward to meeting with you in Paris during EUROCRYPT.

Monday, April 17, 2017

What does privacy mean to you? A 5-day challenge

I recently came across The Privacy Paradox project, a call by WNYC Studios to take part in a five-day-long challenge on defining what does privacy mean to yourself.

Amongst its best features, the process can be started anytime, by anyone, just by registering with an e-mail account. From that moment and during the five following days, you will receive a daily e-mail explaining you the topic addressed by that day's podcast, most of them finishing with an easy demand to the listener to reflect on the matter.

With the audio's length ranging from 11 to 15 minutes, and the topics from cryptography --with Bruce Schneier starring on the first episode-- to psychology, there is something on it for everyone, and at a very accessible level. I personally found the format very nice and entertaining, and it is maybe something to learn about when we talk about privacy with people not so interested in the topic.

Finally, the people behind the project also gathered a nice list with things you can do to face the current privacy situation.

https://project.wnyc.org/privacy-paradox/

Tuesday, April 11, 2017

Robust encryption for symmetric primitives

Robustness for encryption schemes has been introduced in the context of public-key encryption in the work of Abdalla, Bellare and Neven. In a nutshell, it states that a ciphertext cannot be decrypted under two different keys. Later, their work has been extended, by taking into account the cases where the keys are adversarially generated.
The work of Farshim, Orlandi and Roşie studies the robustness in the context of symmetric primitives under the incorrect usage of keys. Roughly speaking, a  key-robust scheme does not output ciphertexts/tags that are valid with respect to distinct keys. Key-robustness is a notion that is often tacitly expected/assumed in protocol design --- as is the case with anonymous auction, oblivious transfer, or public-key encryption.

To motivate the new notion, "consider the following protocol, for constructing a ${3 \choose 2}$-OT protocol using only ${3 \choose 1}$-OTs: the sender picks $3$ random keys $k_1,k_2,k_3$ and inputs the message $x_1=(k_2,k_3),x_2=(k_1,k_3)$ and $x_3=(k_1,k_2)$ to the OT. At the same time, the sender sends encryptions of his messages under these keys, i.e., sends $c_i=E(k_i,m_i)$ for $i=1..3$. Now the receiver inputs the index of the message he does not want to learn to the ${3 \choose 1}$-OT and learns all keys except $k_i$. Intuitively the fact that the messages are sent only once (encrypted) should guarantee that the sender's choice of messages is uniquely defined. However, consider the following attack: the corrupt sender inputs $x^*_1=(k_2, k^*)$ (instead of $x_1$) such that $D(k^*,c_3)=m^*_3$ with $m^*_3\neq m_3$ and $m^*_3\neq \bot$. This means that the receiver will see two different versions of $m_3$ depending on whether the receiver asked for the pair $(2,3)$ or $(1,3)$. (This attack is an example of input-dependence and is a clear breach of security since it cannot be simulated in the ideal world)."
In terms of the results, the new work considers "both notions where the adversary has control over the keys and notions where the keys are generated honestly. The strongest notion that is formulated is called complete robustness and allows an adversary to generate the keys used in the system. The work shows that whether the adversary is in control of the keys or not makes a significant difference, by giving separations between the notions. While previous work in the public-key setting also had to deal with adversarially generated keys that were also invalid, this is not an issue in the setting, since in the symmetric world keys are often bit-strings of some pre-specified length and can be easily checked for validity. By focusing on correctly formed keys it can can shown equivalence between  complete robustness and a syntactically simpler notion, which we call full robustness." Then, it is shown that full robustness composes well: any fully robust symmetric encryption when combined with a fully robust MAC results in a fully robust AE scheme. Analogous composition results also hold for MAC-then-Encrypt and Encrypt-and-MAC.

One of the most interesting question is if "feasibility results for robustness in the public-key setting can be translated to the symmetric setting. This turns out not to be the case. The main reason for this is that in the asymmetric setting the public key can be used as a mechanism to commit to its associated secret key. In the symmetric case, on the other hand, there is no such public information. It might be tempting to think that one can just commit to the secret key and append it to the ciphertext. Unfortunately this approach cannot be proven secure due to a circular key-dependency between the encryption and the commitment components. To give a provably secure construction, with the authors constructing appropriate commitments that can be used in this setting. This requires  a right-injective PRG, that can be in turn based on one-way permutations. This result relies on the one-time security of the MAC and its collision-resistance, which once again is based on right-injective PRGs."

Monday, April 3, 2017

Three everyday git scenarios

You may remember Ralph's post on git basics from a little over a year ago. In this post, I'll share three things I've learned are possible (and practically painless) to do with git that go beyond the basic add, merge, push, and pull. Everything in this post is done from the command line.

Scenario 1

You're working on a project with others. These hardworking colleagues of yours just pushed a bunch of commits and you want to see exactly what they changed since your last commit.

What to do

Use git diff. Besides using it to show unstaged changes, you can also use git diff to show changes between any two commits. Suppose that after your last commit, your colleagues made 4 commits. To see the changes, use git diff HEAD~4 HEAD or git diff HEAD^^^^ HEAD. If you don't want to count how many commits there were, you can look up the abbreviated commit IDs (SHA-1 hashes in hex) using git log --oneline. Then, you could use something like git diff 54180a8 129ec44.

There's a lot more to git diff: you can use it to compare specific files, you can show differences at the word level rather than at the line level, etc. If you want to learn more about it, see its page on the git website. If you're working on LaTeX files, git-latexdiff is convenient: given any two commits, it will produce a PDF file showing the changes, with removed text striked through and in red, and added text underlined and in blue.

Scenario 2

You were in the zone, hacking away at multiple issues and successfully completing all of them. Well done, you! Wait—each issue should have its own commit...

What to do

Use interactive staging: use git add -p or git add --patch to add the file or files. Git will look at each file you want to add and split the changes into hunks. For each hunk, it will show you the diff and ask you what to do with it:

Stage this hunk [y,n,q,a,d,/,s,e,?]?

Here is the full list of options. They're pretty easy to remember, especially if you're a vi user who is used to navigating with HJKL.

y - stage this hunk
n - do not stage this hunk
q - do not stage this hunk or any of the remaining ones
a - stage this hunk and all later hunks in the file
d - do not stage this hunk or any of the later hunks in the file
g - select a hunk to go to
/ - search for a hunk matching the given regex
j - leave this hunk undecided, see next undecided hunk
J - leave this hunk undecided, see next hunk
k - leave this hunk undecided, see previous undecided hunk
K - leave this hunk undecided, see previous hunk
s - split the current hunk into smaller hunks
e - manually edit the current hunk
? - print help

git add -p is a powerful command that helps you keep your commits reasonably sized. It does require some care, though, since each individual commit should be consistent.

Scenario 3

You have to stop working, but you haven't finished fixing the issue you're working on.

What to do

You should do something because you don't want to have a dirty working directory.

Option 1: commit now, then use git commit --amend (introduced in Ralph's post) once you've finished what you were working on. git commit --amend is useful for a bunch of other things, like adding files you forgot to stage to a commit and fixing typos in a commit message.

Commits are local, so what should you do if you're worried about hardware failure? If you're working on a personal project, it may be acceptable to push this commit and later push the amended commit. You'll need the -f (--force) flag for the latter push. If you're working on a project with others, however, it would be bad practice to amend a commit that you've already pushed. Instead, you could create a temporary personal branch, commit your changes to this branch, and push it to the remote repository. Then, you could push the amended commit to this branch without worrying about rewriting anyone else's history and merge it with the main branch when you've finished fixing the issue.

Option 2: shelve your changes with git stash, which will sweep away both staged and unstaged changes, leaving you with a clean working directory. You can stash changes more than once. To see all stashes, use git stash list. To re-apply the most recent stash you made, use git stash pop. By default, git stash excludes new files (that haven't yet been staged) and ignored files. git stash is also useful when you want to see upstream changes made by someone else, but aren't ready to commit your work.

There's much more you can do with stashes: apply one stash to multiple branches, delete stashes you no longer need, stash parts of a file, etc. Stashes are always local; they can never be pushed to a remote repository. Atlassian has a good, detailed tutorial on git stash.

Some books you may (or may not) find useful.

Tuesday, March 21, 2017

Learn You a GPU For Great Good! (Part 1?)

Side note: I stole the title from the most famous, most awesome Haskell book I know.

If you are reading this blog you are most likely interested in cryptography. Today I want to convince you that GPUs are also, well, pretty awesome. I have personally done a few crypto-related projects using GPUs and this post is my attempt at crystallizing the knowledge and experience I built up during that time.

The purpose of this post is to provide a simple, meaningful introduction to developing GPU-accelerated programs. We will discuss setup, the two primary frameworks, basic code examples and development workflow as well as some optimization tips. In the end, I want to show that developing this type of application is not hard at all. If the post is successful I may do a follow-up with a few more detailed and tricky examples. Throughout this post I will assume you are familiar with basic C and/or C++, as the code examples will be in that language. I will not focus too much on develop complicated kernels or how to exploit multi-dimensional parallelism, I will leave that for a later post. Instead, I will focus on a few things that may help you in making the firsts steps towards GPU programming easier, as well as a few things that may help it scale a bit better.

The Why & When


GPU programming was originally designed for, and should be used for, large-scale parallel computation problems. The more parallelism you can utilize, the better GPUs will fit your problem. The most simple example is probably when you loop over a very large collection of elements, performing on each a simple operation independently.

For large-scale parallel computation problems I tend to think of three different architectural setups that you can use (they also mix). The simplest is utilizing multi-core CPUs (possibly over many machines). This has the shortest development time due to its familiarity and easy-of-use and is suitable to many applications. CPUs are of course trivially available. On the other end of the spectrum is the development of custom hardware clusters, utilizing many FPGAs or even ASICs. Development time is fairly long, even for experienced hardware designers; the upside is that this very likely gives you optimal performance.

GPUs fall somewhere in the middle. Development time is very close to that for CPUs; the primary constraint is availability. It is simply easier to get access to CPU clusters. However, these days you can also rent all the GPU power you need from Amazon EC2 instances, as was done for the recent SHA1 collision. If you solve the availability issue, you can get a lot of bang out of your buck performance-wise.

The How


First, you need to get your hands on a machine with a GPU, preferably a remote machine or otherwise a machine with more than one GPU. The reason is that if your GPU is also driving your desktop environment, programming errors may cause your computer to hang or crash. It also allows you to more easily run long-lasting kernels as well as giving you more reliable performance.

CUDA vs OpenCL


Assuming you have a GPU in your system, your next choice is between CUDA and OpenCL, two programming environments for GPU programming. If you do not plan to use an NVIDIA GPU you are stuck with OpenCL, whereas you otherwise have the choice of using CUDA.
Having used both for different projects at different times, I can say that both are perfectly usable and that the differences are mostly superficial. OpenCL is more portable and integrates easier into existing projects; CUDA has the superior tool-chain.

The examples in this post will be for CUDA, as it typically involves less boilerplate. Also, we will use the more basic CUDA C++ implementation, as it provides a better basis for understanding than special-purpose libraries. This is particularly relevant if you want to computations that are not a native part of these libraries, which is definitely true if you want to, for instance, compute CPA-like correlations in parallel.

Hello World


I am not one to break tradition and thus we start the "Hello world" of classic parallel programming, namely SAXPY. Or, more formally, given input vectors \(\textbf{x}, \textbf{y}\) of length \(n\) and a scalar \(a\), compute the output vector \(\textbf{z}\) where \(\textbf{z} = a\textbf{x} + \textbf{y}\).
First let us consider the basic C implementation of this function, where \(z = y\), i.e., we update \(y\) using the scalar \(a\) and a vector \(x\).

 1: void saxpy(int n, float a, float * __restrict__ x, float * __restrict__ y) {
 2:   for (int i = 0; i < n; ++i) {
 3:     y[i] = a*x[i] + y[i];
 4:   }
 5: }
 6: 
 7: // ...
 8: int n = 1 << 20;
 9: // allocate vectors x,y of n elements each.
10: // ...
11: 
12: saxpy(n, 3.14, x, y);


Nothing too special going on here. We simply iterate over every element and perform our update with the scalar \(a=3.14\). Note the use of the __restrict__ keyword to indicate that x and y point to different objects in memory. Just giving the compiler a helping hand, which is generally a useful thing to do. Anything that makes it behave less like a random function, I say.

Conversion to CUDA is straightforward. In GPU programming you are always defining what a single parallel unit of computation is doing, this is called a kernel. When programming such a kernel, you are computing from the point of view of the thread. Before delving in too deep, let us see what the CUDA-equivalent code looks like.

 1: __global__
 2: void saxpy(int n, float a, float * __restrict__ x, float * __restrict__ y) {
 3:   int i = blockIdx.x*blockDim.x + threadIdx.x;
 4:   if (i < n) {
 5:     y[i] = a*x[i] + y[i];
 6:   }
 7: }
 8: 
 9: // ...
10: const int n = 1<<20;
11: 
12: // allocate and initialize host-side buffers x,y
13: // ...
14: 
15: // allocate device-side buffers x,y
16: cudaMalloc((void **)&d_x, sizeof(float) * n);
17: cudaMalloc((void **)&d_y, sizeof(float) * n);
18: 
19: // copy host-side buffers to the device
20: cudaMemcpy(d_x, x, sizeof(float) * n, cudaMemcpyHostToDevice);
21: cudaMemcpy(d_y, y, sizeof(float) * n, cudaMemcpyHostToDevice);
22: 
23: // compute the saxpy
24: const int threads_per_block = 256;
25: const int number_of_blocks = n / threads_per_block;
26: saxpy<<<number_of_blocks, threads_per_block>>>(n, 3.14, d_x, d_y);
27: 
28: // copy the output buffer from the device to the host
29: cudaMemcpy(y, d_y, sizeof(float) * n, cudaMemcpyDeviceToHost);
30: 
31: // free the device buffers
32: cudaFree(d_x);
33: cudaFree(d_y);
34: 
35: // clean up the x, y host buffers
36: // ...

Let us consider the kernel first, denoted by the simple fact of the function definition starting with __global__. The parameters to the function are the same as before, nothing special there. Line 3 is a key first step in any kernel: we need to figure out the correct offset into our buffers x and y. To understand this, we need to understand CUDA's notion of threads and blocks (or work groups and work items in OpenCL).
The Grid
The CUDA threading model is fairly straightforward to imagine. A thread essentially computes a single instance of a kernel. These threads form groups called blocks that have somewhat-more-efficient inter-thread communication primitives. The blocks together form what is known as the grid. The grid can have up to three dimensions, i.e., the blocks can be ordered into \((x,y, z)\) coordinates. The same goes for threads inside a block, they can be addressed with \((x, y, z)\) coordinates as well.
Mostly though, I have tended to stick to 1-dimensional grids. This is simply dividing a vector of \(n\) elements into $n/m$-sized sequential blocks (even better if \(n\) is a multiple of \(m\)). 
A quick note about warps (or wavefronts in OpenCL), which is a related concept. A warp is a unit of scheduling, it determines the amount of threads that actually execute in lockstep. It is good practice to have your block size as a multiple of the warp size but other than that you should not worry overly much about warps.
In this case we find our thread by multiplying the block id with the size of block and then adding the offset of the thread within the block. The rest of the kernel is straightforward, we simply perform the same computation as in the original code but we omit the for-loop. The conditional at line 4 makes sure we do not write outside the bounds of our vector, though that should not happen if we choose our grid carefully.

The rest of the code is the standard boilerplate that you will find in most CUDA programs. A key notion is that there is a distinction between buffers allocated on the device (the GPU) and buffers allocated on the host. Note that on line 26 we schedule the kernel for execution. The first two weird-looking parameters (within angle brackets) are the number of blocks and the block size respectively.

Improving & Testing "Hello World"


To showcase a few things that I found helpful we are going to improve this simple code example. And because this is my blog post and I decide what is in it, I get to talk to you about how to test your code. GPU code tends to be a bit flaky: it breaks easily. Thus, I argue that creating simple tests for your code is essential. These do not have to be very complicated but I recommend that you use a proper framework for writing unit tests. For C++ I have had success with Catch and doctest, both single-headers that you include into your project.

Before we include these tests however, I propose that we make two more changes to the program. First of all, we are going to add better error checking. Most of the cudaFoo functions return a value indicating whether the operation was successful. Otherwise, we get something which we can use to determine the error.

1: #define check(e) { _check((e), __FILE__, __LINE__); }
2: 
3: inline cudaError_t _check(cudaError_t result, const char *file, int line) {
4:   if (result != cudaSuccess) {
5:     fprintf(stderr, "CUDA Runtime Error: %s (%s:%d)\n", cudaGetErrorString(result), file, line);
6:     assert(result == cudaSuccess);
7:   }
8:   return result;
9: }

And then simply wrap the cudaFoo functions with this check macro. Alternatively, you may want to rewrite this to use exceptions instead of asserts. Pick your poison.

Another thing I would recommend adding if you are doing CUDA in C++ is wrapping most of the allocation and de-allocation logic in a class. I generally take a more utilitarian view of classes for simple pieces of code and thus the following is not necessarily idiomatic or good C++ code.

 1: class Saxpy {
 2: public:
 3:   const int n;
 4:   float *d_x;
 5:   float *d_y;
 6:   float *x;
 7:   float *y;
 8: 
 9:   Saxpy(const int n) : n(n) {
10:     x = new float[n];
11:     y = new float[n];
12: 
13:     check(cudaMalloc((void **)&d_x, sizeof(float) * n));
14:     check(cudaMalloc((void **)&d_y, sizeof(float) * n));
15:   }
16: 
17:   ~Saxpy() {
18:     check(cudaFree(d_x));
19:     check(cudaFree(d_y));
20: 
21:     delete[] x;
22:     delete[] y;
23:   }
24: 
25:   Saxpy& fill() {
26:     for (int i = 0; i < n; ++i) {
27:       x[i] = i / 12.34;
28:       y[i] = i / 56.78;
29:     }
30: 
31:     check(cudaMemcpy(d_x, x, sizeof(float) * n, cudaMemcpyHostToDevice));
32:     check(cudaMemcpy(d_y, y, sizeof(float) * n, cudaMemcpyHostToDevice));
33: 
34:     return *this;
35:   }
36: 
37:   Saxpy& run(float a) {
38:     const int threads_per_block = 256;
39:     const int number_of_blocks = n / threads_per_block;
40:     saxpy<<<number_of_blocks, threads_per_block>>>(n, a, d_x, d_y);
41: 
42:     return *this;
43:   }
44: 
45:   Saxpy& load() {
46:     check(cudaDeviceSynchronize());
47:     check(cudaMemcpy(y, d_y, sizeof(float) * n, cudaMemcpyDeviceToHost));
48:     return *this;
49:   }
50: };


Why we went through all this trouble becomes clear if we put this in a test (I am using doctest syntax as an example).

 1: TEST_CASE("testing saxpy") {
 2:   float a = 3.14;
 3:   const int n = 1024;
 4:   Saxpy s(n);
 5: 
 6:   s.fill().run(a).load();
 7: 
 8:   for (int i = 0; i < n; ++i) {
 9:     // we didn't keep the old y values so we recompute them here
10:     float y_i = i / 56.78;
11:     // the approx is because floating point comparison is wacky
12:     CHECK(s.y[i] == doctest::Approx(a * s.x[i] + y_i));
13:   }
14: }

That is a, for C++ standards, pretty concise test. And indeed, our tests succeed. Yay.

===============================================================================
[doctest] test cases:    1 |    1 passed |    0 failed |    0 skipped
[doctest] assertions: 1024 | 1024 passed |    0 failed |

A Final Improvement


Because this post is already too long I will conclude with one last really nice tip that I absolutely did not steal from here. Actually, the NVIDIA developer blogs contain a lot of really good CUDA tips.
Our current kernel is perfectly capable of adapting to a situation where we give it less data than the grid can support. However, if we give it more data, things will break. This is where gride-stride loops come in. It works by looping over the data one grid at a time while maintaining coalesced memory access (which is something I will write about next time).

Here's our new kernel using these kinds of loops.

1: __global__
2: void saxpy(int n, float a, float *x, float *y) {
3:   for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
4:     y[i] = a * x[i] + y[i];
5:   }
6: }

Conclusion


I hope this convinces you that GPU programming is actually pretty simple. The kernel here is pretty trivial, but as long as you understand that within the kernel you can basically write C/C++, you are going to do just fine.

If there is a next post I will write more about memory in GPUs, a very important topic if you want your code to actually run fast. If you want to skip ahead you should read about the different types of memory (global, local, shared, texture, etc.) and what memory coalescing entails.

Until next time.

Tuesday, March 14, 2017

What are lattices?

Do all the cool post-quantum cryptographers around you talk about lattices and all you can think about is a good salad?

Don't worry! You can be part of the hip crowd at the next cocktail party in just three easy steps!
 

First, I will tell you what a lattice is. It is actually really easy.
Second, we will prove a tiny fact about lattices so that those poor neurons that did all the work in the first step have some friends.
Third, nothing. There isn't even a third step. That's how easy it is!

So, let's tackle the first step. What, actually, is a lattice?
The answer could not be easier. It is simply a grid of points. Like this:



Easy, right?
The formal definition is just as simple:

Given $n$ linearly independent vectors $b_1, b_2, \ldots, b_n \in \mathbb{R}^m$, the lattice generated by them is defined as
\[
\mathcal{L}(b_1, b_2, \ldots, b_n) = \left\{ \sum x_i b_i \middle| x_i \in \mathbb{Z} \right\}.
\]

We can add those so called basis vectors $b_i$ to our diagram and will readily see how each point is constructed.



And now you already know what a lattice is! In the next step we will talk about some more definitions and show the proof of a theorem.

We will be discussing something related to the question of what the length of the shortest non-zero vector in a given lattice $\mathcal{L}$ is.

Definition:
$\lambda_1(\mathcal{L})$ is the length of the shortest non-zero vector in $\mathcal{L}$. We use the euclidean norm for the definition of length.
And we also need the volume of a lattice which we are not going to define formally we will just take it to be the volume of that $n$ dimensional parallelogram in the picture:



Now we can already prove Blichfeld's Theorem! It makes a statement about where we can find a lattice point and roughly goes as follows:
For a lattice $\mathcal{L}\subseteq \mathbb{R}^n$ and a set $S \subseteq \mathbb{R}^n$ where the volume of $S$ is bigger than the volume of the lattice there exist two nonequal points $z_1, z_2 \in S$ such that $z_1 - z_2 \in \mathcal{L}$.
And here is the idea of the proof in a graphical way! First we simply draw an example for the set $S$ in blue:


Now we cut up our set and move all the parts into our first parallelogram!

As you can see there is quite a bit of overlap and the fact that the full volume of the set is bigger than the volume of the lattice aka the parallelogram guarantees there will be at least some overlap somewhere. But if there is overlap then we have two points, that have been moved by multiples of the base vectors and which are now at the same position!


And therefore their difference must be a multiple of base vectors as well and we have found a difference which is a lattice point! Hooray!

Now we will use a little trick to expand this result and make a statement about an actual point from the set being on the lattice not just the difference of two points!

What we will show is called Minkowski's Convex Body Theorem and it states roughly
Let $\mathcal{L}$ be a lattice. Then any centrally-symmetric convex set $S$, with volume bigger than $2^n$ times the volume of the lattice contains a nonzero lattice point.
So after this we will know such a set it will contain a lattice point and using a simple sphere as the set allows us to put a bound on $\lambda_1(\mathcal{L})$.

Let's get to it then!
First we blow up our lattice to twice its size along every dimension!






Now we add our centrally-symmetric convex set $S$. Again in blue.
 

And because we picked the volume of $S$ to be bigger than $2^n$ times the volume of the lattice we still get the colliding points from the last theorem EVEN in the double size lattice!



But since our set $S$ is symmetric about the origin if $z_2 \in S$ it follows that $-z_2 \in S$ and because it is convex $z_1, -z_2 \in S$ implies $\frac{z_1 - z_2}{2} \in S$. And because we've double the size of the lattice and $z_1 - z_2$ is on the bigger lattice it follows that $\frac{z_1 - z_2}{2} \in \mathcal{L}$ and we have shown that a point in our set is also in the lattice.

You can see it quite clearly in the next picture.


As already stated above if we use a simple sphere for this set we can give a bound on $\lambda_1(\mathcal{L})$ based on the volume of $\mathcal{L}$. It is known as Minkowski's First Theorem and states:
 \[
\lambda_1(\mathcal{L}) \leq \sqrt{n}(vol(\mathcal{L})^{1/n}.
\]
Isn't that great?!

If you have now completely fallen in love with lattices but would appreciate real math instead of pictures, worry not!
You will find steps 3 - 14 on Oded Regev's course website! Enjoy!

And in case you accidentally run into him at a conference here is the picture from his website so you don't miss the opportunity to say hello.