# An introduction to ParallelAccelerator.jl

The High Performance Scripting team at Intel Labs recently released ParallelAccelerator.jl, a Julia package for high-performance, high-level array-style programming. The goal of ParallelAccelerator is to make high-level array-style programs run as efficiently as possible in Julia, with a minimum of extra effort required from the programmer. In this post, we’ll take a look at the ParallelAccelerator package and walk through some examples of how to use it to speed up some typical array-style programs in Julia.

## Introduction

Ideally, high-level array-style Julia programs should run as efficiently as possible on high-performance parallel hardware, with a minimum of extra programmer effort required, and with performance reasonably close to that of an expert implementation in C or C++. There are three main things that ParallelAccelerator does to move us toward this goal:

• First, we identify implicit parallel patterns in array-style code the user writes. We’ll say more about these parallel patterns shortly.
• Second, we compile these parallel patterns to explicit parallel loops.
• Third, we minimize runtime overheads incurred by things like array bounds checks and intermediate array allocations.

The key user-facing feature that the ParallelAccelerator package provides is a Julia macro called @acc, which is short for “accelerate”. Annotating functions or blocks of code with @acc lets you designate the parts of your Julia program that you want to compile to optimized native code. Here’s a toy example of using @acc to annotate a function:

Under the hood, ParallelAccelerator is essentially a compiler – itself implemented in Julia – that intercepts the usual Julia JIT compilation process for @acc-annotated functions. It compiles @acc-annotated code to C++ OpenMP code, which can then be compiled to a native library by an external C++ compiler such as GCC or ICC. (This intermediate C++ generation step isn’t essential to the design of ParallelAccelerator, though – instead, the compiler could target Julia’s own forthcoming native threading backend. [1]) On the Julia side, ParallelAccelerator generates a proxy function that calls into that native library, and replaces calls to @acc-annotated functions, like f in the above example, with calls to the appropriate proxy function.

We’ll say more shortly about the parallel patterns that ParallelAccelerator targets and about how the ParallelAccelerator compiler works, but before we do, let’s look at some code and some performance results.

## A quick preview of results: Black-Scholes option pricing benchmark

Let’s see how to use ParallelAccelerator to speed up a classic high-performance computing benchmark: an implementation of the Black-Scholes formula for option pricing. The following code is a Julia implementation of the Black-Scholes formula.

Here, the blackscholes function takes five arguments, each of which is an array of Float64s. The run function initializes these five arrays and passes them to blackscholes, which, along with the cndf2 (cumulative normal distribution) function that it calls, does several computations involving pointwise addition (.+), subtraction (.-), multiplication (.*), and division (./) on the arrays. It’s not necessary to understand the details of the Black-Scholes formula; the important thing to notice about the code is that we are doing lots of pointwise array arithmetic. Using Julia 0.4.4-pre on a 4-core Ubuntu 14.04 desktop machine with 8 GB of memory, the run function takes about 11 seconds to run when called with an argument of 40,000,000 (meaning that we are dealing with 40-million-element arrays):

Here, the 11.297714183 being returned from run is the number of seconds it takes the blackscholes call alone to return. The 12.885293 seconds reported by @time is a little longer, because it’s the running time of the entire run call.

The many pointwise array operations in this code make it a great candidate for speeding up with ParallelAccelerator (as we’ll discuss more shortly). Doing so requires only minor changes to the code: we import the ParallelAccelerator library with using ParallelAccelerator, then wrap the cndf2 and blackscholes functions in an @acc block, as follows:

The definition of run stays the same. With the addition of the @acc wrapper, we now have much better performance:

This time, blackscholes returns in about 3.5 seconds, and the entire run call finishes in about 4 seconds. This is already an improvement, but on subsequent calls to run, we do even better:

In subsequent calls, run finishes in about a second, with the entire call taking about 1.4 seconds. The reason for this additional improvement is that ParallelAccelerator has already compiled the blackscholes and cndf2 functions and doesn’t need to do so again on subsequent runs.

These results were collected on an ordinary desktop machine, but we can scale up further. The following figure reports the time it takes blackscholes to run on arrays of 100 million elements, this time on a 36-core machine with 128 GB of RAM [2]:

The first three bars of the above figure show performance results for ParallelAccelerator using different numbers of threads. Since ParallelAccelerator compiles Julia to OpenMP C++, we can use the OMP_NUM_THREADS environment variable to control the number of threads that the code runs with. Here, with OMP_NUM_THREADS set to 18, blackscholes runs in 0.27 seconds; with 36 threads (matching the number of cores on the machine), running time drops to 0.16 seconds. The third bar shows results for ParallelAccelerator with OMP_NUM_THREADS set to 1, which clocks in at about 3 seconds. For comparison, the rightmost bar show results for “plain Julia”, that is, a version of the code without @acc, which runs in about 21 seconds.

Because Julia doesn’t (yet) have native multithreading support, the plain Julia results shown in the rightmost bar are for one thread. But it is interesting to note that the ParallelAccelerator implementation of Black-Scholes outperforms plain Julia by a factor of about seven, even when running on just one core. The reason for this speedup is that ParallelAccelerator (despite its name!) does more than just parallelize code. The ParallelAccelerator compiler is able to do away with much of the runtime overhead incurred by array bounds checks and allocation of intermediate arrays. After that, with the addition of parallelism, we’re able to do even better, for a total speedup of more than 100x over plain Julia.

To see how ParallelAccelerator accomplishes this, we’ll discuss the parallel patterns that ParallelAccelerator handles in a bit more detail, and then we’ll take a closer look at the ParallelAccelerator compiler pipeline.

## Parallel patterns

ParallelAccelerator works by identifying implicit parallel patterns in source code and making the parallelism explicit. These patterns include map, reduce, array comprehension, and stencil.

### Map

As we saw in the Black-Scholes example above, the .+, .-, .*, and ./ operations in Julia are pointwise array operations that take input arrays as arguments and produce an output array. ParallelAccelerator translates these pointwise array operations into data-parallel map operations. (See the ParallelAccelerator documentation for a complete list of all the pointwise array operations that it knows how to parallelize.) Furthermore, ParallelAccelerator translates array assignments into in-place map operations. For instance, assigning a = a .* b where a and b are arrays would map .* over a and b and update a in place with the result. For both standard map and in-place map, it is possible for ParallelAccelerator to avoid any array bounds checking once we’ve established that the input arrays and the output arrays are the same size.

### Reduce

Reduce operations take an array argument and produce a scalar result by combining all the elements of an array with an associative and commutative operation. ParallelAccelerator translates the Julia functions minimum, maximum, sum, prod, any, and all into data-parallel reduce operations when they are called on arrays.

### Array comprehension

Julia supports array comprehensions, a convenient and concise way to construct arrays. For example, the expressions that initialize the five input arrays in the Black-Scholes example above are all array comprehensions. As a more sophisticated example, the following avg function, taken from the Julia manual, takes a one-dimensional input array x of length n and uses an array comprehension to construct an output array of length n-2, in which each element is a weighted average of the corresponding element in the original array and its two neighbors:

Comprehensions like this one can also be parallelized by ParallelAccelerator: in a nutshell, ParallelAccelerator can transform array comprehensions to code that first allocates an output array and then performs an in-place map that can write to each element of the output array in parallel.

Array comprehensions differ from map and reduce operations in that they involve explicit array indexing. But it is still possible to parallelize array comprehensions in Julia, as long as there are no side effects in the comprehension body (everything before the for). [3] ParallelAccelerator uses a conservative static analysis to try to identify and reject side-effecting operations in comprehensions.

### Stencil

In addition to map, reduce, and comprehension, ParallelAccelerator targets a fourth parallel pattern: stencil computations. A stencil computation updates the elements of an array according to a fixed pattern called a stencil. In fact, the avg comprehension example above could also be thought of as a stencil computation, because it updates the contents of an array based on each element’s neighbors. However, stencil computations differ from the other patterns that ParallelAccelerator targets, because there’s not a built-in, user-facing language feature in Julia that expresses stencil computations specifically. So, ParallelAccelerator introduces a new user-facing language construct called runStencil for expressing stencil computations in Julia. Next, we’ll look at an example that illustrates how runStencil works.

## Example: Blurring an image with runStencil

Let’s consider a stencil computation that blurs an image using a Gaussian blur. The image is represented as a two-dimensional array of pixels. To blur the image, we set the value of each output pixel to a particular weighted average of the corresponding input pixel’s value and the values of its neighboring input pixels. By repeating this process multiple times, we can get an increasingly blurred image. [4]

The following code implements a Gaussian blur in Julia. It operates on a 2D array of Float32s: the pixels of the source image. It’s easy to obtain such an array using, for instance, the load function from the Images.jl library, followed by a call to convert to get an array of type Array{Float32,2}. (For simplicity, we’re assuming that the input image is a grayscale image, so each pixel has just one value instead of red, green, and blue values. However, it would be straightforward to use the same approach for RGB pixels.)

Here, to compute the value of a pixel in the output image, we use the the corresponding input pixel as well as all its neighboring pixels, to a depth of two pixels out from the input pixel – so, twenty-four neighbors. In all, there are twenty-five pixel values to examine. We add all these pixel values together, each multiplied by a weight – in this case 0.0030 for the cornermost pixels, 0.1621 for the center pixel, and for all the other pixels, something in between – and the total is the value of the output pixel. At the borders of the image, we don’t have enough neighboring pixels to compute an output pixel value, so we simply skip those pixels and don’t assign to them. [5]

Notice that the blur function explicitly loops over the number of iterations, that is, times to apply the blur to the the image, but it does not explicitly loop over pixels in the image. Instead, the code is written in array style: it performs just one assignment to the array img, using the ranges 3:w-2 and 3:h-2 to avoid assigning to the borders of the image. On a large grayscale input image of 7095 by 5322 pixels, this code takes about 10 minutes to run for 100 iterations.

Using ParallelAccelerator, we can get much better performance. Let’s look at a version of blur that uses runStencil:

Here, we again have a function called blur – now annotated with @acc – that takes the same arguments as the original code. This version of blur allocates a new 2D array called buf that is the same size as the original img array. The allocation of buf is followed by a call to runStencil. Let’s take a closer look at the runStencil call.

runStencil has the following signature:

In blur, the call to runStencil uses Julia’s do-block syntax for function arguments, so the do b, a ... end block is actually the first argument to the runStencil call. The do block creates an anonymous function that binds the variables b and a. The arguments buffer1, buffer2, ... that are passed to runStencil become the arguments to the anonymous function. In this case, we are passing two buffers, buf and img, to runStencil, and so the anonymous function takes two arguments.

Aside from the anonymous function and the two buffers, runStencil takes two other arguments. The first of these is a number of iterations that we want to run the stencil computation for. In this case, we simply pass along the iterations argument that is passed to blur. Finally, the last argument to runStencil is a symbol indicating how stencil boundaries are to be handled. Here, we’re using the :oob_skip symbol, short for “out-of-bounds skip”. It means that when input indices are out of bounds – for instance, in the situation where the input pixel is one of those on the two-pixel border of the image, and there aren’t enough neighbor pixels to compute the output pixel value – then we simply skip writing to the output pixel. This has the same effect as the careful indexing in the original version of blur.

Finally, let’s look at the body of the do block that we’re passing to runStencil. It contains an assignment to b, using values computed from a. As we’ve said, b and a here are buf and img: our newly-allocated buffer, and the original image. The code here is similar to that of the original implementation of blur, but here we’re using relative rather than absolute indexing into arrays, The index 0,0 in b[0,0] doesn’t refer to any particular element of b, but instead to the current position of a cursor that can be thought of as traversing all the elements of b. On the right side of the assignment. a[-2,-1] refers to the element in a that is two elements to the left and one element up from the 0,0 element of a. In this way, we can express a stencil computation more concisely than the original version of blur did, and we don’t have to worry about getting the indices correct for boundary handling as we had to do before, because the :oob_skip argument tells runStencil everything it needs to no to handle boundaries correctly.

Finally, at the end of the do block, we return a, b. They were bound as b, a, but we return them in the opposite order so that for each iteration of the stencil, we’ll be using the already-blurred buffer as the input for another round of blurring. This continues for however many iterations we’ve specified. There’s therefore no need to write an explicit for loop for stencil iterations when using runStencil; one just passes an argument saying how many iterations should occur.

Therefore runStencil enables us to write more concise code than plain Julia, as we’d expect from a language extension. But where runStencil really shines is in the performance it enables. The following figure compares performance results for plain Julia and ParallelAccelerator implementations of blur, each running for 100 iterations on the aforementioned 7095x5322 source image, run using the same machine as for the previous Black-Scholes benchmark.

The rightmost column shows the results for plain Julia, using the first implementation of blur shown above. The three columns to the left show results for the ParallelAccelerator version that uses runStencil. As we can see, even when running on just one thread, ParallelAccelerator enables a speedup of about 15x: from about 600 seconds to about 40 seconds. Running on 36 threads provides a further parallel speedup of more than 26x, resulting in a total speedup of nearly 400x over plain single-threaded Julia.

## An overview of the ParallelAccelerator compiler architecture

Now that we’ve talked about the parallel patterns that ParallelAccelerator speeds up and seen some code examples, let’s take a look at how the ParallelAccelerator compiler works.

The standard Julia JIT compiler parses Julia source code into the Julia abstract syntax tree (AST) representation. It performs type inference on the AST, then transforms the AST to LLVM IR, and finally generates native assembly code. ParallelAccelerator intercepts this process at the level of the AST. It introduces new AST nodes for the parallel patterns we discussed above. It then does various optimizations on the resulting AST. Finally, it generates C++ code that can be compiled by an external C++ compiler. The following figure shows an overview of the ParallelAccelerator compilation process:

As many readers of this blog will know, Julia has good support for inspecting and manipulating its own ASTs. Its built-in code_typed function will return the AST of any function after Julia’s type inference has taken place. This is very convenient for ParallelAccelerator, which is able to use the output from code_typed as the input to the first pass of its compiler, which is called “Domain Transformations”. The Domain Transformations pass produces ParallelAccelerator’s Domain AST intermediate representation.

Domain AST is similar to Julia’s AST, except it introduces new AST nodes for parallel patterns that it identifies. We call these nodes “domain nodes”, collectively. The Domain Transformations pass replaces certain parts of the AST with domain nodes.

The Domain Transformations pass is followed by the Parallel Transformations pass, which replaces domain nodes with “parfor” nodes, each of which represents one or more nested parallel for loops. Loop fusion also takes place during the Parallel Transformations pass. We call the result of Parallel Transformations Parallel AST. [6]

The compiler hands off Parallel AST code to the last pass of the compiler, CGen, which generates C++ code and converts parfor nodes into OpenMP loops. Finally, an external C++ compiler creates an executable which is linked to OpenMP and to a small array runtime component written in C that manages the transfer of arrays back and forth between Julia and C++.

## Caveats

ParallelAccelerator is still a proof of concept at this stage. Users should be aware of two issues that can stand in the way of being able to make effective use of ParallelAccelerator. Those issues are, first, package load time, and second, limitations in what Julia programs ParallelAccelerator is able to handle. We discuss each of these issues in turn.

Because ParallelAccelerator is a large Julia package (it’s a compiler, after all), it takes a long time (perhaps 20 or 25 seconds on a 4-core desktop machine) for using ParallelAccelerator to run. This long pause is not the time that ParallelAccelerator is taking to compile your @acc-annotated code; it’s the time that Julia is taking to compile ParallelAccelerator itself. After this initial pause, the first call to an @acc-annotated function will incur a brief compilation pause (this time from the ParallelAccelerator compiler, not Julia itself) of perhaps a couple of seconds. Subsequent calls to the same function won’t incur the compilation pause.

Let’s see what these compilation pauses look like in practice. The ParallelAccelerator package comes with a collection of example programs that print timing information, including the Black-Scholes and Gaussian blur examples shown in this post. All the examples print timing information for two calls to an @acc-annotated function: first a “warm-up” call with trivial arguments to measure compilation time, and then a more realistic call. In the output printed by each example, timing information for the more realistic call is preceded by the string "SELFTIMED", while timing information for the warm-up call is preceded by "SELFPRIMED". Let’s run the Black-Scholes example and time it using the time shell command:

Here, we’re running Black-Scholes for 10,000,000 iterations on our 4-core desktop machine. The total wall-clock time of 26.454 seconds consists mostly of the time it takes for using ParallelAccelerator to run. Once that’s done, Julia reports a SELFPRIMED time of about 1.8 seconds, which is dominated by the time it takes for ParallelAccelerator to compile the @acc-annotated code, and finally the SELFTIMED time is about 0.05 seconds for this problem size.

As Julia’s compilation speed improves, we expect that package load time will be less of a problem for ParallelAccelerator.

### Compiler limitations

ParallelAccelerator is able to handle only a limited subset of Julia language features, and it only supports a limited subset of Julia’s Base library functions. In other words, you cannot yet put an @acc annotation on arbitrary Julia code and expect it to go faster out of the box. The examples in this post give an idea of what kinds of programs are supported currently; for more, check out the full collection of ParallelAccelerator examples. However, if ParallelAccelerator can’t compile some code in an @acc-annotated function, it will simply fall back to running the function under regular Julia. So your code will run, regardless of whether ParallelAccelerator can speed it up.

One reason why an @acc-annotated function might fail to compile is that ParallelAccelerator tries to transitively compile every Julia function that is called by the @acc-annotated function. So, if an @acc-annotated function makes several Julia library calls, ParallelAccelerator will attempt to compile those functions as well – and every Julia function that they call, and so on. If any of the code in the call chain contains a feature that ParallelAccelerator doesn’t currently support, ParallelAccelerator will fail to compile the original @acc-annotated function. It is therefore a good idea to begin by annotating small (but expensive) computational kernels with @acc, rather than wrapping an entire program in an @acc block. The ParallelAccelerator documentation has many more details on which Julia features we don’t support and why.

These limitations explain why the kind of performance improvements that ParallelAccelerator provides aren’t already the default in Julia. Supporting all of Julia would be a major undertaking; however, in many cases, there’s not a fundamental reason why ParallelAccelerator couldn’t support a particular Julia feature or a function in Base, and supporting it is a matter of realizing that it is a problem for users and putting in the necessary engineering effort to fix it. So, when you come across code that ParallelAccelerator can’t handle, please do file bugs!

## Conclusion

In this post, we’ve introduced ParallelAccelerator.jl, a package for speeding up array-style Julia programs. It works by identifying implicit parallel patterns in source code and compiling them to efficient, explicitly parallel executables, along the way getting rid of many of the usual overheads of high-level array-style programming.

ParallelAccelerator is an open source project in its early stages, and we enthusiastically encourage comments, questions, bug reports, and contributions from the Julia community. We welcome everyone’s participation, and we are especially interested in how ParallelAccelerator can be used to speed up real-world Julia programs.

[1] Starting with Julia 0.5, Julia will have its own native threading support, which means that ParallelAccelerator can target Julia’s own native threads instead of generating C++ OpenMP code for parallelism. We’ve begun work on implementing a native-threading-based backend for ParallelAccelerator, but we still target C++ by default.

[2] Detailed machine and benchmarking specifications: We use a machine with two Intel Xeon E5-2699 v3 processors (2.3 GHz) with 18 physical cores each and 128 GB RAM, running the CentOS 6.7 Linux distribution. We use the Intel C++ Compiler (ICC) v15.0.2 with “-O3” for compilation of the generated C++ code. The Julia version is 0.4.4-pre+26. The results shown are the average of three runs (we run each version of a benchmark five times and discard the first and last runs).

[3] In Julia, it is not possible to index into a comprehension’s output array in the body of the comprehension. (The avg example indexes only into the input array, not the output array.) Therefore, it’s not necessary to do any bounds checking on writes to the output array. However, we still need to bounds-check reads from the input array (for instance, in the avg example, if we’d written 0.25*x[i-2], that would be out of bounds), so we cannot avoid all array bounds checking for comprehensions in the way that we can for map operations.

[4] In practice, rather than applying successive Gaussian blurs to an image, we’d probably apply a single, larger Gaussian blur, which, as Wikipedia notes, is at least as efficient computationally. Nevertheless, we’ll use it here as an example of a stencil computation that can be iterated.

[5] A more sophisticated implementation of Gaussian blur might do a fancier form of border handling, using only the pixels it has available at the borders.

[6] The names “Domain AST” and “Parallel AST” are inspired by the Domain IR and Parallel IR of the Delite compiler framework.