Generalizing AbstractArrays: opportunities and challenges

27 March 2016 | Tim Holy

Introduction: generic algorithms with AbstractArrays

Somewhat unusually, this blog post is future-looking: it mostly focuses on things that don't yet exist. Its purpose is to lay out the background for community discussion about possible changes to the core API for AbstractArrays, and serves as background reading and reference material for a more focused "julep" (a julia enhancement proposal). Here, often I'll use the shorthand "array" to mean AbstractArray, and use Array if I explicitly mean julia's concrete Array type.

As the reader is likely aware, in julia it's possible to write algorithms for which one or more inputs are only assumed to be AbstractArrays. This is "generic" code, meaning it should work (i.e., produce a correct result) on any specific concrete array type. In an ideal world — which julia approaches rather well in many cases — generality of code should not have a negative impact on its performance: a generic implementation should be approximately as fast as one restricted to specific array type(s). This implies that generic algorithms should be written using lower-level operations that give good performance across a wide variety of array types.

Providing efficient low-level operations is a different kind of design challenge than one experiences with programming languages that "vectorize" everything. When successful, it promotes much greater reuse of code, because efficient, generic low-level parts allow you to write a wide variety of efficient, generic higher-level functions.

Naturally, as the diversity of array types grows, the more careful we have to be about our abstractions for these low-level operations.

Examples of arrays

In discussing general operations on arrays, it's useful to have a diverse collection of concrete arrays in mind.

In core julia, some types we support fairly well are:

Another important class of array types in Base are sparse arrays: SparseMatrixCSC and SparseVector, as well as other sparse representations like Diagonal, Bidiagonal, and Tridiagonal. These are good examples of array types where access patterns deserve careful thought. Notably, despite many commonalities in "strategy" among the 5 or so sparse parametrizations we have, implementations of core algorithms (e.g., matrix multiplication) are specialized for each sparse-like type — in other words, these mimic the "high level vectorized functions" strategy common to other languages. What we lack is a "sparse iteration API" that lets you write the main algorithms of sparse linear algebra efficiently in a generic way. Our current model is probably fine for SparseLike*Dense operations, but gets to be harder to manage if you want to efficiently compute, e.g., Bidiagonal*SparseMatrixCSC: the number of possible combinations you have to support grows rapidly with more sparse types, and thus represents a powerful incentive for developing efficient, generic low-level operations.

Outside of Base, there are some other mind-stretching examples of arrays, including:

SubArrays: a case study

For arrays of fixed dimension, one can write algorithms that index arrays as A[i,j,k,...] (good examples can be found in our linear algebra code, where everything is a vector or matrix). For algorithms that have to support arbitrary dimensionality, for a long time our fallback was linear indexing, A[i] for integer i. However, in general SubArrays cannot be efficiently accessed by a linear index because it results in call(s) to div, and div is slow. This is a CPU problem, not a Julia-specific problem. The slowness of div is still true despite the recent addition of infrastructure to make it much faster — now one can make it merely "really bad" rather than "Terrible, Horrible, No Good, and Very Bad".

The way we (largely) resolved this problem was to make it possible to do cartesian indexing, A[i,j,k,...], for arrays of arbitrary dimensionality (the CartesianIndex type). To leverage this in practical code, we also had to extend our iterators with the for I in eachindex(A) construct. This allows one to select an iterator that optimizes the efficiency of access to elements of A. In generic algorithms, the performance gains were not small, sometimes on the scale of ten- to fifty-fold. These types were described in a previous blog post.

To my knowledge, this approach has given Julia one of the most flexible yet efficient "array view" types in any programming language. Many languages base views on array strides, meaning situations in which the memory offset is regular along each dimension. Among other things, this requires that the underlying array is dense. In contrast, in Julia we can easily handle non-strided arrays (e.g., sampling at [1,3,17,428,...] along one dimension, or creating a view of a SparseMatrixCSC). We can also handle arrays for which there is no underlying storage (e.g., Ranges). Being able to do this with a common infrastructure is part of what makes different optimized array types useful in generic programming.

It's also worth pointing out some problems:

Unfortunately, I suspect that if we want to add support for certain new operations or types (specific examples below), it will force us to set the latter problem on fire.

Challenging examples

Some possible new AbstractArray types pose novel challenges.

ReshapedArrays (#15449)

These are the front-and-center motivation for this post. These are motivated by a desire to ensure that reshape(A, dims) always returns a "view" of A rather than allocating a copy of A. (Much of the urgency of this julep goes away if we decide to abandon this goal, in which case for consistency we should always return a copy of A.) It's worth noting that besides an explicit reshape, we have some mechanisms for reshaping that currently cause a copy to be created, notably A[:] or A[:, :] applied to a 3D array.

Similar to SubArrays, the main challenge for ReshapedArrays is getting good performance. If A is a 3D array, and you reshape it to a 2D array B, then B[i,j] must be expanded to A[k,l,m]. The problem is that computing the correct k,l,m might result in a call to div. So ReshapedArrays violate a crutch of our current ecosystem, in that indexing with N integers might not be the fastest way to access elements of B. From a performance perspective, this problem is substantial (see #15449, about five- to ten-fold).

In simple cases, there's an easy way to circumvent this performance problem: define a new iterator type that (internally) iterates over the parent A's indexes directly. In other words, create an iterator so that B[I] immediately expands to A[I'], and so that the latter has "ideal" performance.

Unfortunately, this strategy runs into a lot of trouble when you need to keep two arrays in sync: if you want to adopt this strategy, you simply can't write B[i,j]*v[j] for matrix-vector multiplication anymore. A potential way around this problem is to define a new class of iterators that operate on specific dimensions of an array (#15459), writing B[ii,jj]*v[j]. jj (whatever that is) and j need to be in-sync, but they don't necessarily need to both be integers. Using this kind of strategy, matrix-vector multiplication

for j = 1:size(B, 2)
    vj = v[j]
    for i = 1:size(B, 1)
        dest[i] += B[i,j] * vj
    end
end

might be written in a more performant manner like this:

for (jj, vj) in zip(eachindex(B, Dimension{2}), v)
    for (i, ii) in zip(eachindex(dest), eachindex(B, (:, jj)))
        dest[i] += B[ii,jj]*vj
    end
end

It's not too hard to figure out what eachindex(B, Dimension{2}) and eachindex(B, (:, jj)) should do: ii, for example, could be a CartesianInnerIndex (a type that does not yet exist) that for a particular column of B iterates from A[3,7,4] to A[5,8,4], where the dth index component wraps around at size(A, d). The big performance advantage of this strategy is that you only have to compute a div to set the bounds of the iterator on each column; the inner loop doesn't require a div on each element access. No doubt, given suitable definition of jj one could be even more clever and avoid calculating div altogether. To the author, this strategy seems promising as a way to resolve the majority of the performance concerns about ReshapedArrays — only if you needed "random access" would you require slow (integer-based) operations.

However, a big problem is that compared to the "naive" implementation, this is rather ugly.

Row-major matrices, PermutedDimensionArrays, and "taking transposes seriously"

Julia's Array type stores its entries in column-major order, meaning that A[i,j] and A[i+1,j] are in adjacent memory locations. For certain applications — or for interfacing with certain external code bases — it might be convenient to support row-major arrays, where instead A[i,j] and A[i,j+1] are in adjacent memory locations. More fundamentally, this is partially related to one of the most commented-on issues in all of julia's development history, known as "taking transposes seriously" aka #4774. There have been at least two attempts at implementation, #6837 and the mb/transpose branch, and for the latter a summary of benefits and challenges was posted.

One of the biggest challenges mentioned was the huge explosion of methods that one would need to support. Can generic code come to the rescue here? There are two related concerns. The first is linear indexing: oftentimes this is conflated with "storage order," i.e., given two linear indexes i and j for the same array, the offset in memory is proportional to i-j. For row-major arrays, this notion is not viable, because otherwise a loop

function copy!(dest, src)
    for i = 1:length(src)
        dest[i] = src[i]  # trouble if `i` means "memory offset"
    end
    dest
end

would end up taking a transpose if src and dest don't use the same storage order. Consequently, a linear index has to be defined in terms of the corresponding cartesian (full-dimensionality) index. This isn't much of a real problem, because it's one we know how to solve: use ind2sub (which is slow) when you have to, but for efficiency make row major arrays belong to the category (LinearSlow) of arrays that defaults to iteration with cartesian indexes. Doing so will ensure that if one uses generic constructs like eachindex(src) rather than 1:length(src), then the loop above can be fast.

The far more challenging problem concerns cache-efficiency: it's much slower to access elements of an array in anything other than storage-order. Some reasonably fast ways to write matrix-vector multiplication are

for j = 1:size(B, 2)
    vj = v[j]
    for i = 1:size(B, 1)
        dest[i] += B[i,j] * vj
    end
end

for a column-major matrix B, and

for i = 1:size(B, 1)
    for j = 1:size(B, 2)
        dest[i] += B[i,j] * v[j]
    end
end

for a row-major matrix. (One can do even better than this by using a scalar temporary accumulator, but let's not worry about that here.) The key point to note is that the order of the loops has been switched.

One could generalize this by defining a RowMajorRange iterator that's a lot like our CartesianRange iterator, but traverses the array in row-major order. eachindex claims to return an "efficient iterator," and without a doubt the RowMajorRange is a (much) more efficient iterator than a CartesianRange iterator for row-major arrays. So let's imagine that eachindex does what it says, and returns a RowMajorRange iterator. Using this strategy, the two algorithms above can be combined into a single generic implementation:

for I in eachindex(B)
    dest[I[1]] += B[I]*v[I[2]]
end

Yay! Score one for efficient generic implementations.

But our triumph is short-lived. Let's return to the example of copy! above, and realize that dest and src might be two different array types, and therefore might be most-efficiently indexed with different iterator types. We're tempted to write this as

function copy!(dest, src)
    for (idest, isrc) in zip(eachindex(dest), eachindex(src))
        dest[idest] = src[isrc]
    end
    dest
end

Up until we introduced our RowMajorRange return-type for eachindex, this implementation would have been fine. But we just broke it, because now this will incorrectly take a transpose in certain situations.

In other words, without careful design the goals of "maximally-efficient iteration" and "keeping accesses in-sync" are in conflict.

OffsetArrays and the meaning of AbstractArray

Julia's arrays are indexed starting at 1, whereas some other languages start numbering at 0. If you take comments on various blog posts at face value, there are vast armies of programmers out there eagerly poised to adopt julia, but who won't even try it because of this difference in indexing. Since recruiting those armies will lead to world domination, this is clearly a problem of the utmost urgency.

More seriously, there are algorithms which simplify if you can index outside of the range from 1:size(A,d). In my own lab's internal code, we've long been using a CenterIndexedArray type, in which such arrays (all of which have odd sizes) are indexed over the range -n:n and for which 0 refers to the "center" element. One package which generalizes this notion is OffsetArrays. Unfortunately, in practice both of these array types produce segfaults (due to built-in assumptions about when @inbounds is appropriate) for many of julia's core functions; over time my lab has had to write implementations specialized for CenterIndexedArrays for quite a few julia functions.

OffsetArrays illustrates another conceptual challenge, which can easily be demonstrated by copy!. When dest is a 1-dimensional OffsetArray and src is a standard Vector, what should copy! do? In particular, where does src[1] go? Does it go in the first element of dest, or does it get stored in dest[1] (which may not be the first element).

Such examples force us to think a little more deeply about what an array really is. There seem to be two potential conceptions. One is that arrays are lists, and multidimensional arrays are lists-of-lists-of-lists-of... In such a world view, the right thing to do is to put src[1] into the first slot of dest, because 1 is just a synonym for first. However, this world view doesn't really endow any kind of "meaning" to the index-tuple of an array, and in that sense doesn't even include the distinction conveyed by an OffsetArray. In other words, in this world an OffsetArray is simply nonsensical, and shouldn't exist.

If instead one thinks OffsetArrays should exist, this essentially forces one to adopt a different world view: arrays are effectively associative containers, where each index-tuple is the "key" by which one retrieves a value. With this mode of thinking, src[1] should be stored in dest[1].

Formalizing AbstractArray

These examples suggest a formalization of AbstractArray:

Finding a way forward

Resolving these conflicting demands is not easy. One approach might be to decree that some of these array types simply can't be supported with generic code. It is possible that this is the right strategy. Alternatively, one can attempt to devise an array API that handles all of these types (and hopefully more).

In GitHub issue #15648, we are discussing APIs that may resolve these challenges. Readers are encouraged to contribute to this discussion.