Tabular Data – Summer of Code

Parquet.jl enhancements

Efficient storage of tabular data is an important component of the data analysis story in the ecosystem. Julia has many options here – JLD, JuliaDB’s built-in serialization, CSV.write. These either suffer from lack of performance or lack of standardization. Parquet is a format for efficient storage of tabular data used in the Hadoop world. It has compression techniques which reduce disk usage as well as speed up reads. A well-rounded Parquet implementation in Julia will solve the current issues with storage formats and let Julia interoperate with software from the Hadoop world.

Parquet.jl currently contains a reader for Parquet files. This project involves implementing the writer for Parquet files, as well as some enhancements to the reading functionality.


Reader enhancements:

Read a file as a NamedTuple of vectors (using NamedTuples.jl on Julia 0.6). This is on similar lines, but different from the current cursor-based reader. Probably as an implementation of AbstractBuilder that returns NamedTuple of column vectors, combined with a new iterator/cursor that returns a bunch of records instead of individual records.

Writer support:

Mentors: Tanmay Mohapatra

GPU support in JuliaDB

JuliaDB is a distributed analytical database. It uses Julia’s multi-processing for parallelism at the moment. GPU implementations of some operations may allow relational algebra with low latency. In this project, you will be required to add basic GPU support in JuliaDB.

Mentors: Shashi Gowda, Mike Innes

A SQL backend for Query.jl

Query.jl is designed to work with multiple backends. This project would add a SQL backend, so that queries that are formulated with the query commands in Query.jl get translated into an equivalent SQL query that can be run within a SQL database engine. Both LINQ and dplyr support a similar feature set, and this project would enable the same scenario for julia. There is also a small academic literature on this topic that we need to understand and incorporate.

Recommended Skills: Very strong database and SQL skills, previous experience with compilers (this project is essentially a compiler that translates a query AST into SQL) and a strong familiarity with the julia data stack.

Expected Results: A new version of Query.jl that runs queries as SQL in a database.

Mentors: David Anthoff

Tabular file IO

The Queryverse has a large number of file IO packages: CSVFiles.jl, ExcelFiles.jl, FeatherFiles.jl, StatFiles.jl, ParquetFiles and FstFiles.jl. This project will a) do serious performance work across all of the existing packages and b) add write capabilities to a number of them.

Recommended Skills: Experience with file formats, writing performant julia code.

Expected Results: Write capabilities across the packages listed above, competitive performance for all the packages listed above.

Mentors: David Anthoff