Efficient storage of tabular data is an important component of the data analysis story in the ecosystem. Julia has many options here – JLD, JuliaDB’s built-in serialization, CSV.write. These either suffer from lack of performance or lack of standardization. Parquet is a format for efficient storage of tabular data used in the Hadoop world. It has compression techniques which reduce disk usage as well as speed up reads. A well-rounded Parquet implementation in Julia will solve the current issues with storage formats and let Julia interoperate with software from the Hadoop world.
Parquet.jl currently contains a reader for Parquet files. This project involves implementing the writer for Parquet files, as well as some enhancements to the reading functionality.
Read a file as a NamedTuple of vectors (using NamedTuples.jl on Julia 0.6). This is on similar lines, but different from the current cursor-based reader. Probably as an implementation of
AbstractBuilder that returns NamedTuple of column vectors, combined with a new iterator/cursor that returns a bunch of records instead of individual records.
Mentors: Tanmay Mohapatra
JuliaDB is a distributed analytical database. It uses Julia’s multi-processing for parallelism at the moment. GPU implementations of some operations may allow relational algebra with low latency. In this project, you will be required to add basic GPU support in JuliaDB.
filteroperation – apply simple functions on a large table that is on the GPU
joinoperations may involve first implementing an efficient
sortpermthat utilize the GPU, or an efficient hash table on the GPU
groupbykernel on GPU
joinkernel on GPU (stretch goal)
Query.jl is designed to work with multiple backends. This project would add a SQL backend, so that queries that are formulated with the query commands in Query.jl get translated into an equivalent SQL query that can be run within a SQL database engine. Both LINQ and dplyr support a similar feature set, and this project would enable the same scenario for julia. There is also a small academic literature on this topic that we need to understand and incorporate.
Recommended Skills: Very strong database and SQL skills, previous experience with compilers (this project is essentially a compiler that translates a query AST into SQL) and a strong familiarity with the julia data stack.
Expected Results: A new version of Query.jl that runs queries as SQL in a database.
Mentors: David Anthoff
The Queryverse has a large number of file IO packages: CSVFiles.jl, ExcelFiles.jl, FeatherFiles.jl, StatFiles.jl, ParquetFiles and FstFiles.jl. This project will a) do serious performance work across all of the existing packages and b) add write capabilities to a number of them.
Recommended Skills: Experience with file formats, writing performant julia code.
Expected Results: Write capabilities across the packages listed above, competitive performance for all the packages listed above.
Mentors: David Anthoff