Tabular Data – Summer of Code

Parquet.jl enhancements

Difficulty: Medium

Duration: 175 hours

Apache Parquet is a binary data format for tabular data. It has features for compression and memory-mapping of datasets on disk. A decent implementation of Parquet in Julia is likely to be highly performant. It will be useful as a standard format for distributing tabular data in a binary format. There exists a Parquet.jl package that has a Parquet reader and a writer. It currently conforms to the Julia Tabular file IO interface at a very basic level. It needs more work to add support for critical elements that would make Parquet.jl usable for fast large scale parallel data processing. Each of these goals can be targeted as a single, short duration (175 hrs) project.

Resources:

Recommended skills: Good knowledge of Julia language, Julia data stack and writing performant Julia code.

Expected Results: Depends on the specific projects we would agree on.

Mentors: Tanmay Mohapatra

DataFrames.jl join enhancements

Difficulty: Hard

Duration: 175 hours

DataFrames.jl is one of the more popular implementations of tabular data type for Julia. One of the features it supports is data frame joining. However, more work is needed to improve this functionality. The specific targets for this project are (a final list of targets included in the scope of the project can be decided later).

Resources:

Recommended skills: Good knowledge of Julia language, Julia data stack and writing performant multi-threaded Julia code. Experience with benchmarking code and writing tests. Knowledge of join algorithms (as e.g. used in databases like DuckDB or other tabular data manipulation ecosystems e.g. Polars or data.table).

Expected Results: Depends on the specific projects we would agree on.

Mentors: Bogumił Kamiński