Tabular Data – Summer of Code

Implement Flashfill in Julia

Difficulty: Medium

FlashFill is mechanism for creating data manipulation pipelines using programming by example (PBE). As an example see this implementation in Microsoft Excel. We want a version of Flashfill that can work against Julia tabular data structures, such as DataFrames and Tables.

Resources:

Recommended Skills: Compiler techniques, DSL generation, Program synthesis

Expected Output: A practical flashfill implementation that can be used on any tabular data structure in Julia

Mentors: Avik Sengupta

Parquet.jl enhancements

Difficulty: Medium

Apache Parquet is a binary data format for tabular data. It has features for compression and memory-mapping of datasets on disk. A decent implementation of Parquet in Julia is likely to be highly performant. It will be useful as a standard format for distributing tabular data in a binary format. There exists a Parquet.jl package that has a Parquet reader and a writer. It currently conforms to the Julia Tabular file IO interface at a very basic level. It needs more work to add support for critical elements that would make Parquet.jl usable for fast large scale parallel data processing. One or more of the following goals can be targeted:

Resources:

Recommended skills: Good knowledge of Julia language, Julia data stack and writing performant Julia code.

Expected Results: Depends on the specific projects we would agree on.

Mentors: Shashi Gowda, Tanmay Mohapatra