FlashFill is mechanism for creating data manipulation pipelines using programming by example (PBE). As an example see this implementation in Microsoft Excel. We want a version of Flashfill that can work against Julia tabular data structures, such as DataFrames and Tables.
Recommended Skills: Compiler techniques, DSL generation, Program synthesis
Expected Output: A practical flashfill implementation that can be used on any tabular data structure in Julia
Mentors: Avik Sengupta
Apache Parquet is a binary data format for tabular data. It has features for compression and memory-mapping of datasets on disk. A decent implementation of Parquet in Julia is likely to be highly performant. It will be useful as a standard format for distributing tabular data in a binary format. There exists a Parquet.jl package that has a Parquet reader and a writer. It currently conforms to the Julia Tabular file IO interface at a very basic level. It needs more work to add support for critical elements that would make Parquet.jl usable for fast large scale parallel data processing. One or more of the following goals can be targeted:
Lazy loading and support for out-of-core processing, with Arrow.jl and Tables.jl integration. Improved usability and performance of Parquet reader and writer for large files.
Reading from and writing data on to cloud data stores, including support for partitioned data.
Support for missing data types and encodings making the Julia implementation fully featured.
Recommended skills: Good knowledge of Julia language, Julia data stack and writing performant Julia code.
Expected Results: Depends on the specific projects we would agree on.