JSoC 2015 project: DataStreams.jl

25 October 2015 | Jacob Quinn

Data processing got ya down? Good news! The DataStreams.jl package, er, framework, has arrived!

The DataStreams processing framework provides a consistent interface for working with data, from source to sink and eventually every step in-between. It's really about putting forth an interface (specific types and methods) to go about ingesting and transferring data sources that hopefully makes for a consistent experience for users, no matter what kind of data they're working with.

How does it work?

DataStreams is all about creating "sources" (Julia types that represent true data sources; e.g. csv files, database backends, etc.), "sinks" or data destinations, and defining the appropriate Data.stream!(source, sink) methods to actually transfer data from source to sink. Let's look at a quick example.

Say I have a table of data in a CSV file on my local machine and need to do a little cleaning and aggregation on the data before building a model with the GLM.jl package. Let's see some code in action:

using CSV, SQLite, DataStreams, DataFrames

# let's create a Julia type that understands our data file
csv_source = CSV.Source("datafile.csv")

# let's also create an SQLite destination for our data
# according to its structure
db = SQLite.DB() # create an in-memory SQLite database

# creates an SQLite table
sqlite_sink = SQLite.Sink(Data.schema(csv_source), db, "mydata")

# parse the CSV data directly into our SQLite table
Data.stream!(csv_source, sqlite_sink)

# now I can do some data cleansing/aggregation
# ...various SQL statements on the "mydata" SQLite table...

# now I'm ready to get my data out and ready for model fitting
sqlite_source = SQLite.Source(sqlite_sink)

# stream our data into a Julia structure (Data.Table)
dt = Data.stream!(sqlite_source, Data.Table)

# convert to DataFrame (non-copying)
df = DataFrame(dt)

# do model-fitting
OLS = glm(Y~X,df,Normal(),IdentityLink())

Here we see it's quite simple to create a Source type by wrapping a true datasource (our CSV file), a destination for that data (an SQLite table), and to transfer the data. We can then turn our SQLite.Sink into an SQLite.Source for getting the data back out again.

So What Have You Really Been Working On?

Well, a lot actually. Even though the DataStreams framework is currently simple and minimalistic, it took a lot of back and forth on the design, including several discussions at this year's JuliaCon at MIT. Even with a tidy little framework, however, the bulk of the work still lies in actually implementing the interface in various packages. The two that are ready for release today are CSV.jl and SQLite.jl. They are currently available for julia 0.4+ only.

Quick rundown of each package:

So What's Next?

The work on DataStreams.jl was carried out as part of the Julia Summer of Code program, made possible thanks to the generous support of the Gordon and Betty Moore Foundation, and MIT.