Thanks for your interest in Summer of Code! This page outlines some ideas for student projects. Most project ideas come with a suggested set of skills and experience which will be helpful, but they aren’t hard requirements and there’s scope for learning and exploring before, during, and after the official Summer of Code period. Please contact us at email@example.com with your questions.
We encourage you to send us a proposal through the GSoC website; you’re welcome to flesh out one of these projects with your own ideas, or to come up with something completely different. We also have some application guidelines for more suggestions on writing a proposal.
Once you have an idea or area of interest, the best way to move forward is to get involved in relevant areas of the Julia ecosystem – look out for packages or organisations which are working on similar things, and see what features and issues they have. It’s great to start by solving some small issues, and making real patches will show us your coding skills, your ability to learn, and your commitment.
If there’s no directly relevant package, you could also start your own to begin prototyping your idea. It’s not an absolute requirement, but a few lines of real code will really bolster an application.
Talking to others in the community is also a good way to get introduced to potential mentors. If you’ve spoken to someone who has agreed to mentor, let us know, but it’s not a requirement for the application.
Table of Contents
Please also see the Juno project page for ideas related to Julia tooling.
Julia is emerging as a serious tool for technical computing and is ideally suited for the ever-growing needs of big data analytics. This set of proposed projects addresses specific areas for improvement in analytics algorithms and distributed data management.
Julia uses OpenBLAS for matrix algebra, but OpenBLAS is better suited for large matrices. For operations with small matrices and vectors, one can often obtain substantial speedups by implementing everything in Julia. At least two candidate implementations already exist, with the first more thoroughly developed but the second (currently just a sketch) having some features that are attractive for inclusion in
The project would be to flesh out operations with fixed-size arrays, and get them interoperating seamlessly with other types. It would be desirable implement operations for certain sizes using Julia’s up-and-coming SIMD support.
The Images.jl package implements several algorithms that do not use, but would be well-suited for, multi-threading. This project would implement multithreaded versions of
imfilter_gaussian. While such kernels might be written by hand, it is also attractive to explore various “frameworks” that reduce the amount of boilerplate code required. One recommended approach would be to explore using the ParallelAccelerator.jl; alternatively, one might leverage the KernelTools.jl package in conjunction with julia 0.5’s native threading capabilities.
Expected Results: multithreaded implementation of
The LightGraphs.jl package provides a fast, robust set of graph analysis tools. This project would implement additions to LightGraphs to support parallel computation for a subset of graph algorithms. Examples of algorithms that would benefit from adaptation to parallelism would include centrality measures and traversals.
Expected Results: creation of LightGraphs-based data structures and algorithms that take advantage of large-scale parallel computing environments.
Julia employs several techniques that are novel in the field of high performance computing, such as just-in-time compiling, or first-class support of an interactive environment, or dynamically adding/removing worker processes. This clashes with the traditional ahead-of-time compiled programes running in batch mode. However, the advantages of Julia’s approach are clear. This project explores how “typical” Julia programs can be run efficiently on current large scale systems such as, e.g. Blue Waters or Cori.
Expected Results: Run a large, parallel Julia application on a high-end HPC system
Knowledge Prerequisites: High-performance computing, MPI
OpenLibm is a portable libm implementation used by Julia. It has fallen behind msun, from where it was forked a few years ago. This project seeks to update OpenLibm with all the latest bugfixes to msun. At the same time the MUSL libm implementation will be considered as an alternative to base openlibm on. A significant testsuite based on various existing libm testsuites will be created.
As a technical computing language, Julia provides a huge number of
special functions, both in Base as well
as packages such as StatsFuns.jl. At the
moment, many of these are implemented in external libraries such as
openspecfun. This project would involve
implementing these functions in native Julia (possibly utilising the work in
seeking out opportunties for possible improvements along the way, such as supporting
BigFloat, exploiting fused multiply-add operations, and improving errors
and boundary cases.
Matrix functions maps matrices onto other matrices, and can often be interpreted as generalizations of ordinary functions like sine and exponential, which map numbers to numbers. Once considered a niche province of numerical algorithms, matrix functions now appear routinely in applications to cryptography, aircraft design, nonlinear dynamics, and finance.
This project proposes to implement state of the art algorithms that extend the currently available matrix functions in Julia, as outlined in issue #5840. In addition to matrix generalizations of standard functions such as real matrix powers, surds and logarithms, students will be challenged to design generic interfaces for lifting general scalar-valued functions to their matrix analogues for the efficient computation of arbitrary (well-behaved) matrix functions and their derivatives.
Julia currently supports big integers, rationals and floats, making use of the GMP library. But the current implementation is essentially only a prototype. There exists a very wide gap between the performance of GMP bignums from a C program and from Julia.
This performance gap is due to multiple things: bignums in C can be reused, saving on initialisation and cleaning up, Julia bignum objects are garbage collected, handwritten C programs can allocate a minimal number of temporaries and C programs can also make use of the mutable properties of GMP bignum objects, saving making copies. Further advantages exist for C implementations that store small integers as immediate words, rather than use a full multiple precision structure.
There are many ways of potentially improving bignum performance in Julia. We could keep a cache of bignum objects around instead of allocating new ones all the time. We could inline operations involving single word operations and introduce a combined immediate word/mpz struct type for BigInts. Constant propagation could be done by introducing a type for numerical constant literals. Julia could wrap the low level mpn layer of GMP directly instead of the higher level mpz, mpq, mpf layers. A set of unsafe mutating operators for bignums could be introduced, for cases where the user knows they have control over the bignum and can safely mutate it. (Theoretically, term rewriting or lazy evaluation could also eliminate the need for creating a new bignum object for every subexpression, or even reduce the number of temporaries required for any given expression, in much the same way as the old C++ wrapper for GMP used template metaprogramming to speed things up. However, this is probably beyond the scope of a GSOC project.) The Julia Cxx package could even be used to inline calls to the GMP C++ wrapper whose semantics may be closer to the Julia semantics.
Not all of these options are either practical or agreeable. And it isn’t clear which options are going to lead to the best performance or to the nicest Julia code. A balance would need to be struck to use what’s actually available in Julia in a way that is performant but also flexible enough to support future developer needs. The absolutely fastest option may not be the best long term option and consultation on the Julia list will be important for this project. The project should implement prototypes of a number of options, and present a performance comparison to the Julia developers before settling on a final design.
Monte Carlo methods are becoming increasingly important in large-scale numerical computations, requiring large quantities of random numbers. To ensure accuracy of the simulated systems, it is critical that the pseudorandom number generator is both fast and reliable, avoiding problems with periodicity and dependence, robust to statistical tests such as the Crush suite. Challenges are even greater in massively parallel computations, which require going beyond running many copies of serial algorithms for generating pseudorandom numbers, due to well-known synchronization effects which can compromise the quality and uniformity of of random sampling. A previous Google Summer of Code project created RNG.jl, and there are more random number generating algorithms to explore.
Some possible aims of this project:
Scientific and technical computing often makes use of publicly available datasets. Obtaining this data often involves digging through horrifically designed government websites. Julia has a robust package manager, so storing a dataset on Github and making it available through the package manager can be a convenient means of distribution (see RDatasets.jl). At the same time, many datasets are too large to store reasonably in a git repository.
This project proposal is to develop a new Julia package that will implement a standard means to write data packages to make large external datasets accessible by downloading to a local machine. When a data package is installed, it will automatically download the dataset it wraps, validate it, e.g. with stored checksums, and make that data available as a
DataFrame or other Julia structure. In addition to a standard structure, the
Pkg.generate function should be extended to generate data packages.
This project proposes to implement a very simple persistent storage mechanism for Julia variables so that data can be saved to and loaded from disk with a consistent interface that is agnostic of the underlying storage layer. Data will be tagged with a minimal amount of metadata by default to support type annotations, time-stamped versioning and other user-specifiable tags, not unlike the
git stash mechanism for storing blobs. The underlying engine for persistent storage should be generic and interoperable with any reasonable choice of binary blob storage mechanism, e.g. MongoDB, ODBC, or HDFS. Of particular interest will be persistent storage for distributed objects such as
DArrays, and making use of the underlying storage engine’s mechanisms for data movement and redundant storage for such data.
Distributed computation frameworks like Hadoop/MapReduce have demonstrated the usefulness of an abstraction layer that takes care of low level concurrency concerns such as atomicity and fine-grained synchronization, thus allowing users to concentrate on task-level decomposition of extremely large problems such as massively distributed text processing. However, the tree-based scatter/gather design of MapReduce limits its usefulness for general purpose data parallelism, and in particular poses significant restrictions on the implementation of iterative algorithms.
This project proposal is to implement a native Julia framework for distributed execution for general purpose data parallelism, using dynamic, runtime-generated general task graphs which are flexible enough to describe multiple classes of parallel algorithms. Students will be expected to weave together native Julia parallelism constructs such as the
ClusterManager for massively parallel execution, and automate the handling of data dependencies using native Julia
RemoteRefs as remote data futures and handles. Students will also be encouraged to experiment with novel scheduling algorithms.
Methods for solving stiff differential equations and differential algebraic equations
(DAEs) are composed of similar calculus objects: gradients, Jacobians, etc. However,
not only are these objects common, but there are also very common use patterns.
For example, the quantity
(M - gamma*J)^(-1), (where
M is a mass matrix,
is the Jacobian, and
gamma is a constant), is found in Rosenbrock methods,
Newton solvers for implicit Runge-Kutta methods, and update equations for
implicit multistep methods. However, the current setup requires that every solver
implements these functions on their own, leaving the chosen options and optimizations
as repeated work throughout the ecosystem.
The goal would be to create a library which exposes many different options for computing these quantities to allow for easier development of native solvers. Specifically, ParameterizedFunctions.jl symbolically computes explicit functions for these quantities in many cases, and the user can also specify the functions. But if no function is found, then the library functions can must provide fallbacks using ForwardDiff.jl and ReverseDiff.jl for autodifferentiation, and Calculus.jl for a fallback when numerical differentiation most be used. Using the traits provided by DiffEqBase.jl, this can all be implemented with zero runtime overhead, utilizing Julia’s JIT compiler to choose the correct method at compile-time.
Students who take on this project will encounter and utilize many of the core tools for scientific computing in Julia, and the result will be a very useful library for all differential equation solvers.
For more details, see the following issue.
Expected Results: A high-performance backend library for native differential equation solvers.
Julia needs to have a full set of ordinary differential equations (ODE) and algebraic differential equation (DAE) solvers, as they are vital for numeric programming. There are many advantages to having a native Julia implementation, including the ability to use Julia-defined types (for things like arbitrary precision) and composability with other packages. A library of methods can be built for the common interface, seamlessly integrating with the other available methods. Possible families of methods to implement are:
Expected Results: A production-quality ODE/DAE solver package.
In recent years, the popular Sundials library has added a suite of implicit-explicit Runge-Kutta methods for efficient solving of discretizations which commonly arise from PDEs. However, these new methods are not accessible from Sundials.jl. The goal of this project would be to expose the ARKODE solvers in Sundials.jl to the common JuliaDiffEq interface.
Expected Results: An interface to the Sundials ARKODE methods in Sundials.jl.
Global Sensitivity Analysis is a popular tool to assess the affect that parameters have on a differential equation model. Global Sensitivity Analysis tools can be much more efficient than Local Sensitivity Analysis tools, and give a better view of how parameters affect the model in a more general sense. Julia currently has an implemention Local Sensitivity Analysis, but there is no method for Global Sensitivity Analysis. The goal of this project would be to implement methods like the Morris method in DiffEqSensitivity.jl which can be used with any differential equation solver on the common interface.
Expected Results: Efficient functions for performing Global Sensitivity Analysis.
Machine learning has become a popular tool for understanding data, but scientists typically want to use this data to better understand their differential equation-based models. The intersection between these two fields is parameter estimation. This is the idea of using techniques from machine learning in order to identify the values for model parameters from data. Currently, DiffEqParamEstim.jl shows how to link the differential equation solvers with the optimization packages for parameter estimation, but no link to machine learning tools have been created. The tools are all in place for this pairing between JuliaDiffEq and JuliaML.
Expected Results: Modular tools for using JuliaML’s libraries for parameter estimation of differential equations.
Bayesian estimation of parameters for differential equations is a popular technique since this outputs probability distributions for the underlying parameters. Julia’s
ParameterizedFunction makes it easy to solve differential equations with explicit parameters, and holds enough information to be used with Stan.jl. The purpose for this project is to create a function in DiffEqParamEstim.jl which translates the saved information of the model definition in a
ParameterizedFunction to automatically write the input to Stan.jl, and tools for tweaking the inputs.
Expected Results: A function which takes in a
ParameterizedFunction and performs parameter estimation using Stan.jl
One of the major uses for differential equations solvers is for partial differential equations (PDEs). PDEs are solved by discretizing to create ODEs which are then solved using ODE solvers. However, in many cases a good understanding of the PDEs are required to perform this discretization and minimize the error. The purpose of this project is to produce a library with common PDE discretizations to make it easier for users to solve common PDEs.
Expected Results: A production-quality PDE solver package for some common PDEs.
Plotting is a fundamental tool for analyzing the numerical solutions to differential equations. The plotting functionality in JuliaDiffEq is provided by plot recipes to Plots.jl which create plots directly from the differential equation solution types. However, this functionality is currently limited. Some goals for this project would be:
Expected Results: Publication-quality plot recipes.
Iterative methods for solving numerical linear algebraic problems are crucial for big data applications, which often involve matrices that are too large to store in memory or even to compute its matrix elements explicitly. Iterative Krylov methods such as conjugate gradients (CG) and the generalized minimal residual (GMRES) methods have proven to be particular valuable for a wide variety of applications such as eigenvalue finding, convex optimization, and even systems control. This project proposes to implement a comprehensive suite of iterative solver algorithms in Julia’s native IterativeSolvers.jl package, as described in the implementation roadmap. Students will be encouraged to refactor the codebase to better expose the mathematical structure of the underlying Arnoldi and Lanczos iterations, thus promoting code composability without sacrificing performance.
While one normally thinks of solving the linear equation Ax=b with A being a matrix, this concept is more generally applied to A being a linear map. In many domains of science, this idea of directly using a linear map instead of a matrix allows for one to solve the equation in a more efficient manner. Iterative methods for linear solving only require the ability compute
A*x in order solve the system, and thus these methods can be extended to use more general linear maps. By restructuring IterativeSolvers.jl to use
LinearMap types from LinearMaps.jl, these applications can be directly supported in the library.
PETSc is a widely used framework of data structures and computational routines suitable for massively scaling scientific computations. Many of these algorithms are also ideally suited for big data applications such as computing principal components of very large sparse matrices and solving complicated forecasting models with distributed methods for solving partial differential equations. This project proposal is to develop a new Julia package to interface with PETsc, thus allowing users access to state of the art scalable algorithms for optimization, eigenproblem solvers, finite element mesh computations, and hyperbolic partial differential equation solvers. The more mathematically oriented student may choose to study the performance of these various algorithms as compared to other libraries and naïve implementations. Alternatively, students may also be interested in working on the LLVM BlueGene port for deploying Julia with PetSc integration in an actual supercomputing environment.
Expected Results: New wrappers for PETSc functions in the PETSc.jl package.
A large portion of big data analytics is predicated upon efficient linear algebraic operations on extremely large matrices. However, massively parallel linear algebra has traditionally focussed on supercomputer architectures, and comparatively little work has been done on efficient scaling on more heterogeneous architectures such as commodity clusters and cloud computing servers, where memory hierarchies and network topologies both introduce latency and bandwidth bottlenecks that differ significantly from those on supercomputers.
This project proposal is for implementing native Julia algorithms involving efficient, cache-conscious matrix operations on tiled matrices. Students will be expected to implement tiled algorithms and tune the performance of typical algorithms such as the singular value decomposition or linear solve.
Modern data-intensive computations, such as Google’s PageRank algorithm, can often be cast as operations involving sparse matrices of extremely large nominal dimensions. Unlike dense matrices, which decompose naturally into many homogeneous tiles, efficient algorithms for working with sparse matrices must be fully cognizant of the sparsity pattern of specific matrices at hand, which oftentimes reduce to efficiently computing partitions of extremely large graphs.
This project proposal is for implementing native Julia algorithms for massively parallel sparse linear algebra routines. Unlike the project above for dense linear algebra, efficient parallel algorithms for sparse linear algebra are comparatively less well studied and understood. Students will be expected to implement several algorithms for common tasks such as linear solvers or computing eigenvectors, and benchmark the performance of these algrithms on various real world applications.
We’d like to make finding and viewing relevant documentation a core part of the Juno/Julia experience. As well as viewing docs for a particular function, it’d be great to take advantage of the extensive Markdown docs provided by packages for other purposes. For example, a basic doc search engine could allow users to find relevant functionality even when they don’t know the names of the functions involved. (This could even be extended to searching across all packages!)
Initially this project could be built as a package or as an extension to the Atom.jl package. Eventually, we’d also like to integrate the functionality with a nice UI inside of Juno, and this could serve as extension work for an enterprising student.
Juno could provide a simple way to browse available packages and view what’s installed on their system. To start with, this project could simply provide a GUI that reads in package data from
~/.julia, including available packages and installed versions, and buttons which call the relevant
This could also be extended by having metadata about the package, such as a readme, github stars, activity and so on. To support this we probably need a pkg.julialang.org style API which provides the info in JSON format.
The swirl tutorial teaches R users through an interactive REPL experience. Something similar in Julia could provide a tutorial that takes advantage of Juno’s frontend integration, e.g. for getting input and displaying results. In particular, we’d expect this project to involve building a solid framework for building swirl-style tutorials, and allowing the Julia community to easily create new tutorials using Julia. Some research into how Swirl itself achieves this would be a good start.
The foundation for tools like refactoring, linting or autoformatting is a powerful framework for reading and writing Julia code while preserving various features of the source code; comments, original formatting, and perhaps even syntax errors. This could build on the work in JuliaParser.jl.
Related would be various tools for doing static or dynamic analysis of Julia code in order to find errors. A simple example would be linting indentation; combined with information from the parser, this could be a powerful way to reduce beginner frustration over
unexpected ) style error messages.
Concepts relevant to Julia code’s performance, like ‘type-stability’, are often implicit in written code, making it hard for new users in particular to catch slow code early on. However, this also represents a challenge for static analysis, since in general concrete type information won’t be available until runtime.
A potential solution to this is to hook into function calls (users will most likely call a function with test inputs after writing it) and then use dynamically-available information, such as the output of
code_typed, to find performance issues such as non-concrete types, use of global variables etc, and present these as lint warnings in the IDE.
While static analysis has long been used as a tool for understanding and finding problems in code, the use of dynamically available information is unexplored (with the exception of tracing JIT compilers, which demonstrate the power of the concept). This project has plenty of interesting extensions and could have the most interesting long-term implications.
The Ink console has some nice features, including being able to display graphics and HTML inline. However, it could be useful to integrate some more terminal-esc features like support for ANSI colour codes.
RStudio provides the option to save loaded packages and variables on shutdown, and set up the environment as before when restarting. This could be replicated in Juno using some serialisation format for Julia data (e.g. the built-in serialiser or HDF5.jl).
If there’s a piece of tooling you’d like to see for Julia, don’t hesitate to suggest it to us!
Julia’s errors are helpful for the most part but could be improved - see #4744.
Expected Results: Consistent printing of errors with qualified type and human-readable strings.
This project involves creating an autoformat tool, similar to Go’s
gofmt, which automatically rewrites Julia source files into a standard format. This is a fairly experimental project - there are plenty of challanges involved with formatting such a dynamic and flexible language well. However, a well executed tool is likely to gain widespread adoption by the Julia community.
Expected Results: Autoformatting tool for Julia
Memory errors in Julia’s underlying C code are sometimes difficult to trace, and missing garbage-collector
“roots” (GC roots) can lead to segfaults and other problems. One potential way
to make it easier to find such errors would be to write a package that checks Julia’s
src/ directory for
missing GC roots. A Julia-based solution might leverage the Clang.jl
package to parse the C code, determine which call chains can trigger garbage collection, and then
look for objects that lack GC root protection. Alternatively, the same strategy might be implemented in C++ by writing a plugin for Clang’s static analyzer. Another attractive approach is to leverage coccinelle.
Expected Results: A tool that, when run against Julia’s
src/ directory, highlights lines that
need to have additional GC roots added.
Julia’s method cache is shared by all call sites of a generic function. Although whenever single dispatch is provable we generate a direct call, there are some cases where dynamic dispatch is inevitable. When the compiler can prove (using type inference) that the possible matches for a call site is small enough, it would be a huge performance win to generate a small cache specific to this call site. Those small caches would have to be updated when new method definitions are added (even replaced by the global cache when the number of matches becomes too large).
This project has a large number of possible extensions when the basic feature is done, including : using minimal caching keys (e.g. when only one of the arguments determine the dispatch entierly), generating specific dispatch code in a separate function, sharing the small caches between call sites, …
It has also the future benefit of putting together the infrastructure we will need to enable inline method caching when automatic recompilation is done.
Knowledge Prerequisites: Good understanding of C/C++. A priori knowledge of Julia internals is a plus but this project could also be a very good way to familiarize oneself with those.
Julia’s web and higher level networking framework is largely consolidated within the JuliaWeb github organization.
Knowledge Prerequisites: basic familiarity with HTTP
Implementation of mid-level features - specifically routing, load-balancing, cookie/session handling, and authentication - in Mux.jl. The implementation should be extensible enough to allow easy interfacing with different kinds of caching, databases or authentication backends. (See Clojure/Ring for inspiration)
Expected Results: Some experience with web development.
Julia’s functions for getting and setting the clipboard are currently limited to text; it would be useful to extend them to allow the transfer of other data (for example, spreadsheet data or images).
Expected Results: Extensible
clipboard(x) functions to get and set the richest possible representation of the system clipboard data, as well as methods for specific types.
Knowledge Prerequisites: Interfacing with C from Julia, clipboard APIs for different platforms.
Julia could be a great replacement for C in Python projects, where it can be used to speed up bottlenecks without sacrificing ease of use. However, while the basic functionality for calling Julia exists in IJulia and pyjulia, it needs to be separated out and mantained as a real Python package.
Expected Results: An easy-to-use Python package which allows Julia functions to be imported and called, with transparent conversion of data.
Knowledge Prerequisites: Python (especially C interop).
The Winston package can be used for plotting 2D graphs and images. The package is already very useful but compared to the full featured Matplotlib python package there are still several things missing. This project can either go into the direction of improving the plotting itself (more graph types, more customization) or could go into the direction of increasing the interactivity of plotting windows (zooming, data picking …) In the later case a close integration with Gtk.jl would be one way to go.
The Gtk.jl package is shaping up pretty well. Still there are various corners currently unimplemented and besides documentation it is very important to get installation of Gtk completely simple on all three major platforms. Furthermore, there is currently quite some manual tweaking necessary to get Gtk looking good on OSX. These installation issues are very crutial for serious integration of Gtk.jl into the Julia universe.
The Ruby bindings could be used as starting point for implementing Julia bindings: ruby-qml.
Another possible starting point is QML.jl, where there is a direct interfacing with C++, eliminating the need for libqmlbind.
A possible mentor can be contacted on the julia-users mailing list.
Plots.jl has become the preferred graphical interface for many users. It has the potential to become the standard Julia interface for data visualization, and there are many potential ways that a student could contribute:
LearnBase is an attempt to create a complete well-abstracted set of primitives and tools for machine learning frameworks. The goal is to find the overlap between research perspectives and create smart abstractions for common components such as loss functions, transformations, activation functions, etc. These abstractions can be the building blocks for more flexible learning frameworks, incorporating models (SVM, regression, decision trees, neural networks, etc) into a unified framework more powerful than alternatives (Scikit-learn, TensorFlow, etc).
A student could contribute to the core abstractions, or to dependent learning frameworks, depending on their knowledge and interests. It is expected that the student has prior knowledge of machine learning techniques which they would like to work with. In order to meaningfully contribute to the core abstractions, a broad knowledgebase would be expected. Specific experience with random forests or deep/recurrent neural networks would be a plus.