GSOC projects

2021 Ideas

Titles & possible mentors

MLJ Projects – Summer of Code

MLJ is a machine learning framework for Julia aiming to provide a convenient way to use and combine a multitude of tools and models available in the Julia ML/Stats ecosystem.

MLJ is released under the MIT license and sponsored by the Alan Turing Institute.

Particle swarm optimization of machine learning models

Bring particle swarm optimization to the MLJ machine learning platform to help users tune machine learning models.

Difficulty. Easy - moderate.

Description

Imagine your search for the optimal machine learning model as the meandering flight of a bee through hyper-parameter space, looking for a new home for the queen. Parallelize your search, and you've created a swarm of bees. Introduce communication between the bees about their success so far, and you introduce the possibility of the bees ultimately converging on good candidate for the best model.

PSO (Particle Swarm Optimization) is a large, promising, and active area of research, but also one that is used in real data science practice. The method is based on a very simple idea inspired by nature and makes essentially no assumptions about the nature of the cost function (unlike other methods, such as gradient descent, which might require a handle on derivatives). The method is simple to implement, and applicable to a wide range of hyper-parameter optimization problems.

Mentors. Anthony Blaom, Sebastian Vollmer

Prerequisites

Your contribution

The aim of this project is to implement one or more variants of PSO algorithm, for use in the MLJ machine learning platform, for the purpose of optimizing hyper-parameters. Integration with MLJ is crucial, so there will be opportunity to spend time familiarizing yourself with this popular tool.

Specifically, you will:

References

In-processing methods for fairness in machine learning

Mentors: Jiahao Chen, Moritz Schauer, and Sebastian Vollmer

Fairness.jl is a package to audit and mitigate bias, using the MLJ machine learning framework and other tools. It has implementations of some preprocessing and postprocessing methods for improving fairness in classification models, but could use more implementations of other methods, especially inprocessing algorithms like adversarial debiasing.

Difficulty Hard.

Prerequisites

Description

Machine learning models are developed to support and make high-impact decisions like who to hire or who to give a loan to. However, available training data can exhibit bias against race, age, gender, or other prohibited bases, reflecting a complex social and economic history of systemic injustice. For example, women in the United Kingdom, United States and other countries were only allowed to have their own bank accounts and lines of credit in the 1970s! That means that training a credit decisioning model on historical data would encode implicit biases, that women are less credit-worthy because few of them had lines of credit in the past. Surely we would want to be fair and not hinder an applicant's ability to get a loan on the basis of their race, gender and age?

So how can we fix data and models that are unfair? A common first reaction is to remove the race, gender and age attributes from the training data, and then say we are done. But as described in detail in the references, we have to consider if other features like one's name or address could encode such prohibited bases too. To mitigate bias and improve fairness in models, we can change the training data (pre-processing), the way we define and train the model (in-processing), and/or alter the predictions made (post-processing). Some algorithms for the first and third approaches have already been implemented in Fairness.jl, which have the advantage of treating the ML model as a black box. However, our latest research (arXiv:2011.02407) shows that pur black box methods have fundamental limitations in their ability to mitigate bias.

Your contribution

This project is to implement more bias mitigation algorithms and invent new ones too. We will focus on in-processing algorithms that alter the training process or alter ML model. Some specific stages are to:

  1. Use Flux.jl or MLJFlux.jl to develop in-processing algorithms,

  2. Study research papers proposing in-processing algorithms and implement them, and

  3. Implement fairness algorithms and metrics for individual fairness as described in papers like arXiv:2006.11439.

References

  1. High-level overview: https://towardsdatascience.com/a-tutorial-on-fairness-in-machine-learning-3ff8ba1040cb

  2. https://nextjournal.com/ashryaagr/fairness

  3. IBM’s AIF360 resources: https://aif360.mybluemix.net/

    AIF360 Inprocessing algorithms: Available here.

  4. https://dssg.github.io/fairness_tutorial/

Causal and counterfactual methods for fairness in machine learning

Mentors: Jiahao Chen, Moritz Schauer, Zenna Tavares, and Sebastian Vollmer

Fairness.jl is a package to audit and mitigate bias, using the MLJ machine learning framework and other tools. This project is to implement algorithms for counterfactual ("what if") reasoning and causal analysis to Fairness.jl and MLJ.jl, integrating and extending Julia packages for causal analysis.

Difficulty Hard.

Prerequisites

Description

Machine learning models are developed to support and make high-impact decisions like who to hire or who to give a loan to. However, available training data can exhibit bias against race, age, gender, or other prohibited bases, reflecting a complex social and economic history of systemic injustice. For example, women in the United Kingdom, United States and other countries were only allowed to have their own bank accounts and lines of credit in the 1970s! That means that training a credit decisioning model on historical data would encode implicit biases, that women are less credit-worthy because few of them had lines of credit in the past. Surely we would want to be fair and not hinder an applicant's ability to get a loan on the basis of their race, gender and age?

So how can we fix unfairness in models? Arguably, we should first identify the underlying causes of bias, and only then can we actually remediate bias successfully. However, one major challenge is that a proper evaluation often requires data that we don't have. For this reason, we also need counterfactual analysis, to identify actions we can take that can mitigate fairness not just in our training data, but also in situations we haven't seen yet but could encounter in the future. Ideas for identifying and mitigating bias using such causal interventions have been proposed in papers such as Equality of Opportunity in Classification: A Causal Approach and the references below.

Your contribution

This project is to implement algorithms for counterfactual ("what if") reasoning and causal analysis to Fairness.jl and MLJ.jl, integrating and extending Julia packages for causal analysis. Some specific stages are:

  1. Implement interfaces in MLJ.jl for Julia packages for causal inference and probabilistic programming such as Omega.jl and CausalInference.jl](https://github.com/mschauer/CausalInference.jl)

  2. Implement and benchmark causal and counterfactual definitons for measuring unfairness

  3. Implement and benchmark causal and counterfactual approaches to mitigate bias

References

Time series forecasting at scale - speed up via Julia

Time series are ubiquitous - stocks, sensor reading, vital signs. This projects aims at adding time series forecasting to MLJ and perform benchmark comparisons to sktime, tslearn, tsml).

Difficulty. Easy - moderate.

Prerequisites

Your contribution

MLJ is so far focused on tabular data and time series classification. This project is to add support for time series data in a modular, composable way.

Time series are everywhere in real-world applications and there has been an increase in interest in time series frameworks recently (see e.g. sktime, tslearn, tsml).

But there are still very few principled time-series libraries out there, so you would be working on something that could be very useful for a large number of people. To find out more, check out this paper on sktime.

Mentors: Sebastian Vollmer, Markus Löning (sktime developer).

References

Interpretable Machine Learning in Julia

Interpreting and explaining black box interpretation crucial to establish trust and improve performance

Difficulty. Easy - moderate.

Description

It is important to have mechanisms in place to interpret the results of machine learning models. Identify the relevant factors of a decision or scoring of a model.

This project will implement methods for model and feature interpretability.

Mentors. Diego Arenas, Sebastian Vollmer.

Prerequisites

Your contribution

The aim of this project is to implement multiple variants implementation algorithms such as:

Specifically you will

References

Tutorials

Model visualization in MLJ

Design and implement a data visualization module for MLJ.

Difficulty. Easy.

Description

Design and implement a data visualization module for MLJ to visualize numeric and categorical features (histograms, boxplots, correlations, frequencies), intermediate results, and metrics generated by MLJ machines.

Using a suitable Julia package for data visualization.

The idea is to implement a similar resource to what mlr3viz does for mlr3.

Prerequisites

Your contribution

So far visualizing data or features in MLJ is an ad-hoc task. Defined by the user case by case. You will be implementing a standard way to visualize model performance, residuals, benchmarks and predictions for MLJ users.

The structures and metrics will be given from the results of models or data sets used; your task will be to implement the right visualizations depending on the data type of the features.

A relevant part of this project is to visualize the target variable against the rest of the features.

You will enhance your visualisation skills as well as your ability to "debug" and understand models and their prediction visually.

References

Mentors: Sebastian Vollmer, Diego Arenas.

Deeper Bayesian Integration

Bayesian methods and probabilistic supervised learning provide uncertainty quantification. This project aims increasing integration to combine Bayesian and non-Bayesian methods using Turing.

Description

As an initial step reproduce SOSSMLJ in Turing. The bulk of the project is to implement methods that combine multiple predictive distributions.

Your contributions

References

Bayesian Stacking SKpro

Difficulty: Medium to Hard

Mentors: Hong Ge Sebastian Vollmer

MLJ and MLFlow integration

Integrate MLJ with MLFlow.

Difficulty. Easy.

Description

MLFlow is a flexible model management tool. The project consists of writing the necessary functions to integrate MLJ with MLFlow REST API so models built using MLJ can keep track of its runs, evaluation metrics, parameters, and can be registered and monitored using MLFlow.

Prerequisites

Your contribution

References

Speed demons only need apply

Diagnose and exploit opportunities for speeding up common MLJ workflows.

Difficulty. Moderate.

Description

In addition to investigating a number of known performance bottlenecks, you will have some free reign in this to identify opportunities to speed up common MLJ workflows, as well as making better use of memory resources.

Prerequisites

Your contribution

In this project you will:

References

Mentors. Anthony Blaom