MLJ.jl Projects – Summer of Code

MLJ is a machine learning framework for Julia aiming to provide a convenient way to use and combine a multitude of tools and models available in the Julia ML/Stats ecosystem.

List of projects

MLJ is released under the MIT license and sponsored by the Alan Turing Institute.

  1. MLJ.jl Projects – Summer of Code
      1. List of projects
    1. Machine Learning in Predictive Survival Analysis
      1. Description
      2. Prerequisites
      3. Your contribution
      4. References
    2. Feature transformations
      1. Description
      2. Prerequisites
      3. Your contribution
      4. References
    3. Time series forecasting at scale - speed up via Julia
      1. Prerequisites
      2. Your contribution
      3. References
    4. Interpretable Machine Learning in Julia
      1. Description
      2. Prerequisites
      3. Your contribution
      4. References
    5. Model visualization in MLJ
      1. Description
      2. Prerequisites
      3. Your contribution
      4. References
    6. Deeper Bayesian Integration
      1. Description
      2. Your contributions
      3. References
      4. Difficulty: Medium to Hard
    7. Tracking and sharing MLJ workflows using MLFlow
      1. Description
      2. Prerequisites
      3. Your contribution
      4. References
    8. Speed demons only need apply
      1. Description
      2. Prerequisites
      3. Your contribution
      4. References
    9. Correcting for class imbalance in classification problems
      1. Description
      2. Prerequisites
      3. Your contribution
      4. References

Machine Learning in Predictive Survival Analysis

Implement survival analysis models for use in the MLJ machine learning platform.

Difficulty. Moderate - hard. Duration. 350 hours

Description

Survival/time-to-event analysis is an important field of Statistics concerned with understanding the distribution of events over time. Survival analysis presents a unique challenge as we are also interested in events that do not take place, which we refer to as 'censoring'. Survival analysis methods are important in many real-world settings, such as health care (disease prognosis), finance and economics (risk of default), commercial ventures (customer churn), engineering (component lifetime), and many more. This project aims to implement models for performing survivor analysis with the MLJ machine learning framework.

Mentors. Sebastian Vollmer, Anthony Blaom,

Prerequisites

preferred.

Your contribution

Specifically, you will:

learning models in MLJ.

models not currently implemented in Julia.

References

prediction with neural networks and Cox regression. Journal of Machine Learning Research, 20(129), 1--30.](https://arxiv.org/abs/1907.00825)

Deephit: A deep learning approach to survival analysis with competing risks. In Thirty-Second AAAI Conference on Artificial Intelligence.](https://ojs.aaai.org/index.php/AAAI/article/view/11842/11701)

Kluger, Y. (2018). DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Medical Research Methodology, 18(1), 24.](https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-018-0482-1) <https://doi.org/10.1186/s12874-018-0482-1>

discrete-time survival model for neural networks. PeerJ, 7, e6257.](https://peerj.com/articles/6257/)

Documentation](https://juliastats.org/Survival.jl/latest/)

Feature transformations

Enhancing MLJ data-preprocessing capabilities by integrating TableTransforms into MLJ.

Difficulty. Easy. Duration. 350 hours

Description

TableTransforms.jl is a Julia package heavily inspired by FeatureTranforms.jl which aims to provide feature engineering transforms which are vital in the Statistics and Machine Learning domain. This project would implement the necessary methods to integrate TableTransforms with MLJ, making them available for incorporation into sophisticated ML workflows.

Mentors. Anthony Blaom.

Prerequisites

preferred

Your contribution

References

repository.

repository with existing MLJ transformers.

Time series forecasting at scale - speed up via Julia

Time series are ubiquitous - stocks, sensor reading, vital signs. This projects aims at adding time series forecasting to MLJ and perform benchmark comparisons to sktime, tslearn, tsml).

Difficulty. Moderate - hard. Duration. 350 hours.

Prerequisites

Your contribution

MLJ is so far focused on tabular data and time series classification. This project is to add support for time series data in a modular, composable way.

Time series are everywhere in real-world applications and there has been an increase in interest in time series frameworks recently (see e.g. sktime, tslearn, tsml).

But there are still very few principled time-series libraries out there, so you would be working on something that could be very useful for a large number of people. To find out more, check out this paper on sktime.

Mentors: Sebastian Vollmer, Markus Löning (sktime developer).

References

Interpretable Machine Learning in Julia

Interpreting and explaining black box interpretation crucial to establish trust and improve performance

Difficulty. Easy - moderate. Duration. 350 hours.

Description

It is important to have mechanisms in place to interpret the results of machine learning models. Identify the relevant factors of a decision or scoring of a model.

This project will implement methods for model and feature interpretability.

Mentors. Diego Arenas, Sebastian Vollmer.

Prerequisites

Your contribution

The aim of this project is to implement multiple variants implementation algorithms such as:

Specifically you will

References

Tutorials

Model visualization in MLJ

Design and implement a data visualization module for MLJ.

Difficulty. Easy. Duration. 350 hours.

Description

Design and implement a data visualization module for MLJ to visualize numeric and categorical features (histograms, boxplots, correlations, frequencies), intermediate results, and metrics generated by MLJ machines.

Using a suitable Julia package for data visualization.

The idea is to implement a similar resource to what mlr3viz does for mlr3.

Prerequisites

Your contribution

So far visualizing data or features in MLJ is an ad-hoc task. Defined by the user case by case. You will be implementing a standard way to visualize model performance, residuals, benchmarks and predictions for MLJ users.

The structures and metrics will be given from the results of models or data sets used; your task will be to implement the right visualizations depending on the data type of the features.

A relevant part of this project is to visualize the target variable against the rest of the features.

You will enhance your visualisation skills as well as your ability to "debug" and understand models and their prediction visually.

References

Mentors: Sebastian Vollmer, Diego Arenas.

Deeper Bayesian Integration

Bayesian methods and probabilistic supervised learning provide uncertainty quantification. This project aims increasing integration to combine Bayesian and non-Bayesian methods using Turing.

Difficulty. Difficult. Duration. 350 hours.

Description

As an initial step reproduce SOSSMLJ in Turing. The bulk of the project is to implement methods that combine multiple predictive distributions.

Your contributions

References

Bayesian Stacking SKpro

Difficulty: Medium to Hard

Mentors: Hong Ge Sebastian Vollmer

Tracking and sharing MLJ workflows using MLFlow

Help data scientists using MLJ track and share their machine learning experiments using MLFlow.

Difficulty. Moderate. Duration. 350 hours.

Description

MLFlow is an open source platform for the machine learning life cycle. It allows the data scientist to upload experiment metadata and outputs to the platform for reproducing and sharing purposes. This project aims to integrate the MLJ machine learning platform with MLFlow.

Prerequisites

Your contribution

References

Mentors. Deyan Dyankov, Anthony Blaom, Diego Arenas.

Speed demons only need apply

Diagnose and exploit opportunities for speeding up common MLJ workflows.

Difficulty. Moderate. Duration. 350 hours.

Description

In addition to investigating a number of known performance bottlenecks, you will have some free reign in this to identify opportunities to speed up common MLJ workflows, as well as making better use of memory resources.

Prerequisites

Your contribution

In this project you will:

References

Mentors. Anthony Blaom

Correcting for class imbalance in classification problems

Improve and extend Julia's offering of algorithms for correcting class imbalance, with a view to integration into MLJ and elsewhere.

Difficulty. Easy - moderate. Duration. 350 hours

Description

Many classification algorithms do not perform well when there is a class imbalance in the target variable (for example, many more positives than negatives). There are number of well-known data preprocessing algorithms, such as oversampling, for compensating for class imbalance. See for instance the python package imbalance-learn.

The Julia package ClassImbalance.jl provides some native Julia class imbalance algorithms. For wider adoption it is proposed that:

Mentor. Anthony Blaom.

Prerequisites

Your contribution

References

repository.