MLJ is a machine learning framework for Julia aiming to provide a convenient way to use and combine a multitude of tools and models available in the Julia ML/Stats ecosystem.
MLJ is released under the MIT license and sponsored by the Alan Turing Institute.
Implement survival analysis models for use in the MLJ machine learning platform.
Difficulty. Moderate - hard. Duration. 350 hours
Survival/time-to-event analysis is an important field of Statistics concerned with understanding the distribution of events over time. Survival analysis presents a unique challenge as we are also interested in events that do not take place, which we refer to as 'censoring'. Survival analysis methods are important in many real-world settings, such as health care (disease prognosis), finance and economics (risk of default), commercial ventures (customer churn), engineering (component lifetime), and many more. This project aims to implement models for performing survivor analysis with the MLJ machine learning framework.
mlr3proba is currently the most complete survival analysis interface, let's get SurvivalAnalysisA.jl to the same standard - but learning from mistakes along the way.
Mentors. Sebastian Vollmer, Anthony Blaom,
Julia language fluency is essential.
Git-workflow familiarity is strongly preferred.
Some experience with survival analysis.
Familiarity with MLJ's API a plus.
A passing familiarity with machine learning goals and workflow is
preferred.
You will work towards creating a survival analysis package with a range of metrics, capable of making distribution predictions for classical and ML models. You will bake in competing risks in early, as well as prediction transformations, and include both left and interval censoring. You will code up basic models (Cox PH and AFT), as well as one ML model as a proof of concept (probably decision tree is simplest or Coxnet).
Specifically, you will:
Familiarize yourself with the training and evaluation machine
learning models in MLJ.
For SurvivalAnalysis.jl, implement the MLJ model interface.
Consider Explainability of SurvivalAnalysis through SurvSHAP(t)
Develop a proof of concept for newer advanced survival analysis
models not currently implemented in Julia.
Mateusz Krzyziński et al., SurvSHAP(t): Time-Dependent Explanations of Machine Learning Survival Models, Knowledge-Based Systems 262 (February 2023): 110234
Kvamme, H., Borgan, Ø., & Scheel, I. (2019). Time-to-event prediction with neural networks and Cox regression. Journal of Machine Learning Research, 20(129), 1–30.
Lee, C., Zame, W. R., Yoon, J., & van der Schaar, M. (2018). Deephit: A deep learning approach to survival analysis with competing risks. In Thirty-Second AAAI Conference on Artificial Intelligence.
Katzman, J. L., Shaham, U., Cloninger, A., Bates, J., Jiang, T., & Kluger, Y. (2018). DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Medical Research Methodology, 18(1), 24.
Gensheimer, M. F., & Narasimhan, B. (2019). A scalable discrete-time survival model for neural networks.](https://peerj.com/articles/6257/) PeerJ, 7, e6257.
Bayesian methods and probabilistic supervised learning provide uncertainty quantification. This project aims increasing integration to combine Bayesian and non-Bayesian methods using Turing.
Difficulty. Difficult. Duration. 350 hours.
As an initial step reproduce SOSSMLJ in Turing. The bulk of the project is to implement methods that combine multiple predictive distributions.
Interface between Turing and MLJ
Comparisons of ensembling, stacking of predictive distribution
reproducible benchmarks across various settings.
Mentors: Hong Ge Sebastian Vollmer
Help data scientists using MLJ track and share their machine learning experiments using MLFlow.
Difficulty. Moderate. Duration. 350 hours.
MLFlow is an open source platform for the machine learning life cycle. It allows the data scientist to upload experiment metadata and outputs to the platform for reproducing and sharing purposes. This project aims to integrate the MLJ machine learning platform with MLFlow.
Julia language fluency essential.
Git-workflow familiarity strongly preferred.
General familiarity with data science workflows
You will familiarize yourself with MLJ, MLFlow and MLFlowClient.jl client APIs.
Implement functionality to upload to MLFlow machine learning model hyper-parameters, performance evaluations, and artifacts encapsulating the trained model.
Implement functionality allowing for the live tracking of learning for iterative models, such as neural networks, by hooking in to MLJIteration.jl.
MLFlow website.
Mentors. Deyan Dyankov (to be confirmed), Anthony Blaom, Diego Arenas.
Diagnose and exploit opportunities for speeding up common MLJ workflows.
Difficulty. Moderate. Duration. 350 hours.
In addition to investigating a number of known performance bottlenecks, you will have some free reign in this to identify opportunities to speed up common MLJ workflows, as well as making better use of memory resources.
Julia language fluency essential.
Experience with multi-threading and multi-processor computing essential, preferably in Julia.
Git-workflow familiarity strongly preferred.
Familiarity with machine learning goals and workflow preferred
In this project you will:
familiarize yourself with the training, evaluation and tuning of machine learning models in MLJ
benchmark and profile common workflows to identify opportunities for further code optimizations, with a focus on the most popular models
work to address problems identified
roll out new data front-end for iterative models, to avoid unnecessary copying of data
experiment with adding multi-processor parallelism to the current learning networks scheduler
implement some of these optimizations
MLJ Roadmap. See, in particular "Scalability" section.
Data front end for MLJ models.
Mentors. Anthony Blaom, Okon Samuel.
Improve and extend Julia's offering of algorithms for correcting class imbalance, with a view to integration into MLJ and elsewhere.
Difficulty. Easy - moderate. Duration. 350 hours
Many classification algorithms do not perform well when there is a class imbalance in the target variable (for example, many more positives than negatives). There are number of well-known data preprocessing algorithms, such as oversampling, for compensating for class imbalance. See for instance the python package imbalance-learn.
The Julia package ClassImbalance.jl provides some native Julia class imbalance algorithms. For wider adoption it is proposed that:
ClassImbalance.jl be made more data-generic by supporting the MLUtils.jl getobs
interface (original documentation here which now (mostly) includes tabular data implementing the Tables.jl) API. Currently there is only support for an old version of DataFrames.jl.
ClassImbalance.jl implements one or more general transformer API's, such the ones provided by TableTransforms.jl, MLJ, and FeatureTransforms.jl (a longer term goal is for MLJ to support the TableTransforms.jl API)
Other Julia-native class imbalance algorithms be added
Mentor. Anthony Blaom.
Julia language fluency is essential.
An understanding of the class imbalance phenomena essential. A detailed understanding of at least one class imbalance algorithm essential.
Git-workflow familiarity is strongly preferred.
A familiarity with machine learning goals and workflow preferred
Familiarize yourself with the existing ClassImbalance package, including known issues
Familiarize yourself with the Tables.jl interface
Assess the merits of different transformer API choices and choose one in consultation with your mentor
Implement the proposed improvements in parallel with testing and documentation additions to the package. Testing and documentation must be up-to-date before new algorithms are added.
repository.