|
|
||
|---|---|---|
| .gitignore | ||
| LICENSE | ||
| MODY_data.xlsx | ||
| README-es.md | ||
| README.md | ||
| load_dataset.py | ||
| train.py | ||
| trainer.py | ||
README.md
MODY Code and Dataset
This repository contains the Python code used to train models and the pre-processed dataset ready to be loaded into the pipeline.
Prerequisites Installation
You’ll need Miniconda / conda to create the environment with the required dependencies:
conda create -n mody python=3.10 scikit-learn imbalanced-learn xlsxwriter openpyxl pandas xgboost enlighten conda-forge::shap ray-tune tensorflow-gpu keras tensorboard conda-forge::keras-tuner matplotlib
After installing the environment, activate it with:
conda activate mody
Running the Pipeline
The file train.py contains an example of the pipeline. To run it, simply execute:
python train.py
Pipeline Structure
The load_data function reads the MODY_data.xlsx file, which contains the pre-processed dataset.
If it does not exist, a base dataset file named HC.xlsx is required to generate it.
df_mody1, df_mody2, df_mody3, df_mody5 = load_data()
This returns four dataframes, each containing the full dataset.
Next, we create a BinaryTuner object, which provides the abstractions to generate 10 datasets and test 10 different machine learning strategies.
It takes multiple parameters — the first being the target column name, followed by either the number of seeds (n_seeds) or a vector of specific seeds (seeds), and the test set proportion.
mody2 = BinaryTuner(df_mody2, 'MODY2_label', seeds=[231964], drop_ratio=0.2)
This creates a directory named after the target column, which stores logs, models, and generated images. Note: This introduces a restriction — if you want to run experiments on other datasets with the same target column name, ensure that the directory does not already exist before starting a new training run.
mody2.fit()
Trains all models for all test sets.
mody2.explain_model('GaussianNB', 'fulldataset-oversampled-mice', 231964)
Generates SHAP plots for the specified model/dataset/seed combination. A complete list of model names, datasets, and seeds can be found in the files within the directory.
mody2.wrap_and_save()
This is an auxiliary method that creates a compressed .zip file containing the directory’s contents.
The file includes a timestamp to prevent name collisions in subsequent runs.