Go to file
ifiguero 94554dead4 patch english 2025-10-16 14:28:08 -03:00
.gitignore patch english 2025-10-16 14:28:08 -03:00
LICENSE Initial commit 2024-12-06 18:57:00 -03:00
MODY_data.xlsx training code 2024-12-06 19:08:51 -03:00
README-es.md patch english 2025-10-16 14:28:08 -03:00
README.md patch english 2025-10-16 14:28:08 -03:00
load_dataset.py fit 2025-10-14 00:08:57 -03:00
train.py typo 2025-10-14 09:38:07 -03:00
trainer.py typo 2025-10-14 09:38:07 -03:00

README.md

MODY Code and Dataset

This repository contains the Python code used to train models and the pre-processed dataset ready to be loaded into the pipeline.

Prerequisites Installation

Youll need Miniconda / conda to create the environment with the required dependencies:

conda create -n mody python=3.10 scikit-learn imbalanced-learn xlsxwriter openpyxl pandas xgboost enlighten conda-forge::shap ray-tune tensorflow-gpu keras tensorboard conda-forge::keras-tuner matplotlib

After installing the environment, activate it with:

conda activate mody

Running the Pipeline

The file train.py contains an example of the pipeline. To run it, simply execute:

python train.py

Pipeline Structure

The load_data function reads the MODY_data.xlsx file, which contains the pre-processed dataset. If it does not exist, a base dataset file named HC.xlsx is required to generate it.

df_mody1, df_mody2, df_mody3, df_mody5 = load_data()

This returns four dataframes, each containing the full dataset. Next, we create a BinaryTuner object, which provides the abstractions to generate 10 datasets and test 10 different machine learning strategies. It takes multiple parameters — the first being the target column name, followed by either the number of seeds (n_seeds) or a vector of specific seeds (seeds), and the test set proportion.

mody2 = BinaryTuner(df_mody2, 'MODY2_label', seeds=[231964], drop_ratio=0.2)

This creates a directory named after the target column, which stores logs, models, and generated images. Note: This introduces a restriction — if you want to run experiments on other datasets with the same target column name, ensure that the directory does not already exist before starting a new training run.

mody2.fit()

Trains all models for all test sets.

mody2.explain_model('GaussianNB', 'fulldataset-oversampled-mice', 231964)

Generates SHAP plots for the specified model/dataset/seed combination. A complete list of model names, datasets, and seeds can be found in the files within the directory.

mody2.wrap_and_save()

This is an auxiliary method that creates a compressed .zip file containing the directorys contents. The file includes a timestamp to prevent name collisions in subsequent runs.