README.md

# SAMBA
[![PyPI version](https://badge.fury.io/py/sambaflux.svg)](https://badge.fury.io/py/sambaflux)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.8369624.svg)](https://doi.org/10.5281/zenodo.8369624)
[![License](https://img.shields.io/badge/license-MIT-green)](./LICENSE)

## Description
SAMpling Biomarker Analysis is a toolset for running flux sampling on metabolic networks, predicting biomarkers 
(or metabolic profiles) for specific metabolic conditions, and representing the results visually.

It uses a Snakemake pipeline to manage the entire workflow, starting from a metabolic network and a set of scenarios of reactions or genes to disrupt, and ending with a change score (z-score) for each exchange metabolite in the network, for each disruption scenario.


## Getting started

### Tutorial
A tutorial on what SAMBA can do with a toy example is available on Google Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Gb5b9AKIJ9pEpBhPshp6fyCxIlLqZz8D?usp=sharing)

Click the link above and save a copy of the Colab notebook to your Google Drive to be able to edit and run. 

Note that this is on a small metabolic model, and is not how SAMBA is designed to be used for larger models which require a computer cluster. For full installation instructions, see below.


### Requirements
- Python 3.7.x
- [cobrapy](https://pypi.org/project/cobra/)
- [sambaflux](https://pypi.org/project/sambaflux/) (see `cluster_install.sh` for installation commands.)
- A solver (GLPK, CPLEX, GUROBI) (CPLEX 12.10 does not work with Python 3.8+)  
- Access to a computer cluster or powerful computer for large metabolic networks


### Installation and usage
0. Connect to a cluster using `ssh`
    - *Example*: `ssh <username>@genobioinfo.toulouse.inrae.fr`
1. Run `git clone --depth 1 https://forgemia.inra.fr/metexplore/cbm/samba-project/samba.git` to download the necessary files in your directory of choice. `--depth 1` is important to avoid downloading any unnecessary files.
2. On the cluster, open `pipeline/cluster_install.sh` to define environments and requirements. Set the ENVPATH and WORKINGDIR directory paths, and make sure those directories exist. ENVPATH is where the Python environment will be created. WORKINGDIR is where the SAMBA project (this project) will be cloned to, and will be where you run the cluster jobs from. To run this file once you've edited it:
    - You can submit the file to a cluster job using `sbatch cluster_install.sh` 
    - Or connect interactively to a node using `srun --pty bash` and then run the commands manually. 

>NOTE:
>- You can also execute `WORKINGDIR=/path/to/folder` in a single line in the terminal to be able to use `$WORKINGDIR` in your cluster or local terminal in the next steps.
>- Make sure the Python module (imported in `cluster_install.sh`) you're using contains Snakemake pre-installed.
>- Once this step had been done once, you will only need to run `git clone --depth 1 https://forgemia.inra.fr/metexplore/cbm/samba-project/samba.git` to a different folder to launch a future sampling run with different parameters.
 
>IMPORTANT:  
>If you're using CPLEX, make sure to set up CPLEX by adding CPLEX to the PYTHONPATH (see commented line in `cluster_install.sh`).


3. Using your preferred file copy method, send the metabolic network file and file with reactions/genes to KO to the cluster, to `$WORKINGDIR/samba/pipeline/data/`. 

>*Example from a local PC to the cluster*: `rsync -aP /path/to/local/folder/data/ <username>@genobioinfo.toulouse.inrae.fr:$WORKINGDIR/samba/pipeline/data/`

4. `cd` into your WORKINGDIR and edit `config.yaml` with the correct parameters. You can change these to use different models and KO files. See the parameters [below](#samba-parameters).
    - *Example*: 
    ```bash
    cd $WORKINGDIR
    vim config.yaml
    ```
    `:q` to quit vim, `:x` to save and quit vim.


5. Edit `submit_slurm.sh` to set ENVPATH to the same path you used before. You can also change the job name, error and out filenames, and add other SBATCH parameters. Or create a new submit file specific to your computer cluster, just make sure you include the `ENVPATH`, the Python environment and the `snakemake --profile default` command.
6. Make sure you are in `$WORKINGDIR/samba/pipeline/`, and execute `sbatch submit_slurm.sh` in the terminal.

>TIP:  
>You can watch the job submissions using `watch squeue -u <username>` (`Ctrl+C` to exit the watch).

7. Results can be found in `<model_name>_<KO_file_name>/output/`.
    - `zscores.tsv` contains the z-scores for all exchange metabolites and all scenarios.
    - `densities.json` contains a density version of the sampling distributions for all exchange metabolites and all scenarios. Used in R for plotting purposes.
    - Raw sampling files are also provided, but can get quite large with high numbers of sampling.
8. You may also need the files in `<model_name>_<KO_file_name>/dict/` to convert from the unique scenario IDs back to the original scenarios, and to convert metabolite IDs to names for plotting/readability purposes.

### SAMBA parameters
The parameters are set in `config.yaml` (step 4.).
- `model`: model filename added in `data/`
- `samba_path`: path to the samba scripts folder. Since you will be running files from the pipeline/ folder, this can be set to a relative path `../local/scripts`, or absolute path `$WORKINGDIR/samba/local/scripts`.
- `wt`: set to `"default"` or `"specific"`. `"default"` uses the base model as a WT for all WT-KO comparisons. `"specific"` creates a WT specific to each KO where the flux of the KO reactions is forced in the WT.
- `ids`: whether you are using an input gene/reaction KO file (`"simple"`) or a scenario-based KO (`"scenario"`).
- `ids_file`: file containing genes or reactions to KO if `ids: "simple"`. First column `Scenario` contains descriptive scenario names (do not have to be unique), second column `IDs` contains the gene or reaction IDs to KO. Multiple IDs in one scenario should be separated using a space. Third (optional) column Reduction contains a value between 0 and 1 of the percentage of maximum flux for that group of reactions or genes to be set to  (e.g. 0.3 means the corresponding reactions will be set to 30% of their maximum fluxes). This option is ignored if using `ids: "scenario"`.
- `scale`: if using `ids: "scenario"`, set the scale to `"pathway"` to KO within pathways o r `"network"` to KO within the entire network. Is ignored if using `ids: "ids"`.
- `scenario`: if using `ids: "scenario"`, set the scenario to `"singlerxn"` to KO one random reaction or `"allrxn"` to KO all reactions in <scale>. Is ignored if using `ids: "ids"`.
- `each`: if using `ids: "scenario"`, set each to `"--each"` to enable looping over <scale> to generate <scenario> KO IDs. Is ignored if using `ids: "ids"`.
- `nsamples`: number of samples to use. `100000` is recommended for large human metabolic networks.
- `biomass`: minimum amount of biomass to optimise for.
- `biomassfile`: tsv file containing a `Model` column with model names, and `Biomass` with the model's biomass reaction. You can add or replace a row to the existing file where needed. Only used if `biomass` != 0.
- `exchangemin`: value to set the exchange reaction lower bounds to (will be negative), e.g. `1` will results in exchange reactions being set to `[-1, 1000]`.
- `rxns_to_output`: reactions to output flux samples for: "all" "exchanges" "<filename>" containing IDs
- `fva`: `--fva` or `""` to also calculate FVA bounds in the same conditions as sampling.
- `onlyfva`: `--onlyfva` or `""` to only run FVA instead of sampling.
- `zscoresample`: Percent of total samples to sample from to calculate z-scores, between 0 and 1. For example, setting `zscoresample` to 0.6 will make SAMBA use (randomly sampled) 60% of all samples to calculate the z-scores.


### SAMBAR
SAMBAR is not an installation requirement to run SAMBA: it is however useful for importing and plotting SAMBA results in R scripts.
Requirements for SAMBAR:
- R (currently used version: 4.2.2)
- R.utils
- optparse
- ggh4x
- [sambar](https://forgemia.inra.fr/metexplore/cbm/samba-project/sambar/-/releases/permalink/latest/downloads/sambar)
Install via bash:
```bash
Rscript -e 'install.packages("https://forgemia.inra.fr/metexplore/cbm/samba-project/sambar/-/releases/permalink/latest/downloads/sambar", repos = NULL)'
```
Install via R:
```R
install.packages("https://forgemia.inra.fr/metexplore/cbm/samba-project/sambar/-/releases/permalink/latest/downloads/sambar", repos = NULL)
```


## Confirmed working models
- Human1 (HumanGEM)
- Recon2
- Toy model (small4m) with added exchange reactions

## Roadmap
- [x] Add support for multiple KOs at once
- [x] Add the option to reduce flux rates in addition to KO


## Authors
Juliette Cooke

## Acknowledgments


## License
MIT License (see LICENSE file)

## Project status
Currently active on this project.