README.md

# SAMBA
## Description
SAMpling Biomarker Analysis is a toolset for running flux sampling on metabolic networks, predicting biomarkers 
(or metabolic profiles) for specific metabolic conditions, and representing the results visually.

It uses a Snakemake pipeline to manage the entire workflow, starting from a metabolic network and a set of scenarios of reactions or genes to disrupt, and ending with a change score (z-score) for each exchange metabolite in the network, for each disruption scenario.


## Getting started
### Requirements
- Python 3.7.x
- [cobrapy](https://pypi.org/project/cobra/)
- [sambaflux](https://pypi.org/project/sambaflux/) (see `cluster_install.sh` for installation commands.)
- A solver (GLPK, CPLEX, GUROBI) (CPLEX 12.10 does not work with Python 3.8+)  
- Access to a computer cluster or powerful computer for large metabolic networks


### Installation and usage
0. Connect to a cluster using `ssh`
    - *Example*: `ssh <username>@genologin.toulouse.inra.fr`
1. On the cluster, use `pipeline/cluster_install.sh` to install environments and requirements. You can submit the file to a cluster job using `sbatch cluster_install.sh`, or connect interactively to a node using `srun --pty bash` and running the commands manually. Make sure to set the ENVPATH and WORKINGDIR paths beforehand, and make sure the folders exist. ENVPATH is where the Python environment will be created. WORKINGDIR is where the SAMBA project (this project) will be cloned to, and will be where you run the cluster jobs from.
    - You can also run `WORKINGDIR=/path/to/folder` to be able to use `$WORKINGDIR` in your cluster or local terminal in the next steps.
    - If you're using CPLEX, make sure to set up CPLEX by adding CPLEX to the PYTHONPATH (see commented line in `cluster_install.sh`).
    - Make sure the Python module you're using contains Snakemake.
    - Once installed, you will only need to run the `git clone --depth 1 https://forgemia.inra.fr/metexplore/cbm/samba-project/samba.git` to a different folder to launch a new sampling run with different parameters.

2. Using your preferred file copy method, send the metabolic network file and file with reactions/genes to KO to the cluster, to your `$WORKINGDIR/samba/pipeline/data/`. 
    - *Example from a local PC*: `rsync -aP /path/to/local/folder/data/ <username>@genologin.toulouse.inra.fr:$WORKINGDIR/samba/pipeline/data/`

2. `cd` into your WORKINGDIR and edit `config.yaml` with the correct parameters. You can change these to use different models and KO files.
    - *Example*: 
    ```bash
    cd $WORKINGDIR
    vim config.yaml
    ```
    `:q` to quit vim, `:x` to save and quit vim.
    - **Parameters**:
        - `model`: model filename added in `data/`
        - `samba_path`: path to the samba scripts folder. Since you will be running files from the pipeline/ folder, this can be set to a relative path `../local/scripts`, or absolute path `$WORKINGDIR/samba/local/scripts`.
        - `ids`: whether you are using an input gene/reaction KO file (`"simple"`) or a scenario-based KO (`"scenario"`).
        - `ids_file`: file containing genes or reactions to KO if `ids: "simple"`. First column `Scenario` contains descriptive scenario names (do not have to be unique), second column `IDs` contains the gene or reaction IDs to KO. Multiple IDs in one scenario should be separated using a space. Third (optional) column Reduction contains a value between 0 and 1 of the percentage of maximum flux for that group of reactions or genes to be set to  (e.g. 0.3 means the corresponding reactions will be set to 30% of their maximum fluxes). This option is ignored if using `ids: "scenario"`.
        - `scale`: if using `ids: "scenario"`, set the scale to `"pathway"` to KO within pathways o r `"network"` to KO within the entire network. Is ignored if using `ids: "ids"`.
        - `scenario`: if using `ids: "scenario"`, set the scenario to `"singlerxn"` to KO one random reaction or `"allrxn"` to KO all reactions in <scale>. Is ignored if using `ids: "ids"`.
        - `each`: if using `ids: "scenario"`, set each to `"--each"` to enable looping over <scale> to generate <scenario> KO IDs. Is ignored if using `ids: "ids"`.
        - `nsamples`: number of samples to use. `100000` is recommended for large human metabolic networks.
        - `biomass`: minimum amount of biomass to optimise for.
        - `biomassfile`: tsv file containing a `Model` column with model names, and `Biomass` with the model's biomass reaction. You can add or replace a row to the existing file where needed. Only used if `biomass` != 0.
        - `exchangemin`: value to set the exchange reaction lower bounds to (will be negative), e.g. `1` will results in exchange reactions being set to `[-1, 1000]`.
        - `rxns_to_output`: reactions to output flux samples for: "all" "exchanges" "<filename>" containing IDs
        - `fva`: `--fva` or `""` to also calculate FVA bounds in the same conditions as sampling.
        - `onlyfva`: `--onlyfva` or `""` to only run FVA instead of sampling.
        - `zscoresample`: Percent of total samples to sample from to calculate z-scores, between 0 and 1. For example, setting `zscoresample` to 0.6 will make SAMBA use (randomly sampled) 60% of all samples to calculate the z-scores.


3. Edit `submit_slurm.sh` to set ENVPATH to the same path you used before. You can also change the job name, error and out filenames, and add other SBATCH parameters.
4. Make sure you are in `$WORKINGDIR/samba/pipeline/`, and run `sbatch submit_slurm.sh`.
    - You can watch the job submissions using `watch squeue -u <username>` (`Ctrl+C` to exit the watch).
5. Results can be found in `<model_name>_<KO_file_name>/output/`.
    - `zscores.tsv` contains the z-scores for all exchange metabolites and all scenarios.
    - `densities.json` contains a density version of the sampling distributions for all exchange metabolites and all scenarios. Used in R for plotting purposes.
6. You may also need the files in `<model_name>_<KO_file_name>/dict/` to convert from the unique scenario IDs back to the original scenarios, and to convert metabolite IDs to names for plotting/readability purposes.

### SAMBAR
SAMBAR is not an installation requirement to run SAMBA: it is however useful for importing and plotting SAMBA results in R scripts.
Requirements for SAMBAR:
- R (currently used version: 4.2.2)
- R.utils
- optparse
- ggh4x
- [sambar](https://forgemia.inra.fr/metexplore/cbm/samba-project/sambar/-/releases/permalink/latest/downloads/sambar)
Install via bash:
```bash
Rscript -e 'install.packages("https://forgemia.inra.fr/metexplore/cbm/samba-project/sambar/-/releases/permalink/latest/downloads/sambar", repos = NULL)'
```
Install via R:
```R
install.packages("https://forgemia.inra.fr/metexplore/cbm/samba-project/sambar/-/releases/permalink/latest/downloads/sambar", repos = NULL)
```


## Confirmed working models
- Human1 (HumanGEM)
- Recon2
- Toy model (small4m) with added exchange reactions

## Roadmap
- [x] Add support for multiple KOs at once
- [x] Add the option to reduce flux rates in addition to KO


## Authors
Juliette Cooke

## Acknowledgments


## License
MIT License (see LICENSE file)

## Project status
Currently active on this project.