README.md

# SAMBA
## Description
SAMpling Biomarker Analysis is a toolset for running flux sampling on metabolic networks, predicting biomarkers 
(or metabolic profiles), and representing the results visually.

It uses a Snakemake pipeline to manage the entire workflow, starting from a metabolic network and a set of scenarios of reactions or genes to disrupt, and ending with a change score (z-score) for each exchange metabolite in the network, for each disruption scenario.

## Visuals
[add example plots here]

## Getting started
### Requirements
#### SAMBAflux
- Python 3.7
- [cobrapy](https://pypi.org/project/cobra/)
- [sambaflux](https://pypi.org/project/sambaflux/) (see `cluster_install.sh` for installation command.)
- A solver (GLPK, CPLEX, GUROBI) (CPLEX 12.10 does not work with Python 3.8+)  
- Access to a computer cluster or powerful computer for large metabolic networks

#### SAMBAR
- R (currently used version: 4.2.1)
- R.utils
- optparse
- ggh4x
- [sambar](https://forgemia.inra.fr/metexplore/cbm/samba-project/sambar/-/releases/permalink/latest/downloads/sambar) (see `cluster_install.sh` for installation command.)

### Installation and usage
0. Connect to a cluster using `ssh`
    - *Example*: `ssh <username>@genologin.toulouse.inra.fr`
1. On the cluster, use `pipeline/cluster_install.sh` to install environments and requirements. You can submit the file to a cluster job using `sbatch cluster_install.sh`, or connect interactively to a node using `srun --pty bash` and running the commands manually. Make sure to set the ENVPATH and WORKINGDIR paths beforehand, and make sure the folders exist. ENVPATH is where the Python environment will be created. WORKINGDIR is where the SAMBA project (this project) will be cloned to, and will be where you run the cluster jobs from.
    - You can also run `WORKINGDIR=/path/to/folder` to be able to use `$WORKINGDIR` in your cluster or local terminal in the next steps.
    - If you're using CPLEX, make sure to set up CPLEX by adding CPLEX to the PYTHONPATH (see commented line in `cluster_install.sh`).
    - Make sure the Python module you're using contains Snakemake.
    - Once installed, you will only need to run the `git clone --depth 1 https://forgemia.inra.fr/metexplore/cbm/samba-project/samba.git` to a different folder to launch a new sampling run with different parameters.

2. Using your preferred file copy method, send the metabolic network file and file with reactions/genes to KO to the cluster, to your `$WORKINGDIR/samba/pipeline/data/`. 
    - *Example from a local PC*: `rsync -aP /path/to/local/folder/data/ <username>@genologin.toulouse.inra.fr:$WORKINGDIR/samba/pipeline/data/`

2. `cd` into your WORKINGDIR and edit `config.yml` with the correct parameters. You can change these to use different models and KO files.
    - *Example*: 
    ```bash
    cd $WORKINGDIR
    vim config.yml
    ```
    `:q` to quit vim, `:x` to save and quit vim.
    - **Parameters**:
        - `model`: model filename added in `data/`
        - `samba_path`: path to the samba scripts folder. Since you will be running files from the pipeline/ folder, this can be set to a relative path `../local/scripts`, or absolute path `$WORKINGDIR/samba/local/scripts`.
        - `ids`: whether you are using an input gene/reaction KO file (`"simple"`) or a scenario-based KO (`"scenario"`).
        - `ids_file`: file containing genes or reactions to KO if `ids: "simple"`. First column `Scenario` contains descriptive scenario names (do not have to be unique), second column `IDs` contains the gene or reaction IDs to KO. Multiple IDs in one scenario should be separated using a space. Is ignored if using `ids: "scenario"`.
        - `scale`: if using `ids: "scenario"`, set the scale to `"pathway"` to KO within pathways o r `"network"` to KO within the entire network. Is ignored if using `ids: "ids"`.
        - `scenario`: if using `ids: "scenario"`, set the scenario to `"singlerxn"` to KO one random reaction or `"allrxn"` to KO all reactions in <scale>. Is ignored if using `ids: "ids"`.
        - `each`: if using `ids: "scenario"`, set each to `"--each"` to enable looping over <scale> to generate <scenario> KO IDs. Is ignored if using `ids: "ids"`.
        - `nsamples`: number of samples to use. `100000` is recommended for large human metabolic networks.
        - `biomass`: minimum amount of biomass to optimise for.
        - `biomassfile`: tsv file containing a `Model` column with model names, and `Biomass` with the model's biomass reaction. You can add or replace a row to the existing file where needed. Only used if `biomass` != 0.
        - `exchangemin`: value to set the exchange reaction lower bounds to (will be negative), e.g. `1` will results in exchange reactions being set to `[-1, 1000]`.


3. Edit `submit_slurm.sh` to set ENVPATH to the same path you used before. You can also change the job name, error and out filenames.
4. Make sure you are in `$WORKINGDIR/samba/pipeline/`, and run `sbatch submit_slurm.sh`.
    - You can watch the job submissions using `watch squeue -u <username>` (`Ctrl+C` to exit the watch).
5. Results can be found in `<model_name>_<KO_file_name>/output/`.
    - `zscores.tsv` contains the z-scores for all exchange metabolites and all scenarios.
    - `densities.rds` contains a compressed version of the sampling distributions for all exchange metabolites and all scenarios. Used in R for plotting purposes.
6. You may also need the files in `<model_name>_<KO_file_name>/dict/` to convert from the unique scenario IDs back to the original scenarios, and to convert metabolite IDs to names for plotting/readability purposes.


## Confirmed working models
- Human1 (HumanGEM)
- Recon2
- Toy model (small4m) with added exchange reactions

## Roadmap
- [x] Add support for multiple KOs at once
- [ ] Add the option to reduce flux rates in addition to KO


## Authors
Juliette Cooke

## Acknowledgments


## License
MIT License (see LICENSE file)

## Project status
Currently active on this project.