README.md

<p align="center"> 
                   <img src="logo/seqoccin.png" width="300">
                   <img src="logo/feder.jpg" width="300">
                   <img src="logo/genotoul_bioinfo.png" width="200">
</p>

<p align="center">
                   <img src="logo/inrae.jpg" width="200">
                   <img src="logo/genphyse.png" width="200">
</p>

# Axe2 - Epigenectics

## Pipeline for differential methylation analysis

The initial project is available here: [axis2-epigenectics](https://forgemia.inra.fr/seqoccin/axis2-epigenectics/-/tree/BSseq-DiffAnalysis)

### Running the pipeline inside a conda environment

**Load miniconda**

```sh
module load system/Miniconda3-4.7.10
```

**Create conda env from methdiffR-env.yml**

```sh
conda env create -f methdiffR-env.yml
```

**Load the environment**

```sh
conda activate methdiffR-env
```

### DSS

#### Preprocessing with *BS_preprocessing.R* 

```sh
Rscript ../bsseqmethdiffanalysis/DSS/BS_preprocessing.R \
 	-o "name of output directory with quote" \ 
	-m "tabulated metadata file" \
	-p "path to methylation calls with quote : (eg: nextflow/results/bismark_methylation_calls/stranded_CpG_report" \ 
	--vcf="Variant calling file" 
	--process_functions="/bsseqmethdiffanalysis/DSS/BS_process_functions.R" \
	-x 1000 \
	-y 2 \ 
	-z 0.4 
```

**Usage:**

```r
Options:
        -o, --directory, type=CHARACTER, 
              Path to the output directory,
              [default=NULL]

        -m, --metadata, type=CHARACTER,
              File of summarized data and experience,
              [default=NULL]

        -p, --pathToFiles, type=CHARACTER,
              Path to the bismark files to be processed,
              [default=NULL]

        --process_functions, type=CHARACTER,
              Complete path to the R functions for preprocessing,
              [default="./BS_process_functions.R"]

        --mergeStrands, type=LOGICAL,
              Set FALSE to ignore the merging of CpG counts per strand,
              [default=TRUE]

        --vcf, type=CHARACTER,
              SNP file : format VCF (vcf.gz accepted),
              [default=NULL]

        --is_norm, type=LOGICAL,
              Coverage normalisation,
              [default=FALSE]

        --replicate, type=LOGICAL,
              Are there replicates in study ?,
              [default=TRUE]

        -x, --maxCoverage, type=INTEGER,
              Maximum of coverage depth, 
              [default = 1000]

        -y, --minCoverage, type=INTEGER,
              Minimum of coverage depth, 
              [default = 5]

        -z, --sampleFraction, type=DOUBLE,
              Minimum fraction of sample per position, between 0 and 1 
              [default = NULL]

        -h, --help
              Show this help message and exit
```

#### DMC/DMR identification with *MethDiffDSS.R*

```sh
Rscript ../bsseqmethdiffanalysis/DSS/MethDiffDSS.R \
 	-d "path to preprocessed file (eg: outpout_dir/bismark_format)" \ 
 	-m "File of summarized data and experience" \
 	--smoothing TRUE \
 	--formula "line+sex" \
 	--term "line,sex" \
 	--ref "epi-,F" \
 	--dmr TRUE \
 	--dmr.type qvalue \ # dmr stats (pvalue or qvalue)
 	--dmr.numC 3 \ # minimum number of dmc for one dmr
 	--dmr.propDMC 0.5 \ # cutoff of the proportion of DMCs in each region to call DMR
 	--alpha 0.05 \ # significance level of the tests (i.e. acceptable rate of false-positive in the list of DMC)
 	--correct BH \
 	--type 'gene,exon,five_prime_utr,three_prime_utr' \ # type of annotation to loock for in gff file
 	--tss tss.txt \ # List of TSS (files format: chr\ttss\tstrand)
 	--gff annotation_file.gtf \
 	--parallel TRUE \
 	--ncores 16 \
 	--plots TRUE \ # output plot boolean
 	--rda FALSE \
 	-o output_directory/ \ # Ouput directory for plots and Rfiles
 	--process_functions ../bsseqmethdiffanalysis/DSS/BS_DSS_analysis.R
```


**Usage:**

```r
Options:
        -d, --directory, type=CHARACTER, 
              Name of the directory where all files are stored.
              Files format: "chr pos N X freqC strand",
              [default=NULL]
        
        --multifactor, type=LOGICAL,
              TRUE if there is another factor in the metadata file. 
              To correct the potential effect of another factor, 
              [default=FALSE]
                
        --smoothing, type=LOGICAL,
              If TRUE smoothing is on (recommended for WGBS data). 
              Better estimations of mean methylation are obtained due to the spatial correlation should be provided,
              [default=TRUE]
        
        -m, --metadata, type=CHARACTER,
              File of summarized data and experience should be provided,
              [default = NULL]
              
        --alpha, type=DOUBLE, 
              Significance level of the tests (i.e. acceptable rate of false-positive in the list of DMC),
              [default=0.05]
      
        --correct, type=CHARACTER, 
              Method used to adjust p-values for multiple testing ("BH" or "bonferroni"),
              [default="BH"]
              
        --dmr, type=LOGICAL, 
              If TRUE DMR are extracted,
              [default=FALSE]
              
        --dmr.numC, type=INTEGER,
              Cutoff of the number of CpGs (CHH or CHG) in each region to call DMR,
              [default=3]
              
        --dmr.propDMC, type=DOUBLE, 
              Cutoff of the proportion of DMCs in each region to call DMR,
              [default=1]
              
        --dmr.type, type=CHARACTER,
              Definition of DMC in DMR ("pvalue" or "qvalue"),
              [default="qvalue"]
              
        --min_cov, type=INTEGER, 
              Minimum of coverage depth.
              [default=5]
              
        --gff, type=CHARACTER, 
              Annotation file (gff ot gtf files),
              [default=NULL]
              
        --type, type=CHARACTER, 
              List of interested type of regions separeted by ',' (e.g. exon, intron, 5_prime_utr...),
              [default=NULL]
              
        --tss, type=CHARACTER,
              File with TSS (files format: chr  tss strand),
              [default=NULL]
              
        --parallel, type=LOGICAL,
              If TRUE some of the methods are performed on multiple cores,
              [default=FALSE]
              
        --ncores, type=INTEGER,
              Number of cores in parallel mode,
              [default=NULL]
              
        --plots, type=LOGICAL, 
              If TRUE plots are printed,
              [default=TRUE]
                    
        --rda, type=LOGICAL, 
              If TRUE rda are generated,
              [default=TRUE]

        -o, --out, type=CHARACTER, 
              Folder path where results are stored,
              [default=NULL]
                    
        --process_functions, type=CHARACTER,
              Complete path to the R functions for preprocessing,
              [default=NULL]
        
        -h, --help
              Show this help message and exit
              
Options single factor (multifactor=FALSE):

        --comp, type=CHARACTER, 
              The two groups to compare (ex: "M,F"). 
              Only if --multifactor is FALSE, 
              [default=NULL]

Options multifactor (multifactor=TRUE):

        --formula, type=CHARACTER,
              Linear model fitted. 
              Only when --multifactor is TRUE. 
              Exemple of formula: "tissue+treat",
              [default=NULL]
 
        --ref, type=CHARACTER, 
              The reference level(s) of each factor(s) used for the Wald test. 
              Separate each value with a comma (example:"Sang,TI").
              Only when --multifactor is TRUE,
              [default=NULL]

        --term, type=CHARACTER, 
              The effect(s) to be tested. 
              If --multifactor is TRUE and --term is NULL, all effects are tested,
              [default=NULL]
              
        --coef, type=CHARACTER,
              The specific effect(s) to be tested. 
              If --multifactor is TRUE, coefficient in the linear model to be tested.
              Separate each value with a comma (example:"TEMOIN,5AZA"),
              [default=NULL]
```


### edgeR

#### Preprocessing with *BS_preprocessing.R* 


```sh
Rscript ../bsseqmethdiffanalysis/edgeR/BS_preprocessing.R \
 	-o "name of output directory with quote" \ 
	-m "tabulated metadata file" \
	-p "path to methylation calls with quote : (eg: nextflow/results/bismark_methylation_calls/stranded_CpG_report" \ 
	--vcf="Variant calling file" 
	--process_functions="/bsseqmethdiffanalysis/DSS/BS_process_functions.R" \
	--group "treat,line" \
	-x 1000 \
	-y 2 \ 
	-z 0.4 
```

**Usage:**

```r
Options:
        -o, --directory, type=CHARACTER, 
              Path to the output directory,
              [default=NULL]

        -m, --metadata, type=CHARACTER,
              File of summarized data and experience,
              [default=NULL]

        -p, --pathToFiles, type=CHARACTER,
              Path to the bismark files to be processed,
              [default=NULL]

        --process_functions, type=CHARACTER,
              Complete path to the R functions for preprocessing,
              [default="./BS_process_functions.R"]

        --mergeStrands, type=LOGICAL,
              Set FALSE to ignore the merging of CpG counts per strand,
              [default=TRUE]

        --vcf, type=CHARACTER,
              SNP file : format VCF (vcf.gz accepted),
              [default=NULL]

        --is_norm, type=LOGICAL,
              Coverage normalisation,
              [default=FALSE]

        --replicate, type=LOGICAL,
              Are there replicates in study ?,
              [default=TRUE]
              
        --group, type=CHARACTER, 
              Filtering of potential SNPs by group.
              If mean_X by level of each variable = 0, CpGs are filtered from the bismark files. 
              Separate each value with a comma (example:"treat,line"),
              [default=NULL]

        -x, --maxCoverage, type=INTEGER,
              Maximum of coverage depth, 
              [default = 1000]

        -y, --minCoverage, type=INTEGER,
              Minimum of coverage depth, 
              [default = 5]

        -z, --sampleFraction, type=DOUBLE,
              Minimum fraction of sample per position, between 0 and 1 
              [default = NULL]

        -h, --help
              Show this help message and exit
```

#### DMC/DMR identification with *MethDiffedgeR.R*

```sh
Rscript ../bsseqmethdiffanalysis/edgeR/MethDiffDSS.R \
 	-d "path to preprocessed file (eg: outpout_dir/bismark_format)" \
 	-m "File of summarized data and experience" \
 	--multifactor TRUE \
 	--formula "treat+sex" \
 	--ref "T,F" \
 	--term "treat,sex" \
 	--parallel TRUE \
 	--ncores 16 \
 	-o output_directory/ \ # Ouput directory for plots and Rfiles
 	--dmr TRUE \ 
 	--dmr.numC 3 \ # minimum number of dmc for one dmr
 	--dmr.propDMC 0.5 \ # cutoff of the proportion of DMCs in each region to call DMR
 	--alpha 0.05 \ #significance level of the tests (i.e. acceptable rate of false-positive in the list of DMC)
 	--plot TRUE # output plot boolean
```


**Usage:**

```r
Options:
        -d, --directory, type=CHARACTER, 
              Name of the directory where all files are stored.
              Files format: "chr start end freqC C T",
              [default=NULL]
        
        --multifactor, type=LOGICAL,
              TRUE if there is another factor in the metadata file. 
              To correct the potential effect of another factor, 
              [default=FALSE]
                
        -m, --metadata, type=CHARACTER,
              File of summarized data and experience should be provided,
              [default = NULL]
              
        --alpha, type=DOUBLE, 
              Significance level of the tests (i.e. acceptable rate of false-positive in the list of DMC),
              [default=0.05]
      
        --dmr, type=LOGICAL, 
              If TRUE DMR are extracted,
              [default=FALSE]
              
        --dmr.numC, type=INTEGER,
              Cutoff of the number of CpGs (CHH or CHG) in each region to call DMR,
              [default=3]
              
        --dmr.propDMC, type=DOUBLE, 
              Cutoff of the proportion of DMCs in each region to call DMR,
              [default=0.5]
              
        --min_cov, type=INTEGER, 
              Minimum of coverage depth.
              [default=5]
              
        --parallel, type=LOGICAL,
              If TRUE some of the methods are performed on multiple cores,
              [default=FALSE]
              
        --ncores, type=INTEGER,
              Number of cores in parallel mode,
              [default=NULL]
              
        --plots, type=LOGICAL, 
              If TRUE plots are printed,
              [default=TRUE]
                    
        --rda, type=LOGICAL, 
              If TRUE rda are generated,
              [default=TRUE]

        -o, --out, type=CHARACTER, 
              Folder path where results are stored,
              [default=NULL]
                    
        -h, --help
              Show this help message and exit
              
Options single factor (multifactor=FALSE):

        --comp, type=CHARACTER, 
              The two groups to compare (ex: "M,F"). 
              Only if --multifactor is FALSE, 
              [default=NULL]

Options multifactor (multifactor=TRUE):

        --formula, type=CHARACTER,
              Linear model fitted. 
              Only when --multifactor is TRUE. 
              Exemple of formula: "tissue+treat",
              [default=NULL]
 
        --ref, type=CHARACTER, 
              The reference level(s) of each factor(s). 
              Separate each value with a comma (example:"Sang,TI").
              Only when --multifactor is TRUE,
              [default=NULL]

        --term, type=CHARACTER, 
              The effect(s) to be tested. 
              If --multifactor is TRUE and --term is NULL, all effects are tested,
              [default=NULL]
```

**Download a metadata file example [here](http://genoweb.toulouse.inra.fr/~pterzian/files/template_metadata.tsv)**