Axe2 - Epigenectics
Pipeline for differential methylation analysis
The initial project is available here: axis2-epigenectics
Running the pipeline inside a conda environment
Load miniconda
module load system/Miniconda3-4.7.10
Create conda env from methdiffR-env.yml
conda env create -f methdiffR-env.yml
Load the environment
conda activate methdiffR-env
DSS
Preprocessing with BS_preprocessing.R
Rscript ../bsseqmethdiffanalysis/DSS/BS_preprocessing.R \
-o "name of output directory with quote" \
-m "tabulated metadata file" \
-p "path to methylation calls with quote : (eg: nextflow/results/bismark_methylation_calls/stranded_CpG_report" \
--vcf="Variant calling file"
--process_functions="/bsseqmethdiffanalysis/DSS/BS_process_functions.R" \
-x 1000 \
-y 2 \
-z 0.4
Usage:
Options:
-o, --directory, type=CHARACTER,
Path to the output directory,
[default=NULL]
-m, --metadata, type=CHARACTER,
File of summarized data and experience,
[default=NULL]
-p, --pathToFiles, type=CHARACTER,
Path to the bismark files to be processed,
[default=NULL]
--process_functions, type=CHARACTER,
Complete path to the R functions for preprocessing,
[default="./BS_process_functions.R"]
--mergeStrands, type=LOGICAL,
Set FALSE to ignore the merging of CpG counts per strand,
[default=TRUE]
--vcf, type=CHARACTER,
SNP file : format VCF (vcf.gz accepted),
[default=NULL]
--is_norm, type=LOGICAL,
Coverage normalisation,
[default=FALSE]
--replicate, type=LOGICAL,
Are there replicates in study ?,
[default=TRUE]
-x, --maxCoverage, type=INTEGER,
Maximum of coverage depth,
[default = 1000]
-y, --minCoverage, type=INTEGER,
Minimum of coverage depth,
[default = 5]
-z, --sampleFraction, type=DOUBLE,
Minimum fraction of sample per position, between 0 and 1
[default = NULL]
-h, --help
Show this help message and exit
DMC/DMR identification with MethDiffDSS.R
Rscript ../bsseqmethdiffanalysis/DSS/MethDiffDSS.R \
-d "path to preprocessed file (eg: outpout_dir/bismark_format)" \
-m "File of summarized data and experience" \
--smoothing TRUE \
--formula "line+sex" \
--term "line,sex" \
--ref "epi-,F" \
--dmr TRUE \
--dmr.type qvalue \ # dmr stats (pvalue or qvalue)
--dmr.numC 3 \ # minimum number of dmc for one dmr
--dmr.propDMC 0.5 \ # cutoff of the proportion of DMCs in each region to call DMR
--alpha 0.05 \ # significance level of the tests (i.e. acceptable rate of false-positive in the list of DMC)
--correct BH \
--type 'gene,exon,five_prime_utr,three_prime_utr' \ # type of annotation to loock for in gff file
--tss tss.txt \ # List of TSS (files format: chr\ttss\tstrand)
--gff annotation_file.gtf \
--parallel TRUE \
--ncores 16 \
--plots TRUE \ # output plot boolean
--rda FALSE \
-o output_directory/ \ # Ouput directory for plots and Rfiles
--process_functions ../bsseqmethdiffanalysis/DSS/BS_DSS_analysis.R
Usage:
Options:
-d, --directory, type=CHARACTER,
Name of the directory where all files are stored.
Files format: "chr pos N X freqC strand",
[default=NULL]
--multifactor, type=LOGICAL,
TRUE if there is another factor in the metadata file.
To correct the potential effect of another factor,
[default=FALSE]
--smoothing, type=LOGICAL,
If TRUE smoothing is on (recommended for WGBS data).
Better estimations of mean methylation are obtained due to the spatial correlation should be provided,
[default=TRUE]
-m, --metadata, type=CHARACTER,
File of summarized data and experience should be provided,
[default = NULL]
--alpha, type=DOUBLE,
Significance level of the tests (i.e. acceptable rate of false-positive in the list of DMC),
[default=0.05]
--correct, type=CHARACTER,
Method used to adjust p-values for multiple testing ("BH" or "bonferroni"),
[default="BH"]
--dmr, type=LOGICAL,
If TRUE DMR are extracted,
[default=FALSE]
--dmr.numC, type=INTEGER,
Cutoff of the number of CpGs (CHH or CHG) in each region to call DMR,
[default=3]
--dmr.propDMC, type=DOUBLE,
Cutoff of the proportion of DMCs in each region to call DMR,
[default=1]
--dmr.type, type=CHARACTER,
Definition of DMC in DMR ("pvalue" or "qvalue"),
[default="qvalue"]
--min_cov, type=INTEGER,
Minimum of coverage depth.
[default=5]
--gff, type=CHARACTER,
Annotation file (gff ot gtf files),
[default=NULL]
--type, type=CHARACTER,
List of interested type of regions separeted by ',' (e.g. exon, intron, 5_prime_utr...),
[default=NULL]
--tss, type=CHARACTER,
File with TSS (files format: chr tss strand),
[default=NULL]
--parallel, type=LOGICAL,
If TRUE some of the methods are performed on multiple cores,
[default=FALSE]
--ncores, type=INTEGER,
Number of cores in parallel mode,
[default=NULL]
--plots, type=LOGICAL,
If TRUE plots are printed,
[default=TRUE]
--rda, type=LOGICAL,
If TRUE rda are generated,
[default=TRUE]
-o, --out, type=CHARACTER,
Folder path where results are stored,
[default=NULL]
--process_functions, type=CHARACTER,
Complete path to the R functions for preprocessing,
[default=NULL]
-h, --help
Show this help message and exit
Options single factor (multifactor=FALSE):
--comp, type=CHARACTER,
The two groups to compare (ex: "M,F").
Only if --multifactor is FALSE,
[default=NULL]
Options multifactor (multifactor=TRUE):
--formula, type=CHARACTER,
Linear model fitted.
Only when --multifactor is TRUE.
Exemple of formula: "tissue+treat",
[default=NULL]
--ref, type=CHARACTER,
The reference level(s) of each factor(s) used for the Wald test.
Separate each value with a comma (example:"Sang,TI").
Only when --multifactor is TRUE,
[default=NULL]
--term, type=CHARACTER,
The effect(s) to be tested.
If --multifactor is TRUE and --term is NULL, all effects are tested,
[default=NULL]
--coef, type=CHARACTER,
The specific effect(s) to be tested.
If --multifactor is TRUE, coefficient in the linear model to be tested.
Separate each value with a comma (example:"TEMOIN,5AZA"),
[default=NULL]
edgeR
Preprocessing with BS_preprocessing.R
Rscript ../bsseqmethdiffanalysis/edgeR/BS_preprocessing.R \
-o "name of output directory with quote" \
-m "tabulated metadata file" \
-p "path to methylation calls with quote : (eg: nextflow/results/bismark_methylation_calls/stranded_CpG_report" \
--vcf="Variant calling file"
--process_functions="/bsseqmethdiffanalysis/DSS/BS_process_functions.R" \
--group "treat,line" \
-x 1000 \
-y 2 \
-z 0.4
Usage:
Options:
-o, --directory, type=CHARACTER,
Path to the output directory,
[default=NULL]
-m, --metadata, type=CHARACTER,
File of summarized data and experience,
[default=NULL]
-p, --pathToFiles, type=CHARACTER,
Path to the bismark files to be processed,
[default=NULL]
--process_functions, type=CHARACTER,
Complete path to the R functions for preprocessing,
[default="./BS_process_functions.R"]
--mergeStrands, type=LOGICAL,
Set FALSE to ignore the merging of CpG counts per strand,
[default=TRUE]
--vcf, type=CHARACTER,
SNP file : format VCF (vcf.gz accepted),
[default=NULL]
--is_norm, type=LOGICAL,
Coverage normalisation,
[default=FALSE]
--replicate, type=LOGICAL,
Are there replicates in study ?,
[default=TRUE]
--group, type=CHARACTER,
Filtering of potential SNPs by group.
If mean_X by level of each variable = 0, CpGs are filtered from the bismark files.
Separate each value with a comma (example:"treat,line"),
[default=NULL]
-x, --maxCoverage, type=INTEGER,
Maximum of coverage depth,
[default = 1000]
-y, --minCoverage, type=INTEGER,
Minimum of coverage depth,
[default = 5]
-z, --sampleFraction, type=DOUBLE,
Minimum fraction of sample per position, between 0 and 1
[default = NULL]
-h, --help
Show this help message and exit
DMC/DMR identification with MethDiffedgeR.R
Rscript ../bsseqmethdiffanalysis/edgeR/MethDiffDSS.R \
-d "path to preprocessed file (eg: outpout_dir/bismark_format)" \
-m "File of summarized data and experience" \
--multifactor TRUE \
--formula "treat+sex" \
--ref "T,F" \
--term "treat,sex" \
--parallel TRUE \
--ncores 16 \
-o output_directory/ \ # Ouput directory for plots and Rfiles
--dmr TRUE \
--dmr.numC 3 \ # minimum number of dmc for one dmr
--dmr.propDMC 0.5 \ # cutoff of the proportion of DMCs in each region to call DMR
--alpha 0.05 \ #significance level of the tests (i.e. acceptable rate of false-positive in the list of DMC)
--plot TRUE # output plot boolean
Usage:
Options:
-d, --directory, type=CHARACTER,
Name of the directory where all files are stored.
Files format: "chr start end freqC C T",
[default=NULL]
--multifactor, type=LOGICAL,
TRUE if there is another factor in the metadata file.
To correct the potential effect of another factor,
[default=FALSE]
-m, --metadata, type=CHARACTER,
File of summarized data and experience should be provided,
[default = NULL]
--alpha, type=DOUBLE,
Significance level of the tests (i.e. acceptable rate of false-positive in the list of DMC),
[default=0.05]
--dmr, type=LOGICAL,
If TRUE DMR are extracted,
[default=FALSE]
--dmr.numC, type=INTEGER,
Cutoff of the number of CpGs (CHH or CHG) in each region to call DMR,
[default=3]
--dmr.propDMC, type=DOUBLE,
Cutoff of the proportion of DMCs in each region to call DMR,
[default=0.5]
--min_cov, type=INTEGER,
Minimum of coverage depth.
[default=5]
--parallel, type=LOGICAL,
If TRUE some of the methods are performed on multiple cores,
[default=FALSE]
--ncores, type=INTEGER,
Number of cores in parallel mode,
[default=NULL]
--plots, type=LOGICAL,
If TRUE plots are printed,
[default=TRUE]
--rda, type=LOGICAL,
If TRUE rda are generated,
[default=TRUE]
-o, --out, type=CHARACTER,
Folder path where results are stored,
[default=NULL]
-h, --help
Show this help message and exit
Options single factor (multifactor=FALSE):
--comp, type=CHARACTER,
The two groups to compare (ex: "M,F").
Only if --multifactor is FALSE,
[default=NULL]
Options multifactor (multifactor=TRUE):
--formula, type=CHARACTER,
Linear model fitted.
Only when --multifactor is TRUE.
Exemple of formula: "tissue+treat",
[default=NULL]
--ref, type=CHARACTER,
The reference level(s) of each factor(s).
Separate each value with a comma (example:"Sang,TI").
Only when --multifactor is TRUE,
[default=NULL]
--term, type=CHARACTER,
The effect(s) to be tested.
If --multifactor is TRUE and --term is NULL, all effects are tested,
[default=NULL]
Download a metadata file example here