Newer
Older
<img src="logo/seqoccin.png" width="300">
<img src="logo/feder.jpg" width="300">
<img src="logo/genotoul_bioinfo.png" width="200">
<img src="logo/inrae.jpg" width="200">
<img src="logo/genphyse.png" width="200">
</p>
# Axe2 - Epigenectics
## Pipeline for differential methylation analysis
The initial project is available here: [axis2-epigenectics](https://forgemia.inra.fr/seqoccin/axis2-epigenectics/-/tree/BSseq-DiffAnalysis)
-o "name of output directory with quote" \
-m "tabulated metadata file" \
-p "path to methylation calls with quote : (eg: nextflow/results/bismark_methylation_calls/stranded_CpG_report" \
--vcf="Variant calling file"
--process_functions="/bsseqmethdiffanalysis/DSS/BS_process_functions.R" \
-x 1000 \
-o, --directory, type=CHARACTER,
Path to the output directory,
[default=NULL]
-m, --metadata, type=CHARACTER,
File of summarized data and experience,
[default=NULL]
-p, --pathToFiles, type=CHARACTER,
Path to the bismark files to be processed,
[default=NULL]
--process_functions, type=CHARACTER,
Complete path to the R functions for preprocessing,
[default="./BS_process_functions.R"]
--mergeStrands, type=LOGICAL,
Set FALSE to ignore the merging of CpG counts per strand,
[default=TRUE]
--vcf, type=CHARACTER,
SNP file : format VCF (vcf.gz accepted),
[default=NULL]
--is_norm, type=LOGICAL,
Coverage normalisation,
[default=FALSE]
--replicate, type=LOGICAL,
Are there replicates in study ?,
[default=TRUE]
-x, --maxCoverage, type=INTEGER,
Maximum of coverage depth,
[default = 1000]
-y, --minCoverage, type=INTEGER,
Minimum of coverage depth,
[default = 5]
-z, --sampleFraction, type=DOUBLE,
Minimum fraction of sample per position, between 0 and 1
[default = NULL]
-d "path to preprocessed file (eg: outpout_dir/bismark_format)" \
-m "File of summarized data and experience" \
--smoothing TRUE \
--formula "line+sex" \
--term "line,sex" \
--ref "epi-,F" \
--dmr TRUE \
--dmr.type qvalue \ # dmr stats (pvalue or qvalue)
--dmr.numC 3 \ # minimum number of dmc for one dmr
--dmr.propDMC 0.5 \ # cutoff of the proportion of DMCs in each region to call DMR
--alpha 0.05 \ # significance level of the tests (i.e. acceptable rate of false-positive in the list of DMC)
--correct BH \
--type 'gene,exon,five_prime_utr,three_prime_utr' \ # type of annotation to loock for in gff file
--tss tss.txt \ # List of TSS (files format: chr\ttss\tstrand)
--gff annotation_file.gtf \
--parallel TRUE \
--ncores 16 \
--plots TRUE \ # output plot boolean
--rda FALSE \
-o output_directory/ \ # Ouput directory for plots and Rfiles
--process_functions ../bsseqmethdiffanalysis/DSS/BS_DSS_analysis.R
-d, --directory, type=CHARACTER,
Name of the directory where all files are stored.
Files format: "chr pos N X freqC strand",
[default=NULL]
--multifactor, type=LOGICAL,
TRUE if there is another factor in the metadata file.
To correct the potential effect of another factor,
[default=FALSE]
--smoothing, type=LOGICAL,
If TRUE smoothing is on (recommended for WGBS data).
Better estimations of mean methylation are obtained due to the spatial correlation should be provided,
[default=TRUE]
-m, --metadata, type=CHARACTER,
File of summarized data and experience should be provided,
[default = NULL]
--alpha, type=DOUBLE,
Significance level of the tests (i.e. acceptable rate of false-positive in the list of DMC),
[default=0.05]
--correct, type=CHARACTER,
Method used to adjust p-values for multiple testing ("BH" or "bonferroni"),
[default="BH"]
--dmr.numC, type=INTEGER,
Cutoff of the number of CpGs (CHH or CHG) in each region to call DMR,
[default=3]
--dmr.propDMC, type=DOUBLE,
Cutoff of the proportion of DMCs in each region to call DMR,
--dmr.type, type=CHARACTER,
Definition of DMC in DMR ("pvalue" or "qvalue"),
[default="qvalue"]
--min_cov, type=INTEGER,
Minimum of coverage depth.
[default=5]
--gff, type=CHARACTER,
Annotation file (gff ot gtf files),
[default=NULL]
--type, type=CHARACTER,
List of interested type of regions separeted by ',' (e.g. exon, intron, 5_prime_utr...),
[default=NULL]
--tss, type=CHARACTER,
File with TSS (files format: chr tss strand),
[default=NULL]
--parallel, type=LOGICAL,
If TRUE some of the methods are performed on multiple cores,
[default=FALSE]
--ncores, type=INTEGER,
Number of cores in parallel mode,
[default=NULL]
--plots, type=LOGICAL,
[default=TRUE]
--rda, type=LOGICAL,
If TRUE rda are generated,
[default=TRUE]
-o, --out, type=CHARACTER,
Folder path where results are stored,
[default=NULL]
--process_functions, type=CHARACTER,
Complete path to the R functions for preprocessing,
[default=NULL]
-h, --help
Show this help message and exit
--comp, type=CHARACTER,
The two groups to compare (ex: "M,F").
Only if --multifactor is FALSE,
[default=NULL]
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
--formula, type=CHARACTER,
Linear model fitted.
Only when --multifactor is TRUE.
Exemple of formula: "tissue+treat",
[default=NULL]
--ref, type=CHARACTER,
The reference level(s) of each factor(s) used for the Wald test.
Separate each value with a comma (example:"Sang,TI").
Only when --multifactor is TRUE,
[default=NULL]
--term, type=CHARACTER,
The effect(s) to be tested.
If --multifactor is TRUE and --term is NULL, all effects are tested,
[default=NULL]
--coef, type=CHARACTER,
The specific effect(s) to be tested.
If --multifactor is TRUE, coefficient in the linear model to be tested.
Separate each value with a comma (example:"TEMOIN,5AZA"),
[default=NULL]
```
### edgeR
#### Preprocessing with *BS_preprocessing.R*
```sh
Rscript ../bsseqmethdiffanalysis/edgeR/BS_preprocessing.R \
-o "name of output directory with quote" \
-m "tabulated metadata file" \
-p "path to methylation calls with quote : (eg: nextflow/results/bismark_methylation_calls/stranded_CpG_report" \
--vcf="Variant calling file"
--process_functions="/bsseqmethdiffanalysis/DSS/BS_process_functions.R" \
--group "treat,line" \
-x 1000 \
-y 2 \
-z 0.4
```
**Usage:**
```r
Options:
-o, --directory, type=CHARACTER,
Path to the output directory,
[default=NULL]
-m, --metadata, type=CHARACTER,
File of summarized data and experience,
[default=NULL]
-p, --pathToFiles, type=CHARACTER,
Path to the bismark files to be processed,
[default=NULL]
--process_functions, type=CHARACTER,
Complete path to the R functions for preprocessing,
[default="./BS_process_functions.R"]
--mergeStrands, type=LOGICAL,
Set FALSE to ignore the merging of CpG counts per strand,
[default=TRUE]
--vcf, type=CHARACTER,
SNP file : format VCF (vcf.gz accepted),
[default=NULL]
--is_norm, type=LOGICAL,
Coverage normalisation,
[default=FALSE]
--replicate, type=LOGICAL,
Are there replicates in study ?,
[default=TRUE]
--group, type=CHARACTER,
Filtering of potential SNPs by group.
If mean_X by level of each variable = 0, CpGs are filtered from the bismark files.
Separate each value with a comma (example:"treat,line"),
[default=NULL]
-x, --maxCoverage, type=INTEGER,
Maximum of coverage depth,
[default = 1000]
-y, --minCoverage, type=INTEGER,
Minimum of coverage depth,
[default = 5]
-z, --sampleFraction, type=DOUBLE,
Minimum fraction of sample per position, between 0 and 1
[default = NULL]
-h, --help
Show this help message and exit
```
```sh
Rscript ../bsseqmethdiffanalysis/edgeR/MethDiffDSS.R \
-d "path to preprocessed file (eg: outpout_dir/bismark_format)" \
-m "File of summarized data and experience" \
--multifactor TRUE \
--formula "treat+sex" \
--ref "T,F" \
--term "treat,sex" \
--parallel TRUE \
--ncores 16 \
-o output_directory/ \ # Ouput directory for plots and Rfiles
--dmr TRUE \
--dmr.numC 3 \ # minimum number of dmc for one dmr
--dmr.propDMC 0.5 \ # cutoff of the proportion of DMCs in each region to call DMR
--alpha 0.05 \ #significance level of the tests (i.e. acceptable rate of false-positive in the list of DMC)
--plot TRUE # output plot boolean
```
**Usage:**
```r
Options:
-d, --directory, type=CHARACTER,
Name of the directory where all files are stored.
Files format: "chr start end freqC C T",
[default=NULL]
--multifactor, type=LOGICAL,
TRUE if there is another factor in the metadata file.
To correct the potential effect of another factor,
[default=FALSE]
-m, --metadata, type=CHARACTER,
File of summarized data and experience should be provided,
[default = NULL]
--alpha, type=DOUBLE,
Significance level of the tests (i.e. acceptable rate of false-positive in the list of DMC),
[default=0.05]
[default=FALSE]
--dmr.numC, type=INTEGER,
Cutoff of the number of CpGs (CHH or CHG) in each region to call DMR,
[default=3]
--dmr.propDMC, type=DOUBLE,
Cutoff of the proportion of DMCs in each region to call DMR,
--min_cov, type=INTEGER,
Minimum of coverage depth.
[default=5]
--parallel, type=LOGICAL,
If TRUE some of the methods are performed on multiple cores,
[default=FALSE]
--ncores, type=INTEGER,
Number of cores in parallel mode,
[default=NULL]
--plots, type=LOGICAL,
[default=TRUE]
--rda, type=LOGICAL,
If TRUE rda are generated,
[default=TRUE]
-o, --out, type=CHARACTER,
Folder path where results are stored,
[default=NULL]
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
Show this help message and exit
Options single factor (multifactor=FALSE):
--comp, type=CHARACTER,
The two groups to compare (ex: "M,F").
Only if --multifactor is FALSE,
[default=NULL]
Options multifactor (multifactor=TRUE):
--formula, type=CHARACTER,
Linear model fitted.
Only when --multifactor is TRUE.
Exemple of formula: "tissue+treat",
[default=NULL]
--ref, type=CHARACTER,
The reference level(s) of each factor(s).
Separate each value with a comma (example:"Sang,TI").
Only when --multifactor is TRUE,
[default=NULL]
--term, type=CHARACTER,
The effect(s) to be tested.
If --multifactor is TRUE and --term is NULL, all effects are tested,
[default=NULL]
**Download a metadata file example [here](http://genoweb.toulouse.inra.fr/~pterzian/files/template_metadata.tsv)**