Newer
Older
<img src="logo/seqoccin.png" width="300">
<img src="logo/feder.jpg" width="300">
<img src="logo/genotoul_bioinfo.png" width="200">
<img src="logo/inrae.jpg" width="200">
<img src="logo/genphyse.png" width="200">
</p>
# Axe2 - Epigenectics
## Pipeline for differential methylation analysis
The initial project is available here: [axis2-epigenectics](https://forgemia.inra.fr/seqoccin/axis2-epigenectics/-/tree/BSseq-DiffAnalysis)
-o "name of output directory with quote" \
-m "tabulated metadata file" \
-p "path to methylation calls with quote : (eg: nextflow/results/bismark_methylation_calls/stranded_CpG_report" \
--vcf="Variant calling file"
--process_functions="/bsseqmethdiffanalysis/DSS/BS_process_functions.R" \
-x 1000 \
-o, --directory, type=CHARACTER,
Path to the output directory,
[default=NULL]
-m, --metadata, type=CHARACTER,
File of summarized data and experience,
[default=NULL]
-p, --pathToFiles, type=CHARACTER,
Path to the bismark files to be processed,
[default=NULL]
--process_functions, type=CHARACTER,
Complete path to the R functions for preprocessing,
[default="./BS_process_functions.R"]
--mergeStrands, type=LOGICAL,
Set FALSE to ignore the merging of CpG counts per strand,
[default=TRUE]
--vcf, type=CHARACTER,
SNP file : format VCF (vcf.gz accepted),
[default=NULL]
--is_norm, type=LOGICAL,
Coverage normalisation,
[default=FALSE]
--replicate, type=LOGICAL,
Are there replicates in study ?,
[default=TRUE]
-x, --maxCoverage, type=INTEGER,
Maximum of coverage depth,
[default = 1000]
-y, --minCoverage, type=INTEGER,
Minimum of coverage depth,
[default = 5]
-z, --sampleFraction, type=DOUBLE,
Minimum fraction of sample per position, between 0 and 1
[default = NULL]
-d "path to preprocessed file (eg: outpout_dir/bismark_format)" \
-m "File of summarized data and experience" \
--smoothing TRUE \
--formula "line+sex" \
--term "line,sex" \
--ref "epi-,F" \
--dmr TRUE \
--dmr.type qvalue \ # dmr stats (pvalue or qvalue)
--dmr.numC 3 \ # minimum number of dmc for one dmr
--dmr.propDMC 0.5 \ # cutoff of the proportion of DMCs in each region to call DMR
--alpha 0.05 \ # significance level of the tests (i.e. acceptable rate of false-positive in the list of DMC)
--correct BH \
--type 'gene,exon,five_prime_utr,three_prime_utr' \ # type of annotation to loock for in gff file
--tss tss.txt \ # List of TSS (files format: chr\ttss\tstrand)
--gff annotation_file.gtf \
--parallel TRUE \
--ncores 16 \
--plots TRUE \ # output plot boolean
--rda FALSE \
-o output_directory/ \ # Ouput directory for plots and Rfiles
--process_functions ../bsseqmethdiffanalysis/DSS/BS_DSS_analysis.R
-d, --directory, type=CHARACTER,
Name of the directory where all files are stored.
Files format: "chr pos N X freqC strand",
[default=NULL]
--multifactor, type=LOGICAL,
TRUE if there is another factor in the metadata file.
To correct the potential effect of another factor,
[default=FALSE]
--smoothing, type=LOGICAL,
If TRUE smoothing is on (recommended for WGBS data).
Better estimations of mean methylation are obtained due to the spatial correlation should be provided,
[default=TRUE]
-m, --metadata, type=CHARACTER,
File of summarized data and experience should be provided,
[default = NULL]
--alpha, type=DOUBLE,
Significance level of the tests (i.e. acceptable rate of false-positive in the list of DMC),
[default=0.05]
--correct, type=CHARACTER,
Method used to adjust p-values for multiple testing ("BH" or "bonferroni"),
[default="BH"]
--dmr, type=LOGICAL,
If TRUE DMR are extract,
[default=FALSE]
--dmr.numC, type=INTEGER,
Cutoff of the number of CpGs (CHH or CHG) in each region to call DMR,
[default=3]
--dmr.propDMC, type=DOUBLE,
Cutoff of the proportion of DMCs in each region to call DMR,
[default=1]
--dmr.type, type=CHARACTER,
Definition of DMC in DMR ("pvalue" or "qvalue"),
[default="qvalue"]
--min_cov, type=INTEGER,
Minimum of coverage depth.
[default=5]
--gff, type=CHARACTER,
Annotation file (gff ot gtf files),
[default=NULL]
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
--type, type=CHARACTER,
List of interested type of regions separeted by ',' (e.g. exon, intron, 5_prime_utr...),
[default=NULL]
--tss, type=CHARACTER,
File with TSS (files format: chr tss strand),
[default=NULL]
--parallel, type=LOGICAL,
If TRUE some of the methods are performed on multiple cores,
[default=FALSE]
--ncores, type=INTEGER,
Number of cores in parallel mode,
[default=NULL]
--plots, type=LOGICAL,
If TRUE plots are print,
[default=TRUE]
--rda, type=LOGICAL,
If TRUE rda are generated,
[default=TRUE]
-o, --out, type=CHARACTER,
Folder path where results are stored,
[default=NULL]
--process_functions, type=CHARACTER,
Complete path to the R functions for preprocessing,
[default=NULL]
-h, --help
Show this help message and exit
--comp, type=CHARACTER,
The two groups to compare (ex: "M,F").
Only if --multifactor is FALSE,
[default=NULL]
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
--formula, type=CHARACTER,
Linear model fitted.
Only when --multifactor is TRUE.
Exemple of formula: "tissue+treat",
[default=NULL]
--ref, type=CHARACTER,
The reference level(s) of each factor(s) used for the Wald test.
Separate each value with a comma (example:"Sang,TI").
Only when --multifactor is TRUE,
[default=NULL]
--term, type=CHARACTER,
The effect(s) to be tested.
If --multifactor is TRUE and --term is NULL, all effects are tested,
[default=NULL]
--coef, type=CHARACTER,
The specific effect(s) to be tested.
If --multifactor is TRUE, coefficient in the linear model to be tested.
Separate each value with a comma (example:"TEMOIN,5AZA"),
[default=NULL]
```
### edgeR
#### Preprocessing with *BS_preprocessing.R*
```sh
Rscript ../bsseqmethdiffanalysis/edgeR/BS_preprocessing.R \
-o "name of output directory with quote" \
-m "tabulated metadata file" \
-p "path to methylation calls with quote : (eg: nextflow/results/bismark_methylation_calls/stranded_CpG_report" \
--vcf="Variant calling file"
--process_functions="/bsseqmethdiffanalysis/DSS/BS_process_functions.R" \
--group "treat,line" \
-x 1000 \
-y 2 \
-z 0.4
```
**Usage:**
```r
Options:
-o, --directory, type=CHARACTER,
Path to the output directory,
[default=NULL]
-m, --metadata, type=CHARACTER,
File of summarized data and experience,
[default=NULL]
-p, --pathToFiles, type=CHARACTER,
Path to the bismark files to be processed,
[default=NULL]
--process_functions, type=CHARACTER,
Complete path to the R functions for preprocessing,
[default="./BS_process_functions.R"]
--mergeStrands, type=LOGICAL,
Set FALSE to ignore the merging of CpG counts per strand,
[default=TRUE]
--vcf, type=CHARACTER,
SNP file : format VCF (vcf.gz accepted),
[default=NULL]
--is_norm, type=LOGICAL,
Coverage normalisation,
[default=FALSE]
--replicate, type=LOGICAL,
Are there replicates in study ?,
[default=TRUE]
--group, type=CHARACTER,
Filtering of potential SNPs by group.
If mean_X by level of each variable = 0, CpGs are filtered from the bismark files.
Separate each value with a comma (example:"treat,line"),
[default=NULL]
-x, --maxCoverage, type=INTEGER,
Maximum of coverage depth,
[default = 1000]
-y, --minCoverage, type=INTEGER,
Minimum of coverage depth,
[default = 5]
-z, --sampleFraction, type=DOUBLE,
Minimum fraction of sample per position, between 0 and 1
[default = NULL]
-h, --help
Show this help message and exit
```
#### DMC/DMR finder and annotations with *MethDiffedgeR.R*
```sh
Rscript ../bsseqmethdiffanalysis/edgeR/MethDiffDSS.R \
-d "path to preprocessed file (eg: outpout_dir/bismark_format)" \
-m "File of summarized data and experience" \
--multifactor TRUE \
--formula "treat+sex" \
--ref "T,F" \
--term "treat,sex" \
--parallel TRUE \
--ncores 16 \
-o output_directory/ \ # Ouput directory for plots and Rfiles
--dmr TRUE \
--dmr.numC 3 \ # minimum number of dmc for one dmr
--dmr.propDMC 0.5 \ # cutoff of the proportion of DMCs in each region to call DMR
--alpha 0.05 \ #significance level of the tests (i.e. acceptable rate of false-positive in the list of DMC)
--plot TRUE # output plot boolean
```
**Usage:**
```r
Options:
-d, --directory, type=CHARACTER,
Name of the directory where all files are stored.
Files format: "chr start end freqC C T",
[default=NULL]
--multifactor, type=LOGICAL,
TRUE if there is another factor in the metadata file.
To correct the potential effect of another factor,
[default=FALSE]
-m, --metadata, type=CHARACTER,
File of summarized data and experience should be provided,
[default = NULL]
--alpha, type=DOUBLE,
Significance level of the tests (i.e. acceptable rate of false-positive in the list of DMC),
[default=0.05]
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
--dmr, type=LOGICAL,
If TRUE DMR are extract,
[default=FALSE]
--dmr.numC, type=INTEGER,
Cutoff of the number of CpGs (CHH or CHG) in each region to call DMR,
[default=3]
--dmr.propDMC, type=DOUBLE,
Cutoff of the proportion of DMCs in each region to call DMR,
[default=1]
--min_cov, type=INTEGER,
Minimum of coverage depth.
[default=5]
--parallel, type=LOGICAL,
If TRUE some of the methods are performed on multiple cores,
[default=FALSE]
--ncores, type=INTEGER,
Number of cores in parallel mode,
[default=NULL]
--plots, type=LOGICAL,
If TRUE plots are print,
[default=TRUE]
--rda, type=LOGICAL,
If TRUE rda are generated,
[default=TRUE]
-o, --out, type=CHARACTER,
Folder path where results are stored,
[default=NULL]
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
Show this help message and exit
Options single factor (multifactor=FALSE):
--comp, type=CHARACTER,
The two groups to compare (ex: "M,F").
Only if --multifactor is FALSE,
[default=NULL]
Options multifactor (multifactor=TRUE):
--formula, type=CHARACTER,
Linear model fitted.
Only when --multifactor is TRUE.
Exemple of formula: "tissue+treat",
[default=NULL]
--ref, type=CHARACTER,
The reference level(s) of each factor(s).
Separate each value with a comma (example:"Sang,TI").
Only when --multifactor is TRUE,
[default=NULL]
--term, type=CHARACTER,
The effect(s) to be tested.
If --multifactor is TRUE and --term is NULL, all effects are tested,
[default=NULL]
**Download a metadata file example [here](http://genoweb.toulouse.inra.fr/~pterzian/files/template_metadata.tsv)**