Skip to content
Snippets Groups Projects

About

This projet contains a workflow designed to extract microorganism taxa, their habitats and their phenotypes from texts and categorize the extracted information by means of taxa from the NCBI taxonomy and concepts from the OntoBiotope ontology. It uses texts downloaded from CIRMS, GenBank, DSMZ and PubMed data Banks.

The workflow is composed of six pipelines which represent text mining processes relying on AlvisNLP plans and additional tools. It is designed to feed the Florilge database with rich text mining knowledges related to food microbe flora.

We use Snakemake pipelines to materialize the execution steps:

  1. Preprocess ontology to analyze the ontologies, cut the desired branches, produce the tomap models and lexicon, and create the concept paths from the structure of the ontologies. These resources are used in the next steps. details here...
  1. Process CIRM corpus to extract microorganisms, habitats from CIRM data. details here...

  2. Process GenBank corpus to extract microorganisms, habitats from GenBank data. details here...

  3. Process DSMZ corpus to extract microorganisms, habitats from DSMZ data. details here...

  4. Process PubMed corpus to extract microorganisms, habitats from Pubmed abstracts. details here...

  5. Process BioNLP-OST 2019 test corpus to evaluate the results on a reference dataset. details here...

The workflow relies on the following structure of folders to manage the resources:

    ├── README.md
    ├── LICENSE.md
    ├── config
    │   ├── config.yaml
    │   └── resources.yaml
    ├── *-snakefile
    ├── plans
    │   └── *.plan
    ├── corpora
    │   ├── cirm/
    │   ├── genbank/
    │   ├── dsmz/
    │   └── pubmed/
    ├── ancillaries
    │    ├── *.obo
    │    ├── *.txt
    │    └── *.tomap
    └── softwares
         ├── *.env
         ├── *.simg
         ├── *.python
         ├── *.perl
         ├── *.bash
         └── *.jar

The main directories/files are:

  • plans/ contains the AlvisNLP plans that are text mining pipelines to extract the informations
  • corpora/ contains the textual data to process
  • ancillaries/ contains ancillary data like ontologies, lexical data, grammars, language models
  • softwares/ contains additional scripts, envs and config files
  • *.snakefile are the snakemake pipelines which represent the execution entry points

How to install ?

How to run ?