Dagstuhl Seminar Proceedings, Volume 8131

Document

08131 Executive Summary – Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives

Authors: Michael Ashburner, Ulf Leser, and Dietrich Rebholz-Schuhmann

Abstract

Researchers in Text Mining and researchers active in developing ontological resources provide solutions to preserve semantic information properly, i.e. in ontologies and/or fact databases. Researchers from both fields tend to work independently from each other, but there is a shared interest to profit from ongoing research in the complementary domain. The relatedness of both domains has led to the idea to organize a workshop that brings together members of both research domains.

Cite as

Michael Ashburner, Ulf Leser, and Dietrich Rebholz-Schuhmann. 08131 Executive Summary – Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, pp. 1-5, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{ashburner_et_al:DagSemProc.08131.1,
  author =	{Ashburner, Michael and Leser, Ulf and Rebholz-Schuhmann, Dietrich},
  title =	{{08131 Executive Summary – Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--5},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.1},
  URN =		{urn:nbn:de:0030-drops-15234},
  doi =		{10.4230/DagSemProc.08131.1},
  annote =	{Keywords: Text Mining, natural language processing, ontologies, ontology design, machine learning, bioinformatics, medical informatics, knowledge management}
}

@InProceedings{ashburner_et_al:DagSemProc.08131.1,
  author =	{Ashburner, Michael and Leser, Ulf and Rebholz-Schuhmann, Dietrich},
  title =	{{08131 Executive Summary – Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--5},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.1},
  URN =		{urn:nbn:de:0030-drops-15234},
  doi =		{10.4230/DagSemProc.08131.1},
  annote =	{Keywords: Text Mining, natural language processing, ontologies, ontology design, machine learning, bioinformatics, medical informatics, knowledge management}
}

Document

DOI: 10.4230/DagSemProc.08131.2

Applications of semantic similarity measures

Authors: Andreas Schlicker, Fidel Ramírez, Jörg Rahnenführer, Carola Huthmacher, Alejandro Pironti, Francisco S. Domingues, Thomas Lengauer, and Mario Albrecht

Abstract

There has been much interest in uncovering protein-protein interactions and their underlying domain-domain interactions. Many experimental techniques have been developed, for example yeast-two-hybrid screening and tandem affinity purification. Since it is time consuming and expensive to perform exhaustive experimental screens, in silico methods are used for predicting interactions. However, all experimental and computational methods have considerable false positive and false negative rates. Therefore, it is necessary to validate experimentally determined and predicted interactions. One possibility for the validation of interactions is the comparison of the functions of the proteins or domains. Gene Ontology (GO) is widely accepted as a standard vocabulary for functional terms, and is used for annotating proteins and protein families with biological processes and their molecular functions. This annotation can be used for a functional comparison of interacting proteins or domains using semantic similarity measures. Another application of semantic similarity measures is the prioritization of disease genes. It is know that functionally similar proteins are often involved in the same or similar diseases. Therefore, functional similarity is used for predicting disease associations of proteins. In the first part of my talk, I will introduce some semantic and functional similarity measures that can be used for comparison of GO terms and proteins or protein families. Then, I will show their application for determining a confidence threshold for domain-domain interaction predictions. Additionally, I will present FunSimMat (http://www.funsimmat.de/), a comprehensive resource of functional similarity values available on the web. In the last part, I will introduce the problem of comparing diseases, and a first attempt to apply functional similarity measures based on GO to this problem.

Cite as

Andreas Schlicker, Fidel Ramírez, Jörg Rahnenführer, Carola Huthmacher, Alejandro Pironti, Francisco S. Domingues, Thomas Lengauer, and Mario Albrecht. Applications of semantic similarity measures. In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, p. 1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{schlicker_et_al:DagSemProc.08131.2,
  author =	{Schlicker, Andreas and Ram{\'\i}rez, Fidel and Rahnenf\"{u}hrer, J\"{o}rg and Huthmacher, Carola and Pironti, Alejandro and Domingues, Francisco S. and Lengauer, Thomas and Albrecht, Mario},
  title =	{{Applications of semantic similarity measures}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.2},
  URN =		{urn:nbn:de:0030-drops-15198},
  doi =		{10.4230/DagSemProc.08131.2},
  annote =	{Keywords: Semantic similarity, functional similarity, Gene Ontology, domain-domain interactions}
}

@InProceedings{schlicker_et_al:DagSemProc.08131.2,
  author =	{Schlicker, Andreas and Ram{\'\i}rez, Fidel and Rahnenf\"{u}hrer, J\"{o}rg and Huthmacher, Carola and Pironti, Alejandro and Domingues, Francisco S. and Lengauer, Thomas and Albrecht, Mario},
  title =	{{Applications of semantic similarity measures}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.2},
  URN =		{urn:nbn:de:0030-drops-15198},
  doi =		{10.4230/DagSemProc.08131.2},
  annote =	{Keywords: Semantic similarity, functional similarity, Gene Ontology, domain-domain interactions}
}

Document

DOI: 10.4230/DagSemProc.08131.3

Bootstrapping an interactive information extraction system for FlyBase curation

Authors: Ted Briscoe, Caroline Gasperin, Ian Lewin, and Andreas Vlachos

Abstract

We describe an adaptive information extraction (IE) system designed to aid the curation of papers about fruit fly genomics for incorporation into FlyBase. FlyBase employs a team of about eight curators who fill in prespecified IE templetes (called proformas) for each gene and allele discussed in a given paper with curatable information associated with it. The normal approach to curation is to load the PDF of the paper into a tool such as Acroread and to use the `Find' function to search for repeated mentions of an entity of interest. The relevant information is then typed into the appropriate template fields. Templates are then checked for consistency and automatically integrated into the database. We have developed PaperBrowser, a tool designed to make it easier for curators to locate relevant information. The tool takes the PDF version of the paper as input and rerenders it as SciXML, a standard developed at Cambridge for representing the logical structure of scientific articles in a fashion amenable to text mining. The basic SciXML is augmented by a gene name recogniser and anaphora resolution module so that PaperBrowser is able to highlight gene names in the paper and to provide a navigation bar which allows the curator to jump to specific mentions of a given gene in the various sections of the paper. Alternatively, the curator can select a specific gene mention and the browser will highlight all the noun phrases which are anaphorically linked to that gene mention. These anaphoric links can either be coreferential, or associative to the gene's products or components, such as proteins or RNA. User-based evaluation of PaperBrowser in comparison to the use of Acroread, with FlyBase curators undertaking the task of finding the set of genes and alleles for which templates should be constructed, has demonstrated that curation is 20\% faster at no cost to accuracy when using PaperBrowser. PaperBrowser uses a conditional random field model to perform gene name recognition bootstrapped from training data derived automatically via information in FlyBase. The anaphora resolution algorithm is unsupervised but uses information from the Sequence Ontology augmented with lexemes from UMLS to identify noun phrases referring to gene products and components. The PDF extraction tool uses a commercial OCR package augmented with a seed-based machine learning technique to learn the mapping from font and format information to the logical structure of the paper. Papers describing the complete processing pipeline, intrinsic evaluation of the individual components and user-based experiments, along with test datasets are available from the FlySlip Project website

Cite as

Ted Briscoe, Caroline Gasperin, Ian Lewin, and Andreas Vlachos. Bootstrapping an interactive information extraction system for FlyBase curation. In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, p. 1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{briscoe_et_al:DagSemProc.08131.3,
  author =	{Briscoe, Ted and Gasperin, Caroline and Lewin, Ian and Vlachos, Andreas},
  title =	{{Bootstrapping an interactive information extraction system for FlyBase curation}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.3},
  URN =		{urn:nbn:de:0030-drops-15086},
  doi =		{10.4230/DagSemProc.08131.3},
  annote =	{Keywords: Biomedical Text Mining, Interactive Information Extraction, Natural Language Processing}
}

Document

DOI: 10.4230/DagSemProc.08131.4

Coreference Resolution in Biomedical Texts: a Machine Learning Approach

Authors: Jian Su, Xiaofeng Yang, Huaqing Hong, Yuka Tateisi, and Jun'ichi Tsujii

Abstract

Motivation: Coreference resolution, the process of identifying different mentions of an entity, is a very important component in a text-mining system. Compared with the work in news articles, the existing study of coreference resolution in biomedical texts is quite preliminary by only focusing on specific types of anaphors like pronouns or definite noun phrases, using heuristic methods, and running on small data sets. Therefore, there is a need for an in-depth exploration of this task in the biomedical domain. Results: In this article, we presented a learning-based approach to coreference resolution in the biomedical domain. We made three contributions in our study. Firstly, we annotated a large scale coreference corpus, MedCo, which consists of 1,999 medline abstracts in the GENIA data set. Secondly, we proposed a detailed framework for the coreference resolution task, in which we augmented the traditional learning model by incorporating non-anaphors into training. Lastly, we explored various sources of knowledge for coreference resolution, particularly, those that can deal with the complexity of biomedical texts. The evaluation on the MedCo corpus showed promising results. Our coreference resolution system achieved a high precision of 85.2% with a reasonable recall of 65.3%, obtaining an F-measure of 73.9%. The results also suggested that our augmented learning model significantly boosted precision (up to 24.0%) without much loss in recall (less than 5%), and brought a gain of over 8% in F-measure.

Cite as

Jian Su, Xiaofeng Yang, Huaqing Hong, Yuka Tateisi, and Jun'ichi Tsujii. Coreference Resolution in Biomedical Texts: a Machine Learning Approach. In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, p. 1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{su_et_al:DagSemProc.08131.4,
  author =	{Su, Jian and Yang, Xiaofeng and Hong, Huaqing and Tateisi, Yuka and Tsujii, Jun'ichi},
  title =	{{Coreference Resolution in Biomedical Texts: a Machine Learning Approach}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.4},
  URN =		{urn:nbn:de:0030-drops-15220},
  doi =		{10.4230/DagSemProc.08131.4},
  annote =	{Keywords: Coreference resolution, biomedical text}
}

Document

DOI: 10.4230/DagSemProc.08131.5

Facilitating the development of controlled vocabularies for metabolomics technologies with text mining

Authors: Irena Spasic, Daniel Schober, Susanna-Assunta Sansone, Dietrich Rebholz-Schuhmann, Douglas B. Kell, and Norman W. Paton

Abstract

Background. Many bioinformatics applications rely on controlled vocabularies or ontologies to consistently interpret and seamlessly integrate information scattered across public resources. Experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology, hence the pressing need for vocabularies and ontologies in metabolomics. However, it is time-consuming and non trivial to construct these resources manually. Results. We describe a methodology for rapid development of controlled vocabularies, a study originally motivated by the needs for vocabularies describing metabolomics technologies. We present case studies involving two controlled vocabularies (for nuclear magnetic resonance spectroscopy and gas chromatography) whose development is currently underway as part of the Metabolomics Standards Initiative. The initial vocabularies were compiled manually, providing a total of 243 and 152 terms. A total of 5,699 and 2,612 new terms were acquired automatically from the literature. The analysis of the results showed that full-text articles (especially the Materials and Methods sections) are the major source of technology-specific terms as opposed to paper abstracts. Conclusions. We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies with the terms used in the scientific literature. We adopted an integrative approach, combining relatively generic software and data resources for time- and cost-effective development of a text mining tool for expansion of controlled vocabularies across various domains, as a practical alternative to both manual term collection and tailor-made named entity recognition methods.

Cite as

Irena Spasic, Daniel Schober, Susanna-Assunta Sansone, Dietrich Rebholz-Schuhmann, Douglas B. Kell, and Norman W. Paton. Facilitating the development of controlled vocabularies for metabolomics technologies with text mining. In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{spasic_et_al:DagSemProc.08131.5,
  author =	{Spasic, Irena and Schober, Daniel and Sansone, Susanna-Assunta and Rebholz-Schuhmann, Dietrich and Kell, Douglas B. and Paton, Norman W.},
  title =	{{Facilitating the development of controlled vocabularies for metabolomics technologies with text mining}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.5},
  URN =		{urn:nbn:de:0030-drops-15503},
  doi =		{10.4230/DagSemProc.08131.5},
  annote =	{Keywords: Text mining, ontology, controlled vocabulary, metabolomics}
}

Document

DOI: 10.4230/DagSemProc.08131.6

GoPubMed: Exploring Pubmed with Ontological Background Knowledge

Authors: Heiko Dietze, Dimitra Alexopoulou, Michael R. Alvers, Bill Barrio-Alvers, Andreas Doms, Jörg Hakenberg, Jan Mönnich, Conrad Plake, Andreas Reischuck, Loic Royer, Thomas Wächter, Matthias Zschunke, and Michael Schroeder

Abstract

With the ever increasing size of scientific literature, finding relevant documents and answering questions has become even more of a challenge. Recently, ontologies - hierarchical, controlled vocabularies - have been introduced to annotate genomic data. They can also improve the question answering and the selection of relevant documents in the literature search. Search engines such as GoPubMed.org use ontological background knowledge to give an overview over large query results and to help answering questions. We review the problems and solutions underlying these next generation intelligent search engines and give examples of the power of this new search paradigm.

Cite as

Heiko Dietze, Dimitra Alexopoulou, Michael R. Alvers, Bill Barrio-Alvers, Andreas Doms, Jörg Hakenberg, Jan Mönnich, Conrad Plake, Andreas Reischuk, Loic Royer, Thomas Wächter, Matthias Zschunke, and Michael Schroeder. GoPubMed: Exploring Pubmed with Ontological Background Knowledge. In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, p. 1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{dietze_et_al:DagSemProc.08131.6,
  author =	{Dietze, Heiko and Alexopoulou, Dimitra and Alvers, Michael R. and Barrio-Alvers, Bill and Doms, Andreas and Hakenberg, J\"{o}rg and M\"{o}nnich, Jan and Plake, Conrad and Reischuck, Andreas and Royer, Loic and W\"{a}chter, Thomas and Zschunke, Matthias and Schroeder, Michael},
  title =	{{GoPubMed: Exploring Pubmed with Ontological Background Knowledge}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.6},
  URN =		{urn:nbn:de:0030-drops-15204},
  doi =		{10.4230/DagSemProc.08131.6},
  annote =	{Keywords: Text mining, literature search, Gene Ontology, NLP, ontology, thesaurus, PubMed}
}

@InProceedings{dietze_et_al:DagSemProc.08131.6,
  author =	{Dietze, Heiko and Alexopoulou, Dimitra and Alvers, Michael R. and Barrio-Alvers, Bill and Doms, Andreas and Hakenberg, J\"{o}rg and M\"{o}nnich, Jan and Plake, Conrad and Reischuck, Andreas and Royer, Loic and W\"{a}chter, Thomas and Zschunke, Matthias and Schroeder, Michael},
  title =	{{GoPubMed: Exploring Pubmed with Ontological Background Knowledge}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.6},
  URN =		{urn:nbn:de:0030-drops-15204},
  doi =		{10.4230/DagSemProc.08131.6},
  annote =	{Keywords: Text mining, literature search, Gene Ontology, NLP, ontology, thesaurus, PubMed}
}

Document

DOI: 10.4230/DagSemProc.08131.7

Mining associations and roles: role of feature extraction

Authors: Goran Nenadic

Abstract

One of the ultimate aims of biomedical text mining would be to extract both explicit and implicit associations between different types of entities. In addition, assigning roles that entities have or may have in biological processes is also of interest. In this talk I will be discussing our experience in selecting and engineering textual features that can help in mining associations and roles from literature. Depending on tasks and entities involved, we have used four types of features: from simple words and terms, to words and semantic classes, to textual contexts, to contexts augmented with additional background attributes. The main epilogue is that both NLP- and domain-knowledge driven feature engineering are needed for successful mining of associations and roles.

Cite as

Goran Nenadic. Mining associations and roles: role of feature extraction. In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, p. 1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{nenadic:DagSemProc.08131.7,
  author =	{Nenadic, Goran},
  title =	{{Mining associations and roles: role of feature extraction}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.7},
  URN =		{urn:nbn:de:0030-drops-15497},
  doi =		{10.4230/DagSemProc.08131.7},
  annote =	{Keywords: Text mining, associations, roles, feature engineering, feature extraction}
}

Document

DOI: 10.4230/DagSemProc.08131.8

Mining Phenotypes for Protein Function Prediction

Authors: Ulf Leser, Philip Groth, Bertram Weiss, and Hans-Dieter Pohlenz

Abstract

Until very recently, phenotypes only very rarely were studied in a systematic manner. While ontologies for describing gene functions now have a 10 year long tradition, similar vocabularies for describing the phenotype of genes are only emerging now; similarly, the techniques for determining phenotypes on a large scale (especially RNAi) are available only for a few years, while genomic sequencing or gene expression studies are already established for a much longer time. In this talk, we describe results from a study for exploiting phenotype descriptions for protein function prediction. We used the data from PhenomicsDB, a phenotype database integrated from several publicly available data sources. Due to the lack of standardization, phenotypes in PhenomicsDB can only be viewed as text (short statements, abstracts, singular terms, ...). We clustered these texts and analyzed the corresponding gene clusters in terms of their coherence in functional annotation and their interconnectedness by protein-protein-interactions. We also devised a method for using the close similarity in their phenotype descriptions to predict the function of proteins. We show that this methods yields a very good precision at acceptable coverage.

Cite as

Ulf Leser, Philip Groth, Bertram Weiss, and Hans-Dieter Pohlenz. Mining Phenotypes for Protein Function Prediction. In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, p. 1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{leser_et_al:DagSemProc.08131.8,
  author =	{Leser, Ulf and Groth, Philip and Weiss, Bertram and Pohlenz, Hans-Dieter},
  title =	{{Mining Phenotypes for Protein Function Prediction}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.8},
  URN =		{urn:nbn:de:0030-drops-15133},
  doi =		{10.4230/DagSemProc.08131.8},
  annote =	{Keywords: Data mining, funciton prediction, bioinformatics, phenotypes, text mining}
}

Document

DOI: 10.4230/DagSemProc.08131.9

Named Entity or Entity Name?

Authors: Stefan Schulz

Abstract

The expression "named entity" is very fuzzy and its definitions partly contradictory. Semantic subtleties involving the words "entity", "name" and "term" are largely ignored. Based on formal ontology a more principled typology is introduced.

Cite as

Stefan Schulz. Named Entity or Entity Name?. In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, p. 1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{schulz:DagSemProc.08131.9,
  author =	{Schulz, Stefan},
  title =	{{Named Entity or Entity Name?}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.9},
  URN =		{urn:nbn:de:0030-drops-15214},
  doi =		{10.4230/DagSemProc.08131.9},
  annote =	{Keywords: Ontology, Named Entity Recognition}
}

Document

DOI: 10.4230/DagSemProc.08131.10

NLP and Phenotypes: using Ontologies to link Human Diseases to Animal Models

Authors: N. Washington, M. Gibson, C.J. Mungall, Michael Ashburner, G. Gkoutos, M. Westerfield, M. Haendel, and S. E. Lewis

Abstract

The path to disease gene discovery in humans is often a lengthy one, but can be significantly shortened if links between human and model organism phenotypes are readily available. Collecting and storing these descriptions in a common resource, recorded with ontologies, as well as developing the tools for annotation, access, and analysis are among the goals of the National Center for Biomedical Ontology. The use of well-structured, expert-reviewed ontologies during curation allows biological data to be understandable by both humans and computers, and thereby increases the capacity for meaningful analysis. We have developed the EQ annotation model, which uses ontology terms to label and link together entities, such as anatomical structures, with the qualities describing them. Phenotypes are represented in our model using any combination of entity (such as anatomy) ontologies in combination with an ontology of qualities (PATO). Together with the model organism databases Zfin and FlyBase, we are evaluating this model, using the Phenote Annotation Tool to capture the mutant phenotypes of 200 genes known to cause human disease (from OMIM records) that have corresponding fly and zebrafish mutant phenotypes. The phenotypic data modeled in this way is available from the NCBO Open Biomedical Database (OBD), which has the same underlying annotation data model, and can currently be accessed via a computational (REST) interface for utilization by other external application or databases. This work is funded by the NIH.

Cite as

N. Washington, M. Gibson, C.J. Mungall, Michael Ashburner, G. Gkoutos, M. Westerfield, M. Haendel, and S. E. Lewis. NLP and Phenotypes: using Ontologies to link Human Diseases to Animal Models. In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, p. 1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{washington_et_al:DagSemProc.08131.10,
  author =	{Washington, N. and Gibson, M. and Mungall, C.J. and Ashburner, Michael and Gkoutos, G. and Westerfield, M. and Haendel, M. and Lewis, S. E.},
  title =	{{NLP and Phenotypes: using Ontologies to link Human Diseases to Animal Models}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.10},
  URN =		{urn:nbn:de:0030-drops-15143},
  doi =		{10.4230/DagSemProc.08131.10},
  annote =	{Keywords: Phenotypes, ontologies, annotation}
}

Document

DOI: 10.4230/DagSemProc.08131.11

Ontologies & Text Mining (for Life Sciences)

Authors: Paul Buitelaar

Abstract

The talk will address several issues in the application and development of ontologies: the selection of appropriate ontologies for a task; the population of a selected ontology through information extraction from text; the semi-automatic development or extension of an ontology; the lexicalisation of ontologies for the purpose of ontology-based information extraction from text. Each of these issues will be addressed through a particular application: the OntoSelect ontology library and search engine (http://olp.dfki.de/ontoselect/); the OntoLT Protege PlugIn for ontology learning from text (http://olp.dfki.de/OntoLT/OntoLT.htm); the SOBA system for ontology-based information extraction from text; the LingInfo lexicon model for the integration of lexical/linguistic information in ontologies (http://olp.dfki.de/LingInfo/).

Cite as

Paul Buitelaar. Ontologies & Text Mining (for Life Sciences). In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, p. 1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{buitelaar:DagSemProc.08131.11,
  author =	{Buitelaar, Paul},
  title =	{{Ontologies \& Text Mining (for Life Sciences)}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.11},
  URN =		{urn:nbn:de:0030-drops-15095},
  doi =		{10.4230/DagSemProc.08131.11},
  annote =	{Keywords: Ontology Search; Ontology Population; Ontology Learning; Lexical Enrichment of Ontologies}
}

Document

DOI: 10.4230/DagSemProc.08131.12

Ontology learning with text mining: Two use cases in lipoprotein metabolism and toxicology

Authors: Dimitra Alexopoulou, Thomas Wächter, Laura Pickersgill, Cecilia Eyre, and Michael Schroeder

Abstract

Background: The engineering of ontologies, especially with a view to a text-mining use, is still a new research field. There does not yet exist a well-defined theory and technology for ontology construction. Many of the ontology design steps remain manual and are based on personal experience and intuition. However, there exist a few efforts on automatic construction of ontologies in the form of extracted lists of terms and relations between them. Results: We share experience acquired during the manual development of a lipoprotein metabolism ontology (LMO) to be used for text-mining. We compare the manually created ontology terms with the automatically derived terminology from four different automatic term recognition methods. The top 50 predicted terms contain up to 89% relevant terms. For the top 1000 terms the best method still generates 51% relevant terms. In a corpus of 3066 documents 53% of LMO terms are contained and 38% can be generated with one of the methods. Secondly we present a use case for ontology-based search for toxicological methods. Conclusions: Given high precision, automatic methods can help decrease development time and provide significant support for the identification of domain-specific vocabulary. The coverage of the domain vocabulary depends strongly on the underlying documents. Ontology development for text mining should be performed in a semi-automatic way; taking automatic term recognition results as input. Availability: The automatic term recognition method is available as web service, described at http://gopubmed4.biotec.tu- dresden.de/IdavollWebService/services/CandidateTermGeneratorService?wsdl

Cite as

Dimitra Alexopoulou, Thomas Wächter, Laura Pickersgill, Cecilia Eyre, and Michael Schroeder. Ontology learning with text mining: Two use cases in lipoprotein metabolism and toxicology. In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, p. 1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{alexopoulou_et_al:DagSemProc.08131.12,
  author =	{Alexopoulou, Dimitra and W\"{a}chter, Thomas and Pickersgill, Laura and Eyre, Cecilia and Schroeder, Michael},
  title =	{{Ontology learning with text mining: Two use cases in lipoprotein metabolism and toxicology}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.12},
  URN =		{urn:nbn:de:0030-drops-15063},
  doi =		{10.4230/DagSemProc.08131.12},
  annote =	{Keywords: Automatic Term Recognition, Ontology Learning, Lipoprotein Metabolism}
}

Document

DOI: 10.4230/DagSemProc.08131.13

Ontology-based Extraction of Transcription Regulation Events

Authors: Jung-Jae Kim

Abstract

I present an on-going work on extraction of transcription regulation events from text by using an ontology which plays a central role in integrating information from different sources. The events of transcription regulation are expressed in the literature with a high degree of compositeness. They have elements such as event types, participants, and attributes. These elements are associated with different keywords, which should be merged into a shared structure. I use the Gene Regulation Ontology (GRO) for the integration purpose. It contains not only biological concepts related to transcription regulation, but also inference rules for deduction of specific event types and attributes from semantics of sentences. It is also used to represent the semantics of linguistic patterns that are used to identify the semantics of sentences. The ontology provides the formality which is required for the extraction of specific and well-defined events as those of transcription regulation.

Cite as

Jung-Jae Kim. Ontology-based Extraction of Transcription Regulation Events. In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, p. 1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{kim:DagSemProc.08131.13,
  author =	{Kim, Jung-Jae},
  title =	{{Ontology-based Extraction of Transcription Regulation Events}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.13},
  URN =		{urn:nbn:de:0030-drops-15112},
  doi =		{10.4230/DagSemProc.08131.13},
  annote =	{Keywords: Information extraction, ontology, transcription regulation, inference, ontology semantics}
}

Document

DOI: 10.4230/DagSemProc.08131.14

Ontology-Based Interactive Information Extraction

Authors: David Milward

Abstract

Interactive Information Extraction brings together search and information extraction to provide fast, interactive text mining over large volumes of text such as Medline abstracts, full text scientific articles, patents etc. As well as covering the two ends of the spectrum: keyword search over documents, and detailed linguistic patterns within sentences, the Interactive Information Extraction System, I2E, also covers the points in between such as keywords within the same sentence, or co-occurrence of biological entities within sentences or documents. This talk briefly introduces the idea of Interactive Information Extraction, and describes how terminologies/ontologies are incorporated. We also show how I2E can be used to augment ontologies by finding potential synonyms or members of classes from the literature using linguistic patterns. Finally we discuss issues concerning how best to use ontologies for text mining.

Cite as

David Milward. Ontology-Based Interactive Information Extraction. In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, p. 1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{milward:DagSemProc.08131.14,
  author =	{Milward, David},
  title =	{{Ontology-Based Interactive Information Extraction}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.14},
  URN =		{urn:nbn:de:0030-drops-15150},
  doi =		{10.4230/DagSemProc.08131.14},
  annote =	{Keywords: Information extraction, ontologies, text mining}
}

Document

DOI: 10.4230/DagSemProc.08131.15

Services for annotation of biomedical text

Authors: Jörg Hakenberg

Abstract

Motivation: Text mining in the biomedical domain in recent years has focused on the development of tools for recognizing named entities and extracting relations. Such research resulted from the need for such tools as basic components for more advanced solutions. Named entity recognition, entity mention normalization, and relationship extraction now have reached a stage where they perform comparably to human annotators (considering inter--annotator agreement, measured in many studies to be around 90\%). Many tools have been made available, through web--interfaces or as downloadable software using non--standardized formats for in-- and output. To advance progress in text mining, solutions are needed to both provide and combine the results of 'basic' information retrieval and extraction tools. Results: Our groups at Technical University Dresden, Humboldt--Universit"{a}t zu Berlin, and Arizona State University developed systems for named entity recognition, normalization, and relationship extraction. As evaluated during and after the BioCreative 2 challenge, recognition of proteins achieves 86\% f--measure, normalization of gene mentions 85\%, and extraction of protein--protein interactions including mapping to UniProt 25\%. Conclusions: We consider the BioCreative meta-service an ideal framework to make available information extraction tools to a variety of users: researchers from the biomedical domain, database curators, and researchers in text mining who can use the services as input for subsequent analyses. At the time of writing this abstract, twelve groups provide their tools as services to the BCMS server. We currently participates with tools for recognizing names of genes/proteins and species, normalization of gene mentions to EntrezGene, protein mentions to UniProt, mentions of species to NCBI Taxonomy, as well as classifying abstracts for protein-protein interactions. Availability: For more information, please refer to http://alibaba.informatik.hu-berlin.de/bcms/. BCMS is available at http://bcms.bioinfo.cnio.es/.

Cite as

Jörg Hakenberg. Services for annotation of biomedical text. In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, p. 1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{hakenberg:DagSemProc.08131.15,
  author =	{Hakenberg, J\"{o}rg},
  title =	{{Services for annotation of biomedical text}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.15},
  URN =		{urn:nbn:de:0030-drops-15101},
  doi =		{10.4230/DagSemProc.08131.15},
  annote =	{Keywords: BioCreative, NER, EMN, GN, information extraction, web-service, AliBaba}
}

Document

DOI: 10.4230/DagSemProc.08131.16

Systems biology approaches for prioritizing therapeutic gene targets

Authors: Johannes Schuchhardt

Abstract

Rational approaches to therapy of complex diseases may be improved by predictive modelling of underlying disease mechanisms. Formulating and implementing such models requires the integration of heterogeneous information from different sources and usually entails considerable effort. We need new concepts and resources making knowledge on causal regulatory interactions of genes and physiological states in a disease context available. Dedicated ontologies and text mining methods can be of great use for guiding and supporting the process of model construction and model evaluation.

Cite as

Johannes Schuchhardt. Systems biology approaches for prioritizing therapeutic gene targets. In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, p. 1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{schuchhardt:DagSemProc.08131.16,
  author =	{Schuchhardt, Johannes},
  title =	{{Systems biology approaches for prioritizing therapeutic gene targets}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.16},
  URN =		{urn:nbn:de:0030-drops-15053},
  doi =		{10.4230/DagSemProc.08131.16},
  annote =	{Keywords: Rational therapy, predictive modelling, data integration, ontologies, pathway databases}
}

Document

DOI: 10.4230/DagSemProc.08131.17

Term Mapping Using Matrix Operations

Authors: Michael Krauthammer and Thaibinh Luong

Abstract

We believe that gene name identification is a modular process involving term recognition, classification and mapping. This work's focus is on gene name mapping, and we assume that names are already recognized and classified. We use a combination of two methods to map recognized entities to their appropriate gene identifiers (Entrez GeneIDs): the Trigram Method, and the Network Method. Both methods require preprocessing, using resources from Entrez Gene, to construct a set of method-specific matrices. We first address lexical variation by transforming gene names into their unique "trigrams" (groups of three alphanumeric characters), and perform trigram matching against the preprocessed gene dictionary. For ambiguous gene names, we additionally perform a contextual analysis of the abstract that contains the recognized entity. We have formalized our method as a sequence of matrix manipulations, allowing for a fast and coherent implementation of the algorithm. In this talk, we also show how gene name identification, and text mining in general, can play a critical role in translational medicine. We demonstrate how term identification is useful for establishing a biobibliometric distance between genes and psychiatric disorders.

Cite as

Michael Krauthammer and Thaibinh Luong. Term Mapping Using Matrix Operations. In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, p. 1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{krauthammer_et_al:DagSemProc.08131.17,
  author =	{Krauthammer, Michael and Luong, Thaibinh},
  title =	{{Term Mapping Using Matrix Operations}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.17},
  URN =		{urn:nbn:de:0030-drops-15126},
  doi =		{10.4230/DagSemProc.08131.17},
  annote =	{Keywords: Term Identification}
}

Document

DOI: 10.4230/DagSemProc.08131.18

Text Mining and Management Tools for Resource Construction and Validation in the Life Sciences

Authors: Jong C. Park

Abstract

In this talk, I am concerned with the question of what it really takes to move forward in the age of information, along the well knowln progression of human understanding, or data-information-knowledge-truth. After an overview of research directions of our group at KAIST, I present four text mining and management tools for resource construction and validation: Automatic gene summary generation, e3db construction, pathway validation by logical inference, and the system that taps into the evolutionary aspect of the Gene Ontology with colored graphs.

Cite as

Jong C. Park. Text Mining and Management Tools for Resource Construction and Validation in the Life Sciences. In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, p. 1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{park:DagSemProc.08131.18,
  author =	{Park, Jong C.},
  title =	{{Text Mining and Management Tools for Resource Construction and Validation in the Life Sciences}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.18},
  URN =		{urn:nbn:de:0030-drops-15181},
  doi =		{10.4230/DagSemProc.08131.18},
  annote =	{Keywords: Text Mining, Resource Construction, Resource Validation, Gene Ontology}
}

Document

DOI: 10.4230/DagSemProc.08131.19

Textpresso - an Information Retrieval and Extraction System for Biological Literature

Authors: Hans-Michael Mueller, Arun Rangarajan, Tracy K. Teal, Kimberly van Auken, Juancarlos Chan, and Paul W. Sternberg

Abstract

We developed an information retrieval and extraction system that processes the full text of biological papers. The system, called Textpresso, separates text into sentences, labels words and phrases according to an ontology (an organized lexicon), and allows queries to be performed on a database of labeled sentences. The current ontology comprises approximately one hundred categories of terms, such as "gene", "regulation", "human disease", "brain area" etc., and also contains main Gene Ontology (GO) categories. Extraction of particular biological facts, such as gene-Ã‚Âgene interactions, or the curation of GO cellular components, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences. Search engine for four literatures, C. elegans, Drosophila, Arabidopsis and Neuroscience have been established by us, and thirteen systems for other literatures have been developed by other groups around the world. Currently, our four systems contain 112,000 papers with 40 million sentences, all systems worldwide contain 190,000 papers with approximately 65 million sentences.

Cite as

Hans-Michael Mueller, Arun Rangarajan, Tracy K. Teal, Kimberly van Auken, Juancarlos Chan, and Paul W. Sternberg. Textpresso - an Information Retrieval and Extraction System for Biological Literature. In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, p. 1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{mueller_et_al:DagSemProc.08131.19,
  author =	{Mueller, Hans-Michael and Rangarajan, Arun and Teal, Tracy K. and van Auken, Kimberly and Chan, Juancarlos and Sternberg, Paul W.},
  title =	{{Textpresso - an Information Retrieval and Extraction System for Biological Literature}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.19},
  URN =		{urn:nbn:de:0030-drops-15169},
  doi =		{10.4230/DagSemProc.08131.19},
  annote =	{Keywords: Information retrieval, literature search engine, information extraction, automated literature curation, semantic search, ontology,}
}

@InProceedings{mueller_et_al:DagSemProc.08131.19,
  author =	{Mueller, Hans-Michael and Rangarajan, Arun and Teal, Tracy K. and van Auken, Kimberly and Chan, Juancarlos and Sternberg, Paul W.},
  title =	{{Textpresso - an Information Retrieval and Extraction System for Biological Literature}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.19},
  URN =		{urn:nbn:de:0030-drops-15169},
  doi =		{10.4230/DagSemProc.08131.19},
  annote =	{Keywords: Information retrieval, literature search engine, information extraction, automated literature curation, semantic search, ontology,}
}

Document

DOI: 10.4230/DagSemProc.08131.20

WordNet-Inspired Terminological Resources for Bio-NLP

Authors: Elena Beisswanger, Michael Poprat, and Udo Hahn

Abstract

WordNet is currently the most widely used lexicon resource for general English language. We here argue in favor of a similar lexical resource for biomedicine, BioWordNet, to extend the virtues of WordNet to this sublanguage domain. We present a simple approach to semi-automatically build up such a resource. It crucially builds on the conversion of structured domain knowledge taken from the Open Biomedical Ontologies (OBO) and the subsequent mapping of this library to WordNet's lexicographic file format. We report on the shortcomings of the original WordNet format that hampered this approach. Subsequently we propose alternative strategies to construct a BioWordNet in a more up-to-date representation such as the Web Ontology Language (OWL). Finally we point out the steps ahead when building up such a domain-specific resource from various sources and discuss the general problems that need to be solved when integrating the BioWordNet into the original WordNet.

Cite as

Elena Beisswanger, Michael Poprat, and Udo Hahn. WordNet-Inspired Terminological Resources for Bio-NLP. In Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, Volume 8131, p. 1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2008)

Copy BibTex To Clipboard

@InProceedings{beisswanger_et_al:DagSemProc.08131.20,
  author =	{Beisswanger, Elena and Poprat, Michael and Hahn, Udo},
  title =	{{WordNet-Inspired Terminological Resources for Bio-NLP}},
  booktitle =	{Ontologies and Text Mining for Life Sciences : Current Status and Future Perspectives},
  pages =	{1--1},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2008},
  volume =	{8131},
  editor =	{Michael Ashburner and Ulf Leser and Dietrich Rebholz-Schuhmann},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08131.20},
  URN =		{urn:nbn:de:0030-drops-15073},
  doi =		{10.4230/DagSemProc.08131.20},
  annote =	{Keywords: WordNet, Bio-Ontologies}
}

Dagstuhl Seminar Proceedings, Volume 8131

Publication Details

Access Numbers

Documents

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Filters

Thanks for your feedback!

Could not send message