Vocabularies and research data
This guide explains what vocabularies are and how they are useful for supporting research. A brief discussion of vocabulary services is included.
What is a vocabulary ?
A vocabulary sets out the common language a discipline has agreed to use to refer to concepts of interest in that discipline. It is a kind of model of the concepts in a discipline, with labels applied to the concepts and some kind of structure relating the concepts to each other.
Vocabularies take many forms. They include authority files, glossaries, dictionaries, gazetteers, code lists, taxonomies, subject headings, thesauri, semantic networks and ontologies.
- a vocabulary is a set of terms or labels (words, codes, icons) that are used in a specific community to represent concepts.
Based on http://marinemetadata.org/guides/vocabs/vocdef
How do vocabularies support research?
Data specification and description
When sharing data or combining data from different sources, there is a need for an agreed language to make sure the meaning of data is clear and explicit.
Researchers planning observation or surveys need to define their data items clearly. In formal system development environments this is done using metadata registries, data dictionaries, or data modelling software to define the permissible values/codes for data.
An agreed vocabulary (a standard) makes a good starting point for translating concepts into other vocabularies so that collaboration can occur.
Examples of vocabularies used to specify data values:
- Marine science vocabularies
- Darwin Core: An Evolving Community-Developed Biodiversity Data Standard
- Health and welfare statistics values are defined in AIHW's METeOR Metadata Online Registry
- ABS 2011 Census Data Dictionary Example
Ontology-mediated data integration
In this process scientists annotate data sets with semantically precise terms from an ontology, enabling reasoning across the data and transformations of the data for further analysis.
- Case study from genomics: Ontologies: Scientific Data Sharing Made Easy
- Case study from ecoinformatics: Ecoinformatics: supporting ecology as a data-intensive science
Statistical analysis involves aggregating data and applying statistical analytical techniques. Use of standard classification schemes (a kind of vocabulary) means that data from different sources can be compared. If standard classifications are not used, it is difficult to aggregate data from different sources with a high degree of confidence.
Examples of statistical vocabularies
- International Classification of Diseases (ICD), used for national mortality and morbidity statistics
- Australian and New Zealand Standard Research Classification (ANZSRC), used for measuring and analysing research and experimental development (R&D) in Australia and New Zealand
Indexing vocabularies are used to tag items in library catalogues and search portals and to provide keywords for academic journal articles. Without indexing vocabularies search precision is reduced and valuable relevant research may not be retrieved. Indexing vocabularies are most effective when they mirror the searcher's terminology and conceptual perspective.
Examples of indexing vocabularies:
- Medical Subject Headings (MeSH) used in the PubMed biomedical literature portal
- Powerhouse Museum Object Name Thesaurus used for indexing museum collections
Example of journal article with keywords: Example
Traditionally most vocabularies were managed in custom software, and either printed or published as read-only web pages or downloadable documents (for example, see the APAIS Thesaurus).
A vocabulary service is a machine-to-machine service that can support activities such as creating, managing and querying vocabularies.
Examples of vocabulary services:
ANDS is developing a prototype Controlled Vocabulary service. Read more about this project
Knowledge organisation systems such as thesauri or any other type of structured controlled vocabulary can be represented using SKOS (Simple Knowledge Organization System). SKOS provides a standard way to represent knowledge organisation systems using the Resource Description Framework (RDF). This means that vocabulary information can be passed between computer applications in an interoperable way.
Find out more
Introduction to vocabularies:
- Marine Metadata Initiative (MMI) — a comprehensive explanation of vocabularies and their use
- ANSI/NISO Z39.19 - Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies 2005 (revised 2010)
- ISO 25964 -1:2011 Information and documentation -- Thesauri and interoperability with other vocabularies -- Part 1: Thesauri for information retrieval
- SKOS Simple Knowledge Organisation System
- Resource Description Framework (RDF)
ANDS prototype Controlled Vocabulary service