Metadata¶
Overview¶
This document provides a comprehensive guide to metadata management in the ALTERNATIVE platform. Metadata is structured reference data that helps to sort and identify the information it describes. In the context of our platform, metadata plays a crucial role in organizing, finding, and understanding datasets. This guide covers the various metadata fields available, including standard fields, advanced fields for experimental data, and custom fields. It also explains how to effectively use metadata for searching and filtering datasets within the platform. Understanding and properly utilizing metadata is essential for researchers and data scientists to efficiently manage and access the wealth of information stored in the ALTERNATIVE platform.
Fields¶
Each dataset has metadata associated with it. The following fields are available for describing datasets:
- Title - this title will be unique across the platform, so make it brief but specific
- URL - automatically filled based on title, but can be edited
- Description - add a longer description of the dataset, including information such as what people need to know when using the data
- Tags - add tags that will help people find the data and link it with other related data
- License - it is important to include a license, so that people know how they can use the data
- Organization - choose which organization, that you're a member of, should own the dataset
- Visibility - a public dataset can be seen by anyone, a private one can only be seen by members of the organization owning the dataset or by collaborators of the dataset, and will not show up in searches by other users
- Source - where the data is from
- Version - version of the data
- Author - name of the person or organization responsible for producing the data
- Author e-mail - an e-mail address for the author, to which queries about the data should be sent
- Maintainer - name of the person or organization responsible for maintaining the data
- Maintainer e-mail - details for a second person responsible for the data
Advanced Metadata for Experiments¶
When creating/editing a dataset, you can mark the checkbox to be able to set extra fields, related to experiment data:
- Culture medium
- Number of replicates
- Number of cells/well
- Ratio hiPSC-CMs/HCAECs
- Date of experiment
- Toxin
- Age type
- Dimension
- Category
- Content
- Model
Custom Fields¶
If you want the dataset metadata to have more fields, you can add a name/key and value for it.
Find Data¶
You should be able to find a dataset by typing the title, or some relevant words from the description/metadata, into the search box on any page, containing datasets. On the left side of the Datasets
page there are also some filters, such as Organizations
, Groups
, Tags
, Formats
, Licenses
. Select any number of options to restrict the search. Under the search box there's also an Order by
field, to sort datasets in any given way. If you want to look for data owned by a particular organization/group, you can search within that from its homepage.
CKAN uses Apache Solr as its search engine. Note that not the whole functionality is offered through the simplified search interface in CKAN or it can differ. CKAN supports two search modes, both are used from the same search field. If the search terms entered into the search field contain no colon (:
) CKAN will perform a simple search. If the search expression does contain a colon CKAN will perform an advanced search.
Simple Search¶
CKAN defers most of the search to Solr and by default it uses the DisMax Query Parser that was primarily designed to be easy to use and to accept almost any input.
The search words typed by the user in the search box defines the main query constituting the essence of the search. Some characters are treated as mandatory (+
) and prohibited (-
) modifiers for terms. Text wrapped in balanced quote characters ("example text"
) is treated as a phrase. By default, all words or phrases specified by the user are treated as optional unless they are preceded by +
or -
. CKAN will search for the complete word and when doing simple search wildcards are not supported. Solr applies some preprocessing and stemming when searching. Stemmers remove morphological affixes from words, leaving only the word stem. This may cause, for example, that searching for testing
or tested
will show also results containing the word test
. If the name of the dataset contains words separated by -
it will consider each word independently in the search.
Examples:
census
-> will search for all the datasets containing the wordcensus
in the query fieldscensus +2019
-> will search for all the datasets contaning the wordcensus
and filter only those matching2019
toocensus -2019
-> will search for all the datasets containing the wordcensus
and will exclude the results featuring2019
"european census"
-> will search for all the datasets containing the phraseeuropean census
Testing
-> will search for all the datasets containing the wordTesting
and alsoTest
as it is the stem ofTesting
Advanced Search¶
This will be considered a fielded search and the query syntax of Solr will be used to search. This will allow us to use wildcards (*
), proximity matching (~
, looks for terms that are within a specific distance from one another) and general features described in Solr docs. The basic syntax is field:term
. Field names may differ from datasets' attributes, the mapping rules are defined in the schema.xml file. You can use title
to search by the dataset name and text
to look in a catch-all field. CKAN supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm, to do a fuzzy search use the ~
symbol at the end of a single-word term.
Examples:
title:european
-> will look for all the datasets containing in their title the wordeuropean
title:europ*
-> will look for all the datasets containing in their title a word that starts witheurop
, likeeurope
andeuropean
title:europe || title:australia
-> will look for datasets containingeurope
oraustralia
in their titletitle: "european census" ~ 4
-> will look for datasets, in which the title contains the wordseuropean
andcensus
within a distance of 4 wordsauthor:powell~
-> words likejowell
orpomell
will also be found