.. _Necessity:
###############################################
What are metadata and why are they important?
###############################################
.. raw:: html
In this chapter: metadata schema, XML and XSD, metadata optionality and possible values.
.. admonition:: definition
Metadata are *"data that provide information about other data"*.
The data we wish to have information about are **language data** and **tools/services** which process them. The basic **metadata** elements used to describe the aforementioned are:
* ``corpora`` (i.e. collections of texts or other media),
* ``lexical/conceptual resources`` (i.e. collections of terms),
* ``language descriptions`` (i.e. grammars), and
* ``tools`` or ``services`` (i.e. software for natural language processing).
These metadata elements have multiple features and properties. For example the ``corpus`` element has several *children* (hierarchically dependent elements), as shown in the image, which are metadata themselves:
.. image:: CorpusXSD.png
:width: 1200px
.. _XMLXSD:
What is shown in the image above is a part of the :ref:`CLARIN:EL metadata schema ` dedicated to the ``corpus`` element. A **schema** is a complicated detailed *map* where all elements are located, defined, described and associated with each other hierarchically. All this information is stored in a external document called `XSD `_: **XML Schema Documentation**.
`XML `_ stands for **eXtensible Markup Language**. It is a language designed to label data by using **tags <>** [#]_. The tags represent the data structure and contain the **metadata**. The XSD also expresses a set of rules to which an XML document must conform in order to be considered *valid* (according to a specific schema).
The schema is created to help different types of users to **describe**, **organize**, **retrieve** and **reuse** resources (for more information see the :ref:`Fair Principles ` section). As for the resources found in `CLARIN:EL `_, the schema created provides information on questions such as the following:
* **What** is the nature of the resources?
* **How** were the resources **created**?
* **Why** were they **created**?
* **When** were they **created**?
* **Who created** them?
* **What** were the **standards/tools/techniques** used, if any?
* What is their **size** (in various units)?
* What was their **source**?
The CLARIN:EL metadata schema has also foreseen for the various media, the different languages and other useful information on all types of resources which are expressed by the respective metadata elements.
.. _Optionality:
Each piece of information encoded as a metadata element is *more or less necessary* for the description of a resource. This is expressed by the various degrees of **optionality** as depicted in the following table:
+--------------------------------+-----------------------------------------------------------------------------------------------+
| If a metadata element is | Then |
+================================+===============================================================================================+
| **mandatory** | it must always be provided |
+--------------------------------+-----------------------------------------------------------------------------------------------+
| **recommended** | it is still important, therefore should be provided |
+--------------------------------+-----------------------------------------------------------------------------------------------+
| **mandatory upon condition** | it becomes mandatory after a certain value of *another element* has been filled in |
+--------------------------------+-----------------------------------------------------------------------------------------------+
| **recommended upon condition** | it becomes recommended after a certain value of *another element* has been filled in |
+--------------------------------+-----------------------------------------------------------------------------------------------+
| **optional** | "you should never say ‘this metadata isn’t useful’; be generous and provide it anyway!"[#]_ |
+--------------------------------+-----------------------------------------------------------------------------------------------+
.. tip:: See :ref:`here ` the mandatory metadata elements for CLARIN:EL.
Each element takes a specific value. This value is the acceptable content to be enclosed between the metadata tags and it varies from alphanumeric strings to float numbers, URLs etc. These values are instantiated in some of the following examples (*click on the arrow to reveal the example*).
.. collapse:: a single word:
**** alignment ****
.. collapse:: a phrase:
**** Political Science ****
.. collapse:: multiple phrases/paragraphs:
**** This is a collection of the raw minutes of the Greek Parliament plenary sessions of the last 30 years (more than 1.000.000 speeches). The existing corpus has all raw data in txt format. In order to make the resource more processable, we have also split it into smaller subcorpora, with a maximum compressed folder size of 40 Mb per
subcorpus. The created subcorpora are thematically organized per Greek parliamentary terms. ****
.. collapse:: a date:
**** 2005-10-01 ****
.. collapse:: a number:
**** 100000.0 ****
.. collapse:: a URL:
**** http://www.ilsp.gr/ ****
.. collapse:: an email:
**** name@athenarc.gr ****
.. collapse:: other metadata with their values:
****
**** name@athenarc.gr ****
****
You can see more examples :ref:`here `.
.. [#] You can export the description of a resource in XML by visiting its :ref:`view page `.
.. [#] `FAIR Principles > F2: Data are described with rich metadata `_.