Examples of metadata

In this chapter: metadata excerpts with their explanations, links to XSD and full XML metadata descriptions

The goal of this section is to familiarize users with the use of metadata. To do so, resource descriptions have been exported 1 from the CLARIN:EL infrastructure and excerpts of interest have been copied verbatim. Each metadata element is presented per se and then briefly explained. There are also links to the full XML description of the resource for anyone wishing to see the metadata examined in context and the XSD, for a detailed representation of the element.

resourceName

The first metadata element is the resourceName from the Greek Parliament Plenary Sessions (1989-2019), a collection of the raw minutes of the Greek Parliament plenary sessions of the last 30 years (more than 1.000.000 speeches).

XML

<ms:resourceName xml:lang="en">Greek Parliament Plenary Sessions (1989-2019)</ms:resourceName>
<ms:resourceName xml:lang="el">Πρακτικά της Ολομέλειας του Ελληνικού Κοινοβουλίου (1989-2019)</ms:resourceName>

As shown, it is possible to provide the name in more than one languages; the first language, by default, is english (xml:lang=”el”) while the second is free of choice. Here the language chosen is greek (xml:lang=”en”).


resourceCreator

The second excerpt is taken from the KELLY word-list, a monolingual lexical conceptual resource. KELLY word-lists were created to facilitate the learning of a foreign/second language. The Greek part was created by the Institute for Language and Speech Processing which is an organization.

XML

<ms:resourceCreator>
        <ms:Organization>
                <ms:actorType>Organization</ms:actorType>
                <ms:organizationName xml:lang="el">Ινστιτούτο Επεξεργασίας του Λόγου</ms:organizationName>
                <ms:organizationName xml:lang="en">Institute for Language and Speech Processing</ms:organizationName>
                <ms:website>http://www.ilsp.gr/</ms:website>
        </ms:Organization>
</ms:resourceCreator>

The necessary information about the creator is enclosed between the resourceCreator tags. First, the type of the creator (actorType) is defined; a resource could have as creator a person, a group of people or an organization, as is the case for the Kelly world-list. Then the name of the organization is provided (in two languages, xml:lang=”el” and xml:lang=”en”) as well as its website.


isPartOf

The next example is from the Golden Part of Speech Tagged Corpus, a monolingual annotated corpus in Greek with 100.000 words. This corpus is a subset of the Hellenic National Corpus which contains more than 97 million words from a variety of sources and various domains. The subset relationship is expressed through the isPartOf metadata element in the CLARIN:EL metadata schema.

XML

<ms:isPartOf>
        <ms:resourceName xml:lang="el">Ελληνικός Θησαυρός της Ελληνικής Γλώσσας</ms:resourceName>
        <ms:resourceName xml:lang="en">Hellenic National Corpus</ms:resourceName>
        <ms:LRIdentifier ms:LRIdentifierScheme="http://purl.org/spar/datacite/handle"
                >http://hdl.handle.net/11500/ATHENA-0000-0000-23E2-9</ms:LRIdentifier>
        <ms:version>3.0</ms:version>
</ms:isPartOf>

The isPartOf element includes the name of the resource (resourceName) from which the Golden Part has been derived, i.e. the Hellenic National Corpus, expressed in two languages (xml:lang=”el” and xml:lang=”en”) along with its identifier (LRIdentifier) and version (version).


annotationType

Alignment is the process that establishes translational equivalences between structural units (words, sentences etc.) of a text in a given language and a text with similar meaning in other language(s). The Greek-Bulgarian Bul-TM parallel corpus is a bilingual corpus and as the adjective parallel suggests has been aligned.

XML

<ms:annotation>
        <ms:annotationType>http://w3id.org/meta-share/omtd-share/Alignment1</ms:annotationType>
        <ms:segmentationLevel>http://w3id.org/meta-share/meta-share/sentence</ms:segmentationLevel>
        <ms:annotationStandoff>false</ms:annotationStandoff>
        <ms:annotationMode>http://w3id.org/meta-share/meta-share/automatic</ms:annotationMode>
        <ms:isAnnotatedBy>
                <ms:resourceName xml:lang="en">TrAid</ms:resourceName>
                <ms:version>unspecified</ms:version>
        </ms:isAnnotatedBy>
</ms:annotation>

Alignment is considered a type of annotation. The two languages have been aligned at sentence level (segmentationLevel) and there is not a separate (annotationStandoff) document with each language independently. The procedure has been automatically done (annotationMode); the tool used for the alignment (isAnnotatedBy) is called TrAid but no version is available (unspecified).


multilingualityType

The DICTA-SIGN corpus is a multimedia corpus, consisting of a video part and a text part, for four sign languages (english, french, german and greek).

XML

<ms:multilingualityType>http://w3id.org/meta-share/meta-share/parallel</ms:multilingualityType>
        <ms:language>
                <ms:languageTag>gss</ms:languageTag>
                <ms:languageId>gss</ms:languageId>
        </ms:language>
        <ms:language>
                <ms:languageTag>bfi</ms:languageTag>
                <ms:languageId>bfi</ms:languageId>
        </ms:language>
        <ms:language>
                <ms:languageTag>gsg</ms:languageTag>
                <ms:languageId>gsg</ms:languageId>
        </ms:language>
        <ms:language>
                <ms:languageTag>fsl</ms:languageTag>
                <ms:languageId>fsl</ms:languageId>
        </ms:language>

Each corpus part is described separately. This excerpt describes the content of the video part of the resource. The languages in the video are sign languages and are aligned as indicated by the choice of the value parallel for the multilingualityType element. Then each language (language) is presented separately with its language tag (languageTag) and id (languageId): gss (Greek Sign Language), bfi (British Sign Language), gsg (German Sign Language) and fsl (French Sign Language).


isDocumentedBy

Sometimes there is extra information about a resource in external documents such as papers and/or conference announcements. Such is the case with Orossimo Terminological Resource - History which is documented in the Collection of digital terminological resources: methodology and results.

XML

<ms:isDocumentedBy>
                <ms:title xml:lang="el">Συλλογή ηλεκτρονικών ορολογικών πόρων: μεθοδολογία και αποτελέσματα</ms:title>
                <ms:title xml:lang="en">Collection of digital terminological resources: methodology and results</ms:title>
</ms:isDocumentedBy>

fundingProject

The following example is more complex as it includes various metadata elements. It is taken from the Trilingual Terminological Dictionary, a lexical/conceptual resource with a threefold aim: to assist the student in learning the subject areas of the curriculum, to improve their language skills in Greek and to familiarize themselves with information technology.

XML

<ms:fundingProject>
        <ms:projectName xml:lang="el">Τρίγλωσσο Ορολογικό Λεξικό</ms:projectName>
        <ms:projectName xml:lang="en">Trilingual Terminological Dictionary</ms:projectName>
        <ms:website>https://bit.ly/2V4hWLe</ms:website>
        <ms:website>https://www.ilsp.gr/projects/tol/</ms:website>
        <ms:fundingType>http://w3id.org/meta-share/meta-share/euFunds</ms:fundingType>
        <ms:fundingType>http://w3id.org/meta-share/meta-share/nationalFunds</ms:fundingType>
        <ms:funder>
                <ms:Organization>
                        <ms:actorType>Organization</ms:actorType>
                        <ms:organizationName xml:lang="en">Ministry of Education and Religious Affairs</ms:organizationName>
                </ms:Organization>
        </ms:funder>
        <ms:funder>
                <ms:Organization>
                        <ms:actorType>Organization</ms:actorType>
                        <ms:organizationName xml:lang="el">Ευρωπαϊκή Επιτροπή</ms:organizationName>
                        <ms:organizationName xml:lang="en">European Commission</ms:organizationName>
                        <ms:website>https://ec.europa.eu/info/index_en</ms:website>
                </ms:Organization>
        </ms:funder>
</ms:fundingProject>

The resource is the result of a project (fundingProject) bearing the same name (projectName), Trilingual Terminological Dictionary. The information provided for the project is the websites available, the fundingType and the funders. The project was created with EU and national funds while the funders were two organizations, the Ministry of Education and Religious Affairs and the European Commission.


inputContentResource

The following XML excerpt provides information on the input of Voyant Tools, a web-based text reading and analysis environment.

XML

<ms:inputContentResource>
        <ms:processingResourceType>http://w3id.org/meta-share/meta-share/corpus</ms:processingResourceType>
        <ms:mediaType>http://w3id.org/meta-share/meta-share/text</ms:mediaType>
        <ms:dataFormat>http://w3id.org/meta-share/omtd-share/Pdf</ms:dataFormat>
        <ms:dataFormat>http://w3id.org/meta-share/omtd-share/Rtf</ms:dataFormat>
        <ms:dataFormat>http://w3id.org/meta-share/omtd-share/Xml</ms:dataFormat>
        <ms:dataFormat>http://w3id.org/meta-share/omtd-share/ConllU</ms:dataFormat>
        <ms:dataFormat>http://w3id.org/meta-share/omtd-share/Html</ms:dataFormat>
</ms:inputContentResource>

Voyant tools can process, take as input (inputContentResource), corpora (processingResourceType) of textual data the format (dataFormat) of which is plain text, PDF, RTF, XML, ConllU and HTML.


outputResource

The next excerpt presents the output of the ILSP Language Identification System.

XML

            <ms:outputResource>
        <ms:processingResourceType>http://w3id.org/meta-share/meta-share/corpus</ms:processingResourceType>
        <ms:language>
            <ms:languageTag>el-Latn</ms:languageTag>
            <ms:languageId>el</ms:languageId>
            <ms:scriptId>Latn</ms:scriptId>
            <ms:languageVarietyName xml:lang="en">Greeklish</ms:languageVarietyName>
        </ms:language>
        <ms:language>
            <ms:languageTag>el-Grek</ms:languageTag>
            <ms:languageId>el</ms:languageId>
            <ms:scriptId>Grek</ms:scriptId>
        </ms:language>
        <ms:language>
            <ms:languageTag>fr</ms:languageTag>
            <ms:languageId>fr</ms:languageId>
        </ms:language>
        <ms:language>
            <ms:languageTag>en</ms:languageTag>
            <ms:languageId>en</ms:languageId>
        </ms:language>
        <ms:language>
            <ms:languageTag>de</ms:languageTag>
            <ms:languageId>de</ms:languageId>
        </ms:language>
        <ms:language>
            <ms:languageTag>nl</ms:languageTag>
            <ms:languageId>nl</ms:languageId>
        </ms:language>
        <ms:mediaType>http://w3id.org/meta-share/meta-share/text</ms:mediaType>
</ms:outputResource>

This tool performs language identification for Greeklish, Greek, English, German, Dutch and French. Greeklish as seen in the excerpt above is a variety (languageVarietyName) of the Greek language: the language (languageId) is defined as Greek (el) but the script (scriptId) is latin (Latn).


attributionText

The last example showcases the attributionText of a language description resource, the PANACEA Environment Corpus n-grams EL.

XML

<ms:attributionText xml:lang="el">PANACEA σώμα ελληνικών n-γραμμάτων (n-grams) περιβαλλοντικού τομέα. Δημιουργός:
Ινστιτούτο Επεξεργασίας του Λόγου - Ερευνητικό Κέντρο Αθηνά. Άδεια: Creative Commons Attribution Share Alike 4.0
International (https://creativecommons.org/licenses/by-sa/4.0/legalcode,
https://creativecommons.org/licenses/by-sa/4.0/). Πηγή: http://hdl.handle.net/11500/ATHENA-0000-0000-23DA-3
(CLARIN:EL)</ms:attributionText>
<ms:attributionText xml:lang="en">PANACEA Environment Corpus n-grams EL (Greek) by Institute for Language and Speech
Processing - Athena Research Center used under Creative Commons Attribution Share Alike 4.0 International
(https://creativecommons.org/licenses/by-sa/4.0/legalcode, https://creativecommons.org/licenses/by-sa/4.0/). Source:
http://hdl.handle.net/11500/ATHENA-0000-0000-23DA-3 (CLARIN:EL)</ms:attributionText>

The licence of the resource is the CC-BY-SA 4.0 International. “This license lets others remix, adapt, and build upon your work even for commercial purposes, as long as they credit you and license their new creations under the identical terms. This license is often compared to “copyleft” free and open source software licenses. All new works based on yours will carry the same license, so any derivatives will also allow commercial use.” 2 The attribution serves this exact purpose as it provides one with text containing the information on the resource creator, the Institute for Language and Speech Processing - Athena Research Center and the licence under which the resource and all its derivatives are to be distributed.

1

These tags come in pairs; the opening and ending tags are identical except for the forward slash.

2

More information on the Creative Commons website.