Specific guidelines on mandatory metadata

In this chapter: how to fill in the mandatory metadata.

This section provides guidance on how to fill in specific metadata which are mandatory for a corpus, a lexical/conceptual resource, a tool and a language description. Since some of the metadata elements are common for all resources, they are presented first followed by metadata which are resource type specific.

Each metadata element is briefly explained and examples are provided whenever possible. The examples cover both best practices as well as common mistakes which must be avoided (marked with an asterisk *). In addition for each metadata element there is a link to the XSD with its full representation.

1. resourceName

~The official name or title of the language resource/technology~

The name must reflect the content (and the type) of the resource; it must present all the necessary information for the resource but it should not be too descriptive; detailed information must be provided in the description. Do not use full phrases, punctuation marks (unless necssary) or abbreviations in the resource title. Provide the full name of the resource and use the short name (if any) in the respective metadata field.

Examples

Do: Glossary of medical terms; Old and New Testament; Ellogon annotation tool

Don’t: *This is a glossary of medical terms; *Old and New Testament!; *Ellogon ann. tool

  • See how the resourceName element is described in detail in XSD


2. description

~A short presentation of the language resource/technology~

The description must contain all the important information about the resource. Don’t simply repeat (or rephrase) the resource title without adding any other information. Once read and without seeing the rest of the metadata, one should be able to understand what it is about. Define the type of the resource and provide any useful information on how, when and by whom it was created, what is its language and size and what is the purpose it serves, if any. Mention any particularities or limitations about the data or the tool that users should be aware of. The description must be a free text of minimum one paragraph. You can also make use of the functionalities (formatting, hyperlinking, bullets etc.) of the metadata editor 1 to make the description easy to read.

Examples

Do: 1) Bilingual glossary (German / Greek) made in 2019/2020 by students of DFLTI (Ionian University) under the supervision of Mr. Olaf Immanuel Seel in the framework of the department’s cooperation with the EU TermCord.

  1. Texts corpus from the transcription of recorded children’s speech focused on narration. The corpus was collected from interviews conducted by undergraduate and postgraduate students of the Department of Mediterranean Studies of the University of the Aegean with children with whom they are related either by friendship or kinship. Files with both the questions and answers are provided, where K=girl and A=boy, as well as cleaned files containing only the children’s answers (clean).

Don’t: *Symposium Proceedings; *Bilingual lexicon on the Greek economy

  • See how the description element is described in detail in XSD


3. version

~A particular form of a resource differing in certain respects from an earlier form~

The recommended format for a version is: major_version.minor_version.patch 2.

Examples

Do: 1.0.0-alpha; 2.1.1

Don’t: *1.0.1-alpha; *0.0.2

The infrastructure automatically assigns the 1.0.0 version to all resources. If this is not the case with your resource, write the version number in the box (e.g. 2.0.0) and then click on the version date to reveal the calendar. Select the date when this version was released and click on OK.

../../_images/VersionDate.png

The editor also provides the possibility to automatically create a new version of an existing resource. See the guidelines on versioning before you proceed to do so.

  • See how the version element is described in detail in XSD


4. keyword

~A word or phrase characteristic of the language resource/technology that can be used at search~

Keywords are words or small phrases used to search for a resource. The more keywords used, the merrier for the resource retrieval. However, the keywords must highlight resource aspects not already covered by mandatory metadata. If, for example, you describe a monolingual annotated corpus created to enhance the learning process of non native speakers, your keywords must not be exclusively or primarily the following: “corpus”, “annotated” or “monolingual”; these are the values of the resourceType, corpusSubclass and linguality metadata elements respectively which are also searched and retrieved. Instead use as keywords the phrases “non native speaker” and “learning process” which emphasize the resource intended use; in addition you can add “corpus”, “annotated” and “monolingual”.

Examples

Do: non native speaker; learning process (corpus; annotated; monolingual)

Don’t: *corpus; *annotated; *monolingual

After you have typed in the keyword you want, click on the prompt that appears under the box: Add “non native speaker”. Only then the value will be saved. If you omit this step, the keyword will not be appear when you revisit this editor section.

../../_images/Keyword.png
  • See how the keyword element is described in detail in XSD


5. additionalInformation

~A URL (landing page) or email (e.g., support email) where the user can find or ask for more information~

This metadata element is either a web page with additional information on the language resource/technology (e.g., its contents, link to the access location, etc.) or the email of person responsible to provide information. Make sure to enter a valid email or URL.

  • See how the additionalInformation element is described in detail in XSD



8. data

~The content files of a resource~

Not all resources have content files. A metadata description may or may not be accompanied by content files (see here for more information). See also the detailed guidelines on how to prepare data, the recommended formats and how to upload them.

9. personalData, sensitiveData & anonymized

~Information about whether the resource contains personal and/or sensitive data~

Attention

This metadata element is mandatory for corpora, lexical/conceptual resources and language descriptions.

You must specify whether the resource contains personal data (e.g. names) and/or sensitive data (e.g., medical/health-related, etc.) and thus requires special handling. If this is the case, new metadata fields are presented in which you can provide additional information on special requirements, if necessary.

../../_images/PersonalSensitiveData.png
  • See how the personalData element is described in detail in XSD

  • See how the sensitiveData element is described in detail in XSD

The existence of personal and/or sensitive data generates 3 another metadata element, that of anonymization. Here you can provide all the information on the anonymization/pseudo-anonymization, the tool used, if specific code was written, any conventions adopted, etc.

../../_images/Anonymization.png
  • See how the anomymized element is described in detail in XSD

11. encodingLevel

~Information on the contents of a resource as regards the linguistic level of analysis it caters for~

Attention

This metadata element is mandatory for lexical/conceptual resources and language descriptions.

The values for encoding refer to various linguistic levels of analysis. These levels are presented in alphabetical order below with their subject matters:

  • morphology: word formation (such as inflection, derivation and compounding);

  • other: value used when none of the recommended values of an element is appropriate for an item;

  • phonetics: speech sounds;

  • phonology: speech sounds that constitute the fundamental components of a language;

  • pragmatics: the relationship of sentences to the environment in which they occur;

  • semantics: the meaning of a word, phrase, etc.;

  • syntax: the structure of linguistic units (phrases, sentences);

  • unspecified: value used for mandatory elements whose value is unknown or cannot be specified.

The metadata field is found in the LRC section (Technical tab) for lexical/conceptual resources above the lcrSubclass as shown in the image below.

../../_images/EncodingLevel.png

For language descriptions the metadata field is found in the Language Description section (Technical tab) below the chosen LanguageDescriptionSubclass.

../../_images/EncodingLevelGrammar.png
  • See how the encodingLevel element is described in detail in XSD

12. function

~The operation/function/task that a software object performs~

Attention

This metadata element is mandatory for tools/services only.

The dropdown list in the respective metadata field includes numerous values which cannot be presented all here. If you start typing, though, the list will be reduced only to the values matching your criteria. If the function of your tool/service matches one of the values suggested, click on it and it will be added. If the function of your tool/service does not match one of the values suggested, you must click on the prompt (missing…? add). Only then the value will be saved. If you omit this step, the function will not be appear when you revisit this editor section.

../../_images/FunctionEvent.png

The metadata element is found in the editor Tool/Service section (Categories tab).

  • See how the function element is described in detail in XSD

13. inputContentResource

~The requirements set by a tool/service for the (content) resource that it processes~

Attention

This metadata element is mandatory for tools/services only.

This is a complex metadata element which requires for four other metadata fields to be described: input resource type, media type, data format and annotation type. All these elements provide the necessary information on the resource that a tool/service processes.

../../_images/inputContentResource.png

For the resource used as input, a dropdown list provides the values shown in the following image. To choose one, click on the value.

../../_images/ProcessingResourceType.png

The next field to be filled in, requires information on the medium of the resource used as input. Again, click on a value to add it.

../../_images/InputMediaType.png

For the data format following, you must type in the box to reveal the values that match your criteria and eliminate all the others from the dropdown list. Once you have located the appropriate value, click on it.

../../_images/InputFormat.png

Finally, if the resource provided as input is annotated, you must define the annotation type. Once more, start typing in the box to reveal the possible corresponding values. Choose one by clicking on it.

../../_images/InputAnnotationType.png

The inputContentResource element is found in the editor Tool/Service section (Technical tab).

../../_images/inputContentEditor.png
  • See how the inputContentResource element is described in detail in XSD

1

Henceforth editor.

2

See the semantic versioning guidelines for specific instructions.

3

The anonymized element belongs to the mandatory upon condition metadata, the necessity of which depends on the values of other elements provided by the user, such as the answer “yes” to the question about the personal and/or sensitive data existence in a resource.

4

The difference between typesystem and annotation scheme is based on whether they are used by tools or defined by users: the annotation scheme contains custom types while the typesystem is mostly used for built-in types.

5

The difference between a typesystem and a tagset is that the typesystem will include only annotation types (e.g. an annotation type POS to represent part-of-speech annotations) while the tagset contains a list of the valid tag values (e.g. the Penn Treebank Tagset).

6

The difference between typesystem and annotation scheme is based on whether they are used by tools or defined by users: the annotation scheme contains custom types while the typesystem is mostly used for built-in types.