How to prepare data before uploading

The CLARIN:EL infrastructure allows data uploading for two reasons:

  • to deposit the content of a resource (see section 1), or

  • to use the data as input to a processing service (see section 2).

Attention

In each situation both the user involved and the requirements for the data are different. In both cases signing in is a prerequisite. If you wish to upload data and you are not a CLARIN:EL user you must first register.

I. Depositing data as content of a resource

1. TYPES

Not all resources in the CLARIN:EL infrastructure have content files (see for example metaresources and resources only “for info”) but they all have metadata descriptions. When they do have content, it varies depending on the resource type.

  • Corpora are collections of:

    • primary data of various media:

      • digital/digitised written texts (e.g. digitised books, web texts, newspapers, corpora etc.), recordings of spoken language (e.g. interviews, radio broadcasts etc.)

      • video recordings (e.g. TV shows, facial expressions collections, gestures etc.)

      • images (e.g. digital/digitised photographs with their captions etc.)

    or

    • processed data

      • various types of annotations of texts,

      • sound and multimedia data, automatically or manually created (e.g. morphosyntactically annotated texts, transcriptions of spoken data, video annotations etc.)

  • Lexical/conceptual resources and language descriptions are:

    • structured language data (e.g. word lists, lexica, thesauri, grammars etc.) used for processing and study of primary and processed data.

  • Tools are:

    • source code, or

    • software

of programs/applications perfoming various types of language processing (e.g. multilingual text alignment, morphological annotation, lemmatisation, parsing, knowledge extraction etc.).

2. USAGE

Uploading data to CLARIN:EL does not automatically guarantee they are directly accessible by the CLARIN:EL users nor that they are processable by the CLARIN:EL tools and services. Two factors are to be taken into consideration: accessibility and processability, which applies only to corpora (processable) and tools (processing services). In order to prepare the data in the most appropriate way you must have an answer to the following questions beforehand.

Important

  • Do I want my data to be accessible to the CLARIN:EL users?

    • If the answer is yes, please see the respective section before reading the following instructions.

  • Do I want my corpus data to be processable by the CLARIN:EL services?

    • If the answer is yes, please check the necessary metadata values along with all the other instructions.

  • Do I want my tool data to be converted into a CLARIN:EL processing service?

The following instructions are divided into two sections: general instructions apply to all types of resources while specific instructions apply only to corpora and tools, as indicated.

3. Steps to follow

3.1. General instructions

There are several legal documents which you need to consult before proceeding. Make sure you have read carefully the CLARIN:EL

as uploading data to the infrastructure entails that you have agreed to the aforementioned legal documents.

If you are affiliated to an organization member of CLARIN:EL make sure to contact your Scientific Responsible before depositing your data.

If you are not affiliated to an organization member of CLARIN:EL you need to sign the depositor’s agreement before depositing your data.

Also ensure that the data you provide have clear licence terms and permission received from all right-holders involved. If the data have more than one distributions you will need to indicate the licence terms for each one of them. In addition, they can also be available under multiple licence terms depending on the user nature or the intended use (academic vs commercial).

Then, you can proceed with the three stages of data preparation: collection , categorization and compression.

../../_images/CCC.png
Step 1: Collection

Collect data around a specific idea (e.g. a glossary of feminist theory). Collect all and only the necessary data. If personal, sensitive or confidential data are included, please anonymize them or remove them before uploading.

Step 2: Categorization

Collected data may be the result of various processing stages: video recodings which have been transcribed, PDFs which have been cleaned (images and URLs removed) and converted to TXT files. In such cases, the raw and covnerted data, comprise a unity involving multiple and various formats, media and languages which you might not want to break. To do so, and present everything in a single metadata record, you must organize your data in the most structured and easy to understand way. By grouping them in a coherent and cohesive way, you will not only facilitate other users but also make the data compatible with the infrastructure services and workflows. The following guidelines aim at helping you do so in such a way that no information is lost and the text part of your (corpus) data is processable.

Attention

These guidelines do not address categorization based on domain, time/geographic coverage etc.

Multiple formats

If the data are in various formats (e.g. XML, TXT, PDF, etc.), organize the files according to their format. Group all files of the same format in one folder (e.g. all XML files together). You can upload two different datasets (e.g. XML vs TXT) on the same metadata record by associating each one of them with a separate distribution.

Tip

See the list of recommended file formats for the CLARIN:EL infrastructure.

Multiple media

If the data have various media parts (e.g. text, audio, etc.), organize the files according to the medium. Group all files of the same medium in one folder (e.g. all text files together in one file and all audio files in another). You can upload two different datasets (e.g. audio recordings and transcripts) on the same metadata record by associating each one of them with a separate distribution.

Naming files and folders

Name both the files and the folders in a way that reflects meaningfully and consistently their content. Use the latin alphabet and leave no spaces between the words. If you have files in various formats, media and/or languages, label them accordingly (e.g. news1_el.txt, news1_en.txt).

Important

Any relevant documentation (e.g. manuals, questionnaires, codebooks, project reports, etc.) should be directly described and uploaded to the respective field in the metadata editor 1. Nevertheless, if you wish to include any documentation in the data folder, create a separate file and name it “README” (in TXT or PDF format). This file should contain all the necessary information on the methods used for collecting/generating the data and explanations about the structure, the naming of the files or any other kind of information that can help the user.

Consistency

The metadata used to describe your data should clearly reflect them. Make sure there are no inconsistencies (e.g. check that your files are indeed in PDF format and not just scanned images; if you provide information on an annotated corpus, indicate the annotation tool etc.) to avoid any problems. Check here the mandatory metadata for all resource types but also keep in mind that an LRT description is more complete if the recommended metadata are provided as well.

Step 3: Compression

The content files must be in a compressed folder in one of the following formats: .zip, .tgz, .gz, .tar. When naming the folder you must use the latin alphabet and leave no spaces between the words.

Attention

Do not compress the embedded files/folders since this makes it impossible for the CLARIN:EL services to handle them (i.e. do not include .zip files within a .zip file).

3.2. Specific instructions

Corpora

In order to become processable, a corpus must have the features described below:

  • multilinguality:

    • for monolingual corpora, the language must be Greek, English, German or Portuguese (currently these are the language supported by the services),

    • for bilingual corpora, Greek should be the one language in a pair where English, German or Portuguese is the other.

  • medium: Τext

  • format:

    • for monolingual corpora the formats are Plain Text and XCES,

    • for bilingual corpora the formats are TMX and MOSES.

  • encoding: UTF-8

  • size: < 60Mb

  • licence: Creative Commons licences (CC, starting with Creative Commons Zero (CC-0) and all possible combinations along the CC differentiation of rights of use). See also the Recommended licensing scheme for Language Resources.

Corpora with these features are compatible with the workflows of the infrastructure and are indicated as processable. The processable corpora are grouped together as a subset of the total list in the inventory home page.

Tools

If you would like to integrate a tool to the CLARIN:EL infrastructure as a compatible service, please indicate your choice upon the creation of the resource and contact the CLARIN:EL technical team.

../../_images/CreateTool1.png

4. UPLOAD

When you have finished you can upload the data.

Attention

This action is available only to signed in curators.

As a curator you are provided with two options for uploading:

Once you are done with uploading you must associate the data with a distribution, the form or delivery channel through which the data are distributed, described here.

You can repeat the procedure (data upload –> association with distribution) as many times as you need to, having different sets of data associated with various distributions. This functionality serves not only the various ways through which the same data are distributed (e.g. a CD-ROM, a link from where a dataset can be downloaded, etc.) but also the various data formats or media (e.g. PDF vs TXT, Audio files vs Transcipts, etc.) which can be treated separately.

Tip

If you encounter any problem during uploading, please contact the Technical helpdesk.

II. Data as input of a service

Attention

This action is available to all signed in users.

Both the data uploaded for processing and the data which result from the processing are not stored permanently in the infrastructure; the CLARIN:EL policy is to delete the annotated data 48 hours after processing has been completed.

Tip

If you want to, you can create a metadata record, where you can upload data, either by using the editor or by uploading an XML file. Keep in mind that you must be signed in.

CLARIN:EL services accept as input small datasets with the following features:

  • multilinguality: monolingual corpora in Greek, English, German or Portuguese,

  • medium: Τext

  • format: Plain Text

  • encoding: UTF-8

  • size: < 2Mb

In addition, the data must be in a compressed folder in one of the following formats: .zip, .tgz, .gz, .tar. When naming the folder you must use the latin alphabet and leave no spaces between the words.

Attention

Do not compress the embedded files/folders since this makes it impossible for the CLARIN:EL services to handle them (i.e. do not include .zip files within a .zip file).

To find out more about processing, check:

  1. how to access a service,

  2. how to access a workflow.

1

Henceforth editor.