9 Musical Corpora

The term corpus may refer to any collection of musical data, although it is often reserved for collections which have been organized or collected with a particular end in view, generally to illustrate a particular characteristic of, or to demonstrate the variety found in, a group of related texts. The principal distinguishing characteristic of a corpus is that its components have been selected or structured according to some conscious set of design criteria.

In MEI, a corpus is regarded as a composite text because, although each discrete document in a corpus clearly has a claim to be considered as a text in its own right, it is also regarded as a subdivision of some larger object, if only for convenience of analysis. In corpora, the component samples are clearly distinct texts, but the systematic collection, standardized preparation, and common markup of the corpus often make it useful to treat the entire corpus as a unit, too. Corpora share a number of characteristics with other types of composite texts, including anthologies and collections. Most notably, different components of composite texts may exhibit different structural properties, thus potentially requiring elements from different MEI modules.

Aside from these high-level structural differences, and possibly differences of scale, the encoding of language corpora and the encoding of individual texts present identical sets of problems. Therefore, any of the encoding techniques and elements presented in other chapters of these Guidelines may therefore prove relevant to some aspect of corpus encoding and may be used in corpora.

9.1 Corpus Module Overview

The meiCorpus module defines a single element:

  • meiCorpus(MEI corpus) – A group of related MEI documents, consisting of a header for the group, and one or more <mei> elements, each with its own complete header.

The meiCorpus element is intended for the encoding of corpora, though it may also be useful in encoding any collection of disparate materials. The individual samples in the corpus are encoded as separate mei elements, and the entire corpus is enclosed in an meiCorpus element. Each sample has the usual structure for a mei document, comprising an meiHead followed by a music element. The corpus, too, has a corpus-level meiHead element, in which the corpus as a whole, and encoding practices common to multiple samples may be described. The overall structure of an MEI-conformant corpus is thus:

<meiCorpus>
<meiHead type= "corpus">
<!-- metadata for the corpus -->
</meiHead>
<mei>
<meiHead type= "text">
<!-- metadata for sample 1 -->
</meiHead>
<music>
<!-- the encoding of sample 1 -->
</music>
</mei>
<mei>
<meiHead type= "text">
<!-- metadata for sample 2 -->
</meiHead>
<music>
<!-- the encoding of sample 2 -->
</music>
</mei>
</meiCorpus>

This two-level structure allows for metadata to be specified at the corpus level, at the individual text level, or at both. However, metadata which relates to the whole corpus rather than to its individual components should be removed from the individual component metadata and included only in the meiHead element prefixed to the whole.

In some cases, the design of a corpus is reflected in its internal structure. For example, a corpus of musical incipits might be arranged to combine all compositions of one type (symphonies, songs, chamber music, etc.) into some higher-level grouping, possibly with sub-groups for date of publication, instrumentation, key, etc. The meiCorpus element provides no support for reflecting such internal structure in the markup: it treats the corpus as an undifferentiated series of components, each tagged with an mei element.

If it is essential to reflect the organization of a corpus into sub-components, then the members of the corpus should be encoded as composite texts instead, using the group element described section 1.1.2 Music Element. The mechanisms for corpus characterization described in this chapter, however, are designed to reduce the need to do this. Useful groupings of components may easily be expressed using the classification and identification elements described in section 2.3.12 Classification, and those for associating declarations with corpus components described in section 2.1.7.1 Associating Metadata and Data. These mechanisms also allow several different methods of text grouping to co-exist, each to be used as needed at different times. This helps minimize the danger of cross-classification and mis-classification of samples, and helps improve the flexibility with which parts of a corpus may be characterized for different applications.

All composite texts share the characteristic that their different component texts may be of structurally similar or dissimilar types. If all component texts may all be encoded using the same module, then no problem arises. If however they require different modules, then the various modules must all be included in the schema.

9.2 Combining Corpus and Text Headers

An MEI-conformant document may have more than one header only in the case of a TEI corpus, which must have a header in its own right, as well as the obligatory header for each text. Every element specified in a corpus-header is understood as if it appeared within every text header in the corpus. An element specified in a text header but not in the corpus header supplements the specification for that text alone. If any element is specified in both corpus and text headers, the corpus header element is over-ridden for that text alone.

The titleStmt for a corpus text is understood to be prefixed by the titleStmt given in the corpus header. All other optional elements of the fileDesc should be omitted from an individual corpus text header unless they differ from those specified in the corpus header. All other header elements behave identically, in the manner documented in chapter 2 The MEI Header. This makes it possible to state information which is common to the whole of the corpus in the corpus header, while still allowing for individual texts to vary from this common metadata.

For example, the following markup shows the structure of a corpus consisting of three texts, the first and last of which share the same encoding description. The second one has its own encoding description.

<meiCorpus>
<meiHead>
<fileDesc>
<!-- corpus file description-->
</fileDesc>
<encodingDesc>
<!-- default encoding description -->
</encodingDesc>
<revisionDesc>
<!-- corpus revision description -->
</revisionDesc>
</meiHead>
<mei>
<meiHead>
<fileDesc>
<!-- file description for this corpus text -->
</fileDesc>
</meiHead>
<music>
<!-- first corpus text -->
</music>
</mei>
<mei>
<meiHead>
<fileDesc>
<!-- file description for this corpus text -->
</fileDesc>
<encodingDesc>
<!-- encoding description for this corpus text, over-riding the default -->
</encodingDesc>
</meiHead>
<music>
<!-- second corpus text -->
</music>
</mei>
<mei>
<meiHead>
<fileDesc>
<!-- file description for third corpus text -->
</fileDesc>
</meiHead>
<music>
<!-- third corpus text -->
</music>
</mei>
</meiCorpus>

9.3 Recommendations for the Encoding of Large Corpora

These Guidelines include proposals for the identification and encoding of a far greater variety of textual features and characteristics than is likely to be either feasible or desirable in any one corpus, however large and ambitious. For most large-scale corpus projects, it will therefore be necessary to determine a subset of recommended elements appropriate to the anticipated needs of the project ; these mechanisms include the ability to exclude selected element types, add new element types, and change the names of existing elements.

Because of the high cost of identifying and encoding many textual features, and the difficulty in ensuring consistent practice across very large corpora, encoders may find it convenient to divide the set of elements to be encoded into the following four categories:

required
- texts included within the corpus will always encode textual features in this category, should they exist in the text
recommended
- textual features in this category will be encoded wherever economically and practically feasible; where present but not encoded, a note in the header should be made.
optional
- textual features in this category may or may not be encoded; no conclusion about the absence of such features can be inferred from the absence of the corresponding element in a given text.
proscribed
- textual features in this category are deliberately not encoded; they may be transcribed as unmarked up text, or represented as gap elements, or silently omitted, as appropriate.