This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Jakob Vogel, M.A. Digital Humanities, Institute for Digital Humanities, Faculty of Philosophy, Georg August University of G¨ottingen.
Table of Links
- Abstract and Intro
- Diverse cross-document coreference and media bias analysis
- Annotation tool
- Annotation guidelines
- Conclusion and future work
- Acknowledgements
- Bibliographical References
4. Annotation guidelines
Annotators will read each article three times and focus on a different annotation task in each pass: in the first pass, only read the text to get an overview of it. Do not make any annotations, yet. In the second pass, mark mentions with identity-relations, assign an entity to them and link them to Wikidata. In the third pass, annotate near-identity and bridging relations between mentions.
4.1. First pass: get familiar with the text
Read the entire text carefully. Try to already pay attention to what entities are mentioned, but do not annotate them, yet.
4.2. Second pass: annotate mentions with identity-relations
Read the text for a second time. Identify potential coreference candidates. Wherever a referent is referred to by at least two identical mentions, annotate these and all subsequent mentions respectively. Do this as follows:
• First check if a candidate is markable:
– In general, only noun phrases (NPs) are markable. This includes nominal phrases (”the president”), proper names (”Mr. Biden”), and quantifier phrases (”all member states”).
– For reasons of efficiency, most pronominal NPs are excluded from annotation because they normally carry little variation with regards to how they are labelled (Zhukova et al., 2021). However, certain types of pronouns can be included not as head, but as modifier for another NP, e.g. demonstrative pronouns (”this man”) and reflexive pronouns (”the president himself”).
– Numbers like currency expressions (”€2.3 billion”) and percentages (”19% of the votes”) are included, but dates of any kind (”January 23”, ”1996”, ”this Sunday”) are excluded for now.
– Given coreferential conjunctions that mention several entities at once and, syntactically, cannot be split (”North and South Korea”), first mark everything that could be extracted as single-entity mention separately (possible for ”South Korea”, but not for ”North”), then mark the entire conjunction. Use a MER-relation to connect mentioned entities with the conjunction (see description of the MERrelation in subsection 4.3).
• Then check if the candidate you want to annotate is truly identical to other mentions of the same referent. To do so, compare it to the referent’s most previous mention. In case no mention of the referent has been annotated so far, simply compare the two candidates triggering the annotation:
– Identity between two mentions means that both refer to the same entity in almost the same way. In comparison to the first mention, the second one may provide additional information about the referent or only highlight a subset of its attributes, but new and old attributes may not contradict each other (Recasens et al., 2010).
– When in doubt, ignore all modifiers and focus on the heads of both mentions to check if they are identical.
• If the candidate is markable and identical to previous mentions, start your annotation. First, mark the mention:
– We annotate mentions with a maximum span style. This means that for each candidate, the NP’s head and all of its pre- and post-modifiers are included in the annotation. More precisely, this includes articles (”a”, the”), adjectives (”a worried president”), other NPs (”US president Joe Biden”), appositives (”Joe Biden, president of the United States”), prepositional phrases (”demonstrators in front of the White House”), and relative clauses (”Biden, who was elected president in 2020”) (Hirschman and Chinchor, 1998). Any punctuation or white space at the very beginning or end of the span are excluded.
– Additionally to maximum span style, we annotate with nested style, meaning a mention’s span may overlap with or contain another mention. But remember not to mark any mention you discover, but only those who actually participate in coreference!
• After selecting the correct span, assign an entity-type to a marked mention by choosing from the layer’s respective drop-down list. We distinguish between the following entitytypes: PER, ORG, GRP, GPE, LOC, OBJ.
– Person (PER): an individual actor.
– Organization (ORG): an official organization that is not government-related, e.g. ”the WHO”, ”Fox News”, ”the opposition”. – Group (GRP): a group of individuals acting collectively or sharing the same properties, e.g. ”demonstrators”, ”unemployed beneficiaries”, ”the two leaders”.
– Geo-political entity (GPE): a state, country, province etc. that comprises a government, a population, a physical location, and a nation (Linguistic Data Consortium, 2008). This includes clusters of GPEs, e.g. ”Eastern Europe” or ”the Arab League”. Governmental organizations or locations that represent an entire GPE are also marked as GPE, e.g. ”the US government”, ”US officials”, ”the Biden administration”, ”Washington”, ”the White House”.
– Location (LOC): a physical location that is not a GPE, e.g. ”Los Angeles”. This includes mentions like ”Germany” or ”the White House” when referred to not in a political way, but with a focus on its geographic, cultural, architectural and other locality attributes. Be aware that two mentions with the same textual representation but different entity-types are not to be marked as identical! Instead, most of such cases would imply a MET-relation.
– Object (OBJ): an object or other concept that is mentioned, e.g. ”Biden’s hands”, ”a submarine”, ”the results”. However, objects are static concepts. Do not confuse them with NPs that express events or other changes of state (”election”, ”negotiations”, ”Biden’s statement”) which we do not annotate!
• Now it is time to assign the mention to an entity cluster. With this step, you create or extend a local coreference chain. At the same time, you link it with corresponding discourse entities across documents and globally with its actual referent.
– In case that, in the present document, you already have annotated previous mentions of the same entity, you will also already have created a local coreference cluster. The cluster will already be linked to a global discourse entity and to a referent. To assign the current mention to that cluster, select the global entity’s name from the respective drop-down list. The Wikidata field can be left empty.[3]
– If, on the other hand, no previous mentions have been annotated, you are faced with two identical mentions you want to create a new local cluster of. To do this, first fill in the fields of the first mention.
∗ Begin with the Wikidata field and type in the referent’s name. Inception now looks for a suiting Wikidata entry and displays a drop-down list with the search results. Select the correct entry from that list. To enhance search results, try to look for the entity’s most neutral name, ignoring articles. Sometimes it is easier to look for the entry on the Wikidata website itself and then copy its name into the field. If no Wikidata entry exists, leave the field empty.
∗ Assuming you have found a Wikidata entry, copy the text displayed in the Wikidata field into the Global entityname field. By doing this, the name will automatically be added to the underlying tag set, meaning you will be able to select it from the drop-down list in subsequent annotations. However, if you have not found a Wikidata entry, copy the mention’s text, again with maximum span style, into the Global entity-name field. Use this text as name for any following coreferential mentions. If the name has already been used for a semantically different entity in another document, add the document ID to the new name.[4]
– Now turn to the second mention and annotate it based on the previous one. That is, assign the Global entity-name while leaving the Wikidata field empty.
4.3. Third pass: annotate mentions with different relations
Read the text for a third time. Wherever you see two mentions connected through a near-identity relation, make a respective annotation:
• For every new mention that has not been marked in the second pass already, check if it is markable and annotate it with its correct span and entity-type as described above. However, leave the Global entity-name and Wikidata field empty.
• When both mentions are marked with the correct span and entity-type, connect them with one of the following near-identity relation-types: MET, MER, CLS, STF, DEC, BRD (Recasens et al., 2010; Spala et al., 2019; Clark and Bangerter, 2004; Nedoluzhko et al., 2009).
– Metonymy (MET): In a MET-relation, in comparison to its antecedent, an anaphor highlights different facets of an entity. This includes facets like:
∗ a certain role or function performed by an entity. Consider example (5).
(5) ”Although Biden is head of the Democrats, he is also president of all Americans.”
Assuming ”Biden” has already been annotated as part of a respective cluster in the second pass, ”head of the Democrats” and ”president of all Americans” would now be connected to ”Biden” with a MET-relation. However, in this example, it is the juxtaposition of both roles in particular that makes this a case of metonymy. In a more regular context, naming one of these roles alone could be annotated in the second pass as identical mention, instead.
∗ a location’s name to refer to an associated entity, e.g. ”Washington” as metonym for ”the US government”, ”China” for ”the Chinese government”, ”Silicon Valley” for ”the Tech industry”.
∗ an organization’s name to refer to an associated place, e.g. a bank’s name like ”ECB” to refer to the building that contains that bank’s headquarters.
∗ different forms of realization of the same piece of information, like in example (6), where the same content is manifested once as audible speech and once as written text.
(6) ”Though it is questionable whether he had actually written the piece himself, Macron gave a truly brilliant speech this afternoon.”
∗ representation, where one mention is a picture or other representation of an entity, as already seen in example (4).
(4) “The AfD is circulating a photo of Angela Merkel with a Hijab, although Merkel never wore Muslim clothes.”
∗ other facets, since this is no exhaustive list and metonymy is a dynamic phenomenon.
∗ given two ID-clusters that are metonymous to each other (e.g. several mentions of ”the US president” and several mentions of ”the White House” which often participate in metonymy together), do not connect every single mention of the latter to a mention of the former, but only do this for the latter’s first truly coreferential mention.
– Meronymy (MER): A MER-relation between two mentions indicates that:
∗ one mention is a constituent part of the other in whatever direction, as in example (7).
(7) ”President Biden expressed his concern about the ongoing ... ’The US government will not ...’, he stated.”
∗ one mention refers to an object which is made of the stuff which the other mention refers to.
(8) ”The duty on tobacco has risen once again, making cigarettes as expensive as never before.”
∗ both mentions refer to overlapping sets.
(9) ”AfD supporters demonstrated in front of the Reichstag this morning. Among the crowd was ...”
∗ finally, a MER-relation can be used to specify entities mentioned in syntactically non-dividable conjunctions. Given such a conjunction, as ”North and South Korea” in example (10), mark ”South Korea” separately as it can be treated as independent noun phrase. The adjective phrase ”North”, however, cannot be marked. Instead, mark the entire conjunction and connect ”South Korea” to it with a MER-relation (illustrated by the dotted underlining). Do the same for the first full mention of ”North Korea” that follows in the text. If none follows, use a previous mention or, if there is none, ignore the ”North”- mention.
(10) ”North and . . . . . . South. . . . . . . . Korea have resumed negotiations ... North Korea seems ...”
– Class (CLS): a CLS-relation indicates an ’is-a’ connection between two mentions. One mention thus belongs to a sub- or superclass of another.
(11) ”In way, Trump only seized the opportunity. This is what skilled politicians do.”
– Spatio-temporal function (STF): a mention refers to an entity that deviates in place, time (3), number, or person (12).
(3) ”Even if the young Erdogan used to be pro-Western, Turkey’s president nowadays often acts against Western interests.”
(12) ”A historic meeting: a pope and a pope shaking hands.”
– Declarative (DEC): where two mentions X and Y are connected through verbal phrases like ”X seems like Y”, ”stated that X was Y”, ”declared X Y”, or other declarations as in (13), they can be connected with a DEC-relation.
(13) ”In his speech, he also spoke about North Korea and called it a fundamentally barbaric nation.”
The DEC-relation thus includes definitions and descriptions of entities. This is especially the case when declarative clauses are used within quotes. However, when value-free declarative clauses like ”X is Y” are used as quasi objective specifications of an entity, they might indicate an identity relation, instead. The same structure might be used to assign a super-class to the entity, making it a CLS-relation.
– Bridging (BRD): for reasons of simplicity, we have included BRD in our subsumption of different relation-types under the term of near-identy. Despite of that, BRD is actually a separate phenomenon from both identity and nearidentity. BRD connects two entities that are mostly independent of each other while nonetheless, the existence of one can be inferred by the existence of the other (Clark and Haviland, 1977). Technically, the BRD-relation could be used to mark all sorts of ontological connections between entities. This is not the purpose of this annotation scheme, though. Instead, we use BRD only where the mention of one entity influences the depiction of an associated entity or where one entity is modified by a possessive pronoun that refers to another entity. Example (14) illustrates both use cases:
(14) ”Unlike Queen Elizabeth, Charles has not been shy about promoting his political views.”
Here, the NP ”his political views” contains a modifying possessive pronoun, which is why it is to be annotated as bridging to ”Charles”. Additionally, the mention ”Charles” can only be interpreted correctly as referring to Charles III (and not any other Charles) by its juxtaposition with the NP ”Queen Elizabeth”. Hence ”Charles” is to be annotated as bridging to ”Queen Elizabeth”.
• Deciding on what relation-type to choose can be difficult. When in doubt, follow these general guidelines:
– use an identity relation rather than a near-identity relation (especially DEC).
– when having to choose between near-identity relations, use MET rather than MER.
– use MER rather than CLS.
– use CLS rather than DEC.
– use any near-identity relation that is not BRD rather than BRD.
• When annotating near-identity and bridging, always connect an anaphoric mention to the nearest possible antecedent. But remember that antecedents normally appear before an anaphor. Only if necessary you may connect a mention to a subsequent expression (making their relation cataphoric).
[3] This is to save time. As the cluster will already be linked, assigning a Wikidata entry to every additional mention would be redundant work.
[4] The following example illustrates this: let us assume you have annotated several mentions with the name ”demonstrators” in a previous document. Now, while annotating document ”0 L”, you are faced with an entity that would also have to be given the Global entity-name ”demonstrators”, although it refers to a semantically different group of people. In this case, do not change your annotations of the previous document, but do use the Global entity-name ”demonstrators0_L” in the current document.