"where P(A) is the proportion of time that the coders agree and P(E) is the proportion of times that we would expect them to agree by chance." ( Carletta 1996 : 4).

There is no doubt that annotation tends to be highly labour-intensive and time-consuming to carry out well. This is why it is appropriate to admit, as a final observation, that 'best practice' in corpus annotation is something we should all strive for — but which perhaps few of us will achieve.

9. Getting down to the practical task of annotation

To conclude, it is useful to say something about the practicalities of corpus annotation. Assume, say, that you have a text or a corpus you want to work on, and want to 'get the tags into the text'.

  • It is not necessary to have special software. You can annotate the text using a general-purpose text editor or word processor. But this means the job has to be done by hand, which risks being slow and prone to error.
  • For some purposes, particularly if the corpus is large and is to be made available for general use, it is important to have the annotation validated. That is, the vocabulary of annotation is controlled and is allowed to occur only in syntactically valid ways. A validating tool can be written from scratch, or can use macros for word processors or editors.
  • If you decide to use XML-compliant annotation, this means that you have the option to make use of the increasingly available XML editors. An XML editor, in conjunction with a DTD or schema, can do the job of enforcing well-formedness or validity without any programming of the software, although a high degree of expertise with XML will come in useful.
  • Special tagging software has been developed for large projects — for example the CLAWS tagger and Template Tagger used for the Brown Family or corpora and the BNC. Such programs or packages can be licensed for your own annotation work. (For CLAWS, see the UCREL website http://www.comp.lancs.ac.uk/ucrel/ .)
  • There are tagsets which come with specific software — e.g. the C5, C7 and C8 tagsets for CLAWS, and CHAT for the CHILDES system, which is the de facto standard for language acquisition data.
  • There are more general architectures for handling texts, language data and software systems for building and annotation corpora. The most prominent example of this is GATE ('general architecture for text engineering' http://gate.ac.uk ) developed at the University of Sheffield.

Continue to Chapter Three: Metadata for corpus work

Return to the table of contents

© Geoffrey Leech 2004. The right of Geoffrey Leech to be identified as the Author of this Work has been asserted by him in accordance with the Copyright, Designs and Patents Act 1988.

All material supplied via the Arts and Humanities Data Service is protected by copyright, and duplication or sale of all or any part of it is not permitted, except that material may be duplicated by you for your personal research use or educational purposes in electronic or print form. Permission for any other use must be obtained from the Arts and Humanities Data Service.

Electronic or print copies may not be offered, whether for sale or otherwise, to any third party.

by Tony McEnery and Andrew Hardie; published by Cambridge University Press, 2012  

Home > (1) Corpus linguistics > Annotated versus unannotated corpora

Website contents

  • (1) Corpus Linguistics
  • Mode of communication
  • Corpus-based, corpus-driven
  • Data collection
  • Annotated corpora
  • Multilingual corpora
  • (2) Analysing corpus data
  • (3) The web, laws and ethics
  • (4) English Corpus Linguistics
  • Extended footnotes
  • Answers to exercises
  • Weblink directory
  • Corpus tools
  • Other resources
  • Buy the book
  • About the authors

Annotated versus unannotated corpora

The tree diagram – a commonplace of (corpus) linguistics!

What is corpus annotation?

Linguistic analyses encoded in the corpus data itself are usually called corpus annotation . For example, we may wish to annotate a corpus to show parts of speech , assigning to each word a grammatical category label. So when we see the word talk in the sentence I heard John's talk and it was the same old thing , we would assign it the category noun in that context. This would often be done using some mnemonic code or tag such as N .

While the phrase corpus annotation may be unfamiliar, the basic operation it describes is not – it is just like the analyses of data that have been done using hand, eye, and pen for decades. For example, in Chomsky (1965), 24 invented sentences are analysed; in the parsed version of LOB , a million words are annotated with parse trees. So corpus annotation is largely the process of recording common analysis in a systematic and accessible form.

Annotating data: how to get started

If you are interested in experimenting with automatic annotation for yourself, there are online systems that will allow you to try this out without having to install any software on your own computer.

You can try out grammatical tagging of a small-to-medium text using the web-interface to the CLAWS tagger (below). This tagger, created by UCREL at Lancaster University, is the software that was used to tag the BNC. It can be set to use either of two tagsets, the standard C7 and the less-complex C5 .

A more complex form of grammatical annotation is parsing . One easy way to try out parsing is to use the online Stanford Parser . This program does two different types of parsing – dependency parsing and constituency parsing – and is also openly available to download and use on your own computer.

A combination tool that part-of-speech tags text but also dependency-parses it aqnd lemmatises it is the Constraint Grammar system. You can try out Constraint Grammar-based taggers and parsers for English on the web here or here .

This page was last modified on Monday 31 October 2011 at 5:20 am.

Welcome | Part 1 | Part 2 | Part 3 | Part 4 | Footnotes | Answers | Weblinks | Corpus tools | Other resources | Buy the book | About the authors | References

Department of Linguistics and English Language, Lancaster University, United Kingdom

  • Search Menu
  • Browse content in Arts and Humanities
  • Browse content in Archaeology
  • Anglo-Saxon and Medieval Archaeology
  • Archaeological Methodology and Techniques
  • Archaeology by Region
  • Archaeology of Religion
  • Archaeology of Trade and Exchange
  • Biblical Archaeology
  • Contemporary and Public Archaeology
  • Environmental Archaeology
  • Historical Archaeology
  • History and Theory of Archaeology
  • Industrial Archaeology
  • Landscape Archaeology
  • Mortuary Archaeology
  • Prehistoric Archaeology
  • Underwater Archaeology
  • Urban Archaeology
  • Zooarchaeology
  • Browse content in Architecture
  • Architectural Structure and Design
  • History of Architecture
  • Residential and Domestic Buildings
  • Theory of Architecture
  • Browse content in Art
  • Art Subjects and Themes
  • History of Art
  • Industrial and Commercial Art
  • Theory of Art
  • Biographical Studies
  • Byzantine Studies
  • Browse content in Classical Studies
  • Classical History
  • Classical Philosophy
  • Classical Mythology
  • Classical Literature
  • Classical Reception
  • Classical Art and Architecture
  • Classical Oratory and Rhetoric
  • Greek and Roman Epigraphy
  • Greek and Roman Law
  • Greek and Roman Papyrology
  • Greek and Roman Archaeology
  • Late Antiquity
  • Religion in the Ancient World
  • Digital Humanities
  • Browse content in History
  • Colonialism and Imperialism
  • Diplomatic History
  • Environmental History
  • Genealogy, Heraldry, Names, and Honours
  • Genocide and Ethnic Cleansing
  • Historical Geography
  • History by Period
  • History of Emotions
  • History of Agriculture
  • History of Education
  • History of Gender and Sexuality
  • Industrial History
  • Intellectual History
  • International History
  • Labour History
  • Legal and Constitutional History
  • Local and Family History
  • Maritime History
  • Military History
  • National Liberation and Post-Colonialism
  • Oral History
  • Political History
  • Public History
  • Regional and National History
  • Revolutions and Rebellions
  • Slavery and Abolition of Slavery
  • Social and Cultural History
  • Theory, Methods, and Historiography
  • Urban History
  • World History
  • Browse content in Language Teaching and Learning
  • Language Learning (Specific Skills)
  • Language Teaching Theory and Methods
  • Browse content in Linguistics
  • Applied Linguistics
  • Cognitive Linguistics
  • Computational Linguistics
  • Forensic Linguistics
  • Grammar, Syntax and Morphology
  • Historical and Diachronic Linguistics
  • History of English
  • Language Acquisition
  • Language Evolution
  • Language Reference
  • Language Variation
  • Language Families
  • Lexicography
  • Linguistic Anthropology
  • Linguistic Theories
  • Linguistic Typology
  • Phonetics and Phonology
  • Psycholinguistics
  • Sociolinguistics
  • Translation and Interpretation
  • Writing Systems
  • Browse content in Literature
  • Bibliography
  • Children's Literature Studies
  • Literary Studies (Asian)
  • Literary Studies (European)
  • Literary Studies (Eco-criticism)
  • Literary Studies (Romanticism)
  • Literary Studies (American)
  • Literary Studies (Modernism)
  • Literary Studies - World
  • Literary Studies (1500 to 1800)
  • Literary Studies (19th Century)
  • Literary Studies (20th Century onwards)
  • Literary Studies (African American Literature)
  • Literary Studies (British and Irish)
  • Literary Studies (Early and Medieval)
  • Literary Studies (Fiction, Novelists, and Prose Writers)
  • Literary Studies (Gender Studies)
  • Literary Studies (Graphic Novels)
  • Literary Studies (History of the Book)
  • Literary Studies (Plays and Playwrights)
  • Literary Studies (Poetry and Poets)
  • Literary Studies (Postcolonial Literature)
  • Literary Studies (Queer Studies)
  • Literary Studies (Science Fiction)
  • Literary Studies (Travel Literature)
  • Literary Studies (War Literature)
  • Literary Studies (Women's Writing)
  • Literary Theory and Cultural Studies
  • Mythology and Folklore
  • Shakespeare Studies and Criticism
  • Browse content in Media Studies
  • Browse content in Music
  • Applied Music
  • Dance and Music
  • Ethics in Music
  • Ethnomusicology
  • Gender and Sexuality in Music
  • Medicine and Music
  • Music Cultures
  • Music and Religion
  • Music and Media
  • Music and Culture
  • Music Education and Pedagogy
  • Music Theory and Analysis
  • Musical Scores, Lyrics, and Libretti
  • Musical Structures, Styles, and Techniques
  • Musicology and Music History
  • Performance Practice and Studies
  • Race and Ethnicity in Music
  • Sound Studies
  • Browse content in Performing Arts
  • Browse content in Philosophy
  • Aesthetics and Philosophy of Art
  • Epistemology
  • Feminist Philosophy
  • History of Western Philosophy
  • Metaphysics
  • Moral Philosophy
  • Non-Western Philosophy
  • Philosophy of Science
  • Philosophy of Language
  • Philosophy of Mind
  • Philosophy of Perception
  • Philosophy of Action
  • Philosophy of Law
  • Philosophy of Religion
  • Philosophy of Mathematics and Logic
  • Practical Ethics
  • Social and Political Philosophy
  • Browse content in Religion
  • Biblical Studies
  • Christianity
  • East Asian Religions
  • History of Religion
  • Judaism and Jewish Studies
  • Qumran Studies
  • Religion and Education
  • Religion and Health
  • Religion and Politics
  • Religion and Science
  • Religion and Law
  • Religion and Art, Literature, and Music
  • Religious Studies
  • Browse content in Society and Culture
  • Cookery, Food, and Drink
  • Cultural Studies
  • Customs and Traditions
  • Ethical Issues and Debates
  • Hobbies, Games, Arts and Crafts
  • Lifestyle, Home, and Garden
  • Natural world, Country Life, and Pets
  • Popular Beliefs and Controversial Knowledge
  • Sports and Outdoor Recreation
  • Technology and Society
  • Travel and Holiday
  • Visual Culture
  • Browse content in Law
  • Arbitration
  • Browse content in Company and Commercial Law
  • Commercial Law
  • Company Law
  • Browse content in Comparative Law
  • Systems of Law
  • Competition Law
  • Browse content in Constitutional and Administrative Law
  • Government Powers
  • Judicial Review
  • Local Government Law
  • Military and Defence Law
  • Parliamentary and Legislative Practice
  • Construction Law
  • Contract Law
  • Browse content in Criminal Law
  • Criminal Procedure
  • Criminal Evidence Law
  • Sentencing and Punishment
  • Employment and Labour Law
  • Environment and Energy Law
  • Browse content in Financial Law
  • Banking Law
  • Insolvency Law
  • History of Law
  • Human Rights and Immigration
  • Intellectual Property Law
  • Browse content in International Law
  • Private International Law and Conflict of Laws
  • Public International Law
  • IT and Communications Law
  • Jurisprudence and Philosophy of Law
  • Law and Politics
  • Law and Society
  • Browse content in Legal System and Practice
  • Courts and Procedure
  • Legal Skills and Practice
  • Primary Sources of Law
  • Regulation of Legal Profession
  • Medical and Healthcare Law
  • Browse content in Policing
  • Criminal Investigation and Detection
  • Police and Security Services
  • Police Procedure and Law
  • Police Regional Planning
  • Browse content in Property Law
  • Personal Property Law
  • Study and Revision
  • Terrorism and National Security Law
  • Browse content in Trusts Law
  • Wills and Probate or Succession
  • Browse content in Medicine and Health
  • Browse content in Allied Health Professions
  • Arts Therapies
  • Clinical Science
  • Dietetics and Nutrition
  • Occupational Therapy
  • Operating Department Practice
  • Physiotherapy
  • Radiography
  • Speech and Language Therapy
  • Browse content in Anaesthetics
  • General Anaesthesia
  • Neuroanaesthesia
  • Browse content in Clinical Medicine
  • Acute Medicine
  • Cardiovascular Medicine
  • Clinical Genetics
  • Clinical Pharmacology and Therapeutics
  • Dermatology
  • Endocrinology and Diabetes
  • Gastroenterology
  • Genito-urinary Medicine
  • Geriatric Medicine
  • Infectious Diseases
  • Medical Toxicology
  • Medical Oncology
  • Pain Medicine
  • Palliative Medicine
  • Rehabilitation Medicine
  • Respiratory Medicine and Pulmonology
  • Rheumatology
  • Sleep Medicine
  • Sports and Exercise Medicine
  • Clinical Neuroscience
  • Community Medical Services
  • Critical Care
  • Emergency Medicine
  • Forensic Medicine
  • Haematology
  • History of Medicine
  • Browse content in Medical Dentistry
  • Oral and Maxillofacial Surgery
  • Paediatric Dentistry
  • Restorative Dentistry and Orthodontics
  • Surgical Dentistry
  • Browse content in Medical Skills
  • Clinical Skills
  • Communication Skills
  • Nursing Skills
  • Surgical Skills
  • Medical Ethics
  • Medical Statistics and Methodology
  • Browse content in Neurology
  • Clinical Neurophysiology
  • Neuropathology
  • Nursing Studies
  • Browse content in Obstetrics and Gynaecology
  • Gynaecology
  • Occupational Medicine
  • Ophthalmology
  • Otolaryngology (ENT)
  • Browse content in Paediatrics
  • Neonatology
  • Browse content in Pathology
  • Chemical Pathology
  • Clinical Cytogenetics and Molecular Genetics
  • Histopathology
  • Medical Microbiology and Virology
  • Patient Education and Information
  • Browse content in Pharmacology
  • Psychopharmacology
  • Browse content in Popular Health
  • Caring for Others
  • Complementary and Alternative Medicine
  • Self-help and Personal Development
  • Browse content in Preclinical Medicine
  • Cell Biology
  • Molecular Biology and Genetics
  • Reproduction, Growth and Development
  • Primary Care
  • Professional Development in Medicine
  • Browse content in Psychiatry
  • Addiction Medicine
  • Child and Adolescent Psychiatry
  • Forensic Psychiatry
  • Learning Disabilities
  • Old Age Psychiatry
  • Psychotherapy
  • Browse content in Public Health and Epidemiology
  • Epidemiology
  • Public Health
  • Browse content in Radiology
  • Clinical Radiology
  • Interventional Radiology
  • Nuclear Medicine
  • Radiation Oncology
  • Reproductive Medicine
  • Browse content in Surgery
  • Cardiothoracic Surgery
  • Gastro-intestinal and Colorectal Surgery
  • General Surgery
  • Neurosurgery
  • Paediatric Surgery
  • Peri-operative Care
  • Plastic and Reconstructive Surgery
  • Surgical Oncology
  • Transplant Surgery
  • Trauma and Orthopaedic Surgery
  • Vascular Surgery
  • Browse content in Science and Mathematics
  • Browse content in Biological Sciences
  • Aquatic Biology
  • Biochemistry
  • Bioinformatics and Computational Biology
  • Developmental Biology
  • Ecology and Conservation
  • Evolutionary Biology
  • Genetics and Genomics
  • Microbiology
  • Molecular and Cell Biology
  • Natural History
  • Plant Sciences and Forestry
  • Research Methods in Life Sciences
  • Structural Biology
  • Systems Biology
  • Zoology and Animal Sciences
  • Browse content in Chemistry
  • Analytical Chemistry
  • Computational Chemistry
  • Crystallography
  • Environmental Chemistry
  • Industrial Chemistry
  • Inorganic Chemistry
  • Materials Chemistry
  • Medicinal Chemistry
  • Mineralogy and Gems
  • Organic Chemistry
  • Physical Chemistry
  • Polymer Chemistry
  • Study and Communication Skills in Chemistry
  • Theoretical Chemistry
  • Browse content in Computer Science
  • Artificial Intelligence
  • Computer Architecture and Logic Design
  • Game Studies
  • Human-Computer Interaction
  • Mathematical Theory of Computation
  • Programming Languages
  • Software Engineering
  • Systems Analysis and Design
  • Virtual Reality
  • Browse content in Computing
  • Business Applications
  • Computer Security
  • Computer Games
  • Computer Networking and Communications
  • Digital Lifestyle
  • Graphical and Digital Media Applications
  • Operating Systems
  • Browse content in Earth Sciences and Geography
  • Atmospheric Sciences
  • Environmental Geography
  • Geology and the Lithosphere
  • Maps and Map-making
  • Meteorology and Climatology
  • Oceanography and Hydrology
  • Palaeontology
  • Physical Geography and Topography
  • Regional Geography
  • Soil Science
  • Urban Geography
  • Browse content in Engineering and Technology
  • Agriculture and Farming
  • Biological Engineering
  • Civil Engineering, Surveying, and Building
  • Electronics and Communications Engineering
  • Energy Technology
  • Engineering (General)
  • Environmental Science, Engineering, and Technology
  • History of Engineering and Technology
  • Mechanical Engineering and Materials
  • Technology of Industrial Chemistry
  • Transport Technology and Trades
  • Browse content in Environmental Science
  • Applied Ecology (Environmental Science)
  • Conservation of the Environment (Environmental Science)
  • Environmental Sustainability
  • Environmentalist Thought and Ideology (Environmental Science)
  • Management of Land and Natural Resources (Environmental Science)
  • Natural Disasters (Environmental Science)
  • Nuclear Issues (Environmental Science)
  • Pollution and Threats to the Environment (Environmental Science)
  • Social Impact of Environmental Issues (Environmental Science)
  • History of Science and Technology
  • Browse content in Materials Science
  • Ceramics and Glasses
  • Composite Materials
  • Metals, Alloying, and Corrosion
  • Nanotechnology
  • Browse content in Mathematics
  • Applied Mathematics
  • Biomathematics and Statistics
  • History of Mathematics
  • Mathematical Education
  • Mathematical Finance
  • Mathematical Analysis
  • Numerical and Computational Mathematics
  • Probability and Statistics
  • Pure Mathematics
  • Browse content in Neuroscience
  • Cognition and Behavioural Neuroscience
  • Development of the Nervous System
  • Disorders of the Nervous System
  • History of Neuroscience
  • Invertebrate Neurobiology
  • Molecular and Cellular Systems
  • Neuroendocrinology and Autonomic Nervous System
  • Neuroscientific Techniques
  • Sensory and Motor Systems
  • Browse content in Physics
  • Astronomy and Astrophysics
  • Atomic, Molecular, and Optical Physics
  • Biological and Medical Physics
  • Classical Mechanics
  • Computational Physics
  • Condensed Matter Physics
  • Electromagnetism, Optics, and Acoustics
  • History of Physics
  • Mathematical and Statistical Physics
  • Measurement Science
  • Nuclear Physics
  • Particles and Fields
  • Plasma Physics
  • Quantum Physics
  • Relativity and Gravitation
  • Semiconductor and Mesoscopic Physics
  • Browse content in Psychology
  • Affective Sciences
  • Clinical Psychology
  • Cognitive Psychology
  • Cognitive Neuroscience
  • Criminal and Forensic Psychology
  • Developmental Psychology
  • Educational Psychology
  • Evolutionary Psychology
  • Health Psychology
  • History and Systems in Psychology
  • Music Psychology
  • Neuropsychology
  • Organizational Psychology
  • Psychological Assessment and Testing
  • Psychology of Human-Technology Interaction
  • Psychology Professional Development and Training
  • Research Methods in Psychology
  • Social Psychology
  • Browse content in Social Sciences
  • Browse content in Anthropology
  • Anthropology of Religion
  • Human Evolution
  • Medical Anthropology
  • Physical Anthropology
  • Regional Anthropology
  • Social and Cultural Anthropology
  • Theory and Practice of Anthropology
  • Browse content in Business and Management
  • Business Strategy
  • Business Ethics
  • Business History
  • Business and Government
  • Business and Technology
  • Business and the Environment
  • Comparative Management
  • Corporate Governance
  • Corporate Social Responsibility
  • Entrepreneurship
  • Health Management
  • Human Resource Management
  • Industrial and Employment Relations
  • Industry Studies
  • Information and Communication Technologies
  • International Business
  • Knowledge Management
  • Management and Management Techniques
  • Operations Management
  • Organizational Theory and Behaviour
  • Pensions and Pension Management
  • Public and Nonprofit Management
  • Strategic Management
  • Supply Chain Management
  • Browse content in Criminology and Criminal Justice
  • Criminal Justice
  • Criminology
  • Forms of Crime
  • International and Comparative Criminology
  • Youth Violence and Juvenile Justice
  • Development Studies
  • Browse content in Economics
  • Agricultural, Environmental, and Natural Resource Economics
  • Asian Economics
  • Behavioural Finance
  • Behavioural Economics and Neuroeconomics
  • Econometrics and Mathematical Economics
  • Economic Systems
  • Economic History
  • Economic Methodology
  • Economic Development and Growth
  • Financial Markets
  • Financial Institutions and Services
  • General Economics and Teaching
  • Health, Education, and Welfare
  • History of Economic Thought
  • International Economics
  • Labour and Demographic Economics
  • Law and Economics
  • Macroeconomics and Monetary Economics
  • Microeconomics
  • Public Economics
  • Urban, Rural, and Regional Economics
  • Welfare Economics
  • Browse content in Education
  • Adult Education and Continuous Learning
  • Care and Counselling of Students
  • Early Childhood and Elementary Education
  • Educational Equipment and Technology
  • Educational Strategies and Policy
  • Higher and Further Education
  • Organization and Management of Education
  • Philosophy and Theory of Education
  • Schools Studies
  • Secondary Education
  • Teaching of a Specific Subject
  • Teaching of Specific Groups and Special Educational Needs
  • Teaching Skills and Techniques
  • Browse content in Environment
  • Applied Ecology (Social Science)
  • Climate Change
  • Conservation of the Environment (Social Science)
  • Environmentalist Thought and Ideology (Social Science)
  • Natural Disasters (Environment)
  • Social Impact of Environmental Issues (Social Science)
  • Browse content in Human Geography
  • Cultural Geography
  • Economic Geography
  • Political Geography
  • Browse content in Interdisciplinary Studies
  • Communication Studies
  • Museums, Libraries, and Information Sciences
  • Browse content in Politics
  • African Politics
  • Asian Politics
  • Chinese Politics
  • Comparative Politics
  • Conflict Politics
  • Elections and Electoral Studies
  • Environmental Politics
  • European Union
  • Foreign Policy
  • Gender and Politics
  • Human Rights and Politics
  • Indian Politics
  • International Relations
  • International Organization (Politics)
  • International Political Economy
  • Irish Politics
  • Latin American Politics
  • Middle Eastern Politics
  • Political Methodology
  • Political Communication
  • Political Philosophy
  • Political Sociology
  • Political Behaviour
  • Political Economy
  • Political Institutions
  • Political Theory
  • Politics and Law
  • Public Administration
  • Public Policy
  • Quantitative Political Methodology
  • Regional Political Studies
  • Russian Politics
  • Security Studies
  • State and Local Government
  • UK Politics
  • US Politics
  • Browse content in Regional and Area Studies
  • African Studies
  • Asian Studies
  • East Asian Studies
  • Japanese Studies
  • Latin American Studies
  • Middle Eastern Studies
  • Native American Studies
  • Scottish Studies
  • Browse content in Research and Information
  • Research Methods
  • Browse content in Social Work
  • Addictions and Substance Misuse
  • Adoption and Fostering
  • Care of the Elderly
  • Child and Adolescent Social Work
  • Couple and Family Social Work
  • Developmental and Physical Disabilities Social Work
  • Direct Practice and Clinical Social Work
  • Emergency Services
  • Human Behaviour and the Social Environment
  • International and Global Issues in Social Work
  • Mental and Behavioural Health
  • Social Justice and Human Rights
  • Social Policy and Advocacy
  • Social Work and Crime and Justice
  • Social Work Macro Practice
  • Social Work Practice Settings
  • Social Work Research and Evidence-based Practice
  • Welfare and Benefit Systems
  • Browse content in Sociology
  • Childhood Studies
  • Community Development
  • Comparative and Historical Sociology
  • Economic Sociology
  • Gender and Sexuality
  • Gerontology and Ageing
  • Health, Illness, and Medicine
  • Marriage and the Family
  • Migration Studies
  • Occupations, Professions, and Work
  • Organizations
  • Population and Demography
  • Race and Ethnicity
  • Social Theory
  • Social Movements and Social Change
  • Social Research and Statistics
  • Social Stratification, Inequality, and Mobility
  • Sociology of Religion
  • Sociology of Education
  • Sport and Leisure
  • Urban and Rural Studies
  • Browse content in Warfare and Defence
  • Defence Strategy, Planning, and Research
  • Land Forces and Warfare
  • Military Administration
  • Military Life and Institutions
  • Naval Forces and Warfare
  • Other Warfare and Defence Issues
  • Peace Studies and Conflict Resolution
  • Weapons and Equipment

The Oxford Handbook of Computational Linguistics

A newer edition of this book is available.

  • < Previous chapter
  • Next chapter >

24 Corpus Linguistics

Tony McEnery is Distinguished Professor of English Language and Linguistics at Lancaster University. He is the author of many books and papers on corpus linguistics, including Corpus Linguistics: Method, Theory and Practice (with Andrew Hardie, CUP, 2011). He was founding Director of the ESRC Corpus Approaches to Social Science (CASS) Centre, which was awarded the Queen's Anniversary Prize for its work on corpus linguistics in 2015.

  • Published: 18 September 2012
  • Cite Icon Cite
  • Permissions Icon Permissions

Corpus data have emerged as the raw data/benchmark for several NLP applications. Corpus is described as a large body of linguistic evidence composed of attested language use. It may be contrasted against sentences constructed from metalinguist reflection upon language use, rather than as a result of communication in context. Corpus can be both spoken and written. It can be categorized as follows: monolingual, representing one language; comparable, using multiple monolingual corpora to create a comparative framework; parallel corpora, wherein, corpus of one language is considered, and the data obtained, is translated in other languages. The choice of corpus depends on the research question/the chosen application. Adding linguistic information can enhance a corpus. Analysts, human or mechanical, or a combination achieves annotation. The modern computerized corpus has been in vogue only since the 1940s. Ever since, the volume of corpus banks have risen steadily and assumed an increasingly multilingual nature.

In this chapter the use of corpora in natural language processing is overviewed. After defining what a corpus is and briefly overviewing the history of corpus linguistics, the chapter focuses on corpus annotation. Following the review of corpus annotation, a brief survey of existing corpora is presented, taking into account the types of corpus annotation present in each corpus. The chapter concludes by considering the use of corpora, both annotated, and unannotated, in a range of natural language processing (NLP) systems.

24.1 Introduction

Corpus data are, for many applications, the raw fuel of NLP, and/or the testbed on which an NLP application is evaluated. In this chapter the history of corpus linguistics is briefly considered. Following on from this, corpus annotation is introduced as a prelude to a discussion of some of the uses of corpus data in NLP. But before any of this can be done, we need to ask: what is a corpus?

24.2 What is a Corpus?

A corpus (pl. corpora , though corpuses is perfectly acceptable) is simply described as a large body of linguistic evidence typically composed of attested language use. One may contrast this form of linguistic evidence with sentences created not as a result of communication in context, but rather upon the basis of metalinguistic reflection upon language use, a type of data common in the generative approach to linguistics. Corpus data is not composed of the ruminations of theorists. It is composed of such varied material as everyday conversations (e.g. the spoken section of the British National Corpus 1 ), radio news broadcasts (e.g. the IBM/Lancaster Spoken English Corpus), published writing (e.g. the majority of the written section of the British National Corpus) and the writing of young children (e.g. the Leverhulme Corpus of Children's Writing). Such data are collected together into corpora which may be used for a range of research purposes. Typically these corpora are machine readable—trying to exploit a paper-based linguistic resource or audio recording running into millions of words is impractical. So while corpora could be paper based, or even simply sound recordings, the view taken here is that corpora are machine readable.

In this chapter the focus will be upon the use of corpora in NLP. But it is worth noting that one of the immense benefits of corpus data is that they may be used for a wide range of purposes in a number of disciplines. Corpora have uses in both linguistics and NLP, and are of interest to researchers from other disciplines, such as literary stylistics (Short, Culpeper, and Semino 1996 ). Corpora are multifunctional resources.

With this stated, a slightly more refined definition of a corpus is needed than that which has been introduced so far. It has been established that a corpus is a collection of naturally occurring language data. But is any collection of language data, from three sentences to three million words of data, a corpus? The term corpus should properly only be applied to a well-organized collection of data, collected within the boundaries of a sampling frame designed to allow the exploration of a certain linguistic feature (or set of features) via the data collected. A sampling frame is of crucial importance in corpus design. Sampling is inescapable. Unless the object of study is a highly restricted sublanguage or a dead language, it is quite impossible to collect all of the utterances of a natural language together within one corpus. As a consequence, the corpus should aim for balance and representativeness within a specific sampling frame, in order to allow a particular variety of language to be studied or modelled. The best way to explain these terms is via an example. Imagine that a researcher has the task of developing a dialogue manager for a planned telephone ticket selling system and decides to construct a corpus to assist in this task. The sampling frame here is clear—the relevant data for the planned corpus would have to be drawn from telephone ticket sales. It would be quite inappropriate to sample the novels of Jane Austen or face-to-face spontaneous conversation in order to undertake the task of modelling telephone-based transactional dialogues. Within the domain of telephone ticket sales there may be a number of different types of tickets sold, each of which requires distinct questions to be asked. Consequently, we can argue that there are various linguistically distinct categories of ticket sales. So the corpus is balanced by including a wide range of types of telephone ticket sales conversations within it, with the types organized into coherent subparts (for example, train ticket sales, plane ticket sales, and theatre ticket sales). Finally, within each of these categories there may be little point in recording one conversation, or even the conversations of only one operator taking a call. If one records only one conversation it may be highly idiosyncratic. If one records only the calls taken by one operator, one cannot be sure that they are typical of all operators. Consequently, the corpus aims for representativeness by including within it a range of speakers) in order that idiosyncrasies may be averaged out.

24.2.1 Monolingual, comparable, and parallel corpora

So, a corpus is a body of machine-readable linguistic evidence, which is collected with reference to a sampling frame. There are important variations on this theme, however. So far the focus has been upon monolingual corpora —corpora representing one language. Comparable corpora are corpora where a series of monolingual corpora are collected for a range of languages, preferably using the same sampling frame and with similar balance and representativeness, to enable the study of those languages in contrast. Parallel corpora take a slightly different approach to the study of languages in contrast, gathering a corpus in one language) and then translations of that corpus data into one or more languages. Parallel and comparable corpora may appear rather similar when first encountered, but the data they are composed of are significantly different. If the main focus of a study is on contrastive linguistics, comparable corpora are preferable, as, for example, the process of translation may influence the forms of a translation) with features of the source language carried over into the target language (Schmied and Fink 2000 ). If the interest in using the corpus is to gain translation examples for an application such as example-based machine translation (see Chapter 28 ), then the parallel corpus, used in conjunction with a range of alignment techniques (Botley, Mcfinery, and Wilson 2000 ; Véronis 2000 ), offers just such data.

24.2.2 Spoken corpora

Whether the corpus is monolingual, comparable, or parallel, the corpus may also be composed of written language, spoken language, or both. With spoken language some important variations in corpus design come into play. The spoken corpus could in principle exist as a set of audio recordings only (for example, the Survey of English Dialects existed in this form for many years). At the other extreme, the original sound recordings of the corpus may not be available at all, and an orthographic transcription of the corpus could be the sole source of data (as is the case with the spoken section of the British National Corpus 2 ). Both of these scenarios have drawbacks. If the corpus exists only as a sound recording, such data are difficult to exploit, even in digital form. It is currently problematic for a machine to search, say, for the word apple in a recording of spontaneous conversation in which a whole range of different speakers are represented. On the other hand, while an orthographic transcription is useful for retrieval purposes—retrieving word forms from a machine-readable corpus is typically a trivial computational task—many important acoustic features of the original data are lost, e.g. prosodic features, variations in pronunciation. 3 As a consequence of both of these problems, spoken corpora have been built which combine a transcription of the corpus data with the original sound recording, so that one is able to retrieve words from the transcription, but then also retrieve the original acoustic context of the production of the word via a process called time alignment (Roach and Arnfield 1995 ). Such corpora are now becoming increasingly common.

24.2.3 Research questions and corpora

The choice of corpus to be used in a study depends upon the research questions being asked of the corpus, or the applications one wishes to base upon the corpus. Yet whether the corpus is monolingual, comparable or parallel, within the sampling frame specified for the corpus, the corpus should be designed to be balanced and representative. 4 With this stated, let us now move to a brief overview of the history of corpus linguistics before introducing a further refinement to our definition of a corpus—the annotated versus the unannotated corpus.

24.3 A History of Corpus Linguistics

Outlining a history of corpus linguistics is difficult. In its modern, computerized, form, the corpus has only existed since the late 1940s. The basic idea of using attested language use for the study of language clearly pre-dated this time, but the problem was that the gathering and use of large volumes of linguistic data in the pre-computer age was so difficult as to be almost impossible. There were notable examples of it being achieved via the deployment of vast workforces—Kaeding 1897 is a notable example of this. Yet in reality, corpus linguistics in the form that we know it today, where any PC user can, with relative ease, exploit corpora running into millions of words, is a very recent phenomenon.

The crucial link between computers and the manipulation of large bodies of linguistic evidence was forged by Bussa 1980 in the late 1940s. During the 1950s the first large project in the construction of comparable corpora was undertaken by Juilland (see, for example, Juilland and Chang-Rodriguez 1964 ), who also articulated clearly the concepts behind the ideas of the sampling frame, balance, and representativeness. English corpus linguistics took off in the late 1950s, with work in America on the Brown corpus (Francis 1979 ) and work in Britain on the Survey of English Usage (Quirk 1960 ). Work in English corpus linguistics in particular grew throughout the 1960s, 1970s, and 1980s, with significant milestones such as a corpus of transcribed spoken language (Svartvik and Quirk 1980 ), a corpus with manual encodings of parts-of-speech information (Francis 1979 ), and a corpus with reliable automated encodings of parts of speech (Garside, Leech, and Sampson 1987 ) being reached in this period. During the 1980s, the number of corpora available steadily grew as did the size of those corpora. This trend became clear in the 1990s, with corpora such as the British National Corpus and the Bank of English reaching vast sizes (100,000,000 words and 300,000,000 words of modern British English respectively) which would have been for all practical purposes impossible in the pre-electronic age. The other trend that became noticeable during the 1990s was the increasingly multilingual nature of corpus linguistics, with monolingual corpora becoming available for a range of languages, and parallel corpora coming into widespread use (McEnery and Oakes 1996 ; Botley, McEnery, and Wilson 2000 ; Véronis 2000 ).

In conjunction with this growth in corpus data, fuelled in part by expanding computing power, came a range of technical innovations. For example, schemes for systematically encoding corpus data came into being (Sperberg-McQueen and Burnard 1994), programs were written to allow the manipulation of ever larger data sets (e.g. Sara), and work began in earnest to represent the audio recording of a transcribed spoken corpus text in tandem with its transcription. The range of future developments in corpus linguistics is too numerous to mention in detail here (see McEnery and Wilson 2001 for a fuller discussion). What can be said, however, is that as personal computing technology develops yet further, we can expect that research questions not addressable with corpus data at this point of time will become possible, as new types of corpora are developed, and new programs to exploit these new corpora are written.

One area which has only been touched upon here, but which has been a major area of innovation in corpus linguistics in the past and which will undoubtedly remain so in the future, is corpus annotation. In the next section corpus annotation will be discussed in some depth, as it is an area where corpus linguistics and NLP often interact, as will be shown in section 24.6 .

24.4 Corpus Annotation

24.4.1 what is corpus annotation.

McEnery and Wilson ( 1996 : 24) describe annotated corpora as being ‘enhanced with various types of linguistic information’. This enhancement is achieved by analysts, whether they be humans, computers, or a mixture of both, imposing a linguistic interpretation upon a corpus. Typically this analysis is encoded by reference to a specified range of features represented by textual mnemonics which are introduced into the corpus. These mnemonics seek to link sections of the text to units of linguistic analysis. So, for example, in the case of introducing a part-of-speech analysis to a text, textual mnemonics are generally placed in a one-to-one relationship with the words in the text. 5

24.4.1.1 Enrichment, interpretation, and imposition

In essence corpus annotation is the enrichment of a corpus in order to aid the process of corpus exploitation. Note that enrichment of the corpus does not necessarily occur from the viewpoint of the expert human analyst—corpus annotation only makes explicit what is implicit, it does not introduce new information. For any level of linguistic information encoded explicitly in a corpus, the information that a linguist can extract from the corpus by means of a hand and eye analysis will hopefully differ little, except in terms of the speed of analysis, from that contained in the annotation. The enrichment is related to users who need linguistic analyses but are not in a position to provide them. This covers both humans who lack the metalinguistic ability to impose a meaningful linguistic analysis upon a text as well as computers, which may lack the ability to impose such analyses also.

Keywords in describing the process of corpus annotation are imposition and interpretation . Given any text, there are bound to be a plurality of analyses for any given level of interpretations that one may wish to undertake. This plurality arises from the fact that there is often an allowable degree of variation in linguistic analyses, arising in part at least from ambiguities in the data and fuzzy boundaries between categories of analysis in any given analytical scheme. Corpus annotation typically represents one of a variety of possible analyses, and imposes that consistently upon the corpus text.

24.4.2 What are the advantages of corpus annotation?

In the preceding sections, some idea of why we may wish to annotate a corpus has already emerged. In this section I want to detail four specific advantages of corpus annotation as a prelude to discussing the process of corpus annotation in the context of criticisms put forward against it to date. Key advantages of corpus annotation are ease of exploitation, reusability, multi-functionality , and explicit analyses .

24.4.2.1 Ease of exploitation

This is a point which we have considered briefly already. With an annotated corpus, the range and speed of corpus exploitation increases. Considering the range of exploitation, an annotated corpus can be used by a wider range of users than an unannotated corpus. For example, even if I cannot speak French, given an appropriately annotated corpus of French, I am capable of retrieving all of the nouns in a corpus of French. Similarly, even if a computer is not capable of parsing a sentence, given a parsed treebank and appropriate retrieval software, it can retrieve noun phrases from that corpus. Corpus annotation enables humans and machines to exploit and retrieve analyses of which they are not themselves capable.

Moving to the speed of corpus exploitation, even where a user is capable of undertaking the range of analyses encoded within an annotated corpus, they are able to exploit an analysis encoded within a corpus 6 swiftly and reliably.

24.4.2.2 Reusability

Annotated corpora also have the merit of allowing analyses to be exploited over and over again, as noted by Leech ( 1997 : 5). Rather than an analysis being performed for a specific purpose and discarded, corpus annotation records an analysis. This analysis is then prone to reuse.

24.4.2.3 Multi-functionality

An analysis originally annotated within a corpus may have been undertaken with one specific purpose in mind. When reused, however, the purpose of the corpus exploitation may be quite different from that originally envisaged. So as well as being reusable, corpus analyses can also be put to a wide range of uses.

24.4.2.4 Explicit analyses

A final advantage I would outline for corpus annotation is that it is an excellent means by which to make an analysis explicit. As well as promoting reuse, a corpus analysis stands as a clear objective record of the analysis imposed upon the corpus by the analyst/analysts responsible for the annotation. As we will see shortly, this clear benefit has actually been miscast as a drawback of corpus annotation in the past.

24.4.3 How corpus annotation is achieved

Corpus annotation may be achieved entirely automatically, by a semi-automated process, or entirely manually. To cover each in turn, some NLP tools, such as lemmatizers and part-of-speech taggers, are now so reliable for languages such as English, French, and Spanish 7 that we may consider a wholly automated approach to their annotation (see Chapter 11 for a more detailed review of part-of-speech tagging). While using wholly automated procedures does inevitably mean a rump of errors in a corpus, the error rates associated with taggers such as CLAWS (Garside, Leech, and Sampson 1987 ) are low, typically being reported at around 3 per cent. Where such a rate of error in analysis is acceptable, corpus annotation may proceed without human intervention.

More typically, however, NLP tools are not sufficiently accurate so as to allow fully automated annotation. Yet they may be sufficiently accurate that correcting the annotations introduced by them is faster than undertaking the annotation entirely by hand. This was the case in the construction of the Penn Treebank (Marcus, Santorini, and Marcinkiewicz 1993 ), where the constituent structure of the corpus was first annotated by a computer and then corrected by human analysts. Another scenario where a mixture of machine and human effort occurs is where NLP tools which are usually sufficiently accurate, such as part-of-speech taggers, are not sufficient because highly accurate annotation is required. This is the case, for example, in the core corpus of the British National Corpus. Here the core corpus (one million words of writen English and one million words of spoken) was first automatically part-of-speech annotated, and then hand corrected by expert human analysts.

Pure manual annotation occurs where no NLP application is available to a user, or where the accuracy of available systems is not high enough to make the time invested in manual correction less than pure manual annotation. An example of purely manual annotation is in the construction of corpora encoding anaphoric and cataphoric references (Botley and McEnery 2000 ; Mitkov 2002 ). It should be noted that in real terms, considering the range of possible annotations we may want to introduce into corpus texts, most would have to be introduced manually or at best semi-manually.

24.4.4 Criticisms of corpus annotation

Two main criticisms of corpus annotation have surfaced over the past decade or so. I believe it is quite safe to dismiss both, but for purposes of clarity let us spell out the criticisms and counter them here.

24.4.4.1 Corpus annotations produce impure corpora

The first criticism to be levelled at corpus annotation was that it somehow sullied the unannotated corpus by the process of imposing an interpretation on the data. The points to be made against this are simple. First, in imposing one analysis, there is no constraint upon the user of the corpus to use that analysis—they may impose one of their own. The plurality of interpretations of a text is something that must be accepted from the outset. Secondly, just because we do not make a clear record of the interpretation we have imposed via annotation it does not disguise the fact that in using raw corpora interpretations still occur. The interpretations imposed by corpus annotations have the advantage that they are objectively recorded and open to scrutiny. The interpretations of those who choose not to annotate corpus data remain fundamentally more obscure than those recorded clearly in a corpus. Bearing these two points in mind, it is plain to see the fundamental weakness of this criticism of corpus annotation.

24.4.4.2 Consistency versus accuracy

The second criticism, presented by Sinclair (1992) , is not a criticism of corpus annotation as such. Rather it is a criticism of two of the practices of corpus annotation we have just examined—manual and semi-automatic corpus annotation. The argument is subtle, and worth considering seriously. It is centred upon two related notion—saccuracy and consistency. When a part-of-speech tagger annotates a text and is 97 per cent accurate, its analysis should be 100 per cent consistent, i.e. given the same set of decision-making conditions in two different parts of the corpus, the answer given is the same. This consistency for the machine derives from its impartial and unswerving application of a program. Can we expect the same consistency of analysis from human annotators? As we have discussed already, there is a plurality of analyses possible for any given annotation. Consequently, when human beings are imposing an interpretation upon a text, can we assume that their analysis is 100 per cent consistent? May it not be the case that they may produce analyses which, when viewed from several points of view, are highly accurate, but which, when viewed from one analytical viewpoint, are not as accurate? It may be the case that the annotation of a corpus may be deemed to be accurate, but simultaneously be highly inconsistent.

This argument is potentially quite damaging to the practice of corpus annotation, especially when we consider that most hand analyses are carried out by teams of annotators, hence amplifying the possibility of inconsistency. As a result of such arguments, experiments have been carried out by annotation teams around the world (Marcus, Santorini, and Marcinkiewicz 1993 ; Voutilainen and Järvinen 1995 ; Baker 1997 ) in order to examine the validity of this criticism. No study to date has supported Sinclair's argument. Indeed every study has shown that while the introduction of a human element to corpus annotation does mean a modest decline in the consistency of annotation within the corpus, this decline is more than offset by a related rise in the accuracy of the annotation. There is one important rider to add to this observation, however. All of the studies above, especially the studies of Baker 1997 , have used teams of trained annotators—annotators who were well versed in the use of a particular annotation scheme, and who had long experience in working with lists of guidelines which helped their analyses to converge. It is almost certainly true, though as yet not validated experimentally, that, given a set of analysts with no guidelines to inform their annotation decisions and no experience of teamwork, Sinclair's criticism would undoubtedly be more relevant. As it is, there is no reason to assume that Sinclair's criticisms of human-aided annotation should colour one's view of corpora produced with the aid of human grammarians, such as the French, English, and Spanish CRATER corpora (McEnery et al. 1997 ).

24.5 What Corpora are in Existence?

An increasing variety of annotated corpora are currently in existence. It should come as no surprise to discover that the majority of annotated corpora involve part of speech and lemmatization, as these are procedures which can be undertaken largely automatically. Nonetheless, a growing number of hand-annotated corpora are becoming available. Table 24.1 seeks to show something of the range of annotated corpora of written language in existence. For more detail on the range and use of corpus annotation, see McEnery and Wilson (1996) , 8 and Garside, Leech, and McEnery 1997 .

Having now established the philosophical and practical basis for corpus annotation, and having reviewed the range of annotations related to written corpora, I would like to conclude this chapter by reviewing the practical benefits related to the use of annotated corpora in one field, NLP.

24.6 The Exploitation of Corpora in NLP

NLP is a rapidly developing area of study, which is producing working solutions to specified natural-language processing problems. The application of annotated corpora within NLP to date has resulted in advances in language processing—part-of-speech taggers, such as CLAWS, are an early example of how annotated corpora enabled the development of better language processing systems (see Garside, Leech, and Sampson 1987 ). Annotated corpora have allowed such developments to occur as they are unparalleled sources of quantitative data. To return to CLAWS, because the tagged Brown corpus was available, accurate transition probabilities could be extracted for use in the development of CLAWS. The benefits of this data are apparent when we compare the accuracy rate of CLAWS—around 97 per cent—to that of TAGGIT, used to develop the Brown corpus—around 77 per cent. This massive improvement can be attributed to the existence of annotated corpus data which enabled CLAWS to disambiguate between multiple potential part-of-speech tag assignments in context.

It is not simply part-of-speech tagging where quantitative data are of prime importance to disambiguation. Disambiguation is a key problem in a variety of areas such as anaphor resolution, parsing, and machine translation. It is beyond doubt that annotated corpora will have an important role to play in the development of NLP systems in the future, as can be seen from the burgeoning corpus-based NLP literature (LREC 2000 ).

Beyond the use of quantitative data derived from annotations as the basis of disambiguation in NLP systems, annotated corpora may also provide the raw fuel for various terminology extraction programs. Work has been developed in the area of automated terminology extraction which relies upon annotated corpora for its results (Daille 1995 ; Gausier 1995 ). So although disambiguation is an area where annotated corpora are having a key impact, there is ample scope for believing that they may be used in a wider variety of applications.

A further example of such an application may be called evidence-based learning. Until recently, language analysis programs almost exclusively relied on human intuition in the construction of their knowledge/rule base. Annotated corpora corrected/ produced by humans, while still encoding human intuitions, situate those intuitions within a context where the computer can recover intuitions from use, and where humans can moderate their intuitions by application to real examples. Rather than having to rely on decontextualized intuitions, the computer can recover intuitions from practice. The difference between human experts producing opinions about what they do out of context and practice in context has long been understood in artificial intelligence—humans tend to be better at showing what they know rather than explaining what they know, so to speak. The construction of an annotated corpus, therefore, allows us to overcome this known problem in communicating expert knowledge to machines, while simultaneously providing testbeds against which intuitions about language may be tested. Where machine learning algorithms are the basis for an NLP application, it is fair to say that corpus data are essential. Without them machine learning-based approaches to NLP simply will not work.

Another role which is emerging for the annotated corpus is as an evaluation testbed for NLP programs. Evaluation of language processing systems can be problematic, where people are training systems with different analytical schemes and texts, and have different target analyses which the system is to be judged by. Using one annotated corpus as an agreed testbed for evaluation can greatly ease such problems, as it specifies the text type/types, analytical scheme, and results which the performance of a program is to be judged upon. This approach to the evaluation of systems has been adopted in the past, as reported by Black, Garside, and Leech (1993) , for instance, and in the Message Understanding Conferences in the United States (Aone and Bennett 1994 ). The benefits of the approach are so evident, however, that the establishment of such testbed corpora is bound to become increasingly common in the very near future.

One final activity which annotated corpora allow is worthy of some coverage here. It is true that, at the moment, the range of annotations available is wider than the range of annotations which it is possible for a computer to introduce with a high degree of accuracy. Yet by the use of the annotations present in a hand-annotated corpus, a resource is developed that permits a computer, over the scope of the annotated corpus only, to act as if it could perform the analysis in question. In short, if we have a manually produced treebank, a computer can read the treebank and discover where the marked constituents are, rather than having to work it out for itself. The advantages of this are limited yet clear. Such a use of an annotated corpus may provide an economic means of evaluating whether the development of a certain NLP application is worthwhile—if somebody posits that the application of a parser of newspaper stories would be of use in some application, then by the use of a treebank of newspaper stories they can experiment the worth of their claim without actually producing a parser.

There are further uses of annotated corpora in NLP beyond those covered here. The range of uses covered, however, is more than sufficient to illustrate that annotated corpora, even though we can justify them on philosophical grounds, can more than be justified on practical grounds.

24.7 Conclusion

Corpora have played a useful role in the development of human language technology to date. In return, corpus linguistics has gained access to ever more sophisticated language processing systems. There is no reason to believe that this happy symbiosis will not continue—to the benefit of language engineers and corpus linguists alike—in the future.

Further Reading and Relevant Resources

There are now a number of introductions to corpus linguistics, each of which takes slightly different views on the topic. McEnery and Wilson (2001) take a view closest to that presented in this chapter. Kennedy 1999 is concerned largely with English corpus linguistics and the use of corpora in language pedagogy. Stubbs 1997 is written entirely from the viewpoint of neo-Firthian approaches to corpus linguistics, while Biber, Conrad, and Reppen (1998) is concerned mainly with the multi-feature multi-dimension approach to analysing corpus data established in Biber 1988 .

For those readers interested in corpus annotation, Garside, Leech, and McEnery 1997 provides a comprehensive overview of corpus annotation practices to date.

Many references in this chapter will lead to papers where specific corpora are discussed. The corpora listed here are simply those explicitly referenced in this chapter. For each corpus a URL is given where further information can be found about each corpus.

This list by no means represents the full range of corpora available. For a better idea of the range of corpora available visit the website of the European Language Resources Association ( http://www.icp.grenet.fr/ELRA/home.html ) or the Linguistic Data Consortium ( http://www.ldc.upenn.edu ).

British National Corpus: http://www.comp.lancs.ac.uk/computing/research/ucrel/bnc.html ; IBM/Lancaster Spoken English Corpus: http://midwich.reading.ac.uk/research/speechlab/marsec/marsec.html ; Leverhulme Corpus of Children's Writing: http://www.ling.lancs.ac.uk/monkey/lever/intro.htm ; Survey of English Dialects: http://www.xrefer.com/entry/444074 .

Aone, C. and S. W. Bennett. 1994 . ‘ Discourse tagging and discourse tagged multilingual corpora ’. Proceedings of the International Workshop on Sharable Natural Language Resources (Nara), 71–7.

Google Scholar

Google Preview

Baker, J. P. 1995. The Evaluation of Mutliple Posteditors: Inter Rater Consistency in Correcting Automatically Tagged Data , Unit for Computer Research on the English Language Technical Papers 7, Lancaster University.

—— 1997 . ‘ Consistency and accuracy in correcting automatically-tagged corpora ’. In Garside, Leech, and McEnery (1997), 243–50.

Biber, D. 1988 . Variation across speech and writing . Cambridge: Cambridge University Press.

—— S. Conrad, and R. Reppen. 1998 . Corpus Linguistics: Investigating Language Structure and Use . Cambridge: Cambridge University Press.

Black, E., R. Garside, and G. Leech. 1993 . Statistically Driven Computer Grammars of English: The IBM/Lancaster Approach . Amsterdam: Rodopi.

Botley, S. and A. M. McEnery (eds.). 2000 . Discourse Anaphora and Resolution . Studies in Corpus Linguistics. Amsterdam: John Benjamins.

Betley, S., A. M. McEnery, and A. Wilson (eds.). 2000 . Multilingual Corpora in Teaching and Research . Amsterdam: Rodopi.

Bussa, R. 1980 . ‘ The annals of humanities computing: the index Thomisticus ’ Computers and the Humanities , 14, 83–90.

Church, K. 1988. ‘A stochatic parts program and noun phrase parser for unrestricted texts’. Proceedings of the 2nd Annual Conference on Applied Natural Language Processing (Austin, Tex.), 136–48.

Daille, B. 1995 . Combined Approach for Terminology Extraction: Lexical Statistics and Linguistic Filtering . Unit for Computer Research on the English Language Technical Papers 5, Lancaster University.

Francis, W. 1979 . ‘Problems of assembling, describing and computerizing large corpora’. In H. Bergenholtz and B. Schader (eds.), Empirische Textwissenschaft: Aufbau und Auswertung von Text Corpora . Königstein: Scripter Verlag, 110–23.

Gaizauskas, R., T. Wakao, K. Humphreys, H. Cunningham, and Y. Wilks. 1995. ‘Description of the LaSIE System as used for MUC-6’. Proceedings of the 6th Message Understanding Conference(MUC-6) (San Jose, Calif.), 207–20.

Garside, R., G. Leech, and A. M. McEnery. 1997 . Corpus Annotation . London: Longman.

—— and G. Sampson. 1987 . The Computational Analysis of English . London: Longman.

Gausier, E. 1995. Modèles statistiques et patrins morphosyntactiques pour l'extraction de lexiques bilingues . Ph. D. thesis, University of Paris VII.

Juilland, A. and E. Chang-Rodriguez. 1964 . Frequency Dictionary of Spanish Words . The Hague: Mouton.

Kaeding, J. 1897 . Häufigkeitswörteroucn der deutschen Sprache . Steglitz: published by the author.

Kennedy, G. 1999 . Corpus Linguistics . London: Longman.

Leech, G. 1997 . ‘ Introducing corpus annotation ’. In Garside, Leech, and McEnery (1997), 1–18.

LREC 2000 . Proceedings of the 2nd International Conference on Language Resources and Evaluation (Athens).

McEnery, A. M. and M. P. Oakes. 1996 . ‘Sentence and word alignment in the CRATER project: methods and assessment’. In J. Thomas and M. Short (eds.), Using Corpora for Language Research . London: Longman, 211–31.

—— and A. Wilson. 1996 . Corpus Linguistics . Edinburgh: Edinburgh University Press.

—— 2001 . Corpus Linguistics , 2nd edn. Edinburgh: Edinburgh University Press.

—— F. Sanchez-Leon, and A. Nieto-Serano, 1997 . ‘ Multilingual resources for European languages: contributions of the CRATER project ’. Literary and Linguistic Computing , 12(4), 219–26.

Marcus, M., B. Santorini, and M. Marcinkiewicz. 1993 . ‘ Building a large annotated corpus of English: the Penn Treebank ’. Computational Linguistics , 19(2), 313–30.

Mitkov, R. 2002 . Anaphora Resolution . London: Longman.

Nagao, M. 1984 . ‘A framework of a mechanical translation between Japanese and English by analogy principle’. In A. Elithorn and J. Banerji (eds.), Artificial and Human Translation . Brussels: Nato Publications, 173–80.

Ng, H. T. and H. B. Lee. 1996. ‘Integrating multiple knowledge sources to disambiguate word sense: an exemplar-based approach’. Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (Santa Cruz, Calif.), 40–7.

Quirk, R. 1960 . ‘ Towards a description of English usage ’ Transactions of the Philological Society , 40–61.

Roach, P . and S. Arnfield. 1995 . ‘Linking prosodic transcription to the time dimension’. In G. Leech, G. Myers, and J. Thomas (eds.), Spoken English on Computer: Transcription, Mark-up and Applications . London: Longman, 149–60.

Sampson, G. 1995 . English for the Computer: The SUSANNE Corpus and Analytic Scheme . Oxford: Clarendon Press.

Schmied, J. and B. Fink. 2000 . ‘Corpus-based contrastive lexicology: the case of English with and its German translation eqivalents’. In Botley, McEnery, and Wilson (eds.), 157–76.

Short, M. , J. Culpeper, and E. Semino. 1996 . ‘Using a corpus for stylistics research: speech presentation’. In M. Short and J. Thomas (eds.), Using Corpora for Language Research . London: Longman.

Sinclair, J. 1991 . Corpus, Concordance, Collocation . Oxford: Oxford University Press.

—— 1992 . ‘The automatic analysis of text corpora’. In J. Svartvik (ed.), Directions in Corpus Linguistics: Proceedings of the Nobel Symposium 82, Stockholm , The Hague: Mouton, 379–97.

Sperberg-McQueen, C. M. and L Burnard. 1993 . Guidelines for Electronic Text Encoding and Interchange . Chicago: Text Encoding Initiative.

Stiles, W. B. 1992 . Describing Talk . New York: Sage.

Stubbs, M. 1997 . Texts and Corpus Analysis . Oxford: Blackwell.

Svartvik, J. and R. Quirk. 1980 . The London-Lund Corpus of Spoken English . Lund: Lund University Press.

Véronis, J. 2000 . Parallel Text Processing . Dordrecht: Kluwer.

Voutilanen, A. and T. Järvinen. 1995 . ‘ Specifying a shallow grammatical representation for parsing purposes ’. Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics (EACL'95) (Dublin), 210–14.

Wilson, A. and J. Thomas. 1997 . ‘ Semantic annotation ’. In Garside, Leech, and McEnery (1997), 53–65.

Details of all corpora mentioned in this chapter are given in ‘Further Reading and Relevant Resources’ below.

Some audio material for the BNC spoken corpus is available. Indeed, the entire set of recordings are lodged in the National Sound Archive in the UK. However, the recordings are not available for general use beyond the archive, and the sound files have not been time aligned against their transcriptions.

One can, as will be seen later, transcribe speech using a phonemic transcription and annotate the transcription to show features such as stress, pitch, and intonation. Nonetheless, as the original data will almost certainly contain information lost in the process of transcription, and, crucially, the process of transcription and annotation also entails the imposition of an analysis, the need to consult the sound recording would still exist.

There is another organizing principle upon which some corpora have been constructed, which emphasizes continued text collection through time with less of a focus on the features of corpus design outlined here. These corpora, called monitor corpora, are not numerous, but have been influential and are useful for diachronic studies of linguistic features which may change rapidly, such as lexis. Some, such as the Bank of English, are very large and used for a range of purposes. Readers interested in exploring the monitor corpus further are referred to Sinclair 1991 .

Note that there are exceptions to this general description. Multi-word units may be placed in a many-to-one relationship with a morphosyntactic tag. Similarly, enclitics in a text may force certain words to be placed in a one-to-many relationship with morphosyntactic annotations.

Assuming that suitable, preferably annotation-aware, retrieval software is available.

See McEnery et al. (1997) for an account of a project which produced reliable English, French, and Spanish lemmatization and part-of-speech tagging.

Also see http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2fral.htm for some on-line examples of corpus annotation.

  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Natural Language Annotation for Machine Learning by James Pustejovsky, Amber Stubbs

Get full access to Natural Language Annotation for Machine Learning and 60K+ other titles, with a free 10-day trial of O'Reilly.

There are also live events, courses curated by job role, and more.

Chapter 1. The Basics

It seems as though every day there are new and exciting problems that people have taught computers to solve, from how to win at chess or Jeopardy to determining shortest-path driving directions. But there are still many tasks that computers cannot perform, particularly in the realm of understanding human language. Statistical methods have proven to be an effective way to approach these problems, but machine learning (ML) techniques often work better when the algorithms are provided with pointers to what is relevant about a dataset, rather than just massive amounts of data. When discussing natural language, these pointers often come in the form of annotations—metadata that provides additional information about the text. However, in order to teach a computer effectively, it’s important to give it the right data, and for it to have enough data to learn from. The purpose of this book is to provide you with the tools to create good data for your own ML task. In this chapter we will cover:

Why annotation is an important tool for linguists and computer scientists alike

How corpus linguistics became the field that it is today

The different areas of linguistics and how they relate to annotation and ML tasks

What a corpus is, and what makes a corpus balanced

How some classic ML problems are represented with annotations

The basics of the annotation development cycle

The Importance of Language Annotation

Everyone knows that the Internet is an amazing resource for all sorts of information that can teach you just about anything: juggling, programming, playing an instrument, and so on. However, there is another layer of information that the Internet contains, and that is how all those lessons (and blogs, forums, tweets, etc.) are being communicated. The Web contains information in all forms of media—including texts, images, movies, and sounds—and language is the communication medium that allows people to understand the content, and to link the content to other media. However, while computers are excellent at delivering this information to interested users, they are much less adept at understanding language itself.

Theoretical and computational linguistics are focused on unraveling the deeper nature of language and capturing the computational properties of linguistic structures. Human language technologies (HLTs) attempt to adopt these insights and algorithms and turn them into functioning, high-performance programs that can impact the ways we interact with computers using language. With more and more people using the Internet every day, the amount of linguistic data available to researchers has increased significantly, allowing linguistic modeling problems to be viewed as ML tasks, rather than limited to the relatively small amounts of data that humans are able to process on their own.

However, it is not enough to simply provide a computer with a large amount of data and expect it to learn to speak—the data has to be prepared in such a way that the computer can more easily find patterns and inferences. This is usually done by adding relevant metadata to a dataset. Any metadata tag used to mark up elements of the dataset is called an annotation over the input. However, in order for the algorithms to learn efficiently and effectively, the annotation done on the data must be accurate, and relevant to the task the machine is being asked to perform. For this reason, the discipline of language annotation is a critical link in developing intelligent human language technologies.

Giving an ML algorithm too much information can slow it down and lead to inaccurate results, or result in the algorithm being so molded to the training data that it becomes “overfit” and provides less accurate results than it might otherwise on new data. It’s important to think carefully about what you are trying to accomplish, and what information is most relevant to that goal. Later in the book we will give examples of how to find that information, and how to determine how well your algorithm is performing at the task you’ve set for it.

Datasets of natural language are referred to as corpora , and a single set of data annotated with the same specification is called an annotated corpus . Annotated corpora can be used to train ML algorithms. In this chapter we will define what a corpus is, explain what is meant by an annotation, and describe the methodology used for enriching a linguistic data collection with annotations for machine learning.

The Layers of Linguistic Description

While it is not necessary to have formal linguistic training in order to create an annotated corpus, we will be drawing on examples of many different types of annotation tasks, and you will find this book more helpful if you have a basic understanding of the different aspects of language that are studied and used for annotations. Grammar is the name typically given to the mechanisms responsible for creating well-formed structures in language. Most linguists view grammar as itself consisting of distinct modules or systems, either by cognitive design or for descriptive convenience. These areas usually include syntax, semantics, morphology, phonology (and phonetics), and the lexicon. Areas beyond grammar that relate to how language is embedded in human activity include discourse, pragmatics, and text theory. The following list provides more detailed descriptions of these areas:

The study of how words are combined to form sentences. This includes examining parts of speech and how they combine to make larger constructions.

The study of meaning in language. Semantics examines the relations between words and what they are being used to represent.

The study of units of meaning in a language. A morpheme is the smallest unit of language that has meaning or function, a definition that includes words, prefixes, affixes, and other word structures that impart meaning.

The study of the sound patterns of a particular language. Aspects of study include determining which phones are significant and have meaning (i.e., the phonemes); how syllables are structured and combined; and what features are needed to describe the discrete units (segments) in the language, and how they are interpreted.

The study of the sounds of human speech, and how they are made and perceived. A phoneme is the term for an individual sound, and is essentially the smallest unit of human speech.

The study of the words and phrases used in a language, that is, a language’s vocabulary.

The study of exchanges of information, usually in the form of conversations, and particularly the flow of information across sentence boundaries.

The study of how the context of text affects the meaning of an expression, and what information is necessary to infer a hidden or presupposed meaning.

The study of how narratives and other textual styles are constructed to make larger textual compositions.

Throughout this book we will present examples of annotation projects that make use of various combinations of the different concepts outlined in the preceding list.

What Is Natural Language Processing?

Natural Language Processing (NLP) is a field of computer science and engineering that has developed from the study of language and computational linguistics within the field of Artificial Intelligence. The goals of NLP are to design and build applications that facilitate human interaction with machines and other devices through the use of natural language. Some of the major areas of NLP include:

Imagine being able to actually ask your computer or your phone what time your favorite restaurant in New York stops serving dinner on Friday nights. Rather than typing in the (still) clumsy set of keywords into a search browser window, you could simply ask in plain, natural language—your own , whether it’s English, Mandarin, or Spanish. (While systems such as Siri for the iPhone are a good start to this process, it’s clear that Siri doesn’t fully understand all of natural language, just a subset of key phrases.)

This area includes applications that can take a collection of documents or emails and produce a coherent summary of their content. Such programs also aim to provide snap “elevator summaries” of longer documents, and possibly even turn them into slide presentations.

The holy grail of NLP applications, this was the first major area of research and engineering in the field. Programs such as Google Translate are getting better and better, but the real killer app will be the BabelFish that translates in real time when you’re looking for the right train to catch in Beijing.

This is one of the most difficult problems in NLP. There has been great progress in building models that can be used on your phone or computer to recognize spoken language utterances that are questions and commands. Unfortunately, while these Automatic Speech Recognition (ASR) systems are ubiquitous, they work best in narrowly defined domains and don’t allow the speaker to stray from the expected scripted input ( “Please say or type your card number now” ).

This is one of the most successful areas of NLP, wherein the task is to identify in which category (or bin ) a document should be placed. This has proved to be enormously useful for applications such as spam filtering, news article classification, and movie reviews, among others. One reason this has had such a big impact is the relative simplicity of the learning models needed for training the algorithms that do the classification.

As we mentioned in the Preface , the Natural Language Toolkit (NLTK), described in the O’Reilly book Natural Language Processing with Python , is a wonderful introduction to the techniques necessary to build many of the applications described in the preceding list. One of the goals of this book is to give you the knowledge to build specialized language corpora (i.e., training and test datasets) that are necessary for developing such applications.

A Brief History of Corpus Linguistics

In the mid-20th century, linguistics was practiced primarily as a descriptive field, used to study structural properties within a language and typological variations between languages. This work resulted in fairly sophisticated models of the different informational components comprising linguistic utterances. As in the other social sciences, the collection and analysis of data was also being subjected to quantitative techniques from statistics. In the 1940s, linguists such as Bloomfield were starting to think that language could be explained in probabilistic and behaviorist terms. Empirical and statistical methods became popular in the 1950s, and Shannon’s information-theoretic view to language analysis appeared to provide a solid quantitative approach for modeling qualitative descriptions of linguistic structure.

Unfortunately, the development of statistical and quantitative methods for linguistic analysis hit a brick wall in the 1950s. This was due primarily to two factors. First, there was the problem of data availability. One of the problems with applying statistical methods to the language data at the time was that the datasets were generally so small that it was not possible to make interesting statistical generalizations over large numbers of linguistic phenomena. Second, and perhaps more important, there was a general shift in the social sciences from data-oriented descriptions of human behavior to introspective modeling of cognitive functions.

As part of this new attitude toward human activity, the linguist Noam Chomsky focused on both a formal methodology and a theory of linguistics that not only ignored quantitative language data, but also claimed that it was misleading for formulating models of language behavior ( Chomsky 1957 ).

This view was very influential throughout the 1960s and 1970s, largely because the formal approach was able to develop extremely sophisticated rule-based language models using mostly introspective (or self-generated) data. This was a very attractive alternative to trying to create statistical language models on the basis of still relatively small datasets of linguistic utterances from the existing corpora in the field. Formal modeling and rule-based generalizations, in fact, have always been an integral step in theory formation, and in this respect, Chomsky’s approach on how to do linguistics has yielded rich and elaborate models of language.

Here’s a quick overview of some of the milestones in the field, leading up to where we are now.

1950s : Descriptive linguists compile collections of spoken and written utterances of various languages from field research. Literary researchers begin compiling systematic collections of the complete works of different authors. Key Word in Context (KWIC) is invented as a means of indexing documents and creating concordances.

1960s : Kucera and Francis publish A Standard Corpus of Present-Day American English (the Brown Corpus ), the first broadly available large corpus of language texts. Work in Information Retrieval (IR) develops techniques for statistical similarity of document content.

1970s : Stochastic models developed from speech corpora make Speech Recognition systems possible. The vector space model is developed for document indexing. The London-Lund Corpus (LLC) is developed through the work of the Survey of English Usage .

1980s : The Lancaster-Oslo-Bergen (LOB) Corpus, designed to match the Brown Corpus in terms of size and genres, is compiled. The COBUILD (Collins Birmingham University International Language Database) dictionary is published, the first based on examining usage from a large English corpus, the Bank of English. The Survey of English Usage Corpus inspires the creation of a comprehensive corpus-based grammar, Grammar of English . The Child Language Data Exchange System (CHILDES) Corpus is released as a repository for first language acquisition data.

1990s : The Penn TreeBank is released. This is a corpus of tagged and parsed sentences of naturally occurring English (4.5 million words). The British National Corpus (BNC) is compiled and released as the largest corpus of English to date (100 million words). The Text Encoding Initiative (TEI) is established to develop and maintain a standard for the representation of texts in digital form.

2000s : As the World Wide Web grows, more data is available for statistical models for Machine Translation and other applications. The American National Corpus (ANC) project releases a 22-million-word subcorpus, and the Corpus of Contemporary American English (COCA) is released (400 million words). Google releases its Google N-gram Corpus of 1 trillion word tokens from public web pages. The corpus holds up to five n-grams for each word token, along with their frequencies .

2010s : International standards organizations, such as ISO, begin to recognize and co-develop text encoding formats that are being used for corpus annotation efforts. The Web continues to make enough data available to build models for a whole new range of linguistic phenomena. Entirely new forms of text corpora, such as Twitter , Facebook , and blogs , become available as a resource.

Theory construction, however, also involves testing and evaluating your hypotheses against observed phenomena. As more linguistic data has gradually become available, something significant has changed in the way linguists look at data. The phenomena are now observable in millions of texts and billions of sentences over the Web, and this has left little doubt that quantitative techniques can be meaningfully applied to both test and create the language models correlated with the datasets. This has given rise to the modern age of corpus linguistics. As a result, the corpus is the entry point from which all linguistic analysis will be done in the future.

You gotta have data! As philosopher of science Thomas Kuhn said: “When measurement departs from theory, it is likely to yield mere numbers, and their very neutrality makes them particularly sterile as a source of remedial suggestions. But numbers register the departure from theory with an authority and finesse that no qualitative technique can duplicate, and that departure is often enough to start a search” ( Kuhn 1961 ).

The assembly and collection of texts into more coherent datasets that we can call corpora started in the 1960s.

Some of the most important corpora are listed in Table 1-1 .

What Is a Corpus?

A corpus is a collection of machine-readable texts that have been produced in a natural communicative setting. They have been sampled to be representative and balanced with respect to particular factors; for example, by genre—newspaper articles, literary fiction, spoken speech, blogs and diaries, and legal documents. A corpus is said to be “representative of a language variety” if the content of the corpus can be generalized to that variety ( Leech 1991 ).

This is not as circular as it may sound. Basically, if the content of the corpus, defined by specifications of linguistic phenomena examined or studied, reflects that of the larger population from which it is taken, then we can say that it “represents that language variety.”

The notion of a corpus being balanced is an idea that has been around since the 1980s, but it is still a rather fuzzy notion and difficult to define strictly. Atkins and Ostler (1992) propose a formulation of attributes that can be used to define the types of text, and thereby contribute to creating a balanced corpus.

Two well-known corpora can be compared for their effort to balance the content of the texts. The Penn TreeBank ( Marcus et al. 1993 ) is a 4.5-million-word corpus that contains texts from four sources: the Wall Street Journal , the Brown Corpus, ATIS, and the Switchboard Corpus. By contrast, the BNC is a 100-million-word corpus that contains texts from a broad range of genres, domains, and media.

The most diverse subcorpus within the Penn TreeBank is the Brown Corpus, which is a 1-million-word corpus consisting of 500 English text samples, each one approximately 2,000 words. It was collected and compiled by Henry Kucera and W. Nelson Francis of Brown University (hence its name) from a broad range of contemporary American English in 1961. In 1967, they released a fairly extensive statistical analysis of the word frequencies and behavior within the corpus, the first of its kind in print, as well as the Brown Corpus Manual ( Francis and Kucera 1964 ).

There has never been any doubt that all linguistic analysis must be grounded on specific datasets. What has recently emerged is the realization that all linguistics will be bound to corpus-oriented techniques, one way or the other. Corpora are becoming the standard data exchange format for discussing linguistic observations and theoretical generalizations, and certainly for evaluation of systems, both statistical and rule-based.

Table 1-2 shows how the Brown Corpus compares to other corpora that are also still in use.

Looking at the way the files of the Brown Corpus can be categorized gives us an idea of what sorts of data were used to represent the English language. The top two general data categories are informative, with 374 samples, and imaginative, with 126 samples.

These two domains are further distinguished into the following topic areas:

Press: reportage (44), Press: editorial (27), Press: reviews (17), Religion (17), Skills and Hobbies (36), Popular Lore (48), Belles Lettres, Biography, Memoirs (75), Miscellaneous (30), Natural Sciences (12), Medicine (5), Mathematics (4), Social and Behavioral Sciences (14), Political Science, Law, Education (15), Humanities (18), Technology and Engineering (12)

General Fiction (29), Mystery and Detective Fiction (24), Science Fiction (6), Adventure and Western Fiction (29), Romance and Love Story (29) Humor (9)

Similarly, the BNC can be categorized into informative and imaginative prose, and further into subdomains such as educational , public , business , and so on. A further discussion of how the BNC can be categorized can be found in Distributions Within Corpora .

As you can see from the numbers given for the Brown Corpus, not every category is equally represented, which seems to be a violation of the rule of “representative and balanced” that we discussed before. However, these corpora were not assembled with a specific task in mind; rather, they were meant to represent written and spoken language as a whole. Because of this, they attempt to embody a large cross section of existing texts, though whether they succeed in representing percentages of texts in the world is debatable (but also not terribly important).

For your own corpus, you may find yourself wanting to cover a wide variety of text, but it is likely that you will have a more specific task domain, and so your potential corpus will not need to include the full range of human expression. The Switchboard Corpus is an example of a corpus that was collected for a very specific purpose—Speech Recognition for phone operation—and so was balanced and representative of the different sexes and all different dialects in the United States.

Early Use of Corpora

One of the most common uses of corpora from the early days was the construction of concordances . These are alphabetical listings of the words in an article or text collection with references given to the passages in which they occur. Concordances position a word within its context, and thereby make it much easier to study how it is used in a language, both syntactically and semantically. In the 1950s and 1960s, programs were written to automatically create concordances for the contents of a collection, and the results of these automatically created indexes were called “Key Word in Context” indexes, or KWIC indexes . A KWIC index is an index created by sorting the words in an article or a larger collection such as a corpus, and aligning them in a format so that they can be searched alphabetically in the index. This was a relatively efficient means for searching a collection before full-text document search became available.

The way a KWIC index works is as follows. The input to a KWIC system is a file or collection structured as a sequence of lines. The output is a sequence of lines, circularly shifted and presented in alphabetical order of the first word. For an example, consider a short article of two sentences, shown in Figure 1-1 with the KWIC index output that is generated.

Example of a KWIC index

Another benefit of concordancing is that, by displaying the keyword in its context, you can visually inspect how the word is being used in a given sentence. To take a specific example, consider the different meanings of the English verb treat . Specifically, let’s look at the first two senses within sense (1) from the dictionary entry shown in Figure 1-2 .

Senses of the word “treat”

Now let’s look at the concordances compiled for this verb from the BNC, as differentiated by these two senses.

These concordances were compiled using the Word Sketch Engine , by the lexicographer Patrick Hanks, and are part of a large resource of sentence patterns using a technique called Corpus Pattern Analysis ( Pustejovsky et al. 2004 ; Hanks and Pustejovsky 2005 ).

What is striking when one examines the concordance entries for each of these senses is the fact that the contexts are so distinct. These are presented in Figures 1-3 and 1-4 .

Sense (1a) for the verb “treat”

The NLTK provides functionality for creating concordances. The easiest way to make a concordance is to simply load the preprocessed texts into the NLTK and then use the concordance function, like this:

If you have your own set of data for which you would like to create a concordance, then the process is a little more involved: you will need to read in your files and use the NLTK functions to process them before you can create your own concordance. Here is some sample code for a corpus of text files (replace the directory location with your own folder of text files):

You can see if the files were read by checking what file IDs are present:

Next, process the words in the files and then use the concordance function to examine the data:

Corpora Today

When did researchers start to actually use corpora for modeling language phenomena and training algorithms? Beginning in the 1980s, researchers in Speech Recognition began to compile enough spoken language data to create language models (from transcriptions using n-grams and Hidden Markov Models [HMMS]) that worked well enough to recognize a limited vocabulary of words in a very narrow domain. In the 1990s, work in Machine Translation began to see the influence of larger and larger datasets, and with this, the rise of statistical language modeling for translation.

Eventually, both memory and computer hardware became sophisticated enough to collect and analyze increasingly larger datasets of language fragments. This entailed being able to create statistical language models that actually performed with some reasonable accuracy for different natural language tasks.

As one example of the increasing availability of data, Google has recently released the Google Ngram Corpus . The Google Ngram dataset allows users to search for single words (unigrams) or collocations of up to five words (5-grams). The dataset is available for download from the Linguistic Data Consortium, and directly from Google . It is also viewable online through the Google Ngram Viewer . The Ngram dataset consists of more than one trillion tokens (words, numbers, etc.) taken from publicly available websites and sorted by year, making it easy to view trends in language use. In addition to English, Google provides n-grams for Chinese, French, German, Hebrew, Russian, and Spanish, as well as subsets of the English corpus such as American English and English Fiction.

N-grams are sets of items (often words, but they can be letters, phonemes, etc.) that are part of a sequence. By examining how often the items occur together we can learn about their usage in a language, and predict what would likely follow a given sequence (using n-grams for this purpose is called n-gram modeling ).

N-grams are applied in a variety of ways every day, such as in websites that provide search suggestions once a few letters are typed in, and for determining likely substitutions for spelling errors. They are also used in speech disambiguation—if a person speaks unclearly but utters a sequence that does not commonly (or ever) occur in the language being spoken, an n-gram model can help recognize that problem and find the words that the speaker probably intended to say.

Another modern corpus is ClueWeb09 ( http://lemurproject.org/clueweb09.php/ ), a dataset “created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009.” This corpus is too large to use for an annotation project (it’s about 25 terabytes uncompressed), but some projects have taken parts of the dataset (such as a subset of the English websites) and used them for research ( Pomikálek et al. 2012 ). Data collection from the Internet is an increasingly common way to create corpora, as new and varied content is always being created.

Kinds of Annotation

Consider the different parts of a language’s syntax that can be annotated. These include part of speech (POS) , phrase structure , and dependency structure . Table 1-3 shows examples of each of these. There are many different tagsets for the parts of speech of a language that you can choose from.

The tagset in Figure 1-5 is taken from the Penn TreeBank, and is the basis for all subsequent annotation over that corpus.

The Penn TreeBank tagset

The POS tagging process involves assigning the right lexical class marker(s) to all the words in a sentence (or corpus). This is illustrated in a simple example, “The waiter cleared the plates from the table.” (See Figure 1-6 .)

POS tagging sample

POS tagging is a critical step in many NLP applications, since it is important to know what category a word is assigned to in order to perform subsequent analysis on it, such as the following:

Is the word a noun or a verb? Examples include object , overflow , insult , and suspect . Without context, each of these words could be either a noun or a verb.

You need POS tags in order to make larger syntactic units. For example, in the following sentences, is “clean dishes” a noun phrase or an imperative verb phrase?

Getting the POS tags and the subsequent parse right makes all the difference when translating the expressions in the preceding list item into another language, such as French: “Des assiettes propres” (Clean dishes) versus “Fais la vaisselle!” (Clean the dishes!).

Consider how these tags are used in the following sentence, from the Penn TreeBank ( Marcus et al. 1993 ):

Identifying the correct parts of speech in a sentence is a necessary step in building many natural language applications, such as parsers, Named Entity Recognizers, QAS, and Machine Translation systems. It is also an important step toward identifying larger structural units such as phrase structure.

Use the NLTK tagger to assign POS tags to the example sentence shown here, and then with other sentences that might be more ambiguous:

Look for places where the tagger doesn’t work, and think about what rules might be causing these errors. For example, what happens when you try “Clean dishes are in the cabinet.” and “Clean dishes before going to work!”?

While words have labels associated with them (the POS tags mentioned earlier), specific sequences of words also have labels that can be associated with them. This is called syntactic bracketing (or labeling) and is the structure that organizes all the words we hear into coherent phrases. As mentioned earlier, syntax is the name given to the structure associated with a sentence. The Penn TreeBank is an annotated corpus with syntactic bracketing explicitly marked over the text. An example annotation is shown in Figure 1-7 .

Syntactic bracketing

This is a bracketed representation of the syntactic tree structure, which is shown in Figure 1-8 .

Syntactic tree structure

Notice that syntactic bracketing introduces two relations between the words in a sentence: order (precedence) and hierarchy (dominance). For example, the tree structure in Figure 1-8 encodes these relations by the very nature of a tree as a directed acyclic graph (DAG). In a very compact form, the tree captures the precedence and dominance relations given in the following list:

{Dom(NNP1,John), Dom(VPZ,loves), Dom(NNP2,Mary), Dom(NP1,NNP1), Dom(NP2,NNP2), Dom(S,NP1), Dom(VP,VPZ), Dom(VP,NP2), Dom(S,VP), Prec(NP1,VP), Prec(VPZ,NP2)}

Any sophisticated natural language application requires some level of syntactic analysis, including Machine Translation. If the resources for full parsing (such as that shown earlier) are not available, then some sort of shallow parsing can be used. This is when partial syntactic bracketing is applied to sequences of words, without worrying about the details of the structure inside a phrase. We will return to this idea in later chapters.

In addition to POS tagging and syntactic bracketing, it is useful to annotate texts in a corpus for their semantic value, that is, what the words mean in the sentence. We can distinguish two kinds of annotation for semantic content within a sentence: what something is , and what role something plays. Here is a more detailed explanation of each:

A word or phrase in the sentence is labeled with a type identifier (from a reserved vocabulary or ontology), indicating what it denotes.

A word or phrase in the sentence is identified as playing a specific semantic role relative to a role assigner, such as a verb.

Let’s consider what annotation using these two strategies would look like, starting with semantic types. Types are commonly defined using an ontology, such as that shown in Figure 1-9 .

The word ontology has its roots in philosophy, but ontologies also have a place in computational linguistics, where they are used to create categorized hierarchies that group similar concepts and objects. By assigning words semantic types in an ontology, we can create relationships between different branches of the ontology, and determine whether linguistic rules hold true when applied to all the words in a category.

A simple ontology

The ontology in Figure 1-9 is rather simple, with a small set of categories. However, even this small ontology can be used to illustrate some interesting features of language. Consider the following example, with semantic types marked:

[Ms. Ramirez] Person of [QBC Productions] Organization visited [Boston] Place on [Saturday] Time , where she had lunch with [Mr. Harris] Person of [STU Enterprises] Organization at [1:15 pm] Time .

From this small example, we can start to make observations about how these objects interact with one other. People can visit places, people have “of” relationships with organizations, and lunch can happen on Saturday at 1:15 p.m. Given a large enough corpus of similarly labeled sentences, we can start to detect patterns in usage that will tell us more about how these labels do and do not interact.

A corpus of these examples can also tell us where our categories might need to be expanded. There are two “times” in this sentence: Saturday and 1:15 p.m. We can see that events can occur “on” Saturday, but “at” 1:15 p.m. A larger corpus would show that this pattern remains true with other days of the week and hour designations—there is a difference in usage here that cannot be inferred from the semantic types. However, not all ontologies will capture all information—the applications of the ontology will determine whether it is important to capture the difference between Saturday and 1:15 p.m.

The annotation strategy we just described marks up what a linguistic expression refers to. But let’s say we want to know the basics for Question Answering , namely, the who , what , where , and when of a sentence. This involves identifying what are called the semantic role labels associated with a verb. What are semantic roles? Although there is no complete agreement on what roles exist in language (there rarely is with linguists), the following list is a fair representation of the kinds of semantic labels associated with different verbs:

The event participant that is doing or causing the event to occur

The event participant who undergoes a change in position or state

The event participant who experiences or perceives something

The location or place from which the motion begins; the person from whom the theme is given

The location or place to which the motion is directed or terminates

The person who comes into possession of the theme

The event participant who is affected by the event

The event participant used by the agent to do or cause the event

The location or place associated with the event itself

The annotated data that results explicitly identifies entity extents and the target relations between the entities:

[The man] agent painted [the wall] patient with [a paint brush] instrument .

[Mary] figure walked to [the cafe] goal from [her house] source .

[John] agent gave [his mother] recipient [a necklace] theme .

[My brother] theme lives in [Milwaukee] location .

Language Data and Machine Learning

Now that we have reviewed the methodology of language annotation along with some examples of annotation formats over linguistic data, we will describe the computational framework within which such annotated corpora are used, namely, that of machine learning. Machine learning is the name given to the area of Artificial Intelligence concerned with the development of algorithms that learn or improve their performance from experience or previous encounters with data. They are said to learn (or generate) a function that maps particular input data to the desired output. For our purposes, the “data” that an ML algorithm encounters is natural language, most often in the form of text, and typically annotated with tags that highlight the specific features that are relevant to the learning task. As we will see, the annotation schemas discussed earlier, for example, provide rich starting points as the input data source for the ML process (the training phase).

When working with annotated datasets in NLP, three major types of ML algorithms are typically used:

Any technique that generates a function mapping from inputs to a fixed set of labels (the desired output). The labels are typically metadata tags provided by humans who annotate the corpus for training purposes.

Any technique that tries to find structure from an input set of unlabeled data.

Any technique that generates a function mapping from inputs of both labeled data and unlabeled data; a combination of both supervised and unsupervised learning.

Table 1-4 shows a general overview of ML algorithms and some of the annotation tasks they are frequently used to emulate. We’ll talk more about why these algorithms are used for these different tasks in Chapter 7 .

You’ll notice that some of the tasks appear with more than one algorithm. That’s because different approaches have been tried successfully for different types of annotation tasks, and depending on the most relevant features of your own corpus, different algorithms may prove to be more or less effective. Just to give you an idea of what the algorithms listed in that table mean, the rest of this section gives an overview of the main types of ML algorithms.

Classification

Classification is the task of identifying the labeling for a single entity from a set of data. For example, in order to distinguish spam from not-spam in your email inbox, an algorithm called a classifier is trained on a set of labeled data, where individual emails have been assigned the label [+spam] or [-spam]. It is the presence of certain (known) words or phrases in an email that helps to identify an email as spam. These words are essentially treated as features that the classifier will use to model the positive instances of spam as compared to not-spam. Another example of a classification problem is patient diagnosis, from the presence of known symptoms and other attributes. Here we would identify a patient as having a particular disease, A, and label the patient record as [+disease-A] or [-disease-A], based on specific features from the record or text. This might include blood pressure, weight, gender, age, existence of symptoms, and so forth. The most common algorithms used in classification tasks are Maximum Entropy (MaxEnt), Naïve Bayes, decision trees, and Support Vector Machines (SVMs).

Clustering is the name given to ML algorithms that find natural groupings and patterns from the input data, without any labeling or training at all. The problem is generally viewed as an unsupervised learning task, where either the dataset is unlabeled or the labels are ignored in the process of making clusters. The clusters that are formed are “similar in some respect,” and the other clusters formed are “dissimilar to the objects” in other clusters. Some of the more common algorithms used for this task include k-means, hierarchical clustering, Kernel Principle Component Analysis, and Fuzzy C-Means (FCM).

Structured Pattern Induction

Structured pattern induction involves learning not only the label or category of a single entity, but rather learning a sequence of labels, or other structural dependencies between the labeled items. For example, a sequence of labels might be a stream of phonemes in a speech signal (in Speech Recognition); a sequence of POS tags in a sentence corresponding to a syntactic unit (phrase); a sequence of dialog moves in a phone conversation; or steps in a task such as parsing, coreference resolution, or grammar induction. Algorithms used for such problems include Hidden Markov Models (HMMs), Conditional Random Fields (CRFs), and Maximum Entropy Markov Models (MEMMs).

We will return to these approaches in more detail when we discuss machine learning in greater depth in Chapter 7 .

The Annotation Development Cycle

The features we use for encoding a specific linguistic phenomenon must be rich enough to capture the desired behavior in the algorithm that we are training. These linguistic descriptions are typically distilled from extensive theoretical modeling of the phenomenon. The descriptions in turn form the basis for the annotation values of the specification language, which are themselves the features used in a development cycle for training and testing an identification or labeling algorithm over text. Finally, based on an analysis and evaluation of the performance of a system, the model of the phenomenon may be revised for retraining and testing.

We call this particular cycle of development the MATTER methodology, as detailed here and shown in Figure 1-10 ( Pustejovsky 2006 ):

Structural descriptions provide theoretically informed attributes derived from empirical observations over the data.

An annotation scheme assumes a feature set that encodes specific structural descriptions and properties of the input data.

The algorithm is trained over a corpus annotated with the target feature set.

The algorithm is tested against held-out data.

A standardized evaluation of results is conducted.

The model and the annotation specification are revisited in order to make the annotation more robust and reliable with use in the algorithm.

The MATTER cycle

We assume some particular problem or phenomenon has sparked your interest, for which you will need to label natural language data for training for machine learning. Consider two kinds of problems. First imagine a direct text classification task. It might be that you are interested in classifying your email according to its content or with a particular interest in filtering out spam. Or perhaps you are interested in rating your incoming mail on a scale of what emotional content is being expressed in the message.

Now let’s consider a more involved task, performed over this same email corpus: identifying what are known as Named Entities (NEs) . These are references to everyday things in our world that have proper names associated with them; for example, people, countries, products, holidays, companies, sports, religions, and so on.

Finally, imagine an even more complicated task, that of identifying all the different events that have been mentioned in your mail (birthdays, parties, concerts, classes, airline reservations, upcoming meetings, etc.). Once this has been done, you will need to “timestamp” them and order them, that is, identify when they happened, if in fact they did happen. This is called the temporal awareness problem , and is one of the most difficult in the field.

We will use these different tasks throughout this section to help us clarify what is involved with the different steps in the annotation development cycle.

Model the Phenomenon

The first step in the MATTER development cycle is “Model the Phenomenon.” The steps involved in modeling, however, vary greatly, depending on the nature of the task you have defined for yourself. In this section, we will look at what modeling entails and how you know when you have an adequate first approximation of a model for your task.

The parameters associated with creating a model are quite diverse, and it is difficult to get different communities to agree on just what a model is. In this section we will be pragmatic and discuss a number of approaches to modeling and show how they provide the basis from which to created annotated datasets. Briefly, a model is a characterization of a certain phenomenon in terms that are more abstract than the elements in the domain being modeled. For the following discussion, we will define a model as consisting of a vocabulary of terms, T , the relations between these terms, R , and their interpretation, I . So, a model, M , can be seen as a triple, M = <T,R,I> . To better understand this notion of a model, let us consider the scenarios introduced earlier. For spam detection, we can treat it as a binary text classification task, requiring the simplest model with the categories (terms) spam and not-spam associated with the entire email document. Hence, our model is simply:

T = {Document_type, Spam, Not-Spam}

R = {Document_type ::= Spam | Not-Spam}

I = {Spam = “something we don’t want!”, Not-Spam = “something we do want!"}

The document itself is labeled as being a member of one of these categories. This is called document annotation and is the simplest (and most coarse-grained) annotation possible. Now, when we say that the model contains only the label names for the categories (e.g., sports, finance, news, editorials, fashion, etc.), this means there is no other annotation involved. This does not mean the content of the files is not subject to further scrutiny, however. A document that is labeled as a category, A , for example, is actually analyzed as a large-feature vector containing at least the words in the document. A more fine-grained annotation for the same task would be to identify specific words or phrases in the document and label them as also being associated with the category directly. We’ll return to this strategy in Chapter 4 . Essentially, the goal of designing a good model of the phenomenon (task) is that this is where you start for designing the features that go into your learning algorithm. The better the features, the better the performance of the ML algorithm!

Preparing a corpus with annotations of NEs, as mentioned earlier, involves a richer model than the spam-filter application just discussed. We introduced a four-category ontology for NEs in the previous section, and this will be the basis for our model to identify NEs in text. The model is illustrated as follows:

T = {Named_Entity, Organization, Person, Place, Time}

R = {Named_Entity ::= Organization | Person | Place | Time}

I = {Organization = “list of organizations in a database”, Person = “list of people in a database”, Place = “list of countries, geographic locations, etc.”, Time = “all possible dates on the calendar”}

This model is necessarily more detailed, because we are actually annotating spans of natural language text, rather than simply labeling documents (e.g., emails) as spam or not-spam. That is, within the document, we are recognizing mentions of companies, actors, countries, and dates.

Finally, what about an even more involved task, that of recognizing all temporal information in a document? That is, questions such as the following:

When did that meeting take place?

How long was John on vacation?

Did Jill get promoted before or after she went on maternity leave?

We won’t go into the full model for this domain, but let’s see what is minimally necessary in order to create annotation features to understand such questions. First we need to distinguish between Time expressions (“yesterday,” “January 27,” “Monday”), Events (“promoted,” “meeting,” “vacation”), and Temporal relations (“before,” “after,” “during”). Because our model is so much more detailed, let’s divide the descriptive content by domain:

Time_Expression ::= TIME | DATE | DURATION | SET

TIME: 10:15 a.m., 3 o’clock, etc.

DATE: Monday, April 2011

DURATION: 30 minutes, two years, four days

SET: every hour, every other month

Event: Meeting, vacation, promotion, maternity leave, etc.

Temporal_Relations ::= BEFORE | AFTER | DURING | EQUAL | OVERLAP | ...

We will come back to this problem in a later chapter, when we discuss the impact of the initial model on the subsequent performance of the algorithms you are trying to train over your labeled data.

In later chapters, we’ll see that there are actually several models that might be appropriate for describing a phenomenon, each providing a different view of the data. We’ll call this multimodel annotation of the phenomenon. A common scenario for multimodel annotation involves annotators who have domain expertise in an area (such as biomedical knowledge). They are told to identify specific entities, events, attributes, or facts from documents, given their knowledge and interpretation of a specific area. From this annotation, nonexperts can be used to mark up the structural (syntactic) aspects of these same phenomena, making it possible to gain domain expert understanding without forcing the domain experts to learn linguistic theory as well.

Once you have an initial model for the phenomena associated with the problem task you are trying to solve, you effectively have the first tag specification , or spec , for the annotation. This is the document from which you will create the blueprint for how to annotate the corpus with the features in the model. This is called the annotation guideline , and we talk about this in the next section.

Annotate with the Specification

Now that you have a model of the phenomenon encoded as a specification document, you will need to train human annotators to mark up the dataset according to the tags that are important to you. This is easier said than done, and in fact often requires multiple iterations of modeling and annotating, as shown in Figure 1-11 . This process is called the MAMA (Model-Annotate-Model-Annotate) cycle, or the “babeling” phase of MATTER. The annotation guideline helps direct the annotators in the task of identifying the elements and then associating the appropriate features with them, when they are identified.

Two kinds of tags will concern us when annotating natural language data: consuming tags and nonconsuming tags. A consuming tag refers to a metadata tag that has real content from the dataset associated with it (e.g., it “consumes” some text); a nonconsuming tag, on the other hand, is a metadata tag that is inserted into the file but is not associated with any actual part of the text. An example will help make this distinction clear. Say that we want to annotate text for temporal information, as discussed earlier. Namely, we want to annotate for three kinds of tags: times (called Timex tags), temporal relations (TempRels), and Events. In the first sentence in the following example, each tag is expressed directly as real text. That is, they are all consuming tags (“promoted” is marked as an Event, “before” is marked as a TempRel, and “the summer” is marked as a Timex). Notice, however, that in the second sentence, there is no explicit temporal relation in the text, even though we know that it’s something like “on”. So, we actually insert a TempRel with the value of “on” in our corpus, but the tag is flagged as a “nonconsuming” tag.

John was [promoted] Event [before] TempRel [the summer] Timex .

John was [promoted] Event [Monday] Timex .

An important factor when creating an annotated corpus of your text is, of course, consistency in the way the annotators mark up the text with the different tags. One of the most seemingly trivial problems is the most problematic when comparing annotations: namely, the extent or the span of the tag . Compare the three annotations that follow. In the first, the Organization tag spans “QBC Productions,” leaving out the company identifier “Inc.” and the location “of East Anglia,” while these are included in varying spans in the next two annotations.

[QBC Productions] Organization Inc. of East Anglia

[QBC Productions Inc.] Organization of East Anglia

[QBC Productions Inc. of East Anglia] Organization

Each of these might look correct to an annotator, but only one actually corresponds to the correct markup in the annotation guideline. How are these compared and resolved?

The inner workings of the MAMA portion of the MATTER cycle

In order to assess how well an annotation task is defined, we use Inter-Annotator Agreement (IAA) scores to show how individual annotators compare to one another. If an IAA score is high, that is an indication that the task is well defined and other annotators will be able to continue the work. This is typically defined using a statistical measure called a Kappa Statistic . For comparing two annotations against each other, the Cohen Kappa is usually used, while when comparing more than two annotations, a Fleiss Kappa measure is used. These will be defined in Chapter 8 .

Note that having a high IAA score doesn’t necessarily mean the annotations are correct; it simply means the annotators are all interpreting your instructions consistently in the same way. Your task may still need to be revised even if your IAA scores are high. This will be discussed further in Chapter 9 .

Once you have your corpus annotated by at least two people (more is preferable, but not always practical), it’s time to create the gold standard corpus . The gold standard is the final version of your annotated data. It uses the most up-to-date specification that you created during the annotation process, and it has everything tagged correctly according to the most recent guidelines. This is the corpus that you will use for machine learning, and it is created through the process of adjudication . At this point in the process, you (or someone equally familiar with all the tasks) will compare the annotations and determine which tags in the annotations are correct and should be included in the gold standard.

Train and Test the Algorithms over the Corpus

Now that you have adjudicated your corpus, you can use your newly created gold standard for machine learning. The most common way to do this is to divide your corpus into two parts: the development corpus and the test corpus . The development corpus is then further divided into two parts: the training set and the development-test set . Figure 1-12 shows a standard breakdown of a corpus, though different distributions might be used for different tasks. The files are normally distributed randomly into the different sets.

Corpus divisions for machine learning

The training set is used to train the algorithm that you will use for your task. The development-test (dev-test) set is used for error analysis. Once the algorithm is trained, it is run on the dev-test set and a list of errors can be generated to find where the algorithm is failing to correctly label the corpus. Once sources of error are found, the algorithm can be adjusted and retrained, then tested against the dev-test set again. This procedure can be repeated until satisfactory results are obtained.

Once the training portion is completed, the algorithm is run against the held-out test corpus, which until this point has not been involved in training or dev-testing. By holding out the data, we can show how well the algorithm will perform on new data, which gives an expectation of how it would perform on data that someone else creates as well. Figure 1-13 shows the “TTER” portion of the MATTER cycle, with the different corpus divisions and steps.

The Training–Evaluation cycle

Evaluate the Results

The most common method for evaluating the performance of your algorithm is to calculate how accurately it labels your dataset. This can be done by measuring the fraction of the results from the dataset that are labeled correctly using a standard technique of “relevance judgment” called the Precision and Recall metric .

Here’s how it works. For each label you are using to identify elements in the data, the dataset is divided into two subsets: one that is labeled “relevant” to the label, and one that is not relevant. Precision is a metric that is computed as the fraction of the correct instances from those that the algorithm labeled as being in the relevant subset. Recall is computed as the fraction of correct items among those that actually belong to the relevant subset. The following confusion matrix helps illustrate how this works:

Given this matrix, we can define both precision and recall as shown in Figure 1-14 , along with a conventional definition of accuracy .

Precision and recall equations

The values of P and R are typically combined into a single metric called the F-measure , which is the harmonic mean of the two.

Precision and recall equations

This creates an overall score used for evaluation where precision and recall are measured equally, though depending on the purpose of your corpus and algorithm, a variation of this measure, such as one that rates precision higher than recall, may be more useful to you. We will give more detail about how these equations are used for evaluation in Chapter 8 .

Revise the Model and Algorithms

Once you have evaluated the results of training and testing your algorithm on the data, you will want to do an error analysis to see where it performed well and where it made mistakes. This can be done with various packages and formulas, which we will discuss in Chapter 8 , including the creation of what are called confusion matrices. These will help you go back to the design of the model, in order to create better tags and features that will subsequently improve your gold standard, and consequently result in better performance of your learning algorithm.

A brief example of model revision will help make this point. Recall the model for NE extraction from the previous section, where we distinguished between four types of entities: Organization, Place, Time, and Person. Depending on the corpus you have assembled, it might be the case that you are missing a major category, or that you would be better off making some subclassifications within one of the existing tags. For example, you may find that the annotators are having a hard time knowing what to do with named occurrences or events, such as Easter, 9-11, or Thanksgiving. These denote more than simply Times, and suggest that perhaps a new category should be added to the model: Event. Additionally, it might be the case that there is reason to distinguish geopolitical Places from nongeopolitical Places. As with the “Model-Annotate” and “Train-Test” cycles, once such additions and modifications are made to the model, the MATTER cycle begins all over again, and revisions will typically bring improved performance.

In this chapter, we have provided an overview of the history of corpus and computational linguistics, and the general methodology for creating an annotated corpus. Specifically, we have covered the following points:

Natural language annotation is an important step in the process of training computers to understand human speech for tasks such as Question Answering, Machine Translation, and summarization.

All of the layers of linguistic research, from phonetics to semantics to discourse analysis, are used in different combinations for different ML tasks.

In order for annotation to provide statistically useful results, it must be done on a sufficiently large dataset, called a corpus . The study of language using corpora is corpus linguistics .

Corpus linguistics began in the 1940s, but did not become a feasible way to study language until decades later, when the technology caught up to the demands of the theory.

A corpus is a collection of machine-readable texts that are representative of natural human language. Good corpora are representative and balanced with respect to the genre or language that they seek to represent.

The uses of computers with corpora have developed over the years from simple key-word-in-context (KWIC) indexes and concordances that allowed full-text documents to be searched easily, to modern, statistically based ML techniques.

Annotation is the process of augmenting a corpus with higher-level information, such as part-of-speech tagging, syntactic bracketing, anaphora resolution, and word senses. Adding this information to a corpus allows the computer to find features that can make a defined task easier and more accurate.

Once a corpus is annotated, the data can be used in conjunction with ML algorithms that perform classification, clustering, and pattern induction tasks.

Having a good annotation scheme and accurate annotations is critical for machine learning that relies on data outside of the text itself. The process of developing the annotated corpus is often cyclical, with changes made to the tagsets and tasks as the data is studied further.

Here we refer to the annotation development cycle as the MATTER cycle—Model, Annotate, Train, Test, Evaluate, Revise.

Often before reaching the Test step of the process, the annotation scheme has already gone through several revisions of the Model and Annotate stages.

This book will show you how to create an accurate and effective annotation scheme for a task of your choosing, apply the scheme to your corpus, and then use ML techniques to train a computer to perform the task you designed.

Get Natural Language Annotation for Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Don’t leave empty-handed

Get Mark Richards’s Software Architecture Patterns ebook to better understand how to design components—and how they should interact.

It’s yours, free.

Cover of Software Architecture Patterns

Check it out now on O’Reilly

Dive in for free with a 10-day trial of the O’Reilly learning platform—then explore all the other resources our members count on to build skills and solve problems every day.

what is annotated corpora

Advertisement

Issue Cover

  • Previous Article
  • Next Article

1 Introduction

2 related work, 3 criteria for corpus review, 4 english corpora, 5 spanish corpora, 6 other corpora, 7 negation processing, 9 discussion, 10 conclusions, appendix a: comparative tables, acknowledgments, corpora annotated with negation: an overview.

  • Cite Icon Cite
  • Open the PDF for in another window
  • Permissions
  • Article contents
  • Figures & tables
  • Supplementary Data
  • Peer Review
  • Search Site

Salud María Jiménez-Zafra , Roser Morante , María Teresa Martín-Valdivia , L. Alfonso Ureña-López; Corpora Annotated with Negation: An Overview. Computational Linguistics 2020; 46 (1): 1–52. doi: https://doi.org/10.1162/coli_a_00371

Download citation file:

  • Ris (Zotero)
  • Reference Manager

Negation is a universal linguistic phenomenon with a great qualitative impact on natural language processing applications. The availability of corpora annotated with negation is essential to training negation processing systems. Currently, most corpora have been annotated for English, but the presence of languages other than English on the Internet, such as Chinese or Spanish, is greater every day. In this study, we present a review of the corpora annotated with negation information in several languages with the goal of evaluating what aspects of negation have been annotated and how compatible the corpora are. We conclude that it is very difficult to merge the existing corpora because we found differences in the annotation schemes used, and most importantly, in the annotation guidelines: the way in which each corpus was tokenized and the negation elements that have been annotated. Differently than for other well established tasks like semantic role labeling or parsing, for negation there is no standard annotation scheme nor guidelines, which hampers progress in its treatment.

Negation is a key universal phenomenon in language. All languages possess different types of resources (morphological, lexical, syntactic) that allow speakers to speak about properties that people or things do not hold or events that do not happen. The presence of a negation in a sentence can have enormous consequences in many real world situations: A world in which Donald Trump was elected as president would be very different from a world in which Donald Trump was not elected as president, for example. Thus, the presence of a single particle modifying a proposition describes a completely different situation. Negation is a main linguistic phenomenon and the issue of its computational treatment has not been resolved yet due to its complexity, the multiple linguistic forms in which it can appear, and the different ways it can act on the words within its scope. If we want to develop systems that approach human understanding, it is necessary to incorporate the treatment of one of the main linguistic phenomena used by people in their daily communication.

Natural language processing (NLP) is a subfield of artificial intelligence that focuses on the processing and generation of human language in order for computers to learn, understand, and produce human language (Hirschberg and Manning 2015 ). Some linguistic phenomena such as negation, speculation, irony, or sarcasm pose challenges for computational natural language learning. One might think that, given the fact that negations are so crucial in language, most NLP pipelines incorporate negation modules and that the computational linguistics community has already addressed this phenomenon. However, this is not the case. Work on processing negation has started relatively late as compared to work on processing other linguistic phenomena and, as a matter fact, there are no publicly available off-the-shelf tools that can be easily incorporated into applications to detect negations.

Work on negation started in 2001 with the aim of processing clinical records (Chapman et al. 2001a ; Mutalik, Deshpande, and Nadkarni 2001 ; Goldin and Chapman 2003 ). Some rule-based systems were developed based on lists of negations and stop words (Mitchell et al. 2004 ; Harkema et al. 2009 ; Mykowiecka, Marciniak, and Kupść 2009 ; Uzuner, Zhang, and Sibanda 2009 ; Sohn, Wu, and Chute 2012 ). With the surge of opinion mining, negation was studied as a marker of polarity change (Das and Chen 2001 ; Wilson, Wiebe, and Hoffmann 2005 ; Polanyi and Zaenen 2006 ; Taboada et al. 2011 ; Jiménez-Zafra et al. 2017 ). Only with the release of the BioScope corpus (Vincze et al. 2008 ) did the work on negation receive a boost. But even so, despite the existence of several publications that focus on negation, it is difficult to find a negation processor for languages other than English. For English, some systems are available for processing clinical documents (NegEx (Chapman et al. 2001b ), ConText (Harkema et al. 2009 ), Deepen (Mehrabi et al. 2015 )) and, recently, a tool for detecting negation cues and scopes in natural language texts has been published (Enger, Velldal, and Øvrelid 2017 ).

Four tasks are usually performed in relation to processing negation: (i) negation cue detection, in order to find the words that express negation; (ii) scope identification, in order to find which parts of the sentence are affected by the negation cues; (iii) negated event recognition, to determine which events are affected by the negation cues; and (iv) focus detection, in order to find the part of the scope that is most prominently negated. Most of the works have modeled these tasks as token-level classification tasks, where a token is classified as being at the beginning, inside, or outside a negation cue, scope, event, or focus. Scope, event, and focus identification tasks are more complex because they depend on negation cue detection. In this article we focus on reviewing existing corpora annotated with negation, without entering in the realm of reviewing negation processing systems.

Most applications treat negation in an ad hoc manner by processing main negation constructions, but processing negation is not as easy as using a list of negation markers and applying look-up methods because negation cues do not always act as negators. For example, in the sentence “You bought the car to use it, didn’t you?” the cue “not” is not used as a negation but it is used to reinforce the first part of the sentence. We believe that there are three main reasons for which most applications treat negation in an ad hoc manner: One is that negation is a complex phenomenon, which has not been completely modeled yet. In this way it is similar to phenomena like factuality for which it is necessary to read large amounts of theoretical literature in order to put together a model, as shown by Saurí’s work on modeling factuality for its computational treatment (Saurí and Pustejovsky 2009 ). A second reason is that, although negation is a phenomenon of habitual use in language, it is difficult to measure its quantitative impact in some tasks such as anaphora resolution or text simplification. The number of sentences with negation in the English texts of the corpora analyzed is between 9.37% and 32.16%, whereas in Spanish texts it is between 10.67% and 34.22%, depending on the domain. In order to evaluate the improvement that processing negation produces, it would be necessary to focus only on those parts of the text in which negation is present and perform an evaluation before and after its treatment. However, from a qualitative perspective, its impact is very high—for example, when processing clinical records, because the health of patients is at stake. A third reason is that there are no large corpora exhaustively annotated with negation phenomena, which hinders the development of machine learning systems.

Processing is relevant for a wide range of applications, such as information retrieval (Liddy et al. 2000 ), information extraction (Savova et al. 2010 ), machine translation (Baker et al. 2012 ), or sentiment analysis (Liu 2015 ). Information retrieval systems aim to provide relevant documents from a collection, given a user query. Negation has an important role because it is not the same to make a search ( “recipes with milk and cheese” ) than to make the negated version of the search ( “recipes without milk and cheese” ). The information retrieval system must return completely different documents for both queries. In other tasks, such as information extraction, negation analysis is also beneficial. Clinical texts often refer to negative findings, that is, conditions that are not present in the patient. Processing negation in these documents is crucial because the health of patients is at stake. For example, a diagnosis of a patient will be totally different if negation is not detected in the sentence “No signs of DVT.” Translating a negative sentence from one language into another is also challenging because negation is not used in the same way. For example, the Spanish sentence “No tiene ninguna pretensión en la vida” is equivalent to the English sentence “He has no pretense in life” , but in the first case two negation cues are used whereas in the second only one is used. Sentiment analysis is also another task in which the presence of negation has a great impact. A sentiment analysis system that does not process negation can extract a completely different opinion than the one expressed by the opinion holder. For example, the polarity of the sentence “A fascinating film, I would repeat” should be the opposite of its negation “A film nothing fascinating, I would not repeat.” Notwithstanding, negation does not always imply polarity reversal, it can also increment, reduce, or have no effect on sentiment expressions, which makes the task even more difficult.

However, as we can see in some of the systems we use regularly, this phenomenon is not being processed effectively. For example, if we do the Google search in Spanish “películas que no sean de aventuras” ( non-adventure movies ), we obtain adventure movies, which reflects that the engine is not taking into account negation. Other examples can be found in online systems for sentiment analysis. If we analyze the Spanish sentence “Jamás recomendaría comprar este producto.” ( I would never recommend buying this product. ) with Mr. Tuit system 1 , we can see that the output returned by the system is positive but the text clearly expresses a negative opinion. In the meaning cloud system 2 we can find another example. If we write the Spanish sentence “Este producto tiene fiabilidad cero.” ( This product has zero reliability. ), the system indicates that it is a positive text, although in fact it is negative.

One of the first steps when attempting to develop a machine learning negation processing system is to check whether there are training data and to decide whether their quality is good enough. Differently than for other well established tasks like semantic role labeling or parsing, for negation there is no corpus of reference, but several small corpora, and, ideally, a training corpus needs to be large for a system to be able to learn. This motivates our main research questions: Is it possible to merge the existing negation corpora in order to create a larger training corpus? What are the problems that arise? In order to answer the questions we first review all existing corpora and characterize them in terms of several factors: type of information about negation that they contain, type of information about negation that is lacking, and type of application they would be suitable for. Available corpora that contain a representation of negation can be divided into two types (Fancellu et al. 2017 ): (i) those that represent negation in a logical form, using quantifiers, predicates, and relations (e.g., Groningen Meaning Bank (Basile et al. 2012 ), DeepBank (Flickinger, Zhang, and Kordoni 2012 )); and (ii) those that use a string-level, where the negation operator and the elements (scope, event, focus) are defined as spans of text (e.g., BioScope (Vincze et al. 2008 ), ConanDoyle-neg (Morante and Daelemans 2012 )). It should be noted that we focus on corpora that deal with string-level negation.

The rest of the article is organized as follows: In Section 2 previous overviews that focus on negation are presented; in Section 3 the criteria used to review the existing corpora annotated with negation are described; in Sections 4 , 5 , and 6 the existing corpora for English, Spanish, and other languages are reviewed; in Section 7 we briefly describe negation processing systems that have been developed using the corpora; in Sections 8 and 9 the corpora are analyzed showing features of interest, applications for which they can be used, and problems found for the development of negation processing systems; and finally, conclusions are drawn in Section 10 .

To the best of our knowledge, there are currently no extensive reviews of corpora annotated with negation, but there are overviews that focus on the role of negation. An interesting overview on how modality and negation have been modeled in computational linguistics was presented by Morante and Sporleder ( 2012 ). The authors emphasize that most research in NLP has focused on propositional aspects of meaning, but extra-propositional aspects, such as negation and modality, are also important to understanding language. They also observe a growing interest in the computational treatment of these phenomena, evidenced by several annotations projects. In this overview, modality and negation are defined in detail with some examples. Moreover, details on the linguistic resources annotated with modality and negation until then are provided as well as an overview of automated methods for dealing with these phenomena. In addition, a summary of studies in the field of sentiment analysis that have modeled negation and modality are shown. Some of the conclusions drawn by Morante and Sporleder are that although work on the treatment of negation and modality has been carried out in recent years, there is still much to do. Most research has been carried out on the English language and on specific domains and genres (biomedical, reviews, newswire, etc.). At the time of this overview only corpora annotated with negation for English had been developed, with the exception of one Swedish corpus (Dalianis and Velupillai 2010 ). Therefore, the authors indicate that it would be interesting to look at different languages and also distinct domains and genres, due to the fact that extra-propositional meaning is susceptible to domain and genre effects. Another interesting conclusion drawn from this study is that it would be a good idea to study which aspects of extra-propositional meaning need to be modeled for which applications, and the appropriate modeling of modality and negation.

In relation to the modeling of negation, we can reference one survey about the role of negation in sentiment analysis (Wiegand et al. 2010 ). In this survey, several papers with novel approaches to modeling negation in sentiment analysis are presented. Sentiment analysis focuses on the automatic detection and classification of opinions expressed in texts; and negation can affect the polarity of a word (usually positive, negative, or neutral) because it can change, increment, or reduce the polarity value, hence the importance of dealing with this phenomenon in this area. The authors study the level of representation used for sentiment analysis, negation word detection, and scope of negation. In relation to the representation of negation, the usual way to incorporate negation in supervised machine learning is to use a bag-of-words model adding a new feature NOT_x . Thus, if a word x is preceded by a negation marker (e.g., not, never), it would be represented as NOT_x and as x in any other case. Pang, Lee, and Vaithyanathan ( 2002 ) followed a similar approach but they added the tag NOT to every word between a negation cue and the first punctuation mark. They found that the effect of adding negation was relatively small, probably because the introduction of the feature NOT_x increased the feature space. Later, negation was modeled as a polarity shifter and not only negation was considered, but also intensifiers and diminishers. Negation was incorporated into models including knowledge of polar expressions by changing the polarity of an expression (Polanyi and Zaenen 2004 ; Kennedy and Inkpen 2006 ) or encoding negation as features using polar expressions (negation features, shifter features, and polarity modification features) (Wilson, Wiebe, and Hoffmann 2005 ). The results obtained with these models led to a significant improvement over the bag-of-words model. The conclusion drawn by the authors of this survey is that negation is highly relevant to sentiment analysis and that for a negation model to be effective in this area, knowledge of polar expressions is required. Moreover, they state that negation markers do not always function as negators and, consequently, need to be disambiguated. Another interesting remark is that, despite the existence of several approaches to modeling negation for sentiment analysis, to make affirmations of the effectiveness of the methods it is necessary to carry out comparative analysis with regard to classification type, text granularity, target domain, language, and so forth. The papers presented in this study are the pioneering studies of negation modeling in sentiment analysis for English texts. In recent studies researchers have been developing rule-based systems using syntactic dependency trees (Jia, Yu, and Meng 2009 ), applying more complex calculations in order to obtain polarity (Taboada et al. 2011 ), using deep-learning (Socher et al. 2013 ), and using machine-learning with lexical and syntactic features (Cruz, Taboada, and Mitkov 2016a ).

The studies analyzed above were carried out on English texts, but interest in processing negation in languages other than English has been increasing in recent years. Jiménez-Zafra et al. ( 2018a ) recently presented a review of Spanish corpora annotated with negation. The authors consulted the main catalogs and platforms that provide information about resources and/or access to them (LDC catalog, 3 ELRA catalog, 4 LRE Map, 5 META-SHARE, 6 and ReTeLe 7 ) with the aim of developing a negation processing system for Spanish. Because of the difficulty in finding corpora annotated with negation in Spanish, they conducted an exhaustive search of these resources. As a result, they provided a description of the corpora found as well as the direct links for accessing the data where possible. Moreover, the main features of the corpora were analyzed in order to determine whether the existing annotation schemes account for the complexity of negation in Spanish, that is, whether the typology of negation patterns in this language (Marti et al. 2016 ) was taken into account in the existing annotation guidelines. The conclusions drawn from this analysis were that the Spanish corpora are very different in several aspects: the genres, the annotation guidelines, and the aspects of negation that have been annotated. As a consequence, it would not be possible to merge all of them to training a negation processing system.

Language : The language(s) of the texts included in the corpus. This characteristic should always be specified in the description of any corpus, as it conditions its use.

Domain : Field to which the texts belong. Although cross-domain methodologies are being used for many tasks (Li et al. 2012 ; Szarvas et al. 2012 ; Bollegala, Mu, and Goulermas 2016 ), the domain of a corpus partly determines its area of application since different areas have different vocabularies.

Availability : Accessibility of the corpora. We indicate whether the corpus is publicly available and we provide the links for obtaining the data when possible. Corpora annotation is time-consuming and expensive, so it is not only necessary that corpora exist, but also that they be publicly available for the research community to use.

Guidelines : We study the guidelines used for the annotation showing similarities and differences between corpora. The definition of guidelines for the annotation of any phenomenon is fundamental because the generation of quality data will depend on it. The goal of annotation guidelines can be formulated as follows: given a theoretically described phenomenon or concept, describe it as generically as possible but as precisely as necessary so that human annotators can annotate the concept or phenomenon in any text without running into problems or ambiguity issues (Ide 2017 ).

Sentences : Corpus size is measured in sentences. The number of sentences is the information that is usually provided in the statistics of a corpus to give an idea of its extension, although the important thing is not the number of sentences but the information contained in them.

Annotated elements : This aspect refers to the elements on which the annotation has been performed, such as sentences, events, relationships, and so forth.

Elements with negation : Total number of elements that have been annotated with negation. As has been mentioned before, the number of annotated sentences is not important, but rather the information annotated in them. The annotation should cover all the relevant cases that algorithms need to process in order to allow for a rich processing of negation.

Syntactic negation , if a syntactically independent negation marker is used to express negation (e.g., no [‘no/not’] , nunca [‘never’] ).

Lexical negation , if the cue is a word whose meaning has a negative component (e.g., negar [‘deny’] , desistir [‘desist’] ).

Morphological negation , if a morpheme is used to express negation (e.g., i- in ilegal [‘illegal’] , in in incoherente [‘incoherent’] ). It is also known as affixal negation.

Cues : lexical items that modify the truth value of the propositions that are within their scope (Morante 2010 ), that is, they are words that express negation. Negation cues can be adverbs (e.g., I have never been to Los Angeles ), pronouns (e.g., His decisions have nothing to do with me ), verbs (e.g., The magazine desisted from published false stories about the celebrity ), and words with negative prefixes (e.g., What you’ve done is illegal ). They may consist of a single token (e.g., I do not like the food of this restaurant ), a sequence of two or more contiguous tokens (e.g., He has not even tried it ), or two or more non-contiguous tokens (e.g., I am not going back at all ). The annotation of cues in corpora is very important because they are the elements that act as triggers of negation. The identification of negation cues is usually the first task that a negation processing system needs to perform, hence the importance of the annotation of corpora with this information.

Scope : part of the sentence affected by the negation cue (Vincze et al. 2008 ), that is, all elements whose individual falsity would make the negated statement strictly true (Blanco and Moldovan 2011b ). For example, consider the sentence (a) My children do not like meat and its positive counterpart (b) My children like meat . In order for (b) to be true the following conditions must be satisfied: (i) somebody likes something, (ii) my children are the ones who like it, and (iii) meat is what is liked. The falsity of any of them would make (a) true. Therefore, all these elements are the scope of negation: My children do not like meat . The words identified as scope are those on which the negation acts and on which it will be necessary to make certain decisions based on the objective of the final system. For example, in a sentiment analysis system, these words could see their polarity modified.

Negated event : the event that is directly negated by the negation cue, usually a verb, a noun, or an adjective (Kim, Ohta, and Tsujii 2008 ). The negated event or property is always within the scope of a cue, and it is usually the head of the phrase in which the negation cue appears. For example, in the sentence “Technical assistance did not arrive on time,” the event is the verbal form “arrive,” which is the head of the sentence. There are some domains in which the identification of the negated events is crucial. For example, in the clinical domain it is relevant for the correct processing of diagnoses and for the analysis of clinical records.

Focus : part of the scope that is most prominently or explicitly negated (Blanco and Moldovan 2011a ). It can also be defined as the part of the scope that is intended to be interpreted as false or whose intensity is modified. It is one of the most difficult aspects of negation to identify, especially without knowing the stress or intonation. For example, in the sentence “I’m not going to the concert with you,” the focus is “with you” because what is false is not the fact of going to the concert, but the fact of going with a specific person ( with you ). Detecting the focus of negation is useful for retrieving the numerous words that contribute to implicit positive meanings within a negation (Morante and Blanco 2012 ).

Example (1) shows a sentence with the last four elements, which have been explained above. The negation cue appears in bold , the event in italics , the focus underlined , and the scope between [brackets]. The adverb “no”/ no is the negation cue because it is used to change the meaning of the words that are within its scope. The negated event is the verbal form “tiene”/ has and the focus is the noun “límites”/ limits , because it is the part that is intended to be false, it is equivalent to saying “cero límites”/ zero limits . The scope goes from the negation cue 8 to the end of the verb phrase, although this is not always the case, or else it would be very easy to detect the words affected by the negation. In Example (2) we show a sentence in which the scope of negation is the whole sentence and, in Example (3), a sentence with two coordinated structures with independent negation cues and predicates in which a scope is annotated for each coordinated negation marker.

Es una persona que [ no tiene límites ], aunque a veces puede controlarse. He is a person who has no limits, although sometimes he can control himself.

[El objetivo de la cámara nunca ha funcionado bien ].

The camera lens has never worked well.

[ No soy alta ] aunque [ tampoco soy un pitufo ].

I’m not tall, but I’m not a smurf either.

In this section we have presented the aspects that we have described for each corpus. In Sections 4 , 5 , and 6 , we present the existing corpora annotated with negation grouped by language. In Section 9 we provide an analysis of all the factors and we summarize them in different tables than can be found in Appendix A .

As we already indicated, our analysis focuses on corpora with string-level annotations. We are aware of two corpora that do not follow this annotation approach: Groningen Meaning Bank (Basile et al. 2012 ) and DeepBank (Flickinger, Zhang, and Kordoni 2012 ). The Groningen Meaning Bank 9 corpus is a collection of semantically annotated English texts with formal meaning representations rather than shallow semantics. It is composed of newswire texts from Voice of America, country descriptions from the CIA Factbook, a collection of texts from the open ANC (Ide et al. 2010 ), and Aesop’s fables. It was automatically annotated using C&C tools and Boxer (Curran, Clark, and Bos 2007 ) and then manually corrected. The DeepBank corpus 10 contains rich syntactic and semantic annotations for the 25 Wall Street Journal sections included in the Penn Treebank (Taylor,Marcus, and Santorini 2003 ). The annotations are for the most part produced by manual disambiguation of parses licensed by the English Resource Grammar (Flickinger 2000 ). It is available in a variety of representation formats.

To the best of our knowledge, the following are corpora that contain texts in English and string-level annotations.

4.1 BioInfer

The first corpus annotated with negation was BioInfer (Pyysalo et al. 2007 ). It focuses on the development of Information Extraction systems for extracting relationships between genes, proteins, and RNAs. Therefore, only entities relevant to this focus were annotated. It consists of 1,100 sentences extracted from the abstracts of biomedical research articles that were annotated with named entities and their relationships, and with syntactic dependencies including negation predicates. Out of 2,662 relationships, 163 (6%) are negated using the predicate NOT. The predicate NOT was used to annotate any explicit statements of the non-existence of a relationship. For this purpose, the three types of negation were considered: syntactic, morphological, and lexical. The scope of negation was not annotated as such, but the absence of a relationship between entities, such as not affected by or unable to , was annotated with the predicate NOT:

Abundance of actin is not affected by calreticulin expression. (See Figure 1 .)

NOT(affected by:AFFECT(abundance of actin, calreticulin expression))

N-WASP mutant unable to interact with profilin. (See Figure 2 .)

NOT(interact with:BIND(N-WASP mutant, profilin))

Annotated example from the BioInfer corpus (not affected by).

Annotated example from the BioInfer corpus (not affected by).

Annotated example from the BioInfer corpus (unable to).

Annotated example from the BioInfer corpus (unable to).

In relation to the annotation process, this was divided into two parts. On the one hand, the dependency annotations were created by six annotators who worked in rotating pairs to reduce variation and avoid systematic errors. Two of the annotators were biology experts and the other four had the possibility of consulting with an expert. On the other hand, the entity and relationship annotations were created based on a previously unpublished annotation of the corpus and were carried out by a biology expert, with difficult cases and annotation rules being discussed with two Information Extraction researchers. The inter-annotator agreement was not measured in this corpus because the authors considered that there were some difficulties in calculating the kappa statistic for many of the annotation types. They said that they intended to measure agreement separately for the different annotation types, applying the most informative measures for each type but, to the best of our knowledge, this information was not published. The annotation manual used for producing the annotation can be found at http://tucs.fi/publications/view/?pub_id=tGiPyBjHeSa07a .

The BioInfer corpus is in XML format, licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License and can be downloaded at http://mars.cs.utu.fi/BioInfer/ .

4.2 Genia Event

The Genia Event corpus (Kim, Ohta, and Tsujii 2008 ) is composed of 9,372 sentences from Medline abstracts that were annotated with biological events and with negation and uncertainty. It is an extension of the Genia corpus (Ohta, Tateisi, and Kim 2002 ; Kim et al. 2003 ), which was annotated with the Part Of Speech (POS), syntactic trees, and terms (biological entities).

As for negation, it was annotated whether events were explicitly negated or not, using the label non-exists or exists , respectively. The three types of negation were considered, but linguistic cues were not annotated.

This pathway involves the Rac1 and Cdc42 GTPases, two enzymes that are not required for NF-kappaB activation by IL-1beta in epithelial cells. (See Figure 3 .)

Annotated example from the Genia Event corpus.

Annotated example from the Genia Event corpus.

Out of a total of 36,858 tagged events, 2,351 events were annotated as explicitly negated. The annotation process was carried out by a biologist and three graduate students in molecular biology following the annotation guidelines defined. 11 However, there is no information about inter-annotator agreement.

The corpus is provided as a set of XML files, and it can be downloaded at http://www.geniaproject.org/genia-corpus/event-corpus under the terms of the Creative Commons Public License.

4.3 BioScope

The BioScope corpus (Vincze et al. 2008 ) is one of the largest corpora and is the first in which negation and speculation markers have been annotated with their scopes. It contains 6,383 sentences from clinical free-texts (radiology reports), 11,871 sentences from full biological papers, and 2,670 sentences from biological paper abstracts from the GENIA corpus (Collier et al. 1999 ). In total, it has 20,924 sentences, out of which 2,720 contains negations.

Negation is understood as the implication of the non-existence of something. The strategy for annotating keywords was to mark the minimal unit possible (only lexical and syntactic negations were considered). The largest syntactic unit possible should be annotated as scope. Moreover, negation cues were also included within the scope.

PMA treatment, and not retinoic acid treatment of the U937 cells, acts in inducing NF-KB expression in the nuclei. (See Figure 4 .)

Annotated example from the BioScope corpus.

Annotated example from the BioScope corpus.

The corpus was annotated by two independent linguist annotators and a chief linguist following annotation guidelines. 12 The consistency level of the annotation was measured using the inter-annotator agreement rate defined as the F β −1 measure of one annotation, considering the second one as the gold standard. The average agreement of negation keywords annotation was 93.69, 93.74, and 85.97 for clinical records, abstracts, and full articles, respectively, and the average agreement of scope identification for the three corpora was 83.65, 94.98, and 78.47, respectively.

The BioScope corpus is in XML format and is freely available for academic purposes at http://rgai.inf.u-szeged.hu/index.php?lang=en&page=bioscope . This corpus was also used in the CoNLL-2010 Shared Task: Learning to detect hedges and their scope in natural language text (Farkas et al. 2010 ).

4.4 Product Review Corpus

In 2010, the Product Review corpus was presented (Councill, McDonald, and Velikovich 2010b ). It is composed of 2,111 sentences from 268 product reviews extracted from Google Product Search. This corpus was annotated with the scope of syntactic negation cues and 679 sentences were found to contain negation. Each review was manually annotated with the scope of negation by a single person, after achieving inter-annotator agreement of 91% with a second person on a smaller subset of 20 reviews containing negation. Inter-annotator agreement was calculated using a strict exact span criteria where both the existence and the left/right boundaries of a negation span were required to match. In this case, negation cues were not included within the scope. The guidelines used for the annotation are described in the work in which the corpus was presented.

The format of the corpus is not mentioned by the authors and is not publicly available. However, we contacted the authors and they sent us the corpus. In this way we were able to see that it is in XML format and extract an example of it:

I am a soft seller, If you don’t want or need the services offered that’s cool with me. (See Figure 5 .)

Annotated example from the Product Review corpus.

Annotated example from the Product Review corpus.

4.5 PropBank Focus (PB-FOC)

In 2011, the PropBank Focus (PB-FOC) corpus was presented. It introduced a new element for the annotation of negation, the focus. Blanco and Moldovan ( 2011a ) selected 3,993 verbal negations contained in 3,779 sentences from the WSJ section of the Penn TreeBank marked with MNEG in the PropBank corpus (Palmer, Gildea, and Kingsbury 2005 ), and performed annotations of negation focus. They reduced the task to selecting the semantic role most likely to be the focus.

Fifty percent of the instances were annotated twice by two graduate students in computational linguistics and an inter-annotator agreement of 72% percent was obtained (it was calculated as the percentage of annotations that were a perfect match). Later, disagreements were examined and resolved by giving annotators clearer instructions. Finally, the remaining instances were annotated once. The annotation guidelines defined are described in the paper in which the corpus was presented.

This corpus was used in Task 2, focus detection, at the *SEM 2012 Shared Task (Resolving the scope and focus of negation) (Morante and Blanco 2012 ). It is in CoNLL format (Farkas et al. 2010 ) and can be downloaded at http://www.clips.ua.ac.be/sem2012-st-neg/data.html . Figure 6 shows the annotations for Example (4.5). The columns provide the following information: token (1), token number (2), POS tag (3), named entities (4), chunk (5), parse tree (6), syntactic head (7), dependency relation (8), semantic roles (9 to previous to last, with one column per verb), negated predicates (previous to last), focus (last).

Annotated example from the PropBank Focus (PB-FOC) corpus.

Annotated example from the PropBank Focus (PB-FOC) corpus.

PB-FOC is distributed as standalone annotations on top of the Penn TreeBank. The distribution must be completed with the actual words from the the Penn TreeBank, which is subject to an LDC license.

Marketers believe most Americans won’t make the convenience trade-off. (See Figure 6 .)

4.6 ConanDoyle-neg

The ConanDoyle-neg (Morante and Daelemans 2012 ) is a corpus of Conan Doyle stories annotated with negation cues and their scopes, as well as the event or property that is negated. It is composed of 3,640 sentences from The Hound of the Baskervilles story, out of which 850 contain negations, and 783 sentences from The Adventure of Wisteria Lodge story, out of which 145 contain negations. In this case, the three types of negation cues (lexical, syntactic, and morphological) were taken into account.

The corpus was annotated by two annotators, a master’s student and a researcher, both with a background in linguistics. The inter-annotator agreement in terms of F1 was of 94.88% and 92.77% for negation cues in The Hound of the Baskervilles story and The Adventure of Wisteria Lodge story, respectively, and of 85.04% and 77.31% for scopes. The annotation guidelines 13 are based on those of the BioScope corpus, but there are some differences. The most important differences are that in the ConanDoyle-neg corpus the cue is not considered to be part of the scope, the scope can be discontinuous, and all the arguments of the event being negated are considered to be within the scope, including the subject, which is kept out of the scope in the BioScope corpus.

After his habit he said nothing, and after mine I asked no questions. (See Figure 7 .)

Annotated example from the ConanDoyle-neg corpus.

Annotated example from the ConanDoyle-neg corpus.

No license is needed to download the corpus.

4.7 SFU Review EN

Konstantinova et al. ( 2012 ) annotated the SFU Review EN corpus (Taboada, Anthony,and Voll 2006 ) with information about negation and speculation. This corpus is composed of 400 reviews extracted from the Web site Epinions.com that belong to 8 different domains: books, cars, computers, cookware, hotels, films, music, and phones. It was annotated with negation and speculation markers and their scopes. Out of the total 17,263 sentences, 18% contain negation cues (3,017 sentences). In this corpus syntactic negation was annotated, but not lexical nor morphological negation.

The annotation process was carried out by two linguists. The entire corpus was annotated by one of them and 10% of the documents (randomly selected in a stratified way) were annotated by the second one in order to measure inter-annotator agreement. The kappa agreement was a value of 0.927 for negation cues and 0.872 for the scope. The guidelines of the BioScope corpus were taken into consideration with some modifications. The min-max strategy of BioScope corpus was used but negation cues were not included within the scope. A complete description of the annotation guidelines can be found in Konstantinova, De Sousa, and Sheila ( 2011 ).

This corpus is in XML format and publicly available at https://www.sfu.ca/tildemtaboada/SFU_Review_Corpus.html , under the terms of the GNU General Public License as published by the Free Software Foundation. Figure 8 shows how Example (4.7) is annotated in the corpus:

Annotated example from the SFU ReviewEN corpus.

Annotated example from the SFU Review EN corpus.

I have never liked the much taller instrument panel found in BMWs and Audis.

4.8 NEG-DrugDDI

In the biomedical domain, the DrugDDI 2011 corpus (Segura Bedmar, Martinez, andde Pablo Sánchez 2011 ) was also tagged with negation cues and their scopes, producing the NEG-DrugDDI corpus (Bokharaeian, Díaz Esteban, and Ballesteros Martínez 2013 ). It contains 579 documents extracted from the DrugBank database and it is composed of 5,806 sentences, out of which 1,399 sentences (24%) contain negation. Figure 9 expands Example (12), a corpus sentence containing two negations.

Annotated example from the NEG-Drug DDI corpus.

Annotated example from the NEG-Drug DDI corpus.

Repeating the study with 6 healthy male volunteers in the absence of glibenclamide did not detect an effect of acitretin on glucose tolerance.

This corpus was automatically annotated with a subsequent manual revision. The first annotation was performed using a rule-based system (Ballesteros et al. 2012 ), which is publicly available and works on biomedical literature following the BioScope guidelines to annotate sentences with negation. After applying the system, a set of 1,340 sentences were annotated with negation. Then, the outcome was manually checked, correcting annotations when needed. In order to do so, the annotated corpus was divided into three different sets that were assigned to three different evaluators. The evaluators checked all the sentences contained in each set and corrected the annotation errors. After this revision, a different evaluator revised all the annotations produced by the first three evaluators. Next, sentences were explored in order to annotate some negation cues that were not detected by the system, such as unaffected , unchanged , or non-significant . Finally, 1,399 sentences of the corpus were annotated with the scope of negation.

The NEG-DrugDDI corpus is in XML format and can be downloaded at http://nil.fdi.ucm.es/sites/default/files/NegDrugDDI.zip .

4.9 NegDDI-DrugBank

A new corpus, which included the DrugDDI 2011 corpus as well as Medline abstracts, was developed and it was named the DDI-DrugBank 2013 corpus (Herrero Zazo et al. 2013 ). This corpus was also annotated with negation markers and their scopes and it is known as the NegDDI-DrugBank corpus (Bokharaeian et al. 2014 ). It consists of 6,648 sentences from 730 files and it has 1,448 sentences with at least one negation scope, which corresponds to 21.78% of the sentences. The same approach as the one used for the annotation of the NEG-DrugDDI corpus was followed.

This corpus is in XML format and is freely available at http://nil.fdi.ucm.es/sites/default/files/NegDDI_DrugBank.zip . In Figure 10 , we show the annotations from Example (13). It can be seen that the annotation scheme is the same as the one used in the corpus NEG-DrugDDI.

Annotated example from the NEGDDI-DrugBank corpus.

Annotated example from the NEGDDI-DrugBank corpus.

Drug-Drug Interactions: The pharmacokinetic and pharmacodynamic interactions between UROXATRAL and other alpha-blockers have not been determined.

4.10 Deep Tutor Negation

The Deep Tutor Negation corpus (DT-Neg) (Banjade and Rus 2016 ) consists of texts extracted from tutorial dialogues where students interacted with an Intelligent Tutoring System to solve conceptual physics problems. It contains annotations about negation cues, and the scope and focus of negation. From a total of 27,785 student responses, 2,603 responses (9.36%) contain at least one explicit negation marker. In this corpus, syntactic and lexical negation were taken into account but not morphological negation.

In relation to the annotation process, the corpus was first automatically annotated based on a list of cue words that the authors compiled from different research reports (Morante, Schrauwen, and Daelemans 2011 ; Vincze et al. 2008 ). After this, annotators validated the automatically detected negation cues and annotated the corresponding negation scope and focus. The annotation was carried out by a total of five graduate students and researchers following an annotation manual that was inspired by the guidelines of Morante, Schrauwen, and Daelemans ( 2011 ). In order to measure inter-annotator agreement, a subset of 500 instances was randomly selected. It was equally divided into five subsets and each of them was annotated by two annotators. The averaged agreement for scope and focus detection was 89.43% and 94.20%, respectively (the agreement for negation cue detection was not reported).

This corpus is in TXT format and it is available for research-only, non-commercial, and internal use at http://deeptutor.memphis.edu/resources.htm . Figure 11 is an example of an annotated response.

Annotated example from the Deep Tutor Negation corpus.

Annotated example from the Deep Tutor Negation corpus.

They will not hit the water at the same time. (See Figure 11 .)

Finally, the last English corpus we are aware of is the SFU Opinion and Comments Corpus (SOCC) (Kolhatkar et al. 2019 ) that was presented at the beginning of 2018. The original corpus contains 10,339 opinion articles (editorials, columns, and op-eds) together with their 663,173 comments from 303,665 comment threads, from the main Canadian daily newspaper in English, The Globe and Mail , for a five-year period (from January 2012 to December 2016). The corpus is organized into three subcorpora: the articles corpus, the comments corpus, and the comment-threads corpus. The corpus description and download links are publicly available. 15

SOCC was recollected to study different aspects of on-line comments such as the connections between articles and comments; the connections of comments to each other; the types of topics discussed in comments; the nice (constructive) or mean (toxic) ways in which commenters respond to each other; and how language is used to convey very specific types of evaluation. However, the main focus of the annotation is oriented toward the study of the constructiveness and evaluation in the comments. Thus, a subset of SOCC with 1,043 comments was selected to be annotated with three different layers: constructiveness, appraisal, and negation.

The primary intention of the research and annotation was to examine the relationship between negation, negativity, and appraisal. In the annotation process up to two individuals participated. Specific guidelines were developed to assist the annotators throughout the annotation process, and to ensure that annotations were standardized. These guidelines are publicly available through the GitHub page for the corpus. 16 The 1,043 comments were annotated for negation using Webanno (de Castilho et al. 2016 ) and the elements to consider were the negation cue or keyword, focus, and scope. Syntactic negation was taken into account, as well as some verbs and adjectives that indicate negation. The negation cue is excluded from the scope. In cases of elision or question and response, a special annotation label, xscope , was created to indicate the implied content of a non explicit scope. For the 1,043 comments there were 1,397 nega- tion cues, 1,349 instances of scope, 34 instances of xscope, and 1,480 instances of focus.

Regarding the agreement, two annotators performed the annotation, a graduate student in computer science and an expert in computational linguistics. The expert was in charge of overseeing the process and training the research assistant. The research assistant annotated the entire corpus. The senior annotator then refined and resolved any disagreements. To calculate agreement, 50 comments from the beginning of the annotation process and 50 comments from the conclusion of the annotation process were compared. Agreement between the annotators was calculated individually based on the label and the span for the keyword, scope, and focus. Agreement was calculated using percentage agreement for nominal data, with annotations regarded as either agreeing or disagreeing. A percentage indicating agreement was measured for both label and span, then combined to yield an average agreement for the tag. The agreement for the first 50 comments was 99.0% for keyword, 98.0% for scope, and 85.3% for focus. For the last 50 comments the agreement was 96.4% for keyword, 94.2% for scope, and 75.8% for focus.

The annotated corpus is in TSV format and it can be downloaded at https://researchdata.sfu.ca/islandora/object/islandora%3A9109 under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Next, we show an annotated example in Figure 12 .

Annotated example from the SOCC corpus.

Annotated example from the SOCC corpus.

Because if nobody is suggesting that then this is just another murder where someone was at the WRONG PLACE at the WRONG TIME.

In this section we present the Spanish corpora annotated with negation. To the best of our knowledge, five corpora exist from different domains, although the clinical domain is the predominant one.

5.1 UAM Spanish Treebank

The first Spanish corpus annotated with negation that we are aware of is the UAM Spanish Treebank (Moreno et al. 2003 ), which was enriched with the annotation of negation cues and their scopes (Sandoval and Salazar 2013 ).

The initial UAM Spanish Treebank consisted of 1,500 sentences extracted from newspaper articles ( El País Digital and Compra Maestra ) that were annotated syntactically. Trees were encoded in a nested structure, including syntactic category, syntactic and semantic features, and constituent nodes, following the Penn Treebank model. Later, this version of the corpus was extended with the annotation of negation and 10.67% of the sentences were found to contain negations (160 sentences).

In this corpus, syntactic negation was annotated but not lexical nor morphological negation. It was annotated by two experts in corpus linguistics, who followed similar guidelines to those of the Bioscope corpus (Vincze 2010 ; Szarvas et al. 2008 ). They included negation cues within the scope as in Bioscope and NegDDI-DrugBank (Bokharaeian et al. 2014 ). All the arguments of the negated events were also included in the scope of negation, including the subject (as in ConanDoyle-neg corpus (Morante and Daelemans 2012 )), which was excluded from the scope in active sentences in Bioscope. There is no information about inter-annotator agreement.

The UAM Spanish Treebank corpus is freely available for research purposes at http://www.lllf.uam.es/ESP/Treebank.html , but it is necessary to accept the license agreement for non-commercial use and send it to the authors. It is in XML format, negation cues are tagged with the label Type=“NEG” , and the scope of negation is tagged with the label Neg=“YES” in the syntactic constituent on which negation acts. If negation affects the complete sentence, the label is included as an attribute of the tag < Sentence > or, by contrast, if negation only affects part of the sentence, for example, an adjectival syntagma represented as < Adjp >, the label Neg=“YES” is included in the corresponding tag. In Figure 13 , we present an example extracted from the corpus in which negation affects the complete sentence.

Annotated example from the UAM Spanish Treebank corpus.

Annotated example from the UAM Spanish Treebank corpus.

No juega a ser un magnate.

He doesn’t play at being a tycoon.

5.2 IxaMed-GS

The IxaMed-GS corpus (Oronoz et al. 2015 ) is composed of 75 real electronic health records from the outpatient consultations of the Galdakao-Usansolo Hospital in Biscay (Spain). It was annotated by two experts in pharmacology and pharmacovigilance with entities related to diseases and drugs, and with the relationships between entities indicating adverse drug reaction events. They defined their own annotation guidelines, taking into consideration the issues that should be considered for the design of a corpus according to Ananiadou and McNaught ( 2006 ).

The objective of this corpus was not the annotation of negation but the identification of entities and events in clinical reports. However, negation and speculation were taken into account in the annotation process. In the corpus, four entity types were annotated: diseases, allergies, drugs, and procedures. For diseases and allergies, they distinguished between negated entity, speculated entity, and entity (for non-speculative and non-negated entities). On the one hand, 2,362 diseases were annotated, out of which 490 (20.75%) were tagged as negated diseases and 40 (1.69%) as speculated diseases. On the other hand, 404 allergy entities were identified, from which 273 (67.57%) were negated allergies and 13 (3.22%) speculated allergies. The quality of the annotation process was assessed by measuring the inter-annotator agreement, which was 90.53% for entities and 82.86% for events.

The corpus might be possible to acquire via the EXTRECM project 17 following a procedure of some conditions that include a confidentiality agreement, and its format is not specified.

5.3 SFU Review SP -NEG

The SFU Review SP -NEG 18 (Jiménez-Zafra et al. 2018b ) is the first Spanish corpus that includes the event in the annotation of negation and that takes into account discontinuous negation markers. Moreover, it is the first corpus in which it is annotated how negation affects the words that are within its scope—that is, whether there is a change in the polarity or an increment or reduction of its value. It is an extension of the Spanish part of the SFU Review corpus (Taboada, Anthony,and Voll 2006 ) and it could be considered the counterpart of the SFU Review Corpus with negation and speculation annotations 19 (Konstantinova et al. 2012 ).

The Spanish SFU Review corpus consists of 400 reviews extracted from the Web site Ciao.es that belong to 8 different domains: cars, hotels, washing machines, books, cell phones, music, computers, and movies. For each domain there are 50 positive and 50 negative reviews, defined as positive or negative based on the number of stars given by the reviewer (1–2 = negative; 4 –5 = positive; 3-star reviews were not included). Later, it was extended to the SFU Review SP -NEG corpus in which each review was automatically annotated at the token level with POS-tags and lemmas using Freeling (Padro and Stanilovsky 2012 ), and manually annotated at the sentence level with negation cues and their corresponding scopes and events. It is composed of 9,455 sentences, out of which 3,022 sentences (31.97%) contain at least one negation marker.

In this corpus, syntactic negation was annotated but not lexical nor morphological negation, as in the UAM Spanish Treebank corpus. Unlike this one, annotations on the event and on how negation affects the polarity of the words within its scope were included. It was annotated by two senior researchers with in-depth experience in corpus annotation who supervised the whole process and two trained annotators who carried out the annotation task. The kappa coefficient for inter-annotator agreement was 0.97 for negation cues, 0.95 for negated events, and 0.94 for scopes. 20 A detailed discussion of the main sources of disagreements can be found in Jiménez-Zafra et al. ( 2016 ).

The guidelines of the Bioscope corpus were taken into account, but after a thorough analysis of negation in Spanish, a typology of negation patterns in Spanish (Marti et al. 2016 ) was defined. As in Bioscope, NegDDI-DrugBank, and UAM Spanish Treebank, negation markers were included within the scope. Moreover, the subject was also included within the scope when the word directly affected by negation is the verb of the sentence. The event was also included within the scope of negation as in the ConanDoyle-neg corpus.

The SFU Review SP -NEG is in XML format. It is publicly available and can be downloaded at http://sinai.ujaen.es/sfu-review-sp-neg-2/ under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. In Figure 14 , we present an example of a sentence containing negation annotated in this corpus:

Annotated example from the SFU ReviewSP-NEG corpus.

Annotated example from the SFU Review SP -NEG corpus.

El 307 es muy bonito, pero no os lo recomiendo.

The 307 is very nice, but I don’t recommend it.

The annotations of this corpus were used in NEGES 2018: Workshop on Negation in Spanish (Jiménez-Zafra et al. 2019 ) for Task 2: “Negation cues detection” (Jiménez-Zafra et al. 2018 ). The corpus was converted to CoNLL format (Farkas et al. 2010 ) as in the *SEM 2012 Shared Task (Morante and Blanco 2012 ). This format of the corpus can be downloaded from the Web site of the workshop http://www.sepln.org/workshops/neges/index.php?lang=en or by sending an email to the organizers. In Figure 15 , we show an example of a sentence with two negations. In this version of the corpus, each line corresponds to a token, each annotation is provided in a column and empty lines indicate the end of the sentence. The content of the given columns is: domain_filename (1), sentence number within domain_filename (2), token number within sentence (3), word (4), lemma (5), part-of-speech (6), part-of-speech type (7); if the sentence has no negations, column (8) has a “***” value and there are no more columns. If the sentence has negations, the annotation for each negation is provided in three columns. The first column contains the word that belongs to the negation cue. The second and third columns contain “-”, because the proposed task was only negation cue detection. Figure 15 shows an annotated example.

Annotated example from the SFU ReviewSP-NEG corpus for negation cue
                                detection in CoNLL format.

Annotated example from the SFU ReviewSP-NEG corpus for negation cue detection in CoNLL format.

Aquí estoy esperando que me carguen los puntos en mi tarjeta más, no sé dónde tienen la cabeza pero no la tienen donde deberían.

Here I am waiting for the points to be loaded on my card and I don’t know where they have their head but they don’t have it where they should.

5.4 UHU-HUVR

The UHU-HUVR (Cruz Díaz et al. 2017 ) is the first Spanish corpus in which affixal negation is annotated. It is composed of 604 clinical reports from the Virgen del Rocío Hospital in Seville (Spain). A total of 276 of these clinical documents correspond to radiology reports and 328 to the personal history of anamnesis reports written in free text.

In this corpus all types of negation were annotated: syntactic, morphological (affixal negation), and lexical. It was annotated with negation markers, their scopes, and the negated events by two domain expert annotators following closely the Thyme corpus guidelines (Styler IV et al. 2014 ) with some adaptations. In the anamnesis reports, 1,079 sentences (35.20%) were found to contain negations out of 3,065 sentences. On the other hand, 1,219 sentences (22.80%) out of 5,347 sentences were annotated with negations in the radiology reports. The Dice coefficient for inter-annotator agreement was higher than 0.94 for negation markers and higher than 0.72 for negated events. Most of the disagreements were the result of human errors, namely, the annotators missed a word or included a word that did not belong either to the event or to the marker. However, other cases of disagreement can be explained by the difficulty of the task and the lack of clear guidance. They encountered the same type of disagreements as Jiménez-Zafra et al. ( 2016 ) when annotating the SFU Review SP -NEG corpus.

The format of the corpus is not specified and the authors say that the annotated corpus will be made publicly available, but it is not currently available probably because of legal and ethics issues.

5.5 IULA Spanish Clinical Record

The IULA Spanish Clinical Record (Marimon et al. 2017 ) corpus contains 300 anonymized clinical records from several services of one of the main hospitals in Barcelona (Spain) that was annotated with negation markers and their scopes. It contains 3,194 sentences, out of which 1,093 (34.22%) were annotated with negation cues.

In this corpus, syntactic and lexical negation were annotated but not morphological negation. It was annotated with negation cues and their scopes by three computational linguists annotators advised by a clinician. The inter-annotator agreement kappa rates were 0.85 between annotators 1 and 2, and annotators 1 and 3; and 0.88 between annotators 2 and 3. The authors defined their own annotation guidelines taking into account the currently existing guidelines for corpora in English (Mutalik, Deshpande, and Nadkarni 2001 ; Szarvas et al. 2008 ; Morante and Daelemans 2012 ). Differently from previous work, they did not include the negation cue nor the subject in the scope (except when the subject is located after the verb).

The corpus is publicly available with a CC-BY-SA 3.0 license and it can be downloaded at http://eines.iula.upf.edu/brat//#/NegationOnCR_IULA/ . The annotations can be exported in ANN format and the raw text in TXT format. Figure 16 is an example of the annotation of a sentence in this corpus is presented:

Annotated example from the IULA Spanish Clinical Record corpus.

Annotated example from the IULA Spanish Clinical Record corpus.

AC: tonos cardíacos rítmicos sin soplos audibles.

CA: rhythmic heart tones without audible murmurs.

Some corpora have been created for languages other than Spanish and English. We present them in this section.

6.1 Swedish Uncertainty, Speculation, and Negation Corpus

Dalianis and Velupillai ( 2010 ) annotated a subset of the Stockholm Electronic Patient Record corpus (Dalianis, Hassel, and Velupillai 2009 ) with certain and uncertain expressions as well as speculative and negation keywords. The Stockholm Electronic Patient Record Corpus is a clinical corpus that contains patient records from the Stockholm area stretching over the years 2006 to 2008. From this corpus, 6,740 sentences were randomly extracted and annotated by three annotators: one senior level student, one undergraduate computer scientist, and one undergraduate language consultant. For the annotation, guidelines similar to those of the BioScope corpus (Vincze et al. 2008 ) were applied ( Figure 17 ). The inter-annotator agreement was measured by pairwise F-measure. In relation to the annotation of negation cues, only syntactic negation was con- sidered and the agreement obtained was of 0.80 in terms of F-measure. The corpus was annotated with a total of 6,996 expressions, out of which 1,008 were negative keywords.

Annotated example from the Stockholm Electronic Patient Record
                            corpus.

Annotated example from the Stockholm Electronic Patient Record corpus.

The corpus is in XML format, according to the example provided by the authors, but there is no information about availability.

Statusmässigt inga säkra artriter. Lungrtg Huddinge ua. Leverprover ua.

Status-wise no certain arthritis. cxr Huddinge woco. Liver samples woco.

6.2 EMC Dutch Clinical Corpus

Out of a total of 3,626 medical terms from general practitioners, 12% were annotated as negated (435).

Out of a total of 2,748 medical terms from specialists’ letters, 15% were annotated as negated (412).

Out of a total of 3,684 medical terms from radiology reports, 16% were annotated as negated (589).

Out of a total of 2,830 medical terms from discharge letters, 13% were annotated as negated (368).

This is the first publicly available Dutch clinical corpus, but it cannot be accessed online. It is necessary to send an email to the authors.

6.3 Japanese Negation Corpus

Matsuyoshi, Otsuki, and Fukumoto ( 2014 ) proposed an annotation scheme for the focus of negation in Japanese and annotated a corpus of reviews from “Rakuten Travel: User review data” 21 and the newspaper subcorpus of the “Balanced Corpus of Contemporary Written Japanese (BCCWJ)” 22 in order to develop a system for detecting the focus of negation in Japanese.

The Review and Newspaper Japanese corpus is composed of 5,178 sentences of facilities reviews and 5,582 sentences of Group “A” and “B”of the newspaper documents from BCCWJ. It was automatically tagged with POS tags using the MeCab analyzer 23 so that this information could be used to mark negation cue candidates. After a filtering process, 2,147 negation cues were annotated (1,246 from reviews and 901 from newspapers). Of the 10,760 sentences, 1,785 were found to contain some negation cue (16.59%).

For the annotation of the focus of negation, two annotators marked the focus for Group “A” in the newspaper subcorpus. They obtained an agreement of 66% in terms of number of segments. Disagreement problems were discussed and solved. Then, one of the annotators annotated reviews and Group “B” and the other checked the annotations. After a discussion, a total of ten labels were corrected.

The format of the corpus is not specified, although the authors show some examples of annotated sentences in their work. In Example (6.3) we present one of them, corresponding to a hotel review. The negation cue is written in boldface and the focus is underlined . In relation to the availability, the authors plan to freely distribute the corpus in their Web site: http://cl.cs.yamanashi.ac.jp/nldata/negation/ , although it is not available yet. 24

heya ni reizoko ga naku robi ni aru kyodo reizoko wo tsukatta.

The room where I stayed had no fridge, so I used a common one in the lobby.

6.4 Chinese Negation and Speculation Corpus

Zou, Zhou, and Zhu ( 2016 ) recently presented the Chinese Negation and Speculation (CNeSp) corpus , which consists of three types of documents annotated with negative and speculative cues and their linguistic scopes. The corpus includes 19 articles of scientific literature, 821 product reviews, and 311 financial articles. It is composed of 16,841 sentences, out of which 4,517 (26.82%) contain negations.

For the annotation, the guidelines of the BioScope corpus (Szarvas et al. 2008 ) were used with some adaptation in order to fit the Chinese language. The minimal unit expressing negation or speculation was annotated and the cues were included within the scope, as with the BioScope corpus. However, the following adaptations were realized: (i) the existence of a cue depends on its actual semantic in context, (ii) a scope should contain the subject which contributes to the meaning of the content being negated or speculated if possible, (iii) scope should be a continuous fragment in sentence, and (iv) a negative or speculative word may not be a cue (there are many double negatives in Chinese, used only for emphasizing rather than expressing negative meaning). The corpus was annotated by two annotators and disagreements were resolved by a linguist expert who modified the guidelines accordingly. The inter-annotator agreement was measured in terms of kappa. It was a value of 0.96, 0.96, and 0.93 for negation cue detection and 0.90, 0.91, and 0.88 for scope identification, scientific literature, and financial articles and product reviews, respectively. In this corpus, only lexical and syntactic negation were considered.

The corpus is in XML format and the authors state that it is publicly available for research purposes at http://nlp.suda.edu.cn/corpus/CNeSp/ . In Figure 18 we show an annotation example of a hotel review sentence.

Annotated example from the CNeSp corpus.

Annotated example from the CNeSp corpus.

The standard room is too bad, the room is not as good as the 3 stars, and the facilities are very old.

6.5 German Negation and Speculation Corpus

The German negation and speculation corpus (Cotik et al. 2016a ) consists of 8 anonymized German discharge summaries and 175 clinical notes of the nephrology domain. It was first automatically annotated using an annotation tool. Medical terms were pre-annotated using data of the UMLS Methathesaurus, and later a human annotator corrected wrong annotations and included missing concepts. Furthermore, the annotator had to decide and annotate whether a given finding occurs in a positive, negative, or speculative context. Finally, the annotations were corrected by a second annotator with more experience. There is no mention of annotation guidelines, and inter-annotator agreement is not reported. In relation to negation, out of 518 medical terms from discharge summaries, 106 were annotated as negated. On the other hand, out of 596 medical terms from clinical notes, 337 were annotated as negated.

The format of the corpus is not mentioned by authors and it is not publicly available.

6.6 Italian Negation Corpus

Altuna, Minard, and Speranza ( 2017 ) proposed an annotation framework for negation in Italian based on the guidelines proposed by Morante, Schrauwen, and Daelemans ( 2011 ) and Blanco and Moldovan ( 2011a ), and they applied it to the annotation of news articles and tweets. They provided annotations for negation cues, negation scope, and focus, taking into account only syntactic negation. As a general rule, they do not include the negation cue inside the scope, except when negation has a richer semantic meaning (e.g., nessun / “no” (determiner), mai / “never”, nessuno / “nobody”, and nulla / “nothing”) ( Figure 19 ).

Annotated example from the Fact-Ita Bank Negation corpus.

Annotated example from the Fact-Ita Bank Negation corpus.

Pare che, concluso questo ciclo, il docente non si dedichera solo all’ insegnamento.

It seems that, at the end of this cycle, the teacher will not only devote himself to teaching.

The corpus is composed of 71 documents from the Fact-Ita Bank corpus (Minard,Marchetti, and Speranza 2014 2014 ), which consists of news stories taken from Ita-TimeBank (Caselli et al. 2011 ), and 301 tweets that were used as the test set in the FactA task presented at the EVALITA 2016 evaluation campaign (Minard, Speranza, and Caselli 2016 ). On the one hand, the Fact-Ita Bank Negation corpus consists of 1,290 sentences, out of which 278 contain negations (21.55%). On the other hand, the tweet corpus has 301 sentences and 59 were annotated as negated (19.60%).

The annotation process was carried out by four annotators, whose background is not specified, and the inter-annotator agreement was measured using the average pairwise F-measure. The agreement on the identification of negation cues, scope, and focus was a value of 0.98, 0.67, and 0.58, respectively.

The corpus is in XML format and it can be downloaded at https://hlt-nlp.fbk.eu/technologies/fact-ita-bank under a Creative Commons Attribution-NonCommercial 4.0 International License. It should be mentioned that only news annotations are available. Tweets are not available because they are from another corpus that has copyright. In Figure 19 , a negation sentence of the corpus is shown.

that aims at finding the words that express negation.

that consists in determining which parts of the sentence are affected by the negation cues. The task was introduced in 2008, when the BioScope corpus was released as a machine learning sequence labeling task (Morante,Liekens, and Daelemans 2008 ).

that focuses on detecting whether events are affected by the negation cues; this task was motivated by the release of biomedical corpora annotated with negated events, such as BioInfer and Genia Event.

consisting of finding the part of the scope that is most prominently negated. This task was introduced by Blanco and Moldovan ( 2011b ), who argued that the scope and focus of negation are crucial for a correct interpretation of negated statements. The authors released the PropBank Focus corpus, on which all focus detection systems have been trained. The corpus was used in the first edition of the *SEM Shared Task, which was dedicated to resolving the scope (Task 1) and focus (Task 2) of negation (Morante and Blanco 2012 ). Both rule-based (Rosenberg and Bergler 2012 ) and machine learning approaches (Blanco and Moldovan 2013 ; Zou, Zhu, and Guodong 2015 ) have been applied to solve this task.

Most of the works have modeled these tasks as token-level classification tasks, where a token is classified as being at the beginning, inside, or outside a negation cue, scope, event, or focus. Scope, event, and focus identification tasks are more complex because they depend on negation cue detection.

The interest in processing negation originated from the need to extract information from clinical records (Chapman et al. 2001a ; Goldin and Chapman 2003 ; Mutalik, Deshpande, and Nadkarni 2001 ). Despite the fact that many studies have focused on negation in clinical texts, the problem is not yet solved (Wu et al. 2014 ), due to several reasons, among which is the lack of consistent annotation guidelines.

Three main types of approaches have been applied to processing negation: (i) rule-based systems have been developed based on lists of negations and stop words (Mitchell et al. 2004 ; Harkema et al. 2009 ; Mykowiecka, Marciniak, and Kupść 2009 ; Uzuner, Zhang, and Sibanda 2009 ; Sohn, Wu, and Chute 2012 ). The first system was the NegEx algorithm (Chapman et al. 2001a ), which was then improved resulting in systems such as ConText (Harkema et al. 2009 ), DEEPEN (Mehrabi et al. 2015 ), and NegMiner (Elazhary 2017 ); (ii) machine learning techniques (Agarwal and Yu 2010 ; Li et al. 2010 ; Cruz Díaz et al. 2012 ; Velldal et al. 2012 ; Cotik et al. 2016b ; Li and Lu 2018 ); and (iii) deep learning approaches (Fancellu, Lopez, and Webber 2016 ; Qian et al. 2016 ; Ren, Fei,and Peng 2018 ; Lazib et al. 2018 ). Although the interest in processing negation has only increased, negation resolvers are not yet a standard component of the natural language processing pipeline. Recently, a tool for detecting negation cues and scopes in English natural language texts has been released (Enger, Velldal, and Øvrelid 2017 ).

Later on, with the developments in opinion mining, negation was studied as a marker of polarity change (Das and Chen 2001 ; Wilson, Wiebe, and Hoffmann 2005 ; Polanyi and Zaenen 2006 ; Taboada et al. 2011 ; Jiménez-Zafra et al. 2017 ) and was incorporated in sentiment analysis systems. Some systems use rules to detect negation, without evaluating their impact (Das and Chen 2001 ; Polanyi and Zaenen 2006 ; Kennedy and Inkpen 2006 ; Jia, Yu, and Meng 2009 ), whereas other systems use a lexicon of negation cues and predict the scope with machine learning algorithms (Councill, McDonald, and Velikovich 2010a ; Lapponi, Read, and 2012 ; Cruz, Taboada,and Mitkov 2016b ). Most systems are tested on the SFU Review corpus.

Several shared tasks have addressed negation processing for English: the BioNLP’09 Shared Task 3 (Kim et al. 2009 ), the i2b2 NLP Challenge (Uzuner et al. 2011 ), the *SEM 2012 Shared Task (Morante and Blanco 2012 ), and the ShARe/CLEF eHealth Evaluation Lab 2014 Task 2 (Mowery et al. 2014 ).

Although most of the work on processing negation focused on English texts, recently, negation in Spanish texts has attracted the attention of researchers. Costumero et al. ( 2014 ), Stricker, Iacobacci, and Cotik ( 2015 ), and Cotik et al. ( 2016b ) develop systems for the identification of negation in clinical texts by adapting the NegEx algorithm (Chapman et al. 2001b ). Regarding product reviews, there are some works that treat negation as a subtask of sentiment analysis (Taboada et al. 2011 ; Vilares, Alonso, andGómez-Rodríguez 2013 ,s; Jiménez-Zafra et al. 2015 ; Amores, Arco, and Barrera 2016 ; Jiménez-Zafra et al. 2019 ; Miranda, Guzmán, and Salcedo 2016 ). The first systems that detect negation cues were developed in the framework of the NEGES workshop 2018 (Jiménez-Zafra et al. 2019 ) and were trained on the SFU Corpus (Jiménez-Zafra et al. 2018). Fabregat, Martínez-Romo, and Araujo ( 2018 ) applied a deep learning model based on the combination of some dense neural networks and one Bidirectional Long Short-Term Memory network and Loharja, Padró, and Turmo ( 2018 ) used a CRF model. Additionally there is also work for other languages such as Swedish (Skeppstedt 2011 ), German (Cotik et al. 2016a ), or Chinese (Kang et al. 2017 ).

Negation is an important phenomenon to deal with in NLP tasks if we want to develop accurate systems. Work on processing negation has started relatively late as compared to work on processing other linguistic phenomena, and there are no publicly available off-the-shelf tools for detecting negations that can be easily incorporated into applications. In this overview, the corpora annotated with negation so far are presented with the aim of promoting the development of such tools. For the development of a negation processing system it is not only important that corpora exist, but also that they be publicly available, well documented, and have annotations of quality. Moreover, to train robust machine learning systems it is necessary to have large enough data covering all possible cases of the phenomenon under study. Therefore, in this section we perform an analysis of the features of the corpora we have described and in the next section we discuss the possibility of merging the existing negation corpora in order to create a larger training corpus. In Appendix A , the information analyzed is summarized in Tables 8 , 9 , and 10 .

8.1 Language and Year of Publication

The years of publication of the corpora ( Table 3 , Appendix A ) show that interest in the annotation of negation started in 2007 with English texts. Thenceforth, a total of 11 English corpora have been presented. The following language for which annotations were made was Swedish, although we only have evidence of one corpus presented in 2010. For other languages, the interest is more recent. The first corpus annotated with negation in Spanish appeared in 2013 and since then five corpora have been compiled, three of them in the last two years. There are also corpora for Dutch, Japanese, Chinese, German, and Italian, although it seems that it is an emergent task because we only have evidence of one corpus annotated with negation in each language. These corpora appeared in 2014, 2016, 2016, and 2017, respectively. From the analysis of the years of publication, it can be observed that it is a task of recent interest for Spanish, Dutch, Japanese, Chinese, German, and Italian, and that for English it is something more established or at least more extensively studied. For Swedish, although annotation with negation started three years after the English annotation, no continuity is observed as there is only one corpus annotated with negation.

If we look at Tables 8 – 10 (see Appendix A ), it can be seen that in the corpora annotated so far there is a special interest in the medical domain, followed by reviews. In English, out of 11 corpora, 5 focus on the biomedical domain, 3 on reviews or opinion articles, 1 on journal stories, 1 on tutorial dialogues, and 1 on the literary domain. In Spanish, 3 of the corpora are about clinical reports; 1 about movies, books, and product reviews; and 1 about newspaper articles. In other languages, we have only found one corpora annotated with negation per language. For Swedish, Dutch, and German, the domain is clinical reports; for Japanese it is news articles and reviews; for Italian it is news articles; and the Chinese corpus is about scientific literature, product reviews, and financial articles. This information shows that in all languages there is a common interest in processing negation in clinical/biomedical texts. This is understandable because detecting negated concepts is crucial in this domain. If we want to develop information extraction systems, it is very important to process negation because clinical texts often refer to concepts that are explicitly not present in the patient, for example, to document the process of ruling out a diagnosis: “In clinical reports the presence of a term does not necessarily indicate the presence of the clinical condition represented by that term. In fact, many of the most frequently described findings and diseases in discharge summaries, radiology reports, history and physical exams, and other transcribed reports are denied in the patient.” (Chapman et al. 2001b , page 301).

Not recognizing these negated concepts can cause problems. For example, if the concept “pulmonary nodules” is recognized in the text “There is no evidence of pulmonary nodules” and negation is not detected, the diagnosis of a patient will be totally different.

Considering the corpora analyzed, another domain that has attracted the attention of researchers is opinion articles or reviews. The large amount of content that is published on the Internet has generated great interest in the opinions that are shared in this environment through social networks, blogs, sales portals, and other review sites. This user-generated content is useful for marketing strategies because it can be used to measure and monitor customer satisfaction. It is a quick way to find out what customers liked and what they did not like. Moreover, micro-bloggings such as Twitter are being used to measure voting intention, people’s moods, and even to predict the success of a film. The study of negation in this domain is very important because if negation is present in a sentence and it is not taken into account, a system can extract a completely different opinion than the one published by the user. In Example (24) we can find a positive opinion that changes to negative if negation is present as in Example (25), or by contrast, in Example (26) there is a positive opinion in which negation is present whose meaning changes if it does not have negation as in Example (27).

The camera works well.

The camera does not work well.

I have not found a camera that works better.

I have found a camera that works better.

Other domains for which interest has also been shown, although to a lesser extent, are journal stories, tutorial dialogues, the literary domain, newspaper articles, scientific literature, and financial articles.

8.3 Availability

The extraction and annotation of corpora is time consuming and expensive. Therefore, it is not enough that corpora exists, but it must also be made available for the scientific community to allow progress in the study of the different phenomena. In this overview we focus on negation, and of the 22 corpora collected, 15 are publicly available. Of the seven non-available corpora, five contain clinical reports and legal and ethical issues may be the reasons for this. The links for obtaining the data of the different corpora (when possible) are shown in Table 4 ( Appendix A ).

The size of a corpus is usually expressed in number of sentences and/or tokens. It is important to know the extension of the corpus, but what is really important is the number of elements of the phenomenon or concept that has been annotated. As we focus on negation, the relevant information is the total of elements (sentences, events, relationships, etc.) that have been annotated and the total of elements that have been annotated with negation. Both are very important because for a rich processing of negation—algorithms need examples of elements with and without negation in order to cover all possible cases.

In Table 5 ( Appendix A ) we present information on the size of the corpora. The existing corpora are not very large and they do not contain many examples of negations. However, differences in languages are observed. According to the existing corpora, negation is used less frequently in English, Swedish, Dutch, and Japanese, whereas it appears more frequently in Spanish, Italian, Chinese, and German. The percentage of negated elements in English ranges from 6.12% to 32.16%. It should be noted that the first percentage corresponds to relations in the biomedical domain and the second to sentences in product reviews. In Swedish we are aware of only one corpus, the Stockholm Electronic Patient Record, which consists of clinical reports and contains 10.67% of negated expressions. The EMC Dutch corpus is also composed of clinical reports and the percentage of medical terms negated is 14.04%. The Review and Japanese corpus consists of reviews and newspaper articles and 16.59% of the sentences contain negations. For Spanish the frequency of negated sentences goes from 10.67% in newspaper articles to 34.22% in clinical reports. In Italian, the existing corpus is composed of news articles and the percentage of negated sentences is 21.55%. The German negation and speculation corpus consists of clinical reports and 39.77% of the medical terms annotated are negated. Finally, the Chinese corpus of scientific literature, product reviews, and financial articles contains 26.82% of negated sentences. The percentages of elements with negation do not always correspond to sentences, but in some cases are related to events, expressions, relationships, medical terms, or answers, depending on the level at which the annotation has been made. Therefore, for a better comparison of the frequency of occurrence of negation in sentences we have also calculated the average per language, taking into account only those corpora that provide information at the sentence level. Thus, the average number of sentences with negation in English texts is 17.94% and in Japanese 16.59%, whereas for Spanish it is 29.13%, for Italian 21.55%, and for Chinese 26.82%. 25 On the other hand, if we take a look at the domain of the corpora, we can say that, in general, clinical reports are the type of texts that have a greater presence of negation, followed by reviews/opinion articles, and biomedical texts.

Although negation is an important phenomenon for NLP tasks, it is relatively infrequent compared with other phenomena. Therefore, in order to train a negation processing system properly, it would be necessary to merge some corpora. However, in order to do this, the annotations of the corpora must be consistent, a fact that we will analyze in Section 8.5 .

8.5 Annotation Guidelines

Existence and availability. Have annotation guidelines been defined? Are they available?

Negation. What types of negation have been taken into account (syntactic and/or lexical and/or morphological)?

Negation elements. What elements of negation have been annotated? Cue? Scope? Negated event? Focus?

Tokenization. What tokenizer has been used?

Annotation scheme and guidelines. What annotation scheme and guidelines have been used?

8.5.1 Existence and Availability

Ide ( 2017 ) indicates that the purpose of the annotation guidelines is to define a phenomenon or concept in a generic but precise way so that the annotators do not have problems or find ambiguity during the annotation process. Therefore, it is very important to define annotation guidelines that annotators can consult whenever necessary. In addition, these guidelines should be available not only for the annotators of the ongoing project but also for other researchers to use them. The definition of annotation guidelines involves a long process of study and the time spent on it should serve to facilitate the annotation process to other researchers. In Table 6 ( Appendix A ), we show the link or reference to the annotation guidelines of the different corpora.

As Table 6 ( Appendix A ) shows, there is information about the annotation guidelines of most corpora, although some guidelines are not complete. For one third of the corpora the guidelines are not available. In some cases, it is indicated that existing annotation guidelines were adopted with some modifications, but these modifications are not reflected.

8.5.2 Negation Elements

Another important aspect to be analyzed from the corpora is what elements of negation have been annotated. As mentioned in Section 3 , negation is often represented using one or more of the following four elements: cue, scope, focus, and event.

The first task that a negation processing system should carry out is the identification of negation cues , because it is the one that will allow us to identify the presence of this phenomenon in a sentence and because the rest of the elements are linked to it. Most of the existing corpora contain annotations about negation cues. However, some of the corpora of the biomedical and clinical domain take negation into account only to annotate whether an event or relationship is negated, but not to annotate the cue. They use a clinical perspective more than a linguistic one. This is the case with the BioInfer, Genia Event, IxaMed-GS, EMC Dutch, and German negation and speculation corpora.

Depending on the negation cue used, we can distinguish three main types of negation: syntactic, lexical, and morphological (see Section 3 ). Most annotation efforts focus on syntactic negation. It has been difficult to summarize the types of negation considered, because in some cases they are not specified in the description of a corpus nor in the guidelines, and we have had to manually review the annotations of the corpora and/or contact the annotators. In Table 7 ( Appendix A ), we determine for each corpus whether it contains annotations about negation cues (✓) or not (-), and what types of negation have been considered. In the second column, we use CS , CM , and CL to indicate that all syntactic, morphological, and lexical negation cues have been taken into account, NA if the information is not available, or PS , PM , and PL if syntactic, morphological, and lexical negations have been considered partially (e.g., because only negation that acts on certain events or relationships have been considered or because a list of predefined markers have been used for the annotation).

Once the negation cue has been identified, we can proceed to the identification of the rest of the elements. The scope is the part of the sentence affected by the negation cue, that is, it is the set of words on which negation acts and on which to proceed, depending on the objective of the final system. In most of the corpora reviewed the scope has been annotated, except in the Genia Event, Stockholm Electronic Patient Record, PropBank Focus (PB-FOC), EMC Dutch, Review and Newspaper Japanese, IxaMed-GS, and German negation and speculation corpora. The two remaining elements, event and focus , have been annotated to a lesser extent. The negated event is the event or property that is directly negated by the negation cue, usually a verb, a noun, or an adjective. It has been annotated on two English corpora (Genia Event and ConanDoyle-neg), three Spanish corpora (IxaMed-GS, SFU Review SP -NEG, and UHU-HUVR), and the EMC Dutch, the Fact-Ita Bank Negation, and the German negation and speculation corpora. On the other hand, the focus , the part of the scope most prominently or explicitly negated, has only been annotated on three English corpora (PB-FOC, Deep Tutor Negation, and SOCC) and in the Review and Newspaper Japanese corpus, which shows that it is the least studied element. In the fourth, fifth, and sixth columns of Table 7 (Appendix A), this information is represented using ✓ if the corpus contains annotations about the scope, event, and focus, respectively, or – otherwise.

8.5.3 Tokenization

The way in which each corpus was tokenized is also important and is only mentioned in the description of the SFU Review SP -NEG corpus. Why is it important? The identification of negation cues and the different elements (scope, event, focus) is usually carried out at token level, that is, the system is trained to tell us whether a token is a cue or not and whether it is part of a scope or not. Tokenization is also important when we want to merge annotations. If the tokenization is different in several versions of a corpus or in different corpora, merging annotations will pose technical problems.

8.5.4 Annotation Scheme and Guidelines

Inclusion or not of the subject within the scope. For example, in the UAM Spanish Treebank corpus all the arguments of the negated events, including the subject, are included within the scope of negation (Example (28)). On the contrary, in the IULA Spanish Clinical Record corpus the subject is included within the scope (Example (29)) only when it is located after the verb (Example 30), or when there is an unaccusative verb (Example (31)).

Gobierno, patronal y cámaras tratan de demostrar [que Chile SUBJ no castiga a las empresas españolas].

Government, employers and chambers try to demonstrate that Chile does not punish Spanish companies.

MVC SUBJ sin [ruidos sobreañadidos].

NBS no additional sounds.

Se descarta [ enolismo SUBJ ].

Oenolism discarded.

[ El dolor ] SUBJ no [ha mejorado con nolotil].

Pain has not improved with nolotil.

Inclusion or not of the cue within the scope. For example, in the annotation of the SOCC corpus, the negation cue was not included within the scope (Example (32)), whereas in the BioScope corpus it was included (Example (33)).

I cannot [believe that one of the suicide bombers was deported back to Belgium.]

Mildly hyperinflated lungs [ without focal opacity].

Strategy to annotate as scope the largest or shortest syntactic unit. For example, in the Product Review corpus annotators decided to annotate the minimal span of a negation covering only the portion of the text being negated semantically (Example (34)), whereas in ConanDoyle-neg corpus the longest relevant scope of the negation cue was marked (Example (35)).

Long live ambitious filmmakers with no [talent]

[It was] suggested, but never [proved, that the deceased gentleman may have had valuables in the house, and that their abstraction was the motive of the crime].

Use a set of predefined negation cues or all the negation cues present in a text. For example, for scope annotation in the Product Review corpus, a lexicon of 35 explicit negation cues was defined and, for instance, the cue “not even” was not considered, while in the SFU Review SP -NEG corpus all syntactic negation cues were take into account.

These differences provoke that the annotations are not compatible, not even within corpora of the same language and domain.

The perspective that we have taken in this article when analyzing the corpora annotated with negation is computational, because our final goal is not to evaluate the quality of the annotations from a theoretical perspective, but to determine whether corpora can be used to develop a negation processing system. In order to achieve this we need a significant amount of training data, even more taking into consideration that negation is a relatively infrequent phenomenon as compared to tasks like semantic role labeling. Additionally, we need qualitative data that cover all possible cases of negation. Since the existing corpora are small, we have analyzed them in order to evaluate whether it is possible to merge the corpora into a larger one. Two features that are relevant when considering merging corpora are the language, analyzed in Section 8.1 , and the domain, reviewed in Section 8.2 . Next, we discuss the possibility of merging corpora according to each of these aspects.

On the one hand, it can be necessary to merge corpora for processing negation in a specific language. As we have mentioned before, there are four general tasks related to negation processing: negation cue detection, scope identification, negated event extraction, and focus detection. In Table 1 we show for which of these tasks each corpus could be used. Negation cue detection and scope identification are the tasks for which there are more corpora. However, it is noteworthy that in some of the corpora (BioInfer, Genia Event, Product Review, EMC Dutch, IxaMed-GS, and German negation and speculation corpus) negation cues have not been annotated, despite the fact that the cue is the element that denotes the presence of negation in a sentence and the one to which the rest of the elements (scope, event, and focus) are connected. The task with the fewest annotated corpora is focus detection, probably because annotating focus is a difficult task that depends on stress and intonation. For the event extraction task there are also few corpora, most of them belonging to the biomedical and clinical domains.

On the other hand, it could be necessary to merge corpora in order to evaluate the impact of processing negation in specific tasks such as information extraction in the biomedical and clinical domain, drug–drug interactions, clinical events detection, bio-molecular events extraction, sentiment analysis, and constructiveness and toxicity detection. Moreover, corpora can be used to improve information retrieval and question–answering systems. In Table 2 , we show for each language the specific tasks for which the corpora could be used. The applicability tasks of most of the corpora analyzed are (i) information extraction in the biomedical and clinical domain; and (ii) sentiment analysis. For the first task, the role of negation could be evaluated in English, Spanish, Swedish, Dutch, and German (5 of the 8 languages analyzed) and, for the second task, it could be analyzed in English, Spanish, Japanese, Chinese, and Italian (5 of the 8 languages analyzed). For drug–drug interactions, bio-molecular events extraction, and constructiveness and toxicity detection, it could only be analyzed in English; and for clinical events detection, it could only be evaluated in Spanish.

As we showed in Section 8.5.1 , there are corpora for which the annotation guidelines are not available or are not complete. This is a problem because in order to merge corpora we need to know the criteria followed for the annotation and we need to know whether the corpora are consistent. For example, if negation cues are included within the scope of negation, this rule must be satisfied in all the corpora used to train a negation processing system.

As has been mentioned in Section 8.5.2 , corpora have been annotated with different purposes. Some corpora have been annotated taking into account the final application, whereas others are annotated from a linguistic point of view. There are cases in which not all types of negation have been considered or they have only partially been taken into account. Therefore, when merging the corpora it is very important to take into consideration the types of negations (syntactic, morphological, lexical) and merge only those corpora completely annotated with the same types to avoid the system being trained with false negatives.

As indicated in Section 8.5.3 , the way in which each corpus was tokenized is not specified in most of the cases, whereas annotations are carried out at token level. If we would like to expand the corpora, we would need to have more technical information available to make sure that the annotations are compatible. If we want to run the negation processing system on new test data, we need to make sure that in both training and test data, the tokenization should be the same.

As we have shown in Section 8.5.4 , the annotation formats are different. This problem could be resolved by reconverting the corpora annotations, but the process is time-consuming. The different corpora must be pre-processed in a different way in order to obtain the information related to negation and to represent it according to the input format for the machine learning system.

Finally, as indicated in Sub-sub section 8.5.4 , the annotation guidelines are different. This is a great problem because it means that the criteria used during the annotation process are different. For example, some authors include the subject within the scope of negation and others leave it out. If the training examples are contradictory, the system will not be reliable.

As our analysis shows, the main problem is related to the non-existence of a common scheme and annotation guidelines. In view to future work, the annotation of negation should be standardized in the same way as has been done for other annotation tasks such as semantic role labeling. Moreover, there are languages for which the existence of corpora annotated with negation is limited, for example, Spanish, Swedish, Dutch, Japanese, Chinese, German, and Italian, and there are even languages for which no corpora have been annotated with this information, such as Arabic, French, or Russian. This is a sign that we must continue working to try to advance in the study of this phenomenon, which is so important to the development of systems that approach human understanding.

We have analyzed whether it is possible to make these corpora compatible. First, we focus on overall negation processing tasks ( Table 1 ).

I don’t like meat.

El final del libro no te aporta nada , no aade nada nuevo, no crees?

The end of the book doesn’t give you anything, it doesn’t add anything new, didn’t you?

He is a well-known author but he is not the best for me.

For scope identification , we would have the same problems as for cue detection, but we would also have to solve additional aspects, such as unifying the inclusion or not of the subject and the cue within the scope, and unifying the length of the scope to the largest or shortest syntactic unit. We would have to use the same syntactic analyzer to process the texts and convert the manual annotations into annotations that follow the new standards in relation to inclusion of subject and length of scope. For event extraction the main problem is that most of the corpora events have only been annotated if they are clinically or biologically relevant, so not all negated events are annotated. Finally, for focus detection , we would be able to merge PB-FOC, Deep Tutor Negation, and SOCC English corpora.

Once the problems related to negation processing had been solved, it would be possible to merge corpora for specific tasks ( Table 2 ). This would require a study of the annotation schemes, the labels used, and their values. For example, for sentiment analysis, we would have to make sure that the corpora use the same polarity labels. If not, we would have to analyze the meaning of the labels, define a new tag set, and convert the real labels of these corpora to those of the new tag set.

In this article, we have reviewed the existing corpora annotated with negation information in several languages. Processing negation is a very important task in NLP because negation is a linguistic phenomenon that can change the truth value of a proposition, and so it is crucial in some tasks such as sentiment analysis, information extraction, summarization, machine translation, and question answering. Most corpora have been annotated for English, but it is also necessary to focus on other languages whose presence on the Internet is growing, such as Chinese or Spanish.

We have conducted an exhaustive search of corpora annotated with negation, finding corpora for the following languages: English, Spanish, Swedish, Dutch, Japanese, Chinese, German, and Italian. We have described the main features of the corpora based on the following criteria: the language, year of publication, domain, the availability, size, types of negation taken into account (syntactic and/or lexical and/or morphological), negation elements annotated (cue and/or scope and/or negated event and/or focus) and the way in which each corpus was tokenized, the annotation guidelines, and annotation scheme used. In addition, we have included an appendix with tables summarizing all this information in order to facilitate analysis.

In sum, our analysis demonstrates that the language and year of publication of the corpora show that interest in the annotation of negation started in 2007 with English texts followed by Swedish in 2010, whereas for the other languages (Spanish, Dutch, Chinese, German and Italian) it is a task of recent interest. Most of the corpora have been documented in the last 5 years, which shows that negation is a phenomenon whose processing has not yet been resolved and which is generating interest. Concerning the domains, those that have mainly attracted the attention of researchers are the medical domain and reviews/opinion articles. Another important fact that we have analyzed is the availability of the corpora. Most of them are publicly available and most of the non-available corpora contain clinical reports, with legal and ethical issues probably affecting their status. The length of the corpora shows that existing corpora are not very large, which hinders the development of machine learning systems, since the frequency of negations is low. Finally, in relation to the annotation guidelines, most of the annotators define guidelines, but some of them are not complete and others are not available. In addition, we found differences in the annotation schemes used, and, most importantly, in the annotation guidelines: the way in which each corpus was tokenized and the negation elements that have been annotated. The annotation formats are different for each corpus; there is no standard annotation scheme. Moreover, the criteria used during the annotation process are different, especially with regard to three aspects: the inclusion or not of the subject and the cue in the scope; the annotations of the scope as the largest or shortest syntactic unit; and the annotation of all the negation cues or a subset of them according to a predefined set. Another important finding is that, in most of the corpora, it is not specified how they were tokenized— this being essential for processing negation systems because the identification of negated elements (cue, scope, event, and focus) is carried out at token level.

We conclude that the lack of a standard annotation scheme and guidelines as well as the lack of large annotated corpora make it difficult to progress in the treatment of negation. As future work, the community should work on the standardization of negation, as has been done for other well established tasks like semantic role labeling and parsing. A robust and precise annotation scheme should be defined for the different elements that represent the phenomenon of negation (cue, scope, negated event, and focus) and researchers should work together to define common annotation guidelines.

Note: Link to the Review and Japanese corpus is currently not available (Accessed March 19, 2019). However, authors say that they plan to freely distribute it in the provided link.

Note: PS, PM, and PL are used when syntactic, morphological, and lexical negations are annotated partially. CS, CM, and CL represents that all syntactic, morphological, and lexical negations have been annotated.

This work has been partially supported by a grant from the Ministerio de Educación Cultura y Deporte (MECD - scholarship FPU014/00983), LIVING-LANG project (RTI2018-094653-B-C21), Fondo Europeo de Desarrollo Regional (FEDER), and REDES project (TIN2015-65136-C2-1-R) from the Spanish Government. R.M. was supported by the Netherlands Organization for Scientific Research (NWO) via the Spinoza-prize awarded to Piek Vossen (SPI 30-673, 2014-2019). We are thankful to the authors of the corpora who kindly answered our questions.

http://www.mrtuit.com/ .

https://www.meaningcloud.com/es/productos/analisis-de-sentimiento .

https://catalog.ldc.upenn.edu/ .

http://catalog.elra.info/en-us/ .

http://lremap.elra.info/ .

http://www.meta-share.org/ .

http://linguistic.linkeddata.es/retele-share/sparql-editor/ .

There are authors that do not include the negation cue within the scope.

The Groningen Meaning Bank is available at http://gmb.let.rug.nl .

DeepBank is available at http://moin.delph-in.net/DeepBank .

http://www.nactem.ac.uk/meta-knowledge/Annotation_Guidelines.pdf .

The annotation guidelines can be downloaded at http://rgai.inf.u-szeged.hu/project/nlp/bioscope/Annotation%20guidelines2.1.pdf and a discussion of them can be found in Vincze ( 2010 ).

The annotation guidelines are described in Morante, Schrauwen, and Daelemans ( 2011 ).

www.clips.ua.ac.be/sem2012-st-neg/ .

https://github.com/sfu-discourse-lab/SOCC .

https://github.com/sfu-discourse-lab/SOCC/tree/master/guidelines .

http://ixa.si.ehu.eus/extrecm .

First Online: 22 May 2017 https://doi.org/10.1007/s10579-017-9391-x .

https://www.sfu.ca/~mtaboada/SFU_Review_Corpus.html .

The inter-annotator agreement values have been corrected with respect to those published in Jiménez-Zafra et al. ( 2018b ) due to the detection of an error in the calculation thereof.

http://rit.rakuten.co.jp/rdr/index_en.html .

http://www.ninjal.ac.jp/english/products/bccwj/ .

http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html .

Accessed March 19, 2019.

The Italian and Chinese percentages correspond to the only existing corpus in each language. The percentages of sentences annotated with negation in Swedish and Dutch could not be calculated because the information provided by the authors corresponds to expressions and medical terms, respectively.

In the examples provided to clarify differences, we mark in bold negation cues and enclose negation scopes between [square brackets].

Email alerts

Related articles, related book chapters, affiliations.

  • Online ISSN 1530-9312
  • Print ISSN 0891-2017

A product of The MIT Press

Mit press direct.

  • About MIT Press Direct

Information

  • Accessibility
  • For Authors
  • For Customers
  • For Librarians
  • Direct to Open
  • Open Access
  • Media Inquiries
  • Rights and Permissions
  • For Advertisers
  • About the MIT Press
  • The MIT Press Reader
  • MIT Press Blog
  • Seasonal Catalogs
  • MIT Press Home
  • Give to the MIT Press
  • Direct Service Desk
  • Terms of Use
  • Privacy Statement
  • Crossref Member
  • COUNTER Member  
  • The MIT Press colophon is registered in the U.S. Patent and Trademark Office

This Feature Is Available To Subscribers Only

Sign In or Create an Account

Definition and Examples of Corpus Linguistics

Hardie / Getty Images

  • An Introduction to Punctuation
  • Ph.D., Rhetoric and English, University of Georgia
  • M.A., Modern English and American Literature, University of Leicester
  • B.A., English, State University of New York

Corpus linguistics is the study of language based on large collections of "real life" language use stored in corpora (or corpuses )—computerized databases created for linguistic research. It is also known as corpus-based studies.

Corpus linguistics is viewed by some linguists as a research tool or methodology and by others as a discipline or theory in its own right. Sandra Kübler and Heike Zinsmeister state in their book, "Corpus Linguistics and Linguistically Annotated Corpora," that "the answer to the question whether corpus linguistics is a theory or a tool is simply that it can be both. It depends on how corpus linguistics is applied."

Although the methods used in corpus linguistics were first adopted in the early 1960s, the term itself didn't appear until the 1980s.

Examples and Observations

"[C]orpus linguistics is...a methodology, comprising a large number of related methods which can be used by scholars of many different theoretical leanings. On the other hand, it cannot be denied that corpus linguistics is also frequently associated with a certain outlook on language. At the centre of this outlook is that the rules of language are usage -based and that changes occur when speakers use language to communicate with each other. The argument is that if you are interested in the workings of a particular language, like English , it is a good idea to study language in use. One efficient way of doing this is to use corpus methodology...."

– Hans Lindquist, Corpus Linguistics and the Description of English . Edinburgh University Press, 2009

"Corpus studies boomed from 1980 onwards, as corpora, techniques and new arguments in favour of the use of corpora became more apparent. Currently this boom continues—and both of the 'schools' of corpus linguistics are growing....Corpus linguistics is maturing methodologically and the range of languages addressed by corpus linguists is growing annually."

– Tony McEnery and Andrew Wilson, Corpus Linguistics , Edinburgh University Press, 2001

Corpus Linguistics in the Classroom

"In the context of the classroom the methodology of corpus linguistics is congenial for students of all levels because it is a 'bottoms-up' study of the language requiring very little learned expertise to start with. Even the students that come to linguistic enquiry without a theoretical apparatus learn very quickly to advance their hypotheses on the basis of their observations rather than received knowledge, and test them against the evidence provided by the corpus."

– Elena Tognini-Bonelli,  Corpus Linguistics at Work . John Benjamins, 2001

"To make good use of corpus resources a teacher needs a modest orientation to the routines involved in retrieving information from the corpus, and—most importantly—training and experience in how to evaluate that information."

– John McHardy Sinclair, How to Use Corpora in Language Teaching , John Benjamins, 2004

Quantitative and Qualitative Analyses

"Quantitative techniques are essential for corpus-based studies. For example, if you wanted to compare the language use of patterns for the words big and large , you would need to know how many times each word occurs in the corpus, how many different words co-occur with each of these adjectives (the collocations ), and how common each of those collocations is. These are all quantitative measurements....

"A crucial part of the corpus-based approach is going beyond the quantitative patterns to propose functional interpretations explaining why the patterns exist. As a result, a large amount of effort in corpus-based studies is devoted to explaining and exemplifying quantitative patterns."

– Douglas Biber, Susan Conrad, and Randi Reppen, Corpus Linguistics: Investigating Language Structure and Use , Cambridge University Press, 2004

"[I]n corpus linguistics quantitative and qualitative methods are extensively used in combination. It is also characteristic of corpus linguistics to begin with quantitative findings, and work toward qualitative ones. But...the procedure may have cyclic elements. Generally it is desirable to subject quantitative results to qualitative scrutiny—attempting to explain why a particular frequency pattern occurs, for example. But on the other hand, qualitative analysis (making use of the investigator's ability to interpret samples of language in context) may be the means for classifying examples in a particular corpus by their meanings; and this qualitative analysis may then be the input to a further quantitative analysis, one based on meaning...."

– Geoffrey Leech, Marianne Hundt, Christian Mair, and Nicholas Smith, Change in Contemporary English: A Grammatical Study . Cambridge University Press, 2012

  • Kübler, Sandra, and Zinsmeister, Heike.  Corpus Linguistics and Linguistically Annotated Corpora . Bloomsbury, 2015.
  • What Is Lexicogrammar?
  • Colloquialization (Language)
  • Linguistic Variation
  • Definition and Examples of Corpora in Linguistics
  • English Usage (Grammar)
  • An Introduction to Semantics
  • An Introduction to Theoretical Grammar
  • What Is the Lexical Approach?
  • What Is a Conduit Metaphor?
  • Definition and Examples of Linguists
  • Indian English, AKA IndE
  • Definition and Examples of Dialect in Linguistics
  • New Englishes: Adapting the Language to Meet New Needs
  • Definition and Examples of Linguistic Americanization
  • Definition and Examples of Native Languages
  • 10 Types of Grammar (and Counting)

Intranet Login

Federated Content Search

Federated Data Search Engine 

Language Resource Switchboard

Find a suitable tool to process language data 

Virtual Language Observatory

Metadata search interface 

Virtual Collection Registry

Publish and access digital bookmarks 

Manually Annotated Corpora

Manual corpora are collections of texts containing manually validated or manually assigned linguistic information, such as morphosyntactic tags, lemmas, syntactic parses, named entities etc. These corpora can be used to train new language annotation tools, as well as testing the accuracy of existing annotation tools. 

There are more than 70 manually annotated training corpora and corpus collections in the CLARIN infrastructure. Among the multilingual corpora, there are 4 collections in the CLARIN infrastructure that were annotated under the following umbrella initiatives:  HamleDT 3.0 , Treebanks of INESS , Universal Dependencies , and Annotated corpora and tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (edition 1.1) .  

The corpora and corpus collections are classified into six categories based on the type of manual annotation:

  • PoS/MSD tagging

Lemmatisation

  • Syntactic parsing
  • Named Entity recognition

Sentiment analysis

If a corpus is manually annotated for more than one linguistic information, then it is listed under all the relevant sections. For instance, the xLiMe Twitter Corpus XTC 1.0.1 is manually annotated for PoS tags, Named Entities and sentiment, so it is listed under all the three relevant sections.

For comments, changes of the existing content or inclusion of new corpora, send us an  resource-families [at] clarin.eu ( email ) .

The Manually Annotated Corpora

Pos msd tagging, syntatic parsing, named entity recognition, other annotation layers, publications.

[ Batanović et al. 2018 ] Vuk Batanović, Nikola Ljubešić, and Tanja Samadržić. 2018. SETimes.SR – A Reference Training Corpus of Serbian.

[ Bučar et al. 2018 ]  Jože Bučar, Martin Žnidaršič, and Janez Povh. 2018. Annotated news corpora and a lexicon for sentiment analysis in Slovene.

[ Csendes et al. 2005 ]  Dóra Csendes, János Csirik, Tibor Gyimóthy, and András Kocsor. 2005. The Szeged Treebank.

[ Erjavec 2012 ] Tomaž Erjavec. 2012. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages.

[ Erjavec et al. 2010 ] Tomaž Erjavec, Darja Fišer, Simon Krek, and Nina Ledinek. 2010. The JOS Linguistically Tagged Corpus of Slovene.

[ Fišer et al. 2018 ] Darja Fišer, Nikola Ljubešić and Tomaž Erjavec. 2018. The Janes project: language resources and tools for Slovene user generated content.

[ Habernal et al. 2013 ] Ivan Habernal, Tomáš Ptáček, and Josef Steinberger. 2013. Sentiment Analysis in Czech Social Media Using Supervised Machine Learning. 

[ Hajič et al. 2004 ] Jan Hajič, Otakar Smrž, Petr Zemánek, Jan Šnaidauf, and Emanuel Beška. 2004. Prague Arabic Dependency Treebank: Development in Data and Tools

[ Hajič et al. 2012 ]  Jan, Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, Ondřej Bojar, Silvie Cinková, Eva Fučíková, Marie Mikulová, Petr Pajas, Jan Popelka, Jiří Semecký, Jana Šindlerová, Jan Štěpánek, Josef Toman, Zdeňka Urešová, and Zdeněk Žabokrtský. 2012. Announcing Prague Czech-English Dependency Treebank 2.0

[ Haverinen et al. 2014 ] Katri Haverinen, Jenna Nyblom, Timo Viljanen, Veronika Laippala, Samuel Kohonen, Anna Missilä, Stina Ojala, Tapio Salakoski, and Filip Ginter. 2014. Building the essential resources for Finnish: the Turku Dependency Treebank.

[ Holozan 2018 ] Peter Holozan. 2018. Corpus of comma placement Vejica 1.3.

[ Kravalová and Žabokrtský 2009 ] Jana Kravalová and Zdenek Žabokrtský. 2009. Czech Named Entity Corpus and SVM-based Recognizer.

[ Kríž and Hladká 2018 ] Vincent Kríz and Barbora Hladká. 2018. Czech Legal Text Treebank 2.0.

[ Miličević and Ljubešić 2016 ] Maja Miličević and Nikola Ljubešić. 2016. Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets.

[ Mozetič et al. 2016 ] Igor Mozetič, Miha Grčar, and Jasmina Smailović. 2016. Multilingual Twitter Sentiment Classification: The Role of Human Annotators.

[ Muischnek et al. 2014 ] Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen, Eleri Aedmaa, Riin Kirt, Dage Särg. 2014. Estonian Dependency Treebank and its annotation scheme

[ van Noord 2009 ] Gertjan van Noord. 2009. Huge Parsed Corpora in LASSY. 

[ Jelínek 2017 ] Tomáš Jelínek. 2017. FicTree: a Manually Annotated Treebank of Czech Fiction.

[ Ogrodniczuk and Kopeć 2014 ]  Maciej Ogrodniczuk and Mateusz Kopeć. The Polish Summaries Corpus.

[ Ogrodnizcuk et al. 2015 ] Maciej Ogrodniczuk, Katarzyna Głowińska, Mateusz Kopeć, Agata Savary, and Magdalena Zawisławska. Coreference in Polish: Annotation, Resolution and Evaluation in Polish.

[ Orasmaa 2014 ] Siim Orasmaa. Towards an Integration of Syntactic and Temporal Annotations in Estonian.

[ Przepiórkowski and Murzynowski  2011 ]  Adam Przepiórkowski and Grzegorz Murzynowski. 2011. Manual annotation of the National Corpus of Polish with Anotatornia.

[ QasemiZadeh and Schumann 2016 ] Behrang QasemiZadeh and Anne-Kathrin Schumann. 2016. The ACL RD-TEC 2.0: A Language Resource for Evaluating Term Extraction and Entity Recognition Methods.

[ Rei et al. 2016 ] Luis Rei, Dunja Mladenić, and Simon Krek. 2016. A Multilingual Social Media Linguistic Corpus.

[ Resch et al. 2016 ] Claudia Resch, Ulrike Czeitschner, Eva Wohlfarter, Barbara Krautgartner. 2016. Introducing the Austrian Baroque Corpus: Annotation and Application of a Thematic Research Collection.

[ Rögnvaldsson et al. 2012 ] Eiríkur Rögnvaldsson, Anton Karl Ingason, Einar Freyr Sigurðsson and Joel Wallenberg. 2012. The Icelandic Parsed Historical Corpus (IcePaHC).

[ Rosén et al. 2012 ] Victoria Rosén, Koenraad De Smedt, Paul Meurer, and Helge Dyvik. 2012. An Open Infrastructure for Advanced Treebanking.

[ Stein and Prévost 2013 ]  Achim Stein and Sophie Prévost. 2013. Syntactic annotation of medieval texts: the Syntactic Reference Corpus of Medieval French (SRCMF).

[ Velldal et al. 2018 ] Erik Velldal, Lilja Øvrelid, Eivind Alexander Bergem, Cathrine Stadsnes, Samia Touileb, and Fredrik Jørgensen. 2018. NoReC: The Norwegian Review Corpus

[ Wróblewska 2018 ] Alina Wróblewska. 2018. Extended and enhanced Polish dependency bank in Universal Dependencies format.

[ Zeman et al. 2012 ] Daniel Zeman, David Mareček, Martin Popel, Loganathan Ramasamy, Jan Štěpánek, Zdeněk Žabokrtský, and Jan Hajič. 2012. HamleDT: To Parse or Not to Parse?

Terms & Disclaimer

what is annotated corpora

‘Automatic Collation for Diversifying Corpora: Commonly Copied Texts as Distant Supervision for Handwritten Text Recognition’

what is annotated corpora

“Handwritten text recognition (HTR) has enabled many researchers to gather textual evidence from the human record. … To build generalized models for Arabic-script manuscripts, perhaps one of the largest textual traditions in the pre-modern world, we need an approach that can improve its accuracy on unseen manuscripts and hands without linear growth in the amount of manually annotated data. We propose Automatic Collation for Diversifying Corpora (ACDC), taking advantage of the existence of multiple manuscripts of popular texts.”

Find the paper and full list of authors in the Computational Humanities Research Conference 2023 proceedings.

‘Hierarchical RL-Guided Large-Scale Navigation of a Snake Robot’

‘bergeron: combating adversarial attacks through a conscience-based alignment framework’, ‘more samples or more prompt inputs exploring effective in-context sampling for llm few-shot prompt engineering’, ‘multi-instance randomness extraction and security against bounded-storage mass surveillance’, ‘is a seat at the table enough engaging teachers and students in dataset specification for ml in education’, ‘”the wallpaper is ugly”: indoor localization using vision and language’, ‘human still wins over llm: an empirical study of active learning on domain-specific annotation tasks’, ‘beyond labels: empowering human annotators with natural language explanations through a novel active-learning architecture’, ‘icml 2023 topological deep learning challenge: design and results’.

  • Global (English)
  • Global (Deutsch)
  • Canada (English)
  • Canada (Français)
  • United Kingdom (English)

Deloitte User?

  • Register | Forgot password
  • Publications
  • Jurisdictions
  • Communications

2024 issued and annotated issued IFRS Accounting Standards now available

04 Apr 2024

The IFRS Foundation announces that the annual publication formerly known as the 'Red Book' is now available.

The IFRS Accounting Standards 2024 — Issued  publication contains the Standards as approved by the International Accounting Standards Board for issue up to 31 December 2023. These Standards include changes that are not yet required at 1 January 2024. The IFRS Accounting Standards 2024 — Issued Annotated  includes the same content as IFRS Accounting Standards 2024 , but with additional annotations containing extensive cross-references, explanatory notes and IFRS Interpretations Committee agenda decisions.

The books are available in electronic format to IFRS Digital subscribers through the  IFRS Accounting Standards Navigator . Printed copies of the books are available for sale through the  IFRS Foundation's web shop .

Related Topics

  • IFRS Interpretations Committee
  • International Accounting Standards Board (IASB)
  • IASB finalised pronouncements

Related news

Research workshop on goodwill and acquisitions.

05 Apr 2024

Podcast on Q1 2024 IFRS IC developments

03 Apr 2024

IFRS Foundation Trustees seek IASB Board members from Asia-Oceania

28 Mar 2024

IASB proposes addendum to the exposure draft of the third edition of the IFRS for SMEs

Iasb publishes "investor perspectives" article on acquisitions reporting.

27 Mar 2024

IASB issues podcast on latest Board developments (March 2024)

  • All Related

Related Publications

Igaap in focus — financial reporting: iasb proposes addendum to exposure draft 'third edition of the ifrs for smes accounting standard', igaap in focus — closing out (march 2024), deloitte comment letter on amendments to financial instruments with characteristics of equity, igaap in focus — financial reporting: iasb proposes amendments to improve reporting on acquisitions.

18 Mar 2024

Related Discussions

14 Mar 2023

2021 Agenda Consultation (FASB) and Third Agenda Consultation (IASB)

30 Sep 2022

25 Jan 2022

16 Nov 2021

Related Dates

April 2024 iasb meeting, ifrs foundation conference, comment deadline: acquisitions and goodwill exposure draft, comment deadline: sme addendum, effective date of amendments to ias 21, correction list for hyphenation.

These words serve as exceptions. Once entered, they are only hyphenated at the specified hyphenation points. Each word should be on a separate line.

More Types of Corpus Annotation

  • First Online: 08 July 2021

Cite this chapter

Book cover

  • Niladri Sekhar Dash 2  

369 Accesses

In this chapter, we define the basic concepts of some non-conventional types of text annotation which are, due to several reasons, not frequently used on corpus texts. This discussion gives readers some preliminary ideas about the ways and means of annotating a text at various levels, beyond the grammatical and syntactic level of annotation, for making a text useful in various domains of linguistics and language engineering. The availability of different types of annotated texts is a blessing for various academic and commercial applications. The primary goals and objectives of each type of corpus annotation are characteristically different. Goals vary depending on the type of a text and scopes for possible utilization in a customized and object-oriented application. Because of marked differences in goals, the processes of corpus annotation vary. Besides, the kind of text considered for annotation plays a decisive role in the selection of annotation type. The non-conventionally annotated texts are not always useful for all kinds of linguistic investigation and studies. They are useful in those contexts where non-standard analysis and interpretation of texts are required for a specific application. That means the application of non-conventional annotation processes on text generates a kind of outputs which are not frequently applied in traditional schemes of language description and analysis. However, such texts become significantly relevant in many advanced areas of applied linguistics and language data management. On the other hand, non-conventional annotation techniques require a different kind of capability in understanding a text, which in return generates a new kind of expertise in text interpretation and information processing and management. Keeping all these aspects in view, in this chapter, we focus on some of the non-typical and non-conventional text annotation types which are not so frequently applied in corpus annotation.

  • Orthography
  • Figure-of-speech

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Archer, D., & Culpeper, J. (2003). Socio-pragmatic annotation: New directions and possibilities in historical corpus linguistics. In A. Wilson, P. Rayson, & A. McEnery (Eds.), Corpus linguistics by the Lune: A festschrift for Geoffrey Leech (pp. 37–58). Peter Lang.

Google Scholar  

Archer, D., McEnery, T., Rayson, P., & Hardie, A. (2003). Developing an automated semantic analysis system for Early Modern English. In Archer, D., Rayson, P., Wilson, A., and McEnery, T. (Eds.). Proceedings of the corpus linguistics 2003 conference. UCREL technical paper number 16 (pp. 22–31). UCREL, Lancaster University.

Aristotle. (1982). The art of rhetoric (Trans. John Henry Freese). Loeb Classical Library.

Atkins, S., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7 (1), 1–16.

Article   Google Scholar  

Boersma, P., & van Heuven, V. (2001). Speak and unSpeak with PRAAT. Glot International., 5 (9/10), 341–347.

Carlson, L., Marcu, D., & Okurowski, M. E. (2003). Building a discourse-tagged corpus in the framework of rhetorical structure theory. In Kuppevelt, J. V. & Smith, R. W. (Eds.) Current and new directions in discourse and dialogue (pp. 85–112). Springer.

Crowley, S., & Hawhee, D. (2004). Ancient rhetorics for contemporary students . Pearson Education.

Cuddon, J. A. (1998). The Penguin dictionary of literary terms and literary theory . Penguin Books.

Dash, N. S. (2009). Language corpora: Past, present, and future . Mittal Publications.

Dash, N. S., & Ramamoorthy, L. (2019). Utility and Application of Language Corpora . Springer Nature.

DuBois, J. W., Cumming, S., Schuetze-Coburn, S., & Paolino, D. (Eds.) (1992). Discourse transcription . Santa Barabara papers in linguistics (vol. 4). University of California.

Edwards, J. A., & Lampert, M. D. (Eds.). (1993). Talking data: Transcription and coding in discourse research . Erlbaum.

Fink, G. A., Johanntokrax, M., & Schaffranietz, B. (1995). A flexible formal language for the orthographic transcription of spontaneous spoken dialogues. In Proceedings of the 4th European conference on speech communication and speech technology (Eurospeech'95) (vol. 1, pp. 871–874). Madrid, Spain, 18–21 Sept 1995.

Garside, R., & Rayson, P. (1997). Higher-level annotation tools. In R. Garside, G. Leech, & A. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 179–193). Longman.

Chapter   Google Scholar  

Garside, R., Leech, G., & McEnery, A. (Eds.). (1997). Corpus annotation: Linguistic information from computer text corpora . Longman.

Grice, M., Leech, G., Weisser, M., & Wilson, A. (2000). Representation and annotation of dialogue. In: Dafydd, G., Mertins, I. and Moore, R.K. (eds.) Handbook of multimodal & spoken dialogue systems. Resources, terminology, and product evaluation (pp. 1–101) . Kluwer Academic Publishers.

Grover, C., Facrell, J., Vereecken, H., Martens, J. P., & Coile, B. V. (1998). Designing prosodic databases for automatic modelling in 6 languages. In Proceedings of the 3rd ESCA/COCOSDA workshop on speech synthesis (SSW3–1998) (pp. 93–98). Jenolan Caves House, Blue Mountains, Australia, 26–29 Nov 1998.

Gussenhoven, C., Rietveld, T., & Terken, J. (1999). ToDI: Transcription of Dutch intonation . http://todi.let.kun.nl/ToDI/home.htm

Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English (English language series 9) . Longman.

Halliday, M. A. K., & Hasan, R. (1989). Language, context, and text: Aspects of language in a social-semiotic perspective . Oxford University Press.

Harris, R. A., Marco, C. D., Ruan, S., & O’Reilly, C. (2018). An annotation scheme for rhetorical figures. Argument & Computation, 9 , 155–175.

Heinrichs, J. (2007). Thank you for arguing . Three Rivers Press.

Hepburn, A., & Bolden, G. B. (2013). The conversation analytic approach to transcription. In J. Sidnell & T. Stivers (Eds.), The handbook of conversation analysis (pp. 57–76). Blackwell.

Hepburn, A. D. (1875). Manual of English rhetoric . American Book Company.

Hymes, D. (1962). The ethnography of speaking. In T. Gladwin & W. C. Sturtevant (Eds.), Anthropology and human behavior (pp. 13–53). The Anthropology Society of Washington.

Hymes, D. (1964). Introduction: Toward ethnographies of communication. American Anthropologist, 66 (6), 1–34.

Jakobson, R. (1959). On linguistic aspects of translation. In R. A. Brower (Ed.), On translation (pp. 232–239). Harvard University Press.

Jakobson, R. (1960). Linguistics and poetics. In T. Sebeok (Ed.), Style in language (pp. 350–377). MIT Press.

Johansson, S. (1995). The encoding of spoken texts. Computers & the Humanities, 29 (1), 149–158.

Joos, M. (1962). The five clocks. International Journal of American Linguistics, 28 , 9–62.

Knowles, G. (1991). Prosodic labelling: The problem of tone group boundaries. In S. Johansson & A.-B. Stenström (Eds.), English computer corpora: Selected papers and research guides (pp. 149–163). Mouton de Gruyter.

Lanham, R. (1991). A handlist of rhetorical terms (2nd ed.). University of California Press.

Book   Google Scholar  

Lee, A., Prasad, R., Joshi, A., Dinesh, N., & Webber, B. (2006). Complexity of dependencies in discourse. In Proceedings of the 5th Workshop on Treebanks and Linguistic Theory (TLT’06) .

Leech, G., & Wilson, A. (1999). Guidelines & standards for tagging. In H. van Halteren (Ed.), Syntactic wordclass tagging (pp. 55–80). Kluwer.

Leech, G. (1993). Corpus annotation schemes. Literary and Linguistic Computing, 8 (4), 275–281.

Löfberg, L., Archer, D., Piao, S., Rayson, P., McEnery, A., Varantola, K., & Juntunen, J. P. (2003). Porting an English semantic tagger to the Finnish language. In: Archer, D., Rayson, P., Wilson, A., & McEnery, T. (Eds.) In Proceedings of the corpus linguistics 2003 conference. UCREL technical paper number 16 (pp. 457–464). UCREL, Lancaster University.

Löfberg, L., Juntunen, J. P., Nykanen, A., Varantola, K., Rayson, P., & Archer, D. (2004). Using a semantic tagger as a dictionary search tool. In: Williams, G., & Vessier, S. (Eds.) Proceedings of the 11th EURALEX (European association for lexicography) International congress (Euralex 2004) (vol. I, pp. 127–134). Université de Bretagne Sud, 6–10 July 2004.

Löfberg, L., Piao, S., Rayson, P., Juntunen, J.P., Nykänen, A., & Varantola, K. (2005). A semantic tagger for the Finnish language. In Proceedings of the corpus linguistics 2005 conference series online e-journal (vol. 1, no. 1.). 14–17 July 2005.

McArthur, T. (Ed.). (1981). Longman lexicon of contemporary English . Longman.

McEnery, T., & Wilson, A. (1996). Corpus linguistics . Edinburgh University Press.

Milde, J. T., & Gut, U. B. (2002). A prosodic corpus of non-native speech. In: Bel, B., & Marlien, I. (Eds.) Proceedings of the speech prosody 2002 conference (pp. 503–506). Laboratoire Parole et Language, 11–13 April 2002.

Miltsakaki, E., Prasad, R., Joshi, A., & Webber, B. (2004). Annotating discourse connectives and their arguments . In NAACL/HLT Workshop on Frontiers in Corpus Annotation.

O’Donnell, M. B. (1999). The use of annotated corpora for New Testament discourse analysis: A survey of current practice and future prospects. In S. E. Porter & J. T. Reed (Eds.), Discourse analysis and the new testament: Results and applications (pp. 71–117). Sheffield Academic Press.

Piao, S., Archer, D., Mudraya, O., Rayson, P., Garside, R., McEnery, A.M., & Wilson, A. (2006). A large semantic lexicon for corpus annotation. In Proceedings of the corpus linguistics 2005 conference series online e-journal (vol. 1, no. 1). July 14–17.

Piao, S., Bianchi, F., Dayrell, C., D'Egidio, A., & Rayson, P. (2015). Development of the multilingual semantic annotation system. In Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics—human language technologies (NAACL HLT 2015) (pp. 1268–1274).

Piao, S., Rayson, P., Archer, D., & McEnery, A. M. (2005). Comparing and combining a semantic tagger and a statistical tool for MWE extraction. Journal of Computer Speech & Language., 19 (4), 378–397.

Piao, S., Rayson, P., Archer, D., Bianchi, F., Dayrell, C., El-Haj, M., Jiménez, R., Knight, D., Kren, M., Löfberg, L., Nawab, R. M. A., Shafi, J., Teh, P., & Mudraya, O. (2016). Lexical coverage evaluation of large-scale multilingual semantic lexicons for twelve languages. In Proceedings of the 10th International language resources and evaluation conference (LREC2016) (pp. 2614–2619).

Polakova, L., Mirovsky, J., Nedoluzhko, A., Jinova, P., Zikanova, S., & Hajicova, E. (2013). Introducing the Prague discourse treebank 1.0. In Proceedings of the Sixth International joint conference on natural language processing (IJCNLP) (pp. 91–99). 14–18 Oct 2013.

Portele, T., & Heuft, B. (1995). Two kinds of stress perception. In Proceedings of the 13th international congress of phonetic sciences (ICPhS 95) (pp. 126–129). 13–19 August 1995.

Prasad, R., Forbes-Riley, K., & Lee, A. (2017). Towards full-text shallow discourse relation annotation: Experiments with cross-paragraph implicit relations in the PDTB. In Proceedings of the 18th Annual SIGdial meeting on discourse and dialogue (pp. 7–16).

Quinn, A. (1995). Figures of speech: 60 ways to turn a phrase . Routledge.

Rastier, F. (ed.) (2001). A little glossary of semantics. Texts and cultures (electronic glossary) (Larry Marks Trans.). Retrieved on 26 June 2020.

Rayson, P., & Stevenson, M. (2008). Sense and semantic tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (pp. 564–579). Gruyter.

Rayson, P., & Wilson, A. (1996). The ACAMRIT semantic tagging system: progress report. In Evett, L. J., & Rose T. G. (Eds.) Language engineering for document analysis and recognition , LEDAR, AISB96 workshop proceedings (pp. 13–20).

Sinclair, J. M. (1994). Spoken language: Phonetic-phonemic and prosodic annotation. In Calzolari, N., Baker, M., & Kruyt, P. G. (Eds.) Towards a network of European reference corpora (pp. 129–132). Giardini.

Sperberg-McQueen, C. M., & Burnard, L. (Eds.) (1994). Guidelines for electronic text encoding and interchange. The Association for Computers and the Humanities/The Association for Literary and Linguistic Computing and The Association for Computational Linguistics.

Stenström, A.-B., & Andersen, G. (1996). More trends in the teenage talk: A corpus-based investigation of the discourse items ‘cos’ and ‘init.’ In C. Percy, C. F. Meyer, & I. Lancashire (Eds.), Synchronic corpus linguistics: Papers from the 16th International conference on english language research on computerized corpora (pp. 189–203). Rodopi.

Stenström, A.-B. (1984). Discourse tags. In J. Aarts & W. Meijs (Eds.), Corpus linguistics: Recent developments in the use of computer corpora in English language research (pp. 65–81). Rodopi.

Teufel, S., Carletta, J., & Moens, M. (1999). An annotation scheme for discourse-level argumentation in research articles. In Proceedings of the 9th European conference of the ACL (EACL-99) (pp. 110–117).

Webber, B., Stone, M., Joshi, A., & Knott, A. (2003). Anaphora and discourse structure. Computational Linguistics, 29 (4), 545–587.

Webber, B. (2005). A short introduction to the Penn discourse treebank. In Copenhagen working papers in language and speech processing .

Wilson, A., & Thomas, J. A. (1997). Semantic annotation. In R. Garside, G. Leech, & A. McEnery (Eds.), Corpus annotation: Linguistic information from computer text corpora (pp. 53–65). Longman.

Wolf, F., & Gibson, E. (2005). Representing discourse coherence: A corpus-based study. Computational Linguistics, 31 , 249–287.

http://lands.let.kun.nl/todi

http://lands.let.ru.nl/cgn/doc_English/topics/version_1.0/annot/prosody/info.htm

http://users.ox.ac.uk/~eets/Guidelines%20for%20Editors%2011.pdf .

http://www.fon.hum.uva.nl/praat/

http://www.helsinki.fi/varieng/series/volumes/10/diemer/

http://www.ling.upenn.edu/hist-corpora/annotation/index.html .

http://www.tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf .

http://www.helsinki.fi/varieng/CoRD/corpora/CEEM/MEMTindex.html .

http://www.uis.no/getfile.php/Forskning/Kultur/MEG/Corpus_manual_%202011_1.pdf .

http://www.uis.no/research-and-phd-studies/research-areas/history-languages-and-literature/the-middle-english-scribal-texts-programme/meg-c/ .

https://tei-c.org/release/doc/tei-p5-doc/en/html/CC.html

https://www.ling.upenn.edu/hist-corpora/annotation/index.html

http://phlox.lancs.ac.uk/ucrel/semtagger/chinese

http://phlox.lancs.ac.uk/ucrel/semtagger/dutch

http://phlox.lancs.ac.uk/ucrel/semtagger/italian

http://phlox.lancs.ac.uk/ucrel/semtagger/portuguese

http://phlox.lancs.ac.uk/ucrel/semtagger/spanish

http://ucrel.lancs.ac.uk/usas/

http://ucrel-api.lancaster.ac.uk/usas/tagger.html

http://www.revue-texto.net/Reperes/Glossaires/Glossaire_en.html

https://www.seas.upenn.edu/~pdtb/

http://www.wlv.ac.uk/˜le1825/anaphora_resolution_papers/state.ps

Download references

Author information

Authors and affiliations.

Linguistic Research Unit, Indian Statistical Institute, Kolkata, West Bengal, India

Dr. Niladri Sekhar Dash

You can also search for this author in PubMed   Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Dash, N.S. (2021). More Types of Corpus Annotation. In: Language Corpora Annotation and Processing. Springer, Singapore. https://doi.org/10.1007/978-981-16-2960-0_6

Download citation

DOI : https://doi.org/10.1007/978-981-16-2960-0_6

Published : 08 July 2021

Publisher Name : Springer, Singapore

Print ISBN : 978-981-16-2959-4

Online ISBN : 978-981-16-2960-0

eBook Packages : Education Education (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: mulan: a multi layer annotated dataset for controllable text-to-image generation.

Abstract: Text-to-image generation has achieved astonishing results, yet precise spatial controllability and prompt fidelity remain highly challenging. This limitation is typically addressed through cumbersome prompt engineering, scene layout conditioning, or image editing techniques which often require hand drawn masks. Nonetheless, pre-existing works struggle to take advantage of the natural instance-level compositionality of scenes due to the typically flat nature of rasterized RGB output images. Towards adressing this challenge, we introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer ANnotations of RGB images as multilayer, instance-wise RGBA decompositions, and over 100K instance images. To build MuLAn, we developed a training free pipeline which decomposes a monocular RGB image into a stack of RGBA layers comprising of background and isolated instances. We achieve this through the use of pretrained general-purpose models, and by developing three modules: image decomposition for instance discovery and extraction, instance completion to reconstruct occluded areas, and image re-assembly. We use our pipeline to create MuLAn-COCO and MuLAn-LAION datasets, which contain a variety of image decompositions in terms of style, composition and complexity. With MuLAn, we provide the first photorealistic resource providing instance decomposition and occlusion information for high quality images, opening up new avenues for text-to-image generative AI research. With this, we aim to encourage the development of novel generation and editing technology, in particular layer-wise solutions. MuLAn data resources are available at this https URL .

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. what is annotated corpora

    what is annotated corpora

  2. what is annotated corpora

    what is annotated corpora

  3. what is annotated corpora

    what is annotated corpora

  4. Corpus Linguistics and Linguistically Annotated Corpora: : Sandra

    what is annotated corpora

  5. PPT

    what is annotated corpora

  6. PPT

    what is annotated corpora

VIDEO

  1. Annotated Bibliography

  2. Anatomical Regions Annotated with Tips and Tricks For Learning! Updated 3-13-2024!

  3. Study Stream

  4. Annotated Parse Tree Example int a,b,c

  5. PERCUSSÃO CORPORAL (CORPO COREOGRÁFICO DA BANDA MARCIAL DAVID TRINDADE

  6. What is an annotation?

COMMENTS

  1. Developing Linguistic Corpora: a Guide to Good Practice

    Corpus annotation is the practice of adding interpretative linguistic information to a corpus. For example, one common type of annotation is the addition of tags, or labels, indicating the word class to which words in a text belong. This is so-called part-of-speech tagging (or POS tagging), and can be useful, for example, in distinguishing ...

  2. Annotated versus unannotated corpora

    Linguistic analyses encoded in the corpus data itself are usually called corpus annotation. For example, we may wish to annotate a corpus to show parts of speech, assigning to each word a grammatical category label. So when we see the word talk in the sentence I heard John's talk and it was the same old thing, we would assign it the category ...

  3. Corpus Annotation

    As highlighted in [48], in annotated corpora for sentiment analysis this is especially challenging. Research in psychology outlines three main approaches to the modeling of emotions and sentiments: the categorical, the dimensional, and the appraisal-based approach. The most widespread are the categorical and the dimensional ones, which describe ...

  4. Annotated Corpora and Annotation Tools

    The Annotated Corpora (AnCora) Footnote 14 of Spanish and Catalan are the result of years of annotation at different linguistic levels . The corpora began as an initiative by the University of Barcelona, the Technical University of Catalonia, and the University of Alicante to create two half-million-word treebanks for Spanish and Catalan that ...

  5. Corpus Annotation

    Corpus Linguistics and Linguistically Annotated Corpora. London / New York: Bloomsbury. A recent introduction to working with annotated corpora, with particularly detailed discussion of different forms of annotation (ranging from the level of individual words to larger discourse features) and current software tools for querying them. ...

  6. Corpora

    A corpus (pl. corpora, though corpuses is perfectly acceptable) is simply described as a large body of linguistic evidence composed of attested language use. One may contrast this form of linguistic evidence with sentences created not as a result of communication in context, but rather upon the basis of metalinguistic reflection upon language use, a type of data common in the generative ...

  7. Developing Linguistic Theories Using Annotated Corpora

    Annotated corpora can be powerful tools for developing and evaluating linguistic theories. By providing large samples of naturalistic data, such resources complement native speaker intuitions and controlled psycholinguistic methods, thereby putting linguistic hypotheses on a sturdier empirical foundation. Corpus data and methods also open up ...

  8. Annotating a corpus (Chapter 4)

    For a corpus to be fully useful to potential users, it needs to be annotated. There are three types of annotation, or "markup," that can be inserted in a corpus: "structural" markup, "part-of-speech" markup, and "grammatical" markup. Structural markup provides descriptive information about the texts. For instance, general ...

  9. PDF Developing linguistic theories using annotated corpora

    Annotated corpora can be powerful tools for developing and evaluating linguistic theories. By providing large samples of naturalistic data, such resources comple-ment native speaker intuitions and controlled psycholinguistic methods, thereby putting linguistic hypotheses on a sturdier empirical foundation. Corpus data and methods also open up ...

  10. Text corpus

    Text corpus. In linguistics and natural language processing, a corpus ( pl.: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corpus linguistics for statistical hypothesis testing, checking occurrences or validating ...

  11. Corpus Linguistics

    Following the review of corpus annotation, a brief survey of existing corpora is presented, taking into account the types of corpus annotation present in each corpus. The chapter concludes by considering the use of corpora, both annotated, and unannotated, in a range of natural language processing (NLP) systems.

  12. PDF Unit 4 Corpus annotation

    suitably annotated Chinese corpus, you are able to find out a great deal about Chinese using that corpus (see case study 6 in Section C). Speed of data extraction is another advantage of annotated corpora. Even if one is capable of undertaking the required linguistic analyses, one is quite unlikely to be able to explore a raw corpus as swiftly

  13. Natural Language Annotation for Machine Learning

    Datasets of natural language are referred to as corpora, and a single set of data annotated with the same specification is called an annotated corpus. Annotated corpora can be used to train ML algorithms. In this chapter we will define what a corpus is, explain what is meant by an annotation, and describe the methodology used for enriching a ...

  14. Corpora Annotated with Negation: An Overview

    Abstract. Negation is a universal linguistic phenomenon with a great qualitative impact on natural language processing applications. The availability of corpora annotated with negation is essential to training negation processing systems. Currently, most corpora have been annotated for English, but the presence of languages other than English on the Internet, such as Chinese or Spanish, is ...

  15. Definition and Examples of Corpus Linguistics

    Corpus linguistics is the study of language based on large collections of "real life" language use stored in corpora (or corpuses )—computerized databases created for linguistic research. It is also known as corpus-based studies. Corpus linguistics is viewed by some linguists as a research tool or methodology and by others as a discipline or ...

  16. Slate

    These richly annotated corpora are indispensable for progressing research, but also more difficult to manage and maintain due to increasing complexity - what is needed is a way to manage the annotation project in its entirety. However, annotation project management has received little attention, with tools predominately focusing on single ...

  17. Digital Corpora

    4 Annotated Corpora. A corpus is annotated when it consists of more than the actual words in the document (which is generally known as plain text). Annotation is a multilayered term that includes (but is not limited to) markup , POS-tagging , and semantic tagging . 4.1 Markup. Markup refers to the conventions used to annotate a corpus. ...

  18. Corpus Linguistics and Linguistically Annotated Corpora

    This book discusses Corpus Linguistics Using Linguistically Annotated Corpora and Linguistic Annotation, as well as searching for Semantic and Discourse Phenomena, and some of the techniques used in that process. Preface Part I Introduction 1. Corpus Linguistics 2. Corpora and Linguistic Annotation Part II Linguistic Annotation 3. Linguistic Annotation on the Word Level 4.

  19. Manually Annotated Corpora

    Manually Annotated Corpora. Manual corpora are collections of texts containing manually validated or manually assigned linguistic information, such as morphosyntactic tags, lemmas, syntactic parses, named entities etc. These corpora can be used to train new language annotation tools, as well as testing the accuracy of existing annotation tools.

  20. [PDF] Morphologically and Syntactically Annotated Corpora of Many

    In natural language processing (NLP), syntactic parsing is an important preparatory step for many tasks such as question answering, data mining or machine translation; the state-of-the-art parsers rely on human-annotated treebanks and apply machine learning algorithms to extract linguistic knowledge from the treebanks. Annotated corpora have become a standard resource for research in both ...

  21. 'Automatic Collation for Diversifying Corpora: Commonly Copied Texts as

    "Handwritten text recognition (HTR) has enabled many researchers to gather textual evidence from the human record. … To build generalized models for Arabic-script manuscripts, perhaps one of the largest textual traditions in the pre-modern world, we need an approach that can improve its accuracy on unseen manuscripts and hands without linear growth in the amount of manually annotated data.

  22. PDF Vermont State Board of Education Education Quality Standards CVR 22-000

    Statutes Annotated or the school is otherwise directed in state law. As these rules relate to independent schools designated as meeting education quality standards and unless the context suggests otherwise, duties assigned to supervisory unions, supervisory districts, school districts, or schools shall all be assigned to the independent school; d.

  23. Linguistic Annotation in/for Corpus Linguistics

    This article surveys linguistic annotation in corpora and corpus linguistics. We first define the concept of 'corpus' as a radial category and then, in Sect. 2, discuss a variety of kinds of information for which corpora are annotated and that are exploited in contemporary corpus linguistics.Section 3 then exemplifies many current formats of annotation with an eye to highlighting both the ...

  24. PDF Waterbury, VT 05671-1000 [fax] 802-241-0450 Response to Comment from

    Administrative Rules. The "annotated" copy we are now submitting shows the rules as if they already existed in the Agency's standard Health Care Administrative Rules format, with all changes tracked. The Agency is further taking this opportunity to address the typographical errors and legacy definitions

  25. a new corpus management platform for annotated corpora

    Corpuscle - a new corpus management platform for annotated corpora. The main design goals were the ability to handle very large corpora, support for structured data (XML), and seamless integration of manual corpus annotation and editing, and a technique for running finite state automata from edges with lowest corpus counts. Expand.

  26. 2024 issued and annotated issued IFRS Accounting Standards now available

    The IFRS Accounting Standards 2024 — Issued publication contains the Standards as approved by the International Accounting Standards Board for issue up to 31 December 2023. These Standards include changes that are not yet required at 1 January 2024. The IFRS Accounting Standards 2024 — Issued Annotated includes the same content as IFRS Accounting Standards 2024, but with additional ...

  27. More Types of Corpus Annotation

    Such corpora have multiple applications in language engineering, textual analysis of a language, cognitive linguistics, computational linguistics, and allied domains. Multiple application potentials are the incentives that can motivate us to generate annotated corpora of various types and make them available for use in academic and commercial ...

  28. MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image

    Text-to-image generation has achieved astonishing results, yet precise spatial controllability and prompt fidelity remain highly challenging. This limitation is typically addressed through cumbersome prompt engineering, scene layout conditioning, or image editing techniques which often require hand drawn masks. Nonetheless, pre-existing works struggle to take advantage of the natural instance ...