Scientific Named Entity Recognition

CMU 11-711 Course: Advanced NLP

Objective

The goal is to build an end-to-end NLP system involving collecting our own data and training a model on the data to identify specific entities such as method names, task names, dataset names, metric names and their values, hyperparameter names and their values within scientific publications from recent NLP conferences (e.g. ACL, EMNLP, and NAACL).

  • Input: a text file with one paragraph per line. The text will already be tokenized using the spacy tokenizer, and you should not change the tokenization.

  • Output: a CoNLL-formatted file, with one token per line, a tab, and then a corresponding tag.

Project website: https://github.com/cmu-anlp/nlp-from-scratch-assignment-2023

Importance of the Scientific Named Entity Recognition (NER) Task

Scientific NER in the NLP domain is crucial for extracting and understanding method names, task names, dataset names, metric names, and their corresponding values within scientific publications. Accurate identification of these entities enhances the accessibility and comprehension of research findings and literature reviews, facilitating the retrieval of relevant information for further studies.

STEP 1: Dataset Collection

In a team of 3, we selected ACL Anthology, hosting 88,586 papers on computation studies, as our data source. Using a Python script, we parsed the BibTeX file from the ACL Anthology website into a CSV file, including paper title, publication year, venue details (NAACL, ACL, EMNLP), and PDF download URLs. Employing SHA256 hashing for URL identification, we downloaded and saved 87,587 PDFs (98.8%). Subsequently, SciPDF converted these PDFs into JSON files, aligning with our scientific publication task.

STEP 2: Data Annotation

One of the critical challenges in developing a robust scientific entity recognition system is the availability of annotated data. We categorized the 87,587 downloaded papers into three groups: manually annotated, automatically annotated, and unannotated. The manually annotated data comprises 35 papers from ACL, EMNLP, or NAACL conferences in 2022 or 2023, distributed among the three of us for manual annotation using the Label Studio interface, resulting in 1542 paragraphs. Additionally, we utilized close to 86,000 paragraphs from the auto-annotated category by training an NER model on a small labeled dataset and applying it to automatically label the paragraphs.

Label Studio interface

Here are some steps to use this platform:

  • Choose a folder you want to clone the repo in and type this in the terminal: git clone https://github.com/HumanSignal/label-studio.git
  • Type in terminal: docker-compose up
  • Open Label Studio in your web browser at http://localhost:8080/ and create an account.

We were given a few annotation instructions (see image below). The most difficult ones to annotate were MethodName and TaskName due to the context-dependent nature of these entities within the papers we read. Task names often involve nuanced descriptions of the research objective, making it challenging to precisely delineate the boundaries of the task. Similarly, method names may include domain-specific terms, abbreviations, or multi-word expressions, which needs a deep understanding of the subject matter for accurate annotation.

Annotation guide table I made for my team given the instructions shared for the project.

More Rules

These were some other exceptions calling out words or phrases we should not annotate.

STEP 3: Model Training and Evaluation

In our proposed Scientific NER system, we employ transfer learning to fine-tune pre-trained models with a limited number of manually annotated data, addressing the challenge of data scarcity. This approach prioritizes creating a small yet high-quality dataset due to constraints such as a limited number of annotators (three) and time limitations. The training process involves multiple iterations of fine-tuning BERT-based models, incorporating manual and automated annotations while rejecting low-confidence predictions. The model is trained for 20 epochs with specific hyperparameters and evaluated on a test set, and subsequent experiments are conducted based on the performance of the bert-large-cased model.

Significance test for bert-large-cased and dslim/bert-large-NER over 12 slices of evaluation data

Acknowledgement

Thank you to my teammates for the support and collaboration: Ashwin Pillay and Bharath Somayajula. I would like to acknowledge Professors Daniel Fried and Robert Frederking for teaching this course.

$\setCounter{0}$
Previous
Previous

BurgerBot: GPT-4, Segmentation, and Manipulation

Next
Next

Building My Own BERT