Scientific Named Entity Recognition
CMU 11-711 Course: Advanced NLP
Objective
The goal is to build an end-to-end NLP system involving collecting our own data and training a model on the data to identify specific entities such as method names, task names, dataset names, metric names and their values, hyperparameter names and their values within scientific publications from recent NLP conferences (e.g. ACL, EMNLP, and NAACL).
Input: a text file with one paragraph per line. The text will already be tokenized using the spacy tokenizer, and you should not change the tokenization.
Output: a CoNLL-formatted file, with one token per line, a tab, and then a corresponding tag.
Project website: https://github.com/cmu-anlp/nlp-from-scratch-assignment-2023
Importance of the Scientific Named Entity Recognition (NER) Task
Scientific NER in the NLP domain is crucial for extracting and understanding method names, task names, dataset names, metric names, and their corresponding values within scientific publications. Accurate identification of these entities enhances the accessibility and comprehension of research findings and literature reviews, facilitating the retrieval of relevant information for further studies.
STEP 1: Dataset Collection
In a team of 3, we selected ACL Anthology, hosting 88,586 papers on computation studies, as our data source. Using a Python script, we parsed the BibTeX file from the ACL Anthology website into a CSV file, including paper title, publication year, venue details (NAACL, ACL, EMNLP), and PDF download URLs. Employing SHA256 hashing for URL identification, we downloaded and saved 87,587 PDFs (98.8%). Subsequently, SciPDF converted these PDFs into JSON files, aligning with our scientific publication task.
STEP 2: Data Annotation
One of the critical challenges in developing a robust scientific entity recognition system is the availability of annotated data. We categorized the 87,587 downloaded papers into three groups: manually annotated, automatically annotated, and unannotated. The manually annotated data comprises 35 papers from ACL, EMNLP, or NAACL conferences in 2022 or 2023, distributed among the three of us for manual annotation using the Label Studio interface, resulting in 1542 paragraphs. Additionally, we utilized close to 86,000 paragraphs from the auto-annotated category by training an NER model on a small labeled dataset and applying it to automatically label the paragraphs.
Here are some steps to use this platform:
- Choose a folder you want to clone the repo in and type this in the terminal:
git clone https://github.com/HumanSignal/label-studio.git
- Type in terminal:
docker-compose up
- Open Label Studio in your web browser at
http://localhost:8080/
and create an account.
We were given a few annotation instructions (see image below). The most difficult ones to annotate were MethodName
and TaskName
due to the context-dependent nature of these entities within the papers we read. Task names often involve nuanced descriptions of the research objective, making it challenging to precisely delineate the boundaries of the task. Similarly, method names may include domain-specific terms, abbreviations, or multi-word expressions, which needs a deep understanding of the subject matter for accurate annotation.
STEP 3: Model Training and Evaluation
In our proposed Scientific NER system, we employ transfer learning to fine-tune pre-trained models with a limited number of manually annotated data, addressing the challenge of data scarcity. This approach prioritizes creating a small yet high-quality dataset due to constraints such as a limited number of annotators (three) and time limitations. The training process involves multiple iterations of fine-tuning BERT-based models, incorporating manual and automated annotations while rejecting low-confidence predictions. The model is trained for 20 epochs with specific hyperparameters and evaluated on a test set, and subsequent experiments are conducted based on the performance of the bert-large-cased model.
Acknowledgement
Thank you to my teammates for the support and collaboration: Ashwin Pillay and Bharath Somayajula. I would like to acknowledge Professors Daniel Fried and Robert Frederking for teaching this course.