Entity Extraction Annotation Guide

We need to extract skills and some other entities from job postings and other documents. In this project, we'll create a ground truth sample for this task. Many of the decisions we'll have to make are ambiguous so I prepared this guide to align our expectations.

Entities we want to extract are:

  • SKILL
  • JOB_TITLE
  • EXPERIENCE_LEVEL
  • EDUCATION_LEVEL
  • EDUCATION_MAJOR
  • CERTIFICATION_LICENSE

For example, suppose we have:

We are looking for a Senior Software Engineer with experience in Python, Django, and React. Must have strong communication skills and leadership abilities. AWS certification required. PhD in computer science or statistics preferred. Remote position.

Here, we'd extract:

  • Senior Software Engineer as a title span,
  • Python, Django, React, communication skills, and leadership abilities as skill spans,
  • AWS certification as Cert/License span,
  • PhD as the degree level span, and
  • computer science and statistics as our major spans.

Here is a more realistic example, annotated by one of our interns:

Example of entity annotation showing labeled skills, job titles, experience levels, education levels, majors, and certifications in a job posting


Spans

  • Spans must be:
    • Contiguous (no gaps)
    • Aligned to whole words (no partial token selection)
    • The shortest phrase that still uniquely expresses the concept
  • If we must make a decision that trades off the quality of different entities in our selection, the priority should be: SKILL > CERTIFICATION_LICENSE > JOB_TITLE > EDUCATION_LEVEL > EDUCATION_MAJOR > EXPERIENCE_LEVEL
  • No overlapping entities: each token can be part of at most one entity.
  • Similarly, we also assign each word a single label. Sometimes, the job descriptions have two words concatenated together. In those cases, we have to make a decision about what to prioritize and what the "word" sounds like on its own. Ultimately, in some cases, if we suspect that the concatenated word will not make much sense, it can be okay to leave it unlabeled. If we have to make a judgement call, we can use the priority order above. I understand that this is vague and difficult so we can think about this together after looking at some examples.

Non-entities

  • We do not annotate:
    • Company description, values, or general marketing text
    • Benefits and compensation
    • Work authorization, location, or availability constraints

1. SKILL

Definition: A SKILL is any knowledge area, capability, competency, method, tool/technology, or soft skill that the candidate is expected to have or use in the role.

1.1 Include as SKILL

  • Technical / domain skills: data analysis, statistical modelling, customer service, cash handling, network troubleshooting
  • Software, tools, platforms: Python, R, SQL, Excel, Tableau, Salesforce, SAP, AWS, Azure
  • Methods, frameworks, techniques: A/B testing, agile methodologies, machine learning model deployment
  • Soft skills: team player, self-starter, problem-solving skills, communication skills, time management

1.2 Exclude from SKILL

We do not tag as SKILL:

  • Location / authorization / availability: US work authorization, must be willing to relocate, available weekends
  • Benefits: health insurance, 401(k), paid time off (unless it is something like an HR role where they actually manage these benefits or similar)
  • Pure environment / conditions (unless clearly framed as a skill): work in cold environments (but experience working in cold storage environments can be SKILL)
  • Bare adjectives with no clear skill: strong, excellent, proven, outstanding when they stand alone

1.3 Adjectives in SKILL spans

  • Include adjectives that are integral to the concept: statistical modelling, technical writing, financial analysis, mechanical design
  • Exclude pure intensity/quality adjectives: advanced Excel skillsSKILL = Excel skills
  • Ex: strong analytical skillsSKILL = analytical skills; excellent communication skillsSKILL = communication skills

1.4 Skill vs Task (verbs)

  • If the verb is generic, drop it.
  • If the verb is core to the competency, keep it.

Generic use of verbs: Generally we drop these, and keep noun phrase only but the context is important. We can mentally drop the verbs in the following examples and see that it doesn't really change the skill:

  • Common examples: conduct, perform, assist, support, participate in (of course, they can still be used in legitimate ways, e.g., conducting opera pieces or similar)
  • conducting data analysisSKILL = data analysis
  • performing statistical modellingSKILL = statistical modelling
  • assist with project managementSKILL = project management

Core verbs (keep verb + core object together as SKILL):

  • operate, drive, repair, install, troubleshoot, test, maintain
  • operate CDL class tractor/trailer and straight trucks → one SKILL span
  • repair hydraulic systems → one SKILL span
  • install and maintain HVAC systems → one SKILL span

We allow such task-like spans as SKILL when that is the only realistic way to capture the underlying competency.

1.5 Span length and context

  • Annotate the full domain-specific term or phrase: lift gates (not just lift because the word "lift" can also mean lifting heavy weights etc.), machine learning model deployment, SQL query optimization
  • We do not include trailing conditions, environments, or purpose: troubleshoot network issues in a fast-paced environmentSKILL = troubleshoot network issues; perform cash handling in compliance with company policiesSKILL = cash handling
  • So, we want to extract the shortest span that conveys the main idea of the skill.
  • If a span could be both skill and major, and it is not clear which one is intended, we can label them as skills. However, when it is clear that it is about education/major, then we should label it as major. (E.g., molecular biology can be either a skill or major depending on how it is mentioned.)

2. JOB_TITLE

Definition: JOB_TITLE is the title of a position or role.

2.1 Include as JOB_TITLE

Ex:

  • Data Scientist
  • Senior Software Engineer
  • Truck Driver
  • Registered Nurse
  • Junior Accountant
  • Project Manager

Important: Include seniority terms if they are part of the title: Senior Data Scientist, Lead Engineer, Entry-Level Analyst

2.2 Exclude from JOB_TITLE

  • Department / team names alone: Data Science team, Finance department, HR team
  • Generic role descriptions not used as a title: team lead when clearly used descriptively and not as a formal title (if ambiguous, prefer JOB_TITLE).

Overlap rule:

  • If a phrase could be both SKILL and JOB_TITLE and is clearly used as a title (e.g. As a Senior Data Scientist, you will…), label it as JOB_TITLE (do not label SKILL there).
  • In principle, a job posting can have several titles, which can be extracted as separate title spans. E.g. We are looking for data scientists, data engineers, machine learning engineers or data scientists, senior data scientists, staff data scientists. In both cases we can break these apart into three title spans.

3. EXPERIENCE_LEVEL

Definition: EXPERIENCE_LEVEL captures required or described seniority and required years of experience.

3.1 Include as EXPERIENCE_LEVEL

  • Years of experience:
    • 1 year of similar experience
    • 2+ years of experience
    • at least five years of relevant experience
  • Level descriptors (when not embedded in a JOB_TITLE span):
    • entry-level, mid-level, senior-level experience
    • early-career, experienced professional

3.2 Span guidelines

  • For years of experience, annotate the full requirement phrase expressing the duration: 3+ years of experience, minimum of 5 years of experience
  • If segmentation errors create concatenated tokens (e.g. independently1 year of similar experience): Start the EXPERIENCE_LEVEL span at independently1 year of similar experience (ignore the fused token independently1). If fusing causes major ambiguity, we can leave it unlabeled.
  • Apply this rule consistently.

Overlap rule with JOB_TITLE:

  • If Senior, Junior, etc. appears inside a JOB_TITLE span (Senior Data Scientist), label only JOB_TITLE; we do not extract the Senior in Senior Data Scientist as EXPERIENCE_LEVEL.

4. EDUCATION_LEVEL

Definition: EDUCATION_LEVEL is the required or held level of formal education.

4.1 Include as EDUCATION_LEVEL

  • Degree levels:
    • Bachelor's degree, BA, BS
    • Master's degree, MS, MBA
    • PhD, Doctorate
    • associate's degree
    • high school diploma, GED
  • Level phrases:
    • Bachelor's degree or higher
    • Master's degree preferred

4.2 Span guidelines

  • Annotate the full degree expression: Bachelor's degree, Master's degree in a related field, high school diploma or equivalentEDUCATION_LEVEL = high school diploma or equivalent
  • When combined with major: Bachelor's degree in Computer ScienceEDUCATION_LEVEL = Bachelor's degree and EDUCATION_MAJOR = Computer Science

5. EDUCATION_MAJOR

Definition: EDUCATION_MAJOR is a field of study, subject, or discipline associated with a degree or educational requirement.

5.1 Include as EDUCATION_MAJOR

  • Fields of study:
    • Computer Science
    • Economics
    • Mechanical Engineering
    • Business Administration
    • Statistics
    • Nursing
    • Finance, Accounting

5.2 Span guidelines

  • We do not include the degree words in the major spans: Bachelor's degree in Computer Science or related fieldEDUCATION_MAJOR = Computer Science (do not include Bachelor's degree in)
  • Multiple majors: Computer Science, Mathematics, or Statistics
    • Annotate the phrase as separate EDUCATION_MAJOR spans (we apply this consistently when multiple acceptable fields are listed together).
    • If it we see A, B, and other quantitative fields, we can extract A, B, quantitative fields as the major spans; while it is not precise, it can still be mapped to some level of CIP

6. CERTIFICATION_LICENSE

Definition: CERTIFICATION_LICENSE covers named professional certifications and licenses required or preferred for the role.

6.1 Include as CERTIFICATION_LICENSE

  • Certifications: CPA, Certified Public Accountant (CPA), CFA, PMP, Project Management Professional (PMP), Cisco CCNA, AWS Certified Solutions Architect
  • Licenses: valid driver's license, CDL Class A license, Registered Nurse license, Bar admission / licensed to practice law

6.2 Span guidelines

  • Annotate the full certification/license phrase: CDL Class A license, Certified Public Accountant (CPA), current CPA license
  • If the phrase clearly names the certification/license but omits the word license or certification, it is still fine: current CPACERTIFICATION_LICENSE = CPA
  • When certification-like terms appear inside a SKILL phrase: operate CDL class tractor/trailer and straight trucks
    • This is a SKILL span (task-like); we do not extract a separate CERTIFICATION_LICENSE span from CDL in this context.
  • Similarly, if we see registered nurse, this is a title. It is true that this probably requires a license but we do not infer it and label this as a license. Here, we only focus on the explicitly listed certifications and licenses, and leave the inference for another model to handle.