Entity Extraction Annotation Guide
We need to extract skills and some other entities from job postings and other documents. In this project, we'll create a ground truth sample for this task. Many of the decisions we'll have to make are ambiguous so I prepared this guide to align our expectations.
Entities we want to extract are:
SKILLJOB_TITLEEXPERIENCE_LEVELEDUCATION_LEVELEDUCATION_MAJORCERTIFICATION_LICENSE
For example, suppose we have:
We are looking for a Senior Software Engineer with experience in Python, Django, and React. Must have strong communication skills and leadership abilities. AWS certification required. PhD in computer science or statistics preferred. Remote position.
Here, we'd extract:
Senior Software Engineeras a title span,Python,Django,React,communication skills, andleadership abilitiesas skill spans,AWS certificationas Cert/License span,PhDas the degree level span, andcomputer scienceandstatisticsas our major spans.
Here is a more realistic example, annotated by one of our interns:

Spans
- Spans must be:
- Contiguous (no gaps)
- Aligned to whole words (no partial token selection)
- The shortest phrase that still uniquely expresses the concept
- If we must make a decision that trades off the quality of different entities in our selection, the priority should be:
SKILL>CERTIFICATION_LICENSE>JOB_TITLE>EDUCATION_LEVEL>EDUCATION_MAJOR>EXPERIENCE_LEVEL - No overlapping entities: each token can be part of at most one entity.
- Similarly, we also assign each word a single label. Sometimes, the job descriptions have two words concatenated together. In those cases, we have to make a decision about what to prioritize and what the "word" sounds like on its own. Ultimately, in some cases, if we suspect that the concatenated word will not make much sense, it can be okay to leave it unlabeled. If we have to make a judgement call, we can use the priority order above. I understand that this is vague and difficult so we can think about this together after looking at some examples.
Non-entities
- We do not annotate:
- Company description, values, or general marketing text
- Benefits and compensation
- Work authorization, location, or availability constraints
1. SKILL
Definition: A SKILL is any knowledge area, capability, competency, method, tool/technology, or soft skill that the candidate is expected to have or use in the role.
1.1 Include as SKILL
- Technical / domain skills:
data analysis,statistical modelling,customer service,cash handling,network troubleshooting - Software, tools, platforms:
Python,R,SQL,Excel,Tableau,Salesforce,SAP,AWS,Azure - Methods, frameworks, techniques:
A/B testing,agile methodologies,machine learning model deployment - Soft skills:
team player,self-starter,problem-solving skills,communication skills,time management
1.2 Exclude from SKILL
We do not tag as SKILL:
- Location / authorization / availability:
US work authorization,must be willing to relocate,available weekends - Benefits:
health insurance,401(k),paid time off(unless it is something like an HR role where they actually manage these benefits or similar) - Pure environment / conditions (unless clearly framed as a skill):
work in cold environments(butexperience working in cold storage environmentscan beSKILL) - Bare adjectives with no clear skill:
strong,excellent,proven,outstandingwhen they stand alone
1.3 Adjectives in SKILL spans
- Include adjectives that are integral to the concept:
statistical modelling,technical writing,financial analysis,mechanical design - Exclude pure intensity/quality adjectives:
advanced Excel skills→SKILL=Excel skills - Ex:
strong analytical skills→SKILL=analytical skills;excellent communication skills→SKILL=communication skills
1.4 Skill vs Task (verbs)
- If the verb is generic, drop it.
- If the verb is core to the competency, keep it.
Generic use of verbs: Generally we drop these, and keep noun phrase only but the context is important. We can mentally drop the verbs in the following examples and see that it doesn't really change the skill:
- Common examples:
conduct,perform,assist,support,participate in(of course, they can still be used in legitimate ways, e.g.,conducting opera piecesor similar) conducting data analysis→SKILL=data analysisperforming statistical modelling→SKILL=statistical modellingassist with project management→SKILL=project management
Core verbs (keep verb + core object together as SKILL):
operate,drive,repair,install,troubleshoot,test,maintainoperate CDL class tractor/trailer and straight trucks→ oneSKILLspanrepair hydraulic systems→ oneSKILLspaninstall and maintain HVAC systems→ oneSKILLspan
We allow such task-like spans as SKILL when that is the only realistic way to capture the underlying competency.
1.5 Span length and context
- Annotate the full domain-specific term or phrase:
lift gates(not justliftbecause the word "lift" can also mean lifting heavy weights etc.),machine learning model deployment,SQL query optimization - We do not include trailing conditions, environments, or purpose:
troubleshoot network issues in a fast-paced environment→SKILL=troubleshoot network issues;perform cash handling in compliance with company policies→SKILL=cash handling - So, we want to extract the shortest span that conveys the main idea of the skill.
- If a span could be both skill and major, and it is not clear which one is intended, we can label them as skills. However, when it is clear that it is about education/major, then we should label it as major. (E.g.,
molecular biologycan be either a skill or major depending on how it is mentioned.)
2. JOB_TITLE
Definition: JOB_TITLE is the title of a position or role.
2.1 Include as JOB_TITLE
Ex:
Data ScientistSenior Software EngineerTruck DriverRegistered NurseJunior AccountantProject Manager
Important: Include seniority terms if they are part of the title: Senior Data Scientist, Lead Engineer, Entry-Level Analyst
2.2 Exclude from JOB_TITLE
- Department / team names alone:
Data Science team,Finance department,HR team - Generic role descriptions not used as a title:
team leadwhen clearly used descriptively and not as a formal title (if ambiguous, preferJOB_TITLE).
Overlap rule:
- If a phrase could be both
SKILLandJOB_TITLEand is clearly used as a title (e.g.As a Senior Data Scientist, you will…), label it asJOB_TITLE(do not labelSKILLthere). - In principle, a job posting can have several titles, which can be extracted as separate title spans. E.g.
We are looking for data scientists, data engineers, machine learning engineersordata scientists, senior data scientists, staff data scientists. In both cases we can break these apart into three title spans.
3. EXPERIENCE_LEVEL
Definition: EXPERIENCE_LEVEL captures required or described seniority and required years of experience.
3.1 Include as EXPERIENCE_LEVEL
- Years of experience:
1 year of similar experience2+ years of experienceat least five years of relevant experience
- Level descriptors (when not embedded in a
JOB_TITLEspan):entry-level,mid-level,senior-level experienceearly-career,experienced professional
3.2 Span guidelines
- For years of experience, annotate the full requirement phrase expressing the duration:
3+ years of experience,minimum of 5 years of experience - If segmentation errors create concatenated tokens (e.g.
independently1 year of similar experience): Start theEXPERIENCE_LEVELspan atindependently1 year of similar experience(ignore the fused tokenindependently1). If fusing causes major ambiguity, we can leave it unlabeled. - Apply this rule consistently.
Overlap rule with JOB_TITLE:
- If
Senior,Junior, etc. appears inside aJOB_TITLEspan (Senior Data Scientist), label onlyJOB_TITLE; we do not extract theSeniorinSenior Data ScientistasEXPERIENCE_LEVEL.
4. EDUCATION_LEVEL
Definition: EDUCATION_LEVEL is the required or held level of formal education.
4.1 Include as EDUCATION_LEVEL
- Degree levels:
Bachelor's degree,BA,BSMaster's degree,MS,MBAPhD,Doctorateassociate's degreehigh school diploma,GED
- Level phrases:
Bachelor's degree or higherMaster's degree preferred
4.2 Span guidelines
- Annotate the full degree expression:
Bachelor's degree,Master's degree in a related field,high school diploma or equivalent→EDUCATION_LEVEL=high school diploma or equivalent - When combined with major:
Bachelor's degree in Computer Science→EDUCATION_LEVEL=Bachelor's degreeandEDUCATION_MAJOR=Computer Science
5. EDUCATION_MAJOR
Definition: EDUCATION_MAJOR is a field of study, subject, or discipline associated with a degree or educational requirement.
5.1 Include as EDUCATION_MAJOR
- Fields of study:
Computer ScienceEconomicsMechanical EngineeringBusiness AdministrationStatisticsNursingFinance,Accounting
5.2 Span guidelines
- We do not include the degree words in the major spans:
Bachelor's degree in Computer Science or related field→EDUCATION_MAJOR=Computer Science(do not includeBachelor's degree in) - Multiple majors:
Computer Science, Mathematics, or Statistics- Annotate the phrase as separate
EDUCATION_MAJORspans (we apply this consistently when multiple acceptable fields are listed together). - If it we see
A, B, and other quantitative fields, we can extractA,B,quantitative fieldsas the major spans; while it is not precise, it can still be mapped to some level of CIP
- Annotate the phrase as separate
6. CERTIFICATION_LICENSE
Definition: CERTIFICATION_LICENSE covers named professional certifications and licenses required or preferred for the role.
6.1 Include as CERTIFICATION_LICENSE
- Certifications:
CPA,Certified Public Accountant (CPA),CFA,PMP,Project Management Professional (PMP),Cisco CCNA,AWS Certified Solutions Architect - Licenses:
valid driver's license,CDL Class A license,Registered Nurse license,Bar admission / licensed to practice law
6.2 Span guidelines
- Annotate the full certification/license phrase:
CDL Class A license,Certified Public Accountant (CPA),current CPA license - If the phrase clearly names the certification/license but omits the word license or certification, it is still fine:
current CPA→CERTIFICATION_LICENSE=CPA - When certification-like terms appear inside a
SKILLphrase:operate CDL class tractor/trailer and straight trucks- This is a
SKILLspan (task-like); we do not extract a separateCERTIFICATION_LICENSEspan fromCDLin this context.
- This is a
- Similarly, if we see
registered nurse, this is a title. It is true that this probably requires a license but we do not infer it and label this as a license. Here, we only focus on the explicitly listed certifications and licenses, and leave the inference for another model to handle.