spaCy: Industrial-Strength NLP in Python

spaCy is an open-source library for natural language processing (NLP) built for production use. Where large language models excel at open-ended generation, spaCy focuses on fast, deterministic, and structured text analysis: tokenization, part-of-speech tagging, dependency parsing, and named entity recognition. This tutorial walks through spaCy from installation to training a custom model, using a coherent business-text example throughout.

What spaCy Is and Where It Fits

spaCy is a Python library that turns raw text into structured linguistic data. It ships with pre-trained statistical models for many languages and exposes a clean object model (Doc, Token, Span) for working with the results. The library is written with performance in mind, so it can process large volumes of text quickly on a CPU.

spaCy vs. Large Language Models

LLMs and spaCy solve different problems, and in practice they are often combined.

Determinism: spaCy produces the same output for the same input every time. This matters for pipelines that feed downstream systems or that must be auditable.
Speed and cost: spaCy runs on commodity CPUs and processes thousands of documents per second for many tasks. There is no per-token API cost.
Structure: spaCy returns typed, span-level annotations (entities, tokens, dependencies) rather than free text you have to parse again.
Privacy: text never has to leave your infrastructure.

A common pattern is to use spaCy for high-volume extraction and routing, and reserve an LLM for the smaller subset of cases that genuinely need open-ended reasoning. spaCy also integrates LLM prompts as pipeline components through the spacy-llm package when you want both in one workflow.

Installation and Downloading Models

Install spaCy with pip. Using a virtual environment is recommended so that model versions stay pinned to your project.

python -m venv .venv source .venv/bin/activate # on Windows: .venv\Scripts\activate pip install spacy

Models are distributed separately from the library. Download a small English model to get started:

python -m spacy download encorewebsm

The English models follow a naming convention: en (language), core (general-purpose pipeline), web (trained on web text), and a size suffix:

encorewebsm — small, fast, no word vectors.
encorewebmd — medium, includes word vectors.

encoreweblg — large, more vectors.
encorewebtrf — transformer-based, highest accuracy, needs more compute.

You can confirm what is installed and validate compatibility:

python -m spacy info python -m spacy validate

The nlp Pipeline and Core Objects

Loading a model gives you an nlp object. Calling it on a string runs the full pipeline and returns a Doc.

import spacy

nlp = spacy.load("encorewebsm")

text = "Acme Corp acquired Globex Ltd for $4.5 billion in March 2023."
doc = nlp(text)

print(type(doc))            # 
print(len(doc))             # number of tokens
print([token.text for token in doc])

A Doc is a sequence of Token objects. A contiguous slice of tokens is a Span. These three objects are the foundation for everything else.

# Token: a single word, punctuation mark, or symbol
first = doc[0]
print(first.text, first.idx)        # "Acme" 0

Span: a slice of the Doc
span = doc[0:2]
print(span.text)                    # "Acme Corp"

Sentences are Spans too
for sent in doc.sents:
    print(sent.text)

Inspecting the Pipeline

The nlp object holds an ordered list of components. You can inspect and modify it.

print(nlp.pipenames)
['tok2vec', 'tagger', 'parser', 'attributeruler', 'lemmatizer', 'ner']

Each component adds annotations to the Doc as it passes through.

spaCy Tutorial: Industrial-Strength NLP in Python

spaCy: Industrial-Strength NLP in Python

What spaCy Is and Where It Fits

spaCy vs. Large Language Models

Installation and Downloading Models

The `nlp` Pipeline and Core Objects

Span: a slice of the Doc

Sentences are Spans too

Inspecting the Pipeline

['tok2vec', 'tagger', 'parser', 'attributeruler', 'lemmatizer', 'ner']

Related Articles

SHAP Tutorial: Explainable AI and Model Interpretability

PyOD Tutorial: Anomaly and Outlier Detection in Python

BERTopic Tutorial: Modern Topic Modeling with Embeddings

Sentence Transformers Tutorial: Embeddings, Similarity, and Rerankers

Related Articles

SHAP Tutorial: Explainable AI and Model Interpretability

SHAP - Panduan Praktis Explainable AI dan Interpretabilitas Model Model machine learning makin sering dipakai untuk meng...

PyOD Tutorial: Anomaly and Outlier Detection in Python

Deteksi Anomali di Python dengan PyOD: Panduan Praktis Sebagian besar dataset di dunia nyata mengandung sebagian kecil d...

BERTopic Tutorial: Modern Topic Modeling with Embeddings

BERTopic: Pemodelan Topik Modern dengan Embedding BERTopic adalah library pemodelan topik yang menggabungkan embedding t...

Sentence Transformers Tutorial: Embeddings, Similarity, and Rerankers

Sentence Transformers: Embedding, Kemiripan Semantik, dan Reranker Sentence Transformers (sering disebut SBERT) adalah p...

spaCy: Industrial-Strength NLP in Python

What spaCy Is and Where It Fits

spaCy vs. Large Language Models

Installation and Downloading Models

The nlp Pipeline and Core Objects

Span: a slice of the Doc

Sentences are Spans too

Inspecting the Pipeline

['tok2vec', 'tagger', 'parser', 'attributeruler', 'lemmatizer', 'ner']

Related Articles

SHAP Tutorial: Explainable AI and Model Interpretability

PyOD Tutorial: Anomaly and Outlier Detection in Python

BERTopic Tutorial: Modern Topic Modeling with Embeddings

Sentence Transformers Tutorial: Embeddings, Similarity, and Rerankers

The `nlp` Pipeline and Core Objects