spaCy: Industrial-Strength NLP in Python
spaCy is an open-source library for natural language processing (NLP) built for production use. Where large language models excel at open-ended generation, spaCy focuses on fast, deterministic, and structured text analysis: tokenization, part-of-speech tagging, dependency parsing, and named entity recognition. This tutorial walks through spaCy from installation to training a custom model, using a coherent business-text example throughout.
What spaCy Is and Where It Fits
spaCy is a Python library that turns raw text into structured linguistic data. It ships with pre-trained statistical models for many languages and exposes a clean object model (Doc, Token, Span) for working with the results. The library is written with performance in mind, so it can process large volumes of text quickly on a CPU.
spaCy vs. Large Language Models
LLMs and spaCy solve different problems, and in practice they are often combined.
- Determinism: spaCy produces the same output for the same input every time. This matters for pipelines that feed downstream systems or that must be auditable.
- Speed and cost: spaCy runs on commodity CPUs and processes thousands of documents per second for many tasks. There is no per-token API cost.
- Structure: spaCy returns typed, span-level annotations (entities, tokens, dependencies) rather than free text you have to parse again.
- Privacy: text never has to leave your infrastructure.
A common pattern is to use spaCy for high-volume extraction and routing, and reserve an LLM for the smaller subset of cases that genuinely need open-ended reasoning. spaCy also integrates LLM prompts as pipeline components through the spacy-llm package when you want both in one workflow.
Installation and Downloading Models
Install spaCy with pip. Using a virtual environment is recommended so that model versions stay pinned to your project.
python -m venv .venv
source .venv/bin/activate # on Windows: .venv\Scripts\activate
pip install spacy
Models are distributed separately from the library. Download a small English model to get started:
python -m spacy download encorewebsm
The English models follow a naming convention: en (language), core (general-purpose pipeline), web (trained on web text), and a size suffix:
encorewebsm— small, fast, no word vectors.encorewebmd— medium, includes word vectors.encoreweblg— large, more vectors.encorewebtrf— transformer-based, highest accuracy, needs more compute.
You can confirm what is installed and validate compatibility:
python -m spacy info
python -m spacy validate
The nlp Pipeline and Core Objects
Loading a model gives you an nlp object. Calling it on a string runs the full pipeline and returns a Doc.
import spacy
nlp = spacy.load("encorewebsm")
text = "Acme Corp acquired Globex Ltd for $4.5 billion in March 2023."
doc = nlp(text)
print(type(doc)) #
print(len(doc)) # number of tokens
print([token.text for token in doc])
A Doc is a sequence of Token objects. A contiguous slice of tokens is a Span. These three objects are the foundation for everything else.
# Token: a single word, punctuation mark, or symbol
first = doc[0]
print(first.text, first.idx) # "Acme" 0
Span: a slice of the Doc
span = doc[0:2]
print(span.text) # "Acme Corp"
Sentences are Spans too
for sent in doc.sents:
print(sent.text)
Inspecting the Pipeline
The nlp object holds an ordered list of components. You can inspect and modify it.
print(nlp.pipenames)
['tok2vec', 'tagger', 'parser', 'attribute
ruler', 'lemmatizer', 'ner']
Each component adds annotations to the Doc as it passes through.