spaCy Tutorial: Industrial-Strength NLP in Python

# spaCy: NLP Kelas Industri di Python spaCy adalah pustaka open-source untuk pemrosesan bahasa alami (NLP) yang dirancang untuk penggunaan produksi. Jika large language model unggul dalam menghasilka...

By Ruby Abdullah · · tutorial
spaCyNLPNamed Entity RecognitionText ProcessingPythonMachine Learning

spaCy: Industrial-Strength NLP in Python

spaCy is an open-source library for natural language processing (NLP) built for production use. Where large language models excel at open-ended generation, spaCy focuses on fast, deterministic, and structured text analysis: tokenization, part-of-speech tagging, dependency parsing, and named entity recognition. This tutorial walks through spaCy from installation to training a custom model, using a coherent business-text example throughout.

What spaCy Is and Where It Fits

spaCy is a Python library that turns raw text into structured linguistic data. It ships with pre-trained statistical models for many languages and exposes a clean object model (Doc, Token, Span) for working with the results. The library is written with performance in mind, so it can process large volumes of text quickly on a CPU.

spaCy vs. Large Language Models

LLMs and spaCy solve different problems, and in practice they are often combined.

  • Determinism: spaCy produces the same output for the same input every time. This matters for pipelines that feed downstream systems or that must be auditable.
  • Speed and cost: spaCy runs on commodity CPUs and processes thousands of documents per second for many tasks. There is no per-token API cost.
  • Structure: spaCy returns typed, span-level annotations (entities, tokens, dependencies) rather than free text you have to parse again.
  • Privacy: text never has to leave your infrastructure.

A common pattern is to use spaCy for high-volume extraction and routing, and reserve an LLM for the smaller subset of cases that genuinely need open-ended reasoning. spaCy also integrates LLM prompts as pipeline components through the spacy-llm package when you want both in one workflow.

Installation and Downloading Models

Install spaCy with pip. Using a virtual environment is recommended so that model versions stay pinned to your project.

python -m venv .venv

source .venv/bin/activate # on Windows: .venv\Scripts\activate

pip install spacy

Models are distributed separately from the library. Download a small English model to get started:

python -m spacy download encorewebsm

The English models follow a naming convention: en (language), core (general-purpose pipeline), web (trained on web text), and a size suffix:

  • encorewebsm — small, fast, no word vectors.
  • encorewebmd — medium, includes word vectors.
  • encoreweblg — large, more vectors.
  • encorewebtrf — transformer-based, highest accuracy, needs more compute.

You can confirm what is installed and validate compatibility:

python -m spacy info

python -m spacy validate

The nlp Pipeline and Core Objects

Loading a model gives you an nlp object. Calling it on a string runs the full pipeline and returns a Doc.

import spacy

nlp = spacy.load("encorewebsm")

text = "Acme Corp acquired Globex Ltd for $4.5 billion in March 2023."

doc = nlp(text)

print(type(doc)) #

print(len(doc)) # number of tokens

print([token.text for token in doc])

A Doc is a sequence of Token objects. A contiguous slice of tokens is a Span. These three objects are the foundation for everything else.

# Token: a single word, punctuation mark, or symbol

first = doc[0]

print(first.text, first.idx) # "Acme" 0

Span: a slice of the Doc

span = doc[0:2]

print(span.text) # "Acme Corp"

Sentences are Spans too

for sent in doc.sents:

print(sent.text)

Inspecting the Pipeline

The nlp object holds an ordered list of components. You can inspect and modify it.

print(nlp.pipenames)

['tok2vec', 'tagger', 'parser', 'attributeruler', 'lemmatizer', 'ner']

Each component adds annotations to the Doc as it passes through.

Related Articles

SHAP Tutorial: Explainable AI and Model Interpretability

SHAP - Panduan Praktis Explainable AI dan Interpretabilitas Model Model machine learning makin sering dipakai untuk meng...

PyOD Tutorial: Anomaly and Outlier Detection in Python

Deteksi Anomali di Python dengan PyOD: Panduan Praktis Sebagian besar dataset di dunia nyata mengandung sebagian kecil d...

BERTopic Tutorial: Modern Topic Modeling with Embeddings

BERTopic: Pemodelan Topik Modern dengan Embedding BERTopic adalah library pemodelan topik yang menggabungkan embedding t...

Sentence Transformers Tutorial: Embeddings, Similarity, and Rerankers

Sentence Transformers: Embedding, Kemiripan Semantik, dan Reranker Sentence Transformers (sering disebut SBERT) adalah p...