NLP Text Cleaning Pipeline
Aug 14, 2025
9 min read
NLPText ProcessingPythonData Cleaning
Why a pipeline, not ad-hoc cleaning
In production NLP, ad-hoc regex hacks create silent bugs. I use a reproducible pipeline with measurable impact.
Steps
- Normalize unicode, whitespace, and quotes
- Strip boilerplate (headers/footers/TOCs)
- Fix broken sentences for better chunking
- Remove tracking junk and HTML remnants
- Log diffs to validate cleaning
from __future__ import annotations
import re
import unicodedata
from typing import List, Callable
def normalize(text: str) -> str:
text = unicodedata.normalize("NFKC", text)
text = text.replace("\u2019", "'").replace("\u201c", '"').replace("\u201d", '"')
text = re.sub(r"[ \t]+", " ", text)
text = re.sub(r"\n{3,}", "\n\n", text)
return text.strip()
def strip_boilerplate(text: str) -> str:
text = re.sub(r"(?mi)^page \d+ of \d+\s*$", "", text)
text = re.sub(r"(?mi)^confidential\s*$", "", text)
return text
def fix_sentences(text: str) -> str:
# ensure period-space capitalization
text = re.sub(r"(\w)(\n)([A-Z])", r"\1. \3", text)
return text
def clean_text(text: str) -> str:
for step in (normalize, strip_boilerplate, fix_sentences):
text = step(text)
return text
if __name__ == "__main__":
sample = """PAGE 1 OF 3\nCONFIDENTIAL\nReturn policy: Items within 30 days\nshipping in 3-5 days\nNew line Capitalized"""
print(clean_text(sample))
Evaluate impact
Measure token count reduction and sentence boundary accuracy before/after cleaning. Track regressions in CI.
Related posts: RAG data quality → [/blogs/detect-remove-outliers-python-iqr-zscore], missing values → [/blogs/handle-missing-values-pandas-without-losing-information].
CTA
Need robust text processing for NLP pipelines with monitoring and tests? Work with me →
Architecture and decision points
- Input sources: PDF OCR, HTML, Markdown, plain text
- Normalization:
NFKC
, smart quotes, non-breaking spaces - Boilerplate: page headers/footers, legal footers, cookie banners
- Sentence repair: ensure boundary consistency for chunkers and sentence embeddings
- QA: token deltas, diff samples, failure alerts
Rule bank with unit tests (keep bugs out)
from __future__ import annotations
import re
import pytest
RULES = [
(re.compile(r"(?mi)^page\s+\d+\s+of\s+\d+\s*$"), ""),
(re.compile(r"(?mi)^confidential\s*$"), ""),
(re.compile(r"\s+"), " "),
]
def apply_rules(text: str) -> str:
for pattern, repl in RULES:
text = pattern.sub(repl, text)
return text.strip()
def test_page_marker_removed():
assert apply_rules("Page 1 of 10\nHello") == "Hello"
Run tests in CI to prevent regressions when adding new rules.
Language-specific pitfalls
- Accented characters: avoid lossy ASCII folding unless search-only. For semantic models, preserve accents.
- CJK spacing: avoid inserting spaces mid-words; use tokenizer-aware cleaning.
- Right-to-left scripts: ensure normalization keeps bidirectional markers.
Token and cost benchmarks
from tiktoken import get_encoding
def token_count(text: str, enc: str = "cl100k_base") -> int:
return len(get_encoding(enc).encode(text))
before = open("./sample.txt").read()
after = clean_text(before)
print({
"before_tokens": token_count(before),
"after_tokens": token_count(after),
"delta_tokens": token_count(before) - token_count(after)
})
Target: 10–30% token reduction without losing semantics. If you exceed 40%, inspect for over-aggressive stripping.
Integrate with RAG pipelines
- Clean before chunking to avoid cross-chunk sentences.
- Prefer sentence-aware splitters after cleaning (e.g.,
RecursiveCharacterTextSplitter
with paragraph hints). - Validate with retrieval eval: citation hit rate should improve or remain stable after cleaning.
See: /blogs/does-langchain-use-rag, /blogs/lightrag-fast-retrieval-augmented-generation.
CLI and batching
python - <<'PY'
from pathlib import Path
from nlp_cleaning import clean_text
input_dir = Path("./docs")
output_dir = Path("./docs_clean")
output_dir.mkdir(exist_ok=True)
for p in input_dir.glob("**/*.txt"):
out = output_dir / p.relative_to(input_dir)
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text(clean_text(p.read_text()))
print("done")
PY
QA checklist (print and use)
- Do citations, section headings, and bullets survive?
- Are table columns preserved (monospace code blocks if needed)?
- Did token count drop meaningfully without losing facts?
- Are sentence boundaries coherent for chunkers?
Business value
- Lower inference cost by 10–30% from token savings.
- Higher retrieval quality by reducing noise.
- Faster human review due to cleaner, consistent text.