NLP Text Cleaning Pipeline

Why a pipeline, not ad-hoc cleaning

In production NLP, ad-hoc regex hacks create silent bugs. I use a reproducible pipeline with measurable impact.

Steps

Normalize unicode, whitespace, and quotes
Strip boilerplate (headers/footers/TOCs)
Fix broken sentences for better chunking
Remove tracking junk and HTML remnants
Log diffs to validate cleaning

from __future__ import annotations
import re
import unicodedata
from typing import List, Callable


def normalize(text: str) -> str:
    text = unicodedata.normalize("NFKC", text)
    text = text.replace("\u2019", "'").replace("\u201c", '"').replace("\u201d", '"')
    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()


def strip_boilerplate(text: str) -> str:
    text = re.sub(r"(?mi)^page \d+ of \d+\s*$", "", text)
    text = re.sub(r"(?mi)^confidential\s*$", "", text)
    return text


def fix_sentences(text: str) -> str:
    # ensure period-space capitalization
    text = re.sub(r"(\w)(\n)([A-Z])", r"\1. \3", text)
    return text


def clean_text(text: str) -> str:
    for step in (normalize, strip_boilerplate, fix_sentences):
        text = step(text)
    return text


if __name__ == "__main__":
    sample = """PAGE 1 OF 3\nCONFIDENTIAL\nReturn policy: Items within 30 days\nshipping in 3-5 days\nNew line Capitalized"""
    print(clean_text(sample))

Evaluate impact

Measure token count reduction and sentence boundary accuracy before/after cleaning. Track regressions in CI.

Related posts: RAG data quality → [/blogs/detect-remove-outliers-python-iqr-zscore], missing values → [/blogs/handle-missing-values-pandas-without-losing-information].

CTA

Need robust text processing for NLP pipelines with monitoring and tests? Work with me →

Architecture and decision points

NLP cleaning pipeline: ingest → normalize → boilerplate removal → sentence repair → HTML cleanup → QA → output

Input sources: PDF OCR, HTML, Markdown, plain text
Normalization: NFKC, smart quotes, non-breaking spaces
Boilerplate: page headers/footers, legal footers, cookie banners
Sentence repair: ensure boundary consistency for chunkers and sentence embeddings
QA: token deltas, diff samples, failure alerts

Rule bank with unit tests (keep bugs out)

from __future__ import annotations
import re
import pytest


RULES = [
    (re.compile(r"(?mi)^page\s+\d+\s+of\s+\d+\s*$"), ""),
    (re.compile(r"(?mi)^confidential\s*$"), ""),
    (re.compile(r"\s+"), " "),
]


def apply_rules(text: str) -> str:
    for pattern, repl in RULES:
        text = pattern.sub(repl, text)
    return text.strip()


def test_page_marker_removed():
    assert apply_rules("Page 1 of 10\nHello") == "Hello"

Run tests in CI to prevent regressions when adding new rules.

Language-specific pitfalls

Accented characters: avoid lossy ASCII folding unless search-only. For semantic models, preserve accents.
CJK spacing: avoid inserting spaces mid-words; use tokenizer-aware cleaning.
Right-to-left scripts: ensure normalization keeps bidirectional markers.

Token and cost benchmarks

from tiktoken import get_encoding


def token_count(text: str, enc: str = "cl100k_base") -> int:
    return len(get_encoding(enc).encode(text))


before = open("./sample.txt").read()
after = clean_text(before)
print({
    "before_tokens": token_count(before),
    "after_tokens": token_count(after),
    "delta_tokens": token_count(before) - token_count(after)
})

Target: 10–30% token reduction without losing semantics. If you exceed 40%, inspect for over-aggressive stripping.

Integrate with RAG pipelines

Clean before chunking to avoid cross-chunk sentences.
Prefer sentence-aware splitters after cleaning (e.g., RecursiveCharacterTextSplitter with paragraph hints).
Validate with retrieval eval: citation hit rate should improve or remain stable after cleaning.

See: /blogs/does-langchain-use-rag, /blogs/lightrag-fast-retrieval-augmented-generation.

CLI and batching

python - <<'PY'
from pathlib import Path
from nlp_cleaning import clean_text

input_dir = Path("./docs")
output_dir = Path("./docs_clean")
output_dir.mkdir(exist_ok=True)

for p in input_dir.glob("**/*.txt"):
    out = output_dir / p.relative_to(input_dir)
    out.parent.mkdir(parents=True, exist_ok=True)
    out.write_text(clean_text(p.read_text()))
print("done")
PY

QA checklist (print and use)

Do citations, section headings, and bullets survive?
Are table columns preserved (monospace code blocks if needed)?
Did token count drop meaningfully without losing facts?
Are sentence boundaries coherent for chunkers?

Business value

Lower inference cost by 10–30% from token savings.
Higher retrieval quality by reducing noise.
Faster human review due to cleaner, consistent text.