NLP Text Cleaning Pipeline

Aug 14, 2025
9 min read
NLPText ProcessingPythonData Cleaning

Why a pipeline, not ad-hoc cleaning

In production NLP, ad-hoc regex hacks create silent bugs. I use a reproducible pipeline with measurable impact.

Steps

  1. Normalize unicode, whitespace, and quotes
  2. Strip boilerplate (headers/footers/TOCs)
  3. Fix broken sentences for better chunking
  4. Remove tracking junk and HTML remnants
  5. Log diffs to validate cleaning
from __future__ import annotations
import re
import unicodedata
from typing import List, Callable


def normalize(text: str) -> str:
    text = unicodedata.normalize("NFKC", text)
    text = text.replace("\u2019", "'").replace("\u201c", '"').replace("\u201d", '"')
    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()


def strip_boilerplate(text: str) -> str:
    text = re.sub(r"(?mi)^page \d+ of \d+\s*$", "", text)
    text = re.sub(r"(?mi)^confidential\s*$", "", text)
    return text


def fix_sentences(text: str) -> str:
    # ensure period-space capitalization
    text = re.sub(r"(\w)(\n)([A-Z])", r"\1. \3", text)
    return text


def clean_text(text: str) -> str:
    for step in (normalize, strip_boilerplate, fix_sentences):
        text = step(text)
    return text


if __name__ == "__main__":
    sample = """PAGE 1 OF 3\nCONFIDENTIAL\nReturn policy: Items within 30 days\nshipping in 3-5 days\nNew line Capitalized"""
    print(clean_text(sample))

Evaluate impact

Measure token count reduction and sentence boundary accuracy before/after cleaning. Track regressions in CI.

Related posts: RAG data quality → [/blogs/detect-remove-outliers-python-iqr-zscore], missing values → [/blogs/handle-missing-values-pandas-without-losing-information].

CTA

Need robust text processing for NLP pipelines with monitoring and tests? Work with me →

Architecture and decision points

NLP cleaning pipeline: ingest → normalize → boilerplate removal → sentence repair → HTML cleanup → QA → output

  • Input sources: PDF OCR, HTML, Markdown, plain text
  • Normalization: NFKC, smart quotes, non-breaking spaces
  • Boilerplate: page headers/footers, legal footers, cookie banners
  • Sentence repair: ensure boundary consistency for chunkers and sentence embeddings
  • QA: token deltas, diff samples, failure alerts

Rule bank with unit tests (keep bugs out)

from __future__ import annotations
import re
import pytest


RULES = [
    (re.compile(r"(?mi)^page\s+\d+\s+of\s+\d+\s*$"), ""),
    (re.compile(r"(?mi)^confidential\s*$"), ""),
    (re.compile(r"\s+"), " "),
]


def apply_rules(text: str) -> str:
    for pattern, repl in RULES:
        text = pattern.sub(repl, text)
    return text.strip()


def test_page_marker_removed():
    assert apply_rules("Page 1 of 10\nHello") == "Hello"

Run tests in CI to prevent regressions when adding new rules.

Language-specific pitfalls

  • Accented characters: avoid lossy ASCII folding unless search-only. For semantic models, preserve accents.
  • CJK spacing: avoid inserting spaces mid-words; use tokenizer-aware cleaning.
  • Right-to-left scripts: ensure normalization keeps bidirectional markers.

Token and cost benchmarks

from tiktoken import get_encoding


def token_count(text: str, enc: str = "cl100k_base") -> int:
    return len(get_encoding(enc).encode(text))


before = open("./sample.txt").read()
after = clean_text(before)
print({
    "before_tokens": token_count(before),
    "after_tokens": token_count(after),
    "delta_tokens": token_count(before) - token_count(after)
})

Target: 10–30% token reduction without losing semantics. If you exceed 40%, inspect for over-aggressive stripping.

Integrate with RAG pipelines

  • Clean before chunking to avoid cross-chunk sentences.
  • Prefer sentence-aware splitters after cleaning (e.g., RecursiveCharacterTextSplitter with paragraph hints).
  • Validate with retrieval eval: citation hit rate should improve or remain stable after cleaning.

See: /blogs/does-langchain-use-rag, /blogs/lightrag-fast-retrieval-augmented-generation.

CLI and batching

python - <<'PY'
from pathlib import Path
from nlp_cleaning import clean_text

input_dir = Path("./docs")
output_dir = Path("./docs_clean")
output_dir.mkdir(exist_ok=True)

for p in input_dir.glob("**/*.txt"):
    out = output_dir / p.relative_to(input_dir)
    out.parent.mkdir(parents=True, exist_ok=True)
    out.write_text(clean_text(p.read_text()))
print("done")
PY

QA checklist (print and use)

  • Do citations, section headings, and bullets survive?
  • Are table columns preserved (monospace code blocks if needed)?
  • Did token count drop meaningfully without losing facts?
  • Are sentence boundaries coherent for chunkers?

Business value

  • Lower inference cost by 10–30% from token savings.
  • Higher retrieval quality by reducing noise.
  • Faster human review due to cleaner, consistent text.

Enjoyed this article?

Discover more insights on AI, Python, and tech

NLP Text Cleaning Pipeline | Blogs | Subhajit Bhar — Freelance ML Engineer