Document Summarization: Eval First
Aug 14, 2025
11 min read
NLPSummarizationEvaluationPython
Why eval-first
When I built an extractive summarizer for finance reports, we shipped faster by defining evaluation and acceptance criteria before touching models.
Workflow
- Curate a small, representative dataset (20–50 docs)
- Define extractive baseline + abstractive model
- Compute ROUGE/BERTScore, then human checklist (coverage, faithfulness)
- Review failure modes and iterate on chunking/prompts
from __future__ import annotations
from datasets import load_metric
def rouge(refs: list[str], hyps: list[str]):
metric = load_metric("rouge")
scores = metric.compute(predictions=hyps, references=refs)
return {k: v.mid.fmeasure for k, v in scores.items()}
if __name__ == "__main__":
refs = ["Revenue increased due to subscriptions and lower churn."]
hyps = ["Revenue increased from new subscriptions; churn was lower."]
print(rouge(refs, hyps))
Human checklist (print and use)
- Coverage: all key bullets present?
- Faithfulness: no invented numbers or facts?
- Specificity: numbers and entities preserved?
- Brevity: remove filler and boilerplate?
Related: RAG and data quality posts to improve chunking/grounding: /blogs/does-langchain-use-rag, /blogs/lightrag-fast-retrieval-augmented-generation, and /blogs/detect-remove-outliers-python-iqr-zscore.
CTA
Need production summarization with scorecards and human-in-the-loop review? Work with me →
Architecture
- Ingest and clean text (see text-cleaning pipeline)
- Segment by sections; avoid cross-topic chunks
- Extractive baseline (TextRank or embedding-based key sentence selection)
- Abstractive refinement with constrained prompting
- Score with ROUGE/BERTScore + human checklist
Extractive baseline example
from __future__ import annotations
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
def textrank_sentences(sentences: list[str], top_k: int = 5) -> list[str]:
tfidf = TfidfVectorizer().fit_transform(sentences)
sim = (tfidf * tfidf.T).A
scores = sim.sum(axis=1)
idx = np.argsort(-scores)[:top_k]
return [sentences[i] for i in sorted(idx)]
Abstractive refinement prompt (LLM)
You are summarizing a section for financial analysts.
Constraints:
- Keep numbers and entities accurate.
- No claims beyond the provided sentences.
- Max 120 words.
Sentences:
<paste extractive sentences>
Metrics and acceptance criteria
- ROUGE-L ≥ 0.35 on validation set; BERTScore-F1 ≥ 0.86 on domain corpus.
- Human checklist pass rate ≥ 0.9 (sampled 20 summaries weekly).
- Drift alerts if either metric drops ≥ 10% week-over-week.
Failure modes and fixes
- Missing critical bullet: increase top_k extractive or re-segment by section headings.
- Fabricated numbers: add unit tests scanning for number changes vs source.
- Repetition/bloat: enforce word cap and remove boilerplate via cleaning.
Integration notes
- Store source sentence IDs alongside summaries for traceability.
- Log tokens, latency, and scores for each job; create dashboards.
- For long docs, summarize sections first, then synthesize an executive summary.