NLP Entity Matching with Fuzzy Search
Aug 14, 2025
10 min read
NLPEntity MatchingPythonIR
Problem
Product catalogs rarely match 1:1. I combine lexical and semantic similarity with thresholds to minimize false matches.
Approach
- Candidate generation with TF-IDF cosine
- Re-ranking with Jaro-Winkler for surface similarity
- Final semantic tie-breaker with small embeddings
from __future__ import annotations
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def jaro_winkler(a: str, b: str) -> float:
try:
import jellyfish
return jellyfish.jaro_winkler(a, b)
except Exception:
return 0.0
def match_entities(left: list[str], right: list[str], top_k: int = 5) -> list[tuple[str, str, float]]:
tfidf = TfidfVectorizer(min_df=1, ngram_range=(1, 2))
Xl = tfidf.fit_transform(left)
Xr = tfidf.transform(right)
sims = cosine_similarity(Xl, Xr)
matches = []
for i, row in enumerate(sims):
idx = int(np.argmax(row))
jw = jaro_winkler(left[i], right[idx])
score = 0.7 * row[idx] + 0.3 * jw
matches.append((left[i], right[idx], float(score)))
return matches
if __name__ == "__main__":
a = ["Apple iPhone 13 Pro", "Samsung Galaxy S22"]
b = ["iPhone 13 Pro Max by Apple", "Galaxy S22 Ultra Samsung"]
for m in match_entities(a, b):
print(m)
Thresholds and QA
- Accept ≥ 0.8 as confident match; 0.6–0.8 → manual review; < 0.6 reject.
- Evaluate with precision@1 and manual spot-checks.
CTA
Need entity resolution at scale with human-in-the-loop QA? Work with me →
Architecture & workflow
- Normalize product titles (case, unicode, punctuation)
- Generate candidates via TF-IDF cosine (top-10)
- Re-rank with Jaro-Winkler; compute blended score
- Optional: embed with
text-embedding-3-small
for semantic tie-breakers - Threshold routing: accept/review/reject
Evaluation harness
from __future__ import annotations
import numpy as np
def precision_at_1(gold: list[tuple[str, str]], preds: list[tuple[str, str, float]]):
lookup = {a: b for a, b in gold}
hits = 0
for a, b, _ in preds:
hits += int(lookup.get(a) == b)
return hits / max(1, len(preds))
if __name__ == "__main__":
gold = [("Apple iPhone 13 Pro", "iPhone 13 Pro Max by Apple"), ("Samsung Galaxy S22", "Galaxy S22 Ultra Samsung")]
preds = match_entities([g[0] for g in gold], [g[1] for g in gold])
print({"p@1": precision_at_1(gold, preds)})
Target: ≥ 0.9 p@1 on clean catalogs; add human review queue for ambiguous ranges.
Edge cases and fixes
- Brand aliases (e.g., Google vs Alphabet) → maintain alias map pre-match.
- Units and pack sizes ("500ml" vs "0.5 L") → normalize units.
- Noise tokens ("new", "sale") → remove stopwords tailored to catalog domain.
Integrations
- Push accepted matches to MDM with versioned lineage.
- Emit ambiguous matches to a human-review dashboard with audit logs.
- Schedule nightly diffs; alert on drift in p@1.