Confidence Thresholds
Not all detections are equal. A credit card number validated by Luhn checksum is definitively a card number. A 9-digit number near the word "passport" might be a passport number — or a reference code, order number, or zip+4.
Most PII detection tools return everything and leave you to sort out the noise. Enterprise tools let you configure custom detection policies, but charge accordingly. PIIGuard ships per-detector thresholds you control on every request, no enterprise contract required. Raise them for compliance workflows. Lower them for data discovery. Tune individual entity types while leaving everything else at defaults.
Overriding Thresholds
Three ways to customize detection per request:
Global floor
min_confidence: 0.7Raise or lower the bar across all detectors at once.
Per-entity tuning
confidence_thresholds: {…}Set a different threshold for each entity type independently.
Bypass defaults
use_default_thresholds: falseDisable PIIGuard defaults entirely and handle filtering yourself.
1. Raise the threshold — compliance / high-precision
Returns only high-confidence matches. Reduces false positives for PCI DSS, HIPAA audit workflows, or any pipeline where a wrong flag has a real cost.
from instructeer.guards import PIIGuard
# api_key = os.environ["INSTRUCTEER_API_KEY"]
pii = PIIGuard(api_key="rg_your_key")
result = pii.detect_all(
"Patient John Smith, DOB 04/12/1980. Ref B12345678.",
min_confidence=0.7,
)
# B12345678 (passport 0.4) is filtered out
# John Smith (person 1.0) and DOB (1.0) remain2. Lower the threshold — discovery / breach assessment
Returns everything including uncertain matches. Use for scanning data stores, breach triage, or building your own confidence-based routing. Expect more false positives.
from instructeer.guards import PIIGuard
pii = PIIGuard(api_key="rg_your_key")
result = pii.detect_all(
"Patient John Smith, DOB 04/12/1980. Ref B12345678.",
min_confidence=0.0,
)
# All matches returned — B12345678 flagged as US_PASSPORT (0.4)
# and US_DRIVER_LICENSE (0.3). Route by confidence in your code.
for d in result.detections:
if d.confidence >= 0.7:
block(d)
elif d.confidence >= 0.4:
review(d)3. Per entity type tuning
Set a different floor for each entity type. Everything not listed uses the PIIGuard default. Keys are the entity type names returned in each detection's entity_type field.
from instructeer.guards import PIIGuard
pii = PIIGuard(api_key="rg_your_key")
result = pii.detect_all(
"Patient John Smith, DOB 04/12/1980. Ref B12345678.",
confidence_thresholds={
"US_DRIVER_LICENSE": 0.5,
"US_PASSPORT": 0.6,
"URL": 0.8,
},
)
detections = result.detectionsValid confidence_thresholds keys
US_SSNCREDIT_CARDUS_BANK_NUMBERUS_BANK_ROUTINGIBAN_CODEUS_PASSPORTUS_DRIVER_LICENSEDATE_TIMEPHONE_NUMBEREMAIL_ADDRESSIP_ADDRESSURLPERSONLOCATIONDefault Thresholds
Applied automatically on /detect/all unless you set use_default_thresholds: false.
| Detector | Default min | What gets filtered |
|---|---|---|
| — | Nothing — always 1.0 | |
| ssn | — | Nothing — always 1.0 |
| card | — | Nothing — Luhn validated |
| iban | — | Nothing — MOD-97 validated |
| bank_routing | — | Nothing — ABA validated |
| bank_account | 0.8 | No-context 4–17 digit sequences (0.5) |
| phone | — | Nothing — NANP validated |
| ip | — | Nothing — private IPs flagged at 0.7 |
| url | 0.6 | Bare domain without www or protocol (0.4) |
| dob | — | Nothing — already requires context |
| driver_license | 0.3 | Digit-only without context (0.01) |
| passport | 0.5 | NGP format without context (0.4) |
| person_name | — | Nothing — already 0.6+ minimum |
What This Looks Like in Practice
# Input
"Order 12345678 shipped. Reference B12345678. Call 555-0123."
# Without thresholds (what most tools return):
# - 12345678 → US_DRIVER_LICENSE (0.01) ← noise
# - B12345678 → US_PASSPORT (0.4) ← uncertain
# - B12345678 → US_DRIVER_LICENSE (0.3) ← uncertain
# - 555-0123 → US_PHONE_NUMBER (0.5) ← fictional
# With PIIGuard defaults:
# - 555-0123 → US_PHONE_NUMBER (0.5, fictional=true)The noise is filtered. The real signal — a phone number, flagged as fictional — remains.
Confidence Score Reference
| Confidence | Interpretation | Validation signal |
|---|---|---|
| 1.0 | Definitive match | Checksum valid (Luhn, MOD-97, ABA) |
| 0.8–0.9 | Very likely PII | Strong context + pattern match |
| 0.6–0.7 | Likely PII | Pattern match + some context |
| 0.4–0.5 | Possible PII | Pattern match, weak/no context |
| 0.1–0.3 | Uncertain | Pattern match only, high false positive risk |
| < 0.1 | Noise | Filter in production |
Checksum-validated (always 1.0): Credit card (Luhn), IBAN (MOD-97), Bank routing (ABA 3-7-1), SSN (range validation)
Context-boosted (varies): Driver's license, passport, URL, bank account — confidence rises when context keywords appear nearby.
Lexicon-matched: Person name — first+last pair = 1.0, single token with context = 0.6, common word overlap = 0.2.
Severity Levels (NIST SP 800-122)
Severity follows NIST SP 800-122 impact levels for PII disclosure.
| Entity Type | Severity | Rationale |
|---|---|---|
| SSN | high | Direct identity theft vector |
| Credit Card | high | Direct financial fraud vector |
| Bank Account + Routing | high | ACH fraud, account takeover |
| IBAN | high | International wire fraud |
| Passport | high | Identity fraud, border issues |
| Driver's License | high | Identity fraud, physical access |
| Date of Birth | high | Identity verification bypass |
| Person Name | medium–high | PII when combined with other data |
| Phone Number | medium | Contact info, social engineering |
| medium | Account enumeration, phishing | |
| IP Address | medium | Location tracking, network attacks |
| URL | low | Indirect identifier |
| Generic Date | low | Rarely PII without context |