Confidence Thresholds

Not all detections are equal. A credit card number validated by Luhn checksum is definitively a card number. A 9-digit number near the word "passport" might be a passport number — or a reference code, order number, or zip+4.

Most PII detection tools return everything and leave you to sort out the noise. Enterprise tools let you configure custom detection policies, but charge accordingly. PIIGuard ships per-detector thresholds you control on every request, no enterprise contract required. Raise them for compliance workflows. Lower them for data discovery. Tune individual entity types while leaving everything else at defaults.

Overriding Thresholds

Three ways to customize detection per request:

Global floor

min_confidence: 0.7

Raise or lower the bar across all detectors at once.

Per-entity tuning

confidence_thresholds: {…}

Set a different threshold for each entity type independently.

Bypass defaults

use_default_thresholds: false

Disable PIIGuard defaults entirely and handle filtering yourself.

1. Raise the threshold — compliance / high-precision

Returns only high-confidence matches. Reduces false positives for PCI DSS, HIPAA audit workflows, or any pipeline where a wrong flag has a real cost.

from instructeer.guards import PIIGuard

# api_key = os.environ["INSTRUCTEER_API_KEY"]
pii = PIIGuard(api_key="rg_your_key")
result = pii.detect_all(
    "Patient John Smith, DOB 04/12/1980. Ref B12345678.",
    min_confidence=0.7,
)
# B12345678 (passport 0.4) is filtered out
# John Smith (person 1.0) and DOB (1.0) remain

2. Lower the threshold — discovery / breach assessment

Returns everything including uncertain matches. Use for scanning data stores, breach triage, or building your own confidence-based routing. Expect more false positives.

from instructeer.guards import PIIGuard

pii = PIIGuard(api_key="rg_your_key")
result = pii.detect_all(
    "Patient John Smith, DOB 04/12/1980. Ref B12345678.",
    min_confidence=0.0,
)
# All matches returned — B12345678 flagged as US_PASSPORT (0.4)
# and US_DRIVER_LICENSE (0.3). Route by confidence in your code.
for d in result.detections:
    if d.confidence >= 0.7:
        block(d)
    elif d.confidence >= 0.4:
        review(d)

3. Per entity type tuning

Set a different floor for each entity type. Everything not listed uses the PIIGuard default. Keys are the entity type names returned in each detection's entity_type field.

from instructeer.guards import PIIGuard

pii = PIIGuard(api_key="rg_your_key")
result = pii.detect_all(
    "Patient John Smith, DOB 04/12/1980. Ref B12345678.",
    confidence_thresholds={
        "US_DRIVER_LICENSE": 0.5,
        "US_PASSPORT": 0.6,
        "URL": 0.8,
    },
)
detections = result.detections

Valid confidence_thresholds keys

US_SSNCREDIT_CARDUS_BANK_NUMBERUS_BANK_ROUTINGIBAN_CODEUS_PASSPORTUS_DRIVER_LICENSEDATE_TIMEPHONE_NUMBEREMAIL_ADDRESSIP_ADDRESSURLPERSONLOCATION

Default Thresholds

Applied automatically on /detect/all unless you set use_default_thresholds: false.

Detector	Default min	What gets filtered
email	—	Nothing — always 1.0
ssn	—	Nothing — always 1.0
card	—	Nothing — Luhn validated
iban	—	Nothing — MOD-97 validated
bank_routing	—	Nothing — ABA validated
bank_account	0.8	No-context 4–17 digit sequences (0.5)
phone	—	Nothing — NANP validated
ip	—	Nothing — private IPs flagged at 0.7
url	0.6	Bare domain without www or protocol (0.4)
dob	—	Nothing — already requires context
driver_license	0.3	Digit-only without context (0.01)
passport	0.5	NGP format without context (0.4)
person_name	—	Nothing — already 0.6+ minimum

What This Looks Like in Practice

# Input
"Order 12345678 shipped. Reference B12345678. Call 555-0123."

# Without thresholds (what most tools return):
# - 12345678       → US_DRIVER_LICENSE  (0.01)  ← noise
# - B12345678      → US_PASSPORT        (0.4)   ← uncertain
# - B12345678      → US_DRIVER_LICENSE  (0.3)   ← uncertain
# - 555-0123       → US_PHONE_NUMBER    (0.5)   ← fictional

# With PIIGuard defaults:
# - 555-0123       → US_PHONE_NUMBER    (0.5, fictional=true)

The noise is filtered. The real signal — a phone number, flagged as fictional — remains.

Confidence Score Reference

Confidence	Interpretation	Validation signal
1.0	Definitive match	Checksum valid (Luhn, MOD-97, ABA)
0.8–0.9	Very likely PII	Strong context + pattern match
0.6–0.7	Likely PII	Pattern match + some context
0.4–0.5	Possible PII	Pattern match, weak/no context
0.1–0.3	Uncertain	Pattern match only, high false positive risk
< 0.1	Noise	Filter in production

Checksum-validated (always 1.0): Credit card (Luhn), IBAN (MOD-97), Bank routing (ABA 3-7-1), SSN (range validation)

Context-boosted (varies): Driver's license, passport, URL, bank account — confidence rises when context keywords appear nearby.

Lexicon-matched: Person name — first+last pair = 1.0, single token with context = 0.6, common word overlap = 0.2.

Severity Levels (NIST SP 800-122)

Severity follows NIST SP 800-122 impact levels for PII disclosure.

Entity Type	Severity	Rationale
SSN	high	Direct identity theft vector
Credit Card	high	Direct financial fraud vector
Bank Account + Routing	high	ACH fraud, account takeover
IBAN	high	International wire fraud
Passport	high	Identity fraud, border issues
Driver's License	high	Identity fraud, physical access
Date of Birth	high	Identity verification bypass
Person Name	medium–high	PII when combined with other data
Phone Number	medium	Contact info, social engineering
Email	medium	Account enumeration, phishing
IP Address	medium	Location tracking, network attacks
URL	low	Indirect identifier
Generic Date	low	Rarely PII without context

References: NIST SP 800-122 · GDPR Article 22 · PCI DSS