Confidence Thresholds

Not all detections are equal. A credit card number validated by Luhn checksum is definitively a card number. A 9-digit number near the word "passport" might be a passport number — or a reference code, order number, or zip+4.

Most PII detection tools return everything and leave you to sort out the noise. Enterprise tools let you configure custom detection policies, but charge accordingly. PIIGuard ships per-detector thresholds you control on every request, no enterprise contract required. Raise them for compliance workflows. Lower them for data discovery. Tune individual entity types while leaving everything else at defaults.

Overriding Thresholds

Three ways to customize detection per request:

Global floor

min_confidence: 0.7

Raise or lower the bar across all detectors at once.

Per-entity tuning

confidence_thresholds: {…}

Set a different threshold for each entity type independently.

Bypass defaults

use_default_thresholds: false

Disable PIIGuard defaults entirely and handle filtering yourself.

1. Raise the threshold — compliance / high-precision

Returns only high-confidence matches. Reduces false positives for PCI DSS, HIPAA audit workflows, or any pipeline where a wrong flag has a real cost.

from instructeer.guards import PIIGuard

# api_key = os.environ["INSTRUCTEER_API_KEY"]
pii = PIIGuard(api_key="rg_your_key")
result = pii.detect_all(
    "Patient John Smith, DOB 04/12/1980. Ref B12345678.",
    min_confidence=0.7,
)
# B12345678 (passport 0.4) is filtered out
# John Smith (person 1.0) and DOB (1.0) remain

2. Lower the threshold — discovery / breach assessment

Returns everything including uncertain matches. Use for scanning data stores, breach triage, or building your own confidence-based routing. Expect more false positives.

from instructeer.guards import PIIGuard

pii = PIIGuard(api_key="rg_your_key")
result = pii.detect_all(
    "Patient John Smith, DOB 04/12/1980. Ref B12345678.",
    min_confidence=0.0,
)
# All matches returned — B12345678 flagged as US_PASSPORT (0.4)
# and US_DRIVER_LICENSE (0.3). Route by confidence in your code.
for d in result.detections:
    if d.confidence >= 0.7:
        block(d)
    elif d.confidence >= 0.4:
        review(d)

3. Per entity type tuning

Set a different floor for each entity type. Everything not listed uses the PIIGuard default. Keys are the entity type names returned in each detection's entity_type field.

from instructeer.guards import PIIGuard

pii = PIIGuard(api_key="rg_your_key")
result = pii.detect_all(
    "Patient John Smith, DOB 04/12/1980. Ref B12345678.",
    confidence_thresholds={
        "US_DRIVER_LICENSE": 0.5,
        "US_PASSPORT": 0.6,
        "URL": 0.8,
    },
)
detections = result.detections

Valid confidence_thresholds keys

US_SSNCREDIT_CARDUS_BANK_NUMBERUS_BANK_ROUTINGIBAN_CODEUS_PASSPORTUS_DRIVER_LICENSEDATE_TIMEPHONE_NUMBEREMAIL_ADDRESSIP_ADDRESSURLPERSONLOCATION

Default Thresholds

Applied automatically on /detect/all unless you set use_default_thresholds: false.

DetectorDefault minWhat gets filtered
emailNothing — always 1.0
ssnNothing — always 1.0
cardNothing — Luhn validated
ibanNothing — MOD-97 validated
bank_routingNothing — ABA validated
bank_account0.8No-context 4–17 digit sequences (0.5)
phoneNothing — NANP validated
ipNothing — private IPs flagged at 0.7
url0.6Bare domain without www or protocol (0.4)
dobNothing — already requires context
driver_license0.3Digit-only without context (0.01)
passport0.5NGP format without context (0.4)
person_nameNothing — already 0.6+ minimum

What This Looks Like in Practice

# Input
"Order 12345678 shipped. Reference B12345678. Call 555-0123."

# Without thresholds (what most tools return):
# - 12345678       → US_DRIVER_LICENSE  (0.01)  ← noise
# - B12345678      → US_PASSPORT        (0.4)   ← uncertain
# - B12345678      → US_DRIVER_LICENSE  (0.3)   ← uncertain
# - 555-0123       → US_PHONE_NUMBER    (0.5)   ← fictional

# With PIIGuard defaults:
# - 555-0123       → US_PHONE_NUMBER    (0.5, fictional=true)

The noise is filtered. The real signal — a phone number, flagged as fictional — remains.

Confidence Score Reference

ConfidenceInterpretationValidation signal
1.0Definitive matchChecksum valid (Luhn, MOD-97, ABA)
0.8–0.9Very likely PIIStrong context + pattern match
0.6–0.7Likely PIIPattern match + some context
0.4–0.5Possible PIIPattern match, weak/no context
0.1–0.3UncertainPattern match only, high false positive risk
< 0.1NoiseFilter in production

Checksum-validated (always 1.0): Credit card (Luhn), IBAN (MOD-97), Bank routing (ABA 3-7-1), SSN (range validation)

Context-boosted (varies): Driver's license, passport, URL, bank account — confidence rises when context keywords appear nearby.

Lexicon-matched: Person name — first+last pair = 1.0, single token with context = 0.6, common word overlap = 0.2.

Severity Levels (NIST SP 800-122)

Severity follows NIST SP 800-122 impact levels for PII disclosure.

Entity TypeSeverityRationale
SSNhighDirect identity theft vector
Credit CardhighDirect financial fraud vector
Bank Account + RoutinghighACH fraud, account takeover
IBANhighInternational wire fraud
PassporthighIdentity fraud, border issues
Driver's LicensehighIdentity fraud, physical access
Date of BirthhighIdentity verification bypass
Person Namemedium–highPII when combined with other data
Phone NumbermediumContact info, social engineering
EmailmediumAccount enumeration, phishing
IP AddressmediumLocation tracking, network attacks
URLlowIndirect identifier
Generic DatelowRarely PII without context