LLMs Enable Large-Scale Deanonymization
Large language models (LLMs) can automate deanonymization at scale by extracting identity-relevant signals from unstructured text, performing efficient candidate search over millions of profiles via embeddings, and reasoning about candidate matches, according to recently published paper. The paper presents a modular four-stage pipeline—Extract, Search, Reason, Calibrate—that augments classical deanonymization approaches (e.g., Narayanan & Shmatikov’s Netflix Prize attack) and works directly on raw user content rather than structured micro-data. Evaluations across three datasets (Hacker News ↔ LinkedIn, cross-subreddit Reddit movie communities, and temporally split Reddit profiles) show LLM-based methods substantially outperform classical baselines, achieving tens of percent recall at very high precision (for example, ~45% recall at 99% precision in one cross-platform setting), and scaling far better as candidate pools grow.
The work develops practical evaluation methods that balance ethics and ground truth: (1) take publicly linked profiles, synthetically anonymize them, and test whether the pipeline can recover the link; and (2) split a single user’s activity across communities or time to create paired profiles with known ground truth. The pipeline uses off-the-shelf embedding models for efficient nearest-neighbour retrieval and stronger LLMs to select and verify matches, plus confidence calibration (either direct LLM scores or tournament-style pairwise sorting). Across settings, the LLMs notably narrow candidate pools (true match commonly in top-15), and advanced reasoning and calibration substantially increase recall at high precision compared to embedding-only ranking.
Policy and privacy implications are severe: the practical cost of deanonymization drops drastically, undermining the long-held assumption that pseudonymity on public platforms offers meaningful protection. Platforms, policymakers, and users must reconsider data-access policies, export routines, and expectations about what can remain private. Possible mitigations—rate-limiting bulk access, detecting automated scraping, improved anonymization for text, and model-level guardrails—may reduce risk but are unlikely to fully eliminate it given the pipeline relies on many benign capabilities (summarization, embedding, search). The paper argues for urgent reassessment of privacy protections for pseudonymous participation online.
Key takeaways:
- LLMs enable end-to-end deanonymization on unstructured text by combining extraction, semantic search, reasoning, and calibration.
- LLM-based methods outperform classical structured-data attacks and scale more gracefully as candidate pools grow.
- High-precision deanonymization at non-trivial recall is practical: examples include ~67% recall at 90% precision (Hacker News→LinkedIn) and substantial recall for Reddit splits.
- Advanced reasoning (stronger LLMs, higher compute) and calibration materially improve high-precision recall.
- Embedding search narrows candidates (true match often in top-15) but is a poor confidence measure by itself.
- Ground-truth evaluation is achieved ethically via synthetic anonymization and temporal/community splits.
- The practical obscurity protecting pseudonymous users is weakened; users should not assume pseudonymity equals safety.
- Defensive options exist (access controls, scraping detection, better text anonymization, model guardrails) but are partial and costly.
- Policy and platform choices about data releases, APIs, and bulk access must be revisited in light of automated deanonymization risks.
- Further research should combine stylometric and semantic signals and explore robust, deployable anonymization for text.