Dense vector embeddings derived from text containing personal data may themselves constitute personal data if the original information can be recovered through inversion techniques. The DPO Liaison assesses whether the stored embeddings constitute personal data by applying the Recital 26 test: whether re-identification is achievable using means reasonably likely to be used, taking into account available technology, the cost of identification, and the time required. The CJEU’s Breyer ruling (C-582/14) confirms that the availability of legal means to obtain identifying information is relevant to this assessment.
The assessment considers the embedding model’s dimensionality (higher dimensions preserve more information and increase inversion risk), the availability of inversion techniques for the specific model architecture, and whether the embeddings are stored alongside metadata (document identifiers, timestamps, user identifiers) that could facilitate re-identification. The state of the art in embedding inversion techniques evolves; the assessment must reflect current capabilities.
Where the DPO Liaison determines that embeddings constitute personal data, the full GDPR compliance framework applies. A lawful basis must be identified for storing the embeddings. The retention policy must specify a deletion schedule. Data subject access and erasure requests must be serviceable, which may require the ability to identify and delete specific embeddings from the vector store. The DPIA must address the embedding-specific risks.
The practical challenge is that vector databases are optimised for similarity search, not record-level deletion. The Technical SME assesses the vector database’s deletion capabilities at architecture design time and documents the approach for servicing erasure requests. Where the database does not support efficient single-record deletion, a mapping between embeddings and source documents enables erasure at the next scheduled re-indexing.
For systems where embeddings are determined to constitute personal data, the Technical SME implements monitoring for embedding inversion attacks: access logging for the vector database, anomaly detection on query patterns, and periodic reassessment of the inversion risk landscape.
Key outputs
- GDPR status determination for stored embeddings
- DPO Liaison assessment record
- Erasure request handling specification
- Inversion monitoring specification (where applicable)