Even large language models can't understand privacy policies: Towards building LLM-based solutions to make privacy policies more user-friendlier

Wednesday
 
27
 
November
1:30 pm
 - 
2:10 pm

Speakers

Bhanuka Pinchahewage

Bhanuka Pinchahewage

Doctoral Student
The University Of Sydney

Synopsis

With many online services collecting a vast volume of personal information for first party usage such as personalisation, these data are often shared with third parties for advertising, analytics and other purposes, even without the direct knowledge of end users. Disclosures regarding data collection, storage and sharing practices are usually outlined in privacy policies. Nonetheless, it is a common understanding that these documents are often too long, complicated and written in complex legal jargon. Despite regulatory and compliance attempts such as European Union’s GDPR have aimed at reducing the complexity of privacy policies, they haven’t helped the end users, giving the required levels of accessibility. For example, a recent study suggested that GDPR caused 4.9% increase in already lengthy documents. Overall, many end-users blindly consent to privacy policies without actively going through them.

Providing a user-friendly, scalable, efficient, and automated solution for interpretations or explanations of the privacy policies advances the field of cybersecurity and risk compliance. Existing research that aimed to leverage classical natural language processing techniques to interpret privacy policies range from providing user friendly labels or classifying them according to data practices, up to designing chatbots to answer privacy-related questions. The recent advancements in large language models (LLMs) such as GPT and LLaMA variants have shown excellent capabilities in areas like text summarisation and understanding and have even been adapted in diverse domains such as medical and finance. They, offer significant potential in making privacy policies more user accessible.

In this talk, I will present our findings on utilising LLMs to interpret privacy policies, highlighting the challenges such as hallucinations. I will then introduce our entailment-driven LLM framework designed to classify privacy policy paragraphs into categories such as 'first party collection/use,' 'third party sharing/collection,' and 'data retention', etc., making them easier for users to understand. The key components of our framework include an explained classifier that generates classification thoughts, a blank filler that re-evaluates about these original thoughts and an entailment verifier that makes the final decision regarding the entailment. This end-to-end pipeline is analogous to how a human would be accepting or rejecting a ChatGPT interpretation based on cognitive reasoning. We compare our results using a publicly available legally annotated dataset and show that our method performs better than vanilla LLM methods; 8.6%, 14.5%, and 10.5% higher than the original results of T5, GPT4, and LLaMA2, respectively in terms of macro-average F1 scores. We further show that our method has better explainability, that out of 57.9% of all predictions, our method generates reasoning texts that are at least 50% or more overlapping with what a legal expert would have reasoned. I will further shed light on instances where popular AI assistances such as GPT4 could potentially make false or implicit decisions in privacy policy domain, again to emphasise on the need for entailment verification prior to applying these models for user privacy and security-based applications.

Acknowledgement of Country

We acknowledge the traditional owners and custodians of country throughout Australia and acknowledge their continuing connection to land, waters and community. We pay our respects to the people, the cultures and the elders past, present and emerging.

Acknowledgement of Country