CodeGPT's Newsletter
Posts
Beyond RAG: Privacy-Preserving Training Data with No Compromises

Beyond RAG: Privacy-Preserving Training Data with No Compromises

Trusting in RAG and futhermore

Pilar Hidalgo
April 26, 2024

Research & Innovation 🧮

Hello all you CodeGPT enthusiasts! 🙌 As you may already know, we have a lot of exciting updates in CodeGPT, both in the extension and in the Playground. But this article isn't about spilling the beans on those updates - for that, you can visit our landing page right here. 🚀

I must say, all these changes have been primarily driven by the race of models to top the charts in the use of Generative AI. As agnostic as they are creative, incorporating a variety of models and features means getting to grips with how to maintain answer quality, boost platform performance without messing with the user experience (and let's just sidestep the recent performance of GPT4 for now, shall we? 😅).

We want accurate answers, even if it takes a reasonable amount of time, but it's crucial they're spot on. But what happens when strategies to improve answer precision could potentially compromise data privacy in terms of data extraction and training data for LLM’s? It's a bit like trying to bake a cake without breaking any eggs - tricky, but not impossible! 🎂🥚

RAG is not enough

We've chatted about RAG before, the importance of a knowledge base for pre-trained models to answer specific questions. However, using RAG to search for knowledge that answers questions as accurately as possible is a bit like trying to find a needle in a haystack - it comes with its fair share of challenges. 🧵🌾

For instance, there's limited contextual understanding, as RAG only has access to a limited portion of the context. Then there's the task of generating coherent text - due to its restricted attention, it might struggle to maintain coherence and cohesion in the generated text, especially in longer or more complex contexts. And let's not forget the difficulties in handling ambiguity, and so on. These are all current issues that pique the interest of developers and researchers. But what good is an accurate answer if it compromises the privacy of the models' data? 🤔

This week, we revisited an article that, while a few weeks old, inspires us to remember that as users and partners of LLM providers, we're responsible for the data's integrity. 📚

Let's dive into the research paper of Yuanhao Wu et al. Precision means fewer hallucinations, and researchers believe that the challenge of hallucinations in LLMs, even when using retrieval-augmented generation (RAG) techniques, LLMs may still produce unsupported or contradictory claims. The authors present RAGTruth, a corpus designed for analyzing word-level hallucinations in various domains and tasks within standard RAG frameworks. This dataset is intended to help develop effective strategies for preventing hallucinations under RAG. To evaluate their method, they create a dataset that includes nearly 18,000 manually annotated natural responses generated from diverse LLMs. The annotations evaluate hallucination intensity at both individual cases and word levels. 👻

The paper benchmarks hallucination frequencies across different LLMs and assesses the effectiveness of several existing hallucination detection methodologies. It also shows that fine-tuning a smaller LLM with the RAGTruth dataset can achieve competitive performance in hallucination detection compared to state-of-the-art models like GPT-4. 🏆

The research suggests that using a high-quality dataset like RAGTruth can lead to the development of better hallucination detection methods. This has significant implications for the deployment of LLMs in real-world applications where reliability and trustworthiness are critical. 🌐

So, now that we can trust in RAG, about the precision, how to deploy our solution without spilling our secret recipe. 🤫

The paper of Shenglai Zeng et al. investigates the privacy implications of using Retrieval-Augmented Generation (RAG) in large language models (LLMs). The authors aim to understand how RAG might affect the privacy of the data used in these systems, both in terms of the retrieval database and the training data of the LLMs. While RAG can improve the performance of LLMs, it also introduces new privacy risks. The paper explores whether private data from the external retrieval database can be extracted using RAG and whether the retrieval data can affect the memorization behavior of LLMs. The authors propose novel attack methods to demonstrate the vulnerability of RAG systems to privacy leaks. They show that attackers can extract sensitive information from the retrieval database with a high success rate. 🕵️‍♂️

Mitigating Privacy Risks: The paper also delves into potential strategies to reduce the risk of privacy leaks, like a secret agent trying to avoid detection. These include re-ranking retrieved documents, summarizing retrieved content, and setting distance thresholds in retrieval. They conduct empirical studies to evaluate the privacy leakage of retrieval datasets and to compare the exposure of training data with and without retrieval augmentation. The findings reveal that RAG systems are as susceptible to privacy attacks as a chocolate cake at a birthday party, with a considerable amount of sensitive retrieval data being extracted. However, incorporating retrieval data into RAG systems can significantly reduce the tendency of LLMs to output memorized training data. 🕵️‍♀️🍰

While RAG introduces new privacy risks, it also offers potential benefits in reducing the leakage of LLM training data. The authors emphasize the importance of addressing these risks for the safe and responsible use of RAG techniques in real-world applications, much like remembering to turn off the oven after baking that birthday cake. 🎂🔥

To learn more about the resources discussed above, visit the links below. It's like a treasure hunt, but instead of gold, you'll find knowledge! 🧭📚

The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG)

RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models

🎉^{_{Little surprise:}}

We've got users sharing intel on how to level up their experience with Code GPT. It's like getting insider trading tips, but legal and for coding! 🕵️‍♂️💻

This month, we want to give a big shout-out to:

- Tobi Dotcom (better crawler for URL docs)
- Zeryán Guerra (better scrapping of Github repos)

You folks are the real MVPs! 🏆👏

The resources are available here. It's like a treasure map, but instead of 'X' marking the spot, it's a hyperlink! 🗺️🔗

The repo! 👾

🌟 Today, we recommend the repo from The Good and the Bad! 🚀 , hopping that don’t found the ugliness of your models 🤖📜.

New at CodeGPT 🎁

I invite you to take a look at the updated documentation, both for the extension and the Playground, to learn about the new features from the last month, as well as those coming soon. Stay tuned, fans!

Additionally, feel free to engage with our Help Desk bot. We're open to suggestions for improvement!

Do you have friends who code? Share CodeGPT and earn rewards. For every friend who joins a paid account, you'll be closer to unlocking the full power of CodeGPT Plus.