Based in Sydney, Australia, Foundry is a blog by Rebecca Thao. Her posts explore modern architecture through photos and quotes by influential architects, engineers, and artists.

Episode 181 - Redacting your Secrets

Episode 181 - Redacting your Secrets

Patricia Thaine, CEO of Private AI, talks to Max about using AI for redaction, and preserving privacy.

About Patricia Thaine

 
patricia-thaine-aievent.jpg
 

Patricia Thaine is the Co-Founder and CEO of Private AI, a Toronto- and Berlin-based startup creating a suite of privacy tools that make it easy to comply with data protection regulations, mitigate cybersecurity threats, and maintain customer trust.

She is a Computer Science PhD Candidate at the University of Toronto and a Postgraduate Affiliate at the Vector Institute doing research on privacy-preserving natural language processing, with a focus on applied cryptography. Her research interests also include computational methods for lost language decipherment.
Patricia is a recipient of the NSERC Postgraduate Scholarship, the RBC Graduate Fellowship, the Beatrice “Trixie” Worsley Graduate Scholarship in Computer Science, and the Ontario Graduate Scholarship. She has nine years of research and software development experience, including at the McGill Language Development Lab, the University of Toronto's Computational Linguistics Lab, the University of Toronto's Department of Linguistics, and the Public Health Agency of Canada.

She is also a member of the Board of Directors of Equity Showcase, one of Canada's oldest not-for-profit charitable organizations.

Private AI Website
Private AI Demo

Privacy Enhancing Technology (PET) decision tree
(Use this flowchart to understand privacy implications of your data work)

Patricia's Twitter
Private AI's Twitter
Private AI's LinkedIn

Slate: Korean Chat Bot “Science of Love” Spills user Secrets

Related Episodes

Episode 2 on Difficulties of Internationalization
Episode 3 on Foursquare’s Rating System
Episode 18 on the Manipulation of Public Chatbots
Episode 104 on Preserving Privacy in the Digital Age

Transcript

Max Sklar: You're listening to The Local Maximum, Episode 181.

Time to expand your perspective. Welcome to The Local Maximum. Now here's your host, Max Sklar.

Max: Welcome, everyone. Welcome. You have reached another Local Maximum. What do you think about when you think about redaction. I think about a redacted document. I don't know. I'm imagining the government is finally saying, “Oh, we're releasing the JFK files or something.” And they come out half of it is blocked out because you don't want to tell us who this person is, and what happened in this paragraph or whatever. Today we asked the question, when do we get to redact our own documents? When do we get to protect their own privacy? As you'll see, it's important because you have a huge digital footprint. And if someone wants to find out a lot about you, then they can. And organizations can accidentally leak information about you without knowing it. So even if you're giving your information or allowing your information to be recorded by a somewhat trustworthy organization, there's so much data out there that a lot of new techniques have to be used. And a lot of new ways of thinking have to be used in order to keep certain things private. So I found someone working on this problem with a solution that uses AI to redact information in our private conversations and private data being stored in databases around the world. It's called Private AI. My guest today, is a Ph.D. candidate in Computer Science at the University of Toronto, and the co-founder and CEO of Private AI. Patricia Thaine, you've reached the local maximum. Welcome to the show. 

Patricia Thaine: Thank you so much, Max. Great to be here. 

Max: I'm so glad to have you here today. I like to hear about the latest AI projects. And I know my audience does, too. And I know we all care about our personal privacy and data these days on the internet and all over. And so, your research, your entrepreneurial work speaks to all of that. So maybe before we begin, you could tell us a little bit about what you're working on and what led you to that point. What are your research interests?

Patricia: Yeah, sounds good. So at Private AI, what we're doing is making it really easy for developers to integrate privacy into their software pipelines. So we noticed a couple of years ago that if you were a developer that wanted to integrate privacy, you needed to either have a background in machine learning and privacy, in secure multi-party computation, homomorphic encryption, differential privacy. And these are very rare skill sets to have. And it's also very difficult to hire for these skill sets given how rare and expensive they are. So what we're setting out to do is essentially being the go-to solution for developers for privacy technology.

Max: Right. So you're... So when people think about privacy technology, there's all sorts of different things you could think about, certainly, encryption. And you could tell me a little bit about what you're doing in that area. So my understanding is you are working more on redaction is that true, or is there an encryption component?

Patricia: Yeah. For now, redaction and the identification of text images and video. Really, text is where we focus on the most because more of the data produced is tends to be text. 

Max: So yeah, let's actually get back into... So what do you mean by redaction? Give people an example of why that's important, what that is.

Patricia: Yeah, sounds good. So you can think about is: your calls are being recorded, often when you call call centers. And then these call centers would like to analyze these calls in order to improve your experience. But they don't need your personally identifiable information like your name, your address, your credit card number, your social security number, and so on. To be part of those calls, in order to perform inferences on the transcripts about does a particular customer service need training? How well did this conversation go and so on? So what we do is make it really easy for a developer who's working with transcripts to remove that person and valuable information.

Max: So you're on the phone, or anyone out there is on the phone with some customer service representative, you try not to–maybe if you're privacy-minded–maybe, you try not to say things that are too could be too intrusive to yourself, but you never know. You're having a long conversation. And you could say all sorts of things that are then being recorded. And so, what you're saying is your software goes in and somehow remove those bits that could be personally identifiable information. And some examples are your social security number, anything else like credit card number, anything else that's maybe less obvious.

Patricia: Yeah. So a lot of the times... So, okay. When you see the headlines, let's say, for example, anonymization doesn't work because such and such company tried anonymization. And they were still able to re-identify an individual. If you look more deeply into those headlines, it tends to be companies that weren't actually anonymizing properly. And what they're missing are things like quasi-identifiers, where you could have age, you could have religion, ethnicity, approximate locations; things that when combined together, it increases exponentially the likelihood of re-identifying an individual.

Max: Basically, everyone's gonna know all the things that you could use to try to find someone on Facebook or LinkedIn when it's not obvious.

Patricia: Exactly, exactly.

Max: So how did you get into this deal? Why did you see this as an important area to work on? Do you have a personal story or something that you read? 

Patricia: Yeah. So back in... When I started my Ph.D., we started doing research on acoustic forensics. So acoustic forensics allows you to determine who's speaking in a recording, what kind of educational background they have, the socio-economic background they have, things like that. And two things really popped out. One, as part of the interest we're getting in that work, was from heads of military from less than humanitarian countries. So I didn't feel comfortable creating things that would be compromising people's identity in those situations. And two, for a lot of these tools, which can really enhance the performance of automatic speech recognition, they do lack the data to be able to be trained properly. So it really is combining privacy; one, protects the privacy and human rights of the individuals whose data are being either trained with or inferred upon, but two, it also increases the likelihood that the technology that we build is going to be better in real world environments.

Max: So maybe, you could help give an example of a conversation that I might have. And then, what does it look like before it's run? And so okay, the company now has this conversation. So if they don't run any privacy software on it, you're trusting that company to run privacy software. Am I correct? But then, if they use it further, at least if they use the privacy, at least if they use the redaction software, then they are there protecting the end user a little bit. Is that how it goes?

Patricia: Yeah, so for that specific example, what you're saying is, in a lot of data protection regulations, you do need to prevent certain Personally Identifiable Information from being shared broadly across an organization or you need to keep track of where that PII is stored. There's a whole bunch of rules that you have to follow whenever you're collecting Personally Identifiable Information with good reason, also, because of all of the cybersecurity threats that are out there. So if there's a data leak, for example, and you didn't collect Personally Identifiable Information, that's much less bad for your organization than if you're leaking a bunch of conversations that contain everybody's details, and what they spoke about during these calls, for example. Yeah, does that answer your question? 

Max: Yeah, yeah. And so, what... Tell me about the text software a little bit. So I had a little bit of a look at it. It'll take some texts and it will be Name 1, basically, replace words with an identifier in brackets, which will either be name, or credit card number, or religion 1, which is interesting if people mentioned religion. So, tell me a little bit more about how it works. And... Well, first, tell me, am I right in describing how it works? And then we can get into some of the... What's going on to the hood a little bit?

Patricia: Sounds good. Yeah. So on the surface, that's one thing that you can see. What we do is send a docker image to developers for them to integrate directly within their pipeline so no personal information is ever shared with us, none of the data center servers ever. And then what they do is they could call this docker image which is set up as a REST API using POST requests. And in goes the text that they want to de-identify or determine where the PII, and what kind of PII are in there, and outcomes, the identified source or the pseudonymized source. So we can also replace names with fake names, locations with fake locations, and so on to make the text more natural, and a dictionary of Personally Identifiable Information. So all of the PII that were found, what entities they're associated with, what segments of the text you can find those in. 

And the cool thing about the pseudonymization is that if you're using the data to then train machine learning models, it prevents downgraded accuracy down the line. So you could train sentiment analysis, named entity recognition, and so on, without worrying about degrading your accuracy. Because these labels aren't natural and don't match your pre-train data. And in addition to that, you also add an extra level of privacy because in case anything's missed, it's very difficult to tell what the original data is from the fake data. 

Max: So I'm a little confused here because so you're talking about someone might want to run named entity recognition on their on their text. And I've done that, let's say. That's like trying to pick out key people that you're talking about, or key locations, or things like that. But if you've already changed those, doesn't that screw up the algorithm? Or you're saying run it through before then? Or–

Patricia: It depends on what kind of named entity recognition you want to run, right? So you might want to pick up names, in which case, you could just use our software to do that. But you might have your own specific named entities to your organization that you want to look at.

Max: They might not be redacted by your software? 

Patricia: Yeah. 

Max: Oh, okay. Okay, I see. So in other words, you can then take the new text, run it through, and run it through algorithms that can analyze other portions of the text without worrying about all of the Personally Identifiable Information that is in there. 

Patricia: Right. 

Max: So okay, cool.

Patricia: So, there's this Korean love bot that was in the news a few weeks ago, maybe a couple of months ago, that had been training directly on chat from their customers. And it was outputting people's names, people's addresses, usernames, and so on. So if you're training, a generative model, especially on text that contains PII, you're very much risking the privacy of the individuals that are in the training data. So a solution like ours prevents that from happening.

Max: Yeah, yeah, that's a good point. Because so I get a so... Let me try to relate this to some of the stuff that I've worked on. I've worked on a bot before it's called Mars bot, and you type to it. And largely, it's getting yes-no answers. So I'm not too worried. But although, it does contain a lot of personal information, because it uses the locations that people are at. Basically, what it does is, when you walk into a cafe, it tells you exactly what to order. But it's not using a lot of the text that you type to it. But on Foursquare, for example, we go through tons of reviews to pull out key terms and things like that. And we also pull out key terms from things that people write privately, which you have to be careful about. Usually, what we do is–and this is super smart, I'm actually proud of ourselves for doing this–if somebody on Foursquare types a private check-in and they're like, “Oh, I'm sitting out on the patio.” Then we'll mark that venue as likely having outdoor seating, but you have to be very careful in case somebody is typing something that is personal to them and they don't want out there. 

Patricia: Right. Absolutely. Yeah. 

Max: And also, I could see anyone who's designing a chatbot, you want to have memory. That's one of the biggest problems, I think, with current personal assistants and stuff; that my Amazon Echo or Siri has no idea what I just asked it. And a lot of times, when I want information or want something done, it requires a back and forth conversation. And there's maybe one or two steps, but there's not really that much right now because it's too complicated. And they would have to store a lot of the stuff you say, and repeat back a lot of the stuff that you said, and you want to be cognizant of what you're storing, and what you know if you're building one of these applications.

Patricia: Absolutely. And in so many cases, you don't need the Personally Identifiable Information to get the data that you really want to perform the task.

Max: Gotcha. Yeah, for sure. And so, it's fun to read some of your examples because it almost looks like a reverse mad-libs type situation where it's like a name. Pick a color. So have you thought about–and I'm sure you have–how would you attack your system? If I wanted to reverse engineer what was written... Have you thought about ways that people might try to do that? And how do you kind of mitigate that?

Patricia: Yeah, we're constantly thinking about that. And that's one of the reasons why we don't say it's DNN identification right off the bat, it's redaction. And then if you want to do identification, you do need human eyes on the data if you want to publish the data, for example. And that's because language is just so complex and complicated. If you wanted to have–

Max: –euphemisms or something that could get around the automated sensors.

Patricia: Yeah, yeah. Things like that or more vague things like I was visiting a cottage by the lake north of where I live on last week. And then you could piece together different pieces of information to try to determine who this person might be. But one thing that we do recommend that our customers do to not have that person be re-identified is not link conversations together, if possible. But also, the way that we're looking at it is that one of the biggest problems is having AI be trained to re-identify an individual because you've got massive amounts of data. If you've got it mostly de-identify, you've got it redacted to a very large extent. It's very unlikely for an AI to be able to figure out these ambiguities, and then pinpoint you. But a malicious actor who specifically wants to identify one individual that they know is in a data set; that is trickier and you need human eyes to be aware of that. But it does mitigate a huge amount of the risk when you do redact the Personally Identifiable Information, including the direct identifiers and quasi-identifiers. 

Max: Right.

Patricia: And it's about risk mitigation. Privacy tech is never going to be perfect just like security tech. You don't expect your antivirus to be perfect. But it does help a huge amount.

Max: Yeah, so I'm actually thinking about your cottage by the lake example. If that were mentioned here in New Hampshire, there's a lot of lakes up here. So any automated system would have no shot of identifying the person. But if you were crazy, you could be like, “Okay, let me get out a map. Let me look at all of the lake cottages, and look at the likelihood of each one, then correlate it with some other databases, and start narrowing it down. It would be crazy expensive. But if an individual is targeted, it probably felt like it, maybe, could be done.

Patricia: Yes, yeah. Which is why if you want to call something truly anonymized, and you want to publish the data, you do need human eyes. But human eye often isn't enough because, for example, our technology. We're constantly seeing that it outperforms human accuracy and reduction. So a combination of AI and human feedback is required in order to make sure that a database is actually publishable. Human expert feedback…

Max: And you said also the texts are kept separate. So it's not like New York is always changed the same identifier or something. And you can work back like, “Well, why is this small town mentioned so much, or this one identifier mentioned so much? Well, that must actually be a big city.”

Patricia: Yeah, yeah. It's non-deterministic. Yeah. 

Max: Okay. But if you mentioned the same… If you're doing one paragraph at a time, and somebody mentions the same city twice in the same paragraph, will it say like, “Hey, they mentioned that same one twice.” 

Patricia: Yes. 

Max: Okay. So you can get the context of what the person is saying, but you can't do it on a corpus-wide scale. And oftentimes… I've worked with corpus, like a lot of text. There's a lot of information you can glean when you look at the corpus-wide scale, which we want to do sometimes, and sometimes you don't want to do, which is hopefully what you're doing is making sure that we're not doing the things we don't want to do. And we do things we want to do. 

Patricia: Yeah, yeah, absolutely. We try to make it as flexible as possible for developers. So if they want to send it a paragraph at a time and get it pseudonymized, or redacted a paragraph at a time, that's the context that we're taking. If they weren't longer than that, they could send longer than that through the API.

Max: Cool. So and... I don't know how much you can tell me, but can you give me a high-level overview? And keep in mind, this is not a 100% technical audience, but how does it work? What's going on under the hood when it's looking at the texts and trying to redact it?

Patricia: Yeah, so we are using NLP models. I guess it's no secret that for redacting something like this, you do need named entity recognition to work. But getting the named entity recognition models to work to the extent that we have is quite tricky. And none of the out-of-the-box models work as well for any hour at the moment. And you'd need a lot of data as well. And you need to optimize for speed. So we spend quite a bit of time optimizing our models for speed and to run directly on CPU. So we could run 100 words in 42 milliseconds in one CPU core. And we're constantly moving those speeds. 

Max: I’ll repeat that's 100 words and 42 milliseconds? 

Patricia: Yeah, in one CPU core. 

Max: Milliseconds is pretty short. So that's a lot.

Patricia: Yeah, yeah. It's very fast, especially for CPU processing. And so we, collect a huge amount of data as well from a variety of different languages. So we currently support seven different languages: English, French, Spanish, Portuguese, Italian, German. And we have Korean out in beta as well. And it's about a lot of very meticulous sanitation. So you'd be surprised how much time and money goes into the annotation process.

Max: Should these languages have problems... So personal experience… Each of these languages have problems that are not solved in other languages. You might have one thing that's specific to Korean that you wouldn't have solved by going to the other six languages.

Patricia: Yeah, absolutely. For sure. 

Max: What was one of the most difficult parts about expanding to some of these languages? Which one is giving you trouble?

Patricia: A lot of it is gathering data and finding the right people to help us annotate the data. And it's a lot of finding the right people who can follow instructions. It was building up the right instruction manual. It's constantly giving people feedback. It's just so much work. Yeah. But I think all of them would give us about the same amount of trouble. But the ones that we don't speak directly a bit more, so we need to get more experts involved that who do speak those languages when we are working on those. So Korean, for example, I don't speak it. But we had to hire people who do to make sure that everything is Tip Top.

Max: Yeah, yeah. I said I found out about the Korean love bot or was that separate? 

Patricia: No, that was that was separately. Yeah, I think I found that out through looking at privacy news and federal.

Max: Okay, cool. Yeah. This really makes me think back when I really lucked out when I was doing sentiment analysis on Foursquare tips, where the training data was literally Foursquare users liking, and disliking venues, and assuming that was the same night whatever they left. And so, I had training data for every... There's like 97 models there, 97 different languages. Now, I can only do a few things on it. I can only do sentiment analysis and spam analysis because people mark things and spam. But of course, they don't... Sentiment is a little bit more balanced in terms of positive and negative examples so it's a little bit easier. But it must be a lot harder when you have to get the data yourself.

Patricia: Yeah, yeah. Because it's also about getting the right balance of data. So we generalize really well across multiple different kinds of data and types of in–

Max: So you have different types of texts, too. It's not like... So I was dealing with a lot of the texts was the same stuff. Like I had the burger. It was great. Yeah, but you don't know what you're getting.

Patricia: Yeah. So what we personally do because our customers share the data that they're dealing with. And we could work off of that.

Max: Sure. But you have different types of customers.

Patricia: We do. Yeah, yeah. And we also need to plan ahead for how well our model can work out of the box in any given situation. And we're constantly adapting it. So yeah. We've currently trained on... I think we're getting to over 60 different data sets. And we're constantly expanding that because there's just so many different kinds. There's semi-structured as well. They're structured that we could help with as well. In addition, there's conversational data, data with banking information, data with healthcare information. And how is that healthcare information structured? Is it in the form of clinical notes? Or is it doctor's notes? Or is it a conversation between patient and doctor? It just varies a lot. And also, there's so many intricacies between the different transcription systems. How are the digits actually written out? What kind of idiosyncrasies does each transcription system have? That would make it difficult for our current model to pick up that we have to adapt for; things like that.

Max: Can people use this if they're doing research or just want to play around? Because obviously, you're gonna have big clients who are gonna pay for it. And do you have a sandbox-type version?

Patricia: Yeah. They could try out our web demo on our website. So that'll give a pretty good idea of how it works. We don't currently provide access to it for just sandboxing with our actual web demo model with REST API calls, but you can see how it works on the website.

Max: Okay, okay. Yeah. I was just curious about that. So you guys also do a little bit with photos, too. Is that correct?

Patricia: We do. Yeah. So I guess our main differentiator there is that we can determine what kind of PII is within text and images.

Max: So okay. So it's like if I take a picture of my driver's license, which now they asked for it a lot. A lot of different places ask for picture of your driver's license. And if someone wants to run models off of those, you can try to redact the relevant information.

Patricia: Yeah. Or if you take a picture of pill bottle, and you want to know what the medication is, but not who the patient name is.

Max: And there could be stuff in the background of any casual photo. You don't know. 

Patricia: Yeah, exactly. 

Max: That's why every time I do the podcast, I got to worry about what's behind me. When I do the video podcast, it's like, “What do I have out?” I know some people use fake backgrounds. I like to show where I live. But I also don't want to have too much. I'm also worried about what I'm putting in the background. 

Patricia: Exactly. A lot of people put their university degrees in the background, for example. 

Max: Sure, sure. Which might work for the pod– I don't have my… I'm sure I have my university degree somewhere. It's not here. But I'm sure what worked for the podcast, but wouldn't work for any old call. So, all right, very cool. I'm going to keep this on my list of things to use. And I'm sure it'll come up whenever I hear of someone who needs to redact their data. I am going to point them to your software. So thanks a lot for for for coming on the show today and sharing this with us. Before we go, do you have any last thoughts on this? And where can people go to learn more?

Patricia: Yeah. I think that there are a lot of misconceptions about privacy technology and that it's one versus another, and that either anonymization will solve all problems, or homomorphic encryption will solve all problems. But it's not one versus the other. There are very specific use cases for each technology in which they are the right tech for. And before you start thinking about which privacy technology to use; you have to really understand the problem that you're dealing with and, what kind of outcomes you want, and what kind of privacy you want to guarantee for your individual users. So we do have a privacy-enhancing technologies decision tree on our website, but I'm also happy to talk to anybody who is having a hard time figuring out which one is the right one for me.

Patricia: Okay, great. So I will put all of the links that you send me to your website. And how to get in touch with you on the show notes page for this episode. That would be localmaxradio.com/181. Patricia, thank you so much for coming on the show today. 

Max: Thank you so much, Max. Such a pleasure to speak with you. 

Patricia: Alright. That's really fascinating. This is Episode 181. So the show notes page for this will be on localmaxradio.com/181. As we move along this summer, I have some new and bizarre-type topics coming up. I think next week, I want to talk about, for lack of a better word: setting up rules. No, that sounds boring. Nobody likes rules. What am I gonna– How am I gonna describe this? I'm going to talk about.... When you're setting up a system, whether it's social media and you want to figure out how people flag posts or something, or you're setting up a contract among many people or a constitution; the rules that you set up affect how things are going to play out and affect whether people cheat or not. 

I want to talk a little bit about that from what I've learned from both a technical perspective and from a historical perspective, some thoughts on that. Man, I don't know how to market that one. That one is going to be... We'll see what I come up with. Another kind of interesting... One that I want to talk to Aaron about is a philosophical view of digits. Yes. digits like the number of digits. Okay, you probably... There are a lot of things about digits that you might not have considered. So definitely tune in for that one. And then we'll get back to the news and stuff. But every summer, we have some… Not just the summer, but I'm thinking last summer when we talked about topology a lot. But every summer, we have some interesting miscellaneous topics and I hope to continue that. Alright, so have a great week, everyone. 

That's the show. To support a local maximum, sign up for exclusive content and their online community at maximum.locals.com. The Local Maximum is available wherever podcasts are found. If you want to keep up, remember to subscribe on your podcast app. Also, check out the website with show notes and additional materials at localmaxradio.com. If you want to contact me, the host, send an email to localmaxradio@gmail.com. Have a great week.

Episode 182 - Focal Points and Written Constitutions

Episode 182 - Focal Points and Written Constitutions

Episode 180 - Branson, Bezos, and Musk dash to Space

Episode 180 - Branson, Bezos, and Musk dash to Space