In-Ear Insights: Generative AI Data Management Best Practices

In this episode of In-Ear Insights, the Trust Insights podcast, Katie and Chris discuss generative AI data management best practices and how to leverage qualitative data for content marketing. You will learn how to extract valuable customer questions from various sources like webinars and forums using AI. Discover effective methods to prioritize this wealth of information based on your ideal customer profile and keyword strategy. Finally, understand how to use AI tools to quickly generate content by referencing your existing materials and repurposing previous work.

Watch the video here:

Watch this video on YouTube

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Audio Player

00:00

Use Up/Down Arrow keys to increase or decrease volume.

Download the MP3 audio here.

[podcastsponsor]

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for listening to the episode.

Christopher Penn – 00:00
In this week’s In Ear Insights, let’s talk about the stuff you’ve got laying around, specifically data. Katie, one of the things you’ve said time and again, which is 100% accurate, is that data isn’t just numbers. Data is any information you have laying around. One of the things we try to do for our own marketing is make use of the data we have because we have so much of it. I’ve been collecting questions from webinars and things that we’ve done. Every time I’m speaking at a virtual event, I copy and paste the chat log because there’s a lot of people who have really good questions. As much as I would like to say, yeah, every question gets followed up afterwards, that doesn’t always happen, particularly if it’s not our event.

Christopher Penn – 00:46
I’ve been looking at our back catalog, and our back catalog of questions that we have left over from events is enormous. Hundreds and hundreds of questions. So, Katie, when you think about content marketing strategy and using AI to make use of the stuff we have laying around, why aren’t more companies, including us, to some degree, why aren’t more companies doing more with this enormous goldmine of data?

Katie Robbert – 01:24
In my experience, the qualitative data feels daunting to try and analyze. When you have the quantitative data, the numeric data, there’s something—at least this is the way that I’ve understood it. When I’ve worked with other data teams, it’s quote-unquote easier to put all of the numbers in a spreadsheet and get them to make sense. It’s not the same process for getting feedback and sentences and unstructured data. That’s the stuff that’s like, yeah, we totally should collect it, but we don’t know how to analyze it because it’s harder to— I guess, what am I trying to—basically, math is a language, but math is a language that isn’t open to interpretation.

There’s no slang, there’s no regional sayings within math, it’s all two plus two equals four. It’s that’s universal. It’s a universal language that anybody can put into an analysis program and get two plus two equals four. When you’re looking at the qualitative data, I could be saying— You could tell me something. I’m like, “That’s okay.” But that’s open to interpretation. I could be saying “That’s okay” like, “It’s fine. It’s mediocre.” Or I could be saying “That’s okay” like, “It’s amazing. It’s the greatest thing I’ve ever seen.” Or I can be saying “That’s okay” to move on because I’ve not been paying attention. It’s really hard to find one single program to interpret language like that.

You and I, we could be saying the same thing, but we’re going to use different language to explain something. Or we can be using the same language but have different meaning because of our own personal experience. So I think that’s— I don’t know if I’m fully answering your question, but I feel like— You’re right.

Katie Robbert – 04:40
We’re all sitting on a lot of qualitative data from social media, from emails, from forums, from blog responses, from other content. How do we put all of that together in a way that we can understand it?

Christopher Penn – 05:00
I think part of the issue, too, is also just the sheer scale of processing what you do with the data. Here’s what I mean. I did a one-hour event with our colleagues at Libsyn, which was a lot of fun, and we got a bunch of questions during that event about AI and podcasting. John and I have done now three episodes of the “So What?” Live stream, trying to tackle some of these questions, and we’ve still got, like, two-thirds of them unanswered. That’s because there’s just so many. We can process and extract the data, but then we have to do something with it. This is the part where I think— I certainly feel stuck. How do we create all this content at scale? They’re all good questions.

Christopher Penn – 05:48
They’re all questions like, “How can I protect my podcast content when using AI?” Or, “I’m suspicious of AI, how can I approach and podcasting safely and effectively?” There’s obviously very tactical questions like, “How do I— What are some of the tools to do this, that, or the other thing?” This is one of now ten different files that I’ve extracted from questions we have laying around. I don’t think we have enough time in the day to create all this content.

Katie Robbert – 06:15
Well, it sounds like there’s two different tracks here. One is, how do companies—how do teams get all of that feedback into a format like you’re talking about, like a clean list of questions to tackle? That’s, question one is: How do you get from “I have feedback all over the place on ten different channels” to “How do I get it into one single list of topics?” Then, number two is: Great, now I have a list of topics, how do I create all that content so that I’m doing the full circle of “You asked this question, I answered this question?” So that’s two different tracks.

Let’s start with the first one. So, sure you did an event and so you had the questions, but what if someone doesn’t have that file and what they have is Reddit forums or social media conversations or Slack conversations, and they’re trying to extract that Voice of the Customer information to say, “What are the questions that my audience is asking?” How do they get that information into a clean list?

Christopher Penn – 07:28
I love Reddit forums. Reddit forums are the best.

Katie Robbert – 07:32
They scare the crap out of me.

Christopher Penn – 07:35
You’ve got to be able to download the data, which—if you have a Reddit developers API key, which you can apply for free, you can then go and extract that data. You have to have some technical skills to write the code or ask ChatGPT to write the code to extract that data. But let’s say that you’ve got that part solved, more or less. You then have to take that data file and give it to a language model that can handle the size of that file because they’re big. They’re really big. We’re getting ready to do our generative AI real estate webinar in a couple weeks, and I pulled two Reddit forums, and these things are each a million and a half words, and this is just 90 days worth of discussion topics. It’s a million and a half words.

So a model like Google’s Gemini can handle that much content. Then you need to have a prompt, some way of processing that. I can show you a quick example of the one that I built for dealing with just webinars. I said, “I want you to build some system instructions for handling these questions.” As the user, I have to provide you the questions. I also have to provide the context so that the machine understands ambiguous questions. Like, “What do we do with it?” That’s literally a question somebody asked, “What do we do with it?” Within the context of AI and podcasting, it was then able to interpret that.

Once you have that, you give it some potential questions, you give it a bunch of rules dealing—like, how do you recognize a question, particularly when people think that punctuation is optional these days, how do you deal with all that? Then you extract the information, say, “Here’s the rules for what I want you to pull out.” Here’s some rules for dealing with thematic duplicates of questions so that we’re not—we don’t have a list of questions, 18 questions, that are all pretty much the same thing, and then spitting out an output. That process gets you something that looks like this, which is much nicer than the raw file. The raw file is almost unintelligible. So that’s step one.

Katie Robbert – 09:41
Yeah, and I think that it’s that getting through step one that is what holds a lot of people up. Aside from having the resources to create the content, it’s “How do I get to the point of knowing what content to create?” Are there— Let me ask you this. Social listening tools pull data from a variety of platforms, and a lot of times what they give you back are word clouds. Essentially, like, very high level: Here are the words that are most prominent. I feel like that might be where a lot of companies are looking to figure out what people are talking about. Obviously, the better option is the option that you just walked through, but companies that are using social listening tools to gather this information, how can they better use that data?

Katie Robbert – 10:39
Because that’s what we’re talking about. We’re talking about that qualitative, conversational feedback, mining that to create your content strategy. Like, what can they do better if that’s the tool they have?

Christopher Penn – 10:55
Any social listing tool worth its salt should have an export button. To be able to say, “Okay, export all this, a data file of this stuff.” Then you can open that up in Excel, you can open that up in spreadsheet software. Just look for the column that contains the unstructured words. Just copy and paste that column to a text file, and that is good enough to start doing the analysis, to start doing the processing of that with a tool like Gemini or ChatGPT or Anthropic Claude. That’s where I would start. Same for your customer service inbox. Your customer service inbox, if you have an email program like Thunderbird, for example, that can export a mailbox as just a big old text file—to push that button, hit dump it out, and then there you go.

I mean, that’s—we actually do that for one of our clients. I will take my inbox just for that client, dump it in a secure LLM and say, “Just tell me what I did this past month for this client” so that we can put it in the report, rather than me having to go back through every email and say, “Okay, well, this is what we did.” The machines can do that. The export button is probably the most valuable button there, rather than using the built-in analysis tools, which are never going to be as good.

Katie Robbert – 12:14
All right, so let’s say I have my file, I have my questions, I have my topics. Now what you’re saying is there’s not enough hours in the day to create this content. When you say that to me, my first thought is: Okay, so we need to prioritize it. How do we prioritize it? Well, we need to do some work on our side to say, well, what do we know best? What things do we actually want to be known for? Because you’re right, we can’t— We can answer every question, but that takes a lot of resources. Do we need to answer every question? Probably not. There’s probably some questions that can be combined. So how do we start to prioritize and tackle, from your viewpoint?

Christopher Penn – 13:04
Again, I would personally use generative AI for this—huge surprise. I would take something like our ideal customer profile and I would say, “Here’s our ideal customer profile. Here’s a list of questions that a general audience has asked. Rank and sort this list of questions by which ones would be most valuable to our ideal customer profile.” That’s how I would approach it. How about you?

Katie Robbert – 13:25
I would add to that our keyword list of what we want to be known for. We can’t forget that we need to be doing basic SEO work. Topics and keywords and all of that good stuff. I would say starting with our ideal customer profile is definitely a good place to start, and then also including—since we’re talking about our content strategy—also including anything we have in terms of our SEO strategy. Here are the topics that we want to be known for, here are services that we provide, and sort of starting to do all of that matching to say: “Great.”

Katie Robbert – 14:10
“Based on these top three keywords that you want, based on your services, based on the pain points of your ICP, here are the six questions you need to answer this month.” That to me is a useful output because it cuts down the list of, like, 30 questions to just a few. It’s like, okay, I can answer those six questions probably in 500 words or less and put that up on our website and then on social.

Christopher Penn – 14:39
Another option, and this is something you have to be very careful with, is if you’ve already got a large corpus of data of things that you’ve already done content for, you can put that into a language model and say, “Here’s what we’ve already done, for these questions that we haven’t done yet, infer the answers.” If you have three or four years of content—hundreds of YouTube shows and podcasts and things—there’s a good chance that there’s enough knowledge there from your unique point of view that a language model could infer a correct answer. Then it’s just up to you to proofread it to make sure it is in fact correct.

Katie Robbert – 15:19
Well, yeah, that’s the big caveat. You, the human, still have to edit. I’ve found—and maybe it’s just a matter of how much information the large language models have—is that even with the ChatGPT version, it still needs heavy editing because there’s a lot of sameness. Same with our ICP GPT model. I tend to get the same things over and over again. To me, that’s just not overly helpful because I, the human Katie, would not necessarily say the exact same thing over and over again to answer ten different questions. That, to me, is the sticking point of using generative AI to answer the questions.

I think it’s appropriate to use generative AI up until the point you’re generating the content, but I personally feel like then the human should be the one generating the responses. You can then give that information back to the model to increase its knowledge base, but you’re still going to run into the sameness issue.

Christopher Penn – 16:34
We may have a fix for that. If you’re curious about that, you’re going to want to tune into our livestream, Trust Insights AI, YouTube, Thursdays at 1:00 PM Eastern. Of course, the replays are always available on our YouTube channel. We’re going to be talking about rebuilding KDGPT to be better, faster, smarter, and all those things. I will say that I’ve been doing a lot of experimentation with the generation of text from existing knowledge, and those are instances where it does a much better job.

I’m currently writing the second version of my woefully incomplete guide to generative AI, and what I’ve done with that is taken two years of YouTube videos, a year and a half of newsletters, etcetera, and I’ve told the model, “Instead of trying to write net new stuff, I want you to literally plagiarize my exact words from my keynotes and things. Just steal literally word for word wherever possible and just use your language capabilities to sew together the grammar.” It’s doing a fantastic job because all it has to do is essentially copy and paste as opposed to try and create something net new. It definitely sounds more like me. It uses the exact case studies I use in talks and things. So I think this is another angle when it comes to the data you have laying around. If you’ve got a lot of data laying around, models may be able to very easily assemble that. I think I told you about this last week, but you were on holiday.

Christopher Penn – 18:14
I built a version of NotebookLM that just has all your newsletter stuff in it, and so it will only spit back things you’ve said. It will not write anything new. There’s a historical Katie in that tool that you can ask like, “Well, what happened with this? Or how is this?” Anytime you want to ask about the five Ps, it has a lot to say. Anytime you want to ask about something that you’ve never written about, like account based experience, it’s like, “I know nothing about that. There’s nothing in the training data you gave me.” So I think that’s another great option for using the data you already have.

Katie Robbert – 18:47
This is all assuming that teams know how to build these things, that they have the technical skills to put these language models together. Is this a technical project or is this something that someone with my technical skillset could handle?

Christopher Penn – 19:10
For using a tool like NotebookLM, you would definitely have no trouble doing that. The hard part for you would be getting the data together. Data governance matters a lot. Knowing where your data is matters a lot. Knowing what format it’s in matters a lot. If you can get over those hurdles, NotebookLM is just drop the files and talk to it.

Katie Robbert – 19:32
It’s funny how new tools don’t solve old problems. I heard that somewhere sometime. You’re absolutely right. It all comes back to data governance. None of these cool, shiny objects that can help expedite your analysis matter if you don’t have good data governance to begin with. If you don’t know where your data lives, who has access to it, what format it’s in, what it’s collecting, the questions that it’s answering, how often it’s collected, the last time it was maintained—any of those things—none of these tools are going to help you.

Christopher Penn – 20:15
Exactly.

Katie Robbert – 20:15
That is the not-so-secret to being successful with any analysis tool, generative AI or something that’s more manual. It doesn’t matter if you don’t have good data governance.

Christopher Penn – 20:33
Exactly. This is a simple example of NotebookLM. This is the Trust Insights newsletter. This is just 2024. I literally just dropped PDFs in, and I can say, “What has Katie Robbert said about the five Ps?” It will only reference things that are in the dataset we provided. The advantage of this particular tool is that it is much less capable of hallucination because it’s been given very specific rules that you may only rely on data that you have in here. It says, “Here’s— You know, Katie, where we created the five P framework, she encourages people to use the framework to get organized.” What’s really nice is you can then tap in on any of these things and say, “Which— Where did this come from?”

It’ll show you the exact newsletter it pulled that answer from. It’s a very simple tool, no sophisticated prompting needed, because it’s just a Q&A tool. It doesn’t do a great job of generating new content because it’s literally just parroting everything you said. But it does a good job of looking up what you’ve done, and if you wanted to create things like study guides, you could certainly do that. We’ll create a study guide for you of, “Here’s what’s in—there’s a briefing document—what the Trust Insights newsletter is about using data.”

Katie Robbert – 21:52
I find it unironic that I’ve written about data governance 14 times this year.

Christopher Penn – 21:59
Huh.

Katie Robbert – 22:00
It’s almost like it’s important.

Christopher Penn – 22:03
Exactly, exactly. But that to me is a really good use of the data you have. If you’re unsure how to get started, even use a highly restricted tool like this and say, “Okay, what have we already done?”

Katie Robbert – 22:23
To go back to your question, I feel like this— This isn’t writing for you, this is really just mining the data that you have. When you get into the, “Well, we have hundreds of questions to answer. I don’t think we have the resources,” I feel like rather than thinking about it like, “Let me ask generative AI what to write for me,” if you have all of those assets—like the YouTube transcripts, the blog posts, all of those pieces of content—what you just showed with NotebookLM is a really good option because it’s telling you what you’ve already written about those things and you’re just repurposing the existing content. Someone says at a very high level, “Katie, what are the five Ps?”

Katie Robbert – 23:18
I could say, “Here’s everything you need to know about the five Ps,” but I’m not having generative AI write it for me. It’s literally just extracting the information that I, myself, the human, have already written. That, to me, that sits better with me than asking generative AI to write it. For me, it’s literally just pulling what I’ve already written, which is— There’s a nuance there, but I think it’s an important one.

Christopher Penn – 23:46
Yes, it’s a very important nuance from a legal perspective because—and again, we’re not lawyers, so please contact an attorney if you have questions about your specific situation or need actual legal advice. We cannot give legal advice. My understanding is that when you use generative AI tools with existing data and it’s making a derivative work—you can very clearly see, “This chapter of my book is literally right from my keynotes”—because it’s a derivative work and not a transformative work, the copyright is retained. Even though a machine created that chapter, because it is so clearly and obviously derived from my original words, it would pass a sniff test in a court of law to say, “Yeah, that’s clearly Chris’s writing,” so it retains its copyright.

If you are using your existing content and your existing data to create new stuff derived from your existing stuff, you retain the copyright for it.

Katie Robbert – 24:58
My brain is trying to figure out where to go with that because I have a lot of questions, but I also feel like that in and of itself could be a different episode. Okay, so for marketers who are wanting to put together a more focused content strategy based on what their audience actually cares about—that’s everybody’s goal, it’s just a matter of how do you get from that goal to actual execution—the plan that we’re outlining is saying: “You can use all of that unstructured data that you collect, use generative AI to structure it and give you a list of questions to answer, and then answer them. Or, bring it into something like NotebookLM to say, ‘This is all the content I’ve written previously on this. What have I said?'”

Katie Robbert – 25:49
“How do I answer these questions?” and start making those matchups. I could see giving your writing team, your content team, saying, “All right, here’s the questions. Here are the responses. Make them more robust.” You’re not asking them to do the research to figure out—because that’s the problem with a lot of companies, is that they’re trying to have this diverse expertise on everything, and it’s real. That’s what takes a lot of time, is the research. If you have a content team that’s working on— “Okay, so we just landed a brand new client and they do medical devices. I know we’ve never done medical devices before, but we owe them ten blogs by Friday.” The research is what takes a lot of time. If you can help expedite that part of it, then you can create more thoughtful content more efficiently still with that human touch.

Christopher Penn – 26:50
Exactly. This is a case where you’re using generative AI in two spots. One, to structure the data itself, which is one of the best use cases for it because you’re just taking data and organizing it. I would say that if you use generative AI for nothing else, use it for that because it’s a slam dunk. Then, using generative AI to either enhance or clean up or reference the data you have so that you can provide that content quickly, especially after something like a webinar or an event where you want people to stay engaged. If you can have these tools pull from your existing knowledge, because very few of these questions are questions that have never been asked before.

“How do I use AI to improve my SEO keyword list?” That question has been answered a gazillion times. You can very quickly respond. I think those are good use cases for this to deal with the unstructured data and to ultimately create more satisfying outcomes for the audience that invested the time showing up for your thing.

Katie Robbert – 27:50
All right, so we’re going to start doing this, Chris?

Christopher Penn – 27:53
We will start doing this. I’m going to have AI do it all.

Katie Robbert – 27:58
So you weren’t listening at all during this episode?

Christopher Penn – 28:02
Do I ever?

Katie Robbert – 28:03
No.

Christopher Penn – 28:06
If you’ve got some stories of your own to share about how you’re using generative AI to process all the data you already have laying around, you want to share those stories, pop on over to our free Slack group. Go to trustinsights.ai/analytics for marketers, where you and over 3,500 other marketers are asking and answering each other’s questions every single day on analytics, data science, and AI. Now, wherever it is you watch or listen to this show, if there’s a channel you’d rather have it on instead, go to trustinsights.ai/tipodcast, where you can find us on most places where podcasts are served. Thanks for tuning in, and we’ll talk to you next time.

Need help with your marketing AI and analytics?

Machine-Generated Transcript

Leave a Reply Cancel reply

Subscribe to our Weekly Newsletter

Pin It on Pinterest