So What? Preparing Your Data for AI

So What? Marketing Analytics and Insights Live

airs every Thursday at 1 pm EST.

You can watch on YouTube Live. Be sure to subscribe and follow so you never miss an episode!

In this episode of So What? The Trust Insights weekly livestream, you’ll learn how to prepare your data for AI so that it can be properly understood by large language models (LLMs). You’ll discover the importance of structuring your data effectively and why simply handing a spreadsheet to an LLM is unlikely to yield meaningful results. You’ll also explore practical tips and tools for transforming raw data into a format suitable for AI processing, including converting CSV files to markdown tables, and why that is important. This episode offers valuable insights for anyone looking to leverage AI for data analysis and reporting, setting you up for success with AI-driven insights.

Watch the video here:

So What? AI Podcast Editing: Level Up Your Workflow

Watch this video on YouTube

Can’t see anything? Watch it on YouTube here.

In this episode you’ll learn:

What formats generative AI tools prefer data
What free tools you can use to prepare your data for AI
Best practices for preparing your data for AI

Transcript:

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for listening to the episode.

Katie Robbert – 00:26
Well, hey everyone! Happy Thursday. Welcome to So What? The Marketing Analytics and Insights Live show. I’m Katie, joined by Chris and John. Good luck with your high five this week. Will you guys be waking up to look at the eclipse over the blood moon tonight?

Christopher Penn – 00:45
Didn’t even know there was one.

John Wall – 00:47
Sure as my name is Vishwanath McKinnon.

Katie Robbert – 00:51
Join our free Slack group, Analytics for Marketers, to get that joke. You can go to Trust Insights, AI analytics for Marketers. If you would like to know what that joke is about, this week we are talking about prepping your data for AI. So Chris, this comes up a lot. I know that in the context of, “Well, I gave AI my data, but it didn’t work. Why not?” We were actually just talking about it separately for one of our clients. They have a spreadsheet of information that they were hoping a custom model could look at and pull information from, to give recommendations. You said it’s not language. I kept looking at you like you were nuts. I was like, “I don’t know what that means.” So Chris, where would you like to start this week for prepping your data for AI?

Christopher Penn – 01:47
Let’s talk about oatmeal. Believe it or not, this is relevant, and we’re going to talk about why. I think we’re going to start off by talking about that very peculiar thing that I just said, which is that something is not language. Some forms of data are not language. They look like language, they’re shaped like language, but they’re not actual language. Let’s take a look at an example. This is from a clinical study here on oatmeal porridge’s impact on microflora-associated characteristics in healthy subjects. We’ve got a lovely table of data here on SCFA excretion in the millimoles per 72 hours and proportions in healthy subjects, before and after eating oatmeal porridge daily for a week. Essentially, this is all about oatmeal is good for you. This looks like language. There are words, there are numbers, and stuff like that.

Christopher Penn – 02:49
You can read it. But what we forget is that, as humans working with data, we have to do a lot of inference that is not obvious to an AI model. You can see at the top here, there’s these two column headings: before and after. Then there’s the statistical mean, the average, and then the standard deviation of the P value. It looks like language, but it’s not. This is all numbers. This is all math. Even though the conclusions are baked here, this is not data that generative AI could work with in its current form. Because look at each of the rows—here is acetic acid, and then the total SCFA concentration, propionic, and so on and so forth. You can see how you, as a human, can read this.

Christopher Penn – 03:36
You’re like, “Okay, I get it. The first row is the chemical, and the second row is the concentration.” We think about that as humans. That’s just how we read, just like we can see this is the before and the after. “Okay, there’s four columns, two columns before, two columns after.” But we are making symbolic assumptions about how the data is laid out. Generative AI can’t do that. When generative AI looks at this, it does not see what’s on the screen. It sees this all kind of mashed together. As a result, what you end up with is it can make bad assumptions. It might say, “Hey, it looks like you’ve got a bunch of duplicate labels. That seems bad.”

Christopher Penn – 04:20
So maybe I will just interpret one of those rows. Maybe you gave me a short prompt saying, “Summarize the state one.” We’ll just say, “Okay, well, I see a lot of duplicates, so I’m just going to pick one.” You can see how, even though this looks like language, it’s really math. Math is not written language. Mathematics does not work in the way that written language works. We can’t hand this to AI and assume it’s going to read it correctly. We have to prepare it for AI so that it is obvious—in a line by line, column by column table—what it is that we’re talking about, what it is that this actually represents.

Katie Robbert – 05:01
This reminds me of the classic problem we all have with reporting and charts and graphs. It’s the… what I mean by that is the struggle that we run into is we’re asked, at the end of every month, to put together a report of how things performed. We take all the data, we put it into tables and pie charts and line graphs and all that stuff, and we say, “Here you go, here’s what happened.” The person you give it to says, “I don’t know what this means. This is just a bunch of graphs. This is just a bunch of numbers. What am I supposed to do with this?” It sounds like this is, for all intents and purposes, pretty much the exact same problem. New tech doesn’t solve old problems.

Katie Robbert – 05:50
We have to be better about communicating what the data actually means. It’s not enough to hand somebody a table and say, “This is what happened.”

Christopher Penn – 05:59
Exactly right. Exactly right.

Katie Robbert – 06:02
I’m taking the win.

Christopher Penn – 06:04
Exactly. If you take the win, here’s an easy way to determine how generative AI is going to do something. If I were to select this general region of text from this PDF and paste it into a text file, what do I see? A mess. This is a hot mess in here. There’s stuff all over the place. Things are not where they’re supposed to be. This doesn’t make a whole lot of sense. This is what generative AI sees if you were to load that PDF in because, what it does behind the scenes, it transforms everything into tokens and then tries to predict those tokens. That’s fundamentally what a language model does. So to work with this data well, and to be able to have AI—and frankly, some humans—work with it well, we need to prepare it. We need to get it ready.

Christopher Penn – 06:51
You can use generative AI to do that, particularly today’s state-of-the-art foundation models. They’re very good at this sort of thing. By the way, if you are interested, join the Analytics for Marketers Slack because we’re going to be announcing… what we’re talking about today is part of our next mini-course on generative AI use cases. This is an extraction use case because we want to get this data out of this PDF and make it available to us. It’s funny we mentioned this because I was literally doing this before lunch today. Here’s how we would go about taking this table out of here. There’s a variety of different ways you can export, or you can just hand generative AI the PDF itself.

Christopher Penn – 07:34
I’m going to put the PDF… I’m using Google’s Gemini, but you can use the AI model of your choice. It does not matter which one you use, as long as it can read documents. I’m going to put the PDF there, and I’m going to give it a four-part prompt. This is going to be using a modified version of the Trust Insights RAPPEL framework, which you can find at TrustInsights.ai/RAPPEL. We’re going to first start off by saying you’re a data analyst and data management expert skilled at tabular data and rectangular data. Rectangular data is sort of the secret. It’s not a secret. The catchphrase there, rectangular data, means that the data is in the form of a rectangle.

Christopher Penn – 08:21
So it has rows and columns that are consistent throughout—the same number of headers, the same number of columns throughout the table. Same number of rows throughout the table. What you see in this PDF is not that. If we go back to this, this PDF here, this is not rectangular. Stuff is indented, there’s headers sticking out where there’s no other headers up here. That’s wrong, that’s incorrect. That is not a rectangular-shaped piece of data. Second step, we’re going to say the action you’re going to take. You’re going to extract. We give it an overview of the task: you’re going to extract Table One. Here’s another little hint. This goes back to what you were saying, Katie—good communication. I’m not saying extract the table from the PDF. That’s a disaster waiting to happen.

Christopher Penn – 09:10
I’m saying Table One is SCFA excretion data. I’m telling it by name and number where in the document to find this, now and again.

Katie Robbert – 09:19
Just good practices for communication. We often equate it to delegating to a person. If, in an academic paper, you can make an assumption that there’s more than one table, if you say to the person you’re delegating to, to “extract the table”, they likely will get it wrong if you don’t specify. So this is very much the same. Just telling the machine exactly what you want makes you less likely to be disappointed in the outcome.

Christopher Penn – 09:45
Exactly. Let’s continue on. The priming step here is we’re going to say, “Refer to the PDF” that is by name. Again Katie, if I upload two, three, or four PDFs, it’s like handing it to the intern saying, “Intern, get this table out of the PDF.” They’re like, “Which one?”

Katie Robbert – 10:04
Right.

Christopher Penn – 10:05
There’s a lot of Table Ones throughout this pile. Finally, because the document source itself is the priming data, we don’t need to provide extra stuff. Normally, with the PRISM step in generative AI, we would say things like, “What do you know about best practices for this? What are things people commonly do wrong?” and so forth. That’s what’s called generated knowledge. In this case, because this is an extraction task, we don’t need to do that because we’re giving it the information. Instead, we’re going to give it a prompt that says “extract Table One.” Here’s how: going to validate the column names to make sure you know what they are, what the row names are; denormalize the data—another one of those little buzzwords—so that’s a rectangular data frame.

Christopher Penn – 10:48
Produce the data in a markdown table, which is what we would want to use for generative AI. Or you could say, “Produce it in a comma-separated table,” if you want to take it out of here and do something with it in Excel. If I hit run, what the model is going to do is it’s going to read through everything. You’ll note that part of the process is I say, “You got to read the document first. Don’t just do the task, review it and tell me that you understand what I’m asking of you.” Again, good delegation.

Katie Robbert – 11:19
Good delegation. You, as the human, should also know what’s in the document and not just be feeding it. My question for you, Chris, is: so those last five instructions—”validate the column names, validate the row names”—that seems to make sense to me. You wouldn’t need to be an analyst to know to ask that. The last three, I feel like you do need to have a good handle on data analytics as a skill set in order to know to ask those questions. “Produce the data in a comma-separated table. Denormalize the data so that it’s a rectangular data frame.” Those are not common phrases that we think of when we think of data analytics, or “Hey, Katie, can you pull the data and put it into a report?”

Katie Robbert – 12:14
My brain doesn’t instantly go to, “Okay, first I need to produce the data in a comma-separated file.” When I think back to my SPSS days, yes, I understand all of that, but not everybody has that background. So where does one… what non-AI skill set does this kind of fall under?

Christopher Penn – 12:34
This is all data analysis. Data analysis skills to say, “Yeah, when you open up a spreadsheet in Excel and you look at it, can you accurately tell… ‘yeah, this is not good. This is not going to help.’ Or can you say, ‘Okay, here’s what I need to do. Oh, there is, there’s stuff that’s happening here.'”

Katie Robbert – 12:56
The reason I asked that question is one of the things we covered in the newsletter this week was the question that always comes up: “Will AI take my job?” The answer is, it’s going to take certain aspects and tasks of the job. But I look at this, and I’m like, “Well, a data analyst is safe because not everybody knows that these are the commands of analyzing data in a specific way.” The advice that we gave was, “You really still need to understand the foundational pieces of the role of how these things work.” That’s not going away. You can’t just check out and say the machine is going to do it. You’re going to get terrible results.

Katie Robbert – 13:41
But you still need to understand what goes into data analytics in order to do this kind of processing that you’re talking about. I just wanted to sort of sidestep and comment on that. That’s the question we get a lot, and this is demonstrating why. No, you still have to be a subject matter expert.

Christopher Penn – 14:00
In fact, subject matter experts benefit most from generative AI because subject matter experts know what’s going to go wrong. When you look at that PDF, if you are not an expert in data analysis, you might look at that table and go, “Yeah, that’s fine.” But if you’re an expert in data analysis, you look at that going, “That’s not fine.” When the model processes it, look what it’s done. It says “before mean, before standard deviation, after mean, after standard deviation.” Now a rectangle. There’s your P value, and each row, there’s no indent anymore. Now it says “acetic acid, the excretion, exceed acid total acfa.” This is now a rectangular table. This is now prepared for generative AI. What this allows you to do is I can take this markdown out of here. Let me put it into a text document.

Christopher Penn – 14:50
You can see in this section of the text document, there is the table. Generative AI models can read this. They can read this and they can understand what it is that they are looking at. From a prompting perspective, they don’t have to guess what this peculiar column means, or what this row means. If I were to take from the original document… let’s just put it back on screen here. If I were to take just one line like this, copy and paste it, do you know the context of that line?

Katie Robbert – 15:27
Oh, absolutely not. Because there’s no headers, there’s no column names.

Christopher Penn – 15:32
There’s nothing there. If I take this here and I paste this, you don’t have the header names for the header columns, but you at least know what the row is: it’s envelic acid, and it’s scfa. So it is more descriptive. We’ve put more language in the table itself to help the AI model navigate it. Now, if I wanted to use this in a prompt, I could take this whole thing and say, “Here is the table.” In a new prompt, I could say, “What conclusions could you draw from this table?” If I did that, said, “You’re a health expert with a focus on the benefits of oatmeal. Think through what you know about this topic. What are common myths and misconceptions about the health benefits of oatmeal?”

Christopher Penn – 16:46
We did a quick roll and a bit of a PRISM, and we’re going to let the model come up with… oatmeal from this clinical study data about SCFA and oatmeal, “What can you conclude just from this table?” Now, this is not enough data to work with. However, there’s enough of the proper language in it that at least it should be able to infer short-chain fatty acid. I see the numbers, I see statistical relevance. It was able to read this markdown version of the table to be able to understand it. So part of preparing data for AI is understanding what formats AI works best with. For generative language models, like ChatGPT’s model family—like Gemini and Claude—generally speaking, they can read plaintext really well. They can read markdown, which is a markup language, really well.

Christopher Penn – 18:07
It’s one of the best languages they can read. They can read YAML, which is an abbreviation, stands for “Yet Another Markup Language,” really well. They can read PDFs okay, but not great because there’s a lot of extra junk inside PDFs that can obscure what they understand about them, especially if it’s a badly-made PDF. They can read Office documents very poorly. They can read spreadsheets extremely poorly. Then there’s a whole bunch of document types they just can’t read. They can read code very well. They can read JSON—JavaScript Object Notation—very well. When we’re talking about preparing your data for AI, you want to try and get out of whatever random format it’s in, into one of those preferred formats.

Katie Robbert – 19:01
It’s interesting because I remember a specific example. We were looking at our Google Analytics Four data. Rather than export the CSV of the data, you said, “Just take a screenshot of it and upload that.” Is that because if we had uploaded the CSV file, the model would have said, “I don’t really know what to do with all these numbers,” but with a screenshot it is more understandable to the model?

Christopher Penn – 19:29
Depending on the model you’re using, yes. Because the Vision Language model, the multimodal part of the model, has seen a lot of those examples and can, in some cases, infer better than from a CSV, especially ChatGPT. ChatGPT tends to… when you give it a CSV file, it tends to say, “Hey, this looks like math. It’s time for me to start writing code.” It starts writing code, and you’re like, “No, I didn’t ask you to write code. Just tell me what’s what.” It is very prone to that because OpenAI recognizes a lot of the time people will give it a spreadsheet and ask it to do math, and generative AI models cannot do math. Fundamentally, it’s beyond what their architecture is capable of doing. So the hack that a lot of companies use—they write code in the background.

Christopher Penn – 20:20
That can go pretty badly sometimes. It’s much better to have it write the code publicly in front of you, and then you go run it.

Katie Robbert – 20:33
It’s still one of those things… I understand logically what you’re saying, but it still kind of baffles me that numbers and math are not considered a language, and that when we give it a spreadsheet, we’re not actually asking it to do math. We’re looking for it to just pull some insights out of it. But the mechanics of it are like, “No, you’re asking… I have to do math in order to do what you’re asking.” John, as our chief statistician and the person who is most well-versed in two or three standard deviations of where Chris and I sit, what is your take on this? Is this surprising information to you, or is this like, “Oh yeah, that’s how AI works.”

John Wall – 21:16
We’ve talked about this stuff because when we were first rolling out some stuff, I was pestering Chris, being like, “Look, why doesn’t it do this? Why doesn’t it do that?” He’s like, “Yeah, that’s just not how this works. You’ve got this wrong.” It is interesting to me, though, that I think the code hack of writing code to solve things… but I don’t know… Chris, what’s your gut on this? I feel like we’re just a year out from somebody doing some kind of multimodal thing that can do all kinds of math. It’s not like computers can’t do math, so it’s just a matter of crossing over.

Christopher Penn – 21:47
It’s not that it’s crossing over. The Transformers architecture that powers Generative AI will never be able to do math in the same way that, no matter how much money and how much time and how much effort you spend, a blender will never ever make a good steak. It’s just fundamentally, that’s not what it’s for. You will never, ever make as finely blended a soup with a frying pan. It’s just not what it does. Understanding the Transformers architecture… it just can’t do that.

Christopher Penn – 22:15
This means that you will… if you stop trying to coerce it to do something it’s bad at and instead say, “How can we get a series of pieces and an assembly system where generative AI—a language model—is part of it, but so is a code part, so is a tool handling part where the model knows… ‘Huh, I should search the web for that. Don’t write a search engine.'” Say, “Someone, the user, has provided me an API to a search engine. I can go and talk to that and say, ‘Search engine thing, give me this thing,’ and it will come back.” There was a fantastic story this past week about a new Chinese AI agent system called Manuscript. It was being positioned as the next hottest thing since sliced bread.

Christopher Penn – 23:00
The developers, the open-source community, pulled it apart and said, “This is nothing but a set of Claude prompts and 29 different tools,” like a web search tool, a calculator, and stuff like that. “Why… what’s so special about this?” Well, what’s special about it is, instead of having one thing try to do everything and be everything to everyone, they said, “Let’s assemble a toolkit of all the things that AI can’t do, or doesn’t do well.” Instead, just tell the model, “Hey, this tool is available, this tool’s available, this is available. Why don’t, when you’re asked to do this task, think through what tools should I use and then come up with a recipe. Okay, that’s all language. Come up with a plan. Okay, that’s all language.”

Christopher Penn – 23:43
“Start asking each tool in the plan for its things,” and it comes back with a result. What came back was absolutely incredible results because the model wasn’t trying to do everything. The model was, at that point, Claude—the CLAUDE system was almost like the manager delegating to individual 29 different employees and then synthesizing the employee’s outputs to hand it to the user. That’s how… to your point, John… that’s how this is going to evolve. The language model may, in fact, become the manager of the system and not the doer for a lot of it.

Katie Robbert – 24:22
So the language model got a promotion.

Christopher Penn – 24:25
Exactly. That’s what a reasoning model is. We have said for years now, in our keynotes and our workshops, that AI is like the world’s smartest, most forgetful intern. The reasoning model now can do a lot of the how. It still needs to be told the what and the why. It’s now like the world’s smartest, most forgetful junior manager. It just got a promotion to a junior manager, and it’s like, “I’m eager, I’m excited. What am I doing?”

Katie Robbert – 24:52
So when we think about preparing data for AI… you were just talking through some of the formats. Where do we go next? You just sort of gave an overview of a “simple” table from a paper: “Here’s how you prepare it.” But it wasn’t a simple preparation. There was a lot about statistics and data analysis you have to understand in order to use the model. What if I am your below-average marketer, which I am, and I want to give the model some of my channel data and say, “Here’s my channel data, make me a plan.”

Christopher Penn – 25:36
That would be a typical ask, and the typical answer, and the correct answer is, “I kind of need you to do this part, Katie.”

Katie Robbert – 25:50
Unbelievable. I did not see that coming. Yes, I did. So for those who are not familiar, this is the Trust Insights 5P framework: Purpose, People, Process, Platform, Performance. Purpose: what is the question you’re trying to answer? What is your goal? People: who’s involved internally and externally? Your audience, your stakeholders? Think of all the people. Process: how are you going to do it? Platform: what tools do you need? Performance: how do you measure success as it relates to your original purpose? Let’s say I’m your mediocre marketer, which I am, and I said, “I want to use ChatGPT to understand the performance of my website, my SEO performance, my organic search. I want to show up on page one.”

Katie Robbert – 26:41
So my purpose, the question I’m trying to answer is, “How well is my SEO performing? What do I need to do to make my SEO the best in class to make sure I show up on page one?” Of all the things, the people involved… well, me, you Chris, as the analyst. Then I would probably likely need to understand why, or I would need to understand my audience so that I know that we’re giving them the right stuff. That comes from our ideal customer profile, which you can find at older episodes of the So What? playlist on our YouTube channel. The process… well, what I was going to do was extract my data from Ahrefs, which is the SEO tool we use, give it to ChatGPT, and say, “What am I supposed to do?” That’s my process.

Katie Robbert – 27:36
It’s probably not deep enough. Platform: Ahrefs, ChatGPT, or whatever model is appropriate. Then performance. The performance in this case is I now have a plan to fix my SEO so that I show up on page one.

Christopher Penn – 27:54
Most of that is good. However, it’s still not decomposed enough. Process decomposition is one of the most important things you can do as part of preparing your data for AI. Say, “What is this? What does data look like? Is it usable in the format that it’s in, or do we have to process it and slice it up and do more stuff with it?” Inside of… for example, inside the Ahrefs SEO software, if you were to say, “Export my backlinks, export the number of links that I’m getting to my website,” you would end up with this. It’s a very nice spreadsheet. It’s gigantic. This is 30,000 rows of data. I can tell you with 100% confidence, ChatGPT is just going to implode if you give it this information because it is unprocessed.

Christopher Penn – 28:43
This is like a pile of flour, a bottle of water, and some yeast. You’re like, “I want bread.” This is not bread. You cannot eat this. The next step would be to look at this and go, “Okay, in the 5Ps, who is this report for?” Well, it’s for Katie. She’s a CEO. She wants to know what’s going on their website. Is this the best way to do this? The answer is clearly not because there’s no conclusions here. This is a key part of preparing your data for AI. We never, ever, to the maximum extent possible, ever want to have AI trying to do the math, or crunching the data, or processing the data. We want to hand it a finished product. Say, “Here is the data that is completely processed now.”

Christopher Penn – 29:32
It’s more language-shaped and more like real language. You should now be able to work with this. The way that we would process this is we would say, “What are the…” You could do this in a tool like… you could do it in Excel. It would take you a very long time. You could do it in a tool like Python, you could do it in a tool like R. There’s paid vendors—Tableau, Alteryx, Power BI, and all those things—that can help you. You slice and dice and process the data. For Trust Insights, we do this in the R programming language for the most part. We say, “I want to turn this data into a set of conclusions. What is this data?”

Christopher Penn – 30:17
“Can you give me just a summary table of where your links came from this month? Who do they come from? Where did they go to on your website?” That processing is then what comes out the other end to say, “Okay, now we can hand that useful summary data to generative AI, and then generative AI will know what to do with it.” After you process it… this is an example of what the prompt would look like where you would say, “Here is… I’m going to give you the data.” If I scroll down here, you can see here’s the number of referring domains. There is no math involved here. This is the work is done. Here are the pages that got the most links. Here is the quality of the links. Here is your weekly trends.

Christopher Penn – 31:07
Then, if you put this cleaned data that a third-party system has prepared into generative AI, going back to the 5Ps, we are preparing this SEO report for our CEO, Katie. She is an excellent strategist, but a relatively non-technical marketer and doesn’t have a ton of SEO experience, as her focus is on strategy. We’ve got the people in the… we get the people. The prompt has the intended process, and now, with the prepared data, it’s going to create a report. The conclusions are already in the prepared data. This is now assembling the language around it.

Katie Robbert – 32:11
This is important because this might be the most disappointing thing that marketers are realizing about generative AI, especially when… we were talking earlier about, “Will AI take my job?” No, it’s going to change it. You still have to do the foundational work to make sure that, if you’re extracting data from your SEO tool, that you’re analyzing it. The tool’s not going to do that for you. You have to analyze it and put it in a form that it can then do something with and sort of help you do those deeper insights. You don’t get to skip over the actual analysis part, unfortunately. I think that’s probably been a huge wake-up call for a lot of companies that have misunderstood what AI can do.

Katie Robbert – 33:00
“Let me get rid of my whole data team. AI is going to do it for me.”

Christopher Penn – 33:05
Exactly. Now there are a couple of tools that I personally find helpful for prepping your data. There is one that is a command line tool called CSV2MD. That’s a bit of a mouthful, but it is a command line tool. You would install this in the Windows command line or the Mac terminal. You do need to have Python installed. Then you run the command, you give it a CSV, it will just spit out a markdown table. If you don’t have access to that, you can—as we saw earlier—just ask generative AI, “Here’s a CSV. Can you turn this into a markdown table for me?” It obviously can do that as well, but that’s a super helpful utility.

Christopher Penn – 33:55
Another utility that is super helpful is any text editor—Advanced Text Editor, UltraEdit on Windows, BBEdit on the Mac—that can do things like remove duplicate lines or can take a pattern and remove data from it. A good example is if you were to break apart a CSV file that had email addresses in it and say, “I want to remove these because I don’t want to hand personally identifying information to a generative model.” You can have a text tool yank them out, and then you can move on with handling your data safely.

Katie Robbert – 34:30
I think, again, it sort of goes back to the old Katie adage of “new tech doesn’t solve old problems.” Generative AI is not this magic wand tool that sort of does all the things. John, this was to your question earlier of, “How close are we to that?” The answer is we’re not, because it’s not going to do all the things. It is just now another part of your probably already massive tech stack. Going through something like the 5P framework is going to help you figure out where exactly it fits. If you want your own copy of the 5P framework so that you can work through the steps, go to TrustInsights.ai/5P-framework to figure out where AI fits in.

Katie Robbert – 35:16
It’s not a catch-all that’s suddenly going to do all the things. We’re seeing, in very specific, discrete examples, where it can’t figure out what is theoretically a very simple table in an academic paper because there’s not enough language around it to the model. It’s just numbers, and those numbers don’t mean anything.

Christopher Penn – 35:40
Two other utilities that are going to be very useful for you: some tool that can glue together documents because very often you will have—for example—50 PDFs. They’re all like five pages long, but you’ve got a lot of them. There would be a tool like PDF Unite, which is a free tool, would be the way to glue them all together into one PDF. Then you could load—for example—that into a tool like NotebookLM instead of dropping all 50 PDFs and waiting for it to churn endlessly. Another example would be… I have a tool that I wrote in Python—I didn’t write it; a generative AI wrote it—that glues together text files.

Christopher Penn – 36:22
For example, we were recently doing a project where we had call transcripts, and I needed to glue a bunch of them together into one file that I could then hand to generative AI. What the tool does is it says, “Here’s the file name, here’s the start of the file, here’s the end of the file.” It just glues it all together. As we were seeing earlier, having that utility allows generative AI to recognize, “Oh, you’re referencing File 15.” There’s a… I can look through the text you provide, find the words “File 15.” There’s where File 15 starts. It’s good delegation. The other tool that I would recommend is a good file splitter, something can take a large file and chop it up into intelligible pieces.

Christopher Penn – 37:04
There are utilities like G Split or Split on the Mac, and I believe that can also run on Windows. It’s a command line tool. Give it, say, a spreadsheet or a document that has 100,000 lines. It will not fit into generative AI in one piece. NotebookLM is notorious for this. It will not accept a text file over 1.8 megabytes. If you get a bunch of Reddit conversations, you might have to split it into 1.8 megabyte chunks. With a good file splitter, you split it into pieces, you load the individual pieces in, and then you can have your whole big, huge honking file accessible to NotebookLM, but in those individual pieces.

Katie Robbert – 37:47
John, I owe you an apology. I had just given you 500 files and told you to manually copy and paste them into one text document. You said, “I’m on it, boss.”

John Wall – 37:58
There’s actually a tool.

Katie Robbert – 38:00
For that, which isn’t surprising. That’s again, that’s not AI. Those are the types of tasks… I don’t want to get too far off the topic of today’s livestream, but those are the types of tasks that, unfortunately, people are looking to AI to do. They’re not AI tasks. Those are just general technical or automated tasks that a different tool is going to do before you even get to the point of AI, which, again, we keep going back to it because this year is all about foundations. Do your 5Ps and figure out what you’re trying to do before you pick the platform.

Katie Robbert – 38:37
If you go in saying, “This is an AI task,” you’re going to end up like John Wall, copying and pasting text from 500 different documents into one single Word file. Then that one Word file is going to be way too big. Forget compression, that’s not even going to help anymore. You’re just going to have to figure it out from there, and it’s such a waste of time. Do the research first. Do your requirements gathering first. Prepare your data. When we say data, we’re not just talking about numbers; we’re talking about anything you’re giving to generative AI. Everything you’re giving—whether it’s photos, videos, or audio—that’s all data for AI. You have to do that work up front to prepare it.

Christopher Penn – 39:20
One other thing that you can do is generative AI can write the tool for you if you go through a three-step process. Number one, establish what your best practices are for what you’re going to be assembling. Two, with those best practices, build a product requirements document—a PRD—with the help of generative AI, saying, “Here’s what I want to do. Walk me through how to build this. What are the requirements?” It will say, “What does it do? What should it not do? What about this? What about this?” After a good conversation, you’ll have a PRD. Three, you say, “Build me a file by file work plan with the detailed requirements for each file.”

Christopher Penn – 40:01
When you do that, then generative AI can start creating the code for you, building the app for you, and you’ll have it ready to go. In preparation for this livestream, I was messing around, saying, “Could I make a CSV to markdown tool that I could put on the Trust Insights website?” The answer is yes. I just ran out of time before the show started. But if you don’t have a tool, generative AI can make it for you, as long as you follow that process.

John Wall – 40:35
How about… do you have a hierarchy for yourself? We’ve seen all the different models take different file types, and we’ve talked about here how markdown is superior to CSV most of the time. Is there a list that you go through? “Here’s what we would prefer to have: these.” Going down the list, they’re less and less likely to get loaded.

Christopher Penn – 40:55
It’s not that they’re less likely to get loaded. They’re less and less likely to be interpreted properly. Markdown, YAML, any structured file format is going to be the most likely to get loaded cleanly. Even HTML, as long as you’re doing things like H1 tags and H2 tags within your text, you can see it in the code itself. The AI models also recognize and go, “Okay, this is the title of the page, this is the top heading,” and it can understand the hierarchy of the text. XML is a good example of that. JSON—JavaScript Object Notation—language models love JSON. It’s one of their favorite data formats because, again, it says, “Here’s the key, here’s the value. Here’s the key, here’s the value.” “Employee first name: Katie. Value: Katie. Employee last name: Robbert,” and so on and so forth.

Christopher Penn – 41:43
That structure removes ambiguity. At the end of the day, when we’re talking about accuracy with generative AI, it is all about reducing ambiguity, leaving less and less for the model to guess at.

Katie Robbert – 41:57
Where does that information live? This is my naive question. If I were to open up Gemini, for example, is there somewhere within Gemini that says, “The following five file types are accepted or preferred”? How does one who isn’t Chris Penn learn that information? I know you probably learned a lot of it through trial and error and just doing your own research. How does someone who isn’t you find out, “If only I had markdown”? Do I even know what markdown is? How do you get to that point so that you watch…

Christopher Penn – 42:38
The Trust Insights livestream?

Katie Robbert – 42:42
Duh.

Christopher Penn – 42:44
Kidding aside, one of the biggest complaints… Ethan Mollick points this out all the time. One of the biggest complaints about generative AI is that the garden hose you buy on Amazon has more documentation than a generative AI model does. That is largely the fault of the fact that the technology is moving so fast, and the people who are making it care so little about how people actually use it, that there is no documentation of, “Here’s general best practices.” That’s one of the reasons why I wanted to have this episode of “what are the basic best practices for preparing your data: getting it into markdown, getting it into YAML, using generative AI to convert your data into those formats?” Because it isn’t written down as an obvious… there is no “Generative AI for Dummies.”

Christopher Penn – 43:29
There may be, but we didn’t write it.

Katie Robbert – 43:32
To your point, it’s changing so quickly that by the time something is available and published, it’s already changing. This is why I was asking the naive question: if I were to open up Gemini, or Claude, or something, and look at the settings, does it say… because I know it doesn’t say, “These are the file types.” I know it does include, for the platforms where you have the ability to attach things, it gives you a couple of options, but those seem like standard. You can attach a Google Drive, you can attach an image, you can attach a file. That’s not helpful because I’m still like, “I’m trying to upload a movie, or I’m trying to upload this.”

Katie Robbert – 44:17
I guess that was really my question: how does someone find out how you’re supposed to use these tools if you don’t even know what file format is accepted?

Christopher Penn – 44:29
A lot of that… where I get that information from is when model makers publish where they got their training data from. That is enormously helpful for understanding the formats that the models have seen and seen the most of. When a model maker goes and scrapes all of GitHub, for example, everything on GitHub is markdown. GitHub is a huge site. Most of the stuff on Stack Overflow and Stack Exchange is either Markdown or HTML. Knowing that those are the data sources, you can then logically conclude it has seen the most of that. So that is the format that is most preferred. Just like the reason why we do so much in Python now is because that is the language that generative AI has seen the most of.

Christopher Penn – 45:17
When you use lesser languages, like Go, or Swift, or Flutter, or Dart, it screws up a lot more because it has seen fewer examples of it. It has seen pretty much anything you could ever want to do in Python at some point. So it is the training data. Its knowledge is going to be the most reliable for that. Part of understanding how to get the most out of AI is knowing what the AI has seen the most of. That’s one of the reasons why PDF works surprisingly well. Even though PDF, objectively, is a terrible document format—if you look at the actual standard, it sucks; people were trying to get rid of it for years—

Christopher Penn – 45:53
because AI has seen the most of it online, with all the academic papers, it’s really good at reading even crappy PDFs.

Katie Robbert – 46:02
So what you’re saying is, go to Trust Insights AI, YouTube, catch older episodes of our So What? playlist to stay up to date on what kinds of things a generative model will accept. This is not me picking on you, Chris, but you look at it very differently. I would not immediately think to look at the documentation from the developers of these models to say, “What has it read the most of?” I don’t go to that natural conclusion. I would take a wild guess to say John, you’re more aligned with me in this particular case, where, if you’re wondering what works in generative AI, you’re not going that route to try to find it. There are reasons why people subscribe to the Trust Insights newsletter.

Katie Robbert – 46:58
Subscribe to our personal newsletter to keep up with that information. I’m already feeling overwhelmed, and I’m part of the team that writes the newsletter!

Christopher Penn – 47:08
Yep. We had it in Analytics for Marketers. The other week, we had someone who’s trying to use Gemini to transcribe the contents of a video, and they could not get it to work. I asked them a couple of questions, and it turns out the video file they were using was using a codec—which is a method that video is compressed in—that Gemini couldn’t read. Why? Because Gemini has been trained on YouTube. If you were to translate that video, using a free tool like FFmpeg, into the standard format that YouTube uses, suddenly the person got it. Gemini’s like, “I got it. I can work with it.” So even in that case, it’s knowing where did Gemini get this information from? How does it know how to process this? It’s because it’s a Google product.

Christopher Penn – 47:57
It came from YouTube, and we know the YouTube video formats. There are seven different video formats that YouTube’s really good at, and then a bunch that it’s not really good at. Of course, having been trained on the YouTube corpus, Gemini is going to… you’ll see that style of video format work the best. It always comes down to: what has the system seen the most of, and can you get your data into those containers?

Katie Robbert – 48:26
This was a bigger episode than I thought it was going to be, but in a good way because I personally feel like I learned a lot. Like a lot of people, I just sort of made the assumption that if you give generative AI a document with a table in it, it’s going to know what to do with it, or if you give it a CSV, it’s going to know what to do with it. That is the opposite of what’s true. Even that little bit of information is helpful. What about you, John?

John Wall – 49:02
For me, it was just the opposite. I knew this would be important because pretty much every time I’m running a query, it’s like, “Oh yeah, it barfs on some kind of file,” and we can’t do anything with that.

Christopher Penn – 49:13
This is one of the reasons why people are not getting as much value out of AI as you would think. This is the foundation stuff that you need to know to build things like vector databases and retrieval augmented generation systems, and fully private AI that knows all your knowledge and all your data. This is the prerequisite. If you’ve got a Microsoft SharePoint instance with a gazillion different Word files, that is almost totally useless to generative AI. If you look at the DOCX format, it is just ugly. It’s filled with so much extra stuff that it’s very hard for AI to read. If you aspire to unlocking the value of generative AI, you’ve got to get these foundational pieces done first.

Katie Robbert – 50:02
I agree, and I think that’s a good place to end this week’s conversation. Big takeaways: you can’t skip over knowing how any individual role works. You still have to be a subject matter expert. AI can’t do math, unfortunately. Do your requirements gathering up front. Make sure you know what you’re asking so that you can prompt the systems correctly.

Christopher Penn – 50:26
Exactly. I’m going to give away the file splitter that I use for Generative AI. It will be posted in Analytics for Marketers. If you’re not a member of the Analytics for Marketers Slack group, you should go and join. We’ll give that away later today. That’s going to do it for this week’s episode. Thank you everyone for tuning in, and we will talk to you on the next one. Thanks for watching today. Be sure to subscribe to our show wherever you’re watching it. For more resources and to learn more, check out the Trust Insights podcast at TrustInsights.ai/ti-podcast and our weekly email newsletter at TrustInsights.ai/newsletter. Got questions about what you saw in today’s episode? Join our free Analytics for Marketers Slack group at TrustInsights.ai/analytics-for-marketers.

Christopher Penn – 51:15
See you next time.

Need help with your marketing AI and analytics?

So What? Marketing Analytics and Insights Live

airs every Thursday at 1 pm EST.

In this episode you’ll learn:

Transcript:

Leave a Reply Cancel reply

Subscribe to our Weekly Newsletter

Pin It on Pinterest