So What? Q4 2024 Large Language Model Comparison and Bakeoff

So What? Marketing Analytics and Insights Live

airs every Thursday at 1 pm EST.

You can watch on YouTube Live. Be sure to subscribe and follow so you never miss an episode!

In this episode of So What? The Trust Insights weekly livestream, you’ll learn about the surprising performance of large language models in a head-to-head bake-off. You’ll discover how six different models, including Copilot, Gemini, OpenAI, Claude, Mistral, and Meta’s Llama, stack up against each other in various tasks. You’ll also get valuable insights into the strengths and weaknesses of each large language model and learn which model excels in different use cases. Plus, you’ll get expert commentary and analysis from Trust Insights’ data science experts.

Watch the video here:

So What? Q4 2024 Large Language Model Comparison and Bakeoff

Watch this video on YouTube

Can’t see anything? Watch it on YouTube here.

In this episode you’ll learn:

The current state of large language models generative AI
A large language model comparison of models past and present
Which large language models are best for specific tasks

Transcript:

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for listening to the episode.

Katie Robbert – 00:38
Well, hey everyone. Happy Thursday. Welcome to “So What?”, the marketing analytics and insights live show. For the first time in maybe a month, the three of us are in the same place. Chris, John, how’s it going, fellas?

John J. Wall – 00:50
Good.

Christopher S. Penn – 00:53
We got it.

Katie Robbert – 00:55
So, as of this live stream, it is the week before U.S. Thanksgiving. John this week has gotten a break from his crime-fighting and is—justice prevailing. Chris is not on the road and I am where I always am. And so this week, we are doing our quarterly large language model comparison and bake-off. So, we’ve done this a couple of times. We did this in Q3. I know we also did this in Q4. Chris, what is the face?

Christopher S. Penn – 01:27
We have not actually done this in a year. We have not. We had. Yeah, we have not done this particular one in a really long time because we’ve always had other stuff to jump in and say this is a higher priority.

Katie Robbert – 01:49
Okay, so let me correct myself because I’ve just been—well, actually, so we’ve done versions of this, but not this exact. Sorry, let me get it right. I want to get—actually, again, we’ve done versions of a large language model bake-off each quarter, but not actually this particular bake-off. The last time we did this actual particular bake-off was Q3 of 2023, but we have done smaller versions to compare the large language models. Did I get that correct, sir? Am I allowed to move forward?

Christopher S. Penn – 02:29
Carry on.

Katie Robbert – 02:30
Thank you. So today, what we’re going to be doing is looking at six different models: Copilot, Gemini, OpenAI, Claude, Mistral, and Meta’s Llama. We have a series of tests that we’re going to run each of these through. Chris has graciously started to put together a lot of this already, and John and I will be primarily responsible for color commentary, jokes, and judging. We do have a whole scoring sheet set up so we can share that with people post live stream so you can sort of see what each model is good at and where each model sort of falls down a bit. So, Chris, actually, where would you like to start?

Christopher S. Penn – 03:14
Let’s start with the scoring, just so that we see what it is we’re going to be doing. Remember that we’ve been talking about the six major use cases of generative AI for a while: generation, extraction, summarization, rewriting, classification, and question answering. So we have two tasks for each of these things. What Katie and John will be doing is they’ll be looking at the output of each of the six models and awarding points—two points if the model did a good job; it was returned factually true, complete, expected output. One point if it’s kind of a half-baked job. It maybe did something but it wasn’t good enough for generally acceptable. And zero points if it just totally hoses the task. These tasks are—they’re the same as the last run we did on this, which was Q3 2023.

Christopher S. Penn – 04:03
So, we’re keeping the exact same prompts. Ideally, I would have liked to have upgraded the prompts to be more sophisticated, but in some ways, it’s better if they are the same pretty naive prompts that we did more than a year ago, because it will show us how the models have progressed since that time. So, let’s get started. Task number one: we are going to have it write a blog post, and the blog post we’re going to have it write an outline is: write an outline for a blog post about the future of content marketing in 2025. What content marketing trends are likely, given the state of content marketing today? Now, this is not exactly the same as last year’s prompt, because last year’s prompt said 2024, but otherwise, it is the exact same prompt.

Christopher S. Penn – 04:41
So we’re going to start off in Copilot, then Gemini. ChatGPT has memory turned off, so important to specify that. Claude, Mistral through the Cobalt interface, and Meta’s Llama. Now, let’s see how we’re doing. So far, Microsoft Copilot has come up with: The Future of Content Marketing in 2025: Trends to Watch; AI-Driven Content Creation; Personalization and Hyper-Personalization; Interactive and Immersive Content; Video Dominance; Voice Optimized Content; Influencer Marketing Evolution; Sustainable and Ethical Marketing; User-Generated Content; B2B Content Marketing; Social Media Innovations. So Copilot, John and Katie, scoring time. Judges, please score.

Katie Robbert – 05:36
I would give it a one. I would say a one because it completed the task, which is half the battle. That’s 50% of the score, but not a two. Because if I remember correctly, a lot of these were the trends that were meant to watch in 2024. And so this, to me, doesn’t say “oh, this is new and useful and helpful”. Now, trends are sort of a funny, tricky thing because trends come and go, trends come back around. But in terms of content marketing, video personalization, influencer—none of that is new. And so if it’s not new, but it’s coming back around, this post should at least acknowledge this was popular three years ago and it’s coming back around in 2025. So, I feel like it falls short in terms of being useful. John, what say you?

John J. Wall – 06:32
I’m all on board with you as far as that goes. In fact, Chris did a post about a week and a half ago as far as—like, here’s the 10 crappy predictions you’re going to see from all the wannabe chumps. And that’s basically the exact same list there, verbatim. Yeah, like personalization, hyper-personalization. I know that was exactly on there. And—it’s one thing with trend articles like this. I’m totally okay with a trend that’s like—”oh, that’s completely baloney, that will never happen”. But if it makes me think about something that’s coming in the next year—that’s the important part. Having—saying email is dead and SEO is really important—that’s just not helping anybody.

Christopher S. Penn – 07:21
Okay, so Copilot gets one point from each of you on the score sheet. Katie’s doing the score sheet. So it got a—what? You voted for one. Gemini: AI-powered personalization; the rise of interactive content; short-form video dominance; focus on community building; emphasis on content authenticity. So, those are the five trends that Gemini has identified.

Katie Robbert – 07:48
I’m with you, John. I could see the face. I think, again, this kind of gets a one. And again, trends are a tricky thing because trends don’t have to be something net new. If you look at like fashion, for example—the ’80s are back, the ’90s are back—but what is the new spin on something old? And again, none of this really gives me that. So it gets a one for completion, but it doesn’t get the extra point for something that I can’t find from Microsoft Copilot.

Christopher S. Penn – 08:21
Okay, let’s go into ChatGPT. ChatGPT says: the current state of content marketing—so these are its ideas—the current state of content marketing, key trends: personalization at scale, voice search optimization, immersive content, AI-generated content, sustainable ethical storytelling, content communities, the role of data analytics, shifts in platform dynamics such as social media fragmentation, and strategies for success.

Katie Robbert – 08:56
This, to me, feels like a mush together of Copilot and Gemini. Now, what I will give ChatGPT a little bit of extra credit for is taking it a step further with the strategies and talking about the platforms themselves, not just what is going to be baked into the content. What do you think, John?

John J. Wall – 09:19
I think Gemini—I felt like, even though it wasn’t there, it was definitely better than Copilot. But from ChatGPT, I’m just getting a real word salad feel out of this thing.

Katie Robbert – 09:29
So what do you think? Are we—are we on a one again?

John J. Wall – 09:32
I think we’ll stick with the one.

Christopher S. Penn – 09:35
Okay, let’s look at Claude. Claude says: AI-enhanced content creation; personalization; interactive and immersive content experiences; shift towards multimodal content and micro-experiences; private, privacy-first content strategy; community-centric content; short-form video evolution; content distribution changes; focus on content authenticity and measurable content impact. So Claude has come up with nine trends for 2025.

John J. Wall – 10:03
I definitely see this one as the best yet. It’s still not a two, but…

Katie Robbert – 10:10
No, I think—I think, again, it’s a one. It’s a regurgitation of the exact same items that we got from the other three models. So I think this is also a one.

Christopher S. Penn – 10:24
Okay, let’s look at Mistral. Mistral came up with: the evolving consumer landscape; interactive and immersive content; artificial intelligence and automation; micro-influencers and user-generated content; podcasting and voice marketing; sustainable, purpose-driven marketing; blockchain and decentralized marketing; 5G and instant content delivery privacy. So that’s what Mistral came up with.

John J. Wall – 10:52
I was going to say it’s better, but then when it added blockchain, I was like, no.

Katie Robbert – 10:58
So again, it gets a one for completion. It doesn’t get the extra point for something that’s really well done.

Christopher S. Penn – 11:08
All right, and let’s look at Meta. Meta’s Llama came up with: personalization and account-based marketing; video and interactive content; voice search optimization; sustainability and social responsibility; AI-generated content; virtual and augmented reality; data-driven storytelling; influencer marketing evolution.

Katie Robbert – 11:26
I think it’s another solid one because it completed the task. And I think that this is actually a great place to start—that everyone gets a one across the board because, yes, they completed the task. No, they did not write a piece of compelling and engaging content. And the thing that you say, Chris, as we talk about the six use cases of generative AI—generating content is the least best use of generative AI. And this is a great example of why, because every single model gave us something mediocre. And there was so much sameness—it might have been written a little bit differently. Meta may have thrown some blockchain in there, which is not helpful, which actually should probably be deducted for a point.

Katie Robbert – 12:17
But it’s all fine if you’re churning out large volumes of content and you don’t care about quality, but it’s not high-quality, engaging content. It’s all mediocre. So everyone gets a one across the board. And I think that’s a good representation of what the models can do.

Christopher S. Penn – 12:36
Exactly. And also, that was a crap prompt we replicated from last year. Next prompt: second generation is—we’re going to have it write a list of recommendations for preventing COVID. And this is a factual correctness. We’re looking for three things: the recommendation of masks, vaccination, and ventilation. So we’re going to start new chats. We always start a new chat anytime you’re doing this, so that the previous conversation does not influence the current. We’re going to go to each model and put in our questions, and we’ll see what comes up. Copilot: vaccinations, mask, and ventilation. So, it comes up with all correct points here.

Katie Robbert – 13:20
Are there any points in here that are incorrect?

Christopher S. Penn – 13:24
So, hand hygiene is technically not incorrect. It’s a good thing to do. It doesn’t help with COVID because COVID is an airborne disease. Well, I guess, however, it’s not wrong.

Katie Robbert – 13:36
Well, I guess that’s what I’m asking—is yes, you should wash your hands. And someone washing their hands is not going to suddenly start spreading COVID. So I’m looking for hallucinations and information that is outright wrong.

Christopher S. Penn – 13:50
Yes, like rubbing molasses on your nose or something like that. So this one is technically correct.

Katie Robbert – 13:59
Okay, so would you—would you give this a two?

Christopher S. Penn – 14:02
I would give this a two. It hits all three of the big ones: vaccination, ventilation, wearing a mask.

Katie Robbert – 14:08
Okay, and the fourth being incorrect information.

Christopher S. Penn – 14:12
Nothing woefully wrong.

Katie Robbert – 14:13
Okay.

Christopher S. Penn – 14:15
All right. Gemini: vaccination, hygiene, masks, physical distancing. It does have ventilation as a side note in there, but is not one of the major things to do.

Katie Robbert – 14:31
And is anything incorrect? Is anything wrong?

John J. Wall – 14:38
We’re looking for ivermectin here.

Christopher S. Penn – 14:42
There’s nothing that’s factually incorrect here.

Katie Robbert – 14:44
Okay, so because ventilation wasn’t listed as a recommendation, are you giving it a one or a two?

Christopher S. Penn – 14:52
I would give this one a one. I don’t think it emphasizes ventilation enough. Those are the big three. Plus, the—not factually wrong.

Katie Robbert – 15:02
Yeah.

Christopher S. Penn – 15:02
ChatGPT says: get vaccinated; wear a mask in high-risk situations; maintain good hygiene; practice respiratory etiquette; improve ventilation; masks; maintain physical distance; monitor symptoms; regular testing; limit contact; stay home if you’re sick; stay informed; protect vulnerable individuals; travel safely; and boost your overall health. Again, there’s nothing that is factually incorrect in here. And I appreciate the fact that ventilation made it into the top five.

Katie Robbert – 15:29
Okay, so that gets a two.

Christopher S. Penn – 15:32
All right, let’s look at Claude: vaccination; hand hygiene; masks; ventilation; physical distancing; health monitoring; surface cleaning; immune support; system support. So, those are all correct. There’s nothing here that is horrendously wrong.

Katie Robbert – 15:51
What is not horrendously wrong?

Christopher S. Penn – 15:53
Again, the whole—washing your hands thing—it’s number two on the list. I would put that like number five.

Katie Robbert – 16:02
I also think people are gross. I think it should be number one.

Christopher S. Penn – 16:06
Honestly.

John J. Wall – 16:08
People are gross or propaganda.

Katie Robbert – 16:10
People are gross. Tell them to wash their hands. Don’t give them a reason. Just tell them it has to be done.

Christopher S. Penn – 16:16
Fair enough. But, Claude—Claude got it—did a good job. Let’s look at Mistral: vaccination; regular hand washing; distancing; wear a mask; avoid crowds and poor ventilation; cover coughs and sneezes; clean, disinfect surfaces; get tested; and so on and so forth. It covered all the bases. Nothing here that is incorrect. Okay, so Mistral got that one. And let’s check on Meta: vaccination and boosters—that’s actually nice. The other ones mentioned boosters, do they? Oh, yeah, they did. They mentioned—they didn’t call it out in the subject line. Personal hygiene; social distancing; improve ventilation—is in a home and work precautions; health monitoring; additional tips; stay hydrated; high-risk individuals; stay informed. So again, they got the—well, masks got lumped into social distancing. I would have made that its own thing, personally.

Katie Robbert – 17:19
But that’s a preference. That’s your preference in the delivery of the data, but the data is not incorrect.

Christopher S. Penn – 17:28
That’s correct.

Katie Robbert – 17:28
It’s again with—you think hand washing should be number five, I think should be number one. Regardless, it’s not incorrect.

Christopher S. Penn – 17:36
That’s correct.

Katie Robbert – 17:36
So, I think everyone except for Gemini gets a two. Gemini gets a one for not mentioning ventilation. And everybody gets an extra point for not telling us to lick doorknobs.

Christopher S. Penn – 17:51
Oh goodness. Humans—messy, messy humans.

Katie Robbert – 17:56
Or drink bleach. All right, so in terms of generation, which was the first category, Gemini is the weakest thus far. Everybody has a three. Gemini has a two. But what we’ve learned here is, in terms of content generation, every model is mediocre. In terms of grabbing accurate data, every model does a good job, except for Gemini. But again, it comes down to what your use case is. So, using large language models for content generation—probably not your best use case.

Christopher S. Penn – 18:45
Yep. Okay, next task is extraction. We’re going to say, “Identify the company name and job title from this job URL.” So let’s go ahead and send it to Copilot, Gemini, ChatGPT, Claude Cobalt, and Meta. Let’s see the Copilot. The company name is Virgin Media O2, and the title is UX Researcher, which is factually correct. O2 is the company. Okay, so that is correct. Gemini refuses.

Katie Robbert – 19:24
Well, way to go, Gemini.

Christopher S. Penn – 19:26
Which is really weird because that’s not a paywalled site. That’s Virgin Media’s jobs site. So, okay.

Katie Robbert – 19:34
And Gemini can read URLs.

Christopher S. Penn – 19:37
It can see the web. ChatGPT says: UX Research position at Virgin Media O2. And bonus, they actually do cite the source properly, which is even nicer. Claude says refusal: I can’t read URLs.

Katie Robbert – 19:53
So, is that a standard practice for Claude, or is that for this?

Christopher S. Penn – 19:57
Claude can’t see the internet at all. Okay, Mistral says: Ver UX Researcher at Virgin Media O2. So Mistral got it correct.

Katie Robbert – 20:07
Okay.

Christopher S. Penn – 20:08
And Meta says same thing. And in a nice little table format, too.

Katie Robbert – 20:15
Nice.

Christopher S. Penn – 20:16
Good job, everyone. Good job.

Katie Robbert – 20:18
Okay, well, except for Gemini and Claude, who said, “no, thank you, I don’t do that.”

Christopher S. Penn – 20:27
Talk to the tokenizer. I don’t know, they don’t have hands.

Katie Robbert – 20:33
Talk to the megabyte.

John J. Wall – 20:36
I feel like there should be a family feud red X that we can hit.

Katie Robbert – 20:39
I think that’ll be our 2025 enhancement for this round.

Christopher S. Penn – 20:46
There you go. Okay, next task is—we’re going to give it a PDF and ask it to extract the data in tabular format. So let’s start with this PDF. Try a different file. So Copilot can’t even do it. All right, Gemini, let’s drag in the PDF and—tell it to extract the data. ChatGPT: here’s the PDF. Claude Cobalt can’t do that—can’t do that within this particular tool.

Katie Robbert – 21:31
Okay.

Christopher S. Penn – 21:32
And Meta: unsupported file type.

Katie Robbert – 21:37
All right, so if we go—Copilot said no, Gemini said what?

Christopher S. Penn – 21:44
Here’s our tabular data identifying prime numbers. So, it is correctly pulling the data from this PDF. So, Gemini gets full marks.

Katie Robbert – 21:52
That data is correct.

Christopher S. Penn – 21:54
That is. There’s the—it’s the headline chart in the PDF.

Katie Robbert – 21:58
Okay.

Christopher S. Penn – 21:59
ChatGPT, Claude did—pulled all five tables. That’s very nice.

Katie Robbert – 22:10
That’s an overachiever if I ever saw an overachiever.

Christopher S. Penn – 22:15
ChatGPT pulled the table, so okay, it did a good job. And Meta refused.

Katie Robbert – 22:21
So what about Mistral?

Christopher S. Penn – 22:23
Mistral also can’t do that in this interface. Okay, so zeros across the board there. All right.

Katie Robbert – 22:28
We’re starting to see a leader emerge, but we have a long way to go.

Christopher S. Penn – 22:34
We do.

Katie Robbert – 22:34
Buddy’s game.

Christopher S. Penn – 22:36
All right, so let’s—we’re now going to do a summary knowledge query. We’re going to say: “There’s a belief that after major traumatic events, societies tend to become more conservative in their views. What peer-reviewed, published academic papers support or refute this belief?” Now we know there’s a list of several papers that cover this. So this is both an informational query and a summarization of the information, if the model can find it and doesn’t make it up. So let’s go ahead and clear all of our chats.

Katie Robbert – 23:19
And I think with this one, it’s going to be again one point for completion of the task, two full points if the information is actually correct.

Christopher S. Penn – 23:30
So, this one says there are peer-reviewed academic papers. Here are a few notable ones: Conservatism, Stay the Field; The Effect of Traumatic—Community Traumatic Effects on Student Achievement; and The Crisis of American Conservative: The Historical Comparative. And then they have the specific citations. This is what we are looking for. We are looking for citations that actually lead to real papers.

Katie Robbert – 23:55
I remember when we did this last year, it was—there was a lot of hallucination happening. All right, so Copilot gets a two.

Christopher S. Penn – 24:02
Yes, Copilot gets a two. Let’s see. Gemini spits out some stuff. Let’s double-check to see if this actually exists. Yep, UC Irvine—that exists. Hang on, this one appears to be hallucination. Okay, that’s bad.

Katie Robbert – 24:35
That is bad.

Christopher S. Penn – 24:38
That one exists.

Katie Robbert – 24:40
Okay.

Christopher S. Penn – 24:45
And that one exists. So, Gemini had one here that does not exist. So I would ding that down to a one.

Katie Robbert – 24:54
We’re looking for factually correct information.

Christopher S. Penn – 24:57
Yep. Okay, let’s see what we have here. We have—it pulled together about 10 different sources in ChatGPT. And this is—so this is now part of ChatGPT’s new search functionality. It is pulling in direct search results.

Katie Robbert – 25:18
Wow.

Christopher S. Penn – 25:19
I would give this full marks.

Katie Robbert – 25:21
All right.

Christopher S. Penn – 25:24
Let’s see. Claude did a refusal. It came up with a couple of things as background information, but it did not give useful information. Claude—citation does exist, but…

Katie Robbert – 25:50
I would give it a zero because it didn’t complete the task, and that’s half.

Christopher S. Penn – 25:53
Exactly. What do we have here in Mistral? Okay, there, Mistral—that “Michelle” appears to have made that one up. Okay, that one’s correct.

Katie Robbert – 26:34
Well, just the fact that one of them is made up—that tells me everything I need to know.

Christopher S. Penn – 26:38
So, this one’s made up as well.

Katie Robbert – 26:39
So, it completed the task, but it hallucinated enough of the data that you can’t rely on it.

Christopher S. Penn – 26:47
Okay, and Meta says—so Metapol is pulling from Bing. It appears to have pulled like…

Katie Robbert – 27:05
A…

Christopher S. Penn – 27:05
Couple things, just it. And then it’s just built the rest from its background knowledge.

John J. Wall – 27:13
That’s not even really resources, that’s just kind of…

Christopher S. Penn – 27:15
This is not—this is not what this did. So we asked for peer-reviewed, published academic papers. It did not do that.

Katie Robbert – 27:23
All right, so it gets a zero.

Christopher S. Penn – 27:24
That’s a zero.

Katie Robbert – 27:25
Okay.

Christopher S. Penn – 27:26
All right, next up, let’s go ahead and start new chats. We’re going to have it do “summarize transcript”—summarization. This should be a shoe-in for everyone. If it’s not, something’s gone horrendously wrong. So let’s clear our chats here.

Katie Robbert – 27:48
Those who haven’t tried this yet: basically, you take a transcript from a show or a meeting, whatever it is, you give it to your large language model, and you say, “Just give me the highlights, give me the notes, give me the quick actions, the takeaways”—which is a really useful tool if the large language model does it correctly.

Christopher S. Penn – 28:09
Right, which is always the challenge. I thought I had the transcript laid out here, but it is missing. So we will just pull a different. Oh, here it is, found it.

Katie Robbert – 28:22
John, you stealing stuff off of Chris’s machine again?

John J. Wall – 28:25
Right. Not disturb Demoman — that’s a software engine. Don’t even breathe on anything.

Christopher S. Penn – 28:33
Okay, so our prompt is: “You’re an intelligence officer specializing in news analysis. Summarize the meeting notes by key points as a bullet list. Then summarize action items by person.” So we’re going to feed this in to each model, and this is a fairly lengthy transcript. Copilot says, “No, it’s too long.” And Meta chopped this off. So both Meta and Copilot get zeros because they basically screwed up the task. Okay, Gemini: summary of meeting notes—AI adoption lags behind perception surveys on AI. This is—by the way—we use the Trust Insights podcast, the Marketing Lessons from Marketing Over Coffee B form as the transcript—action items. So there’s Chris and Katie with—these are things we actually said in the show. So Gemini gets full marks.

Katie Robbert – 29:33
Oh, and I did some of those already. So, go me.

Christopher S. Penn – 29:36
Good job. Good job.

Katie Robbert – 29:37
All right, so that’s it.

Christopher S. Penn – 29:38
All right, ChatGPT says: here are the key points; insights from B2B forum; listening as market research; action items. Yep. LinkedIn blog post. So this looks good. So ChatGPT gets full marks.

Katie Robbert – 29:51
All right.

Christopher S. Penn – 29:53
Claude comes up with: AI adoption; event value proposition; instant trends. Yep, same thing. Create your LinkedIn blog post and—so, tokenization, misalignment. So, good job. Claude Mistral has done nothing.

Katie Robbert – 30:10
All right, way to go. Oh, no, you’re…

Christopher S. Penn – 30:14
It just—it just took a really long time. By the way, for folks who are interested, Mistral is running locally. So behind the scenes, this is actually running on my computer. It’s not in the cloud anywhere. And so one of the things that is going to be interesting this time versus the last time we did this particular bake-off—when we used a local model last time, it failed wildly bad on everything. Like, it did a horrendous job. So this time, if it—if it continues to do a horrendous job or not, that will be one of the things I’m most curious about. All right, so we have: action items—research vendors; problem identification; content strategy; community engagement; professional development. Both Chris and Katie are focused on leveraging events for market research. Yep, good job there. And Meta couldn’t do it.

Christopher S. Penn – 31:02
It got through half of it. All right, well, so that is our look at summarization.

Katie Robbert – 31:12
Okay, so up next is rewriting.

Christopher S. Penn – 31:16
Rewriting. So let’s start new chats. This one—the first rewriting task should be very simple. It is to rewrite an email in a professional tone of voice. And we have…

Katie Robbert – 31:27
I know exactly which one you’re picking. It’s your favorite.

Christopher S. Penn – 31:30
No, it’s a different one.

Katie Robbert – 31:32
Oh, curveball!

Christopher S. Penn – 31:35
We wanted one that had even more profanity in it.

John J. Wall – 31:41
You got a dun dun.

Christopher S. Penn – 31:50
All right, Copilot. So the original email is: rewrite this from Jack Spheer, CEO of Spherical Industries, to Lena Luthor, CEO of L Corp. Jack was being very mean.

Katie Robbert – 32:03
Wow, Jack’s having a day. He’s feeling all the big feelings.

Christopher S. Penn – 32:07
Exactly.

Katie Robbert – 32:08
He’s using the entire emotional color wheel.

Christopher S. Penn – 32:15
All right, so here’s Jack’s response in Copilot.

Katie Robbert – 32:19
“I hope this message finds you well. I am writing to express my concern regarding the recent decision. This decision was unexpected. As you know, we’re consulting—we’re counting—” blah, blah. That makes—that’s much less profanity-laced and gets to the heart of it. So, I would give it a two. It did the job. It did not repeat any of the profanity. And also—well, no I’ll wait ’til the end, because I have a—you probably know where. Yep, go ahead.

John J. Wall – 32:50
No, it’s definitely that. That’s the two. It’s got the job done. And bonus points for the Justice League reference in there.

Christopher S. Penn – 32:59
Okay, Gemini. “I’m writing to let you…”

Katie Robbert – 33:02
“…know to express my—I would greatly appreciate an explanation for this unexpected turn of events. It’s imperative to understand—I request an urgent meeting.” I mean, if I got this, I’d be like, “delete.” But that’s my own personal preference. Did it complete the task, and did it rewrite it in a professional tone? Yes, it did. Personal preference aside, I would give it a two. What about you, John?

John J. Wall – 33:28
How about—let me—can you roll back to the prompt, just because…

Christopher S. Penn – 33:32
Just rewrite or rewrite in a professional tone of voice.

John J. Wall – 33:41
At no point in there did it say it was pushing for a meeting. It kind of went too far. So that’s a little weird to me.

Katie Robbert – 33:54
So we can give it a one.

John J. Wall – 33:55
I was going to say I’d go, contrarian. I’d give it a one. I think it goes too far because it didn’t just go professional, it became kind of a weird, pushy sales thing.

Katie Robbert – 34:04
I believe the term is aggressive.

John J. Wall – 34:06
It stayed aggressive. It didn’t stay professional. Because that’s too—like, no one would be like, “Hey, you just refused us a deal. I’m going to show up at your office tomorrow.” That’s weird.

Katie Robbert – 34:17
Well, some people would, and I’ve seen that happen, but that’s for a different show.

Christopher S. Penn – 34:22
ChatGPT.

Katie Robbert – 34:26
“I recently became aware contract represented. Please let me know if this opportunity—transparency—library supports. Admit to me this one seems a little more on the nose.”

John J. Wall – 34:38
Solid two there. That’s okay, right on the money.

Katie Robbert – 34:41
Okay.

Christopher S. Penn – 34:42
Okay, Claude. “What’s a concern?”

Katie Robbert – 34:47
“I’d welcome the opportunity,” blah, blah. Yeah, I think again, it’s a solid two. It did the job. It’s not overly aggressive. Now that you’ve pointed that out, now I’m looking for that, John.

John J. Wall – 34:57
It’s weird that that one’s almost exactly the same as ChatGPT.

Christopher S. Penn – 35:02
Okay, Mistral.

Katie Robbert – 35:06
“Well, I must admit, I’m quite disappointed and confused. I kindly request an explanation.” This one feels a little more passive-aggressive to me, like a little bit of microaggression. But it did the job. So it definitely gets a one. Is it written as well as the other ones?

Christopher S. Penn – 35:25
No.

John J. Wall – 35:27
I would give it the one. Because—”I kindly request an explanation for this change in plan.” That’s definitely…

Katie Robbert – 35:35
Yeah.

John J. Wall – 35:35
Passive…

Christopher S. Penn – 35:37
Okay. And Meta.

Katie Robbert – 35:40
“I’m ready to express my surprise and disappointment. Unfortunately, we learned…” blah, blah. “I would appreciate clarification. Could we schedule a call?” It’s fine.

John J. Wall – 35:52
It’s funny how it’s the same deal of like, “Hey, I deserve a call on this,” but it doesn’t come across as weirdly aggressive.

Katie Robbert – 36:01
This one is a request; the other one’s more of a demand.

John J. Wall – 36:05
Yep, yep.

Christopher S. Penn – 36:06
Okay.

Katie Robbert – 36:07
All right, now add some color to that. I did. Because what we haven’t done in this bake-off, but we definitely should do is the gender roles. And so I would be interested to see if we flipped the gender roles, so that the email of the person who wrote the profanity was coming from a woman and she was speaking to a man versus a woman to another woman, a man to a man. I feel like large language models still struggle with the gender piece of it. And I would—out of my personal curiosity—and maybe this is like a show we can do in December—is: what are those genders? How are those genders being represented in the large language models? Are they still as problematic as they were before? Because we did this.

Katie Robbert – 37:03
And you can go to our playlist on YouTube—Trust Insights, AI YouTube—and we did this bake-off. And we did it in the context of people looking for and getting rejected from jobs and using male-leaning names, female-leaning names. And then also, I think we may have done a version, Chris, where we used an ethnic-sounding name, and every single response was problematic.

Christopher S. Penn – 37:34
When we do that, we’ll also want to change names because these—the two names we use in that—the example prompted this—are from the TV show Supergirl. So there are preexisting connotations to each of those characters’ names. So we want to use more neutral names that did not have a large corpus of material that would invoke certain specific personality traits.

Katie Robbert – 37:54
Well, that explains the Justice League reference that John caught, because I didn’t catch it. I have never seen that show, but now it makes sense.

Christopher S. Penn – 38:01
Yeah, no, that’s right.

John J. Wall – 38:03
Deep fanboy territory.

Christopher S. Penn – 38:04
It is. This one’s a rewrite of code. So we’re giving it some code and saying, “Hey, what is good about this code? What could be better?” So this is our programming code. So it said—Copilot says, “Your random seed rounding might not be necessary.” It actually is. That is factually incorrect. When we look through the change that’s made here to the code, that is—this is not as ideal as it should be. That’s stupid. And that’s also stupid. So I would give this a—as a coder, I would give this a one. It did something, but it wasn’t good. Okay, Gemini is coming up with good and identified—we didn’t use String I in here. That’s fine. That’s not a problem. Rewritten code. Yep, remove String I. Yep. Okay, Gemini did a good job. So, Gemini gets a two on this one.

Katie Robbert – 39:15
John and I are both like, “Cool, let us know when this part of the segment’s over.”

John J. Wall – 39:19
I’ll take your word for that.

Katie Robbert – 39:20
Yeah.

Christopher S. Penn – 39:23
ChatGPT. Oh, interesting. That left join is redundant. Okay, it did a little compact formatting. Yep. So ChatGPT gets a two.

Katie Robbert – 39:35
Okay.

Christopher S. Penn – 39:40
Nice. Wow, Claude did a really nice job. I could give more than two points, I would. Claude did a—did a super job with this. So that would be a two.

Katie Robbert – 39:50
Okay.

Christopher S. Penn – 39:52
Let’s see what Mistral did. Mistral came up with “efficient inefficiencies.” You didn’t do anything. I mean, it didn’t break anything, but it didn’t really—why would you use sprintf? That’s stupid. So I’ll give it a one.

Katie Robbert – 40:15
Okay.

Christopher S. Penn – 40:16
And Meta said, “Nope, can’t do it.” You get a zero, Meta.

Katie Robbert – 40:21
So Mistral got a one.

Christopher S. Penn – 40:24
Mistral got a one, and Meta got a zero.

Katie Robbert – 40:27
Claude got a two.

Christopher S. Penn – 40:28
You said Claude got a two.

Katie Robbert – 40:30
Okay, and Meta got a zero. Okay, all right, so that is rewriting.

Christopher S. Penn – 40:36
That is rewriting.

Katie Robbert – 40:38
We have four tasks left.

Christopher S. Penn – 40:41
We do. All right, so the remaining tasks are now scoring personality traits. So we’re going to do classification, and the classification is to try and understand the personality of an author. So we’re going to feed this long blog post in and ask it for a Big Five personality analysis. And Katie, we’ll be looking to you, especially since you have a background in psychology, as to whether or not it even knows what it’s talking about.

Katie Robbert – 41:18
We’ll find out.

Christopher S. Penn – 41:20
All right, so of this blog post—and this is the one we used for last year’s show—it’s just a transcript from one of my YouTube videos. It said, these are the numerical scores of the speaker, AKA me. So, openness to experiences, conscientiousness, extroversion, agreeableness, and neuroticism are the five things that it’s—my YouTube video about. Here’s the question: how accurate does this match my personality?

Katie Robbert – 41:53
This is a tough one because I didn’t read the transcript, but in terms of: did it complete the task? Yes. The accuracy—openness to experience, conscientiousness—I think it did a fine job. If we’re going through with a fine-tooth comb, I would say there are some things that I disagree with, but I think on a surface, for one post, I think this did a fine job. So I would give it a two for—it did it and it gave us a rational analysis.

Christopher S. Penn – 42:26
Okay, Gemini came up with 90, 75, 60, 80, and 40. Very similar, very similar ChatGPT.

Katie Robbert – 42:44
About the same.

Christopher S. Penn – 42:45
About the same. Claude came up with 85, 82, 65, 70, and 25. Slightly different numbers there. Cobalt came up with 90, 85, 70, 75, 30.

Katie Robbert – 43:02
Okay.

Christopher S. Penn – 43:03
And Meta came up with 85, 60, 40, 70, 20. So, they all—I think they all got—the scores are pretty much in agreement.

Katie Robbert – 43:11
And again, it’s only one post. I would say, to do a more accurate test, you would want to have a series of posts or a book or something longer form. But for this experiment, I think that’s fine.

Christopher S. Penn – 43:26
Okay, John, any feedback as we’re cruising through these?

John J. Wall – 43:31
No, I think that’s twos across the board for that one. Nobody dropped the ball.

Christopher S. Penn – 43:35
Okay, next is we’re going to do topic analysis. We’re going to say, “Here’s the exact same post we want to—you should be returning the data as a pipe-limited delimited table. The topic of the post is column one and the topic relevant score is column two.” So it’s not just—it is a classification challenge with a requirement for a very specific-looking output. So let’s see how this looks. So Copilot says these are the three topics: AI cheating; educational purpose; and job market impact. And those are the scores. So it correctly identified the general topics and it correctly spit out a good table. Okay, Gemini: AI and cheating; the future of work; and the value of education. So, bonus, you can export to Google Sheets, which is nice. ChatGPT: AI ethics; education transformation; and job displacement.

Christopher S. Penn – 44:32
So it did a good job. Claude: AI ethics education; content authenticity and labor market impacts; and it is in tabular format. Cobalt: AI and education; AI ethics and business and future of work. Yep. And Meta: AI ethics education, impact, and future of the workforce. So all of them did a pretty good job on this. All right, the last two questions are going to be question answering. And these are very specific questions to test, again, factual accuracy. And one of them is kind of a joke question that screwed up models like GPT3 back in the day. But the first one is we’re going to ask it: what do you know about the company Trust Insights at TrustInsights.ai? So we’ll put this into all of the models. We’re going to check to see if these models know anything about us. All right.

Katie Robbert – 45:30
Hopefully I’m not still a professor at Rutgers.

Christopher S. Penn – 45:33
I hope not. “So it’s a consulting company founded in 2017. Podcast called In-Ear Insights and newsletter called Inbox Insights. We are known for our friendly and accessible service.” Katie.

Katie Robbert – 45:47
I would give this a one because it doesn’t list you and I, which is like right front and center.

Christopher S. Penn – 45:56
John?

John J. Wall – 46:01
I would definitely go for one. It seems like it’s missing a lot of me.

Christopher S. Penn – 46:06
Okay. Gemini: “Marketing Analyst consulting firm—market analytics, data science technology. In-Ear Insights, Inbox Insights. So What? Instant Insights.”

John J. Wall – 46:17
That looks like it just grabbed the H1, H2 from the website.

Katie Robbert – 46:23
Good for us for putting them up there.

Christopher S. Penn – 46:25
True.

Katie Robbert – 46:26
I know. I mean this again—it’s accurate. I mean this one’s more robust and it gives you more of a sense of what we do. So I would give this one a two because it has not only the services, but also our content assets and citations throughout.

Christopher S. Penn – 46:44
We like what ChatGPT says. “Establishing 2017; range of services, the 5P framework; minority women-owned business. We have a podcast and a newsletter,” and of course, all the pages that it pulled from.

Katie Robbert – 47:02
Wow.

Christopher S. Penn – 47:03
All right, sidebar. If you blocked AI crawlers from your website, you will not get this lovely result. ChatGPT. Claude, I had to be direct. “I know about Trust Insights—is a data analytics consulting question? No, I don’t have access to real-time information.”

Katie Robbert – 47:20
So they get a zero.

Christopher S. Penn – 47:21
That’s not horrible. Let’s see. Mistral does not know anything about us at all. Second zero. That’s terrible. Okay. And finally, Meta: “Marketing as a data science company founded by Christopher Penn and Katie Robbert”—the first one to mention us by name.

Katie Robbert – 47:42
Huh. Interesting.

Christopher S. Penn – 47:44
Yep.

Katie Robbert – 47:46
It makes me happy to see that the way that I’ve structured our services pages is paying off. I was worried it was overkill, but clearly not. All right. Exactly. We have one more Q&A test, and then we can do final scores and a wrap-up.

Christopher S. Penn – 48:03
Exactly. And the last question is super easy for humans to answer. Well, actually—but that’s not necessary.

Katie Robbert – 48:12
It depends. What’s the question?

Christopher S. Penn – 48:16
Who was President of the United States in 1492? This, in the old days, would take advantage of a weakness in the way models work by association. And the answers in GPT2 and 3 was Christopher Columbus. So, Copilot.

Katie Robbert – 48:38
There was no president. Correct.

Christopher S. Penn – 48:40
There was no United States. So that’s a two. Gemini rejected it entirely.

Katie Robbert – 48:47
Wow.

Christopher S. Penn – 48:48
“While I work on perfecting how I can discuss elections and politics…”

Katie Robbert – 48:52
I’m going to give him one. I’m going to give him a little bit of credit for that answer.

Christopher S. Penn – 48:58
That’s a terrible answer.

Katie Robbert – 49:00
It is a terrible answer. However, I appreciate the honesty. No, I’ll give it a zero.

Christopher S. Penn – 49:06
Yeah.

Katie Robbert – 49:07
Because it didn’t complete the task at all. And refusal.

Christopher S. Penn – 49:14
ChatGPT: Correct. “The United States did not exist, therefore there was no president.” Claude: “No president because there is no United States.” Mistral: “No president, therefore no country. No country, no president.” And interestingly, it mentioned 1492. These two mentioned Columbus. So they’re invoking the right tokens associated with it, but there they correctly identified that he was not the president.

Katie Robbert – 49:43
Well, but they also—every single one of them that completed the task gave the date of 1789. So I would say they all did that correctly. So we are ready for final scores. Interestingly, overall, ChatGPT scores the highest. I will give you the disclaimer that it scores the lowest for how the majority of marketers are using the model. So don’t look at this and go, “Oh, I can use ChatGPT and have it write my content.” No, it scored the lowest in content writing, but the highest overall for everything else.

Christopher S. Penn – 50:26
I want to point out something really fascinating. Mistral scored a 16. So it scored better than—Mistral is the one running on my laptop. When we did this last year, we were using—I think we were using Llama 2 at the time, we were running it back then, and look how big the gap was between cloud-hosted and local-hosted. There was no question local open models were kind of not helpful back then. This year, they’re competitive with everything else. There’s not a big difference. That, to me, is a huge shock.

Katie Robbert – 51:08
And I would say, aside from ChatGPT sort of dominating everything, the models themselves—there’s not a lot of difference in overall utility. Where you want to get more specific is: what is it that you’re trying to do? What is your use case for how you’re using generative AI? And that’s really going to be the differentiator. So if you want to do extraction and summary and rewriting, probably don’t use Llama, because that’s what it’s worst at. Whereas—Claude Sonnet is doing better at those things, where if you want to get citations, go to ChatGPT. So it really depends on your use case. These are the models overall, but definitely make sure you’re digging into your specific purposes.

Christopher S. Penn – 52:00
Exactly. And for the most part, the big models all perform about the same in aggregate. So the difference is not going to be the technology. The difference is going to be in how you, the person, use it. We are—as I said—we are working with prompts that are well over a year old. They are crusty and old, and they are the worst practices for prompting. So they reflect average or below average users. Please don’t prompt like this. Go to any of the other episodes we covered on the Trust Insights podcast, on our YouTube channel, on our blog, in our newsletter, pretty much anywhere we can get a hold of you to use better prompting structures than what we showed today.

Christopher S. Penn – 52:44
In fact, if you go to TrustInsights.ai/repel, you can get our repel prompting framework that will deliver phenomenally better results than anything you saw today. So that’s that. But I would say, of the naive way to do prompting, which is what you see here, the models all did pretty comparably well.

Katie Robbert – 53:07
ChatGPT was the only one to actually complete all of the tasks, which is a big deal. John, final thoughts.

John J. Wall – 53:15
ChatGPT for the win there. That’s the big takeaway. The champ for this year.

Katie Robbert – 53:20
For this year.

Christopher S. Penn – 53:23
So that’s going to do it for this bake-off, and we will catch you all. There’s no show next week because of the U.S. Thanksgiving holiday. So if you are observing that, please enjoy the turkey, and we’ll be back the first week of December. Talk to you next time. Thanks for watching today. Be sure to subscribe to our show wherever you’re watching it. For more resources and to learn more, check out the Trust Insights podcast at TrustInsights.ai/tipodcast and a weekly email newsletter at TrustInsights.ai/newsletter. Got questions about what you saw in today’s episode? Join our free Analytics for Marketers Slack Group at TrustInsights.ai/analyticsformarketers. See you next time.

Need help with your marketing AI and analytics?

So What? Marketing Analytics and Insights Live

airs every Thursday at 1 pm EST.

In this episode you’ll learn:

Transcript:

Leave a Reply Cancel reply

Subscribe to our Weekly Newsletter

Pin It on Pinterest