I tested GPT-5 and Opus 4.1 for account research

Playback speed

Share post at current time

Share from 0:00

0:00

Paid episode

The full episode is only available to paid subscribers of Outbound Kitchen

I tested GPT-5 and Opus 4.1 for account research

Here are the results + prompt

Elric Legloire - Outbound Chef

Aug 17, 2025

∙ Paid

Two weeks ago, I tested 5 AI models to find the best one for account research. ChatGPT's o3 model won.

Since then, OpenAI and Anthropic have released new models:

OpenAI: GTP 5, GPT-5 Thinking, and GPT-5 Pro
Anthropic: Claude Opus 4.1

Time for a rematch.

What I'm Testing today: 6 models total:

GPT-5 Thinking (NEW)
Claude Opus 4.1 (NEW)
ChatGPT o3 (Previous winner)
Gemini 2.5 Pro
Grok 4
Perplexity Deep Research

We're not testing the deep-research models that take 10-30 mins to run, or the pro models, because we want models that balance both power and speed.

Btw why am I doing this and sharing this with you? Because:

Each model is great for specific use cases.
I measure the quality of the model before scaling it with Clay or n8n (via the API).
Why account research first? (before writing) Because your message quality equals the quality of the data about your accounts.

We are going to evaluate them based:

Web access (can it find current information?)
Reasoning (can it connect the dots?)
Additional insights (does it surface non-obvious intel?)

Note: To access o3 in ChatGPT, enable legacy models in your settings.

Let's run this new test on Windsurf

Perfect timing for this test, Windsurf just went through massive changes:

Google paid $2.4B to license their AI coding technology
Leadership team (CEO Varun Mohan, co-founder Douglas Chen) joined Google DeepMind
Company acquired by Cognition

Let's see which models catch these critical updates.

Response Times:

ChatGPT o3: 1 min 10 sec ⚡
Gemini 2.5 Pro: 1 min 3 sec ⚡
Claude 4.1: ~2 min
Grok 4: 2 min 6 sec
Perplexity: 2-3 min
GPT-5 Thinking: 3 min 50 sec 🐢

1. ChatGPT-5 Thinking

Pros:

Returned the complete tech stack with sources. Use words from the RevOps job description to find pain points about their tech stack.

Account mapping

Good account mapping identifying RevOps as champion and Graham as economic buyer

Red flags

Strong reasoning on red flags.

The only model that found the hiring pause

Problem-Solution Alignment Map

Strongest problem-solution alignment map with 5 specific challenges addressing leadership change + reorg post Google deal and Cognition acquisition.

Cons:

Outbound Angle

Outbound angle was okay but didn't mention the Google deal or recent acquisition as main POV for August 2025

Score: 9/10

Strongest model for this test. Surprised about the output quality, much more consistent than at launch. Best web access, reasoning, and strategic insights.

Only model that found the hiring pause and leadership changes post-acquisition
Strongest problem-solution mapping with 5 specific challenges
Best account mapping (identified RevOps champion + economic buyer correctly)

Trade-off: Slowest at 3 min 50 sec

2. Claude Opus 4.1

Pros:

Found 2 biggest red flags with interesting reasoning on budget freeze

Outbound Angle

Good competitive market understanding in outbound angle

Cons:

Tech Stack

Same as last time, Failed at tech stack discovery - couldn't return any tools

Account mapping

Claude did an okay job here. Found Graham, but he's not the VP of worldwide sales anymore. No RevOps leadership identified

Outbound angles

Assumptions without online data for outbound angles

Problem-Solution map

Sources without links in problem-solution map.

The 2 last problems are not super relevant for Clay.

Score 5/10

No improvement from Opus 4.0.

Same limitations with web access: Failed at tech stack discovery (consistent weakness)
Decent reasoning on budget freeze but missed key executive exits
No improvement from Opus 4.0

3. ChatGPT o3 model (last time winner)

Pros

Tech stack

Returned tech stack with sources + identified gaps in the stack

Outbound Angle

Found strong problems/challenges for Windsurf and include that in the Outreach angle

Problem-Solution Map

Found strong problems/challenges for Windsurf

Cons:

Account mapping

Account mapping not as good as GPT-5 Thinking, couldn't detect Graham's job change

Red flags

Found only one red flag (acquisition) but missed the Google deal

Score: 7/10

Still solid for API calls with n8n/Clay. Good web access but less comprehensive reasoning than GPT-5 Thinking.

Strong tech stack analysis with sources and gap identification
Good problem detection but missed recent Google deal implications
Fastest quality output at 1 min 10 sec
Still solid for API calls via n8n/Clay

4. Gemini 2.5 pro

Pros:

None this time

Cons:

Tech Stack

Tech stack without source links - can't verify. It found Salesforce. The reasoning piece is interesting, but not relevant if it's not using the right data.

Account Mapping

Major hallucination: Created fake CRO "Elena Rodriguez" who doesn't exist

Red Flags

No red flags found.

Outbound Angle

Company isn't even public but referenced earnings calls

Problem Solution Map

Hallucinated CFO "David Chen" and fake Q2 earnings call

Score: 0/10

Complete failure. Cannot be trusted without source verification. Major quality drop from previous testing.

3 critical hallucinations in one test (fake CRO, fake CFO, fake earnings call)
Inconsistent performance vs previous testing
Cannot be trusted without source verification

5. Grok 4

Pros:

Found 1 red flag: acquisition by Cognition

Problem-solution alignment map

Did an okay job here. Found 4 problems, but generic problem identification, not specific enough.

Cons:

Tech Stack

Failed at tech stack, assumes instead of finding actual data

Account mapping

Same as last time. Can't return relevant prospects for account mapping

Outbound Angles

Poor reasoning (suggested SDR hiring during hiring freeze)

Score: 2/10

Same as last time.

Appears to lack proper web access.

Can't find tech stack, assumes instead of researching
Poor reasoning (suggested SDR hiring during hiring freeze)
Generic problem identification

6. Perplexity Deep Research

Pros

Account mapping

Good account mapping understanding leadership transition

Outbound Angle

Strong outbound angle using current business priorities

Cons

Tech Stack

Same as last time, failed at tech stack task - no sources, irrelevant data

Red flags.

Didn't return any red flags

Problem-Solution Map

Problem-solution map not relevant to Clay's value prop

Score 5/10

Limited web access capabilities. Better at reasoning than data discovery.

Good at understanding leadership transitions
Strong outbound angle using current business priorities
Consistently fails at tech stack discovery

The results

Last test results:

Here are the 2 models that I'm currently using.

🥇 Winner: GPT-5 Thinking (9/10

Only model that caught the hiring pause - critical intel for outreach timing
Connected leadership changes to business challenges
Best reasoning capability linking Google deal → Cognition acquisition → organizational impact
Most comprehensive web access with verified sources

The standout: When analyzing red flags, it didn't just list them - it explained how the leadership exodus (Varun and Douglas leaving) creates both risk and opportunity for vendors.

🥈2nd: ChatGPT o3 (7/10)

Speed champion at 1 min 10 sec (vs 3 min 50 sec for GPT-5 Thinking)
Solid tech stack discovery with gap analysis
Still reliable for API automation workflows

Missed critical context about current state (Google deal, leadership changes) that makes outreach timely.

Use GPT-5 Thinking for high-value account research where accuracy matters. Keep o3 for volume plays and API workflows where speed is critical. Avoid Gemini 2.5 Pro completely until hallucination issues are resolved.

What's your go-to model for account research right now?

Want the prompt I used for this test?

And the 30 other prompts that I already created? Upgrade now:

Listen to this episode with a 7-day free trial

Subscribe to Outbound Kitchen to listen to this post and get 7 days of free access to the full post archives.