GPT vs Gemini vs Claude vs ClaraT: Which AI Model Performs Best for Lead Grading in Home Services?

A 6,000-Transcript Benchmark Study (2026)

As AI adoption accelerates across sales and marketing teams, many home service operators are evaluating large language models (LLMs) for automated lead grading to determine bookability, reasons leads are unbookable, and why bookable leads aren’t booking at the time of the lead. 

But which AI model actually performs best for structured lead classification?

To answer that question, we evaluated six AI models across 6,000 real home service lead transcripts:

    • GPT-4o-mini

    • GPT-5.2

    • Gemini 3 Flash

    • Claude Opus 4.6

    • Claude Sonnet 4.6

    • ClaraT™ (SearchLight’s domain-trained model)

This benchmark measures structured classification accuracy, not summarization quality, sentiment analysis, or general reasoning ability.

Why does this matter for you, the contractor? 

For many home service companies, adopting off-the-shelf LLMs is a powerful first step toward AI-driven operations. But as lead grading becomes embedded into attribution models and budget decisions, the remaining accuracy gap between general and domain-optimized systems becomes increasingly material, a point we examine quantitatively in a later section of this article.


What Is Lead Grading in Home Services?

Lead grading is the process of evaluating inbound customer inquiries, phone calls, web forms, and chats, to determine:

    • Whether the lead was bookable

    • Why it was unbookable (if applicable)

    • Why a bookable lead did not convert

    • Which service or business unit did the inquiry belong to

In recent memory, this evaluation was often done by humans, which was expensive and hard to do at scale, but LLM advancements have reduced the cost to continuously grade leads to help in both a marketing and operational context: 

    • Extract negative keywords and service area confirmation for paid ad campaigns (measuring out-of-service-area contacts and service-not-offered contacts)

    • Enhanced Book Rate Analysis by Channel – is the low book rate on a marketing channel from low-quality leads or poor lead handling?

    • CSR Performance and Coaching Opportunities

    • Improve bookable to booked appointment by monitoring bookable conversions that slip through the cracks

    • Budget planning and allocation – knowing % bookable conversions can help you more accurately plan marketing spend against the needed appointments to hit your business goals

    • Missed call tracking

    • Feed bookable conversions back to ad platforms to improve those platforms’ algorithms to benefit your business by targeting higher intent individuals

As AI becomes more embedded into these workflows, grading accuracy directly influences how contractors measure performance, allocate marketing dollars, and prioritize operational changes needed to improve business outcomes.

Artificial intelligence is already transforming how contractors evaluate inbound leads, and general-purpose LLMs represent a meaningful operational upgrade for many teams. 

However, as grading systems begin influencing marketing spend and revenue forecasting, even small accuracy differences can compound, a dynamic we quantify later in this report.


Executive Summary

This study evaluates structured lead grading performance across six AI models (LLMs) using 6,000 real home service transcripts (phone, form, and chat).

The evaluation measured:

    1. Detailed conversion label accuracy (bookable vs. unbookable)
    2. Unbookable reason classification accuracy (e.g. out of service area)
    3. Lost (bookable but not booked) reason classification accuracy (e.g. scheduling/availability)

To ensure a fair, apples-to-apples comparison, all models were evaluated using:

    • A frozen, standardized taxonomy

    • Identical schema-constrained prompts

  • Exact-match scoring

Key Findings

    • General-purpose LLM accuracy across three categories ranged from 68% to 89.75%

    • The highest-performing general LLM plateaued just under 90% in determining bookable vs. not bookable

    • ClaraT™ achieved 98.05% structured classification accuracy

    • ClaraT™ ranked #1 across all evaluated metrics

  • The performance gap between domain-trained and general models was approximately 8–10+ percentage points


What Was Measured?

The accuracy of structured classification across:

    • Bookable vs. unbookable determination

    • Root cause identification for unbookable conversions

    • Lost reason classification for bookable but unbooked conversions

    • Multi-business-unit intent disambiguation

Accuracy was defined as an exact match against a standardized taxonomy.

The models had to select the correct predefined category.


Dataset Overview

    • 6,000 real home service lead transcripts

    • Channels: inbound phone, web forms, live chat

    • Trades: HVAC, plumbing, electrical, roofing, multi-trade

    • U.S. market

Ground Truth Methodology

    • 1,500 transcripts manually labeled by human domain experts with HVAC/plumbing operational experience (“Gold Set”)

    • Remaining transcripts were AI-assisted labeled using the validated taxonomy, with human review and auditing

    • Taxonomy frozen prior to benchmarking

  • All models were evaluated against identical classification rules


Standardized Classification Schema

To eliminate semantic drift, all models were classified within fixed reason sets:

    • 30 standardized unbookable reasons

    • 11 standardized lost reasons

Examples include:

Unbookable:

    • Outside service area

    • Service Not Offered

    • Existing / Confirming Appointment

    • Cancelling Appointment

Lost:

    • Planned Follow-up

    • Scheduling Availability

    • Pricing Concerns

    • AI to Human Transfer Required

All models, including ClaraT™, were constrained to this identical schema.


GPT vs Gemini vs Claude vs ClaraT: Accuracy Comparison

1. Detailed Conversion Label Accuracy (Bookable vs. Unbookable)

Rank Model Accuracy
1 ClaraT™ 98.05%
2 Gemini 3 Flash 89.75%
3 Claude Sonnet 4.6 89.53%
4 Claude Opus 4.6 89.37%
5 GPT-5.2 89.25%
6 GPT-4o-mini 73.04%


2. Unbookable Reason Accuracy

Rank Model Accuracy
1 ClaraT™ 97.24%
2 Claude Sonnet 4.6 76.57%
3 Claude Opus 4.6 75.83%
4 Gemini 3 Flash 74.15%
5 GPT-5.2 73.80%
6 GPT-4o-mini 68.69%


3. Lost Reason Accuracy

Rank Model Accuracy
1 ClaraT™ 97.61%
2 Gemini 3 Flash 84.15%
3 GPT-5.2 82.41%
4 Claude Sonnet 4.6 80.40%
5 Claude Opus 4.6 79.28%
6 GPT-4o-mini 75.81%


Composite Ranking

    1. ClaraT™
    2. Gemini 3 Flash
    3. Claude Sonnet 4.6
    4. Claude Opus 4.6
    5. GPT-5.2
    6. GPT-4o-mini

Among generic LLMs, Gemini 3 Flash performed the highest overall in the composite rankings.

However, ClaraT™ outperformed all general models by a substantial margin.


Statistical Considerations

With a sample size of 6,000 transcripts:

    • At ~89% observed accuracy, the 95% confidence interval is approximately ±0.8%.

    • At ~98% observed accuracy, the 95% confidence interval is approximately ±0.3%.

The observed 8–10+ point performance gap is statistically significant and consistent across the evaluated dataset.

Exact-match scoring ensured deterministic evaluation under a fixed taxonomy.


Why This Matters for Contractors

AI has the potential to help transform home service businesses in many different ways. 

General-purpose LLMs like GPT, Gemini, and Claude are powerful tools. They can summarize calls, extract intent signals, and surface patterns that previously required manual review.

For many contractors, adopting off-the-shelf AI models represents a major operational leap forward.

However, structured lead grading introduces a subtle challenge:

Small accuracy gaps compound at scale.


Are Off-the-Shelf LLMs Good Enough for Contractors?

For many contractors, yes.

Moving from manual review to an 85–90% accurate AI model represents a significant improvement in efficiency and insight generation.

General-purpose LLMs can:

    • Classify basic booking intent

    • Identify common objections

    • Summarize lead outcomes

    • Reduce manual QA workload

For small-to-mid-sized operations, this level of accuracy may be sufficient.

However, in higher-volume environments where lead grading directly influences marketing attribution, channel investment, and revenue forecasting, the remaining 8–12% accuracy gap becomes increasingly important.

The appropriate solution depends on operational scale and how grading data is used.


Real Conversion Volume at Industry Scale

To understand why accuracy matters, consider recent aggregated performance data (the conversion cost will be useful in the data exercises in the sections following this one).

Source: SearchLight data, January–February 2026 across 1,000+ contractors.

Advertising:

    • 992,050 conversions

    • $35.88 per conversion

Organic:

    • 725,092 conversions

    • $6.52 per conversion

Total conversions:

    • 1,717,142

At this scale, classification precision affects a substantial volume of revenue-influencing events, as seen below. 


What Does a 10% Accuracy Gap Mean in Practice?

Scenario 1: 20% Inaccuracy on 6,000 Leads

If a model operates at 80% accuracy:

6,000 × 20% = 1,200 misclassified leads

If those are advertising conversions:

1,200 × $35.88 = $43,056 in incorrectly categorized leads.

This does not imply $43,000 in direct revenue loss.

It represents conversion events that may be misattributed, misrouted, or misinterpreted in reporting systems.


Scenario 2: 89% Accuracy at Advertising Scale

If a model operates at 89% accuracy:

Error rate = 11%

992,050 × 11% = 109,126 potentially misclassified advertising conversions

109,126 × $35.88 ≈ $3.9M worth of lead value influenced by grading inaccuracy.

Again, this does not represent direct loss.

It reflects the scale of revenue events shaped by classification precision.


Where Accuracy Gaps Influence Operations

Misclassification does not typically eliminate revenue.

It influences decision quality.

Accuracy gaps can affect:

    • Marketing attribution models

    • Channel ROI rankings

    • Budget reallocation decisions

    • CSR team performance evaluation

    • Multi-trade routing accuracy

    • Revenue forecasting precision

At smaller scales, these discrepancies may be negligible.

At enterprise scale, they compound.


The Importance of Industry Context

This benchmark is not an argument against general-purpose LLMs.

Off-the-shelf AI models are:

    • Accessible

    • Affordable

    • Powerful

    • Transformative compared to manual review

For many contractors, adopting general AI tools is an important and valuable first step.

The key takeaway is this:

As AI becomes embedded into revenue decision systems, structured accuracy and domain context become increasingly important.

The difference between:

    • “AI that sounds plausible”
      and

    • “AI that produces decision-grade classifications tied to revenue”

becomes meaningful when marketing dollars and forecasting depend on it.

Not only is a lower tolerance for inaccuracy required when you operationalize data, but the outputted information is not as actionable when it’s standalone and without the context of the marketing channel that drove the lead, whether it turned into a booking, if revenue was generated, etc.

If you leverage an LLM (off-the-shelf or domain-trained) and you learn that bookability across all of your conversions is just 65%, the next action you’ll want to take is to improve that bookability. 

But what should your goal bookability be? Should it be channel-specific or apply to all your inbound leads?

This is where context can make or break your goal setting, decisions, and investments (time or money). 

All of you (hopefully) have a monthly revenue goal/target. 

You know your average ticket across your different services and how well you convert to revenue once you’re in the home (and if you don’t, this is another reason why context in data matters, because it is hard to set goals and take action to influence goals without those data points). 

Armed with that information, you probably have a rough idea of how many daily conversions (contacts to the business) you need in order to run x appointments to convert x to sold/closed to hit your goals. 

To keep it easy, let’s just say you need 50 conversions for your business per day.

Out of the 50 conversions, 20 of those come from existing customers/outbound, but you’re relying on at least 30 to come from your digital marketing spend (organic and/or paid).

The problem is that it’s a stretch for your budget to get 30 conversions from organic and paid channels daily. 

Sometimes you hit that number, sometimes you don’t. 

You don’t have the budget to pay your SEO provider for more content, and Google Ads is expensive in your market, so you’re told to make do with your existing budget. 

It’s the very reason you looked into lead grading in the first place, and as a result, figured out that a third of the conversions you pay for aren’t even bookable. 

But what can you do with that information? 

If your lead grading data is integrated with your conversion → revenue marketing attribution, you can now go identify specific channels that may be lagging in bookability in order to make adjustments. 

That’s actionability. 

You sort through your data and realize that conversions from Google Ads have a bookability of just 40%. 

To add insult to injury, Google Ads drives 70% of your conversions across all paid and organic channels (thanks, integrated data!).

So now you’ve gone from “I need to improve bookability” to “I need to specifically improve bookability from Google Ads because it’s my most expensive channel, but it drives my highest volume of conversions, and ultimately, revenue”.

It’s much easier to now spend the required time to diagnose what’s happening and not feel like you’re on a wild goose chase. 

After digging into the data, you realize that there are two primary drivers of expensive, unbookable conversions from Google Ads: 

    1. Wrong Number Phone Calls 
    2. Individuals looking for services you don’t offer

Armed with this information, you email your agency and ask them to review your negative keyword list against the services these individuals are contacting you about.

While they’re at it, you ask them to make sure you aren’t bidding on competitor names, because you noticed a spike in conversions from people looking for an HVAC company, just not yours. 

After a few hours, your agency comes back to you and says that they were in fact bidding on competitor names (miscommunication), and did not have negative keywords for 2 services individuals were contacting your business about (another miscommunication).

Over the next few days, the bookabiity of your conversions from Google Ads increases by 20%! 

Before, you were getting 21 daily conversions from Google Ads, but only ~8 were bookable. 

With the improved negative keyword list and budget removed on competitor names, your book rate jumped, and now you’re getting ~12 bookable conversions from Google Ads per day, a 50% increase! 

Pumped with your win, you continue to leverage context-driven data to find and correct inefficiencies that matter to your business’s bottom line. 

But over time, you’re chasing smaller and smaller amounts of that opportunity. 

So much so that now just a 1% improvement is a win. 

That’s where the accuracy of domain-specific models plays an important role in getting you into the long tail of efficiencies that matter to your business. 

If an off-the-shelf LLM properly classified 9 unbookable conversions from Google, that’s good enough to get you started. 

But what if the other 4 unbookable conversions were improperly classified, and you didn’t realize they all originated from the same zip code? 

Had you known that, you could have also asked your agency to confirm they were not targeting that zip code, and might’ve found out they were because of a miscommunication. 

These are real-world examples we see day-to-day that have a meaningful impact on business outcomes. 

Operationalizing data to make measurable impacts on those business outcomes requires both accuracy and specificity. 

The level of accuracy and specificity comes with trade-offs. 80% accuracy may be good enough today, but you may need the remaining 20% after cleaning up the low-hanging fruit. 


Methodological Considerations & Common Questions

Prompt Engineering

All models were provided with identical optimized schema prompts.

While few-shot prompting can improve performance, general LLMs experience long-context degradation when managing:

    • 30+ classification options

    • Multi-dimensional bookability signals

    • Long transcripts

ClaraT™ does not rely solely on prompting.

It uses a domain-weighted ensemble architecture trained on structured conversion outcomes.


Cost-Benefit Argument

General LLMs may be inexpensive per grade.

However, at scale, an 11% misclassification rate exposes substantial amounts of marketing-influenced revenue to inaccurate grading.

For mid-sized enterprises, the operational cost of inaccuracy may exceed the infrastructure savings of using a general-purpose model.


Model Version Selection

Extended reasoning (“Thinking”) models were not used because:

    • They introduce higher latency

    • Increase inference cost

    • Are a poor fit for real-time, high-throughput lead grading workflows

This benchmark reflects production-grade deployment conditions.


Why ClaraT™ Performed Differently

ClaraT™ is a domain-trained lead-grading system optimized for structured home-service conversion/lead classification.

It differs from general LLMs in three structural ways:

    1. Outcome-based training on real booking data
    2. Iterative hand-labeling done by human domain experts
    3. Domain-weighted ensemble architecture designed for taxonomy stability

ClaraT™ is not a single-release model.

It has undergone over 12 months of iterative development, including:

    • Multiple supervised training cycles

    • Taxonomy refinements based on real booking outcomes

    • Human-in-the-loop validation passes

    • Benchmark testing across evolving datasets

Each iteration incorporated new edge cases, routing complexities, and failure patterns observed in production environments.

The results presented in this benchmark reflect a mature, production-grade system. 

The results suggest that structured operational classification benefits from domain-specific optimization beyond general language modeling.


Conclusion

In a 6,000 transcript benchmark evaluating structured lead grading accuracy:

    • General-purpose LLMs achieved 68–89.75% accuracy.

    • ClaraT™ achieved high-90% accuracy (~98%).

    • The observed performance gap was approximately 8–10+ percentage points.

    • The difference is statistically and operationally significant.

For structured lead grading in home services, domain-trained models demonstrated materially higher classification reliability than general-purpose LLMs.

Last Updated: March 2026

Close Menu