How do email spam filters work?

Spam filters score incoming messages across multiple dimensions: sender reputation, authentication results, content patterns, link analysis, and recipient engagement. Each factor contributes points to a total spam score. Above a threshold, the message goes to spam.

What is a Bayesian spam filter?

A Bayesian filter is a statistical classifier that learns from examples of spam and legitimate email. It calculates the probability that a message is spam based on word frequencies and patterns. SpamAssassin uses Bayesian classification as one of its scoring methods.

Does Gmail use SpamAssassin?

No. Gmail uses its own proprietary filtering system that heavily weights sender reputation and user engagement signals. SpamAssassin is used primarily by hosting providers, corporate mail servers, and open-source mail setups.

What triggers spam filters?

The most common triggers are: poor sender reputation, failed authentication (SPF/DKIM/DMARC), known spam content patterns, suspicious link patterns (URL shorteners, blacklisted domains), and low recipient engagement.

Can I test my email against spam filters?

Yes. Mail-Tester.com checks against SpamAssassin. GlockApps tests inbox placement across providers. Litmus includes spam filter testing. But these only test content — they can't simulate your real sender reputation.

Why does the same email go to inbox for some recipients and spam for others?

Per-user behavioral models. Gmail and Microsoft both score sender-recipient relationships individually. A user who has opened your last 10 messages will get the inbox; a user who deleted your last 10 without opening will get spam. Aggregate reputation sets the baseline; per-user behavior fine-tunes from there.

Where does spam email come from?

Most spam originates from compromised hosts, botnets, snowshoe networks (many low-volume IPs to evade reputation), and legitimate bulk senders whose hygiene has collapsed. Phishing is a distinct subcategory — typically targeted, often using lookalike domains. Authentication (SPF, DKIM, DMARC) is what separates legitimate senders from spammers technically.

Spam Filter Technologies: How Bayesian, Reputation, and Content Filters Work

The Layered Filter Model

No spam filter uses a single technique. Modern filtering stacks multiple technologies, each catching different types of spam. Understanding each layer helps you diagnose which one is filtering your mail.

Layer 1: Connection-Level Filtering

Before the receiving server even looks at your message content, it evaluates the connection itself:

IP reputation checks — Is the sending IP on any blacklists? What's its historical spam ratio? Services like Spamhaus SBL, Barracuda Reputation System, and Cisco Talos maintain real-time IP reputation data.

PTR/rDNS validation — Does the sending IP have a reverse DNS record? Does it match the sending hostname? Missing rDNS is a strong spam indicator.

Connection rate limiting — Too many connections per minute from one IP triggers automatic throttling or blocking.

This layer is binary — you either pass or you don't. No content optimization fixes a blacklisted IP.

Practitioner note: Connection-level filtering catches the most spam by volume. A huge percentage of global spam comes from compromised machines with terrible IP reputation. If your IP is clean and authenticated, you've already passed the hardest filter.

Layer 2: Authentication Checks

The server verifies your authentication protocols:

SPF: Is the sending IP authorized for the From domain?
DKIM: Is the cryptographic signature valid?
DMARC: Do SPF and DKIM align with the From domain? What's the published policy?

Authentication doesn't directly determine spam/inbox placement, but failing authentication is a strong negative signal. In 2026, unauthenticated email from bulk senders is increasingly rejected outright.

See our email authentication guide for complete setup.

Layer 3: Content Analysis

This is where Bayesian classifiers and pattern matching come in.

Bayesian Classification

Bayesian filters learn from labeled examples. They build a probability model: given the words and patterns in this message, how likely is it spam?

How it works:

Train on thousands of known spam and legitimate emails
Calculate the probability that each word/phrase appears in spam vs legitimate mail
For a new message, combine the probabilities of all its words
Output a spam probability score

SpamAssassin's Bayesian classifier is the most widely deployed, but Gmail, Outlook, and Yahoo all use similar (more sophisticated) statistical models.

Pattern Matching

Rule-based filters check for specific patterns:

Known spam phrases and word combinations
Suspicious formatting (all caps, excessive punctuation, colored text)
Image-to-text ratio anomalies
Hidden text (white text on white background)
Deceptive subject lines

URL Analysis

Every link in the message is checked against:

Domain blacklists (URIBL, SURBL)
Known phishing URL patterns
URL shortener usage
Redirect chain analysis
Safe Browsing databases (Google, Microsoft)

Practitioner note: Content filtering gets outsized attention, but it's actually the weakest layer for legitimate senders. If your reputation and authentication are solid, content analysis is rarely what puts you in spam. The exception is if you're using known spam templates or linking to blacklisted domains.

Layer 4: Engagement-Based Filtering

This is Gmail's secret weapon and the most powerful filter for bulk senders.

Positive signals: Opens, clicks, replies, moving from spam to inbox, adding to contacts, starring/labeling

Negative signals: Spam reports, deleting without reading, consistently ignoring messages

Gmail tracks engagement at the individual recipient level. If most of your recipients ignore your email, Gmail progressively filters more of your mail to spam — even if your content and authentication are perfect.

A simplified version of the per-user model:

User A: opens every newsletter, occasionally clicks. Future mail → Inbox.
User B: never opens, deletes within seconds. Future mail → Spam after 5-10 messages.
User C: marked sender as spam once. All future mail → Spam permanently (until manually whitelisted).

This means the same email from the same sender lands in inbox for engaged recipients and spam for disengaged ones — and it's why engagement-based sending matters so much for Gmail deliverability.

List Hygiene as a Filter Input

ISPs read list quality signals as part of the engagement layer, regardless of sender intent:

Bounce rate > 2% on a send → flag
Complaint rate > 0.3% → throttling at Gmail/Yahoo
Hit on a recycled spam trap → reputation hit
Hit on a pristine spam trap → Spamhaus listing risk
Role address volume > 5% of list → flag

Poor hygiene reads as either incompetence or bad acquisition (purchased lists, scraping). Either way, the response is reduced inbox placement.

Layer 5: Machine Learning Models

Gmail, Outlook, and Yahoo all use deep learning models that consider hundreds of signals simultaneously:

Sender behavior patterns over time
Similarity to known spam campaigns
Network analysis (which other senders share your infrastructure)
Temporal patterns (sending time, frequency changes)
Cross-user signals (if many users mark similar messages as spam)

These models are proprietary and constantly evolving. You can't game them — you can only send legitimate, wanted email and let the models classify you correctly over time.

What Filters Are Protecting Against

It helps to know what these layers are calibrated to catch:

Botnets and compromised hosts — automated sending from infected machines. High volume, low IP reputation, often fails authentication entirely.
Snowshoe spam — sending distributed across many low-volume IPs and domains to evade per-source reputation, often on freshly registered domains.
Phishing — targeted impersonation of legitimate brands. DMARC at p=reject is the primary defense.
Unsolicited bulk mail from "legitimate" senders — purchased lists, scraped contacts, dormant subscribers reactivated without permission. This is the category most well-meaning marketers accidentally fall into.

The reason legitimate marketing mail gets caught is that its pattern (bulk, commercial, low engagement) overlaps with the patterns spam uses.

How Major Providers Differ

Filter Aspect	Gmail	Outlook/Microsoft	Yahoo
Primary weight	Engagement + domain reputation	IP reputation + content	Reputation + authentication
Content analysis	ML-heavy	Microsoft Defender + SmartScreen	SpamAssassin-like + proprietary
Engagement impact	Very high	Moderate	Moderate
Blacklist reliance	Low (own data)	Moderate	Moderate
Authentication strictness	Very high (2024 requirements)	Moderate	High (2024 requirements)

Practitioner note: The biggest misconception I fight is that spam filtering is about content. For any sender doing real volume, reputation is 80% of the game. I've seen perfectly written emails land in spam because of bad IP reputation, and terribly written emails land in the inbox because the sender had excellent engagement metrics.

What This Means for You

Fix reputation first — no content change overcomes bad reputation
Authenticate everything — SPF, DKIM, DMARC are table stakes
Monitor engagement — especially for Gmail
Clean your links — avoid shorteners, check domain reputation
Test content last — use Mail-Tester and GlockApps for content scoring

If you're getting filtered and can't figure out which layer is catching you, schedule a deliverability audit — I'll trace your messages through each filter stage and identify exactly where they're being caught.

Sources

v1.0 · April 2026