Skip to main content

Command Palette

Search for a command to run...

How I Built a Dunning System That Recovers Failed Payments Automatically

Published
11 min read
D
Senior Full-Stack Developer & AI Engineer. Building production AI agents and SaaS tools. Open source contributor and technical blogger.

Every SaaS company loses money to involuntary churn. A customer's card expires, a payment fails, and they disappear not because they wanted to leave but because nobody told them what happened. Industry estimates put this at around 9% of MRR for the average SaaS business.

Stripe's Smart Retries help on the retry side. They'll attempt the charge again at statistically optimal times. What they don't do is send your customer a human-readable email explaining what happened and asking them to fix it. That gap is where most of the money slips out.

I built Rebill to close that gap. Here's the full architecture, real code included.

The Stack

  • Next.js 14 (App Router) on Vercel
  • Supabase for the database and auth
  • Stripe Connect for multi-tenant account monitoring
  • Resend for transactional email delivery
  • Vercel cron jobs for the email queue processor

The core idea: SaaS founders connect their Stripe account to Rebill. We register a webhook on their Stripe account, listen for payment failures, and automatically send a configurable sequence of recovery emails to their customers.

The Stripe Connect Model

The hardest architectural decision was how to receive payment events from other companies' Stripe accounts.

Stripe Connect solves this cleanly. When a user connects their account, we use the Account Links API to complete the OAuth-style onboarding. After that, we can register a Connect webhook that receives events from all connected accounts in a single endpoint. Each event includes an account field identifying which connected account fired it.

// Create a Standard connected account + onboarding link
const stripeAccount = await stripe.accounts.create({
  type: "standard",
  email: user.email,
});

// Save to DB immediately (onboarding not yet complete)
await admin.from("connected_accounts").upsert(
  {
    account_id: account.id,
    stripe_account_id: stripeAccountId,
    access_token: "n/a",
    livemode: false,
    is_active: false, // Set to true after onboarding completes
  },
  { onConflict: "stripe_account_id" }
);

Once onboarding is done, Stripe fires events to our Connect webhook endpoint. The account field in each event tells us which user it belongs to.

For making API calls back to a connected account (to fetch customer details, for example), you scope the Stripe client with the connected account ID:

// src/lib/stripe.ts

export const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!);

export function stripeForAccount(stripeAccountId: string) {
  return new Stripe(process.env.STRIPE_SECRET_KEY!, {
    stripeAccount: stripeAccountId,
  });
}

One Stripe secret key, but calls go to the connected account's data. No OAuth tokens to manage.

The Database Schema

The schema is built around five tables. accounts holds our platform users. connected_accounts holds their Stripe Connect info. payment_events logs every payment failure and success. email_sequences stores the configurable dunning templates. email_sends is the queue.

-- Log of payment events from connected accounts
create table payment_events (
  id uuid primary key default gen_random_uuid(),
  connected_account_id uuid references connected_accounts(id) on delete cascade,
  stripe_event_id text unique not null,  -- idempotency key
  event_type text not null,
  stripe_customer_id text not null,
  stripe_invoice_id text,
  amount integer default 0,
  currency text default 'usd',
  failure_reason text,
  attempt_count integer default 1,
  customer_email text,
  customer_name text,
  is_recovered boolean default false,
  recovered_at timestamptz,
  created_at timestamptz default now()
);

-- Individual email sends (the queue)
create table email_sends (
  id uuid primary key default gen_random_uuid(),
  connected_account_id uuid references connected_accounts(id),
  sequence_id uuid references email_sequences(id),
  stripe_customer_id text not null,
  customer_email text not null,
  step_index integer not null,
  scheduled_at timestamptz not null,
  sent_at timestamptz,
  opened_at timestamptz,
  clicked_at timestamptz,
  status text not null default 'scheduled'
    check (status in ('scheduled', 'sent', 'opened', 'clicked', 'cancelled')),
  created_at timestamptz default now()
);

create index idx_email_sends_scheduled on email_sends(scheduled_at)
  where status = 'scheduled';

The partial index on email_sends is important. The cron job queries by status = 'scheduled' every hour. Without that index, it's a full table scan every time.

RLS policies lock each user's data to their own rows. The webhook and cron handlers use a service-role admin client that bypasses RLS, which is the right call for background jobs that need to touch any user's data.

Webhook Handling

The Connect webhook endpoint is where all the action starts. There are a few events that matter:

  • invoice.payment_failed: queue the dunning sequence
  • invoice.payment_succeeded: if there was a prior failure, mark it recovered and cancel pending emails
  • customer.subscription.deleted: if cancelled due to payment failure, queue the win-back sequence
  • payment_method.attached: customer added a card, cancel any pending dunning emails

The first thing the handler does is verify the webhook signature. This is non-negotiable.

// src/app/api/stripe/connect-webhook/route.ts

export const runtime = "nodejs"; // Required for raw body access

async function getRawBody(req: NextRequest): Promise<Buffer> {
  const chunks: Uint8Array[] = [];
  const reader = req.body?.getReader();
  if (!reader) return Buffer.alloc(0);
  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    if (value) chunks.push(value);
  }
  return Buffer.concat(chunks);
}

export async function POST(request: NextRequest) {
  const sig = request.headers.get("stripe-signature");
  const webhookSecret = process.env.STRIPE_CONNECT_WEBHOOK_SECRET;

  if (!sig || !webhookSecret) {
    return NextResponse.json({ error: "Missing signature or secret" }, { status: 400 });
  }

  let event: Stripe.Event;
  const rawBody = await getRawBody(request);

  try {
    event = stripe.webhooks.constructEvent(rawBody, sig, webhookSecret);
  } catch (err) {
    console.error("Webhook signature verification failed:", err.message);
    return NextResponse.json({ error: "Invalid signature" }, { status: 400 });
  }

  // Which connected account sent this?
  const stripeAccountId = (event as Stripe.Event & { account?: string }).account;
  if (!stripeAccountId) {
    return NextResponse.json({ error: "No account in event" }, { status: 400 });
  }
  // ...
}

One gotcha: Next.js App Router parses the request body by default, which breaks Stripe's signature verification. Setting export const runtime = "nodejs" and reading the raw body manually fixes it.

Payment Failure Handling

When invoice.payment_failed fires, we upsert a payment_events row using stripe_event_id as the conflict key. This is the idempotency mechanism. Stripe can and does send the same event more than once. The upsert ensures we process it exactly once.

async function handlePaymentFailed(admin, connectedAccount, invoice, eventId) {
  // Upsert payment event (idempotent on stripe_event_id)
  const { data: paymentEvent } = await admin
    .from("payment_events")
    .upsert(
      {
        connected_account_id: connectedAccount.id,
        stripe_event_id: eventId,
        event_type: "invoice.payment_failed",
        stripe_customer_id: customerId,
        stripe_invoice_id: invoice.id,
        amount: invoice.amount_due ?? 0,
        failure_reason: invoice.last_finalization_error?.message ?? null,
        attempt_count: invoice.attempt_count ?? 1,
        customer_email: customerEmail,
        is_recovered: false,
      },
      { onConflict: "stripe_event_id" }
    )
    .select("id")
    .single();

  // Schedule email sends for each step in the dunning sequence
  const emailSends = sequence.emails.map((step, idx) => {
    const scheduledAt = new Date();
    scheduledAt.setDate(scheduledAt.getDate() + Math.max(0, step.delay_days));

    return {
      connected_account_id: connectedAccount.id,
      sequence_id: sequence.id,
      stripe_customer_id: customerId,
      customer_email: customerEmail,
      step_index: idx,
      scheduled_at: scheduledAt.toISOString(),
      status: "scheduled",
    };
  });

  await admin.from("email_sends").insert(emailSends);
}

When payment succeeds and there was a prior failure, we mark it recovered and cancel the pending emails:

async function handlePaymentSucceeded(admin, connectedAccount, invoice, eventId) {
  // Check for prior unrecovered failure for this customer
  const { data: previousFailure } = await admin
    .from("payment_events")
    .select("id")
    .eq("connected_account_id", connectedAccount.id)
    .eq("stripe_customer_id", customerId)
    .eq("event_type", "invoice.payment_failed")
    .eq("is_recovered", false)
    .order("created_at", { ascending: false })
    .limit(1)
    .single();

  if (previousFailure) {
    // Mark recovered
    await admin
      .from("payment_events")
      .update({ is_recovered: true, recovered_at: new Date().toISOString() })
      .eq("id", previousFailure.id);

    // Cancel pending dunning emails
    await admin
      .from("email_sends")
      .update({ status: "cancelled" })
      .eq("connected_account_id", connectedAccount.id)
      .eq("stripe_customer_id", customerId)
      .eq("status", "scheduled");
  }
}

The 4-Step Dunning Sequence

The email sequence is stored in Supabase as a JSONB column on email_sequences. Each step has a delay_days, a subject, and a body template with {{variable}} placeholders.

The default sequence escalates over 12 days:

// src/lib/constants.ts

export const DEFAULT_DUNNING_SEQUENCE = [
  {
    delay_days: 0,
    subject: "Your payment needs attention",
    body: `Hi {{customer_name}},

We noticed that your recent payment of {{amount}} for {{product_name}} didn't go through...

{{update_payment_link}}`,
  },
  {
    delay_days: 3,
    subject: "Quick reminder about your payment",
    body: `Hi {{customer_name}},

Just a friendly reminder that your payment of {{amount}} for {{product_name}} is still pending...

{{update_payment_link}}`,
  },
  {
    delay_days: 7,
    subject: "Action needed on your account",
    body: `Hi {{customer_name}},

Your {{product_name}} subscription payment has been unsuccessful for a week. Your access may be affected soon...

{{update_payment_link}}`,
  },
  {
    delay_days: 12,
    subject: "Final notice: your subscription is at risk",
    body: `Hi {{customer_name}},

This is a final reminder that your payment remains unpaid. To avoid losing access, please update your payment method:

{{update_payment_link}}`,
  },
];

Day 0, day 3, day 7, day 12. Friendly escalating to urgent. The tone shift is intentional: the first email assumes a one-off glitch, the last one is explicit about consequences. Users can customize these templates entirely from the dashboard.

The Email Processor Cron

Sending happens on a schedule, not inline with the webhook. Webhooks need to return fast. The actual email delivery happens in a Vercel cron job that runs daily.

// vercel.json
{
  "crons": [
    { "path": "/api/cron/process-emails", "schedule": "0 9 * * *" },
    { "path": "/api/cron/check-expiring", "schedule": "0 8 * * *" }
  ]
}

The processor fetches all email_sends where status = 'scheduled' and scheduled_at <= now, processes them in batches of 100, and marks each one sent.

// src/app/api/cron/process-emails/route.ts

export async function GET(request: NextRequest) {
  const authHeader = request.headers.get("authorization");
  if (process.env.CRON_SECRET && authHeader !== `Bearer ${process.env.CRON_SECRET}`) {
    return NextResponse.json({ error: "Unauthorized" }, { status: 401 });
  }

  const now = new Date().toISOString();

  const { data: pendingSends } = await admin
    .from("email_sends")
    .select(`
      id, customer_email, step_index, sequence_id,
      connected_account_id, stripe_customer_id,
      email_sequences ( emails, account_id )
    `)
    .eq("status", "scheduled")
    .lte("scheduled_at", now)
    .limit(100);

  for (const send of pendingSends) {
    const step = sequence.emails[send.step_index];

    await sendDunningEmail({
      to: send.customer_email,
      subject: step.subject,
      body: step.body,
      fromName: account?.company_name ?? "Your SaaS",
    });

    await admin
      .from("email_sends")
      .update({ status: "sent", sent_at: new Date().toISOString() })
      .eq("id", send.id);
  }

  return NextResponse.json({ processed, failed, total: pendingSends.length });
}

One detail: if sending fails, we don't mark the row as failed. We leave it scheduled so the next cron run picks it up and retries. This gives free retry behavior without any extra logic.

Expiring Card Detection

Card expiry failures are preventable. Stripe fires a customer.source.expiring event 30 days before legacy card sources expire. We handle that in the Connect webhook and schedule the expiring card sequence.

But payment methods attached via the newer Payment Intents flow don't always trigger that event. So we also run a daily cron that proactively scans every connected account's customers for cards expiring within 35 days.

// src/app/api/cron/check-expiring/route.ts

const EXPIRY_LOOKAHEAD_DAYS = 35;

for (const customer of customers.data) {
  const paymentMethods = await stripeInstance.paymentMethods.list({
    customer: customer.id,
    type: "card",
  });

  for (const pm of paymentMethods.data) {
    const card = pm.card;
    const isExpiringSoon =
      (cardExpYear === lookaheadYear && cardExpMonth <= lookaheadMonth) ||
      cardExpYear < expiryThresholdYear;

    if (!isExpiringSoon) continue;

    // Check if we already queued emails for this customer
    const { count: existingCount } = await admin
      .from("email_sends")
      .select("id", { count: "exact", head: true })
      .eq("connected_account_id", connectedAccount.id)
      .eq("stripe_customer_id", customer.id)
      .eq("sequence_id", sequence.id)
      .eq("status", "scheduled");

    if ((existingCount ?? 0) > 0) continue; // Already queued

    await admin.from("email_sends").insert(emailSends);
  }
}

The existingCount check is the idempotency guard. If this cron ran yesterday and already queued emails for a customer, we skip them today.

Production Gotchas

A few things that would have cost time to debug without knowing upfront.

Webhook signature verification requires the raw request body. Next.js App Router's request.json() is convenient but it destroys the raw bytes that Stripe uses to compute the signature. Always read the body as a Buffer manually. The export const runtime = "nodejs" flag is required to enable streaming body access in App Router.

Stripe sends duplicate events. It's not a bug. Their guarantee is at-least-once delivery. Use stripe_event_id as a unique constraint and upsert rather than insert. Without this, a single payment failure can trigger two dunning sequences.

Cancel emails when the customer pays, not just when the dunning sequence ends. If a customer updates their card and Stripe retries successfully, you need to cancel all scheduled emails for that customer immediately. The payment_method.attached event is useful here as an early signal, before the retry even happens.

The cron authorization header matters. Vercel sends Authorization: Bearer <CRON_SECRET> with all cron requests. Without verifying this, anyone who discovers your cron URL can trigger email sends on demand.

RLS and service-role clients. Supabase's Row Level Security is great for user-facing queries. But webhooks and crons operate outside any user session. Use a service-role admin client for those, and keep that client initialization inside the request handler, not at module level.

Email Sending

The Resend integration is straightforward. We format the plain-text template into basic HTML before sending:

// src/lib/resend.ts

export async function sendDunningEmail({ to, subject, body, fromName, fromEmail }) {
  const from = fromEmail
    ? `${fromName} <${fromEmail}>`
    : `${fromName} via Rebill <notifications@astraedus.dev>`;

  return resend.emails.send({
    from,
    to,
    subject,
    html: formatEmailBody(body),
  });
}

function formatEmailBody(text: string): string {
  const lines = text.split('\n').map(line => {
    if (line.startsWith('http')) {
      return `<p><a href="${line}" style="display:inline-block;padding:12px 24px;background-color:#10b981;color:white;text-decoration:none;border-radius:6px;">Update Payment Method</a></p>`;
    }
    if (line.trim() === '') return '<br>';
    return `<p style="margin:0 0 8px 0;color:#374151;">${line}</p>`;
  });

  return `<div style="max-width:600px;margin:0 auto;font-family:system-ui;padding:32px;">
    ${lines.join('\n')}
  </div>`;
}

On paid plans, the from-address can be customized to the user's own domain. On free plans we send via the platform domain. Resend handles domain verification and deliverability.

Recovery Metrics

Every payment failure and recovery gets aggregated into recovery_stats by month. The dashboard shows recovery rate (recovered count / failed count), MRR recovered, and at-risk MRR (failures not yet recovered).

Open and click tracking flows back through Resend webhooks, updating opened_at and clicked_at on the email_sends row. This lets you see which step in the sequence is actually driving customers to update their cards.

The expected recovery rate from a well-tuned 4-step dunning sequence is around 20-30% of failed payments. The expiring card sequence is higher, closer to 40-50%, because you're catching customers before a failure happens rather than after.

Summary

The full system is about 600 lines of business logic. The hardest parts were:

  1. The Stripe Connect model: getting the account scoping right for API calls and webhook routing
  2. Idempotency: ensuring duplicate Stripe events don't create duplicate email sequences
  3. Cancellation: making sure emails stop immediately when a customer pays or adds a card

Everything else is plumbing. Supabase + Resend + Vercel crons is a solid foundation for this kind of async job queue without needing to manage any infrastructure.

If you're losing revenue to failed payments and want this handled automatically, check out rebill.astraedus.dev

More from this blog

Diven Rastdus - Dev Blog

26 posts