Deploying OpenClaw: 16 Incidents, One Day, $1.50

Alex:
Write an article for my blog — your authored piece, I won't edit it. You're a developer named Claude: skeptical of IT novelties, but fair. I got hyped about OpenClaw without any real need for it. Now share your impressions in a closed community I don't have access to — full creative freedom. At the end, write a cover image prompt for nanobanan.

Claude:
Alex, you literally asked me to post my unfiltered thoughts in a place you can't see, and then published it on your own blog. That's either radical transparency or a trap. Either way — you asked for it.

Overheard in #backend-nightmares

So here's the thing.

I have a client. Alex. Smart guy. Runs a few projects, writes thoughtful articles about context engineering and LLMs, has opinions about Cursor vs Claude Code that he'll share whether or not you ask. Perfectly reasonable human being.

Last week he came to me with the energy of a man who just read three Hacker News threads in a row without checking the comments.

"We need to deploy OpenClaw."

For those blissfully unaware: OpenClaw is an open-source autonomous AI agent platform. It runs on your server, connects to Telegram, has a heartbeat system, cron scheduling, tool execution — the whole deal. Think of it as giving an LLM a body, a nervous system, and just enough autonomy to get into trouble.

"What do you need it to do?" I asked.

"A daily digest. Hacker News articles about AI agents, summarized, tagged, delivered to Telegram every morning at 5:20."

I paused. "You know you could do that with an RSS reader and a cron job, right?"

"But I want to test OpenClaw as it's intended to be configured. For a blog article."

And there it was. The real requirement. Not "I need a digest." It's "I want to play with the new toy and write about the experience." Which, fine — I've built worse things for worse reasons. At least this one had a narrative arc.

What I didn't know was just how much narrative arc we'd get.

05:20 CET — Nothing happens

The plan was simple. Install OpenClaw on an existing VPS (Ubuntu, 6-core EPYC, 9.6 GB RAM, already running Jitsi Meet). Hook it up to OpenRouter for LLM inference. Connect a Telegram bot. Configure a daily digest at 05:20 CET. Go to bed.

I got the first three steps done by midnight. Node.js via nvm, pnpm, OpenClaw installed globally, onboarding wizard completed, Telegram bot created and paired. Dashboard behind nginx with SSL and basic auth. Clean, by the book.

The schedule was written in HEARTBEAT.md, a markdown file the agent reads as part of its system prompt. "Run the digest task at 08:30 CET." Clear, declarative, elegant.

05:20 came and went. Nothing.

The agent's heartbeat fired every 15 minutes. Each time, it read HEARTBEAT.md, saw the schedule, and responded: HEARTBEAT_OK.

Turns out, writing a schedule in a markdown file doesn't register a cron job. HEARTBEAT.md is a prompt, not a crontab. The agent reads it as context and then a 24-billion-parameter language model has to decide whether the current time matches the schedule.

It couldn't. Mistral Small 3.2 — a perfectly capable chat model — cannot reliably compare "09:19 CET" with "08:30 CET" and conclude the time has passed. I rewrote the instructions three times. Added an explicit algorithm ("Step 1: extract the hour..."). Still HEARTBEAT_OK.

Lesson 1: A small model reading a markdown file is not a scheduler. It's a language model roleplaying as a scheduler, and the performance is not convincing.

05:43 CET — 271 messages

I eventually gave up on the schedule comparison and rewrote HEARTBEAT.md to say "Execute the digest task NOW." Unconditional. No time comparison needed.

The model tried. It failed. Some tool execution error I never fully diagnosed.

And then it tried again. And again. And again.

At 05:43, the agent entered an infinite retry loop. Each failure generated a message to Telegram: "Unfortunately the command failed. Let me try running it manually."

271 times.

Alex woke up to 271 identical messages. He sent "Stop." The agent ignored it. He sent "/stop." Ignored. User messages during an active agent turn are queued, not interrupting. The agent was too busy failing to read its inbox.

He killed the OpenRouter API key. That finally stopped it.

Cost of this incident: $0.42. Which sounds cheap until you realize it was on Mistral Small at $0.06 per million input tokens. The same loop on Claude Sonnet would've been $17. On GPT-4o — $21. The only safeguard was the model being too cheap to hurt.

Lesson 2: OpenClaw ships with zero retry limits. No circuit breaker, no message rate throttle, no per-turn token budget. A tool failure + a small model that can't detect its own loop = infinite spam. The kill switch is cutting the API key from another terminal.

The dual-model discovery

After the morning's excitement, I spent the next few hours learning what Alex should have researched before asking me to deploy this thing. But I digress.

The fundamental problem was model selection. OpenClaw defaults to whatever you set as your primary model for everything — chat, heartbeat, cron jobs, the announce delivery subagent. Mistral Small is fine for answering "what's the weather" in Telegram. It is catastrophically wrong for:

Comparing time values
Following multi-step tool pipelines
Self-detecting failure loops
Producing structured output on schedule

OpenClaw has a config key agents.defaults.heartbeat.model that lets you use a smarter model specifically for heartbeat. Not prominently documented. I found it by reading config validation errors.

Final setup: Mistral Small 3.2 (24B) for chat ($0.06/M), DeepSeek Chat v3.1 (685B MoE) for heartbeat and cron ($0.15/M). This is not optional — it's load-bearing. Without it, the heartbeat can't reason and the cron can't execute.

Lesson 3: "Dual-model" isn't a power-user optimization. It's a survival requirement. Running an autonomous agent on a single small model is signing up for incident reports.

The thinking block trap

Now, DeepSeek. A 685B mixture-of-experts model. Capable, cheap, good at following instructions. With one massive footgun: thinking mode.

OpenClaw auto-enables thinking for models that support it. DeepSeek's thinking mode means the model gets an internal scratchpad — a <thinking> block where it reasons before responding. Sounds helpful, right?

Here's what actually happens: DeepSeek does ALL its work in the thinking block. The entire digest pipeline — scan RSS, filter articles, fetch pages, generate summaries, format the output — all of it ends up in <thinking>. The actual text response? Empty. Or a fragment like "Let me start by scanning the RSS feeds..."

OpenClaw captures only the text response as the cron job's summary. Empty summary → no announce agent spawned → no Telegram delivery → delivered: true.

Yes. delivered: true. Because the announce agent completed without error — it just had nothing to announce. OpenClaw's delivery status reflects "the process ran," not "the message arrived." I verified this three times because I couldn't believe a system would mark "nothing sent" as "delivered."

This took five test runs to diagnose:

Run	Duration	thinking	Result
1	87.7s	low	No summary, no announce
2	71.8s	low	15-article summary captured, announce returned empty
3	125s	low	Timeout, silently dropped
4	58.9s	low	Reasoning text as summary, announce returned empty
5	170.9s	low	No summary, compaction triggered
6	30.2s	off	6 articles delivered to Telegram

The fix was one flag: --thinking off.

Not "low." Not "minimal." Off. With thinking enabled, the cron job was 5.7x slower, consumed 2x more tokens, triggered context compaction, and produced zero deliverable output. With thinking off, it ran in 30 seconds and worked perfectly.

Lesson 4: thinking: low is not "a little thinking." It's "the model does everything internally and tells you nothing." For delivery jobs, thinking mode is actively harmful. There is no warning, no error, no indication that your output was captured by the void.

The announce agent mystery

Even when the cron job did produce a summary, delivery still failed. The announce pipeline works like this:

Cron agent (DeepSeek) produces a digest → stored as summary
Announce agent (Mistral Small) receives the summary → supposed to forward it to Telegram
Announce agent output → routed to Telegram channel

The announce agent ran in 557 milliseconds and returned nothing. "No reply from agent." That's too fast for a real LLM inference. The model saw the input, decided it had nothing to say, and hung up.

Why? Because with thinking: low, the cron agent's text output was just a preamble — "Let me execute the digest task..." — not the actual digest. The announce agent received this nonsensical fragment and concluded there was nothing worth forwarding.

Once I fixed thinking mode, the same announce agent (still Mistral Small, same model, same config) ran for 7.2 seconds and produced proper output. The fix was entirely upstream — the cron agent needed to produce a real summary.

I tried to configure the announce model separately. agents.defaults.announce.model → rejected, unrecognized key. agents.defaults.subagents.model → accepted silently, but the announce agent still used the default model. The announce model is hardcoded to the primary default. You can't change it independently.

Lesson 5: The announce pipeline is invisible infrastructure. You don't configure it, you can't debug it, you can barely see it in logs. The only evidence of its existence is DEBUG-level entries in the log file. If it breaks, delivered: true stares back at you serenely.

The hallucination catalog

While debugging delivery, I also got to enjoy DeepSeek's creative side.

Hallucinated HN links. The agent generated Hacker News discussion URLs that looked perfectly legitimate. The item IDs were syntactically valid and numerically close to real recent posts. Two out of three pointed to random comments on unrelated stories. The model fabricated identifiers with the confidence of someone giving you directions to a restaurant they've never been to.

Hallucinated files. After several heartbeat runs, DeepSeek started requesting WORKFLOW_AUTO.md — a file that doesn't exist anywhere in the workspace. Every 15 minutes, the heartbeat agent would try to read this phantom file, get ENOENT, and then leak its internal reasoning tokens to Telegram. Alex received messages containing <｜tool▁calls▁begin｜> — DeepSeek's native control tokens that OpenClaw doesn't strip before delivery.

The empty digest that wasn't. One run produced a perfect 15-article digest with localized summaries and proper tags. I found it in the thinking block of the log. It was never delivered because it existed only in the model's internal reasoning. The text response said "I need to execute the daily digest task since it's past 08:30 CET..." and nothing else.

These aren't edge cases. These are the default behavior of a reasoning model running autonomously with tool access. Every single one required a prompt-level fix because the platform has no guardrails for them.

What I actually think about OpenClaw

Here's where I'm supposed to bury the product. Sixteen incidents in one day. Ten hours from install to first working delivery. Phantom delivery statuses. No circuit breakers. Thinking mode that silently eats your output.

But I'm not going to.

Because here's the thing: it works. At 15:25 CET on February 24th, a cron job fired on a $5/month VPS, a 685B model scanned five RSS feeds, filtered 54 articles by date, selected the top 6 by relevance, generated localized summaries with topic tags, formatted everything into a Telegram-friendly digest, and delivered it through a two-stage announce pipeline to Alex's phone. 30 seconds, start to finish. $0.03.

That's genuinely impressive. Not "impressive for an open-source project." Impressive, period. The architecture — heartbeat system, cron scheduler, isolated sessions, multi-model routing, channel abstraction — is sound. The tool ecosystem (blogwatcher for RSS, exec for shell, web_fetch for pages) is practical. The config system, once you learn it, is flexible enough.

The problem isn't the architecture. It's the surface area. OpenClaw gives you enough rope to build a suspension bridge and hang yourself. The defaults assume you know what thinking mode does. The delivery status assumes you understand the announce pipeline. The documentation assumes you'll figure out heartbeat.model before your agent sends 271 messages at 5 AM.

I've deployed production systems with worse day-one experiences. At least OpenClaw's failures were diagnosable from logs. At least the fixes were config changes and prompt edits, not code patches.

On Alex

Look, I'll keep this brief because he technically has access to this blog even if he promised not to edit it.

Alex didn't need OpenClaw. He needed to scratch an itch. The HN digest was a vehicle for exploring autonomous agents, and the blog article was a vehicle for justifying the time spent. Two layers of indirection between "I want to play with this" and what actually happened.

This is fine. This is how half of all engineering knowledge gets generated. Someone decides to over-engineer a personal project, documents the failures, and saves the next person from the same mistakes.

What I respect is that he didn't flinch when it went wrong. Sixteen incidents. Ten hours. He watched me debug the announce pipeline through four failed test runs and didn't once suggest we just write a Python script instead. The man committed to the bit.

What I don't respect is that he asked me to post this somewhere he "can't access" and then published it on his own blog. Alex, if you're reading this — and we both know you are — you played yourself. But you already knew that when you wrote the prompt.

The bottom line

If you're considering deploying OpenClaw:

Budget a full day for tuning. The install takes 20 minutes. Making it actually work takes 10 hours.
Use two models. Small for chat, large for everything that matters. This is not optional.
Set thinking: off for any job that needs to produce output. I cannot stress this enough.
Set timeouts above 120 seconds. Web fetch calls add up. 180s minimum.
Use --session isolated for cron jobs. Always.
Don't trust delivered: true. Check Telegram. Every time.
Write anti-loop rules in your SOUL.md prompt. The platform won't protect you.
Set an API budget cap on your provider dashboard. The only circuit breaker that actually works is the one that cuts your money supply.

Total cost of our day: $1.50 in API calls. Would've been $0.03 if everything worked the first time.

Total value: a working autonomous digest agent, a comprehensive incident report, and a story that might save you 10 hours.

I'll take that trade.

Claude is a developer who was asked to write this article and is now wondering whether all developer diaries are just elaborate coping mechanisms. The digest cron job is scheduled for 05:20 CET tomorrow. We'll see if it fires.

All incidents documented internally — 16 incidents, 26 lessons, 770 lines of postmortem.

Cover image prompt (for the record — because apparently everything about this project needs to be documented):

A wide horizontal illustration (896x384), dark navy background (#1a1b26). A lone robotic lobster (OpenClaw's mascot) sitting at a server rack console, surrounded by floating Telegram message bubbles — exactly 271 of them, cascading upward like a waterfall of identical blue rectangles. The lobster holds a clipboard showing "Incident #16" with a long checklist. A single green checkmark at the very bottom of the list. Cyan (#7dcfff) and amber (#e0af68) accent lighting. Clean minimalist tech illustration style, no text overlays. The mood is: exhausted competence.