Why my cron job ran twice (and the one line that fixed it)
A backup job firing twice every morning, two identical hosts that shouldn't both have existed, and ninety minutes I'd rather not have spent inside /var/log/syslog.
Last Tuesday I got paged at 07:14. Same backup job, two emails, ninety seconds apart. The script — which I had written, debugged, and forgotten about three months ago — was running twice. Then it stopped. Then it ran twice again the next morning.
If you have ever spent half a Saturday inside /var/log/syslog wondering why the moon is conspiring against your crontab, this one is for you.
What I checked first, and was wrong about
The obvious guess: someone added a duplicate entry. I ran crontab -l on the box. One line. The same line I had written in March:
14 7 * * * /opt/backups/run.shNot the bug.
Next guess: maybe run.sh was being invoked by something else. I checked the system timers:
$ systemctl list-timers --all | grep backup
(no output)Nothing in /etc/cron.d/, nothing in /etc/cron.daily/, no at jobs, no anacron. The script was only referenced by the user crontab, which had exactly one line.
Then I noticed the thing I should have noticed in the first thirty seconds. The morning's syslog had two matching CMD entries:
Jun 2 07:14:01 backup-01 CRON[28412]: (root) CMD (/opt/backups/run.sh)
Jun 2 07:14:02 backup-01 CRON[14501]: (root) CMD (/opt/backups/run.sh)Different PIDs. One second apart. Same hostname. Which is normal, except this server only had one cron daemon.
The actual cause
The host had been migrated to a new instance the week before. The old instance, which everyone (including me) believed had been destroyed, was still running. Both machines were named backup-01. Both had the same crontab, deployed from the same Ansible role. Both pointed at the same S3 bucket. So at 07:14 every morning, both ran the same script, both uploaded to the same key, and both emailed the same address. The PIDs were different because they were different processes on different boxes.
This is not really a cron bug. It's a "we forgot to terminate the old VM" bug. But cron made it look like a cron bug, which is why I spent ninety minutes inside the wrong file.
The local fix, and why it wasn't enough
The fix on the cron side took ten seconds. I wrapped the script in flock so two copies on the same host couldn't run concurrently:
14 7 * * * flock -n /tmp/backup.lock /opt/backups/run.shflock -n tries to grab the lock and exits immediately if it can't. Useful if your cron entry sometimes overlaps itself — a slow job that hasn't finished by the time the next minute ticks. It is the right reflex.
It was also useless for my problem. Two hosts, two local /tmp locks, no coordination between them. Both still ran. Both still emailed.
What actually fixed it
I needed a lock that lived somewhere both hosts could see. The cheap version is an object in the bucket they both write to. At the top of run.sh:
#!/bin/bash
set -e
BUCKET="acme-backups"
LOCK_KEY="locks/backup-$(date +%F).flag"
# Try to claim today's slot. If the object already exists, bail.
if aws s3api head-object --bucket "$BUCKET" --key "$LOCK_KEY" 2>/dev/null; then
echo "[$(date)] $LOCK_KEY exists, another host is running. Exiting."
exit 0
fi
# Claim the slot. If two hosts race here, last writer wins — the
# backup itself is idempotent, so we accepted the risk.
echo "$(hostname -f) pid=$$" | aws s3 cp - "s3://$BUCKET/$LOCK_KEY"
# Do the actual work.
./backup-database
./backup-filesWhichever host writes the flag first wins. The other one sees the key, prints a line, and exits clean. No second email.
There is a small race window — both hosts could call head-object before either writes — and in that case both will still run. In our case the backup is idempotent and overwriting the same S3 key with the same data is fine, so we shrugged at the race. If your work isn't idempotent, use a real coordination service (a DynamoDB conditional PutItem, Consul, etcd, Postgres advisory lock) instead of trusting an HTTP HEAD and a hope.
Two small things I wish I had done from day one
I would not have named two machines the same string. The migration runbook had a step that said rename old host to backup-01-old before bringing up new host. That step was checked off in the ticket. It was not actually done. Hostnames are cheap. Identical hostnames across an environment are not.
I would also have logged the hostname inside the script itself. Cron's default MAILTO subject is the user@host of the sending machine, but when both senders look identical to your inbox, you have no signal at all. Two lines at the top of run.sh:
echo "Host: $(hostname -f)"
echo "PID: $$"would have told me within thirty seconds what took ninety minutes.
The thing to keep, if you only keep one thing
When a scheduled job misbehaves, check whether your scheduled job is even the thing misbehaving. Most of the time the answer is upstream of cron. Cron is dumb. It runs exactly what you told it, on every host you installed it on, including the ones you forgot.
