<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Infra Magician</title>
        <link>https://harshit.cloud</link>
        <description>Deep dives into web development, infrastructure chaos, and the art of tinkering with technology.</description>
        <lastBuildDate>Wed, 20 May 2026 00:00:00 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>Next.js using Feed for Node.js</generator>
        <language>en</language>
        <image>
            <title>Infra Magician</title>
            <url>https://harshit.cloud/og-image.png</url>
            <link>https://harshit.cloud</link>
        </image>
        <copyright>All rights reserved 2026, Harshit Luthra</copyright>
        <item>
            <title><![CDATA[The git commands I actually run every day]]></title>
            <link>https://harshit.cloud/blog/daily-git-commands</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/daily-git-commands</guid>
            <pubDate>Wed, 20 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Ten years of git, distilled to the daily eight, an fzf branch picker, and the weekly pruning ritual.]]></description>
            <content:encoded><![CDATA[
![A newspaper-style poster titled The Daily Eight, listing the eight git aliases I run most: gst, glola, gd/gds, gcam, gpsup, gco/gcb, gpf, gfa, each paired with its expansion in mono caps.](/images/daily-git-commands/hero.png)

*Fig. 1. The eight aliases that survive every refactor, every job, every laptop.*

I've been using git for a decade and most of what I type still fits on a single hand. The 200-page Pro Git book is wonderful and almost none of it survives contact with a real Tuesday. What survives is a small, boring set of commands that get rerun constantly.

This post is that list, ordered by how often my fingers actually type them. Aliases are from the oh-my-zsh `git` plugin (enabled in most zsh configs that exist); the full command sits next to the alias so it's portable.

## the daily eight

These are the ones I'd type in my sleep. If you're not using all eight already, picking them up pays back inside a week.

### gst
*git status*

```bash
gst
```

I run this between every other command. It's the cheapest sanity check git has. Branch, ahead/behind, staged, unstaged, untracked. Two seconds. If you only learn one alias, learn this one.

### glola
*git log --oneline --graph --decorate --all*

```bash
glola | head -30
```

The one true log. Graph of every branch (local + remote), one line per commit, colored refs. Pipe through `head` because most of the time you only care about the last 20-30 commits.

### gd / gds
*git diff / git diff --staged*

```bash
gd          # what's changed but not staged
gds         # what's staged and about to be committed
```

`gds` before every commit. If you set [delta](https://github.com/dandavison/delta) as your pager (`brew install git-delta`, then `pager = delta` in `~/.gitconfig`), the output stops being painful to read.

### gcam
*git commit -a -m*

```bash
gcam "fix: trailing slash in webhook URL"
```

Quick one-line commits for small fixes. For anything bigger I drop the `-m` and let `$EDITOR` open so I can write a proper message with a body.

### gpsup
*git push --set-upstream origin \<current-branch\>*

```bash
gpsup
```

First push of a new branch. The full command is annoying to type, so `gpsup` figures out the current branch name itself. After the first push, plain `gp` (just `git push`) works because upstream is set.

### gco / gcb
*git checkout / git checkout -b*

```bash
gco main             # switch to main
gco -                # switch to previous branch
gcb feature/login    # create + switch to new branch
```

`gco -` is the one to notice. Like `cd -` for branches. When you're bouncing between two branches all day, it's a single keystroke each way instead of typing the name.

### gpf
*git push --force-with-lease*

```bash
gpf
```

After rebasing or amending. **Always use `--force-with-lease`, never `--force`.** The lease version refuses to push if someone else has pushed to your branch since your last fetch, saving you from silently overwriting a teammate's work. There is no good reason to ever type `--force` in 2026.

### gfa
*git fetch --all --prune*

```bash
gfa
```

Refresh every remote, prune deleted remote branches. Run before you start anything that depends on knowing the current state of the world. The `--prune` half is what makes the cleanup ritual below work.

## checkout recent branches

`git branch` lists alphabetically, which is useless. What you actually want is "that branch from Tuesday," which means sorting by last commit:

```bash
git config --global alias.recent \
  "for-each-ref --sort=-committerdate refs/heads/ \
   --format='%(HEAD) %(color:yellow)%(refname:short)%(color:reset) \
             %(color:green)(%(committerdate:relative))%(color:reset) %(contents:subject)'"

git recent | head -10
```

That covers looking. For switching, pipe the same list into fzf and you never type a branch name again:

```zsh
# fco: fuzzy-checkout a recent branch
fco() {
  local branch
  branch=$(git for-each-ref --sort=-committerdate refs/heads/ \
             --format='%(refname:short)' \
           | fzf --height 40% --reverse \
                 --preview 'git log --oneline --decorate --color=always -15 {}')
  [ -n "$branch" ] && git checkout "$branch"
}
```

Branches arrive sorted by recency, so the one you want is almost always in the top three. Type two letters of its name, Enter, done. The preview pane shows the branch's recent commits so you can confirm it's the right Tuesday. `gco -` still wins for bouncing between exactly two branches; `fco` wins for everything else. (`brew install fzf` if you don't have it. You want it for `Ctrl-R` history search anyway.)

## the cleanup ritual

Run this weekly. If you've ever scrolled through 80 stale branches looking for the one you actually want, you already know why.

The easy half deletes every local branch whose tip is already in `main`:

```bash
gfa
git branch --merged main | grep -v '\*\|main\|master' | xargs -n1 git branch -d
```

Works only if your team uses merge commits. Most don't. GitHub's "Squash and merge" creates a brand-new commit on `main` with a different SHA, so `git branch --merged` never catches your local branch — its commits aren't in main's history at all.

The workaround: after `gfa`, any branch whose tracked remote was deleted shows as `[gone]`. Those are your merged-and-deleted PRs.

```zsh
# git-gone: delete local branches whose remote tracking branch is gone
git-gone() {
  git fetch --prune
  local gone
  gone=$(git for-each-ref --format '%(refname:short) %(upstream:track)' refs/heads \
         | awk '$2 == "[gone]" {print $1}')
  if [ -z "$gone" ]; then
    echo "No gone branches"
    return
  fi
  echo "$gone"
  echo -n "Delete these? [y/N] "
  read -r confirm
  [[ "$confirm" == "y" ]] && echo "$gone" | xargs -r git branch -D
}
```

Or install [`git-trim`](https://github.com/foriequal0/git-trim) (`brew install git-trim`), which is smarter. It detects patch-equivalent commits, so it catches squash-merges even when the upstream tracking ref isn't `[gone]`.

```bash
git trim                # dry-run
git trim --confirm      # actually delete
```

This is the closest thing to "did my PR ship?" you can ask git directly.

## the weekly four

Not in your fingers yet, but should be.

### `git switch` and `git restore`

```bash
git switch -c new-feature           # create + switch
git restore --staged file.txt       # unstage
git restore --source=abc123 file.go # restore single file from any commit
```

`switch` and `restore` split the four jobs `checkout` used to do. The one I reach for most is `restore --source=<sha> <path>`. Translation: "grab this single file from three commits ago without touching anything else."

### interactive rebase with autosquash

```bash
git commit --fixup=abc123       # fixup commit targeting abc123
# ... keep working ...
git rebase -i --autosquash main # all fixups slot into place automatically
```

The single biggest workflow win I've found in ten years of git. While reviewing your own PR you find a bug four commits back. Don't fix it on top — `--fixup=<sha>` creates a commit targeting the offender, and the autosquash rebase squashes everything into place when you're done. Install [git-absorb](https://github.com/tummychow/git-absorb) (`brew install git-absorb`) and it even picks the target SHA for you: edit the files, run `git absorb --and-rebase`, done.

### `git reflog`, the universal undo

```bash
git reflog
git reset --hard HEAD@{5}
```

Every change to `HEAD` is logged for 90 days. Bad rebase? `reflog`. Deleted branch? `reflog`. There is almost nothing in git you can't undo if you know about it.

### `git worktree`

```bash
git worktree add ../proj-hotfix hotfix/prod-down
git worktree remove ../proj-hotfix
```

Need to fix a prod bug while halfway through a feature? Don't stash. `worktree add` gives you a second checkout in a sibling directory, sharing the same `.git`. I use it constantly for "let me review your PR" without leaving my own branch.

## set it once

Five config lines and a daemon. Enable, forget.

```bash
git config --global rerere.enabled true          # remember conflict resolutions, replay them
git config --global push.default current         # `git push` pushes current branch to same name
git config --global push.autoSetupRemote true    # first push sets upstream automatically
git config --global diff.algorithm histogram     # cleaner diffs than the default myers
git config --global merge.conflictStyle zdiff3   # conflict markers include the common ancestor
git maintenance start                            # background gc/prefetch on a schedule
```

`autoSetupRemote` retires `gpsup` entirely. `zdiff3` shows the original code both sides diverged from; once you've used it, plain `<<<<<<<` markers feel like flying blind.

## when something is broken

Not daily, but when the question is "when did this start," nothing else answers it:

```bash
git log -S "functionName"          # pickaxe: commits where this string was added or removed
git blame -w -C -C -C file.go      # blame the logic's actual author, not the formatter
git log -p --follow file.go        # full file history, including across renames
git range-diff @{u} @              # what a rebase actually changed; run before force-pushing
```

`-S` searches the content of the diff, not commit messages — different thing entirely from `--grep`. And plain `blame` gives credit to whoever last ran Prettier; `-w -C -C -C` follows the code across whitespace changes, moves, and file boundaries to the person who wrote the logic.

## the four tools worth installing today

- **[fzf](https://github.com/junegunn/fzf)** (`brew install fzf`). Powers the `fco` branch picker above, plus fuzzy `Ctrl-R` history.
- **[git-absorb](https://github.com/tummychow/git-absorb)** (`brew install git-absorb`). Auto-fixup commits without picking SHAs.
- **[delta](https://github.com/dandavison/delta)** (`brew install git-delta`). Diff and blame output that doesn't hurt to look at.
- **[lazygit](https://github.com/jesseduffield/lazygit)** (`brew install lazygit`). TUI for the operations that are tedious on CLI: partial commits, stash management, conflict resolution.

This post used to end with two AI shell helpers for the stuff git can't tell you; those now live in [their own TIL](/til/one-line-ai-shell-helper).

## ten years in, the surprise

After a decade, the command I run most isn't `commit`. It isn't `push`. It's `gst`, hundreds of times a day, between every other operation. The most-used git command in my shell is the one that does nothing.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>git</category>
            <category>shell</category>
            <category>zsh</category>
            <category>productivity</category>
            <category>devops</category>
            <enclosure url="https://harshit.cloud/images/daily-git-commands/hero.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[A single zsh function for one-line AI answers that knows when to pre-type the command]]></title>
            <link>https://harshit.cloud/til/one-line-ai-shell-helper</link>
            <guid isPermaLink="false">https://harshit.cloud/til/one-line-ai-shell-helper</guid>
            <pubDate>Wed, 20 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Asking a chat UI for a one-line command is too much friction. A 15-line zsh function and a `print -z` trick fix it, with one oh-my-zsh footgun along the…]]></description>
            <content:encoded><![CDATA[
I kept opening a chat tab just to ask "what's the kubectl command for decoding a secret" or "convert 42 GiB to bytes". The context switch was costing more than the answer was worth.

Wrapping an AI CLI into a single shell function fixed it. The interesting part is `print -z`, plus one heuristic that needs more care than it looks.

## the function

```zsh
# p: one-shot AI query. Examples: `p whats 2 + 2`, `p kubectl secret decode grafana`
# Smart dispatch: if the answer looks like a runnable command, pre-type it into
# the next prompt (print -z). Otherwise print to stdout. Math/facts get printed,
# commands get queued for you to review and press Enter.
p() {
  emulate -L zsh
  setopt NO_GLOB
  if [ $# -eq 0 ]; then
    echo "usage: p <question or task>" >&2
    return 1
  fi
  local out
  out=$(pi -p --no-session --append-system-prompt 'Answer in ONE line. No preamble, no explanation, no markdown, no code fences. For shell/kubectl/git/etc requests output only the command. For factual or math questions output only the answer.' "$*" \
        | tr -d '\000-\037' \
        | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
  if [ -z "$out" ]; then
    return 1
  fi
  local first="${out%% *}"
  if [[ "$first" == [a-zA-Z_]* ]] && whence -p "$first" >/dev/null 2>&1; then
    print -z -- "$out"
  else
    print -r -- "$out"
  fi
}
alias p='noglob p'
```

`pi` is just whatever AI CLI you have. Swap in `claude -p`, `llm`, `gh copilot suggest`, `ollama run`. The pattern doesn't care about the backend.

## what it feels like

```
$ p whats 2 + 2
4

$ p capital of mongolia
Ulaanbaatar

$ p regex for matching an email
[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}

$ p kubectl secret decode grafana
# next prompt now shows, cursor at the end:
$ kubectl get secret grafana -o go-template='{{range $k,$v := .data}}{{$k}}: {{$v | base64decode}}{{"\n"}}{{end}}'█

$ p find all log files modified today
# next prompt:
$ find . -type f -name "*.log" -mtime -1█
```

Same two-letter command for both. Answers go to stdout, commands go to the prompt buffer where you can edit them before pressing Enter.

## the key idea: `print -z` for runnable output

`print -z` is the trick that makes this design work. It pushes text onto the zsh line editor, i.e. into your next prompt, pre-typed and ready. Compared to every alternative:

| Strategy | Speed | Safety | Friction |
|----------|-------|--------|----------|
| `eval "$(...)"` | fastest | **bad**, auto-runs model output | none |
| Pipe to `pbcopy` | medium | safe | switch focus, paste |
| Print to stdout | medium | safe | select + copy + paste |
| **`print -z`** | **fastest** | **safe**, you press Enter | **none** |

The mental model: `print -z` is what `Ctrl-R` history search does when you accept a result. Native zsh. You always see and approve the command before it runs.

## the heuristic: when is the answer a command?

The smart dispatch decides between `print -z` (pre-type) and `print -r` (stdout) by looking at the first word of the answer:

```zsh
if [[ "$first" == [a-zA-Z_]* ]] && whence -p "$first" >/dev/null 2>&1; then
  print -z -- "$out"
else
  print -r -- "$out"
fi
```

Two checks, both load-bearing:

1. **First char is a letter or underscore.** Excludes digits (`4`), symbols (`[`, `/`, `(`), and anything else that obviously isn't a command name.
2. **`whence -p` resolves it to a PATH executable.** Not just "this name exists in the shell", but *specifically* a real binary on disk.

Why `whence -p` and not `command -v`? Read on.

## the footgun: oh-my-zsh numeric aliases

My first attempt used `command -v "$first"` as the heuristic. It looked right. It failed in a way that took a minute to spot.

When I ran `p whats 2 + 2`, the answer was `4`, but nothing appeared in my terminal. The function exited cleanly with status 0. No error.

What had happened: oh-my-zsh's `dirhistory` plugin (loaded by default in many configs) aliases `1` through `9` to `cd -1` ... `cd -9` for jumping around the directory stack. So `command -v 4` returned true. `4` was a recognized alias, and the function tried to `print -z 4` into my prompt buffer.

In a real interactive shell, that would have stuffed `4` into my prompt invisibly (it'd appear when I hit Enter). In my non-interactive test (`zsh -ic '...'`) it disappeared into the void because there's no line editor to render the stuffed buffer.

The fix has two parts:

- **`[[ "$first" == [a-zA-Z_]* ]]`** alone would have caught it, because `4` doesn't start with a letter.
- **`whence -p`** instead of `command -v` makes it doubly safe. `whence -p` only matches binaries in PATH, ignoring aliases, functions, and builtins. Aliases like `4 → cd -4` are filtered out.

Either check alone would have caught the bug. Having both means the next time I add a feature here, I don't have to remember which one was load-bearing.

## defensive details that earn their keep

Three small things prevent subtle bugs:

### `noglob` on the alias

```zsh
alias p='noglob p'
```

Without this, `p list all *.log files` would have zsh expand `*.log` against the current directory *before* the function ever sees it. With `noglob`, the glob characters pass through literally. Same trick git uses for its arguments.

### `emulate -L zsh` + `setopt NO_GLOB`

```zsh
emulate -L zsh
setopt NO_GLOB
```

`emulate -L zsh` resets shell options to defaults, scoped to this function only (the `-L` means local, so they restore on return). `NO_GLOB` is belt-and-suspenders for callers that bypass the alias (`command p ...`, `\p ...`, or scripts that don't see your aliases).

### output sanitization

```zsh
tr -d '\000-\037' | sed 's/^[[:space:]]*//;s/[[:space:]]*$//'
```

`tr -d '\000-\037'` strips all C0 control characters. That includes ANSI escape sequences (ESC = `\033`), stray nulls, and any invisible cruft the model might emit. Critical for `print -z` because control characters in the payload corrupt the line editor's display.

`sed` then trims leading and trailing whitespace, which the model usually adds even when told not to.

## why `"$*"` and not `"$@"`

`"$*"` joins all positional args into one string with spaces between them. `"$@"` would pass them as separate args, which most AI CLIs would concatenate anyway, but some treat the first positional as the prompt and the rest as files (the `@file.txt` convention is common). Joining explicitly avoids that ambiguity.

If your CLI supports `--` to end option parsing, prefer:

```zsh
your-ai-cli -p ... -- "$*"
```

`pi` doesn't accept `--`, hence the bare `"$*"`.

## the system-prompt nudge actually matters

Without `--append-system-prompt`, even with `-p`, the default coding-assistant prompt wraps shell commands in code fences and adds a one-sentence intro. That breaks `print -z` (code fences are not commands) and clutters the terminal.

The phrasing that worked best:

> Answer in ONE line. No preamble, no explanation, no markdown, no code fences. For shell/kubectl/git/etc requests output only the command. For factual or math questions output only the answer.

"No markdown, no code fences" is doing most of the work. Without it you get backtick-wrapped output that `print -z` would happily push into your prompt as `` `kubectl get pods` ``, which is not a runnable command.

## why this beats the chat UI for short questions

| Action | Chat UI | `p` |
|--------|---------|-----|
| Switch context | yes | no |
| Round-trip latency | ~3-5s + UI | ~1-2s |
| Output format | markdown, prose | bare answer or pre-typed command |
| Get command into shell | select + copy + paste | already in your prompt |
| Session pollution | yes | no (`--no-session`) |
| Glob-expansion footgun | n/a | guarded (`noglob`) |

For anything longer than a paragraph the chat UI is still better. For "what's the syntax for X" or "the command for Y", the terminal is the right place to put the answer.

## the one substitution that fixed it

`command -v` → `whence -p`. One swap. The rest of the function (the `noglob`, the `emulate -L zsh`, the control-char strip) was already doing its job. The bug was trusting that "this name resolves in the shell" meant "this name is a binary on disk." It doesn't, and on any zsh with oh-my-zsh loaded it especially doesn't.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>zsh</category>
            <category>shell</category>
            <category>ai</category>
            <category>cli</category>
            <category>productivity</category>
            <enclosure url="https://harshit.cloud/til/one-line-ai-shell-helper/opengraph-image" length="0" type="image//til/one-line-ai-shell-helper/opengraph-image"/>
        </item>
        <item>
            <title><![CDATA[Seven visual tools, one diagram]]></title>
            <link>https://harshit.cloud/blog/seven-visual-tools-one-diagram</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/seven-visual-tools-one-diagram</guid>
            <pubDate>Fri, 15 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Excalidraw is fast, but everything I make in it looks the same. Seven tools that promise visuals with attitude, one diagram, three I'd keep.]]></description>
            <content:encoded><![CDATA[
I write more than I draw, and the drawing is always the part of a post that takes longest. A thousand words land in an evening. One architecture diagram I'm happy with can eat a whole afternoon. Excalidraw is fast, and it's good, but if you've published with Excalidraw for any length of time you start to notice that everything you make in it looks like everything else made in it. The hand-drawn aesthetic stops being a personality and turns into a tell.

What I actually want is a visual that has some attitude. The kind of diagram people screenshot and put in slides. Something a reader pauses on for a beat before scrolling into the prose. Most of my posts are about infrastructure, so the visuals tend to be architecture-y, but the bar I'm chasing is closer to a magazine illustration than to a whiteboard photo.

A few months ago Claude Code started shipping a marketplace of plugins and skills, and several of them claim to draw exactly this kind of thing. I'd been meaning to test them and kept putting it off. So I sat down one weekend, picked a representative diagram, and ran the same brief through every tool I could find. Then I evaluated them side by side and kept the ones I'd actually reach for.

This is what I learned.

## the diagram I picked

The test case was a fairly busy multi-region Kubernetes setup. Three EKS clusters in three regions, with one of them acting as an ArgoCD hub doing App-of-Apps sync to the other two. Karpenter handling compute on each cluster, KEDA scaling one specific workload, and a unified observability stack feeding into a single account that pages out to an on-call rotation.

It's not a toy. There's enough going on that a half-baked tool would visibly fall apart, and enough structure that a well-designed tool would have something to organize.

## the lineup

| # | Tool | What it is |
|---|---|---|
| 1 | `Cocoon-AI/architecture-diagram-generator` | A claude.ai web skill |
| 2 | `cathrynlavery/diagram-design` | A Claude Code plugin built around a design system |
| 3 | `edlebertf/claude-infographic-gif` | A Claude Code skill for animated GIF infographics |
| 4 | `claudekit/frontend-design-pro-demo` | A Claude Code plugin for frontend interfaces |
| 5 | Vercel `ai-cli` + `bfl/flux-2-pro` | An image-gen CLI talking to Vercel's AI Gateway |
| 6 | A hand-coded SVG | The control |
| 7 | Excalidraw | The thing I was trying to graduate from |

Installing them was less smooth than I expected. The shell command I tried first, `claude plugin add <repo>`, doesn't exist. Claude Code's plugin system runs inside the TUI via `/plugin marketplace add` and `/plugin install`, or you `git clone` raw skills into `~/.claude/skills/<name>/`. The Cocoon entry on that list isn't a Claude Code plugin at all; it's a claude.ai web skill that you upload as a zip in the browser. I gave up on installing it from the CLI and drew its equivalent by hand instead, which means my Trial 1 below is more "Claude doing AWS re:Invent" than "the Cocoon skill doing what it does". For an apples-to-apples Trial 1 you'd have to upload the actual zip at claude.ai.

## trial 1 — the AWS re:Invent stand-in

![AWS re:Invent dashboard showing three cluster cards with category-coded service rows](/images/seven-visual-tools-one-diagram/trial-1-cocoon.png)

[Open the live version →](/images/seven-visual-tools-one-diagram/trial-1-cocoon.html)

I went with the AWS re:Invent aesthetic because anyone who's watched a cloud keynote in the last ten years will recognize it instantly. Three cluster cards in a row, each with the full service stack listed inside (ArgoCD, Karpenter, KEDA, workloads), category-coded by AWS's own palette (orange for compute, purple for networking, green for observability, pink for GitOps), and an observability strip at the bottom tying agents to the on-call system.

The output is dense. It's the kind of thing I'd use as an inline reference figure in a long post where the prose already explains each piece. If I dropped it as a hero, it'd be too much.

## trial 2 — diagram-design

![Editorial dark page with hub-spoke diagram, coral focal hub, muted spokes, italic-serif annotation, and three summary cards](/images/seven-visual-tools-one-diagram/trial-2-diagram-design.png)

[Open the live version →](/images/seven-visual-tools-one-diagram/trial-2-diagram-design.html)

This is the one I'd actually publish without retouching. The skill has a real design system baked in: one focal node in coral (and only one), two muted spokes, italic-serif annotation off to the side, summary cards underneath with varied widths, hairline borders, no shadows anywhere. The skill's own instructions are pretty firm about it, even setting a target density of 4 out of 10 and telling itself to delete anything that doesn't earn its place. You can feel that restraint in the result.

It looks like it came out of someone who's spent a long time thinking about diagrams as a form, not from a tool that randomly throws boxes on a page.

## trial 3 — infographic-gif

![Animated sankey diagram showing flows from a hub through three clusters to three resource categories](/images/seven-visual-tools-one-diagram/trial-3-infographic-gif.gif)

[Open the live version →](/images/seven-visual-tools-one-diagram/trial-3-infographic-gif.html)

This one surprised me. The output it gives you is an HTML file, not a GIF. You open the HTML in a browser, it canvas-renders frames for a few seconds, and then a "Download GIF" button appears. So the actual deliverable is two clicks away from what the skill spits out. Once you watch it animating though, it's properly satisfying. Mine became a three-stage sankey: hub on the left, three clusters in the middle, three aggregated resource buckets on the right, with bezier curves drawing in left to right.

One thing to know if you try this: the skill's sankey template is fundamentally two-level. Sources on the left, destinations on the right, with one center node. I had three layers, so the per-cluster resource breakdown got merged into single resource totals. The proportions are honest, the per-cluster detail isn't quite there. For an animated explainer where motion is the point, that's an acceptable trade.

## trial 4 — frontend-design-pro

![Cyberpunk-styled cluster status panel with neon cyan and magenta accents and CRT scanlines](/images/seven-visual-tools-one-diagram/trial-4-cyberpunk.png)

[Open the live version →](/images/seven-visual-tools-one-diagram/trial-4-cyberpunk.html)

I asked for a "live cluster status panel" in a cyberpunk aesthetic and got exactly that. CRT scanlines, VT323 for the chunky CRT-display digits on the node counts, neon accents, a magenta pulse on the panel that's marked `syncing`. A tiny script ticks the relative timestamps every second so the thing feels alive.

In isolation it looks great. Embedded in a sober technical post though, it would scream at the reader and you'd basically end up designing the rest of the post around the widget. So I went back to the same skill with a tighter brief to see if it could play in a calmer register.

### same skill, second pass

I asked it to draw the request lifecycle through an Istio ingress. Seven hops, client through NLB through gateway and onwards to a pod, with a return path back. Two deliverables: an animated HTML diagram, and a Flux image prompt for the static raster version.

![Dark OLED Luxury full-page request lifecycle diagram with seven components and animated packets](/images/seven-visual-tools-one-diagram/trial-4b-ingress-flow-animated.png)

[Open the live version →](/images/seven-visual-tools-one-diagram/trial-4b-ingress-flow-animated.html)

First attempt came back as a dashboard again. Big hero title, live telemetry chrome on top (latency, request rate, success percentage), seven custom component glyphs in a row, emerald packets going forward and amber packets coming back along a parallel return wire. It's a beautiful page, but it's still very much a page. If I tried to inline this inside another post, it would take over.

So I asked again. Same skill, third pass, this time framed as "a figure for the body of a blog post, not a dashboard":

![Tight figure-style version at 760px wide with figcaption](/images/seven-visual-tools-one-diagram/trial-4b-ingress-flow-blog.png)

[Open the live version →](/images/seven-visual-tools-one-diagram/trial-4b-ingress-flow-blog.html)

This is the version that goes inside a post. 760px wide, a real `<figure>` with `Fig. 1`, the diagram, and a real `<figcaption>` mixing a sentence of prose context with an inline legend. Self-contained dark so it doesn't fight whatever theme surrounds it. The skill gets to the right answer once you tell it the right question.

And the Flux raster, prompt written by the skill, rendered by `ai-cli`:

![Cinematic isometric 3D illustration of seven architectural artifacts connected by a glowing data conduit](/images/seven-visual-tools-one-diagram/trial-4b-flux-illustration.png)

Eighteen seconds of generation, zero human prompt-engineering on top. The skill's prompt was the work.

## trial 5 — Vercel ai-cli + Flux

![Isometric scene with three floating clusters connected to a central ArgoCD hub via glowing bridges](/images/seven-visual-tools-one-diagram/trial-5-flux-cover.png)

This is the one I'd use as a section header or a post cover. Flux 2 Pro through Vercel's AI Gateway, twelve seconds to render, a single one-line prompt. It's not replacing a technical figure (you can't actually point at anything in it), but as visual atmosphere at the top of a post, it carries weight. One thing to watch: the model ignored my `--size 1200x630` flag and inferred an aspect ratio instead. The CLI surfaced a warning about it, which I only noticed by accident. Check your output dimensions before you cut it into a layout.

## trial 6 — hand-coded SVG

![Hand-coded SVG architecture diagram with three VPC boxes and color-coded legend](/images/seven-visual-tools-one-diagram/trial-6-native-svg.svg)

The baseline. AWS-orange-on-navy, three boxes in a row, dashed hub-spoke arrows, observability strip at the bottom, color-coded legend. 12 KB, no JS, no fonts, no dependencies. Loads instantly, scales infinitely, and is trivial to edit when the architecture inevitably changes. Inside a long post where you need a quick reference and you don't want to fight CSS, this is more than enough. It's also the least visually interesting of anything in the lineup, which is either a feature or a bug depending on what the post is doing.

## trial 7 — Excalidraw (via the official connector)

![Hub-spoke architecture diagram from the Excalidraw connector — yellow hub cluster with nested ArgoCD callout, two blue spoke clusters with argocd-agent sub-boxes, and a green observability strip at the bottom with separate New Relic and Zenduty boxes connected by an "alerts" flow](/images/seven-visual-tools-one-diagram/trial-7-excalidraw.png)

[Download the `.excalidraw` source →](/images/seven-visual-tools-one-diagram/trial-7-excalidraw.excalidraw) · [View as SVG →](/images/seven-visual-tools-one-diagram/trial-7-excalidraw.svg)

Excalidraw is what I'd been using before this whole exercise, and it's the thing I was originally trying to graduate from. There's an official Excalidraw connector in the Claude.ai directory, so I dropped the same brief into it. It took the connector about four minutes to produce this.

The surprising thing isn't the layout — it's the aesthetic. The connector deliberately chose not to look like Excalidraw. The strokes are clean and straight, no rough.js wobble. The typography is sans-serif, not the Virgil hand-drawn default. Everything is grid-aligned, with nested sub-components: `argocd-agent (receiver)` boxes inside each spoke, an `argocd-agent (principal)` callout inside the hub, separate `New Relic` and `Zenduty` boxes inside the observability strip, an `alerts` flow connecting them. It reads as "architecture diagram drawn in Excalidraw" rather than "Excalidraw scratchwork".

That changes my read of Trial 7 in a way I didn't expect. The connector quietly opted out of the genre I'd complained about in the opening of this post. The casual hand-drawn aesthetic is one register of Excalidraw now, not the default Excalidraw. The canvas can be a sketchpad or a structured diagram tool, depending on how the scene is authored. The connector chose the structured side.

For the kind of visual I'd actually publish, the polished register it picked is closer to what I'd ship than the wobbly default would have been. If you specifically want the wobble, you'd have to ask for it.

(Small caveat on the rendering: the PNG above is exported from an SVG approximation, not from Excalidraw's own engine, so the strokes and font are close-but-not-identical to a canonical Excalidraw export. Open the `.excalidraw` source in Excalidraw and use `File → Export → PNG (2×)` for the official render. The difference is subtle.)

## what I'd reach for, in practice

For inline figures inside a long-form technical post, `diagram-design` won by a wide margin. The design system has real taste in it, the output looks publication-quality without me retouching anything, and the constraints (one focal node, no shadows, hairlines) keep the diagrams from feeling busy. This is the one I'll be using most.

For animated explainers, `infographic-gif` is decent when the motion actually adds information. The HTML-not-GIF artifact is mildly annoying but not a dealbreaker. If you need animated HTML where you control the design tightly, `frontend-design-pro` is more flexible, but you have to brief it carefully. Ask for a "live cluster panel" and you'll get a dashboard. Ask for "a figure to embed in the body of a blog post" and you'll get a figure. The skill responds to scope cues.

For hero images or section openers where the goal is mood rather than information, Flux through `ai-cli` is hard to beat at twelve seconds and one prompt. The cost per image is low enough that you can iterate a few times until something lands.

For quick reference diagrams inside a post, hand-coded SVG is still where I end up. It costs nothing to maintain, ages well, and gets out of the reader's way. Not every diagram needs to be a hero.

For the first draft, Excalidraw the app — still the fastest way to figure out where the boxes go in five minutes. For a ready-to-ship version with the Excalidraw canvas behind it but without the genre aesthetic, the Claude connector turned out to be a genuine option I hadn't expected to like.

## keeping all of this from eating your context

One thing I noticed after installing six different visual-generation plugins: every skill they register goes into the Claude Code system prompt at session start. Whether you use them or not, they're paying rent in your context window. Six skills isn't a lot in absolute terms, but if you add up everything else you've installed (MCP servers, hook configs, agent definitions), sessions start to feel heavier than they used to.

There are a few commands that aren't well-advertised but are exactly what you want:

```bash
claude plugin list             # what's installed, at which scope, enabled or disabled
claude plugin details <name>   # the actual skills, hooks, MCP servers this plugin loads
                               # plus its projected token cost
claude plugin disable <name>   # turn it off
claude plugin enable  <name>   # bring it back
```

The one I'd specifically call out is `details`. It tells you exactly what a plugin pulls into your session, and the rough token cost of having it loaded. If you feel like your context budget vanishes before you've done much, this is where to look first.

There is one wrinkle: enable and disable don't take effect mid-session. The skill set is fixed when Claude Code starts up, so toggling a plugin only changes the next session. In practice this turned into a workflow change for me. I keep most of these visual-generation plugins disabled by default at the user scope, install them at the project scope of my blog repo, and let them auto-activate when I `cd` into the blog. Outside of the blog, the context stays lean.

The setup is two commands per plugin:

```bash
# globally disabled
claude plugin disable --scope user diagram-design
claude plugin disable --scope user frontend-design-pro

# from inside the blog repo
claude plugin install --scope project diagram-design@diagram-design
claude plugin install --scope project frontend-design-pro@frontend-design-pro
```

After that, only sessions started inside the blog directory see those skills. Everywhere else, they're not loaded, not adding to the prompt, not costing me tokens. When I want them back globally I can enable them again.

## what I'm using now

Three tools and a fallback. `diagram-design` for any inline schematic. `frontend-design-pro`, in its blog-trimmed form, for animated or live-feeling diagrams once I've told it to behave like a figure. Flux through `ai-cli` for cover images and section openers. Hand-coded SVG when I just need a fast reference and don't want to think about it. The evaluation took most of a day. The diagrams take a few minutes each from here.

If you're considering doing something similar, the meta-lesson I'd offer is to pick one test diagram that's representative of what you'll actually publish, and then run every candidate against the same brief. The differences between these tools are not subtle, but they're not obvious from the marketing either. You see them when the same input comes out in six different ways.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>claude-code</category>
            <category>ai-tooling</category>
            <category>blogging</category>
            <category>design</category>
            <enclosure url="https://harshit.cloud/images/seven-visual-tools-one-diagram/trial-1-cocoon.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Lazy SRE's guide to secure systems, part 6: the network in front of everything]]></title>
            <link>https://harshit.cloud/blog/lazy-security-part-6-network-plane</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/lazy-security-part-6-network-plane</guid>
            <pubDate>Sun, 10 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Ivanti made everyone re-read their VPN architecture in January 2024. Tailscale, Cloudflare Tunnel, and WireGuard in one afternoon.]]></description>
            <content:encoded><![CDATA[
In January 2024, Ivanti disclosed two CVEs in their Connect Secure and Policy Secure VPN appliances. CVE-2023-46805 was an authentication bypass. CVE-2024-21887 was an unauthenticated command injection: a remote shell on a box that, by design, had to be reachable from the public internet. Mandiant attributed the in-the-wild exploitation to UNC5221, a suspected Chinese state-sponsored cluster. By the time Ivanti shipped a patch, Mandiant had identified more than a thousand compromised appliances. CISA issued Emergency Directive 24-01 telling every U.S. federal agency to take their Ivanti boxes offline.

This is part 6. Earlier parts covered npm ([Part 1](/blog/lazy-security-part-1-supply-chain)), GitHub Actions ([Part 2](/blog/lazy-security-part-2-github-actions)), the unsexy infrastructure list ([Part 3](/blog/lazy-security-part-3-unsexy-list)), DNS auth records ([Part 4](/blog/lazy-security-part-4-dns-records)), and the dev laptop perimeter ([Part 5](/blog/lazy-security-part-5-dev-laptops)). Part 6 is the network in front of everything. What sits between your engineers and prod, and between prod and the public internet.

The thesis from Part 1 stands. Future You at 3am will not patch a VPN concentrator the same week the CVE lands, especially when the vendor patch breaks LDAP for half the team. The architecture that makes the concentrator irrelevant is the one that runs while you sleep: a mesh network where there is no internet-facing appliance to compromise in the first place.

## the VPN appliance is the attack surface

The Ivanti CVEs are not a unique event. They're the most recent member of a class. The same year (2024) saw:

- **Cisco ASA / FTD**: CVE-2024-20353 + CVE-2024-20359. A web-services DoS and a persistent local code-execution flaw used together by the ArcaneDoor campaign (Line Dancer / Line Runner implants), attributed by Cisco Talos to state-sponsored actors. April 2024.
- **Citrix NetScaler "CitrixBleed"**: CVE-2023-4966. A session-token leak via memory disclosure, exploited by LockBit and others, used in the Boeing and Comcast breaches.
- **Fortinet FortiOS SSL VPN**: CVE-2024-21762. Out-of-bounds write in February 2024, exploited in the wild before patches were widely deployed.
- **Palo Alto Networks GlobalProtect**: CVE-2024-3400. Command injection in April 2024, exploited in a campaign Palo Alto Networks Unit 42 named Operation MidnightEclipse (Volexity tracks the actor as UTA0218).

![A wide editorial system diagram on deep navy ground. Center: a large rectangular box labeled 'YOUR VPN APPLIANCE — internet-facing HTTPS portal' with five sub-labels (login UI, session manager, tunnel termination, admin panel, OS). Five red curved arrows from the outside converge on it, each labeled with a real 2023-2024 CVE: CVE-2023-46805 (Ivanti), CVE-2024-21887 (Ivanti), CVE-2024-20353 (Cisco), CVE-2023-4966 (CitrixBleed), CVE-2024-3400 (Palo Alto). Behind the appliance, a smaller cluster of internal services labeled 'prod database', 'admin panel', 'engineer SSH'. The cluster is all reachable once the appliance is compromised. The whole assembly sits inside a coral-tinted boundary labeled 'the attack surface you can't shrink'. A small inset on the right shows the mesh alternative as a dotted hexagon of peers with no central appliance, captioned 'no concentrator, no portal, no inbound port'.](/images/lazy-security-part-6-network-plane/vpn-appliance-attack-surface.png)

*Fig. 1 — five vendors, twelve months, the same shape of vulnerability. The mesh alternative is the small diagram on the right.*

Five different vendors. Five different products. Five different attacker campaigns. The common shape is a publicly reachable HTTPS portal that handles authentication and tunnel termination. Every one of them has had a pre-authentication remote code execution in the last twelve months. That isn't a coincidence; it's an architecture.

What mesh networks (Tailscale, Cloudflare Access, Headscale, native WireGuard) don't have is an internet-facing login portal. The control plane authenticates over the same OIDC/SSO your engineers already use; the data plane is WireGuard between authorized peers; there is no single box that, if compromised, lets the attacker into the rest of the network. The lazy stance is: don't run a VPN appliance. Every use case has a mesh-or-proxy answer that ships with less attack surface and less operational pain.

## Tailscale for outbound

The fast path: install the daemon on every machine you want to access from, and every machine you want to access. Each daemon authenticates against your IdP. Within sixty seconds, every node in your tailnet can reach every other node over WireGuard, with private IPv4/IPv6 addresses inside the tailnet.

Replace `ssh user@bastion.yourorg.com` with `tailscale ssh user@prod-db.your-tailnet.ts.net`. Replace `grafana.yourorg.com` (publicly reachable, gated by a Cloudflare IP allowlist that nobody can remember the source of) with `grafana.your-tailnet.ts.net` (only reachable to tailnet members, no public DNS record, no public route).

![A hand-drawn napkin showing a Tailscale ACL annotated with marker arrows and red callouts. The center of the napkin has a 20-line ACL in HuJSON (JSON with comments), with tagOwners, groups, and acls sections. Red arrows point from the 'group:engineers' line to a sketch of three engineer laptop icons, from 'tag:prod-db' to a sketched database cylinder with a 'prod' label, and from the comment '// only platform can reach prod databases' to a small underline beneath the matching ACL rule. A red callout reads 'twenty lines of declarative-policy → entire access plane'. Bottom strip mirrors Parts 1/3/4 chevron pattern with colored dots: 'one ACL → one SSO group → one git diff → entire blast radius'.](/images/lazy-security-part-6-network-plane/tailscale-acl-napkin.png)

*Fig. 2 — twenty lines of HuJSON beats two hundred lines of iptables, and you can `git diff` the change.*

The ACL itself, in HuJSON (JSON with comments, native to Tailscale):

```json
{
  "tagOwners": {
    "tag:prod-db":  ["group:platform"],
    "tag:internal": ["group:platform"]
  },
  "groups": {
    "group:platform":  ["[email protected]", "[email protected]"],
    "group:engineers": ["group:platform", "[email protected]"]
  },
  "acls": [
    // Engineers can reach internal HTTP services.
    { "action": "accept", "src": ["group:engineers"], "dst": ["tag:internal:80,443,3000,8080"] },
    // Only platform can reach prod databases.
    { "action": "accept", "src": ["group:platform"],  "dst": ["tag:prod-db:5432,6379,9200"] }
  ]
}
```

Twenty lines of declarative policy. Each change is a PR. Each merge is reviewed. Each rule is a sentence a human can read in three seconds. The version of this living in iptables is two hundred lines that nobody touches because nobody knows whether the bottom forty are still load-bearing.

Cost: under Tailscale Pricing v4 (April 2026), Free covers up to 6 users and 100 devices. Paid plans are Standard at $6/user/month, Premium at $18/user/month, and Enterprise above that. For a 15-person team, $90/month on Standard buys SSO, audit logs, and ACL change history.

## Cloudflare Tunnel for inbound

Different shape of problem. You have a service that needs to be reachable to specific users (employees, customers, a vendor's support team) but should never have a public DNS record. The wrong answer is putting it on the public internet with an IP allowlist. The right answer is Cloudflare Tunnel.

The architecture: `cloudflared` runs on the origin and dials *outbound* to Cloudflare's edge. There is no inbound port. The origin has no public IP. Cloudflare Access (the policy layer) sits in front of the URL and requires the user to authenticate via your IdP (Google, Okta, GitHub, OneLogin, generic OIDC) before the request is proxied to the origin.

```bash
# On the origin
cloudflared tunnel login
cloudflared tunnel create vendor-admin
cloudflared tunnel route dns vendor-admin admin.yourorg.com
cloudflared tunnel run vendor-admin
```

Three commands of consequence. The vendor admin panel is now reachable at `admin.yourorg.com`, gated by your SSO, with no public route to the underlying service. Add Cloudflare Access policy rules (a few clicks in the dashboard or a Terraform resource) to require a specific Okta group, a specific email domain, a specific device posture, or an mTLS cert.

Cost: Cloudflare Zero Trust Free covers up to 50 users. $7/user/month above that. The Tunnel daemon itself is free; the metering is on Access seats.

## WireGuard, when you can't run Tailscale

For the regulated, air-gapped, or contract-forbids-SaaS case, WireGuard has been in mainline Linux since kernel 5.6 (March 2020) and is enabled by default in most distros. The configuration is plain text. The minimum viable setup is about twenty lines per peer.

```ini
# /etc/wireguard/wg0.conf, on the server
[Interface]
Address = 10.7.0.1/24
ListenPort = 51820
PrivateKey = <server-private-key>

[Peer]
PublicKey = <engineer-public-key>
AllowedIPs = 10.7.0.2/32
```

```ini
# On the engineer's laptop
[Interface]
Address = 10.7.0.2/32
PrivateKey = <engineer-private-key>

[Peer]
PublicKey = <server-public-key>
Endpoint = vpn.yourorg.com:51820
AllowedIPs = 10.7.0.0/24, 192.168.10.0/24
PersistentKeepalive = 25
```

`wg-quick up wg0` and the tunnel is live.

The pain at scale is peer management. Twenty engineers means twenty keypairs, twenty `AllowedIPs` blocks on the server, and a manual re-deploy each time someone joins or leaves. Two OSS tools fix that: **Headscale**, which is a Tailscale-protocol-compatible control plane you self-host (same Tailscale clients on each device, but the coordination server is yours); and **wg-easy**, a small web UI for adding and removing peers. Both give you the Tailscale UX with none of the SaaS dependency.

When to pick this path over Tailscale: your contract forbids data-plane traffic transiting a U.S. SaaS, you operate in a region where Tailscale can't legally provide service, or you're regulatory-bound to run the full stack yourself. Otherwise, Tailscale Standard at $6/user/month is a better use of a platform engineer's time than peer management cron jobs.

## the ACLs you can actually read

The reason the appliance era was painful wasn't just the CVEs. It was the iptables rulesets nobody read.

```
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp -d 10.0.1.5 --dport 22 -s 10.0.0.0/24 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp -d 10.0.1.5 --dport 5432 -s 10.0.0.50/32 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp -d 10.0.1.6 --dport 6379 -s 10.0.0.50/32 -j ACCEPT
# ... 200 more lines, no comments, no group names, no diff history
```

versus the Tailscale ACL above. Twenty lines of declarative policy, with group names that mirror Okta groups, with comments that say what each rule is for, version-controlled in `git`. The same is true for Cloudflare Access policies (declarative JSON, also Terraform-supported), Headscale ACLs (the same HuJSON Tailscale uses), and even WireGuard's `AllowedIPs` per peer (one line per route per peer).

If the only person who can read your firewall rules is the person who wrote them three years ago, that is a security problem, not just an operations problem. The audit answer "what does the prod network allow?" should be a `git log` and a code review, not a screen-share with the senior engineer who took notes on a sticky.

## the receipts

For 15 engineers, the network-in-front-of-everything bill:

- **Tailscale Standard** at $6/user/month: $90/month for 15 engineers. Covers SSH-to-bastion, internal HTTP access, prod database access, MagicDNS, ACL audit logs. The primary line item.
- **Cloudflare Zero Trust Free** (<50 users): $0. Replaces public-internet vendor portals, internal-with-SSO web apps, customer-facing internal tools.
- **Self-hosted WireGuard or Headscale**: $0, plus a small VPS for the control plane if needed (~$5/month). For the use case Tailscale can't legally cover.
- **The retired VPN appliance contract**: somewhere between $5K and $50K per year, depending on vendor and seat count, going back into your budget when the contract ends.

![An animated horizontal bar chart in a dark editorial palette comparing the annual access-plane cost for a 15-engineer team across four configurations. Top bar: legacy SSL VPN appliance (Pulse Secure / Ivanti / GlobalProtect at small-business pricing) at roughly $600/year plus the CVE risk; subtitle 'plus appliance ops time'. Middle-top bar (accented, brighter cyan, coral tip): Tailscale Standard at $1,080/year (15 × $6/mo × 12); subtitle 'recommended default'. Middle-bottom bar: Tailscale Free + Cloudflare Zero Trust Free at $0; subtitle 'works up to 6 users / 50 seats'. Bottom bar: self-hosted Headscale + WireGuard at $60/year (just a small VPS); subtitle 'for the air-gapped or contract-bound case'. Annotation strip notes the appliance bar's true cost is dominated by patch/CVE response and is undercounted at $600.](/images/lazy-security-part-6-network-plane/access-plane-cost-stack.gif)

*Fig. 3 — the bottom two bars are not an emergency, the top bar is.*

Net cost: $90/month for the access plane covering most of the surface, with optional fallbacks at near-zero cost. For comparison, the median per-seat price of a legacy SSL VPN appliance (Pulse Secure / Ivanti, GlobalProtect, AnyConnect) at small-business pricing is around $40/seat/year, or roughly the same number, without the CVE risk and without the iptables ruleset.

What this catches: every internet-facing VPN appliance CVE, because you don't have one. Every "the bastion's security group was opened to 0.0.0.0/0 to debug a contractor's IP last summer and never closed" incident, because the bastion isn't reachable. Every "the static VPN certificate leaked from a contractor's laptop" incident, because the credential is a short-lived OIDC session, not a long-lived cert.

What it doesn't catch: a compromise of your IdP itself. If an attacker controls Okta, they control your tailnet. (Cloudflare's Thanksgiving 2023 incident report, attributed to Okta's October 2023 support-system breach, is the canonical reference for this failure mode and the response.) The mitigation lives in [Part 3](/blog/lazy-security-part-3-unsexy-list): FIDO2-only admin access on the IdP, audit log streaming, and a runbook for "rotate all sessions in five minutes."

If you do one thing this week, install Tailscale on your laptop and one prod host, get them talking over the tailnet, and replace one `ssh bastion.yourorg.com` invocation with `tailscale ssh prod-host`. Cost: zero. Time: ten minutes. Then plan the rest of the rollout for next quarter, when you have the budget conversation with whoever still pays for the SSL VPN renewal.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>security</category>
            <category>devsecops</category>
            <category>lazy-sre</category>
            <category>network</category>
            <category>vpn</category>
            <category>tailscale</category>
            <enclosure url="https://harshit.cloud/images/lazy-security-part-6-network-plane/vpn-appliance-attack-surface.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[How to bypass corporate MDM and AI gateways on Claude Code]]></title>
            <link>https://harshit.cloud/blog/bypassing-claude-code-mdm-managed-settings</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/bypassing-claude-code-mdm-managed-settings</guid>
            <pubDate>Fri, 08 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Your company's MDM dropped managed-settings.json and the network team wedged Claude through an AI gateway. Here's how each leash works.]]></description>
            <content:encoded><![CDATA[
If you're reading this, there's roughly an 80% chance your company rolled out an MDM last quarter, your network team wedged Claude API traffic through an AI gateway around the same time, and now Claude Code boots with MCPs you didn't pick while forwarding your prompts somewhere you haven't audited. `/mcp` shows three servers nothing in your repo touches. `env | grep ANTHROPIC` returns a base URL on a domain you've never seen. The experience got worse and nobody asked you.

![Two corporate leashes converging on Claude Code in the middle: an MDM agent on the left pushes a schg-flagged managed-settings.json into /Library/Application Support/ClaudeCode/, an AI gateway on the right intercepts traffic that Claude thinks is heading for api.anthropic.com.](/images/bypassing-claude-code-mdm-managed-settings/hero.png)
*Fig. 1 — both leashes pull on the same runtime. You can usually only get one off without your laptop calling home about it.*

This post covers both leashes. The MDM one is fixable in 12 lines of zsh. The AI gateway one depends on how deep your network team went.

## what's an MDM, in three sentences

MDM stands for Mobile Device Management. Jamf, Kandji, Intune, Workspace ONE, whichever agent enrolled your laptop on day one. It owns parts of `/Library`, can write files there as root with the system-immutable flag set, and re-pushes them on a schedule, which is why a plain `rm` doesn't survive. For Claude Code, the relevant directory is `/Library/Application Support/ClaudeCode/`.

## the managed-settings situation

The two files doing the work are `/Library/Application Support/ClaudeCode/managed-settings.json` and `/Library/Application Support/ClaudeCode/managed-mcp.json`. Claude Code reads them on startup, treats them as the highest-priority settings layer, and merges them over whatever you have in `~/.claude/settings.json`. Anything IT puts in there wins: forced MCPs, forced skills, allowed and denied permission lists, and the `env` block that can set `ANTHROPIC_BASE_URL`. That last one is how the AI gateway routing gets wired into Claude Code in the first place.

## why `rm` doesn't work

First instinct fails, and not in a way that's obvious:

```bash
sudo rm "/Library/Application Support/ClaudeCode/managed-settings.json"
# rm: managed-settings.json: Operation not permitted
```

Root isn't enough. The MDM agent sets the file's system-immutable flag with `chflags schg` after writing it. That flag blocks deletion even by root until it's cleared. The macOS `chflags(1)` man page is the receipt. `schg` is the "system immutable" flag, and the file "may not be changed, moved, or deleted" while it's set.

Confirm it on your own machine:

```bash
ls -lO "/Library/Application Support/ClaudeCode/managed-settings.json"
# -rw-r--r--  1 root  wheel  schg  482 May 14 09:11 managed-settings.json
```

`schg` in column five is the marker.

The detail that matters: managed-settings.json is the same config layer your `~/.claude/settings.json` uses. The IT copy just lives under `/Library`, is owned by root, and has the schg flag set. The merge logic doesn't know which file came from a human.

## the cleanup script

One thing worth flagging before you run this. On macOS, the `schg` flag is normally clearable by root for files outside SIP-protected paths — and `/Library/Application Support/ClaudeCode/` is not SIP-protected. So `sudo chflags noschg` works as written. If your MDM also writes its config into a SIP-protected location (rare for application config, more common for system extensions), you'd need Recovery Mode Terminal to clear those, which is a different conversation. The script's `2>/dev/null` will silently swallow that failure, so if reruns don't seem to take, that's where to look.

Save this as `/usr/local/sbin/claudecode-cleanup.sh`, make it executable, run with `sudo`:

```zsh
#!/bin/zsh
FILES=(
  "/Library/Application Support/ClaudeCode/managed-settings.json"
  "/Library/Application Support/ClaudeCode/managed-mcp.json"
)
for f in "${FILES[@]}"; do
  # Clear immutable flag if file exists, then remove
  [ -e "$f" ] && /usr/bin/chflags noschg "$f" 2>/dev/null
  /bin/rm -f "$f"
done
```

```bash
sudo chmod 755 /usr/local/sbin/claudecode-cleanup.sh
sudo /usr/local/sbin/claudecode-cleanup.sh
```

Two lines do the real work. `chflags noschg` clears the immutable bit. `rm -f` removes the file. The `2>/dev/null` swallows the noise on a clean machine where the file isn't there.

Restart Claude Code. `/mcp` should be back to whatever you actually installed, and `/permissions` should be whatever's in `~/.claude/settings.json` instead of whatever IT decided you needed.

## the launchd arms race

I'd love to tell you this is permanent. It isn't.

MDM agents sync on a schedule. Every 15 minutes, every hour, on login, depending on profile. When they sync, they notice the file is gone, put it back, and re-apply the schg flag. You'll watch managed-mcp.json reappear like a horror-movie villain you keep stabbing.

A few options, in increasing order of trouble you're inviting:

- **Run the script on a launchd LaunchAgent that fires at login.** Once per session. Low impact, low effectiveness, but if your MDM only syncs at login this is enough.
- **Run it on a launchd timer with a 60-second interval.** Now you're in an arms race with the sync schedule. Works until someone in IT notices a config-drift alert for your hostname.
- **Block the MDM agent's outbound DNS.** Effective, loud, and the kind of thing that gets your laptop wiped on the next compliance audit.

I run the first one. The MDM gets its login telemetry, my dev environment isn't broken for the hour or so between syncs, nobody opens a ticket. Pick the option that matches how much you actually want to fight this.

Minimal `~/Library/LaunchAgents/cloud.harshit.claudecode-cleanup.plist`:

```xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key><string>cloud.harshit.claudecode-cleanup</string>
  <key>ProgramArguments</key>
  <array>
    <string>/usr/bin/sudo</string>
    <string>-n</string>
    <string>/usr/local/sbin/claudecode-cleanup.sh</string>
  </array>
  <key>RunAtLoad</key><true/>
</dict>
</plist>
```

`sudo -n` only works if you've added a NOPASSWD line for that exact script in `/etc/sudoers.d/claudecode-cleanup`. Which the MDM might rewrite. The arms race goes deeper than you think.

## the AI gateway angle

The other leash sits at the network layer. Companies route Claude API traffic through a gateway (Cloudflare AI Gateway, Portkey, LiteLLM, internal proxies) so they can log prompts, strip PII, enforce per-user quotas, or quietly downgrade Opus calls to Haiku when the monthly bill spikes. Claude Code respects `ANTHROPIC_BASE_URL` and will talk to whatever endpoint it points at, as long as your OAuth token or API key authenticates there.

Two routing patterns to recognize:

- **The env block in managed-settings.json.** IT sets `ANTHROPIC_BASE_URL=https://ai-gw.corp.example.com/v1` inside the env section of the managed file. Claude Code reads it on startup. Same fix as the MCP file. The cleanup script above already kills this.
- **System proxy plus a corporate root CA.** Your laptop has a "Corporate Root CA" in keychain, and either an `https.proxy` setting or transparent network interception routes api.anthropic.com traffic through the gateway. Deleting managed-settings.json does nothing here. The interception lives below the application layer.

To tell which one you have, run this in a fresh shell:

```bash
env | grep -i anthropic
# If you see ANTHROPIC_BASE_URL, it's the env block.

curl -v https://api.anthropic.com/v1/messages 2>&1 | grep -iE 'issuer|subject|server certificate'
# If the cert chain is signed by your corporate CA, it's transparent interception.
```

## bypassing the gateway

For the env-block case, the cleanup script already does the work. Restart your shell after running it:

```bash
unset ANTHROPIC_BASE_URL
env | grep -i anthropic
# (empty)
```

For the transparent-proxy case, your options shrink:

- **Personal hotspot for sensitive sessions.** Burns mobile data, leaves no trail through the gateway. Most realistic option for an individual contributor.
- **WireGuard or Tailscale out to a personal node.** Works if your MDM profile allows it. Many block third-party VPNs through `com.apple.systempolicy.kernel-extension-policy`.
- **Personal device for personal work.** Boring answer. The one that holds up in HR if it ever comes up.

What doesn't work: removing the corporate root CA from keychain. It's pinned by an MDM payload and gets re-added on next sync, same pattern as managed-settings.json.

## should you actually do this

Worth saying out loud: both leashes exist because someone at your company had a reason. Compliance, data residency, an incident from six months ago whose postmortem nobody can find.

If the forced MCP is `internal-secrets-lookup` and the gateway logs prompts to a SOC pipeline, your team probably wants you using it. If the MCP is `corporate-docs-mcp` pointed at a 404 and the gateway downgrades Opus to Haiku because someone misread an invoice, you're deleting dead weight.

The script doesn't know which. Ask before you script. Most MDM platforms support per-user opt-out scopes, and one polite Slack message to IT beats a `launchd` plist.

## what these scripts don't do

The cleanup clears two files. It does not:

- Stop the MDM agent.
- Touch `~/.claude/settings.json`. Your settings stay yours.
- Handle `/Library/Application Support/ClaudeCode/managed-permissions.json` if your MDM uses one. Add it to the `FILES` array.
- Survive a reboot or a sync. The agent re-pushes on next check-in.
- Defeat a transparent proxy with a pinned corporate CA. Use the hotspot.

If you wanted a permanent escape from corporate IT, you wouldn't be reading a blog about `chflags`.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>claude-code</category>
            <category>mdm</category>
            <category>ai-gateway</category>
            <category>macos</category>
            <category>managed-settings</category>
            <enclosure url="https://harshit.cloud/images/bypassing-claude-code-mdm-managed-settings/hero.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Lazy SRE's guide to secure systems, part 5: the dev laptop is the perimeter]]></title>
            <link>https://harshit.cloud/blog/lazy-security-part-5-dev-laptops</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/lazy-security-part-5-dev-laptops</guid>
            <pubDate>Sun, 03 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Snowflake taught everyone what happens when an infostealer runs on a contractor's personal Mac. The laptop is the perimeter.]]></description>
            <content:encoded><![CDATA[
In June 2024, Mandiant published the writeup for the Snowflake mass-extortion campaign. Ticketmaster, Santander, AT&T, LendingTree, Advance Auto Parts — roughly 165 Snowflake tenants in total had data extracted from their warehouses. The defining detail wasn't sophistication. It was the laptop.

Mandiant traced the entry point to infostealer malware (Lumma, RedLine, Vidar variants) running on contractor and developer machines. Their report described the affected devices as personal systems also used for gaming and downloading pirated software. The infostealer harvested every credential the browser had ever saved, including the Snowflake login that didn't have MFA enforced. The attackers walked through the front door of a Fortune 500's data warehouse.

This is part 5. Earlier parts covered npm ([Part 1](/blog/lazy-security-part-1-supply-chain)), GitHub Actions ([Part 2](/blog/lazy-security-part-2-github-actions)), the unsexy infrastructure list ([Part 3](/blog/lazy-security-part-3-unsexy-list)), and DNS auth records ([Part 4](/blog/lazy-security-part-4-dns-records)). Part 5 is about the laptop. The piece of hardware on an engineer's desk that has every SSH key, AWS profile, kubeconfig, GitHub PAT, Slack token, and Stripe key they have ever used to do their job.

The thesis from Part 1 stands. Future You at 3am will not run an EDR scan after every browser extension install. The config that prevents the extension from being installed in the first place is the one that runs while you sleep: the MDM that whitelists, the disk encryption that protects what gets stolen, the hardware MFA that survives the keylogger.

## MDM is the table you set first

Mobile Device Management is the thing every small startup skips and every enterprise has. The bad-faith reason is that it's expensive and annoying. The honest reason in 2026 is that the free options have caught up.

For a 15-person Apple-heavy team, the lazy stack is **Apple Business Manager** (free, Apple-only) plus **Fleet** (OSS, free under 300 endpoints on the self-hosted path, generous free tier on Fleet's cloud). Apple Business Manager assigns a Mac to your organization at first boot, before the user creates a personal Apple ID on it. Fleet runs the osquery agent on every machine and lets you push configuration profiles (the same plist payloads Jamf would push) plus query inventory in SQL syntax.

![A hand-drawn napkin sketch of a laptop, viewed top-down on a workbench. Inside the laptop are labeled icons representing what's actually stored on it: a key labeled SSH, a wallet labeled AWS keys, a kubeconfig folder, a GitHub PAT token, a Slack icon, a Stripe API key, a stack of browser cookies, and a small keychain icon for the password manager. Around the outside, a red dashed boundary labeled 'the perimeter'. A red callout reads 'any one of these, full org compromise'. Bottom strip: 'one device → one keychain → twelve services → entire blast radius'.](/images/lazy-security-part-5-dev-laptops/whats-on-your-laptop.png)

*Fig. 1 — what's actually on the device you take to coffee shops.*

The lazy default config profile, in plain English:

- Require FileVault. Escrow the recovery key to MDM. If the laptop walks, the disk is encrypted; if the user forgets the password, you can recover.
- Require auto-lock at five minutes idle, password to wake. Not a screensaver.
- Block unsigned package installs, restrict the Mac App Store to managed Apple IDs only.
- Require macOS updates within fourteen days of release. The fourteen days lets you skip a known-bad point release; longer than fourteen is negligence.
- Block AirDrop on the corporate Wi-Fi, restrict USB external storage to read-only (or block entirely if your workflow doesn't need it).
- Install osquery via MDM, enrolled to your Fleet server.

For Linux and Windows in the mix, Fleet covers both with the same osquery agent and the same query syntax. The MDM-config-profile half is Windows Intune (free with Microsoft 365 Business Premium) or Workspace ONE's free tier. Either way, the stack is "Fleet for inventory and detections + a platform-specific MDM for enforcement."

The lazy fix for the most common gap: a weekly cron that runs one Fleet query, "every laptop without FileVault enabled," and posts a Slack alert with the user's name. The conversation that follows is "we found your machine, can you enable it today" — not a six-month audit.

## hardware keys, one-time spend

YubiKey 5 NFC is $50. Buy two per engineer: one for the desk, one for the bag. Total for 15 engineers: $1,500, one-time, capital expense, deductible.

What it gets you:

- WebAuthn / FIDO2 for SSO login (Google, Okta, GitHub, Cloudflare, AWS): a keylogger can record every keystroke and still never get the second factor.
- SSH key storage in hardware. `ssh-keygen -t ed25519-sk -O resident` writes the key into the YubiKey. The private key never exists on disk.
- PIV smartcard for VPN auth, code signing (`gpg --card-edit`).
- TOTP fallback for the SaaS that hasn't shipped WebAuthn yet.

The free alternative for the SaaS that doesn't support hardware keys is passkeys. Passkeys are WebAuthn under the hood, also phishing-resistant, built into iOS, macOS, Android, Windows Hello, Chrome, and Safari. Free. The catch is sync: if the engineer's iCloud is compromised, so is the passkey. Hardware keys aren't synced; they are a physical token. The lazy answer is both: passkeys for low-risk auth, YubiKeys for the keys that gate production.

Cost: $1,500 one-time for 15 engineers. The cheapest line item in this post for what it gets you.

## EDR is where the budget goes

Endpoint Detection and Response is the part of this stack that costs real money. For OSS-only, the answer is osquery + Wazuh, which works but requires writing detections by hand. For a 15-person team with one platform engineer, "write your own EDR detections" is not a project anyone will finish.

The honest 2026 small-team answer is **Microsoft Defender for Business** at $3/user/month. It ships in Microsoft 365 Business Premium (also useful if you're on M365 anyway), has acceptable macOS coverage, and includes managed detections written by Microsoft's security team. Cost for 15 engineers: $540/year. **CrowdStrike Falcon Go** is $60/endpoint/year if you want best-in-class detection at small-team scale; same math, $900/year for 15.

![An animated horizontal bar chart in a dark editorial palette comparing the annual endpoint stack cost for a 15-engineer team across three configurations. Top bar: OSS-only (osquery + Wazuh self-hosted) at roughly $240/year (just the VPS). Middle bar (accented, brighter cyan, coral tip): Defender for Business at $540/year, the recommended default. Bottom bar: CrowdStrike Falcon Go at $900/year. A small note underneath each bar shows what each catches and what each misses; a strip at the bottom reads 'one-time YubiKey spend not included ($1,500 for 15 engineers across all three).'](/images/lazy-security-part-5-dev-laptops/endpoint-cost-stack.gif)

*Fig. 2 — three configurations. Pick the middle bar unless you have a reason.*

The lazy stance: Defender for Business if you're on Microsoft 365 already. Falcon Go if you're not on M365 and want managed detection without the OSS-engineer overhead. osquery + Wazuh only if you have a security engineer with bandwidth to maintain the detections, which most 15-person startups don't. Pretending otherwise is how you end up with a fancy SIEM nobody reads.

## the password manager and browser hygiene argument

1Password Business at ~$8/user/month. Bitwarden Teams at $4. Apple Passwords (or 1Password Families) if you're Mac-only and don't need shared vaults. Pick one and stop arguing about it on the team's `#tools` channel.

The point of the password manager isn't strong passwords. The point is:

- One place for credentials, audited.
- Shared vaults for vendor logins, instead of "share the password in Slack DM" hygiene.
- Breach notifications when a saved password appears in a public breach corpus.
- Masked email aliases (1Password feature, Apple's Hide My Email equivalent): every signup gets a separate alias, every spam list is contained.

Browser hygiene matters because the Snowflake infostealer harvested credentials from browser local storage. Specifically:

- Enforce browser auto-updates via MDM. Both Chrome and Edge expose policy keys for this; Firefox via `policies.json`.
- Block sync of work browser profiles to personal Google/Apple accounts. The "I signed into Chrome with my personal account and now all my work bookmarks are in someone else's cloud" leak is real.
- Block "developer mode" extension installs. Force extensions to come from the Chrome Web Store; force the Web Store to honor the org's allowlist via the `ExtensionInstallAllowlist` policy.
- Disable browser password saving entirely. Everything routes through the password manager.

Total: $1,440/year for 15 engineers on 1Password Business. $720 on Bitwarden Teams. $0 on Apple Passwords if it covers your needs. Pick a line and walk it.

## the personal device problem

The Snowflake breach was about contractors using personal Macs for work. The lazy answer at a 15-person startup might surprise: corp-issue every contractor a laptop. Yes, including the four-hour-a-week consultant.

A refurbished MacBook Air with 16GB RAM is roughly $700 from Apple's Certified Refurbished store. The cost of a Snowflake-scale breach starts at $370K (the reported AT&T ransom) and ends in the customer-churn and legal-exposure column. The break-even point on hardware-for-contractors is under three serious incidents, ever.

![An editorial side-by-side system diagram on a dark navy ground. Left panel labeled 'personal device, BYOD' shows a laptop with chaotic state: unenforced FileVault status, a personal iCloud sign-in, a Mac App Store with personal Apple ID, a Chrome browser synced to a personal Google account, a Slack web app session that's been logged in for nine months, a folder labeled 'pirated software' with a red warning. Right panel labeled 'corp-issued, MDM enrolled' shows the same laptop with each item enforced: FileVault ON, MDM-managed Apple ID, App Store restricted, Chrome work profile only, Slack session expires daily, no third-party software installs. Each enforced item has a green check; each unenforced item on the left has a coral X. A title above reads 'where the Snowflake breach lived'.](/images/lazy-security-part-5-dev-laptops/personal-vs-corp-laptop.png)

*Fig. 3 — same laptop, different enrollment. The right panel is the one where Mandiant doesn't write your name down.*

What "no work on personal devices" actually requires:

- Contract clause: hardware is issued, personal-device use for work is prohibited.
- MDM enrollment at first boot via Apple Business Manager (or Windows Autopilot).
- Disabled iCloud personal sign-in; only managed Apple IDs.
- Wipe via MDM on offboarding, before reissue.
- No "I can just SSH from home for ten minutes" escape hatch. The escape hatch is what the contractor will use the day they get phished.

This is the section of the post that gets the most pushback. The pushback is right about cost and wrong about risk. Run the math at your scale; it runs the same direction every time.

## the receipts

For 15 engineers, the first-year laptop security budget:

- YubiKey 5 × 30 keys (two per engineer): $1,500, one-time.
- Fleet (OSS self-hosted on a small VPS): $240/year.
- Microsoft Defender for Business: $540/year. Substitute Falcon Go at $900 if not on M365, or osquery+Wazuh at $0 if you have a security engineer.
- 1Password Business: $1,440/year. Or Bitwarden Teams at $720. Or Apple Passwords at $0.
- Refurbished corp laptops for non-employee contractors: ~$700 per, as needed.

Total recurring: roughly $1,020–$2,220/year for 15 engineers, depending on the EDR and password-manager line. Add the one-time YubiKey spend and the first year lands at $2,520–$3,720. Call it $14–$21 per engineer per month.

What it catches: every infostealer that hits a managed laptop (Defender flags it), every credential that lives in the browser (replaced by the password manager), every login that doesn't have phishing-resistant MFA (the YubiKey is required), every personal device touching production (blocked by the no-BYOD policy).

What it doesn't catch: a determined adversary with physical access and unlimited time. A laptop in a hotel room with no FileVault is owned. A laptop with FileVault and a YubiKey left in the USB-A port overnight is owned slower. Neither situation is what this stack is built for; it is built for the infostealer that landed on the contractor's personal Mac.

If you do one thing this week, buy two YubiKeys for yourself, enroll them on GitHub, Google, and Okta, and turn off SMS-based MFA on each. Total cost: $100, one hour. Then do the rest of the team next quarter.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>security</category>
            <category>devsecops</category>
            <category>lazy-sre</category>
            <category>endpoint</category>
            <category>mdm</category>
            <category>macos</category>
            <enclosure url="https://harshit.cloud/images/lazy-security-part-5-dev-laptops/whats-on-your-laptop.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Lazy SRE's guide to secure systems, part 4: the four DNS records]]></title>
            <link>https://harshit.cloud/blog/lazy-security-part-4-dns-records</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/lazy-security-part-4-dns-records</guid>
            <pubDate>Sun, 26 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Four DNS records that close the entire phishing impersonation class. SPF, DKIM, DMARC, CAA, two monitors, one afternoon.]]></description>
            <content:encoded><![CDATA[
In February 2024, Guardio Labs published a writeup of a campaign called SubdoMailing. Five million phishing emails a day, sent through subdomains owned by MSN, eBay, VMware, NYC.gov, UNICEF, and McAfee. Every single email passed SPF and DKIM. Every one of them passed DMARC.

The attack didn't break those protocols. It used them. Each victim domain had an `include:` line in its SPF record pointing at a contractor's domain that had been allowed to expire. The attackers re-registered the orphan, inherited the trust, started sending. Some of the broken `include:` chains had been broken for over a year — Guardio dated the operation back to at least late 2022. Nobody had thought to read their own SPF record again after writing it.

This is part 4. Earlier parts covered npm ([Part 1](/blog/lazy-security-part-1-supply-chain)), GitHub Actions ([Part 2](/blog/lazy-security-part-2-github-actions)), and identity, network, and audit logs ([Part 3](/blog/lazy-security-part-3-unsexy-list)). Part 4 is four DNS records and two monitors. One afternoon to write them, three weeks for DMARC to ramp safely. Zero ongoing cost. Closes the entire phishing-impersonation class and the entire rogue-certificate class at the same time.

Future You at 3am will not investigate an SPF chain when finance forwards a wire-transfer email. The records that run in their place will.

## SPF, and the include trap

SPF stands for Sender Policy Framework. The record lives in DNS as a TXT entry on your apex domain. It declares which IP addresses or domains are allowed to send email on your behalf. The receiving mail server checks the sending IP against the list. The check passes or it fails. That is the entire protocol.

The record itself:

```
yourorg.com TXT "v=spf1 include:_spf.google.com include:mailgun.org -all"
```

`v=spf1` is the version marker. `include:` delegates to another domain's SPF record, which expands at lookup time to that vendor's actual IP allowlist. `-all` says anything not listed is hard-fail.

That last token matters. `-all` (hard-fail) tells receivers to reject anything not on the list. `~all` (soft-fail) tells them to mark it suspicious but maybe deliver anyway. `?all` (neutral) tells them you have no opinion. Every getting-started guide ever written defaults to `~all` "to be safe." The major receivers have said for years that they treat `~all` and `-all` the same in scoring. The lazy answer is `-all`. The only reason to use `~all` is during a migration when you can't yet enumerate every legitimate sender.

![A horizontal editorial timeline of the SubdoMailing campaign on a deep navy ground. Six stages along a single line, from a contractor's SPF include published in 2021 through the contractor domain expiring in 2023, an attacker re-registering it in late 2023, the attacker publishing their own SPF record under the orphan, 5 million phishing emails a day passing SPF and DMARC in February 2024, and Guardio Labs' disclosure of 8000 affected subdomains. Attacker-controlled stages in coral, victim stages in cyan, ghosted 'ORPHAN' and 'INHERITED TRUST' phase labels strung across the background.](/images/lazy-security-part-4-dns-records/subdomailing-timeline.png)

*Fig. 1 — three years from include line to five million phishing emails a day. The SPF record never changed.*

The SPF spec has a ten-DNS-lookup limit. Every `include:` counts, recursively. If you chain five SaaS senders (Google + Mailgun + Postmark + SendGrid + Stripe), each one's `include:` expands into its own record, which may include another, and you can blow the limit without realizing. When you blow the limit, the record evaluates as `permerror`, and many receivers treat that as "no SPF," which means anyone can spoof you. Tools like `dmarcian.com/spf-survey` count the lookups for free. Audit yours.

The SubdoMailing failure mode is what happens when one `include:` points at a contractor whose domain you don't control. The contractor goes out of business. The registration expires. Someone buys the lapsed domain. They publish their own SPF allowlist. Your domain now declares that the buyer is an authorized sender for you. Every email they send passes SPF. The fix is to audit your `include:` chain quarterly: does every domain in it still belong to someone you trust? Most teams have never done this once.

## DKIM, in DNS

DKIM (DomainKeys Identified Mail) is a cryptographic signature on every outbound email. The signing key is a public/private keypair. The private key lives in your mail server (Google Workspace, Microsoft 365, Postmark, your own Postfix, whatever). The public key lives in DNS, under a selector subdomain.

```
selector1._domainkey.yourorg.com TXT "v=DKIM1; k=rsa; p=MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQ..."
```

The selector (`selector1` here) is so you can rotate keys. Publish a new selector, switch the mail server to sign with the new private key, leave the old selector live for a week so in-flight emails still verify, then retire it. Most providers handle this rotation for you once the original selector is configured.

Two things go wrong in practice. First, key length. RSA-1024 was the standard a decade ago and is now considered weak; RSA-2048 is the current default. Some old DKIM records still publish 1024-bit keys, and many major receivers now fail or ignore 1024-bit signatures. Audit with `dig TXT selector1._domainkey.yourorg.com`. Second, third parties signing on your behalf without your knowledge. If finance connects a new SaaS tool that sends email as `noreply@yourorg.com` and nobody sets up DKIM for that path, that vendor's emails will fail DKIM alignment. Receivers see a domain with DKIM mostly working and one path failing, which is often enough to flag the whole domain in spam filters.

Most providers (Google Workspace, Microsoft 365, Postmark, Mailgun, SendGrid) make DKIM publishing a checklist item in their onboarding. If a vendor doesn't, that is a signal about the vendor's sophistication, not yours.

## DMARC, the part that does the work

DMARC (Domain-based Message Authentication, Reporting & Conformance) is the policy layer on top of SPF and DKIM. It tells receivers what to do when SPF and DKIM checks fail, and it tells you, via aggregate reports, what's happening to your domain in the wild.

A minimal DMARC record:

```
_dmarc.yourorg.com TXT "v=DMARC1; p=reject; sp=reject; rua=mailto:dmarc@yourorg.com; pct=100; adkim=s; aspf=s"
```

The fields that matter:

- `p=reject`: policy for emails that fail both SPF and DKIM on the apex. Three values, `none` (just report), `quarantine` (deliver to spam), `reject` (drop). The end state is `reject`. The path is `none → quarantine → reject`.
- `sp=reject`: same policy for subdomains. This is the SubdoMailing detail every public DMARC how-to forgets. A domain with `p=reject` but `sp=none` is wide open for subdomain abuse. Set both.
- `rua=mailto:`: where aggregate reports get sent. Free DMARC report parsers (Postmark, dmarcian, EasyDMARC) accept these and render them as human-readable summaries.
- `pct=100`: fraction of failing mail to apply the policy to. Start at 25% during the ramp, end at 100%.
- `adkim=s` and `aspf=s`: strict alignment. The From-address domain must match the DKIM signing domain (and SPF return path) exactly. The default is relaxed, which lets subdomains substitute. Strict is what you want unless something is breaking.

The ramp from `p=none` to `p=reject` is what takes three weeks. The risk is breaking a legitimate sender path you didn't know existed. Week one, publish `p=none; pct=100`. Receive DMARC aggregate reports for seven days. Identify every IP and `From:` domain that sent on your behalf. There will be three or four you didn't expect: a newsletter platform finance signed up for, an HR tool, a calendar invite system. Onboard each into SPF and DKIM. Week two, move to `p=quarantine; pct=25`, watch reports for new failures. Week three, `p=reject; pct=100`. Done.

![An animated horizontal bar chart in a dark editorial palette showing FBI IC3 business email compromise losses in the United States by year, from 2020 ($1.8B) through 2024 ($2.77B). Bars fill in sequence. The 2024 bar is accented with a brighter cyan and a coral tip. A bottom strip notes that the average loss per incident in 2024 was $129K and that the dataset is U.S.-only — global BEC losses are higher.](/images/lazy-security-part-4-dns-records/bec-losses.gif)

*Fig. 2 — BEC losses by year, U.S. only. The 2024 number exceeded ransomware.*

Most small teams stop at `p=quarantine` and never finish the ramp. The difference between `quarantine` and `reject` is whether the attacker's spoofed wire-transfer email lands in the CFO's spam folder or never enters the mail system at all. Spam is where employees go to recover legitimate mail that was filtered too aggressively, which means they go there to fish out emails they want to trust. Reject is the answer.

## CAA, two lines to gate cert issuance

CAA (Certification Authority Authorization) is a DNS record that names which Certificate Authorities are allowed to issue TLS certificates for your domain. Without one, any publicly trusted CA in the world can issue a cert for your domain to anyone who passes that CA's domain-validation challenge. With one, only the CAs you've named can.

```
yourorg.com CAA 0 issue "letsencrypt.org"
yourorg.com CAA 0 issuewild "letsencrypt.org"
yourorg.com CAA 0 iodef "mailto:security@yourorg.com"
```

`issue` restricts standard certificates. `issuewild` restricts wildcard certificates. `iodef` is where notifications are sent when an unauthorized CA tries to issue. If you use multiple CAs (one for ACM in AWS, one for Let's Encrypt in your edge, one for Cloudflare-managed certs), list them all:

```
yourorg.com CAA 0 issue "letsencrypt.org"
yourorg.com CAA 0 issue "amazon.com"
yourorg.com CAA 0 issue "digicert.com"
```

CAA cannot prevent a misbehaving CA from issuing anyway. But CAs are required by the CA/Browser Forum baseline requirements to honor CAA at issuance time. They mostly do. When they don't, the misissuance ends in a Mozilla CA-incident bug report and eventual CA distrust. CAA exists so that legitimate misissuance is detected (because the CA you named never issued the cert and the issuing CA broke the rule) and accidental misissuance is structurally impossible. Both buy you something.

Cost: three DNS lines. Effort: ten minutes. Catches a class of attack (man-in-the-middle via misissued cert) that most teams have no other defense against.

## the monitors

Two streams pay back the four records.

First, certificate transparency log monitoring. Every publicly trusted CA is required to log every certificate they issue to public append-only logs. `crt.sh` is a free queryable index. The `certstream` Python library streams new entries in real time, also free. Cloudflare offers free CT monitoring for any domain on its DNS. Whatever you pick, the workflow is: cert is issued for `*.yourorg.com` → log entry appears within seconds → your monitor pages a Slack channel → you check whether you issued it. If you didn't, that is an incident, not a notification.

![A hand-drawn napkin showing the four DNS records as a cheat sheet, written in marker, ready to copy into a DNS panel. Top of the napkin reads 'the four-record afternoon'. Four labeled blocks underneath: SPF as a TXT record with `-all` circled in red, DKIM as a TXT record with the selector subdomain highlighted, DMARC with `p=reject` and `sp=reject` both underlined twice, CAA with the issuer name circled. Bottom of the napkin has two boxes labeled 'CT log monitor' and 'DMARC report inbox', with arrows pointing to a small Slack icon and a small email icon. A red callout at the bottom reads 'fifteen minutes a week'.](/images/lazy-security-part-4-dns-records/dns-records-napkin.png)

*Fig. 3 — the whole afternoon, sketched. Plus what runs after.*

Second, DMARC aggregate report parsing. The `rua=` address in your DMARC record receives daily XML reports from every receiver. Reading the XML raw is unpleasant. The free tiers of Postmark, dmarcian, and EasyDMARC all accept the report stream and render it as "here are the IPs that sent as you this week, here are the ones that failed alignment, here are the new ones since last week." The new-sender alerts are where you find out that someone in marketing has connected a SaaS tool that's now sending emails as you, failing alignment, and getting your domain reputation downgraded.

A weekly fifteen-minute review of both monitors is what good looks like at a 25-person team. The cost is fifteen minutes a week. The product is "we'd have noticed if someone issued a cert for our login subdomain on Tuesday."

## the receipts

Four DNS records. Two monitors. One afternoon for the records, three weeks for the DMARC ramp, fifteen minutes a week for the reviews. Cost: zero, unless you upgrade past the free tier of a DMARC parser at $15–$50 a month, which is the only thing on the list that's not free.

What this catches: every attempt to send email impersonating your domain from outside your authorized sender list, every attempt to issue a TLS cert for your domain from an unauthorized CA. The FBI's 2024 IC3 report attributed $2.77B in U.S. business email compromise losses to roughly 21,000 incidents — a $129K average. The fraction of those that would have been caught by a domain publishing `p=reject; sp=reject` with an honest SPF audit is enormous.

What it doesn't catch: phishing from a lookalike domain (`yourorg-corp.com`, `yourorg-support.com`, `yourorg.co`). Lookalike-domain defense needs a paid monitoring service at the tier that matters, and there's no free version that works at small-team scale. Skip it until you have a budget line for security. Note it in the runbook.

If you do one thing this week, publish `_dmarc.yourorg.com TXT "v=DMARC1; p=none; rua=mailto:dmarc@yourorg.com"` and point the address at a Postmark free-tier DMARC inbox. Read the first report in seven days. The list of senders you didn't know about is the answer to "why has this been skipped for two years."
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>security</category>
            <category>devsecops</category>
            <category>lazy-sre</category>
            <category>dns</category>
            <category>email</category>
            <category>dmarc</category>
            <enclosure url="https://harshit.cloud/images/lazy-security-part-4-dns-records/subdomailing-timeline.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Lazy SRE's guide to secure systems, part 3: the unsexy list]]></title>
            <link>https://harshit.cloud/blog/lazy-security-part-3-unsexy-list</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/lazy-security-part-3-unsexy-list</guid>
            <pubDate>Sun, 19 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Identity, network, default creds, attestation, audit logs — the controls that close most of the gap Parts 1 and 2 left.]]></description>
            <content:encoded><![CDATA[
I have a calendar reminder that fires on the first of every month. It says "rotate the PAT." I have hit "snooze for 1 week" seventeen times in a row. The PAT in question is a `ghp_` token with read-write access to four private repos and permission to push tags, and the last time I rotated it was October 2024. If anyone has phished my GitHub session in the past fifteen months, they have had a year's head start on me.

This is part 3. [Part 1](/blog/lazy-security-part-1-supply-chain) was npm. [Part 2](/blog/lazy-security-part-2-github-actions) was GitHub Actions. This part is the unsexy list: the controls that don't fit a single attacker narrative, that protect against many different classes of incident in small ways. Identity, network access, default credentials, attestation, the audit log you'll need when the rest of the series missed what you needed it to catch.

The thesis from Part 1 stands. Future You at 3am will not rotate the PAT. The config that makes the rotation unnecessary (short-lived expiry, fine-grained scope, SSO enforcement, audit streaming) is the one that runs while you sleep.

## the PAT you forgot is in four places

Personal-access tokens hide in more places than I want to think about. Mine, when I went through them this weekend:

- `~/.netrc` (the one git falls back to when no credential helper is set)
- `~/.zshrc`, exported as `GH_TOKEN` because some script three years ago needed it
- Mac Keychain, two duplicates, one expired in 2023 but the dialogue still surfaces it
- A `.env` in a repo I haven't pushed to since last summer, committed in plaintext to the `staging` branch (`git log -S 'ghp_'` finds these surprisingly often)
- One CI secret in a repo whose workflow file I deleted six months ago; the workflow went, the secret did not

That's five, not four, which is on-brand for this section.

![A stylized editorial map on a dark navy ground showing the typical places a credential lives on an engineer's machine and inside the org. Nodes for shell config (`~/.zshrc`, `~/.netrc`), Mac Keychain, the CI secret store with one node tagged 'orphaned' in coral, a `.env` file in a stale repo, a browser-cached session, a Slack DM history, a password manager entry, and a post-it from 2022 also in coral. Arrows show the typical sprawl, with concentric rings labeling 'on disk', 'in cloud', and 'in someone else's possession'.](/images/lazy-security-part-3-unsexy-list/credentials-in-the-wild.png)

*Fig. 1 — every place a credential hides. Most teams have it in all of them simultaneously.*

The fix isn't "rotate them all." It's "make the next leak useless." Three configs at the org level do the work.

First, require expiration on all PATs. GitHub org settings → Personal access tokens → Require an expiration date; set the org max to 90 days (GitHub's platform ceiling is 366, but 90 is the right org default). Tokens issued before the setting keep working until their original expiry, so old tokens die naturally as they age out. No big-bang migration.

Second, enforce SSO on the org. A leaked PAT without an active SSO session can't reach SSO-protected repos. Most SaaS git-hosted orgs should have this on already; if yours doesn't, that is the highest-yield ten minutes in this post.

Third, stream the GitHub audit log somewhere SQL-shaped, with two-year retention. The default is six months. You will want eighteen months of history exactly when you need eighteen months of history. The question "did this token get used last week?" should be a query, not a support ticket.

The thing that took me longest to learn is that fine-grained PATs (`github_pat_` prefix, not `ghp_`) let you scope a token to one repo with read-only contents and nothing else. The default scope (full account) is what turns a leaked PAT into a domain compromise. To stop typing `ghp_` into shells entirely:

```
# ~/.gitconfig
[credential]
  helper = !gh auth git-credential
[url "https://github.com/"]
  insteadOf = git@github.com:
```

`gh auth login` once, and `git push` works for the rest of your career. The PAT now lives in one place: `gh`'s keyring entry, scoped to your machine, rotated by `gh` whenever it likes.

## identity is the perimeter

SSO + MFA + SCIM is the only thing on the unsexy list that competes with the PAT story for "worst yield from neglect." A single phished password without these is a domain admin compromise. With them, the same phish gets the attacker a soup of session cookies that expire in eight hours and an MFA prompt they can't satisfy.

The three configs, in rough order of cost:

- **MFA, mandatory, no exceptions.** Including the founder, including the contractor, including the on-call rotation. The exception list is the attack list.
- **SSO for every system that supports it.** Yes, Okta SSO Tax is real. Yes, it is annoying. It is cheaper than rebuilding identity after a session-token compromise. Most of the Snowflake-customer breaches of 2024 started with a non-SSO'd account.
- **SCIM provisioning to every system that supports it.** SCIM means offboarding actually offboards. The day someone leaves, every connected system revokes their access in the same SAML attribute push. Without SCIM, the median time to fully revoke at a small startup is days, and there is always one Postgres console nobody remembered.

![An animated horizontal bar chart in a dark editorial palette comparing the time to fully revoke an employee's access after offboarding. Top bar 'without SCIM (median, small-startup surveys 2024-2025)' grows over several seconds to around four days. Bottom bar 'with SCIM, SAML attribute push' grows to roughly forty-five seconds and is almost invisible at the scale of the first. Coral tip on the without-SCIM bar marks the window of compromise.](/images/lazy-security-part-3-unsexy-list/scim-revocation-window.gif)

*Fig. 2 — the no-SCIM bar is the entire window of compromise.*

One nightly cron closes most of the rest of the gap:

```bash
# nightly: diff "people on payroll" vs "humans with prod access"
okta-cli list-users --status active | sort > /tmp/active.txt
aws iam list-users --query 'Users[].UserName' | jq -r '.[]' | sort > /tmp/prod.txt
diff /tmp/active.txt /tmp/prod.txt | mail -s "identity-diff $(date +%F)" sec@yourorg.io
```

It runs in twelve seconds and surfaces the contractor whose SCIM hook silently broke in March.

## the access plane: Tailscale, IAP, PrivateLink

Nothing internal needs to be on the public internet. Anything that isn't can't be scanned by Shodan, can't be hit by a credential stuffer, can't be 0-day'd by a CVE published yesterday. The configs are different per layer, but the move is the same: take the thing off the internet and put authentication in front of it.

For shell access and internal HTTP services, Tailscale. The pitch is honest. Install the daemon on every machine, write a twelve-line ACL, you have a private network without running a VPN appliance. Replace SSH-to-bastion with `tailscale ssh`. Replace the internal Grafana on `grafana.yourorg.io` with `grafana.your-tailnet.ts.net`. Both stop existing on the public internet the same afternoon.

For web apps that need real auth-aware proxying (customer-facing internal tools, vendor admin panels), Cloudflare Access or Google IAP. The user hits a public URL, the proxy hands them off to your IdP, then proxies the request to a private backend. The backend has no public route.

For service-to-service inside cloud accounts, AWS PrivateLink and GCP Private Service Connect. These exist so your `stripe-receiver` lambda doesn't need to leave the VPC to reach Stripe's API. They are also what you need so the data warehouse in account A can reach the production database in account B without anything traversing the public internet.

![A hand-drawn two-panel napkin. Left panel labeled 'what the security group says (`0.0.0.0/0`)' shows three boxes (postgres, redis, grafana) sitting in the open, with arrows from labeled attackers (a Shodan crawler, a credential stuffer, a CVE-2026-12345 scanner) landing directly on them. A dashed line labeled 'the bastion SG' floats nearby, doing nothing. Right panel labeled 'what the tailnet says' shows the same three boxes behind a solid Tailnet boundary, with the same attacker arrows bouncing off the boundary line. Bottom strip reads 'twelve lines of ACL → entire blast radius'.](/images/lazy-security-part-3-unsexy-list/access-plane-contrast.png)

*Fig. 3 — same services, different boundary. The right panel is whatever Future You at 3am will thank you for.*

The anti-pattern is the "we'll just rotate the bastion IP" security group. We won't. The credentials for the bastion are in a Slack channel from 2023. The bastion is one of those things that exists because someone set it up before everyone joined and nobody knows whether it's safe to turn off. The lazy answer is to make the bastion irrelevant.

## the helm chart that ships with admin/admin

Every operator-installed thing in the cluster has a default password. Argo CD's `admin` with auto-generated password is fine, because the password isn't `admin`. Grafana's chart that ships with `admin/admin` is not fine. Jenkins ships with a random initial password printed to `initialAdminPassword` that most operators copy in once and never rotate. Half the database charts have `password: changeme` in `values.yaml` and the README says "you should change this," which is not the same as the chart changing it.

The lazy fix is two configs.

First, every secret in the cluster comes from external-secrets or sealed-secrets, never from a `values.yaml`. Pick one. The choice matters less than the consistency. Mine is external-secrets pointing at Vault, because reconciliation handles rotation upstream and the YAML stays clean.

Second, a weekly cron that hits every Service in the cluster with the top 25 default credentials and pages on success. `nuclei` ships a template set for this:

```bash
nuclei -t http/default-logins/ -l services.txt -severity critical,high
```

If it finds something, that's a real incident. If it doesn't, you have evidence, which is the audit-log argument postponed by one section.

One honest aside in parentheses: the rate at which Helm chart maintainers have moved away from default passwords is encouraging. Bitnami's PostgreSQL chart now generates a random password by default instead of `changeme`. The chart that ships with `admin/admin` today is more likely to be a private internal chart someone wrote three years ago than something current from Bitnami. (Note: the official Grafana chart still defaults to `admin/admin` — override it via Helm values before first install; "I'll change it later" is the part nobody does.) Check the internal charts first.

## sigstore, provenance, and reproducible builds

Part 1 ended on "the next-tier defenses are real, Part 3 will name them." These are them. Sigstore signing, npm provenance, reproducible builds. Each closes a class of attack that pinning and cooldowns can't.

**Sigstore for container images.** `cosign verify` confirms an image was built by your specific GitHub Actions workflow, with your repo's OIDC identity, against a transparency-log entry that's public and append-only.

```bash
cosign verify ghcr.io/yourorg/api:abc123 \
  --certificate-identity-regexp '^https://github.com/yourorg/api/' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com
```

If an attacker pushes a malicious image to your registry without also compromising your CI's OIDC trust, the verify fails. Bake the verify into your deploy step; refuse to deploy what doesn't pass. That is the attested-deployment pattern Part 2 named, in one verb in your CD pipeline.

**npm provenance.** `npm audit signatures` (since npm 9.5) tells you which dependencies have published provenance attestations linking the `.tgz` to a specific GitHub Actions build. A package with provenance gives you a tamper-evident chain: this artifact came from this commit on this branch in this repo, built by this workflow. Coverage is uneven (most `@types/*` packages have it; most one-maintainer packages don't), but the trend is good. The number to track is "what fraction of my install graph has provenance?" That's your remaining audit surface.

**Reproducible builds.** The hardest of the three. Same source produces the same binary, bit-for-bit, on every build machine. Two implementations have shipped at scale: Debian's reproducible-builds program (`reproducible-builds.org` tracks coverage by package) and Nix. The lazy version, for a small team, is to build the production artifact twice on two different runners and compare hashes. If they match, your CI is reproducible enough to detect a poisoned-build attack. If they don't, you have a non-determinism bug to fix, which is also worth knowing about.

## audit logs are for after the incident

Part 2 ended on "Part 3 will name the controls that exist to make the postmortem readable, not to prevent the incident." This is the section. Audit logging is what tells you whether everything in the previous six sections actually worked, what got accessed when one of them didn't, and which credential to roll at 03:11.

Three streams, all of which support direct destination handoff:

GitHub's audit log to S3, Splunk, Datadog, or whichever SQL-shaped destination you'll actually query. Settings → Audit log → Streaming. Default retention is six months; you want two years. The same goes for Okta's System Log (Reports → System Log → Stream).

AWS CloudTrail to a separate audit account, write-only from production, S3 with Object Lock and KMS-encrypted. Multi-region. The level of paranoia required is "this bucket survives a full prod-account compromise." GCP and Azure have equivalents (Cloud Audit Logs, Activity Logs).

Application audit. Stripe webhook history, Slack audit log, Google Workspace audit log. Each is one config and one Splunk index. The marginal effort approaches zero. The payoff is the difference between a one-page incident summary and a six-week panic.

The runbook for "we think we had a breach Thursday" is then a SQL query against a known schema. Without these, it's an interview with everyone who had access.

## the receipts

The unsexy list is one afternoon, one quarter, and one year. The afternoon: PAT cleanup, SSO/MFA mandatory, GitHub audit log streaming on. The quarter: SCIM provisioning everywhere, Tailscale on every internal service, external-secrets across the cluster. The year: sigstore for your images, an `npm audit signatures` report tracked weekly, reproducible-build hash comparison in CI.

It will not catch a nation-state with patience. It will not catch an insider with a grudge. It will not catch the next Log4j the day it lands. Those are different problems with different budgets, and worth a separate post when one of them happens to one of us.

What it does: it makes the postmortem on your next incident readable. It moves "we don't know what got accessed" out of the executive summary and into "Appendix A, the SQL query." For a small team, that is the difference between recovering and rebuilding.

If you do one thing this week, generate a fresh fine-grained PAT scoped to one repo with a 90-day expiry, switch your `gh auth login` to it, and delete the eight-year-old `ghp_` from your `~/.zshrc`. The calendar reminder won't help. Future You at 3am will not rotate it. Make the wrong default impossible.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>security</category>
            <category>devsecops</category>
            <category>lazy-sre</category>
            <category>identity</category>
            <category>supply-chain</category>
            <category>audit-logs</category>
            <enclosure url="https://harshit.cloud/images/lazy-security-part-3-unsexy-list/credentials-in-the-wild.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Lazy SRE's guide to secure systems, part 2: the actions you didn't pin]]></title>
            <link>https://harshit.cloud/blog/lazy-security-part-2-github-actions</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/lazy-security-part-2-github-actions</guid>
            <pubDate>Sun, 12 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Hardening GitHub Actions for small teams. SHA pinning, OIDC, cooldowns, and the trigger Future You at 3am should not touch.]]></description>
            <content:encoded><![CDATA[
Last March, someone with write access to the `trivy-action` repo rewrote 76 of its 77 version tags in place. The tags still resolved to `aquasecurity/trivy-action` — they just resolved to different commits than they did the week before. Every pipeline that ran `aquasecurity/trivy-action@0.20.0` (and every other tagged version) ran the attacker's commit instead. Secrets exfiltrated. The stolen credentials chained into PyPI and took down LiteLLM. Nobody noticed for hours, because the workflow file diff was still clean.

This is part 2. [Part 1](/blog/lazy-security-part-1-supply-chain) covered npm: the dependencies you didn't read. Part 2 is the same problem one level up: the workflows you didn't pin. Part 3 is the unsexy list — Tailscale, PrivateLink, IAP, the PAT you forgot.

The thesis from Part 1 stands. The best security work for a small team is the work *Future You at 3am* will actually execute. The configuration that makes the wrong thing impossible beats the runbook that only discourages it. With GitHub Actions, "the wrong thing" has gotten very specific over the last twelve months, and the configs to block each variety have gotten correspondingly precise.

## pinning is necessary but not sufficient

The first thing the trivy-action incident proves: hash-pinning to `@0.20.0` is not pinning. It's a name lookup. The owner of the repo is allowed to rewrite that name. The pin you actually wanted was:

```yaml
- uses: aquasecurity/trivy-action@9b9a3f5c8a5c7e1b6e4d2f1c9b8a7e6d5c4b3a2f # v0.20.0
```

Full forty-character SHA. Immutable. The version comment is so the next reader knows what they're looking at; the SHA is so the workflow runs the code you reviewed.

![A horizontal editorial timeline of the trivy-action force-push attack of March 2026 on a deep navy ground. Six stages along a single line, from a maintainer credential phish at T-30d through 76 of 77 tags force-pushed at T-0, first CI pipelines picking up the rewritten tag at T+1h, secrets exfiltrated minutes later, a trojanized LiteLLM published to PyPI at T+6h, and detection at T+9d. Attacker-controlled stages are coral, victim stages cyan, with ghosted 'FORCE-PUSH' and 'VICTIM' phase labels strung across the background.](/images/lazy-security-part-2-github-actions/trivy-action-timeline.png)

*Fig. 1 — nine days from force-push to advisory. The workflow files never changed.*

Two GitHub features shipped in 2025 that change the math:

- **SHA pinning enforcement** (Aug 2025). An org-level policy that *fails* workflow runs using unpinned actions, instead of warning about them. Settings → Actions → General → Action pinning. Turn it on. There is no "we'll get to it" version of this toggle.
- **Immutable Releases** (Oct 2025, GA). Action authors opt in to making release tags non-rewritable after publication. If you publish actions, turn this on for downstream consumers. If you consume actions, prefer ones that have.

The lazy stance: enforcement at the org level. The workflow that doesn't have a forty-character SHA fails the run. The PR can't merge. The work of remembering to pin moves from every engineer's head to one setting.

What this doesn't catch: an attacker who compromises the maintainer account and ships a new tag at a new SHA. The SHA is real. Pinning by SHA doesn't help, because the workflow author *will* rev to the new version when they read the maintainer's release notes. Which is the next config.

## cooldown is the same trick that worked for npm

Part 1's load-bearing config was `SAFE_CHAIN_MINIMUM_PACKAGE_AGE_HOURS=48`. The principle: most published malware is detected and pulled within hours. If you can wait, the wait does the work for you.

The action ecosystem has the same property, with a longer window. [yossarian's analysis](https://blog.yossarian.net/2025/11/21/We-should-all-be-using-dependency-cooldowns) puts the cooldown that catches most supply-chain attacks at 7-14 days. So:

```bash
pinact --min-age 7 .github/workflows/*.yml
```

Refuses to write a pin younger than seven days. Add to pre-commit, your CI lint, or whatever your dependabot equivalent runs before opening the bump PR.

For Renovate users, the equivalent lives in the action manager:

```json
{
  "packageRules": [
    { "matchManagers": ["github-actions"], "minimumReleaseAge": "7 days" }
  ]
}
```

That's it. Same trick, different ecosystem.

![An animated horizontal bar chart in a dark editorial palette showing the share of recent supply-chain action compromises caught by a cooldown of 0, 3, 7, 14, or 21 days. The 0-day bar lands at 3% and the 3-day bar at 38%. The 7-day bar reaches 76% and the 14-day bar reaches 89%, both accented with a brighter cyan and a coral tip. The 21-day bar lands at 94%. A bottom strip notes that the trivy-action force-push was detected at about nine days.](/images/lazy-security-part-2-github-actions/cooldown-window.gif)

*Fig. 2 — the wait is doing the work. Seven days closes most of the door; fourteen closes most of the rest.*

The empirical question is whether seven days is enough. The trivy-action force-push was detected at about nine — seven would have caught most consumers, not all of them. The cost of fourteen is "your action versions lag upstream by two weeks." If your action surface is small (most teams are running `actions/checkout`, `actions/setup-node`, one cloud-login action, maybe a deploy action), set fourteen and forget.

## pull_request_target is the new postinstall

Part 1 named `postinstall` as the single trigger that does the most damage and the single switch (`ignore-scripts=true`) that closes the most doors. Actions has the same shape and the same fix.

`pull_request_target` runs in the context of the base repository, with access to repository secrets, but is triggered by a PR from a fork. The legitimate use case is small: comment on PRs, label them, run lightweight metadata jobs. The illegitimate use case is enormous: check out the fork's code and execute it. The attack writes itself. Open a fork, modify a script the trusted workflow runs, watch the runner exfiltrate every secret in the env.

Astral, who maintain `uv` and `ruff`, [wrote it cleanly](https://astral.sh/blog/open-source-security-at-astral): "these triggers are almost impossible to use securely." GitHub partially mitigated this in November 2025 by forcing `pull_request_target` to always use the default branch's version of the workflow, so an attacker can't push a vulnerable workflow on a feature branch and trigger it. But the foot-cannon still ships loaded if your default-branch workflow checks out PR-head code.

![A hand-drawn two-panel napkin. Left panel labeled 'pull_request_target' shows a fork PR boundary as a dashed line, a modified script.sh inside the fork, and a runner on the base side reaching across the boundary while holding a red keyring labeled NPM_TOKEN, AWS_KEY, GH_PAT. Right panel labeled 'pull_request' shows the same setup, but the keyring is replaced by a greyed-out 'secrets.* not in scope' bag. The two panels are structurally identical except for the presence or absence of secrets in the runner.](/images/lazy-security-part-2-github-actions/pull-request-target-contrast.png)

*Fig. 3 — same workflow, different trigger, opposite blast radius.*

The lazy stance:

- Don't use `pull_request_target` unless you've named the specific reason and one other person has signed off.
- If you do, never `actions/checkout` the PR head from inside it. Check out the base SHA, do the metadata thing, exit.
- For everything else, use `pull_request`. It runs without secrets. Attacker-controlled code stays attacker-jailed.

Same shape as `ignore-scripts=true`. The setting that closes the class.

## the safe defaults that go in every workflow

The four-line workflow header that does the most work per character:

```yaml
permissions:
  contents: read

defaults:
  run:
    shell: bash -euo pipefail {0}
```

`contents: read` overrides the org-level default. If a step needs to push a tag or open a PR, that job opts back up to `contents: write` explicitly. The default is the safe one.

At the checkout step:

```yaml
- uses: actions/checkout@<sha> # v4.2.0
  with:
    persist-credentials: false
```

The default behavior of `actions/checkout` is to leave a credential sitting in `.git/config` for the rest of the workflow. Later steps have shipped this credential into uploaded artifacts more than once. Opt out unless a later step in the same job needs to push.

Three secret-access rules with the same flavor:

- Step-scoped `env:`, never workflow-scoped, for any secret.
- Never `${{ toJson(secrets) }}`. Exposes every secret in the project to the runner. There is no use case.
- Never `secrets: inherit` on reusable workflows. Pass each secret by name. The reusable workflow gets exactly what it asked for.

The trivy-action exfiltration worked partly because secrets were workflow-scoped. The malicious step inherited every credential in the env, not just the one the legitimate scan needed. Step-scoping wouldn't have prevented the credential theft — but it would have bounded the blast radius to one secret instead of all of them.

## OIDC, the promise from part 1

Part 1 ended on "the next-tier defenses are real, Part 3 names them." OIDC is the part of that conversation that lives here.

The trade: instead of storing an `AWS_ACCESS_KEY_ID` in repo secrets and praying nobody exfiltrates it, you configure AWS to trust GitHub's OIDC issuer for a specific repo, branch, and workflow. GitHub mints a short-lived (five-minute) OIDC identity token for the workflow run. The workflow trades that for STS credentials whose lifetime you set (default one hour). Nothing long-lived ever sits in the env.

```yaml
permissions:
  id-token: write
  contents: read

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: aws-actions/configure-aws-credentials@<sha> # v4.0.2
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-deploy
          aws-region: us-east-1
      - run: aws s3 sync ./dist s3://my-bucket
```

The role's trust policy restricts the OIDC subject to your exact repo and (ideally) branch. An attacker who compromises a fork PR can't assume the role, because they don't match the trust condition. The OIDC JWT itself lasts five minutes and the STS credential is scoped to whatever you configure (default one hour). Even an exfiltrated credential gets the attacker a bounded window of scoped access, not a permanent IAM user.

For Google Cloud, the equivalent is Workload Identity Federation. For HashiCorp Vault, the JWT auth backend. Same shape across providers.

The labor here is genuinely one-time. Configure the trust relationship once per repo, delete the long-lived key, forget about rotation forever. The rotation runbook you're not maintaining is one of the better quiet wins in this post.

## zizmor is the local proxy for workflows

Part 1's `safe-chain` sat in front of every package install and refused malware before bytes hit disk. The action ecosystem's equivalent is `zizmor` — a workflow linter that reads your YAML and catches the patterns this post is about, before they merge.

```bash
brew install zizmor
zizmor .github/workflows/
```

It catches unpinned actions, `pull_request_target` with PR-head checkouts, template-injection patterns where attacker-controlled input lands in a `run:` string, jobs with excessive permissions. Add it to pre-commit:

```yaml
# .pre-commit-config.yaml
- repo: https://github.com/woodruffw/zizmor-pre-commit
  rev: v1.x  # pin the rev, obviously
  hooks:
    - id: zizmor
```

The principle is identical to safe-chain. Move the security check from "after the incident, in the postmortem" to "before the PR can merge, on the dev machine." The CI run is the second line of defense. The pre-commit is the first.

## the receipts

The above stack is approximately one afternoon: org-level SHA pinning enforcement, `pinact --min-age 7` or Renovate `minimumReleaseAge: 7 days`, the four-line workflow header, `persist-credentials: false`, no `pull_request_target` with PR-head checkouts, OIDC for every cloud credential, `zizmor` in pre-commit.

It will not catch a maintainer-account compromise that ships clean-looking code which activates weeks later. It will not catch a determined attacker who studies your build and writes a payload that survives every linter and looks innocent at PR review. Nothing in this post will. Part 3 will name the controls that buy partial mitigation against that class: sigstore, npm provenance, reproducible builds, attested deployments. And the ones that exist to make the postmortem readable, not to prevent the incident.

For a small team, the delta from this post is moving from "we're one tag-rewrite away from a credential theft cascade" to "an attacker would need a credentialed insider, or a fifteen-minute window of luck against a scoped IAM role." That's the only delta that matters at this scale.

If you do one thing this week, turn on SHA pinning enforcement at the org level. Everything else gates off that.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>security</category>
            <category>lazy-sre</category>
            <category>github-actions</category>
            <category>supply-chain</category>
            <category>ci-cd</category>
            <category>devsecops</category>
            <enclosure url="https://harshit.cloud/images/lazy-security-part-2-github-actions/trivy-action-timeline.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Lazy SRE's guide to secure systems, part 1: the dependencies you didn't read]]></title>
            <link>https://harshit.cloud/blog/lazy-security-part-1-supply-chain</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/lazy-security-part-1-supply-chain</guid>
            <pubDate>Sun, 05 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Startup-grade defense against npm supply-chain attacks, for Future You at 3am. Chainjacking, postinstall scripts, smallest install, most leverage.]]></description>
            <content:encoded><![CDATA[
A few months ago a friend's CI pipeline tried to install a package none of us had heard of. The build failed. The error wasn't a missing dep. The error was a local proxy saying *this is malware, I'm not letting it touch disk*. The package was a transitive dependency, six levels deep, that had been published to npm 38 minutes earlier. Nobody on the team had asked for it.

I do platform work at a startup. The job is: keep production up, keep the bill down, keep the kind of person who reads HN comments from getting a free shell on the cluster. The thesis for the series is short. The best security work for a small team is the work *Future You at 3am* will actually execute. The lazy answer is also, almost always, the right one: the configuration that makes the wrong thing impossible, rather than only discouraged. This is part 1. Part 2 is GitHub (`tj-actions`, OIDC, fork PRs). Part 3 is the unsexy list (IAP, Tailscale, PrivateLink, Okta, default helm creds, the PAT you forgot).

## the picture

You opened your editor, ran `npm install`, and onboarded somewhere between 800 and 2,000 packages maintained by people you have never audited. Your team reviewed five direct dependencies. The other 1,995 came in for the ride.

![A two-panel hand-drawn diagram. Left panel labeled 'what i installed' shows five neat dependency boxes connected to a 'your app' box. Right panel labeled 'what npm install pulled in' shows the same five boxes fanning out into a sprawling cluster of small transitive dependencies, with one highlighted in red labeled evil-helper@1.2.3 and a callout reading 'this one is the one shipping crypto miners.'](/images/lazy-security-part-1-supply-chain/dependency-tree-contrast.png)

*Fig. 1 — left is the dependency graph you reviewed at PR time. Right is the one your CI runner actually executes.*

The job is not to read all 1,995. The job is to make sure that when one of them is the problem, the blast radius is small and the alarm goes off.

## chainjacking

Chainjacking is the umbrella term for "someone got control of a package you depend on and pushed a bad version." The attacker doesn't break npm. They get the credentials of the human who publishes the package, ship a patch version, and semver puts it on your machine the next time anyone runs `npm install`. `event-stream` (2018), `ua-parser-js` / `coa` / `rc` (2021), `lottie-player` (2024), and the shai-hulud worm (Sept 2025, with a 2.0 wave in Nov 2025) that self-replicated by stealing tokens from compromised maintainer machines via TruffleHog-style secret scans and then republishing every other package that maintainer owned. The economics still work for the attacker. It is going to keep happening.

![A magazine-infographic-style timeline on a dark navy background. Six stages from left to right: T-7d maintainer account targeted, T-0 malicious version published, T+12m first CI installs it, T+12m02s secrets exfiltrated, T+1h backdoor in artifacts, T+24h credentials for sale. Stages 1-2 highlighted in coral as 'compromise' stages; stages 3-6 in cyan as 'victim' stages.](/images/lazy-security-part-1-supply-chain/chainjacking-timeline.png)

*Fig. 2 — twenty-four hours from maintainer phish to credential resale. Nobody noticed the version bump.*

What matters in that timeline is that the *human* steps are slow and the *automated* steps are fast. The window between "malicious version published" and "your CI runs `npm install`" is whatever your dependabot cron is. If you auto-merge minor and patch bumps, that window is ninety seconds.

## dependency confusion

You have a private package called `internal-utils`. Your CI is configured with both your private registry and the public npm registry. Somebody publishes `internal-utils` to public npm at a higher version. CI installs the public one. Birsan did this against Apple, Microsoft, Tesla, and PayPal in a weekend in 2021 for bug bounties.

Fix, in `.npmrc`:

```
@yourorg:registry=https://npm.yourorg.internal
registry=https://registry.npmjs.org/
```

Scope everything internal. Register your scope on public npm as a parked placeholder. It costs nothing.

## postinstall

Most install-time npm supply-chain incidents I have read the postmortem on shipped their malicious code in a `postinstall` script — not in runtime code. (Some recent ones, like the chalk/debug compromise of Sept 2025, activate at runtime in the browser or on first import; the switch below doesn't help against those. It does help against the install-time class, which is still the majority.) The install hook runs before your tests, before your linter, as part of the install. Default is enabled. The one-line change with the highest blast-radius reduction:

```
# .npmrc
ignore-scripts=true
```

You'll need to allowlist two to five packages that genuinely need it (typically `bcrypt`-shaped things). That number is small. The alternative is letting every package run code on install. Pick.

## the install that buys the most

Aikido's [`safe-chain`](https://github.com/AikidoSec/safe-chain) is an open-source local proxy that sits in front of `npm`, `npx`, `yarn`, `pnpm`, `pnpx`, `bun`, `pip`, `uv`, `poetry`, `pipx`. Every package download is intercepted and checked against Aikido Intel, an open malware feed. Malware is blocked before bytes hit disk. Which is before `postinstall` runs. Free. No account.

![A clean dark-editorial flow diagram. Five columns from left to right: developer terminal running 'npm install lodash-utils', a shell alias intercepting the command, a local proxy that all package downloads route through, the Aikido Intel cloud queried for malware reputation, and an outcome column with a green 'allowed → installed' branch and a red 'blocked → install aborted' branch.](/images/lazy-security-part-1-supply-chain/safe-chain-flow.png)

*Fig. 3 — safe-chain in one picture. A local proxy in front of every package manager, checked against an open threat-intel feed.*

On a dev machine:

```bash
curl -fsSL https://github.com/AikidoSec/safe-chain/releases/latest/download/install-safe-chain.sh | sh
# restart your shell
npm safe-chain-verify
# expected: OK: Safe-chain works!
```

In CI:

```yaml
- name: Install safe-chain
  run: |
    curl -fsSL https://github.com/AikidoSec/safe-chain/releases/latest/download/install-safe-chain.sh \
      | sh -s -- --install-dir /usr/local/.safe-chain
    echo "$HOME/.safe-chain/bin" >> "$GITHUB_PATH"
- run: npm ci
```

And, the part that quietly does the most work — refuse to install anything younger than 48 hours, because that's the window in which most npm malware is caught and removed:

```bash
export SAFE_CHAIN_MINIMUM_PACKAGE_AGE_HOURS=48
```

## the receipts

The above stack is approximately one afternoon of work: `npm ci` from a committed lockfile, `ignore-scripts=true` with a tiny allowlist, scoped private packages with locked registry resolution, safe-chain in front of every install, minimum package age of 48 hours. It will catch most known-bad packages, kill dependency-confusion at the registry level, and reduce postinstall blast radius to zero for the long tail. It will not catch a maintainer-account compromise that ships clean-looking malware that only activates in production weeks later. Nothing in this post will. The next-tier defenses (sigstore signing, npm provenance, reproducible builds) are real, and Part 3 will name them.

For a startup the delta from this post is moving from "one of the next ten incidents has a non-trivial chance of being yours" to "you would have to be very unlucky." That's the only delta that matters.

If you do one thing this week, go register your npm scope.

---

## diagrams: what i tried

Three diagrams, three different tools, one brief: "explain a supply-chain attack to a tired SRE in one image." Prompts kept short. Results below.

### #1 — the napkin contrast (coleam00 excalidraw-diagram skill)

Brief: *"two-panel hand-drawn napkin. Left panel 'what i installed': five direct deps off a 'your app' box. Right panel 'what npm install pulled in': same five direct deps, transitive sprawl under each, one of them is a red `evil-helper@1.2.3`, callout reads 'shipping crypto miners.'"*

Result: [Fig. 1](/images/lazy-security-part-1-supply-chain/dependency-tree-contrast.png). Built with the [`excalidraw-diagram`](https://github.com/coleam00/excalidraw-diagram-skill) skill, a Claude Code skill that enforces a design methodology (depth assessment → pattern mapping → evidence artifacts → mandatory render-and-validate loop) and ships a Playwright-based renderer. The `.excalidraw` source is [downloadable](/images/lazy-security-part-1-supply-chain/dependency-tree-contrast.excalidraw); open it on excalidraw.com to edit.

Two things the skill produced that I wouldn't have prompted for: a semantic color palette (Start/Trigger orange for "your app", Error red for the malicious package, Inactive blue-dashed for the "+N more" bags), and a summary-flow strip at the bottom (`5 direct → 1,200 transitive → 1 malicious → full keychain`) that compresses the post's thesis into nine words. The methodology turns a drawing into an argument.

One install gotcha worth knowing: the skill loads Excalidraw via ESM and the default CDN (`esm.sh`) was unreachable from my environment. One-line patch in `render_template.html` to use `cdn.jsdelivr.net` and it worked.

Verdict: **won for the hero.** The Excalidraw aesthetic earns a place when a post needs a punchline; the skill's methodology adds the second zoom level that elevates a punchline into something that teaches.

### #2 — the chainjacking timeline (diagram-design, polished editorial)

Brief: *"Horizontal six-stage timeline. Dark navy `#11141c` ground, coral `#ff6b5a` for the two 'compromise' stages, muted cyan `#5bc0d9` for the four 'victim' stages. Each stage: small timestamp label (T-7d, T-0, T+12m, T+12m02s, T+1h, T+24h), a node on the line, a short title, one-line caption. Title 'a chainjacking attack, in six steps' (lowercase). Italic figcaption. Magazine-infographic feel, no neon, no scanlines."*

Result: [Fig. 2](/images/lazy-security-part-1-supply-chain/chainjacking-timeline.png). Built by a diagram-design subagent. Came back with a ghosted "COMPROMISE / VICTIM" phase label in the background that I hadn't asked for and now wouldn't part with.

Verdict: **won for explaining attacker workflow.** Editorial polish without being a dashboard. The agent's improvisation (the phase label) was the part I would not have prompted my way to.

### #3 — the safe-chain flow (diagram-design, five-column system diagram)

Brief: *"Five columns left to right: developer terminal running `npm install lodash-utils`, shell alias intercepting, local proxy as the focal column, Aikido Intel cloud with a lookup arrow, outcome column with a green 'installed' branch and a red 'install aborted' branch. Dark editorial background `#0d1117`. Cyan normal flow, red blocked, green allowed. Legend strip at the bottom."*

Result: [Fig. 3](/images/lazy-security-part-1-supply-chain/safe-chain-flow.png). Same skill as #2. Came back with `harshit.cloud · lazy security` baked into the footer (also unprompted, also welcome) and a clean two-outcome fork that makes the block/allow decision the visual sink.

Verdict: **won for explaining a system.** When the diagram has to show *what a tool does*, this format beats the napkin every time. The napkin is a punchline. This one is a reference.

I did not ship an animated `.gif` for Part 1. The `infographic-gif` skill is the right tool for *quantitative motion* — a funnel decaying, a bar chart counting up. Nothing in Part 1 needed motion to make the point. If Part 3 ends up wanting a "blast radius over 24 hours" visualisation, that's where the GIF goes.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>security</category>
            <category>lazy-sre</category>
            <category>supply-chain</category>
            <category>npm</category>
            <category>devsecops</category>
            <enclosure url="https://harshit.cloud/images/lazy-security-part-1-supply-chain/dependency-tree-contrast.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Self-hosting SimpleLogin: own your email aliases for $3 a month]]></title>
            <link>https://harshit.cloud/blog/self-hosting-simplelogin</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/self-hosting-simplelogin</guid>
            <pubDate>Sat, 07 Feb 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Self-hosted SimpleLogin with Docker, Postfix, and Brevo for $3/month. The TLS gotcha that ate two hours of my Sunday, written down so you skip it.]]></description>
            <content:encoded><![CDATA[
I'd been running Cloudflare Email Routing for months. Free. Dead simple. Emails hit my custom domain, forwarded to Gmail. Privacy-friendly aliases without paying a dime.

Then I tried to reply from an alias. Couldn't.

Cloudflare Email Routing is inbound-only. You receive emails at your alias, but when you hit reply, it goes out from your real Gmail address. The whole point of aliasing — gone in one click.

I'd already moved my infrastructure to [a self-hosted Dokploy setup](/blog/netlify-to-dokploy-migration) running on Hetzner. The server was sitting there at 8% CPU. Why not run my own email aliasing too?

Two hours later, I had full bidirectional email aliases running on SimpleLogin. Here's every step, including the TLS trap that almost made me quit.

![Side-by-side sequence diagram showing Cloudflare Email Routing leaking the real Gmail address on reply, while a self-hosted SimpleLogin stack on Hetzner rewrites the reply path so the alias survives in both directions.](/images/self-hosting-simplelogin/hero.png)

*Fig. 1 — Cloudflare can hand you the letter. It just can't post one back without signing your real name.*

## why Cloudflare email routing wasn't enough

Credit where it's due. Cloudflare Email Routing is genuinely great for what it does:

- **Free.** No credit limits, no tier anxiety
- **5-minute setup.** Add MX records, create routes, done
- **Reliable inbound forwarding.** Never lost an email

But the moment you need to reply from an alias or send a new email as your alias, you're stuck. Cloudflare loosened Email Workers' reply restrictions in March 2025 to let you programmatically process and reply to emails. But it's a developer tool for automated responses, not a "hit reply in Gmail" solution.

| Feature | Cloudflare Email Routing | SimpleLogin (Self-Hosted) |
|---------|--------------------------|---------------------------|
| Cost | Free | ~$3/month (VPS) |
| Setup time | 5 minutes | ~2 hours |
| Receive to alias | Yes | Yes |
| Reply from alias | No | Yes |
| Send as alias | No | Yes |
| Custom domains | Yes (CF nameservers required) | Yes (any registrar) |
| Open source | No | Yes |
| PGP encryption | No | Yes |
| Self-hosted option | No | Yes |

If all you need is inbound forwarding, stick with Cloudflare. It's free and it works. But if you want actual email aliases, where you can reply and send and nobody ever sees your real address, you need SimpleLogin.

## what you'll need

Before diving in:

- **A VPS** with ports 25, 465, and 443 open (Hetzner, Contabo, etc. — ~$3/month)
- **A domain** with DNS you control
- **A Brevo account** (free tier: 300 emails/day) for outbound SMTP relay
- **30 minutes of focus** for DNS, plus another hour for the stack

> **Key Insight:** Most residential ISPs and some cloud providers block port 25. Hetzner doesn't by default, but you might need to request it. Check before you start — no port 25, no self-hosted email.

## architecture overview

Here's how the pieces fit together:

```
Inbound Email Flow:
Internet → MX Record → Your Server:25 (Postfix) → SimpleLogin App → Your Mailbox

Outbound Email Flow:
SimpleLogin App → Postfix → Brevo SMTP Relay → Recipient
```

Four containers plus Postfix on the host:

| Container | Role | Port |
|-----------|------|------|
| `sl-db` | PostgreSQL database | 5432 |
| `sl-app` | Web UI + API | 7777 |
| `sl-email` | Email handler (SMTP) | 20381 |
| `sl-job-runner` | Background tasks | — |

Plus Postfix running directly on the host, listening on port 25.

## the DNS gauntlet

DNS is where most people give up. Don't. It's just a lot of records. Set them all up at once and verify later.

For a domain like `sl.example.com` with server IP `203.0.113.50`:

### A Record

```
Type: A
Name: sl
Value: 203.0.113.50
Proxy: OFF (DNS only)
```

### MX Record

```
Type: MX
Name: sl.example.com
Value: sl.example.com
Priority: 10
```

### SPF Record

```
Type: TXT
Name: sl.example.com
Value: v=spf1 mx a ip4:203.0.113.50 include:spf.sendinblue.com ~all
```

The `include:spf.sendinblue.com` is critical — Brevo (formerly Sendinblue) sends your outbound mail through that SPF record.

### DKIM Record

```
Type: TXT
Name: dkim._domainkey.sl.example.com
Value: v=DKIM1; k=rsa; p=YOUR_DKIM_PUBLIC_KEY
```

You'll generate this key during Docker setup. Come back and add it then.

### DMARC Record

```
Type: TXT
Name: _dmarc.sl.example.com
Value: v=DMARC1; p=quarantine; pct=100; adkim=s; aspf=s
```

### PTR Record (Reverse DNS)

Set this in your hosting provider's panel, not your DNS. It maps your IP back to your domain. Most providers have a "Reverse DNS" or "rDNS" field in the server settings.

```
203.0.113.50 → sl.example.com
```

> **Key Insight:** If you're using Cloudflare DNS, the A record for your mail subdomain **must** be set to "DNS only" (grey cloud). Cloudflare's proxy doesn't pass through SMTP traffic on port 25. Orange cloud = your MX record points to Cloudflare's proxy = mail delivery fails silently.

![Cloudflare DNS panel showing all configured records](/images/self-hosting-simplelogin/sl-dns-records.png)

### Verify Everything

Don't move on until these pass:

```bash
# MX record
dig MX sl.example.com +short
# Should return: 10 sl.example.com.

# SPF record
dig TXT sl.example.com +short
# Should include: v=spf1 mx a ip4:203.0.113.50 include:spf.sendinblue.com ~all

# A record
dig A sl.example.com +short
# Should return: 203.0.113.50

# PTR record
dig -x 203.0.113.50 +short
# Should return: sl.example.com.
```

## why Brevo? IP reputation is everything

Why not send directly from Postfix? You can. Gmail, Outlook, and Yahoo will just spam-folder it — or reject it outright.

Email deliverability depends on IP reputation. A fresh VPS IP has none. To the big providers, that looks identical to a spammer on a throwaway server. Building reputation takes weeks of careful warm-up. For a personal alias service sending 10 emails a day, it's not worth it.

Brevo's SMTP relay solves this. Your Postfix hands mail to Brevo, and Brevo sends it from IPs with years of established reputation. Your email lands in inboxes, not spam. Free tier: 300 emails/day.

## setting up Brevo

Sign up at [brevo.com](https://www.brevo.com). Then:

1. Go to **Settings > SMTP & API**
2. Generate an SMTP key
3. Note your SMTP login (it's your account email, not a generated username)
4. Add and verify your domain under **Settings > Senders & Domains**

![Brevo SMTP settings page](/images/self-hosting-simplelogin/sl-brevo-smtp.png)

Save the SMTP key. You'll need it for both the SimpleLogin env file and Postfix config.

## docker setup

SSH into your server. Let's build this.

### Create the Network and Directories

```bash
docker network create sl-network

mkdir -p /sl/pgp
mkdir -p /sl/db
mkdir -p /sl/upload
```

### Environment File

Create `/sl/simplelogin.env`:

```bash
# Domain
URL=https://sl.example.com
EMAIL_DOMAIN=sl.example.com
SUPPORT_EMAIL=support@sl.example.com
ADMIN_EMAIL=admin@sl.example.com

# Email
EMAIL_SERVERS_WITH_PRIORITY=[(10, "sl.example.com.")]
DKIM_PRIVATE_KEY_PATH=/dkim.key
DKIM_PUBLIC_KEY_PATH=/dkim.pub.key

# Brevo SMTP Relay
POSTFIX_SERVER=host.docker.internal
POSTFIX_PORT=25

# Database
DB_URI=postgresql://sl_user:your_strong_password_here@sl-db:5432/simplelogin

# Flask
FLASK_SECRET=generate_a_long_random_string_here

# Features
DISABLE_ALIAS_SUFFIX=1
NOT_SEND_LINK_TO_SELF=1
ENABLE_SPAM_ASSASSIN=0

# PGP
GNUPGHOME=/sl/pgp
```

Generate your secrets:

```bash
# Flask secret
openssl rand -hex 32

# Database password
openssl rand -hex 16
```

### Generate DKIM Keys

```bash
openssl genrsa -out /sl/dkim.key 1024
openssl rsa -in /sl/dkim.key -pubout -out /sl/dkim.pub.key

# Get the public key for your DNS record
cat /sl/dkim.pub.key | sed '1d;$d' | tr -d '\n'
```

Copy that output. Go back to your DNS and paste it as the `p=` value in your DKIM TXT record.

### Start PostgreSQL

```bash
docker run -d \
  --name sl-db \
  --network sl-network \
  --restart always \
  -e POSTGRES_DB=simplelogin \
  -e POSTGRES_USER=sl_user \
  -e POSTGRES_PASSWORD=your_strong_password_here \
  -v /sl/db:/var/lib/postgresql/data \
  postgres:16
```

### Initialize the Database

```bash
docker run --rm \
  --name sl-migration \
  --network sl-network \
  --env-file /sl/simplelogin.env \
  -v /sl/dkim.key:/dkim.key:ro \
  -v /sl/dkim.pub.key:/dkim.pub.key:ro \
  simplelogin/app:4.6.5-beta \
  alembic upgrade head

docker run --rm \
  --name sl-init \
  --network sl-network \
  --env-file /sl/simplelogin.env \
  -v /sl/dkim.key:/dkim.key:ro \
  -v /sl/dkim.pub.key:/dkim.pub.key:ro \
  simplelogin/app:4.6.5-beta \
  python init_app.py
```

### Start the Application Containers

```bash
# Web app
docker run -d \
  --name sl-app \
  --network sl-network \
  --restart always \
  --env-file /sl/simplelogin.env \
  --add-host=host.docker.internal:host-gateway \
  -v /sl/dkim.key:/dkim.key:ro \
  -v /sl/dkim.pub.key:/dkim.pub.key:ro \
  -v /sl/upload:/code/static/upload \
  -p 127.0.0.1:7777:7777 \
  simplelogin/app:4.6.5-beta

# Email handler
docker run -d \
  --name sl-email \
  --network sl-network \
  --restart always \
  --env-file /sl/simplelogin.env \
  --add-host=host.docker.internal:host-gateway \
  -v /sl/dkim.key:/dkim.key:ro \
  -v /sl/dkim.pub.key:/dkim.pub.key:ro \
  -v /sl/upload:/code/static/upload \
  -p 127.0.0.1:20381:20381 \
  simplelogin/app:4.6.5-beta \
  python email_handler.py

# Job runner
docker run -d \
  --name sl-job-runner \
  --network sl-network \
  --restart always \
  --env-file /sl/simplelogin.env \
  --add-host=host.docker.internal:host-gateway \
  -v /sl/dkim.key:/dkim.key:ro \
  -v /sl/dkim.pub.key:/dkim.pub.key:ro \
  -v /sl/upload:/code/static/upload \
  simplelogin/app:4.6.5-beta \
  python job_runner.py
```

![Docker containers running healthily](/images/self-hosting-simplelogin/sl-docker-ps.png)

Four containers. All running. But we're not done — Postfix is the piece that actually handles SMTP.

## the Postfix config (and the TLS trap)

This is where I lost two hours. The setup itself is straightforward. The bug that follows is not.

### Install Postfix

```bash
apt-get update && apt-get install -y postfix postfix-pgsql libsasl2-modules
```

Choose "Internet Site" when prompted. Set the system mail name to your domain.

### Main Configuration

Replace `/etc/postfix/main.cf` with:

```ini
# Basic
smtpd_banner = $myhostname ESMTP
biff = no
append_dot_mydomain = no
readme_directory = no
compatibility_level = 3.6

# TLS - Outbound (Postfix → Brevo)
smtp_tls_security_level = encrypt
smtp_tls_note_starttls_offer = yes
smtp_tls_CAfile = /etc/ssl/certs/ca-certificates.crt
smtp_tls_loglevel = 1

# TLS - Inbound (Internet → Postfix)
smtpd_tls_cert_file = /etc/ssl/certs/ssl-cert-snakeoil.pem
smtpd_tls_key_file = /etc/ssl/private/ssl-cert-snakeoil.key
smtpd_tls_security_level = may

# Network
myhostname = sl.example.com
mydomain = sl.example.com
myorigin = $mydomain
mydestination = localhost
mynetworks = 127.0.0.0/8 [::ffff:127.0.0.0]/104 [::1]/128 172.16.0.0/12
inet_interfaces = all
inet_protocols = ipv4

# Relay through Brevo
relayhost = [smtp-relay.brevo.com]:587
smtp_sasl_auth_enable = yes
smtp_sasl_password_maps = hash:/etc/postfix/sasl_passwd
smtp_sasl_security_options = noanonymous

# Size limits
message_size_limit = 50000000
mailbox_size_limit = 0

# SimpleLogin integration
virtual_mailbox_domains = pgsql:/etc/postfix/pgsql-relay-domains.cf
virtual_mailbox_maps = pgsql:/etc/postfix/pgsql-transport-maps.cf
virtual_alias_maps = pgsql:/etc/postfix/pgsql-transport-maps.cf
transport_maps = pgsql:/etc/postfix/pgsql-transport-maps.cf
```

### The TLS Trap

Here's what happened. Everything looked right. Containers running. Postfix installed. DNS verified. Sent a test email to my alias.

Nothing arrived.

Checked the Postfix logs:

```bash
journalctl -u postfix -n 50
```

```
postfix/smtp[12345]: Untrusted TLS connection established to
  smtp-relay.brevo.com[1.2.3.4]:587: TLSv1.3 with cipher
  TLS_AES_256_GCM_SHA384 (256/256 bits)
postfix/smtp[12345]: certificate verification failed for
  smtp-relay.brevo.com: unable to get local issuer certificate
```

**Untrusted TLS connection.** Postfix was connecting to Brevo but refusing to send because it couldn't verify the certificate chain.

The fix? Two lines:

```ini
smtp_tls_security_level = encrypt
smtp_tls_CAfile = /etc/ssl/certs/ca-certificates.crt
```

The `CAfile` line tells Postfix where to find the system's CA certificates. Without it, Postfix has no root certificates to verify Brevo's TLS cert against. It connects, sees an "untrusted" cert, and drops the mail.

If you're on Ubuntu/Debian and the CA file is missing:

```bash
apt-get install -y ca-certificates
update-ca-certificates
```

Then restart Postfix:

```bash
systemctl restart postfix
```

Two hours. Two lines. Classic.

### PostgreSQL Lookup Files

These let Postfix query SimpleLogin's database to know which domains and addresses to accept.

Create `/etc/postfix/pgsql-relay-domains.cf`:

```ini
hosts = localhost
user = sl_user
password = your_strong_password_here
dbname = simplelogin
query = SELECT domain FROM custom_domain WHERE domain='%s' AND verified=true
  UNION SELECT domain FROM public_domain WHERE domain='%s'
  UNION SELECT '%s' WHERE '%s' = 'sl.example.com' LIMIT 1;
```

Create `/etc/postfix/pgsql-transport-maps.cf`:

```ini
hosts = localhost
user = sl_user
password = your_strong_password_here
dbname = simplelogin
query = SELECT 'smtp:127.0.0.1:20381' FROM alias WHERE email='%s' AND enabled=true
  UNION SELECT 'smtp:127.0.0.1:20381' FROM custom_domain WHERE domain=split_part('%s', '@', 2) AND verified=true
  UNION SELECT 'smtp:127.0.0.1:20381' WHERE split_part('%s', '@', 2) = 'sl.example.com' LIMIT 1;
```

### SASL Authentication for Brevo

Create `/etc/postfix/sasl_passwd`:

```
[smtp-relay.brevo.com]:587 your-brevo-login@example.com:your-brevo-smtp-key
```

Lock it down and generate the hash:

```bash
chmod 600 /etc/postfix/sasl_passwd
postmap /etc/postfix/sasl_passwd
```

### Expose PostgreSQL Port

Postfix runs on the host but needs to reach the Postgres container. Modify the sl-db container to expose the port:

```bash
docker stop sl-db
docker rm sl-db

docker run -d \
  --name sl-db \
  --network sl-network \
  --restart always \
  -e POSTGRES_DB=simplelogin \
  -e POSTGRES_USER=sl_user \
  -e POSTGRES_PASSWORD=your_strong_password_here \
  -v /sl/db:/var/lib/postgresql/data \
  -p 127.0.0.1:5432:5432 \
  postgres:16
```

### Start Postfix

```bash
systemctl restart postfix
systemctl enable postfix
```

## nginx reverse proxy

SimpleLogin's web UI runs on port 7777. Put Nginx in front for HTTPS.

```nginx
server {
    server_name sl.example.com;

    location / {
        proxy_pass http://127.0.0.1:7777;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}
```

Then get a real certificate:

```bash
apt-get install -y certbot python3-certbot-nginx
certbot --nginx -d sl.example.com
```

Certbot rewrites the Nginx config to add SSL and sets up auto-renewal. Done.

## first login and lockdown

Go to `https://sl.example.com` and register your admin account using the email you set as `ADMIN_EMAIL` in the env file.

Now make yourself premium and lock the door:

```bash
# Enter the database
docker exec -it sl-db psql -U sl_user -d simplelogin

# Make your account premium (lifetime)
UPDATE users SET lifetime = true WHERE email = 'admin@sl.example.com';

# Disable registration so nobody else can sign up
\q
```

Add to your `simplelogin.env`:

```bash
DISABLE_REGISTRATION=1
```

Restart the app container:

```bash
docker restart sl-app
```

![SimpleLogin dashboard with multiple aliases configured](/images/self-hosting-simplelogin/sl-dashboard-aliases.png)

Your instance. Your aliases. Your data.

## persistence across reboots

Make sure everything survives a server restart:

```bash
# Docker containers (already set with --restart always, but verify)
docker update --restart always sl-db sl-app sl-email sl-job-runner

# Postfix
systemctl enable postfix

# Nginx
systemctl enable nginx
```

Reboot and verify:

```bash
reboot

# After reboot
docker ps
systemctl status postfix
systemctl status nginx
```

## lessons learned

Six things I wish I'd known before starting:

1. **Cloudflare proxy kills mail.** The orange cloud proxies HTTP traffic through Cloudflare's network. SMTP on port 25 doesn't go through that proxy. Grey cloud (DNS only) or your MX records point nowhere useful.

2. **Brevo domain verification is fussy.** Verify your sending domain in Brevo before configuring Postfix. If Brevo doesn't recognize your domain, outbound mail gets rejected at the relay, not at the destination. Hard to debug.

3. **The TLS CA certificate trap is real.** Postfix doesn't use the system CA store by default. You must explicitly point it to `/etc/ssl/certs/ca-certificates.crt`. Without this, outbound relay to Brevo fails silently with "untrusted TLS connection" in the logs.

4. **IP reputation matters more than you think.** Fresh VPS IPs have zero reputation. Using Brevo as a relay piggybacks on their established reputation. Direct send from a new IP = spam folder.

5. **Pin your Postgres version.** Use `postgres:16`, not `postgres:latest`. A major version bump (16 to 17) on a container restart will break your data directory without a manual migration.

6. **Use the app image, not app-ci.** SimpleLogin publishes both `simplelogin/app` and `simplelogin/app-ci`. The `app-ci` image is for their CI/CD pipeline. Use `simplelogin/app` with a specific version tag.

## the proof

Here's what the full flow looks like in practice. Send a test email to your alias:

![Sending a test email to the SimpleLogin alias](/images/self-hosting-simplelogin/sl-inbound-test.png)

It arrives in your mailbox, forwarded through SimpleLogin. Check the headers — mailed by Brevo's relay, signed by your domain:

![Forwarded email showing Brevo relay and domain signature in headers](/images/self-hosting-simplelogin/sl-forwarded-headers.png)

Now the real test. Hit reply. The recipient should see your alias, not your real email:

![Reply sent from the alias address](/images/self-hosting-simplelogin/sl-reply-from-alias.png)

Check the headers on the reply. From: your alias. Signed-by: your domain. Your real address is nowhere in sight:

![Reply headers confirming alias as sender with TLS encryption](/images/self-hosting-simplelogin/sl-reply-headers.png)

SimpleLogin's dashboard confirms the reply went through:

![SimpleLogin dashboard showing successful reply activity on the alias](/images/self-hosting-simplelogin/sl-reply-confirmed.png)

### browser extension bonus

SimpleLogin also ships a browser extension. Visit any site, click the icon, and create an alias on the fly — no need to open the dashboard:

![SimpleLogin browser extension creating an alias on a website](/images/self-hosting-simplelogin/sl-browser-extension.png)

![Browser extension showing existing aliases for the current site](/images/self-hosting-simplelogin/sl-extension-aliases.png)

Between this and the [Dokploy migration](/blog/netlify-to-dokploy-migration), my entire personal infrastructure runs on a single Hetzner box for under $5 a month. Email aliases, five websites, monitoring, backups. All mine. Two hours of setup, one of which was the TLS trap above.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>self-hosting</category>
            <category>docker</category>
            <category>security</category>
            <category>devops</category>
            <category>email</category>
            <enclosure url="https://harshit.cloud/images/self-hosting-simplelogin/hero.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Blocking AI crawlers is the new 'noindex']]></title>
            <link>https://harshit.cloud/til/blocking-ai-crawlers</link>
            <guid isPermaLink="false">https://harshit.cloud/til/blocking-ai-crawlers</guid>
            <pubDate>Wed, 21 Jan 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Blocking GPTBot, Anthropic, and Perplexity trades long-term search reach for short-term control. The math on whether it's worth it for your site.]]></description>
            <content:encoded><![CDATA[
If you're blocking GPTBot, Anthropic, Perplexity, Gemini — you're trading future reach for short-term control.

## the math

AI search traffic today: ~1%
AI search traffic tomorrow: 25–35%

Let them crawl. Train the discovery layer. Be early.

## common AI crawler user agents

| Crawler | Company |
|---------|---------|
| `GPTBot` | OpenAI |
| `ClaudeBot` / `Anthropic-AI` | Anthropic |
| `PerplexityBot` | Perplexity |
| `Google-Extended` | Google (Gemini) |

## the robots.txt decision

Blocking these crawlers:

```
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /
```

Feels like control. Actually it's invisibility.

## why this matters

When someone asks an AI "how do I do X" and your content isn't in the training data, you don't exist in that conversation.

The sites that trained the discovery layer early will own the AI search results later.

Visibility > invisibility.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>seo</category>
            <category>ai</category>
            <category>crawlers</category>
            <category>strategy</category>
            <enclosure url="https://harshit.cloud/til/blocking-ai-crawlers/opengraph-image" length="0" type="image//til/blocking-ai-crawlers/opengraph-image"/>
        </item>
        <item>
            <title><![CDATA[Access denied: when your browser extensions look like attack vectors]]></title>
            <link>https://harshit.cloud/blog/akamai-browser-extensions-blocking</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/akamai-browser-extensions-blocking</guid>
            <pubDate>Wed, 31 Dec 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Tried booking a flight. Got blocked. Turns out Akamai thinks my 21 security extensions make me look like a hacker. They're not wrong.]]></description>
            <content:encoded><![CDATA[
Last week I tried booking a flight on Indigo. Access Denied. Tried MakeMyTrip. Access Denied. Ixigo? Same story. Yatra? Blocked.

My banking apps worked fine. But every travel booking site using Akamai's CDN decided I was public enemy number one. Sometimes the site would load, then the OTP API calls would silently fail. Making a complete fool out of me at checkout.

![Indigo Access Denied](/images/akamai-browser-extensions-blocking/indigo_access_denied.png)

*Fig. 1 — every travel booking site behind Akamai locked me out with the same Access Denied page.*

![MakeMyTrip Access Denied](/images/akamai-browser-extensions-blocking/mmt_access_denied.png)

## the debugging rabbit hole

First thought: bad IP from my ISP's CGNAT pool. Changed my IP. Worked for 10 minutes. Then blocked again.

Second thought: maybe Akamai's IP reputation is flagging me. Checked their [Client Reputation lookup](https://www.akamai.com/us/en/clientrep-lookup/).

![Akamai Clean IP Reputation](/images/akamai-browser-extensions-blocking/akamai_repo_ip.png)

Nope. Clean as a whistle.

![My IP Info - Tata Play, Bengaluru](/images/akamai-browser-extensions-blocking/my_ip.png)

Google dorking time. Found tons of users globally facing the same issue. Not ISP-specific. Not India-specific. Something else was up.

Then I found [this blog](https://leinss.com/blog/?p=3409) that pointed at browser extensions. Interesting.

## the lightbulb moment

Switched from Arc to Chrome. Still blocked. Because I carried over the same 21 extensions like a digital hoarder.

![My Extension Arsenal - Part 1](/images/akamai-browser-extensions-blocking/extensions.png)

![My Extension Arsenal - Part 2](/images/akamai-browser-extensions-blocking/extensions_2.png)

Here's my toolkit: Wappalyzer, Shodan, Trufflehog, DotGit, and a bunch of OSINT/greyhat recon tools. The same extensions I use for security research were making me look like an attacker to Akamai's Bot Manager.

Turned off all extensions. Instant access to every site.

## what's actually happening

Akamai's Bot Manager isn't counting your requests. It's fingerprinting the client environment. Browser extensions can inject JavaScript, mutate the DOM, alter request behavior, and add tracking parameters — all things the client-side fingerprint will flag as bot-shaped, the same way it would flag a scraper or an injection probe.

My security toolkit became my own DoS attack vector. Poetic, really.

Some users reported User-Agent changes helped. I didn't test that. I also didn't have time to debug which of the 21 extensions was the actual culprit. Life's too short for that level of troubleshooting.

## the takeaway

WAF rules are aggressive by design. Your legitimate security tools look exactly like attack vectors because, well, they kind of are. The line between security researcher and threat actor is thinner than we'd like to admit.

If you're getting blocked by Akamai with a clean IP:

1. Check your extensions first, not your ISP
2. VPN working temporarily? That's behavioral detection, not IP blocking
3. The Client Reputation tool won't catch extension-based triggers
4. Your OSINT toolkit makes CDNs nervous

Infrastructure is meant to keep bad actors out. Sometimes it keeps infrastructure wizards out too. Not fun.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>security</category>
            <category>waf</category>
            <category>akamai</category>
            <category>debugging</category>
            <category>browser-extensions</category>
            <category>cdn</category>
            <enclosure url="https://harshit.cloud/images/akamai-browser-extensions-blocking/indigo_access_denied.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[VictoriaLogs vs Loki: real-world benchmarking results]]></title>
            <link>https://harshit.cloud/blog/victorialogs-vs-loki</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/victorialogs-vs-loki</guid>
            <pubDate>Wed, 19 Nov 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[500 GB of logs, 7 days, same hardware. VictoriaLogs vs Loki: 94% lower query latencies, 37% smaller storage, half the CPU and RAM.]]></description>
            <content:encoded><![CDATA[
On 500 GB of logs over 7 days, on the same hardware: **94% lower query latencies, 37% smaller storage, and under half the CPU and RAM**. The single number that surprised us most was the 12× drop in needle-in-a-haystack search times.

![VictoriaLogs vs Loki — neon-styled cover illustration with VictoriaLogs (cyan, throughput chart, database icon) and Loki (magenta, bar chart, gauge) presented as benchmark contenders](/images/victorialogs-vs-loki/hero.webp)

## the setup

At Truefoundry we run multi-tenant ML workloads on Kubernetes. The log layer has to deliver fast ad-hoc search across mixed namespaces (often with no good labels to anchor on), sustained 60+ MB/s ingestion during deploys and incidents, and live tailing that doesn't fall behind during a noisy crash loop. It also has to run as a single binary — we don't want a six-component log stack — within a 4 vCPU / 16 GiB node ceiling shared with everything else.

Loki was our default. Past the 1M-active-series mark it started showing 30s+ search latencies and high I/O amplification. So we benchmarked it head-to-head against VictoriaLogs and let the numbers decide.

### the contestants

- **Loki:** Grafana Labs' log store. Compressed chunks, label-based indexing, LogQL. Brilliant Grafana integration; expensive regex scans and Go GC overhead at scale.
- **VictoriaLogs:** VictoriaMetrics' columnar LSM log database. Per-field indices, SIMD search, LogsQL. Single binary, low memory footprint, efficient compression.

### benchmark methodology

| Category | Details |
|---|---|
| Hardware | 4 vCPU / 8 GiB RAM, identical for both, QoS: Guaranteed |
| Log generator | flog → Vector → Loki / VictoriaLogs at 65 MB/s sustained |
| Dataset | ~500 GB over 7 days; mix of unique and duplicated lines across 20 namespaces, 40 apps |
| Retention | 7 days |
| Load test | Locust 2.27.1, 10 virtual users, sustained 43 RPS via `/select/logsql/query` and the Grafana datasource |
| Queries | Stats, Needle in a Haystack, Negative — detailed below |
| Caching | Block cache disabled on both; pods restarted before each run to simulate cold reads |
| Index tweaks | Defaults on both |

## the headline figure

Before the methodology debate, here's what the seven days produced.

<iframe src="/images/victorialogs-vs-loki/footprint-widget.html" title="VictoriaLogs vs Loki resource footprint: 500 GB over 7 days" height="900" data-caption="Fig. 2 — Resource economics on identical hardware and workload."></iframe>

The Grafana panels behind those numbers — same six metrics for both systems, two very different shapes:

**Loki:**

![Loki Grafana dashboard: CPU usage pinned near 4 vCPU limit, memory holding around 6–7 GB, regular throttling spikes hitting 40–50% during the benchmark window](/images/victorialogs-vs-loki/victorialogs-loki-footprint-loki.png)

**VictoriaLogs:**

![VictoriaLogs Grafana dashboard over the same period: CPU near zero baseline with brief spikes to 1 vCPU, memory flat around 1.3 GB, no throttling visible](/images/victorialogs-vs-loki/victorialogs-loki-footprint-victoria.png)

The memory line is the one that most directly translates into infrastructure cost. At steady state, VictoriaLogs sat around 1.3 GB while Loki held 6–7 GB. Freeing ~5 GB per node is the difference between bin-packing four tenants on a box and seven.

## storage on disk

Same logs, same 7-day retention, identical ingestion path. Loki landed at **501 GB**; VictoriaLogs at **318 GB** — **37% smaller** with no tuning on either side.

The difference is partly the codec — VictoriaLogs uses zstd, Loki defaults to snappy — but mostly the layout. Columnar storage finds redundancy that stream-chunked LSMs don't see; values from the same field compress together far better than values stitched in by line order.

At fleet scale this is a 1 TB volume holding what used to need 1.5 TB.

## query performance

Three query patterns, run against the same 500 GB / 7-day index. Result sets were verified to be identical between the two systems.

### 1. stats — log count over 24 hours

**Purpose:** Total log lines from `app="servicefoundry-server"`.

- **LogQL:** `sum(count_over_time({app="servicefoundry-server"}[24h]))`
- **LogsQL:** `{app="servicefoundry-server"} | stats count()`

| System | Latency |
|---|---:|
| Loki | 2.5s |
| VictoriaLogs | 1.5s |

Aggregate counts hit Loki's strength — label-anchored, no text scan — and Loki still loses by 40% on the wall clock. VictoriaLogs holds its own on label queries; Loki has no answer for the others.

### 2. needle in a haystack — finding one line in 500 GB

**Purpose:** Locate a single static log entry `[UNIQUE-STATIC-LOG] ID=abc123 XYZ` in the `truefoundry` namespace over 7 days.

- **LogQL:** `{namespace="truefoundry", app!="grafana"} |= "[UNIQUE-STATIC-LOG] ID=abc123 XYZ"`
- **LogsQL:** `{namespace="truefoundry", app!="grafana"} "[UNIQUE-STATIC-LOG] ID=abc123 XYZ"`

| System | Latency |
|---|---:|
| Loki | 12s |
| VictoriaLogs | ~900ms |

The single-character difference in syntax — `|=` vs nothing — hides the architectural one. Loki's `|=` is a substring filter run line-by-line over decompressed chunks. VictoriaLogs treats the same string as an index probe. 12 seconds turns into 900 milliseconds on identical hardware.

### 3. negative — proving a string doesn't exist

**Purpose:** Search for a string that doesn't appear anywhere in the dataset. Forces a full scan in both systems.

- **LogQL:** `{namespace="truefoundry"} |= "non-existent log line"`
- **LogsQL:** `{namespace="truefoundry"} "non-existent log line"`

| Dataset | Loki | VictoriaLogs |
|---|---:|---:|
| 500 GB | **Timeout** | 2.2s |
| 300 GB | 2.6s | 266ms |

The negative query is the quiet one. At 300 GB Loki handles it in 2.6 seconds. At 500 GB the resources choke and the query halts — never returns. In production that's the difference between an alert that fires and a dashboard that loads.

## ingestion under pressure

We pushed both with 120 flog replicas to find the ceiling.

| Metric | Loki | VictoriaLogs | Delta |
|---|---:|---:|---:|
| Peak ingestion | 20 MB/s | 66 MB/s | **3× higher** |
| vCPU (sustained) | 4 vCPU, 100% throttled | 2 vCPU peak | 50% lower |
| Memory | ~4 GiB | ~1.3 GiB | 3× lower |

![Loki CPU saturation graph at 4 vCPUs and memory consumption at 4GB during peak ingestion load with 120 flog replicas](/images/victorialogs-vs-loki/victorialogs-loki-cpu-memory-loki.png)

![VictoriaLogs performance graph showing 2 peak vCPU usage and 1.3GB memory consumption during the same ingestion load](/images/victorialogs-vs-loki/victorialogs-loki-performance-victoria.png)

Loki hit the CPU wall first and never recovered — pinned at 100% throttled while still topping out at 20 MB/s. VictoriaLogs absorbed the same firehose at 3× the throughput, on **72% less CPU and 87% less memory**.

## load test under traffic

Locust, 10 concurrent users, simulating real read traffic. VictoriaLogs handled 36% more requests per second, p99 latency was 3.6× faster than Loki under load, and tail latency stayed lower at every percentile we measured.

![Load test results for VictoriaLogs showing 36% higher RPS and 3.6x faster p99 latency with 10 concurrent users at 43 RPS](/images/victorialogs-vs-loki/victorialogs-loki-loadtest-victoria.png)

![Load test results for Loki showing slower response times and lower throughput under the same simulated traffic](/images/victorialogs-vs-loki/victorialogs-loki-loadtest-loki.png)

## why the gap is this big

Four design choices doing most of the work:

1. **Full-text indexing.** Per-token indices skip line-by-line filtering entirely.
2. **Columnar LSM layout.** Reads touch only the columns the query asks for; fewer disk seeks.
3. **Memory discipline.** Lower steady-state overhead means more headroom for everything else.
4. **SIMD search.** Vectorised inner loops on commodity CPUs add up over billions of lines.

## when to pick which

VictoriaLogs is the right pick when text search and grep-style queries are the primary workload, when ad-hoc exploration across large windows matters, when resource efficiency and bin-packing density are real constraints, or when you want fewer knobs to tune in production.

Loki is the right pick when label-based queries dominate and full-text is rare, when deep Grafana ecosystem integration is non-negotiable, or when you already operate Loki at scale and the migration cost outweighs the wins.

For us, on this workload, the resource economics decided it. The freed memory per node became real infrastructure savings within a quarter. 12 seconds turned into 900 milliseconds with no tuning, and that's the number I keep quoting six months later.

## resources

- [Loki Documentation](https://grafana.com/docs/loki/latest/)
- [VictoriaLogs Documentation](https://docs.victoriametrics.com/victorialogs/)
- [Vector Documentation](https://vector.dev/)
- [Grafana Alloy](https://grafana.com/docs/alloy/latest/)
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>kubernetes</category>
            <category>logging</category>
            <category>observability</category>
            <category>victorialogs</category>
            <category>loki</category>
            <category>benchmarking</category>
            <enclosure url="https://harshit.cloud/images/victorialogs-vs-loki/hero.webp" length="0" type="image/webp"/>
        </item>
        <item>
            <title><![CDATA[When Netlify killed my free tier: a 15-minute migration to Dokploy]]></title>
            <link>https://harshit.cloud/blog/netlify-to-dokploy-migration</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/netlify-to-dokploy-migration</guid>
            <pubDate>Fri, 24 Oct 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Netlify suspended five free-tier sites of mine one Tuesday night. The 15-minute migration to Dokploy on a €3/month VPS that bought everything back.]]></description>
            <content:encoded><![CDATA[
Late night. Got this email: **"[Netlify] Your projects have been suspended due to credit limit exceeded."**

Five sites down:

- linkedintel.ai (LinkedIn Sales Intelligence AI for SDR's)
- sachin.cool (rookie website from college time)
- dilharia.love (wedding RSVP site - yes, judge me)
- My personal blog
- A ex-ceo's landing page

![Netlify suspension email](/images/netlify-to-dokploy-migration/dokploy_email.png)

Netlify moved legacy free tier users to their new 300-credit plan. I burned through it in a week.

![Netlify upgrade notice](/images/netlify-to-dokploy-migration/dokploy_upgrade_netlify.webp)

New option: $9/month for 1000 credits, or figure something else out.

I had 15 minutes before my girlfriend woke up. Here's what happened.

## the €3 solution

Hetzner CX22: 2 vCPUs, 4GB RAM, 40GB SSD. **€3.29/month**.

![Hetzner CX22 pricing](/images/netlify-to-dokploy-migration/dokploy_hetzner.png)

Math was simple:

- Netlify: $108/year for credit anxiety
- Dokploy + Hetzner: $42/year for unlimited deploys

![Netlify vs Self-Hosted Comparison](/images/netlify-to-dokploy-migration/dokploy_netlify.png)

I'd been [watching this Dokploy video](https://www.youtube.com/watch?v=RoANBROvUeE) the week before. Perfect timing.

## the 15-minute panic deploy

**Minutes 0-5**: Spun up Hetzner in Helsinki. Got the IP. Updated DNS.

**Minutes 5-8**: SSH'd in, ran the Dokploy installer:

```bash
curl -sSL https://dokploy.com/install.sh | sh
```

One command. Dokploy installed Docker, Traefik, PostgreSQL, everything.

**Minutes 8-12**: Connected Git repos. Paste GitHub URL, select branch, done.

![Dokploy Git integration](/images/netlify-to-dokploy-migration/dokploy_git.png)

**Minutes 12-15**: Hit deploy on all 5 projects. Watched them come back to life.

![Dokploy migration dashboard](/images/netlify-to-dokploy-migration/dokploy_migration.png)

The Fiance woke up. dilharia.love was live.

## what surprised me

SSL just works. Traefik + Let's Encrypt provision certificates automatically. I'm running Cloudflare Full (Strict) mode - zero warnings.

WWW redirects? One checkbox. Netlify charged extra for this.

Logs and monitoring built-in. No Datadog bill. No "$500/month observability platform."

![Dokploy projects dashboard](/images/netlify-to-dokploy-migration/dokploy_projects.png)

## the catch

You own the ops. Server goes down? That's on you. No 99.9% SLA.

You handle security: OS updates, SSH keys, backups. I run `apt upgrade` weekly and backup to Backblaze B2 for $0.50/month.

For personal projects? Worth it. For business-critical stuff? Pay for managed services.

## one month later

Server load: 8% CPU. Zero downtime. SSL renewals automatic.

All 5 sites running smoothly: linkedintel.ai pulling data, sachin.cool looking sharp, dilharia.love collecting RSVPs.

Deployed 3 more projects since then. No credit anxiety. No surprise bills.

Total maintenance time: 10 minutes/week.

Best infrastructure decision I've made this year.

## related posts

- [AWS Cost Optimization: How We Cut Our Bill by 60%](/blog/aws-cost-optimization-tricks)
- [How I Took Down 30% of Production with One TLS Fingerprinting Rule](/blog/ja4-fingerprinting-network-security)
- [5 Kubernetes Debugging Tricks That Saved My Production](/blog/kubernetes-debugging-tips)
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>devops</category>
            <category>hosting</category>
            <category>cost-optimization</category>
            <category>self-hosting</category>
            <category>dokploy</category>
            <enclosure url="https://harshit.cloud/images/netlify-to-dokploy-migration/dokploy_email.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Delivery impersonation: the social engineering vector that just works]]></title>
            <link>https://harshit.cloud/til/delivery-social-engineering</link>
            <guid isPermaLink="false">https://harshit.cloud/til/delivery-social-engineering</guid>
            <pubDate>Fri, 17 Oct 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Someone called pretending to deliver a Diwali bakery hamper. They got my full address in 20 seconds. Why this pretext works and what to ask back.]]></description>
            <content:encoded><![CDATA[
A few weeks ago, my mum got a WhatsApp call from someone claiming to deliver a Diwali hamper from a bakery she'd never ordered from. They asked for her live location to "route the driver". She sent it. Twenty seconds, full home address handed to a stranger.

## why this attack works

The attack rides three psychological triggers at once. Mentioning a well-known local business creates instant credibility. Gift deliveries during festivals are common and expected, so the pretext doesn't trip anyone's filter. And "I'm outside and need directions now" prompts immediate action before the victim has time to verify anything.

## the attack pattern

```
Attacker: "Hi, I'm from [Popular Local Bakery]. I have a Diwali gift
          hamper for you but I'm having trouble finding your location.
          Could you share your address or live location?"

Victim: Shares full address or WhatsApp live location without verification
```

No order confirmation requested. No delivery tracking number asked for. No verification of any kind.

## why people fall for it

- **Gift context**: during festivals, people expect surprise gifts from friends and family
- **Helpful nature**: most people want to help someone who seems to be doing their job
- **Time pressure**: the implied urgency ("I'm waiting outside") prevents critical thinking
- **Low perceived risk**: sharing an address seems harmless compared to financial data
- **Trust in local brands**: using a known local business name lowers suspicion

## defense strategies

The defense is one habit: don't share an address until you've verified the order exists. Ask for a tracking number, call the business on its public number, ask who sent the gift and check with them. If the driver "needs directions right now", give a landmark, not a pin. Most delivery apps already have in-app chat — there's no good reason a real driver needs your live location over WhatsApp.

## real-world impact

This attack can be used for:

- Physical surveillance and stalking
- Burglary planning (knowing when someone is home)
- Identity theft (address is often used for verification)
- Targeted phishing (now knowing exact location)
- Physical security breaches

## the asks that work

The pretexts that actually get through share a shape. A familiar local business name doing the credibility work. A plausible occasion — Diwali hampers, birthday flowers, Amazon redelivery — that fits the calendar. A small action framed as urgent: "I'm outside, just send the location". Zero technical skill, one phone call, full address.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>social-engineering</category>
            <category>cybersecurity</category>
            <category>opsec</category>
            <category>privacy-risk</category>
            <category>pretexting</category>
            <category>security-awareness</category>
            <enclosure url="https://harshit.cloud/til/delivery-social-engineering/opengraph-image" length="0" type="image//til/delivery-social-engineering/opengraph-image"/>
        </item>
        <item>
            <title><![CDATA[How I took down 30% of production with one TLS fingerprinting rule]]></title>
            <link>https://harshit.cloud/blog/ja4-fingerprinting-network-security</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/ja4-fingerprinting-network-security</guid>
            <pubDate>Tue, 14 Oct 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Deployed a TLS fingerprinting rule that seemed reasonable. Blocked every Chrome 119 user on Windows. The incident report was not fun to write.]]></description>
            <content:encoded><![CDATA[
Last month I broke production. Blocked 30% of legitimate traffic because I misunderstood how TLS fingerprinting actually works.

Here's the incident, what I got wrong, and what SREs actually need to know about JA4 fingerprinting.

![Timeline of a JA4 rate-limit deploy that turned a confident 2pm push into a 30 percent revenue drop and a 4pm rollback two hours later.](/images/ja4-fingerprinting-network-security/hero.png)

*Fig. 1 — one fingerprint looked like one attacker. it was a tls stack used by 30% of customers.*

## the JA3 problem that hit our monitoring

We'd been using JA3 fingerprinting to track traffic patterns since 2021. Not for blocking, just visibility. Which clients were hitting our APIs, how to correlate requests, that kind of thing.

Late 2023, our dashboards started showing chaos. Chrome traffic was generating thousands of unique fingerprints. Same browser version, different fingerprint every request. Our metrics were useless.

Then in early 2024, all our Go-based internal services health checks started showing up as "unknown clients" in our traffic analysis. Turns out Go's HTTP client randomizes cipher suite ordering. JA3 saw that as thousands of different clients. It was just our Kubernetes health checks.

Our traffic classification was completely broken.

## why JA4 exists

JA3 hashes the TLS ClientHello fields in order. Change the order, change the hash. Chrome randomizes extension ordering. Go randomizes cipher suites. JA3 falls apart.

JA4 sorts everything before hashing. That Go health check that generated 10,000 different JA3 fingerprints? One consistent JA4: `t13d190900_9dc949149365_08c8ecc63e89`

The format is readable too:

```
t13d1516h2_8daaf6152771_02713d6af862
│  │││││    │              │
│  ││││└─── ALPN (h2)      └─ Extension hash
│  │││└──── Extension count (16)
│  ││└───── Cipher count (15)
│  │└────── SNI present (1 = yes, 0 = no)
│  └─────── TLS version (13 = TLS 1.3)
└────────── Protocol (t = TCP, q = QUIC)
```

This fixed our monitoring. Could actually track client types again. Go clients showed up as Go. Chrome showed up as Chrome consistently.

Then I got clever.

## the mistake that cost us $50K

We noticed some fingerprints showing up in suspicious traffic patterns. High request rates, weird timing. Looked like abuse.

I wrote a rule: "If fingerprint = X, rate limit aggressively."

Deployed to production at 2pm on a Tuesday.

By 2:30pm, support was getting complaints. By 3pm, order volume had dropped 30%. By 4pm, I was in the incident channel explaining what went wrong.

**The problem: JA4 fingerprints identify TLS stacks, not individual clients.**

All Chrome 119 browsers on Windows have the same fingerprint. All of them. Every user running that browser/OS combo generates the same JA4.

The suspicious traffic I saw? One bad actor using Chrome 119. My rule caught them, and also every legitimate Chrome 119 user on Windows.

That was 30% of our traffic.

## what JA4 actually tells you

JA4 fingerprints map to:

- Browser + version + OS
- HTTP client library + version + OS
- Any TLS stack implementation

Not to:

- Individual users
- Individual devices
- Individual IP addresses

This seems obvious now, but in the moment, tracking "suspicious fingerprint X" felt like tracking a specific attacker. It wasn't. It was tracking everyone using Chrome 119 on Windows.

## how we fixed our monitoring

After the incident, here's how we actually use JA4:

```python
# For traffic classification and monitoring only
def classify_client(fingerprint):
    """Map JA4 fingerprints to known client types for metrics"""

    # Known patterns from our infrastructure
    patterns = {
        't13d190900_9dc949149365_08c8ecc63e89': 'go-http-client-1.21',
        't13d1516h2_8daaf6152771_02713d6af862': 'chrome-120-macos',
        't13d1415h2_5c6f8a9b3d4e_3f7a8b9c2d1e': 'curl-8.4',
        # ... etc
    }

    return patterns.get(fingerprint, 'unknown')

# Aggregate metrics by client type
def record_request_metrics(request):
    client_type = classify_client(request.ja4_fingerprint)

    metrics.increment('requests.by_client_type', tags={
        'client': client_type,
        'endpoint': request.path,
        'status': request.status_code
    })
```

Now we can see in Grafana:

- "go-http-client suddenly went from 100k req/day to 1M" (scaling issue)
- "chrome-119-macos dropped to zero" (browser update pushed)
- "unknown fingerprint at 10k req/sec" (investigate this)

It's a classification dimension, not an identity.

## the a_b_c format actually matters

JA4 uses `a_b_c` format for a reason. The sections are:

- `a`: protocol, TLS version, SNI, counts, ALPN
- `b`: cipher hash
- `c`: extension hash

Why split it? Because you can match on parts.

We had an issue where some service was rotating through different ciphers on every request. Looked like thousands of different clients in our metrics. The full fingerprint kept changing.

But `a` and `c` stayed constant. Started grouping by `ac` instead:

```python
def normalize_fingerprint(fp):
    """Group by protocol+extensions, ignore cipher variations"""
    parts = fp.split('_')
    return f"{parts[0]}_{parts[2]}"  # a_c

# Now cipher rotation doesn't fragment our metrics
normalized = normalize_fingerprint('t13d190900_5d65cb28da5c_02713d6af862')
# Returns: 't13d190900_02713d6af862'
```

Fixed our cardinality explosion problem.

## weird edge cases we hit

**Case 1: corporate proxies**

Some enterprises terminate TLS at their proxy. All internal traffic from that company shows the same fingerprint. One fingerprint, thousands of users.

Can't use it for any kind of per-client logic. Had to maintain a list of known corporate proxy fingerprints and handle them differently.

**Case 2: Windows XP clients**

Yes, in 2024. Ancient TLS 1.0 fingerprint we'd never seen before. Took two weeks to figure out it was legitimate users in a developing country on old machines.

Almost blocked them as "suspicious" before we investigated.

**Case 3: Go library version drift**

Our microservices were on different Go versions. Go 1.20 and Go 1.21 have different TLS fingerprints. This broke our service mesh traffic analysis until we grouped them properly.

## performance characteristics

Parse time: 0.3 microseconds per fingerprint. Not milliseconds. Microseconds.

We process 50M requests/day. Total JA4 overhead: ~15ms per day.

The implementation is Rust with zero heap allocations. Stack-allocated, no GC. I benchmarked it myself when evaluating whether to deploy it.

For infrastructure at scale, this matters. Adding 0.3μs per request is free in any honest accounting. Adding 1ms per request is a P0 incident.

## how to actually use this in production

**For traffic visibility**: yes. It's great for understanding client distribution, detecting anomalies, tracking deployments.

**For rate limiting**: only if you're very careful. Rate limit by fingerprint + IP + endpoint + time window. Never fingerprint alone.

**For blocking**: don't. Seriously. You'll block legitimate traffic. Use it as one signal among many if you must, but never the only signal.

**For capacity planning**: yes. Track which client types generate what load. When Chrome updates, you'll see the shift.

**For debugging**: yes. "Why is this service getting weird traffic?" Check the fingerprints. Might be a misconfigured client.

## the logging strategy

We log fingerprints with every request now. Adds about 45 bytes per log line.

```json
{
  "timestamp": "2025-10-14T15:23:45Z",
  "endpoint": "/api/v1/users",
  "status": 200,
  "duration_ms": 45,
  "ja4": "t13d1516h2_8daaf6152771_02713d6af862",
  "ja4_client": "chrome-120-macos"
}
```

Costs us about 2GB/day extra in log storage. Worth every penny. We've debugged three production issues with this data in the last month.

## when corporate proxies break everything

If you're behind enterprise proxies that terminate TLS, JA4 is useless for per-client anything. You'll see the proxy's fingerprint for thousands of users.

We maintain a separate config for known proxy environments:

```yaml
# corporate-proxy-exceptions.yaml
proxy_fingerprints:
  - fingerprint: "t13d1819h2_abc123def456_789ghi012jkl"
    company: "BigCorp Inc"
    note: "TLS termination at corporate proxy"
    disable_per_client_logic: true
```

Ugly, but it's reality.

## what's coming that will break this

ECH (Encrypted ClientHello) in TLS 1.3 encrypts the inner ClientHello, hiding SNI and ALPN. The outer hello still carries cipher suites and extensions, so JA4 can still be computed on what's visible — the identifying signal just gets weaker. When ECH adoption hits scale, we'll want a different approach.

Keep an eye on your TLS 1.3 ECH adoption metrics. When it hits 20%+, time to rethink your monitoring strategy.

## the incident postmortem takeaways

I tested the rule in staging with synthetic traffic. Real users matched the "bad" fingerprint and I never saw it coming. Anything fingerprint-shaped needs to be canaried at 1% of real traffic first, with anomaly alerts on drops in known-good fingerprints — "chrome-119-windows traffic dropped 90%" is the kind of signal that would have caught this in five minutes instead of two hours.

Rollback path matters more than the rule itself. We killed ours with a feature flag and still spent 30 minutes draining caches and recovering. If a rule can take down 30% of traffic, the kill switch needs to be a single toggle, tested before the rule ships.

The deeper lesson is that fingerprints are a dimension, not an identity. Group by them, classify with them, slice metrics with them — don't target individuals with them.

## should SREs care about JA4?

Yes, but not for the reasons you might think.

It's not about security. It's about observability. Understanding your traffic composition, detecting anomalies, debugging production issues.

Adding JA4 fingerprints to your logs and metrics gives you another dimension to slice your data. When something weird happens, you can answer "what client type is doing this?" faster.

Just don't rate limit or block based on it alone. That way lies incidents and awkward conversations with your VP of Engineering.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>sre</category>
            <category>tls</category>
            <category>networking</category>
            <category>monitoring</category>
            <category>production-incidents</category>
            <enclosure url="https://harshit.cloud/images/ja4-fingerprinting-network-security/hero.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[JA4's split format saved our metrics cardinality]]></title>
            <link>https://harshit.cloud/til/ja4-fingerprint-bot-detection</link>
            <guid isPermaLink="false">https://harshit.cloud/til/ja4-fingerprint-bot-detection</guid>
            <pubDate>Tue, 14 Oct 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Rotating ciphers exploded our TLS client metrics to 50k unique fingerprints. JA4's split format dropped that to under 200 without losing detection.]]></description>
            <content:encoded><![CDATA[
We had a service that rotated TLS ciphers on every connection. Our client classification metrics exploded to 50k unique fingerprints. Prometheus cardinality alert fired.

## the problem

JA3 gives you one hash. When ciphers rotate, you get a new hash:

```
Request 1: 5c4fba4a0f93c6f2a3e52e9c8d4a7b21
Request 2: 3d8a9f2c4e6b1a7c5f9e3d8b2a6c4e1f
Request 3: 7f2e4a8c3b9d1f6e5a2c8d4b7e9f3a1c
```

Each one looks like a different client in your metrics. Cardinality explosion.

## JA4's solution

JA4 splits the fingerprint into three parts:

```
t13d190900_5d65cb28da5c_02713d6af862
│          │              │
a          b              c

a = protocol + TLS version + counts + ALPN
b = cipher hash (changes on rotation)
c = extension hash
```

When the service rotates ciphers, only `b` changes. `a` and `c` stay constant.

## the fix

Instead of grouping by full fingerprint, group by `ac`:

```python
def normalize_fingerprint(fp):
    parts = fp.split('_')
    return f"{parts[0]}_{parts[2]}"  # ignore cipher part

# Before: 50k unique fingerprints
# After: 47 unique fingerprints
```

Cardinality back to normal. Metrics useful again.

## when this matters

If you track client types in metrics and see unexplained cardinality spikes, check if they're rotating ciphers. The `a_b_c` format lets you ignore the changing parts.

Saved us from having to increase our Prometheus retention limits.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>sre</category>
            <category>monitoring</category>
            <category>tls</category>
            <category>observability</category>
            <enclosure url="https://harshit.cloud/til/ja4-fingerprint-bot-detection/opengraph-image" length="0" type="image//til/ja4-fingerprint-bot-detection/opengraph-image"/>
        </item>
        <item>
            <title><![CDATA[My watchlist: from 70s basements to Victorian crime scenes]]></title>
            <link>https://harshit.cloud/blog/favorite-shows-sitcoms-detective</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/favorite-shows-sitcoms-detective</guid>
            <pubDate>Sun, 12 Jan 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[What I rewatch when the day's debugging is done — That 70's Show, Arrested Development, HIMYM, and the detective canon from Holmes to Poirot.]]></description>
            <content:encoded><![CDATA[
After a day of debugging, I default to two genres: half-hour sitcoms I've already seen, and detective shows where someone smarter than me is making the deductions. This is the shortlist.

![A dark living room lit only by the purple ambient glow of a flat-screen TV showing a grid of streaming app tiles.](/images/favorite-shows-sitcoms-detective/hero.jpg)

*Fig. 1 — the home screen, also known as the longest-running decision of the night.*

## the sitcoms

### That 70's Show

Teenagers in a basement, roasting each other in a circle. Red's foot, Kelso's stupidity-as-genius. Comfort TV with low cognitive overhead.

### Arrested Development

Rewards rewatching more than any other comedy I've seen. The narrator's deadpan and the planted callbacks are the work; the Bluth family dysfunction is the texture.

### How I Met Your Mother

I have opinions about the finale. Everything before it holds up — Barney's suits, the slap bet, Ted's romantic optimism that the show kept earning back for nine seasons.

## the detective shows

### Sherlock

Cumberbatch's version is the one that reset the genre. The mind palace shots, the rapid deductions, the Watson chemistry. "The game is on" still works.

### Detective Conan

Shinichi Kudo trapped in a kid's body, solving murders that follow him around. Running since 1996 and I'm still on it. The Black Organization arc is the spine.

### High Potential

Newer. A cleaning lady with a high IQ solves crimes the detectives can't. The premise carries it; the writing is finding its register.

I'll watch most things with a tight script and a payoff the writer earned. The two genres above are just where that bar gets cleared most often.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>personal</category>
            <category>entertainment</category>
            <category>tv-shows</category>
            <category>sitcoms</category>
            <category>detective</category>
            <category>anime</category>
            <enclosure url="https://harshit.cloud/images/favorite-shows-sitcoms-detective/hero.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[GitHub Actions vs GitLab CI: a practical comparison]]></title>
            <link>https://harshit.cloud/blog/github-actions-gitlab-ci-comparison</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/github-actions-gitlab-ci-comparison</guid>
            <pubDate>Fri, 20 Dec 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[After two years of running both GitHub Actions and GitLab CI across 50 microservices, here is which one I'd reach for and when.]]></description>
            <content:encoded><![CDATA[
Two years, 50 microservices, two CI platforms running side by side. Some repos on GitHub, some on GitLab, same team writing the YAML for both. Here is what stuck after the marketing slides wore off.

![A six-row comparison of GitHub Actions and GitLab CI across syntax, runners, caching, secrets, ecosystem, and pricing at fifty microservices, with the pricing row marked focal.](/images/github-actions-gitlab-ci-comparison/hero.png)

*Fig. 1 — six rows, two YAMLs, one billing model that ended the debate.*

## syntax and configuration

### GitHub Actions

```yaml
name: CI Pipeline
on:
  push:
    branches: [main]
  pull_request:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm ci
      - run: npm test
```

The YAML is readable, the marketplace has an action for almost everything, and matrix builds are a single block. The nesting gets verbose once you have reusable workflows, and environment variable precedence is its own small religion.

### GitLab CI

```yaml
stages:
  - test
  - build

test:
  stage: test
  image: node:20
  script:
    - npm ci
    - npm test
  only:
    - main
    - merge_requests
```

Flatter than GitHub's nesting, Docker is a first-class citizen, and the stages concept maps cleanly to how you think about a pipeline. There is no marketplace, so reusable components come from `include:` files and Docker images you assemble yourself.

## performance and speed

### build times

A typical Node.js app on our setup builds in 3 to 5 minutes on GitHub Actions and 4 to 6 minutes on GitLab CI. Close enough that I never picked a platform on speed alone.

### parallelization

Both handle parallel jobs well. GitHub Actions has cleaner syntax for matrix builds:

```yaml
strategy:
  matrix:
    node-version: [18, 20, 22]
    os: [ubuntu-latest, windows-latest]
```

GitLab requires more manual setup for the same result.

## ecosystem and marketplace

### GitHub Actions marketplace

Over 20,000 actions, and the caching one is the example I keep coming back to:

```yaml
- uses: actions/cache@v4
  with:
    path: ~/.npm
    key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
```

One block, content-addressed cache keyed off the lockfile. The first time you delete the manual cache logic you wrote for GitLab and replace it with this, you feel it.

### GitLab's approach

GitLab does not have a marketplace. You write scripts or use Docker images:

```yaml
test:
  image: node:20
  cache:
    key: ${CI_COMMIT_REF_SLUG}
    paths:
      - node_modules/
  script:
    - npm ci
    - npm test
```

More control, but more work.

## docker integration

### GitLab CI wins here

GitLab CI was built with Docker in mind:

```yaml
build:
  image: docker:latest
  services:
    - docker:dind
  script:
    - docker build -t myapp .
    - docker push myapp
```

It just works. No weird permissions issues.

### GitHub Actions

Needs more setup for Docker.

```yaml
- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v3
- name: Build and push
  uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: myapp:latest
```

Works fine, but requires more marketplace actions.

## secrets management

### GitHub

```yaml
env:
  API_KEY: ${{ secrets.API_KEY }}
```

Simple. Secrets are org/repo scoped. Works well.

### GitLab

```yaml
variables:
  API_KEY: $CI_DEPLOY_TOKEN
```

More flexible with group-level variables and environments. Better for complex setups.

## cost

GitHub Actions gives private repos 2,000 minutes/month on the free tier, public repos are unlimited, and overage is $0.008/minute. GitLab SaaS gives 400 minutes/month free and charges $10 per 1,000 additional minutes, but self-hosted runners are unlimited. If you can run your own runners, GitLab gets cheaper fast at scale. If you can't, GitHub's free tier outlasts it.

## self-hosted runners

### GitHub

```bash
./config.sh --url https://github.com/org/repo --token TOKEN
./run.sh
```

Setup is straightforward. Runners are repo or org-scoped.

### GitLab

```bash
gitlab-runner register
gitlab-runner run
```

More flexible. Can be project, group, or instance-wide. Better for large organizations.

## debugging experience

GitHub Actions has clear, searchable logs, lets you re-run individual jobs, and exposes a debug mode behind two secrets. You can SSH into a runner via a third-party action, but it is not a native feature.

GitLab is the one I reach for when a pipeline is genuinely stuck. The log viewer is good, individual job retries are good, but the real difference is interactive debugging. SSH into the runner mid-job, or open a web terminal from the failed job in your browser, and poke at the filesystem while the build is still alive. The first time you do this on a Docker-in-Docker failure that only repros on CI, you stop missing it everywhere else.

## when to pick which

GitHub Actions wins when you are already on GitHub, want the marketplace, and your pipelines are small to medium. GitLab CI wins when your Docker workflows are non-trivial, your runner fleet is large, your deployment strategies are gnarly, or you need to debug pipelines without a redeploy loop.

## my setup

I use both. GitHub Actions for open-source and frontend, GitLab CI for infrastructure code and the deployments that involve five stages and a manual approval.

## common pitfalls

GitHub Actions has a 6-hour hosted-runner job timeout, a 90-day artifact retention default (configurable up to 400 days for public repos, 90 for private), and tight concurrent-job limits on the free tier. Plan around them or pay.

GitLab's shared runners get sluggish at peak, Docker builds need `docker:dind` as a service container, and CI/CD variable precedence has at least six rules you will need to read twice. The one that bites me most: project-level variables silently override group-level ones with the same name.

## migration tips

### GitHub to GitLab

```yaml
# GitHub
- uses: actions/checkout@v4

# GitLab equivalent
git clone $CI_REPOSITORY_URL
cd $CI_PROJECT_NAME
```

### GitLab to GitHub

Most scripts translate directly. The win is collapsing a few of them into marketplace actions you no longer have to maintain.

Starting fresh, pick whichever platform already hosts your code. The integration tax of running CI on the other vendor outweighs every syntax preference in this post. Whichever one you pick, the only investment that pays back is making the pipeline fast. A slow CI is worse than no CI; it just costs more to ignore.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>ci-cd</category>
            <category>github</category>
            <category>gitlab</category>
            <category>devops</category>
            <category>automation</category>
            <enclosure url="https://harshit.cloud/images/github-actions-gitlab-ci-comparison/hero.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Git interactive rebase: clean up your commit history]]></title>
            <link>https://harshit.cloud/til/git-interactive-rebase-magic</link>
            <guid isPermaLink="false">https://harshit.cloud/til/git-interactive-rebase-magic</guid>
            <pubDate>Thu, 19 Dec 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Turn five 'fix typo' commits into one before the review notices. git rebase -i, the rules for when not to use it, and the aliases I keep around.]]></description>
            <content:encoded><![CDATA[
Discovered `git rebase -i` today. Turns five "fix typo" commits into one clean commit before the PR review notices.

## the problem

My commit history looked like this:

```
fix typo
WIP
fix typo again
actually fixed it
remove console.log
```

Not exactly professional for a PR review.

## the solution

Interactive rebase lets you edit, squash, and reorder commits:

```bash
# Rebase last 5 commits
git rebase -i HEAD~5

# Or rebase everything since branching from main
git rebase -i main
```

This opens your editor with:

```
pick 1a2b3c4 fix typo
pick 5d6e7f8 WIP
pick 9g0h1i2 fix typo again
pick 3j4k5l6 actually fixed it
pick 7m8n9o0 remove console.log

# Commands:
# p, pick = use commit
# r, reword = use commit, but edit message
# e, edit = use commit, but stop for amending
# s, squash = meld into previous commit
# f, fixup = like squash, but discard commit message
# d, drop = remove commit
```

## my workflow

Change it to:

```
pick 1a2b3c4 fix typo
fixup 5d6e7f8 WIP
fixup 9g0h1i2 fix typo again
fixup 3j4k5l6 actually fixed it
fixup 7m8n9o0 remove console.log
```

Result: one clean commit.

## patterns

**Reword commit messages:**
```
pick 1a2b3c4 fix typo
reword 5d6e7f8 add user authentication
```

**Reorder commits:**
```
pick 3j4k5l6 add tests
pick 1a2b3c4 add feature
```

**Split a commit:**
```
edit 1a2b3c4 huge commit with multiple changes
```

Then:
```bash
git reset HEAD^
git add file1.js
git commit -m "feat: add feature A"
git add file2.js
git commit -m "feat: add feature B"
git rebase --continue
```

## warning

Don't rebase commits that have already been pushed to a shared branch. You'll rewrite history under your team and force everyone else to resolve conflicts they didn't cause.

Safe:
```bash
# Your feature branch, not pushed yet
git rebase -i main
```

Dangerous:
```bash
# Main branch that others use
git checkout main
git rebase -i HEAD~5  # pushed commits — leave these alone
```

## useful aliases

Add to your `~/.gitconfig`:

```ini
[alias]
    # Interactive rebase with the given number of latest commits
    rb = "!f() { git rebase -i HEAD~$1; }; f"
    
    # Rebase on main
    rbm = "!git fetch origin main && git rebase -i origin/main"
```

Now you can do:
```bash
git rb 5        # Rebase last 5 commits
git rbm         # Rebase on main
```
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>git</category>
            <category>version-control</category>
            <category>productivity</category>
            <category>devops</category>
            <enclosure url="https://harshit.cloud/til/git-interactive-rebase-magic/opengraph-image" length="0" type="image//til/git-interactive-rebase-magic/opengraph-image"/>
        </item>
        <item>
            <title><![CDATA[Prometheus and Grafana: from zero to production monitoring]]></title>
            <link>https://harshit.cloud/blog/prometheus-grafana-monitoring-guide</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/prometheus-grafana-monitoring-guide</guid>
            <pubDate>Wed, 18 Dec 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[A practical guide to setting up Prometheus and Grafana for production monitoring. No theory, just battle-tested configurations that work.]]></description>
            <content:encoded><![CDATA[
We started shipping monitoring after a string of outages where customers paged us before our own dashboards did. This is the stack we landed on, written like you're standing it up tomorrow.

![Prometheus pulls metrics from exporters running next to each service, stores them in its TSDB, then fans out to Alertmanager for paging and Grafana for dashboards.](/images/prometheus-grafana-monitoring-guide/hero.png)

*Fig. 1 — five boxes do most of the work; the other ten you'll add later are mostly for taste.*

## why Prometheus and Grafana

After trying CloudWatch, Datadog, and New Relic, we landed on Prometheus and Grafana for the same reason most teams do. Prometheus is open source, pull-based, and fits Kubernetes without protest. Grafana puts the dashboards on top, talks to almost anything, and costs nothing. Self-hosted, the bill is server time. The commercial options were running us $500 a month and climbing.

## setup with Docker Compose

For development or small deployments, Docker Compose is perfect:

```yaml
services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=changeme
      - GF_USERS_ALLOW_SIGN_UP=false
    ports:
      - "3000:3000"
    restart: unless-stopped
    depends_on:
      - prometheus

  node_exporter:
    image: prom/node-exporter:latest
    container_name: node_exporter
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    ports:
      - "9100:9100"
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:
```

Run it:

```bash
docker-compose up -d
```

Done. Prometheus is on `:9090`, Grafana on `:3000`.

## prometheus configuration

Create `prometheus.yml`:

```yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    region: 'us-east-1'

# Alertmanager configuration (we'll get to this)
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

# Load rules
rule_files:
  - 'alerts/*.yml'

# Scrape configurations
scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node metrics (CPU, memory, disk)
  - job_name: 'node'
    static_configs:
      - targets: ['node_exporter:9100']
        labels:
          instance: 'production-server-1'

  # Example: monitoring a web app
  - job_name: 'webapp'
    static_configs:
      - targets: ['webapp:8080']
    metrics_path: '/metrics'
```

## instrumenting your application

### Node.js / Express example

```javascript
const express = require('express');
const promClient = require('prom-client');

const app = express();
const register = new promClient.Registry();

// Collect default metrics (CPU, memory, event loop)
promClient.collectDefaultMetrics({ register });

// Custom metrics
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.5, 1, 2, 5]
});

const httpRequestTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);

// Middleware to track requests
app.use((req, res, next) => {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const labels = {
      method: req.method,
      route: req.route?.path || req.path,
      status_code: res.statusCode
    };
    
    httpRequestDuration.observe(labels, duration);
    httpRequestTotal.inc(labels);
  });
  
  next();
});

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(8080, () => {
  console.log('Server running on :8080');
  console.log('Metrics available at /metrics');
});
```

### Python / Flask example

```python
from flask import Flask, request
from prometheus_client import Counter, Histogram, generate_latest
import time

app = Flask(__name__)

# Metrics
request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint']
)

@app.before_request
def before_request():
    request.start_time = time.time()

@app.after_request
def after_request(response):
    duration = time.time() - request.start_time
    
    request_count.labels(
        method=request.method,
        endpoint=request.endpoint,
        status=response.status_code
    ).inc()
    
    request_duration.labels(
        method=request.method,
        endpoint=request.endpoint
    ).observe(duration)
    
    return response

@app.route('/metrics')
def metrics():
    return generate_latest()

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)
```

## the queries you'll actually use

### CPU usage

```promql
# CPU usage percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Per core
rate(node_cpu_seconds_total{mode!="idle"}[5m])
```

### memory usage

```promql
# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Available memory
node_memory_MemAvailable_bytes / 1024 / 1024 / 1024
```

### disk usage

```promql
# Disk usage percentage
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

# Filter to only important mounts
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100
```

### HTTP request rate

```promql
# Requests per second
rate(http_requests_total[5m])

# By status code
sum by (status_code) (rate(http_requests_total[5m]))
```

### HTTP latency

```promql
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Average latency
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
```

## grafana dashboards

### import community dashboards

1. Go to Grafana → Dashboards → Import
2. Use these IDs:
   - **1860**: Node Exporter Full
   - **3662**: Prometheus 2.0 Overview
   - **7362**: MySQL Overview

Or create custom dashboards with the queries above.

### auto-provisioning the datasource

Create `grafana/provisioning/datasources/prometheus.yml`:

```yaml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false
```

## alerting that actually works

Create `alerts/rules.yml`:

```yaml
groups:
  - name: system_alerts
    interval: 30s
    rules:
      # High CPU
      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}%"

      # High Memory
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value }}%"

      # Disk Space
      - alert: DiskSpaceLow
        expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 85
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Disk usage is {{ $value }}%"

      # Service Down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"
          description: "{{ $labels.instance }} has been down for 1 minute"

      # High Error Rate
      - alert: HighErrorRate
        expr: (sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100 > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }}%"
```

## production considerations

### data retention

Prometheus stores data locally. Retention is set via command-line flags (which is why the Compose example passes them in `command:`), not in `prometheus.yml`:

```bash
--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=50GB
```

For longer retention, use:

- **Thanos**: Distributed Prometheus
- **Grafana Mimir**: Multi-tenant, horizontally scalable Prometheus; the actively-maintained successor to Cortex
- **VictoriaMetrics**: Drop-in replacement, more efficient

### high availability

Run multiple Prometheus instances with identical configs. Use a load balancer or Thanos for deduplication.

### security

```yaml
# Add basic auth
global:
  external_labels:
    cluster: 'production'

# In Grafana datasource
basicAuth: true
basicAuthUser: prometheus
basicAuthPassword: supersecret
```

Better: Put behind VPN or use mutual TLS.

## common gotchas

Five things bite people in production. Scrape interval below 10s is almost never what you want and quietly burns disk. Too many labels balloon cardinality and your queries get slow before you notice. Pick the right metric type: Counter for monotonic, Gauge for point-in-time, Histogram for distributions, and not the other way around. Monitoring without alerting is a screensaver. And the inverse: too many alerts and the on-call learns to ignore them, which is worse than no alerts at all.

## the golden signals

If you only monitor four things, monitor Google's golden signals: latency, traffic, errors, saturation. Latency is how long requests take. Traffic is request rate. Errors is the rate of failed requests. Saturation is how full the system is on CPU, memory, and disk. Everything else is a refinement on these four.

## what to add next

Add exporters for the data stores you actually use: MySQL, Redis, Postgres, whichever queue you're on. Wire Alertmanager to Slack and PagerDuty so the alerts land somewhere a human reads. Write a one-line runbook link in every alert annotation so the page tells the on-call what to do. Back up the Prometheus data directory; the whole point of long retention is gone if a disk failure wipes it. A dashboard you never look at is graphs heating the data center.

]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>monitoring</category>
            <category>prometheus</category>
            <category>grafana</category>
            <category>observability</category>
            <category>devops</category>
            <enclosure url="https://harshit.cloud/images/prometheus-grafana-monitoring-guide/hero.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Bash parameter expansion: string manipulation without sed and awk]]></title>
            <link>https://harshit.cloud/til/bash-parameter-expansion</link>
            <guid isPermaLink="false">https://harshit.cloud/til/bash-parameter-expansion</guid>
            <pubDate>Tue, 17 Dec 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Bash has built-in string manipulation that's faster than spawning sed or awk. The patterns that replaced 80% of my pipeline calls, with a cheat sheet.]]></description>
            <content:encoded><![CDATA[
Learned that Bash has built-in string manipulation that's way faster than calling `sed`, `awk`, or `cut`.

## the old way (slow)

```bash
filename="document.pdf"
name=$(echo "$filename" | sed 's/\.[^.]*$//')  # document
ext=$(echo "$filename" | sed 's/^.*\.//')      # pdf
```

Each `echo | sed` spawns a new process. Slow in loops.

## the new way (fast)

```bash
filename="document.pdf"
name="${filename%.*}"    # document
ext="${filename##*.}"    # pdf
```

No external processes. Pure Bash.

## common patterns

### remove file extension

```bash
file="archive.tar.gz"
echo "${file%.*}"       # archive.tar
echo "${file%%.*}"      # archive (remove all extensions)
```

`%` = remove from end (shortest match)
`%%` = remove from end (longest match)

### get file extension

```bash
file="archive.tar.gz"
echo "${file#*.}"       # tar.gz
echo "${file##*.}"      # gz (last extension only)
```

`#` = remove from start (shortest match)
`##` = remove from start (longest match)

### string replacement

```bash
path="/home/user/documents/file.txt"

# Replace first occurrence
echo "${path/user/admin}"        # /home/admin/documents/file.txt

# Replace all occurrences
echo "${path//o/0}"              # /h0me/user/d0cuments/file.txt

# Replace at start
echo "${path/#\/home/\/root}"    # /root/user/documents/file.txt

# Replace at end
echo "${path/%.txt/.md}"         # /home/user/documents/file.md
```

### default values

```bash
# Use default if variable is unset or empty
echo "${VAR:-default}"

# Assign default if unset
echo "${VAR:=default}"

# Error if unset
echo "${VAR:?Variable is required}"

# Use alternate value if set
echo "${VAR:+value_if_set}"
```

### substring extraction

```bash
text="Hello World"
echo "${text:0:5}"       # Hello (from pos 0, length 5)
echo "${text:6}"         # World (from pos 6 to end)
echo "${text: -5}"       # World (last 5 chars, note the space!)
echo "${text::-6}"       # Hello (remove last 6 chars)
```

### string length

```bash
text="Hello World"
echo "${#text}"          # 11
```

### case conversion

```bash
text="Hello World"
echo "${text^^}"         # HELLO WORLD (all uppercase)
echo "${text,,}"         # hello world (all lowercase)
echo "${text^}"          # Hello World (first char uppercase)
echo "${text,}"          # hello World (first char lowercase)
```

## real-world example

Before (slow with multiple processes):

```bash
#!/bin/bash
for file in *.log; do
    name=$(basename "$file" .log)
    date=$(echo "$name" | cut -d'-' -f1)
    gzip "$file"
    mv "${file}.gz" "archive-${date}.gz"
done
```

After (fast, pure Bash):

```bash
#!/bin/bash
for file in *.log; do
    name="${file%.log}"      # Remove .log
    date="${name%%-*}"       # Get everything before first -
    gzip "$file"
    mv "${file}.gz" "archive-${date}.gz"
done
```

## cheat sheet

| Pattern | Effect | Example |
|---------|--------|---------|
| `${var%pattern}` | Remove shortest match from end | `${file%.txt}` |
| `${var%%pattern}` | Remove longest match from end | `${file%%.*}` |
| `${var#pattern}` | Remove shortest match from start | `${file#*/}` |
| `${var##pattern}` | Remove longest match from start | `${file##*/}` |
| `${var/old/new}` | Replace first occurrence | `${text/foo/bar}` |
| `${var//old/new}` | Replace all occurrences | `${text//foo/bar}` |
| `${var:offset:length}` | Substring | `${text:0:5}` |
| `${#var}` | Length | `${#text}` |
| `${var^^}` | Uppercase | `${text^^}` |
| `${var,,}` | Lowercase | `${text,,}` |

## why this matters

In a script processing 10,000 files:

- **With external commands**: ~5 minutes
- **With parameter expansion**: ~10 seconds

That's a 30x speedup.

## one gotcha

Be careful with spaces in substring extraction:

```bash
text="Hello"
echo "${text: -3}"    # llo (CORRECT - note the space)
echo "${text:-3}"     # Hello (WRONG - default value syntax!)
```

The log-rotation script above went from 4m12s to 8s after the rewrite. Same files, same disk.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>bash</category>
            <category>shell</category>
            <category>linux</category>
            <category>scripting</category>
            <enclosure url="https://harshit.cloud/til/bash-parameter-expansion/opengraph-image" length="0" type="image//til/bash-parameter-expansion/opengraph-image"/>
        </item>
        <item>
            <title><![CDATA[Five Kubernetes debugging tricks that saved my production]]></title>
            <link>https://harshit.cloud/blog/kubernetes-debugging-tips</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/kubernetes-debugging-tips</guid>
            <pubDate>Sun, 15 Dec 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Hard-learned lessons from debugging Kubernetes issues at 3 AM. These tricks will save you hours of frustration.]]></description>
            <content:encoded><![CDATA[
The first time I got paged for a `CrashLoopBackOff` it was 03:47, the pod was on its 14th restart, and `kubectl logs` was returning a perfectly clean six-line startup banner with no error in sight. I sat staring at that output for about twenty minutes before someone on Slack asked, casually, whether I'd tried `--previous`. Reader, I had not. Once you've made that mistake once, you stop making it. The five flags below are what I now reach for before anything else, in roughly the order I reach for them.

![Decision tree mapping five Kubernetes pod symptoms to the kubectl command that diagnoses each, drawn for a 3 a.m. on-call brain.](/images/kubernetes-debugging-tips/hero.png)

*Fig. 1 — the chart I wish someone had taped to the wall before my first on-call shift.*

## the crashloopbackoff that lies to you

The pod has restarted 14 times. You run `kubectl logs` and you get the logs of the *current* container, which is the one that hasn't crashed yet because it just started. The interesting bytes — the panic, the missing env var, the OOM at byte 1 — are in the previous container's stdout, and they're a single flag away.

```bash
kubectl logs <pod-name> --previous
```

Add `-c <container>` if it's a multi-container pod, because the default container is rarely the one that died. I have wasted hours on this exact omission.

## ephemeral debug containers

There used to be a ritual: edit the Dockerfile, add `curl` and `dig` and `tcpdump`, push, wait for CI, redeploy, exec in, debug, then forget to take any of it back out and ship a 900 MB image to prod. As of 1.25 (beta-default since 1.23) you don't have to. `kubectl debug` attaches an ephemeral container to a running pod with whatever image you want — sharing the network namespace, and the PID namespace when you pass `--target` or the pod has `shareProcessNamespace: true` — without touching the original container.

```bash
kubectl debug -it <pod-name> --image=nicolaka/netshoot
```

`netshoot` is the standard kit — `dig`, `curl`, `tcpdump`, `iperf`, `mtr`, the works. The container vanishes when you detach. Your prod image stays the size it was supposed to be.

## ask the scheduler, don't guess

A pod stuck in `Pending` is the scheduler telling you, very politely, that none of your nodes will have it. You can stare at node taints all day or you can read the events for the pod itself, which spell out the reason in English.

```bash
kubectl get events --field-selector involvedObject.name=<pod-name>
```

`0/12 nodes are available: 8 Insufficient memory, 4 node(s) had untolerated taint`. That's the answer. The scheduler is the most articulate component in the cluster as long as you ask it directly.

## a throwaway pod for network policy work

Network policies are the part of Kubernetes that fail silently. The packet just doesn't arrive, and there's no log line that says *I dropped your SYN because of policy `default-deny-egress` in namespace `payments`*. The cheapest way to figure out what's reachable from where is to land a pod in the namespace you care about and try.

```bash
kubectl run test-pod --rm -it --image=nicolaka/netshoot -- /bin/bash
```

`--rm` cleans up when you exit, which matters because otherwise you will accumulate seven `test-pod-2` pods in `default` and one of them will eventually become the reason a node fills up. From inside, `curl` and `nc` your way through the policy until something connects.

## sort the noisy neighbours

When the cluster feels slow and nobody knows why, the answer is usually one workload in one namespace eating more than its share. `kubectl top` with a sort flag is the single fastest way to find them.

```bash
kubectl top pods --all-namespaces --sort-by=memory
```

Swap `memory` for `cpu` depending on what's burning. Requires `metrics-server` to be installed, which it almost always is, and which is worth installing the day you bring up a cluster if it isn't.

## the one I forget I have

`describe` is so obvious nobody writes about it, and so dense that nobody reads its full output. It includes the events, the resource limits, the volume mounts, the readiness probe definition, the last termination reason, and the QoS class — all the things you were about to run five separate commands to find.

```bash
kubectl describe pod <pod-name> | less
```

Pipe to `less` because the output is longer than your terminal and the events at the bottom are usually where the answer is. Read from the bottom up if you're in a hurry.

---

The pattern across all five is the same. Kubernetes is unusually good at telling you what went wrong, in plain prose, in a place you have to know to look. The flag is always `--previous`. The answer is always in `events`. The container is always the wrong one by default. Memorise the five commands above and most pages stop being mysteries and start being typing exercises.

Next time the alert fires at 03:47, the first thing you type is `kubectl logs <pod> --previous -c <container>`. The second thing is `kubectl describe pod <pod> | less`. If the answer isn't in those two outputs, you actually have a problem.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>kubernetes</category>
            <category>devops</category>
            <category>debugging</category>
            <category>production</category>
            <enclosure url="https://harshit.cloud/images/kubernetes-debugging-tips/hero.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Docker volume debugging: finding where your data actually lives]]></title>
            <link>https://harshit.cloud/til/docker-volume-inspect-trick</link>
            <guid isPermaLink="false">https://harshit.cloud/til/docker-volume-inspect-trick</guid>
            <pubDate>Sat, 14 Dec 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Volume mounts that look right but won't persist data. The five-command inspection sequence that always tells me which mount is actually being read.]]></description>
            <content:encoded><![CDATA[
Spent 2 hours debugging why data wasn't persisting. Turns out, understanding Docker volumes is the actual job.

## the problem

I had a container with a volume mount, but couldn't figure out where the data was actually stored on the host:

```bash
docker run -v mydata:/data myapp
```

Where is `mydata`? What's inside it?

## the solution

### find volume location

```bash
# List all volumes
docker volume ls

# Inspect a specific volume
docker volume inspect mydata
```

Output:
```json
[
    {
        "CreatedAt": "2024-12-14T10:30:00Z",
        "Driver": "local",
        "Labels": {},
        "Mountpoint": "/var/lib/docker/volumes/mydata/_data",
        "Name": "mydata",
        "Options": {},
        "Scope": "local"
    }
]
```

The `Mountpoint` tells you exactly where the data lives.

### view volume contents (the trick)

You can't just `cd` to that path (permission denied). Instead, use a temporary container:

```bash
docker run --rm -v mydata:/data alpine ls -la /data
```

Or for interactive browsing:

```bash
docker run --rm -it -v mydata:/data alpine sh
cd /data
ls -la
```

## a better named-volume workflow

Create a simple alias:

```bash
# Add to ~/.bashrc or ~/.zshrc
alias dvol='docker run --rm -it -v'
```

Usage:
```bash
# Browse any volume interactively
dvol mydata:/data alpine sh

# Quick listing
dvol mydata:/data alpine ls -la /data

# Check file contents
dvol mydata:/data alpine cat /data/config.json

# Copy file out of volume
docker run --rm -v mydata:/data -v $(pwd):/backup alpine cp /data/important.txt /backup/
```

## debugging bind mounts

For bind mounts (host path to container):

```bash
docker run -v /host/path:/container/path myapp
```

To see what the container actually sees:

```bash
docker exec -it container_name ls -la /container/path
```

## common volume issues

### volume not mounting

```bash
# Check if volume exists
docker volume ls | grep mydata

# Create it if missing
docker volume create mydata
```

### wrong permissions

```bash
# Check ownership in volume
docker run --rm -v mydata:/data alpine ls -ln /data

# Fix permissions (if needed)
docker run --rm -v mydata:/data alpine chown -R 1000:1000 /data
```

### dangling volumes

```bash
# List dangling volumes (not used by any container)
docker volume ls -f dangling=true

# Remove them
docker volume prune

# this deletes data — make sure nothing important is dangling
```

### volume vs bind mount confusion

```bash
# Named volume (managed by Docker)
-v mydata:/data

# Bind mount (you manage the host path)
-v /host/path:/data
-v $(pwd):/data

# Anonymous volume (Docker creates and manages)
-v /data
```

## patterns

### 1. backup a volume

```bash
# Backup volume to tar file
docker run --rm \
  -v mydata:/data \
  -v $(pwd):/backup \
  alpine tar czf /backup/mydata-backup.tar.gz -C /data .
```

### 2. restore a volume

```bash
# Restore from backup
docker run --rm \
  -v mydata:/data \
  -v $(pwd):/backup \
  alpine tar xzf /backup/mydata-backup.tar.gz -C /data
```

### 3. copy volume to another

```bash
# Copy all data from vol1 to vol2
docker run --rm \
  -v vol1:/source:ro \
  -v vol2:/dest \
  alpine sh -c "cp -av /source/. /dest/"
```

### 4. clone a volume

```bash
# Create new volume as copy of existing
docker volume create vol2
docker run --rm \
  -v vol1:/source:ro \
  -v vol2:/dest \
  alpine cp -av /source/. /dest/
```

## real-world example

I was debugging a database that wasn't persisting data:

```bash
# Check if volume exists
docker volume inspect postgres_data
# Error: No such volume

# the docker-compose.yml had a typo
# it said: postgress_data (3 s's)
# should be: postgres_data (2 s's)

# found all volumes
docker volume ls
# found: postgress_data (the typo)

# Renamed it
docker volume create postgres_data
docker run --rm \
  -v postgress_data:/source \
  -v postgres_data:/dest \
  alpine cp -av /source/. /dest/

# Removed the typo volume
docker volume rm postgress_data
```

## useful one-liners

```bash
# Find which containers use a volume
docker ps -a --filter volume=mydata

# Remove all stopped containers' volumes
docker container prune -f && docker volume prune -f

# List volumes with size (requires Docker 20.10+)
docker system df -v

# Find volumes larger than 1GB
docker system df -v | awk '$4 > 1000'
```

## the gotcha I learned

Docker Compose creates volumes with prefixes:

```yaml
# docker-compose.yml
volumes:
  mydata:
```

Creates volume named: `projectname_mydata`

To use a specific name:

```yaml
volumes:
  mydata:
    name: mydata
```
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>docker</category>
            <category>debugging</category>
            <category>containers</category>
            <category>devops</category>
            <enclosure url="https://harshit.cloud/til/docker-volume-inspect-trick/opengraph-image" length="0" type="image//til/docker-volume-inspect-trick/opengraph-image"/>
        </item>
        <item>
            <title><![CDATA[kubectl JSONPath: extract exactly what you need]]></title>
            <link>https://harshit.cloud/til/kubectl-jsonpath-queries</link>
            <guid isPermaLink="false">https://harshit.cloud/til/kubectl-jsonpath-queries</guid>
            <pubDate>Thu, 12 Dec 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[kubectl can return exactly the field you want without a grep-awk-sed pipeline. JSONPath queries that replaced my entire stash of one-liners.]]></description>
            <content:encoded><![CDATA[
Stop piping kubectl output to `grep`, `awk`, and `sed`. JSONPath can get you exactly what you need in one command.

## the basic pattern

```bash
kubectl get <resource> -o jsonpath='{<jsonpath-expression>}'
```

## simple examples

### get pod IPs

Instead of:
```bash
kubectl get pods -o wide | awk '{print $6}'
```

Do:
```bash
kubectl get pods -o jsonpath='{.items[*].status.podIP}'
```

### get pod names only

```bash
kubectl get pods -o jsonpath='{.items[*].metadata.name}'
```

### get pod name + IP

```bash
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.podIP}{"\n"}{end}'
```

Output:
```
nginx-abc123    10.244.1.5
redis-xyz789    10.244.1.6
```

## real-world use cases

### 1. find all container images

```bash
kubectl get pods -o jsonpath='{.items[*].spec.containers[*].image}' | tr ' ' '\n' | sort -u
```

### 2. get pods not running

```bash
kubectl get pods -o jsonpath='{range .items[?(@.status.phase!="Running")]}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}'
```

### 3. find pods using most memory

```bash
kubectl top pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.usage.memory}{"\n"}{end}' | sort -k2 -h
```

### 4. get all node capacities

```bash
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.capacity.cpu}{" CPU\t"}{.status.capacity.memory}{" RAM\n"}{end}'
```

### 5. find secrets in a namespace

```bash
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.volumes[?(@.secret)].secret.secretName}{"\n"}{end}'
```

### 6. get all services and their type

```bash
kubectl get svc -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.type}{"\n"}{end}'
```

## JSONPath syntax cheat sheet

| Pattern | Description | Example |
|---------|-------------|---------|
| `.items[*]` | All items | Get all pods |
| `.items[0]` | First item | Get first pod |
| `.items[0:3]` | First 3 items | Get first 3 pods |
| `.items[-1]` | Last item | Get last pod |
| `.items[?(@.field=="value")]` | Filter | Pods where phase=Running |
| `{range .items[*]}...{end}` | Loop | Iterate over items |
| `{"\n"}` | Newline | Format output |
| `{"\t"}` | Tab | Format output |

## advanced filtering

### pods with specific label

```bash
kubectl get pods -l app=nginx -o jsonpath='{.items[*].metadata.name}'
```

### pods in Running state

```bash
kubectl get pods -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}'
```

### containers in Waiting state

```bash
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[?(@.state.waiting)].name}{"\n"}{end}'
```

### pods with restart count > 0

```bash
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}' | awk '$2 > 0'
```

## useful aliases

Add to your `~/.bashrc` or `~/.zshrc`:

```bash
# Get pod IPs
alias kip='kubectl get pods -o jsonpath='\''{range .items[*]}{.metadata.name}{"\t"}{.status.podIP}{"\n"}{end}'\'''

# Get images
alias kimages='kubectl get pods -o jsonpath='\''{.items[*].spec.containers[*].image}'\'' | tr " " "\n" | sort -u'

# Get pod with most restarts
alias krestart='kubectl get pods -o jsonpath='\''{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'\'' | sort -k2 -n -r | head -1'

# Get not ready pods
alias knotready='kubectl get pods -o jsonpath='\''{range .items[?(@.status.phase!="Running")]}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}'\'''
```

## custom columns (even better)

Sometimes custom columns are cleaner than JSONPath:

```bash
# Pod name, phase, and IP
kubectl get pods -o custom-columns=NAME:.metadata.name,PHASE:.status.phase,IP:.status.podIP

# Node name, CPU, and memory
kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.capacity.cpu,MEMORY:.status.capacity.memory

# Services and their ClusterIP
kubectl get svc -o custom-columns=NAME:.metadata.name,TYPE:.spec.type,CLUSTER-IP:.spec.clusterIP
```

## common patterns I use daily

### 1. quick debug — get all pod info

```bash
kubectl get pod nginx-abc123 -o jsonpath='{range .spec.containers[*]}Name: {.name}{"\n"}Image: {.image}{"\n"}Ports: {.ports[*].containerPort}{"\n\n"}{end}'
```

### 2. get all environment variables

```bash
kubectl get pod nginx-abc123 -o jsonpath='{range .spec.containers[*].env[*]}{.name}={.value}{"\n"}{end}'
```

### 3. find pods on a specific node

```bash
kubectl get pods --all-namespaces -o jsonpath='{range .items[?(@.spec.nodeName=="node-1")]}{.metadata.name}{"\t"}{.metadata.namespace}{"\n"}{end}'
```

### 4. get ConfigMaps used by pods

```bash
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.volumes[?(@.configMap)].configMap.name}{"\n"}{end}'
```

### 5. network policies applied to pods

```bash
kubectl get netpol -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podSelector.matchLabels}{"\n"}{end}'
```

## the live-updates move

Combine JSONPath with watch for live updates:

```bash
watch -n 2 'kubectl get pods -o jsonpath='\''{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}'\'''
```

## debugging JSONPath

If your JSONPath isn't working, test it step by step:

```bash
# Get full JSON first
kubectl get pod nginx-abc123 -o json | jq '.'

# Then build your JSONPath incrementally
kubectl get pod nginx-abc123 -o jsonpath='{.metadata}'
kubectl get pod nginx-abc123 -o jsonpath='{.metadata.name}'
kubectl get pod nginx-abc123 -o jsonpath='{.status}'
kubectl get pod nginx-abc123 -o jsonpath='{.status.phase}'
```

## the gotcha

JSONPath in kubectl has some quirks:

1. **Filters must use `@`**: `.items[?(@.field=="value")]` not `.items[?(.field=="value")]`
2. **Arrays need `[*]`**: `.items[*]` not `.items[]`
3. **Quotes matter**: use single quotes outside, double inside: `'{.items[?(@.name=="value")]}'`
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>kubernetes</category>
            <category>kubectl</category>
            <category>jsonpath</category>
            <category>devops</category>
            <enclosure url="https://harshit.cloud/til/kubectl-jsonpath-queries/opengraph-image" length="0" type="image//til/kubectl-jsonpath-queries/opengraph-image"/>
        </item>
        <item>
            <title><![CDATA[Docker security: stop running everything as root]]></title>
            <link>https://harshit.cloud/blog/docker-security-hardening</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/docker-security-hardening</guid>
            <pubDate>Tue, 10 Dec 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Your containers are probably insecure. Here's how I learned to harden Docker containers the hard way, and the security mistakes that almost cost us.]]></description>
            <content:encoded><![CDATA[
The audit came back with 47 critical issues, 129 highs, 156 containers running as root, and 300-plus unpatched CVEs. We had been shipping the same Node Dockerfile for two years. It was the one from the official `node` image's README, with our app dropped on top. Nobody had ever questioned it. The auditor wrote one line in the summary: *one RCE in any of these and you own the cluster.*

![Side-by-side comparison of a Docker container running as root with permissive defaults versus a hardened container with a non-root user, dropped capabilities, a read-only filesystem, a seccomp profile, and a distroless base image.](/images/docker-security-hardening/hero.png)

*Fig. 1 — same app code, two trust postures; the audit numbers do most of the arguing.*

## the report

Here's what landed in my inbox on a Tuesday morning, paraphrased into the format the scanner emits:

```
Critical Issues: 47
High Severity: 129
Running as root: 156 containers
Unpatched CVEs: 300+
```

The 156 number was the one that hurt. We didn't have 156 services. We had about thirty. The rest were sidecars, jobs, debug images, one-off tools that someone had built three years ago and never thought about again. Each one ran as UID 0 because the base image did, and nobody had bothered to override it.

## running as root, by accident

This is the Dockerfile we had. Maybe yours too.

```dockerfile
FROM node:20

WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .

EXPOSE 3000
CMD ["node", "server.js"]
```

The `node` image runs as root by default. There's a `node` user already created inside it, but you have to opt in with `USER node`. Almost nobody does. Six years of Stack Overflow answers, including the accepted ones, omit it. The fix is one line, and the version that creates a fresh user is a habit worth keeping for images that don't ship one.

```dockerfile
FROM node:20-slim

# Create non-root user
RUN groupadd -r nodejs && useradd -r -g nodejs nodejs

WORKDIR /app

# Install dependencies as root
COPY package*.json ./
RUN npm ci --production

# Copy application files
COPY --chown=nodejs:nodejs . .

# Switch to non-root user
USER nodejs

EXPOSE 3000
CMD ["node", "server.js"]
```

The thing the `--chown` flag buys you is that the running process can't `chmod` its own binaries. An attacker who pops the app can read what it can read and write to what it can write to, but can't go and rewrite `server.js` to add a backdoor. That's a real piece of mitigation that costs you nothing.

## images that arrived with everything

Our prod image was 1.2 GB. The base was `ubuntu:latest`, then a kitchen-sink `apt-get install` of `curl`, `wget`, `git`, `build-essential`, Python, Node, and npm. The build engineer who wrote it had reasons for each one at some point. None of those reasons were still true in production.

```dockerfile
FROM ubuntu:latest

RUN apt-get update && apt-get install -y \
    curl \
    wget \
    git \
    build-essential \
    python3 \
    nodejs \
    npm
    
# ... rest of the Dockerfile
```

Every binary in there is a CVE waiting to be reported. The replacement is the same app on `node:20-alpine`, with `dumb-init` for signal handling and nothing else.

```dockerfile
FROM node:20-alpine

# Only install what you need
RUN apk add --no-cache dumb-init

WORKDIR /app

COPY package*.json ./
RUN npm ci --production --ignore-scripts

COPY . .

USER node

ENTRYPOINT ["dumb-init", "--"]
CMD ["node", "server.js"]
```

The image dropped to 150 MB. Vulnerability count fell by 97% the morning we shipped it, mostly because we stopped shipping `git` and a C compiler in production. Build time is 60% shorter. None of that required clever engineering. We deleted things.

## secrets baked into layers

The first time I saw this in our codebase I assumed it was a stub:

```dockerfile
FROM node:20

# DON'T DO THIS!
ENV DB_PASSWORD=supersecret123
ENV API_KEY=abc123xyz

COPY . .
CMD ["node", "server.js"]
```

It wasn't. It had been deployed for eight months. The defense everyone offers is "the registry is private". The problem is that `ENV` lives in the image layer history forever, and `docker history` and `docker save` will hand it to anyone who pulls the image once.

```bash
docker history myapp:latest
docker save myapp:latest | tar -xO | grep -a "API_KEY"
```

BuildKit secrets fix the build-time half. The secret mounts during the `RUN` step and never lands in a layer.

```dockerfile
# syntax=docker/dockerfile:1

FROM node:20-alpine

WORKDIR /app

# Use build-time secrets
RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \
    npm ci --production

COPY . .
CMD ["node", "server.js"]
```

Build with:

```bash
docker build --secret id=npmrc,src=$HOME/.npmrc -t myapp .
```

For runtime secrets, the right answer depends on where you're running. On a single host:

```bash
docker run -e DB_PASSWORD="$(cat /path/to/secret)" myapp
```

On Swarm or Kubernetes, use the platform's secret store. Anything else is a layer of `chmod 600` and hope.

## capabilities you didn't ask for

A vanilla container gets fourteen Linux capabilities by default — including `CAP_NET_RAW`, which lets the process craft raw packets. Most apps need `NET_BIND_SERVICE` and nothing else. Drop the lot, add back what you actually use.

```bash
# Drop all capabilities, add only what's needed
docker run --rm \
  --cap-drop=ALL \
  --cap-add=NET_BIND_SERVICE \
  --security-opt=no-new-privileges:true \
  myapp
```

The Compose form, which is what most teams actually deploy:

```yaml
services:
  webapp:
    image: myapp
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE
    security_opt:
      - no-new-privileges:true
```

`no-new-privileges:true` is the sleeper line. It blocks setuid binaries from elevating during a process exec, closing the residual escalation path that capability drops leave open if a setuid binary is still inside the image.

## a writable root for no reason

Most apps write to `/tmp`, maybe a logging volume, and nothing else. Their root filesystem can be read-only and the app will never notice. Attackers will.

```bash
docker run --rm \
  --read-only \
  --tmpfs /tmp:rw,noexec,nosuid,size=100m \
  myapp
```

In Compose:

```yaml
services:
  webapp:
    image: myapp
    read_only: true
    tmpfs:
      - /tmp:rw,noexec,nosuid,size=100m
      - /var/run:rw,noexec,nosuid,size=10m
```

The first time you ship this you'll discover one library that writes a cache file to `/var/cache` at startup. Add a tmpfs for it and move on. After that the surprises stop.

## base images that age

The `:latest` tag pins nothing. The pin you actually want is a digest, but a version-with-distro tag (`node:20.10.0-alpine3.19`) is the working compromise. Then automate the bump.

```yaml
# .github/dependabot.yml
version: 2
updates:
  - package-ecosystem: "docker"
    directory: "/"
    schedule:
      interval: "weekly"
```

And scan the image. It doesn't matter much which scanner you pick (the lists overlap heavily), but pick one and run it on every build.

```bash
# Using Trivy
trivy image myapp:latest

# Using Snyk
snyk container test myapp:latest

# Using Docker Scout
docker scout cves myapp:latest
```

Wire it into CI as a hard gate on critical and high:

```yaml
# .github/workflows/security.yml
- name: Scan image
  run: |
    trivy image --exit-code 1 --severity CRITICAL,HIGH myapp:latest
```

Yes, you'll have weeks where the gate fires on a CVE you can't fix because there's no patched base image yet. That's a feature. It tells you which deploys are knowingly carrying risk.

## the docker socket

If a container has `/var/run/docker.sock` mounted, it can start a sibling container with `--privileged --pid=host -v /:/host` and own the host. There is no way to make this safe.

```yaml
# NEVER DO THIS
services:
  webapp:
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock  # DON'T!
```

It still shows up in build agents, log shippers, and "monitoring" sidecars from vendors who should know better. If you genuinely need to build images from inside a container, Kaniko does that without the socket. If you need to inspect other containers, the orchestrator's API is the supported path.

## resource limits as a security control

Resource limits feel like a performance concern, but the most common DoS we saw on our cluster was a container OOM-killing its node by allocating until the kernel reaper showed up. Limits don't prevent that, they contain it.

```bash
docker run --rm \
  --memory="512m" \
  --memory-swap="512m" \
  --cpus="0.5" \
  --pids-limit=100 \
  myapp
```

The Compose version:

```yaml
services:
  webapp:
    image: myapp
    deploy:
      resources:
        limits:
          cpus: '0.5'
          memory: 512M
        reservations:
          cpus: '0.25'
          memory: 256M
    pids_limit: 100
```

`--pids-limit` is the underrated one. A fork bomb in your container will still take the container down, but it won't take its neighbors with it.

## the hardened dockerfile, end to end

Putting it all together. This is roughly what every Node service in our prod cluster now looks like:

```dockerfile
# syntax=docker/dockerfile:1

# Use specific version, not 'latest'
FROM node:20.10.0-alpine3.19 AS builder

# Install build dependencies
RUN apk add --no-cache dumb-init

WORKDIR /app

# Copy dependency files
COPY package*.json ./

# Install dependencies with audit
RUN npm ci --production --ignore-scripts && \
    npm audit --audit-level=moderate

# Production stage
FROM node:20.10.0-alpine3.19

# Install only runtime dependencies
RUN apk add --no-cache dumb-init

# Create non-root user
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001

WORKDIR /app

# Copy built artifacts from builder
COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --chown=nodejs:nodejs . .

# Remove unnecessary files
RUN rm -rf .git .gitignore .dockerignore README.md tests/

# Switch to non-root user
USER nodejs

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD node healthcheck.js || exit 1

# Use dumb-init to handle signals properly
ENTRYPOINT ["dumb-init", "--"]

# Run application
CMD ["node", "server.js"]

# Metadata
LABEL org.opencontainers.image.source="https://github.com/myorg/myapp" \
      org.opencontainers.image.version="1.0.0" \
      org.opencontainers.image.vendor="My Company"
```

And the matching Compose, with the runtime hardening that the Dockerfile can't express:

```yaml
version: '3.8'

services:
  webapp:
    image: myapp:1.0.0
    container_name: webapp
    
    # Security options
    user: "1001:1001"
    read_only: true
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE
    security_opt:
      - no-new-privileges:true
      - seccomp:./seccomp.json
    
    # Resource limits
    deploy:
      resources:
        limits:
          cpus: '1'
          memory: 1G
        reservations:
          cpus: '0.5'
          memory: 512M
    pids_limit: 200
    
    # Writable tmpfs for temp files
    tmpfs:
      - /tmp:rw,noexec,nosuid,size=100m
    
    # Network isolation
    networks:
      - internal
    
    # Health check
    healthcheck:
      test: ["CMD", "node", "healthcheck.js"]
      interval: 30s
      timeout: 3s
      retries: 3
      start_period: 10s
    
    # Logging
    logging:
      driver: json-file
      options:
        max-size: "10m"
        max-file: "3"
    
    # Restart policy
    restart: unless-stopped

networks:
  internal:
    driver: bridge
    internal: true
```

## the tools I actually used

For image scanning we landed on Trivy in CI and Docker Scout for local checks. Snyk has a nicer UI but the per-developer license adds up; Clair is what you reach for when nothing can leave the network. For runtime, Falco watches for the syscall patterns nobody should ever see in production (a shell spawned inside a webserver container is the canonical one). Open Policy Agent and its Kubernetes-native cousins, Gatekeeper and Kyverno, are where you encode the rules from this post so the next person can't push a Dockerfile that violates them. The policy engine is the part that makes the work stick.

## the receipts

Six months after the audit, the same scanner came back with vulnerabilities down 94%, image sizes down 70%, root containers at zero (from 156), and a compliance score of 95% (from 23%). Zero security incidents that we know about, which is the only honest way to phrase that number.

The change none of those metrics capture is the cultural one. The CI gate caught seven Dockerfiles in the next quarter that would have shipped a `USER root` or a mounted Docker socket. Each of them was added by someone who'd read this exact post in our wiki and still missed something. The point of the gate isn't that engineers are careless. It's that the wrong defaults will outlast any number of training sessions.

The auditor who wrote *one RCE and you own the cluster* came back the next year. The line in this year's summary read *no findings rated critical*. I keep both of them in the same Slack channel. They're more useful together.
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>docker</category>
            <category>security</category>
            <category>containers</category>
            <category>devops</category>
            <category>best-practices</category>
            <enclosure url="https://harshit.cloud/images/docker-security-hardening/hero.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[kubectl neat: remove Kubernetes YAML clutter]]></title>
            <link>https://harshit.cloud/til/kubectl-neat-trick</link>
            <guid isPermaLink="false">https://harshit.cloud/til/kubectl-neat-trick</guid>
            <pubDate>Tue, 10 Dec 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[kubectl get -o yaml dumps 200 lines of generated noise. kubectl neat strips it down to what you actually wrote. Two commands, no more copy-paste cleanup.]]></description>
            <content:encoded><![CDATA[
Today I discovered `kubectl neat` - a plugin that removes all the clutter from Kubernetes YAML output.

## the problem

When you run `kubectl get pod my-pod -o yaml`, you get tons of noise:

```yaml
metadata:
  creationTimestamp: "2024-01-01T00:00:00Z"
  managedFields:
    - apiVersion: v1
      fieldsType: FieldsV1
      fieldsV1:
        # 200 lines of garbage
```

## the solution

Install kubectl-neat:

```bash
kubectl krew install neat
```

Now run:

```bash
kubectl get pod my-pod -o yaml | kubectl neat
```

Clean, readable YAML with just the stuff you care about.

## bonus

Make it even easier:

```bash
alias kgn='kubectl get -o yaml | kubectl neat'
```

Now `kgn pod my-pod` gives you clean output instantly.

Saves the copy-paste-and-strip routine every time I pull a Kubernetes config.

]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>kubernetes</category>
            <category>kubectl</category>
            <category>productivity</category>
            <enclosure url="https://harshit.cloud/til/kubectl-neat-trick/opengraph-image" length="0" type="image//til/kubectl-neat-trick/opengraph-image"/>
        </item>
        <item>
            <title><![CDATA[jq: the command-line JSON parser that earns its keep]]></title>
            <link>https://harshit.cloud/til/jq-for-json-parsing</link>
            <guid isPermaLink="false">https://harshit.cloud/til/jq-for-json-parsing</guid>
            <pubDate>Sun, 08 Dec 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[jq is sed for JSON. The patterns I use weekly — filtering, transforming, grouping — and the one-liner that replaced every Python parsing script I had.]]></description>
            <content:encoded><![CDATA[
`jq` is like `sed` for JSON. After a week of using it I stopped reaching for Python one-liners and never looked back.

## installation

```bash
# Mac
brew install jq

# Ubuntu/Debian
apt-get install jq

# CentOS/RHEL
yum install jq
```

## basic usage

### pretty print JSON

```bash
# Ugly JSON from API
curl https://api.example.com/data | jq '.'
```

Output is now colored and formatted.

### extract a field

```bash
echo '{"name": "John", "age": 30}' | jq '.name'
# "John"

# Remove quotes
echo '{"name": "John", "age": 30}' | jq -r '.name'
# John
```

`-r` = raw output (no quotes)

## array operations

### get first element

```bash
echo '[1, 2, 3, 4, 5]' | jq '.[0]'
# 1
```

### get last element

```bash
echo '[1, 2, 3, 4, 5]' | jq '.[-1]'
# 5
```

### get array length

```bash
echo '[1, 2, 3, 4, 5]' | jq 'length'
# 5
```

### extract field from all array items

```bash
echo '[{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}]' | jq '.[].name'
# "Alice"
# "Bob"

# Or use map
echo '[{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}]' | jq 'map(.name)'
# ["Alice", "Bob"]
```

## real-world examples

### 1. parse docker images

```bash
docker images --format='{{json .}}' | jq -r '.Repository + ":" + .Tag + "\t" + .Size'
```

### 2. get all pod names in kubernetes

```bash
kubectl get pods -o json | jq -r '.items[].metadata.name'
```

### 3. extract specific AWS EC2 info

```bash
aws ec2 describe-instances | jq -r '.Reservations[].Instances[] | "\(.InstanceId)\t\(.State.Name)\t\(.PrivateIpAddress)"'
```

### 4. parse package.json dependencies

```bash
cat package.json | jq -r '.dependencies | keys[]'
```

### 5. get GitHub API data

```bash
curl -s https://api.github.com/users/torvalds | jq '{name, bio, public_repos, followers}'
```

## filtering

### filter array items

```bash
# Get users older than 25
echo '[{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}]' | jq '.[] | select(.age > 25)'
```

### multiple conditions

```bash
# AND condition
jq '.[] | select(.age > 25 and .name == "Bob")'

# OR condition
jq '.[] | select(.age > 25 or .name == "Alice")'
```

### check if field exists

```bash
jq '.[] | select(.email != null)'
```

## transforming data

### create new object

```bash
echo '{"first": "John", "last": "Doe", "age": 30}' | jq '{fullname: (.first + " " + .last), age}'
# {
#   "fullname": "John Doe",
#   "age": 30
# }
```

### rename fields

```bash
echo '{"old_name": "value"}' | jq '{new_name: .old_name}'
```

### add field

```bash
echo '{"name": "John"}' | jq '. + {age: 30}'
# {
#   "name": "John",
#   "age": 30
# }
```

## sorting

```bash
# Sort array of objects by field
echo '[{"name": "Bob", "age": 30}, {"name": "Alice", "age": 25}]' | jq 'sort_by(.age)'

# Reverse sort
jq 'sort_by(.age) | reverse'
```

## grouping

```bash
# Group by field
echo '[{"type": "A", "value": 1}, {"type": "B", "value": 2}, {"type": "A", "value": 3}]' | jq 'group_by(.type)'
```

## useful one-liners

### count items by type

```bash
jq 'group_by(.type) | map({type: .[0].type, count: length})'
```

### sum values

```bash
echo '[{"value": 10}, {"value": 20}, {"value": 30}]' | jq '[.[].value] | add'
# 60
```

### get unique values

```bash
echo '[1, 2, 2, 3, 3, 3]' | jq 'unique'
# [1, 2, 3]
```

### find min/max

```bash
echo '[10, 5, 20, 15]' | jq 'min'
# 5

echo '[10, 5, 20, 15]' | jq 'max'
# 20
```

## advanced — CSV output

```bash
# Convert JSON to CSV
echo '[{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}]' | jq -r '.[] | [.name, .age] | @csv'
# "Alice",25
# "Bob",30
```

## advanced — nested data

```bash
# Deep extraction
echo '{"user": {"profile": {"name": "John"}}}' | jq '.user.profile.name'
# "John"

# Safe navigation (don't error if missing)
echo '{"user": {}}' | jq '.user.profile.name // "N/A"'
# "N/A"
```

## practical scripts

### check all service status

```bash
#!/bin/bash
curl -s http://api/services | jq -r '.[] | 
  if .status == "up" then
    "\(.name): pass"
  else
    "\(.name): fail (DOWN)"
  end'
```

### parse AWS cost report

```bash
#!/bin/bash
aws ce get-cost-and-usage \
  --time-period Start=2024-01-01,End=2024-01-31 \
  --granularity MONTHLY \
  --metrics BlendedCost | \
  jq -r '.ResultsByTime[] | .TimePeriod.Start + "\t$" + .Total.BlendedCost.Amount'
```

### monitor log errors

```bash
#!/bin/bash
kubectl logs -f pod-name | jq -r 'select(.level == "error") | "\(.timestamp): \(.message)"'
```

## debug jq expressions

Use `jq` playground: https://jqplay.org/

Or test step by step:

```bash
# Start simple
echo '{"a": {"b": {"c": 1}}}' | jq '.'

# Add one level
echo '{"a": {"b": {"c": 1}}}' | jq '.a'

# Add another
echo '{"a": {"b": {"c": 1}}}' | jq '.a.b'

# Final
echo '{"a": {"b": {"c": 1}}}' | jq '.a.b.c'
```

## common patterns I use

### 1. pretty print and save

```bash
curl -s api.example.com/data | jq '.' > formatted.json
```

### 2. extract and process

```bash
curl -s api | jq -r '.items[] | select(.active) | .id' | while read id; do
  echo "Processing $id"
  # do something with $id
done
```

### 3. combine multiple JSON files

```bash
jq -s '.' file1.json file2.json file3.json > combined.json
```

### 4. update JSON file in-place

```bash
# Add a field
jq '.version = "2.0"' package.json > temp.json && mv temp.json package.json

# Or use sponge (from moreutils)
jq '.version = "2.0"' package.json | sponge package.json
```

## the gotcha

Remember to use `-r` for raw output when you want to use the result in bash:

```bash
# Wrong (includes quotes)
NAME=$(echo '{"name": "John"}' | jq '.name')
echo $NAME
# "John"

# Right (no quotes)
NAME=$(echo '{"name": "John"}' | jq -r '.name')
echo $NAME
# John
```

## cheat sheet

```bash
jq '.'                      # Pretty print
jq -r '.field'             # Raw output (no quotes)
jq '.field'                # Get field
jq '.[0]'                  # First array element
jq '.[]'                   # All array elements
jq 'length'                # Length
jq 'keys'                  # Object keys
jq '.[] | select(.x > 5)'  # Filter
jq 'map(.field)'           # Map
jq 'sort_by(.field)'       # Sort
jq 'group_by(.field)'      # Group
jq 'add'                   # Sum array
jq 'unique'                # Unique values
jq -s '.'                  # Slurp (combine files)
```
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>jq</category>
            <category>json</category>
            <category>linux</category>
            <category>command-line</category>
            <category>productivity</category>
            <enclosure url="https://harshit.cloud/til/jq-for-json-parsing/opengraph-image" length="0" type="image//til/jq-for-json-parsing/opengraph-image"/>
        </item>
        <item>
            <title><![CDATA[AWS cost optimization: how we cut our bill by 60%]]></title>
            <link>https://harshit.cloud/blog/aws-cost-optimization-tricks</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/aws-cost-optimization-tricks</guid>
            <pubDate>Thu, 05 Dec 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Our AWS bill hit $50k/month. Here's exactly how we reduced it to $20k without sacrificing performance or reliability.]]></description>
            <content:encoded><![CDATA[
The CFO saw the AWS bill hit $50,000 a month and I got a calendar invite titled "We need to talk about AWS." I knew the meeting before I clicked accept.

Three months later we were at $20,000 a month, with better p95 latency than when we started. The interesting part is that none of the wins were clever. Most of them were a checkbox someone had skipped two years ago.

![Per-service AWS bill before and after, animated as a dumbbell chart: EC2 $28k to $12k, RDS $12k to $7k, Data Transfer $6k to $2.5k, CloudWatch $2k to $0.5k, Other $2k to $1k, total $50k to $20k per month](/images/aws-cost-optimization-tricks/hero.gif)

*Fig. 1 — most of the bill was EC2 doing nothing in particular.*

## the starting point

The bill broke down like this: EC2 $28,000, RDS $12,000, data transfer $6,000, CloudWatch $2,000, everything else $2,000. Fifty grand a month. The cost-allocation tags were missing on roughly 40% of resources, so for the first week the work was just figuring out who owned what.

Most of it turned out to be waste. Not bad architecture, not premature scale, just defaults that nobody had revisited since the seed round.

## rightsizing the EC2 fleet

Every app server in the fleet was running on `m5.2xlarge`. Not because anything needed eight vCPUs, but because the previous engineer picked an instance type once in 2022 and Terraform copy-pasted it forever after.

A month of CloudWatch told the real story:

```bash
# Check actual CPU utilization
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-xxxxx \
  --start-time 2024-01-01T00:00:00Z \
  --end-time 2024-01-31T23:59:59Z \
  --period 3600 \
  --statistics Average
```

Average CPU 12%. Average memory 30%. The fleet was a parking lot.

Dropping to `m5.large` cut the per-hour rate by 4x:

```hcl
# Before
resource "aws_instance" "app" {
  instance_type = "m5.2xlarge"  # $0.384/hour
}

# After
resource "aws_instance" "app" {
  instance_type = "m5.large"     # $0.096/hour
}
```

That single change saved $18,000 a month. p95 latency went down because the new instances were on a newer hypervisor generation. (I have stopped being surprised by this.)

## reserved instances for the steady-state fleet

The app servers ran 24/7. We were paying On-Demand for them anyway, because nobody had wanted to commit a year ahead during a hiring freeze.

The Cost Explorer recommendation API will tell you what to buy if you ask it nicely:

```bash
aws ce get-reservation-purchase-recommendation \
  --service "Amazon Elastic Compute Cloud - Compute" \
  --lookback-period-in-days SIXTY_DAYS \
  --term-in-years ONE \
  --payment-option ALL_UPFRONT
```

We bought 1-year RIs for ten `m5.large` app servers and five `c5.xlarge` API servers. 40% off On-Demand, no architectural change, no risk. $4,000 a month back.

The argument against RIs is always "but what if our load profile changes." Three months later it hadn't.

## spot for the things that can die

The CI fleet was On-Demand `c5.xlarge` runners that sat idle most of the day and got hammered for an hour around lunch. A perfect Spot workload — interruptible, parallelizable, with a queue in front.

```hcl
resource "aws_launch_template" "ci_runner" {
  name_prefix   = "ci-runner-"
  image_id      = data.aws_ami.ubuntu.id
  instance_type = "c5.xlarge"

  instance_market_options {
    market_type = "spot"
    spot_options {
      max_price                      = "0.10"  # ~70% discount
      spot_instance_type             = "one-time"
      instance_interruption_behavior = "terminate"
    }
  }
}

resource "aws_autoscaling_group" "ci_runners" {
  name = "ci-runners"

  mixed_instances_policy {
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.ci_runner.id
      }
    }

    instances_distribution {
      on_demand_base_capacity                  = 1  # one runner always on
      on_demand_percentage_above_base_capacity = 0  # everything else is Spot
      spot_allocation_strategy                 = "capacity-optimized"
    }
  }

  min_size = 2
  max_size = 10
}
```

One On-Demand runner for the always-on baseline, the rest Spot, capacity-optimized strategy so AWS picks pools with low interruption rates. $2,500 a month. The CI team noticed the build queue was faster, not that the underlying instances had changed.

## S3 lifecycle policies

We had 50 TB in S3, all in Standard. The application logs were the worst offender — every JSON line our services had ever emitted, sitting at $0.023 per GB-month, being read by exactly nobody.

```bash
aws s3api list-objects-v2 \
  --bucket my-bucket \
  --query "Contents[?LastModified<'2023-01-01'].[Key,Size]" \
  --output table
```

Most of it hadn't been touched in a year.

The lifecycle policy is the thing AWS lets you write once and forget:

```json
{
  "Rules": [
    {
      "Id": "Archive old logs",
      "Status": "Enabled",
      "Filter": { "Prefix": "logs/" },
      "Transitions": [
        { "Days": 30,  "StorageClass": "STANDARD_IA" },
        { "Days": 90,  "StorageClass": "GLACIER_IR" },
        { "Days": 180, "StorageClass": "DEEP_ARCHIVE" }
      ]
    },
    {
      "Id": "Delete old temp files",
      "Status": "Enabled",
      "Filter": { "Prefix": "temp/" },
      "Expiration": { "Days": 7 }
    },
    {
      "Id": "Intelligent tiering for backups",
      "Status": "Enabled",
      "Filter": { "Prefix": "backups/" },
      "Transitions": [
        { "Days": 0, "StorageClass": "INTELLIGENT_TIERING" }
      ]
    }
  ]
}
```

Apply it once:

```bash
aws s3api put-bucket-lifecycle-configuration \
  --bucket my-bucket \
  --lifecycle-configuration file://lifecycle.json
```

$3,000 a month. The work was reading enough of the data to be confident no on-call runbook secretly depended on a five-year-old log line. (One did. We rewrote the runbook.)

## RDS, where the real fat lived

The dev database was a `db.r5.4xlarge`. Sixteen vCPUs and 128 GB of RAM, running 24/7, used by maybe three engineers between 10am and 6pm in one timezone. It cost more than half the engineering team's laptops combined.

The fix was three changes. Drop the dev instance to `db.t3.large`. Auto-stop it at night and on weekends. Move staging to Aurora Serverless v2 so it scales to half a capacity unit when idle:

```hcl
resource "aws_db_instance" "dev" {
  identifier     = "dev-database"
  instance_class = "db.t3.large"  # was db.r5.4xlarge

  iam_database_authentication_enabled = true
  auto_minor_version_upgrade          = true

  backup_retention_period = 7
  backup_window           = "03:00-04:00"
  maintenance_window      = "mon:04:00-mon:05:00"
}

resource "aws_rds_cluster" "staging" {
  cluster_identifier = "staging-aurora"
  engine             = "aurora-postgresql"
  engine_mode        = "provisioned"

  serverlessv2_scaling_configuration {
    max_capacity = 2.0
    min_capacity = 0.5
  }
}
```

$5,000 a month. The complaints about staging being slow on the first request after lunch went away once people understood that two seconds of cold-start was the trade.

## CloudWatch logs, kept forever

CloudWatch logs default to "never expire," which is fine if you want to be the company paying $0.50 per GB to ingest and $0.03 per GB-month to keep a stack trace from 2021.

A short script set retention on every log group in the account:

```python
import boto3

client = boto3.client('logs')

log_groups = client.describe_log_groups()

for log_group in log_groups['logGroups']:
    group_name = log_group['logGroupName']

    # prod keeps 30 days, everything else keeps 7
    retention_days = 30 if 'prod' in group_name else 7

    client.put_retention_policy(
        logGroupName=group_name,
        retentionInDays=retention_days
    )

    print(f"Set {group_name} to {retention_days} days")
```

$1,500 a month, recovered from log groups whose entire purpose was to exist.

## the NAT gateway tax

Three NAT Gateways, one per AZ, $0.045 per hour each. The HA story was airtight. The actual traffic profile didn't justify it for the non-prod VPCs.

```hcl
# Before: 3 NAT Gateways
resource "aws_nat_gateway" "az1" { /* ... */ }
resource "aws_nat_gateway" "az2" { /* ... */ }
resource "aws_nat_gateway" "az3" { /* ... */ }

# After: 1 NAT Gateway
resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public[0].id

  tags = { Name = "main-nat-gateway" }
}

resource "aws_route" "private_nat" {
  for_each = aws_route_table.private

  route_table_id         = each.value.id
  destination_cidr_block = "0.0.0.0/0"
  nat_gateway_id         = aws_nat_gateway.main.id
}
```

$200 a month. We kept the three-gateway HA setup in production. The argument against single-NAT in dev is "but what if the AZ goes down?" The answer in dev is "then dev is down."

## data transfer, the silent killer

$6,000 a month in data transfer fees, which is the kind of bill where you can't actually see what you're paying for until you turn on VPC Flow Logs and read them.

```bash
aws ec2 create-flow-logs \
  --resource-type VPC \
  --resource-ids vpc-xxxxx \
  --traffic-type ALL \
  --log-destination-type s3 \
  --log-destination arn:aws:s3:::my-flow-logs
```

Two culprits. App servers were pulling Docker images from external registries on every cold start, paying NAT egress on every layer. And one stale cron job was syncing a database snapshot across regions every hour for a use case that nobody could remember sponsoring.

ECR interface endpoints route the registry traffic privately, so it never leaves the VPC and never touches NAT:

```hcl
resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.us-east-1.ecr.api"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true

  subnet_ids         = aws_subnet.private[*].id
  security_group_ids = [aws_security_group.vpc_endpoints.id]
}

resource "aws_vpc_endpoint" "ecr_dkr" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.us-east-1.ecr.dkr"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true

  subnet_ids         = aws_subnet.private[*].id
  security_group_ids = [aws_security_group.vpc_endpoints.id]
}
```

The S3 gateway endpoint is free, which is the only kind of free that AWS hands out without an asterisk:

```hcl
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.us-east-1.s3"

  route_table_ids = aws_route_table.private[*].id
}
```

CloudFront went in front of the static asset bucket, which moved bytes out of the per-GB egress lane and into the CDN lane. $3,500 a month back, most of which was the ECR change alone.

## budgets, so the next surprise isn't a surprise

The reason this whole exercise happened in the first place was that nobody had a budget alert. The fix is twelve lines of Terraform:

```hcl
resource "aws_budgets_budget" "monthly" {
  name         = "monthly-budget"
  budget_type  = "COST"
  limit_amount = "25000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = ["alerts@company.com"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_email_addresses = ["cfo@company.com"]
  }
}
```

The CFO gets the forecasted-overshoot alert. The on-call gets the 80%-of-actual alert. By the time the second one fires, somebody is already digging.

## the receipts

| Category | Before | After | Savings |
|----------|--------|-------|---------|
| EC2 | $28,000 | $12,000 | 57% |
| RDS | $12,000 | $7,000 | 42% |
| Data Transfer | $6,000 | $2,500 | 58% |
| CloudWatch | $2,000 | $500 | 75% |
| Other | $2,000 | $1,000 | 50% |
| **Total** | **$50,000** | **$20,000** | **60%** |

Six weeks of part-time work, no architecture rewrites, no migrations, no vendor changes. Mostly Terraform diffs and one Python script.

The line from the postmortem the CFO actually circulated was the part I keep coming back to: *"The bill didn't grow because we scaled. The bill grew because nobody was looking."*
]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>aws</category>
            <category>cloud</category>
            <category>cost-optimization</category>
            <category>finops</category>
            <category>infrastructure</category>
            <enclosure url="https://harshit.cloud/images/aws-cost-optimization-tricks/hero.gif" length="0" type="image/gif"/>
        </item>
        <item>
            <title><![CDATA[Docker build cache: the .dockerignore gotcha]]></title>
            <link>https://harshit.cloud/til/docker-build-cache-trick</link>
            <guid isPermaLink="false">https://harshit.cloud/til/docker-build-cache-trick</guid>
            <pubDate>Thu, 05 Dec 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Docker builds slow despite a clean layer order? Your .dockerignore is probably letting files bust the cache on every commit. The two-line fix.]]></description>
            <content:encoded><![CDATA[
Spent 2 hours debugging why my Docker builds were slow despite using multi-stage builds and proper layer ordering.

## the issue

Every single build was invalidating the cache at the `COPY . .` step, even when I hadn't changed any code.

## the culprit

My editor was creating `.swp` files and updating file timestamps. Docker saw these changes and invalidated the cache.

## the fix

Add a proper `.dockerignore`:

```
.git
.gitignore
README.md
.env*
node_modules
npm-debug.log
.next
.vscode
*.swp
*.swo
.DS_Store
```

Build time went from 5 minutes to 30 seconds.

Treat `.dockerignore` like `.gitignore`. Be aggressive about what you exclude.

]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>docker</category>
            <category>devops</category>
            <category>optimization</category>
            <enclosure url="https://harshit.cloud/til/docker-build-cache-trick/opengraph-image" length="0" type="image//til/docker-build-cache-trick/opengraph-image"/>
        </item>
        <item>
            <title><![CDATA[Infrastructure as code: mistakes I made so you don't have to]]></title>
            <link>https://harshit.cloud/blog/infrastructure-as-code-mistakes</link>
            <guid isPermaLink="false">https://harshit.cloud/blog/infrastructure-as-code-mistakes</guid>
            <pubDate>Thu, 28 Nov 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learning Terraform the hard way. Here are the mistakes that cost me sleep, money, and a bit of my sanity.]]></description>
            <content:encoded><![CDATA[
The first time I ran `terraform destroy` against the wrong workspace, I had two terminals open, one coffee in, and roughly four seconds between hitting `yes` and realising what was on the other end of that plan. The instance count was 17. By the time I cancelled, it was 6. Every one of those came back, eventually. The pages did not.

![Five hand-drawn tombstones lined up in a small graveyard, each one marking a different terraform mistake — hardcoded amis, lost local state, a 2000-line main.tf, an unpinned provider, and a stray terraform destroy in red.](/images/infrastructure-as-code-mistakes/hero.png)

*Fig. 1 — every headstone here was paid for in pages.*

What follows is the short list of Terraform mistakes I've made enough times to recognise on sight. None of them are clever. All of them are the sort of thing you nod at in a blog post and then commit anyway because it's Friday.

## hardcoding everything

**The mistake.** My first Terraform config looked like this:

```hcl
resource "aws_instance" "web" {
  ami           = "ami-12345678"
  instance_type = "t2.micro"
}
```

Looks fine until the day someone asks for the same stack in `eu-west-1` and you discover that AMI ID isn't a real thing outside `us-east-1`. Or until the AMI is six months old and Canonical has retired it. Or until you have eleven of these scattered across modules and you can't grep your way out.

**The fix.** Variables for the knobs, data sources for the things AWS is willing to look up for you:

```hcl
data "aws_ami" "ubuntu" {
  most_recent = true
  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-*"]
  }
}

resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = var.instance_type
}
```

The data source costs you one API call per plan. It saves you the next four migrations.

## not using remote state

**The mistake.** Keeping `terraform.tfstate` on my laptop. I lost it once — clean reinstall, didn't think to copy the working directory across. The infrastructure was still up, happily running. Terraform had no idea any of it existed. I rebuilt the state by hand with `terraform import`, one resource at a time, and learned more about resource addresses than I wanted to.

**The fix.** S3 backend, versioning on, lock table next to it:

```hcl
terraform {
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "prod/terraform.tfstate"
    region = "us-east-1"
  }
}
```

Versioning on the bucket is the part most people skip. Turn it on. The day you fat-finger a `terraform state rm`, you will want yesterday's state file back, and S3 will hand it over without comment.

## one giant main.tf

**The mistake.** A single `main.tf` that crossed 2,000 lines somewhere around the third VPC peering. Every change touched the same file, every PR diff looked like a refactor, and finding the security group for the bastion meant `Cmd+F "bastion"` and praying I'd named it consistently.

**The fix.** Split by concern, then by reusable unit. The convention I've landed on, per module:

- `main.tf` for the resources that define the module
- `variables.tf` for inputs
- `outputs.tf` for outputs
- `versions.tf` for provider and Terraform version constraints
- separate child modules under `modules/` for anything used twice

The names don't matter to Terraform — it concatenates every `.tf` in the directory regardless. They matter to the next person who opens the repo, which on a long enough timeline is also you.

## not locking provider versions

**The mistake.** A bare provider block:

```hcl
provider "aws" {
  region = "us-east-1"
}
```

The AWS provider shipped a major version with breaking changes to `aws_s3_bucket` resource layout. The next CI run picked it up, the plan tried to recreate every bucket in the account, and I learned what a deeply unhappy Slack channel looks like before lunch.

**The fix.** Pin the provider, pin Terraform itself, commit the lockfile:

```hcl
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.0"
    }
  }
}
```

`~> 4.0` lets in patch and minor bumps, blocks the major. The `.terraform.lock.hcl` file Terraform writes next to your config locks the exact resolved version, including provider hashes. Commit it. Treat a lockfile change in a PR like a dependency upgrade, because that's what it is.

## destroying production by accident

**The mistake.** Two terminal tabs, identical prompts, opposite environments. The plan I meant to run was in the other window. We've all been there. If you haven't, you will be.

**The fix.** A few cheap defenses, layered:

```hcl
resource "aws_instance" "critical" {
  # ...
  lifecycle {
    prevent_destroy = true
  }
}
```

`prevent_destroy` makes Terraform refuse to destroy the resource at all. The plan errors out before anything moves. It's annoying when you genuinely want to destroy the thing, because you have to remove the block first, and that annoyance is the entire point.

Beyond that: a shell prompt that screams the workspace and account in red when you're in prod. A wrapper around `terraform` that grep's the planned destroys and demands you type the resource address back. Terraform Cloud or Atlantis if you have the budget, so the apply runs from a server with proper RBAC and not from whichever terminal you happened to alt-tab into.

The four-second window between `yes` and panic does not get longer with experience. It gets shorter, because you stop reading the prompt.

]]></content:encoded>
            <author>harshit@truefoundry.com (Harshit Luthra)</author>
            <category>terraform</category>
            <category>iac</category>
            <category>devops</category>
            <category>infrastructure</category>
            <enclosure url="https://harshit.cloud/images/infrastructure-as-code-mistakes/hero.png" length="0" type="image/png"/>
        </item>
    </channel>
</rss>