Directory Bear — A 946K-page web directory, fully static with Hugo

Directory Bear — A 946K-page web directory, fully static with Hugo

URL: dirbear.com


I built Directory Bear, a web directory with nearly a million pre-computed site profile pages, all generated and served as a fully static Hugo site on Bunny CDN.

What it is

Directory Bear aims to be the world’s largest static web directory. Every listed site gets its own profile page with a proprietary “Bear Rank” score (a composite of popularity, authority, longevity, and safety), AI-generated descriptions, favicon, and categorization across 49 family-friendly categories.

There are two listing tiers — free (nofollow link) and verified ($20 one-time, dofollow link + badge) — with submissions handled through a Bunny Edge Script and a static admin panel.

The stack

  • Hugo for the entire site build
  • Python data pipeline merging Tranco, Majestic Million, and OpenPageRank datasets
  • GPT-4o-mini for AI enrichment (category classification, one-line descriptions, overviews, tags, FAQs)
  • Bunny CDN for hosting, favicon storage, form submissions via Edge Scripts — everything

How I made Hugo work at ~1 million pages

This was the real engineering challenge. A few things that made it possible:

Hash bucketing. Every domain is bucketed by MD5(domain)[:2], giving 256 content directories with ~3,900 files each. This keeps Hugo from choking on a single massive directory. URL structure: /w/{hash}/{domain}/.

All data in front matter. Each .md file carries everything — BR score, tier, category, favicon path, AI overview, tags, FAQs — all in YAML front matter. Hugo templates just render it. No large JSON lookups, no filtering at build time.

Pre-computed everything. Category pages are pre-paginated (50 per page) by a Python script. Search uses progressive JSON prefix files (type “goo” → fetch goo.json), not a monolithic index. Featured sites, top rankings, new sites — all pre-built as Hugo data files.

Segment builds. For updating non-site pages (submit form, about, homepage), I move content/w/, content/categories/, and static/favicons/ to /tmp/, run Hugo (takes seconds), then move them back. This avoids rebuilding 946K pages just to update a CSS file. A shell script (hugo_segment.sh) handles this.

Favicons out of static/. With 900K+ favicon PNGs in static/, Hugo would try to copy all of them to public/ on every build. Moving them out during builds and uploading them separately to CDN was essential. Client-side SVG letter avatars (deterministic color per domain) handle any missing favicons gracefully.

Individual site additions. New submissions go through add_site.py which creates the page, downloads the favicon, computes the BR score, updates data files, and optionally runs AI enrichment — no need to touch the full pipeline.

Build numbers

  • 946,000+ site profile pages
  • 900,000+ favicon images (Google S2 + DuckDuckGo fallback)
  • 49 pre-paginated category sections
  • Progressive search across nearly a million domains via tiny JSON prefix files
  • Full build requires ulimit -n 65536 on macOS

Lessons learned

  1. Hugo can absolutely handle near-million-page sites, but you need to be deliberate about directory structure and what goes into static/.
  2. Put everything in front matter. The less work Hugo templates do, the faster your builds.
  3. Pre-compute aggressively. If something can be a static JSON file instead of a template computation, make it a static JSON file.
  4. Segment your builds. Don’t rebuild a million pages to fix a typo on your about page.

Happy to answer any questions about the build or the approach!

9 Likes