Directory Bear — A 946K-page web directory, fully static with Hugo
URL: dirbear.com
I built Directory Bear, a web directory with nearly a million pre-computed site profile pages, all generated and served as a fully static Hugo site on Bunny CDN.
What it is
Directory Bear aims to be the world’s largest static web directory. Every listed site gets its own profile page with a proprietary “Bear Rank” score (a composite of popularity, authority, longevity, and safety), AI-generated descriptions, favicon, and categorization across 49 family-friendly categories.
There are two listing tiers — free (nofollow link) and verified ($20 one-time, dofollow link + badge) — with submissions handled through a Bunny Edge Script and a static admin panel.
The stack
- Hugo for the entire site build
- Python data pipeline merging Tranco, Majestic Million, and OpenPageRank datasets
- GPT-4o-mini for AI enrichment (category classification, one-line descriptions, overviews, tags, FAQs)
- Bunny CDN for hosting, favicon storage, form submissions via Edge Scripts — everything
How I made Hugo work at ~1 million pages
This was the real engineering challenge. A few things that made it possible:
Hash bucketing. Every domain is bucketed by MD5(domain)[:2], giving 256 content directories with ~3,900 files each. This keeps Hugo from choking on a single massive directory. URL structure: /w/{hash}/{domain}/.
All data in front matter. Each .md file carries everything — BR score, tier, category, favicon path, AI overview, tags, FAQs — all in YAML front matter. Hugo templates just render it. No large JSON lookups, no filtering at build time.
Pre-computed everything. Category pages are pre-paginated (50 per page) by a Python script. Search uses progressive JSON prefix files (type “goo” → fetch goo.json), not a monolithic index. Featured sites, top rankings, new sites — all pre-built as Hugo data files.
Segment builds. For updating non-site pages (submit form, about, homepage), I move content/w/, content/categories/, and static/favicons/ to /tmp/, run Hugo (takes seconds), then move them back. This avoids rebuilding 946K pages just to update a CSS file. A shell script (hugo_segment.sh) handles this.
Favicons out of static/. With 900K+ favicon PNGs in static/, Hugo would try to copy all of them to public/ on every build. Moving them out during builds and uploading them separately to CDN was essential. Client-side SVG letter avatars (deterministic color per domain) handle any missing favicons gracefully.
Individual site additions. New submissions go through add_site.py which creates the page, downloads the favicon, computes the BR score, updates data files, and optionally runs AI enrichment — no need to touch the full pipeline.
Build numbers
- 946,000+ site profile pages
- 900,000+ favicon images (Google S2 + DuckDuckGo fallback)
- 49 pre-paginated category sections
- Progressive search across nearly a million domains via tiny JSON prefix files
- Full build requires
ulimit -n 65536on macOS
Lessons learned
- Hugo can absolutely handle near-million-page sites, but you need to be deliberate about directory structure and what goes into
static/. - Put everything in front matter. The less work Hugo templates do, the faster your builds.
- Pre-compute aggressively. If something can be a static JSON file instead of a template computation, make it a static JSON file.
- Segment your builds. Don’t rebuild a million pages to fix a typo on your about page.
Happy to answer any questions about the build or the approach!
