JSON/YAML representation of Hugo site (open-ended discussion)

One thing that I really like about Sphinx is that it creates a representation of an entire Sphinx project in an object.inv file, which allows for inter-operability between different documentation projects (inter-linking, creating cross-project search indexes, etc.). I previously worked within a large organization that had hundreds of Sphinx-based documentation projects, and access to the object.inv file was absolutely crucial to our efforts.

I would love it if Hugo offered something like this and I honestly don’t think it’d be all that difficult given how the project is structured, with variables like .Site.Pages and .Site.Taxonomies readily available. I’d be happy to step up and do the work if (a) it’s agreed that this would be a useful feature and project decision makers are on board and (b) I have a concrete sense of what others would like to see.

So I’d like to start an open-ended discussion here. If there were a Hugo command (hugo map, hugo siteinfo, something like that) that could create a JSON and/or YAML representation of a Hugo project, what would you like it to contain? What configurable parameters would you like it to provide?

And perhaps most importantly, what would you use it for?

If I’m understanding what you’re after, it’s quite easy to output Hugo content in any text format, including JSON with custom output types: https://gohugo.io/templates/output-formats/

@budparr True, but I’m talking about creating a JSON/YAML/whatever representation of an entire project, not a particular page or bit of content. Imagine that you could run hugo siteinfo (or something along those lines) and get a single JSON object that represents that project. It could include a listing of all page metadata, maybe metadata plus the actual content of those pages, a listing of all taxonomies, etc. My question is about what that project-level representation would entail.

That representation could then be used by other systems for a variety of purposes, like building cross-project search indexes.

@budparr is right;

What I would do in this case is to have a slimmed down config-something.toml, with say a JSON homepage output format only where you can dump everything. I have done similar to create temporary JSON files for Lunr search indexing. I.e. I don’t need to serve that file from /public, I just need that JSON file to do some “outside of Hugo processing”.

So, a template that creates a JSON from .Site should get you very close to your goal.

@bep I absolutely do agree that there are currently ways to achieve things like this in Hugo (and I’ve indeed done things along the lines of what you’re suggesting for Lunr search).

What I’m suggesting, though, is that this could be added as a native capability, which would, IMO, expand Hugo’s usefulness in multi-project settings without requiring projects (or themes) to provide a config like the one you’re describing.

A bit of background: I used to work on Twitter’s TechDocs team, which provided in-house documentation infrastructure for many, many teams and hundreds of documentation projects. We needed to provide readers with a single interface for things like project discovery, search, a unified landing page for all projects, etc. We used Sphinx because of the object.inv file that I described above. Whenever a project was created or updated, we were able to parse that single file and update the global information we had about the totality of all projects (that meant a stateful MySQL registry of all page content for all projects, all page metadata, and so on).

The problem: Sphinx is slow as a dog and we had to somewhat hackishly add Markdown support to it. Right now, I’m working on an open source version of what we had internally at Twitter. Hugo is the natural choice for my documentation site generator for a billion reasons that I shouldn’t have to rehash here (:grin:) but it would be even better if it had a native capability to generate a project-level artifact that could be used in the way I’m describing.

I’d like to see a theme do what you are proposing. Because then we’d have a really great example of how to do something interesting with custom output formats.

That might be a good interim solution. I’ll play around and let you know if I come up with anything compelling.

1 Like

I think that

  1. This is a “big company problem”, and I only think it should be native if it should be a complete serialization format (i.e. serialize project to disk, create project from serialized format on disk)
  2. We could make it “simpler” and easier to create metadata, but the use cases are a little vague
  3. In my head it would make sense for Twitter et. all to define some schema and put that as part of the “Company Doc Theme” which then outputs /docs-meta.json or whatever
  4. This discovery service would have to know about the sites it should discover and this service can easily crawl the live sites for this metadata. Much cleaner than delivering some inventory files as part of some build.