270K+ Websites Built With Hugo?

Md_Farhad_Hossen · August 28, 2023, 10:24am

Builtwith data shows that 270K+ websites built with Hugo, and 150K+ of them are currently live. Is this data correct?
Is there any way to know how many websites are being built with Hugo on a regular basis?

jmooring · August 28, 2023, 2:42pm

No. They don’t phone home.

bep · August 28, 2023, 4:27pm

I don’t think so, as those numbers sound too high to me. I have tweeted those stats before from the Hugo account, but I have been careful to say “According to BuiltWith …”

Md_Farhad_Hossen · August 29, 2023, 3:06am

haha. Make sense.

leecalvin · September 6, 2023, 11:28pm

Hey all, first post.

I was just thinking after seeing this, that it would be a neat feature to be able to opt-in to “phone home” to contribute to such stats.

Perhaps this is a feature that could be added?

devsr-gt · September 7, 2023, 4:51am

They have it disableHugoGeneratorInject = "false" produces your current version of hugo<meta name=generator content="Hugo 0.115.4">

or something

1. Using Search Engine APIs

Instead of manually collecting a list of URLs, you could use a search engine API to get a list programmatically. For example, you can use Google’s Custom Search JSON API to search for websites that may be built with Hugo. The query string can include specific keywords, meta tags, or any text usually found in Hugo websites to narrow down the search.

2. Distributed Scraping

When dealing with a large number of URLs, a single-threaded scraper will be slow. You can use distributed scraping where multiple machines are involved in scraping, each taking a chunk of URLs to process. Python libraries like Celery can help with distributing tasks.

3. Intelligent Rate Limiting and Error Handling

You’d want to be respectful to web servers, which means keeping an eye on the rate at which you’re making requests. Intelligent rate limiting could include random intervals between requests or exponentially increasing intervals when encountering errors. Some websites might block or limit your IP address if you make requests too quickly, and you may want to consider rotating IP addresses (while still respecting terms of service).

4. Advanced HTML Parsing

Instead of just looking for specific text or tags, you could use advanced HTML parsing techniques to identify more subtle signs that a website is using Hugo. For example, you could look for a combination of HTML structure, meta tags, and specific JavaScript or CSS files.

5. Machine Learning

For very large-scale scraping, you could use machine learning models trained on a set of Hugo and non-Hugo websites. Features could include HTML structure, presence of specific tags or text, or even more advanced features like the use of specific JavaScript libraries. This model could then predict the likelihood that a given website is built using Hugo.

6. Data Storage and Analysis

You might need a database to store your findings, especially if you’re dealing with a large set of URLs. SQL databases like MySQL or NoSQL databases like MongoDB could be suitable depending on your needs. You could then run analyses on your database to answer various questions, like the prevalence of Hugo sites in different domains, countries, etc.

7. Legal and Ethical Considerations

Make sure to respect robots.txt files, terms of service, and any other rules set by website owners. Also, make your scraper as gentle as possible, to avoid putting any undue strain on the websites you’re scraping.

Example Tech Stack:

URL Collection: Google Search API, Bing Search API
Task Distribution: Celery with a message broker like RabbitMQ
Web Scraping: Python libraries like Requests, Selenium for dynamic content
HTML Parsing: BeautifulSoup, lxml
Machine Learning: scikit-learn for building predictive models
Data Storage: MySQL or MongoDB for storing results
Rate Limiting and Proxy Rotation: Custom logic in Python, or libraries like Scrapy with middlewares

Combining these techniques can result in a very efficient, accurate, and large-scale scraper to identify Hugo-based websites.

leecalvin · September 7, 2023, 11:38pm

Neat, I will probably be flipping that on next time I update my site.

gora · September 8, 2023, 3:02pm

Hi @Md_Farhad_Hossen

I use built with and mostly it have reliable data. when you are on paid plan you can actually export list of all 150k+ websites to see your self.

So, in short this number is almost correct.

Thanks

Topic		Replies	Views
How many hugo generated websites are there? Meta	2	326	November 14, 2023
Hugo generator tag usage? support	4	2048	June 2, 2017
What information does hugo.generator collect? support	5	440	June 18, 2022
Generating a rather large website support	20	1538	July 1, 2019
How to discover websites built with Hugo using search engines? support	3	652	September 19, 2022