Builtwith data shows that 270K+ websites built with Hugo, and 150K+ of them are currently live. Is this data correct?
Is there any way to know how many websites are being built with Hugo on a regular basis?
No. They don’t phone home.
I don’t think so, as those numbers sound too high to me. I have tweeted those stats before from the Hugo account, but I have been careful to say “According to BuiltWith …”
haha. Make sense.
Hey all, first post.
I was just thinking after seeing this, that it would be a neat feature to be able to opt-in to “phone home” to contribute to such stats.
Perhaps this is a feature that could be added?
They have it disableHugoGeneratorInject = "false" produces your current version of hugo<meta name=generator content="Hugo 0.115.4">
or something
1. Using Search Engine APIs
Instead of manually collecting a list of URLs, you could use a search engine API to get a list programmatically. For example, you can use Google’s Custom Search JSON API to search for websites that may be built with Hugo. The query string can include specific keywords, meta tags, or any text usually found in Hugo websites to narrow down the search.
2. Distributed Scraping
When dealing with a large number of URLs, a single-threaded scraper will be slow. You can use distributed scraping where multiple machines are involved in scraping, each taking a chunk of URLs to process. Python libraries like Celery can help with distributing tasks.
3. Intelligent Rate Limiting and Error Handling
You’d want to be respectful to web servers, which means keeping an eye on the rate at which you’re making requests. Intelligent rate limiting could include random intervals between requests or exponentially increasing intervals when encountering errors. Some websites might block or limit your IP address if you make requests too quickly, and you may want to consider rotating IP addresses (while still respecting terms of service).
4. Advanced HTML Parsing
Instead of just looking for specific text or tags, you could use advanced HTML parsing techniques to identify more subtle signs that a website is using Hugo. For example, you could look for a combination of HTML structure, meta tags, and specific JavaScript or CSS files.
5. Machine Learning
For very large-scale scraping, you could use machine learning models trained on a set of Hugo and non-Hugo websites. Features could include HTML structure, presence of specific tags or text, or even more advanced features like the use of specific JavaScript libraries. This model could then predict the likelihood that a given website is built using Hugo.
6. Data Storage and Analysis
You might need a database to store your findings, especially if you’re dealing with a large set of URLs. SQL databases like MySQL or NoSQL databases like MongoDB could be suitable depending on your needs. You could then run analyses on your database to answer various questions, like the prevalence of Hugo sites in different domains, countries, etc.
7. Legal and Ethical Considerations
Make sure to respect robots.txt files, terms of service, and any other rules set by website owners. Also, make your scraper as gentle as possible, to avoid putting any undue strain on the websites you’re scraping.
Example Tech Stack:
- URL Collection: Google Search API, Bing Search API
- Task Distribution: Celery with a message broker like RabbitMQ
- Web Scraping: Python libraries like Requests, Selenium for dynamic content
- HTML Parsing: BeautifulSoup, lxml
- Machine Learning: scikit-learn for building predictive models
- Data Storage: MySQL or MongoDB for storing results
- Rate Limiting and Proxy Rotation: Custom logic in Python, or libraries like
Scrapy
with middlewares
Combining these techniques can result in a very efficient, accurate, and large-scale scraper to identify Hugo-based websites.
Neat, I will probably be flipping that on next time I update my site.
I use built with and mostly it have reliable data. when you are on paid plan you can actually export list of all 150k+ websites to see your self.
So, in short this number is almost correct.
Thanks