@kaushalmodi I could see Colly enabling a function like getPage (à la getJSON and getCSV) that fetches a page’s raw HTML and transforms it into a map. Imagine if you could do something like this:
{{ $page := getPage "https://google.com" }}
{{ with (index $page "title") }}
<h1>{{ . }}</h1>
{{ end }}
{{ with (index $page "body") }}
<p>{{ . }}>/p>
{{ end }}
{{ with (index $page "meta") }}
<ul>
{{ range . }}
<li>{{ .key }} | {{ .value }}</li>
{{ end }}
</ul>
{{ end }}
That could open up some pretty broad and interesting horizons.
I would propose that user should be allowed to specify the filters for colly i.e. specify which tags, etc. should be parsed into Go template maps/slices.
@Jura Sure, but you can also create a web scraper/parser using, say, Python that “steals” other people’s web content. That some people can abuse an otherwise useful thing is not, in my estimation, a reason to simply not create it.
A scraper is a tool and as such it depends on our goal. Which is why I suggested using it to read opengraph tags, those are meant only to index or to populate metadata about the page. A .GetOpenGraph would not infringe copyright.
It might however break if the site is taken down for some reason, so maybe it also requires some caching of data.
With respect, this may depend on your experience in different areas of information management.
In my world, the need to collect and reprocess data is paramount. Not all of that data is presented in reusable form however, even though it is accessible perfectly legally and morally by the people needing to process it. In these instances, sometimes scraping is the only cost effective method to reuse important data.
A relatively simple example of this was helping someone from India reprocess some flood risk data. This data is published on a public web site but in a form that is of limited use. Scraping enabled this data to be repurposed into a warning system at minimal cost.
That was related to an open source project. In my professional world, this kind of thing also crops up from time to time. Often where the cost and timescales for altering a source system outweighs the benefits but where the benefits are still significant.
I kind of agree but… since I am currently trying to disable 3rd party tracking from social media embeds I think that scraping is not that black and white.
I’ll talk about Instagram because this social network has a set of particularly annoying challenges.
Their API is a moving target. They disable parts of it as they please with no warning and no replacement. The URLs they offer expire after 24 hours, so that requests to their API must be made with JS or CRON jobs.
In its TOS Instagram treats user content as if it’s Instagram’s own asset.
But the user content on their platform is public content owned by the users who created it and it should be freely accessible in the World Wide Web.
But no… instagram has locked that user content down to profit from it and whenever it presents it to the world (with its iframe embeds) it does so through a slew of privacy invading tech (like cookies with a 20 years expiry date) etc.
Now next week this Far West Data Mining mentality will be outlawed in Europe.
A tool like Colly combined with Hugo could help bring those damned Social Media Silos down once and for all.
This service requires signup to generate an API key.
The free tier has 5000 monthly requests and 20 requests per hour. (that’s more than enough for a static site)
It’s brilliant.
Just tested it with Instagram URLs and the JSON response exposes more useful information from the Open Graph meta tags than the official Instagram oEmbed endpoint.