What all cool stuff can hugo + colly do?

Continuing the discussion from Tools/libraries that can drive new Hugo features:


My comment from that original thread

If I understood correctly, this project can be used to get data from even websites that don’t support a JSON metadata API?

I can think of using this to extract meta data like summary, author, etc. from certain pages where I don’t want to just paste their link…

… or possibly mirror whole Reddit, HN, … etc. comment threads on the associated blog post!


@brunoamaral @lucperkins @alexandros I have created this separate thread so that we don’t pollute the “tools/libraries” thread discussing about Colly :slight_smile:

@lucperkins You are seeing exactly what I am (about a possibility that Colly enables a lot more than getJSON does at the moment) :slight_smile:

1 Like

Colly would be interesting to allow for a shortcode that shows a thumbnail-link. Like we see here on discourse.

@kaushalmodi I could see Colly enabling a function like getPage (à la getJSON and getCSV) that fetches a page’s raw HTML and transforms it into a map. Imagine if you could do something like this:

{{ $page := getPage "https://google.com" }}
{{ with (index $page "title") }}
<h1>{{ . }}</h1>
{{ end }}
{{ with (index $page "body") }}
<p>{{ . }}>/p>
{{ end }}
{{ with (index $page "meta") }}
<ul>
  {{ range . }}
  <li>{{ .key }} | {{ .value }}</li>
  {{ end }}
</ul>
{{ end }}

That could open up some pretty broad and interesting horizons.

3 Likes

GetPage already exists and it’s for internal use.

But I agree with you something like getURL would be awesome.

And I have moved the replies to this thread since you made a separate topic @kaushalmodi

2 Likes

Question is, what sort of data could GetURL retrieve?

I would suggest starting slow with just the opengraph tags.

I would propose that user should be allowed to specify the filters for colly i.e. specify which tags, etc. should be parsed into Go template maps/slices.

As a template func this is mildly interesting.

If you look at in combination with “dynamic page generation” or whatever I called the issue, it gets more interesting.

1 Like

(Never mind, not relevant. :slight_smile: )

@Jura Sure, but you can also create a web scraper/parser using, say, Python that “steals” other people’s web content. That some people can abuse an otherwise useful thing is not, in my estimation, a reason to simply not create it.

1 Like

(Revoked; never mind. :slight_smile: )

To me, scraping isn’t grey at all; it’s black. Except for if you are using it to scrape your own content to get an archive.

2 Likes

A scraper is a tool and as such it depends on our goal. Which is why I suggested using it to read opengraph tags, those are meant only to index or to populate metadata about the page. A .GetOpenGraph would not infringe copyright.

It might however break if the site is taken down for some reason, so maybe it also requires some caching of data.

@RickCogley is right.

With respect, this may depend on your experience in different areas of information management.

In my world, the need to collect and reprocess data is paramount. Not all of that data is presented in reusable form however, even though it is accessible perfectly legally and morally by the people needing to process it. In these instances, sometimes scraping is the only cost effective method to reuse important data.

A relatively simple example of this was helping someone from India reprocess some flood risk data. This data is published on a public web site but in a form that is of limited use. Scraping enabled this data to be repurposed into a warning system at minimal cost.

That was related to an open source project. In my professional world, this kind of thing also crops up from time to time. Often where the cost and timescales for altering a source system outweighs the benefits but where the benefits are still significant.

2 Likes

I kind of agree but… since I am currently trying to disable 3rd party tracking from social media embeds I think that scraping is not that black and white.

I’ll talk about Instagram because this social network has a set of particularly annoying challenges.

Their API is a moving target. They disable parts of it as they please with no warning and no replacement. The URLs they offer expire after 24 hours, so that requests to their API must be made with JS or CRON jobs.

In its TOS Instagram treats user content as if it’s Instagram’s own asset.

But the user content on their platform is public content owned by the users who created it and it should be freely accessible in the World Wide Web.

But no… instagram has locked that user content down to profit from it and whenever it presents it to the world (with its iframe embeds) it does so through a slew of privacy invading tech (like cookies with a 20 years expiry date) etc.

Now next week this Far West Data Mining mentality will be outlawed in Europe.

A tool like Colly combined with Hugo could help bring those damned Social Media Silos down once and for all.

1 Like

@brunoamaral This can be done with:

https://opengraph.io

This service requires signup to generate an API key.
The free tier has 5000 monthly requests and 20 requests per hour. (that’s more than enough for a static site)

It’s brilliant.

Just tested it with Instagram URLs and the JSON response exposes more useful information from the Open Graph meta tags than the official Instagram oEmbed endpoint. :tada:

2 Likes

Didn’t know about this, thank you @alexandros!

Yahoo Pipes was brilliant for this kind of stuff (i.e. scrape a page and transform it into XML or whatever), but they put a lid on that one …

2 Likes

@TotallyInformation, that seems a totally legit usage, so thanks for pointing it out.

1 Like