What all cool stuff can hugo + colly do?

kaushalmodi · May 15, 2018, 7:35pm

Continuing the discussion from Tools/libraries that can drive new Hugo features:

My comment from that original thread

If I understood correctly, this project can be used to get data from even websites that don’t support a JSON metadata API?

I can think of using this to extract meta data like summary, author, etc. from certain pages where I don’t want to just paste their link…

… or possibly mirror whole Reddit, HN, … etc. comment threads on the associated blog post!

@brunoamaral @lucperkins @alexandros I have created this separate thread so that we don’t pollute the “tools/libraries” thread discussing about Colly

@lucperkins You are seeing exactly what I am (about a possibility that Colly enables a lot more than getJSON does at the moment)

brunoamaral · May 15, 2018, 11:02am

Colly would be interesting to allow for a shortcode that shows a thumbnail-link. Like we see here on discourse.

lucperkins · May 15, 2018, 7:21pm

@kaushalmodi I could see Colly enabling a function like getPage (à la getJSON and getCSV) that fetches a page’s raw HTML and transforms it into a map. Imagine if you could do something like this:

{{ $page := getPage "https://google.com" }}
{{ with (index $page "title") }}
<h1>{{ . }}</h1>
{{ end }}
{{ with (index $page "body") }}
<p>{{ . }}>/p>
{{ end }}
{{ with (index $page "meta") }}
<ul>
  {{ range . }}
  <li>{{ .key }} | {{ .value }}</li>
  {{ end }}
</ul>
{{ end }}

That could open up some pretty broad and interesting horizons.

alexandros · May 15, 2018, 7:36pm

GetPage already exists and it’s for internal use.

But I agree with you something like getURL would be awesome.

alexandros · May 15, 2018, 7:41pm

And I have moved the replies to this thread since you made a separate topic @kaushalmodi

brunoamaral · May 15, 2018, 8:08pm

Question is, what sort of data could GetURL retrieve?

I would suggest starting slow with just the opengraph tags.

kaushalmodi · May 15, 2018, 8:09pm

I would propose that user should be allowed to specify the filters for colly i.e. specify which tags, etc. should be parsed into Go template maps/slices.

bep · May 15, 2018, 9:57pm

As a template func this is mildly interesting.

If you look at in combination with “dynamic page generation” or whatever I called the issue, it gets more interesting.

Jura · May 16, 2018, 5:01am

(Never mind, not relevant. )

lucperkins · May 16, 2018, 6:06am

@Jura Sure, but you can also create a web scraper/parser using, say, Python that “steals” other people’s web content. That some people can abuse an otherwise useful thing is not, in my estimation, a reason to simply not create it.

Jura · May 16, 2018, 7:02am

(Revoked; never mind. )

RickCogley · May 16, 2018, 7:30am

To me, scraping isn’t grey at all; it’s black. Except for if you are using it to scrape your own content to get an archive.

brunoamaral · May 16, 2018, 8:04am

A scraper is a tool and as such it depends on our goal. Which is why I suggested using it to read opengraph tags, those are meant only to index or to populate metadata about the page. A .GetOpenGraph would not infringe copyright.

It might however break if the site is taken down for some reason, so maybe it also requires some caching of data.

bep · May 16, 2018, 8:06am

@RickCogley is right.

TotallyInformation · May 16, 2018, 12:06pm

With respect, this may depend on your experience in different areas of information management.

In my world, the need to collect and reprocess data is paramount. Not all of that data is presented in reusable form however, even though it is accessible perfectly legally and morally by the people needing to process it. In these instances, sometimes scraping is the only cost effective method to reuse important data.

A relatively simple example of this was helping someone from India reprocess some flood risk data. This data is published on a public web site but in a form that is of limited use. Scraping enabled this data to be repurposed into a warning system at minimal cost.

That was related to an open source project. In my professional world, this kind of thing also crops up from time to time. Often where the cost and timescales for altering a source system outweighs the benefits but where the benefits are still significant.

alexandros · May 16, 2018, 12:43pm

I kind of agree but… since I am currently trying to disable 3rd party tracking from social media embeds I think that scraping is not that black and white.

I’ll talk about Instagram because this social network has a set of particularly annoying challenges.

Their API is a moving target. They disable parts of it as they please with no warning and no replacement. The URLs they offer expire after 24 hours, so that requests to their API must be made with JS or CRON jobs.

In its TOS Instagram treats user content as if it’s Instagram’s own asset.

But the user content on their platform is public content owned by the users who created it and it should be freely accessible in the World Wide Web.

But no… instagram has locked that user content down to profit from it and whenever it presents it to the world (with its iframe embeds) it does so through a slew of privacy invading tech (like cookies with a 20 years expiry date) etc.

Now next week this Far West Data Mining mentality will be outlawed in Europe.

A tool like Colly combined with Hugo could help bring those damned Social Media Silos down once and for all.

alexandros · May 16, 2018, 3:51pm

@brunoamaral This can be done with:

https://opengraph.io

This service requires signup to generate an API key.
The free tier has 5000 monthly requests and 20 requests per hour. (that’s more than enough for a static site)

It’s brilliant.

Just tested it with Instagram URLs and the JSON response exposes more useful information from the Open Graph meta tags than the official Instagram oEmbed endpoint.

brunoamaral · May 16, 2018, 4:12pm

Didn’t know about this, thank you @alexandros!

bep · May 16, 2018, 4:18pm

Yahoo Pipes was brilliant for this kind of stuff (i.e. scrape a page and transform it into XML or whatever), but they put a lid on that one …

RickCogley · May 17, 2018, 2:33am

@TotallyInformation, that seems a totally legit usage, so thanks for pointing it out.

Topic		Replies	Views
Tools/libraries that can drive new Hugo features dev	85	11620	February 11, 2024
Feature Request: getOpenGraph Func dev	14	1550	May 21, 2018
Hugo - the browser extension you really want! Announcements	9	1283	November 10, 2018
Cool Hugo Tips & Tricks (Compilation) tips & tricks	18	2425	March 21, 2022
Styled link previews, hugo + onebox? support	5	1281	July 27, 2017

What all cool stuff can hugo + colly do?

Related topics