Expression to extract urls with findRE

Tom_Durand · April 18, 2023, 9:55am

Hi, what golang expresssion should I use to extract the url in all these expressions ?

[aol](www.aol.fr)
[aol](http://www.aol.fr)
[aol](https://www.aol.fr)

I thought to use (?:\[[^]]*]$)([^]]*)(?:$) but neither non-capturing groups nor look-ahead expressions work.

chrillek · April 18, 2023, 10:35am

Go (and consequently Hugo) supports only a limited set of RE expressions:

This \[.*?\]$(.*?)[$ ] does the trick: Non-capturing group (and those are supposed to work!), a non-greedy match for [.*](, followed by a capturing group matching .* non-greedily, followed by a closing parenthesis or a space (for image URLs). Beware though, that this expression will also catch image URLs, you might want to take care of that.

With REs, as with everything else, simpler is better. In your example, there’s no need whatsoever for non-capturing groups. Nor for Look-ahead/behind expressions. And yes, non-capturing groups do work. It might be worth it to double-check before stating that a product is at fault.

Tom_Durand · April 18, 2023, 10:49am

I know the expression is valid, I used https://regex101.com/. The question is how do I use it with findRE ?
Currently this:

{{ with $cite }}
{{ $inter := index (findRE `(?:\[[^]]*]\()([^]]*)(?:\))` .) 1 }}
{{with $inter}}{{ printf "cite=\"%s\"" . | safeHTMLAttr }}
{{ end }}{{ end }}

does not work with this

```quote{cite=“[2021, Sociosexual behaviour in wild chimpanzees occurs in variable contexts and is frequent between same-sex partners](Sociosexual behaviour in wild chimpanzees occurs in variable contexts and is frequent between same-sex partners in: Behaviour Volume 158 Issue 3-4 (2021))”}
```

because $inter is empty, while the regexp is correct.

chrillek · April 18, 2023, 11:03am

Well, if everything is ok … I guess you’ll have to check again.

Tom_Durand · April 18, 2023, 11:11am

What I meant is I don’t see the point of your advice: why telling me to use expressions in hugo that do not work with hugo ? Non-capturing are recognized (eventhough I can’t make use of it with findRE) but look-aheads aren’t.

chrillek · April 18, 2023, 11:18am

I didn’t say to use look-ahead/behind. I said you don’t need it here.

Tom_Durand · April 18, 2023, 11:44am

Beyond findRE giving a slice of all matches and this precise regexp working, the code below still produces empty cite expressions. To be precise, it doesn’t enter the {{with $inter}}{{end}} part. I would appreciate if someone could correct this, since I didin’t find the docu of findRE very helpful, so this is above my understanding.

{{ $author := .Attributes.author  }}
{{ $class   := .Attributes.class }}
{{ $complex   := or (or $class (or $author (or .Attributes.cite .Attributes.id))) (not (in $class "simple")) }}
{{ if $complex }}<figure {{with .Attributes.id}}id={{.}} {{end}}class="non-picture {{with $class}}{{.}}{{end}}">{{end}}
<blockquote

{{ with .Attributes.cite }}
{{ $inter := index (findRE `\[.*?\]\((.*?)[\) ]` .) 0 }}
{{with $inter}}{{ printf "cite=\"%s\"" . | safeHTMLAttr }}
{{ end }}{{ end }}

>{{ .Inner|$.Page.RenderString }}</blockquote>
{{if $complex }}{{ if or $author .Attributes.cite }}<figcaption class=cite_class>{{with $author}}{{.|$.Page.RenderString }}{{end}}{{with .Attributes.cite }}<cite> in {{.|$.Page.RenderString}}</cite>{{end}}</figcaption>{{end}}
</figure>{{end}}

to successfully extract urls and put them in cite="..." ? An exemple:

```quote{cite="[moteur](www.aol.fr)"}
random junk
\```

(ignore the ""). this should produce a cite="www.aol.fr" html attribute.

chrillek · April 18, 2023, 12:12pm

Have a look at the Hugo documentation around findRe. I’m fairly confident that you’ll find an answer to your “royally ignored” question.
For me, your attitude makes trying to help you not very fun. Therefore, I’m out here.

Tom_Durand · April 18, 2023, 12:34pm

Best I could come up with:

{{ with .Attributes.cite }}
{{ $inter := replaceRE (?:$)([^]]*)(?:$) “$1” (index (findRE (?:\[[^]]*]$)([^]]*)(?:$) .) 0) }}
{{with $inter}}{{ printf “cite="%s"” . | safeHTMLAttr }}
{{ end }}{{ end }}

this still gives .Attributes.cite without modification, if it contains a link.
In my own opinion there should be more and simpler functions. Something aking to findRESubmatch, which would produce a slice of the groups matching. Use of regexps should be a straightforward as possible, because as is it’s not for normal folks, even courageous ones.

andrewd72 · April 19, 2023, 11:44am

What do you want as the result in your example?
You want www.aol.fr with the scheme trimmed?

If that is it I’d probably just use trim

I haven’t tested so see if split can split on more than one character but you could try with a split on “//”

chrillek · April 19, 2023, 12:15pm

trim does not make sense when the prefix is not fixed (no pun intended). The OP does need an RE if their code is supposed to work with varying strings.

findRE will not help them there, findRESubmatch would.

andrewd72 · April 19, 2023, 12:54pm

Given the examples of needing to trim http:// and https:// it seems simple enough.
Unless the examples are not complete I don’t see the issue?

Tom_Durand · April 19, 2023, 12:57pm

the prefix is fixed, thanks to my links hook. Always starting by either http:// or https://.
I had not thought of those functions ! It’s as the saying goes: if your only tool is a hammer, every problem becomes a nail.
How about that?

{{ $cite := strings.TrimPrefix (index (findRE “!?[.*]((https?://)?(www.)?” .Attributes.cite) 0) .Attributes.cite }}
{{ $cite = strings.TrimSuffix “)” $cite }}

That way I could even put an image or a detail shortcode call, and extract the url, like I always wanted. Pretty funky, but as long as the cite attribute is valid, who cares?
but for now it says “syntax error” for the first line, even after rearranging. I don’t see what’s wrong.

Error: add site dependencies: load resources: loading templates: “/home/drm/WEBSITE/themes/hugo-book/layouts/_default/_markup/render-codeblock-quote.html:3:1”: parse failed: template: _default/_markup/render-codeblock-quote.html:3: invalid syntax

Usually they’re more verbose. So I can assume it’s not a matter of missing parenthesis ?
And yes, in blabka I want to extract URI, so the link (or image or whatever, no matter) can appear in but the cite attribute of the blockquote element is still an URL.

andrewd72 · April 19, 2023, 1:09pm

I don’t see the need for an regexp at all if it is just those two
if string contains, then trim

This works also as far as I can tell, maybe there is some weird URL that could break it but it workds for the examples.

{{ last 1 (split "https://www.test.com" "://") }}

Tom_Durand · April 19, 2023, 1:51pm

Thanks, it works perfectly:

<blockquote {{if in .Attributes.cite “](”}}{{safeHTMLAttr (print “cite="” (strings.TrimSuffix “)” (index (last 1 (split .Attributes.cite “](”)) 0) ) ‘"’) }}{{end}}>

I wonder though, isn’t there a simpler way than index (last 1 (split .Attributes.cite "](")) 0) ) ? this looks stupidly convoluted. The function string can fuse all strings of an array into a string as what merge does for map.

andrewd72 · April 19, 2023, 2:28pm

Sorry, I assumed the separation of text and link as in text were handled by markdown and you just wanted to trim the scheme.
Should still be able to do it this way but you need to account for the case with no scheme present.

Tom_Durand · April 19, 2023, 3:04pm

Here you go:

<blockquote {{if in .Attributes.cite “](”}}{{safeHTMLAttr (print “cite="” (strings.TrimSuffix “)” (index (last 1 (split .Attributes.cite “](”)) 0) ) “"”) }}{{else if or (in .Attributes.cite “www”) (in .Attributes.cite “http”) }}{{safeHTMLAttr (print “cite="” .Attributes.cite “"”) }}{{end}}>

if scheme → … else url but no scheme → … else no cite attribute.

system · April 21, 2023, 3:04pm

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Golang pcre, extracting substrings from matches support	4	394	January 3, 2023
Regular expressions: (sub)matching groups not supported? support	5	2175	January 1, 2023
Invalid syntax with findRE [SOLVED] support	2	1126	April 29, 2017
Problem with findRE support	6	922	April 22, 2020
Regex Invalid Syntax (but actually is valid) support	6	1149	May 6, 2019

Expression to extract urls with findRE

Related topics