How to use findRESubmatch

chrillek · April 19, 2023, 9:01am

Hugo 110 brought us findRESubmatch which would be quite useful, if it were documented correctly and understandably. Firstly, the function should be called findRESubmatches, since it finds all matches, not only one – there’s no flag like in other RE implementations to ask for a single or all matches. Secondly, it does not return “a slice of strings”, but rather a “slice of slice of strings”. Here’s what I found out (which may be wrong, not complete etc.):

findRESubmatch is Go’s findAllSubmatch (again this weird singular, but so be it).
If the RE doesn’t match at all, the function returns nil.
If it matches, the function returns a slice of slice of strings.

Example 1, no capturing groups

findRESubmatch(`b`, "ab") returns [["ab" "b"]]

Example 2, one capturing group, one occurrence

findRESubmatch(`a(.)`, "ab") returns [["ab" "b"]] Access the content of the capturing group with index (index 0 $result) 1: The inner index gives you ["ab" "b"], the outer one retrieves the content of the first capturing group from that, which is b in this case.

Example 3, one capturing group, two occurrences

findRESubmatch(`a(.)` "abac") returns [["ab" "b"] ["ac" "c"]]. You’d use index (index 0 $result) 1 to access the first capturing group of the first match, etc.

Example 4, named capturing group

findeRESubmatch(`a(<Pletter>.)` "abac") behaves exactly as an unnamed capturing group, i.e. the name is not available in the match. Interestingly, the Go documentation keeps mum about that, too. Using a dict in that case would’ve been nice.

x matches, y capturing groups

That results in a slice of x slices, each containing y+1 strings. The first string is always the current match, i.e. what the whole RE matches. The rest are the subgroup matches.

Nested capturing groups

findeRESubmatch(`a(.(.)(d))` "abcd") returns [["abac" "bc" "c" "d"]], thus the nested capturing groups from the outside to the inside. That’s consistent with their numbering.

Left to right – really?

Hugo’s as well as Go’s documentation use the terms “leftmost” (for the first match9 and “left to right” (for the order of matches) in their description of the RE functions. In my opinion, this is misleading, as “leftmost match” makes sense only for left-to-right writing systems. In a right-to-left writing system (Arabic, Hebrew, at least), the first match should be the _right_most, and the order in which matches are returned should be right to left.

Either the wording in the documentation is correct, then the RE matching behaves strangely in certain locales. Or the RE matching works ok, then the wording is wrong.

Feel free to use that in the documentation. I raised an issue about its current state here

github.com/gohugoio/hugoDocs

Improve documentation of findRE and findRESubmatch

opened 04:08PM - 18 Apr 23 UTC

chrillek

**findRE**: `The syntax of the regular expression is the same general syntax use…d by Perl, Python, and other languages.` This is at best misleading, at worst plain wrong. (Hu)Go's RE language is missing a lot of features other languages offer. At the very least, the sentence should be amended by `, but not all of the features from those languages are implemented` **findRESubmatch**: The example code is probably borken – what are `§` signs supposed to stand for? If they mean anything, _what_? And what does this sentence > In Hugo 0.110.0 we added a variant of findRE that returns a slice of strings holding the text of the leftmost match of the regular expression in s and the matches, if any, of its subexpressions. mean? Where is `s`? What is `s`? What is the "leftmost match of the regular expression" – is that the _first_ match? Is it the first capturing group? What are "subexpressions" – capturing groups? In my opinion, introducing new terminology like "subexpression" does not help to clarify things. Simple question: If I have the string `"ab"`, I use `` `a+(.*)` `` to match everything but the leading a – how can I access the content of the capturing group? What seems to work is something like this ``(index (index (findRESubmatch `a+(.*)` "ab" 0) 1)`` which seems to indicate that the return value of `findRESubmatch` is _not_ a "slice of strings" but a "slice containing a single element which is a slice of strings", something like [["ab" "b"]] in the example case. The first element of the inner slice would be the complete match, followed by the contents of the matched capturing groups. Perhaps it's technically necessary to return a slice of slices, but it certainly looks a bit awkward. In any case, the text should be amended to clarify how this return value is structured and how users can access the capturing group matches. Preferably with a _simple_ example that doesn't require deeper knowledge of regular expressions.

Tom_Durand · April 19, 2023, 4:55pm

I appreciate that someone is laying out how unpractical REs are at the moment.
At this point, allowing calls to external commands (sed) within hugo should be considered, as it would produce a much more readable code, with a much more straightforward expressions than index (index 0 $result) 1 !
We rarely need all the matches and submatches at once, usually people want to extract subgroups or rearrange them, and in that case the number one useful feature - named groups - is missing.
Thank you chrillek, I’ll find findRESubmatch very useful from now on.

bep · April 19, 2023, 6:54pm

This maps directly to Go’s FindReSubMatch, which is in line with other template funcs that is just very shallow wrappers.

chrillek · April 20, 2023, 8:16am

You’re probably referring to FindAllSubmatch. And yes, I noticed that Hugo uses only a very shallow wrapper.
What about the other points:

FindRESubmatch returning a slice of slice of strings, not a slice of strings?
It working from left to right only in left-to-right locales, which makes the usage of “left-most” and “from left to right” incorrect?

Tom_Durand · April 20, 2023, 11:17am

from a logical standpoint it makes perfect sense: a slice of all matches, represented by a slice of their submatches.
But the end user simply needs more than just a shallow wrapper.
We need something like function FindRe (source: string; regexp: string; MatchNumber: Positive; MatchingGroupNamed: string; MatchingGroupNumbered: Positive) return Table_of_result
with a few wrappers in case we need to return a single string, or input a slice of groups instead of a string or number, etc.
It’s Ada/pseudocode but understandable enough. All of that wrappers around the current FindRE or FindAllSubmatch.

chrillek · April 20, 2023, 11:20am

Hugo is open source. You can add what you’re missing.

Btw: Hugo’s approach to find regular expressions is very similar to JavaScript’s. Which seems to work for a lot of people.

Tom_Durand · April 20, 2023, 12:16pm

Well not really… A surface level understanding of one language is the limit of my capability
I didn’t know about javascript. So a lot of people don’t think the way I do, no surprise. I just gave some user feedback, if most people are please, fine by me

jmooring · April 20, 2023, 4:02pm

I will revisit this in an attempt to combine descriptions from both FindStringSubmatch and FindAllStringSubmatch.

The frame of reference is the string (slice of bytes), not how the string is displayed (LTR or RTL). This is true for all^[1] programming languages. For example, Go’s strings.TrimLeft operates on the left side of the slice of bytes, regardless of LTR/RTL display mode. The same is true with Python’s str.lstrip function.

You could petition the Go team to rename their functions (e.g., strings.TrimStart instead of strings.TrimLeft ^[2]) and descriptions (e.g., “leading” instead of "left), but I wouldn’t hold my breath.

From a documentation standpoint, unless there’s a gross error, we follow Go’s lead when wrapping their functions.

Maybe there’s an oddball exception out there, something akin to Brainf___. ↩︎
Precedent: JS aliases trimLeft() to trimStart() ↩︎

jmooring · April 20, 2023, 5:37pm

See https://gohugo.io/functions/findresubmatch/.

chrillek · April 20, 2023, 5:40pm

Thanks a lot! That’s really a huge improvement.

Topic		Replies	Views
Regular expressions: (sub)matching groups not supported? support	5	2183	January 1, 2023
Golang pcre, extracting substrings from matches support	4	405	January 3, 2023
How to return a match of a single string to an array of strings support	3	483	April 18, 2022
Accessing findRE named capture groups support	0	612	October 24, 2019
Performing functions on a captured group in replaceRE support	4	1620	January 4, 2019