Hugo 110 brought us findRESubmatch
which would be quite useful, if it were documented correctly and understandably. Firstly, the function should be called findRESubmatches
, since it finds all matches, not only one – there’s no flag like in other RE implementations to ask for a single or all matches. Secondly, it does not return “a slice of strings”, but rather a “slice of slice of strings”. Here’s what I found out (which may be wrong, not complete etc.):
-
findRESubmatch
is Go’sfindAllSubmatch
(again this weird singular, but so be it). - If the RE doesn’t match at all, the function returns nil.
- If it matches, the function returns a slice of slice of strings.
Example 1, no capturing groups
findRESubmatch(`b`, "ab")
returns [["ab" "b"]]
Example 2, one capturing group, one occurrence
findRESubmatch(`a(.)`, "ab")
returns [["ab" "b"]]
Access the content of the capturing group with index (index 0 $result) 1
: The inner index
gives you ["ab" "b"]
, the outer one retrieves the content of the first capturing group from that, which is b
in this case.
Example 3, one capturing group, two occurrences
findRESubmatch(`a(.)` "abac")
returns [["ab" "b"] ["ac" "c"]]
. You’d use index (index 0 $result) 1
to access the first capturing group of the first match, etc.
Example 4, named capturing group
findeRESubmatch(`a(<Pletter>.)` "abac")
behaves exactly as an unnamed capturing group, i.e. the name is not available in the match. Interestingly, the Go documentation keeps mum about that, too. Using a dict
in that case would’ve been nice.
x matches, y capturing groups
That results in a slice of x slices, each containing y+1 strings. The first string is always the current match, i.e. what the whole RE matches. The rest are the subgroup matches.
Nested capturing groups
findeRESubmatch(`a(.(.)(d))` "abcd")
returns [["abac" "bc" "c" "d"]]
, thus the nested capturing groups from the outside to the inside. That’s consistent with their numbering.
Left to right – really?
Hugo’s as well as Go’s documentation use the terms “leftmost” (for the first match9 and “left to right” (for the order of matches) in their description of the RE functions. In my opinion, this is misleading, as “leftmost match” makes sense only for left-to-right writing systems. In a right-to-left writing system (Arabic, Hebrew, at least), the first match should be the _right_most, and the order in which matches are returned should be right to left.
Either the wording in the documentation is correct, then the RE matching behaves strangely in certain locales. Or the RE matching works ok, then the wording is wrong.
Feel free to use that in the documentation. I raised an issue about its current state here