Automatic content summary splitting

I’ve been using Automatic Content Summaries, and they don’t seem to work quite right, so I looked at the code and found a few issues. I’ll discuss the most serious one here.

For non-CJK languages, a function called TruncateWordsToWholeSentence (in helpers/content.go) is used to split the page content at or soon after the number of words defined by summaryLength. It locates the end of a sentence simply by looking for one of . ! ? \n "

In fact, locating the end of a sentence (in English at least) is notoriously difficult (see, for example, https://stackoverflow.com/questions/4576077/how-can-i-split-a-text-into-sentences) and effectively impossible for this application.

I suggest instead that the function should cut off the text after a certain number of words, and add an ellipsis (…, HTML entity &#8230) to indicate that it has done so. (Which is what other well-known website generation systems do.)

This is simple and reliable (and I’ve written some code if anyone else thinks this is a good idea).

It would be a ‘breaking’ change in the sense that it would slightly change the appearance of many websites, but I think it would be a significant improvement.

This doesn’t apply to CJK languages – they are handled by a different function, and I’m not qualified to tell if the function works correctly or not. I also don’t know if an ellipsis is suitable for use with other non-Latin scripts.

What does the panel think?

2 Likes