[Users] How to filter utf8 messages

Fri Jul 28 11:34:02 UTC 2023

On Thu, 27 Jul 2023 16:08:58 -0400
Leon Fisk <lfiskgr at gmail.com> wrote:

> On Thu, 27 Jul 2023 07:22:16 -0400
> Pierre Fortin <pf at pfortin.com> wrote:
> 
> >On Thu, 27 Jul 2023 08:38:18 +0000 Colin Leroy-Mira via Users wrote:
> >  
> >>July 27, 2023 at 9:50 AM, "Slavko" <linux at slavino.sk> wrote:
> >>
> >>    
> >>> And another question, i use this regex to score Chinesse & etc
> >>> chars (scripts) in subject in rspamd (perhaps can be useful for
> >>> OP): [\p{Han}\p{Hiragana}\p{Katakana}\p{Hangul}\p{Arabic}]+
> >>> 
> >>> Will that work in CM filter regexes?      
> >>
> >>I'm unsure, you can test regexps in the QuickSearch in extended
> >>mode: subject regexpcase "..."
> >>    
> >
> >Wow!  I've never seen that regex syntax and my ancient O'Reilly
> >(7/1997) "Mastering Regular Expressions" book does not cover it.
> >Therein, {min:max} is the "interval" quantifier, and no mention of
> >"\p". I tried the suggested regex in QuickSearch; but no hits.  The
> >messages I'm trying to filter are mostly in Chinese & Japanese; are
> >these covered by the above suggestion?  
> 
> This works, but needs more testing and probably a few more characters
> for other languages.
> 
> subject regexp "[えの理]+"
> 
> I was testing it with Quick Search in my Claws List directory with:
> 
> body_part regexp "[えの理]+"
> 
> "\p{Han}" seems to be Java syntax. Some info on this Wiki page:
> 
> https://en.wikipedia.org/wiki/Regular_expression#Character_classes

On that page it says: "In Perl and the java.util.regex library,
properties of the form \p{InX} or \p{Block=X} match characters in block
X ... \p{Armenian} ...". So somewhat wider than Java methinks.