[Users] How to filter utf8 messages

Leon Fisk lfiskgr at gmail.com
Thu Jul 27 20:08:58 UTC 2023


On Thu, 27 Jul 2023 07:22:16 -0400
Pierre Fortin <pf at pfortin.com> wrote:

>On Thu, 27 Jul 2023 08:38:18 +0000 Colin Leroy-Mira via Users wrote:
>
>>July 27, 2023 at 9:50 AM, "Slavko" <linux at slavino.sk> wrote:
>>
>>  
>>> And another question, i use this regex to score Chinesse & etc chars
>>> (scripts) in subject in rspamd (perhaps can be useful for OP):
>>>  [\p{Han}\p{Hiragana}\p{Katakana}\p{Hangul}\p{Arabic}]+
>>> 
>>> Will that work in CM filter regexes?    
>>
>>I'm unsure, you can test regexps in the QuickSearch in extended mode:
>>subject regexpcase "..."
>>  
>
>Wow!  I've never seen that regex syntax and my ancient O'Reilly (7/1997)
>"Mastering Regular Expressions" book does not cover it.  Therein,
>{min:max} is the "interval" quantifier, and no mention of "\p".
>I tried the suggested regex in QuickSearch; but no hits.  The messages
>I'm trying to filter are mostly in Chinese & Japanese; are these covered
>by the above suggestion?

This works, but needs more testing and probably a few more characters
for other languages.

subject regexp "[えの理]+"

I was testing it with Quick Search in my Claws List directory with:

body_part regexp "[えの理]+"

"\p{Han}" seems to be Java syntax. Some info on this Wiki page:

https://en.wikipedia.org/wiki/Regular_expression#Character_classes

-- 
Leon
Claws 3.19.0, Debian


More information about the Users mailing list