[Users] How to filter utf8 messages
Dave Howorth
dave at howorth.org.uk
Fri Jul 28 11:34:02 UTC 2023
On Thu, 27 Jul 2023 16:08:58 -0400
Leon Fisk <lfiskgr at gmail.com> wrote:
> On Thu, 27 Jul 2023 07:22:16 -0400
> Pierre Fortin <pf at pfortin.com> wrote:
>
> >On Thu, 27 Jul 2023 08:38:18 +0000 Colin Leroy-Mira via Users wrote:
> >
> >>July 27, 2023 at 9:50 AM, "Slavko" <linux at slavino.sk> wrote:
> >>
> >>
> >>> And another question, i use this regex to score Chinesse & etc
> >>> chars (scripts) in subject in rspamd (perhaps can be useful for
> >>> OP): [\p{Han}\p{Hiragana}\p{Katakana}\p{Hangul}\p{Arabic}]+
> >>>
> >>> Will that work in CM filter regexes?
> >>
> >>I'm unsure, you can test regexps in the QuickSearch in extended
> >>mode: subject regexpcase "..."
> >>
> >
> >Wow! I've never seen that regex syntax and my ancient O'Reilly
> >(7/1997) "Mastering Regular Expressions" book does not cover it.
> >Therein, {min:max} is the "interval" quantifier, and no mention of
> >"\p". I tried the suggested regex in QuickSearch; but no hits. The
> >messages I'm trying to filter are mostly in Chinese & Japanese; are
> >these covered by the above suggestion?
>
> This works, but needs more testing and probably a few more characters
> for other languages.
>
> subject regexp "[えの理]+"
>
> I was testing it with Quick Search in my Claws List directory with:
>
> body_part regexp "[えの理]+"
>
> "\p{Han}" seems to be Java syntax. Some info on this Wiki page:
>
> https://en.wikipedia.org/wiki/Regular_expression#Character_classes
On that page it says: "In Perl and the java.util.regex library,
properties of the form \p{InX} or \p{Block=X} match characters in block
X ... \p{Armenian} ...". So somewhat wider than Java methinks.
More information about the Users
mailing list