[Rspamd-Users] Rules for different separated spam words such as pharmac=euticals or pharmaceutic*als

Mon Mar 11 11:28:26 UTC 2024

Hi there,

On Mon, 11 Mar 2024, Knut Krüger via Users wrote:

> Can I add a rule that eliminates special characters before checking for spam 
> words?
> It would be even better if it only does this when it finds unseparated 
> keywords.
> Pharmaceutical spam actually always contains unseparated words such as 
> pharmacy, so that you users immediately recognize what it is about.

You can do things with regular expressions such as

p.?h.?a.?r.?m.?a.?c.?y

or

p\W?h\W?a\W?r\W?m\W?a\W?c\W?y

or even

p[\._-]?h[\._-]?a[\._-]?r[\._-]?m[\._-]?a[\._-]?c[\._-]?y

but it's a bit clumsy and you need to be careful that you don't catch
things unintentionally.  If you really do need to remove all 'special'
characters before scanning then you probably need to code something.
My feeling is that this kind of thing rapidly leads to diminishing
returns but I admit I do have quite a few rules which look for some of
the more common junk.

> I think eliminating special characters from every message first would result 
> in a high server load.

Not necessarily an excessive one, stripping characters from text isn't
a difficult operation.  Regular expressions however can bite you if
you're careless e.g. with dot-asterisk.  It depends on your situation;
on your spam profile, server performance, normal load, ...

> Or is there another way to recognize SPAM from botnets with constantly 
> changing spellings of keywords?

You might be better off looking for indications of the sources of the
spam rather than the content.  Try looking at the headers to see if
there are any common characteristics which help you identify the
unwanted messages.  I find blocking by ASN fairly effective, but if
you really are up against a world-wide botnet of hijacked boxes it's
going to be difficult to identify them all.  I use p0f to try to
identify compromised Windows boxes but it isn't especially reliable.

-- 

73,
Ged.