[Rspamd-Users] Where might I find rspamd spam/ham corpus used to test the rules?

Tue May 21 20:01:14 UTC 2019

On 22/05/2019 6:51 am, Sophie Loewenthal wrote:
> Hi Tim,
> 
> I beg to differ. Rules ought to be tested again a sizable spam/ham
> corpus before being released into production. We need this , or an
> equivalent method, to make sure rules work as intended. Such not
> giving false positives with not just one specific rules but also their
> combination with other rules. e. g SpamAssassin teat rules this way.
> 
> Without a sizable corpus we don't know of any unintended rules .
> When new rules have been tested, these are released, and the user base
>  can test against their own corpus or not.

The problem is the golden rule of email: One person's spam is another 
person's ham.

If you need to do rule testing like this, you should probably compile 
your own corpus so you're happy every message is ham is ham and every in 
spam is really what you (and/or your users) consider spam.

Even then, testing against rspamd today is going to be different to the 
tests you run tomorrow.  A host might be listed in Spamhaus today, but 
the sysadmin of that host fixes the problem and gets it removed.  The 
same email that triggered a zen.spamhaus.org rule today, won't trigger a 
zen.spamhaus hit tomorrow.  A lot of rspamd is dynamic like this (The 
RBLS, the SURBLs) and even your bayes database is going to affect what 
rspamd considers ham/spam.
Neural is the same.  It's not (easily) fully reproducible.

Are you planning on nuking your REDIS database every time before running 
your corpus?  How will you ensure that the Neural symbols learnt are 
exactly the same, disable all dynamic/dns checks?

rspamd gives you great defaults out of the box, but I still think it's 
on you to tune/test to your setup, not some static pile of email that 
yesterday was considered spam/ham but today could well be considered 
ham/spam.

Tim