[Rspamd-Users] Where might I find rspamd spam/ham corpus used to test the rules?
Tim Harman
tim at muppetz.com
Tue May 21 20:01:14 UTC 2019
On 22/05/2019 6:51 am, Sophie Loewenthal wrote:
> Hi Tim,
>
> I beg to differ. Rules ought to be tested again a sizable spam/ham
> corpus before being released into production. We need this , or an
> equivalent method, to make sure rules work as intended. Such not
> giving false positives with not just one specific rules but also their
> combination with other rules. e. g SpamAssassin teat rules this way.
>
> Without a sizable corpus we don't know of any unintended rules .
> When new rules have been tested, these are released, and the user base
> can test against their own corpus or not.
The problem is the golden rule of email: One person's spam is another
person's ham.
If you need to do rule testing like this, you should probably compile
your own corpus so you're happy every message is ham is ham and every in
spam is really what you (and/or your users) consider spam.
Even then, testing against rspamd today is going to be different to the
tests you run tomorrow. A host might be listed in Spamhaus today, but
the sysadmin of that host fixes the problem and gets it removed. The
same email that triggered a zen.spamhaus.org rule today, won't trigger a
zen.spamhaus hit tomorrow. A lot of rspamd is dynamic like this (The
RBLS, the SURBLs) and even your bayes database is going to affect what
rspamd considers ham/spam.
Neural is the same. It's not (easily) fully reproducible.
Are you planning on nuking your REDIS database every time before running
your corpus? How will you ensure that the Neural symbols learnt are
exactly the same, disable all dynamic/dns checks?
rspamd gives you great defaults out of the box, but I still think it's
on you to tune/test to your setup, not some static pile of email that
yesterday was considered spam/ham but today could well be considered
ham/spam.
Tim
More information about the Users
mailing list