[Rspamd-Users] Where might I find rspamd spam/ham corpus used to test the rules?

Tue May 21 20:19:18 UTC 2019

> On 21 May 2019, at 10:01 pm, Tim Harman via Users <users at lists.rspamd.com> wrote:
> 
>> On 22/05/2019 6:51 am, Sophie Loewenthal wrote:
>> Hi Tim,
>> I beg to differ. Rules ought to be tested again a sizable spam/ham
>> corpus before being released into production. We need this , or an
>> equivalent method, to make sure rules work as intended. Such not
>> giving false positives with not just one specific rules but also their
>> combination with other rules. e. g SpamAssassin teat rules this way.
>> Without a sizable corpus we don't know of any unintended rules .
>> When new rules have been tested, these are released, and the user base
>> can test against their own corpus or not.
> 
> The problem is the golden rule of email: One person's spam is another person's ham.
> 
> If you need to do rule testing like this, you should probably compile your own corpus so you're happy every message is ham is ham and every in spam is really what you (and/or your users) consider spam.
> 
> Even then, testing against rspamd today is going to be different to the tests you run tomorrow.  A host might be listed in Spamhaus today, but the sysadmin of that host fixes the problem and gets it removed.  The same email that triggered a zen.spamhaus.org rule today, won't trigger a zen.spamhaus hit tomorrow.  A lot of rspamd is dynamic like this (The RBLS, the SURBLs) and even your bayes database is going to affect what rspamd considers ham/spam.
> Neural is the same.  It's not (easily) fully reproducible.
> 
> Are you planning on nuking your REDIS database every time before running your corpus?  How will you ensure that the Neural symbols learnt are exactly the same, disable all dynamic/dns checks?
> 
> rspamd gives you great defaults out of the box, but I still think it's on you to tune/test to your setup, not some static pile of email that yesterday was considered spam/ham but today could well be considered ham/spam.
> 
> Tim

Hi Tim,  

I agree Rspamd rules work well out of the box.  I was concerned with some email being tagged spam, e.g Politico and a few others. Also I wanted to know how rules were tested: against what. 

Rspamd's ham might be your spam. I have a corpus of 10,000 active messages to test against.  it that I've had to really do this since a move from SA to RSpamd because I have less need to write rules ;-)