[Rspamd-Users] Bayes questions and observations

Fri Mar 15 12:14:52 UTC 2024

On 15/03/2024 09:55, christian via Users wrote:
> Am 14.03.2024 um 18:51 schrieb Vsevolod Stakhov:
> 
>> Looks like XY problem to me: why do you need SA for Bayes counting 
>> that it uses much more stupid algorithm for it? Of course, your whole 
>> problem looks very weird to me. The *only* reason why SA integration 
>> exists are testing and legacy concerns (not Bayes or regexps where 
>> Rspamd can do much better job).
> 
> I still get a lot of spam that isn't recognized. There are batches of 
> spam campaigns that come from different senders from different 
> countries, with the same appearance but different words on the same 
> topic (financial, ?hoonky? kitchen knife), which I can currently only 
> block with multimap and regex. But after 2 days the new wave comes.
> The statistical function (BAYES_SPAM) is of no help because the results 
> are not correct. The email has a value of 20, through ASN, RBL, Neural 
> and Reputation. Then BAYES_Spam comes and says the email is ok -2. 
> Learning doesn't help. I now learn every spam email again using rspamc 
> learn_spam. The results do not improve.
> 
> How do you solve this?
> Christian

That's very interesting and I would like to investigate more. In fact, 
both SA and Rspamd are using more or less the same Bayes algorithm with 
some slight differences on tokenisation logic.

If you have samples of misclassification, could you please do the 
following things:

1) Enable "bayes" debugging (add "bayes" to the list of `debug_modules` 
array in the local.d/logging.inc)
2) Check all logs with tag "bayes" when you scan those messages and send 
them to me (probably via private email if there's some confidential data 
or large attachment)
3) Send me both samples and your Redis dump so I can try to experiment 
with that

Maybe (3) would be a huge overkill in terms of privacy and amount of 
data, so I would appreciate if you can do 1-2.

Thanks in advance!