[Rspamd-Users] Bayes questions and observations

Sat Mar 16 14:20:55 UTC 2024

Hello Vsevolod,
thank you for your feedback signal.
First of all: I'm a Rspamd beginner and still have a lot to learn. After 
a few weeks, the filter results from Rspamd are already better than with 
my old spam filter ASSP, which I used for a few years.

The reason I'm asking about the Spamassassin integration is because I 
still can't handle a few waves of spam.
Since spamassassin also allows external filter sources (Heinlein, 
schaal-it), I thought this would give you a better handle on local 
(German) spam. I don't know if that's the case.
Unfortunately, I haven't been able to get Spamd/spamassassin to run in 
my RspamD yet. So I can't offer you any comparisons yet.
I'm currently learning the statistical function (BAYES_SPAM) and making 
sure I keep it clean, but the results are still not too good. I don't 
really know what data underlies the results. e.g. an email that has 
already undergone several checks in RspamD:

X-Spamd-Result: default: False [20.03 / 30.00];
PH_SURBL_MULTI(7.50)[dennisberrien.com:url];
NEURAL_SPAM_SHORT(3.00)[1,000];
HFILTER_HOSTNAME_UNKNOWN(2.50)[];
MISSING_MID(2.50)[];
IP_REPUTATION_SPAM(1.39)[asn: 47674(0.23), country: MO(0.01), ip: 
185.236.231.93(0.00)];
R_BAD_CTE_7BIT(1.05)[7bit,utf8];
R_NO_SPACE_IN_FROM(1.00)[];
MV_CASE(0.50)[];
FORGED_SENDER(0.30)[no-reply at ehtakoskelo.fi,return at ehtakoskelo.fi];
MIME_HTML_ONLY(0.20)[];
ONCE_RECEIVED(0.10)[];
MX_GOOD(-0.01)[];
BAYES_SPAM(-5.00)[99.99%];

But I have already learned such emails using rspamc learn_spam, and 
BAYES_SPAM still says that it is HAM. For this example email it doesn't 
matter because the other values clearly indicate spam, but I have some 
in the border area where the Bayes value is important.

I have now
BAYES_SPAM redis 5378 1
BAYES_HAM redis 5283 1

and still there are spam emails that have a HAM Bayes value.
Is only the content, i.e. words and terms, of the emails learned or is 
it also header data such as From, Env From, Country, IP?
I currently mostly achieve good results with multimap and specially 
created spam words and domain blacklists. But I always have to stay up 
to date and find the spam terms with every wave of spam and enter them 
into my MAPs.

If I have learned 500 emails where terms like
Bitcoin trading
Bitcoin\sAdvice
BlackRock
Blockchain
Blockchain assets
Cyber coins
Cyber transactions
Cyber currency
Digital\scurrencies
Japanese kitchen knives

If this happens, RspamD Bayes should recognize these emails as spam. But 
the value is still -3

The question arises as to whether my setup is correct.
I integrated RspamD into Postfix via the Milter interface.

milter_default_action = accept
milter_protocol = 6
smtpd_milters = inet:localhost:11332
non_smtpd_milters = inet:127.0.0.1:11332

And I still need those
always_bcc=mailarchive at meineDomain.de
for an archive and currently to check the filter results.

I check these emails redirected via always_bcc manually and also 
register them via rspamc learn_spam or learn_ham.
The individual users (approx. 300) are not yet able to learn Spam/Ham 
themselves (sieve to rspamc etc.)

I'm not sure whether I might be getting incorrect values into the Bayes 
database. In the last few weeks I have deleted the redis DB 2-3 times 
and started learning again and have also made a conscious effort to keep 
everything clean. But it still doesn't quite fit.

That's the reason why I tried to get Spamassassin to work, but so far 
with little success.
I will continue to observe how my results with RspamD Bayes develop and 
continue to learn.
But I'm still very happy with RspamD because the results are much better 
than with my old ASSP environment.

Thank you for your efforts.
Best regards
Christian

Am 15.03.2024 um 13:14 schrieb Vsevolod Stakhov:
> On 15/03/2024 09:55, christian via Users wrote:
>> Am 14.03.2024 um 18:51 schrieb Vsevolod Stakhov:
>>
>>> Looks like XY problem to me: why do you need SA for Bayes counting 
>>> that it uses much more stupid algorithm for it? Of course, your whole 
>>> problem looks very weird to me. The *only* reason why SA integration 
>>> exists are testing and legacy concerns (not Bayes or regexps where 
>>> Rspamd can do much better job).
>>
>> I still get a lot of spam that isn't recognized. There are batches of 
>> spam campaigns that come from different senders from different 
>> countries, with the same appearance but different words on the same 
>> topic (financial, ?hoonky? kitchen knife), which I can currently only 
>> block with multimap and regex. But after 2 days the new wave comes.
>> The statistical function (BAYES_SPAM) is of no help because the 
>> results are not correct. The email has a value of 20, through ASN, 
>> RBL, Neural and Reputation. Then BAYES_Spam comes and says the email 
>> is ok -2. Learning doesn't help. I now learn every spam email again 
>> using rspamc learn_spam. The results do not improve.
>>
>> How do you solve this?
>> Christian
> 
> 
> That's very interesting and I would like to investigate more. In fact, 
> both SA and Rspamd are using more or less the same Bayes algorithm with 
> some slight differences on tokenisation logic.
> 
> If you have samples of misclassification, could you please do the 
> following things:
> 
> 1) Enable "bayes" debugging (add "bayes" to the list of `debug_modules` 
> array in the local.d/logging.inc)
> 2) Check all logs with tag "bayes" when you scan those messages and send 
> them to me (probably via private email if there's some confidential data 
> or large attachment)
> 3) Send me both samples and your Redis dump so I can try to experiment 
> with that
> 
> Maybe (3) would be a huge overkill in terms of privacy and amount of 
> data, so I would appreciate if you can do 1-2.
> 
> Thanks in advance!