[Rspamd-Users] Autolearn BAYES does not want to work

Wed Jun 26 22:36:12 UTC 2024

> yes, I've always had problems with BAYES autolearn.
> 
> It "worked" for a while, but I'm not quite sure what it does. The weirdest things happen.
> 
> I tried your suggestion
> autolearn = [-0.5, 4];
> That type of entry doesn't work for me at all.

/etc/rspamd/local.d/classifier-bayes.conf could look like this:

backend = "redis";
servers = "127.0.0.1:6378";
# autolearn = true;
# adjust the scores to your setup:
autolearn = [-0.5, 4];
new_schema = true;
expire = 2144448000;

In this example there is a separate redis instance running on port 6378 solely for Bayes.

From redis-6378.conf:

maxmemory 500mb
maxmemory-policy volatile-ttl

This will automatically delete old records once the redis instance uses 500mb of storage.
Rspamd will not expire records (well, it would after 2144448000 seconds, about 68 years), redis handles it.

https://rspamd.com/doc/modules/bayes_expiry.html#limiting-memory-usage-to-a-fixed-amount

> This is how I've got it working now:
> 
> learn_condition = 'return require("lua_bayes_learn").can_learn';
> autolearn = true;
> autolearn {
>  spam_threshold = 6.0;
>  junk_threshold = 4.0;
>  ham_threshold = -0.5;
>  check_balance = true;
>  min_balance = 0.95;
> }

That's what I tried to point at with my response in March. You still have check_balance and min_balance.
This tells rspamd to learn about the same amount of spam and ham mails. So if you get twice as many ham mails as spam mails, you can't complain that not all spam mails are learned.
I suggested to enable debug back then to get possible hints like "skip learning spam, balance is not satisfied" ...

> 1. But it's still not clear when an incoming email is automatically classified as spam or ham.

All tests like bayes, multimaps, rbls, ... add to the final score. A mail is classified as spam if the score is high enough (e.g. reject score in /etc/rspamd/local.d/actions.conf).

> Does it matter whether pre-filter is used in multimap?

Yes. A prefilter stops further tests, it usually accepts or rejects an email according to your configuration.

Maybe try to start without prefilter so that other tests can continue.

> Can you also set the learning somehow by hand, for example if you set autolearn = True in a multimap in Whitelist_Domain. So everything that is white listed is automatically learned.

I would not advise such a combination.

See https://rspamd.com/doc/configuration/statistic.html#autolearning

  autolearn = true: autolearning is performing as spam if a message has reject action and as ham if a message has negative score
  autolearn = [-5, 5]: autolearn as ham if the score is less than -5 and as spam if the score is more than 5

Try to start with a simple configuration and remove all autolearnings, e.g.:

  autolearn {
   spam_threshold = 6.0;
   junk_threshold = 4.0;
   ham_threshold = -0.5;
   check_balance = true;
   min_balance = 0.95;
  }

Instead just set

  autolearn = true;

As the documentation a few lines above states, this will autolearn mails as spam if they are rejected and as ham if they have a negative score.

> 2. It is not possible to control what is contained in the redis database.

A standard setup is using redis for lots of things, not just bayes (eg reply tracking, ...).
You do not directly control what's inside redis, otherwise you would interfere with the modules using redis without knowing anything about the internals. 

> Is there a way to edit and adjust this?

Don't try to meddle with redis directly. The way you adjust things in rspamd is by setting scores, e.g. in multimaps.

> Wouldn't MySQL be better? I don't have that many emails.

No. MySQL is a relational database whereas redis is a key value store (a kind of specialised database which is better suited for this type of workload).

> 3. I set the lifetime of BAYES entries in classifier-bayes.conf to 20 days and waited until the 20 days had passed. Now I have reduced the time to 10 days, but no older entries are deleted from the database.

Don't do that. Create a separate redis instance for bayes and configure it to autoexpire insignificant data once the database is full (maxmemory-policy volatile-ttl ...).
You want to keep as many significant spam/ham signs as possible, not to delete them after 10 days. 

> Repeated spam that has RBL, Fuzzy, Neural, Multimap, SPF, DKIM entries and a score of +40 is still marked as BAYES ham.

In this case autolearning does not yet work. Try the beforementioned suggestions.

> It looks as if only the content of an email, without headers, is used for the BAYES evaluation. Or can this still be adjusted?

https://rspamd.com/doc/configuration/statistic.html#classifier-and-headers

Some headers are taken into account and this can be adjusted, but don't change the defaults without knowing the implications. It won't help with your problem.

Best regards,
Gerald