[Rspamd-Users] Which data is used for learning bayes or fuzzy storage?

Sat Nov 23 06:02:44 UTC 2024

> Many thanks for the information. On this base I assume that the hashing in auto_learn takes place before the subject will be rewrited.
> So if I want to train the filter with messages manually afterwards via the web interface, should I at least delete the string, for example '*** SPAM [8.57/12] ***', beforehand in order to generate correct hashes?

I don't know for sure, but I guess rspamd does not remove a subject's spam tag in this case and it may be difficult to do so because of custom formatting (***SPAM, [SPAM], ...).
You can delete the string but I don't think it's necessary. Bayes is a statistical approach based on tokens (words) and probabilities.
Some tokens are statistically more relevant for spam mails and will be retained, insignifcant ones are expired over time.
A valid spam mail tagged with ***SPAM in the subject might initially increase the spam score, another valid mail that is about anti spam software might decrease it.
That way the token "spam" alone won't result in a higher spam probability.

See https://rspamd.com/doc/configuration/statistic.html#statistics-architecture

> This is also a problem with user-initiated training via the mail store when a message is moved into or out of the junk folder. In general, however, user training is given its own flag and given less weight in scoring.

An alternative would be to collect users' spam mails, check if they really are spam and only learn those globally.
My personal opinion is that per-user training often does more harm than good. Most users don't want to train anything, they just want spam filtered.

Best regards,
Gerald