[Rspamd-Users] How to handle MIME encoded headers?
Gerald Galster
list+rspamd at gcore.biz
Mon Jun 17 19:28:41 UTC 2024
>> Have you tried using the /u flag in your regexes?
>
> I saw that listed, but it wasn't clear (to me at least) what that means. Lemme take a shot at that...
/u is a character set modifier that sets the character set to unicode (utf-8) instead of ascii/latin1.
All bytestreams read from disk or the network must be interpreted somehow.
Historically this has been ASCII (or ascii-extended / iso-8859-1 / latin1 / ...).
See for example the uppercase character 'A' which is at hex position 41 in iso-8859-1 code page:
https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout
It is also at hex position 41 in UTF-8 code page:
https://en.wikipedia.org/wiki/UTF-8#Codepage_layout
So these two codepages do share some characters but not all as iso-8859-1 is strictly one byte
and UTF-8 may take up to four bytes to define a single character (smileys, diacritcis, grapheme clusters, ...).
Historically mail headers are limited to 7 bit ASCII and since iso-8559-1 and UTF-8 share this part
of the code page layout, there is no need to process anything differently.
Consequently headers like Subject must be encoded somehow if they contain characters outside that range.
That's where quoted printable encoding comes into play: any non-ascii characters are represented by a
sequence of pure ascii characters. This is what you get when processing raw email headers (=?UTF-8?B?...).
With certain configurations rspamd decodes those qp-encodings to utf-8 characters like 'Ä' or 😀 .
To match utf-8 / multibyte characters inside a regular expression, you need the /u modifier.
Otherwise you would need to match the respective hex notations for example.
In summary, subjects containing 7-bit characters (a-z, 0-9, space, +, ...) only don't need any encoding.
If they are encoded nevertheless, this might be a spammer trying to evade common string or regular
expression checks.
Best regards,
Gerald
More information about the Users
mailing list