[Rspamd-Users] How to handle MIME encoded headers?
Dan Swartzendruber
dswartz at druber.com
Mon Jun 17 22:23:41 UTC 2024
Got it thanks!
Sent from my iPhone
> On Jun 17, 2024, at 5:36 PM, Gerald Galster <list+rspamd at gcore.biz> wrote:
>
>
>>
>>> Have you tried using the /u flag in your regexes?
>>
>> I saw that listed, but it wasn't clear (to me at least) what that means. Lemme take a shot at that...
>
> /u is a character set modifier that sets the character set to unicode (utf-8) instead of ascii/latin1.
>
> All bytestreams read from disk or the network must be interpreted somehow.
> Historically this has been ASCII (or ascii-extended / iso-8859-1 / latin1 / ...).
>
> See for example the uppercase character 'A' which is at hex position 41 in iso-8859-1 code page:
> https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout
>
> It is also at hex position 41 in UTF-8 code page:
> https://en.wikipedia.org/wiki/UTF-8#Codepage_layout
>
> So these two codepages do share some characters but not all as iso-8859-1 is strictly one byte
> and UTF-8 may take up to four bytes to define a single character (smileys, diacritcis, grapheme clusters, ...).
>
> Historically mail headers are limited to 7 bit ASCII and since iso-8559-1 and UTF-8 share this part
> of the code page layout, there is no need to process anything differently.
>
> Consequently headers like Subject must be encoded somehow if they contain characters outside that range.
> That's where quoted printable encoding comes into play: any non-ascii characters are represented by a
> sequence of pure ascii characters. This is what you get when processing raw email headers (=?UTF-8?B?...).
>
> With certain configurations rspamd decodes those qp-encodings to utf-8 characters like 'Ä' or 😀 .
> To match utf-8 / multibyte characters inside a regular expression, you need the /u modifier.
> Otherwise you would need to match the respective hex notations for example.
>
> In summary, subjects containing 7-bit characters (a-z, 0-9, space, +, ...) only don't need any encoding.
> If they are encoded nevertheless, this might be a spammer trying to evade common string or regular
> expression checks.
>
> Best regards,
> Gerald
> --
> Users mailing list
> Users at lists.rspamd.com
> https://lists.rspamd.com/mailman/listinfo/users
More information about the Users
mailing list