[Rspamd-Users] How to handle MIME encoded headers?

Dan Swartzendruber dswartz at druber.com
Mon Jun 17 22:23:41 UTC 2024


Got it thanks!

Sent from my iPhone

> On Jun 17, 2024, at 5:36 PM, Gerald Galster <list+rspamd at gcore.biz> wrote:
> 
> 
>> 
>>> Have you tried using the /u flag in your regexes?
>> 
>> I saw that listed, but it wasn't clear (to me at least) what that means.  Lemme take a shot at that...
> 
> /u is a character set modifier that sets the character set to unicode (utf-8) instead of ascii/latin1.
> 
> All bytestreams read from disk or the network must be interpreted somehow.
> Historically this has been ASCII (or ascii-extended / iso-8859-1 / latin1 / ...).
> 
> See for example the uppercase character 'A' which is at hex position 41 in iso-8859-1 code page:
> https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout
> 
> It is also at hex position 41 in UTF-8 code page:
> https://en.wikipedia.org/wiki/UTF-8#Codepage_layout
> 
> So these two codepages do share some characters but not all as iso-8859-1 is strictly one byte
> and UTF-8 may take up to four bytes to define a single character (smileys, diacritcis, grapheme clusters, ...).
> 
> Historically mail headers are limited to 7 bit ASCII and since iso-8559-1 and UTF-8 share this part
> of the code page layout, there is no need to process anything differently.
> 
> Consequently headers like Subject must be encoded somehow if they contain characters outside that range.
> That's where quoted printable encoding comes into play: any non-ascii characters are represented by a
> sequence of pure ascii characters. This is what you get when processing raw email headers (=?UTF-8?B?...).
> 
> With certain configurations rspamd decodes those qp-encodings to utf-8 characters like 'Ä' or 😀 .
> To match utf-8 / multibyte characters inside a regular expression, you need the /u modifier.
> Otherwise you would need to match the respective hex notations for example.
> 
> In summary, subjects containing 7-bit characters (a-z, 0-9, space, +, ...) only don't need any encoding.
> If they are encoded nevertheless, this might be a spammer trying to evade common string or regular
> expression checks.
> 
> Best regards,
> Gerald
> --
> Users mailing list
> Users at lists.rspamd.com
> https://lists.rspamd.com/mailman/listinfo/users


More information about the Users mailing list