[Rspamd-Users] URL Parsing error(s)

Steve Sturges (ststurge) ststurge at cisco.com
Mon Mar 28 19:23:20 UTC 2022

On Mar 27, 2022, at 7:53 AM, Vsevolod Stakhov <vsevolod at rspamd.com<mailto:vsevolod at rspamd.com>> wrote:

On 24/03/2022 17:19, Steve Sturges (ststurge) via Users wrote:
Hi all—
In a test with rspamd 3.1, I think I’ve identified a parsing error when a URL is extracted from an email message body, but the hostname is malformed.
First, consider a few simple URLs (which may be trying to fake a domain, such as linkedin.com<http://linkedin.com><http://linkedin.com>), where the hostname is actually URL encoded:
From a lua callback, when invoking task:get_urls(), it returns both URLs unexpectedly with the %3D decoded as a =.
url list {[1] = http://www.li=kedin.com, [2] = http://www.linke=in.com}

But that's exactly how my Thunderbird highlights these urls.

Other mail clients (eg, Apple Mail) I’m using see two links with the = (non-%3D encoded version) (www.linke and in.com<http://in.com> in the first example, www.li<http://www.li> and kedin.com<http://kedin.com> in the 2nd).  Outlook/O365 web does not make either of them clickable.

However, When the %3D is replaced with the actual = sign representation,
the first URL is not even parsed and the URL list just includes this:
url list {[1] = http://www.li}

That might be an issue when parsing plain text urls. Specifically for `=` sign.

That second URL is expected, and what appears after the = is just treated as part of the text of the message.
I see two potential errors here:
1) URL decoding for the host name portion of a URL should not occur — only the data that should be URL encoded

I'm not sure why do you think so, counting that email clients perform such a decoding.

With the = in the middle of the the hostname, it is not a valid per RFCs 952 & 1123.

RFC 952:

<hname> ::= <name>*["."<name>]
<name>  ::= <let>[*[<let-or-digit-or-hyphen>]<let-or-digit>]

RFC 1123 updates that to allow hyphen/digit as the leading character.  But in no scenario, is an = allowed.  Maybe that would just trigger a internal symbol as malformed hostname in URL or similar — and add a flag for the URL object, so its accessible from C and lua.

2) In the second example, the first URL, in theory should be decoded as http://www.linke.

I believe that Rspamd should do the same as clients do.
Agreed.  Unfortunately, different mail clients do different things.

I will look try to thru the lua plugin code to see if there is an obvious fix next week, unless someone beats me to it.

It's not in the Lua part I'm afraid, that are deep internals of the URL parser of Rspamd: https://github.com/rspamd/rspamd/blob/master/src/libserver/url.c#L1056

Users mailing list
Users at lists.rspamd.com<mailto:Users at lists.rspamd.com>

More information about the Users mailing list