[Rspamd-Users] URL Parsing error(s)

Tue Mar 29 17:05:01 UTC 2022

On Mar 28, 2022, at 4:00 PM, Vsevolod Stakhov <vsevolod at rspamd.com<mailto:vsevolod at rspamd.com>> wrote:
On 28/03/2022 20:23, Steve Sturges (ststurge) via Users wrote:
On Mar 27, 2022, at 7:53 AM, Vsevolod Stakhov <vsevolod at rspamd.com<mailto:vsevolod at rspamd.com><mailto:vsevolod at rspamd.com>> wrote:
On 24/03/2022 17:19, Steve Sturges (ststurge) via Users wrote:
Hi all—
In a test with rspamd 3.1, I think I’ve identified a parsing error when a URL is extracted from an email message body, but the hostname is malformed.
First, consider a few simple URLs (which may be trying to fake a domain, such as linkedin.com<http://linkedin.com><http://linkedin.com><http://linkedin.com>), where the hostname is actually URL encoded:
http://www.linke%3Din.com
http://www.li%3Dkedin.com
From a lua callback, when invoking task:get_urls(), it returns both URLs unexpectedly with the %3D decoded as a =.
url list {[1] = http://www.li=kedin.com, [2] = http://www.linke=in.com}
But that's exactly how my Thunderbird highlights these urls.
Other mail clients (eg, Apple Mail) I’m using see two links with the = (non-%3D encoded version) (www.linke and in.com<http://in.com> in the first example, www.li<http://www.li> and kedin.com<http://kedin.com> in the 2nd).  Outlook/O365 web does not make either of them clickable.
However, When the %3D is replaced with the actual = sign representation,
http://www.linke=in.com
http://www.li=kedin.com
the first URL is not even parsed and the URL list just includes this:
url list {[1] = http://www.li}
That might be an issue when parsing plain text urls. Specifically for `=` sign.
That second URL is expected, and what appears after the = is just treated as part of the text of the message.
I see two potential errors here:
1) URL decoding for the host name portion of a URL should not occur — only the data that should be URL encoded
I'm not sure why do you think so, counting that email clients perform such a decoding.
With the = in the middle of the the hostname, it is not a valid per RFCs 952 & 1123.
RFC 952:
<hname> ::= <name>*["."<name>]
<name>  ::= <let>[*[<let-or-digit-or-hyphen>]<let-or-digit>]
RFC 1123 updates that to allow hyphen/digit as the leading character.  But in no scenario, is an = allowed.  Maybe that would just trigger a internal symbol as malformed hostname in URL or similar — and add a flag for the URL object, so its accessible from C and lua.

Yes, this looks like a most feasible option indeed. I need to check some other details though.

URLs extraction from the plain text is just painful, tbh...

Thanks, Vsevolod.

We’ve seen a few more examples where there is an odd character in the hostname.  Eg, http://sharepoint\\.com<http://sharepoint%5C%5C.com> where there is a \ before the .com.  Following the RFC definitions, that would be covered with an internal symbol and flag being set.

--
Users mailing list
Users at lists.rspamd.com<mailto:Users at lists.rspamd.com>
https://lists.rspamd.com/mailman/listinfo/users