[Rspamd-Users] URL Parsing error(s)

Vsevolod Stakhov vsevolod at rspamd.com
Mon Mar 28 20:00:53 UTC 2022

On 28/03/2022 20:23, Steve Sturges (ststurge) via Users wrote:
> On Mar 27, 2022, at 7:53 AM, Vsevolod Stakhov <vsevolod at rspamd.com<mailto:vsevolod at rspamd.com>> wrote:
> On 24/03/2022 17:19, Steve Sturges (ststurge) via Users wrote:
> Hi all—
> In a test with rspamd 3.1, I think I’ve identified a parsing error when a URL is extracted from an email message body, but the hostname is malformed.
> First, consider a few simple URLs (which may be trying to fake a domain, such as linkedin.com<http://linkedin.com><http://linkedin.com>), where the hostname is actually URL encoded:
> http://www.linke%3Din.com
> http://www.li%3Dkedin.com
>  From a lua callback, when invoking task:get_urls(), it returns both URLs unexpectedly with the %3D decoded as a =.
> url list {[1] = http://www.li=kedin.com, [2] = http://www.linke=in.com}
> But that's exactly how my Thunderbird highlights these urls.
> Other mail clients (eg, Apple Mail) I’m using see two links with the = (non-%3D encoded version) (www.linke and in.com<http://in.com> in the first example, www.li<http://www.li> and kedin.com<http://kedin.com> in the 2nd).  Outlook/O365 web does not make either of them clickable.
> However, When the %3D is replaced with the actual = sign representation,
> http://www.linke=in.com
> http://www.li=kedin.com
> the first URL is not even parsed and the URL list just includes this:
> url list {[1] = http://www.li}
> That might be an issue when parsing plain text urls. Specifically for `=` sign.
> That second URL is expected, and what appears after the = is just treated as part of the text of the message.
> I see two potential errors here:
> 1) URL decoding for the host name portion of a URL should not occur — only the data that should be URL encoded
> I'm not sure why do you think so, counting that email clients perform such a decoding.
> With the = in the middle of the the hostname, it is not a valid per RFCs 952 & 1123.
> RFC 952:
> <hname> ::= <name>*["."<name>]
> <name>  ::= <let>[*[<let-or-digit-or-hyphen>]<let-or-digit>]
> RFC 1123 updates that to allow hyphen/digit as the leading character.  But in no scenario, is an = allowed.  Maybe that would just trigger a internal symbol as malformed hostname in URL or similar — and add a flag for the URL object, so its accessible from C and lua.

Yes, this looks like a most feasible option indeed. I need to check some 
other details though.

URLs extraction from the plain text is just painful, tbh...

More information about the Users mailing list