[Rspamd-Users] URL Parsing error(s)

Sun Mar 27 11:53:58 UTC 2022

On 24/03/2022 17:19, Steve Sturges (ststurge) via Users wrote:
> Hi all—
> 
> In a test with rspamd 3.1, I think I’ve identified a parsing error when a URL is extracted from an email message body, but the hostname is malformed.
> 
> First, consider a few simple URLs (which may be trying to fake a domain, such as linkedin.com<http://linkedin.com>), where the hostname is actually URL encoded:
> 
> http://www.linke%3Din.com
> http://www.li%3Dkedin.com<http://kedin.com>
> 
>  From a lua callback, when invoking task:get_urls(), it returns both URLs unexpectedly with the %3D decoded as a =.
> 
> url list {[1] = http://www.li=kedin.com, [2] = http://www.linke=in.com}

But that's exactly how my Thunderbird highlights these urls.

> However, When the %3D is replaced with the actual = sign representation,
> 
> http://www.linke=in.com
> http://www.li=kedin.com<http://kedin.com>
> 
> the first URL is not even parsed and the URL list just includes this:
> 
> url list {[1] = http://www.li}

That might be an issue when parsing plain text urls. Specifically for 
`=` sign.

> 
> That second URL is expected, and what appears after the = is just treated as part of the text of the message.
> 
> I see two potential errors here:
> 
> 1) URL decoding for the host name portion of a URL should not occur — only the data that should be URL encoded

I'm not sure why do you think so, counting that email clients perform 
such a decoding.

> 2) In the second example, the first URL, in theory should be decoded as http://www.linke.

I believe that Rspamd should do the same as clients do.

> I will look try to thru the lua plugin code to see if there is an obvious fix next week, unless someone beats me to it.
> 

It's not in the Lua part I'm afraid, that are deep internals of the URL 
parser of Rspamd: 
https://github.com/rspamd/rspamd/blob/master/src/libserver/url.c#L1056