[Rspamd-Users] URL Parsing error(s)

Vsevolod Stakhov vsevolod at rspamd.com
Sun Mar 27 11:53:58 UTC 2022

On 24/03/2022 17:19, Steve Sturges (ststurge) via Users wrote:
> Hi all—
> In a test with rspamd 3.1, I think I’ve identified a parsing error when a URL is extracted from an email message body, but the hostname is malformed.
> First, consider a few simple URLs (which may be trying to fake a domain, such as linkedin.com<http://linkedin.com>), where the hostname is actually URL encoded:
> http://www.linke%3Din.com
> http://www.li%3Dkedin.com<http://kedin.com>
>  From a lua callback, when invoking task:get_urls(), it returns both URLs unexpectedly with the %3D decoded as a =.
> url list {[1] = http://www.li=kedin.com, [2] = http://www.linke=in.com}

But that's exactly how my Thunderbird highlights these urls.

> However, When the %3D is replaced with the actual = sign representation,
> http://www.linke=in.com
> http://www.li=kedin.com<http://kedin.com>
> the first URL is not even parsed and the URL list just includes this:
> url list {[1] = http://www.li}

That might be an issue when parsing plain text urls. Specifically for 
`=` sign.

> That second URL is expected, and what appears after the = is just treated as part of the text of the message.
> I see two potential errors here:
> 1) URL decoding for the host name portion of a URL should not occur — only the data that should be URL encoded

I'm not sure why do you think so, counting that email clients perform 
such a decoding.

> 2) In the second example, the first URL, in theory should be decoded as http://www.linke.

I believe that Rspamd should do the same as clients do.

> I will look try to thru the lua plugin code to see if there is an obvious fix next week, unless someone beats me to it.

It's not in the Lua part I'm afraid, that are deep internals of the URL 
parser of Rspamd: 

More information about the Users mailing list