[Rspamd-Users] URL Parsing error(s)
Vsevolod Stakhov
vsevolod at rspamd.com
Sun Mar 27 11:53:58 UTC 2022
On 24/03/2022 17:19, Steve Sturges (ststurge) via Users wrote:
> Hi all—
>
> In a test with rspamd 3.1, I think I’ve identified a parsing error when a URL is extracted from an email message body, but the hostname is malformed.
>
> First, consider a few simple URLs (which may be trying to fake a domain, such as linkedin.com<http://linkedin.com>), where the hostname is actually URL encoded:
>
> http://www.linke%3Din.com
> http://www.li%3Dkedin.com<http://kedin.com>
>
> From a lua callback, when invoking task:get_urls(), it returns both URLs unexpectedly with the %3D decoded as a =.
>
> url list {[1] = http://www.li=kedin.com, [2] = http://www.linke=in.com}
But that's exactly how my Thunderbird highlights these urls.
> However, When the %3D is replaced with the actual = sign representation,
>
> http://www.linke=in.com
> http://www.li=kedin.com<http://kedin.com>
>
> the first URL is not even parsed and the URL list just includes this:
>
> url list {[1] = http://www.li}
That might be an issue when parsing plain text urls. Specifically for
`=` sign.
>
> That second URL is expected, and what appears after the = is just treated as part of the text of the message.
>
> I see two potential errors here:
>
> 1) URL decoding for the host name portion of a URL should not occur — only the data that should be URL encoded
I'm not sure why do you think so, counting that email clients perform
such a decoding.
> 2) In the second example, the first URL, in theory should be decoded as http://www.linke.
I believe that Rspamd should do the same as clients do.
> I will look try to thru the lua plugin code to see if there is an obvious fix next week, unless someone beats me to it.
>
It's not in the Lua part I'm afraid, that are deep internals of the URL
parser of Rspamd:
https://github.com/rspamd/rspamd/blob/master/src/libserver/url.c#L1056
More information about the Users
mailing list