[patch] HTML parser bugfix

Discussion:

123

2012-06-03 18:29:51 UTC

The following HTML fragment is shown as
BUG<">abc
by Dillo. It is valid (checked with validator.w3.org) HTML. In Firefox
it is shown as
abc
with '>BUG<' tooltip.

<div title=">BUG<">abc</div>

What happens is that Dillo see < character inside quotes, shows bug
"attribute lacks closing quote" and breaks out of tag parsing loop. In
fact it is not possible to say if there is a bug in HTML at that
point.

In WWW you can see this bug on http://reddit.com. Each entry on this
page contains
votehash', null, event)" >
and bug meter shows bugs for them.

Attached patch removes logic that tries to detect unterminated quoted
attributes and fixes this bug.

Jeremy Henty

2012-06-03 20:00:55 UTC

Permalink

Attached patch removes logic that tries to detect unterminated
quoted attributes and fixes this bug.

I would be wary of applying this patch without further thought. I run
Dillo with a very similar patch and it breaks *many* more pages than
it fixes. It is unfortunately necessary to detect and recover from
quoting errors in attributes. For instance, the web is *full* of
pages with tags containing this:

foo="bar""

If you don't detect and ignore that extra double quote you will break
many pages that every other browser renders perfectly well.

It's not even that uncommon to see monstrosities like this:

foo=""bar"""

It is also common to see short attributes (often CSS lengths)
delimited with a mixture of single and double quotes, eg:

size="12pt'

For these reasons, sadly, it is IMHO not realistic to just drop the
misquote-detection logic.

Regards,

Jeremy Henty

123

2012-06-03 20:45:29 UTC

Permalink

Post by Jeremy Henty
I would be wary of applying this patch without further thought. I run
Dillo with a very similar patch and it breaks *many* more pages than
it fixes. It is unfortunately necessary to detect and recover from
quoting errors in attributes. For instance, the web is *full* of
foo="bar""
If you don't detect and ignore that extra double quote you will break
many pages that every other browser renders perfectly well.

Then there should be some logic for detecting double quotes. Searching
for < inside quotes is not the right way as it breaks valid pages like
reddit main page.

Can you give examples of real web pages with double quotes?

Post by Jeremy Henty
foo=""bar"""

<div title=""abc""">def</div>

does not show tip "abc" for def in Firefox. It silently displays
nothing in Dillo after patch, however. I will try to fix it.

Post by Jeremy Henty
It is also common to see short attributes (often CSS lengths)
size="12pt'

<div title="abc'>def</div>

displays nothing in Firefox. What browser can render it?

123

2012-06-04 12:19:13 UTC

Permalink

According to WHATWG [1] fragment

<div title=""abc""">def</div>

should be parsed without errors until 'a' character. abc""" should be
parsed as attribute name with parse errors because of lack of space
character and quotes inside attribute name.

[1] http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html

Jeremy Henty

2012-06-06 23:14:06 UTC

Permalink

Post by 123

If you don't detect and ignore that extra double quote you will
break many pages that every other browser renders perfectly well.

Then there should be some logic for detecting double quotes.

I agree. The hard question is: what logic?

Post by 123
Searching for < inside quotes is not the right way as it breaks
valid pages like reddit main page.

Again, I agree. In fact, I decided to experiment with removing
Dillo's misquote-detection just because it broke reddit.
Unfortunately it is very hard to come up with an algorithm that
correctly handles the various quoting horrors that you see all the
time, yet also correctly parses embedded javascript fragments like:

onclick='alert("clicked")'

(IIRC Dillo mis-renders reddit because of embedded javascript like
this.)

Post by 123
Can you give examples of real web pages with double quotes?

I have attached a bunch of links that I collected after spending a lot
of time debugging pages that broke my locally-patched Dillo. (Ignore
the file:///... URLs as they obviously won't work for you.) I have
also attached the two patches I use. I think the first does the same
thing that you propose, although I haven't checked that in detail.
The second attempts to copy Firefox's misquote-detection algorithm.

I agree that Dillo's misquote-detection isn't good enough, but just
taking it out is a step backwards, and it is a mistake to propose
changes to it based solely on particular pages that Dillo mis-renders,
because any such change needs to be tested against the many other
broken pages that Dillo currently renders more or less correctly.

Regards,

Jeremy Henty

123

2012-06-07 12:45:42 UTC

Permalink

Post by Jeremy Henty

Post by 123

If you don't detect and ignore that extra double quote you will
break many pages that every other browser renders perfectly well.

Then there should be some logic for detecting double quotes.

I agree. The hard question is: what logic?

IMO the right way to do it is to implement HTML Standard [1]. When
Html_write_raw is called, parser is in the Data state. When it returns
to Data state again, it is the end of token.

I have implemented standard comment/DOCTYPE parsing. It is incomplete,
EOF is not handled and DOCTYPE parsing is not changed. Patch is
attached. Next step is to rewrite tag parsing in standard way.

[1] http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenization

123

2012-06-07 15:18:11 UTC

Permalink

Forgot to attach.

Johannes Hofmann

2012-06-12 20:39:18 UTC

Permalink

Post by 123

Post by Jeremy Henty

Post by 123

If you don't detect and ignore that extra double quote you will
break many pages that every other browser renders perfectly well.

Then there should be some logic for detecting double quotes.

I agree. The hard question is: what logic?

IMO the right way to do it is to implement HTML Standard [1]. When
Html_write_raw is called, parser is in the Data state. When it returns
to Data state again, it is the end of token.
I have implemented standard comment/DOCTYPE parsing. It is incomplete,
EOF is not handled and DOCTYPE parsing is not changed. Patch is
attached. Next step is to rewrite tag parsing in standard way.
[1] http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenization

With your patch text disappears e.g. on
http://www.cnas.org/blogs/abumuqawama/2011/04/quote-day.html-0
which was mentioned by Jeremy while it renders ok with current dillo and firefox.

I would rather put together a test page that includes all the cases
Jeremy brought up plus the reddit one.

Also looking into firefox or other browser sources might be a good
start.

Cheers,
Johannes

Jorge Arellano Cid

2012-06-04 15:35:42 UTC

Permalink

Post by Jeremy Henty

Attached patch removes logic that tries to detect unterminated
quoted attributes and fixes this bug.

I would be wary of applying this patch without further thought. I run
Dillo with a very similar patch and it breaks *many* more pages than
it fixes. It is unfortunately necessary to detect and recover from
quoting errors in attributes. For instance, the web is *full* of
foo="bar""
If you don't detect and ignore that extra double quote you will break
many pages that every other browser renders perfectly well.
foo=""bar"""
It is also common to see short attributes (often CSS lengths)
size="12pt'
For these reasons, sadly, it is IMHO not realistic to just drop the
misquote-detection logic.

Ditto.

(FWIW, maybe this is the fourth time this patch is received;
In the past the submitter has always desisted after some more
testing).

--
Cheers
Jorge.-