Discussion:
character references, trailing ';', urls
eocene
2014-05-04 00:34:24 UTC
Permalink
I was looking at how badly dillo handles something like:

<a href="http://www.dillo.org?asdf&copy=3&micro=zxcv">link</a>

It becomes a much more common problem with html5, which has a
_lot_ more character references.


I could perhaps stick an argument on the Html_parse_entity() in
Html_get_attr2(), telling it to insist upon finding a ';'.

If we still had cvs.auriga, I could dig through prehistory and
try to see whether not demanding ';' termination was initially
done with the strong belief that it was for the best overall
(or maybe it was even inherited from gzilla), but we don't have
cvs.auriga, and we don't have mailing list search working (not
that that's generally very fun to dig through in any case).
After all, maybe we should always insist upon proper termination.
Jorge Arellano Cid
2014-05-04 13:20:07 UTC
Permalink
Hi,
Post by eocene
<a href="http://www.dillo.org?asdf&copy=3&micro=zxcv">link</a>
It becomes a much more common problem with html5, which has a
_lot_ more character references.
I could perhaps stick an argument on the Html_parse_entity() in
Html_get_attr2(), telling it to insist upon finding a ';'.
If we still had cvs.auriga, I could dig through prehistory and
try to see whether not demanding ';' termination was initially
done with the strong belief that it was for the best overall
(or maybe it was even inherited from gzilla), but we don't have
cvs.auriga, and we don't have mailing list search working (not
that that's generally very fun to dig through in any case).
After all, maybe we should always insist upon proper termination.
This heuristics are not simple.

AFAIR the original routine was written to require the trailing ';'
and it worked well for some time. Then more pages started to show
unterminated entities inside, and it got so annoying we decided to
make it more flexible and not to require the ';' when the entity
name was found (IIRC).

It'd be good to find the reason for the change before reverting it.
I don't remember it now, but I do remember it was because the other way
started to be perceived as worst in some sense.

Maybe GMANE has the mailing list archives...

(a similar situation happens with the question of e.g. allowing H1
inside the A element.).

A bit of history: in the very beginning Dillo had strict
parsing. The motto was not to try to fix bad HTML. After a few
years dillo became more and more annoying (tag soup or HTML
violations were not fixed), and the "Tag soup" pages looked
really bad in it (hence the bug meter). At some point we had to
change the policy because it was a lost war and dillo was
becoming more and more unusable/irrelevant. At this point our
policy is more or less: we try to render tag soup and use
heuristics to do a good job on correcting usual problems, but
haven't gave up on informing the user/author of all the HTML
errors we found in the page.
--
Cheers
Jorge.-
eocene
2014-05-04 17:19:04 UTC
Permalink
Post by Jorge Arellano Cid
Post by eocene
<a href="http://www.dillo.org?asdf&copy=3&micro=zxcv">link</a>
It becomes a much more common problem with html5, which has a
_lot_ more character references.
I could perhaps stick an argument on the Html_parse_entity() in
Html_get_attr2(), telling it to insist upon finding a ';'.
If we still had cvs.auriga, I could dig through prehistory and
try to see whether not demanding ';' termination was initially
done with the strong belief that it was for the best overall
(or maybe it was even inherited from gzilla), but we don't have
cvs.auriga, and we don't have mailing list search working (not
that that's generally very fun to dig through in any case).
After all, maybe we should always insist upon proper termination.
This heuristics are not simple.
AFAIR the original routine was written to require the trailing ';'
and it worked well for some time. Then more pages started to show
unterminated entities inside, and it got so annoying we decided to
make it more flexible and not to require the ';' when the entity
name was found (IIRC).
Yeah, this is why I was considering just changing the get_attr case,
but of course I don't want to make the code messy and complicated
unless I need to.
Post by Jorge Arellano Cid
It'd be good to find the reason for the change before reverting it.
I don't remember it now, but I do remember it was because the other way
started to be perceived as worst in some sense.
Maybe GMANE has the mailing list archives...
I guess I'll put some time into digging around.

Maybe it happens out there and I just don't hear about it, but I
wonder why projects don't tend to keep track -- in some organized
fashion by topic, like in a wiki or group of static web pages or
something -- all of the decisions made on various issues and the
reasoning surrounding them, since it's hard to remember details
for years, people come and go, etc.
eocene
2014-05-04 21:09:48 UTC
Permalink
Post by eocene
Post by Jorge Arellano Cid
AFAIR the original routine was written to require the trailing ';'
and it worked well for some time. Then more pages started to show
unterminated entities inside, and it got so annoying we decided to
make it more flexible and not to require the ';' when the entity
name was found (IIRC).
Yeah, this is why I was considering just changing the get_attr case,
but of course I don't want to make the code messy and complicated
unless I need to.
Post by Jorge Arellano Cid
It'd be good to find the reason for the change before reverting it.
I don't remember it now, but I do remember it was because the other way
started to be perceived as worst in some sense.
Maybe GMANE has the mailing list archives...
I guess I'll put some time into digging around.
http://lists.dillo.org/pipermail/dillo-dev/2005-January/002502.html

where we get the end of a conversation between Jorge and Matthias Franz.

This msg says that it was changed because it wasn't required under
certain conditions. HTML4 spec gives it as:

Note. In SGML, it is possible to eliminate the final ";" after a
character reference in some cases (e.g., at a line break or
immediately before a tag). In other circumstances it may not be
eliminated (e.g., in the middle of a word). We strongly suggest
using the ";" in all cases to avoid problems with user agents that
require this character to be present.


...and there's an "IIRC" in the msg that XHTML requires it.

The HTML5 spec requires a terminating ';' in all cases.
Jorge Arellano Cid
2014-05-05 11:37:53 UTC
Permalink
Post by eocene
This msg says that it was changed because it wasn't required under
Note. In SGML, it is possible to eliminate the final ";" after a
character reference in some cases (e.g., at a line break or
immediately before a tag). In other circumstances it may not be
eliminated (e.g., in the middle of a word). We strongly suggest
using the ";" in all cases to avoid problems with user agents that
require this character to be present.
...and there's an "IIRC" in the msg that XHTML requires it.
The HTML5 spec requires a terminating ';' in all cases.
Then, it looks like requiring it again in this case may be
the way to go (I seem to recall there were lots of unterminated NBSP).
Are you saying always for html5, (probably) always for xhtml, and for
attributes with html4?
I'm saying we should find a simple heuristic that copes with
the current situation.
If you want simple, I can just require it unconditionally and
find out what happens.
Your first suggestion looks quite reasonable. Please try it and
make some field tests. I'm currently working on the double imgbuf
problem...
--
Cheers
Jorge.-
Jorge Arellano Cid
2014-05-05 00:32:57 UTC
Permalink
Post by eocene
Post by eocene
Post by Jorge Arellano Cid
AFAIR the original routine was written to require the trailing ';'
and it worked well for some time. Then more pages started to show
unterminated entities inside, and it got so annoying we decided to
make it more flexible and not to require the ';' when the entity
name was found (IIRC).
Yeah, this is why I was considering just changing the get_attr case,
but of course I don't want to make the code messy and complicated
unless I need to.
Post by Jorge Arellano Cid
It'd be good to find the reason for the change before reverting it.
I don't remember it now, but I do remember it was because the other way
started to be perceived as worst in some sense.
Maybe GMANE has the mailing list archives...
I guess I'll put some time into digging around.
http://lists.dillo.org/pipermail/dillo-dev/2005-January/002502.html
where we get the end of a conversation between Jorge and Matthias Franz.
This msg says that it was changed because it wasn't required under
Note. In SGML, it is possible to eliminate the final ";" after a
character reference in some cases (e.g., at a line break or
immediately before a tag). In other circumstances it may not be
eliminated (e.g., in the middle of a word). We strongly suggest
using the ";" in all cases to avoid problems with user agents that
require this character to be present.
...and there's an "IIRC" in the msg that XHTML requires it.
The HTML5 spec requires a terminating ';' in all cases.
Then, it looks like requiring it again in this case may be
the way to go (I seem to recall there were lots of unterminated NBSP).

A long long time ago people thought that SGML was the final
solution, then XML, then HTML5, now they're looking for an
alternative technology to base the web upon...
--
Cheers
Jorge.-
eocene
2014-05-05 02:25:01 UTC
Permalink
Post by eocene
This msg says that it was changed because it wasn't required under
Note. In SGML, it is possible to eliminate the final ";" after a
character reference in some cases (e.g., at a line break or
immediately before a tag). In other circumstances it may not be
eliminated (e.g., in the middle of a word). We strongly suggest
using the ";" in all cases to avoid problems with user agents that
require this character to be present.
...and there's an "IIRC" in the msg that XHTML requires it.
The HTML5 spec requires a terminating ';' in all cases.
Then, it looks like requiring it again in this case may be
the way to go (I seem to recall there were lots of unterminated NBSP).
Are you saying always for html5, (probably) always for xhtml, and for
attributes with html4?
I'm saying we should find a simple heuristic that copes with
the current situation.
If you want simple, I can just require it unconditionally and
find out what happens.
A long long time ago people thought that SGML was the final
solution, then XML, then HTML5, now they're looking for an
alternative technology to base the web upon...
Where have they been talking about an alternative technology?
I remember short ago, reading somewhere in the news that there
were funds and a call for people with expertise to work on
designing an alternative technology for the web (to try to tackle
the enormous amount of complexity full blown browsers have become
not to mention the disparate user experience this creates).
I wish them luck. HTML5 is the most ridiculous possible document.
eocene
2014-05-05 00:54:06 UTC
Permalink
Post by eocene
This msg says that it was changed because it wasn't required under
Note. In SGML, it is possible to eliminate the final ";" after a
character reference in some cases (e.g., at a line break or
immediately before a tag). In other circumstances it may not be
eliminated (e.g., in the middle of a word). We strongly suggest
using the ";" in all cases to avoid problems with user agents that
require this character to be present.
...and there's an "IIRC" in the msg that XHTML requires it.
The HTML5 spec requires a terminating ';' in all cases.
Then, it looks like requiring it again in this case may be
the way to go (I seem to recall there were lots of unterminated NBSP).
Are you saying always for html5, (probably) always for xhtml, and for
attributes with html4?
A long long time ago people thought that SGML was the final
solution, then XML, then HTML5, now they're looking for an
alternative technology to base the web upon...
Where have they been talking about an alternative technology?
Jorge Arellano Cid
2014-05-05 01:47:17 UTC
Permalink
Post by eocene
This msg says that it was changed because it wasn't required under
Note. In SGML, it is possible to eliminate the final ";" after a
character reference in some cases (e.g., at a line break or
immediately before a tag). In other circumstances it may not be
eliminated (e.g., in the middle of a word). We strongly suggest
using the ";" in all cases to avoid problems with user agents that
require this character to be present.
...and there's an "IIRC" in the msg that XHTML requires it.
The HTML5 spec requires a terminating ';' in all cases.
Then, it looks like requiring it again in this case may be
the way to go (I seem to recall there were lots of unterminated NBSP).
Are you saying always for html5, (probably) always for xhtml, and for
attributes with html4?
I'm saying we should find a simple heuristic that copes with
the current situation.
A long long time ago people thought that SGML was the final
solution, then XML, then HTML5, now they're looking for an
alternative technology to base the web upon...
Where have they been talking about an alternative technology?
I remember short ago, reading somewhere in the news that there
were funds and a call for people with expertise to work on
designing an alternative technology for the web (to try to tackle
the enormous amount of complexity full blown browsers have become
not to mention the disparate user experience this creates).
--
Cheers
Jorge.-
Loading...