Discussion:
UnicodeDecode when calling render()
anatoly techtonik
2012-12-24 01:29:43 UTC
Permalink
The following code fails with UnicodeDecode error and I am completely
puzzled about what does it want.

import genshi

iURL = "http://bugs.farmanager.com/view.php?id=1736"

import urllib2
mbt_file = urllib2.urlopen(iURL)
mbt_genshi = genshi.input.HTMLParser(mbt_file)
parsed = mbt_genshi.parse()
parsed.select("head").render()


The full traceback:

Traceback (most recent call last):
File "test.py", line 9, in <module>
parsed.select("head").render()
File "/usr/lib/pymodules/python2.6/genshi/core.py", line 183, in render
return encode(generator, method=method, encoding=encoding, out=out)
File "/usr/lib/pymodules/python2.6/genshi/output.py", line 57, in encode
return _encode(''.join(list(iterator)))
File "/usr/lib/pymodules/python2.6/genshi/output.py", line 223, in
__call__
for kind, data, pos in stream:
File "/usr/lib/pymodules/python2.6/genshi/output.py", line 670, in
__call__
for kind, data, pos in stream:
File "/usr/lib/pymodules/python2.6/genshi/output.py", line 771, in
__call__
for kind, data, pos in chain(stream, [(None, None, None)]):
File "/usr/lib/pymodules/python2.6/genshi/output.py", line 586, in
__call__
for ev in stream:
File "/usr/lib/pymodules/python2.6/genshi/core.py", line 288, in _ensure
for event in stream:
File "/usr/lib/pymodules/python2.6/genshi/path.py", line 581, in _generate
for event in stream:
File "/usr/lib/pymodules/python2.6/genshi/core.py", line 288, in _ensure
for event in stream:
File "/usr/lib/pymodules/python2.6/genshi/input.py", line 432, in
_coalesce
for kind, data, pos in chain(stream, [(None, None, None)]):
File "/usr/lib/pymodules/python2.6/genshi/input.py", line 327, in
_generate
self.feed(data)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 252, in parse_starttag
attrvalue = self.unescape(attrvalue)
File "/usr/lib/python2.6/HTMLParser.py", line 390, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities,
s)
File "/usr/lib/python2.6/re.py", line 151, in sub
return _compile(pattern, 0).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 9:
ordinal not in range(128)


What this error is about?
--
You received this message because you are subscribed to the Google Groups "Genshi" group.
To view this discussion on the web visit https://groups.google.com/d/msg/genshi/-/KQ02wqbLtGMJ.
To post to this group, send email to ***@googlegroups.com.
To unsubscribe from this group, send email to genshi+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/genshi?hl=en.
Simon Cross
2012-12-24 08:14:04 UTC
Permalink
Hi Anatoly

Could you try construct a minimal test case?

Schiavo
Simon
--
You received this message because you are subscribed to the Google Groups "Genshi" group.
To post to this group, send email to ***@googlegroups.com.
To unsubscribe from this group, send email to genshi+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/genshi?hl=en.
anatoly techtonik
2012-12-24 08:37:27 UTC
Permalink
Actually, the code script pasted is a minimal test case. =)
--
You received this message because you are subscribed to the Google Groups "Genshi" group.
To post to this group, send email to ***@googlegroups.com.
To unsubscribe from this group, send email to genshi+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/genshi?hl=en.
Simon Cross
2012-12-24 08:43:51 UTC
Permalink
Post by anatoly techtonik
Actually, the code script pasted is a minimal test case. =)
It references a giant blob of HTML.
--
You received this message because you are subscribed to the Google Groups "Genshi" group.
To post to this group, send email to ***@googlegroups.com.
To unsubscribe from this group, send email to genshi+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/genshi?hl=en.
anatoly techtonik
2012-12-24 09:45:41 UTC
Permalink
Post by Simon Cross
Post by anatoly techtonik
Actually, the code script pasted is a minimal test case. =)
It references a giant blob of HTML.
I experimented with encoding a bit and it boiled down to
http://genshi.edgewall.org/ticket/375 so I think it is more important.
--
anatoly t.
--
You received this message because you are subscribed to the Google Groups "Genshi" group.
To post to this group, send email to ***@googlegroups.com.
To unsubscribe from this group, send email to genshi+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/genshi?hl=en.
Simon Cross
2012-12-24 09:58:50 UTC
Permalink
Post by anatoly techtonik
I experimented with encoding a bit and it boiled down to
http://genshi.edgewall.org/ticket/375 so I think it is more important.
I closed that ticket as wontfix -- cleaning up HTML seems outside of
Genshi's scope and in any case it's not clear why Genshi would do a
better job than a dedicated tool.

Schiavo
Simon
--
You received this message because you are subscribed to the Google Groups "Genshi" group.
To post to this group, send email to ***@googlegroups.com.
To unsubscribe from this group, send email to genshi+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/genshi?hl=en.
anatoly techtonik
2012-12-24 10:57:39 UTC
Permalink
Post by Simon Cross
Post by anatoly techtonik
I experimented with encoding a bit and it boiled down to
http://genshi.edgewall.org/ticket/375 so I think it is more important.
I closed that ticket as wontfix -- cleaning up HTML seems outside of
Genshi's scope and in any case it's not clear why Genshi would do a
better job than a dedicated tool.
Genshi has an HTML parser, so if parser can not handle HTML that is
accepted and correctly rendered by at least three top browsers, it is of a
little use of Genshi. I used it, because Genshi comes bundled with Trac,
and it is used in a plugin that substitutes links like "issue #423" for
specific repositories with a reference to external tracker.

If I can't use Genshi for parsing HTML then I can't see benefits in using
XML based complications in Trac templating layer over familiar Django and
Jinja-style.
--
You received this message because you are subscribed to the Google Groups "Genshi" group.
To post to this group, send email to ***@googlegroups.com.
To unsubscribe from this group, send email to genshi+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/genshi?hl=en.
Eli Stevens (Gmail)
2012-12-24 18:01:14 UTC
Permalink
Post by anatoly techtonik
Genshi has an HTML parser, so if parser can not handle HTML that is
accepted and correctly rendered by at least three top browsers,
Just to chime in, I've had to deal with the difference between correct HTML
and the HTML that will be rendered "correctly" by browsers previously, and
the difference between the two is huge. The amount of "garbage in, what
you probably wanted out" is staggering. I don't know if this is still
true, but at the time even tools like Beautiful Soup couldn't properly
parse a Google search result page, much less a tool that expected properly
formed markup. Expecting Genshi to replicate all of the cleanup code
present in a browser doesn't make sense, IMO.

What we ended up doing was use the browser to parse the pages we were
interested in, then use them to save an HTML version of the DOM. Since the
browser was just serializing the in-memory DOM, it
was syntactically correct. This was before the days of tools like
PhantomJS, so it would probably be even easier now.

Eli
--
You received this message because you are subscribed to the Google Groups "Genshi" group.
To post to this group, send email to ***@googlegroups.com.
To unsubscribe from this group, send email to genshi+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/genshi?hl=en.
anatoly techtonik
2012-12-25 04:35:26 UTC
Permalink
On Mon, Dec 24, 2012 at 9:01 PM, Eli Stevens (Gmail)
Post by Eli Stevens (Gmail)
Post by anatoly techtonik
Genshi has an HTML parser, so if parser can not handle HTML that is
accepted and correctly rendered by at least three top browsers,
Just to chime in, I've had to deal with the difference between correct
HTML and the HTML that will be rendered "correctly" by browsers previously,
and the difference between the two is huge. The amount of "garbage in,
what you probably wanted out" is staggering. I don't know if this is still
true, but at the time even tools like Beautiful Soup couldn't properly
parse a Google search result page, much less a tool that expected properly
formed markup. Expecting Genshi to replicate all of the cleanup code
present in a browser doesn't make sense, IMO.
The HTML5 standard actually describes all the cleanup procedures
http://ejohn.org/blog/html-5-parsing/ so maybe Genshi should implement
HTML5Parser using http://code.google.com/p/html5lib/ and patch its existing
HTMLParser to have optional fallback mechanism?
Post by Eli Stevens (Gmail)
What we ended up doing was use the browser to parse the pages we were
interested in, then use them to save an HTML version of the DOM. Since the
browser was just serializing the in-memory DOM, it
was syntactically correct. This was before the days of tools like
PhantomJS, so it would probably be even easier now.
Yes, tools are evolved. =)
--
You received this message because you are subscribed to the Google Groups "Genshi" group.
To post to this group, send email to ***@googlegroups.com.
To unsubscribe from this group, send email to genshi+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/genshi?hl=en.
Simon Cross
2012-12-25 06:36:33 UTC
Permalink
Post by anatoly techtonik
I experimented with encoding a bit and it boiled down to
http://genshi.edgewall.org/ticket/375 so I think it is more important.
Your original example doesn't boil down to #357. It's an encoding
issue. Genshi trunk raises:

UnicodeError: source returned bytes, but no encoding specified

and setting "encoding='latin-1'" in the construction of HTMLParse
causes your example to work.

The attached patch to Genshi 0.6.x makes the behaviour there similar.
I haven't applied it to the 0.6.x branch yet because I still need to
think through all the ramifications.

Schiavo
Simon
--
You received this message because you are subscribed to the Google Groups "Genshi" group.
To post to this group, send email to ***@googlegroups.com.
To unsubscribe from this group, send email to genshi+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/genshi?hl=en.
anatoly techtonik
2012-12-25 08:23:04 UTC
Permalink
Post by Simon Cross
Post by anatoly techtonik
I experimented with encoding a bit and it boiled down to
http://genshi.edgewall.org/ticket/375 so I think it is more important.
Your original example doesn't boil down to #357. It's an encoding
UnicodeError: source returned bytes, but no encoding specified
and setting "encoding='latin-1'" in the construction of HTMLParse
causes your example to work.
The attached patch to Genshi 0.6.x makes the behaviour there similar.
I haven't applied it to the 0.6.x branch yet because I still need to
think through all the ramifications.
But the content downloaded is 'utf-8', page meta specifies 'utf-8' and
server header specify 'utf-8' as well. And undocumented encoding parameter
in HTMLParser<http://genshi.edgewall.org/wiki/ApiDocs/genshi.input#genshi.input:HTMLParser>
constructor
seems to be 'utf-8' as well. Is it a problem with urlopen autoconverting to
'latin-1'?
--
You received this message because you are subscribed to the Google Groups "Genshi" group.
To post to this group, send email to ***@googlegroups.com.
To unsubscribe from this group, send email to genshi+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/genshi?hl=en.
anatoly techtonik
2012-12-25 08:24:29 UTC
Permalink
I am using Python 2. No bytes.
--
You received this message because you are subscribed to the Google Groups "Genshi" group.
To post to this group, send email to ***@googlegroups.com.
To unsubscribe from this group, send email to genshi+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/genshi?hl=en.
Loading...