Home Blog CV Projects Patterns Notes Book Colophon Search

HTML Parsing of <script> Tags with Etree

I have a simple template format that generates HTML. An issue I face is that HTML treats the content between <script> and </script> tags differently from normal markup.

For example, this is not valid HTML because you can't use unescaped > tags in markup:

<p>
alert(4 > 3)
</p>

The > should be written as > to be correct:

<p>
alert(4 &gt; 3)
</p>

But within the <script> tag, the unescaped > is perfectly fine.

<script>
alert(4 > 3)
</script>

This means that this code is perfectly fine:

<button id="click">Is 4 &gt; 3?</button>
<script>
document.getElementById('click').addEventListener('click', (e) => alert(4 > 3));
</script>

Here it is running:

If you click it you'll see an alert that says true because the > is interpreted as a greater than sign so the expression is evaluated.

This means that my template system needs to know that <script> tags should be parsed differently from other HTML tags.

So far, I've been using pyquery which uses etree backed by lxml.

Let's set up a tree:

Python 3.7.3 (default, Jul 25 2020, 13:03:44) 
>>> from pyquery import PyQuery
>>> tree = PyQuery('<p>4 &gt; 3</p><script>4 > 3</script>')

Now let's use the default HTML serialisation:

>>> tree.html()
'<p>4 &gt; 3</p><script>4 &gt; 3</script>'

As you can see, it has correctly preserved the escaping in the <p> tag, but incorrectly escaped the content of the <script> tag such that the JavaScript within it is no longer valid JavaScript.

The solution is to pass the (not very well documented) method argument to the html() function, setting the method to HTML mode:

>>> tree.html(method="html")
'<p>4 &gt; 3</p><script>4 > 3</script>'

As you can see, the JavaScript is correctly preserved.