Today someone was discussing the goal of linting HTML, specifically of detecting unclosed attributes. Consider the following snippet:
<p class="important><img src="alert.png">This is important!</p>
It’s clear that a mistake led to a missing double-quote on the class attribute of the opening <p> tag. While WordPress’ HTML API doesn’t directly report this (because “unclosed attribute” isn’t particularly an HTML concept), it can be used to roughly detect it.
Here’s how to use the public functionality of the HTML API to detect unclosed attributes.
To do this, we have to define what an unclosed attribute means. For the sake of brevity we will assume that if an attribute value contains HTML-like syntax it is probably unclosed. We might be tempted to start with something like this:
foreach ( $processor->get_attribute_names_with_prefix( '' ) as $name ) { $value = $processor->get_attribute( $name ); if ( ! is_string( $value ) ) { continue; } $checker = new WP_HTML_Tag_Processor( $value ); if ( $checker->next_tag() ) { throw new WP_Error( 'Found tag syntax within attribute: is it unclosed?') }}
This approach does get pretty far, but it suffers from the fact that it’s checking decoded attribute values, meaning it will detect false positives on any attribute which discusses tags, such as alt="the <img> tag is a void element". It’s better to review the raw attribute value instead of the decoded attribute value.
A sneaky trick hidden in attribute removal
The Tag Processor tracks attribute offsets but doesn’t expose them, even to subclasses. The HTML API tries really hard to avoid exposing string offsets! and it does this for good reason. String offsets are easy to misuse, are unclear, and finicky.
However, the Tag Processor does allow subclasses to access its lexical_updates, which is an array of string replacements to perform after semantic-level requests have been converted to text. We can analyze these updates after requesting to remove an attribute; that will return knowledge about all of the places where that attribute and any ignored duplicates appeared in the source document.
This approach also leans on the fact that static methods of subclasses have access to protected properties of the parent class.
This is risky code and should be used with extreme caution, code review, and shared understanding among those who will be asked to maintain it.
class WP_Attribute_Walker extends WP_HTML_Tag_Processor { public static function walk( $html ) { $p = new WP_HTML_Tag_Processor( $html ); while ( $p->next_tag() ) { $names = $p->get_attribute_names_with_prefix( '' ); foreach ( $names as $name ) { $p->remove_attribute( $name ); $updates = $p->lexical_updates; $p->lexical_updates = array(); $i = 0; foreach ( $updates as $update ) { $raw_attr = substr( $html, $update->start, $update->length ); $quote_at = strcspn( $raw_attr, ''"' ); $might_be_unclosed = false; if ( $quote_at < strlen( $raw_attr ) ) { $raw_value = substr( $raw_attr, $quote_at + 1, strrpos( $raw_attr, $raw_attr[ $quote_at ] ) - $quote_at - 2 ); $checker = new WP_HTML_Tag_Processor( $raw_value ); $might_be_unclosed = $checker->next_tag() || $checker->paused_at_incomplete_token(); } yield $p->get_token_name() => array( $name, array( $update->start, $update->length ), 0 === $i++ ? 'non-duplicate' : 'duplicate', $might_be_unclosed ? 'contains-tag-like-content' : 'does-not-contain-tag-like-content', substr( $html, $update->start, $update->length ), ); } } } }}
This WP_Attribute_Walker::walk( $html ) method steps through each tag in the given document and returns a generator which reports each attribute on the tag, as well as some meta information about it.
$meta === array( 'class', // parsed name of attribute array( 3, 27 ), // (offset, length) of full attribute span in HTML 'non-duplicate', // whether this is the actual attribute or an ignored duplicate 'contains-tag-like-content', // likelihood of being unclosed 'class="important><img src="', // full span of attribute in HTML);
How to use this walker
$html = '<p class="important><img src="alert.png">This is important!</p>';foreach ( WP_Attribute_Walker::walk( $html ) as $tag_name => $meta ) { echo "Found in <{$tag_name}> an attribute named '{$meta[0]}'n"; echo " @ byte offset {$meta[1][0]} extending {$meta[1][1]} bytesn"; echo " it is a {$meta[2]} attribute on the tagn"; echo " its value {$meta[3]}n"; echo " `{$meta[4]}`";}
The output here tells us what we want to know:
Found in <P> an attribute named 'class' @ byte offset 3 extending 27 bytes it is a non-duplicate attribute on the tag its value contains-tag-like-content `class="important><img src="`Found in <P> an attribute named 'alert.png"' @ byte offset 30 extending 10 bytes it is a non-duplicate attribute on the tag its value does-not-contain-tag-like-content `alert.png"`
For normative HTML the values are not as surprising. In this case, the missing " has been added to the class attribute.
$html = '<p class="important"><img src="alert.png">This is important!</p>';
Found in <P> an attribute named 'class' @ byte offset 3 extending 17 bytes it is a non-duplicate attribute on the tag its value does-not-contain-tag-like-content `class="important"`Found in <IMG> an attribute named 'src' @ byte offset 26 extending 15 bytes it is a non-duplicate attribute on the tag its value does-not-contain-tag-like-content `src="alert.png"`
Summary
This code is not meant to be normative; it’s probably missing important details. It’s here to demonstrate one way we can take advantage of the already-available aspects of the HTML API to perform more interesting work.
In this case, we can tug at some of its internals to build linting and reporting tools which investigate aspects not exposed in the public interface: duplicate attributes and raw attribute values.
For the use-case of checking whether an attribute is closed or not, it’s a tricky problem to solve. We can only truly resolve this with a set of heuristics to determine the likelihood that an attribute isn’t closed, because HTML parsers will universally interpret any given string in a specific way, and regardless of errors, will produce tags and attributes from it.
Before we reach for custom regular expressions (PCRE), we can look into the HTML API and consider the sliding scale of safety it presents to us; we can take advantage of the parsing it’s already performing to remove the need to replicate all of HTML’s complicated parsing rules in our custom code.
Discover more from Complete Nursing Solution
Subscribe to get the latest posts sent to your email.