Solving Simple Problems With Simple Apps

As mentioned in the previous rant article, we have been dilligently working toward a Node-free dev life. Little things, like using basic Unix commands and the just task runner go a long way toward that goal, but plenty of gaps remain.

This year, we've finally started plugging those gaps.

Finding the Gaps

First and foremost, every build task needs to be something that can be handled from the command line so that it can be automated. The simpler, the better, but this gives us a starting point.

From there, two common issues arise when setting up complex web projects:

  1. While native Unix commands and dedicated binaries can do a lot, the syntax is often obscure, and when chaining multiple programs together, the code can quickly get unruly, and performance may not be ideal.
  2. Many specialized tasks, like HTML minification, require specialized apps, and unfortunately, in many cases the only viable tools in existence are Node-based.

The second issue is only an issue if you want to avoid using Node. Most Node apps can be installed globally and run from the command line, so it might be fine.

At any rate, let's ignore the second issue for now, and focus on the first one.

Ergonomics

In looking at our own active projects, one immediate example of unnecessary logical complexity that jumps out is the large number of ugly echo statements announcing task errors, successes, etc. While echo is about as simple as it can get, the ANSI formatting required to print a bold, red Error: prefix is a bit obscure:

echo "\033[1;91mError:\033[0m Something broke!"
exit 1

ANSI markup starts with \033[ and terminates with an m. In between, reserved numbers separated by ; instruct the terminal to make things bold, red, blinky, etc.

If you skip the ANSI and just output messages in the plain, they'll all blend together when the script runs. And if you mark them up, you'll have to look up the codes and be careful to type them correctly, and remember to reset them afterwards to prevent all subsequent output from inheriting the custom styles.

Not exactly user friendly.

But that said, it isn't just obscure codes that impede the ergonomics of CLI commands. More often, it is the need to chain several long statements together that gets in the way.

Chaining

Most CLI programs are designed to be run against a single input, but most web projects need to perform the same operation against many different files. For example, if your project requires HTML to be minified, you probably have more than one HTML file.

The solution is typically to chain the single-input command to the results of a find command, like:

for i in $( find . -name "*.html" -type f ! -size 0); do
	html-minifier \
		--collapse-boolean-attributes \
		--collapse-whitespace \
		--decode-entities \
		--remove-attribute-quotes \
		--remove-comments \
		--remove-empty-attributes \
		--remove-optional-tags \
		--remove-optional-tags \
		--remove-redundant-attributes \
		--remove-redundant-attributes \
		--remove-script-type-attributes \
		--remove-style-link-type-attributes \
		-o "$i" \
		"$i" >/dev/null 2>&1
done

That gets the job done, but is less than ergonomical, and the performance — particularly with this example — is often relatively terrible.

It isn't necessarily the fault of any of the apps involved in the chain — though in this example, Node certainly doesn't help; it's just not the use case they were designed for.

Just Make an App!

It is easy to write-off minor annoyances because, well, they're minor. And the statements might be ugly, but they work. And copy-and-paste is a thing, so carrying them from one project to another doesn't require memorization. And you have better things to do with your afternoon.

But it feels wrong, right?

This year, I started looking across our active projects for simple tasks requiring long chains or obscure syntax, and began writing simple, dedicated apps to handle them more easily.

It was the best development-related decision I've ever made.

Each one became instantly indispensable, a true productivity game-changer applicable to dozens of current projects, and infinitely many future ones.

While some of these apps are already featured on our stuff page, I wanted to draw attention to them in context in case they might be of use to you and your projects.

They're all free and open source. Each Github release includes pre-built .deb packages for easy installation on Debian and Ubuntu systems. While we have only focused on x86-64 Linux platforms, they will probably work as-is on Windows via WSL, and should be cross-compilable to MacOS with little or no retooling.

FYI

First up, FYI, a simple, formatted status message printer.

This solves the aforementioned problem with ANSI formatting, while also adding a simple confirmation prompt option, mooting the need to try to tame something like whiptail for "Are you sure?"-type messages.

# The manual way.
echo "\033[1;91mError:\033[0m Something broke!"

# Using FYI:
fyi error "Something broke!"

Both would print something like:
Error: Something broke!

There are a number of built-in prefixes covering "success", "warning", etc., but you can also specify totally arbitrary prefixes, or go prefix-free, though at that point you might as well just use echo.

For more information or to download FYI, visit the repo.

ChannelZ

Most web servers are configured to dynamically compress certain text responses because the resulting savings in network transfer times outweigh the time spent encoding and decoding the content.

That same encoding, however, can actually be done ahead of time, reducing load time even further by removing the server-side overhead entirely, while also allowing for the use of stronger and more time consuming compression settings because, why not? Compress once and call it a day!

This task is easy enough to accomplish by combining find and gzip or find and brotli commands — both binaries are designed to work on just one input at a time — but the performance is pretty bad, and the list of conditions required for the find statement go on for miles.

We wrote ChannelZ to greatly simplify the process. It accepts any number of file and directory paths — optionally supplied from a text file — and will recursively crawl and crunch each one using as many parallel threads as the system can handle. It also knows which types of files benefit from Gzipping and Brotlifying, so you don't have to worry about accidentally encoding JPEGs or anything like that. Point it at the web root and you're done.

Aside from massive improvements in both UX and performance, ChannelZ also delivers better Gzip compression — almost on par with Brotli! — than would be possible using gzip -9 on its own, thanks to algorithmic differences (libdeflater rocks!) and various optimizations.

Also of note: ChannelZ implements the gzip and brotli functionality directly; you do not need to have either standalone binary installed.

For more information or to download ChannelZ, visit the repo.

CheckSame

Most build systems start with a watcher task that will then execute a subject-specific task based on the type of file that just changed. For example, when a .js file changes, it might call "Do All Javascript Stuff", when an .scss file changes, it might call "Do All CSS Stuff", etc.

As long as the tasks are fast, this sort of blunt approach works just fine. But if you're having to wait for dozens of scripts to needlessly rebuild just so you can preview the one script you actually changed, all those wasted seconds add up quickly.

Smart watchers maintain caches to minimize this sort of thing, but often fail to account for arbitrary, inter-related dependencies, such as a script that depends on dozens of ES modules and non-script content like SCSS.

The manual solution for this kind of problem typically involves calculating checksums for each file in a group of dependencies, and checking whether or not any of those checksums have changed before each run.

But while this is technically achievable with md5sum or similar, the chain of commands that have to be piped together to get a simple Yes/No answer is prohibitive, and MD5 is well past its prime at this point anyway.

Enter CheckSame!

Like ChannelZ, CheckSame accepts any number of file and directory paths, and will crunch them all in one go. File paths can also be loaded from a text file, making the command even shorter.

Unlike md5sum and kin, it computes a single hash for the lot (rather than one checksum per file), which in and of itself makes yes/no comparison easier. Instead of MD5, it uses Blake3, which is both faster and less collision-prone.

CheckSame also has a simple --cache mode that causes it to store the result and compare it on subsequent runs, printing either -1, 0, or 1, indicating NO PREVIOUS HASH, UNCHANGED, or SOMETHING CHANGED respectively.

With that, avoiding an expensive build task is as simple as this one-liner:

[ "$( checksame -c -l /path/list.txt )" = "0" ] || ./expensive-task

As long as the checksumming is faster than the build task would be — and the checksums do not change all that frequently — this sort of bypass is an instant performance win.

CheckSame's insane speed makes it suitable for all sorts of other tasks too, such as detecting file changes nested deep inside directory trees, or verifying/logging the integrity of mounted volumes.

For more information or to download CheckSame, visit the repo.

What About That More Complex Apps?

We conveniently ignored the second issue mentioned under Finding the Gaps because trying to rewrite a complex app from scratch is often, well, complex.

In our own workflows, there remain a number of Node apps that would be a complete pain in the ass to try to recreate, including:

I had thought html-minifier would be another one, but as it turns out, Mozilla has already done all the heavy lifting for us!

HTMinL

While analyzing the build times of a recent project, I discovered that about 95% of the crunch time was going toward a single task: HTML minification.

For various reasons, each build pass had to regenerate each HTML file, and because HTML minification has a nasty habit of breaking things in unexpected ways, we couldn't just save that step to production/release builds.

Like so many other examples in this article, html-minifier is simply not designed to account for this project's particular use case: in-place HTML minification of files spanning myriad nested directories, many of which contain non-HTML content.

While plenty of alternative minifiers exist — including some proper binary options — every last one suffered one or more show-stopping issues. Some were unable to handle XHTML or XML markup, choking on things like inline SVGs. Others mangled embedded scripts. Others still choked on Vue's weird @event and :attr bindings. Some were even thrown off by attribute values lacking double quotes.

Diving deeper into the problem, a common pattern emerged. Almost every single one relies on complex regular expression magic to naively tease apart tags, attributes, and text nodes.

Basically, they have a lot of code like this:

// Regular Expressions for parsing tags and attributes
var singleAttrIdentifier = /([^\s"'<>/=]+)/,
	singleAttrAssigns = [/=/],
	singleAttrValues = [
	  // attr value double quotes
	  /"([^"]*)"+/.source,
	  // attr value, single quotes
	  /'([^']*)'+/.source,
	  // attr value, no quotes
	  /([^ \t\n\f\r"'`=<>]+)/.source
	],
	…	 
	startTagOpen = new RegExp('^<' + qnameCapture),
	startTagClose = /^\s*(\/?)>/,
	endTag = new RegExp('^<\\/' + qnameCapture + '[^>]*>'),
	doctype = /^<!DOCTYPE\s?[^>]+>/i;

There is nothing wrong with using regular expressions to make sense of large bodies of text, provided that text is formatted exactly as expected, but HTML is anything but predictable.

HTML is a very forgiving standard. You can omit quotes around attribute values if they contain certain characters. You can use single or double quotes based on your mood. Self-closing elements can end with a / or not. Some tags, like <p>, can be closed or left open. Display types can be overridden with CSS. Whitespace is largely, but not entirely, ignored.

And that's just the valid stuff.

There are very few web sites in the wild that contain 100% valid markup. Try running amazon.com or nytimes.com through the W3 validator and you'll likely discover missed closing tags, invalid child elements, and all sorts of other shenanigans.

Web browsers understand this better than anyone, and dedicate huge portions of their overall codebases to random fixes and workarounds and syntax normalization just to get documents to a renderable state.

With that in mind, I wrote HTMinL differently.

Instead of relying on Regex, it uses Mozilla's Servo engine to first build a complete DOM tree representation of the document. Right off the bat, this fixes all sorts of issues, from missing DOCTYPEs to unclosed tags, but of equal importance, it properly tokenizes every last tag, attribute, value, etc.

From there, HTMinL is able to make node-by-node modifications with full knowledge of context. After that, it simply converts the DOM tree back into valid HTML source.

This approach is not only significantly more robust than the regular expressions used by other apps, it is also significantly faster. Compared to html-minifier, for example, the performance difference is literal magnitudes.

(Node sucks, man. Haha.)

For more information or to download HTMinL, visit the repo.

Build Helpers For Build Helpers

Of course, one consequence of writing so many build-helper apps is that their own development opens up new opportunities for build helpers of their own.

Oops.

Two things all CLI libraries can benefit from — but don't strictly need — are BASH completions and MAN pages.

These can be written by hand, but that quickly gets old as each new release requires various changes be pushed to that many more places.

Crates like clap can lend a hand, but clap is a rather large dependency that adds delay to both compile time and runtime. If you're not already using it to handle CLI argument parsing, it is not worth adding solely for BASH completions and MAN pages.

Inspired by the approach taken by Cargo Deb, we wrote a Cargo plugin of our own called Cargo BashMan.

Because BashMan is a Cargo plugin rather than a build dependency, it has no affect on either a project's build or runtime. It pulls all the information it needs from the project's Cargo.toml configuration.

When compiling a new release, all you need to do is add a call between the project build and the package build, like:

# Build the binary.
cargo build --release

# Build the BASH and MAN files.
cargo bashman

# Package it up!
cargo-deb --no-build

For more information or to download Cargo Bashman, visit the repo.

More On the Way!

There are a number of additional apps we're working on that are still being baked in. (The best way to fine-tune usability is by using them!) As soon as we feel good about how they work, we'll push the code to Github and make them available for all and sundry!

What sort of things?

Annoying things, like:

But don't wait on us!

If you have annoying tasks holding your own builds back, why not take an afternoon to write something better? You won't regret it!

Josh Stoik
29 December 2020
Previous Developing a Web Site *Without* Node/NPM in 2020
Next Dissecting the Blobfolio-Three-Ways Home Page Hero