Faster Path Canonicalization in Rust

I recently spent some time optimizing a recursive path traversal routine written in Rust, and noticed something strange:

Evidently calling std::fs::canonicalize is very expensive!

Now, this may well be file system-specific — my system uses BTRFS — but still, a 4.5x performance penalty isn't great. Haha.

The application needs to know where paths actually are, so it can't not canonicalize them, but maybe there's a better way?

Conditional Canonicalization

Rust must intuitively understand that canonicalization is expensive, because its std::fs::ReadDir iterators build paths naively.

// Relative seed directory produces relative DirEntry paths:
let set1 = std::fs::read_dir("./dir").unwrap();
let path1 = set1.next().map(|p| p.path()); // ./dir/file1

let set1 = std::fs::read_dir("/absolute/dir").unwrap();
let path1 = set1.next().map(|p| p.path()); // /absolute/dir/file1

With that in mind, so long as the seed directory is properly canonicalized, all of the non-symlink child paths it produces will also be canonicalized.

Which brings us to…

Smarter Path Resolving

Here is what we ended up with:

/// # Resolve Path
///
/// This will take an arbitrary path and canonicalize it if it needs
/// canonicalizing. For `trusted` paths — paths with a known canonical parent —
/// this is only done for symlinks. For untrusted or unknown paths, this is
/// always done.
///
/// The method also returns a unique `u128` made up of the path's device and
/// inode, which can be stored (in e.g. a `HashSet`) and checked during
/// traversal to prevent crawling the same path over and over again.
///
/// And it returns a `bool` signifying whether or not the path is a directory,
/// since recursion will require that knowledge anyway. Might as well grab it
/// while the metadata is fresh!
///
/// Like [`std::fs::canonicalize`], this will fail in cases where the path
/// ain't real, except as an `Option` rather than a `Result` for friendlier
/// `filter_map()` integration.
pub fn resolve_path(path: PathBuf, trusted: bool) -> Option<(u128, bool, PathBuf)> {
    use std::os::unix::fs::MetadataExt;

    // Pull the basic meta.
    let meta = std::fs::metadata(&path).ok()?;
    let hash: u128 = unsafe { *([meta.dev(), meta.ino()].as_ptr().cast::<u128>()) };
    let dir: bool = meta.is_dir();

    // Handle "trusted" paths.
    if trusted {
        let meta = std::fs::symlink_metadata(&path).ok()?;
        if ! meta.file_type().is_symlink() {
            return Some((hash, dir, path));
        }
    }

    // Fallback to full canonicalization.
    let path = std::fs::canonicalize(path).ok()?;
    Some((hash, dir, path))
}

Plugging this in, the timings were much more acceptable:

Isn't it nice when everything works out?

Josh Stoik
29 January 2021
Previous Fussing With Images