Faster Path Canonicalization in Rust
I recently spent some time optimizing a recursive path traversal routine written in Rust, and noticed something strange:
Evidently calling std::fs::canonicalize
is very expensive!
Now, this may well be file system-specific — my system uses BTRFS — but still, a 4.5x performance penalty isn't great. Haha.
The application needs to know where paths actually are, so it can't not canonicalize them, but maybe there's a better way?
Conditional Canonicalization
Rust must intuitively understand that canonicalization is expensive, because its std::fs::ReadDir
iterators build paths naively.
// Relative seed directory produces relative DirEntry paths:
let set1 = std::fs::read_dir("./dir").unwrap();
let path1 = set1.next().map(|p| p.path()); // ./dir/file1
let set1 = std::fs::read_dir("/absolute/dir").unwrap();
let path1 = set1.next().map(|p| p.path()); // /absolute/dir/file1
With that in mind, so long as the seed directory is properly canonicalized, all of the non-symlink child paths it produces will also be canonicalized.
Which brings us to…
Smarter Path Resolving
Here is what we ended up with:
/// # Resolve Path
///
/// This will take an arbitrary path and canonicalize it if it needs
/// canonicalizing. For `trusted` paths — paths with a known canonical parent —
/// this is only done for symlinks. For untrusted or unknown paths, this is
/// always done.
///
/// The method also returns a unique `u128` made up of the path's device and
/// inode, which can be stored (in e.g. a `HashSet`) and checked during
/// traversal to prevent crawling the same path over and over again.
///
/// And it returns a `bool` signifying whether or not the path is a directory,
/// since recursion will require that knowledge anyway. Might as well grab it
/// while the metadata is fresh!
///
/// Like [`std::fs::canonicalize`], this will fail in cases where the path
/// ain't real, except as an `Option` rather than a `Result` for friendlier
/// `filter_map()` integration.
pub fn resolve_path(path: PathBuf, trusted: bool) -> Option<(u128, bool, PathBuf)> {
use std::os::unix::fs::MetadataExt;
// Pull the basic meta.
let meta = std::fs::metadata(&path).ok()?;
let hash: u128 = unsafe { *([meta.dev(), meta.ino()].as_ptr().cast::<u128>()) };
let dir: bool = meta.is_dir();
// Handle "trusted" paths.
if trusted {
let meta = std::fs::symlink_metadata(&path).ok()?;
if ! meta.file_type().is_symlink() {
return Some((hash, dir, path));
}
}
// Fallback to full canonicalization.
let path = std::fs::canonicalize(path).ok()?;
Some((hash, dir, path))
}
Plugging this in, the timings were much more acceptable:
Isn't it nice when everything works out?