HyperDeduplicator
2024 · software
Built HyperDeduplicator as an internal tool for identifying and removing duplicate files across large local file archives. The tool hashes files and caches results in SQLite, so repeated runs against the same directory don't re-hash files that haven't changed. On large archives with thousands of files, the cache makes the difference between a tool you'll use and one you'll avoid.
The SQLite caching layer was the interesting engineering decision. A naive deduplication tool re-hashes everything on every run, which is fine for small directories and unusable for large ones. By storing hashes against file path, size, and modification timestamp, the tool can skip files it's already processed and only hash what's new or changed. The cache also persists across sessions, so the tool gets faster the more you use it on a stable archive.
HyperDeduplicator is not a product. It solves one specific problem in one specific context: my own archive management workflow. That constraint is a feature. Internal tools don't need interfaces for users who don't exist; they need to work reliably for the person who built them.
What this proved: The tools you build for yourself have a quality standard that client work sometimes doesn't. You live with the failures directly.