This guide explains how to detect duplicate files based on content, not just file names or sizes, using standard Linux tools.

Why Deduplication?

Duplicate files waste disk space and clutter your directories. Using hashing (like md5sum), we can find files with identical contents, regardless of their names or locations.


Method 1: Efficient (Filter by Size First)

find -not -empty -type f -printf "%s\n" | \
sort -rn | \
uniq -d | \
xargs -I{} -n1 find -type f -size {}c -print0 | \
xargs -0 md5sum | \
sort | uniq -w32 --all-repeated=separate

How It Works:

  1. Finds all non-empty files.
  2. Filters for duplicate file sizes.
  3. Computes MD5 checksums only for files with matching sizes.
  4. Groups and displays files with identical contents.

Method 2: Simple (Hash All Files)

find . ! -empty -type f -exec md5sum {} + | \
sort | uniq -w32 -dD

How It Works:

  1. Calculates the MD5 checksum of every non-empty file.
  2. Sorts and displays files with identical content.