Problem statement
- Oh no, my disk is full.
- Old idea: Reduce multiple copies of the same files.
rdfind -outputname /dev/stdout ~
- Maybe this also works for
/usr
?
- Yes, but work on
.deb
files instead.
- Just report issues and let maintainers fix them. ⇒ QA
Architecture: package import
- Currently processing sid main amd64.
- Save metadata such as version and dependencies.
- For each regular file, store filename and size.
- Compute hashes of files and store them.
sha512
gzip_sha512
: Decompress. Then hash. Failure ⇒ no hash produced.
png_sha512
: Convert PNG to 8bit RGBA, then hash. Ignore non-PNGs.
gif_sha512
: Like png_sha512
. Consider first frame only.
Architecture: precomputation
- For each combination of packages and hash functions, compute the "sharing".
- All but one copy of a file in a single package are considered redundant.
- All copies also present in other packages are considered redundant.
- Differently compressed PNG files considered equal.
GPL-3
can be shared as GPL-3.gz
.
- Issues with individual files. Example: broken
.gz
or PNGs not named .png
Metrics
- 2 GB sqlite database file (800 MB indices), 400 MB
.sql.gz
- 40k packages, 4m files, 5m hash values
- full import takes about 2 CPU days
Thanks
Questions