• Register

Merging & Deduping WARC files

+1 vote
775 views
Is there any programme/script that will take a group of WARC files and merge them, removing exact duplicate responses ?

I realise this probably goes somewhat against good practice, but for reasons of space I would like to remove the approximately 90% replication of content (e.g. unchanged images) but retain the varying parts.
asked May 1, 2018 by richard (190 points)

1 Answer

+1 vote

for merging, warccat concat does the job.

for cleaning/dedup: using JWAT-Tools you can unpack a warc file

jwattools.sh unpack your-archive.warc.gz

the result are single files containing a warc record. then you can remove duplicates or unwanted things and reconstruct the warc just with cat

cat your-archive.warc.{1,6,10} >your-new-new.warc

update: https://github.com/tari/warcdedupe

answered Jun 8, 2018 by atomotic (160 points)
edited Jun 8, 2018 by atomotic
But you would have to write a script to go through all the individual WARC files building up a list of duplicates to remove? That's the bit I was hoping to automate (laziness I know). I suppose it's relatively easy to compare the checksums.
i spent a few minutes trying https://github.com/tari/warcdedupe
but seems that is not yet ready. so i don't know other complete tools at the moment

https://gist.github.com/atomotic/445c3996727ad77db30e15259304a15c
...