Backup Tumblr Blog

wget -H -N -k -p -r -Dmedia.tumblr.com,TARGET.tumblr.com http://TARGET.tumblr.com

Source: http://techstreak.tumblr.com/post/17433669978/tumblr-backup-windows-mac-linux

-H : Span hosts.
-N : Download only if the content is newer than already present.
-k : Convert links for local viewing.
-p : Download all content (Images/CSS/JS) necessary for optimal viewing.
-r : Turn on recursive retrieval.
-D : Domains to be crawled. Note: media stores images uploaded by you.
-U : Use the specified user agent.

Download Images

wget -rH -Dmedia.tumblr.com,TARGET.tumblr.com -R "*avatar*" -A "[0-9]" -A "*index*" -A jpeg,jpg,gif,png --level=10 -nd -nc http://TARGET.tumblr.com/

Source: http://blog.dcxn.com/2011/11/06/wget-all-recent-images-from-a-tumblr/

Explanation of the options

--quiet tell wget not to output what it’s doing, it’s useful because this wget is part of a cron job for me. I know it works, I don’t need to see the output. If you’re debugging or playing around, turn this off.

-rH tells wget to recursively download (r) the site and to span hosts (-H). This means that wget can wander into hosts that aren’t http://economyofspace.tumblr.com/. This is a pretty risky thing, it’s easy to end up downloading the whole internet. This brings us to the next flag.

-Dmedia.tumblr.com,economyofspace.tumblr.com tells wget to only visit domains that are part of media.tumblr.com and economyofspace.tumblr.com. What’s interesting is that subdomains end up on the approved list, so 29.media.tumblr.com is just fine.

-R "*avatar*". This tells wget to download any files with avatar in them.

-A "[0-9]" -A "*index*" -A jpeg,jpg,bmp,gif,png. -A is the counterpart to -R. This tells wget to allow files that contains jpeg, jpg, bmp, gif, or png and to allow any files that contain index and to allow files that contain 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. Why the list? We need the index page to keep spidering, we need the number to download pages from the archive, and we need the file formats to download those image types.

--level=10 tells wget to go 10 levels deep. This would be dangerous except that we’ve restricted domains and limited the file names that can be downloaded.

-nd tells wget not to recreate the directory structure and to instead download all of the files into the current working directory.

-nc tells gets not to re-download files that exist. This is to keep tumblr happy with us and prevent files from being needlessly redownloaded.

SHIT
http://kiodane.tumblr.com/post/27508318036/wget-mirror-a-tumblr-site


IMSIDF79B9E395DD453895553A82771DFE0B379ADF7D9