External/wayback-machine-downloader

Fork 0

mirror of https://github.com/StrawberryMaster/wayback-machine-downloader.git synced 2025-12-29 16:16:06 +00:00

Go to file

hartator 7885c822fa Bump gem version

2016-07-30 14:10:20 -05:00

bin

Make the wording consistent between filters

2016-07-28 17:57:17 -05:00

lib

Bump gem version

2016-07-30 14:10:20 -05:00

test

Add test for to and from timestamp parameter

2016-07-30 14:09:46 -05:00

.gitignore

Add gitignore file

2015-07-25 19:07:51 -05:00

.travis.yml

Opt out of new Travis architecture

2015-11-12 10:43:57 -06:00

Dockerfile

Bump Ruby version to 2.3

2016-07-29 10:06:21 -05:00

Gemfile

Add Bundler to ease development and allow Travis to build correctly

2015-11-04 13:18:32 -06:00

MIT-LICENSE.txt

Add MIT License

2016-03-26 16:30:18 -05:00

Rakefile

Add placeholder test files

2015-07-26 00:02:35 -05:00

README.md

Update Readme to reflect new parameters

2016-07-30 14:09:18 -05:00

wayback_machine_downloader.gemspec

Add to_regex library to gem file list

2015-11-19 15:29:07 -06:00

README.md

Wayback Machine Downloader

Download any website from the Internet Archive Wayback Machine.

Installation

You need to install Ruby on your system (>= 1.9.2) - if you don't already have it. Then run:

gem install wayback_machine_downloader

Tip: If you run into permission errors, you might have to add sudo in front of this command.

Basic Usage

Run wayback_machine_downloader with the base url of the website you want to retrieve as a parameter (e.g., http://example.com):

wayback_machine_downloader http://example.com

How it works

It will download the last version of every file present on Wayback Machine to ./websites/example.com/. It will also re-create a directory structure and auto-create index.html pages to work seamlessly with Apache and Nginx. All files downloaded are the original ones and not Wayback Machine rewritten versions. This way, URLs and links structure are the same than before.

From Timestamp

Optional. You may want to supply a from timestamp to lock your backup to a specific version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/20060716231334/http://example.com). You can also use years (2006), years + month (200607), etc. Can be used in combination of To Timestamp. Wayback Machine Downloader will then fetch only file versions on or after the timestamp specified:

wayback_machine_downloader http://example.com --from 20060716231334

To Timestamp

Optional. You may want to supply a to timestamp to lock your backup to a specifc version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/20100916231334/http://example.com). You can also use years (2010), years + month (201009), etc. Can be used in combination of From Timestamp. Wayback Machine Downloader will then fetch only file versions on or before the timestamp specified:

wayback_machine_downloader http://example.com --to 20100916231334

Only URL Filter

Optional. You may want to retrieve files which are of a certain type (e.g., .pdf, .jpg, .wrd...) or are in a specific directory. To do so, you can supply the --only flag with a string or a regex (using the '/regex/' notation) to limit which files Wayback Machine Downloader will download.

For example, if you only want to download files inside a specific my_directory:

wayback_machine_downloader http://example.com --only my_directory

Or if you want to download every images without anything else:

wayback_machine_downloader http://example.com --only "/\.(gif|jpg|jpeg)$/i"

Exclude URL Filter

Optional. You may want to retrieve files which aren't of a certain type (e.g., .pdf, .jpg, .wrd...) or aren't in a specific directory. To do so, you can supply the --exclude flag with a string or a regex (using the '/regex/' notation) to limit which files Wayback Machine Downloader will download.

For example, if you want to avoid downloading files inside my_directory:

wayback_machine_downloader http://example.com --exclude my_directory

Or if you want to download everything except images:

wayback_machine_downloader http://example.com --exclude "/\.(gif|jpg|jpeg)$/i"

Using the Docker image

As an alternative installation way, we have a Docker image! Retrieve the wayback-machine-downloader Docker image this way:

docker pull hartator/wayback-machine-downloader

Then, you should be able to use the Docker image to download websites. For example:

docker run --rm -it -v $PWD/websites:/websites hartator/wayback-machine-downloader http://example.com

Contributing

Contributions are welcome! Just submit a pull request via GitHub.

To run the tests:

bundle install
bundle exec rake test