wayback-machine-downloader/README.md

# Wayback Machine Downloader

Download an entire website from the Internet Archive Wayback Machine.

## Installation

You need to install Ruby on your system (>= 1.9.2) - if you don't already have it.
Then run:

    gem install wayback_machine_downloader

**Tip:** If you run into permission errors, you might have to add `sudo` in front of this command.

## Basic Usage

Run wayback_machine_downloader with the base url of the website you want to retrieve as a parameter (e.g., http://example.com):

    wayback_machine_downloader http://example.com

## How it works

It will download the last version of every file present on Wayback Machine to `./websites/example.com/`. It will also re-create a directory structure and auto-create `index.html` pages to work seamlessly with Apache and Nginx. All files downloaded are the original ones and not Wayback Machine rewritten versions. This way, URLs and links structure are the same than before.

## Advanced Usage

	Usage: wayback_machine_downloader http://example.com
	
	Download an entire website from the Wayback Machine.
	
	Optional options:
	    -d, --directory PATH             Directory to save the downloaded files to. Default is ./websites/ plus the 	domain name.
	    -f, --from TIMESTAMP             Only files on or after timestamp supplied (ie. 20060716231334)
	    -t, --to TIMESTAMP               Only files on or before timestamp supplied (ie. 20100916231334)
	    -o, --only ONLY_FILTER           Restrict downloading to urls that match this filter (use // notation for 	the filter to be treated as a regex)
	    -x, --exclude EXCLUDE_FILTER     Skip downloading of urls that match this filter (use // notation for the 	filter to be treated as a regex)
	    -a, --all                        Expand downloading to error files (40x and 50x) and redirections (30x)
	    -c, --concurrency NUMBER         Number of multiple files to dowload at a time. Default is one file at a 	time. (ie. 20)
	    -l, --list                       Only list file urls in a JSON format with the archived timestamps. Won't 	download anything.
	    -v, --version                    Display version
	    
## Specify directory to save files to

    -d, --directory PATH

Optional. By default, Wayback Machine Downloader will download files to `./websites/` followed by the domain name of the website. You may want to save files in a specific directory using this option.

Example:

    wayback_machine_downloader http://example.com --directory downloaded-backup/

## From Timestamp

    -f, --from TIMESTAMP

Optional. You may want to supply a from timestamp to lock your backup to a specific version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/20060716231334/http://example.com). You can also use years (2006), years + month (200607), etc. It can be used in combination of To Timestamp.
Wayback Machine Downloader will then fetch only file versions on or after the timestamp specified.

Example:

    wayback_machine_downloader http://example.com --from 20060716231334

## To Timestamp

    -t, --to TIMESTAMP

Optional. You may want to supply a to timestamp to lock your backup to a specifc version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/20100916231334/http://example.com). You can also use years (2010), years + month (201009), etc. It can be used in combination of From Timestamp.
Wayback Machine Downloader will then fetch only file versions on or before the timestamp specified.

Example:

    wayback_machine_downloader http://example.com --to 20100916231334

## Only URL Filter

     -o, --only ONLY_FILTER

Optional. You may want to retrieve files which are of a certain type (e.g., .pdf, .jpg, .wrd...) or are in a specific directory. To do so, you can supply the `--only` flag with a string or a regex (using the '/regex/' notation) to limit which files Wayback Machine Downloader will download.

For example, if you only want to download files inside a specific `my_directory`:

    wayback_machine_downloader http://example.com --only my_directory

Or if you want to download every images without anything else:

    wayback_machine_downloader http://example.com --only "/\.(gif|jpg|jpeg)$/i"

## Exclude URL Filter

     -x, --exclude EXCLUDE_FILTER

Optional. You may want to retrieve files which aren't of a certain type (e.g., .pdf, .jpg, .wrd...) or aren't in a specific directory. To do so, you can supply the `--exclude` flag with a string or a regex (using the '/regex/' notation) to limit which files Wayback Machine Downloader will download.

For example, if you want to avoid downloading files inside `my_directory`:

    wayback_machine_downloader http://example.com --exclude my_directory

Or if you want to download everything except images:

    wayback_machine_downloader http://example.com --exclude "/\.(gif|jpg|jpeg)$/i"

## Expand downloading to all file types

     -a, --all

Optional. By default, Wayback Machine Downloader limits itself to files that responded with 200 OK code. If you also need errors files (40x and 50x codes) or redirections files (30x codes), you can use the `--all` or `-a` flag and Wayback Machine Downloader will download them in addition of the 200 OK files. It will also keep empty files that are removed by default.

Example:

    wayback_machine_downloader http://example.com --all

## Only list files without downloading

     -l, --list

It will just display the files to be downloaded with their snapshot timestamps and urls. The output format is JSON. It won't download anything. It's useful for debugging or to connect to another application.

Example:

    wayback_machine_downloader http://example.com --list

## Download multiple files at a time

    -c, --concurrency NUMBER  

Optional. Specify the number of multiple files you want to download at the same time. Allows to speed up the download of a website significantly. Default is to download one file at the time.

Example:

    wayback_machine_downloader http://example.com --concurrency 20

## Using the Docker image

As an alternative installation way, we have a Docker image! Retrieve the wayback-machine-downloader Docker image this way:

    docker pull hartator/wayback-machine-downloader

Then, you should be able to use the Docker image to download websites. For example:

    docker run --rm -it -v $PWD/websites:/websites hartator/wayback-machine-downloader http://example.com

## Contributing

Contributions are welcome! Just submit a pull request via GitHub.

To run the tests:

    bundle install
    bundle exec rake test

## Donation

Wayback Machine Downloader is free and open source.

If you want to donate: [![Gratipay Team](https://img.shields.io/gratipay/team/hartator.svg)](https://gratipay.com/hartator/)

You can also donate to the Archive.org: https://archive.org/donate/
Explain how the only flag works 2015-11-17 14:52:22 -06:00			`# Wayback Machine Downloader`
Create README.md 2015-08-10 00:48:48 -05:00
Add more details to Readme 2016-07-31 10:08:32 -05:00			`Download an entire website from the Internet Archive Wayback Machine.`
Create README.md 2015-08-10 00:48:48 -05:00
			`## Installation`

Explain how the only flag works 2015-11-17 14:52:22 -06:00			`You need to install Ruby on your system (>= 1.9.2) - if you don't already have it.`
Clarify Sudo usage in Readme #15 2015-11-03 14:01:07 -06:00			`Then run:`
Create README.md 2015-08-10 00:48:48 -05:00
			`gem install wayback_machine_downloader`
Explain how the only flag works 2015-11-17 14:52:22 -06:00
Improve Readme formating to make it clearer #15 2015-11-03 14:08:38 -06:00			Tip: If you run into permission errors, you might have to add `sudo` in front of this command.
Create README.md 2015-08-10 00:48:48 -05:00
Clarify Sudo usage in Readme #15 2015-11-03 14:01:07 -06:00			`## Basic Usage`
Add sections to Readme 2015-08-10 01:02:43 -05:00
Add example to Readme 2015-08-10 01:10:41 -05:00			`Run wayback_machine_downloader with the base url of the website you want to retrieve as a parameter (e.g., http://example.com):`
Create README.md 2015-08-10 00:48:48 -05:00
			`wayback_machine_downloader http://example.com`

Clarify Sudo usage in Readme #15 2015-11-03 14:01:07 -06:00			`## How it works`
Create README.md 2015-08-10 00:48:48 -05:00
Clarify Readme 2016-01-22 09:46:40 -06:00			It will download the last version of every file present on Wayback Machine to `./websites/example.com/`. It will also re-create a directory structure and auto-create `index.html` pages to work seamlessly with Apache and Nginx. All files downloaded are the original ones and not Wayback Machine rewritten versions. This way, URLs and links structure are the same than before.
Create README.md 2015-08-10 00:48:48 -05:00
Add more details to Readme 2016-07-31 10:08:32 -05:00			`## Advanced Usage`

Add directory option explanation to Readme 2016-09-17 13:05:51 -05:00			`Usage: wayback_machine_downloader http://example.com`

			`Download an entire website from the Wayback Machine.`

			`Optional options:`
			`-d, --directory PATH Directory to save the downloaded files to. Default is ./websites/ plus the domain name.`
			`-f, --from TIMESTAMP Only files on or after timestamp supplied (ie. 20060716231334)`
			`-t, --to TIMESTAMP Only files on or before timestamp supplied (ie. 20100916231334)`
			`-o, --only ONLY_FILTER Restrict downloading to urls that match this filter (use // notation for the filter to be treated as a regex)`
			`-x, --exclude EXCLUDE_FILTER Skip downloading of urls that match this filter (use // notation for the filter to be treated as a regex)`
			`-a, --all Expand downloading to error files (40x and 50x) and redirections (30x)`
			`-c, --concurrency NUMBER Number of multiple files to dowload at a time. Default is one file at a time. (ie. 20)`
			`-l, --list Only list file urls in a JSON format with the archived timestamps. Won't download anything.`
			`-v, --version Display version`

			`## Specify directory to save files to`

			`-d, --directory PATH`

			Optional. By default, Wayback Machine Downloader will download files to `./websites/` followed by the domain name of the website. You may want to save files in a specific directory using this option.

			`Example:`

			`wayback_machine_downloader http://example.com --directory downloaded-backup/`
Add more details to Readme 2016-07-31 10:08:32 -05:00
Update Readme to reflect new parameters 2016-07-30 14:09:18 -05:00			`## From Timestamp`
Create README.md 2015-08-10 00:48:48 -05:00
Fix Readme clutter 2016-07-31 10:11:22 -05:00			`-f, --from TIMESTAMP`
Add downloading in multi threads 2016-09-04 23:38:38 +03:00
Add more details to Readme 2016-07-31 10:08:32 -05:00			`Optional. You may want to supply a from timestamp to lock your backup to a specific version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/20060716231334/http://example.com). You can also use years (2006), years + month (200607), etc. It can be used in combination of To Timestamp.`
			`Wayback Machine Downloader will then fetch only file versions on or after the timestamp specified.`

			`Example:`
Create README.md 2015-08-10 00:48:48 -05:00
Update Readme to reflect new parameters 2016-07-30 14:09:18 -05:00			`wayback_machine_downloader http://example.com --from 20060716231334`

			`## To Timestamp`

Fix Readme clutter 2016-07-31 10:11:22 -05:00			`-t, --to TIMESTAMP`
Add downloading in multi threads 2016-09-04 23:38:38 +03:00
Add more details to Readme 2016-07-31 10:08:32 -05:00			`Optional. You may want to supply a to timestamp to lock your backup to a specifc version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/20100916231334/http://example.com). You can also use years (2010), years + month (201009), etc. It can be used in combination of From Timestamp.`
			`Wayback Machine Downloader will then fetch only file versions on or before the timestamp specified.`

			`Example:`
Update Readme to reflect new parameters 2016-07-30 14:09:18 -05:00
			`wayback_machine_downloader http://example.com --to 20100916231334`
Create README.md 2015-08-10 00:48:48 -05:00
Add exclude filter to Readme 2016-07-28 18:09:14 -05:00			`## Only URL Filter`
Explain how the only flag works 2015-11-17 14:52:22 -06:00
Fix Readme clutter 2016-07-31 10:11:22 -05:00			`-o, --only ONLY_FILTER`
Add more details to Readme 2016-07-31 10:08:32 -05:00
Add exclude filter to Readme 2016-07-28 18:09:14 -05:00			Optional. You may want to retrieve files which are of a certain type (e.g., .pdf, .jpg, .wrd...) or are in a specific directory. To do so, you can supply the `--only` flag with a string or a regex (using the '/regex/' notation) to limit which files Wayback Machine Downloader will download.
Explain how the only flag works 2015-11-17 14:52:22 -06:00
Fix grammar mistake 2015-11-20 22:49:32 -06:00			For example, if you only want to download files inside a specific `my_directory`:
Improve Readme readability for the only_filter explanation 2015-11-19 15:45:09 -06:00
			`wayback_machine_downloader http://example.com --only my_directory`
Add downloading in multi threads 2016-09-04 23:38:38 +03:00
Improve Readme readability for the only_filter explanation 2015-11-19 15:45:09 -06:00			`Or if you want to download every images without anything else:`
Add downloading in multi threads 2016-09-04 23:38:38 +03:00
Use double quotes to support windows CLI as well 2016-01-11 18:36:09 -06:00			`wayback_machine_downloader http://example.com --only "/\.(gif\|jpg\|jpeg)$/i"`
Explain how the only flag works 2015-11-17 14:52:22 -06:00
Add exclude filter to Readme 2016-07-28 18:09:14 -05:00			`## Exclude URL Filter`

Fix Readme clutter 2016-07-31 10:11:22 -05:00			`-x, --exclude EXCLUDE_FILTER`
Add more details to Readme 2016-07-31 10:08:32 -05:00
Add exclude filter to Readme 2016-07-28 18:09:14 -05:00			Optional. You may want to retrieve files which aren't of a certain type (e.g., .pdf, .jpg, .wrd...) or aren't in a specific directory. To do so, you can supply the `--exclude` flag with a string or a regex (using the '/regex/' notation) to limit which files Wayback Machine Downloader will download.

			For example, if you want to avoid downloading files inside `my_directory`:

			`wayback_machine_downloader http://example.com --exclude my_directory`
Add downloading in multi threads 2016-09-04 23:38:38 +03:00
Add exclude filter to Readme 2016-07-28 18:09:14 -05:00			`Or if you want to download everything except images:`
Add downloading in multi threads 2016-09-04 23:38:38 +03:00
Add exclude filter to Readme 2016-07-28 18:09:14 -05:00			`wayback_machine_downloader http://example.com --exclude "/\.(gif\|jpg\|jpeg)$/i"`
Add downloading in multi threads 2016-09-04 23:38:38 +03:00
Add more details to Readme 2016-07-31 10:08:32 -05:00			`## Expand downloading to all file types`

Fix Readme clutter 2016-07-31 10:11:22 -05:00			`-a, --all`
Add more details to Readme 2016-07-31 10:08:32 -05:00
Clarify empty file policy when the `--all` option is passed or not 2016-08-09 15:58:15 -05:00			Optional. By default, Wayback Machine Downloader limits itself to files that responded with 200 OK code. If you also need errors files (40x and 50x codes) or redirections files (30x codes), you can use the `--all` or `-a` flag and Wayback Machine Downloader will download them in addition of the 200 OK files. It will also keep empty files that are removed by default.
Add more details to Readme 2016-07-31 10:08:32 -05:00
			`Example:`

			`wayback_machine_downloader http://example.com --all`
Add exclude filter to Readme 2016-07-28 18:09:14 -05:00
Add option to only list files without downloading 2016-08-03 14:23:35 -05:00			`## Only list files without downloading`

			`-l, --list`

			`It will just display the files to be downloaded with their snapshot timestamps and urls. The output format is JSON. It won't download anything. It's useful for debugging or to connect to another application.`

			`Example:`

			`wayback_machine_downloader http://example.com --list`

Update Readme to include new concurrency option 2016-09-15 20:01:07 -05:00			`## Download multiple files at a time`
Add downloading in multi threads 2016-09-04 23:38:38 +03:00
Update Readme to include new concurrency option 2016-09-15 20:01:07 -05:00			`-c, --concurrency NUMBER`
Add downloading in multi threads 2016-09-04 23:38:38 +03:00
Update Readme to include new concurrency option 2016-09-15 20:01:07 -05:00			`Optional. Specify the number of multiple files you want to download at the same time. Allows to speed up the download of a website significantly. Default is to download one file at the time.`
Add downloading in multi threads 2016-09-04 23:38:38 +03:00
			`Example:`

Update Readme to include new concurrency option 2016-09-15 20:01:07 -05:00			`wayback_machine_downloader http://example.com --concurrency 20`
Add downloading in multi threads 2016-09-04 23:38:38 +03:00
Add installation instructions for the Docker image 2016-07-29 10:30:41 -05:00			`## Using the Docker image`
Reworded the README and included it into the main one. 2016-07-20 23:15:45 -04:00
Add installation instructions for the Docker image 2016-07-29 10:30:41 -05:00			`As an alternative installation way, we have a Docker image! Retrieve the wayback-machine-downloader Docker image this way:`
Reworded the README and included it into the main one. 2016-07-20 23:15:45 -04:00
Improve readability of Reame 2016-07-30 14:15:41 -05:00			`docker pull hartator/wayback-machine-downloader`
Reworded the README and included it into the main one. 2016-07-20 23:15:45 -04:00
Add installation instructions for the Docker image 2016-07-29 10:30:41 -05:00			`Then, you should be able to use the Docker image to download websites. For example:`
Reworded the README and included it into the main one. 2016-07-20 23:15:45 -04:00
Improve readability of Reame 2016-07-30 14:15:41 -05:00			`docker run --rm -it -v $PWD/websites:/websites hartator/wayback-machine-downloader http://example.com`
Reworded the README and included it into the main one. 2016-07-20 23:15:45 -04:00
Clarify Sudo usage in Readme #15 2015-11-03 14:01:07 -06:00			`## Contributing`
Create README.md 2015-08-10 00:48:48 -05:00
			`Contributions are welcome! Just submit a pull request via GitHub.`

			`To run the tests:`

Update Readme to ease code contributions and testings 2015-11-04 14:51:51 -06:00			`bundle install`
			`bundle exec rake test`
Add downloading in multi threads 2016-09-04 23:38:38 +03:00
Add donation section 2016-08-16 11:30:59 -05:00			`## Donation`

			`Wayback Machine Downloader is free and open source.`

Remove maxage from Gratipay shield query 2016-08-26 10:08:07 -05:00			`If you want to donate: [![Gratipay Team](https://img.shields.io/gratipay/team/hartator.svg)](https://gratipay.com/hartator/)`
Add donation section 2016-08-16 11:30:59 -05:00
			`You can also donate to the Archive.org: https://archive.org/donate/`