Add more details to Readme

This commit is contained in:
hartator 2016-07-31 10:08:32 -05:00 committed by GitHub
parent 811a61e852
commit 1dc9279ca4

View File

@ -1,6 +1,6 @@
# Wayback Machine Downloader
Download any website from the Internet Archive Wayback Machine.
Download an entire website from the Internet Archive Wayback Machine.
## Installation
@ -21,22 +21,46 @@ Run wayback_machine_downloader with the base url of the website you want to retr
It will download the last version of every file present on Wayback Machine to `./websites/example.com/`. It will also re-create a directory structure and auto-create `index.html` pages to work seamlessly with Apache and Nginx. All files downloaded are the original ones and not Wayback Machine rewritten versions. This way, URLs and links structure are the same than before.
## Advanced Usage
Usage: wayback_machine_downloader http://example.com
Download an entire website from the Wayback Machine.
Optional options:
-f, --from TIMESTAMP Only files on or after timestamp supplied (ie. 20060716231334)
-t, --to TIMESTAMP Only files on or before timestamp supplied (ie. 20100916231334)
-o, --only ONLY_FILTER Restrict downloading to urls that match this filter (use // notation for the filter to be treated as a regex)
-x, --exclude EXCLUDE_FILTER Skip downloading of urls that match this filter (use // notation for the filter to be treated as a regex)
-a, --all Expand downloading to error files (40x and 50x) and redirections (30x)
-v, --version Display version
## From Timestamp
Optional. You may want to supply a from timestamp to lock your backup to a specific version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/*20060716231334*/http://example.com). You can also use years (2006), years + month (200607), etc. Can be used in combination of *To Timestamp*.
Wayback Machine Downloader will then fetch only file versions on or after the timestamp specified:
-f, --from TIMESTAMP Only files on or after timestamp supplied (ie. 20060716231334)
Optional. You may want to supply a from timestamp to lock your backup to a specific version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/20060716231334/http://example.com). You can also use years (2006), years + month (200607), etc. It can be used in combination of To Timestamp.
Wayback Machine Downloader will then fetch only file versions on or after the timestamp specified.
Example:
wayback_machine_downloader http://example.com --from 20060716231334
## To Timestamp
Optional. You may want to supply a to timestamp to lock your backup to a specifc version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/*20100916231334*/http://example.com). You can also use years (2010), years + month (201009), etc. Can be used in combination of *From Timestamp*.
Wayback Machine Downloader will then fetch only file versions on or before the timestamp specified:
-t, --to TIMESTAMP Only files on or before timestamp supplied (ie. 20100916231334)
Optional. You may want to supply a to timestamp to lock your backup to a specifc version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/20100916231334/http://example.com). You can also use years (2010), years + month (201009), etc. It can be used in combination of From Timestamp.
Wayback Machine Downloader will then fetch only file versions on or before the timestamp specified.
Example:
wayback_machine_downloader http://example.com --to 20100916231334
## Only URL Filter
-o, --only ONLY_FILTER Restrict downloading to urls that match this filter (use // notation for the filter to be treated as a regex)
Optional. You may want to retrieve files which are of a certain type (e.g., .pdf, .jpg, .wrd...) or are in a specific directory. To do so, you can supply the `--only` flag with a string or a regex (using the '/regex/' notation) to limit which files Wayback Machine Downloader will download.
For example, if you only want to download files inside a specific `my_directory`:
@ -49,6 +73,8 @@ Or if you want to download every images without anything else:
## Exclude URL Filter
-x, --exclude EXCLUDE_FILTER Skip downloading of urls that match this filter (use // notation for the filter to be treated as a regex)
Optional. You may want to retrieve files which aren't of a certain type (e.g., .pdf, .jpg, .wrd...) or aren't in a specific directory. To do so, you can supply the `--exclude` flag with a string or a regex (using the '/regex/' notation) to limit which files Wayback Machine Downloader will download.
For example, if you want to avoid downloading files inside `my_directory`:
@ -58,6 +84,16 @@ For example, if you want to avoid downloading files inside `my_directory`:
Or if you want to download everything except images:
wayback_machine_downloader http://example.com --exclude "/\.(gif|jpg|jpeg)$/i"
## Expand downloading to all file types
-a, --all Expand downloading to error files (40x and 50x) and redirections (30x)
Optional. By default, Wayback Machine Downloader limits itself to files that responded with 200 OK code. If you also need errors files (40x and 50x codes) or redirections files (30x codes), you can use the `--all` or `-a` flag and Wayback Machine Downloader will download them in addition of the 200 OK files.
Example:
wayback_machine_downloader http://example.com --all
## Using the Docker image