220 lines
9.7 KiB
Markdown
Raw Normal View History

2015-11-17 14:52:22 -06:00
# Wayback Machine Downloader
2015-08-10 00:48:48 -05:00
2016-09-17 17:17:08 -05:00
[![Gem Version](https://badge.fury.io/rb/wayback_machine_downloader.svg)](https://rubygems.org/gems/wayback_machine_downloader/)
2024-06-26 17:06:15 +00:00
This is a fork of the [Wayback Machine Downloader](https://github.com/hartator/wayback-machine-downloader). With this, you can download a website from the Internet Archive Wayback Machine.
2015-08-10 00:48:48 -05:00
2024-06-26 17:06:15 +00:00
Included here is partial content from other forks, namely those @ [ShiftaDeband](https://github.com/ShiftaDeband/wayback-machine-downloader) and [matthid](https://github.com/matthid/wayback-machine-downloader) — attributions are in the code and go to the original authors; as well as a few additional (future) features.
2015-08-10 00:48:48 -05:00
2024-06-26 17:06:15 +00:00
## Installation
2024-12-31 11:42:19 -03:00
Note: You need to install Ruby on your system (>= 2.3) to run this program — if you don't already have it.
2015-08-10 00:48:48 -05:00
2024-12-30 23:56:44 +00:00
1. Clone/download this repository
2. In your terminal (e.g. Command Prompt, PowerShell, Windows Terminal), navigate to the directory where you cloned/downloaded this repository
3. Navigate to `wayback_machine_downloader\bin` (psst, Windows users: open this directory in File Explorer, then press Shift + Right Click → "Open Terminal here")
4. Run:
```bash
ruby wayback_machine_downloader [options] URL
```
2015-11-17 14:52:22 -06:00
2024-12-30 23:56:44 +00:00
### Using Docker
We have a Docker image! Sorta. It's not on Docker Hub yet, but you can build it yourself. Here's how:
2015-08-10 00:48:48 -05:00
2024-12-30 23:56:44 +00:00
```bash
docker build -t wayback_machine_downloader .
2024-06-26 17:06:15 +00:00
2024-12-30 23:56:44 +00:00
docker run -it --rm wayback_machine_downloader [options] URL
```
# Constants
There are a few constants that can be edited in the `wayback_machine_downloader.rb` file for your convenience. The default values may be conservative, so you can adjust them to your needs. They are:
- `DEFAULT_TIMEOUT` - The default timeout (in seconds) for HTTP requests. Default is 30 seconds.
- `MAX_RETRIES` - The maximum number of retries for HTTP requests. Default is 3.
- `RETRY_DELAY` - The delay (in seconds) between retries for HTTP requests. Default is 2 seconds.
- `RATE_LIMIT` - The rate limit (in seconds) for HTTP requests. Default is 0.25 seconds.
- `CONNECTION_POOL_SIZE` - The size of the HTTP connection pool. Default is 10 connections.
- `MEMORY_BUFFER_SIZE` - The size of the memory buffer (in bytes) for downloads. Default is 16KB.
---
2024-06-26 17:06:15 +00:00
## Instructions
### Basic usage
2025-01-01 12:36:52 +00:00
Run `wayback_machine_downloader` with the base URL of the website you want to retrieve as a parameter (e.g., https://example.com):
```bash
ruby wayback_machine_downloader https://example.com
```
2015-08-10 00:48:48 -05:00
2015-11-03 14:01:07 -06:00
## How it works
2025-01-01 12:36:52 +00:00
It will download the last version of every file present on Wayback Machine to `./websites/example.com/`. It will also re-create a directory structure and auto-create `index.html` pages to work seamlessly with Apache and Nginx. All files downloaded are the original ones and not Wayback Machine rewritten versions; this way, URLs and links structure are the same as before.
2015-08-10 00:48:48 -05:00
2025-01-01 12:36:52 +00:00
## Advanced usage
```
2024-12-31 11:42:19 -03:00
Usage: ruby wayback_machine_downloader https://example.com
Download an entire website from the Wayback Machine.
Optional options:
-d, --directory PATH Directory to save the downloaded files into
Default is ./websites/ plus the domain name
2017-10-26 19:55:24 -05:00
-s, --all-timestamps Download all snapshots/timestamps for a given website
-f, --from TIMESTAMP Only files on or after timestamp supplied (ie. 20060716231334)
-t, --to TIMESTAMP Only files on or before timestamp supplied (ie. 20100916231334)
2022-10-28 17:37:19 +02:00
-e, --exact-url Download only the url provided and not the full site
-o, --only ONLY_FILTER Restrict downloading to urls that match this filter
(use // notation for the filter to be treated as a regex)
-x, --exclude EXCLUDE_FILTER Skip downloading of urls that match this filter
(use // notation for the filter to be treated as a regex)
-a, --all Expand downloading to error files (40x and 50x) and redirections (30x)
-c, --concurrency NUMBER Number of multiple files to download at a time
Default is one file at a time (ie. 20)
-p, --maximum-snapshot NUMBER Maximum snapshot pages to consider (Default is 100)
Count an average of 150,000 snapshots per page
-l, --list Only list file urls in a JSON format with the archived timestamps, won't download anything
2025-01-01 12:36:52 +00:00
```
2025-01-01 12:36:52 +00:00
### Specify directory to save files to
-d, --directory PATH
Optional. By default, Wayback Machine Downloader will download files to `./websites/` followed by the domain name of the website. You may want to save files in a specific directory using this option.
Example:
2025-01-01 12:36:52 +00:00
```bash
ruby wayback_machine_downloader https://example.com --directory downloaded-backup/
```
2025-01-01 12:36:52 +00:00
### All timestamps
2017-10-26 19:55:24 -05:00
-s, --all-timestamps
Optional. This option will download all timestamps/snapshots for a given website. It will uses the timestamp of each snapshot as directory.
2017-10-26 19:55:24 -05:00
Example:
2025-01-01 12:36:52 +00:00
```bash
2024-12-31 11:42:19 -03:00
ruby wayback_machine_downloader https://example.com --all-timestamps
2017-10-26 19:55:24 -05:00
Will download:
websites/example.com/20060715085250/index.html
2017-10-26 20:02:13 -05:00
websites/example.com/20051120005053/index.html
websites/example.com/20060111095815/img/logo.png
...
2025-01-01 12:36:52 +00:00
```
2016-07-31 10:08:32 -05:00
2025-01-01 12:36:52 +00:00
### From timestamp
2015-08-10 00:48:48 -05:00
2016-07-31 10:11:22 -05:00
-f, --from TIMESTAMP
2016-09-04 23:38:38 +03:00
2021-05-03 17:50:26 +08:00
Optional. You may want to supply a from timestamp to lock your backup to a specific version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., https://web.archive.org/web/20060716231334/http://example.com). You can also use years (2006), years + month (200607), etc. It can be used in combination of To Timestamp.
2016-07-31 10:08:32 -05:00
Wayback Machine Downloader will then fetch only file versions on or after the timestamp specified.
Example:
2025-01-01 12:36:52 +00:00
```bash
2024-12-31 11:42:19 -03:00
ruby wayback_machine_downloader https://example.com --from 20060716231334
2025-01-01 12:36:52 +00:00
```
2025-01-01 12:36:52 +00:00
### To timestamp
2016-07-31 10:11:22 -05:00
-t, --to TIMESTAMP
2016-09-04 23:38:38 +03:00
Optional. You may want to supply a to timestamp to lock your backup to a specific version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., https://web.archive.org/web/20100916231334/http://example.com). You can also use years (2010), years + month (201009), etc. It can be used in combination of From Timestamp.
2016-07-31 10:08:32 -05:00
Wayback Machine Downloader will then fetch only file versions on or before the timestamp specified.
Example:
2024-12-31 11:42:19 -03:00
ruby wayback_machine_downloader https://example.com --to 20100916231334
2025-01-01 12:36:52 +00:00
### Exact url
2017-06-11 22:21:03 -05:00
-e, --exact-url
Optional. If you want to retrieve only the file matching exactly the url provided, you can use this flag. It will avoid downloading anything else.
For example, if you only want to download only the html homepage file of example.com:
2025-01-01 12:36:52 +00:00
```bash
2024-12-31 11:42:19 -03:00
ruby wayback_machine_downloader https://example.com --exact-url
2025-01-01 12:36:52 +00:00
```
2025-01-01 12:36:52 +00:00
### Only URL filter
2015-11-17 14:52:22 -06:00
2016-07-31 10:11:22 -05:00
-o, --only ONLY_FILTER
2016-07-31 10:08:32 -05:00
2016-07-28 18:09:14 -05:00
Optional. You may want to retrieve files which are of a certain type (e.g., .pdf, .jpg, .wrd...) or are in a specific directory. To do so, you can supply the `--only` flag with a string or a regex (using the '/regex/' notation) to limit which files Wayback Machine Downloader will download.
2015-11-17 14:52:22 -06:00
2015-11-20 22:49:32 -06:00
For example, if you only want to download files inside a specific `my_directory`:
2025-01-01 12:36:52 +00:00
```bash
2024-12-31 11:42:19 -03:00
ruby wayback_machine_downloader https://example.com --only my_directory
2025-01-01 12:36:52 +00:00
```
Or if you want to download every images without anything else:
2025-01-01 12:36:52 +00:00
```bash
2024-12-31 11:42:19 -03:00
ruby wayback_machine_downloader https://example.com --only "/\.(gif|jpg|jpeg)$/i"
2025-01-01 12:36:52 +00:00
```
2015-11-17 14:52:22 -06:00
2025-01-01 12:36:52 +00:00
### Exclude URL filter
2016-07-28 18:09:14 -05:00
2016-07-31 10:11:22 -05:00
-x, --exclude EXCLUDE_FILTER
2016-07-31 10:08:32 -05:00
2016-07-28 18:09:14 -05:00
Optional. You may want to retrieve files which aren't of a certain type (e.g., .pdf, .jpg, .wrd...) or aren't in a specific directory. To do so, you can supply the `--exclude` flag with a string or a regex (using the '/regex/' notation) to limit which files Wayback Machine Downloader will download.
For example, if you want to avoid downloading files inside `my_directory`:
2025-01-01 12:36:52 +00:00
```bash
2024-12-31 11:42:19 -03:00
ruby wayback_machine_downloader https://example.com --exclude my_directory
2025-01-01 12:36:52 +00:00
```
2016-07-28 18:09:14 -05:00
Or if you want to download everything except images:
2025-01-01 12:36:52 +00:00
```bash
2024-12-31 11:42:19 -03:00
ruby wayback_machine_downloader https://example.com --exclude "/\.(gif|jpg|jpeg)$/i"
2025-01-01 12:36:52 +00:00
```
2016-09-04 23:38:38 +03:00
2025-01-01 12:36:52 +00:00
### Expand downloading to all file types
2016-07-31 10:08:32 -05:00
2016-07-31 10:11:22 -05:00
-a, --all
2016-07-31 10:08:32 -05:00
Optional. By default, Wayback Machine Downloader limits itself to files that responded with 200 OK code. If you also need errors files (40x and 50x codes) or redirections files (30x codes), you can use the `--all` or `-a` flag and Wayback Machine Downloader will download them in addition of the 200 OK files. It will also keep empty files that are removed by default.
2016-07-31 10:08:32 -05:00
Example:
2025-01-01 12:36:52 +00:00
```bash
2024-12-31 11:42:19 -03:00
ruby wayback_machine_downloader https://example.com --all
2025-01-01 12:36:52 +00:00
```
2016-07-28 18:09:14 -05:00
2025-01-01 12:36:52 +00:00
### Only list files without downloading
-l, --list
It will just display the files to be downloaded with their snapshot timestamps and urls. The output format is JSON. It won't download anything. It's useful for debugging or to connect to another application.
Example:
2025-01-01 12:36:52 +00:00
```bash
2024-12-31 11:42:19 -03:00
ruby wayback_machine_downloader https://example.com --list
2025-01-01 12:36:52 +00:00
```
2025-01-01 12:36:52 +00:00
### Maximum number of snapshot pages to consider
2016-09-24 10:06:58 -07:00
-p, --snapshot-pages NUMBER
Optional. Specify the maximum number of snapshot pages to consider. Count an average of 150,000 snapshots per page. 100 is the default maximum number of snapshot pages and should be sufficient for most websites. Use a bigger number if you want to download a very large website.
Example:
2025-01-01 12:36:52 +00:00
```bash
2024-12-31 11:42:19 -03:00
ruby wayback_machine_downloader https://example.com --snapshot-pages 300
2025-01-01 12:36:52 +00:00
```
2016-09-24 10:06:58 -07:00
2025-01-01 12:36:52 +00:00
### Download multiple files at a time
2016-09-04 23:38:38 +03:00
-c, --concurrency NUMBER
2016-09-04 23:38:38 +03:00
Optional. Specify the number of multiple files you want to download at the same time. Allows one to speed up the download of a website significantly. Default is to download one file at a time.
2016-09-04 23:38:38 +03:00
Example:
2025-01-01 12:36:52 +00:00
```bash
2024-12-31 11:42:19 -03:00
ruby wayback_machine_downloader https://example.com --concurrency 20
2025-01-01 12:36:52 +00:00
```
2016-09-04 23:38:38 +03:00
2015-11-03 14:01:07 -06:00
## Contributing
2015-08-10 00:48:48 -05:00
Contributions are welcome! Just submit a pull request via GitHub.
To run the tests:
bundle install
bundle exec rake test