mirror of
https://github.com/StrawberryMaster/wayback-machine-downloader.git
synced 2025-12-29 16:16:06 +00:00
Compare commits
48 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
61e22cfe25 | ||
|
|
183ed61104 | ||
|
|
e6ecf32a43 | ||
|
|
375c6314ad | ||
|
|
6e2739f5a8 | ||
|
|
caba6a665f | ||
|
|
ab4324c0eb | ||
|
|
e28d7d578b | ||
|
|
a7a25574cf | ||
|
|
23cc3d69b1 | ||
|
|
01fa1f8c9f | ||
|
|
d2f98d9428 | ||
|
|
c7a5381eaf | ||
|
|
9709834e20 | ||
|
|
77998372cb | ||
|
|
2c789b7df6 | ||
|
|
1ef8c14c48 | ||
|
|
780e45343f | ||
|
|
42e6d62284 | ||
|
|
543161d7fb | ||
|
|
99a6de981e | ||
|
|
d85c880d23 | ||
|
|
917f4f8798 | ||
|
|
787bc2e535 | ||
|
|
4db13a7792 | ||
|
|
31d51728af | ||
|
|
febffe5de4 | ||
|
|
27dd619aa4 | ||
|
|
576298dca8 | ||
|
|
dc71d1d167 | ||
|
|
13e88ce04a | ||
|
|
c7fc7c7b58 | ||
|
|
5aebf83fca | ||
|
|
b1080f0219 | ||
|
|
dde36ea840 | ||
|
|
acec026ce1 | ||
|
|
ec3fd2dcaa | ||
|
|
6518ecf215 | ||
|
|
f5572d6129 | ||
|
|
fc4ccf62e2 | ||
|
|
84bf76363c | ||
|
|
0c701ee890 | ||
|
|
c953d038e2 | ||
|
|
b726e94947 | ||
|
|
f86302e7aa | ||
|
|
791068e9bd | ||
|
|
456e08e745 | ||
|
|
90069fad41 |
5
.dockerignore
Normal file
5
.dockerignore
Normal file
@@ -0,0 +1,5 @@
|
||||
*.md
|
||||
*.yml
|
||||
|
||||
.github
|
||||
websites
|
||||
4
.env.example
Normal file
4
.env.example
Normal file
@@ -0,0 +1,4 @@
|
||||
DB_HOST="db"
|
||||
DB_USER="root"
|
||||
DB_PASSWORD="example1234"
|
||||
DB_NAME="wayback"
|
||||
5
.gitignore
vendored
5
.gitignore
vendored
@@ -18,6 +18,11 @@ Gemfile.lock
|
||||
.ruby-version
|
||||
.rbenv*
|
||||
|
||||
## ENV
|
||||
*.env*
|
||||
!.env*.example
|
||||
|
||||
|
||||
## RCOV
|
||||
coverage.data
|
||||
|
||||
|
||||
12
Dockerfile
12
Dockerfile
@@ -1,7 +1,15 @@
|
||||
FROM ruby:2.3-alpine
|
||||
FROM ruby:3.4.4-alpine
|
||||
USER root
|
||||
WORKDIR /build
|
||||
|
||||
COPY Gemfile /build/
|
||||
COPY *.gemspec /build/
|
||||
|
||||
RUN bundle config set jobs "$(nproc)" \
|
||||
&& bundle config set without 'development test' \
|
||||
&& bundle install
|
||||
|
||||
COPY . /build
|
||||
|
||||
WORKDIR /
|
||||
ENTRYPOINT [ "/build/bin/wayback_machine_downloader" ]
|
||||
ENTRYPOINT [ "/build/bin/wayback_machine_downloader" ]
|
||||
|
||||
104
README.md
104
README.md
@@ -1,6 +1,5 @@
|
||||
# Wayback Machine Downloader
|
||||
|
||||

|
||||
[](https://rubygems.org/gems/wayback_machine_downloader_straw)
|
||||
|
||||
This is a fork of the [Wayback Machine Downloader](https://github.com/hartator/wayback-machine-downloader). With this, you can download a website from the Internet Archive Wayback Machine.
|
||||
|
||||
@@ -19,6 +18,16 @@ Your files will save to `./websites/example.com/` with their original structure
|
||||
- Ruby 2.3+ ([download Ruby here](https://www.ruby-lang.org/en/downloads/))
|
||||
- Bundler gem (`gem install bundler`)
|
||||
|
||||
### Quick install
|
||||
It took a while, but we have a gem for this! Install it with:
|
||||
```bash
|
||||
gem install wayback_machine_downloader_straw
|
||||
```
|
||||
To run most commands, just like in the original WMD, you can use:
|
||||
```bash
|
||||
wayback_machine_downloader https://example.com
|
||||
```
|
||||
|
||||
### Step-by-step setup
|
||||
1. **Install Ruby**:
|
||||
```bash
|
||||
@@ -31,6 +40,11 @@ Your files will save to `./websites/example.com/` with their original structure
|
||||
bundle install
|
||||
```
|
||||
|
||||
If you encounter an error like cannot load such file -- concurrent-ruby, manually install the missing gem:
|
||||
```bash
|
||||
gem install concurrent-ruby
|
||||
```
|
||||
|
||||
3. **Run it**:
|
||||
```bash
|
||||
cd path/to/wayback-machine-downloader/bin
|
||||
@@ -48,16 +62,67 @@ docker build -t wayback_machine_downloader .
|
||||
docker run -it --rm wayback_machine_downloader [options] URL
|
||||
```
|
||||
|
||||
or the example without cloning the repo - fetching smallrockets.com until the year 2013:
|
||||
|
||||
```bash
|
||||
docker run -v .:/websites ghcr.io/strawberrymaster/wayback-machine-downloader:master wayback_machine_downloader --to 20130101 smallrockets.com
|
||||
```
|
||||
|
||||
### 🐳 Using Docker Compose
|
||||
|
||||
We can also use it with Docker Compose, which provides a lot of benefits for extending more functionalities (such as implementing storing previous downloads in a database):
|
||||
```yaml
|
||||
# docker-compose.yml
|
||||
services:
|
||||
wayback_machine_downloader:
|
||||
build:
|
||||
context: .
|
||||
tty: true
|
||||
image: wayback_machine_downloader:latest
|
||||
container_name: wayback_machine_downloader
|
||||
environment:
|
||||
- ENVIRONMENT=${ENVIRONMENT:-development}
|
||||
- OPTIONS=${OPTIONS:-""}
|
||||
- TARGET_URL=${TARGET_URL}
|
||||
volumes:
|
||||
- .:/build:rw
|
||||
- ./websites:/build/websites:rw
|
||||
command: --directory /build/websites ${OPTIONS} ${TARGET_URL}
|
||||
```
|
||||
#### Usage:
|
||||
Now You can create a Docker image as named "wayback_machine_downloader" with the following command:
|
||||
```bash
|
||||
docker compose up -d --build
|
||||
```
|
||||
|
||||
After that you must set TARGET_URL environment variable:
|
||||
```bash
|
||||
export TARGET_URL="https://example.com/"
|
||||
```
|
||||
|
||||
The **OPTIONS** env. variable is optional this may include additional settings which are found in the "**Advanced usage**" section below.
|
||||
|
||||
Example:
|
||||
```bash
|
||||
export OPTIONS="--list -f 20060121"
|
||||
```
|
||||
|
||||
After that you can run the exists container with the following command:
|
||||
```bash
|
||||
docker compose run --rm wayback_machine_downloader https://example.com
|
||||
```
|
||||
|
||||
## ⚙️ Configuration
|
||||
There are a few constants that can be edited in the `wayback_machine_downloader.rb` file for your convenience. The default values may be conservative, so you can adjust them to your needs. They are:
|
||||
|
||||
```ruby
|
||||
DEFAULT_TIMEOUT = 30 # HTTP timeout (in seconds)
|
||||
MAX_RETRIES = 3 # Failed request retries
|
||||
RETRY_DELAY = 2 # Wait between retries
|
||||
RATE_LIMIT = 0.25 # Throttle between requests
|
||||
CONNECTION_POOL_SIZE = 10 # No. of simultaneous connections
|
||||
MEMORY_BUFFER_SIZE = 16384 # Size of download buffer
|
||||
MAX_RETRIES = 3 # Number of times to retry failed requests
|
||||
RETRY_DELAY = 2 # Wait time between retries (seconds)
|
||||
RATE_LIMIT = 0.25 # Throttle between requests (seconds)
|
||||
CONNECTION_POOL_SIZE = 10 # Maximum simultaneous connections
|
||||
MEMORY_BUFFER_SIZE = 16384 # Download buffer size (bytes)
|
||||
STATE_CDX_FILENAME = '.cdx.json' # Stores snapshot listing
|
||||
STATE_DB_FILENAME = '.downloaded.txt' # Tracks completed downloads
|
||||
```
|
||||
|
||||
## 🛠️ Advanced usage
|
||||
@@ -186,6 +251,29 @@ ruby wayback_machine_downloader https://example.com --list
|
||||
```
|
||||
It will just display the files to be downloaded with their snapshot timestamps and urls. The output format is JSON. It won't download anything. It's useful for debugging or to connect to another application.
|
||||
|
||||
### Job management
|
||||
The downloader automatically saves its progress (`.cdx.json` for snapshot list, `.downloaded.txt` for completed files) in the output directory. If you run the same command again pointing to the same output directory, it will resume where it left off, skipping already downloaded files.
|
||||
|
||||
> [!NOTE]
|
||||
> Automatic resumption can be affected by changing the URL, mode selection (like `--all-timestamps`), filtering selections, or other options. If you want to ensure a clean start, use the `--reset` option.
|
||||
|
||||
| Option | Description |
|
||||
|--------|-------------|
|
||||
| `--reset` | Delete state files (`.cdx.json`, `.downloaded.txt`) and restart the download from scratch. Does not delete already downloaded website files. |
|
||||
| `--keep` | Keep state files (`.cdx.json`, `.downloaded.txt`) even after a successful download. By default, these are deleted upon successful completion. |
|
||||
|
||||
**Example** - Restart a download job from the beginning:
|
||||
```bash
|
||||
ruby wayback_machine_downloader https://example.com --reset
|
||||
```
|
||||
This is useful if you suspect the state files are corrupted or want to ensure a completely fresh download process without deleting the files you already have.
|
||||
|
||||
**Example 2** - Keep state files after download:
|
||||
```bash
|
||||
ruby wayback_machine_downloader https://example.com --keep
|
||||
```
|
||||
This can be useful for debugging or if you plan to extend the download later with different parameters (e.g., adding `--to` timestamp) while leveraging the existing snapshot list.
|
||||
|
||||
## 🤝 Contributing
|
||||
1. Fork the repository
|
||||
2. Create a feature branch
|
||||
|
||||
@@ -59,7 +59,19 @@ option_parser = OptionParser.new do |opts|
|
||||
end
|
||||
|
||||
opts.on("-r", "--rewritten", "Downloads the rewritten Wayback Machine files instead of the original files") do |t|
|
||||
options[:rewritten] = t
|
||||
options[:rewritten] = true
|
||||
end
|
||||
|
||||
opts.on("--local", "Rewrite URLs to make them relative for local browsing") do |t|
|
||||
options[:rewrite] = true
|
||||
end
|
||||
|
||||
opts.on("--reset", "Delete state files (.cdx.json, .downloaded.txt) and restart the download from scratch") do |t|
|
||||
options[:reset] = true
|
||||
end
|
||||
|
||||
opts.on("--keep", "Keep state files (.cdx.json, .downloaded.txt) after a successful download") do |t|
|
||||
options[:keep] = true
|
||||
end
|
||||
|
||||
opts.on("-v", "--version", "Display version") do |t|
|
||||
|
||||
15
docker-compose.yml
Normal file
15
docker-compose.yml
Normal file
@@ -0,0 +1,15 @@
|
||||
services:
|
||||
wayback_machine_downloader:
|
||||
build:
|
||||
context: .
|
||||
tty: true
|
||||
image: wayback_machine_downloader:latest
|
||||
container_name: wayback_machine_downloader
|
||||
environment:
|
||||
- ENVIRONMENT=${DEVELOPMENT:-production}
|
||||
- OPTIONS=${OPTIONS:-""}
|
||||
- TARGET_URL=${TARGET_URL}
|
||||
volumes:
|
||||
- .:/build:rw
|
||||
- ./websites:/websites:rw
|
||||
command: /build/bin/wayback_machine_downloader ${TARGET_URL} ${OPTIONS}
|
||||
9
entrypoint.sh
Normal file
9
entrypoint.sh
Normal file
@@ -0,0 +1,9 @@
|
||||
#!/bin/bash
|
||||
|
||||
if [ "$ENVIRONMENT" == "development" ]; then
|
||||
echo "Running in development mode. Starting rerun..."
|
||||
exec rerun --dir /build --ignore "websites/*" -- /build/bin/wayback_machine_downloader "$@"
|
||||
else
|
||||
echo "Not in development mode. Skipping rerun."
|
||||
exec /build/bin/wayback_machine_downloader "$@"
|
||||
fi
|
||||
@@ -9,6 +9,8 @@ require 'json'
|
||||
require 'time'
|
||||
require 'concurrent-ruby'
|
||||
require 'logger'
|
||||
require 'zlib'
|
||||
require 'stringio'
|
||||
require_relative 'wayback_machine_downloader/tidy_bytes'
|
||||
require_relative 'wayback_machine_downloader/to_regex'
|
||||
require_relative 'wayback_machine_downloader/archive_api'
|
||||
@@ -111,17 +113,19 @@ class WaybackMachineDownloader
|
||||
|
||||
include ArchiveAPI
|
||||
|
||||
VERSION = "2.3.3"
|
||||
VERSION = "2.3.7"
|
||||
DEFAULT_TIMEOUT = 30
|
||||
MAX_RETRIES = 3
|
||||
RETRY_DELAY = 2
|
||||
RATE_LIMIT = 0.25 # Delay between requests in seconds
|
||||
CONNECTION_POOL_SIZE = 10
|
||||
MEMORY_BUFFER_SIZE = 16384 # 16KB chunks
|
||||
STATE_CDX_FILENAME = ".cdx.json"
|
||||
STATE_DB_FILENAME = ".downloaded.txt"
|
||||
|
||||
attr_accessor :base_url, :exact_url, :directory, :all_timestamps,
|
||||
:from_timestamp, :to_timestamp, :only_filter, :exclude_filter,
|
||||
:all, :maximum_pages, :threads_count, :logger
|
||||
:all, :maximum_pages, :threads_count, :logger, :reset, :keep, :rewrite
|
||||
|
||||
def initialize params
|
||||
validate_params(params)
|
||||
@@ -137,10 +141,16 @@ class WaybackMachineDownloader
|
||||
@maximum_pages = params[:maximum_pages] ? params[:maximum_pages].to_i : 100
|
||||
@threads_count = [params[:threads_count].to_i, 1].max
|
||||
@rewritten = params[:rewritten]
|
||||
@reset = params[:reset]
|
||||
@keep = params[:keep]
|
||||
@timeout = params[:timeout] || DEFAULT_TIMEOUT
|
||||
@logger = setup_logger
|
||||
@failed_downloads = Concurrent::Array.new
|
||||
@connection_pool = ConnectionPool.new(CONNECTION_POOL_SIZE)
|
||||
@db_mutex = Mutex.new
|
||||
@rewrite = params[:rewrite] || false
|
||||
|
||||
handle_reset
|
||||
end
|
||||
|
||||
def backup_name
|
||||
@@ -163,6 +173,23 @@ class WaybackMachineDownloader
|
||||
end
|
||||
end
|
||||
|
||||
def cdx_path
|
||||
File.join(backup_path, STATE_CDX_FILENAME)
|
||||
end
|
||||
|
||||
def db_path
|
||||
File.join(backup_path, STATE_DB_FILENAME)
|
||||
end
|
||||
|
||||
def handle_reset
|
||||
if @reset
|
||||
puts "Resetting download state..."
|
||||
FileUtils.rm_f(cdx_path)
|
||||
FileUtils.rm_f(db_path)
|
||||
puts "Removed state files: #{cdx_path}, #{db_path}"
|
||||
end
|
||||
end
|
||||
|
||||
def match_only_filter file_url
|
||||
if @only_filter
|
||||
only_filter_regex = @only_filter.to_regex
|
||||
@@ -190,28 +217,100 @@ class WaybackMachineDownloader
|
||||
end
|
||||
|
||||
def get_all_snapshots_to_consider
|
||||
snapshot_list_to_consider = []
|
||||
|
||||
@connection_pool.with_connection do |connection|
|
||||
puts "Getting snapshot pages"
|
||||
|
||||
# Fetch the initial set of snapshots
|
||||
snapshot_list_to_consider += get_raw_list_from_api(@base_url, nil, connection)
|
||||
print "."
|
||||
|
||||
# Fetch additional pages if the exact URL flag is not set
|
||||
unless @exact_url
|
||||
@maximum_pages.times do |page_index|
|
||||
snapshot_list = get_raw_list_from_api("#{@base_url}/*", page_index, connection)
|
||||
break if snapshot_list.empty?
|
||||
|
||||
snapshot_list_to_consider += snapshot_list
|
||||
print "."
|
||||
end
|
||||
if File.exist?(cdx_path) && !@reset
|
||||
puts "Loading snapshot list from #{cdx_path}"
|
||||
begin
|
||||
snapshot_list_to_consider = JSON.parse(File.read(cdx_path))
|
||||
puts "Loaded #{snapshot_list_to_consider.length} snapshots from cache."
|
||||
puts
|
||||
return Concurrent::Array.new(snapshot_list_to_consider)
|
||||
rescue JSON::ParserError => e
|
||||
puts "Error reading snapshot cache file #{cdx_path}: #{e.message}. Refetching..."
|
||||
FileUtils.rm_f(cdx_path)
|
||||
rescue => e
|
||||
puts "Error loading snapshot cache #{cdx_path}: #{e.message}. Refetching..."
|
||||
FileUtils.rm_f(cdx_path)
|
||||
end
|
||||
end
|
||||
|
||||
puts " found #{snapshot_list_to_consider.length} snapshots to consider."
|
||||
snapshot_list_to_consider = Concurrent::Array.new
|
||||
mutex = Mutex.new
|
||||
|
||||
puts "Getting snapshot pages from Wayback Machine API..."
|
||||
|
||||
# Fetch the initial set of snapshots, sequentially
|
||||
@connection_pool.with_connection do |connection|
|
||||
initial_list = get_raw_list_from_api(@base_url, nil, connection)
|
||||
mutex.synchronize do
|
||||
snapshot_list_to_consider.concat(initial_list)
|
||||
print "."
|
||||
end
|
||||
end
|
||||
|
||||
# Fetch additional pages if the exact URL flag is not set
|
||||
unless @exact_url
|
||||
page_index = 0
|
||||
batch_size = [@threads_count, 5].min
|
||||
continue_fetching = true
|
||||
|
||||
while continue_fetching && page_index < @maximum_pages
|
||||
# Determine the range of pages to fetch in this batch
|
||||
end_index = [page_index + batch_size, @maximum_pages].min
|
||||
current_batch = (page_index...end_index).to_a
|
||||
|
||||
# Create futures for concurrent API calls
|
||||
futures = current_batch.map do |page|
|
||||
Concurrent::Future.execute do
|
||||
result = nil
|
||||
@connection_pool.with_connection do |connection|
|
||||
result = get_raw_list_from_api("#{@base_url}/*", page, connection)
|
||||
end
|
||||
[page, result]
|
||||
end
|
||||
end
|
||||
|
||||
results = []
|
||||
|
||||
futures.each do |future|
|
||||
begin
|
||||
results << future.value
|
||||
rescue => e
|
||||
puts "\nError fetching page #{future}: #{e.message}"
|
||||
end
|
||||
end
|
||||
|
||||
# Sort results by page number to maintain order
|
||||
results.sort_by! { |page, _| page }
|
||||
|
||||
# Process results and check for empty pages
|
||||
results.each do |page, result|
|
||||
if result.empty?
|
||||
continue_fetching = false
|
||||
break
|
||||
else
|
||||
mutex.synchronize do
|
||||
snapshot_list_to_consider.concat(result)
|
||||
print "."
|
||||
end
|
||||
end
|
||||
end
|
||||
|
||||
page_index = end_index
|
||||
|
||||
sleep(RATE_LIMIT) if continue_fetching
|
||||
end
|
||||
end
|
||||
|
||||
puts " found #{snapshot_list_to_consider.length} snapshots."
|
||||
|
||||
# Save the fetched list to the cache file
|
||||
begin
|
||||
FileUtils.mkdir_p(File.dirname(cdx_path))
|
||||
File.write(cdx_path, JSON.pretty_generate(snapshot_list_to_consider.to_a)) # Convert Concurrent::Array back to Array for JSON
|
||||
puts "Saved snapshot list to #{cdx_path}"
|
||||
rescue => e
|
||||
puts "Error saving snapshot cache to #{cdx_path}: #{e.message}"
|
||||
end
|
||||
puts
|
||||
|
||||
snapshot_list_to_consider
|
||||
@@ -301,32 +400,103 @@ class WaybackMachineDownloader
|
||||
puts "]"
|
||||
end
|
||||
|
||||
def load_downloaded_ids
|
||||
downloaded_ids = Set.new
|
||||
if File.exist?(db_path) && !@reset
|
||||
puts "Loading list of already downloaded files from #{db_path}"
|
||||
begin
|
||||
File.foreach(db_path) { |line| downloaded_ids.add(line.strip) }
|
||||
rescue => e
|
||||
puts "Error reading downloaded files list #{db_path}: #{e.message}. Assuming no files downloaded."
|
||||
downloaded_ids.clear
|
||||
end
|
||||
end
|
||||
downloaded_ids
|
||||
end
|
||||
|
||||
def append_to_db(file_id)
|
||||
@db_mutex.synchronize do
|
||||
begin
|
||||
FileUtils.mkdir_p(File.dirname(db_path))
|
||||
File.open(db_path, 'a') { |f| f.puts(file_id) }
|
||||
rescue => e
|
||||
@logger.error("Failed to append downloaded file ID #{file_id} to #{db_path}: #{e.message}")
|
||||
end
|
||||
end
|
||||
end
|
||||
|
||||
def download_files
|
||||
start_time = Time.now
|
||||
puts "Downloading #{@base_url} to #{backup_path} from Wayback Machine archives."
|
||||
|
||||
if file_list_by_timestamp.empty?
|
||||
puts "No files to download."
|
||||
|
||||
FileUtils.mkdir_p(backup_path)
|
||||
|
||||
# Load the list of files to potentially download
|
||||
files_to_download = file_list_by_timestamp
|
||||
|
||||
if files_to_download.empty?
|
||||
puts "No files found matching criteria."
|
||||
cleanup
|
||||
return
|
||||
end
|
||||
|
||||
total_files = file_list_by_timestamp.count
|
||||
puts "#{total_files} files to download:"
|
||||
|
||||
total_files = files_to_download.count
|
||||
puts "#{total_files} files found matching criteria."
|
||||
|
||||
# Load IDs of already downloaded files
|
||||
downloaded_ids = load_downloaded_ids
|
||||
files_to_process = files_to_download.reject do |file_info|
|
||||
downloaded_ids.include?(file_info[:file_id])
|
||||
end
|
||||
|
||||
remaining_count = files_to_process.count
|
||||
skipped_count = total_files - remaining_count
|
||||
|
||||
if skipped_count > 0
|
||||
puts "Found #{skipped_count} previously downloaded files, skipping them."
|
||||
end
|
||||
|
||||
if remaining_count == 0
|
||||
puts "All matching files have already been downloaded."
|
||||
cleanup
|
||||
return
|
||||
end
|
||||
|
||||
puts "#{remaining_count} files to download:"
|
||||
|
||||
@processed_file_count = 0
|
||||
@total_to_download = remaining_count
|
||||
@download_mutex = Mutex.new
|
||||
|
||||
|
||||
thread_count = [@threads_count, CONNECTION_POOL_SIZE].min
|
||||
pool = Concurrent::FixedThreadPool.new(thread_count)
|
||||
|
||||
file_list_by_timestamp.each do |file_remote_info|
|
||||
|
||||
files_to_process.each do |file_remote_info|
|
||||
pool.post do
|
||||
@connection_pool.with_connection do |connection|
|
||||
result = download_file(file_remote_info, connection)
|
||||
@download_mutex.synchronize do
|
||||
@processed_file_count += 1
|
||||
puts result if result
|
||||
download_success = false
|
||||
begin
|
||||
@connection_pool.with_connection do |connection|
|
||||
result_message = download_file(file_remote_info, connection)
|
||||
# assume download success if the result message contains ' -> '
|
||||
if result_message && result_message.include?(' -> ')
|
||||
download_success = true
|
||||
end
|
||||
@download_mutex.synchronize do
|
||||
@processed_file_count += 1
|
||||
# adjust progress message to reflect remaining files
|
||||
progress_message = result_message.sub(/\(#{@processed_file_count}\/\d+\)/, "(#{@processed_file_count}/#{@total_to_download})") if result_message
|
||||
puts progress_message if progress_message
|
||||
end
|
||||
end
|
||||
# sppend to DB only after successful download outside the connection block
|
||||
if download_success
|
||||
append_to_db(file_remote_info[:file_id])
|
||||
end
|
||||
rescue => e
|
||||
@logger.error("Error processing file #{file_remote_info[:file_url]}: #{e.message}")
|
||||
@download_mutex.synchronize do
|
||||
@processed_file_count += 1
|
||||
end
|
||||
end
|
||||
sleep(RATE_LIMIT)
|
||||
end
|
||||
@@ -336,7 +506,8 @@ class WaybackMachineDownloader
|
||||
pool.wait_for_termination
|
||||
|
||||
end_time = Time.now
|
||||
puts "\nDownload completed in #{(end_time - start_time).round(2)}s, saved in #{backup_path}"
|
||||
puts "\nDownload finished in #{(end_time - start_time).round(2)}s."
|
||||
puts "Results saved in #{backup_path}"
|
||||
cleanup
|
||||
end
|
||||
|
||||
@@ -363,6 +534,101 @@ class WaybackMachineDownloader
|
||||
end
|
||||
end
|
||||
|
||||
def rewrite_urls_to_relative(file_path)
|
||||
return unless File.exist?(file_path)
|
||||
|
||||
file_ext = File.extname(file_path).downcase
|
||||
|
||||
begin
|
||||
content = File.binread(file_path)
|
||||
|
||||
if file_ext == '.html' || file_ext == '.htm'
|
||||
encoding = content.match(/<meta\s+charset=["']?([^"'>]+)/i)&.captures&.first || 'UTF-8'
|
||||
content.force_encoding(encoding) rescue content.force_encoding('UTF-8')
|
||||
else
|
||||
content.force_encoding('UTF-8')
|
||||
end
|
||||
|
||||
# URLs in HTML attributes
|
||||
content.gsub!(/(\s(?:href|src|action|data-src|data-url)=["'])https?:\/\/web\.archive\.org\/web\/[0-9]+(?:id_)?\/([^"']+)(["'])/i) do
|
||||
prefix, url, suffix = $1, $2, $3
|
||||
|
||||
if url.start_with?('http')
|
||||
begin
|
||||
uri = URI.parse(url)
|
||||
path = uri.path
|
||||
path = path[1..-1] if path.start_with?('/')
|
||||
"#{prefix}#{path}#{suffix}"
|
||||
rescue
|
||||
"#{prefix}#{url}#{suffix}"
|
||||
end
|
||||
elsif url.start_with?('/')
|
||||
"#{prefix}./#{url[1..-1]}#{suffix}"
|
||||
else
|
||||
"#{prefix}#{url}#{suffix}"
|
||||
end
|
||||
end
|
||||
|
||||
# URLs in CSS
|
||||
content.gsub!(/url\(\s*["']?https?:\/\/web\.archive\.org\/web\/[0-9]+(?:id_)?\/([^"'\)]+)["']?\s*\)/i) do
|
||||
url = $1
|
||||
|
||||
if url.start_with?('http')
|
||||
begin
|
||||
uri = URI.parse(url)
|
||||
path = uri.path
|
||||
path = path[1..-1] if path.start_with?('/')
|
||||
"url(\"#{path}\")"
|
||||
rescue
|
||||
"url(\"#{url}\")"
|
||||
end
|
||||
elsif url.start_with?('/')
|
||||
"url(\"./#{url[1..-1]}\")"
|
||||
else
|
||||
"url(\"#{url}\")"
|
||||
end
|
||||
end
|
||||
|
||||
# URLs in JavaScript
|
||||
content.gsub!(/(["'])https?:\/\/web\.archive\.org\/web\/[0-9]+(?:id_)?\/([^"']+)(["'])/i) do
|
||||
quote_start, url, quote_end = $1, $2, $3
|
||||
|
||||
if url.start_with?('http')
|
||||
begin
|
||||
uri = URI.parse(url)
|
||||
path = uri.path
|
||||
path = path[1..-1] if path.start_with?('/')
|
||||
"#{quote_start}#{path}#{quote_end}"
|
||||
rescue
|
||||
"#{quote_start}#{url}#{quote_end}"
|
||||
end
|
||||
elsif url.start_with?('/')
|
||||
"#{quote_start}./#{url[1..-1]}#{quote_end}"
|
||||
else
|
||||
"#{quote_start}#{url}#{quote_end}"
|
||||
end
|
||||
end
|
||||
|
||||
# for URLs in HTML attributes that start with a single slash
|
||||
content.gsub!(/(\s(?:href|src|action|data-src|data-url)=["'])\/([^"'\/][^"']*)(["'])/i) do
|
||||
prefix, path, suffix = $1, $2, $3
|
||||
"#{prefix}./#{path}#{suffix}"
|
||||
end
|
||||
|
||||
# for URLs in CSS that start with a single slash
|
||||
content.gsub!(/url\(\s*["']?\/([^"'\)\/][^"'\)]*?)["']?\s*\)/i) do
|
||||
path = $1
|
||||
"url(\"./#{path}\")"
|
||||
end
|
||||
|
||||
# save the modified content back to the file
|
||||
File.binwrite(file_path, content)
|
||||
puts "Rewrote URLs in #{file_path} to be relative."
|
||||
rescue Errno::ENOENT => e
|
||||
@logger.warn("Error reading file #{file_path}: #{e.message}")
|
||||
end
|
||||
end
|
||||
|
||||
def download_file (file_remote_info, http)
|
||||
current_encoding = "".encoding
|
||||
file_url = file_remote_info[:file_url].encode(current_encoding)
|
||||
@@ -384,21 +650,37 @@ class WaybackMachineDownloader
|
||||
dir_path = dir_path.gsub(/[:*?&=<>\\|]/) {|s| '%' + s.ord.to_s(16) }
|
||||
file_path = file_path.gsub(/[:*?&=<>\\|]/) {|s| '%' + s.ord.to_s(16) }
|
||||
end
|
||||
unless File.exist? file_path
|
||||
begin
|
||||
structure_dir_path dir_path
|
||||
download_with_retry(file_path, file_url, file_timestamp, http)
|
||||
"#{file_url} -> #{file_path} (#{@processed_file_count + 1}/#{file_list_by_timestamp.size})"
|
||||
rescue StandardError => e
|
||||
msg = "#{file_url} # #{e}"
|
||||
if not @all and File.exist?(file_path) and File.size(file_path) == 0
|
||||
File.delete(file_path)
|
||||
msg += "\n#{file_path} was empty and was removed."
|
||||
|
||||
# check existence *before* download attempt
|
||||
# this handles cases where a file was created manually or by a previous partial run without a .db entry
|
||||
if File.exist? file_path
|
||||
return "#{file_url} # #{file_path} already exists. (#{@processed_file_count + 1}/#{@total_to_download})"
|
||||
end
|
||||
|
||||
begin
|
||||
structure_dir_path dir_path
|
||||
status = download_with_retry(file_path, file_url, file_timestamp, http)
|
||||
|
||||
case status
|
||||
when :saved
|
||||
if @rewrite && File.extname(file_path) =~ /\.(html?|css|js)$/i
|
||||
rewrite_urls_to_relative(file_path)
|
||||
end
|
||||
msg
|
||||
"#{file_url} -> #{file_path} (#{@processed_file_count + 1}/#{@total_to_download})"
|
||||
when :skipped_not_found
|
||||
"Skipped (not found): #{file_url} (#{@processed_file_count + 1}/#{@total_to_download})"
|
||||
else
|
||||
# ideally, this case should not be reached if download_with_retry behaves as expected.
|
||||
@logger.warn("Unknown status from download_with_retry for #{file_url}: #{status}")
|
||||
"Unknown status for #{file_url}: #{status} (#{@processed_file_count + 1}/#{@total_to_download})"
|
||||
end
|
||||
else
|
||||
"#{file_url} # #{file_path} already exists. (#{@processed_file_count + 1}/#{file_list_by_timestamp.size})"
|
||||
rescue StandardError => e
|
||||
msg = "Failed: #{file_url} # #{e} (#{@processed_file_count + 1}/#{@total_to_download})"
|
||||
if File.exist?(file_path) and File.size(file_path) == 0
|
||||
File.delete(file_path)
|
||||
msg += "\n#{file_path} was empty and was removed."
|
||||
end
|
||||
msg
|
||||
end
|
||||
end
|
||||
|
||||
@@ -431,40 +713,71 @@ class WaybackMachineDownloader
|
||||
begin
|
||||
wayback_url = if @rewritten
|
||||
"https://web.archive.org/web/#{file_timestamp}/#{file_url}"
|
||||
else
|
||||
else
|
||||
"https://web.archive.org/web/#{file_timestamp}id_/#{file_url}"
|
||||
end
|
||||
|
||||
|
||||
request = Net::HTTP::Get.new(URI(wayback_url))
|
||||
request["Connection"] = "keep-alive"
|
||||
request["User-Agent"] = "WaybackMachineDownloader/#{VERSION}"
|
||||
|
||||
request["Accept-Encoding"] = "gzip, deflate"
|
||||
|
||||
response = connection.request(request)
|
||||
|
||||
case response
|
||||
when Net::HTTPSuccess
|
||||
|
||||
save_response_body = lambda do
|
||||
File.open(file_path, "wb") do |file|
|
||||
if block_given?
|
||||
yield(response, file)
|
||||
body = response.body
|
||||
if response['content-encoding'] == 'gzip' && body && !body.empty?
|
||||
begin
|
||||
gz = Zlib::GzipReader.new(StringIO.new(body))
|
||||
decompressed_body = gz.read
|
||||
gz.close
|
||||
file.write(decompressed_body)
|
||||
rescue Zlib::GzipFile::Error => e
|
||||
@logger.warn("Failure decompressing gzip file #{file_url}: #{e.message}. Writing raw body.")
|
||||
file.write(body)
|
||||
end
|
||||
else
|
||||
file.write(response.body)
|
||||
file.write(body) if body
|
||||
end
|
||||
end
|
||||
when Net::HTTPRedirection
|
||||
raise "Too many redirects for #{file_url}" if redirect_count >= 2
|
||||
location = response['location']
|
||||
@logger.warn("Redirect found for #{file_url} -> #{location}")
|
||||
return download_with_retry(file_path, location, file_timestamp, connection, redirect_count + 1)
|
||||
when Net::HTTPTooManyRequests
|
||||
sleep(RATE_LIMIT * 2)
|
||||
raise "Rate limited, retrying..."
|
||||
when Net::HTTPNotFound
|
||||
@logger.warn("File not found, skipping: #{file_url}")
|
||||
return
|
||||
else
|
||||
raise "HTTP Error: #{response.code} #{response.message}"
|
||||
end
|
||||
|
||||
|
||||
if @all
|
||||
case response
|
||||
when Net::HTTPSuccess, Net::HTTPRedirection, Net::HTTPClientError, Net::HTTPServerError
|
||||
save_response_body.call
|
||||
if response.is_a?(Net::HTTPRedirection)
|
||||
@logger.info("Saved redirect page for #{file_url} (status #{response.code}).")
|
||||
elsif response.is_a?(Net::HTTPClientError) || response.is_a?(Net::HTTPServerError)
|
||||
@logger.info("Saved error page for #{file_url} (status #{response.code}).")
|
||||
end
|
||||
return :saved
|
||||
else
|
||||
# for any other response type when --all is true, treat as an error to be retried or failed
|
||||
raise "Unhandled HTTP response: #{response.code} #{response.message}"
|
||||
end
|
||||
else # not @all (our default behavior)
|
||||
case response
|
||||
when Net::HTTPSuccess
|
||||
save_response_body.call
|
||||
return :saved
|
||||
when Net::HTTPRedirection
|
||||
raise "Too many redirects for #{file_url}" if redirect_count >= 2
|
||||
location = response['location']
|
||||
@logger.warn("Redirect found for #{file_url} -> #{location}")
|
||||
return download_with_retry(file_path, location, file_timestamp, connection, redirect_count + 1)
|
||||
when Net::HTTPTooManyRequests
|
||||
sleep(RATE_LIMIT * 2)
|
||||
raise "Rate limited, retrying..."
|
||||
when Net::HTTPNotFound
|
||||
@logger.warn("File not found, skipping: #{file_url}")
|
||||
return :skipped_not_found
|
||||
else
|
||||
raise "HTTP Error: #{response.code} #{response.message}"
|
||||
end
|
||||
end
|
||||
|
||||
rescue StandardError => e
|
||||
if retries < MAX_RETRIES
|
||||
retries += 1
|
||||
@@ -480,12 +793,25 @@ class WaybackMachineDownloader
|
||||
|
||||
def cleanup
|
||||
@connection_pool.shutdown
|
||||
|
||||
|
||||
if @failed_downloads.any?
|
||||
@logger.error("Download completed with errors.")
|
||||
@logger.error("Failed downloads summary:")
|
||||
@failed_downloads.each do |failure|
|
||||
@logger.error(" #{failure[:url]} - #{failure[:error]}")
|
||||
end
|
||||
unless @reset
|
||||
puts "State files kept due to download errors: #{cdx_path}, #{db_path}"
|
||||
return
|
||||
end
|
||||
end
|
||||
|
||||
if !@keep || @reset
|
||||
puts "Cleaning up state files..." unless @keep && !@reset
|
||||
FileUtils.rm_f(cdx_path)
|
||||
FileUtils.rm_f(db_path)
|
||||
elsif @keep
|
||||
puts "Keeping state files as requested: #{cdx_path}, #{db_path}"
|
||||
end
|
||||
end
|
||||
end
|
||||
|
||||
@@ -4,7 +4,7 @@ require 'uri'
|
||||
module ArchiveAPI
|
||||
|
||||
def get_raw_list_from_api(url, page_index, http)
|
||||
request_url = URI("https://web.archive.org/cdx/search/xd")
|
||||
request_url = URI("https://web.archive.org/cdx/search/cdx")
|
||||
params = [["output", "json"], ["url", url]] + parameters_for_api(page_index)
|
||||
request_url.query = URI.encode_www_form(params)
|
||||
|
||||
|
||||
@@ -1,17 +1,15 @@
|
||||
require './lib/wayback_machine_downloader'
|
||||
|
||||
Gem::Specification.new do |s|
|
||||
s.name = "wayback_machine_downloader"
|
||||
s.version = WaybackMachineDownloader::VERSION
|
||||
s.name = "wayback_machine_downloader_straw"
|
||||
s.version = "2.3.7"
|
||||
s.executables << "wayback_machine_downloader"
|
||||
s.summary = "Download an entire website from the Wayback Machine."
|
||||
s.description = "Download an entire website from the Wayback Machine. Wayback Machine by Internet Archive (archive.org) is an awesome tool to view any website at any point of time but lacks an export feature. Wayback Machine Downloader brings exactly this."
|
||||
s.authors = ["hartator"]
|
||||
s.email = "hartator@gmail.com"
|
||||
s.description = "Download complete websites from the Internet Archive's Wayback Machine. While the Wayback Machine (archive.org) excellently preserves web history, it lacks a built-in export functionality; this gem does just that, allowing you to download entire archived websites. (This is a significant rewrite of the original wayback_machine_downloader gem by hartator, with enhanced features and performance improvements.)"
|
||||
s.authors = ["strawberrymaster"]
|
||||
s.email = "strawberrymaster@vivaldi.net"
|
||||
s.files = ["lib/wayback_machine_downloader.rb", "lib/wayback_machine_downloader/tidy_bytes.rb", "lib/wayback_machine_downloader/to_regex.rb", "lib/wayback_machine_downloader/archive_api.rb"]
|
||||
s.homepage = "https://github.com/hartator/wayback-machine-downloader"
|
||||
s.homepage = "https://github.com/StrawberryMaster/wayback-machine-downloader"
|
||||
s.license = "MIT"
|
||||
s.required_ruby_version = ">= 1.9.2"
|
||||
s.required_ruby_version = ">= 3.4.3"
|
||||
s.add_runtime_dependency "concurrent-ruby", "~> 1.3", ">= 1.3.4"
|
||||
s.add_development_dependency "rake", "~> 12.2"
|
||||
s.add_development_dependency "minitest", "~> 5.2"
|
||||
|
||||
Reference in New Issue
Block a user