Add downloading in multi threads

This commit is contained in:
Šarūnas Kūjalis
2016-09-04 23:38:38 +03:00
parent 0aabbc9a3b
commit ee5c87378d
3 changed files with 112 additions and 63 deletions

View File

@@ -24,9 +24,9 @@ It will download the last version of every file present on Wayback Machine to `.
## Advanced Usage ## Advanced Usage
Usage: wayback_machine_downloader http://example.com Usage: wayback_machine_downloader http://example.com
Download an entire website from the Wayback Machine. Download an entire website from the Wayback Machine.
Optional options: Optional options:
-f, --from TIMESTAMP Only files on or after timestamp supplied (ie. 20060716231334) -f, --from TIMESTAMP Only files on or after timestamp supplied (ie. 20060716231334)
-t, --to TIMESTAMP Only files on or before timestamp supplied (ie. 20100916231334) -t, --to TIMESTAMP Only files on or before timestamp supplied (ie. 20100916231334)
@@ -34,12 +34,13 @@ It will download the last version of every file present on Wayback Machine to `.
-x, --exclude EXCLUDE_FILTER Skip downloading of urls that match this filter (use // notation for the filter to be treated as a regex) -x, --exclude EXCLUDE_FILTER Skip downloading of urls that match this filter (use // notation for the filter to be treated as a regex)
-a, --all Expand downloading to error files (40x and 50x) and redirections (30x) -a, --all Expand downloading to error files (40x and 50x) and redirections (30x)
-l, --list Only list file urls in a JSON format with the archived timestamps. Won't download anything. -l, --list Only list file urls in a JSON format with the archived timestamps. Won't download anything.
--threads NUMBER Number of threads to use while downloading website (ie. 20)
-v, --version Display version -v, --version Display version
## From Timestamp ## From Timestamp
-f, --from TIMESTAMP -f, --from TIMESTAMP
Optional. You may want to supply a from timestamp to lock your backup to a specific version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/20060716231334/http://example.com). You can also use years (2006), years + month (200607), etc. It can be used in combination of To Timestamp. Optional. You may want to supply a from timestamp to lock your backup to a specific version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/20060716231334/http://example.com). You can also use years (2006), years + month (200607), etc. It can be used in combination of To Timestamp.
Wayback Machine Downloader will then fetch only file versions on or after the timestamp specified. Wayback Machine Downloader will then fetch only file versions on or after the timestamp specified.
@@ -50,7 +51,7 @@ Example:
## To Timestamp ## To Timestamp
-t, --to TIMESTAMP -t, --to TIMESTAMP
Optional. You may want to supply a to timestamp to lock your backup to a specifc version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/20100916231334/http://example.com). You can also use years (2010), years + month (201009), etc. It can be used in combination of From Timestamp. Optional. You may want to supply a to timestamp to lock your backup to a specifc version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/20100916231334/http://example.com). You can also use years (2010), years + month (201009), etc. It can be used in combination of From Timestamp.
Wayback Machine Downloader will then fetch only file versions on or before the timestamp specified. Wayback Machine Downloader will then fetch only file versions on or before the timestamp specified.
@@ -67,9 +68,9 @@ Optional. You may want to retrieve files which are of a certain type (e.g., .pdf
For example, if you only want to download files inside a specific `my_directory`: For example, if you only want to download files inside a specific `my_directory`:
wayback_machine_downloader http://example.com --only my_directory wayback_machine_downloader http://example.com --only my_directory
Or if you want to download every images without anything else: Or if you want to download every images without anything else:
wayback_machine_downloader http://example.com --only "/\.(gif|jpg|jpeg)$/i" wayback_machine_downloader http://example.com --only "/\.(gif|jpg|jpeg)$/i"
## Exclude URL Filter ## Exclude URL Filter
@@ -81,11 +82,11 @@ Optional. You may want to retrieve files which aren't of a certain type (e.g., .
For example, if you want to avoid downloading files inside `my_directory`: For example, if you want to avoid downloading files inside `my_directory`:
wayback_machine_downloader http://example.com --exclude my_directory wayback_machine_downloader http://example.com --exclude my_directory
Or if you want to download everything except images: Or if you want to download everything except images:
wayback_machine_downloader http://example.com --exclude "/\.(gif|jpg|jpeg)$/i" wayback_machine_downloader http://example.com --exclude "/\.(gif|jpg|jpeg)$/i"
## Expand downloading to all file types ## Expand downloading to all file types
-a, --all -a, --all
@@ -106,6 +107,16 @@ Example:
wayback_machine_downloader http://example.com --list wayback_machine_downloader http://example.com --list
## Download using ruby green threads
--threads NUMBER
Optional. Default is 1. Number of threads to use while downloading website.
Example:
wayback_machine_downloader http://example.com --threads 20
## Using the Docker image ## Using the Docker image
As an alternative installation way, we have a Docker image! Retrieve the wayback-machine-downloader Docker image this way: As an alternative installation way, we have a Docker image! Retrieve the wayback-machine-downloader Docker image this way:
@@ -124,7 +135,7 @@ To run the tests:
bundle install bundle install
bundle exec rake test bundle exec rake test
## Donation ## Donation
Wayback Machine Downloader is free and open source. Wayback Machine Downloader is free and open source.

View File

@@ -38,6 +38,10 @@ option_parser = OptionParser.new do |opts|
options[:list] = true options[:list] = true
end end
opts.on("--threads NUMBER", Integer, "Number of threads to use while downloading website (ie. 20)") do |t|
options[:threads_count] = t
end
opts.on("-v", "--version", "Display version") do |t| opts.on("-v", "--version", "Display version") do |t|
options[:version] = t options[:version] = t
end end

View File

@@ -11,7 +11,7 @@ class WaybackMachineDownloader
VERSION = "0.4.9" VERSION = "0.4.9"
attr_accessor :base_url, :from_timestamp, :to_timestamp, :only_filter, :exclude_filter, :all, :list attr_accessor :base_url, :from_timestamp, :to_timestamp, :only_filter, :exclude_filter, :all, :list, :threads_count
def initialize params def initialize params
@base_url = params[:base_url] @base_url = params[:base_url]
@@ -21,6 +21,7 @@ class WaybackMachineDownloader
@exclude_filter = params[:exclude_filter] @exclude_filter = params[:exclude_filter]
@all = params[:all] @all = params[:all]
@list = params[:list] @list = params[:list]
@threads_count = params[:threads_count].to_i
end end
def backup_name def backup_name
@@ -121,72 +122,35 @@ class WaybackMachineDownloader
end end
def download_files def download_files
start_time = Time.now
puts "Downloading #{@base_url} to #{backup_path} from Wayback Machine..." puts "Downloading #{@base_url} to #{backup_path} from Wayback Machine..."
puts puts
file_list_by_timestamp = get_file_list_by_timestamp
if file_list_by_timestamp.count == 0 if file_list_by_timestamp.count == 0
puts "No files to download." puts "No files to download."
puts "Possible reasons:" puts "Possible reasons:"
puts "\t* Site is not in Wayback Machine Archive." puts "\t* Site is not in Wayback Machine Archive."
puts "\t* From timestamp too much in the future." if @from_timestamp and @from_timestamp != 0 puts "\t* From timestamp too much in the future." if @from_timestamp and @from_timestamp != 0
puts "\t* To timestamp too much in the past." if @to_timestamp and @to_timestamp != 0 puts "\t* To timestamp too much in the past." if @to_timestamp and @to_timestamp != 0
puts "\t* Only filter too restrictive (#{only_filter.to_s})" if @only_filter puts "\t* Only filter too restrictive (#{only_filter.to_s})" if @only_filter
puts "\t* Exclude filter too wide (#{exclude_filter.to_s})" if @exclude_filter puts "\t* Exclude filter too wide (#{exclude_filter.to_s})" if @exclude_filter
return return
end end
count = 0
file_list_by_timestamp.each do |file_remote_info| threads = []
count += 1 [@threads_count, 1].max.times do
file_url = file_remote_info[:file_url] threads << Thread.new do
file_id = file_remote_info[:file_id] until file_queue.empty?
file_timestamp = file_remote_info[:timestamp] file_remote_info = file_queue.pop(true) rescue nil
file_path_elements = file_id.split('/') download_file(file_remote_info) if file_remote_info
if file_id == ""
dir_path = backup_path
file_path = backup_path + 'index.html'
elsif file_url[-1] == '/' or not file_path_elements[-1].include? '.'
dir_path = backup_path + file_path_elements[0..-1].join('/')
file_path = backup_path + file_path_elements[0..-1].join('/') + '/index.html'
else
dir_path = backup_path + file_path_elements[0..-2].join('/')
file_path = backup_path + file_path_elements[0..-1].join('/')
end
if Gem.win_platform?
file_path = file_path.gsub(/[:*?&=<>\\|]/) {|s| '%' + s.ord.to_s(16) }
end
unless File.exists? file_path
begin
structure_dir_path dir_path
open(file_path, "wb") do |file|
begin
open("http://web.archive.org/web/#{file_timestamp}id_/#{file_url}", "Accept-Encoding" => "plain") do |uri|
file.write(uri.read)
end
rescue OpenURI::HTTPError => e
puts "#{file_url} # #{e}"
if @all
file.write(e.io.read)
puts "#{file_path} saved anyway."
end
rescue StandardError => e
puts "#{file_url} # #{e}"
end
end
rescue StandardError => e
puts "#{file_url} # #{e}"
ensure
if not @all and File.exists?(file_path) and File.size(file_path) == 0
File.delete(file_path)
puts "#{file_path} was empty and was removed."
end
end end
puts "#{file_url} -> #{file_path} (#{count}/#{file_list_by_timestamp.size})"
else
puts "#{file_url} # #{file_path} already exists. (#{count}/#{file_list_by_timestamp.size})"
end end
end end
threads.each(&:join)
end_time = Time.now
puts puts
puts "Download complete, saved in #{backup_path} (#{file_list_by_timestamp.size} files)" puts "Download complete in #{end_time - start_time}s, saved in #{backup_path} (#{file_list_by_timestamp.size} files)"
end end
def structure_dir_path dir_path def structure_dir_path dir_path
@@ -212,4 +176,74 @@ class WaybackMachineDownloader
end end
end end
private
def download_file file_remote_info
@processed_file_count ||= 0
file_url = file_remote_info[:file_url]
file_id = file_remote_info[:file_id]
file_timestamp = file_remote_info[:timestamp]
file_path_elements = file_id.split('/')
if file_id == ""
dir_path = backup_path
file_path = backup_path + 'index.html'
elsif file_url[-1] == '/' or not file_path_elements[-1].include? '.'
dir_path = backup_path + file_path_elements[0..-1].join('/')
file_path = backup_path + file_path_elements[0..-1].join('/') + '/index.html'
else
dir_path = backup_path + file_path_elements[0..-2].join('/')
file_path = backup_path + file_path_elements[0..-1].join('/')
end
if Gem.win_platform?
file_path = file_path.gsub(/[:*?&=<>\\|]/) {|s| '%' + s.ord.to_s(16) }
end
unless File.exists? file_path
begin
structure_dir_path dir_path
open(file_path, "wb") do |file|
begin
open("http://web.archive.org/web/#{file_timestamp}id_/#{file_url}", "Accept-Encoding" => "plain") do |uri|
file.write(uri.read)
end
rescue OpenURI::HTTPError => e
puts "#{file_url} # #{e}"
if @all
file.write(e.io.read)
puts "#{file_path} saved anyway."
end
rescue StandardError => e
puts "#{file_url} # #{e}"
end
end
rescue StandardError => e
puts "#{file_url} # #{e}"
ensure
if not @all and File.exists?(file_path) and File.size(file_path) == 0
File.delete(file_path)
puts "#{file_path} was empty and was removed."
end
end
semaphore.synchronize do
@processed_file_count += 1
puts "#{file_url} -> #{file_path} (#{@processed_file_count}/#{file_list_by_timestamp.size})"
end
else
semaphore.synchronize do
@processed_file_count += 1
puts "#{file_url} # #{file_path} already exists. (#{@processed_file_count}/#{file_list_by_timestamp.size})"
end
end
end
def file_queue
@file_queue ||= file_list_by_timestamp.each_with_object(Queue.new) { |file_info, q| q << file_info }
end
def file_list_by_timestamp
@file_list_by_timestamp ||= get_file_list_by_timestamp
end
def semaphore
@semaphore ||= Mutex.new
end
end end