Merge branch 'ksarunas-feature/add_downloading_in_multi_threads'

This commit is contained in:
hartator 2016-09-15 20:03:08 -05:00
commit 013445b9b0
5 changed files with 121 additions and 71 deletions

1
.gitignore vendored
View File

@ -5,6 +5,7 @@ doc
rdoc rdoc
log log
websites websites
.DS_Store
## BUNDLER ## BUNDLER
*.gem *.gem

View File

@ -6,3 +6,4 @@ rvm:
- 2.1 - 2.1
- 2.2 - 2.2
- jruby - jruby
- rbx-2

View File

@ -24,22 +24,23 @@ It will download the last version of every file present on Wayback Machine to `.
## Advanced Usage ## Advanced Usage
Usage: wayback_machine_downloader http://example.com Usage: wayback_machine_downloader http://example.com
Download an entire website from the Wayback Machine. Download an entire website from the Wayback Machine.
Optional options: Optional options:
-f, --from TIMESTAMP Only files on or after timestamp supplied (ie. 20060716231334) -f, --from TIMESTAMP Only files on or after timestamp supplied (ie. 20060716231334)
-t, --to TIMESTAMP Only files on or before timestamp supplied (ie. 20100916231334) -t, --to TIMESTAMP Only files on or before timestamp supplied (ie. 20100916231334)
-o, --only ONLY_FILTER Restrict downloading to urls that match this filter (use // notation for the filter to be treated as a regex) -o, --only ONLY_FILTER Restrict downloading to urls that match this filter (use // notation for the filter to be treated as a regex)
-x, --exclude EXCLUDE_FILTER Skip downloading of urls that match this filter (use // notation for the filter to be treated as a regex) -x, --exclude EXCLUDE_FILTER Skip downloading of urls that match this filter (use // notation for the filter to be treated as a regex)
-a, --all Expand downloading to error files (40x and 50x) and redirections (30x) -a, --all Expand downloading to error files (40x and 50x) and redirections (30x)
-l, --list Only list file urls in a JSON format with the archived timestamps. Won't download anything. -l, --list Only list file urls in a JSON format with the archived timestamps. Won't download anything.
-v, --version Display version -c, --concurrency NUMBER Number of multiple files to download at a time. Default is one file at a time. (ie. 20)
-v, --version Display version
## From Timestamp ## From Timestamp
-f, --from TIMESTAMP -f, --from TIMESTAMP
Optional. You may want to supply a from timestamp to lock your backup to a specific version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/20060716231334/http://example.com). You can also use years (2006), years + month (200607), etc. It can be used in combination of To Timestamp. Optional. You may want to supply a from timestamp to lock your backup to a specific version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/20060716231334/http://example.com). You can also use years (2006), years + month (200607), etc. It can be used in combination of To Timestamp.
Wayback Machine Downloader will then fetch only file versions on or after the timestamp specified. Wayback Machine Downloader will then fetch only file versions on or after the timestamp specified.
@ -50,7 +51,7 @@ Example:
## To Timestamp ## To Timestamp
-t, --to TIMESTAMP -t, --to TIMESTAMP
Optional. You may want to supply a to timestamp to lock your backup to a specifc version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/20100916231334/http://example.com). You can also use years (2010), years + month (201009), etc. It can be used in combination of From Timestamp. Optional. You may want to supply a to timestamp to lock your backup to a specifc version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/20100916231334/http://example.com). You can also use years (2010), years + month (201009), etc. It can be used in combination of From Timestamp.
Wayback Machine Downloader will then fetch only file versions on or before the timestamp specified. Wayback Machine Downloader will then fetch only file versions on or before the timestamp specified.
@ -67,9 +68,9 @@ Optional. You may want to retrieve files which are of a certain type (e.g., .pdf
For example, if you only want to download files inside a specific `my_directory`: For example, if you only want to download files inside a specific `my_directory`:
wayback_machine_downloader http://example.com --only my_directory wayback_machine_downloader http://example.com --only my_directory
Or if you want to download every images without anything else: Or if you want to download every images without anything else:
wayback_machine_downloader http://example.com --only "/\.(gif|jpg|jpeg)$/i" wayback_machine_downloader http://example.com --only "/\.(gif|jpg|jpeg)$/i"
## Exclude URL Filter ## Exclude URL Filter
@ -81,11 +82,11 @@ Optional. You may want to retrieve files which aren't of a certain type (e.g., .
For example, if you want to avoid downloading files inside `my_directory`: For example, if you want to avoid downloading files inside `my_directory`:
wayback_machine_downloader http://example.com --exclude my_directory wayback_machine_downloader http://example.com --exclude my_directory
Or if you want to download everything except images: Or if you want to download everything except images:
wayback_machine_downloader http://example.com --exclude "/\.(gif|jpg|jpeg)$/i" wayback_machine_downloader http://example.com --exclude "/\.(gif|jpg|jpeg)$/i"
## Expand downloading to all file types ## Expand downloading to all file types
-a, --all -a, --all
@ -106,6 +107,16 @@ Example:
wayback_machine_downloader http://example.com --list wayback_machine_downloader http://example.com --list
## Download multiple files at a time
-c, --concurrency NUMBER
Optional. Specify the number of multiple files you want to download at the same time. Allows to speed up the download of a website significantly. Default is to download one file at the time.
Example:
wayback_machine_downloader http://example.com --concurrency 20
## Using the Docker image ## Using the Docker image
As an alternative installation way, we have a Docker image! Retrieve the wayback-machine-downloader Docker image this way: As an alternative installation way, we have a Docker image! Retrieve the wayback-machine-downloader Docker image this way:
@ -124,7 +135,7 @@ To run the tests:
bundle install bundle install
bundle exec rake test bundle exec rake test
## Donation ## Donation
Wayback Machine Downloader is free and open source. Wayback Machine Downloader is free and open source.

View File

@ -38,6 +38,10 @@ option_parser = OptionParser.new do |opts|
options[:list] = true options[:list] = true
end end
opts.on("-c", "--concurrency NUMBER", Integer, "Number of multiple files to dowload at a time. Default is one file at a time. (ie. 20)") do |t|
options[:threads_count] = t
end
opts.on("-v", "--version", "Display version") do |t| opts.on("-v", "--version", "Display version") do |t|
options[:version] = t options[:version] = t
end end

View File

@ -9,9 +9,9 @@ require_relative 'wayback_machine_downloader/to_regex'
class WaybackMachineDownloader class WaybackMachineDownloader
VERSION = "0.4.9" VERSION = "0.5.0"
attr_accessor :base_url, :from_timestamp, :to_timestamp, :only_filter, :exclude_filter, :all, :list attr_accessor :base_url, :from_timestamp, :to_timestamp, :only_filter, :exclude_filter, :all, :list, :threads_count
def initialize params def initialize params
@base_url = params[:base_url] @base_url = params[:base_url]
@ -21,6 +21,7 @@ class WaybackMachineDownloader
@exclude_filter = params[:exclude_filter] @exclude_filter = params[:exclude_filter]
@all = params[:all] @all = params[:all]
@list = params[:list] @list = params[:list]
@threads_count = params[:threads_count].to_i
end end
def backup_name def backup_name
@ -121,72 +122,37 @@ class WaybackMachineDownloader
end end
def download_files def download_files
start_time = Time.now
puts "Downloading #{@base_url} to #{backup_path} from Wayback Machine..." puts "Downloading #{@base_url} to #{backup_path} from Wayback Machine..."
puts puts
file_list_by_timestamp = get_file_list_by_timestamp
if file_list_by_timestamp.count == 0 if file_list_by_timestamp.count == 0
puts "No files to download." puts "No files to download."
puts "Possible reasons:" puts "Possible reasons:"
puts "\t* Site is not in Wayback Machine Archive." puts "\t* Site is not in Wayback Machine Archive."
puts "\t* From timestamp too much in the future." if @from_timestamp and @from_timestamp != 0 puts "\t* From timestamp too much in the future." if @from_timestamp and @from_timestamp != 0
puts "\t* To timestamp too much in the past." if @to_timestamp and @to_timestamp != 0 puts "\t* To timestamp too much in the past." if @to_timestamp and @to_timestamp != 0
puts "\t* Only filter too restrictive (#{only_filter.to_s})" if @only_filter puts "\t* Only filter too restrictive (#{only_filter.to_s})" if @only_filter
puts "\t* Exclude filter too wide (#{exclude_filter.to_s})" if @exclude_filter puts "\t* Exclude filter too wide (#{exclude_filter.to_s})" if @exclude_filter
return return
end end
count = 0
file_list_by_timestamp.each do |file_remote_info| threads = []
count += 1 @processed_file_count = 0
file_url = file_remote_info[:file_url] @threads_count = 1 unless @threads_count != 0
file_id = file_remote_info[:file_id] @threads_count.times do
file_timestamp = file_remote_info[:timestamp] threads << Thread.new do
file_path_elements = file_id.split('/') until file_queue.empty?
if file_id == "" file_remote_info = file_queue.pop(true) rescue nil
dir_path = backup_path download_file(file_remote_info) if file_remote_info
file_path = backup_path + 'index.html'
elsif file_url[-1] == '/' or not file_path_elements[-1].include? '.'
dir_path = backup_path + file_path_elements[0..-1].join('/')
file_path = backup_path + file_path_elements[0..-1].join('/') + '/index.html'
else
dir_path = backup_path + file_path_elements[0..-2].join('/')
file_path = backup_path + file_path_elements[0..-1].join('/')
end
if Gem.win_platform?
file_path = file_path.gsub(/[:*?&=<>\\|]/) {|s| '%' + s.ord.to_s(16) }
end
unless File.exists? file_path
begin
structure_dir_path dir_path
open(file_path, "wb") do |file|
begin
open("http://web.archive.org/web/#{file_timestamp}id_/#{file_url}", "Accept-Encoding" => "plain") do |uri|
file.write(uri.read)
end
rescue OpenURI::HTTPError => e
puts "#{file_url} # #{e}"
if @all
file.write(e.io.read)
puts "#{file_path} saved anyway."
end
rescue StandardError => e
puts "#{file_url} # #{e}"
end
end
rescue StandardError => e
puts "#{file_url} # #{e}"
ensure
if not @all and File.exists?(file_path) and File.size(file_path) == 0
File.delete(file_path)
puts "#{file_path} was empty and was removed."
end
end end
puts "#{file_url} -> #{file_path} (#{count}/#{file_list_by_timestamp.size})"
else
puts "#{file_url} # #{file_path} already exists. (#{count}/#{file_list_by_timestamp.size})"
end end
end end
threads.each(&:join)
end_time = Time.now
puts puts
puts "Download complete, saved in #{backup_path} (#{file_list_by_timestamp.size} files)" puts "Download completed in #{(end_time - start_time).round(2)}s, saved in #{backup_path} (#{file_list_by_timestamp.size} files)"
end end
def structure_dir_path dir_path def structure_dir_path dir_path
@ -212,4 +178,71 @@ class WaybackMachineDownloader
end end
end end
def download_file file_remote_info
file_url = file_remote_info[:file_url]
file_id = file_remote_info[:file_id]
file_timestamp = file_remote_info[:timestamp]
file_path_elements = file_id.split('/')
if file_id == ""
dir_path = backup_path
file_path = backup_path + 'index.html'
elsif file_url[-1] == '/' or not file_path_elements[-1].include? '.'
dir_path = backup_path + file_path_elements[0..-1].join('/')
file_path = backup_path + file_path_elements[0..-1].join('/') + '/index.html'
else
dir_path = backup_path + file_path_elements[0..-2].join('/')
file_path = backup_path + file_path_elements[0..-1].join('/')
end
if Gem.win_platform?
file_path = file_path.gsub(/[:*?&=<>\\|]/) {|s| '%' + s.ord.to_s(16) }
end
unless File.exists? file_path
begin
structure_dir_path dir_path
open(file_path, "wb") do |file|
begin
open("http://web.archive.org/web/#{file_timestamp}id_/#{file_url}", "Accept-Encoding" => "plain") do |uri|
file.write(uri.read)
end
rescue OpenURI::HTTPError => e
puts "#{file_url} # #{e}"
if @all
file.write(e.io.read)
puts "#{file_path} saved anyway."
end
rescue StandardError => e
puts "#{file_url} # #{e}"
end
end
rescue StandardError => e
puts "#{file_url} # #{e}"
ensure
if not @all and File.exists?(file_path) and File.size(file_path) == 0
File.delete(file_path)
puts "#{file_path} was empty and was removed."
end
end
semaphore.synchronize do
@processed_file_count += 1
puts "#{file_url} -> #{file_path} (#{@processed_file_count}/#{file_list_by_timestamp.size})"
end
else
semaphore.synchronize do
@processed_file_count += 1
puts "#{file_url} # #{file_path} already exists. (#{@processed_file_count}/#{file_list_by_timestamp.size})"
end
end
end
def file_queue
@file_queue ||= file_list_by_timestamp.each_with_object(Queue.new) { |file_info, q| q << file_info }
end
def file_list_by_timestamp
@file_list_by_timestamp ||= get_file_list_by_timestamp
end
def semaphore
@semaphore ||= Mutex.new
end
end end