From 4d3965f2edd1b8a20b1fca9bf5e1337358aa4aa8 Mon Sep 17 00:00:00 2001 From: hashir omer Date: Fri, 16 Sep 2022 09:19:09 +0500 Subject: [PATCH] Added README --- README.md | 19 +++++++++++++++++++ main.go | 4 ++-- transform.go | 1 - 3 files changed, 21 insertions(+), 3 deletions(-) create mode 100644 README.md delete mode 100644 transform.go diff --git a/README.md b/README.md new file mode 100644 index 0000000..3fa6b46 --- /dev/null +++ b/README.md @@ -0,0 +1,19 @@ +# Upwork-Jobs-scraper + +The code uses Upwork's internal Api to scrape new jobs posted on Upwork. I am not using headless browser due to two reasons. + +1. Using headless bowsers is way more resource intensive compared to using an Api. + +2. I don't have to deal with HTML parsing, the api returns json which can be directly passed to downstream systems. + +# Note + +I had to write the script in Golang instead of Python because Upwork filters bots by checking TLS signatures of incoming requests. Unfortunately, I could not +find a way to do it in pure Python because Python is compiled with openssl and popular browsers do not use it. Chrome uses BoringSSl and firefox uses NSS. +These SSL libraries use different extensions and cipher suites which makes detection of TLS level configurations a more robust method to detect bot traffic. + +Golang is a more lower level language compared to Python, so it allows changing network level configurations. I am using `cycletls` package in golang which makes spoofing TLS/JA3 fingerprints an easy task. + +# How can you contribute? + +The code as of now is written as a throwaway script because I wrote it to scrape some data once. If you intend to include it in your workflow or integrate it in larger codebase, you can contribute back by refactoring the code since I don't know golang at all, I had to Google fu everything. diff --git a/main.go b/main.go index faaa968..d75cecc 100644 --- a/main.go +++ b/main.go @@ -74,9 +74,9 @@ func main() { "x-odesk-user-agent": "oDesk LM", "x-requested-with": "XMLHttpRequest", } - + //Upwork limits pagination to 100 pages total_iterations := 100 - //Query to serach for on Upwork + //Query to serach for on Upwork, searching for jobs with shopify keyword query := "shopify" //Number of results per page per_page := 100 diff --git a/transform.go b/transform.go deleted file mode 100644 index 5796f91..0000000 --- a/transform.go +++ /dev/null @@ -1 +0,0 @@ -package transform