33 lines
1.5 KiB
Markdown
Raw Permalink Normal View History

2022-09-16 09:19:09 +05:00
# Upwork-Jobs-scraper
The code uses Upwork's internal Api to scrape new jobs posted on Upwork. I am not using headless browser due to two reasons.
1. Using headless bowsers is way more resource intensive compared to using an Api.
2. I don't have to deal with HTML parsing, the api returns json which can be directly passed to downstream systems.
2025-02-03 18:26:37 +05:00
# How to use
1. Create a .env file and add the variables.
2. Run go run main.go
2022-09-16 09:19:09 +05:00
# Note
2022-09-21 09:48:31 +05:00
The code uses Golang instead of Python because Upwork filters bots by checking TLS signatures of incoming requests. Unfortunately, I could not
2022-09-16 09:19:09 +05:00
find a way to do it in pure Python because Python is compiled with openssl and popular browsers do not use it. Chrome uses BoringSSl and firefox uses NSS.
These SSL libraries use different extensions and cipher suites which makes detection of TLS level configurations a more robust method to detect bot traffic.
Golang is a more lower level language compared to Python, so it allows changing network level configurations. I am using `cycletls` package in golang which makes spoofing TLS/JA3 fingerprints an easy task.
# How can you contribute?
2022-09-21 09:48:31 +05:00
These are some of the features I think could be useful.
2022-09-21 09:53:43 +05:00
- Better error handling with channels
2022-09-21 09:48:31 +05:00
- Add support for automatic proxy rotation. It can be extremely effective when used in conjunction with go routines.
- Add Api schema for Upwork Api.
- Add more scrapers, a lot of logic is platform agnostic which could be used to build scrapers for more platforms.