Update README.md

This commit is contained in:
Hashir Omer
2025-06-27 20:16:11 +05:00
committed by GitHub
parent c7c2998595
commit 63386cfd75

View File

@@ -1,24 +1,50 @@
# Upwork-Jobs-scraper
# Upwork Jobs Scraper
The code uses Upwork's internal Api to scrape new jobs posted on Upwork. I am not using headless browser due to two reasons.
This project uses **Upwork's internal API** to scrape newly posted jobs. It avoids using a headless browser for two key reasons:
1. Using headless bowsers is way more resource intensive compared to using an Api.
1. **Efficiency**: API requests are far less resource-intensive than browser automation.
2. **Simplicity**: The API returns clean JSON, eliminating the need for HTML parsing and simplifying downstream processing.
2. I don't have to deal with HTML parsing, the api returns json which can be directly passed to downstream systems.
---
# Note
## Why Golang?
The code uses Golang instead of Python because Upwork filters bots by checking TLS signatures of incoming requests. Unfortunately, I could not
find a way to do it in pure Python because Python is compiled with openssl and popular browsers do not use it. Chrome uses BoringSSl and firefox uses NSS.
These SSL libraries use different extensions and cipher suites which makes detection of TLS level configurations a more robust method to detect bot traffic.
This scraper is written in **Go** instead of Python due to **Upwork's bot detection techniques**, which rely on analyzing TLS signatures of incoming requests ([explained here](https://scrapfly.io/blog/how-to-avoid-web-scraping-blocking-tls/)).
Golang is a more lower level language compared to Python, so it allows changing network level configurations. I am using `cycletls` package in golang which makes spoofing TLS/JA3 fingerprints an easy task.
> At the time of development (3 years ago), I could not find an HTTP client in Python that could accurately mimic browser-like TLS signatures. This has since changed as modern Python libraries now support lower-level TLS emulation through bindings to native clients written in Go/C++.
# How can you contribute?
---
These are some of the features I think could be useful.
- Better error handling with channels
- Add support for automatic proxy rotation. It can be extremely effective when used in conjunction with go routines.
- Add Api schema for Upwork Api.
- Add more scrapers, a lot of logic is platform agnostic which could be used to build scrapers for more platforms.
## Usage
1. **Add Credentials**
Create a `.env` file inside the `upwork/` directory. Add your **authorization** and **cookie** headers.
> 💡 The easiest way to obtain these is by logging into [Upwork.com](https://www.upwork.com), inspecting **network traffic** in DevTools after passing bot checks, and copying the `Authorization` and `Cookie` headers from any authenticated request.
2. **Keep Credentials Fresh**
These credentials usually expire daily, so you'll need to refresh them if scraping on a regular basis.
3. **Run the Scraper**
In the `main.go` file, call:
```go
p.Run("<keyword>")
```
* Replace `<keyword>` with the term you want to search.
* It will fetch the **top 5000 most recent job postings** that match the keyword and save them in a `.jsonl` file.
* If you pass an empty string (`p.Run("")`), it will fetch the **most recent jobs** regardless of keyword.
> ⚠️ **Important**: Never use your **personal Upwork account** to extract credentials. Doing so **will result in account suspension**.
---
## Contributing
This project is no longer actively maintained, but I occasionally check if it still works (as of **June 27, 2025**, it does).
**Potential contributions:**
* Automating the refresh of `Authorization` and `Cookie` headers.
* Adding support for advanced filters or scraping job details.