mirror of
https://github.com/hashiromer/Upwork-Jobs-scraper-.git
synced 2025-12-29 16:16:01 +00:00
Update README.md
This commit is contained in:
58
README.md
58
README.md
@@ -1,24 +1,50 @@
|
|||||||
# Upwork-Jobs-scraper
|
# Upwork Jobs Scraper
|
||||||
|
|
||||||
The code uses Upwork's internal Api to scrape new jobs posted on Upwork. I am not using headless browser due to two reasons.
|
This project uses **Upwork's internal API** to scrape newly posted jobs. It avoids using a headless browser for two key reasons:
|
||||||
|
|
||||||
1. Using headless bowsers is way more resource intensive compared to using an Api.
|
1. **Efficiency**: API requests are far less resource-intensive than browser automation.
|
||||||
|
2. **Simplicity**: The API returns clean JSON, eliminating the need for HTML parsing and simplifying downstream processing.
|
||||||
|
|
||||||
2. I don't have to deal with HTML parsing, the api returns json which can be directly passed to downstream systems.
|
---
|
||||||
|
|
||||||
# Note
|
## Why Golang?
|
||||||
|
|
||||||
The code uses Golang instead of Python because Upwork filters bots by checking TLS signatures of incoming requests. Unfortunately, I could not
|
This scraper is written in **Go** instead of Python due to **Upwork's bot detection techniques**, which rely on analyzing TLS signatures of incoming requests ([explained here](https://scrapfly.io/blog/how-to-avoid-web-scraping-blocking-tls/)).
|
||||||
find a way to do it in pure Python because Python is compiled with openssl and popular browsers do not use it. Chrome uses BoringSSl and firefox uses NSS.
|
|
||||||
These SSL libraries use different extensions and cipher suites which makes detection of TLS level configurations a more robust method to detect bot traffic.
|
|
||||||
|
|
||||||
Golang is a more lower level language compared to Python, so it allows changing network level configurations. I am using `cycletls` package in golang which makes spoofing TLS/JA3 fingerprints an easy task.
|
> At the time of development (3 years ago), I could not find an HTTP client in Python that could accurately mimic browser-like TLS signatures. This has since changed as modern Python libraries now support lower-level TLS emulation through bindings to native clients written in Go/C++.
|
||||||
|
|
||||||
# How can you contribute?
|
---
|
||||||
|
|
||||||
These are some of the features I think could be useful.
|
## Usage
|
||||||
|
|
||||||
- Better error handling with channels
|
1. **Add Credentials**
|
||||||
- Add support for automatic proxy rotation. It can be extremely effective when used in conjunction with go routines.
|
Create a `.env` file inside the `upwork/` directory. Add your **authorization** and **cookie** headers.
|
||||||
- Add Api schema for Upwork Api.
|
|
||||||
- Add more scrapers, a lot of logic is platform agnostic which could be used to build scrapers for more platforms.
|
> 💡 The easiest way to obtain these is by logging into [Upwork.com](https://www.upwork.com), inspecting **network traffic** in DevTools after passing bot checks, and copying the `Authorization` and `Cookie` headers from any authenticated request.
|
||||||
|
|
||||||
|
2. **Keep Credentials Fresh**
|
||||||
|
These credentials usually expire daily, so you'll need to refresh them if scraping on a regular basis.
|
||||||
|
|
||||||
|
3. **Run the Scraper**
|
||||||
|
In the `main.go` file, call:
|
||||||
|
|
||||||
|
```go
|
||||||
|
p.Run("<keyword>")
|
||||||
|
```
|
||||||
|
|
||||||
|
* Replace `<keyword>` with the term you want to search.
|
||||||
|
* It will fetch the **top 5000 most recent job postings** that match the keyword and save them in a `.jsonl` file.
|
||||||
|
* If you pass an empty string (`p.Run("")`), it will fetch the **most recent jobs** regardless of keyword.
|
||||||
|
|
||||||
|
> ⚠️ **Important**: Never use your **personal Upwork account** to extract credentials. Doing so **will result in account suspension**.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Contributing
|
||||||
|
|
||||||
|
This project is no longer actively maintained, but I occasionally check if it still works (as of **June 27, 2025**, it does).
|
||||||
|
|
||||||
|
**Potential contributions:**
|
||||||
|
|
||||||
|
* Automating the refresh of `Authorization` and `Cookie` headers.
|
||||||
|
* Adding support for advanced filters or scraping job details.
|
||||||
|
|||||||
Reference in New Issue
Block a user