This repository contains several ETL (Extract, Transform, Load) jobs for updating trading data and news for the OSRS Trading App. The pipeline ingests data from multiple sources (RuneScape Wiki APIs and an RSS feed), processes and transforms it into structured Parquet files, and then loads the results into Google Cloud Storage. The jobs are deployed on Cloud Run and scheduled via Cloud Scheduler.
The repository includes multiple job variants tailored for different data update frequencies and sources:
-
1-Hour Job:
Processes hourly trading data by fetching new timestamps, retrieving pricing data from the RuneScape Wiki API (1-hour interval), and saving the results as Parquet files in GCS. -
5-Minute Job:
Processes data at 5-minute intervals. Similar to the hourly job, it detects new timestamps, fetches 5-minute pricing data, and saves it as Parquet files. -
24-Hour Job:
Processes daily data using 24-hour timestamps. This job fetches and processes data for a longer historical period, then consolidates and saves the data. -
RSS Feed Job:
Processes news data by fetching the latest RSS feed from RuneScape News, transforming the XML feed into structured data, deduplicating entries, and updating a Parquet file in GCS.
Each job variant follows a similar pattern:
- Extract: Determine the relevant time intervals (timestamps) for which data is needed.
- Transform: Fetch data from the API or RSS feed, clean and standardize the data (e.g., filling missing values, zeroing nulls), and combine multiple data sources if needed.
- Load: Save the processed data as Parquet files to a designated GCS bucket (e.g.,
osrs-trading-app.appspot.com).
-
Python & Flask:
All ETL jobs are written in Python using Flask to expose a/run-jobendpoint for triggering the pipeline. -
Pandas & PyArrow:
For data manipulation and conversion to Parquet format. -
Google Cloud Storage:
Used for storing the processed Parquet files. -
Google Cloud Run & Cloud Scheduler:
The jobs are containerized and deployed on Cloud Run, and Cloud Scheduler triggers the jobs at the desired intervals. -
Cloud Build:
Deployment is automated via acloudbuild.yamlfile that builds, pushes, and deploys the container image.
-
Cloud Build:
Use the following command to build and deploy the 1-hour job (similar steps apply for other job variants):gcloud builds submit --config=cloudbuild.yaml
The build process will:
- Build the container image.
- Push the image to Google Container Registry.
- Deploy the image to Cloud Run with the name
update-1h-datain theus-central1region.
-
Cloud Scheduler:
Set up Cloud Scheduler jobs to trigger the/run-jobendpoint of the respective Cloud Run services at the desired intervals (e.g., every hour, every 5 minutes, daily). -
IAM Policy:
The providedpolicy.yamlfile configures public invocation of the Cloud Run service (roles/run.invoker) so that Cloud Scheduler can trigger it.
-
Install Dependencies:
pip install -r requirements.txt
-
Run Locally:
Start the Flask application locally to test the job endpoints:
python job.py # or any of the job variant scriptsThen, trigger the job by sending a POST request to
http://localhost:8080/run-job.
Contributions are welcome! If you have suggestions or improvements, please open an issue or submit a pull request.
This project is open source and available under the MIT License. Please note that the ETL pipeline is provided for educational purposes and should not be used to violate any data usage policies.
📧 Email: seer@runetick.com