Skip to content

Conversation

@MohdShoaib-18169
Copy link
Contributor

@MohdShoaib-18169 MohdShoaib-18169 commented Oct 14, 2025

Description

Testing

Additional Notes

Summary by CodeRabbit

  • New Features

    • Added scripts to export, encrypt, upload Vespa data and to receive, decrypt, and feed it back into Vespa.
    • Introduced a Docker Compose setup for a multi-node Vespa cluster with defined services, volumes, and networking.
    • Expanded Vespa configuration: increased redundancy, added a second node, and enabled additional document types.
  • Documentation

    • Added a step-by-step Vespa migration workflow guide covering export, transfer, encryption, and ingestion.
  • Style

    • Minor formatting and indentation cleanups; no functional changes.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @MohdShoaib-18169, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust framework for migrating Vespa data, primarily utilizing AWS S3 for secure and scalable data transfer. It provides a set of executable shell scripts for automating the entire migration lifecycle, from data export and encryption to S3 upload and subsequent import into a new Vespa instance. Complementing these scripts is a detailed guide for the migration process. Furthermore, the PR includes Docker Compose configurations to easily set up and manage a multi-node Vespa cluster, enhancing data redundancy and operational flexibility.

Highlights

  • Vespa Data Migration Scripts: Introduction of shell scripts (vespaDataSend.sh, vespaDataReceive.sh) to facilitate the export, compression, encryption, S3 upload, download, decryption, decompression, and import of Vespa data.
  • Vespa Migration Guide: A new JavaScript file (vespaMigration.js) serves as a detailed, commented guide outlining the steps for Vespa data migration, including options for S3 and GPG encryption.
  • Multi-Node Vespa Cluster Setup: Addition of a docker-compose.cluster.yml file to define and orchestrate a multi-node Vespa cluster, including a config server and two content nodes.
  • Vespa Cluster Configuration: Updates to services.xml and hosts.xml to configure the new multi-node Vespa cluster, including defining host aliases, setting up admin/config servers, and increasing data redundancy.
  • Minor Code Formatting: Small indentation changes were applied to update-workflow-agents.ts for improved readability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 14, 2025

Walkthrough

Adds Vespa data migration utilities (send/receive scripts and a migration guide), introduces Vespa multi-node cluster configs (compose, hosts, services), and makes minor formatting-only adjustments to an existing workflow agents update script. No exported/public API changes.

Changes

Cohort / File(s) Summary
Workflow agents script formatting
server/scripts/update-workflow-agents.ts
Reordered/normalized formatting and indentation without changing logic; preserves count-then-early-return-then-update flow.
Vespa data migration scripts
server/scripts/vespaDataSend.sh, server/scripts/vespaDataReceive.sh, server/scripts/vespaMigration.js
New scripts to export, compress, encrypt, transfer, decrypt, and feed Vespa data. Includes strict error handling; optional paths via comments; migration steps documented.
Vespa cluster configuration
server/vespa/docker-compose.cluster.yml, server/vespa/hosts.xml, server/vespa/services.xml
New multi-node Vespa setup: compose file with config server and two nodes; hosts mapping; services with admin block, redundancy=2, added node2, expanded documents list.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Op as Operator
  participant V as Vespa Cluster (Source)
  participant S as Script: vespaDataSend.sh
  participant ENC as OpenSSL
  participant A as AWS S3

  Op->>S: Run send script (STEP 0–6)
  S->>V: Export data (vespa visit -> dump.json)
  S->>S: Compress (pigz/gzip -> dump.json.gz)
  S->>ENC: Encrypt AES-256-CBC (PBKDF2)
  ENC-->>S: dump.json.gz.enc
  S->>A: Upload to s3://.../dumps/
  S-->>Op: Success message
Loading
sequenceDiagram
  autonumber
  actor Op as Operator
  participant R as Script: vespaDataReceive.sh
  participant A as AWS S3
  participant DEC as OpenSSL
  participant V as Vespa Cluster (Dest)

  Op->>R: Run receive script
  R->>A: aws s3 cp dump.json.gz.enc
  R->>DEC: Decrypt AES-256-CBC (PBKDF2)
  DEC-->>R: dump.json.gz
  R->>R: Decompress (gunzip -> dump.json)
  R->>V: Feed (vespa-feed-client)
  R-->>Op: Completion notice
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • zereraz
  • kalpadhwaryu
  • devesh-juspay
  • junaid-shirur

Poem

A carrot of bytes in a cloud-bound pack,
I hop through dumps with a thump-thump knack.
Zip, then a whisper of ciphered breeze—
Off to S3 through digital trees.
Two Vespa burrows hum in tune—
Feed restored, I nibble the moon. 🥕✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly identifies the introduction of a feature for migrating Vespa data via S3 and follows a concise, conventional commit style, matching a central aspect of the changeset.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch vespa-data-migration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces scripts for Vespa data migration using AWS S3 and configuration for a multi-node Vespa cluster. The changes are a good step towards a more robust data management and deployment strategy. My review focuses on improving the security and reliability of the new scripts and correcting some critical configuration issues in the Vespa cluster setup that would likely prevent it from working as intended. Key feedback includes removing hardcoded credentials from scripts, making them more configurable, and aligning the Vespa service and host configurations with the Docker Compose setup.

Comment on lines +8 to +12
# ⚠️ Replace with your real credentials
export AWS_ACCESS_KEY_ID="AKIAIOSFODNN7EXAMPLE"
export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
export AWS_DEFAULT_REGION="ap-south-1"
export AWS_DEFAULT_OUTPUT="json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Hardcoding AWS credentials in a script, even as examples, is a significant security risk. It encourages a bad practice that can lead to accidentally committing real credentials. The script should rely on the standard AWS CLI credential chain (e.g., IAM roles, environment variables, or the ~/.aws/credentials file). Please remove these export statements.

Suggested change
# ⚠️ Replace with your real credentials
export AWS_ACCESS_KEY_ID="AKIAIOSFODNN7EXAMPLE"
export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
export AWS_DEFAULT_REGION="ap-south-1"
export AWS_DEFAULT_OUTPUT="json"
# ⚠️ Ensure your AWS credentials are configured in your environment
# (e.g., via `aws configure` or environment variables).

Comment on lines +4 to +11
<host name="localhost">
<alias>node1</alias>
</host>

<!-- Secondary node - vespa-testing container -->
<host name="localhost:8181">
<alias>node2</alias>
</host>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The host configuration is incorrect for the multi-node setup defined in docker-compose.cluster.yml.

  1. The name attribute of a <host> tag must be a valid hostname. localhost:8181 is invalid because it includes a port.
  2. In a Docker network, localhost refers to the container itself. To allow containers to communicate, you must use their service names as hostnames (e.g., vespa-node1, vespa-node2).

This file should be updated to map the aliases used in services.xml to the correct service hostnames from your Docker Compose file.

Suggested change
<host name="localhost">
<alias>node1</alias>
</host>
<!-- Secondary node - vespa-testing container -->
<host name="localhost:8181">
<alias>node2</alias>
</host>
<host name="vespa-node1">
<alias>node1</alias>
</host>
<!-- Secondary node - vespa-testing container -->
<host name="vespa-node2">
<alias>node2</alias>
</host>

Comment on lines +2 to +7
<admin version="2.0">
<adminserver hostalias="node1" />
<configservers>
<configserver hostalias="node1" />
</configservers>
</admin>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's an inconsistency between your docker-compose.cluster.yml and this services.xml. The Docker Compose file defines a dedicated service vespa-config for the config server. However, this services.xml file places the adminserver and configserver on hostalias="node1", which corresponds to the vespa-node1 service.

The vespa-node1 service is started with vespa-start-services and configured to use vespa-config as its config server. This creates a conflict. The admin and config server definitions should point to an alias that maps to the vespa-config service in hosts.xml.

Comment on lines +14 to +20
# AWS performance tuning (optional)
aws configure set default.s3.max_concurrent_requests 20
aws configure set default.s3.multipart_threshold 64MB
aws configure set default.s3.multipart_chunksize 64MB
aws configure set default.s3.max_queue_size 100
aws configure set default.s3.multipart_upload_threshold 64MB
aws configure set default.s3.multipart_max_attempts 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Modifying the user's global AWS configuration with aws configure set can have unintended side effects on other operations outside of this script. It's safer to use environment variables for these settings to scope them only to the current script execution (e.g., export AWS_MAX_CONCURRENT_REQUESTS=20).

Note that some of these settings, like max_queue_size and multipart_max_attempts, cannot be set via environment variables and must be in the AWS config file. Also, multipart_upload_threshold appears to be a duplicate of multipart_threshold.

Comment on lines +1 to +157
/* #!/bin/bash
// ------------------------------------------------------------
// STEP 1: Start Vespa container for dump creation
// ------------------------------------------------------------
//docker run -d --name vespa-testing \
//-e VESPA_IGNORE_NOT_ENOUGH_MEMORY=true \
//-p 8181:8080 \
//-p 19171:19071 \
//-p 2224:22 \
//vespaengine/vespa:latest
// ------------------------------------------------------------
// STEP 2: Export Vespa data
// ------------------------------------------------------------
"vespa visit --content-cluster my_content --make-feed > dump.json"
// ------------------------------------------------------------
// STEP 3: Compress dump file
// ------------------------------------------------------------
"apt install -y pigz"
//# or yum install pigz
// pigz is parallel gzip (much faster)
// pigz -9 (1.15 hr, ~280 GB) or -7 (1 hr, ~320 GB)
"pigz -9 dump.json"
// creates dump.json.gz
// if pigz is not available, fallback to gzip
//gzip -9 dump.json
//(gzip -9 -c dump.json > dump.json.gz)
// ------------------------------------------------------------
// STEP 4: Encrypt dump file (OpenSSL password-based)
// ------------------------------------------------------------
"openssl enc -aes-256-cbc -pbkdf2 -salt \
-in dump.json.gz \
-out dump.json.gz.enc"
// Strong AES-256 encryption, password will be prompted
// dump.json.gz.enc → safe to transfer/upload
// ------------------------------------------------------------
// OPTIONAL: GPG-based encryption (if using keypair)
// ------------------------------------------------------------
//gpg --full-generate-key
//gpg --list-keys
//gpg --output dump.json.gz.gpg --encrypt --recipient <A1B2C3D4E5F6G7H8> dump.json.gz
// ------------------------------------------------------------
// STEP 5: Set up SSH access for container-to-host transfer
// ------------------------------------------------------------
// ssh-keygen -t rsa -b 4096 -C "mohd.shoaib@juspay.in"
// cat ~/.ssh/id_rsa.pub
// On your **remote machine (container)**:
//docker exec -it --user root vespa-testing /bin/bash
//apt-get or yum update
//apt-get or yum install -y openssh-client (openssh first)
//ssh-keygen -A
//yum install -y openssh-server
//mkdir -p /var/run/sshd
///usr/sbin/sshd
//mkdir -p ~/.ssh
//chmod 700 ~/.ssh
//echo "PASTE_YOUR_PUBLIC_KEY_HERE" >> ~/.ssh/authorized_keys
//chmod 600 ~/.ssh/authorized_keys
//yum install -y rsync
// ------------------------------------------------------------
// STEP 6: Test SSH + Transfer dump or key
// ------------------------------------------------------------
//ssh -p 2224 root@192.168.1.6 - testing
//yum install -y rsync
//gpg --export-secret-keys --armor BF4AF7E7E3955EF3A436A4ED7C59556BFC58DFAF > my-private-key.asc
//rsync -avzP --inplace --partial --append -e "ssh -p 2224" my-private-key.asc root@192.168.1.6:/home/
"brew install awscli"
"aws configure"
"AWS Access Key ID [None]: ****************"
"AWS Secret Access Key [None]: ********************"
"Default region name [None]: ap-south-1"
"Default output format [None]: json"
//AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
//AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
//Default region name [None]: ap-south-1
//Default output format [None]: json
// For fast file transfer
"aws configure set default.s3.max_concurrent_requests 20"
"aws configure set default.s3.multipart_threshold 64MB"
//Check your identity:
"aws sts get-caller-identity"
// for Making transfers faster (optional)
"aws configure set default.s3.multipart_chunksize 64MB"
"aws configure set default.s3.max_queue_size 100"
"aws configure set default.s3.multipart_upload_threshold 64MB"
"aws configure set default.s3.multipart_max_attempts 5"
"aws s3 cp dump.json.gz.enc s3://your-bucket-name/dumps/"
//Optional (show progress bar):
"aws s3 cp dump.json.gz.enc s3://xyne-vespa-backups/2025-10-13/ --expected-size $(stat -c%s dump.json.gz.enc"
//rsync -avzP --inplace --partial --append -e "ssh -p 2224" dump.json.gz.gpg root@192.168.1.6:/home/root/
// ------------------------------------------------------------
// STEP 7: On the new machine
// ------------------------------------------------------------
// Option 1 — using AWS S3
"aws s3 cp s3://your-bucket-name/dumps/dump.json.gz.enc "
"openssl enc -d -aes-256-cbc -pbkdf2 -salt \
-in dump.json.gz.enc \
-out dump.json.gz"
// Option 2 — if using GPG
//yum install -y pinentry
//gpgconf --kill gpg-agent
//export GPG_TTY=$(tty)
//echo $GPG_TTY
//gpg --import my-private-key.asc
//gpg --list-secret-keys
//gpg --output dump.json.gz --decrypt dump.json.gz.gpg
// ------------------------------------------------------------
// STEP 8: Decompress and feed into Vespa
// ------------------------------------------------------------
"gunzip dump.json.gz"
"vespa-feed-client dump.json"
// ------------------------------------------------------------
// Done 🎉
// ------------------------------------------------------------
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This file appears to be a collection of notes and scratchpad commands, and it's entirely commented out. It also contains example AWS credentials (lines 100-101), which is a security risk even when commented, as they can be flagged by security scanners and promote bad practices. It's best to remove this file from the pull request. If these are important notes, they should be moved to a more appropriate place like a README or a wiki, with any sensitive examples removed.

# ------------------------------------------------------------
# Done 🎉
# ------------------------------------------------------------
echo "✅ Vespa data restored successfully!"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script leaves behind the downloaded and decompressed dump files (dump.json.gz.enc and dump.json). These files can be very large. It's a good practice to clean them up to conserve disk space. You could add a cleanup step at the end of the script, or use a trap to ensure cleanup happens even on failure.

Suggested change
echo "✅ Vespa data restored successfully!"
echo "✅ Vespa data restored successfully!"
echo "🧹 Cleaning up temporary files..."
rm -f dump.json.gz.enc dump.json

# ------------------------------------------------------------
# STEP 5: Upload to AWS S3
# ------------------------------------------------------------
aws s3 cp dump.json.gz.enc s3://your-bucket-name/dumps/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The S3 bucket name is hardcoded. It's better to use a variable defined at the top of the script for better configurability and maintainability.

- "19071:19071" # Config server port
- "19050:19050" # Admin port
volumes:
- ./:/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/1/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Mounting the application package directly into the config server's internal session directory is fragile and not a standard practice. This path is an implementation detail of Vespa and could change in future versions, breaking your setup. A more robust approach is to mount the application package to a neutral location (e.g., /app) and then use vespa deploy to deploy it.


networks:
vespa-cluster:
driver: bridge No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file is missing a final newline. It's a POSIX standard for text files to end with a newline character, and some tools might not process the last line correctly without it.

    driver: bridge

<host name="localhost:8181">
<alias>node2</alias>
</host>
</hosts> No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file is missing a final newline. It's a POSIX standard for text files to end with a newline character, and some tools might not process the last line correctly without it.

Suggested change
</hosts>
</hosts>

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (2)
server/vespa/docker-compose.cluster.yml (2)

29-31: Same volume mount issue as config server.

The volume mount on line 30 has the same issue as the config server (line 13). See previous comment for details.


48-50: Same volume mount issue as other services.

The volume mount on line 49 has the same issue. See previous comments for details.

🧹 Nitpick comments (9)
server/scripts/vespaDataReceive.sh (3)

14-16: Automate password handling for decryption.

The interactive password prompt will block automated execution. For production use, consider using environment variables or AWS Secrets Manager for the encryption password.

Apply this diff:

-openssl enc -d -aes-256-cbc -pbkdf2 -salt \
+# Use password from environment variable to enable automation
+ENCRYPTION_PASSWORD="${VESPA_ENCRYPTION_PASSWORD:?Error: VESPA_ENCRYPTION_PASSWORD not set}"
+openssl enc -d -aes-256-cbc -pbkdf2 -salt \
+  -pass "pass:${ENCRYPTION_PASSWORD}" \
   -in dump.json.gz.enc \
   -out dump.json.gz

34-35: Add validation before decompression and feeding.

The script should verify that the compressed file exists before decompressing and that Vespa is accessible before attempting to feed data.

Apply this diff:

+# Validate file exists
+if [ ! -f "dump.json.gz" ]; then
+  echo "Error: dump.json.gz not found"
+  exit 1
+fi
+
 gunzip dump.json.gz
+
+# Validate Vespa is accessible
+if ! command -v vespa-feed-client &> /dev/null; then
+  echo "Error: vespa-feed-client command not found"
+  exit 1
+fi
+
+# Optionally check Vespa health
+# curl -sf http://localhost:8080/state/v1/health || { echo "Error: Vespa not healthy"; exit 1; }
+
 vespa-feed-client dump.json

1-40: Add cleanup of intermediate files.

The script leaves intermediate files (dump.json.gz.enc, dump.json.gz) after execution. Consider adding a cleanup trap to remove these files on exit.

Add cleanup at the beginning of the script:

 #!/bin/bash
 set -e
 set -o pipefail
+
+# Cleanup function
+cleanup() {
+  echo "Cleaning up intermediate files..."
+  rm -f dump.json.gz.enc dump.json.gz
+}
+trap cleanup EXIT
server/scripts/vespaDataSend.sh (3)

37-37: Make content cluster name configurable.

The content cluster name my_content is hardcoded. Make it configurable via environment variable for flexibility across different deployments.

Apply this diff:

+# Allow overriding the content cluster name
+CONTENT_CLUSTER="${VESPA_CONTENT_CLUSTER:-my_content}"
+
+# Verify Vespa is accessible
+if ! curl -sf http://localhost:8080/state/v1/health > /dev/null; then
+  echo "Error: Vespa is not accessible at localhost:8080"
+  exit 1
+fi
+
-vespa visit --content-cluster my_content --make-feed > dump.json
+vespa visit --content-cluster "$CONTENT_CLUSTER" --make-feed > dump.json

42-43: Improve package installation and command verification.

The script attempts to install pigz using both apt and yum sequentially, which will fail on one or the other. Additionally, it doesn't verify that pigz is available before using it.

Apply this diff:

-apt install -y pigz || yum install -y pigz
-pigz -9 dump.json   # creates dump.json.gz
+# Install pigz with proper OS detection
+if command -v apt-get &> /dev/null; then
+  apt-get update && apt-get install -y pigz
+elif command -v yum &> /dev/null; then
+  yum install -y pigz
+else
+  echo "Error: Neither apt-get nor yum found"
+  exit 1
+fi
+
+# Verify pigz is available, fallback to gzip
+if command -v pigz &> /dev/null; then
+  echo "Using pigz for parallel compression..."
+  pigz -9 dump.json
+else
+  echo "pigz not available, falling back to gzip..."
+  gzip -9 dump.json
+fi

49-51: Automate password handling for encryption.

The interactive password prompt blocks automated execution. Use an environment variable for the encryption password in production workflows.

Apply this diff:

-# ⚠️ You'll be prompted for password — can automate with -pass if needed
-openssl enc -aes-256-cbc -pbkdf2 -salt \
+# Use password from environment variable
+ENCRYPTION_PASSWORD="${VESPA_ENCRYPTION_PASSWORD:?Error: VESPA_ENCRYPTION_PASSWORD not set}"
+openssl enc -aes-256-cbc -pbkdf2 -salt \
+  -pass "pass:${ENCRYPTION_PASSWORD}" \
   -in dump.json.gz \
   -out dump.json.gz.enc
server/vespa/services.xml (1)

38-38: Fix inconsistent indentation.

Line 38 has an extra leading space compared to other document declarations.

Apply this diff:

-             <document type="chat_attachment" mode="index" global="true" />
+            <document type="chat_attachment" mode="index" global="true" />
server/scripts/vespaMigration.js (2)

100-103: Clearly mark example credentials or remove them.

Lines 100-103 contain AWS example credentials. While they're commented and appear to be from AWS documentation, it's better to either remove them or clearly mark them as non-functional examples to avoid confusion.

Consider removing these lines or adding a clear warning:

+// ⚠️ EXAMPLE ONLY - DO NOT USE THESE CREDENTIALS
+// These are placeholder values from AWS documentation
 //AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
 //AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
-//Default region name [None]: ap-south-1
-//Default output format [None]: json

1-157: Avoid duplication with executable scripts.

This documentation file duplicates content from vespaDataSend.sh and vespaDataReceive.sh. This creates a maintenance burden where changes need to be synchronized across multiple files.

Consider converting this to a high-level guide that references the executable scripts:

# Vespa Data Migration Guide

## Overview
This guide explains how to migrate Vespa data using the provided automation scripts.

## Export Data
Use `vespaDataSend.sh` to export and upload Vespa data:
\`\`\`bash
export VESPA_BACKUP_BUCKET="your-bucket-name"
export VESPA_ENCRYPTION_PASSWORD="your-password"
./server/scripts/vespaDataSend.sh
\`\`\`

## Import Data  
Use `vespaDataReceive.sh` to download and import Vespa data:
\`\`\`bash
export VESPA_BACKUP_BUCKET="your-bucket-name"
export VESPA_ENCRYPTION_PASSWORD="your-password"
./server/scripts/vespaDataReceive.sh
\`\`\`

## Prerequisites
- AWS CLI configured
- Vespa CLI tools installed
- ...
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a45bb2 and 3094bb9.

⛔ Files ignored due to path filters (1)
  • server/vespa/app.zip is excluded by !**/*.zip
📒 Files selected for processing (7)
  • server/scripts/update-workflow-agents.ts (1 hunks)
  • server/scripts/vespaDataReceive.sh (1 hunks)
  • server/scripts/vespaDataSend.sh (1 hunks)
  • server/scripts/vespaMigration.js (1 hunks)
  • server/vespa/docker-compose.cluster.yml (1 hunks)
  • server/vespa/hosts.xml (1 hunks)
  • server/vespa/services.xml (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/scripts/update-workflow-agents.ts (2)
server/scripts/count-workflow-agents.ts (1)
  • countWorkflowAgents (7-27)
server/db/schema/agents.ts (1)
  • agents (29-68)
🔇 Additional comments (7)
server/scripts/update-workflow-agents.ts (1)

8-48: LGTM!

The formatting improvements enhance code readability without altering any logic or behavior.

server/scripts/vespaDataReceive.sh (1)

1-3: LGTM!

Proper error handling with set -e and set -o pipefail ensures the script fails fast on errors.

server/vespa/services.xml (3)

2-7: LGTM!

The admin configuration correctly designates node1 as both the admin server and config server, which aligns with the cluster setup in docker-compose.cluster.yml.


26-26: LGTM! Redundancy configuration is appropriate.

Setting redundancy to 2 with reply-after="1" is appropriate for the two-node cluster. This ensures data is replicated across both nodes while allowing queries to return after receiving a response from the first available replica.

Also applies to: 44-45


28-42: All schema definitions found for newly added document types.

server/vespa/docker-compose.cluster.yml (2)

45-47: LGTM! Port remapping avoids conflicts.

The port mappings correctly remap node2's ports to avoid conflicts with node1 (8081:8080, 19113:19112), allowing both nodes to run simultaneously on the same host.


6-6: Verify Docker image tag exists. Confirm that vespaengine/vespa:8.431.26 is published on Docker Hub (e.g., run docker pull vespaengine/vespa:8.431.26 or check https://hub.docker.com/r/vespaengine/vespa/tags). Applies to lines 6, 23, 42.


# ---------- Option 1 — using AWS S3 ----------
# ⚠️ Replace with your actual bucket name and path
aws s3 cp s3://your-bucket-name/dumps/dump.json.gz.enc .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace placeholder bucket name.

The S3 bucket name your-bucket-name is a placeholder that must be replaced with the actual bucket name before running the script.

Consider using environment variables for configuration:

-aws s3 cp s3://your-bucket-name/dumps/dump.json.gz.enc .
+# Read bucket name from environment variable
+BUCKET_NAME="${VESPA_BACKUP_BUCKET:?Error: VESPA_BACKUP_BUCKET environment variable not set}"
+aws s3 cp "s3://${BUCKET_NAME}/dumps/dump.json.gz.enc" .
🤖 Prompt for AI Agents
In server/scripts/vespaDataReceive.sh around line 11, the aws s3 cp command uses
the placeholder bucket name "your-bucket-name"; replace this with the real S3
bucket or, better, read the bucket name from an environment variable (e.g.,
S3_BUCKET) and use that variable in the command, adding a guard to fail with a
clear message if the env var is not set.

Comment on lines +9 to +10
export AWS_ACCESS_KEY_ID="AKIAIOSFODNN7EXAMPLE"
export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

CRITICAL: Remove hardcoded AWS credentials.

Hardcoded AWS credentials in version control is a critical security vulnerability, even if these appear to be example values. Credentials should never be committed to the repository.

Remove the hardcoded credentials and use environment variables or AWS credential profiles instead:

-# ⚠️ Replace with your real credentials
-export AWS_ACCESS_KEY_ID="AKIAIOSFODNN7EXAMPLE"
-export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
-export AWS_DEFAULT_REGION="ap-south-1"
-export AWS_DEFAULT_OUTPUT="json"
+# Use AWS credentials from environment or AWS config
+# Ensure AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are set
+if [ -z "$AWS_ACCESS_KEY_ID" ] || [ -z "$AWS_SECRET_ACCESS_KEY" ]; then
+  echo "Error: AWS credentials not configured. Please set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY"
+  echo "Or configure AWS CLI with: aws configure"
+  exit 1
+fi
+
+export AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-ap-south-1}"
+export AWS_DEFAULT_OUTPUT="${AWS_DEFAULT_OUTPUT:-json}"

Additionally, scan the repository for any committed credentials using tools like git-secrets or trufflehog.

🤖 Prompt for AI Agents
In server/scripts/vespaDataSend.sh around lines 9-10 the script contains
hardcoded AWS credentials; remove these two export lines immediately and replace
them with references to external configuration (e.g., expect AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY to be provided via environment variables, AWS CLI
named profiles, or mounted credential files/EC2/ECS/IAM role credentials),
update any documentation or CI pipeline to set those environment variables
securely, and run a repository secrets scan (git-secrets, trufflehog, or
similar) to detect and purge any other committed credentials.

# ------------------------------------------------------------
# STEP 5: Upload to AWS S3
# ------------------------------------------------------------
aws s3 cp dump.json.gz.enc s3://your-bucket-name/dumps/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace placeholder S3 bucket name.

The S3 bucket name your-bucket-name is a placeholder that must be replaced. Use an environment variable for configuration.

Apply this diff:

-aws s3 cp dump.json.gz.enc s3://your-bucket-name/dumps/
+BUCKET_NAME="${VESPA_BACKUP_BUCKET:?Error: VESPA_BACKUP_BUCKET not set}"
+TIMESTAMP=$(date +%Y-%m-%d-%H%M%S)
+aws s3 cp dump.json.gz.enc "s3://${BUCKET_NAME}/dumps/dump-${TIMESTAMP}.json.gz.enc"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
aws s3 cp dump.json.gz.enc s3://your-bucket-name/dumps/
BUCKET_NAME="${VESPA_BACKUP_BUCKET:?Error: VESPA_BACKUP_BUCKET not set}"
TIMESTAMP=$(date +%Y-%m-%d-%H%M%S)
aws s3 cp dump.json.gz.enc "s3://${BUCKET_NAME}/dumps/dump-${TIMESTAMP}.json.gz.enc"
🤖 Prompt for AI Agents
In server/scripts/vespaDataSend.sh around line 56, the S3 bucket name literal
"your-bucket-name" is a placeholder and should be replaced with a configurable
environment variable; update the aws s3 cp command to use an environment
variable (e.g. "$S3_BUCKET") instead of the hardcoded name, and add a brief
check near the top of the script to ensure S3_BUCKET is set (exit with an error
message if not) so the script fails fast when configuration is missing.

Comment on lines +1 to +157
/* #!/bin/bash
// ------------------------------------------------------------
// STEP 1: Start Vespa container for dump creation
// ------------------------------------------------------------
//docker run -d --name vespa-testing \
//-e VESPA_IGNORE_NOT_ENOUGH_MEMORY=true \
//-p 8181:8080 \
//-p 19171:19071 \
//-p 2224:22 \
//vespaengine/vespa:latest
// ------------------------------------------------------------
// STEP 2: Export Vespa data
// ------------------------------------------------------------
"vespa visit --content-cluster my_content --make-feed > dump.json"
// ------------------------------------------------------------
// STEP 3: Compress dump file
// ------------------------------------------------------------
"apt install -y pigz"
//# or yum install pigz
// pigz is parallel gzip (much faster)
// pigz -9 (1.15 hr, ~280 GB) or -7 (1 hr, ~320 GB)
"pigz -9 dump.json"
// creates dump.json.gz
// if pigz is not available, fallback to gzip
//gzip -9 dump.json
//(gzip -9 -c dump.json > dump.json.gz)
// ------------------------------------------------------------
// STEP 4: Encrypt dump file (OpenSSL password-based)
// ------------------------------------------------------------
"openssl enc -aes-256-cbc -pbkdf2 -salt \
-in dump.json.gz \
-out dump.json.gz.enc"
// Strong AES-256 encryption, password will be prompted
// dump.json.gz.enc → safe to transfer/upload
// ------------------------------------------------------------
// OPTIONAL: GPG-based encryption (if using keypair)
// ------------------------------------------------------------
//gpg --full-generate-key
//gpg --list-keys
//gpg --output dump.json.gz.gpg --encrypt --recipient <A1B2C3D4E5F6G7H8> dump.json.gz
// ------------------------------------------------------------
// STEP 5: Set up SSH access for container-to-host transfer
// ------------------------------------------------------------
// ssh-keygen -t rsa -b 4096 -C "mohd.shoaib@juspay.in"
// cat ~/.ssh/id_rsa.pub
// On your **remote machine (container)**:
//docker exec -it --user root vespa-testing /bin/bash
//apt-get or yum update
//apt-get or yum install -y openssh-client (openssh first)
//ssh-keygen -A
//yum install -y openssh-server
//mkdir -p /var/run/sshd
///usr/sbin/sshd
//mkdir -p ~/.ssh
//chmod 700 ~/.ssh
//echo "PASTE_YOUR_PUBLIC_KEY_HERE" >> ~/.ssh/authorized_keys
//chmod 600 ~/.ssh/authorized_keys
//yum install -y rsync
// ------------------------------------------------------------
// STEP 6: Test SSH + Transfer dump or key
// ------------------------------------------------------------
//ssh -p 2224 root@192.168.1.6 - testing
//yum install -y rsync
//gpg --export-secret-keys --armor BF4AF7E7E3955EF3A436A4ED7C59556BFC58DFAF > my-private-key.asc
//rsync -avzP --inplace --partial --append -e "ssh -p 2224" my-private-key.asc root@192.168.1.6:/home/
"brew install awscli"
"aws configure"
"AWS Access Key ID [None]: ****************"
"AWS Secret Access Key [None]: ********************"
"Default region name [None]: ap-south-1"
"Default output format [None]: json"
//AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
//AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
//Default region name [None]: ap-south-1
//Default output format [None]: json
// For fast file transfer
"aws configure set default.s3.max_concurrent_requests 20"
"aws configure set default.s3.multipart_threshold 64MB"
//Check your identity:
"aws sts get-caller-identity"
// for Making transfers faster (optional)
"aws configure set default.s3.multipart_chunksize 64MB"
"aws configure set default.s3.max_queue_size 100"
"aws configure set default.s3.multipart_upload_threshold 64MB"
"aws configure set default.s3.multipart_max_attempts 5"
"aws s3 cp dump.json.gz.enc s3://your-bucket-name/dumps/"
//Optional (show progress bar):
"aws s3 cp dump.json.gz.enc s3://xyne-vespa-backups/2025-10-13/ --expected-size $(stat -c%s dump.json.gz.enc"
//rsync -avzP --inplace --partial --append -e "ssh -p 2224" dump.json.gz.gpg root@192.168.1.6:/home/root/
// ------------------------------------------------------------
// STEP 7: On the new machine
// ------------------------------------------------------------
// Option 1 — using AWS S3
"aws s3 cp s3://your-bucket-name/dumps/dump.json.gz.enc "
"openssl enc -d -aes-256-cbc -pbkdf2 -salt \
-in dump.json.gz.enc \
-out dump.json.gz"
// Option 2 — if using GPG
//yum install -y pinentry
//gpgconf --kill gpg-agent
//export GPG_TTY=$(tty)
//echo $GPG_TTY
//gpg --import my-private-key.asc
//gpg --list-secret-keys
//gpg --output dump.json.gz --decrypt dump.json.gz.gpg
// ------------------------------------------------------------
// STEP 8: Decompress and feed into Vespa
// ------------------------------------------------------------
"gunzip dump.json.gz"
"vespa-feed-client dump.json"
// ------------------------------------------------------------
// Done 🎉
// ------------------------------------------------------------
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Convert to appropriate file format for documentation.

This file is a .js file but contains shell script documentation wrapped in multi-line comments. The mixed syntax (JavaScript comments, shell comments, quoted strings) makes it confusing and non-executable.

Consider one of these options:

Option 1 (Recommended): Convert to Markdown

Rename to vespaMigration.md and format as proper documentation:

# Vespa Data Migration Guide

## Step 1: Start Vespa Container
\`\`\`bash
docker run -d --name vespa-testing \
  -e VESPA_IGNORE_NOT_ENOUGH_MEMORY=true \
  -p 8181:8080 \
  ...
\`\`\`

## Step 2: Export Vespa Data
\`\`\`bash
vespa visit --content-cluster my_content --make-feed > dump.json
\`\`\`
...

Option 2: Convert to executable shell script

Rename to vespaMigration.sh and format as a proper bash script with functions or clear sections.

🤖 Prompt for AI Agents
In server/scripts/vespaMigration.js lines 1-157: the file contains shell
commands wrapped in JavaScript block comments and quoted strings, making it
non-executable and confusing; convert this to a proper documentation or script
file — either rename and reformat as Markdown (vespaMigration.md) with fenced
bash code blocks and headings for each step, or rename to a shell script
(vespaMigration.sh) and remove JS comment markers/quotes, add a shebang, make
commands valid bash (uncomment, group into functions or sections), and ensure
executable permissions; pick one option and update filenames, references, and
git commit accordingly.

Comment on lines +12 to +14
volumes:
- ./:/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/1/
- vespa-config-data:/opt/vespa/var/db/vespa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Verify volume mount path for application package.

The volume mount on line 13 maps the current directory to a very specific Vespa internal path:
./:/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/1/

This path appears to be an internal Vespa session directory, not the standard location for application packages. The correct path for deploying a Vespa application package is typically /app or a deployment-specific directory.

Verify the intended mount path. If deploying an application package, consider:

     volumes:
-      - ./:/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/1/
+      - ./:/app  # Mount application package to standard location
       - vespa-config-data:/opt/vespa/var/db/vespa

Then use the proper Vespa deployment command to deploy the application from /app.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
volumes:
- ./:/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/1/
- vespa-config-data:/opt/vespa/var/db/vespa
volumes:
- ./:/app # Mount application package to standard location
- vespa-config-data:/opt/vespa/var/db/vespa
🤖 Prompt for AI Agents
In server/vespa/docker-compose.cluster.yml around lines 12 to 14, the bind mount
maps the repo root into Vespa's internal session directory
(./:/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/1/),
which is incorrect for deploying an application package; change the mount to
expose the application package at a sensible path inside the container (for
example /app or a dedicated deploy dir) and remove the direct mapping into
Vespa's session store, then update any startup/deploy scripts to run the Vespa
deployment command (e.g., deploy from /app) so the container places sessions
into Vespa's own data volume rather than overwriting internal session
directories.

Comment on lines +1 to +12
<?xml version="1.0" encoding="utf-8" ?>
<hosts>
<!-- Primary node - current Vespa instance -->
<host name="localhost">
<alias>node1</alias>
</host>

<!-- Secondary node - vespa-testing container -->
<host name="localhost:8181">
<alias>node2</alias>
</host>
</hosts> No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify the host configuration approach with port in hostname.

The configuration uses localhost:8181 as a hostname on line 9, which is unconventional for Vespa hosts.xml files. Typically, hosts are defined without ports, and port mappings are handled in services.xml and docker-compose configuration.

While this may function, ensure it aligns with Vespa best practices and doesn't cause issues with the cluster coordination or health checks.


🏁 Script executed:

#!/bin/bash
# Check Vespa documentation for hosts.xml best practices
# and verify if port notation in host names is supported

Length of output: 59


🌐 Web query:

What is the correct format for Vespa hosts.xml configuration? Should ports be included in host names?

💡 Result:

Use FQDNs (hostnames) only — do not include ports in hosts.xml host name attributes. hosts.xml lists hosts by name (e.g. myserver.example.com) and any aliases; ports are configured separately in services.xml (via baseport/port attributes for services). (docs.vespa.ai)

Citations:


Remove port notation from host names in hosts.xml
Replace <host name="localhost:8181"> with <host name="localhost"> and configure port 8181 in services.xml or docker-compose.
File: server/vespa/hosts.xml line 9

🤖 Prompt for AI Agents
In server/vespa/hosts.xml around lines 1 to 12, the secondary host incorrectly
includes a port in the host name ("localhost:8181"); remove the port so the
entry is <host name="localhost"> and keep the alias as node2, then configure the
port 8181 externally (e.g., in services.xml or docker-compose) so hosts.xml
contains only hostnames as required by Vespa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants