-
Notifications
You must be signed in to change notification settings - Fork 57
feat(vespa-data-migration): migrating vespa using s3 #1101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @MohdShoaib-18169, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust framework for migrating Vespa data, primarily utilizing AWS S3 for secure and scalable data transfer. It provides a set of executable shell scripts for automating the entire migration lifecycle, from data export and encryption to S3 upload and subsequent import into a new Vespa instance. Complementing these scripts is a detailed guide for the migration process. Furthermore, the PR includes Docker Compose configurations to easily set up and manage a multi-node Vespa cluster, enhancing data redundancy and operational flexibility. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughAdds Vespa data migration utilities (send/receive scripts and a migration guide), introduces Vespa multi-node cluster configs (compose, hosts, services), and makes minor formatting-only adjustments to an existing workflow agents update script. No exported/public API changes. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Op as Operator
participant V as Vespa Cluster (Source)
participant S as Script: vespaDataSend.sh
participant ENC as OpenSSL
participant A as AWS S3
Op->>S: Run send script (STEP 0–6)
S->>V: Export data (vespa visit -> dump.json)
S->>S: Compress (pigz/gzip -> dump.json.gz)
S->>ENC: Encrypt AES-256-CBC (PBKDF2)
ENC-->>S: dump.json.gz.enc
S->>A: Upload to s3://.../dumps/
S-->>Op: Success message
sequenceDiagram
autonumber
actor Op as Operator
participant R as Script: vespaDataReceive.sh
participant A as AWS S3
participant DEC as OpenSSL
participant V as Vespa Cluster (Dest)
Op->>R: Run receive script
R->>A: aws s3 cp dump.json.gz.enc
R->>DEC: Decrypt AES-256-CBC (PBKDF2)
DEC-->>R: dump.json.gz
R->>R: Decompress (gunzip -> dump.json)
R->>V: Feed (vespa-feed-client)
R-->>Op: Completion notice
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces scripts for Vespa data migration using AWS S3 and configuration for a multi-node Vespa cluster. The changes are a good step towards a more robust data management and deployment strategy. My review focuses on improving the security and reliability of the new scripts and correcting some critical configuration issues in the Vespa cluster setup that would likely prevent it from working as intended. Key feedback includes removing hardcoded credentials from scripts, making them more configurable, and aligning the Vespa service and host configurations with the Docker Compose setup.
| # ⚠️ Replace with your real credentials | ||
| export AWS_ACCESS_KEY_ID="AKIAIOSFODNN7EXAMPLE" | ||
| export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" | ||
| export AWS_DEFAULT_REGION="ap-south-1" | ||
| export AWS_DEFAULT_OUTPUT="json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoding AWS credentials in a script, even as examples, is a significant security risk. It encourages a bad practice that can lead to accidentally committing real credentials. The script should rely on the standard AWS CLI credential chain (e.g., IAM roles, environment variables, or the ~/.aws/credentials file). Please remove these export statements.
| # ⚠️ Replace with your real credentials | |
| export AWS_ACCESS_KEY_ID="AKIAIOSFODNN7EXAMPLE" | |
| export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" | |
| export AWS_DEFAULT_REGION="ap-south-1" | |
| export AWS_DEFAULT_OUTPUT="json" | |
| # ⚠️ Ensure your AWS credentials are configured in your environment | |
| # (e.g., via `aws configure` or environment variables). |
| <host name="localhost"> | ||
| <alias>node1</alias> | ||
| </host> | ||
|
|
||
| <!-- Secondary node - vespa-testing container --> | ||
| <host name="localhost:8181"> | ||
| <alias>node2</alias> | ||
| </host> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The host configuration is incorrect for the multi-node setup defined in docker-compose.cluster.yml.
- The
nameattribute of a<host>tag must be a valid hostname.localhost:8181is invalid because it includes a port. - In a Docker network,
localhostrefers to the container itself. To allow containers to communicate, you must use their service names as hostnames (e.g.,vespa-node1,vespa-node2).
This file should be updated to map the aliases used in services.xml to the correct service hostnames from your Docker Compose file.
| <host name="localhost"> | |
| <alias>node1</alias> | |
| </host> | |
| <!-- Secondary node - vespa-testing container --> | |
| <host name="localhost:8181"> | |
| <alias>node2</alias> | |
| </host> | |
| <host name="vespa-node1"> | |
| <alias>node1</alias> | |
| </host> | |
| <!-- Secondary node - vespa-testing container --> | |
| <host name="vespa-node2"> | |
| <alias>node2</alias> | |
| </host> |
| <admin version="2.0"> | ||
| <adminserver hostalias="node1" /> | ||
| <configservers> | ||
| <configserver hostalias="node1" /> | ||
| </configservers> | ||
| </admin> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an inconsistency between your docker-compose.cluster.yml and this services.xml. The Docker Compose file defines a dedicated service vespa-config for the config server. However, this services.xml file places the adminserver and configserver on hostalias="node1", which corresponds to the vespa-node1 service.
The vespa-node1 service is started with vespa-start-services and configured to use vespa-config as its config server. This creates a conflict. The admin and config server definitions should point to an alias that maps to the vespa-config service in hosts.xml.
| # AWS performance tuning (optional) | ||
| aws configure set default.s3.max_concurrent_requests 20 | ||
| aws configure set default.s3.multipart_threshold 64MB | ||
| aws configure set default.s3.multipart_chunksize 64MB | ||
| aws configure set default.s3.max_queue_size 100 | ||
| aws configure set default.s3.multipart_upload_threshold 64MB | ||
| aws configure set default.s3.multipart_max_attempts 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifying the user's global AWS configuration with aws configure set can have unintended side effects on other operations outside of this script. It's safer to use environment variables for these settings to scope them only to the current script execution (e.g., export AWS_MAX_CONCURRENT_REQUESTS=20).
Note that some of these settings, like max_queue_size and multipart_max_attempts, cannot be set via environment variables and must be in the AWS config file. Also, multipart_upload_threshold appears to be a duplicate of multipart_threshold.
| /* #!/bin/bash | ||
| // ------------------------------------------------------------ | ||
| // STEP 1: Start Vespa container for dump creation | ||
| // ------------------------------------------------------------ | ||
| //docker run -d --name vespa-testing \ | ||
| //-e VESPA_IGNORE_NOT_ENOUGH_MEMORY=true \ | ||
| //-p 8181:8080 \ | ||
| //-p 19171:19071 \ | ||
| //-p 2224:22 \ | ||
| //vespaengine/vespa:latest | ||
| // ------------------------------------------------------------ | ||
| // STEP 2: Export Vespa data | ||
| // ------------------------------------------------------------ | ||
| "vespa visit --content-cluster my_content --make-feed > dump.json" | ||
| // ------------------------------------------------------------ | ||
| // STEP 3: Compress dump file | ||
| // ------------------------------------------------------------ | ||
| "apt install -y pigz" | ||
| //# or yum install pigz | ||
| // pigz is parallel gzip (much faster) | ||
| // pigz -9 (1.15 hr, ~280 GB) or -7 (1 hr, ~320 GB) | ||
| "pigz -9 dump.json" | ||
| // creates dump.json.gz | ||
| // if pigz is not available, fallback to gzip | ||
| //gzip -9 dump.json | ||
| //(gzip -9 -c dump.json > dump.json.gz) | ||
| // ------------------------------------------------------------ | ||
| // STEP 4: Encrypt dump file (OpenSSL password-based) | ||
| // ------------------------------------------------------------ | ||
| "openssl enc -aes-256-cbc -pbkdf2 -salt \ | ||
| -in dump.json.gz \ | ||
| -out dump.json.gz.enc" | ||
| // Strong AES-256 encryption, password will be prompted | ||
| // dump.json.gz.enc → safe to transfer/upload | ||
| // ------------------------------------------------------------ | ||
| // OPTIONAL: GPG-based encryption (if using keypair) | ||
| // ------------------------------------------------------------ | ||
| //gpg --full-generate-key | ||
| //gpg --list-keys | ||
| //gpg --output dump.json.gz.gpg --encrypt --recipient <A1B2C3D4E5F6G7H8> dump.json.gz | ||
| // ------------------------------------------------------------ | ||
| // STEP 5: Set up SSH access for container-to-host transfer | ||
| // ------------------------------------------------------------ | ||
| // ssh-keygen -t rsa -b 4096 -C "mohd.shoaib@juspay.in" | ||
| // cat ~/.ssh/id_rsa.pub | ||
| // On your **remote machine (container)**: | ||
| //docker exec -it --user root vespa-testing /bin/bash | ||
| //apt-get or yum update | ||
| //apt-get or yum install -y openssh-client (openssh first) | ||
| //ssh-keygen -A | ||
| //yum install -y openssh-server | ||
| //mkdir -p /var/run/sshd | ||
| ///usr/sbin/sshd | ||
| //mkdir -p ~/.ssh | ||
| //chmod 700 ~/.ssh | ||
| //echo "PASTE_YOUR_PUBLIC_KEY_HERE" >> ~/.ssh/authorized_keys | ||
| //chmod 600 ~/.ssh/authorized_keys | ||
| //yum install -y rsync | ||
| // ------------------------------------------------------------ | ||
| // STEP 6: Test SSH + Transfer dump or key | ||
| // ------------------------------------------------------------ | ||
| //ssh -p 2224 root@192.168.1.6 - testing | ||
| //yum install -y rsync | ||
| //gpg --export-secret-keys --armor BF4AF7E7E3955EF3A436A4ED7C59556BFC58DFAF > my-private-key.asc | ||
| //rsync -avzP --inplace --partial --append -e "ssh -p 2224" my-private-key.asc root@192.168.1.6:/home/ | ||
| "brew install awscli" | ||
| "aws configure" | ||
| "AWS Access Key ID [None]: ****************" | ||
| "AWS Secret Access Key [None]: ********************" | ||
| "Default region name [None]: ap-south-1" | ||
| "Default output format [None]: json" | ||
| //AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE | ||
| //AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY | ||
| //Default region name [None]: ap-south-1 | ||
| //Default output format [None]: json | ||
| // For fast file transfer | ||
| "aws configure set default.s3.max_concurrent_requests 20" | ||
| "aws configure set default.s3.multipart_threshold 64MB" | ||
| //Check your identity: | ||
| "aws sts get-caller-identity" | ||
| // for Making transfers faster (optional) | ||
| "aws configure set default.s3.multipart_chunksize 64MB" | ||
| "aws configure set default.s3.max_queue_size 100" | ||
| "aws configure set default.s3.multipart_upload_threshold 64MB" | ||
| "aws configure set default.s3.multipart_max_attempts 5" | ||
| "aws s3 cp dump.json.gz.enc s3://your-bucket-name/dumps/" | ||
| //Optional (show progress bar): | ||
| "aws s3 cp dump.json.gz.enc s3://xyne-vespa-backups/2025-10-13/ --expected-size $(stat -c%s dump.json.gz.enc" | ||
| //rsync -avzP --inplace --partial --append -e "ssh -p 2224" dump.json.gz.gpg root@192.168.1.6:/home/root/ | ||
| // ------------------------------------------------------------ | ||
| // STEP 7: On the new machine | ||
| // ------------------------------------------------------------ | ||
| // Option 1 — using AWS S3 | ||
| "aws s3 cp s3://your-bucket-name/dumps/dump.json.gz.enc " | ||
| "openssl enc -d -aes-256-cbc -pbkdf2 -salt \ | ||
| -in dump.json.gz.enc \ | ||
| -out dump.json.gz" | ||
| // Option 2 — if using GPG | ||
| //yum install -y pinentry | ||
| //gpgconf --kill gpg-agent | ||
| //export GPG_TTY=$(tty) | ||
| //echo $GPG_TTY | ||
| //gpg --import my-private-key.asc | ||
| //gpg --list-secret-keys | ||
| //gpg --output dump.json.gz --decrypt dump.json.gz.gpg | ||
| // ------------------------------------------------------------ | ||
| // STEP 8: Decompress and feed into Vespa | ||
| // ------------------------------------------------------------ | ||
| "gunzip dump.json.gz" | ||
| "vespa-feed-client dump.json" | ||
| // ------------------------------------------------------------ | ||
| // Done 🎉 | ||
| // ------------------------------------------------------------ | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file appears to be a collection of notes and scratchpad commands, and it's entirely commented out. It also contains example AWS credentials (lines 100-101), which is a security risk even when commented, as they can be flagged by security scanners and promote bad practices. It's best to remove this file from the pull request. If these are important notes, they should be moved to a more appropriate place like a README or a wiki, with any sensitive examples removed.
| # ------------------------------------------------------------ | ||
| # Done 🎉 | ||
| # ------------------------------------------------------------ | ||
| echo "✅ Vespa data restored successfully!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script leaves behind the downloaded and decompressed dump files (dump.json.gz.enc and dump.json). These files can be very large. It's a good practice to clean them up to conserve disk space. You could add a cleanup step at the end of the script, or use a trap to ensure cleanup happens even on failure.
| echo "✅ Vespa data restored successfully!" | |
| echo "✅ Vespa data restored successfully!" | |
| echo "🧹 Cleaning up temporary files..." | |
| rm -f dump.json.gz.enc dump.json |
| # ------------------------------------------------------------ | ||
| # STEP 5: Upload to AWS S3 | ||
| # ------------------------------------------------------------ | ||
| aws s3 cp dump.json.gz.enc s3://your-bucket-name/dumps/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - "19071:19071" # Config server port | ||
| - "19050:19050" # Admin port | ||
| volumes: | ||
| - ./:/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/1/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mounting the application package directly into the config server's internal session directory is fragile and not a standard practice. This path is an implementation detail of Vespa and could change in future versions, breaking your setup. A more robust approach is to mount the application package to a neutral location (e.g., /app) and then use vespa deploy to deploy it.
|
|
||
| networks: | ||
| vespa-cluster: | ||
| driver: bridge No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| <host name="localhost:8181"> | ||
| <alias>node2</alias> | ||
| </host> | ||
| </hosts> No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
♻️ Duplicate comments (2)
server/vespa/docker-compose.cluster.yml (2)
29-31: Same volume mount issue as config server.The volume mount on line 30 has the same issue as the config server (line 13). See previous comment for details.
48-50: Same volume mount issue as other services.The volume mount on line 49 has the same issue. See previous comments for details.
🧹 Nitpick comments (9)
server/scripts/vespaDataReceive.sh (3)
14-16: Automate password handling for decryption.The interactive password prompt will block automated execution. For production use, consider using environment variables or AWS Secrets Manager for the encryption password.
Apply this diff:
-openssl enc -d -aes-256-cbc -pbkdf2 -salt \ +# Use password from environment variable to enable automation +ENCRYPTION_PASSWORD="${VESPA_ENCRYPTION_PASSWORD:?Error: VESPA_ENCRYPTION_PASSWORD not set}" +openssl enc -d -aes-256-cbc -pbkdf2 -salt \ + -pass "pass:${ENCRYPTION_PASSWORD}" \ -in dump.json.gz.enc \ -out dump.json.gz
34-35: Add validation before decompression and feeding.The script should verify that the compressed file exists before decompressing and that Vespa is accessible before attempting to feed data.
Apply this diff:
+# Validate file exists +if [ ! -f "dump.json.gz" ]; then + echo "Error: dump.json.gz not found" + exit 1 +fi + gunzip dump.json.gz + +# Validate Vespa is accessible +if ! command -v vespa-feed-client &> /dev/null; then + echo "Error: vespa-feed-client command not found" + exit 1 +fi + +# Optionally check Vespa health +# curl -sf http://localhost:8080/state/v1/health || { echo "Error: Vespa not healthy"; exit 1; } + vespa-feed-client dump.json
1-40: Add cleanup of intermediate files.The script leaves intermediate files (
dump.json.gz.enc,dump.json.gz) after execution. Consider adding a cleanup trap to remove these files on exit.Add cleanup at the beginning of the script:
#!/bin/bash set -e set -o pipefail + +# Cleanup function +cleanup() { + echo "Cleaning up intermediate files..." + rm -f dump.json.gz.enc dump.json.gz +} +trap cleanup EXITserver/scripts/vespaDataSend.sh (3)
37-37: Make content cluster name configurable.The content cluster name
my_contentis hardcoded. Make it configurable via environment variable for flexibility across different deployments.Apply this diff:
+# Allow overriding the content cluster name +CONTENT_CLUSTER="${VESPA_CONTENT_CLUSTER:-my_content}" + +# Verify Vespa is accessible +if ! curl -sf http://localhost:8080/state/v1/health > /dev/null; then + echo "Error: Vespa is not accessible at localhost:8080" + exit 1 +fi + -vespa visit --content-cluster my_content --make-feed > dump.json +vespa visit --content-cluster "$CONTENT_CLUSTER" --make-feed > dump.json
42-43: Improve package installation and command verification.The script attempts to install
pigzusing bothaptandyumsequentially, which will fail on one or the other. Additionally, it doesn't verify thatpigzis available before using it.Apply this diff:
-apt install -y pigz || yum install -y pigz -pigz -9 dump.json # creates dump.json.gz +# Install pigz with proper OS detection +if command -v apt-get &> /dev/null; then + apt-get update && apt-get install -y pigz +elif command -v yum &> /dev/null; then + yum install -y pigz +else + echo "Error: Neither apt-get nor yum found" + exit 1 +fi + +# Verify pigz is available, fallback to gzip +if command -v pigz &> /dev/null; then + echo "Using pigz for parallel compression..." + pigz -9 dump.json +else + echo "pigz not available, falling back to gzip..." + gzip -9 dump.json +fi
49-51: Automate password handling for encryption.The interactive password prompt blocks automated execution. Use an environment variable for the encryption password in production workflows.
Apply this diff:
-# ⚠️ You'll be prompted for password — can automate with -pass if needed -openssl enc -aes-256-cbc -pbkdf2 -salt \ +# Use password from environment variable +ENCRYPTION_PASSWORD="${VESPA_ENCRYPTION_PASSWORD:?Error: VESPA_ENCRYPTION_PASSWORD not set}" +openssl enc -aes-256-cbc -pbkdf2 -salt \ + -pass "pass:${ENCRYPTION_PASSWORD}" \ -in dump.json.gz \ -out dump.json.gz.encserver/vespa/services.xml (1)
38-38: Fix inconsistent indentation.Line 38 has an extra leading space compared to other document declarations.
Apply this diff:
- <document type="chat_attachment" mode="index" global="true" /> + <document type="chat_attachment" mode="index" global="true" />server/scripts/vespaMigration.js (2)
100-103: Clearly mark example credentials or remove them.Lines 100-103 contain AWS example credentials. While they're commented and appear to be from AWS documentation, it's better to either remove them or clearly mark them as non-functional examples to avoid confusion.
Consider removing these lines or adding a clear warning:
+// ⚠️ EXAMPLE ONLY - DO NOT USE THESE CREDENTIALS +// These are placeholder values from AWS documentation //AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE //AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY -//Default region name [None]: ap-south-1 -//Default output format [None]: json
1-157: Avoid duplication with executable scripts.This documentation file duplicates content from
vespaDataSend.shandvespaDataReceive.sh. This creates a maintenance burden where changes need to be synchronized across multiple files.Consider converting this to a high-level guide that references the executable scripts:
# Vespa Data Migration Guide ## Overview This guide explains how to migrate Vespa data using the provided automation scripts. ## Export Data Use `vespaDataSend.sh` to export and upload Vespa data: \`\`\`bash export VESPA_BACKUP_BUCKET="your-bucket-name" export VESPA_ENCRYPTION_PASSWORD="your-password" ./server/scripts/vespaDataSend.sh \`\`\` ## Import Data Use `vespaDataReceive.sh` to download and import Vespa data: \`\`\`bash export VESPA_BACKUP_BUCKET="your-bucket-name" export VESPA_ENCRYPTION_PASSWORD="your-password" ./server/scripts/vespaDataReceive.sh \`\`\` ## Prerequisites - AWS CLI configured - Vespa CLI tools installed - ...
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
server/vespa/app.zipis excluded by!**/*.zip
📒 Files selected for processing (7)
server/scripts/update-workflow-agents.ts(1 hunks)server/scripts/vespaDataReceive.sh(1 hunks)server/scripts/vespaDataSend.sh(1 hunks)server/scripts/vespaMigration.js(1 hunks)server/vespa/docker-compose.cluster.yml(1 hunks)server/vespa/hosts.xml(1 hunks)server/vespa/services.xml(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/scripts/update-workflow-agents.ts (2)
server/scripts/count-workflow-agents.ts (1)
countWorkflowAgents(7-27)server/db/schema/agents.ts (1)
agents(29-68)
🔇 Additional comments (7)
server/scripts/update-workflow-agents.ts (1)
8-48: LGTM!The formatting improvements enhance code readability without altering any logic or behavior.
server/scripts/vespaDataReceive.sh (1)
1-3: LGTM!Proper error handling with
set -eandset -o pipefailensures the script fails fast on errors.server/vespa/services.xml (3)
2-7: LGTM!The admin configuration correctly designates
node1as both the admin server and config server, which aligns with the cluster setup in docker-compose.cluster.yml.
26-26: LGTM! Redundancy configuration is appropriate.Setting redundancy to 2 with
reply-after="1"is appropriate for the two-node cluster. This ensures data is replicated across both nodes while allowing queries to return after receiving a response from the first available replica.Also applies to: 44-45
28-42: All schema definitions found for newly added document types.server/vespa/docker-compose.cluster.yml (2)
45-47: LGTM! Port remapping avoids conflicts.The port mappings correctly remap node2's ports to avoid conflicts with node1 (8081:8080, 19113:19112), allowing both nodes to run simultaneously on the same host.
6-6: Verify Docker image tag exists. Confirm thatvespaengine/vespa:8.431.26is published on Docker Hub (e.g., rundocker pull vespaengine/vespa:8.431.26or check https://hub.docker.com/r/vespaengine/vespa/tags). Applies to lines 6, 23, 42.
|
|
||
| # ---------- Option 1 — using AWS S3 ---------- | ||
| # ⚠️ Replace with your actual bucket name and path | ||
| aws s3 cp s3://your-bucket-name/dumps/dump.json.gz.enc . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace placeholder bucket name.
The S3 bucket name your-bucket-name is a placeholder that must be replaced with the actual bucket name before running the script.
Consider using environment variables for configuration:
-aws s3 cp s3://your-bucket-name/dumps/dump.json.gz.enc .
+# Read bucket name from environment variable
+BUCKET_NAME="${VESPA_BACKUP_BUCKET:?Error: VESPA_BACKUP_BUCKET environment variable not set}"
+aws s3 cp "s3://${BUCKET_NAME}/dumps/dump.json.gz.enc" .🤖 Prompt for AI Agents
In server/scripts/vespaDataReceive.sh around line 11, the aws s3 cp command uses
the placeholder bucket name "your-bucket-name"; replace this with the real S3
bucket or, better, read the bucket name from an environment variable (e.g.,
S3_BUCKET) and use that variable in the command, adding a guard to fail with a
clear message if the env var is not set.
| export AWS_ACCESS_KEY_ID="AKIAIOSFODNN7EXAMPLE" | ||
| export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRITICAL: Remove hardcoded AWS credentials.
Hardcoded AWS credentials in version control is a critical security vulnerability, even if these appear to be example values. Credentials should never be committed to the repository.
Remove the hardcoded credentials and use environment variables or AWS credential profiles instead:
-# ⚠️ Replace with your real credentials
-export AWS_ACCESS_KEY_ID="AKIAIOSFODNN7EXAMPLE"
-export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
-export AWS_DEFAULT_REGION="ap-south-1"
-export AWS_DEFAULT_OUTPUT="json"
+# Use AWS credentials from environment or AWS config
+# Ensure AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are set
+if [ -z "$AWS_ACCESS_KEY_ID" ] || [ -z "$AWS_SECRET_ACCESS_KEY" ]; then
+ echo "Error: AWS credentials not configured. Please set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY"
+ echo "Or configure AWS CLI with: aws configure"
+ exit 1
+fi
+
+export AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-ap-south-1}"
+export AWS_DEFAULT_OUTPUT="${AWS_DEFAULT_OUTPUT:-json}"Additionally, scan the repository for any committed credentials using tools like git-secrets or trufflehog.
🤖 Prompt for AI Agents
In server/scripts/vespaDataSend.sh around lines 9-10 the script contains
hardcoded AWS credentials; remove these two export lines immediately and replace
them with references to external configuration (e.g., expect AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY to be provided via environment variables, AWS CLI
named profiles, or mounted credential files/EC2/ECS/IAM role credentials),
update any documentation or CI pipeline to set those environment variables
securely, and run a repository secrets scan (git-secrets, trufflehog, or
similar) to detect and purge any other committed credentials.
| # ------------------------------------------------------------ | ||
| # STEP 5: Upload to AWS S3 | ||
| # ------------------------------------------------------------ | ||
| aws s3 cp dump.json.gz.enc s3://your-bucket-name/dumps/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace placeholder S3 bucket name.
The S3 bucket name your-bucket-name is a placeholder that must be replaced. Use an environment variable for configuration.
Apply this diff:
-aws s3 cp dump.json.gz.enc s3://your-bucket-name/dumps/
+BUCKET_NAME="${VESPA_BACKUP_BUCKET:?Error: VESPA_BACKUP_BUCKET not set}"
+TIMESTAMP=$(date +%Y-%m-%d-%H%M%S)
+aws s3 cp dump.json.gz.enc "s3://${BUCKET_NAME}/dumps/dump-${TIMESTAMP}.json.gz.enc"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| aws s3 cp dump.json.gz.enc s3://your-bucket-name/dumps/ | |
| BUCKET_NAME="${VESPA_BACKUP_BUCKET:?Error: VESPA_BACKUP_BUCKET not set}" | |
| TIMESTAMP=$(date +%Y-%m-%d-%H%M%S) | |
| aws s3 cp dump.json.gz.enc "s3://${BUCKET_NAME}/dumps/dump-${TIMESTAMP}.json.gz.enc" |
🤖 Prompt for AI Agents
In server/scripts/vespaDataSend.sh around line 56, the S3 bucket name literal
"your-bucket-name" is a placeholder and should be replaced with a configurable
environment variable; update the aws s3 cp command to use an environment
variable (e.g. "$S3_BUCKET") instead of the hardcoded name, and add a brief
check near the top of the script to ensure S3_BUCKET is set (exit with an error
message if not) so the script fails fast when configuration is missing.
| /* #!/bin/bash | ||
| // ------------------------------------------------------------ | ||
| // STEP 1: Start Vespa container for dump creation | ||
| // ------------------------------------------------------------ | ||
| //docker run -d --name vespa-testing \ | ||
| //-e VESPA_IGNORE_NOT_ENOUGH_MEMORY=true \ | ||
| //-p 8181:8080 \ | ||
| //-p 19171:19071 \ | ||
| //-p 2224:22 \ | ||
| //vespaengine/vespa:latest | ||
| // ------------------------------------------------------------ | ||
| // STEP 2: Export Vespa data | ||
| // ------------------------------------------------------------ | ||
| "vespa visit --content-cluster my_content --make-feed > dump.json" | ||
| // ------------------------------------------------------------ | ||
| // STEP 3: Compress dump file | ||
| // ------------------------------------------------------------ | ||
| "apt install -y pigz" | ||
| //# or yum install pigz | ||
| // pigz is parallel gzip (much faster) | ||
| // pigz -9 (1.15 hr, ~280 GB) or -7 (1 hr, ~320 GB) | ||
| "pigz -9 dump.json" | ||
| // creates dump.json.gz | ||
| // if pigz is not available, fallback to gzip | ||
| //gzip -9 dump.json | ||
| //(gzip -9 -c dump.json > dump.json.gz) | ||
| // ------------------------------------------------------------ | ||
| // STEP 4: Encrypt dump file (OpenSSL password-based) | ||
| // ------------------------------------------------------------ | ||
| "openssl enc -aes-256-cbc -pbkdf2 -salt \ | ||
| -in dump.json.gz \ | ||
| -out dump.json.gz.enc" | ||
| // Strong AES-256 encryption, password will be prompted | ||
| // dump.json.gz.enc → safe to transfer/upload | ||
| // ------------------------------------------------------------ | ||
| // OPTIONAL: GPG-based encryption (if using keypair) | ||
| // ------------------------------------------------------------ | ||
| //gpg --full-generate-key | ||
| //gpg --list-keys | ||
| //gpg --output dump.json.gz.gpg --encrypt --recipient <A1B2C3D4E5F6G7H8> dump.json.gz | ||
| // ------------------------------------------------------------ | ||
| // STEP 5: Set up SSH access for container-to-host transfer | ||
| // ------------------------------------------------------------ | ||
| // ssh-keygen -t rsa -b 4096 -C "mohd.shoaib@juspay.in" | ||
| // cat ~/.ssh/id_rsa.pub | ||
| // On your **remote machine (container)**: | ||
| //docker exec -it --user root vespa-testing /bin/bash | ||
| //apt-get or yum update | ||
| //apt-get or yum install -y openssh-client (openssh first) | ||
| //ssh-keygen -A | ||
| //yum install -y openssh-server | ||
| //mkdir -p /var/run/sshd | ||
| ///usr/sbin/sshd | ||
| //mkdir -p ~/.ssh | ||
| //chmod 700 ~/.ssh | ||
| //echo "PASTE_YOUR_PUBLIC_KEY_HERE" >> ~/.ssh/authorized_keys | ||
| //chmod 600 ~/.ssh/authorized_keys | ||
| //yum install -y rsync | ||
| // ------------------------------------------------------------ | ||
| // STEP 6: Test SSH + Transfer dump or key | ||
| // ------------------------------------------------------------ | ||
| //ssh -p 2224 root@192.168.1.6 - testing | ||
| //yum install -y rsync | ||
| //gpg --export-secret-keys --armor BF4AF7E7E3955EF3A436A4ED7C59556BFC58DFAF > my-private-key.asc | ||
| //rsync -avzP --inplace --partial --append -e "ssh -p 2224" my-private-key.asc root@192.168.1.6:/home/ | ||
| "brew install awscli" | ||
| "aws configure" | ||
| "AWS Access Key ID [None]: ****************" | ||
| "AWS Secret Access Key [None]: ********************" | ||
| "Default region name [None]: ap-south-1" | ||
| "Default output format [None]: json" | ||
| //AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE | ||
| //AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY | ||
| //Default region name [None]: ap-south-1 | ||
| //Default output format [None]: json | ||
| // For fast file transfer | ||
| "aws configure set default.s3.max_concurrent_requests 20" | ||
| "aws configure set default.s3.multipart_threshold 64MB" | ||
| //Check your identity: | ||
| "aws sts get-caller-identity" | ||
| // for Making transfers faster (optional) | ||
| "aws configure set default.s3.multipart_chunksize 64MB" | ||
| "aws configure set default.s3.max_queue_size 100" | ||
| "aws configure set default.s3.multipart_upload_threshold 64MB" | ||
| "aws configure set default.s3.multipart_max_attempts 5" | ||
| "aws s3 cp dump.json.gz.enc s3://your-bucket-name/dumps/" | ||
| //Optional (show progress bar): | ||
| "aws s3 cp dump.json.gz.enc s3://xyne-vespa-backups/2025-10-13/ --expected-size $(stat -c%s dump.json.gz.enc" | ||
| //rsync -avzP --inplace --partial --append -e "ssh -p 2224" dump.json.gz.gpg root@192.168.1.6:/home/root/ | ||
| // ------------------------------------------------------------ | ||
| // STEP 7: On the new machine | ||
| // ------------------------------------------------------------ | ||
| // Option 1 — using AWS S3 | ||
| "aws s3 cp s3://your-bucket-name/dumps/dump.json.gz.enc " | ||
| "openssl enc -d -aes-256-cbc -pbkdf2 -salt \ | ||
| -in dump.json.gz.enc \ | ||
| -out dump.json.gz" | ||
| // Option 2 — if using GPG | ||
| //yum install -y pinentry | ||
| //gpgconf --kill gpg-agent | ||
| //export GPG_TTY=$(tty) | ||
| //echo $GPG_TTY | ||
| //gpg --import my-private-key.asc | ||
| //gpg --list-secret-keys | ||
| //gpg --output dump.json.gz --decrypt dump.json.gz.gpg | ||
| // ------------------------------------------------------------ | ||
| // STEP 8: Decompress and feed into Vespa | ||
| // ------------------------------------------------------------ | ||
| "gunzip dump.json.gz" | ||
| "vespa-feed-client dump.json" | ||
| // ------------------------------------------------------------ | ||
| // Done 🎉 | ||
| // ------------------------------------------------------------ | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Convert to appropriate file format for documentation.
This file is a .js file but contains shell script documentation wrapped in multi-line comments. The mixed syntax (JavaScript comments, shell comments, quoted strings) makes it confusing and non-executable.
Consider one of these options:
Option 1 (Recommended): Convert to Markdown
Rename to vespaMigration.md and format as proper documentation:
# Vespa Data Migration Guide
## Step 1: Start Vespa Container
\`\`\`bash
docker run -d --name vespa-testing \
-e VESPA_IGNORE_NOT_ENOUGH_MEMORY=true \
-p 8181:8080 \
...
\`\`\`
## Step 2: Export Vespa Data
\`\`\`bash
vespa visit --content-cluster my_content --make-feed > dump.json
\`\`\`
...Option 2: Convert to executable shell script
Rename to vespaMigration.sh and format as a proper bash script with functions or clear sections.
🤖 Prompt for AI Agents
In server/scripts/vespaMigration.js lines 1-157: the file contains shell
commands wrapped in JavaScript block comments and quoted strings, making it
non-executable and confusing; convert this to a proper documentation or script
file — either rename and reformat as Markdown (vespaMigration.md) with fenced
bash code blocks and headings for each step, or rename to a shell script
(vespaMigration.sh) and remove JS comment markers/quotes, add a shebang, make
commands valid bash (uncomment, group into functions or sections), and ensure
executable permissions; pick one option and update filenames, references, and
git commit accordingly.
| volumes: | ||
| - ./:/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/1/ | ||
| - vespa-config-data:/opt/vespa/var/db/vespa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify volume mount path for application package.
The volume mount on line 13 maps the current directory to a very specific Vespa internal path:
./:/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/1/
This path appears to be an internal Vespa session directory, not the standard location for application packages. The correct path for deploying a Vespa application package is typically /app or a deployment-specific directory.
Verify the intended mount path. If deploying an application package, consider:
volumes:
- - ./:/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/1/
+ - ./:/app # Mount application package to standard location
- vespa-config-data:/opt/vespa/var/db/vespaThen use the proper Vespa deployment command to deploy the application from /app.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| volumes: | |
| - ./:/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/1/ | |
| - vespa-config-data:/opt/vespa/var/db/vespa | |
| volumes: | |
| - ./:/app # Mount application package to standard location | |
| - vespa-config-data:/opt/vespa/var/db/vespa |
🤖 Prompt for AI Agents
In server/vespa/docker-compose.cluster.yml around lines 12 to 14, the bind mount
maps the repo root into Vespa's internal session directory
(./:/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/1/),
which is incorrect for deploying an application package; change the mount to
expose the application package at a sensible path inside the container (for
example /app or a dedicated deploy dir) and remove the direct mapping into
Vespa's session store, then update any startup/deploy scripts to run the Vespa
deployment command (e.g., deploy from /app) so the container places sessions
into Vespa's own data volume rather than overwriting internal session
directories.
| <?xml version="1.0" encoding="utf-8" ?> | ||
| <hosts> | ||
| <!-- Primary node - current Vespa instance --> | ||
| <host name="localhost"> | ||
| <alias>node1</alias> | ||
| </host> | ||
|
|
||
| <!-- Secondary node - vespa-testing container --> | ||
| <host name="localhost:8181"> | ||
| <alias>node2</alias> | ||
| </host> | ||
| </hosts> No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify the host configuration approach with port in hostname.
The configuration uses localhost:8181 as a hostname on line 9, which is unconventional for Vespa hosts.xml files. Typically, hosts are defined without ports, and port mappings are handled in services.xml and docker-compose configuration.
While this may function, ensure it aligns with Vespa best practices and doesn't cause issues with the cluster coordination or health checks.
🏁 Script executed:
#!/bin/bash
# Check Vespa documentation for hosts.xml best practices
# and verify if port notation in host names is supportedLength of output: 59
🌐 Web query:
What is the correct format for Vespa hosts.xml configuration? Should ports be included in host names?
💡 Result:
Use FQDNs (hostnames) only — do not include ports in hosts.xml host name attributes. hosts.xml lists hosts by name (e.g. myserver.example.com) and any aliases; ports are configured separately in services.xml (via baseport/port attributes for services). (docs.vespa.ai)
Citations:
Remove port notation from host names in hosts.xml
Replace <host name="localhost:8181"> with <host name="localhost"> and configure port 8181 in services.xml or docker-compose.
File: server/vespa/hosts.xml line 9
🤖 Prompt for AI Agents
In server/vespa/hosts.xml around lines 1 to 12, the secondary host incorrectly
includes a port in the host name ("localhost:8181"); remove the port so the
entry is <host name="localhost"> and keep the alias as node2, then configure the
port 8181 externally (e.g., in services.xml or docker-compose) so hosts.xml
contains only hostnames as required by Vespa.
Description
Testing
Additional Notes
Summary by CodeRabbit
New Features
Documentation
Style