# Cassandra Cluster Setup (3-Node) **Prerequisites**: Complete [01_init_docker_swarm.md](01_init_docker_swarm.md) first **Time to Complete**: 60-90 minutes **What You'll Build**: - 3 new DigitalOcean droplets (workers 2, 3, 4) - 3-node Cassandra cluster using Docker Swarm - Replication factor 3 for high availability - Private network communication only --- ## Table of Contents 1. [Overview](#overview) 2. [Create Cassandra Worker Droplets](#create-cassandra-worker-droplets) 3. [Configure Workers and Join Swarm](#configure-workers-and-join-swarm) 4. [Deploy Cassandra Cluster](#deploy-cassandra-cluster) 5. [Initialize Keyspaces](#initialize-keyspaces) 6. [Verify Cluster Health](#verify-cluster-health) 7. [Cluster Management](#cluster-management) 8. [Troubleshooting](#troubleshooting) --- ## Overview ### Architecture ``` Swarm Manager (existing): ├── mapleopentech-swarm-manager-1-prod (10.116.0.2) └── Controls cluster, no Cassandra Existing Worker: └── mapleopentech-swarm-worker-1-prod (10.116.0.3) └── Available for other services Cassandra Cluster (NEW): ├── mapleopentech-swarm-worker-2-prod (10.116.0.4) │ └── Cassandra Node 1 ├── mapleopentech-swarm-worker-3-prod (10.116.0.5) │ └── Cassandra Node 2 └── mapleopentech-swarm-worker-4-prod (10.116.0.6) └── Cassandra Node 3 ``` ### Cassandra Configuration - **Version**: Cassandra 5.0.4 - **Cluster Name**: mapleopentech-private-prod-cluster - **Replication Factor**: 3 (each data stored on all 3 nodes) - **Data Center**: datacenter1 - **Heap Size**: 512MB (reduced for 2GB RAM constraint) - **Communication**: Private network only (secure) **⚠️ IMPORTANT - Memory Constraints:** This configuration uses minimal 2GB RAM droplets with 512MB heap size. This is **NOT recommended for production** use. Expect: - Limited performance (max ~1,000 writes/sec vs 10,000 with proper sizing) - Potential stability issues under load - Frequent garbage collection pauses - Limited concurrent connection capacity **For production use**, upgrade to 8GB RAM droplets with 2GB heap size. ### Why 3 Nodes? - **High Availability**: Cluster survives 1 node failure - **Replication Factor 3**: Every piece of data stored on all 3 nodes - **Read Performance**: Queries can hit any node - **Write Performance**: Writes distributed across cluster - **Production Standard**: Minimum for HA Cassandra --- ## Create Cassandra Worker Droplets ### Step 1: Create Worker 2 (Cassandra Node 1) **From DigitalOcean Dashboard:** 1. Go to https://cloud.digitalocean.com/ 2. Click **Create** → **Droplets** **Droplet Configuration:** | Setting | Value | |---------|-------| | **Region** | Toronto 1 (TOR1) - SAME as existing | | **Image** | Ubuntu 24.04 LTS x64 | | **Droplet Type** | Regular Intel | | **CPU Options** | 1 vCPU, 2 GB RAM ($12/month) | | **Storage** | 50 GB SSD | | **VPC** | default-tor1 (auto-selected) | | **SSH Key** | Select your key | | **Hostname** | `mapleopentech-swarm-worker-2-prod` | | **Tags** | `production`, `cassandra`, `database` | Click **Create Droplet** and wait 60 seconds. **✅ Checkpoint - Save to `.env`:** ```bash # On your local machine: SWARM_WORKER_2_HOSTNAME=mapleopentech-swarm-worker-2-prod SWARM_WORKER_2_PUBLIC_IP=159.65.123.47 # Your public IP SWARM_WORKER_2_PRIVATE_IP=10.116.0.4 # Your private IP CASSANDRA_NODE_1_IP=10.116.0.4 # Same as private IP ``` ### Step 2: Create Worker 3 (Cassandra Node 2) Repeat with these values: | Setting | Value | |---------|-------| | **Hostname** | `mapleopentech-swarm-worker-3-prod` | | All other settings | Same as Worker 2 | **✅ Checkpoint - Save to `.env`:** ```bash SWARM_WORKER_3_HOSTNAME=mapleopentech-swarm-worker-3-prod SWARM_WORKER_3_PUBLIC_IP=159.65.123.48 # Your public IP SWARM_WORKER_3_PRIVATE_IP=10.116.0.5 # Your private IP CASSANDRA_NODE_2_IP=10.116.0.5 # Same as private IP ``` ### Step 3: Create Worker 4 (Cassandra Node 3) Repeat with these values: | Setting | Value | |---------|-------| | **Hostname** | `mapleopentech-swarm-worker-4-prod` | | All other settings | Same as Worker 2 | **✅ Checkpoint - Save to `.env`:** ```bash SWARM_WORKER_4_HOSTNAME=mapleopentech-swarm-worker-4-prod SWARM_WORKER_4_PUBLIC_IP=159.65.123.49 # Your public IP SWARM_WORKER_4_PRIVATE_IP=10.116.0.6 # Your private IP CASSANDRA_NODE_3_IP=10.116.0.6 # Same as private IP ``` ### Step 4: Verify All Droplets in Same VPC 1. Go to **Networking** → **VPC** → Click `default-tor1` 2. Should see 5 droplets total: - mapleopentech-swarm-manager-1-prod (10.116.0.2) - mapleopentech-swarm-worker-1-prod (10.116.0.3) - mapleopentech-swarm-worker-2-prod (10.116.0.4) - mapleopentech-swarm-worker-3-prod (10.116.0.5) - mapleopentech-swarm-worker-4-prod (10.116.0.6) --- ## Configure Workers and Join Swarm Follow these steps for **EACH** of the 3 new workers (workers 2, 3, 4). ### Worker 2 Setup #### Step 1: Initial SSH as Root ```bash # SSH to Worker 2 ssh root@159.65.123.47 # Replace with YOUR worker 2 public IP # You should see: root@mapleopentech-swarm-worker-2-prod:~# ``` #### Step 2: System Updates and Create Admin User ```bash # Update system apt update && apt upgrade -y # Install essentials apt install -y curl wget apt-transport-https ca-certificates gnupg lsb-release # Create dockeradmin user adduser dockeradmin # Use the SAME password as other nodes # Add to sudo group usermod -aG sudo dockeradmin # Copy SSH keys rsync --archive --chown=dockeradmin:dockeradmin ~/.ssh /home/dockeradmin ``` #### Step 3: Secure SSH Configuration ```bash # Edit SSH config vi /etc/ssh/sshd_config # Update these lines: PermitRootLogin no PasswordAuthentication no PubkeyAuthentication yes MaxAuthTries 3 LoginGraceTime 60 # Save and restart SSH systemctl restart ssh ``` #### Step 4: Reconnect as dockeradmin ```bash # Exit root session exit # SSH back as dockeradmin ssh dockeradmin@159.65.123.47 # Replace with YOUR worker 2 public IP # You should see: dockeradmin@mapleopentech-swarm-worker-2-prod:~$ ``` #### Step 5: Install Docker ```bash # Install Docker curl -fsSL https://get.docker.com -o get-docker.sh sudo sh get-docker.sh # Add dockeradmin to docker group sudo usermod -aG docker dockeradmin newgrp docker # Verify docker --version # Enable Docker sudo systemctl enable docker sudo systemctl status docker # Press 'q' to exit ``` #### Step 6: Configure Firewall ```bash # Install UFW sudo apt install ufw -y # Allow SSH sudo ufw allow 22/tcp # Allow Docker Swarm ports (replace with YOUR VPC subnet from .env) sudo ufw allow from 10.116.0.0/16 to any port 2377 proto tcp sudo ufw allow from 10.116.0.0/16 to any port 7946 sudo ufw allow from 10.116.0.0/16 to any port 4789 proto udp # Allow Cassandra ports (private network only) # 7000: Inter-node communication # 7001: Inter-node communication (TLS) # 9042: CQL native transport (client connections) sudo ufw allow from 10.116.0.0/16 to any port 7000 proto tcp sudo ufw allow from 10.116.0.0/16 to any port 7001 proto tcp sudo ufw allow from 10.116.0.0/16 to any port 9042 proto tcp # Enable firewall sudo ufw --force enable # Check status sudo ufw status verbose ``` #### Step 7: Join Docker Swarm ```bash # Use the join command from Step 8 of 01_init_docker_swarm.md # Replace with YOUR actual token and manager private IP: docker swarm join --token SWMTKN-1-4abc123xyz789verylongtoken 10.116.0.2:2377 # Expected output: # This node joined a swarm as a worker. ``` ✅ **Worker 2 complete!** Repeat Steps 1-7 for Workers 3 and 4. ### Worker 3 Setup Repeat Steps 1-7 above, replacing: - Public IP: Use Worker 3's public IP (159.65.123.48 example) - Hostname: `mapleopentech-swarm-worker-3-prod` ### Worker 4 Setup Repeat Steps 1-7 above, replacing: - Public IP: Use Worker 4's public IP (159.65.123.49 example) - Hostname: `mapleopentech-swarm-worker-4-prod` --- ## Deploy Cassandra Cluster ### Step 1: Verify All Workers Joined **From your manager node:** ```bash # SSH to manager ssh dockeradmin@159.65.123.45 # Your manager's public IP # List all swarm nodes docker node ls # Expected output (5 nodes total): # ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS # abc123... * mapleopentech-swarm-manager-1-prod Ready Active Leader # def456... mapleopentech-swarm-worker-1-prod Ready Active # ghi789... mapleopentech-swarm-worker-2-prod Ready Active # jkl012... mapleopentech-swarm-worker-3-prod Ready Active # mno345... mapleopentech-swarm-worker-4-prod Ready Active ``` ### Step 2: Label Cassandra Nodes Apply labels so Cassandra services deploy to correct nodes: ```bash # Label Worker 2 as Cassandra Node 1 docker node update --label-add cassandra=node1 mapleopentech-swarm-worker-2-prod # Label Worker 3 as Cassandra Node 2 docker node update --label-add cassandra=node2 mapleopentech-swarm-worker-3-prod # Label Worker 4 as Cassandra Node 3 docker node update --label-add cassandra=node3 mapleopentech-swarm-worker-4-prod # Verify labels docker node inspect mapleopentech-swarm-worker-2-prod --format '{{.Spec.Labels}}' # Should show: map[cassandra:node1] ``` ### Step 3: Create Docker Stack File **On your manager**, create the Cassandra stack: ```bash # Create directory for stack files mkdir -p ~/stacks cd ~/stacks # Create Cassandra stack file vi cassandra-stack.yml ``` Copy and paste the following: ```yaml version: '3.8' networks: mapleopentech-private-prod: external: true volumes: cassandra-1-data: cassandra-2-data: cassandra-3-data: services: cassandra-1: image: cassandra:5.0.4 hostname: cassandra-1 networks: - mapleopentech-private-prod environment: - CASSANDRA_CLUSTER_NAME=mapleopentech-private-prod-cluster - CASSANDRA_DC=datacenter1 - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch - CASSANDRA_SEEDS=cassandra-1,cassandra-2,cassandra-3 - MAX_HEAP_SIZE=512M - HEAP_NEWSIZE=128M volumes: - cassandra-1-data:/var/lib/cassandra deploy: replicas: 1 placement: constraints: - node.labels.cassandra == node1 restart_policy: condition: on-failure delay: 10s max_attempts: 3 healthcheck: test: ["CMD-SHELL", "cqlsh -e 'describe cluster' || exit 1"] interval: 30s timeout: 10s retries: 5 start_period: 120s cassandra-2: image: cassandra:5.0.4 hostname: cassandra-2 networks: - mapleopentech-private-prod environment: - CASSANDRA_CLUSTER_NAME=mapleopentech-private-prod-cluster - CASSANDRA_DC=datacenter1 - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch - CASSANDRA_SEEDS=cassandra-1,cassandra-2,cassandra-3 - MAX_HEAP_SIZE=512M - HEAP_NEWSIZE=128M volumes: - cassandra-2-data:/var/lib/cassandra deploy: replicas: 1 placement: constraints: - node.labels.cassandra == node2 restart_policy: condition: on-failure delay: 10s max_attempts: 3 healthcheck: test: ["CMD-SHELL", "cqlsh -e 'describe cluster' || exit 1"] interval: 30s timeout: 10s retries: 5 start_period: 120s cassandra-3: image: cassandra:5.0.4 hostname: cassandra-3 networks: - mapleopentech-private-prod environment: - CASSANDRA_CLUSTER_NAME=mapleopentech-private-prod-cluster - CASSANDRA_DC=datacenter1 - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch - CASSANDRA_SEEDS=cassandra-1,cassandra-2,cassandra-3 - MAX_HEAP_SIZE=512M - HEAP_NEWSIZE=128M volumes: - cassandra-3-data:/var/lib/cassandra deploy: replicas: 1 placement: constraints: - node.labels.cassandra == node3 restart_policy: condition: on-failure delay: 10s max_attempts: 3 healthcheck: test: ["CMD-SHELL", "cqlsh -e 'describe cluster' || exit 1"] interval: 30s timeout: 10s retries: 5 start_period: 120s ``` ### Step 4: Create Shared Overlay Network Before deploying any services, create the shared `mapleopentech-private-prod` network that all services will use: ```bash # Create the mapleopentech-private-prod overlay network docker network create \ --driver overlay \ --attachable \ mapleopentech-private-prod # Verify it was created docker network ls | grep mapleopentech-private-prod # Should show: # abc123... mapleopentech-private-prod overlay swarm ``` **What is this network for?** - Shared by all Maple services (Cassandra, Redis, Go backend, etc.) - Enables private communication between services - Services can reach each other by service name (e.g., `redis`, `cassandra-1`) - No public internet exposure ### Step 5: Create Deployment Script Create the sequential deployment script to avoid race conditions: ```bash # Create the deployment script vi deploy-cassandra.sh ``` Copy and paste the following script: ```bash #!/bin/bash # # Cassandra Cluster Sequential Deployment Script # This script deploys Cassandra nodes sequentially to avoid race conditions # during cluster formation. # set -e STACK_NAME="cassandra" STACK_FILE="cassandra-stack.yml" echo "=== Cassandra Cluster Sequential Deployment ===" echo "" # Check if stack file exists if [ ! -f "$STACK_FILE" ]; then echo "ERROR: $STACK_FILE not found in current directory" exit 1 fi echo "Step 1: Deploying cassandra-1 (seed node)..." docker stack deploy -c "$STACK_FILE" "$STACK_NAME" # Scale down cassandra-2 and cassandra-3 temporarily docker service scale "${STACK_NAME}_cassandra-2=0" > /dev/null 2>&1 docker service scale "${STACK_NAME}_cassandra-3=0" > /dev/null 2>&1 echo "Waiting for cassandra-1 to become healthy (this takes ~5-8 minutes)..." echo "Checking every 30 seconds..." # Wait for cassandra-1 to be running COUNTER=0 MAX_WAIT=20 # 20 * 30 seconds = 10 minutes max while [ $COUNTER -lt $MAX_WAIT ]; do REPLICAS=$(docker service ls --filter "name=${STACK_NAME}_cassandra-1" --format "{{.Replicas}}") if [ "$REPLICAS" = "1/1" ]; then echo "✓ cassandra-1 is running" # Give it extra time to fully initialize echo "Waiting additional 2 minutes for cassandra-1 to fully initialize..." sleep 120 break fi echo " cassandra-1 status: $REPLICAS (waiting...)" sleep 30 COUNTER=$((COUNTER + 1)) done if [ $COUNTER -eq $MAX_WAIT ]; then echo "ERROR: cassandra-1 failed to start within 10 minutes" echo "Check logs with: docker service logs ${STACK_NAME}_cassandra-1" exit 1 fi echo "" echo "Step 2: Starting cassandra-2..." docker service scale "${STACK_NAME}_cassandra-2=1" echo "Waiting for cassandra-2 to become healthy (this takes ~5-8 minutes)..." COUNTER=0 while [ $COUNTER -lt $MAX_WAIT ]; do REPLICAS=$(docker service ls --filter "name=${STACK_NAME}_cassandra-2" --format "{{.Replicas}}") if [ "$REPLICAS" = "1/1" ]; then echo "✓ cassandra-2 is running" echo "Waiting additional 2 minutes for cassandra-2 to join cluster..." sleep 120 break fi echo " cassandra-2 status: $REPLICAS (waiting...)" sleep 30 COUNTER=$((COUNTER + 1)) done if [ $COUNTER -eq $MAX_WAIT ]; then echo "ERROR: cassandra-2 failed to start within 10 minutes" echo "Check logs with: docker service logs ${STACK_NAME}_cassandra-2" exit 1 fi echo "" echo "Step 3: Starting cassandra-3..." docker service scale "${STACK_NAME}_cassandra-3=1" echo "Waiting for cassandra-3 to become healthy (this takes ~5-8 minutes)..." COUNTER=0 while [ $COUNTER -lt $MAX_WAIT ]; do REPLICAS=$(docker service ls --filter "name=${STACK_NAME}_cassandra-3" --format "{{.Replicas}}") if [ "$REPLICAS" = "1/1" ]; then echo "✓ cassandra-3 is running" echo "Waiting additional 2 minutes for cassandra-3 to join cluster..." sleep 120 break fi echo " cassandra-3 status: $REPLICAS (waiting...)" sleep 30 COUNTER=$((COUNTER + 1)) done if [ $COUNTER -eq $MAX_WAIT ]; then echo "ERROR: cassandra-3 failed to start within 10 minutes" echo "Check logs with: docker service logs ${STACK_NAME}_cassandra-3" exit 1 fi echo "" echo "=== Deployment Complete ===" echo "" echo "All 3 Cassandra nodes should now be running and forming a cluster." echo "" echo "Verify cluster status by SSH'ing to any worker node and running:" echo " docker exec -it \$(docker ps -q --filter \"name=cassandra\") nodetool status" echo "" echo "You should see 3 nodes with status 'UN' (Up Normal)." echo "" ``` Make it executable: ```bash chmod +x deploy-cassandra.sh ``` ### Step 6: Deploy Cassandra Cluster Sequentially **⚠️ CRITICAL - READ THIS BEFORE DEPLOYING ⚠️** **DO NOT use `docker stack deploy -c cassandra-stack.yml cassandra` directly!** **Why?** This creates a **race condition**: all 3 nodes start simultaneously, try to connect to each other before they're ready, give up, and form separate single-node clusters instead of one 3-node cluster. This is a classic distributed systems problem. **What happens if you do?** Each node will run independently. Running `nodetool status` on any node will show only 1 node instead of 3. The cluster will appear broken. **The fix?** Use the sequential deployment script below, which starts nodes one at a time: **ALWAYS use the deployment script:** ```bash # Run the sequential deployment script ./deploy-cassandra.sh ``` **What this script does:** 1. Deploys cassandra-1 first and waits for it to be fully healthy (~5-8 minutes) 2. Starts cassandra-2 and waits for it to join the cluster (~5-8 minutes) 3. Starts cassandra-3 and waits for it to join the cluster (~5-8 minutes) 4. Total deployment time: **15-25 minutes** **Expected output:** ``` === Cassandra Cluster Sequential Deployment === Step 1: Deploying cassandra-1 (seed node)... Creating network cassandra_maple-private-prod Creating service cassandra_cassandra-1 Creating service cassandra_cassandra-2 Creating service cassandra_cassandra-3 cassandra_cassandra-2 scaled to 0 cassandra_cassandra-3 scaled to 0 Waiting for cassandra-1 to become healthy (this takes ~5-8 minutes)... Checking every 30 seconds... cassandra-1 status: 0/1 (waiting...) cassandra-1 status: 1/1 (waiting...) ✓ cassandra-1 is running Waiting additional 2 minutes for cassandra-1 to fully initialize... Step 2: Starting cassandra-2... cassandra_cassandra-2 scaled to 1 Waiting for cassandra-2 to become healthy (this takes ~5-8 minutes)... cassandra-2 status: 0/1 (waiting...) cassandra-2 status: 1/1 (waiting...) ✓ cassandra-2 is running Waiting additional 2 minutes for cassandra-2 to join cluster... Step 3: Starting cassandra-3... cassandra_cassandra-3 scaled to 1 Waiting for cassandra-3 to become healthy (this takes ~5-8 minutes)... cassandra-3 status: 0/1 (waiting...) cassandra-3 status: 1/1 (waiting...) ✓ cassandra-3 is running Waiting additional 2 minutes for cassandra-3 to join cluster... === Deployment Complete === All 3 Cassandra nodes should now be running and forming a cluster. ``` **If the script fails**, check the service logs: ```bash docker service logs cassandra_cassandra-1 docker service logs cassandra_cassandra-2 docker service logs cassandra_cassandra-3 ``` --- ## Initialize Keyspaces ### Step 1: Connect to Cassandra Node 1 ```bash # Get the node where cassandra-1 is running docker service ps cassandra_cassandra-1 --format "{{.Node}}" # Output: mapleopentech-swarm-worker-2-prod # SSH to that worker ssh dockeradmin@10.116.0.4 # Private IP of worker 2 # Find container ID CONTAINER_ID=$(docker ps --filter "name=cassandra_cassandra-1" --format "{{.ID}}") # Open CQL shell docker exec -it $CONTAINER_ID cqlsh ``` ### Step 2: Create Keyspaces ```sql -- MaplePress Backend CREATE KEYSPACE IF NOT EXISTS maplepress WITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': 3 } AND DURABLE_WRITES = true; -- MapleFile Backend CREATE KEYSPACE IF NOT EXISTS maplefile WITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': 3 } AND DURABLE_WRITES = true; -- mapleopentech Backend CREATE KEYSPACE IF NOT EXISTS mapleopentech WITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': 3 } AND DURABLE_WRITES = true; -- Verify DESCRIBE KEYSPACES; -- Exit CQL shell exit ``` Expected output should show your keyspaces: ``` maplepress maplefile mapleopentech system system_auth system_distributed system_schema system_traces system_views system_virtual_schema ``` --- ## Verify Cluster Health ### Step 1: Check Cluster Status **From inside cassandra-1 container:** ```bash # If not already in container: CONTAINER_ID=$(docker ps --filter "name=cassandra_cassandra-1" --format "{{.ID}}") docker exec -it $CONTAINER_ID bash # Check cluster status nodetool status # Expected output: # Datacenter: datacenter1 # ======================= # Status=Up/Down # |/ State=Normal/Leaving/Joining/Moving # -- Address Load Tokens Owns Host ID Rack # UN 10.116.0.4 125 KiB 16 100.0% abc123... rack1 # UN 10.116.0.5 120 KiB 16 100.0% def456... rack1 # UN 10.116.0.6 118 KiB 16 100.0% ghi789... rack1 ``` **What to verify:** - ✅ All 3 nodes show `UN` (Up and Normal) - ✅ Each node has an IP from your private network (10.116.0.x) - ✅ Load is distributed - ✅ Owns shows roughly 100% (data is replicated everywhere with RF=3) ### Step 2: Test Write/Read **Still in cassandra-1 container:** ```bash # Open CQL shell cqlsh # Create test keyspace CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': 3 }; USE test; # Create test table CREATE TABLE IF NOT EXISTS users ( user_id UUID PRIMARY KEY, username TEXT, email TEXT ); # Insert test data INSERT INTO users (user_id, username, email) VALUES (uuid(), 'testuser', 'test@example.com'); # Read data SELECT * FROM users; # Expected output: # user_id | email | username # --------------------------------------+------------------+----------- # abc123-def456-... | test@example.com | testuser # Exit exit exit # Exit container too ``` ### Step 3: Verify Replication Connect to Node 2 and verify data is there: ```bash # SSH to worker 3 (Node 2) ssh dockeradmin@10.116.0.5 # Find cassandra-2 container CONTAINER_ID=$(docker ps --filter "name=cassandra_cassandra-2" --format "{{.ID}}") # Connect and query docker exec -it $CONTAINER_ID cqlsh -e "SELECT * FROM test.users;" # Should see the same test data! # This proves replication is working. ``` ### Step 4: Save Connection Details **✅ Final Checkpoint - Update `.env`:** ```bash # On your local machine, add: CASSANDRA_CLUSTER_NAME=mapleopentech-private-prod-cluster CASSANDRA_DC=datacenter1 CASSANDRA_REPLICATION_FACTOR=3 # Connection endpoints (any node can be used) CASSANDRA_CONTACT_POINTS=10.116.0.4,10.116.0.5,10.116.0.6 CASSANDRA_CQL_PORT=9042 # For application connections (use private IPs) CASSANDRA_NODE_1_IP=10.116.0.4 CASSANDRA_NODE_2_IP=10.116.0.5 CASSANDRA_NODE_3_IP=10.116.0.6 ``` --- ## Cluster Management ### Restarting the Cassandra Cluster **To restart all Cassandra nodes:** ```bash # On manager node docker service update --force cassandra_cassandra-1 docker service update --force cassandra_cassandra-2 docker service update --force cassandra_cassandra-3 # Wait 5-8 minutes for all nodes to restart # Then verify cluster health docker exec -it $(docker ps -q --filter "name=cassandra") nodetool status ``` **To restart a single node:** ```bash # Restart just one service docker service update --force cassandra_cassandra-1 # Wait for it to rejoin the cluster # Check status from any worker docker exec -it $(docker ps -q --filter "name=cassandra") nodetool status ``` ### Shutting Down the Cassandra Cluster **To stop the entire stack (keeps data):** ```bash # On manager node docker stack rm cassandra # Services will be removed but volumes persist # Data is safe and can be restored later ``` **To verify shutdown:** ```bash # On manager node - check that services are gone docker stack ls # cassandra should not appear # Volumes are on worker nodes, not manager # SSH to each worker to verify volumes still exist (data is safe): # On worker-2: ssh dockeradmin@ docker volume ls | grep cassandra # Should show: cassandra_cassandra-1-data exit # On worker-3: ssh dockeradmin@ docker volume ls | grep cassandra # Should show: cassandra_cassandra-2-data exit # On worker-4: ssh dockeradmin@ docker volume ls | grep cassandra # Should show: cassandra_cassandra-3-data exit ``` **To restart after shutdown:** ```bash # Use the deployment script again cd ~/stacks ./deploy-cassandra.sh # Your data will be intact ``` ### Removing All Cassandra Data (Fresh Start) **⚠️ WARNING: This PERMANENTLY deletes all data. Use only when starting from scratch.** **IMPORTANT:** Volumes are stored on the **worker nodes**, not the manager node. You must SSH to each worker to delete them. ```bash # Step 1: Remove the stack (from manager node) docker stack rm cassandra # Step 2: Wait for services to stop (30-60 seconds) watch docker service ls # Press Ctrl+C when cassandra services are gone # Step 3: SSH to EACH worker and remove volumes (THIS DELETES ALL DATA!) # On worker-2 (cassandra-1 node) ssh dockeradmin@ docker volume ls | grep cassandra # Verify volume exists docker volume rm cassandra_cassandra-1-data exit # On worker-3 (cassandra-2 node) ssh dockeradmin@ docker volume ls | grep cassandra # Verify volume exists docker volume rm cassandra_cassandra-2-data exit # On worker-4 (cassandra-3 node) ssh dockeradmin@ docker volume ls | grep cassandra # Verify volume exists docker volume rm cassandra_cassandra-3-data exit # Step 4: Deploy fresh cluster (from manager node) cd ~/stacks ./deploy-cassandra.sh # You now have a fresh cluster with no data # You'll need to recreate keyspaces and tables ``` **Why volumes are on worker nodes:** - Docker Swarm creates volumes on the nodes where containers actually run - Manager node only orchestrates - it doesn't store data - Each worker node has its own volume for the Cassandra container running on it **When to use this:** - Testing deployment from scratch - Recovering from corrupted data - Major version upgrades requiring fresh install - Development/staging environments **When NOT to use this:** - Production environments (use backups and restore instead) - When you just need to restart nodes - When troubleshooting connectivity issues ### Scaling Considerations **Can you scale to more than 3 nodes?** Yes, but you'll need to: 1. Create additional worker droplets 2. Update `cassandra-stack.yml` to add `cassandra-4`, `cassandra-5`, etc. 3. Update the deployment script 4. Run `nodetool rebuild` on new nodes **Recommended minimum: 3 nodes** **Recommended maximum with 2GB RAM: 3-5 nodes** For production with proper 8GB RAM droplets, 5-7 nodes is common for large deployments. --- ## Troubleshooting ### Problem: Nodes Not Joining Cluster (Race Condition) **Symptom**: Each node shows only itself when running `nodetool status` - no 3-node cluster formed. **Root Cause**: If you deployed using `docker stack deploy` directly instead of the deployment script, all 3 nodes started simultaneously. They each tried to connect to the seed nodes before the others were ready, gave up, and formed separate single-node clusters. **Solution - Force Rolling Restart:** ```bash # On manager node, force update all services (triggers restart) docker service update --force cassandra_cassandra-1 docker service update --force cassandra_cassandra-2 docker service update --force cassandra_cassandra-3 # Wait 5-8 minutes for each to restart and discover each other # Then verify cluster from any worker: docker exec -it $(docker ps -q --filter "name=cassandra") nodetool status # You should now see all 3 nodes with UN status ``` **Prevention**: Always use the `deploy-cassandra.sh` script for initial deployment to avoid this race condition. ### Problem: Nodes Not Joining Cluster (Other Causes) **Symptom**: `nodetool status` shows only 1 node, or nodes show `DN` (Down) **Solutions:** 1. **Check firewall allows Cassandra ports:** ```bash # On each worker: sudo ufw status verbose | grep 7000 sudo ufw status verbose | grep 9042 # Should see rules allowing from 10.116.0.0/16 (your VPC subnet) ``` 2. **Verify seeds configuration:** ```bash # Check service environment docker service inspect cassandra_cassandra-1 --format '{{.Spec.TaskTemplate.ContainerSpec.Env}}' # Should see: CASSANDRA_SEEDS=cassandra-1,cassandra-2,cassandra-3 ``` 3. **Check inter-node connectivity:** ```bash # From cassandra-1 container (install tools first): apt-get update && apt-get install -y dnsutils netcat-openbsd # Test DNS resolution: nslookup cassandra-2 nslookup cassandra-3 # Test port connectivity: nc -zv cassandra-2 7000 nc -zv cassandra-3 7000 # Should all succeed ``` 4. **Check service placement:** ```bash # Verify services are on correct nodes docker service ps cassandra_cassandra-1 docker service ps cassandra_cassandra-2 docker service ps cassandra_cassandra-3 # Each should be on its labeled node ``` ### Problem: Slow Startup **Symptom**: Services stuck at 0/1 replicas for > 8 minutes **Solutions:** 1. **Check logs for errors:** ```bash docker service logs cassandra_cassandra-1 --tail 50 ``` 2. **Verify memory constraints:** ```bash # With 2GB RAM, 512MB heap is configured # This is already minimal - slower startup is expected # Be patient and wait up to 10 minutes ``` 3. **Check available memory on worker nodes:** ```bash # SSH to a worker and check memory free -h # Should show at least 1.5GB available after OS overhead ``` 4. **Check disk space:** ```bash df -h # Should have plenty of free space ``` ### Problem: Can't Connect from Application **Symptom**: Application can't reach Cassandra on port 9042 **Solutions:** 1. **Ensure application is on same overlay network:** ```yaml # In your application stack file: networks: mapleopentech-private-prod: external: true ``` 2. **Test connectivity from application container:** ```bash # From app container: nc -zv cassandra-1 9042 # Should connect ``` 3. **Use service names in application config:** ```bash # Use Docker Swarm service names (recommended): CASSANDRA_CONTACT_POINTS=cassandra-1,cassandra-2,cassandra-3 # These resolve automatically on the overlay network ``` ### Problem: Node Shows UJ (Up, Joining) **Symptom**: Node stuck in joining state **Solution:** ```bash # This is normal for first 5-10 minutes with reduced memory # Wait longer and check again # If stuck > 15 minutes, restart that service: docker service update --force cassandra_cassandra-2 ``` ### Problem: Out of Memory Errors **Symptom**: Container keeps restarting, logs show "Out of memory" or "Cannot allocate memory" **Solution:** This means 2GB RAM is insufficient. You have two options: 1. **Upgrade droplets to 4GB RAM minimum** (recommended): - Resize each worker droplet in DigitalOcean - Update stack file to use `MAX_HEAP_SIZE=1G` and `HEAP_NEWSIZE=256M` - Redeploy: `docker stack rm cassandra && docker stack deploy -c cassandra-stack.yml cassandra` 2. **Further reduce heap** (not recommended): ```yaml # In cassandra-stack.yml, change to: - MAX_HEAP_SIZE=384M - HEAP_NEWSIZE=96M ``` This will severely limit functionality and is not viable for any real workload. ### Problem: Keyspace Already Exists Error **Symptom**: `AlreadyExists` error when creating keyspaces **Solution:** This is normal if you've run the script before. The `IF NOT EXISTS` clause prevents actual errors. Your keyspaces are already created. ### Installing Debugging Tools When troubleshooting, you'll often need diagnostic tools inside the Cassandra containers. Here's how to install them: **Quick install of all useful debugging tools:** ```bash # SSH to any worker node, then run: docker exec -it $(docker ps -q --filter "name=cassandra") bash -c "apt-get update && apt-get install -y dnsutils netcat-openbsd iputils-ping curl vim" ``` **What this installs:** - `dnsutils` - DNS tools (`nslookup`, `dig`) - `netcat-openbsd` - Network connectivity testing (`nc`) - `iputils-ping` - Ping utility - `curl` - HTTP testing - `vim` - Text editor **Example debugging workflow:** ```bash # Get into a Cassandra container docker exec -it $(docker ps -q --filter "name=cassandra") bash # Install tools (only needed once per container) apt-get update && apt-get install -y dnsutils netcat-openbsd # Test DNS resolution nslookup cassandra-1 nslookup cassandra-2 nslookup cassandra-3 # Test port connectivity nc -zv cassandra-1 7000 # Gossip port nc -zv cassandra-2 9042 # CQL port nc -zv cassandra-3 7000 # Gossip port # Check cluster status nodetool status # Exit container exit ``` **Note:** These tools are NOT persistent. If a container restarts, you'll need to reinstall them. For permanent installation, you would need to create a custom Docker image. --- ## Next Steps ✅ **You now have:** - 3-node Cassandra cluster with replication factor 3 - High availability (survives 1 node failure) - Keyspaces ready for application data - Swarm-managed containers with auto-restart **Next guides:** - **Redis Setup** - Cache layer for applications - **Application Deployment** - Deploy backend services - **Monitoring** - Set up cluster monitoring --- ## Performance Notes ### Hardware Sizing **Current setup (1 vCPU, 2GB RAM per node):** - **NOT suitable for production** - development/testing only - Handles: ~500-1,000 writes/sec, ~5,000 reads/sec - Storage: 50GB per node (150GB total raw, 50GB with RF=3) - Expected issues: slow queries, GC pauses, limited connections - **Total cost**: 3 nodes × $12 = **$36/month** **Recommended production setup (4 vCPU, 8GB RAM per node):** - Good for: Staging, small-to-medium production - Handles: ~10,000 writes/sec, ~50,000 reads/sec - Storage: 160GB per node (480GB total raw, 160GB with RF=3) - **Total cost**: 3 nodes × $48 = **$144/month** **For larger production:** - Scale to 8 vCPU, 16GB RAM - Add more workers (5-node, 7-node cluster) - Use dedicated CPU droplets ### Heap Size Tuning **Current: 512MB heap (with 2GB RAM total)** - Absolute minimum for Cassandra to run - Expect frequent garbage collection - Limited cache effectiveness - **Not recommended for production** **Recommended configurations:** - **2GB RAM**: 512MB heap (current - minimal) - **4GB RAM**: 1GB heap (small production) - **8GB RAM**: 2GB heap (recommended production) - **16GB RAM**: 4GB heap (high-traffic production) ### Replication Factor Current: RF=3 (recommended for production) Options: - **RF=1**: No redundancy, not recommended for production - **RF=2**: Can tolerate 1 failure, less storage overhead - **RF=3**: Best for production, tolerates 1 failure safely - **RF=5**: For mission-critical data (requires 5+ nodes) --- ## Upgrading to Production-Ready Configuration If you started with 2GB RAM droplets and need to upgrade: ### Step 1: Resize Droplets in DigitalOcean 1. Go to each worker droplet (workers 2, 3, 4) 2. Click **Resize** 3. Select **8GB RAM / 4 vCPU** plan 4. Complete resize (droplets will reboot) ### Step 2: Update Stack Configuration SSH to manager and update the stack file: ```bash ssh dockeradmin@ cd ~/stacks # Edit cassandra-stack.yml vi cassandra-stack.yml # Change these lines in ALL THREE services: # FROM: - MAX_HEAP_SIZE=512M - HEAP_NEWSIZE=128M # TO: - MAX_HEAP_SIZE=2G - HEAP_NEWSIZE=512M ``` ### Step 3: Redeploy ```bash # Remove old stack docker stack rm cassandra # Wait for cleanup sleep 30 # Deploy with new configuration docker stack deploy -c cassandra-stack.yml cassandra # Monitor startup watch -n 2 'docker stack services cassandra' ``` --- **Document Version**: 1.1 **Last Updated**: November 3, 2025 **Maintained By**: Infrastructure Team **Changelog**: - v1.1 (Nov 3, 2025): Updated for 2GB RAM droplets with reduced heap (512MB) - NOT production ready - v1.0 (Nov 3, 2025): Initial version with 8GB RAM droplets docker stack rm cassandra # Remove old volumes to start fresh docker volume rm cassandra_cassandra-1-data cassandra_cassandra-2-data cassandra_cassandra-3-data # Install usefull debugging tools into our container. docker exec -it $(docker ps -q --filter "name=cassandra") bash -c "apt-get update && apt-get install -y dnsutils && nslookup cassandra-2 && nslookup cassandra-3"