monorepo/cloud/infrastructure/production/setup/02_cassandra.md

36 KiB
Raw Blame History

Cassandra Cluster Setup (3-Node)

Prerequisites: Complete 01_init_docker_swarm.md first

Time to Complete: 60-90 minutes

What You'll Build:

  • 3 new DigitalOcean droplets (workers 2, 3, 4)
  • 3-node Cassandra cluster using Docker Swarm
  • Replication factor 3 for high availability
  • Private network communication only

Table of Contents

  1. Overview
  2. Create Cassandra Worker Droplets
  3. Configure Workers and Join Swarm
  4. Deploy Cassandra Cluster
  5. Initialize Keyspaces
  6. Verify Cluster Health
  7. Cluster Management
  8. Troubleshooting

Overview

Architecture

Swarm Manager (existing):
├── mapleopentech-swarm-manager-1-prod (10.116.0.2)
└── Controls cluster, no Cassandra

Existing Worker:
└── mapleopentech-swarm-worker-1-prod (10.116.0.3)
    └── Available for other services

Cassandra Cluster (NEW):
├── mapleopentech-swarm-worker-2-prod (10.116.0.4)
│   └── Cassandra Node 1
├── mapleopentech-swarm-worker-3-prod (10.116.0.5)
│   └── Cassandra Node 2
└── mapleopentech-swarm-worker-4-prod (10.116.0.6)
    └── Cassandra Node 3

Cassandra Configuration

  • Version: Cassandra 5.0.4
  • Cluster Name: maple-private-prod-cluster
  • Replication Factor: 3 (each data stored on all 3 nodes)
  • Data Center: datacenter1
  • Heap Size: 512MB (reduced for 2GB RAM constraint)
  • Communication: Private network only (secure)

⚠️ IMPORTANT - Memory Constraints: This configuration uses minimal 2GB RAM droplets with 512MB heap size. This is NOT recommended for production use. Expect:

  • Limited performance (max ~1,000 writes/sec vs 10,000 with proper sizing)
  • Potential stability issues under load
  • Frequent garbage collection pauses
  • Limited concurrent connection capacity

For production use, upgrade to 8GB RAM droplets with 2GB heap size.

Why 3 Nodes?

  • High Availability: Cluster survives 1 node failure
  • Replication Factor 3: Every piece of data stored on all 3 nodes
  • Read Performance: Queries can hit any node
  • Write Performance: Writes distributed across cluster
  • Production Standard: Minimum for HA Cassandra

Create Cassandra Worker Droplets

Step 1: Create Worker 2 (Cassandra Node 1)

From DigitalOcean Dashboard:

  1. Go to https://cloud.digitalocean.com/
  2. Click CreateDroplets

Droplet Configuration:

Setting Value
Region Toronto 1 (TOR1) - SAME as existing
Image Ubuntu 24.04 LTS x64
Droplet Type Regular Intel
CPU Options 1 vCPU, 2 GB RAM ($12/month)
Storage 50 GB SSD
VPC default-tor1 (auto-selected)
SSH Key Select your key
Hostname mapleopentech-swarm-worker-2-prod
Tags production, cassandra, database

Click Create Droplet and wait 60 seconds.

Checkpoint - Save to .env:

# On your local machine:
SWARM_WORKER_2_HOSTNAME=mapleopentech-swarm-worker-2-prod
SWARM_WORKER_2_PUBLIC_IP=159.65.123.47      # Your public IP
SWARM_WORKER_2_PRIVATE_IP=10.116.0.4        # Your private IP
CASSANDRA_NODE_1_IP=10.116.0.4              # Same as private IP

Step 2: Create Worker 3 (Cassandra Node 2)

Repeat with these values:

Setting Value
Hostname mapleopentech-swarm-worker-3-prod
All other settings Same as Worker 2

Checkpoint - Save to .env:

SWARM_WORKER_3_HOSTNAME=mapleopentech-swarm-worker-3-prod
SWARM_WORKER_3_PUBLIC_IP=159.65.123.48      # Your public IP
SWARM_WORKER_3_PRIVATE_IP=10.116.0.5        # Your private IP
CASSANDRA_NODE_2_IP=10.116.0.5              # Same as private IP

Step 3: Create Worker 4 (Cassandra Node 3)

Repeat with these values:

Setting Value
Hostname mapleopentech-swarm-worker-4-prod
All other settings Same as Worker 2

Checkpoint - Save to .env:

SWARM_WORKER_4_HOSTNAME=mapleopentech-swarm-worker-4-prod
SWARM_WORKER_4_PUBLIC_IP=159.65.123.49      # Your public IP
SWARM_WORKER_4_PRIVATE_IP=10.116.0.6        # Your private IP
CASSANDRA_NODE_3_IP=10.116.0.6              # Same as private IP

Step 4: Verify All Droplets in Same VPC

  1. Go to NetworkingVPC → Click default-tor1
  2. Should see 5 droplets total:
    • mapleopentech-swarm-manager-1-prod (10.116.0.2)
    • mapleopentech-swarm-worker-1-prod (10.116.0.3)
    • mapleopentech-swarm-worker-2-prod (10.116.0.4)
    • mapleopentech-swarm-worker-3-prod (10.116.0.5)
    • mapleopentech-swarm-worker-4-prod (10.116.0.6)

Configure Workers and Join Swarm

Follow these steps for EACH of the 3 new workers (workers 2, 3, 4).

Worker 2 Setup

Step 1: Initial SSH as Root

# SSH to Worker 2
ssh root@159.65.123.47  # Replace with YOUR worker 2 public IP

# You should see: root@mapleopentech-swarm-worker-2-prod:~#

Step 2: System Updates and Create Admin User

# Update system
apt update && apt upgrade -y

# Install essentials
apt install -y curl wget apt-transport-https ca-certificates gnupg lsb-release

# Create dockeradmin user
adduser dockeradmin
# Use the SAME password as other nodes

# Add to sudo group
usermod -aG sudo dockeradmin

# Copy SSH keys
rsync --archive --chown=dockeradmin:dockeradmin ~/.ssh /home/dockeradmin

Step 3: Secure SSH Configuration

# Edit SSH config
vi /etc/ssh/sshd_config

# Update these lines:
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
MaxAuthTries 3
LoginGraceTime 60

# Save and restart SSH
systemctl restart ssh

Step 4: Reconnect as dockeradmin

# Exit root session
exit

# SSH back as dockeradmin
ssh dockeradmin@159.65.123.47  # Replace with YOUR worker 2 public IP

# You should see: dockeradmin@mapleopentech-swarm-worker-2-prod:~$

Step 5: Install Docker

# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Add dockeradmin to docker group
sudo usermod -aG docker dockeradmin
newgrp docker

# Verify
docker --version

# Enable Docker
sudo systemctl enable docker
sudo systemctl status docker
# Press 'q' to exit

Step 6: Configure Firewall

# Install UFW
sudo apt install ufw -y

# Allow SSH
sudo ufw allow 22/tcp

# Allow Docker Swarm ports (replace with YOUR VPC subnet from .env)
sudo ufw allow from 10.116.0.0/16 to any port 2377 proto tcp
sudo ufw allow from 10.116.0.0/16 to any port 7946
sudo ufw allow from 10.116.0.0/16 to any port 4789 proto udp

# Allow Cassandra ports (private network only)
# 7000: Inter-node communication
# 7001: Inter-node communication (TLS)
# 9042: CQL native transport (client connections)
sudo ufw allow from 10.116.0.0/16 to any port 7000 proto tcp
sudo ufw allow from 10.116.0.0/16 to any port 7001 proto tcp
sudo ufw allow from 10.116.0.0/16 to any port 9042 proto tcp

# Enable firewall
sudo ufw --force enable

# Check status
sudo ufw status verbose

Step 7: Join Docker Swarm

# Use the join command from Step 8 of 01_init_docker_swarm.md
# Replace with YOUR actual token and manager private IP:
docker swarm join --token SWMTKN-1-4abc123xyz789verylongtoken 10.116.0.2:2377

# Expected output:
# This node joined a swarm as a worker.

Worker 2 complete! Repeat Steps 1-7 for Workers 3 and 4.

Worker 3 Setup

Repeat Steps 1-7 above, replacing:

  • Public IP: Use Worker 3's public IP (159.65.123.48 example)
  • Hostname: mapleopentech-swarm-worker-3-prod

Worker 4 Setup

Repeat Steps 1-7 above, replacing:

  • Public IP: Use Worker 4's public IP (159.65.123.49 example)
  • Hostname: mapleopentech-swarm-worker-4-prod

Deploy Cassandra Cluster

Step 1: Verify All Workers Joined

From your manager node:

# SSH to manager
ssh dockeradmin@159.65.123.45  # Your manager's public IP

# List all swarm nodes
docker node ls

# Expected output (5 nodes total):
# ID              HOSTNAME                          STATUS   AVAILABILITY   MANAGER STATUS
# abc123... *     mapleopentech-swarm-manager-1-prod    Ready    Active         Leader
# def456...       mapleopentech-swarm-worker-1-prod     Ready    Active
# ghi789...       mapleopentech-swarm-worker-2-prod     Ready    Active
# jkl012...       mapleopentech-swarm-worker-3-prod     Ready    Active
# mno345...       mapleopentech-swarm-worker-4-prod     Ready    Active

Step 2: Label Cassandra Nodes

Apply labels so Cassandra services deploy to correct nodes:

# Label Worker 2 as Cassandra Node 1
docker node update --label-add cassandra=node1 mapleopentech-swarm-worker-2-prod

# Label Worker 3 as Cassandra Node 2
docker node update --label-add cassandra=node2 mapleopentech-swarm-worker-3-prod

# Label Worker 4 as Cassandra Node 3
docker node update --label-add cassandra=node3 mapleopentech-swarm-worker-4-prod

# Verify labels
docker node inspect mapleopentech-swarm-worker-2-prod --format '{{.Spec.Labels}}'
# Should show: map[cassandra:node1]

Step 3: Create Docker Stack File

On your manager, create the Cassandra stack:

# Create directory for stack files
mkdir -p ~/stacks
cd ~/stacks

# Create Cassandra stack file
vi cassandra-stack.yml

Copy and paste the following:

version: '3.8'

networks:
  maple-private-prod:
    external: true

volumes:
  cassandra-1-data:
  cassandra-2-data:
  cassandra-3-data:

services:
  cassandra-1:
    image: cassandra:5.0.4
    hostname: cassandra-1
    networks:
      - maple-private-prod
    environment:
      - CASSANDRA_CLUSTER_NAME=maple-private-prod-cluster
      - CASSANDRA_DC=datacenter1
      - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch
      - CASSANDRA_SEEDS=cassandra-1,cassandra-2,cassandra-3
      - MAX_HEAP_SIZE=512M
      - HEAP_NEWSIZE=128M
    volumes:
      - cassandra-1-data:/var/lib/cassandra
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.labels.cassandra == node1
      restart_policy:
        condition: on-failure
        delay: 10s
        max_attempts: 3
    healthcheck:
      test: ["CMD-SHELL", "cqlsh -e 'describe cluster' || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s

  cassandra-2:
    image: cassandra:5.0.4
    hostname: cassandra-2
    networks:
      - maple-private-prod
    environment:
      - CASSANDRA_CLUSTER_NAME=maple-private-prod-cluster
      - CASSANDRA_DC=datacenter1
      - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch
      - CASSANDRA_SEEDS=cassandra-1,cassandra-2,cassandra-3
      - MAX_HEAP_SIZE=512M
      - HEAP_NEWSIZE=128M
    volumes:
      - cassandra-2-data:/var/lib/cassandra
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.labels.cassandra == node2
      restart_policy:
        condition: on-failure
        delay: 10s
        max_attempts: 3
    healthcheck:
      test: ["CMD-SHELL", "cqlsh -e 'describe cluster' || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s

  cassandra-3:
    image: cassandra:5.0.4
    hostname: cassandra-3
    networks:
      - maple-private-prod
    environment:
      - CASSANDRA_CLUSTER_NAME=maple-private-prod-cluster
      - CASSANDRA_DC=datacenter1
      - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch
      - CASSANDRA_SEEDS=cassandra-1,cassandra-2,cassandra-3
      - MAX_HEAP_SIZE=512M
      - HEAP_NEWSIZE=128M
    volumes:
      - cassandra-3-data:/var/lib/cassandra
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.labels.cassandra == node3
      restart_policy:
        condition: on-failure
        delay: 10s
        max_attempts: 3
    healthcheck:
      test: ["CMD-SHELL", "cqlsh -e 'describe cluster' || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s

Step 4: Create Shared Overlay Network

Before deploying any services, create the shared maple-private-prod network that all services will use:

# Create the maple-private-prod overlay network
docker network create \
  --driver overlay \
  --attachable \
  maple-private-prod

# Verify it was created
docker network ls | grep maple-private-prod
# Should show:
# abc123...      maple-private-prod   overlay   swarm

What is this network for?

  • Shared by all Maple services (Cassandra, Redis, Go backend, etc.)
  • Enables private communication between services
  • Services can reach each other by service name (e.g., redis, cassandra-1)
  • No public internet exposure

Step 5: Create Deployment Script

Create the sequential deployment script to avoid race conditions:

# Create the deployment script
vi deploy-cassandra.sh

Copy and paste the following script:

#!/bin/bash
#
# Cassandra Cluster Sequential Deployment Script
# This script deploys Cassandra nodes sequentially to avoid race conditions
# during cluster formation.
#

set -e

STACK_NAME="cassandra"
STACK_FILE="cassandra-stack.yml"

echo "=== Cassandra Cluster Sequential Deployment ==="
echo ""

# Check if stack file exists
if [ ! -f "$STACK_FILE" ]; then
    echo "ERROR: $STACK_FILE not found in current directory"
    exit 1
fi

echo "Step 1: Deploying cassandra-1 (seed node)..."
docker stack deploy -c "$STACK_FILE" "$STACK_NAME"

# Scale down cassandra-2 and cassandra-3 temporarily
docker service scale "${STACK_NAME}_cassandra-2=0" > /dev/null 2>&1
docker service scale "${STACK_NAME}_cassandra-3=0" > /dev/null 2>&1

echo "Waiting for cassandra-1 to become healthy (this takes ~5-8 minutes)..."
echo "Checking every 30 seconds..."

# Wait for cassandra-1 to be running
COUNTER=0
MAX_WAIT=20  # 20 * 30 seconds = 10 minutes max
while [ $COUNTER -lt $MAX_WAIT ]; do
    REPLICAS=$(docker service ls --filter "name=${STACK_NAME}_cassandra-1" --format "{{.Replicas}}")
    if [ "$REPLICAS" = "1/1" ]; then
        echo "✓ cassandra-1 is running"
        # Give it extra time to fully initialize
        echo "Waiting additional 2 minutes for cassandra-1 to fully initialize..."
        sleep 120
        break
    fi
    echo "  cassandra-1 status: $REPLICAS (waiting...)"
    sleep 30
    COUNTER=$((COUNTER + 1))
done

if [ $COUNTER -eq $MAX_WAIT ]; then
    echo "ERROR: cassandra-1 failed to start within 10 minutes"
    echo "Check logs with: docker service logs ${STACK_NAME}_cassandra-1"
    exit 1
fi

echo ""
echo "Step 2: Starting cassandra-2..."
docker service scale "${STACK_NAME}_cassandra-2=1"

echo "Waiting for cassandra-2 to become healthy (this takes ~5-8 minutes)..."
COUNTER=0
while [ $COUNTER -lt $MAX_WAIT ]; do
    REPLICAS=$(docker service ls --filter "name=${STACK_NAME}_cassandra-2" --format "{{.Replicas}}")
    if [ "$REPLICAS" = "1/1" ]; then
        echo "✓ cassandra-2 is running"
        echo "Waiting additional 2 minutes for cassandra-2 to join cluster..."
        sleep 120
        break
    fi
    echo "  cassandra-2 status: $REPLICAS (waiting...)"
    sleep 30
    COUNTER=$((COUNTER + 1))
done

if [ $COUNTER -eq $MAX_WAIT ]; then
    echo "ERROR: cassandra-2 failed to start within 10 minutes"
    echo "Check logs with: docker service logs ${STACK_NAME}_cassandra-2"
    exit 1
fi

echo ""
echo "Step 3: Starting cassandra-3..."
docker service scale "${STACK_NAME}_cassandra-3=1"

echo "Waiting for cassandra-3 to become healthy (this takes ~5-8 minutes)..."
COUNTER=0
while [ $COUNTER -lt $MAX_WAIT ]; do
    REPLICAS=$(docker service ls --filter "name=${STACK_NAME}_cassandra-3" --format "{{.Replicas}}")
    if [ "$REPLICAS" = "1/1" ]; then
        echo "✓ cassandra-3 is running"
        echo "Waiting additional 2 minutes for cassandra-3 to join cluster..."
        sleep 120
        break
    fi
    echo "  cassandra-3 status: $REPLICAS (waiting...)"
    sleep 30
    COUNTER=$((COUNTER + 1))
done

if [ $COUNTER -eq $MAX_WAIT ]; then
    echo "ERROR: cassandra-3 failed to start within 10 minutes"
    echo "Check logs with: docker service logs ${STACK_NAME}_cassandra-3"
    exit 1
fi

echo ""
echo "=== Deployment Complete ==="
echo ""
echo "All 3 Cassandra nodes should now be running and forming a cluster."
echo ""
echo "Verify cluster status by SSH'ing to any worker node and running:"
echo "  docker exec -it \$(docker ps -q --filter \"name=cassandra\") nodetool status"
echo ""
echo "You should see 3 nodes with status 'UN' (Up Normal)."
echo ""

Make it executable:

chmod +x deploy-cassandra.sh

Step 6: Deploy Cassandra Cluster Sequentially

⚠️ CRITICAL - READ THIS BEFORE DEPLOYING ⚠️

DO NOT use docker stack deploy -c cassandra-stack.yml cassandra directly!

Why? This creates a race condition: all 3 nodes start simultaneously, try to connect to each other before they're ready, give up, and form separate single-node clusters instead of one 3-node cluster. This is a classic distributed systems problem.

What happens if you do? Each node will run independently. Running nodetool status on any node will show only 1 node instead of 3. The cluster will appear broken.

The fix? Use the sequential deployment script below, which starts nodes one at a time:

ALWAYS use the deployment script:

# Run the sequential deployment script
./deploy-cassandra.sh

What this script does:

  1. Deploys cassandra-1 first and waits for it to be fully healthy (~5-8 minutes)
  2. Starts cassandra-2 and waits for it to join the cluster (~5-8 minutes)
  3. Starts cassandra-3 and waits for it to join the cluster (~5-8 minutes)
  4. Total deployment time: 15-25 minutes

Expected output:

=== Cassandra Cluster Sequential Deployment ===

Step 1: Deploying cassandra-1 (seed node)...
Creating network cassandra_maple-private-prod
Creating service cassandra_cassandra-1
Creating service cassandra_cassandra-2
Creating service cassandra_cassandra-3
cassandra_cassandra-2 scaled to 0
cassandra_cassandra-3 scaled to 0
Waiting for cassandra-1 to become healthy (this takes ~5-8 minutes)...
Checking every 30 seconds...
  cassandra-1 status: 0/1 (waiting...)
  cassandra-1 status: 1/1 (waiting...)
✓ cassandra-1 is running
Waiting additional 2 minutes for cassandra-1 to fully initialize...

Step 2: Starting cassandra-2...
cassandra_cassandra-2 scaled to 1
Waiting for cassandra-2 to become healthy (this takes ~5-8 minutes)...
  cassandra-2 status: 0/1 (waiting...)
  cassandra-2 status: 1/1 (waiting...)
✓ cassandra-2 is running
Waiting additional 2 minutes for cassandra-2 to join cluster...

Step 3: Starting cassandra-3...
cassandra_cassandra-3 scaled to 1
Waiting for cassandra-3 to become healthy (this takes ~5-8 minutes)...
  cassandra-3 status: 0/1 (waiting...)
  cassandra-3 status: 1/1 (waiting...)
✓ cassandra-3 is running
Waiting additional 2 minutes for cassandra-3 to join cluster...

=== Deployment Complete ===

All 3 Cassandra nodes should now be running and forming a cluster.

If the script fails, check the service logs:

docker service logs cassandra_cassandra-1
docker service logs cassandra_cassandra-2
docker service logs cassandra_cassandra-3

Initialize Keyspaces

Step 1: Connect to Cassandra Node 1

# Get the node where cassandra-1 is running
docker service ps cassandra_cassandra-1 --format "{{.Node}}"
# Output: mapleopentech-swarm-worker-2-prod

# SSH to that worker
ssh dockeradmin@10.116.0.4  # Private IP of worker 2

# Find container ID
CONTAINER_ID=$(docker ps --filter "name=cassandra_cassandra-1" --format "{{.ID}}")

# Open CQL shell
docker exec -it $CONTAINER_ID cqlsh

Step 2: Create Keyspaces

-- MaplePress Backend
CREATE KEYSPACE IF NOT EXISTS maplepress
WITH REPLICATION = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
}
AND DURABLE_WRITES = true;

-- MapleFile Backend
CREATE KEYSPACE IF NOT EXISTS maplefile
WITH REPLICATION = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
}
AND DURABLE_WRITES = true;

-- mapleopentech Backend
CREATE KEYSPACE IF NOT EXISTS mapleopentech
WITH REPLICATION = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
}
AND DURABLE_WRITES = true;

-- Verify
DESCRIBE KEYSPACES;

-- Exit CQL shell
exit

Expected output should show your keyspaces:

maplepress  maplefile  mapleopentech  system  system_auth  system_distributed  system_schema  system_traces  system_views  system_virtual_schema

Verify Cluster Health

Step 1: Check Cluster Status

From inside cassandra-1 container:

# If not already in container:
CONTAINER_ID=$(docker ps --filter "name=cassandra_cassandra-1" --format "{{.ID}}")
docker exec -it $CONTAINER_ID bash

# Check cluster status
nodetool status

# Expected output:
# Datacenter: datacenter1
# =======================
# Status=Up/Down
# |/ State=Normal/Leaving/Joining/Moving
# --  Address      Load       Tokens  Owns   Host ID                               Rack
# UN  10.116.0.4   125 KiB    16      100.0% abc123...                             rack1
# UN  10.116.0.5   120 KiB    16      100.0% def456...                             rack1
# UN  10.116.0.6   118 KiB    16      100.0% ghi789...                             rack1

What to verify:

  • All 3 nodes show UN (Up and Normal)
  • Each node has an IP from your private network (10.116.0.x)
  • Load is distributed
  • Owns shows roughly 100% (data is replicated everywhere with RF=3)

Step 2: Test Write/Read

Still in cassandra-1 container:

# Open CQL shell
cqlsh

# Create test keyspace
CREATE KEYSPACE IF NOT EXISTS test
WITH REPLICATION = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
};

USE test;

# Create test table
CREATE TABLE IF NOT EXISTS users (
    user_id UUID PRIMARY KEY,
    username TEXT,
    email TEXT
);

# Insert test data
INSERT INTO users (user_id, username, email)
VALUES (uuid(), 'testuser', 'test@example.com');

# Read data
SELECT * FROM users;

# Expected output:
#  user_id                              | email            | username
# --------------------------------------+------------------+-----------
#  abc123-def456-...                    | test@example.com | testuser

# Exit
exit
exit  # Exit container too

Step 3: Verify Replication

Connect to Node 2 and verify data is there:

# SSH to worker 3 (Node 2)
ssh dockeradmin@10.116.0.5

# Find cassandra-2 container
CONTAINER_ID=$(docker ps --filter "name=cassandra_cassandra-2" --format "{{.ID}}")

# Connect and query
docker exec -it $CONTAINER_ID cqlsh -e "SELECT * FROM test.users;"

# Should see the same test data!
# This proves replication is working.

Step 4: Save Connection Details

Final Checkpoint - Update .env:

# On your local machine, add:
CASSANDRA_CLUSTER_NAME=maple-private-prod-cluster
CASSANDRA_DC=datacenter1
CASSANDRA_REPLICATION_FACTOR=3

# Connection endpoints (any node can be used)
CASSANDRA_CONTACT_POINTS=10.116.0.4,10.116.0.5,10.116.0.6
CASSANDRA_CQL_PORT=9042

# For application connections (use private IPs)
CASSANDRA_NODE_1_IP=10.116.0.4
CASSANDRA_NODE_2_IP=10.116.0.5
CASSANDRA_NODE_3_IP=10.116.0.6

Cluster Management

Restarting the Cassandra Cluster

To restart all Cassandra nodes:

# On manager node
docker service update --force cassandra_cassandra-1
docker service update --force cassandra_cassandra-2
docker service update --force cassandra_cassandra-3

# Wait 5-8 minutes for all nodes to restart
# Then verify cluster health
docker exec -it $(docker ps -q --filter "name=cassandra") nodetool status

To restart a single node:

# Restart just one service
docker service update --force cassandra_cassandra-1

# Wait for it to rejoin the cluster
# Check status from any worker
docker exec -it $(docker ps -q --filter "name=cassandra") nodetool status

Shutting Down the Cassandra Cluster

To stop the entire stack (keeps data):

# On manager node
docker stack rm cassandra

# Services will be removed but volumes persist
# Data is safe and can be restored later

To verify shutdown:

# On manager node - check that services are gone
docker stack ls
# cassandra should not appear

# Volumes are on worker nodes, not manager
# SSH to each worker to verify volumes still exist (data is safe):

# On worker-2:
ssh dockeradmin@<worker-2-ip>
docker volume ls | grep cassandra
# Should show: cassandra_cassandra-1-data
exit

# On worker-3:
ssh dockeradmin@<worker-3-ip>
docker volume ls | grep cassandra
# Should show: cassandra_cassandra-2-data
exit

# On worker-4:
ssh dockeradmin@<worker-4-ip>
docker volume ls | grep cassandra
# Should show: cassandra_cassandra-3-data
exit

To restart after shutdown:

# Use the deployment script again
cd ~/stacks
./deploy-cassandra.sh

# Your data will be intact

Removing All Cassandra Data (Fresh Start)

⚠️ WARNING: This PERMANENTLY deletes all data. Use only when starting from scratch.

IMPORTANT: Volumes are stored on the worker nodes, not the manager node. You must SSH to each worker to delete them.

# Step 1: Remove the stack (from manager node)
docker stack rm cassandra

# Step 2: Wait for services to stop (30-60 seconds)
watch docker service ls
# Press Ctrl+C when cassandra services are gone

# Step 3: SSH to EACH worker and remove volumes (THIS DELETES ALL DATA!)

# On worker-2 (cassandra-1 node)
ssh dockeradmin@<worker-2-ip>
docker volume ls | grep cassandra  # Verify volume exists
docker volume rm cassandra_cassandra-1-data
exit

# On worker-3 (cassandra-2 node)
ssh dockeradmin@<worker-3-ip>
docker volume ls | grep cassandra  # Verify volume exists
docker volume rm cassandra_cassandra-2-data
exit

# On worker-4 (cassandra-3 node)
ssh dockeradmin@<worker-4-ip>
docker volume ls | grep cassandra  # Verify volume exists
docker volume rm cassandra_cassandra-3-data
exit

# Step 4: Deploy fresh cluster (from manager node)
cd ~/stacks
./deploy-cassandra.sh

# You now have a fresh cluster with no data
# You'll need to recreate keyspaces and tables

Why volumes are on worker nodes:

  • Docker Swarm creates volumes on the nodes where containers actually run
  • Manager node only orchestrates - it doesn't store data
  • Each worker node has its own volume for the Cassandra container running on it

When to use this:

  • Testing deployment from scratch
  • Recovering from corrupted data
  • Major version upgrades requiring fresh install
  • Development/staging environments

When NOT to use this:

  • Production environments (use backups and restore instead)
  • When you just need to restart nodes
  • When troubleshooting connectivity issues

Scaling Considerations

Can you scale to more than 3 nodes?

Yes, but you'll need to:

  1. Create additional worker droplets
  2. Update cassandra-stack.yml to add cassandra-4, cassandra-5, etc.
  3. Update the deployment script
  4. Run nodetool rebuild on new nodes

Recommended minimum: 3 nodes Recommended maximum with 2GB RAM: 3-5 nodes

For production with proper 8GB RAM droplets, 5-7 nodes is common for large deployments.


Troubleshooting

Problem: Nodes Not Joining Cluster (Race Condition)

Symptom: Each node shows only itself when running nodetool status - no 3-node cluster formed.

Root Cause: If you deployed using docker stack deploy directly instead of the deployment script, all 3 nodes started simultaneously. They each tried to connect to the seed nodes before the others were ready, gave up, and formed separate single-node clusters.

Solution - Force Rolling Restart:

# On manager node, force update all services (triggers restart)
docker service update --force cassandra_cassandra-1
docker service update --force cassandra_cassandra-2
docker service update --force cassandra_cassandra-3

# Wait 5-8 minutes for each to restart and discover each other
# Then verify cluster from any worker:
docker exec -it $(docker ps -q --filter "name=cassandra") nodetool status

# You should now see all 3 nodes with UN status

Prevention: Always use the deploy-cassandra.sh script for initial deployment to avoid this race condition.

Problem: Nodes Not Joining Cluster (Other Causes)

Symptom: nodetool status shows only 1 node, or nodes show DN (Down)

Solutions:

  1. Check firewall allows Cassandra ports:

    # On each worker:
    sudo ufw status verbose | grep 7000
    sudo ufw status verbose | grep 9042
    
    # Should see rules allowing from 10.116.0.0/16 (your VPC subnet)
    
  2. Verify seeds configuration:

    # Check service environment
    docker service inspect cassandra_cassandra-1 --format '{{.Spec.TaskTemplate.ContainerSpec.Env}}'
    
    # Should see: CASSANDRA_SEEDS=cassandra-1,cassandra-2,cassandra-3
    
  3. Check inter-node connectivity:

    # From cassandra-1 container (install tools first):
    apt-get update && apt-get install -y dnsutils netcat-openbsd
    
    # Test DNS resolution:
    nslookup cassandra-2
    nslookup cassandra-3
    
    # Test port connectivity:
    nc -zv cassandra-2 7000
    nc -zv cassandra-3 7000
    
    # Should all succeed
    
  4. Check service placement:

    # Verify services are on correct nodes
    docker service ps cassandra_cassandra-1
    docker service ps cassandra_cassandra-2
    docker service ps cassandra_cassandra-3
    
    # Each should be on its labeled node
    

Problem: Slow Startup

Symptom: Services stuck at 0/1 replicas for > 8 minutes

Solutions:

  1. Check logs for errors:

    docker service logs cassandra_cassandra-1 --tail 50
    
  2. Verify memory constraints:

    # With 2GB RAM, 512MB heap is configured
    # This is already minimal - slower startup is expected
    # Be patient and wait up to 10 minutes
    
  3. Check available memory on worker nodes:

    # SSH to a worker and check memory
    free -h
    # Should show at least 1.5GB available after OS overhead
    
  4. Check disk space:

    df -h
    # Should have plenty of free space
    

Problem: Can't Connect from Application

Symptom: Application can't reach Cassandra on port 9042

Solutions:

  1. Ensure application is on same overlay network:

    # In your application stack file:
    networks:
      maple-private-prod:
        external: true
    
  2. Test connectivity from application container:

    # From app container:
    nc -zv cassandra-1 9042
    # Should connect
    
  3. Use service names in application config:

    # Use Docker Swarm service names (recommended):
    CASSANDRA_CONTACT_POINTS=cassandra-1,cassandra-2,cassandra-3
    # These resolve automatically on the overlay network
    

Problem: Node Shows UJ (Up, Joining)

Symptom: Node stuck in joining state

Solution:

# This is normal for first 5-10 minutes with reduced memory
# Wait longer and check again

# If stuck > 15 minutes, restart that service:
docker service update --force cassandra_cassandra-2

Problem: Out of Memory Errors

Symptom: Container keeps restarting, logs show "Out of memory" or "Cannot allocate memory"

Solution:

This means 2GB RAM is insufficient. You have two options:

  1. Upgrade droplets to 4GB RAM minimum (recommended):

    • Resize each worker droplet in DigitalOcean
    • Update stack file to use MAX_HEAP_SIZE=1G and HEAP_NEWSIZE=256M
    • Redeploy: docker stack rm cassandra && docker stack deploy -c cassandra-stack.yml cassandra
  2. Further reduce heap (not recommended):

    # In cassandra-stack.yml, change to:
    - MAX_HEAP_SIZE=384M
    - HEAP_NEWSIZE=96M
    

    This will severely limit functionality and is not viable for any real workload.

Problem: Keyspace Already Exists Error

Symptom: AlreadyExists error when creating keyspaces

Solution:

This is normal if you've run the script before. The IF NOT EXISTS clause prevents actual errors. Your keyspaces are already created.

Installing Debugging Tools

When troubleshooting, you'll often need diagnostic tools inside the Cassandra containers. Here's how to install them:

Quick install of all useful debugging tools:

# SSH to any worker node, then run:
docker exec -it $(docker ps -q --filter "name=cassandra") bash -c "apt-get update && apt-get install -y dnsutils netcat-openbsd iputils-ping curl vim"

What this installs:

  • dnsutils - DNS tools (nslookup, dig)
  • netcat-openbsd - Network connectivity testing (nc)
  • iputils-ping - Ping utility
  • curl - HTTP testing
  • vim - Text editor

Example debugging workflow:

# Get into a Cassandra container
docker exec -it $(docker ps -q --filter "name=cassandra") bash

# Install tools (only needed once per container)
apt-get update && apt-get install -y dnsutils netcat-openbsd

# Test DNS resolution
nslookup cassandra-1
nslookup cassandra-2
nslookup cassandra-3

# Test port connectivity
nc -zv cassandra-1 7000  # Gossip port
nc -zv cassandra-2 9042  # CQL port
nc -zv cassandra-3 7000  # Gossip port

# Check cluster status
nodetool status

# Exit container
exit

Note: These tools are NOT persistent. If a container restarts, you'll need to reinstall them. For permanent installation, you would need to create a custom Docker image.


Next Steps

You now have:

  • 3-node Cassandra cluster with replication factor 3
  • High availability (survives 1 node failure)
  • Keyspaces ready for application data
  • Swarm-managed containers with auto-restart

Next guides:

  • Redis Setup - Cache layer for applications
  • Application Deployment - Deploy backend services
  • Monitoring - Set up cluster monitoring

Performance Notes

Hardware Sizing

Current setup (1 vCPU, 2GB RAM per node):

  • NOT suitable for production - development/testing only
  • Handles: ~500-1,000 writes/sec, ~5,000 reads/sec
  • Storage: 50GB per node (150GB total raw, 50GB with RF=3)
  • Expected issues: slow queries, GC pauses, limited connections
  • Total cost: 3 nodes × $12 = $36/month

Recommended production setup (4 vCPU, 8GB RAM per node):

  • Good for: Staging, small-to-medium production
  • Handles: ~10,000 writes/sec, ~50,000 reads/sec
  • Storage: 160GB per node (480GB total raw, 160GB with RF=3)
  • Total cost: 3 nodes × $48 = $144/month

For larger production:

  • Scale to 8 vCPU, 16GB RAM
  • Add more workers (5-node, 7-node cluster)
  • Use dedicated CPU droplets

Heap Size Tuning

Current: 512MB heap (with 2GB RAM total)

  • Absolute minimum for Cassandra to run
  • Expect frequent garbage collection
  • Limited cache effectiveness
  • Not recommended for production

Recommended configurations:

  • 2GB RAM: 512MB heap (current - minimal)
  • 4GB RAM: 1GB heap (small production)
  • 8GB RAM: 2GB heap (recommended production)
  • 16GB RAM: 4GB heap (high-traffic production)

Replication Factor

Current: RF=3 (recommended for production)

Options:

  • RF=1: No redundancy, not recommended for production
  • RF=2: Can tolerate 1 failure, less storage overhead
  • RF=3: Best for production, tolerates 1 failure safely
  • RF=5: For mission-critical data (requires 5+ nodes)

Upgrading to Production-Ready Configuration

If you started with 2GB RAM droplets and need to upgrade:

Step 1: Resize Droplets in DigitalOcean

  1. Go to each worker droplet (workers 2, 3, 4)
  2. Click Resize
  3. Select 8GB RAM / 4 vCPU plan
  4. Complete resize (droplets will reboot)

Step 2: Update Stack Configuration

SSH to manager and update the stack file:

ssh dockeradmin@<MANAGER_PUBLIC_IP>
cd ~/stacks

# Edit cassandra-stack.yml
vi cassandra-stack.yml

# Change these lines in ALL THREE services:
# FROM:
- MAX_HEAP_SIZE=512M
- HEAP_NEWSIZE=128M

# TO:
- MAX_HEAP_SIZE=2G
- HEAP_NEWSIZE=512M

Step 3: Redeploy

# Remove old stack
docker stack rm cassandra

# Wait for cleanup
sleep 30

# Deploy with new configuration
docker stack deploy -c cassandra-stack.yml cassandra

# Monitor startup
watch -n 2 'docker stack services cassandra'

Document Version: 1.1 Last Updated: November 3, 2025 Maintained By: Infrastructure Team Changelog:

  • v1.1 (Nov 3, 2025): Updated for 2GB RAM droplets with reduced heap (512MB) - NOT production ready
  • v1.0 (Nov 3, 2025): Initial version with 8GB RAM droplets

docker stack rm cassandra

Remove old volumes to start fresh

docker volume rm cassandra_cassandra-1-data cassandra_cassandra-2-data cassandra_cassandra-3-data

Install usefull debugging tools into our container.

docker exec -it $(docker ps -q --filter "name=cassandra") bash -c "apt-get update && apt-get install -y dnsutils && nslookup cassandra-2 && nslookup cassandra-3"