# Cassandra Cluster Setup (3-Node)

**Prerequisites**: Complete [01_init_docker_swarm.md](01_init_docker_swarm.md) first

**Time to Complete**: 60-90 minutes

**What You'll Build**:
- 3 new DigitalOcean droplets (workers 2, 3, 4)
- 3-node Cassandra cluster using Docker Swarm
- Replication factor 3 for high availability
- Private network communication only

---

## Table of Contents

1. [Overview](#overview)
2. [Create Cassandra Worker Droplets](#create-cassandra-worker-droplets)
3. [Configure Workers and Join Swarm](#configure-workers-and-join-swarm)
4. [Deploy Cassandra Cluster](#deploy-cassandra-cluster)
5. [Initialize Keyspaces](#initialize-keyspaces)
6. [Verify Cluster Health](#verify-cluster-health)
7. [Cluster Management](#cluster-management)
8. [Troubleshooting](#troubleshooting)

---

## Overview

### Architecture

```
Swarm Manager (existing):
├── mapleopentech-swarm-manager-1-prod (10.116.0.2)
└── Controls cluster, no Cassandra

Existing Worker:
└── mapleopentech-swarm-worker-1-prod (10.116.0.3)
    └── Available for other services

Cassandra Cluster (NEW):
├── mapleopentech-swarm-worker-2-prod (10.116.0.4)
│   └── Cassandra Node 1
├── mapleopentech-swarm-worker-3-prod (10.116.0.5)
│   └── Cassandra Node 2
└── mapleopentech-swarm-worker-4-prod (10.116.0.6)
    └── Cassandra Node 3
```

### Cassandra Configuration

- **Version**: Cassandra 5.0.4
- **Cluster Name**: mapleopentech-private-prod-cluster
- **Replication Factor**: 3 (each data stored on all 3 nodes)
- **Data Center**: datacenter1
- **Heap Size**: 512MB (reduced for 2GB RAM constraint)
- **Communication**: Private network only (secure)

**⚠️ IMPORTANT - Memory Constraints:**
This configuration uses minimal 2GB RAM droplets with 512MB heap size. This is **NOT recommended for production** use. Expect:
- Limited performance (max ~1,000 writes/sec vs 10,000 with proper sizing)
- Potential stability issues under load
- Frequent garbage collection pauses
- Limited concurrent connection capacity

**For production use**, upgrade to 8GB RAM droplets with 2GB heap size.

### Why 3 Nodes?

- **High Availability**: Cluster survives 1 node failure
- **Replication Factor 3**: Every piece of data stored on all 3 nodes
- **Read Performance**: Queries can hit any node
- **Write Performance**: Writes distributed across cluster
- **Production Standard**: Minimum for HA Cassandra

---

## Create Cassandra Worker Droplets

### Step 1: Create Worker 2 (Cassandra Node 1)

**From DigitalOcean Dashboard:**

1. Go to https://cloud.digitalocean.com/
2. Click **Create** → **Droplets**

**Droplet Configuration:**

| Setting | Value |
|---------|-------|
| **Region** | Toronto 1 (TOR1) - SAME as existing |
| **Image** | Ubuntu 24.04 LTS x64 |
| **Droplet Type** | Regular Intel |
| **CPU Options** | 1 vCPU, 2 GB RAM ($12/month) |
| **Storage** | 50 GB SSD |
| **VPC** | default-tor1 (auto-selected) |
| **SSH Key** | Select your key |
| **Hostname** | `mapleopentech-swarm-worker-2-prod` |
| **Tags** | `production`, `cassandra`, `database` |

Click **Create Droplet** and wait 60 seconds.

**✅ Checkpoint - Save to `.env`:**

```bash
# On your local machine:
SWARM_WORKER_2_HOSTNAME=mapleopentech-swarm-worker-2-prod
SWARM_WORKER_2_PUBLIC_IP=159.65.123.47      # Your public IP
SWARM_WORKER_2_PRIVATE_IP=10.116.0.4        # Your private IP
CASSANDRA_NODE_1_IP=10.116.0.4              # Same as private IP
```

### Step 2: Create Worker 3 (Cassandra Node 2)

Repeat with these values:

| Setting | Value |
|---------|-------|
| **Hostname** | `mapleopentech-swarm-worker-3-prod` |
| All other settings | Same as Worker 2 |

**✅ Checkpoint - Save to `.env`:**

```bash
SWARM_WORKER_3_HOSTNAME=mapleopentech-swarm-worker-3-prod
SWARM_WORKER_3_PUBLIC_IP=159.65.123.48      # Your public IP
SWARM_WORKER_3_PRIVATE_IP=10.116.0.5        # Your private IP
CASSANDRA_NODE_2_IP=10.116.0.5              # Same as private IP
```

### Step 3: Create Worker 4 (Cassandra Node 3)

Repeat with these values:

| Setting | Value |
|---------|-------|
| **Hostname** | `mapleopentech-swarm-worker-4-prod` |
| All other settings | Same as Worker 2 |

**✅ Checkpoint - Save to `.env`:**

```bash
SWARM_WORKER_4_HOSTNAME=mapleopentech-swarm-worker-4-prod
SWARM_WORKER_4_PUBLIC_IP=159.65.123.49      # Your public IP
SWARM_WORKER_4_PRIVATE_IP=10.116.0.6        # Your private IP
CASSANDRA_NODE_3_IP=10.116.0.6              # Same as private IP
```

### Step 4: Verify All Droplets in Same VPC

1. Go to **Networking** → **VPC** → Click `default-tor1`
2. Should see 5 droplets total:
   - mapleopentech-swarm-manager-1-prod (10.116.0.2)
   - mapleopentech-swarm-worker-1-prod (10.116.0.3)
   - mapleopentech-swarm-worker-2-prod (10.116.0.4)
   - mapleopentech-swarm-worker-3-prod (10.116.0.5)
   - mapleopentech-swarm-worker-4-prod (10.116.0.6)

---

## Configure Workers and Join Swarm

Follow these steps for **EACH** of the 3 new workers (workers 2, 3, 4).

### Worker 2 Setup

#### Step 1: Initial SSH as Root

```bash
# SSH to Worker 2
ssh root@159.65.123.47  # Replace with YOUR worker 2 public IP

# You should see: root@mapleopentech-swarm-worker-2-prod:~#
```

#### Step 2: System Updates and Create Admin User

```bash
# Update system
apt update && apt upgrade -y

# Install essentials
apt install -y curl wget apt-transport-https ca-certificates gnupg lsb-release

# Create dockeradmin user
adduser dockeradmin
# Use the SAME password as other nodes

# Add to sudo group
usermod -aG sudo dockeradmin

# Copy SSH keys
rsync --archive --chown=dockeradmin:dockeradmin ~/.ssh /home/dockeradmin
```

#### Step 3: Secure SSH Configuration

```bash
# Edit SSH config
vi /etc/ssh/sshd_config

# Update these lines:
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
MaxAuthTries 3
LoginGraceTime 60

# Save and restart SSH
systemctl restart ssh
```

#### Step 4: Reconnect as dockeradmin

```bash
# Exit root session
exit

# SSH back as dockeradmin
ssh dockeradmin@159.65.123.47  # Replace with YOUR worker 2 public IP

# You should see: dockeradmin@mapleopentech-swarm-worker-2-prod:~$
```

#### Step 5: Install Docker

```bash
# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Add dockeradmin to docker group
sudo usermod -aG docker dockeradmin
newgrp docker

# Verify
docker --version

# Enable Docker
sudo systemctl enable docker
sudo systemctl status docker
# Press 'q' to exit
```

#### Step 6: Configure Firewall

```bash
# Install UFW
sudo apt install ufw -y

# Allow SSH
sudo ufw allow 22/tcp

# Allow Docker Swarm ports (replace with YOUR VPC subnet from .env)
sudo ufw allow from 10.116.0.0/16 to any port 2377 proto tcp
sudo ufw allow from 10.116.0.0/16 to any port 7946
sudo ufw allow from 10.116.0.0/16 to any port 4789 proto udp

# Allow Cassandra ports (private network only)
# 7000: Inter-node communication
# 7001: Inter-node communication (TLS)
# 9042: CQL native transport (client connections)
sudo ufw allow from 10.116.0.0/16 to any port 7000 proto tcp
sudo ufw allow from 10.116.0.0/16 to any port 7001 proto tcp
sudo ufw allow from 10.116.0.0/16 to any port 9042 proto tcp

# Enable firewall
sudo ufw --force enable

# Check status
sudo ufw status verbose
```

#### Step 7: Join Docker Swarm

```bash
# Use the join command from Step 8 of 01_init_docker_swarm.md
# Replace with YOUR actual token and manager private IP:
docker swarm join --token SWMTKN-1-4abc123xyz789verylongtoken 10.116.0.2:2377

# Expected output:
# This node joined a swarm as a worker.
```

✅ **Worker 2 complete!** Repeat Steps 1-7 for Workers 3 and 4.

### Worker 3 Setup

Repeat Steps 1-7 above, replacing:
- Public IP: Use Worker 3's public IP (159.65.123.48 example)
- Hostname: `mapleopentech-swarm-worker-3-prod`

### Worker 4 Setup

Repeat Steps 1-7 above, replacing:
- Public IP: Use Worker 4's public IP (159.65.123.49 example)
- Hostname: `mapleopentech-swarm-worker-4-prod`

---

## Deploy Cassandra Cluster

### Step 1: Verify All Workers Joined

**From your manager node:**

```bash
# SSH to manager
ssh dockeradmin@159.65.123.45  # Your manager's public IP

# List all swarm nodes
docker node ls

# Expected output (5 nodes total):
# ID              HOSTNAME                          STATUS   AVAILABILITY   MANAGER STATUS
# abc123... *     mapleopentech-swarm-manager-1-prod    Ready    Active         Leader
# def456...       mapleopentech-swarm-worker-1-prod     Ready    Active
# ghi789...       mapleopentech-swarm-worker-2-prod     Ready    Active
# jkl012...       mapleopentech-swarm-worker-3-prod     Ready    Active
# mno345...       mapleopentech-swarm-worker-4-prod     Ready    Active
```

### Step 2: Label Cassandra Nodes

Apply labels so Cassandra services deploy to correct nodes:

```bash
# Label Worker 2 as Cassandra Node 1
docker node update --label-add cassandra=node1 mapleopentech-swarm-worker-2-prod

# Label Worker 3 as Cassandra Node 2
docker node update --label-add cassandra=node2 mapleopentech-swarm-worker-3-prod

# Label Worker 4 as Cassandra Node 3
docker node update --label-add cassandra=node3 mapleopentech-swarm-worker-4-prod

# Verify labels
docker node inspect mapleopentech-swarm-worker-2-prod --format '{{.Spec.Labels}}'
# Should show: map[cassandra:node1]
```

### Step 3: Create Docker Stack File

**On your manager**, create the Cassandra stack:

```bash
# Create directory for stack files
mkdir -p ~/stacks
cd ~/stacks

# Create Cassandra stack file
vi cassandra-stack.yml
```

Copy and paste the following:

```yaml
version: '3.8'

networks:
  mapleopentech-private-prod:
    external: true

volumes:
  cassandra-1-data:
  cassandra-2-data:
  cassandra-3-data:

services:
  cassandra-1:
    image: cassandra:5.0.4
    hostname: cassandra-1
    networks:
      - mapleopentech-private-prod
    environment:
      - CASSANDRA_CLUSTER_NAME=mapleopentech-private-prod-cluster
      - CASSANDRA_DC=datacenter1
      - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch
      - CASSANDRA_SEEDS=cassandra-1,cassandra-2,cassandra-3
      - MAX_HEAP_SIZE=512M
      - HEAP_NEWSIZE=128M
    volumes:
      - cassandra-1-data:/var/lib/cassandra
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.labels.cassandra == node1
      restart_policy:
        condition: on-failure
        delay: 10s
        max_attempts: 3
    healthcheck:
      test: ["CMD-SHELL", "cqlsh -e 'describe cluster' || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s

  cassandra-2:
    image: cassandra:5.0.4
    hostname: cassandra-2
    networks:
      - mapleopentech-private-prod
    environment:
      - CASSANDRA_CLUSTER_NAME=mapleopentech-private-prod-cluster
      - CASSANDRA_DC=datacenter1
      - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch
      - CASSANDRA_SEEDS=cassandra-1,cassandra-2,cassandra-3
      - MAX_HEAP_SIZE=512M
      - HEAP_NEWSIZE=128M
    volumes:
      - cassandra-2-data:/var/lib/cassandra
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.labels.cassandra == node2
      restart_policy:
        condition: on-failure
        delay: 10s
        max_attempts: 3
    healthcheck:
      test: ["CMD-SHELL", "cqlsh -e 'describe cluster' || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s

  cassandra-3:
    image: cassandra:5.0.4
    hostname: cassandra-3
    networks:
      - mapleopentech-private-prod
    environment:
      - CASSANDRA_CLUSTER_NAME=mapleopentech-private-prod-cluster
      - CASSANDRA_DC=datacenter1
      - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch
      - CASSANDRA_SEEDS=cassandra-1,cassandra-2,cassandra-3
      - MAX_HEAP_SIZE=512M
      - HEAP_NEWSIZE=128M
    volumes:
      - cassandra-3-data:/var/lib/cassandra
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.labels.cassandra == node3
      restart_policy:
        condition: on-failure
        delay: 10s
        max_attempts: 3
    healthcheck:
      test: ["CMD-SHELL", "cqlsh -e 'describe cluster' || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s
```

### Step 4: Create Shared Overlay Network

Before deploying any services, create the shared `mapleopentech-private-prod` network that all services will use:

```bash
# Create the mapleopentech-private-prod overlay network
docker network create \
  --driver overlay \
  --attachable \
  mapleopentech-private-prod

# Verify it was created
docker network ls | grep mapleopentech-private-prod
# Should show:
# abc123...      mapleopentech-private-prod   overlay   swarm
```

**What is this network for?**
- Shared by all Maple services (Cassandra, Redis, Go backend, etc.)
- Enables private communication between services
- Services can reach each other by service name (e.g., `redis`, `cassandra-1`)
- No public internet exposure

### Step 5: Create Deployment Script

Create the sequential deployment script to avoid race conditions:

```bash
# Create the deployment script
vi deploy-cassandra.sh
```

Copy and paste the following script:

```bash
#!/bin/bash
#
# Cassandra Cluster Sequential Deployment Script
# This script deploys Cassandra nodes sequentially to avoid race conditions
# during cluster formation.
#

set -e

STACK_NAME="cassandra"
STACK_FILE="cassandra-stack.yml"

echo "=== Cassandra Cluster Sequential Deployment ==="
echo ""

# Check if stack file exists
if [ ! -f "$STACK_FILE" ]; then
    echo "ERROR: $STACK_FILE not found in current directory"
    exit 1
fi

echo "Step 1: Deploying cassandra-1 (seed node)..."
docker stack deploy -c "$STACK_FILE" "$STACK_NAME"

# Scale down cassandra-2 and cassandra-3 temporarily
docker service scale "${STACK_NAME}_cassandra-2=0" > /dev/null 2>&1
docker service scale "${STACK_NAME}_cassandra-3=0" > /dev/null 2>&1

echo "Waiting for cassandra-1 to become healthy (this takes ~5-8 minutes)..."
echo "Checking every 30 seconds..."

# Wait for cassandra-1 to be running
COUNTER=0
MAX_WAIT=20  # 20 * 30 seconds = 10 minutes max
while [ $COUNTER -lt $MAX_WAIT ]; do
    REPLICAS=$(docker service ls --filter "name=${STACK_NAME}_cassandra-1" --format "{{.Replicas}}")
    if [ "$REPLICAS" = "1/1" ]; then
        echo "✓ cassandra-1 is running"
        # Give it extra time to fully initialize
        echo "Waiting additional 2 minutes for cassandra-1 to fully initialize..."
        sleep 120
        break
    fi
    echo "  cassandra-1 status: $REPLICAS (waiting...)"
    sleep 30
    COUNTER=$((COUNTER + 1))
done

if [ $COUNTER -eq $MAX_WAIT ]; then
    echo "ERROR: cassandra-1 failed to start within 10 minutes"
    echo "Check logs with: docker service logs ${STACK_NAME}_cassandra-1"
    exit 1
fi

echo ""
echo "Step 2: Starting cassandra-2..."
docker service scale "${STACK_NAME}_cassandra-2=1"

echo "Waiting for cassandra-2 to become healthy (this takes ~5-8 minutes)..."
COUNTER=0
while [ $COUNTER -lt $MAX_WAIT ]; do
    REPLICAS=$(docker service ls --filter "name=${STACK_NAME}_cassandra-2" --format "{{.Replicas}}")
    if [ "$REPLICAS" = "1/1" ]; then
        echo "✓ cassandra-2 is running"
        echo "Waiting additional 2 minutes for cassandra-2 to join cluster..."
        sleep 120
        break
    fi
    echo "  cassandra-2 status: $REPLICAS (waiting...)"
    sleep 30
    COUNTER=$((COUNTER + 1))
done

if [ $COUNTER -eq $MAX_WAIT ]; then
    echo "ERROR: cassandra-2 failed to start within 10 minutes"
    echo "Check logs with: docker service logs ${STACK_NAME}_cassandra-2"
    exit 1
fi

echo ""
echo "Step 3: Starting cassandra-3..."
docker service scale "${STACK_NAME}_cassandra-3=1"

echo "Waiting for cassandra-3 to become healthy (this takes ~5-8 minutes)..."
COUNTER=0
while [ $COUNTER -lt $MAX_WAIT ]; do
    REPLICAS=$(docker service ls --filter "name=${STACK_NAME}_cassandra-3" --format "{{.Replicas}}")
    if [ "$REPLICAS" = "1/1" ]; then
        echo "✓ cassandra-3 is running"
        echo "Waiting additional 2 minutes for cassandra-3 to join cluster..."
        sleep 120
        break
    fi
    echo "  cassandra-3 status: $REPLICAS (waiting...)"
    sleep 30
    COUNTER=$((COUNTER + 1))
done

if [ $COUNTER -eq $MAX_WAIT ]; then
    echo "ERROR: cassandra-3 failed to start within 10 minutes"
    echo "Check logs with: docker service logs ${STACK_NAME}_cassandra-3"
    exit 1
fi

echo ""
echo "=== Deployment Complete ==="
echo ""
echo "All 3 Cassandra nodes should now be running and forming a cluster."
echo ""
echo "Verify cluster status by SSH'ing to any worker node and running:"
echo "  docker exec -it \$(docker ps -q --filter \"name=cassandra\") nodetool status"
echo ""
echo "You should see 3 nodes with status 'UN' (Up Normal)."
echo ""
```

Make it executable:

```bash
chmod +x deploy-cassandra.sh
```

### Step 6: Deploy Cassandra Cluster Sequentially

**⚠️ CRITICAL - READ THIS BEFORE DEPLOYING ⚠️**

**DO NOT use `docker stack deploy -c cassandra-stack.yml cassandra` directly!**

**Why?** This creates a **race condition**: all 3 nodes start simultaneously, try to connect to each other before they're ready, give up, and form separate single-node clusters instead of one 3-node cluster. This is a classic distributed systems problem.

**What happens if you do?** Each node will run independently. Running `nodetool status` on any node will show only 1 node instead of 3. The cluster will appear broken.

**The fix?** Use the sequential deployment script below, which starts nodes one at a time:

**ALWAYS use the deployment script:**

```bash
# Run the sequential deployment script
./deploy-cassandra.sh
```

**What this script does:**

1. Deploys cassandra-1 first and waits for it to be fully healthy (~5-8 minutes)
2. Starts cassandra-2 and waits for it to join the cluster (~5-8 minutes)
3. Starts cassandra-3 and waits for it to join the cluster (~5-8 minutes)
4. Total deployment time: **15-25 minutes**

**Expected output:**

```
=== Cassandra Cluster Sequential Deployment ===

Step 1: Deploying cassandra-1 (seed node)...
Creating network cassandra_maple-private-prod
Creating service cassandra_cassandra-1
Creating service cassandra_cassandra-2
Creating service cassandra_cassandra-3
cassandra_cassandra-2 scaled to 0
cassandra_cassandra-3 scaled to 0
Waiting for cassandra-1 to become healthy (this takes ~5-8 minutes)...
Checking every 30 seconds...
  cassandra-1 status: 0/1 (waiting...)
  cassandra-1 status: 1/1 (waiting...)
✓ cassandra-1 is running
Waiting additional 2 minutes for cassandra-1 to fully initialize...

Step 2: Starting cassandra-2...
cassandra_cassandra-2 scaled to 1
Waiting for cassandra-2 to become healthy (this takes ~5-8 minutes)...
  cassandra-2 status: 0/1 (waiting...)
  cassandra-2 status: 1/1 (waiting...)
✓ cassandra-2 is running
Waiting additional 2 minutes for cassandra-2 to join cluster...

Step 3: Starting cassandra-3...
cassandra_cassandra-3 scaled to 1
Waiting for cassandra-3 to become healthy (this takes ~5-8 minutes)...
  cassandra-3 status: 0/1 (waiting...)
  cassandra-3 status: 1/1 (waiting...)
✓ cassandra-3 is running
Waiting additional 2 minutes for cassandra-3 to join cluster...

=== Deployment Complete ===

All 3 Cassandra nodes should now be running and forming a cluster.
```

**If the script fails**, check the service logs:

```bash
docker service logs cassandra_cassandra-1
docker service logs cassandra_cassandra-2
docker service logs cassandra_cassandra-3
```

---

## Initialize Keyspaces

### Step 1: Connect to Cassandra Node 1

```bash
# Get the node where cassandra-1 is running
docker service ps cassandra_cassandra-1 --format "{{.Node}}"
# Output: mapleopentech-swarm-worker-2-prod

# SSH to that worker
ssh dockeradmin@10.116.0.4  # Private IP of worker 2

# Find container ID
CONTAINER_ID=$(docker ps --filter "name=cassandra_cassandra-1" --format "{{.ID}}")

# Open CQL shell
docker exec -it $CONTAINER_ID cqlsh
```

### Step 2: Create Keyspaces

```sql
-- MaplePress Backend
CREATE KEYSPACE IF NOT EXISTS maplepress
WITH REPLICATION = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
}
AND DURABLE_WRITES = true;

-- MapleFile Backend
CREATE KEYSPACE IF NOT EXISTS maplefile
WITH REPLICATION = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
}
AND DURABLE_WRITES = true;

-- mapleopentech Backend
CREATE KEYSPACE IF NOT EXISTS mapleopentech
WITH REPLICATION = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
}
AND DURABLE_WRITES = true;

-- Verify
DESCRIBE KEYSPACES;

-- Exit CQL shell
exit
```

Expected output should show your keyspaces:
```
maplepress  maplefile  mapleopentech  system  system_auth  system_distributed  system_schema  system_traces  system_views  system_virtual_schema
```

---

## Verify Cluster Health

### Step 1: Check Cluster Status

**From inside cassandra-1 container:**

```bash
# If not already in container:
CONTAINER_ID=$(docker ps --filter "name=cassandra_cassandra-1" --format "{{.ID}}")
docker exec -it $CONTAINER_ID bash

# Check cluster status
nodetool status

# Expected output:
# Datacenter: datacenter1
# =======================
# Status=Up/Down
# |/ State=Normal/Leaving/Joining/Moving
# --  Address      Load       Tokens  Owns   Host ID                               Rack
# UN  10.116.0.4   125 KiB    16      100.0% abc123...                             rack1
# UN  10.116.0.5   120 KiB    16      100.0% def456...                             rack1
# UN  10.116.0.6   118 KiB    16      100.0% ghi789...                             rack1
```

**What to verify:**
- ✅ All 3 nodes show `UN` (Up and Normal)
- ✅ Each node has an IP from your private network (10.116.0.x)
- ✅ Load is distributed
- ✅ Owns shows roughly 100% (data is replicated everywhere with RF=3)

### Step 2: Test Write/Read

**Still in cassandra-1 container:**

```bash
# Open CQL shell
cqlsh

# Create test keyspace
CREATE KEYSPACE IF NOT EXISTS test
WITH REPLICATION = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
};

USE test;

# Create test table
CREATE TABLE IF NOT EXISTS users (
    user_id UUID PRIMARY KEY,
    username TEXT,
    email TEXT
);

# Insert test data
INSERT INTO users (user_id, username, email)
VALUES (uuid(), 'testuser', 'test@example.com');

# Read data
SELECT * FROM users;

# Expected output:
#  user_id                              | email            | username
# --------------------------------------+------------------+-----------
#  abc123-def456-...                    | test@example.com | testuser

# Exit
exit
exit  # Exit container too
```

### Step 3: Verify Replication

Connect to Node 2 and verify data is there:

```bash
# SSH to worker 3 (Node 2)
ssh dockeradmin@10.116.0.5

# Find cassandra-2 container
CONTAINER_ID=$(docker ps --filter "name=cassandra_cassandra-2" --format "{{.ID}}")

# Connect and query
docker exec -it $CONTAINER_ID cqlsh -e "SELECT * FROM test.users;"

# Should see the same test data!
# This proves replication is working.
```

### Step 4: Save Connection Details

**✅ Final Checkpoint - Update `.env`:**

```bash
# On your local machine, add:
CASSANDRA_CLUSTER_NAME=mapleopentech-private-prod-cluster
CASSANDRA_DC=datacenter1
CASSANDRA_REPLICATION_FACTOR=3

# Connection endpoints (any node can be used)
CASSANDRA_CONTACT_POINTS=10.116.0.4,10.116.0.5,10.116.0.6
CASSANDRA_CQL_PORT=9042

# For application connections (use private IPs)
CASSANDRA_NODE_1_IP=10.116.0.4
CASSANDRA_NODE_2_IP=10.116.0.5
CASSANDRA_NODE_3_IP=10.116.0.6
```

---

## Cluster Management

### Restarting the Cassandra Cluster

**To restart all Cassandra nodes:**

```bash
# On manager node
docker service update --force cassandra_cassandra-1
docker service update --force cassandra_cassandra-2
docker service update --force cassandra_cassandra-3

# Wait 5-8 minutes for all nodes to restart
# Then verify cluster health
docker exec -it $(docker ps -q --filter "name=cassandra") nodetool status
```

**To restart a single node:**

```bash
# Restart just one service
docker service update --force cassandra_cassandra-1

# Wait for it to rejoin the cluster
# Check status from any worker
docker exec -it $(docker ps -q --filter "name=cassandra") nodetool status
```

### Shutting Down the Cassandra Cluster

**To stop the entire stack (keeps data):**

```bash
# On manager node
docker stack rm cassandra

# Services will be removed but volumes persist
# Data is safe and can be restored later
```

**To verify shutdown:**

```bash
# On manager node - check that services are gone
docker stack ls
# cassandra should not appear

# Volumes are on worker nodes, not manager
# SSH to each worker to verify volumes still exist (data is safe):

# On worker-2:
ssh dockeradmin@<worker-2-ip>
docker volume ls | grep cassandra
# Should show: cassandra_cassandra-1-data
exit

# On worker-3:
ssh dockeradmin@<worker-3-ip>
docker volume ls | grep cassandra
# Should show: cassandra_cassandra-2-data
exit

# On worker-4:
ssh dockeradmin@<worker-4-ip>
docker volume ls | grep cassandra
# Should show: cassandra_cassandra-3-data
exit
```

**To restart after shutdown:**

```bash
# Use the deployment script again
cd ~/stacks
./deploy-cassandra.sh

# Your data will be intact
```

### Removing All Cassandra Data (Fresh Start)

**⚠️ WARNING: This PERMANENTLY deletes all data. Use only when starting from scratch.**

**IMPORTANT:** Volumes are stored on the **worker nodes**, not the manager node. You must SSH to each worker to delete them.

```bash
# Step 1: Remove the stack (from manager node)
docker stack rm cassandra

# Step 2: Wait for services to stop (30-60 seconds)
watch docker service ls
# Press Ctrl+C when cassandra services are gone

# Step 3: SSH to EACH worker and remove volumes (THIS DELETES ALL DATA!)

# On worker-2 (cassandra-1 node)
ssh dockeradmin@<worker-2-ip>
docker volume ls | grep cassandra  # Verify volume exists
docker volume rm cassandra_cassandra-1-data
exit

# On worker-3 (cassandra-2 node)
ssh dockeradmin@<worker-3-ip>
docker volume ls | grep cassandra  # Verify volume exists
docker volume rm cassandra_cassandra-2-data
exit

# On worker-4 (cassandra-3 node)
ssh dockeradmin@<worker-4-ip>
docker volume ls | grep cassandra  # Verify volume exists
docker volume rm cassandra_cassandra-3-data
exit

# Step 4: Deploy fresh cluster (from manager node)
cd ~/stacks
./deploy-cassandra.sh

# You now have a fresh cluster with no data
# You'll need to recreate keyspaces and tables
```

**Why volumes are on worker nodes:**
- Docker Swarm creates volumes on the nodes where containers actually run
- Manager node only orchestrates - it doesn't store data
- Each worker node has its own volume for the Cassandra container running on it

**When to use this:**
- Testing deployment from scratch
- Recovering from corrupted data
- Major version upgrades requiring fresh install
- Development/staging environments

**When NOT to use this:**
- Production environments (use backups and restore instead)
- When you just need to restart nodes
- When troubleshooting connectivity issues

### Scaling Considerations

**Can you scale to more than 3 nodes?**

Yes, but you'll need to:
1. Create additional worker droplets
2. Update `cassandra-stack.yml` to add `cassandra-4`, `cassandra-5`, etc.
3. Update the deployment script
4. Run `nodetool rebuild` on new nodes

**Recommended minimum: 3 nodes**
**Recommended maximum with 2GB RAM: 3-5 nodes**

For production with proper 8GB RAM droplets, 5-7 nodes is common for large deployments.

---

## Troubleshooting

### Problem: Nodes Not Joining Cluster (Race Condition)

**Symptom**: Each node shows only itself when running `nodetool status` - no 3-node cluster formed.

**Root Cause**: If you deployed using `docker stack deploy` directly instead of the deployment script, all 3 nodes started simultaneously. They each tried to connect to the seed nodes before the others were ready, gave up, and formed separate single-node clusters.

**Solution - Force Rolling Restart:**

```bash
# On manager node, force update all services (triggers restart)
docker service update --force cassandra_cassandra-1
docker service update --force cassandra_cassandra-2
docker service update --force cassandra_cassandra-3

# Wait 5-8 minutes for each to restart and discover each other
# Then verify cluster from any worker:
docker exec -it $(docker ps -q --filter "name=cassandra") nodetool status

# You should now see all 3 nodes with UN status
```

**Prevention**: Always use the `deploy-cassandra.sh` script for initial deployment to avoid this race condition.

### Problem: Nodes Not Joining Cluster (Other Causes)

**Symptom**: `nodetool status` shows only 1 node, or nodes show `DN` (Down)

**Solutions:**

1. **Check firewall allows Cassandra ports:**
   ```bash
   # On each worker:
   sudo ufw status verbose | grep 7000
   sudo ufw status verbose | grep 9042

   # Should see rules allowing from 10.116.0.0/16 (your VPC subnet)
   ```

2. **Verify seeds configuration:**
   ```bash
   # Check service environment
   docker service inspect cassandra_cassandra-1 --format '{{.Spec.TaskTemplate.ContainerSpec.Env}}'

   # Should see: CASSANDRA_SEEDS=cassandra-1,cassandra-2,cassandra-3
   ```

3. **Check inter-node connectivity:**
   ```bash
   # From cassandra-1 container (install tools first):
   apt-get update && apt-get install -y dnsutils netcat-openbsd

   # Test DNS resolution:
   nslookup cassandra-2
   nslookup cassandra-3

   # Test port connectivity:
   nc -zv cassandra-2 7000
   nc -zv cassandra-3 7000

   # Should all succeed
   ```

4. **Check service placement:**
   ```bash
   # Verify services are on correct nodes
   docker service ps cassandra_cassandra-1
   docker service ps cassandra_cassandra-2
   docker service ps cassandra_cassandra-3

   # Each should be on its labeled node
   ```

### Problem: Slow Startup

**Symptom**: Services stuck at 0/1 replicas for > 8 minutes

**Solutions:**

1. **Check logs for errors:**
   ```bash
   docker service logs cassandra_cassandra-1 --tail 50
   ```

2. **Verify memory constraints:**
   ```bash
   # With 2GB RAM, 512MB heap is configured
   # This is already minimal - slower startup is expected
   # Be patient and wait up to 10 minutes
   ```

3. **Check available memory on worker nodes:**
   ```bash
   # SSH to a worker and check memory
   free -h
   # Should show at least 1.5GB available after OS overhead
   ```

4. **Check disk space:**
   ```bash
   df -h
   # Should have plenty of free space
   ```

### Problem: Can't Connect from Application

**Symptom**: Application can't reach Cassandra on port 9042

**Solutions:**

1. **Ensure application is on same overlay network:**
   ```yaml
   # In your application stack file:
   networks:
     mapleopentech-private-prod:
       external: true
   ```

2. **Test connectivity from application container:**
   ```bash
   # From app container:
   nc -zv cassandra-1 9042
   # Should connect
   ```

3. **Use service names in application config:**
   ```bash
   # Use Docker Swarm service names (recommended):
   CASSANDRA_CONTACT_POINTS=cassandra-1,cassandra-2,cassandra-3
   # These resolve automatically on the overlay network
   ```

### Problem: Node Shows UJ (Up, Joining)

**Symptom**: Node stuck in joining state

**Solution:**

```bash
# This is normal for first 5-10 minutes with reduced memory
# Wait longer and check again

# If stuck > 15 minutes, restart that service:
docker service update --force cassandra_cassandra-2
```

### Problem: Out of Memory Errors

**Symptom**: Container keeps restarting, logs show "Out of memory" or "Cannot allocate memory"

**Solution:**

This means 2GB RAM is insufficient. You have two options:

1. **Upgrade droplets to 4GB RAM minimum** (recommended):
   - Resize each worker droplet in DigitalOcean
   - Update stack file to use `MAX_HEAP_SIZE=1G` and `HEAP_NEWSIZE=256M`
   - Redeploy: `docker stack rm cassandra && docker stack deploy -c cassandra-stack.yml cassandra`

2. **Further reduce heap** (not recommended):
   ```yaml
   # In cassandra-stack.yml, change to:
   - MAX_HEAP_SIZE=384M
   - HEAP_NEWSIZE=96M
   ```
   This will severely limit functionality and is not viable for any real workload.

### Problem: Keyspace Already Exists Error

**Symptom**: `AlreadyExists` error when creating keyspaces

**Solution:**

This is normal if you've run the script before. The `IF NOT EXISTS` clause prevents actual errors. Your keyspaces are already created.

### Installing Debugging Tools

When troubleshooting, you'll often need diagnostic tools inside the Cassandra containers. Here's how to install them:

**Quick install of all useful debugging tools:**

```bash
# SSH to any worker node, then run:
docker exec -it $(docker ps -q --filter "name=cassandra") bash -c "apt-get update && apt-get install -y dnsutils netcat-openbsd iputils-ping curl vim"
```

**What this installs:**
- `dnsutils` - DNS tools (`nslookup`, `dig`)
- `netcat-openbsd` - Network connectivity testing (`nc`)
- `iputils-ping` - Ping utility
- `curl` - HTTP testing
- `vim` - Text editor

**Example debugging workflow:**

```bash
# Get into a Cassandra container
docker exec -it $(docker ps -q --filter "name=cassandra") bash

# Install tools (only needed once per container)
apt-get update && apt-get install -y dnsutils netcat-openbsd

# Test DNS resolution
nslookup cassandra-1
nslookup cassandra-2
nslookup cassandra-3

# Test port connectivity
nc -zv cassandra-1 7000  # Gossip port
nc -zv cassandra-2 9042  # CQL port
nc -zv cassandra-3 7000  # Gossip port

# Check cluster status
nodetool status

# Exit container
exit
```

**Note:** These tools are NOT persistent. If a container restarts, you'll need to reinstall them. For permanent installation, you would need to create a custom Docker image.

---

## Next Steps

✅ **You now have:**
- 3-node Cassandra cluster with replication factor 3
- High availability (survives 1 node failure)
- Keyspaces ready for application data
- Swarm-managed containers with auto-restart

**Next guides:**
- **Redis Setup** - Cache layer for applications
- **Application Deployment** - Deploy backend services
- **Monitoring** - Set up cluster monitoring

---

## Performance Notes

### Hardware Sizing

**Current setup (1 vCPU, 2GB RAM per node):**
- **NOT suitable for production** - development/testing only
- Handles: ~500-1,000 writes/sec, ~5,000 reads/sec
- Storage: 50GB per node (150GB total raw, 50GB with RF=3)
- Expected issues: slow queries, GC pauses, limited connections
- **Total cost**: 3 nodes × $12 = **$36/month**

**Recommended production setup (4 vCPU, 8GB RAM per node):**
- Good for: Staging, small-to-medium production
- Handles: ~10,000 writes/sec, ~50,000 reads/sec
- Storage: 160GB per node (480GB total raw, 160GB with RF=3)
- **Total cost**: 3 nodes × $48 = **$144/month**

**For larger production:**
- Scale to 8 vCPU, 16GB RAM
- Add more workers (5-node, 7-node cluster)
- Use dedicated CPU droplets

### Heap Size Tuning

**Current: 512MB heap (with 2GB RAM total)**
- Absolute minimum for Cassandra to run
- Expect frequent garbage collection
- Limited cache effectiveness
- **Not recommended for production**

**Recommended configurations:**
- **2GB RAM**: 512MB heap (current - minimal)
- **4GB RAM**: 1GB heap (small production)
- **8GB RAM**: 2GB heap (recommended production)
- **16GB RAM**: 4GB heap (high-traffic production)

### Replication Factor

Current: RF=3 (recommended for production)

Options:
- **RF=1**: No redundancy, not recommended for production
- **RF=2**: Can tolerate 1 failure, less storage overhead
- **RF=3**: Best for production, tolerates 1 failure safely
- **RF=5**: For mission-critical data (requires 5+ nodes)

---

## Upgrading to Production-Ready Configuration

If you started with 2GB RAM droplets and need to upgrade:

### Step 1: Resize Droplets in DigitalOcean

1. Go to each worker droplet (workers 2, 3, 4)
2. Click **Resize**
3. Select **8GB RAM / 4 vCPU** plan
4. Complete resize (droplets will reboot)

### Step 2: Update Stack Configuration

SSH to manager and update the stack file:

```bash
ssh dockeradmin@<MANAGER_PUBLIC_IP>
cd ~/stacks

# Edit cassandra-stack.yml
vi cassandra-stack.yml

# Change these lines in ALL THREE services:
# FROM:
- MAX_HEAP_SIZE=512M
- HEAP_NEWSIZE=128M

# TO:
- MAX_HEAP_SIZE=2G
- HEAP_NEWSIZE=512M
```

### Step 3: Redeploy

```bash
# Remove old stack
docker stack rm cassandra

# Wait for cleanup
sleep 30

# Deploy with new configuration
docker stack deploy -c cassandra-stack.yml cassandra

# Monitor startup
watch -n 2 'docker stack services cassandra'
```

---

**Document Version**: 1.1
**Last Updated**: November 3, 2025
**Maintained By**: Infrastructure Team
**Changelog**:
- v1.1 (Nov 3, 2025): Updated for 2GB RAM droplets with reduced heap (512MB) - NOT production ready
- v1.0 (Nov 3, 2025): Initial version with 8GB RAM droplets


docker stack rm cassandra

# Remove old volumes to start fresh
docker volume rm cassandra_cassandra-1-data cassandra_cassandra-2-data cassandra_cassandra-3-data

# Install usefull debugging tools into our container.
docker exec -it $(docker ps -q --filter "name=cassandra") bash -c "apt-get update && apt-get install -y dnsutils && nslookup cassandra-2 && nslookup cassandra-3"