# Reference Documentation

**Audience**: All infrastructure team members, architects, management
**Purpose**: High-level architecture, capacity planning, cost analysis, and strategic documentation
**Prerequisites**: Familiarity with deployed infrastructure

---

## Overview

This directory contains reference materials that provide the "big picture" view of your infrastructure. Unlike operational procedures (setup, operations, automation), these documents focus on **why** decisions were made, **what** the architecture looks like, and **how** to plan for the future.

**Contents:**
- Architecture diagrams and decision records
- Capacity planning and performance baselines
- Cost analysis and optimization strategies
- Security compliance documentation
- Technology choices and trade-offs
- Glossary of terms

---

## Directory Contents

### Architecture Documentation

**`architecture-overview.md`** - High-level system architecture
- Infrastructure topology
- Component interactions
- Data flow diagrams
- Network architecture
- Security boundaries
- Design principles and rationale

**`architecture-decisions.md`** - Architecture Decision Records (ADRs)
- Why Docker Swarm over Kubernetes?
- Why Cassandra over PostgreSQL?
- Why Caddy over NGINX?
- Multi-application architecture rationale
- Network segmentation strategy
- Service discovery approach

### Capacity Planning

**`capacity-planning.md`** - Growth planning and scaling strategies
- Current capacity baseline
- Performance benchmarks
- Growth projections
- Scaling thresholds
- Bottleneck analysis
- Future infrastructure needs

**`performance-baselines.md`** - Performance metrics and SLOs
- Response time percentiles
- Throughput measurements
- Database performance
- Resource utilization baselines
- Service Level Objectives (SLOs)
- Service Level Indicators (SLIs)

### Financial Planning

**`cost-analysis.md`** - Infrastructure costs and optimization
- Monthly cost breakdown
- Cost per service/application
- Cost trends and projections
- Optimization opportunities
- Reserved capacity vs on-demand
- TCO (Total Cost of Ownership)

**`cost-optimization.md`** - Strategies to reduce costs
- Right-sizing recommendations
- Idle resource identification
- Reserved instances opportunities
- Storage optimization
- Bandwidth optimization
- Alternative architecture considerations

### Security & Compliance

**`security-architecture.md`** - Security design and controls
- Defense-in-depth layers
- Authentication and authorization
- Secrets management approach
- Network security controls
- Data encryption (at rest and in transit)
- Security monitoring and logging

**`security-checklist.md`** - Security verification checklist
- Infrastructure hardening checklist
- Compliance requirements (GDPR, SOC2, etc.)
- Security audit procedures
- Vulnerability management
- Incident response readiness

**`compliance.md`** - Regulatory compliance documentation
- GDPR compliance measures
- Data residency requirements
- Audit trail procedures
- Privacy by design implementation
- Data retention policies
- Right to be forgotten procedures

### Technology Stack

**`technology-stack.md`** - Complete technology inventory
- Software versions and update policy
- Third-party services and dependencies
- Library and framework choices
- Language and runtime versions
- Tooling and development environment

**`technology-decisions.md`** - Why we chose each technology
- Database selection rationale
- Programming language choices
- Cloud provider selection
- Deployment tooling decisions
- Monitoring stack selection

### Operational Reference

**`runbook-index.md`** - Quick reference to all runbooks
- Emergency procedures quick links
- Common tasks reference
- Escalation contacts
- Critical command cheat sheet

**`glossary.md`** - Terms and definitions
- Docker Swarm terminology
- Database concepts (Cassandra RF, QUORUM, etc.)
- Network terms (overlay, ingress, etc.)
- Monitoring terminology
- Infrastructure jargon decoder

---

## Quick Reference Materials

### Architecture At-a-Glance

**Current Infrastructure (January 2025):**

```
Production Environment: maplefile-prod
Region: DigitalOcean Toronto (tor1)
Nodes: 7 workers (1 manager + 6 workers)
Applications: MaplePress (deployed), MapleFile (deployed)

Orchestration: Docker Swarm
Container Registry: DigitalOcean Container Registry (registry.digitalocean.com/ssp)
Object Storage: DigitalOcean Spaces (nyc3)
DNS: [Your DNS provider]
SSL: Let's Encrypt (automatic via Caddy)

Networks:
  - maple-private-prod: Databases and internal services
  - maple-public-prod: Public-facing services (Caddy + backends)

Databases:
  - Cassandra: 3-node cluster, RF=3, QUORUM consistency
  - Redis: Single instance, RDB + AOF persistence
  - Meilisearch: Single instance

Applications:
  - MaplePress Backend: Go 1.21+, Port 8000, Domain: getmaplepress.ca
  - MaplePress Frontend: React 19 + Vite, Domain: getmaplepress.com
```

### Key Metrics Baseline (Example)

**As of [Date]:**

| Metric | Value | Threshold |
|--------|-------|-----------|
| Backend p95 Response Time | 150ms | < 500ms |
| Frontend Load Time | 1.2s | < 3s |
| Backend Throughput | 500 req/min | 5000 req/min capacity |
| Database Read Latency | 5ms | < 20ms |
| Database Write Latency | 10ms | < 50ms |
| Redis Hit Rate | 95% | > 90% |
| CPU Utilization (avg) | 35% | Alert at 80% |
| Memory Utilization (avg) | 50% | Alert at 85% |
| Disk Usage (avg) | 40% | Alert at 75% |

### Monthly Cost Breakdown (Example)

| Service | Monthly Cost | Notes |
|---------|--------------|-------|
| Droplets (7x) | $204 | See breakdown in cost-analysis.md |
| Spaces Storage | $5 | 250GB included |
| Additional Bandwidth | $0 | Within free tier |
| Container Registry | $0 | Included |
| DNS | $0 | Using [provider] |
| Monitoring (optional) | $0 | Self-hosted Prometheus |
| **Total** | **~$209/mo** | Can scale to ~$300/mo with growth |

### Technology Stack Summary

| Layer | Technology | Version | Purpose |
|-------|------------|---------|---------|
| **OS** | Ubuntu | 24.04 LTS | Base operating system |
| **Orchestration** | Docker Swarm | Built-in | Container orchestration |
| **Container Runtime** | Docker | 27.x+ | Container execution |
| **Database** | Cassandra | 4.1.x | Distributed database |
| **Cache** | Redis | 7.x | In-memory cache/sessions |
| **Search** | Meilisearch | v1.5+ | Full-text search |
| **Reverse Proxy** | Caddy | 2-alpine | HTTPS termination |
| **Backend** | Go | 1.21+ | Application runtime |
| **Frontend** | React + Vite | 19 + 5.x | Web UI |
| **Object Storage** | Spaces | S3-compatible | File storage |
| **Monitoring** | Prometheus + Grafana | Latest | Metrics & dashboards |
| **CI/CD** | TBD | - | GitHub Actions / GitLab CI |

---

## Architecture Decision Records (ADRs)

### ADR-001: Docker Swarm vs Kubernetes

**Decision**: Use Docker Swarm for orchestration

**Context**: Need container orchestration for production deployment

**Rationale**:
- Simpler to set up and maintain (< 1 hour vs days for k8s)
- Built into Docker (no additional components)
- Sufficient for our scale (< 100 services)
- Lower operational overhead
- Easier to troubleshoot
- Team familiarity with Docker

**Trade-offs**:
- Less ecosystem tooling than Kubernetes
- Limited advanced scheduling features
- Smaller community
- May need migration to k8s if scale dramatically (> 50 nodes)

**Status**: Accepted

---

### ADR-002: Cassandra for Distributed Database

**Decision**: Use Cassandra for primary datastore

**Context**: Need highly available, distributed database with linear scalability

**Rationale**:
- Write-heavy workload (user-generated content)
- Geographic distribution possible (multi-region)
- Proven at scale (Instagram, Netflix)
- No single point of failure (RF=3, QUORUM)
- Linear scalability (add nodes for capacity)
- Excellent write performance

**Trade-offs**:
- Higher complexity than PostgreSQL
- Eventually consistent (tunable)
- Schema migrations more complex
- Higher resource usage (3 nodes minimum)
- Steeper learning curve

**Alternatives Considered**:
- PostgreSQL + Patroni: Simpler but less scalable
- MongoDB: Similar, but prefer Cassandra's consistency model
- MySQL Cluster: Oracle licensing concerns

**Status**: Accepted

---

### ADR-003: Caddy for Reverse Proxy

**Decision**: Use Caddy instead of NGINX

**Context**: Need HTTPS termination and reverse proxy

**Rationale**:
- Automatic HTTPS with Let's Encrypt (zero configuration)
- Automatic certificate renewal (no cron jobs)
- Simpler configuration (10 lines vs 200+)
- Built-in HTTP/2 and HTTP/3
- Security by default
- Active development

**Trade-offs**:
- Less mature than NGINX (but production-ready)
- Smaller community
- Fewer third-party modules
- Slightly higher memory usage (negligible)

**Performance**: Equivalent for our use case (< 10k req/sec)

**Status**: Accepted

---

### ADR-004: Multi-Application Shared Infrastructure

**Decision**: Share database infrastructure across multiple applications

**Context**: Planning to deploy multiple applications (MaplePress, MapleFile)

**Rationale**:
- Cost efficiency (one 3-node Cassandra cluster vs 3 separate clusters)
- Operational efficiency (one set of database procedures)
- Resource utilization (databases rarely at capacity)
- Simplified backups (one backup process)
- Consistent data layer

**Isolation Strategy**:
- Separate keyspaces per application
- Separate workers for application backends
- Independent scaling per application
- Separate deployment pipelines

**Trade-offs**:
- Blast radius: One database failure affects all apps
- Resource contention possible (mitigated by capacity planning)
- Schema migration coordination needed

**Status**: Accepted

---

## Capacity Planning Guidelines

### Current Capacity

**Worker specifications:**
- Manager + Redis: 2 vCPU, 2 GB RAM
- Cassandra nodes (3x): 2 vCPU, 4 GB RAM each
- Meilisearch: 2 vCPU, 2 GB RAM
- Backend: 2 vCPU, 2 GB RAM
- Frontend: 1 vCPU, 1 GB RAM

**Total:** 13 vCPUs, 19 GB RAM

### Scaling Triggers

**When to scale:**

| Metric | Threshold | Action |
|--------|-----------|--------|
| CPU > 80% sustained | 5 minutes | Add worker or scale vertically |
| Memory > 85% sustained | 5 minutes | Increase droplet RAM |
| Disk > 75% full | Any node | Clear space or increase disk |
| Backend p95 > 1s | Consistent | Scale backend horizontally |
| Database latency > 50ms | Consistent | Add Cassandra node or tune |
| Request rate approaching capacity | 80% of max | Scale backend replicas |

### Scaling Options

**Horizontal Scaling (preferred):**
- Backend: Add replicas (`docker service scale maplepress_backend=3`)
- Cassandra: Add fourth node (increases capacity + resilience)
- Frontend: Add CDN or edge caching

**Vertical Scaling:**
- Resize droplets (requires brief restart)
- Increase memory limits in stack files
- Optimize application code first

**Cost vs Performance:**
- Horizontal: More resilient, linear cost increase
- Vertical: Simpler, better price/performance up to a point

---

## Cost Optimization Strategies

### Quick Wins

1. **Reserved Instances**: DigitalOcean doesn't offer reserved pricing, but consider annual contracts for discounts
2. **Right-sizing**: Monitor actual usage, downsize oversized droplets
3. **Cleanup**: Regular docker system prune, clear old snapshots
4. **Compression**: Enable gzip in Caddy (already done)
5. **Caching**: Maximize cache hit rates (Redis, CDN)

### Medium-term Optimizations

1. **CDN for static assets**: Offload frontend static files to CDN
2. **Object storage lifecycle**: Auto-delete old backups
3. **Database tuning**: Optimize queries to reduce hardware needs
4. **Spot instances**: Not available on DigitalOcean, but consider for batch jobs

### Alternative Architectures

**If cost becomes primary concern:**
- Single-node PostgreSQL instead of Cassandra cluster (-$96/mo)
- Collocate services on fewer droplets (-$50-100/mo)
- Use managed databases (different cost model)

**Trade-off**: Lower cost, higher operational risk

---

## Security Architecture

### Defense in Depth Layers

1. **Network**: VPC, firewalls, private overlay networks
2. **Transport**: TLS 1.3 for all external connections
3. **Application**: Authentication, authorization, input validation
4. **Data**: Encryption at rest (object storage), encryption in transit
5. **Monitoring**: Audit logs, security alerts, intrusion detection

### Key Security Controls

**Implemented:**
- ✅ SSH key-based authentication (no passwords)
- ✅ UFW firewall on all nodes
- ✅ Docker secrets for sensitive values
- ✅ Network segmentation (private vs public)
- ✅ Automatic HTTPS with perfect forward secrecy
- ✅ Security headers (HSTS, X-Frame-Options, etc.)
- ✅ Database authentication (passwords, API keys)
- ✅ Minimal attack surface (only ports 22, 80, 443 exposed)

**Planned:**
- [ ] fail2ban for SSH brute-force protection
- [ ] Intrusion detection system (IDS)
- [ ] Regular security scanning (Trivy for containers)
- [ ] Secret rotation automation
- [ ] Audit logging aggregation

---

## Compliance Considerations

### GDPR

**If processing EU user data:**
- Data residency: Deploy EU region workers
- Right to deletion: Implement user data purge
- Data portability: Export user data functionality
- Privacy by design: Minimal data collection
- Audit trail: Log all data access

### SOC2

**If pursuing SOC2 compliance:**
- Access controls: Role-based access, MFA
- Change management: All changes via git, reviewed
- Monitoring: Comprehensive logging and alerting
- Incident response: Documented procedures
- Business continuity: Backup and disaster recovery tested

**Document in**: `compliance.md`

---

## Glossary

### Docker Swarm Terms

**Manager node**: Swarm orchestrator, schedules tasks, maintains cluster state
**Worker node**: Executes tasks (containers) assigned by manager
**Service**: Definition of containers to run (image, replicas, network)
**Task**: Single container instance of a service
**Stack**: Group of related services deployed together
**Overlay network**: Virtual network spanning all swarm nodes
**Ingress network**: Built-in load balancing for published ports
**Node label**: Key-value tag for task placement constraints

### Cassandra Terms

**RF (Replication Factor)**: Number of copies of data (RF=3 = 3 copies)
**QUORUM**: Majority of replicas (2 out of 3 for RF=3)
**Consistency Level**: How many replicas must respond (ONE, QUORUM, ALL)
**Keyspace**: Database namespace (like database in SQL)
**SSTable**: Immutable data file on disk
**Compaction**: Merging SSTables to reclaim space
**Repair**: Synchronize data across replicas
**Nodetool**: Command-line tool for Cassandra administration

### Monitoring Terms

**Prometheus**: Time-series database and metrics collection
**Grafana**: Visualization and dashboarding
**Alertmanager**: Alert routing and notification
**Exporter**: Metrics collection agent (node_exporter, etc.)
**Scrape**: Prometheus collecting metrics from target
**Time series**: Sequence of data points over time
**PromQL**: Prometheus query language

---

## Related Documentation

**For initial deployment:**
- `../setup/` - Step-by-step infrastructure deployment

**For day-to-day operations:**
- `../operations/` - Backup, monitoring, incident response

**For automation:**
- `../automation/` - Scripts, CI/CD, monitoring configs

**External resources:**
- Docker Swarm: https://docs.docker.com/engine/swarm/
- Cassandra: https://cassandra.apache.org/doc/latest/
- DigitalOcean: https://docs.digitalocean.com/

---

## Contributing to Reference Docs

**When to update reference documentation:**

- Major architecture changes
- New technology adoption
- Significant cost changes
- Security incidents (document lessons learned)
- Compliance requirements change
- Quarterly review cycles

**Document format:**
- Use Markdown
- Include decision date
- Link to related ADRs
- Update index/glossary as needed

---

## Document Maintenance

**Review schedule:**
- **Architecture docs**: Quarterly or when major changes
- **Capacity planning**: Monthly (update with metrics)
- **Cost analysis**: Monthly (track trends)
- **Security checklist**: Quarterly or after incidents
- **Technology stack**: When versions change
- **Glossary**: As needed when new terms introduced

**Responsibility**: Infrastructure lead reviews quarterly, team contributes ongoing updates.

---

**Last Updated**: January 2025
**Maintained By**: Infrastructure Team
**Next Review**: April 2025

**Purpose**: These documents answer "why" and "what if" questions. They provide context for decisions and guidance for future planning.