# Reference Documentation **Audience**: All infrastructure team members, architects, management **Purpose**: High-level architecture, capacity planning, cost analysis, and strategic documentation **Prerequisites**: Familiarity with deployed infrastructure --- ## Overview This directory contains reference materials that provide the "big picture" view of your infrastructure. Unlike operational procedures (setup, operations, automation), these documents focus on **why** decisions were made, **what** the architecture looks like, and **how** to plan for the future. **Contents:** - Architecture diagrams and decision records - Capacity planning and performance baselines - Cost analysis and optimization strategies - Security compliance documentation - Technology choices and trade-offs - Glossary of terms --- ## Directory Contents ### Architecture Documentation **`architecture-overview.md`** - High-level system architecture - Infrastructure topology - Component interactions - Data flow diagrams - Network architecture - Security boundaries - Design principles and rationale **`architecture-decisions.md`** - Architecture Decision Records (ADRs) - Why Docker Swarm over Kubernetes? - Why Cassandra over PostgreSQL? - Why Caddy over NGINX? - Multi-application architecture rationale - Network segmentation strategy - Service discovery approach ### Capacity Planning **`capacity-planning.md`** - Growth planning and scaling strategies - Current capacity baseline - Performance benchmarks - Growth projections - Scaling thresholds - Bottleneck analysis - Future infrastructure needs **`performance-baselines.md`** - Performance metrics and SLOs - Response time percentiles - Throughput measurements - Database performance - Resource utilization baselines - Service Level Objectives (SLOs) - Service Level Indicators (SLIs) ### Financial Planning **`cost-analysis.md`** - Infrastructure costs and optimization - Monthly cost breakdown - Cost per service/application - Cost trends and projections - Optimization opportunities - Reserved capacity vs on-demand - TCO (Total Cost of Ownership) **`cost-optimization.md`** - Strategies to reduce costs - Right-sizing recommendations - Idle resource identification - Reserved instances opportunities - Storage optimization - Bandwidth optimization - Alternative architecture considerations ### Security & Compliance **`security-architecture.md`** - Security design and controls - Defense-in-depth layers - Authentication and authorization - Secrets management approach - Network security controls - Data encryption (at rest and in transit) - Security monitoring and logging **`security-checklist.md`** - Security verification checklist - Infrastructure hardening checklist - Compliance requirements (GDPR, SOC2, etc.) - Security audit procedures - Vulnerability management - Incident response readiness **`compliance.md`** - Regulatory compliance documentation - GDPR compliance measures - Data residency requirements - Audit trail procedures - Privacy by design implementation - Data retention policies - Right to be forgotten procedures ### Technology Stack **`technology-stack.md`** - Complete technology inventory - Software versions and update policy - Third-party services and dependencies - Library and framework choices - Language and runtime versions - Tooling and development environment **`technology-decisions.md`** - Why we chose each technology - Database selection rationale - Programming language choices - Cloud provider selection - Deployment tooling decisions - Monitoring stack selection ### Operational Reference **`runbook-index.md`** - Quick reference to all runbooks - Emergency procedures quick links - Common tasks reference - Escalation contacts - Critical command cheat sheet **`glossary.md`** - Terms and definitions - Docker Swarm terminology - Database concepts (Cassandra RF, QUORUM, etc.) - Network terms (overlay, ingress, etc.) - Monitoring terminology - Infrastructure jargon decoder --- ## Quick Reference Materials ### Architecture At-a-Glance **Current Infrastructure (January 2025):** ``` Production Environment: maplefile-prod Region: DigitalOcean Toronto (tor1) Nodes: 7 workers (1 manager + 6 workers) Applications: MaplePress (deployed), MapleFile (deployed) Orchestration: Docker Swarm Container Registry: DigitalOcean Container Registry (registry.digitalocean.com/ssp) Object Storage: DigitalOcean Spaces (nyc3) DNS: [Your DNS provider] SSL: Let's Encrypt (automatic via Caddy) Networks: - maple-private-prod: Databases and internal services - maple-public-prod: Public-facing services (Caddy + backends) Databases: - Cassandra: 3-node cluster, RF=3, QUORUM consistency - Redis: Single instance, RDB + AOF persistence - Meilisearch: Single instance Applications: - MaplePress Backend: Go 1.21+, Port 8000, Domain: getmaplepress.ca - MaplePress Frontend: React 19 + Vite, Domain: getmaplepress.com ``` ### Key Metrics Baseline (Example) **As of [Date]:** | Metric | Value | Threshold | |--------|-------|-----------| | Backend p95 Response Time | 150ms | < 500ms | | Frontend Load Time | 1.2s | < 3s | | Backend Throughput | 500 req/min | 5000 req/min capacity | | Database Read Latency | 5ms | < 20ms | | Database Write Latency | 10ms | < 50ms | | Redis Hit Rate | 95% | > 90% | | CPU Utilization (avg) | 35% | Alert at 80% | | Memory Utilization (avg) | 50% | Alert at 85% | | Disk Usage (avg) | 40% | Alert at 75% | ### Monthly Cost Breakdown (Example) | Service | Monthly Cost | Notes | |---------|--------------|-------| | Droplets (7x) | $204 | See breakdown in cost-analysis.md | | Spaces Storage | $5 | 250GB included | | Additional Bandwidth | $0 | Within free tier | | Container Registry | $0 | Included | | DNS | $0 | Using [provider] | | Monitoring (optional) | $0 | Self-hosted Prometheus | | **Total** | **~$209/mo** | Can scale to ~$300/mo with growth | ### Technology Stack Summary | Layer | Technology | Version | Purpose | |-------|------------|---------|---------| | **OS** | Ubuntu | 24.04 LTS | Base operating system | | **Orchestration** | Docker Swarm | Built-in | Container orchestration | | **Container Runtime** | Docker | 27.x+ | Container execution | | **Database** | Cassandra | 4.1.x | Distributed database | | **Cache** | Redis | 7.x | In-memory cache/sessions | | **Search** | Meilisearch | v1.5+ | Full-text search | | **Reverse Proxy** | Caddy | 2-alpine | HTTPS termination | | **Backend** | Go | 1.21+ | Application runtime | | **Frontend** | React + Vite | 19 + 5.x | Web UI | | **Object Storage** | Spaces | S3-compatible | File storage | | **Monitoring** | Prometheus + Grafana | Latest | Metrics & dashboards | | **CI/CD** | TBD | - | GitHub Actions / GitLab CI | --- ## Architecture Decision Records (ADRs) ### ADR-001: Docker Swarm vs Kubernetes **Decision**: Use Docker Swarm for orchestration **Context**: Need container orchestration for production deployment **Rationale**: - Simpler to set up and maintain (< 1 hour vs days for k8s) - Built into Docker (no additional components) - Sufficient for our scale (< 100 services) - Lower operational overhead - Easier to troubleshoot - Team familiarity with Docker **Trade-offs**: - Less ecosystem tooling than Kubernetes - Limited advanced scheduling features - Smaller community - May need migration to k8s if scale dramatically (> 50 nodes) **Status**: Accepted --- ### ADR-002: Cassandra for Distributed Database **Decision**: Use Cassandra for primary datastore **Context**: Need highly available, distributed database with linear scalability **Rationale**: - Write-heavy workload (user-generated content) - Geographic distribution possible (multi-region) - Proven at scale (Instagram, Netflix) - No single point of failure (RF=3, QUORUM) - Linear scalability (add nodes for capacity) - Excellent write performance **Trade-offs**: - Higher complexity than PostgreSQL - Eventually consistent (tunable) - Schema migrations more complex - Higher resource usage (3 nodes minimum) - Steeper learning curve **Alternatives Considered**: - PostgreSQL + Patroni: Simpler but less scalable - MongoDB: Similar, but prefer Cassandra's consistency model - MySQL Cluster: Oracle licensing concerns **Status**: Accepted --- ### ADR-003: Caddy for Reverse Proxy **Decision**: Use Caddy instead of NGINX **Context**: Need HTTPS termination and reverse proxy **Rationale**: - Automatic HTTPS with Let's Encrypt (zero configuration) - Automatic certificate renewal (no cron jobs) - Simpler configuration (10 lines vs 200+) - Built-in HTTP/2 and HTTP/3 - Security by default - Active development **Trade-offs**: - Less mature than NGINX (but production-ready) - Smaller community - Fewer third-party modules - Slightly higher memory usage (negligible) **Performance**: Equivalent for our use case (< 10k req/sec) **Status**: Accepted --- ### ADR-004: Multi-Application Shared Infrastructure **Decision**: Share database infrastructure across multiple applications **Context**: Planning to deploy multiple applications (MaplePress, MapleFile) **Rationale**: - Cost efficiency (one 3-node Cassandra cluster vs 3 separate clusters) - Operational efficiency (one set of database procedures) - Resource utilization (databases rarely at capacity) - Simplified backups (one backup process) - Consistent data layer **Isolation Strategy**: - Separate keyspaces per application - Separate workers for application backends - Independent scaling per application - Separate deployment pipelines **Trade-offs**: - Blast radius: One database failure affects all apps - Resource contention possible (mitigated by capacity planning) - Schema migration coordination needed **Status**: Accepted --- ## Capacity Planning Guidelines ### Current Capacity **Worker specifications:** - Manager + Redis: 2 vCPU, 2 GB RAM - Cassandra nodes (3x): 2 vCPU, 4 GB RAM each - Meilisearch: 2 vCPU, 2 GB RAM - Backend: 2 vCPU, 2 GB RAM - Frontend: 1 vCPU, 1 GB RAM **Total:** 13 vCPUs, 19 GB RAM ### Scaling Triggers **When to scale:** | Metric | Threshold | Action | |--------|-----------|--------| | CPU > 80% sustained | 5 minutes | Add worker or scale vertically | | Memory > 85% sustained | 5 minutes | Increase droplet RAM | | Disk > 75% full | Any node | Clear space or increase disk | | Backend p95 > 1s | Consistent | Scale backend horizontally | | Database latency > 50ms | Consistent | Add Cassandra node or tune | | Request rate approaching capacity | 80% of max | Scale backend replicas | ### Scaling Options **Horizontal Scaling (preferred):** - Backend: Add replicas (`docker service scale maplepress_backend=3`) - Cassandra: Add fourth node (increases capacity + resilience) - Frontend: Add CDN or edge caching **Vertical Scaling:** - Resize droplets (requires brief restart) - Increase memory limits in stack files - Optimize application code first **Cost vs Performance:** - Horizontal: More resilient, linear cost increase - Vertical: Simpler, better price/performance up to a point --- ## Cost Optimization Strategies ### Quick Wins 1. **Reserved Instances**: DigitalOcean doesn't offer reserved pricing, but consider annual contracts for discounts 2. **Right-sizing**: Monitor actual usage, downsize oversized droplets 3. **Cleanup**: Regular docker system prune, clear old snapshots 4. **Compression**: Enable gzip in Caddy (already done) 5. **Caching**: Maximize cache hit rates (Redis, CDN) ### Medium-term Optimizations 1. **CDN for static assets**: Offload frontend static files to CDN 2. **Object storage lifecycle**: Auto-delete old backups 3. **Database tuning**: Optimize queries to reduce hardware needs 4. **Spot instances**: Not available on DigitalOcean, but consider for batch jobs ### Alternative Architectures **If cost becomes primary concern:** - Single-node PostgreSQL instead of Cassandra cluster (-$96/mo) - Collocate services on fewer droplets (-$50-100/mo) - Use managed databases (different cost model) **Trade-off**: Lower cost, higher operational risk --- ## Security Architecture ### Defense in Depth Layers 1. **Network**: VPC, firewalls, private overlay networks 2. **Transport**: TLS 1.3 for all external connections 3. **Application**: Authentication, authorization, input validation 4. **Data**: Encryption at rest (object storage), encryption in transit 5. **Monitoring**: Audit logs, security alerts, intrusion detection ### Key Security Controls **Implemented:** - ✅ SSH key-based authentication (no passwords) - ✅ UFW firewall on all nodes - ✅ Docker secrets for sensitive values - ✅ Network segmentation (private vs public) - ✅ Automatic HTTPS with perfect forward secrecy - ✅ Security headers (HSTS, X-Frame-Options, etc.) - ✅ Database authentication (passwords, API keys) - ✅ Minimal attack surface (only ports 22, 80, 443 exposed) **Planned:** - [ ] fail2ban for SSH brute-force protection - [ ] Intrusion detection system (IDS) - [ ] Regular security scanning (Trivy for containers) - [ ] Secret rotation automation - [ ] Audit logging aggregation --- ## Compliance Considerations ### GDPR **If processing EU user data:** - Data residency: Deploy EU region workers - Right to deletion: Implement user data purge - Data portability: Export user data functionality - Privacy by design: Minimal data collection - Audit trail: Log all data access ### SOC2 **If pursuing SOC2 compliance:** - Access controls: Role-based access, MFA - Change management: All changes via git, reviewed - Monitoring: Comprehensive logging and alerting - Incident response: Documented procedures - Business continuity: Backup and disaster recovery tested **Document in**: `compliance.md` --- ## Glossary ### Docker Swarm Terms **Manager node**: Swarm orchestrator, schedules tasks, maintains cluster state **Worker node**: Executes tasks (containers) assigned by manager **Service**: Definition of containers to run (image, replicas, network) **Task**: Single container instance of a service **Stack**: Group of related services deployed together **Overlay network**: Virtual network spanning all swarm nodes **Ingress network**: Built-in load balancing for published ports **Node label**: Key-value tag for task placement constraints ### Cassandra Terms **RF (Replication Factor)**: Number of copies of data (RF=3 = 3 copies) **QUORUM**: Majority of replicas (2 out of 3 for RF=3) **Consistency Level**: How many replicas must respond (ONE, QUORUM, ALL) **Keyspace**: Database namespace (like database in SQL) **SSTable**: Immutable data file on disk **Compaction**: Merging SSTables to reclaim space **Repair**: Synchronize data across replicas **Nodetool**: Command-line tool for Cassandra administration ### Monitoring Terms **Prometheus**: Time-series database and metrics collection **Grafana**: Visualization and dashboarding **Alertmanager**: Alert routing and notification **Exporter**: Metrics collection agent (node_exporter, etc.) **Scrape**: Prometheus collecting metrics from target **Time series**: Sequence of data points over time **PromQL**: Prometheus query language --- ## Related Documentation **For initial deployment:** - `../setup/` - Step-by-step infrastructure deployment **For day-to-day operations:** - `../operations/` - Backup, monitoring, incident response **For automation:** - `../automation/` - Scripts, CI/CD, monitoring configs **External resources:** - Docker Swarm: https://docs.docker.com/engine/swarm/ - Cassandra: https://cassandra.apache.org/doc/latest/ - DigitalOcean: https://docs.digitalocean.com/ --- ## Contributing to Reference Docs **When to update reference documentation:** - Major architecture changes - New technology adoption - Significant cost changes - Security incidents (document lessons learned) - Compliance requirements change - Quarterly review cycles **Document format:** - Use Markdown - Include decision date - Link to related ADRs - Update index/glossary as needed --- ## Document Maintenance **Review schedule:** - **Architecture docs**: Quarterly or when major changes - **Capacity planning**: Monthly (update with metrics) - **Cost analysis**: Monthly (track trends) - **Security checklist**: Quarterly or after incidents - **Technology stack**: When versions change - **Glossary**: As needed when new terms introduced **Responsibility**: Infrastructure lead reviews quarterly, team contributes ongoing updates. --- **Last Updated**: January 2025 **Maintained By**: Infrastructure Team **Next Review**: April 2025 **Purpose**: These documents answer "why" and "what if" questions. They provide context for decisions and guidance for future planning.