| .. | ||
| README.md | ||
Reference Documentation
Audience: All infrastructure team members, architects, management Purpose: High-level architecture, capacity planning, cost analysis, and strategic documentation Prerequisites: Familiarity with deployed infrastructure
Overview
This directory contains reference materials that provide the "big picture" view of your infrastructure. Unlike operational procedures (setup, operations, automation), these documents focus on why decisions were made, what the architecture looks like, and how to plan for the future.
Contents:
- Architecture diagrams and decision records
- Capacity planning and performance baselines
- Cost analysis and optimization strategies
- Security compliance documentation
- Technology choices and trade-offs
- Glossary of terms
Directory Contents
Architecture Documentation
architecture-overview.md - High-level system architecture
- Infrastructure topology
- Component interactions
- Data flow diagrams
- Network architecture
- Security boundaries
- Design principles and rationale
architecture-decisions.md - Architecture Decision Records (ADRs)
- Why Docker Swarm over Kubernetes?
- Why Cassandra over PostgreSQL?
- Why Caddy over NGINX?
- Multi-application architecture rationale
- Network segmentation strategy
- Service discovery approach
Capacity Planning
capacity-planning.md - Growth planning and scaling strategies
- Current capacity baseline
- Performance benchmarks
- Growth projections
- Scaling thresholds
- Bottleneck analysis
- Future infrastructure needs
performance-baselines.md - Performance metrics and SLOs
- Response time percentiles
- Throughput measurements
- Database performance
- Resource utilization baselines
- Service Level Objectives (SLOs)
- Service Level Indicators (SLIs)
Financial Planning
cost-analysis.md - Infrastructure costs and optimization
- Monthly cost breakdown
- Cost per service/application
- Cost trends and projections
- Optimization opportunities
- Reserved capacity vs on-demand
- TCO (Total Cost of Ownership)
cost-optimization.md - Strategies to reduce costs
- Right-sizing recommendations
- Idle resource identification
- Reserved instances opportunities
- Storage optimization
- Bandwidth optimization
- Alternative architecture considerations
Security & Compliance
security-architecture.md - Security design and controls
- Defense-in-depth layers
- Authentication and authorization
- Secrets management approach
- Network security controls
- Data encryption (at rest and in transit)
- Security monitoring and logging
security-checklist.md - Security verification checklist
- Infrastructure hardening checklist
- Compliance requirements (GDPR, SOC2, etc.)
- Security audit procedures
- Vulnerability management
- Incident response readiness
compliance.md - Regulatory compliance documentation
- GDPR compliance measures
- Data residency requirements
- Audit trail procedures
- Privacy by design implementation
- Data retention policies
- Right to be forgotten procedures
Technology Stack
technology-stack.md - Complete technology inventory
- Software versions and update policy
- Third-party services and dependencies
- Library and framework choices
- Language and runtime versions
- Tooling and development environment
technology-decisions.md - Why we chose each technology
- Database selection rationale
- Programming language choices
- Cloud provider selection
- Deployment tooling decisions
- Monitoring stack selection
Operational Reference
runbook-index.md - Quick reference to all runbooks
- Emergency procedures quick links
- Common tasks reference
- Escalation contacts
- Critical command cheat sheet
glossary.md - Terms and definitions
- Docker Swarm terminology
- Database concepts (Cassandra RF, QUORUM, etc.)
- Network terms (overlay, ingress, etc.)
- Monitoring terminology
- Infrastructure jargon decoder
Quick Reference Materials
Architecture At-a-Glance
Current Infrastructure (January 2025):
Production Environment: maplefile-prod
Region: DigitalOcean Toronto (tor1)
Nodes: 7 workers (1 manager + 6 workers)
Applications: MaplePress (deployed), MapleFile (deployed)
Orchestration: Docker Swarm
Container Registry: DigitalOcean Container Registry (registry.digitalocean.com/ssp)
Object Storage: DigitalOcean Spaces (nyc3)
DNS: [Your DNS provider]
SSL: Let's Encrypt (automatic via Caddy)
Networks:
- maple-private-prod: Databases and internal services
- maple-public-prod: Public-facing services (Caddy + backends)
Databases:
- Cassandra: 3-node cluster, RF=3, QUORUM consistency
- Redis: Single instance, RDB + AOF persistence
- Meilisearch: Single instance
Applications:
- MaplePress Backend: Go 1.21+, Port 8000, Domain: getmaplepress.ca
- MaplePress Frontend: React 19 + Vite, Domain: getmaplepress.com
Key Metrics Baseline (Example)
As of [Date]:
| Metric | Value | Threshold |
|---|---|---|
| Backend p95 Response Time | 150ms | < 500ms |
| Frontend Load Time | 1.2s | < 3s |
| Backend Throughput | 500 req/min | 5000 req/min capacity |
| Database Read Latency | 5ms | < 20ms |
| Database Write Latency | 10ms | < 50ms |
| Redis Hit Rate | 95% | > 90% |
| CPU Utilization (avg) | 35% | Alert at 80% |
| Memory Utilization (avg) | 50% | Alert at 85% |
| Disk Usage (avg) | 40% | Alert at 75% |
Monthly Cost Breakdown (Example)
| Service | Monthly Cost | Notes |
|---|---|---|
| Droplets (7x) | $204 | See breakdown in cost-analysis.md |
| Spaces Storage | $5 | 250GB included |
| Additional Bandwidth | $0 | Within free tier |
| Container Registry | $0 | Included |
| DNS | $0 | Using [provider] |
| Monitoring (optional) | $0 | Self-hosted Prometheus |
| Total | ~$209/mo | Can scale to ~$300/mo with growth |
Technology Stack Summary
| Layer | Technology | Version | Purpose |
|---|---|---|---|
| OS | Ubuntu | 24.04 LTS | Base operating system |
| Orchestration | Docker Swarm | Built-in | Container orchestration |
| Container Runtime | Docker | 27.x+ | Container execution |
| Database | Cassandra | 4.1.x | Distributed database |
| Cache | Redis | 7.x | In-memory cache/sessions |
| Search | Meilisearch | v1.5+ | Full-text search |
| Reverse Proxy | Caddy | 2-alpine | HTTPS termination |
| Backend | Go | 1.21+ | Application runtime |
| Frontend | React + Vite | 19 + 5.x | Web UI |
| Object Storage | Spaces | S3-compatible | File storage |
| Monitoring | Prometheus + Grafana | Latest | Metrics & dashboards |
| CI/CD | TBD | - | GitHub Actions / GitLab CI |
Architecture Decision Records (ADRs)
ADR-001: Docker Swarm vs Kubernetes
Decision: Use Docker Swarm for orchestration
Context: Need container orchestration for production deployment
Rationale:
- Simpler to set up and maintain (< 1 hour vs days for k8s)
- Built into Docker (no additional components)
- Sufficient for our scale (< 100 services)
- Lower operational overhead
- Easier to troubleshoot
- Team familiarity with Docker
Trade-offs:
- Less ecosystem tooling than Kubernetes
- Limited advanced scheduling features
- Smaller community
- May need migration to k8s if scale dramatically (> 50 nodes)
Status: Accepted
ADR-002: Cassandra for Distributed Database
Decision: Use Cassandra for primary datastore
Context: Need highly available, distributed database with linear scalability
Rationale:
- Write-heavy workload (user-generated content)
- Geographic distribution possible (multi-region)
- Proven at scale (Instagram, Netflix)
- No single point of failure (RF=3, QUORUM)
- Linear scalability (add nodes for capacity)
- Excellent write performance
Trade-offs:
- Higher complexity than PostgreSQL
- Eventually consistent (tunable)
- Schema migrations more complex
- Higher resource usage (3 nodes minimum)
- Steeper learning curve
Alternatives Considered:
- PostgreSQL + Patroni: Simpler but less scalable
- MongoDB: Similar, but prefer Cassandra's consistency model
- MySQL Cluster: Oracle licensing concerns
Status: Accepted
ADR-003: Caddy for Reverse Proxy
Decision: Use Caddy instead of NGINX
Context: Need HTTPS termination and reverse proxy
Rationale:
- Automatic HTTPS with Let's Encrypt (zero configuration)
- Automatic certificate renewal (no cron jobs)
- Simpler configuration (10 lines vs 200+)
- Built-in HTTP/2 and HTTP/3
- Security by default
- Active development
Trade-offs:
- Less mature than NGINX (but production-ready)
- Smaller community
- Fewer third-party modules
- Slightly higher memory usage (negligible)
Performance: Equivalent for our use case (< 10k req/sec)
Status: Accepted
ADR-004: Multi-Application Shared Infrastructure
Decision: Share database infrastructure across multiple applications
Context: Planning to deploy multiple applications (MaplePress, MapleFile)
Rationale:
- Cost efficiency (one 3-node Cassandra cluster vs 3 separate clusters)
- Operational efficiency (one set of database procedures)
- Resource utilization (databases rarely at capacity)
- Simplified backups (one backup process)
- Consistent data layer
Isolation Strategy:
- Separate keyspaces per application
- Separate workers for application backends
- Independent scaling per application
- Separate deployment pipelines
Trade-offs:
- Blast radius: One database failure affects all apps
- Resource contention possible (mitigated by capacity planning)
- Schema migration coordination needed
Status: Accepted
Capacity Planning Guidelines
Current Capacity
Worker specifications:
- Manager + Redis: 2 vCPU, 2 GB RAM
- Cassandra nodes (3x): 2 vCPU, 4 GB RAM each
- Meilisearch: 2 vCPU, 2 GB RAM
- Backend: 2 vCPU, 2 GB RAM
- Frontend: 1 vCPU, 1 GB RAM
Total: 13 vCPUs, 19 GB RAM
Scaling Triggers
When to scale:
| Metric | Threshold | Action |
|---|---|---|
| CPU > 80% sustained | 5 minutes | Add worker or scale vertically |
| Memory > 85% sustained | 5 minutes | Increase droplet RAM |
| Disk > 75% full | Any node | Clear space or increase disk |
| Backend p95 > 1s | Consistent | Scale backend horizontally |
| Database latency > 50ms | Consistent | Add Cassandra node or tune |
| Request rate approaching capacity | 80% of max | Scale backend replicas |
Scaling Options
Horizontal Scaling (preferred):
- Backend: Add replicas (
docker service scale maplepress_backend=3) - Cassandra: Add fourth node (increases capacity + resilience)
- Frontend: Add CDN or edge caching
Vertical Scaling:
- Resize droplets (requires brief restart)
- Increase memory limits in stack files
- Optimize application code first
Cost vs Performance:
- Horizontal: More resilient, linear cost increase
- Vertical: Simpler, better price/performance up to a point
Cost Optimization Strategies
Quick Wins
- Reserved Instances: DigitalOcean doesn't offer reserved pricing, but consider annual contracts for discounts
- Right-sizing: Monitor actual usage, downsize oversized droplets
- Cleanup: Regular docker system prune, clear old snapshots
- Compression: Enable gzip in Caddy (already done)
- Caching: Maximize cache hit rates (Redis, CDN)
Medium-term Optimizations
- CDN for static assets: Offload frontend static files to CDN
- Object storage lifecycle: Auto-delete old backups
- Database tuning: Optimize queries to reduce hardware needs
- Spot instances: Not available on DigitalOcean, but consider for batch jobs
Alternative Architectures
If cost becomes primary concern:
- Single-node PostgreSQL instead of Cassandra cluster (-$96/mo)
- Collocate services on fewer droplets (-$50-100/mo)
- Use managed databases (different cost model)
Trade-off: Lower cost, higher operational risk
Security Architecture
Defense in Depth Layers
- Network: VPC, firewalls, private overlay networks
- Transport: TLS 1.3 for all external connections
- Application: Authentication, authorization, input validation
- Data: Encryption at rest (object storage), encryption in transit
- Monitoring: Audit logs, security alerts, intrusion detection
Key Security Controls
Implemented:
- ✅ SSH key-based authentication (no passwords)
- ✅ UFW firewall on all nodes
- ✅ Docker secrets for sensitive values
- ✅ Network segmentation (private vs public)
- ✅ Automatic HTTPS with perfect forward secrecy
- ✅ Security headers (HSTS, X-Frame-Options, etc.)
- ✅ Database authentication (passwords, API keys)
- ✅ Minimal attack surface (only ports 22, 80, 443 exposed)
Planned:
- fail2ban for SSH brute-force protection
- Intrusion detection system (IDS)
- Regular security scanning (Trivy for containers)
- Secret rotation automation
- Audit logging aggregation
Compliance Considerations
GDPR
If processing EU user data:
- Data residency: Deploy EU region workers
- Right to deletion: Implement user data purge
- Data portability: Export user data functionality
- Privacy by design: Minimal data collection
- Audit trail: Log all data access
SOC2
If pursuing SOC2 compliance:
- Access controls: Role-based access, MFA
- Change management: All changes via git, reviewed
- Monitoring: Comprehensive logging and alerting
- Incident response: Documented procedures
- Business continuity: Backup and disaster recovery tested
Document in: compliance.md
Glossary
Docker Swarm Terms
Manager node: Swarm orchestrator, schedules tasks, maintains cluster state Worker node: Executes tasks (containers) assigned by manager Service: Definition of containers to run (image, replicas, network) Task: Single container instance of a service Stack: Group of related services deployed together Overlay network: Virtual network spanning all swarm nodes Ingress network: Built-in load balancing for published ports Node label: Key-value tag for task placement constraints
Cassandra Terms
RF (Replication Factor): Number of copies of data (RF=3 = 3 copies) QUORUM: Majority of replicas (2 out of 3 for RF=3) Consistency Level: How many replicas must respond (ONE, QUORUM, ALL) Keyspace: Database namespace (like database in SQL) SSTable: Immutable data file on disk Compaction: Merging SSTables to reclaim space Repair: Synchronize data across replicas Nodetool: Command-line tool for Cassandra administration
Monitoring Terms
Prometheus: Time-series database and metrics collection Grafana: Visualization and dashboarding Alertmanager: Alert routing and notification Exporter: Metrics collection agent (node_exporter, etc.) Scrape: Prometheus collecting metrics from target Time series: Sequence of data points over time PromQL: Prometheus query language
Related Documentation
For initial deployment:
../setup/- Step-by-step infrastructure deployment
For day-to-day operations:
../operations/- Backup, monitoring, incident response
For automation:
../automation/- Scripts, CI/CD, monitoring configs
External resources:
- Docker Swarm: https://docs.docker.com/engine/swarm/
- Cassandra: https://cassandra.apache.org/doc/latest/
- DigitalOcean: https://docs.digitalocean.com/
Contributing to Reference Docs
When to update reference documentation:
- Major architecture changes
- New technology adoption
- Significant cost changes
- Security incidents (document lessons learned)
- Compliance requirements change
- Quarterly review cycles
Document format:
- Use Markdown
- Include decision date
- Link to related ADRs
- Update index/glossary as needed
Document Maintenance
Review schedule:
- Architecture docs: Quarterly or when major changes
- Capacity planning: Monthly (update with metrics)
- Cost analysis: Monthly (track trends)
- Security checklist: Quarterly or after incidents
- Technology stack: When versions change
- Glossary: As needed when new terms introduced
Responsibility: Infrastructure lead reviews quarterly, team contributes ongoing updates.
Last Updated: January 2025 Maintained By: Infrastructure Team Next Review: April 2025
Purpose: These documents answer "why" and "what if" questions. They provide context for decisions and guidance for future planning.