Initial commit: Open sourcing all of the Maple Open Technologies code.

2025-12-02 14:33:08 -05:00 · 2025-12-02 14:33:08 -05:00 · 755d54a99d
commit 755d54a99d
2010 changed files with 448675 additions and 0 deletions
--- a/cloud/infrastructure/production/reference/README.md
+++ b/cloud/infrastructure/production/reference/README.md
@ -0,0 +1,544 @@
+# Reference Documentation
+
+**Audience**: All infrastructure team members, architects, management
+**Purpose**: High-level architecture, capacity planning, cost analysis, and strategic documentation
+**Prerequisites**: Familiarity with deployed infrastructure
+
+---
+
+## Overview
+
+This directory contains reference materials that provide the "big picture" view of your infrastructure. Unlike operational procedures (setup, operations, automation), these documents focus on **why** decisions were made, **what** the architecture looks like, and **how** to plan for the future.
+
+**Contents:**
+- Architecture diagrams and decision records
+- Capacity planning and performance baselines
+- Cost analysis and optimization strategies
+- Security compliance documentation
+- Technology choices and trade-offs
+- Glossary of terms
+
+---
+
+## Directory Contents
+
+### Architecture Documentation
+
+**`architecture-overview.md`** - High-level system architecture
+- Infrastructure topology
+- Component interactions
+- Data flow diagrams
+- Network architecture
+- Security boundaries
+- Design principles and rationale
+
+**`architecture-decisions.md`** - Architecture Decision Records (ADRs)
+- Why Docker Swarm over Kubernetes?
+- Why Cassandra over PostgreSQL?
+- Why Caddy over NGINX?
+- Multi-application architecture rationale
+- Network segmentation strategy
+- Service discovery approach
+
+### Capacity Planning
+
+**`capacity-planning.md`** - Growth planning and scaling strategies
+- Current capacity baseline
+- Performance benchmarks
+- Growth projections
+- Scaling thresholds
+- Bottleneck analysis
+- Future infrastructure needs
+
+**`performance-baselines.md`** - Performance metrics and SLOs
+- Response time percentiles
+- Throughput measurements
+- Database performance
+- Resource utilization baselines
+- Service Level Objectives (SLOs)
+- Service Level Indicators (SLIs)
+
+### Financial Planning
+
+**`cost-analysis.md`** - Infrastructure costs and optimization
+- Monthly cost breakdown
+- Cost per service/application
+- Cost trends and projections
+- Optimization opportunities
+- Reserved capacity vs on-demand
+- TCO (Total Cost of Ownership)
+
+**`cost-optimization.md`** - Strategies to reduce costs
+- Right-sizing recommendations
+- Idle resource identification
+- Reserved instances opportunities
+- Storage optimization
+- Bandwidth optimization
+- Alternative architecture considerations
+
+### Security & Compliance
+
+**`security-architecture.md`** - Security design and controls
+- Defense-in-depth layers
+- Authentication and authorization
+- Secrets management approach
+- Network security controls
+- Data encryption (at rest and in transit)
+- Security monitoring and logging
+
+**`security-checklist.md`** - Security verification checklist
+- Infrastructure hardening checklist
+- Compliance requirements (GDPR, SOC2, etc.)
+- Security audit procedures
+- Vulnerability management
+- Incident response readiness
+
+**`compliance.md`** - Regulatory compliance documentation
+- GDPR compliance measures
+- Data residency requirements
+- Audit trail procedures
+- Privacy by design implementation
+- Data retention policies
+- Right to be forgotten procedures
+
+### Technology Stack
+
+**`technology-stack.md`** - Complete technology inventory
+- Software versions and update policy
+- Third-party services and dependencies
+- Library and framework choices
+- Language and runtime versions
+- Tooling and development environment
+
+**`technology-decisions.md`** - Why we chose each technology
+- Database selection rationale
+- Programming language choices
+- Cloud provider selection
+- Deployment tooling decisions
+- Monitoring stack selection
+
+### Operational Reference
+
+**`runbook-index.md`** - Quick reference to all runbooks
+- Emergency procedures quick links
+- Common tasks reference
+- Escalation contacts
+- Critical command cheat sheet
+
+**`glossary.md`** - Terms and definitions
+- Docker Swarm terminology
+- Database concepts (Cassandra RF, QUORUM, etc.)
+- Network terms (overlay, ingress, etc.)
+- Monitoring terminology
+- Infrastructure jargon decoder
+
+---
+
+## Quick Reference Materials
+
+### Architecture At-a-Glance
+
+**Current Infrastructure (January 2025):**
+
+```
+Production Environment: maplefile-prod
+Region: DigitalOcean Toronto (tor1)
+Nodes: 7 workers (1 manager + 6 workers)
+Applications: MaplePress (deployed), MapleFile (deployed)
+
+Orchestration: Docker Swarm
+Container Registry: DigitalOcean Container Registry (registry.digitalocean.com/ssp)
+Object Storage: DigitalOcean Spaces (nyc3)
+DNS: [Your DNS provider]
+SSL: Let's Encrypt (automatic via Caddy)
+
+Networks:
+  - maple-private-prod: Databases and internal services
+  - maple-public-prod: Public-facing services (Caddy + backends)
+
+Databases:
+  - Cassandra: 3-node cluster, RF=3, QUORUM consistency
+  - Redis: Single instance, RDB + AOF persistence
+  - Meilisearch: Single instance
+
+Applications:
+  - MaplePress Backend: Go 1.21+, Port 8000, Domain: getmaplepress.ca
+  - MaplePress Frontend: React 19 + Vite, Domain: getmaplepress.com
+```
+
+### Key Metrics Baseline (Example)
+
+**As of [Date]:**
+
+| Metric | Value | Threshold |
+|--------|-------|-----------|
+| Backend p95 Response Time | 150ms | < 500ms |
+| Frontend Load Time | 1.2s | < 3s |
+| Backend Throughput | 500 req/min | 5000 req/min capacity |
+| Database Read Latency | 5ms | < 20ms |
+| Database Write Latency | 10ms | < 50ms |
+| Redis Hit Rate | 95% | > 90% |
+| CPU Utilization (avg) | 35% | Alert at 80% |
+| Memory Utilization (avg) | 50% | Alert at 85% |
+| Disk Usage (avg) | 40% | Alert at 75% |
+
+### Monthly Cost Breakdown (Example)
+
+| Service | Monthly Cost | Notes |
+|---------|--------------|-------|
+| Droplets (7x) | $204 | See breakdown in cost-analysis.md |
+| Spaces Storage | $5 | 250GB included |
+| Additional Bandwidth | $0 | Within free tier |
+| Container Registry | $0 | Included |
+| DNS | $0 | Using [provider] |
+| Monitoring (optional) | $0 | Self-hosted Prometheus |
+| **Total** | **~$209/mo** | Can scale to ~$300/mo with growth |
+
+### Technology Stack Summary
+
+| Layer | Technology | Version | Purpose |
+|-------|------------|---------|---------|
+| **OS** | Ubuntu | 24.04 LTS | Base operating system |
+| **Orchestration** | Docker Swarm | Built-in | Container orchestration |
+| **Container Runtime** | Docker | 27.x+ | Container execution |
+| **Database** | Cassandra | 4.1.x | Distributed database |
+| **Cache** | Redis | 7.x | In-memory cache/sessions |
+| **Search** | Meilisearch | v1.5+ | Full-text search |
+| **Reverse Proxy** | Caddy | 2-alpine | HTTPS termination |
+| **Backend** | Go | 1.21+ | Application runtime |
+| **Frontend** | React + Vite | 19 + 5.x | Web UI |
+| **Object Storage** | Spaces | S3-compatible | File storage |
+| **Monitoring** | Prometheus + Grafana | Latest | Metrics & dashboards |
+| **CI/CD** | TBD | - | GitHub Actions / GitLab CI |
+
+---
+
+## Architecture Decision Records (ADRs)
+
+### ADR-001: Docker Swarm vs Kubernetes
+
+**Decision**: Use Docker Swarm for orchestration
+
+**Context**: Need container orchestration for production deployment
+
+**Rationale**:
+- Simpler to set up and maintain (< 1 hour vs days for k8s)
+- Built into Docker (no additional components)
+- Sufficient for our scale (< 100 services)
+- Lower operational overhead
+- Easier to troubleshoot
+- Team familiarity with Docker
+
+**Trade-offs**:
+- Less ecosystem tooling than Kubernetes
+- Limited advanced scheduling features
+- Smaller community
+- May need migration to k8s if scale dramatically (> 50 nodes)
+
+**Status**: Accepted
+
+---
+
+### ADR-002: Cassandra for Distributed Database
+
+**Decision**: Use Cassandra for primary datastore
+
+**Context**: Need highly available, distributed database with linear scalability
+
+**Rationale**:
+- Write-heavy workload (user-generated content)
+- Geographic distribution possible (multi-region)
+- Proven at scale (Instagram, Netflix)
+- No single point of failure (RF=3, QUORUM)
+- Linear scalability (add nodes for capacity)
+- Excellent write performance
+
+**Trade-offs**:
+- Higher complexity than PostgreSQL
+- Eventually consistent (tunable)
+- Schema migrations more complex
+- Higher resource usage (3 nodes minimum)
+- Steeper learning curve
+
+**Alternatives Considered**:
+- PostgreSQL + Patroni: Simpler but less scalable
+- MongoDB: Similar, but prefer Cassandra's consistency model
+- MySQL Cluster: Oracle licensing concerns
+
+**Status**: Accepted
+
+---
+
+### ADR-003: Caddy for Reverse Proxy
+
+**Decision**: Use Caddy instead of NGINX
+
+**Context**: Need HTTPS termination and reverse proxy
+
+**Rationale**:
+- Automatic HTTPS with Let's Encrypt (zero configuration)
+- Automatic certificate renewal (no cron jobs)
+- Simpler configuration (10 lines vs 200+)
+- Built-in HTTP/2 and HTTP/3
+- Security by default
+- Active development
+
+**Trade-offs**:
+- Less mature than NGINX (but production-ready)
+- Smaller community
+- Fewer third-party modules
+- Slightly higher memory usage (negligible)
+
+**Performance**: Equivalent for our use case (< 10k req/sec)
+
+**Status**: Accepted
+
+---
+
+### ADR-004: Multi-Application Shared Infrastructure
+
+**Decision**: Share database infrastructure across multiple applications
+
+**Context**: Planning to deploy multiple applications (MaplePress, MapleFile)
+
+**Rationale**:
+- Cost efficiency (one 3-node Cassandra cluster vs 3 separate clusters)
+- Operational efficiency (one set of database procedures)
+- Resource utilization (databases rarely at capacity)
+- Simplified backups (one backup process)
+- Consistent data layer
+
+**Isolation Strategy**:
+- Separate keyspaces per application
+- Separate workers for application backends
+- Independent scaling per application
+- Separate deployment pipelines
+
+**Trade-offs**:
+- Blast radius: One database failure affects all apps
+- Resource contention possible (mitigated by capacity planning)
+- Schema migration coordination needed
+
+**Status**: Accepted
+
+---
+
+## Capacity Planning Guidelines
+
+### Current Capacity
+
+**Worker specifications:**
+- Manager + Redis: 2 vCPU, 2 GB RAM
+- Cassandra nodes (3x): 2 vCPU, 4 GB RAM each
+- Meilisearch: 2 vCPU, 2 GB RAM
+- Backend: 2 vCPU, 2 GB RAM
+- Frontend: 1 vCPU, 1 GB RAM
+
+**Total:** 13 vCPUs, 19 GB RAM
+
+### Scaling Triggers
+
+**When to scale:**
+
+| Metric | Threshold | Action |
+|--------|-----------|--------|
+| CPU > 80% sustained | 5 minutes | Add worker or scale vertically |
+| Memory > 85% sustained | 5 minutes | Increase droplet RAM |
+| Disk > 75% full | Any node | Clear space or increase disk |
+| Backend p95 > 1s | Consistent | Scale backend horizontally |
+| Database latency > 50ms | Consistent | Add Cassandra node or tune |
+| Request rate approaching capacity | 80% of max | Scale backend replicas |
+
+### Scaling Options
+
+**Horizontal Scaling (preferred):**
+- Backend: Add replicas (`docker service scale maplepress_backend=3`)
+- Cassandra: Add fourth node (increases capacity + resilience)
+- Frontend: Add CDN or edge caching
+
+**Vertical Scaling:**
+- Resize droplets (requires brief restart)
+- Increase memory limits in stack files
+- Optimize application code first
+
+**Cost vs Performance:**
+- Horizontal: More resilient, linear cost increase
+- Vertical: Simpler, better price/performance up to a point
+
+---
+
+## Cost Optimization Strategies
+
+### Quick Wins
+
+1. **Reserved Instances**: DigitalOcean doesn't offer reserved pricing, but consider annual contracts for discounts
+2. **Right-sizing**: Monitor actual usage, downsize oversized droplets
+3. **Cleanup**: Regular docker system prune, clear old snapshots
+4. **Compression**: Enable gzip in Caddy (already done)
+5. **Caching**: Maximize cache hit rates (Redis, CDN)
+
+### Medium-term Optimizations
+
+1. **CDN for static assets**: Offload frontend static files to CDN
+2. **Object storage lifecycle**: Auto-delete old backups
+3. **Database tuning**: Optimize queries to reduce hardware needs
+4. **Spot instances**: Not available on DigitalOcean, but consider for batch jobs
+
+### Alternative Architectures
+
+**If cost becomes primary concern:**
+- Single-node PostgreSQL instead of Cassandra cluster (-$96/mo)
+- Collocate services on fewer droplets (-$50-100/mo)
+- Use managed databases (different cost model)
+
+**Trade-off**: Lower cost, higher operational risk
+
+---
+
+## Security Architecture
+
+### Defense in Depth Layers
+
+1. **Network**: VPC, firewalls, private overlay networks
+2. **Transport**: TLS 1.3 for all external connections
+3. **Application**: Authentication, authorization, input validation
+4. **Data**: Encryption at rest (object storage), encryption in transit
+5. **Monitoring**: Audit logs, security alerts, intrusion detection
+
+### Key Security Controls
+
+**Implemented:**
+- ✅ SSH key-based authentication (no passwords)
+- ✅ UFW firewall on all nodes
+- ✅ Docker secrets for sensitive values
+- ✅ Network segmentation (private vs public)
+- ✅ Automatic HTTPS with perfect forward secrecy
+- ✅ Security headers (HSTS, X-Frame-Options, etc.)
+- ✅ Database authentication (passwords, API keys)
+- ✅ Minimal attack surface (only ports 22, 80, 443 exposed)
+
+**Planned:**
+- [ ] fail2ban for SSH brute-force protection
+- [ ] Intrusion detection system (IDS)
+- [ ] Regular security scanning (Trivy for containers)
+- [ ] Secret rotation automation
+- [ ] Audit logging aggregation
+
+---
+
+## Compliance Considerations
+
+### GDPR
+
+**If processing EU user data:**
+- Data residency: Deploy EU region workers
+- Right to deletion: Implement user data purge
+- Data portability: Export user data functionality
+- Privacy by design: Minimal data collection
+- Audit trail: Log all data access
+
+### SOC2
+
+**If pursuing SOC2 compliance:**
+- Access controls: Role-based access, MFA
+- Change management: All changes via git, reviewed
+- Monitoring: Comprehensive logging and alerting
+- Incident response: Documented procedures
+- Business continuity: Backup and disaster recovery tested
+
+**Document in**: `compliance.md`
+
+---
+
+## Glossary
+
+### Docker Swarm Terms
+
+**Manager node**: Swarm orchestrator, schedules tasks, maintains cluster state
+**Worker node**: Executes tasks (containers) assigned by manager
+**Service**: Definition of containers to run (image, replicas, network)
+**Task**: Single container instance of a service
+**Stack**: Group of related services deployed together
+**Overlay network**: Virtual network spanning all swarm nodes
+**Ingress network**: Built-in load balancing for published ports
+**Node label**: Key-value tag for task placement constraints
+
+### Cassandra Terms
+
+**RF (Replication Factor)**: Number of copies of data (RF=3 = 3 copies)
+**QUORUM**: Majority of replicas (2 out of 3 for RF=3)
+**Consistency Level**: How many replicas must respond (ONE, QUORUM, ALL)
+**Keyspace**: Database namespace (like database in SQL)
+**SSTable**: Immutable data file on disk
+**Compaction**: Merging SSTables to reclaim space
+**Repair**: Synchronize data across replicas
+**Nodetool**: Command-line tool for Cassandra administration
+
+### Monitoring Terms
+
+**Prometheus**: Time-series database and metrics collection
+**Grafana**: Visualization and dashboarding
+**Alertmanager**: Alert routing and notification
+**Exporter**: Metrics collection agent (node_exporter, etc.)
+**Scrape**: Prometheus collecting metrics from target
+**Time series**: Sequence of data points over time
+**PromQL**: Prometheus query language
+
+---
+
+## Related Documentation
+
+**For initial deployment:**
+- `../setup/` - Step-by-step infrastructure deployment
+
+**For day-to-day operations:**
+- `../operations/` - Backup, monitoring, incident response
+
+**For automation:**
+- `../automation/` - Scripts, CI/CD, monitoring configs
+
+**External resources:**
+- Docker Swarm: https://docs.docker.com/engine/swarm/
+- Cassandra: https://cassandra.apache.org/doc/latest/
+- DigitalOcean: https://docs.digitalocean.com/
+
+---
+
+## Contributing to Reference Docs
+
+**When to update reference documentation:**
+
+- Major architecture changes
+- New technology adoption
+- Significant cost changes
+- Security incidents (document lessons learned)
+- Compliance requirements change
+- Quarterly review cycles
+
+**Document format:**
+- Use Markdown
+- Include decision date
+- Link to related ADRs
+- Update index/glossary as needed
+
+---
+
+## Document Maintenance
+
+**Review schedule:**
+- **Architecture docs**: Quarterly or when major changes
+- **Capacity planning**: Monthly (update with metrics)
+- **Cost analysis**: Monthly (track trends)
+- **Security checklist**: Quarterly or after incidents
+- **Technology stack**: When versions change
+- **Glossary**: As needed when new terms introduced
+
+**Responsibility**: Infrastructure lead reviews quarterly, team contributes ongoing updates.
+
+---
+
+**Last Updated**: January 2025
+**Maintained By**: Infrastructure Team
+**Next Review**: April 2025
+
+**Purpose**: These documents answer "why" and "what if" questions. They provide context for decisions and guidance for future planning.