Initial commit: Open sourcing all of the Maple Open Technologies code.
This commit is contained in:
commit
755d54a99d
2010 changed files with 448675 additions and 0 deletions
544
cloud/infrastructure/production/reference/README.md
Normal file
544
cloud/infrastructure/production/reference/README.md
Normal file
|
|
@ -0,0 +1,544 @@
|
|||
# Reference Documentation
|
||||
|
||||
**Audience**: All infrastructure team members, architects, management
|
||||
**Purpose**: High-level architecture, capacity planning, cost analysis, and strategic documentation
|
||||
**Prerequisites**: Familiarity with deployed infrastructure
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
This directory contains reference materials that provide the "big picture" view of your infrastructure. Unlike operational procedures (setup, operations, automation), these documents focus on **why** decisions were made, **what** the architecture looks like, and **how** to plan for the future.
|
||||
|
||||
**Contents:**
|
||||
- Architecture diagrams and decision records
|
||||
- Capacity planning and performance baselines
|
||||
- Cost analysis and optimization strategies
|
||||
- Security compliance documentation
|
||||
- Technology choices and trade-offs
|
||||
- Glossary of terms
|
||||
|
||||
---
|
||||
|
||||
## Directory Contents
|
||||
|
||||
### Architecture Documentation
|
||||
|
||||
**`architecture-overview.md`** - High-level system architecture
|
||||
- Infrastructure topology
|
||||
- Component interactions
|
||||
- Data flow diagrams
|
||||
- Network architecture
|
||||
- Security boundaries
|
||||
- Design principles and rationale
|
||||
|
||||
**`architecture-decisions.md`** - Architecture Decision Records (ADRs)
|
||||
- Why Docker Swarm over Kubernetes?
|
||||
- Why Cassandra over PostgreSQL?
|
||||
- Why Caddy over NGINX?
|
||||
- Multi-application architecture rationale
|
||||
- Network segmentation strategy
|
||||
- Service discovery approach
|
||||
|
||||
### Capacity Planning
|
||||
|
||||
**`capacity-planning.md`** - Growth planning and scaling strategies
|
||||
- Current capacity baseline
|
||||
- Performance benchmarks
|
||||
- Growth projections
|
||||
- Scaling thresholds
|
||||
- Bottleneck analysis
|
||||
- Future infrastructure needs
|
||||
|
||||
**`performance-baselines.md`** - Performance metrics and SLOs
|
||||
- Response time percentiles
|
||||
- Throughput measurements
|
||||
- Database performance
|
||||
- Resource utilization baselines
|
||||
- Service Level Objectives (SLOs)
|
||||
- Service Level Indicators (SLIs)
|
||||
|
||||
### Financial Planning
|
||||
|
||||
**`cost-analysis.md`** - Infrastructure costs and optimization
|
||||
- Monthly cost breakdown
|
||||
- Cost per service/application
|
||||
- Cost trends and projections
|
||||
- Optimization opportunities
|
||||
- Reserved capacity vs on-demand
|
||||
- TCO (Total Cost of Ownership)
|
||||
|
||||
**`cost-optimization.md`** - Strategies to reduce costs
|
||||
- Right-sizing recommendations
|
||||
- Idle resource identification
|
||||
- Reserved instances opportunities
|
||||
- Storage optimization
|
||||
- Bandwidth optimization
|
||||
- Alternative architecture considerations
|
||||
|
||||
### Security & Compliance
|
||||
|
||||
**`security-architecture.md`** - Security design and controls
|
||||
- Defense-in-depth layers
|
||||
- Authentication and authorization
|
||||
- Secrets management approach
|
||||
- Network security controls
|
||||
- Data encryption (at rest and in transit)
|
||||
- Security monitoring and logging
|
||||
|
||||
**`security-checklist.md`** - Security verification checklist
|
||||
- Infrastructure hardening checklist
|
||||
- Compliance requirements (GDPR, SOC2, etc.)
|
||||
- Security audit procedures
|
||||
- Vulnerability management
|
||||
- Incident response readiness
|
||||
|
||||
**`compliance.md`** - Regulatory compliance documentation
|
||||
- GDPR compliance measures
|
||||
- Data residency requirements
|
||||
- Audit trail procedures
|
||||
- Privacy by design implementation
|
||||
- Data retention policies
|
||||
- Right to be forgotten procedures
|
||||
|
||||
### Technology Stack
|
||||
|
||||
**`technology-stack.md`** - Complete technology inventory
|
||||
- Software versions and update policy
|
||||
- Third-party services and dependencies
|
||||
- Library and framework choices
|
||||
- Language and runtime versions
|
||||
- Tooling and development environment
|
||||
|
||||
**`technology-decisions.md`** - Why we chose each technology
|
||||
- Database selection rationale
|
||||
- Programming language choices
|
||||
- Cloud provider selection
|
||||
- Deployment tooling decisions
|
||||
- Monitoring stack selection
|
||||
|
||||
### Operational Reference
|
||||
|
||||
**`runbook-index.md`** - Quick reference to all runbooks
|
||||
- Emergency procedures quick links
|
||||
- Common tasks reference
|
||||
- Escalation contacts
|
||||
- Critical command cheat sheet
|
||||
|
||||
**`glossary.md`** - Terms and definitions
|
||||
- Docker Swarm terminology
|
||||
- Database concepts (Cassandra RF, QUORUM, etc.)
|
||||
- Network terms (overlay, ingress, etc.)
|
||||
- Monitoring terminology
|
||||
- Infrastructure jargon decoder
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference Materials
|
||||
|
||||
### Architecture At-a-Glance
|
||||
|
||||
**Current Infrastructure (January 2025):**
|
||||
|
||||
```
|
||||
Production Environment: maplefile-prod
|
||||
Region: DigitalOcean Toronto (tor1)
|
||||
Nodes: 7 workers (1 manager + 6 workers)
|
||||
Applications: MaplePress (deployed), MapleFile (deployed)
|
||||
|
||||
Orchestration: Docker Swarm
|
||||
Container Registry: DigitalOcean Container Registry (registry.digitalocean.com/ssp)
|
||||
Object Storage: DigitalOcean Spaces (nyc3)
|
||||
DNS: [Your DNS provider]
|
||||
SSL: Let's Encrypt (automatic via Caddy)
|
||||
|
||||
Networks:
|
||||
- maple-private-prod: Databases and internal services
|
||||
- maple-public-prod: Public-facing services (Caddy + backends)
|
||||
|
||||
Databases:
|
||||
- Cassandra: 3-node cluster, RF=3, QUORUM consistency
|
||||
- Redis: Single instance, RDB + AOF persistence
|
||||
- Meilisearch: Single instance
|
||||
|
||||
Applications:
|
||||
- MaplePress Backend: Go 1.21+, Port 8000, Domain: getmaplepress.ca
|
||||
- MaplePress Frontend: React 19 + Vite, Domain: getmaplepress.com
|
||||
```
|
||||
|
||||
### Key Metrics Baseline (Example)
|
||||
|
||||
**As of [Date]:**
|
||||
|
||||
| Metric | Value | Threshold |
|
||||
|--------|-------|-----------|
|
||||
| Backend p95 Response Time | 150ms | < 500ms |
|
||||
| Frontend Load Time | 1.2s | < 3s |
|
||||
| Backend Throughput | 500 req/min | 5000 req/min capacity |
|
||||
| Database Read Latency | 5ms | < 20ms |
|
||||
| Database Write Latency | 10ms | < 50ms |
|
||||
| Redis Hit Rate | 95% | > 90% |
|
||||
| CPU Utilization (avg) | 35% | Alert at 80% |
|
||||
| Memory Utilization (avg) | 50% | Alert at 85% |
|
||||
| Disk Usage (avg) | 40% | Alert at 75% |
|
||||
|
||||
### Monthly Cost Breakdown (Example)
|
||||
|
||||
| Service | Monthly Cost | Notes |
|
||||
|---------|--------------|-------|
|
||||
| Droplets (7x) | $204 | See breakdown in cost-analysis.md |
|
||||
| Spaces Storage | $5 | 250GB included |
|
||||
| Additional Bandwidth | $0 | Within free tier |
|
||||
| Container Registry | $0 | Included |
|
||||
| DNS | $0 | Using [provider] |
|
||||
| Monitoring (optional) | $0 | Self-hosted Prometheus |
|
||||
| **Total** | **~$209/mo** | Can scale to ~$300/mo with growth |
|
||||
|
||||
### Technology Stack Summary
|
||||
|
||||
| Layer | Technology | Version | Purpose |
|
||||
|-------|------------|---------|---------|
|
||||
| **OS** | Ubuntu | 24.04 LTS | Base operating system |
|
||||
| **Orchestration** | Docker Swarm | Built-in | Container orchestration |
|
||||
| **Container Runtime** | Docker | 27.x+ | Container execution |
|
||||
| **Database** | Cassandra | 4.1.x | Distributed database |
|
||||
| **Cache** | Redis | 7.x | In-memory cache/sessions |
|
||||
| **Search** | Meilisearch | v1.5+ | Full-text search |
|
||||
| **Reverse Proxy** | Caddy | 2-alpine | HTTPS termination |
|
||||
| **Backend** | Go | 1.21+ | Application runtime |
|
||||
| **Frontend** | React + Vite | 19 + 5.x | Web UI |
|
||||
| **Object Storage** | Spaces | S3-compatible | File storage |
|
||||
| **Monitoring** | Prometheus + Grafana | Latest | Metrics & dashboards |
|
||||
| **CI/CD** | TBD | - | GitHub Actions / GitLab CI |
|
||||
|
||||
---
|
||||
|
||||
## Architecture Decision Records (ADRs)
|
||||
|
||||
### ADR-001: Docker Swarm vs Kubernetes
|
||||
|
||||
**Decision**: Use Docker Swarm for orchestration
|
||||
|
||||
**Context**: Need container orchestration for production deployment
|
||||
|
||||
**Rationale**:
|
||||
- Simpler to set up and maintain (< 1 hour vs days for k8s)
|
||||
- Built into Docker (no additional components)
|
||||
- Sufficient for our scale (< 100 services)
|
||||
- Lower operational overhead
|
||||
- Easier to troubleshoot
|
||||
- Team familiarity with Docker
|
||||
|
||||
**Trade-offs**:
|
||||
- Less ecosystem tooling than Kubernetes
|
||||
- Limited advanced scheduling features
|
||||
- Smaller community
|
||||
- May need migration to k8s if scale dramatically (> 50 nodes)
|
||||
|
||||
**Status**: Accepted
|
||||
|
||||
---
|
||||
|
||||
### ADR-002: Cassandra for Distributed Database
|
||||
|
||||
**Decision**: Use Cassandra for primary datastore
|
||||
|
||||
**Context**: Need highly available, distributed database with linear scalability
|
||||
|
||||
**Rationale**:
|
||||
- Write-heavy workload (user-generated content)
|
||||
- Geographic distribution possible (multi-region)
|
||||
- Proven at scale (Instagram, Netflix)
|
||||
- No single point of failure (RF=3, QUORUM)
|
||||
- Linear scalability (add nodes for capacity)
|
||||
- Excellent write performance
|
||||
|
||||
**Trade-offs**:
|
||||
- Higher complexity than PostgreSQL
|
||||
- Eventually consistent (tunable)
|
||||
- Schema migrations more complex
|
||||
- Higher resource usage (3 nodes minimum)
|
||||
- Steeper learning curve
|
||||
|
||||
**Alternatives Considered**:
|
||||
- PostgreSQL + Patroni: Simpler but less scalable
|
||||
- MongoDB: Similar, but prefer Cassandra's consistency model
|
||||
- MySQL Cluster: Oracle licensing concerns
|
||||
|
||||
**Status**: Accepted
|
||||
|
||||
---
|
||||
|
||||
### ADR-003: Caddy for Reverse Proxy
|
||||
|
||||
**Decision**: Use Caddy instead of NGINX
|
||||
|
||||
**Context**: Need HTTPS termination and reverse proxy
|
||||
|
||||
**Rationale**:
|
||||
- Automatic HTTPS with Let's Encrypt (zero configuration)
|
||||
- Automatic certificate renewal (no cron jobs)
|
||||
- Simpler configuration (10 lines vs 200+)
|
||||
- Built-in HTTP/2 and HTTP/3
|
||||
- Security by default
|
||||
- Active development
|
||||
|
||||
**Trade-offs**:
|
||||
- Less mature than NGINX (but production-ready)
|
||||
- Smaller community
|
||||
- Fewer third-party modules
|
||||
- Slightly higher memory usage (negligible)
|
||||
|
||||
**Performance**: Equivalent for our use case (< 10k req/sec)
|
||||
|
||||
**Status**: Accepted
|
||||
|
||||
---
|
||||
|
||||
### ADR-004: Multi-Application Shared Infrastructure
|
||||
|
||||
**Decision**: Share database infrastructure across multiple applications
|
||||
|
||||
**Context**: Planning to deploy multiple applications (MaplePress, MapleFile)
|
||||
|
||||
**Rationale**:
|
||||
- Cost efficiency (one 3-node Cassandra cluster vs 3 separate clusters)
|
||||
- Operational efficiency (one set of database procedures)
|
||||
- Resource utilization (databases rarely at capacity)
|
||||
- Simplified backups (one backup process)
|
||||
- Consistent data layer
|
||||
|
||||
**Isolation Strategy**:
|
||||
- Separate keyspaces per application
|
||||
- Separate workers for application backends
|
||||
- Independent scaling per application
|
||||
- Separate deployment pipelines
|
||||
|
||||
**Trade-offs**:
|
||||
- Blast radius: One database failure affects all apps
|
||||
- Resource contention possible (mitigated by capacity planning)
|
||||
- Schema migration coordination needed
|
||||
|
||||
**Status**: Accepted
|
||||
|
||||
---
|
||||
|
||||
## Capacity Planning Guidelines
|
||||
|
||||
### Current Capacity
|
||||
|
||||
**Worker specifications:**
|
||||
- Manager + Redis: 2 vCPU, 2 GB RAM
|
||||
- Cassandra nodes (3x): 2 vCPU, 4 GB RAM each
|
||||
- Meilisearch: 2 vCPU, 2 GB RAM
|
||||
- Backend: 2 vCPU, 2 GB RAM
|
||||
- Frontend: 1 vCPU, 1 GB RAM
|
||||
|
||||
**Total:** 13 vCPUs, 19 GB RAM
|
||||
|
||||
### Scaling Triggers
|
||||
|
||||
**When to scale:**
|
||||
|
||||
| Metric | Threshold | Action |
|
||||
|--------|-----------|--------|
|
||||
| CPU > 80% sustained | 5 minutes | Add worker or scale vertically |
|
||||
| Memory > 85% sustained | 5 minutes | Increase droplet RAM |
|
||||
| Disk > 75% full | Any node | Clear space or increase disk |
|
||||
| Backend p95 > 1s | Consistent | Scale backend horizontally |
|
||||
| Database latency > 50ms | Consistent | Add Cassandra node or tune |
|
||||
| Request rate approaching capacity | 80% of max | Scale backend replicas |
|
||||
|
||||
### Scaling Options
|
||||
|
||||
**Horizontal Scaling (preferred):**
|
||||
- Backend: Add replicas (`docker service scale maplepress_backend=3`)
|
||||
- Cassandra: Add fourth node (increases capacity + resilience)
|
||||
- Frontend: Add CDN or edge caching
|
||||
|
||||
**Vertical Scaling:**
|
||||
- Resize droplets (requires brief restart)
|
||||
- Increase memory limits in stack files
|
||||
- Optimize application code first
|
||||
|
||||
**Cost vs Performance:**
|
||||
- Horizontal: More resilient, linear cost increase
|
||||
- Vertical: Simpler, better price/performance up to a point
|
||||
|
||||
---
|
||||
|
||||
## Cost Optimization Strategies
|
||||
|
||||
### Quick Wins
|
||||
|
||||
1. **Reserved Instances**: DigitalOcean doesn't offer reserved pricing, but consider annual contracts for discounts
|
||||
2. **Right-sizing**: Monitor actual usage, downsize oversized droplets
|
||||
3. **Cleanup**: Regular docker system prune, clear old snapshots
|
||||
4. **Compression**: Enable gzip in Caddy (already done)
|
||||
5. **Caching**: Maximize cache hit rates (Redis, CDN)
|
||||
|
||||
### Medium-term Optimizations
|
||||
|
||||
1. **CDN for static assets**: Offload frontend static files to CDN
|
||||
2. **Object storage lifecycle**: Auto-delete old backups
|
||||
3. **Database tuning**: Optimize queries to reduce hardware needs
|
||||
4. **Spot instances**: Not available on DigitalOcean, but consider for batch jobs
|
||||
|
||||
### Alternative Architectures
|
||||
|
||||
**If cost becomes primary concern:**
|
||||
- Single-node PostgreSQL instead of Cassandra cluster (-$96/mo)
|
||||
- Collocate services on fewer droplets (-$50-100/mo)
|
||||
- Use managed databases (different cost model)
|
||||
|
||||
**Trade-off**: Lower cost, higher operational risk
|
||||
|
||||
---
|
||||
|
||||
## Security Architecture
|
||||
|
||||
### Defense in Depth Layers
|
||||
|
||||
1. **Network**: VPC, firewalls, private overlay networks
|
||||
2. **Transport**: TLS 1.3 for all external connections
|
||||
3. **Application**: Authentication, authorization, input validation
|
||||
4. **Data**: Encryption at rest (object storage), encryption in transit
|
||||
5. **Monitoring**: Audit logs, security alerts, intrusion detection
|
||||
|
||||
### Key Security Controls
|
||||
|
||||
**Implemented:**
|
||||
- ✅ SSH key-based authentication (no passwords)
|
||||
- ✅ UFW firewall on all nodes
|
||||
- ✅ Docker secrets for sensitive values
|
||||
- ✅ Network segmentation (private vs public)
|
||||
- ✅ Automatic HTTPS with perfect forward secrecy
|
||||
- ✅ Security headers (HSTS, X-Frame-Options, etc.)
|
||||
- ✅ Database authentication (passwords, API keys)
|
||||
- ✅ Minimal attack surface (only ports 22, 80, 443 exposed)
|
||||
|
||||
**Planned:**
|
||||
- [ ] fail2ban for SSH brute-force protection
|
||||
- [ ] Intrusion detection system (IDS)
|
||||
- [ ] Regular security scanning (Trivy for containers)
|
||||
- [ ] Secret rotation automation
|
||||
- [ ] Audit logging aggregation
|
||||
|
||||
---
|
||||
|
||||
## Compliance Considerations
|
||||
|
||||
### GDPR
|
||||
|
||||
**If processing EU user data:**
|
||||
- Data residency: Deploy EU region workers
|
||||
- Right to deletion: Implement user data purge
|
||||
- Data portability: Export user data functionality
|
||||
- Privacy by design: Minimal data collection
|
||||
- Audit trail: Log all data access
|
||||
|
||||
### SOC2
|
||||
|
||||
**If pursuing SOC2 compliance:**
|
||||
- Access controls: Role-based access, MFA
|
||||
- Change management: All changes via git, reviewed
|
||||
- Monitoring: Comprehensive logging and alerting
|
||||
- Incident response: Documented procedures
|
||||
- Business continuity: Backup and disaster recovery tested
|
||||
|
||||
**Document in**: `compliance.md`
|
||||
|
||||
---
|
||||
|
||||
## Glossary
|
||||
|
||||
### Docker Swarm Terms
|
||||
|
||||
**Manager node**: Swarm orchestrator, schedules tasks, maintains cluster state
|
||||
**Worker node**: Executes tasks (containers) assigned by manager
|
||||
**Service**: Definition of containers to run (image, replicas, network)
|
||||
**Task**: Single container instance of a service
|
||||
**Stack**: Group of related services deployed together
|
||||
**Overlay network**: Virtual network spanning all swarm nodes
|
||||
**Ingress network**: Built-in load balancing for published ports
|
||||
**Node label**: Key-value tag for task placement constraints
|
||||
|
||||
### Cassandra Terms
|
||||
|
||||
**RF (Replication Factor)**: Number of copies of data (RF=3 = 3 copies)
|
||||
**QUORUM**: Majority of replicas (2 out of 3 for RF=3)
|
||||
**Consistency Level**: How many replicas must respond (ONE, QUORUM, ALL)
|
||||
**Keyspace**: Database namespace (like database in SQL)
|
||||
**SSTable**: Immutable data file on disk
|
||||
**Compaction**: Merging SSTables to reclaim space
|
||||
**Repair**: Synchronize data across replicas
|
||||
**Nodetool**: Command-line tool for Cassandra administration
|
||||
|
||||
### Monitoring Terms
|
||||
|
||||
**Prometheus**: Time-series database and metrics collection
|
||||
**Grafana**: Visualization and dashboarding
|
||||
**Alertmanager**: Alert routing and notification
|
||||
**Exporter**: Metrics collection agent (node_exporter, etc.)
|
||||
**Scrape**: Prometheus collecting metrics from target
|
||||
**Time series**: Sequence of data points over time
|
||||
**PromQL**: Prometheus query language
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
**For initial deployment:**
|
||||
- `../setup/` - Step-by-step infrastructure deployment
|
||||
|
||||
**For day-to-day operations:**
|
||||
- `../operations/` - Backup, monitoring, incident response
|
||||
|
||||
**For automation:**
|
||||
- `../automation/` - Scripts, CI/CD, monitoring configs
|
||||
|
||||
**External resources:**
|
||||
- Docker Swarm: https://docs.docker.com/engine/swarm/
|
||||
- Cassandra: https://cassandra.apache.org/doc/latest/
|
||||
- DigitalOcean: https://docs.digitalocean.com/
|
||||
|
||||
---
|
||||
|
||||
## Contributing to Reference Docs
|
||||
|
||||
**When to update reference documentation:**
|
||||
|
||||
- Major architecture changes
|
||||
- New technology adoption
|
||||
- Significant cost changes
|
||||
- Security incidents (document lessons learned)
|
||||
- Compliance requirements change
|
||||
- Quarterly review cycles
|
||||
|
||||
**Document format:**
|
||||
- Use Markdown
|
||||
- Include decision date
|
||||
- Link to related ADRs
|
||||
- Update index/glossary as needed
|
||||
|
||||
---
|
||||
|
||||
## Document Maintenance
|
||||
|
||||
**Review schedule:**
|
||||
- **Architecture docs**: Quarterly or when major changes
|
||||
- **Capacity planning**: Monthly (update with metrics)
|
||||
- **Cost analysis**: Monthly (track trends)
|
||||
- **Security checklist**: Quarterly or after incidents
|
||||
- **Technology stack**: When versions change
|
||||
- **Glossary**: As needed when new terms introduced
|
||||
|
||||
**Responsibility**: Infrastructure lead reviews quarterly, team contributes ongoing updates.
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: January 2025
|
||||
**Maintained By**: Infrastructure Team
|
||||
**Next Review**: April 2025
|
||||
|
||||
**Purpose**: These documents answer "why" and "what if" questions. They provide context for decisions and guidance for future planning.
|
||||
Loading…
Add table
Add a link
Reference in a new issue