monorepo/cloud/infrastructure/production/reference
2025-12-02 14:33:08 -05:00
..
README.md Initial commit: Open sourcing all of the Maple Open Technologies code. 2025-12-02 14:33:08 -05:00

Reference Documentation

Audience: All infrastructure team members, architects, management Purpose: High-level architecture, capacity planning, cost analysis, and strategic documentation Prerequisites: Familiarity with deployed infrastructure


Overview

This directory contains reference materials that provide the "big picture" view of your infrastructure. Unlike operational procedures (setup, operations, automation), these documents focus on why decisions were made, what the architecture looks like, and how to plan for the future.

Contents:

  • Architecture diagrams and decision records
  • Capacity planning and performance baselines
  • Cost analysis and optimization strategies
  • Security compliance documentation
  • Technology choices and trade-offs
  • Glossary of terms

Directory Contents

Architecture Documentation

architecture-overview.md - High-level system architecture

  • Infrastructure topology
  • Component interactions
  • Data flow diagrams
  • Network architecture
  • Security boundaries
  • Design principles and rationale

architecture-decisions.md - Architecture Decision Records (ADRs)

  • Why Docker Swarm over Kubernetes?
  • Why Cassandra over PostgreSQL?
  • Why Caddy over NGINX?
  • Multi-application architecture rationale
  • Network segmentation strategy
  • Service discovery approach

Capacity Planning

capacity-planning.md - Growth planning and scaling strategies

  • Current capacity baseline
  • Performance benchmarks
  • Growth projections
  • Scaling thresholds
  • Bottleneck analysis
  • Future infrastructure needs

performance-baselines.md - Performance metrics and SLOs

  • Response time percentiles
  • Throughput measurements
  • Database performance
  • Resource utilization baselines
  • Service Level Objectives (SLOs)
  • Service Level Indicators (SLIs)

Financial Planning

cost-analysis.md - Infrastructure costs and optimization

  • Monthly cost breakdown
  • Cost per service/application
  • Cost trends and projections
  • Optimization opportunities
  • Reserved capacity vs on-demand
  • TCO (Total Cost of Ownership)

cost-optimization.md - Strategies to reduce costs

  • Right-sizing recommendations
  • Idle resource identification
  • Reserved instances opportunities
  • Storage optimization
  • Bandwidth optimization
  • Alternative architecture considerations

Security & Compliance

security-architecture.md - Security design and controls

  • Defense-in-depth layers
  • Authentication and authorization
  • Secrets management approach
  • Network security controls
  • Data encryption (at rest and in transit)
  • Security monitoring and logging

security-checklist.md - Security verification checklist

  • Infrastructure hardening checklist
  • Compliance requirements (GDPR, SOC2, etc.)
  • Security audit procedures
  • Vulnerability management
  • Incident response readiness

compliance.md - Regulatory compliance documentation

  • GDPR compliance measures
  • Data residency requirements
  • Audit trail procedures
  • Privacy by design implementation
  • Data retention policies
  • Right to be forgotten procedures

Technology Stack

technology-stack.md - Complete technology inventory

  • Software versions and update policy
  • Third-party services and dependencies
  • Library and framework choices
  • Language and runtime versions
  • Tooling and development environment

technology-decisions.md - Why we chose each technology

  • Database selection rationale
  • Programming language choices
  • Cloud provider selection
  • Deployment tooling decisions
  • Monitoring stack selection

Operational Reference

runbook-index.md - Quick reference to all runbooks

  • Emergency procedures quick links
  • Common tasks reference
  • Escalation contacts
  • Critical command cheat sheet

glossary.md - Terms and definitions

  • Docker Swarm terminology
  • Database concepts (Cassandra RF, QUORUM, etc.)
  • Network terms (overlay, ingress, etc.)
  • Monitoring terminology
  • Infrastructure jargon decoder

Quick Reference Materials

Architecture At-a-Glance

Current Infrastructure (January 2025):

Production Environment: maplefile-prod
Region: DigitalOcean Toronto (tor1)
Nodes: 7 workers (1 manager + 6 workers)
Applications: MaplePress (deployed), MapleFile (deployed)

Orchestration: Docker Swarm
Container Registry: DigitalOcean Container Registry (registry.digitalocean.com/ssp)
Object Storage: DigitalOcean Spaces (nyc3)
DNS: [Your DNS provider]
SSL: Let's Encrypt (automatic via Caddy)

Networks:
  - maple-private-prod: Databases and internal services
  - maple-public-prod: Public-facing services (Caddy + backends)

Databases:
  - Cassandra: 3-node cluster, RF=3, QUORUM consistency
  - Redis: Single instance, RDB + AOF persistence
  - Meilisearch: Single instance

Applications:
  - MaplePress Backend: Go 1.21+, Port 8000, Domain: getmaplepress.ca
  - MaplePress Frontend: React 19 + Vite, Domain: getmaplepress.com

Key Metrics Baseline (Example)

As of [Date]:

Metric Value Threshold
Backend p95 Response Time 150ms < 500ms
Frontend Load Time 1.2s < 3s
Backend Throughput 500 req/min 5000 req/min capacity
Database Read Latency 5ms < 20ms
Database Write Latency 10ms < 50ms
Redis Hit Rate 95% > 90%
CPU Utilization (avg) 35% Alert at 80%
Memory Utilization (avg) 50% Alert at 85%
Disk Usage (avg) 40% Alert at 75%

Monthly Cost Breakdown (Example)

Service Monthly Cost Notes
Droplets (7x) $204 See breakdown in cost-analysis.md
Spaces Storage $5 250GB included
Additional Bandwidth $0 Within free tier
Container Registry $0 Included
DNS $0 Using [provider]
Monitoring (optional) $0 Self-hosted Prometheus
Total ~$209/mo Can scale to ~$300/mo with growth

Technology Stack Summary

Layer Technology Version Purpose
OS Ubuntu 24.04 LTS Base operating system
Orchestration Docker Swarm Built-in Container orchestration
Container Runtime Docker 27.x+ Container execution
Database Cassandra 4.1.x Distributed database
Cache Redis 7.x In-memory cache/sessions
Search Meilisearch v1.5+ Full-text search
Reverse Proxy Caddy 2-alpine HTTPS termination
Backend Go 1.21+ Application runtime
Frontend React + Vite 19 + 5.x Web UI
Object Storage Spaces S3-compatible File storage
Monitoring Prometheus + Grafana Latest Metrics & dashboards
CI/CD TBD - GitHub Actions / GitLab CI

Architecture Decision Records (ADRs)

ADR-001: Docker Swarm vs Kubernetes

Decision: Use Docker Swarm for orchestration

Context: Need container orchestration for production deployment

Rationale:

  • Simpler to set up and maintain (< 1 hour vs days for k8s)
  • Built into Docker (no additional components)
  • Sufficient for our scale (< 100 services)
  • Lower operational overhead
  • Easier to troubleshoot
  • Team familiarity with Docker

Trade-offs:

  • Less ecosystem tooling than Kubernetes
  • Limited advanced scheduling features
  • Smaller community
  • May need migration to k8s if scale dramatically (> 50 nodes)

Status: Accepted


ADR-002: Cassandra for Distributed Database

Decision: Use Cassandra for primary datastore

Context: Need highly available, distributed database with linear scalability

Rationale:

  • Write-heavy workload (user-generated content)
  • Geographic distribution possible (multi-region)
  • Proven at scale (Instagram, Netflix)
  • No single point of failure (RF=3, QUORUM)
  • Linear scalability (add nodes for capacity)
  • Excellent write performance

Trade-offs:

  • Higher complexity than PostgreSQL
  • Eventually consistent (tunable)
  • Schema migrations more complex
  • Higher resource usage (3 nodes minimum)
  • Steeper learning curve

Alternatives Considered:

  • PostgreSQL + Patroni: Simpler but less scalable
  • MongoDB: Similar, but prefer Cassandra's consistency model
  • MySQL Cluster: Oracle licensing concerns

Status: Accepted


ADR-003: Caddy for Reverse Proxy

Decision: Use Caddy instead of NGINX

Context: Need HTTPS termination and reverse proxy

Rationale:

  • Automatic HTTPS with Let's Encrypt (zero configuration)
  • Automatic certificate renewal (no cron jobs)
  • Simpler configuration (10 lines vs 200+)
  • Built-in HTTP/2 and HTTP/3
  • Security by default
  • Active development

Trade-offs:

  • Less mature than NGINX (but production-ready)
  • Smaller community
  • Fewer third-party modules
  • Slightly higher memory usage (negligible)

Performance: Equivalent for our use case (< 10k req/sec)

Status: Accepted


ADR-004: Multi-Application Shared Infrastructure

Decision: Share database infrastructure across multiple applications

Context: Planning to deploy multiple applications (MaplePress, MapleFile)

Rationale:

  • Cost efficiency (one 3-node Cassandra cluster vs 3 separate clusters)
  • Operational efficiency (one set of database procedures)
  • Resource utilization (databases rarely at capacity)
  • Simplified backups (one backup process)
  • Consistent data layer

Isolation Strategy:

  • Separate keyspaces per application
  • Separate workers for application backends
  • Independent scaling per application
  • Separate deployment pipelines

Trade-offs:

  • Blast radius: One database failure affects all apps
  • Resource contention possible (mitigated by capacity planning)
  • Schema migration coordination needed

Status: Accepted


Capacity Planning Guidelines

Current Capacity

Worker specifications:

  • Manager + Redis: 2 vCPU, 2 GB RAM
  • Cassandra nodes (3x): 2 vCPU, 4 GB RAM each
  • Meilisearch: 2 vCPU, 2 GB RAM
  • Backend: 2 vCPU, 2 GB RAM
  • Frontend: 1 vCPU, 1 GB RAM

Total: 13 vCPUs, 19 GB RAM

Scaling Triggers

When to scale:

Metric Threshold Action
CPU > 80% sustained 5 minutes Add worker or scale vertically
Memory > 85% sustained 5 minutes Increase droplet RAM
Disk > 75% full Any node Clear space or increase disk
Backend p95 > 1s Consistent Scale backend horizontally
Database latency > 50ms Consistent Add Cassandra node or tune
Request rate approaching capacity 80% of max Scale backend replicas

Scaling Options

Horizontal Scaling (preferred):

  • Backend: Add replicas (docker service scale maplepress_backend=3)
  • Cassandra: Add fourth node (increases capacity + resilience)
  • Frontend: Add CDN or edge caching

Vertical Scaling:

  • Resize droplets (requires brief restart)
  • Increase memory limits in stack files
  • Optimize application code first

Cost vs Performance:

  • Horizontal: More resilient, linear cost increase
  • Vertical: Simpler, better price/performance up to a point

Cost Optimization Strategies

Quick Wins

  1. Reserved Instances: DigitalOcean doesn't offer reserved pricing, but consider annual contracts for discounts
  2. Right-sizing: Monitor actual usage, downsize oversized droplets
  3. Cleanup: Regular docker system prune, clear old snapshots
  4. Compression: Enable gzip in Caddy (already done)
  5. Caching: Maximize cache hit rates (Redis, CDN)

Medium-term Optimizations

  1. CDN for static assets: Offload frontend static files to CDN
  2. Object storage lifecycle: Auto-delete old backups
  3. Database tuning: Optimize queries to reduce hardware needs
  4. Spot instances: Not available on DigitalOcean, but consider for batch jobs

Alternative Architectures

If cost becomes primary concern:

  • Single-node PostgreSQL instead of Cassandra cluster (-$96/mo)
  • Collocate services on fewer droplets (-$50-100/mo)
  • Use managed databases (different cost model)

Trade-off: Lower cost, higher operational risk


Security Architecture

Defense in Depth Layers

  1. Network: VPC, firewalls, private overlay networks
  2. Transport: TLS 1.3 for all external connections
  3. Application: Authentication, authorization, input validation
  4. Data: Encryption at rest (object storage), encryption in transit
  5. Monitoring: Audit logs, security alerts, intrusion detection

Key Security Controls

Implemented:

  • SSH key-based authentication (no passwords)
  • UFW firewall on all nodes
  • Docker secrets for sensitive values
  • Network segmentation (private vs public)
  • Automatic HTTPS with perfect forward secrecy
  • Security headers (HSTS, X-Frame-Options, etc.)
  • Database authentication (passwords, API keys)
  • Minimal attack surface (only ports 22, 80, 443 exposed)

Planned:

  • fail2ban for SSH brute-force protection
  • Intrusion detection system (IDS)
  • Regular security scanning (Trivy for containers)
  • Secret rotation automation
  • Audit logging aggregation

Compliance Considerations

GDPR

If processing EU user data:

  • Data residency: Deploy EU region workers
  • Right to deletion: Implement user data purge
  • Data portability: Export user data functionality
  • Privacy by design: Minimal data collection
  • Audit trail: Log all data access

SOC2

If pursuing SOC2 compliance:

  • Access controls: Role-based access, MFA
  • Change management: All changes via git, reviewed
  • Monitoring: Comprehensive logging and alerting
  • Incident response: Documented procedures
  • Business continuity: Backup and disaster recovery tested

Document in: compliance.md


Glossary

Docker Swarm Terms

Manager node: Swarm orchestrator, schedules tasks, maintains cluster state Worker node: Executes tasks (containers) assigned by manager Service: Definition of containers to run (image, replicas, network) Task: Single container instance of a service Stack: Group of related services deployed together Overlay network: Virtual network spanning all swarm nodes Ingress network: Built-in load balancing for published ports Node label: Key-value tag for task placement constraints

Cassandra Terms

RF (Replication Factor): Number of copies of data (RF=3 = 3 copies) QUORUM: Majority of replicas (2 out of 3 for RF=3) Consistency Level: How many replicas must respond (ONE, QUORUM, ALL) Keyspace: Database namespace (like database in SQL) SSTable: Immutable data file on disk Compaction: Merging SSTables to reclaim space Repair: Synchronize data across replicas Nodetool: Command-line tool for Cassandra administration

Monitoring Terms

Prometheus: Time-series database and metrics collection Grafana: Visualization and dashboarding Alertmanager: Alert routing and notification Exporter: Metrics collection agent (node_exporter, etc.) Scrape: Prometheus collecting metrics from target Time series: Sequence of data points over time PromQL: Prometheus query language


For initial deployment:

  • ../setup/ - Step-by-step infrastructure deployment

For day-to-day operations:

  • ../operations/ - Backup, monitoring, incident response

For automation:

  • ../automation/ - Scripts, CI/CD, monitoring configs

External resources:


Contributing to Reference Docs

When to update reference documentation:

  • Major architecture changes
  • New technology adoption
  • Significant cost changes
  • Security incidents (document lessons learned)
  • Compliance requirements change
  • Quarterly review cycles

Document format:

  • Use Markdown
  • Include decision date
  • Link to related ADRs
  • Update index/glossary as needed

Document Maintenance

Review schedule:

  • Architecture docs: Quarterly or when major changes
  • Capacity planning: Monthly (update with metrics)
  • Cost analysis: Monthly (track trends)
  • Security checklist: Quarterly or after incidents
  • Technology stack: When versions change
  • Glossary: As needed when new terms introduced

Responsibility: Infrastructure lead reviews quarterly, team contributes ongoing updates.


Last Updated: January 2025 Maintained By: Infrastructure Team Next Review: April 2025

Purpose: These documents answer "why" and "what if" questions. They provide context for decisions and guidance for future planning.