History

Bartlomiej Mika 755d54a99d Initial commit: Open sourcing all of the Maple Open Technologies code.		2025-12-02 14:33:08 -05:00
..
README.md	Initial commit: Open sourcing all of the Maple Open Technologies code.	2025-12-02 14:33:08 -05:00

README.md

Reference Documentation

Audience: All infrastructure team members, architects, management Purpose: High-level architecture, capacity planning, cost analysis, and strategic documentation Prerequisites: Familiarity with deployed infrastructure

Overview

This directory contains reference materials that provide the "big picture" view of your infrastructure. Unlike operational procedures (setup, operations, automation), these documents focus on why decisions were made, what the architecture looks like, and how to plan for the future.

Contents:

Architecture diagrams and decision records
Capacity planning and performance baselines
Cost analysis and optimization strategies
Security compliance documentation
Technology choices and trade-offs
Glossary of terms

Directory Contents

Architecture Documentation

architecture-overview.md - High-level system architecture

Infrastructure topology
Component interactions
Data flow diagrams
Network architecture
Security boundaries
Design principles and rationale

architecture-decisions.md - Architecture Decision Records (ADRs)

Why Docker Swarm over Kubernetes?
Why Cassandra over PostgreSQL?
Why Caddy over NGINX?
Multi-application architecture rationale
Network segmentation strategy
Service discovery approach

Capacity Planning

capacity-planning.md - Growth planning and scaling strategies

Current capacity baseline
Performance benchmarks
Growth projections
Scaling thresholds
Bottleneck analysis
Future infrastructure needs

performance-baselines.md - Performance metrics and SLOs

Response time percentiles
Throughput measurements
Database performance
Resource utilization baselines
Service Level Objectives (SLOs)
Service Level Indicators (SLIs)

Financial Planning

cost-analysis.md - Infrastructure costs and optimization

Monthly cost breakdown
Cost per service/application
Cost trends and projections
Optimization opportunities
Reserved capacity vs on-demand
TCO (Total Cost of Ownership)

cost-optimization.md - Strategies to reduce costs

Right-sizing recommendations
Idle resource identification
Reserved instances opportunities
Storage optimization
Bandwidth optimization
Alternative architecture considerations

Security & Compliance

security-architecture.md - Security design and controls

Defense-in-depth layers
Authentication and authorization
Secrets management approach
Network security controls
Data encryption (at rest and in transit)
Security monitoring and logging

security-checklist.md - Security verification checklist

Infrastructure hardening checklist
Compliance requirements (GDPR, SOC2, etc.)
Security audit procedures
Vulnerability management
Incident response readiness

compliance.md - Regulatory compliance documentation

GDPR compliance measures
Data residency requirements
Audit trail procedures
Privacy by design implementation
Data retention policies
Right to be forgotten procedures

Technology Stack

technology-stack.md - Complete technology inventory

Software versions and update policy
Third-party services and dependencies
Library and framework choices
Language and runtime versions
Tooling and development environment

technology-decisions.md - Why we chose each technology

Database selection rationale
Programming language choices
Cloud provider selection
Deployment tooling decisions
Monitoring stack selection

Operational Reference

runbook-index.md - Quick reference to all runbooks

Emergency procedures quick links
Common tasks reference
Escalation contacts
Critical command cheat sheet

glossary.md - Terms and definitions

Docker Swarm terminology
Database concepts (Cassandra RF, QUORUM, etc.)
Network terms (overlay, ingress, etc.)
Monitoring terminology
Infrastructure jargon decoder

Quick Reference Materials

Architecture At-a-Glance

Current Infrastructure (January 2025):

Production Environment: maplefile-prod
Region: DigitalOcean Toronto (tor1)
Nodes: 7 workers (1 manager + 6 workers)
Applications: MaplePress (deployed), MapleFile (deployed)

Orchestration: Docker Swarm
Container Registry: DigitalOcean Container Registry (registry.digitalocean.com/ssp)
Object Storage: DigitalOcean Spaces (nyc3)
DNS: [Your DNS provider]
SSL: Let's Encrypt (automatic via Caddy)

Networks:
  - maple-private-prod: Databases and internal services
  - maple-public-prod: Public-facing services (Caddy + backends)

Databases:
  - Cassandra: 3-node cluster, RF=3, QUORUM consistency
  - Redis: Single instance, RDB + AOF persistence
  - Meilisearch: Single instance

Applications:
  - MaplePress Backend: Go 1.21+, Port 8000, Domain: getmaplepress.ca
  - MaplePress Frontend: React 19 + Vite, Domain: getmaplepress.com

Key Metrics Baseline (Example)

As of [Date]:

Metric	Value	Threshold
Backend p95 Response Time	150ms	< 500ms
Frontend Load Time	1.2s	< 3s
Backend Throughput	500 req/min	5000 req/min capacity
Database Read Latency	5ms	< 20ms
Database Write Latency	10ms	< 50ms
Redis Hit Rate	95%	> 90%
CPU Utilization (avg)	35%	Alert at 80%
Memory Utilization (avg)	50%	Alert at 85%
Disk Usage (avg)	40%	Alert at 75%

Monthly Cost Breakdown (Example)

Service	Monthly Cost	Notes
Droplets (7x)	$204	See breakdown in cost-analysis.md
Spaces Storage	$5	250GB included
Additional Bandwidth	$0	Within free tier
Container Registry	$0	Included
DNS	$0	Using [provider]
Monitoring (optional)	$0	Self-hosted Prometheus
Total	~$209/mo	Can scale to ~$300/mo with growth

Technology Stack Summary

Layer	Technology	Version	Purpose
OS	Ubuntu	24.04 LTS	Base operating system
Orchestration	Docker Swarm	Built-in	Container orchestration
Container Runtime	Docker	27.x+	Container execution
Database	Cassandra	4.1.x	Distributed database
Cache	Redis	7.x	In-memory cache/sessions
Search	Meilisearch	v1.5+	Full-text search
Reverse Proxy	Caddy	2-alpine	HTTPS termination
Backend	Go	1.21+	Application runtime
Frontend	React + Vite	19 + 5.x	Web UI
Object Storage	Spaces	S3-compatible	File storage
Monitoring	Prometheus + Grafana	Latest	Metrics & dashboards
CI/CD	TBD	-	GitHub Actions / GitLab CI

Architecture Decision Records (ADRs)

ADR-001: Docker Swarm vs Kubernetes

Decision: Use Docker Swarm for orchestration

Context: Need container orchestration for production deployment

Rationale:

Simpler to set up and maintain (< 1 hour vs days for k8s)
Built into Docker (no additional components)
Sufficient for our scale (< 100 services)
Lower operational overhead
Easier to troubleshoot
Team familiarity with Docker

Trade-offs:

Less ecosystem tooling than Kubernetes
Limited advanced scheduling features
Smaller community
May need migration to k8s if scale dramatically (> 50 nodes)

Status: Accepted

ADR-002: Cassandra for Distributed Database

Decision: Use Cassandra for primary datastore

Context: Need highly available, distributed database with linear scalability

Rationale:

Write-heavy workload (user-generated content)
Geographic distribution possible (multi-region)
Proven at scale (Instagram, Netflix)
No single point of failure (RF=3, QUORUM)
Linear scalability (add nodes for capacity)
Excellent write performance

Trade-offs:

Higher complexity than PostgreSQL
Eventually consistent (tunable)
Schema migrations more complex
Higher resource usage (3 nodes minimum)
Steeper learning curve

Alternatives Considered:

PostgreSQL + Patroni: Simpler but less scalable
MongoDB: Similar, but prefer Cassandra's consistency model
MySQL Cluster: Oracle licensing concerns

Status: Accepted

ADR-003: Caddy for Reverse Proxy

Decision: Use Caddy instead of NGINX

Context: Need HTTPS termination and reverse proxy

Rationale:

Automatic HTTPS with Let's Encrypt (zero configuration)
Automatic certificate renewal (no cron jobs)
Simpler configuration (10 lines vs 200+)
Built-in HTTP/2 and HTTP/3
Security by default
Active development

Trade-offs:

Less mature than NGINX (but production-ready)
Smaller community
Fewer third-party modules
Slightly higher memory usage (negligible)

Performance: Equivalent for our use case (< 10k req/sec)

Status: Accepted

ADR-004: Multi-Application Shared Infrastructure

Decision: Share database infrastructure across multiple applications

Context: Planning to deploy multiple applications (MaplePress, MapleFile)

Rationale:

Cost efficiency (one 3-node Cassandra cluster vs 3 separate clusters)
Operational efficiency (one set of database procedures)
Resource utilization (databases rarely at capacity)
Simplified backups (one backup process)
Consistent data layer

Isolation Strategy:

Separate keyspaces per application
Separate workers for application backends
Independent scaling per application
Separate deployment pipelines

Trade-offs:

Blast radius: One database failure affects all apps
Resource contention possible (mitigated by capacity planning)
Schema migration coordination needed

Status: Accepted

Capacity Planning Guidelines

Current Capacity

Worker specifications:

Manager + Redis: 2 vCPU, 2 GB RAM
Cassandra nodes (3x): 2 vCPU, 4 GB RAM each
Meilisearch: 2 vCPU, 2 GB RAM
Backend: 2 vCPU, 2 GB RAM
Frontend: 1 vCPU, 1 GB RAM

Total: 13 vCPUs, 19 GB RAM

Scaling Triggers

When to scale:

Metric	Threshold	Action
CPU > 80% sustained	5 minutes	Add worker or scale vertically
Memory > 85% sustained	5 minutes	Increase droplet RAM
Disk > 75% full	Any node	Clear space or increase disk
Backend p95 > 1s	Consistent	Scale backend horizontally
Database latency > 50ms	Consistent	Add Cassandra node or tune
Request rate approaching capacity	80% of max	Scale backend replicas

Scaling Options

Horizontal Scaling (preferred):

Backend: Add replicas (docker service scale maplepress_backend=3)
Cassandra: Add fourth node (increases capacity + resilience)
Frontend: Add CDN or edge caching

Vertical Scaling:

Resize droplets (requires brief restart)
Increase memory limits in stack files
Optimize application code first

Cost vs Performance:

Horizontal: More resilient, linear cost increase
Vertical: Simpler, better price/performance up to a point

Cost Optimization Strategies

Quick Wins

Reserved Instances: DigitalOcean doesn't offer reserved pricing, but consider annual contracts for discounts
Right-sizing: Monitor actual usage, downsize oversized droplets
Cleanup: Regular docker system prune, clear old snapshots
Compression: Enable gzip in Caddy (already done)
Caching: Maximize cache hit rates (Redis, CDN)

Medium-term Optimizations

CDN for static assets: Offload frontend static files to CDN
Object storage lifecycle: Auto-delete old backups
Database tuning: Optimize queries to reduce hardware needs
Spot instances: Not available on DigitalOcean, but consider for batch jobs

Alternative Architectures

If cost becomes primary concern:

Single-node PostgreSQL instead of Cassandra cluster (-$96/mo)
Collocate services on fewer droplets (-$50-100/mo)
Use managed databases (different cost model)

Trade-off: Lower cost, higher operational risk

Security Architecture

Defense in Depth Layers

Network: VPC, firewalls, private overlay networks
Transport: TLS 1.3 for all external connections
Application: Authentication, authorization, input validation
Data: Encryption at rest (object storage), encryption in transit
Monitoring: Audit logs, security alerts, intrusion detection

Key Security Controls

Implemented:

✅ SSH key-based authentication (no passwords)
✅ UFW firewall on all nodes
✅ Docker secrets for sensitive values
✅ Network segmentation (private vs public)
✅ Automatic HTTPS with perfect forward secrecy
✅ Security headers (HSTS, X-Frame-Options, etc.)
✅ Database authentication (passwords, API keys)
✅ Minimal attack surface (only ports 22, 80, 443 exposed)

Planned:

fail2ban for SSH brute-force protection
Intrusion detection system (IDS)
Regular security scanning (Trivy for containers)
Secret rotation automation
Audit logging aggregation

Compliance Considerations

If processing EU user data:

Data residency: Deploy EU region workers
Right to deletion: Implement user data purge
Data portability: Export user data functionality
Privacy by design: Minimal data collection
Audit trail: Log all data access

SOC2

If pursuing SOC2 compliance:

Access controls: Role-based access, MFA
Change management: All changes via git, reviewed
Monitoring: Comprehensive logging and alerting
Incident response: Documented procedures
Business continuity: Backup and disaster recovery tested

Document in: compliance.md

Glossary

Docker Swarm Terms

Manager node: Swarm orchestrator, schedules tasks, maintains cluster state Worker node: Executes tasks (containers) assigned by manager Service: Definition of containers to run (image, replicas, network) Task: Single container instance of a service Stack: Group of related services deployed together Overlay network: Virtual network spanning all swarm nodes Ingress network: Built-in load balancing for published ports Node label: Key-value tag for task placement constraints

Cassandra Terms

RF (Replication Factor): Number of copies of data (RF=3 = 3 copies) QUORUM: Majority of replicas (2 out of 3 for RF=3) Consistency Level: How many replicas must respond (ONE, QUORUM, ALL) Keyspace: Database namespace (like database in SQL) SSTable: Immutable data file on disk Compaction: Merging SSTables to reclaim space Repair: Synchronize data across replicas Nodetool: Command-line tool for Cassandra administration

Monitoring Terms

Prometheus: Time-series database and metrics collection Grafana: Visualization and dashboarding Alertmanager: Alert routing and notification Exporter: Metrics collection agent (node_exporter, etc.) Scrape: Prometheus collecting metrics from target Time series: Sequence of data points over time PromQL: Prometheus query language

For initial deployment:

../setup/ - Step-by-step infrastructure deployment

For day-to-day operations:

../operations/ - Backup, monitoring, incident response

For automation:

../automation/ - Scripts, CI/CD, monitoring configs

External resources:

Docker Swarm: https://docs.docker.com/engine/swarm/
Cassandra: https://cassandra.apache.org/doc/latest/
DigitalOcean: https://docs.digitalocean.com/

Contributing to Reference Docs

When to update reference documentation:

Major architecture changes
New technology adoption
Significant cost changes
Security incidents (document lessons learned)
Compliance requirements change
Quarterly review cycles

Document format:

Use Markdown
Include decision date
Link to related ADRs
Update index/glossary as needed

Document Maintenance

Review schedule:

Architecture docs: Quarterly or when major changes
Capacity planning: Monthly (update with metrics)
Cost analysis: Monthly (track trends)
Security checklist: Quarterly or after incidents
Technology stack: When versions change
Glossary: As needed when new terms introduced

Responsibility: Infrastructure lead reviews quarterly, team contributes ongoing updates.

Last Updated: January 2025 Maintained By: Infrastructure Team Next Review: April 2025

Purpose: These documents answer "why" and "what if" questions. They provide context for decisions and guidance for future planning.

README.md

Reference Documentation

Overview

Directory Contents

Architecture Documentation

Capacity Planning

Financial Planning

Security & Compliance

Technology Stack

Operational Reference

Quick Reference Materials

Architecture At-a-Glance

Key Metrics Baseline (Example)

Monthly Cost Breakdown (Example)

Technology Stack Summary

Architecture Decision Records (ADRs)

ADR-001: Docker Swarm vs Kubernetes

ADR-002: Cassandra for Distributed Database

ADR-003: Caddy for Reverse Proxy

ADR-004: Multi-Application Shared Infrastructure

Capacity Planning Guidelines

Current Capacity

Scaling Triggers

Scaling Options

Cost Optimization Strategies

Quick Wins

Medium-term Optimizations

Alternative Architectures

Security Architecture

Defense in Depth Layers

Key Security Controls

Compliance Considerations

GDPR

SOC2

Glossary

Docker Swarm Terms

Cassandra Terms

Monitoring Terms

Related Documentation

Contributing to Reference Docs

Document Maintenance