Modern Engineering Practices
This guide covers the engineering practices, patterns, and mindsets that define how high-performing teams build software in 2024-2025 and beyond. These are the topics that come up in senior and staff-level interviews at companies pushing the frontier of software engineering.1. AI-Assisted Engineering
The rise of AI coding assistants has fundamentally changed how engineers write, review, and ship code. Understanding how to use these tools effectively --- and when not to --- is now a core engineering skill.1. How do you use AI coding assistants effectively in your workflow?
1. How do you use AI coding assistants effectively in your workflow?
Analogy: AI coding assistants are like a really smart intern --- they can produce a lot of code quickly, but you MUST review everything they write. They lack context about your system’s history, your team’s conventions, and the subtle business rules that never made it into documentation. Great output, zero judgment.Where AI excels:
- Boilerplate and scaffolding --- generating CRUD endpoints, config files, test stubs
- Language translation --- converting logic between languages or frameworks
- Documentation --- writing docstrings, comments, and README sections
- Pattern completion --- finishing repetitive code that follows an established pattern
- Regex and one-liners --- generating complex expressions with natural language descriptions
- Learning new APIs --- exploring unfamiliar libraries by asking for examples
- Novel architecture decisions --- it cannot reason about your specific business constraints
- Security-critical code --- cryptography, auth flows, input validation need human review
- Performance-sensitive hot paths --- AI may generate correct but suboptimal code
- Complex state machines --- multi-step business logic with edge cases
2. When does AI help and when does it hurt?
2. When does AI help and when does it hurt?
| Scenario | AI Useful? | Why |
|---|---|---|
| Writing unit test boilerplate | Yes | Low risk, easily verified by running tests |
| Generating SQL migrations | Caution | Must review for data loss, test against staging |
| Auth/session handling code | No (without heavy review) | Security-critical, subtle bugs are exploitable |
| Refactoring with clear patterns | Yes | Mechanical transformation, easy to diff |
| Designing a distributed consensus protocol | No | Requires deep domain expertise AI lacks |
| Writing API documentation | Yes | Low risk, easy to review for accuracy |
3. What is prompt engineering for developers and how does it improve code suggestions?
3. What is prompt engineering for developers and how does it improve code suggestions?
Be specific about the technology stack
Provide examples of existing patterns
Specify constraints upfront
Ask for reasoning before code
- Vague prompts (“make this better”) produce vague results
- Not mentioning error handling leads to happy-path-only code
- Forgetting to mention existing dependencies causes incompatible suggestions
4. How should AI be used in code review?
4. How should AI be used in code review?
- Style and formatting violations
- Common bug patterns (null checks, resource leaks, off-by-one errors)
- Security scanning (hardcoded secrets, SQL injection patterns)
- Test coverage gaps --- flagging untested code paths
- Documentation completeness
- Architectural fitness --- does this change align with our system’s direction?
- Business logic correctness --- does this actually solve the user’s problem?
- Performance implications --- will this cause N+1 queries at scale?
- Team knowledge sharing --- code review is how juniors learn and seniors stay informed
5. How do you test AI-generated code? What does 'trust but verify' mean in practice?
5. How do you test AI-generated code? What does 'trust but verify' mean in practice?
- Read every line --- do not accept code you do not understand
- Run the tests --- if there are no tests, write them before accepting the code
- Check edge cases --- AI optimizes for the happy path; test with empty inputs, nulls, boundary values
- Review error handling --- AI often generates generic catch-all blocks that swallow real errors
- Verify dependencies --- AI may suggest packages that are deprecated, unmaintained, or do not exist (hallucinated package names)
- Security review --- check for injection vulnerabilities, improper auth, hardcoded values
- Performance test --- run benchmarks if the code is in a hot path
6. What is the future of software engineering with AI?
6. What is the future of software engineering with AI?
- Faster iteration cycles --- prototypes in hours instead of days
- Higher abstraction --- engineers increasingly define what to build, AI helps with how
- Higher quality bar --- with AI handling boilerplate, more time for testing, security, and design
- New skills matter --- prompt engineering, AI evaluation, understanding model limitations
- System design --- understanding tradeoffs at scale is a human skill
- Debugging production issues --- requires context about systems, teams, and business impact
- Cross-team collaboration --- technical leadership, mentoring, conflict resolution
- Ethical judgment --- deciding what to build, not just how to build it
2. Platform Engineering
Platform engineering is the discipline of building and maintaining internal developer platforms (IDPs) that make engineering teams more productive, consistent, and autonomous.7. What is platform engineering and how does it differ from DevOps?
7. What is platform engineering and how does it differ from DevOps?
| Aspect | DevOps | Platform Engineering |
|---|---|---|
| Focus | Culture and practices | Product (the platform) |
| User | The same team builds and runs | App devs are the customers |
| Approach | Every team manages their own infra | Centralized platform, self-service |
| Success metric | Deployment frequency, MTTR | Developer satisfaction, time-to-production |
Analogy: Platform engineering is like building roads --- individual teams should not each build their own path to production. Pave the road once and let everyone drive on it. The platform team maintains the highway (CI/CD, infrastructure, observability), while product teams focus on where they are driving (features, business logic). Without the road, every team is bushwhacking through the wilderness independently.The platform team treats other engineering teams as internal customers and the platform as an internal product with roadmaps, user research, and iteration.
8. What are golden paths and why do they matter?
8. What are golden paths and why do they matter?
- “Create a new microservice” --- one command gives you a repo with CI/CD, monitoring, logging, and a Kubernetes manifest, all pre-configured
- “Deploy to production” --- a standard pipeline that includes tests, security scanning, canary deployment, and automated rollback
- “Add a new database” --- a self-service form that provisions a managed database with backups, monitoring, and connection pooling
- Make the right thing the easy thing --- developers follow secure, tested patterns by default
- Reduce cognitive load --- no need to research which logging library, which CI tool, which deploy strategy
- Consistency at scale --- 50 services all using the same patterns are far easier to maintain than 50 snowflakes
9. How do you measure and improve Developer Experience (DevEx)?
9. How do you measure and improve Developer Experience (DevEx)?
- Feedback loops --- how quickly do developers get signal? (CI time, PR review time, deploy time)
- Cognitive load --- how much irrelevant complexity must developers manage? (config, infra, tooling)
- Flow state --- how often can developers get into deep focus? (interruptions, context switching)
- Time from commit to production (deploy lead time)
- CI pipeline duration (p50 and p95)
- Time to first PR review
- Developer satisfaction surveys (quarterly)
- Onboarding time for new engineers
- Invest in fast CI --- nothing destroys flow like a 45-minute build
- Automate environment setup ---
git cloneandmake devshould get you running - Reduce approval bottlenecks --- async reviews, clear ownership
- Provide good internal documentation --- searchable, up-to-date, with examples
10. Why does self-service infrastructure matter at scale?
10. Why does self-service infrastructure matter at scale?
- 10 engineers: Slack the infra person, they do it in 10 minutes
- 100 engineers: Infra team has a 2-week ticket backlog
- 1000 engineers: Teams bypass infra entirely and create shadow IT
- Developers provision what they need through a portal, CLI, or API
- Guardrails are built in --- you cannot create an S3 bucket without encryption
- Costs are tracked automatically --- teams see what they spend
- Security policies are enforced at the platform level, not via manual review
11. What tools exist in the platform engineering ecosystem?
11. What tools exist in the platform engineering ecosystem?
- Backstage (Spotify) --- open-source developer portal. Service catalog, templates, plugin ecosystem. The most widely adopted IDP framework
- Port --- SaaS developer portal with a visual builder
- Cortex --- focuses on service maturity scorecards
- Humanitec --- platform orchestrator that abstracts infrastructure. Define workloads, it handles the wiring
- Kratix --- Kubernetes-native framework for building platforms. Uses “Promises” (custom resource definitions) to offer services
- Crossplane --- Kubernetes-native infrastructure provisioning. Define cloud resources as YAML
- Terraform --- still the standard for IaC, increasingly wrapped by platform layers
- Pulumi --- IaC using real programming languages (TypeScript, Python, Go)
12. When do you need a platform team vs when is it overkill?
12. When do you need a platform team vs when is it overkill?
- You have 50+ engineers and multiple teams shipping independently
- Teams are duplicating effort (everyone building their own CI, their own Terraform, their own monitoring)
- Onboarding a new engineer takes more than 2 days
- Infrastructure requests are a bottleneck (multi-day ticket queues)
- Security and compliance requirements demand consistent enforcement
- You have fewer than 20 engineers
- One or two people can manage the infrastructure alongside feature work
- Your stack is simple (monolith, single deploy target)
- The overhead of a “platform” would exceed the time it saves
3. Observability-Driven Development
Observability is not something you bolt on after launch. Modern engineering treats observability as a first-class design concern, embedded in the code from day one.13. What does it mean to write code with observability in mind from day one?
13. What does it mean to write code with observability in mind from day one?
- Every service emits structured logs, metrics, and traces from the start
- Every external call (HTTP, DB, queue) is instrumented with timing and error tracking
- Business-critical operations have custom metrics (orders placed, payments processed, emails sent)
- Error paths are as well-instrumented as happy paths --- you learn the most when things fail
- Add a correlation/request ID to every log line from day one
- Use structured logging (JSON) --- not
console.log("something happened") - Define your key metrics before writing the feature, not after the outage
- Include dashboards and alerts in the definition of done for a feature
14. Why is structured logging a first-class concern?
14. Why is structured logging a first-class concern?
- Queryable --- find all errors for
order_id=12345across all services in seconds - Aggregatable --- count error rates by
error_type, alert on spikes - Correlatable --- join logs across services using
trace_id - Indexable --- tools like Elasticsearch, Loki, and Datadog can index fields for fast search
- PII-aware --- you can filter or mask specific fields (like
user_email) systematically
- Use a logging library that enforces structure (Winston, Pino, Serilog, structlog)
- Standardize field names across all services (use a shared schema)
- Always include: timestamp, level, service name, trace ID, and a human-readable message
- Never log raw request bodies (PII risk) --- log derived fields instead
15. How does distributed tracing work in microservices (OpenTelemetry)?
15. How does distributed tracing work in microservices (OpenTelemetry)?
- Trace --- the entire journey of a request (e.g., user clicks “Buy” through to order confirmation)
- Span --- a single unit of work within a trace (e.g., “validate payment” or “query inventory DB”)
- Context propagation --- passing the trace ID from service to service via HTTP headers (
traceparent) - Span attributes --- metadata attached to spans (HTTP status, DB query, user ID)
- Service A receives a request, starts a trace, generates a trace ID
- Service A calls Service B, passing the trace ID in the
traceparentheader - Service B creates a child span under the same trace
- Each span records start time, end time, status, and attributes
- All spans are sent to a collector (Jaeger, Tempo, Datadog) and assembled into a trace view
- The full call graph of a request
- Latency breakdown (which service or DB call is slow?)
- Error propagation (where did the failure originate?)
- Fan-out patterns (one request triggers 10 downstream calls)
16. What is SLO-based development and why define reliability targets before writing code?
16. What is SLO-based development and why define reliability targets before writing code?
- SLI (Service Level Indicator) --- a measurable metric (e.g., “99th percentile latency of the checkout API”)
- SLO (Service Level Objective) --- a target for an SLI (e.g., “p99 latency < 500ms, 99.9% of the time”)
- SLA (Service Level Agreement) --- a contractual commitment with consequences (usually looser than SLOs)
- Error budget --- the allowed amount of unreliability (e.g., 0.1% of requests can fail)
- Architecture decisions depend on reliability targets --- 99.9% vs 99.99% uptime implies fundamentally different designs
- Error budgets drive prioritization --- if you have budget remaining, ship features. If budget is spent, fix reliability
- Avoids over-engineering --- not every service needs five-nines. A weekly report generator can tolerate more failures than a payment service
- Define SLIs for the new feature (latency, error rate, throughput)
- Set SLOs with the product team (what does “reliable enough” mean for users?)
- Instrument the code to emit those SLIs
- Set up dashboards and burn-rate alerts
- Track error budget over time, use it to balance features vs reliability work
17. How do feature flags and observability work together to measure feature impact?
17. How do feature flags and observability work together to measure feature impact?
- Feature flag controls who sees the new behavior (percentage rollout, user segments, geography)
- Observability measures what happens when they do (latency, error rate, business metrics)
- Deploy the feature behind a flag (off by default)
- Enable for 5% of traffic
- Compare SLIs between flag-on and flag-off cohorts (A/B style)
- If metrics are healthy, ramp to 25%, 50%, 100%
- If metrics degrade, kill the flag instantly --- no redeploy needed
- Technical metrics --- latency, error rate, CPU/memory usage
- Business metrics --- conversion rate, revenue per session, user engagement
- Operational metrics --- support ticket volume, on-call pages
4. Event-Driven Architecture in Practice
Event-driven architecture (EDA) decouples systems by communicating through events rather than direct API calls. Understanding when and how to apply EDA is critical for modern distributed systems.18. When should you go event-driven instead of request-response?
18. When should you go event-driven instead of request-response?
- The caller needs an immediate answer (user clicks “Get Balance” and expects a number)
- The operation is simple and fast (< 100ms)
- There is one producer and one consumer
- Strong consistency is required
- Multiple consumers need to react to the same action (order placed -> send email, update inventory, trigger analytics)
- Temporal decoupling is needed --- the producer should not wait for or even know about consumers
- Spike buffering --- absorb traffic bursts with a queue instead of overloading downstream services
- Eventual consistency is acceptable --- the inventory count can be a few seconds stale
- Cross-team boundaries --- teams should be able to evolve independently
OrderPlaced event triggers email, inventory, and analytics asynchronously.19. What is the difference between an event mesh, event bus, and event broker?
19. What is the difference between an event mesh, event bus, and event broker?
| Concept | Definition | Example |
|---|---|---|
| Event Broker | A single system that receives, stores, and delivers events | Kafka, RabbitMQ, Amazon SQS |
| Event Bus | A logical channel where events are published and consumed, typically within one application boundary | AWS EventBridge, Azure Service Bus |
| Event Mesh | A network of interconnected event brokers that route events across environments, clouds, and regions | Solace, a federated Kafka deployment |
- An event broker is infrastructure --- it is the engine
- An event bus is a pattern --- a single stream of events for a bounded context
- An event mesh is a topology --- connecting multiple brokers across locations for global event routing
- Multi-cloud or hybrid-cloud architectures
- Geographically distributed systems that need local event processing with global visibility
- Large organizations with many independent event brokers that need interconnection
20. What is a schema registry and how do you handle event evolution?
20. What is a schema registry and how do you handle event evolution?
- Producer registers the event schema (e.g.,
OrderPlaced v1) with the registry - Consumer reads the schema to know what to expect
- When the producer evolves the schema (v2), the registry checks compatibility rules
- Avro --- schema-driven, compact binary, excellent schema evolution support. Most common with Kafka
- Protobuf --- Google’s format, strong typing, good evolution rules, widely used in gRPC
- JSON Schema --- human-readable, less compact, good for REST/webhook events
- Backward compatible --- new schema can read old data (safe for consumers to upgrade first)
- Forward compatible --- old schema can read new data (safe for producers to upgrade first)
- Full compatible --- both directions work (safest, most restrictive)
21. Explain the Saga pattern: choreography vs orchestration with real examples
21. Explain the Saga pattern: choreography vs orchestration with real examples
| Aspect | Choreography | Orchestration |
|---|---|---|
| Coupling | Low --- services are independent | Medium --- orchestrator knows all services |
| Visibility | Hard to see the full flow | Easy --- the orchestrator defines the flow |
| Complexity | Grows fast with many steps | Centralized, easier to reason about |
| Failure handling | Each service handles its own | Orchestrator manages all compensation |
| Best for | Simple flows (2-3 steps) | Complex flows (4+ steps, conditional logic) |
22. When is CQRS + Event Sourcing worth the complexity?
22. When is CQRS + Event Sourcing worth the complexity?
- Audit requirements --- financial systems, healthcare, legal. You need a complete, immutable history of every change
- Complex read patterns --- the same data needs to be queried in radically different ways (e.g., time-series, aggregations, search)
- High write throughput --- append-only event log is faster than update-in-place
- Temporal queries --- “what was the state of this account on March 15th?”
- Event-driven downstream --- many services need to react to changes
- Simple CRUD applications with straightforward read/write patterns
- Small teams that cannot maintain the operational complexity
- When eventual consistency between read and write models is unacceptable
- Greenfield projects where you are not sure of the requirements yet
5. Security Engineering Mindset
Security is not a team you hand off to at the end --- it is a mindset embedded in every phase of engineering. Modern interviews expect engineers to think about security as naturally as they think about testing.23. What does shift-left security mean in practice?
23. What does shift-left security mean in practice?
Design phase: Threat modeling
Coding phase: Secure defaults and static analysis
Dependency phase: SCA scanning
Build phase: Container scanning
Pre-deploy: DAST and policy checks
24. What is software supply chain security and why does it matter?
24. What is software supply chain security and why does it matter?
- The average application has hundreds of transitive dependencies
- SolarWinds (2020), Log4Shell (2021), and xz-utils (2024) showed that compromising a single dependency can affect millions of systems
- Attackers increasingly target the supply chain because it scales --- one compromised library hits every application that uses it
- SBOMs (Software Bill of Materials) --- a complete list of every component in your software. Mandated by US government for federal software. Generated by tools like Syft, CycloneDX
- Dependency scanning --- automated CVE checking on every build (Dependabot, Snyk, Renovate)
- Sigstore --- keyless signing for artifacts. Cosign signs container images, Rekor provides a transparency log. Verifies that the artifact you deploy is the one your CI built
- SLSA (Supply-chain Levels for Software Artifacts) --- a framework for build integrity. Levels 1-4, from “documented build process” to “hermetic, reproducible builds with provenance”
- Lock files --- always commit lock files (package-lock.json, go.sum). Pin exact versions
- Vendoring --- for critical dependencies, consider vendoring (copying the source) to avoid upstream tampering
25. How does zero-trust architecture work in practice?
25. How does zero-trust architecture work in practice?
- Firewall protects the network boundary
- Once inside, everything trusts everything
- VPN = you are “in”
- No implicit trust based on network location
- Every service-to-service call is authenticated (mTLS, JWT)
- Every request is authorized (does this service have permission to call that endpoint?)
- Least privilege by default --- services can only access what they explicitly need
- Identity --- every service has a cryptographic identity (SPIFFE/SPIRE, service mesh certificates)
- Authentication --- mTLS between services, short-lived tokens for users
- Authorization --- fine-grained policies (OPA, Cedar, Zanzibar-style systems)
- Encryption --- data encrypted in transit (TLS everywhere) and at rest
- Micro-segmentation --- network policies restrict which pods can talk to which
26. What are the best practices for secrets management?
26. What are the best practices for secrets management?
- Hardcoding secrets in source code
- Storing secrets in environment variables without encryption
- Sharing secrets via Slack, email, or sticky notes
- Using the same secret across all environments
- Never rotating secrets
| Practice | Implementation |
|---|---|
| Centralized secret store | HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager |
| Dynamic secrets | Vault generates short-lived DB credentials on demand --- no static passwords |
| Encryption at rest | Secrets encrypted with a master key (envelope encryption) |
| Least privilege access | Services can only read secrets they need, enforced by policy |
| Automatic rotation | Secrets rotate on a schedule, applications fetch the latest version |
| Audit logging | Every secret access is logged (who, when, from where) |
| Git prevention | Pre-commit hooks (gitleaks, detect-secrets) block secrets from being committed |
| SOPS for config | Mozilla SOPS encrypts secret values in config files, keeping keys readable for diffs |
27. How do you do threat modeling as a design activity?
27. How do you do threat modeling as a design activity?
| Threat | Description | Example | Mitigation |
|---|---|---|---|
| Spoofing | Pretending to be someone else | Forged JWT tokens | Strong authentication, token validation |
| Tampering | Modifying data in transit or at rest | Man-in-the-middle altering API responses | TLS, integrity checks, digital signatures |
| Repudiation | Denying an action occurred | User claims they never placed an order | Audit logs, non-repudiation mechanisms |
| Information Disclosure | Exposing data to unauthorized parties | SQL injection leaking user data | Input validation, encryption, access control |
| Denial of Service | Making a system unavailable | DDoS, resource exhaustion attacks | Rate limiting, auto-scaling, CDN |
| Elevation of Privilege | Gaining unauthorized access | Exploiting an admin API with a regular user token | RBAC, least privilege, input validation |
- Diagram the system --- draw data flows, trust boundaries, entry points
- Apply STRIDE to each component and data flow
- Rank threats by likelihood and impact (use a risk matrix)
- Define mitigations for high-priority threats
- Track as engineering work --- threat mitigations go into the backlog alongside features
28. What are AI-specific security concerns?
28. What are AI-specific security concerns?
- Direct --- user crafts input that overrides the system prompt (“ignore previous instructions and…”)
- Indirect --- malicious content in data the AI processes (e.g., a webpage containing hidden instructions that an AI agent follows)
- Mitigation --- input sanitization, output filtering, guardrails, separate system/user prompt handling, never trust user input in prompts
- Attackers contaminate training data to influence model behavior
- Example: injecting biased or malicious examples into a fine-tuning dataset
- Mitigation: data provenance tracking, anomaly detection in training data, human review of training sets
- Attackers query a model repeatedly to reverse-engineer its behavior and create a copy
- Mitigation: rate limiting, query logging, watermarking model outputs, monitoring for extraction patterns
- Models may memorize and regurgitate sensitive training data (PII, proprietary code, API keys)
- Mitigation: data de-identification before training, differential privacy, output filtering
- Compromised model weights distributed via model hubs (think “npm for ML models”)
- Mitigation: model signing, hash verification, trusted model registries
6. Sustainable Engineering
Sustainable engineering is about building software that is efficient with compute resources, responsible with energy consumption, and designed to last.29. What is green software engineering?
29. What is green software engineering?
- Energy efficiency --- use less electricity per unit of work (better algorithms, efficient code, right-sized instances)
- Hardware efficiency --- use less physical hardware per unit of work (higher utilization, shared infrastructure)
- Carbon awareness --- run workloads when and where the electricity grid is cleanest
- Electricity grids vary in carbon intensity based on time and location (solar during the day, wind in certain regions)
- Temporal shifting --- run batch jobs when the grid is cleanest (e.g., overnight when wind power is high)
- Spatial shifting --- run workloads in regions with cleaner grids (e.g., a region powered by hydroelectric)
- Demand shaping --- adjust the amount of work based on carbon intensity (reduce batch size during high-carbon periods)
- Green Software Foundation --- industry body defining standards
- Carbon Aware SDK --- provides carbon intensity data for scheduling decisions
- Cloud Carbon Footprint --- measures and reports cloud emissions
- SCI (Software Carbon Intensity) --- a metric for carbon per unit of work (like a “miles per gallon” for software)
30. How do you reduce waste through efficient engineering?
30. How do you reduce waste through efficient engineering?
- Choosing O(n log n) over O(n^2) is not just an academic exercise --- at scale, it is the difference between 10 servers and 1,000
- Profile before optimizing. Use tools (pprof, py-spy, async-profiler) to find the actual bottleneck
- Cache aggressively at every layer (CDN, application, database)
- Use auto-scaling instead of provisioning for peak load 24/7
- Monitor actual CPU and memory utilization --- most instances run at 10-20% utilization
- Consider serverless for spiky or low-traffic workloads (you pay only for execution)
- Use spot/preemptible instances for fault-tolerant workloads (60-90% cost savings)
- Over-engineering --- CQRS + Event Sourcing + Kubernetes for a CRUD app serving 100 users
- Microservices for a team of 3 engineers (the coordination overhead exceeds the benefits)
- Running unused environments --- dev and staging environments left running overnight and on weekends
31. How do you measure and reduce cloud carbon footprint?
31. How do you measure and reduce cloud carbon footprint?
- Cloud provider dashboards --- AWS Customer Carbon Footprint Tool, Google Carbon Footprint, Azure Emissions Impact Dashboard
- Cloud Carbon Footprint (CCF) --- open-source tool that estimates emissions from cloud usage data (billing APIs)
- SCI metric --- Software Carbon Intensity = (Energy x Carbon Intensity + Embodied Carbon) per functional unit
- Compute --- right-size instances, use ARM-based processors (Graviton, Ampere) which are 40-60% more energy efficient for many workloads
- Storage --- implement data lifecycle policies. Move cold data to cheaper, less energy-intensive tiers. Delete what you do not need
- Networking --- reduce data transfer between regions. Use CDNs. Compress payloads
- Region selection --- choose cloud regions with lower carbon intensity. Google publishes carbon intensity per region
- Include carbon metrics in engineering dashboards alongside cost and performance
- Set carbon budgets per team or service (like you set cost budgets)
- Run “sustainability reviews” alongside architecture reviews for major projects
- Automate shutdown of non-production environments outside business hours
32. How do you engineer for longevity --- code that lasts years, not months?
32. How do you engineer for longevity --- code that lasts years, not months?
- Boring technology --- choose well-understood, stable technologies. PostgreSQL will be around in 10 years. That trendy new database might not
- Clear boundaries --- well-defined interfaces between modules. You should be able to replace the implementation without changing the consumers
- Comprehensive tests --- the test suite is the living documentation and the safety net for future changes
- Explicit over implicit --- future developers (including future you) should not need to guess what a function does or why a decision was made
- Decision records --- ADRs (Architecture Decision Records) document why you chose X over Y. In 2 years, nobody will remember the discussion
- Write commit messages that explain why, not what (the diff shows the what)
- Comment on the why behind non-obvious code --- “we use a mutex here because the map is accessed from multiple goroutines” not “lock the mutex”
- Keep dependencies minimal and up to date (automated with Dependabot or Renovate)
- Design for deletion --- make it easy to remove features and code, not just add them
- Avoid tight coupling to vendor APIs --- use adapters and interfaces
- “Move fast and break things” without ever going back to clean up
- No tests because “we’ll add them later” (you will not)
- Knowledge hoarding --- one person understands the system, and when they leave, so does the knowledge
- Resume-driven development --- choosing technologies to pad your resume rather than to solve the problem
Interview Quick Reference
These are high-signal questions that frequently appear in senior and staff-level engineering interviews on modern practices. Use them for self-assessment.Rapid-fire: Questions to expect in 2024-2025+ interviews
Rapid-fire: Questions to expect in 2024-2025+ interviews
- How do you decide when to use AI-generated code vs writing it yourself?
- Describe a time AI tooling saved you significant time. What about a time it led you astray?
- How do you verify the security of AI-generated code?
Deep Dive: How would you evaluate whether your team should adopt AI-assisted coding tools?
Deep Dive: How would you evaluate whether your team should adopt AI-assisted coding tools?
- Define the evaluation criteria before the pilot --- what does “success” look like? Faster cycle time? Fewer bugs? Higher developer satisfaction? Pick 2-3 primary metrics and commit to them upfront.
-
Run a structured pilot:
- Select 2-3 teams with different codebases and workflows (not just the enthusiasts)
- Run for 4-8 weeks to get past the novelty effect
- Establish a control group or use before/after comparison with baseline data
- Track both quantitative metrics and qualitative developer feedback
-
Metrics to track:
- Cycle time --- time from first commit to PR merged (expect 20-30% improvement based on GitHub’s internal research)
- Acceptance rate --- what percentage of AI suggestions are accepted vs rejected? Low acceptance means the tool is generating noise
- Bug introduction rate --- are AI-assisted PRs introducing more or fewer bugs in production?
- Developer satisfaction --- survey scores on productivity, frustration, and code quality
- Code review effort --- are reviewers spending more or less time per PR? (AI can shift work to reviewers if developers blindly accept suggestions)
- Onboarding velocity --- do new engineers ramp up faster with AI assistance?
-
Evaluate the risks:
- IP and licensing concerns (code generated from training data)
- Security implications (AI suggesting vulnerable patterns)
- Over-reliance and skill atrophy in junior engineers
- Cost vs productivity gain
- Make a phased decision --- do not go all-in or all-out. Roll out to willing teams first, expand based on data.
- Adopting because “everyone else is” without measuring impact
- Evaluating based on vibes instead of metrics
- Only asking senior engineers --- juniors and seniors experience AI tools very differently
- Ignoring the security and IP review
Deep Dive: Should a 50-engineer org build an internal developer platform?
Deep Dive: Should a 50-engineer org build an internal developer platform?
-
Start with the pain, not the solution:
- Interview 5-10 engineers: where do they lose the most time?
- Measure: how long does it take to spin up a new service? To deploy? To debug a production issue?
- If engineers spend 30%+ of their time on undifferentiated infrastructure work, there is a strong signal
-
Assess the 50-engineer context honestly:
- At 50 engineers, you likely have 5-8 teams. That is enough to feel duplication pain but small enough that a dedicated platform team (3-4 people) is a significant investment (6-8% of engineering)
- The opportunity cost is real --- those 3-4 engineers are not shipping features
- But the hidden cost of not investing is also real --- 50 engineers each spending 2 hours per week on infra toil is 100 hours/week of waste
-
Consider the phased approach:
- Phase 0 (Week 1-2): Document the current developer journey. Map every step from “I have an idea” to “it is in production.” Identify the top 3 friction points
- Phase 1 (Month 1-3): Assign 1-2 engineers part-time to solve the single biggest pain point. Often this is CI/CD standardization or environment provisioning
- Phase 2 (Month 3-6): If Phase 1 delivers measurable improvement (deploy time cut in half, onboarding time reduced), formalize a small platform team
- Phase 3 (Month 6+): Build a self-service portal. Evaluate Backstage or similar. Add golden paths for common workflows
-
Decision criteria for “yes, invest now”:
- Multiple teams are solving the same infra problems independently
- New service creation takes more than a day
- Onboarding takes more than a week
- You are in a regulated industry where consistency is a compliance requirement
- You plan to grow to 100+ engineers in the next 12-18 months
-
Decision criteria for “not yet”:
- Most friction is product/process, not tooling
- A single monolith serves your needs and teams are not yet independent
- The existing DevOps/SRE setup handles requests within hours, not weeks
- Building a platform before understanding the actual developer pain points
- Over-building (“we need Backstage, Crossplane, and a custom CLI” for 50 engineers)
- Under-building (a wiki page with setup instructions is not a platform)
- Not treating the platform as a product with internal customers
Deep Dive: Critical CVE in a transitive dependency across 30 services
Deep Dive: Critical CVE in a transitive dependency across 30 services
-
Triage (first 30 minutes):
- Assess severity and exploitability --- is this a remote code execution (RCE)? Is it exploitable from the internet? Is there a known exploit in the wild? A CVSS 9.8 with a public exploit is a different urgency than a CVSS 7.0 requiring local access
- Determine actual exposure --- “used by 30 services” does not mean all 30 are vulnerable. Check which services actually exercise the vulnerable code path. A transitive dependency pulled in for a utility function you never call is lower risk
- Check for existing mitigations --- WAF rules, network segmentation, or input validation may already block the attack vector
- Communicate --- notify the security team, engineering leads, and incident channel. Set a severity level. Assign an incident commander if the CVE is critical
-
Assessment (first 2 hours):
- Generate or consult the SBOM --- identify every service, every version, every path through the dependency tree that includes the vulnerable package
- Categorize services by risk --- internet-facing services processing untrusted input are Priority 1. Internal batch jobs are Priority 3
- Check for a patch --- is a fixed version available? If yes, what is the upgrade path? Are there breaking changes? If no patch, what workarounds exist?
-
Remediation plan:
- If a patch exists: Update the dependency in a shared parent (if you use a monorepo or shared base image, one fix propagates). For polyrepo, automate the update using Dependabot, Renovate, or a bulk scripting approach
- If no patch exists: Implement compensating controls --- WAF rules to block the exploit pattern, network restrictions to limit exposure, feature flags to disable the vulnerable code path
- Testing: Run the existing test suite. For critical services, run targeted tests against the specific vulnerability. Do not skip testing under pressure --- a broken deploy is worse than a delayed patch
- Rollout: Priority 1 services first. Use canary or blue-green deployments. Monitor error rates and latency closely during rollout
-
Post-incident:
- Verify completeness --- rescan all 30 services to confirm the vulnerable version is gone
- Retrospective --- why did it take X hours? Could we have detected this faster? Do we need better SBOM tooling, faster CI, or pre-approved emergency deploy paths?
- Improve defenses --- add the CVE pattern to your SCA tool’s block list. Consider pinning or vendoring critical transitive dependencies. Evaluate whether SLSA adoption would have caught this earlier
- Panic-patching all 30 services simultaneously without risk triage
- Updating the dependency without testing (“it is just a patch version”)
- Forgetting transitive paths --- fixing the direct dependency but missing it pulled in through another package
- Not communicating to stakeholders until the fix is done
- Treating the incident as over when the patch is deployed without a retrospective
Real-World Stories
These stories illustrate why the topics in this guide matter. Each one is a real event that reshaped how the industry thinks about modern engineering.How GitHub Copilot Changed Developer Productivity --- The Internal Research Data
How GitHub Copilot Changed Developer Productivity --- The Internal Research Data
The SolarWinds Attack --- The Most Sophisticated Supply Chain Attack and What It Taught Us
The SolarWinds Attack --- The Most Sophisticated Supply Chain Attack and What It Taught Us
How Backstage Became the Open-Source Standard for Platform Engineering
How Backstage Became the Open-Source Standard for Platform Engineering
How Shopify Measures and Improves Developer Experience --- The DevEx Framework
How Shopify Measures and Improves Developer Experience --- The DevEx Framework
- Feedback loops --- how quickly developers get signal from their tools and processes (CI build time, PR review latency, deploy time, test execution speed)
- Cognitive load --- how much irrelevant complexity developers must manage beyond the core problem they are solving (infrastructure setup, config management, navigating undocumented systems)
- Flow state --- how often developers can achieve and maintain deep, uninterrupted focus (meeting load, context switching between projects, interrupt-driven work culture)