Chapter 2: Identity and Access Management (IAM)

In the cloud, Identity is the new perimeter. Gone are the days when a firewall was enough to protect your data. In GCP, Identity and Access Management (IAM) is the central system that manages permissions across every single service. Understanding IAM is crucial for maintaining security and compliance in your cloud environment.

0. From Scratch: What is IAM?

Before diving into the specifics, let’s establish what IAM is and why it’s fundamentally different from traditional security models.

Traditional Security Model vs. Cloud IAM

In on-premises environments, security was primarily perimeter-based:

Physical security controlled access to buildings
Network firewalls protected internal resources
Active Directory managed user identities
Local admin accounts had privileged access to systems

In contrast, cloud IAM is identity-centric and API-driven:

There is no traditional network perimeter
Every resource access is authenticated and authorized through APIs
Identities can be humans, services, or external systems
Permissions are granted explicitly through policies
Security is enforced at the resource level, not the network level

This shift means that IAM becomes the primary security control in the cloud, making it essential to understand deeply.

1. The Core Philosophy: Who, What, and Where

The IAM model in GCP is a simple but powerful relationship: Who (Identity) + What (Role) + Where (Resource) This relationship forms the basis of every access decision in GCP. Understanding this triangle is fundamental to implementing proper security.

The IAM Triangle Explained

WHO (Identity Principal):
- The entity requesting access (human or service)
- Must be authenticated before authorization occurs
- Examples: users, service accounts, groups
WHAT (Role/Permissions):
- The specific actions the principal is allowed to perform
- Granular to the API call level (e.g., compute.instances.start)
- Roles bundle related permissions together
WHERE (Resource Hierarchy):
- The specific GCP resource the action applies to
- Follows GCP’s resource hierarchy (Organization → Folder → Project → Resource)
- Policies are attached to resources and inherited downward

Identities: The “Who”

Identities represent the entities that can access GCP resources. Each identity type serves different purposes:

1. Google Accounts (Individual Users)

Personal Gmail accounts (e.g., [email protected])
Cloud Identity or Google Workspace accounts (e.g., [email protected])
Used for human users accessing GCP resources
Can authenticate via password, 2-factor authentication, or SSO

2. Service Accounts (Non-human Identities)

Identities for applications, virtual machines, and services
Use cryptographic keys instead of passwords
Represent the “application” or “service” itself
Critical for automation and programmatic access

3. Google Groups

Collections of users, service accounts, or other groups
Best Practice: Always assign roles to groups, not individual users
Simplifies management and enables team-based access control
Can be Google Workspace groups or Cloud Identity groups

4. Cloud Identity / Google Workspace

Corporate directory integration
Enables Single Sign-On (SSO) with existing corporate credentials
Maintains separation between personal and corporate accounts

5. Domain-Wide Delegation

Grants access to all users in a domain
Used for enterprise applications that need broad access
Requires careful consideration due to scope

6. allAuthenticatedUsers and allUsers

allAuthenticatedUsers: Any authenticated Google user
allUsers: Anyone on the internet (including unauthenticated)
Critical Warning: Use these with extreme caution
Typically reserved for public resources like Cloud Storage buckets serving static websites

Roles: The “What”

Roles define what actions an identity can perform on resources. GCP offers multiple role types with different scopes and management approaches:

Primitive Roles (Legacy - Do Not Use in Production)

Owner: Full access to all resources and permissions to grant access to others
Editor: Can modify resources but cannot grant access to others
Viewer: Can view resources but cannot modify them
Warning: These roles grant extremely broad permissions
Never use in production - they violate the principle of least privilege

Predefined Roles (Recommended)

Created and maintained by Google
Regularly updated to include new permissions as services evolve
Granular and specific to service functions
Examples: roles/compute.instanceAdmin.v1, roles/storage.objectViewer
Well-tested and security-reviewed by Google

Custom Roles (Production Use Cases)

Created by administrators for specific organizational needs
Allow precise permission control beyond predefined roles
Must be manually maintained as services evolve
Useful for compliance requirements or unique business needs
Cannot be updated automatically when new permissions are added

The Permission Hierarchy

Understanding the relationship between permissions, roles, and policies is crucial:

Permissions: Individual actions (e.g., compute.instances.create)
Roles: Collections of related permissions (e.g., roles/compute.instanceAdmin)
Policies: Bindings between identities and roles for specific resources

Each permission corresponds to a specific API call, ensuring granular control over cloud resources.

2. Service Accounts: The Powerhouse of Automation

Service accounts are identities for applications and services, not humans. They are fundamental to cloud security and automation, requiring special attention and protection.

Understanding Service Accounts Deeply

What Makes Service Accounts Different from User Accounts?

Authentication Method: Use private keys/certificates instead of passwords
Multi-Factor Authentication: Do not support MFA/2FA (by design)
Identity Type: Represent applications/services, not humans
Management: Controlled programmatically, not through user interfaces
Scope: Can act on behalf of the application across various resources

Service Account Structure

A service account has multiple identifiers:

Email Address: [email protected]
Unique ID: Numeric identifier (e.g., 123456789012345678901)
Display Name: Human-readable name for identification
Description: Additional context about the service account’s purpose

Service Account Keys (Credentials)

JSON Key Files: Contain private keys for authenticating service accounts
High Risk: If compromised, provide full access to the service account’s permissions
Best Practice: Minimize use and rotate regularly
Alternative: Use Workload Identity Federation to eliminate key files

The Service Account User vs Actor Distinction

This is one of the most commonly misunderstood aspects of GCP IAM:

Service Account Identity

The service account itself (e.g., [email protected])
Has specific roles and permissions granted to it
Represents the application’s identity in GCP

Service Account Actor

The human or service that wants to “act as” the service account
Must have the iam.serviceAccounts.actAs permission on the service account
This creates a two-step authorization process

Example Scenario:

Service Account [email protected] has Storage Object Viewer role
User [email protected] wants to deploy an application that uses this service account
User [email protected] must have iam.serviceAccounts.actAs permission on the service account
Without this permission, the user cannot deploy the application with the service account

Workload Identity Federation (Modern Authentication)

Workload Identity Federation eliminates the need for service account key files, addressing a major security concern.

The Problem with Service Account Keys

Storage Risk: Keys stored in files can be accidentally committed to repositories
Distribution Risk: Keys must be securely distributed to all environments
Rotation Risk: Manual rotation is complex and often neglected
Compromise Impact: Stolen keys provide persistent access to resources

How Workload Identity Federation Works

External Identity Provider: GitHub Actions, AWS, Azure, or other OIDC providers
OIDC Token Exchange: External provider issues time-limited OIDC tokens
GCP Trust Relationship: Configure trust between external provider and GCP service account
Token Validation: GCP validates the OIDC token against the trust relationship
Temporary Credentials: GCP provides temporary credentials for the service account

Benefits of Workload Identity Federation

No Key Files: Eliminates the risk of key file compromise
Automatic Rotation: OIDC tokens are short-lived and automatically rotated
External Control: Leverage existing identity providers for access management
Compliance: Meets regulatory requirements for credential management

Service Account Best Practices

1. Principle of Least Privilege

Grant minimal required permissions for each service account
Use predefined roles when possible instead of custom roles
Regularly audit and remove unnecessary permissions
Use IAM Recommender to identify unused permissions

2. Service Account Naming Conventions

# Good naming conventions
[email protected]
[email protected]
[email protected]

3. Service Account Separation

Use separate service accounts for different applications
Use separate service accounts for different environments (dev/staging/prod)
Use separate service accounts for different functions within an application
Avoid sharing service accounts across unrelated applications

4. Key Management

Minimize use of service account key files
Rotate keys regularly (monthly recommended)
Audit key usage and remove unused keys
Use Workload Identity Federation instead of key files when possible

3. IAM Conditions: Contextual Security

IAM Conditions add contextual restrictions to IAM policies, enabling more granular and dynamic access control. They allow you to specify when a policy binding is effective based on various attributes.

Understanding IAM Conditions

Traditional IAM policies are static: “Grant this role to this identity on this resource.” IAM Conditions add a temporal or contextual dimension: “Grant this role to this identity on this resource WHEN certain conditions are met.”

Common Condition Use Cases

1. Time-Based Access

# Grant temporary admin access
- members:
  - user:[email protected]
  role: roles/resourcemanager.projectOwner
  condition:
    title: "Valid until March 31"
    expression: "request.time < timestamp('2024-04-01T00:00:00Z')"

2. IP-Based Restrictions

# Restrict access to corporate IP range
- members:
  - user:[email protected]
  role: roles/compute.instanceAdmin
  condition:
    title: "Corporate IP Only"
    expression: "request.auth.claims['origin'] in ['192.168.1.0/24']"

3. Resource Attributes

# Allow deletion only for temporary resources
- members:
  - group:[email protected]
  role: roles/storage.objectAdmin
  condition:
    title: "Temporary Buckets Only"
    expression: "resource.name.startsWith('temp-')"

Condition Expression Language

IAM Conditions use CEL (Common Expression Language) for expressing conditions. Key elements include:

1. Request Attributes

request.time: The time of the request
request.auth.claims: Authentication claims
request.auth.principal: The authenticated principal

2. Resource Attributes

resource.name: The resource name
resource.type: The resource type
resource.labels: Resource labels

3. Common Functions

timestamp(): Convert string to timestamp
string.startsWith(): Check if string starts with prefix
string.contains(): Check if string contains substring
in: Check membership in a list

Advanced Condition Scenarios

1. Business Hours Access

condition:
  title: "Business Hours Only"
  expression: |
    request.time.getHours() >= 9 && 
    request.time.getHours() < 17 && 
    request.time.getDayOfWeek() > 0 && 
    request.time.getDayOfWeek() < 6

2. Device Compliance

condition:
  title: "Compliant Devices Only"
  expression: |
    request.auth.claims['device_compliant'] == true &&
    request.auth.claims['encryption_enabled'] == true

3. Risk-Based Access

condition:
  title: "Low Risk Access Only"
  expression: |
    request.auth.claims['risk_level'] != 'HIGH' &&
    request.auth.claims['anomaly_score'] < 0.7

Limitations and Considerations

1. Performance Impact

Conditions add slight latency to authorization decisions
Complex conditions may impact performance
Monitor for any performance degradation

2. Debugging Challenges

Condition failures may be harder to troubleshoot
Use IAM Policy Troubleshooter to debug conditions
Test conditions thoroughly before production deployment

3. Service Support

Not all GCP services support IAM Conditions
Check service documentation for condition support
Some services have limitations on condition complexity

4. Policy Inheritance and the “Additive” Rule

GCP’s resource hierarchy creates a complex inheritance model for IAM policies. Understanding this model is crucial for effective access management.

The Resource Hierarchy

GCP organizes resources in a hierarchical structure:

Organization
├── Folder(s)
│   ├── Folder(s)
│   └── Project(s)
│       └── Resources (VMs, Buckets, etc.)
└── Project(s)
    └── Resources (VMs, Buckets, etc.)

Policy Inheritance Rules

1. Downward Inheritance

Policies set at higher levels (organization, folder) apply to lower levels (projects, resources)
Lower levels inherit all permissions from higher levels
This creates a cumulative effect of permissions

2. Additive Nature

CRITICAL: Permissions are additive, not subtractive
If a user has a role at the organization level, they retain those permissions at all child resources
You cannot “deny” a permission at a lower level if it was granted at a higher level

3. No Deny Mechanism

GCP IAM does not support explicit deny policies
Once a permission is granted at a higher level, it cannot be revoked at a lower level
This is a fundamental design choice for simplicity and consistency

Practical Implications

Scenario 1: Organization-Level Owner

Organization: [email protected] → roles/resourcemanager.organizationAdmin
Project A: No specific policies
Project B: [email protected] → roles/viewer

Result: Alice has organization admin access to both Project A and Project B. The viewer role on Project B is irrelevant because organization admin includes all permissions.

Scenario 2: Proper Role Assignment

Organization: [email protected] → No roles
Project A: [email protected] → roles/editor
Project B: [email protected] → roles/viewer

Result: Alice has editor access to Project A and viewer access to Project B. No inheritance conflicts.

Managing Hierarchical Policies Effectively

1. Organization-Level Policies

Apply to all resources in the organization
Use for enterprise-wide roles (security teams, auditors)
Be very careful with permissions granted here
Consider using groups rather than individual users

2. Folder-Level Policies

Apply to all projects within the folder
Useful for departmental or business unit access
Good for shared services and cross-functional teams
Can override organization policies due to additive nature

3. Project-Level Policies

Most common level for role assignments
Use for team-based access control
Combine with groups for easier management
Monitor for inheritance conflicts

4. Resource-Level Policies

Most granular level of control
Use sparingly to avoid complexity
Effective for sensitive resources
Combine with conditions for contextual access

Policy Troubleshooting for Hierarchies

1. Identifying Inheritance Issues

Use IAM Policy Troubleshooter to trace permission sources
Check all levels of the hierarchy for conflicting policies
Understand which policies contribute to effective permissions

2. Planning Policy Changes

Consider impact across the entire hierarchy
Test changes in non-production environments
Document the intended inheritance behavior
Communicate changes to affected stakeholders

4. IAM Deny Policies: The Explicit Guardrails

While standard IAM policies are allow-only, IAM Deny Policies allow you to explicitly block permissions.

Precedence: Deny policies always override allow policies. If a user is granted Owner at the project level but is Denied at the organization level, they cannot perform the action.
Use Case: “Prevent anyone (even admins) from deleting production storage buckets” or “Block access to all GCP services for a contractor after their contract ends, regardless of project-level permissions.”
Inheritance: Deny policies follow the same inheritance rules as allow policies (Organization -> Folder -> Project).

5. Workload Identity Federation: Multi-Cloud Identity

Workload Identity Federation is the modern way to connect AWS, Azure, or GitHub Actions to GCP without service account keys.

The OIDC Bridge: GCP trusts the external provider (e.g., AWS STS or GitHub OIDC).
Attribute Mapping: You map external attributes (like aws_role_arn or github_repo) to GCP principal identifiers.
Security: This eliminates the “Key Rotation” problem. The external workload receives a short-lived GCP token automatically.

AWS to GCP Example

Pool: Create a Workload Identity Pool for AWS.
Provider: Configure the AWS account ID as a trusted provider.
Binding: Allow an AWS IAM Role to impersonate a GCP Service Account.
CLI: gcloud auth login --cred-file=aws-credentials.json.

6. Advanced Troubleshooting

Effective IAM troubleshooting requires understanding both the current state and potential impacts of changes. GCP provides powerful tools to help with both reactive troubleshooting and proactive validation.

5.1 The Policy Troubleshooter

The Policy Troubleshooter is the primary tool for diagnosing access issues. It analyzes the entire policy hierarchy to determine why access was granted or denied.

How Policy Troubleshooter Works

Input Requirements:
- Principal: The user or service account experiencing the issue
- Permission: The specific permission being checked (e.g., compute.instances.start)
- Resource: The specific resource where access is needed
Analysis Process:
- Examines all IAM policies in the resource hierarchy
- Evaluates direct bindings and conditional bindings
- Identifies which policies grant or deny the requested permission
- Provides detailed explanation of the access decision
Output Information:
- Whether access is granted or denied
- Which policies contributed to the decision
- Path through the resource hierarchy
- Any conditions that affected the outcome

Using Policy Troubleshooter Effectively

Command Line Interface

# Troubleshoot a specific access issue
gcloud beta iam troubleshoot access \
    --principal="user:[email protected]" \
    --permission="compute.instances.start" \
    --resource="projects/my-project/zones/us-central1-a/instances/my-instance"

Console Interface

Navigate to IAM section in Google Cloud Console
Select “Policy Troubleshooter” from the left navigation
Enter the principal, permission, and resource details
Review the detailed analysis and recommendations

Common Troubleshooting Scenarios

Scenario 1: Unexpected Access Granted

Problem: A user has access to a resource they shouldn’t have access to. Solution: Use Policy Troubleshooter to identify where the permission was granted in the hierarchy.

Scenario 2: Expected Access Denied

Problem: A user cannot access a resource they should have access to. Solution: Use Policy Troubleshooter to identify missing permissions or conflicting policies.

Scenario 3: Conditional Access Issues

Problem: Access works sometimes but not consistently. Solution: Use Policy Troubleshooter to examine conditional policies and their evaluation.

5.2 The IAM Policy Simulator

The IAM Policy Simulator allows you to test policy changes before implementing them, preventing unintended access modifications.

Key Features of Policy Simulator

Predictive Analysis: Shows how proposed policies would affect existing access
Historical Replay: Analyzes past access attempts against proposed policies
Impact Assessment: Identifies which users/services would be affected by changes
Safety Net: Prevents disruptive policy changes

How Policy Simulator Works

Proposed Policy: Define the IAM policy changes you want to make
Historical Data: Simulator replays the last 90 days of access attempts
Comparison: Shows which access attempts would succeed or fail under the new policy
Reporting: Generates detailed reports of the policy change impact

Practical Use Case: Removing Editor Role

Situation: Need to remove roles/editor from a developer group but concerned about breaking CI/CD pipelines. Solution:

Prepare the new policy without the editor role
Run the policy through the simulator
Analyze the results to identify any blocked API calls
Adjust the policy or prepare for the impact before implementation

Simulator Limitations

Only analyzes the last 90 days of access data
May not catch all edge cases or new access patterns
Requires sufficient historical data to be meaningful
Does not account for future access needs

5.3 IAM Recommender: The SRE’s Intelligence

The IAM Recommender is an automated tool that uses machine learning to enforce the Principle of Least Privilege at scale.

How it Works

Observation: Google monitors the actual permissions used by each principal over the last 90 days.
Comparison: It compares the granted roles with the used permissions.
Recommendation: If a user has roles/editor but only ever reads from GCS buckets, the Recommender suggests downgrading them to roles/storage.objectViewer.

The Policy Analyzer

For complex scenarios, the Policy Analyzer helps you answer “Who can access this resource and how?” by performing a recursive expansion of all groups, service accounts, and conditional bindings across the hierarchy.

9. Organization Policies: The Guardrails of the Cloud

While IAM defines Identity-based access, Organization Policies provide Resource-based constraints. They are the “Constitutional Law” of your GCP environment.

9.1 IAM vs. Org Policy

IAM: “Can user Alice start a VM?” (Identity-centric)
Org Policy: “Can anyone create a VM with a Public IP in this folder?” (Resource-centric)

9.2 Critical Organization Constraints

A Principal Engineer should implement these foundational constraints to prevent security drift:

Constraint	Effect	Why it matters
`iam.disableServiceAccountKeyCreation`	Blocks JSON key generation.	Forces usage of Workload Identity.
`compute.disableExternalIP`	Blocks VMs from having Public IPs.	Prevents accidental internet exposure.
`gcp.resourceLocations`	Restricts resource creation to specific regions.	Ensures data sovereignty and compliance.
`iam.allowedPolicyMemberDomains`	Restricts IAM to only your corporate domain.	Prevents sharing data with personal Gmail accounts.

9.3 Enforcement and Inheritance

Dry Run Mode: You can test an Org Policy in “Dry Run” mode to see what would be blocked without actually stopping developers.
Hierarchical Overrides: Policies set at the Org level apply to everyone, but you can “exempt” specific folders or projects if necessary (use with caution).

6. Deep Dive: Custom Roles and Advanced Permission Management

While predefined roles are suitable for most use cases, production environments often require custom roles for specific compliance, security, or operational requirements. Understanding custom roles deeply is essential for advanced IAM management.

6.1 Understanding Permissions at the API Level

GCP permissions correspond directly to API methods, providing granular control over cloud resources.

Permission Structure

service.resource.action

Examples:

compute.instances.create - Create Compute Engine instances
storage.buckets.delete - Delete Cloud Storage buckets
pubsub.topics.publish - Publish messages to Pub/Sub topics
bigquery.datasets.update - Update BigQuery datasets

Permission Categories

Read Permissions: get, list, getIamPolicy
Write Permissions: create, update, patch
Delete Permissions: delete
Special Permissions: setIamPolicy, testIamPermissions

6.2 Custom Role Creation and Management

Creating custom roles requires careful planning and ongoing maintenance to ensure they remain effective and secure.

Custom Role Components

1. Role Metadata

Title: Human-readable name for the role
Description: Explanation of the role’s purpose
Stage: Development stage (ALPHA, BETA, GA, DEPRECATED)

2. Permission Selection

includedPermissions: List of permissions granted by the role
Must be valid GCP permissions
Should follow least-privilege principle

3. Role Limits

Maximum of 5 permissions per custom role (soft limit)
Maximum of 100 custom roles per project/organization
Consider using predefined roles when possible

Custom Role YAML Template with Advanced Options

title: "Database Migration Operator"
description: "Can migrate data between databases but cannot access production data directly"
stage: "GA"
includedPermissions:
- storage.objects.get
- storage.objects.create
- bigquery.jobs.create
- bigquery.tables.getData
- cloudsql.instances.connect
- cloudsql.databases.get
etag: "AA=="

6.3 Advanced Custom Role Patterns

Pattern 1: Operational Roles

Roles designed for specific operational tasks without data access:

title: "Compute Maintenance Operator"
description: "Can manage Compute Engine instances for maintenance but cannot access data"
stage: "GA"
includedPermissions:
- compute.instances.start
- compute.instances.stop
- compute.instances.reset
- compute.instances.setMetadata
- compute.instances.setLabels
- compute.instances.addAccessConfig

Pattern 2: Deployment Roles

Roles for CI/CD systems with limited scope:

title: "Deployment Service Account"
description: "Can deploy applications but cannot modify IAM or billing"
stage: "GA"
includedPermissions:
- run.services.update
- run.services.get
- cloudfunctions.functions.sourceCodeSet
- cloudfunctions.functions.update
- storage.objects.get
- storage.objects.create

Pattern 3: Auditing Roles

Roles for compliance and auditing without operational capabilities:

title: "Security Auditor"
description: "Can audit security configurations but cannot modify resources"
stage: "GA"
includedPermissions:
- resourcemanager.projects.get
- resourcemanager.projects.testIamPermissions
- iam.roles.get
- iam.serviceAccounts.get
- logging.logs.list
- monitoring.metricDescriptors.list

6.4 Custom Role Maintenance and Evolution

1. Permission Drift Management

Challenge: New GCP features introduce new permissions
Solution: Regular review of custom roles against service updates
Best Practice: Subscribe to GCP release notes and feature announcements

2. Version Control for Custom Roles

Store custom role definitions in version control
Track changes and approvals for role modifications
Maintain backup versions for rollback capability
Document the business justification for each permission

3. Automated Role Validation

Implement CI/CD pipelines for custom role management
Test role permissions in non-production environments
Validate role effectiveness before production deployment
Monitor for unused or overly broad permissions

6.5 Custom Role Security Considerations

1. Privilege Escalation Prevention

Review permissions for potential escalation paths
Avoid granting both read and write permissions when only one is needed
Consider the combination of permissions and their potential misuse
Regular security reviews of custom role assignments

2. Principle of Least Privilege

Grant only the minimum permissions required for the task
Regularly audit and remove unused permissions
Use predefined roles when possible instead of custom roles
Monitor access patterns and adjust permissions accordingly

3. Separation of Duties

Create separate roles for different functional responsibilities
Prevent any single role from having excessive authority
Implement checks and balances through role separation
Document role responsibilities and access patterns

7. Workload Identity: Securing Service-to-Service Communication

Workload Identity is GCP’s recommended approach for authenticating workloads running on Kubernetes Engine (GKE) or other platforms to Google Cloud services. It eliminates the need for service account key files, addressing a significant security concern.

7.1 Understanding Workload Identity

The Traditional Problem

Applications in containers needed service account key files to authenticate to GCP services
Key files were stored in containers, creating security vulnerabilities
Key rotation was manual and often neglected
Compromised containers could expose key files and provide unauthorized access

Workload Identity Solution

Establishes trust relationship between Kubernetes service accounts and GCP service accounts
Uses Kubernetes native authentication mechanisms
Eliminates need for service account key files
Provides automatic credential exchange and refresh

7.2 Workload Identity Architecture

Components

Kubernetes Service Account (KSA): Identity for pods within a GKE cluster
GCP Service Account (GSA): Identity for accessing GCP resources
Workload Identity Pool: Container for the trust relationship
Workload Identity Provider: Authenticates Kubernetes workloads

Trust Flow

Pod authenticates to GKE using Kubernetes service account
GKE validates the workload against the trust relationship
Temporary credentials are exchanged for the GCP service account
Application accesses GCP resources using these credentials

7.3 Workload Identity Configuration

Step 1: Enable Workload Identity on the Cluster

gcloud container clusters update CLUSTER_NAME \
    --workload-pool=PROJECT_ID.svc.id.goog \
    --zone=COMPUTE_ZONE \
    --project=PROJECT_ID

Step 2: Create or Identify GCP Service Account

gcloud iam service-accounts create my-workload-sa \
    --project=PROJECT_ID

Step 3: Bind GCP Service Account to Kubernetes Service Account

gcloud iam service-accounts add-iam-policy-binding \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:PROJECT_ID.svc.id.goog[K8S_NAMESPACE/K8S_SERVICE_ACCOUNT_NAME]" \
    --project=PROJECT_ID \
    my-workload-sa@PROJECT_ID.iam.gserviceaccount.com

Step 4: Annotate Kubernetes Service Account

kubectl annotate serviceaccount \
    --namespace K8S_NAMESPACE \
    K8S_SERVICE_ACCOUNT_NAME \
    iam.gke.io/gcp-service-account=my-workload-sa@PROJECT_ID.iam.gserviceaccount.com

7.4 Workload Identity Federation (Beyond GKE)

Workload Identity Federation extends the concept beyond GKE to external identity providers.

Supported Providers

GitHub Actions
AWS
Azure
SAML 2.0 identity providers
OIDC providers

Federation Configuration

Create a workload identity pool
Create a workload identity provider
Configure trust relationship with external provider
Create or identify target GCP service account
Grant necessary IAM permissions to the service account

7.5 Best Practices for Workload Identity

1. Namespace and Service Account Design

Use dedicated namespaces for different applications/environments
Follow consistent naming conventions for service accounts
Implement least-privilege access for each workload
Separate production and non-production workloads

2. Monitoring and Auditing

Enable audit logging for workload identity exchanges
Monitor for unusual authentication patterns
Alert on failed authentication attempts
Track credential usage and rotation

3. Security Considerations

Regularly review and update trust relationships
Monitor for unauthorized workload identity usage
Implement proper namespace isolation
Secure Kubernetes cluster configuration

8. IAM Design Patterns and Anti-Patterns

Effective IAM implementation follows proven design patterns while avoiding common anti-patterns that lead to security vulnerabilities and management complexity.

8.1 IAM Design Patterns

Pattern 1: Role-Based Access Control (RBAC) with Groups

Problem: Managing permissions for individual users becomes unwieldy. Solution: Create groups based on job functions and assign roles to groups. Implementation:

Create groups: [email protected], [email protected], [email protected]
Assign roles to groups rather than individual users
Add/remove users from groups as their roles change

Pattern 2: Environment-Based Access Control

Problem: Same users need different access levels across dev/staging/prod. Solution: Use folders to organize environments and apply different policies. Implementation:

Organization
├── Development Folder
│   └── Development Projects
├── Staging Folder  
│   └── Staging Projects
└── Production Folder
    └── Production Projects

Pattern 3: Resource Tagging for Access Control

Problem: Need to control access based on resource characteristics. Solution: Use resource labels combined with IAM conditions. Implementation:

Label resources: environment=prod, team=payment, confidentiality=high
Use conditions to enforce access based on labels

Pattern 4: Segmented Administration

Problem: Preventing any single administrator from having complete control. Solution: Split administrative responsibilities across multiple roles. Implementation:

Network administrator: roles/compute.networkAdmin
Security administrator: roles/iam.securityAdmin
Billing administrator: roles/billing.admin

8.2 IAM Anti-Patterns to Avoid

Anti-Pattern 1: Overuse of Primitive Roles

Problem: Using Owner/Editor/Viewer roles in production. Risk: Excessive permissions violate least-privilege principle. Solution: Use predefined roles or create custom roles with minimal permissions.

Anti-Pattern 2: Direct User-to-Resource Binding

Problem: Granting roles directly to individual users instead of groups. Risk: Difficult to manage and maintain as team grows. Solution: Always use groups for role assignments.

Anti-Pattern 3: Overlapping Administrative Boundaries

Problem: Same person has both development and security administration. Risk: Potential for security bypass or conflict of interest. Solution: Separate duties and implement checks and balances.

Anti-Pattern 4: Inadequate Key Management

Problem: Poor management of service account keys. Risk: Compromised keys provide persistent access to resources. Solution: Minimize key use, implement rotation, use Workload Identity.

8.3 Common Implementation Scenarios

Scenario 1: Multi-Team Development Environment

Requirements:

Multiple development teams working on different projects
Shared services (CI/CD, monitoring, security)
Isolated development environments
Controlled production access

Implementation:

Organize projects by team/product
Create team-specific groups with appropriate access
Implement shared service accounts for common infrastructure
Use folders to group related projects
Implement production access controls and approvals

Scenario 2: Regulatory Compliance Environment

Requirements:

Segregation of duties
Detailed audit trails
Restricted data access
Regular access reviews

Implementation:

Implement role separation for different functions
Use audit logging for all access attempts
Implement data classification and access controls
Schedule regular access certification reviews
Use automated tools for compliance monitoring

Lab: Comprehensive IAM Implementation Exercise

This lab will walk you through implementing a complete IAM strategy for a fictional company with multiple teams and environments.

Scenario: Acme Corp GCP Implementation

Acme Corp is migrating their applications to GCP with the following requirements:

3 development teams (Frontend, Backend, Data)
3 environments (Development, Staging, Production)
Security team that monitors all access
Separate billing for each team/environment
Need to implement least-privilege access

Step 1: Organize the Resource Hierarchy

First, create the folder structure to organize resources:

# Create environment folders
gcloud resource-manager folders create \
    --display-name="Development" \
    --organization=ORGANIZATION_ID

gcloud resource-manager folders create \
    --display-name="Staging" \
    --organization=ORGANIZATION_ID

gcloud resource-manager folders create \
    --display-name="Production" \
    --organization=ORGANIZATION_ID

Step 2: Create Groups for Teams

Create Google Groups for each team and environment:

# Create team groups (assuming you have domain admin access)
# [email protected]
# [email protected]
# [email protected]
# [email protected]

Step 3: Create Service Accounts for Applications

Create service accounts for each application:

# Frontend application service account
gcloud iam service-accounts create frontend-app \
    --description="Frontend application service account" \
    --display-name="Frontend App SA"

# Backend application service account
gcloud iam service-accounts create backend-app \
    --description="Backend application service account" \
    --display-name="Backend App SA"

# Data processing service account
gcloud iam service-accounts create data-processor \
    --description="Data processing service account" \
    --display-name="Data Processor SA"

Step 4: Implement Least-Privilege Access

Assign roles based on the principle of least privilege:

# Frontend team gets access to their resources only
gcloud projects add-iam-policy-binding dev-frontend-project \
    --member="group:[email protected]" \
    --role="roles/compute.instanceAdmin.v1"

# Backend team gets access to their resources
gcloud projects add-iam-policy-binding dev-backend-project \
    --member="group:[email protected]" \
    --role="roles/cloudsql.editor"

# Data team gets access to their resources
gcloud projects add-iam-policy-binding dev-data-project \
    --member="group:[email protected]" \
    --role="roles/bigquery.dataEditor"

Step 5: Configure Service Account Access

Set up service account access for applications:

# Frontend app needs to write logs
gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:frontend-app@PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/logging.logWriter"

# Backend app needs database access
gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:backend-app@PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/cloudsql.client"

# Data processor needs to read/write data
gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:data-processor@PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/storage.objectAdmin"

Step 6: Implement Security Monitoring

Configure security team access for monitoring:

# Security team gets broad monitoring access
gcloud organizations add-iam-policy-binding ORGANIZATION_ID \
    --member="group:[email protected]" \
    --role="roles/logging.privateLogViewer"

gcloud organizations add-iam-policy-binding ORGANIZATION_ID \
    --member="group:[email protected]" \
    --role="roles/iam.securityReviewer"

Step 7: Verify the Implementation

Test the implementation to ensure it works as expected:

# Verify access for each team
gcloud projects test-iam-permissions PROJECT_ID \
    --permissions="compute.instances.create,storage.buckets.create"

Step 8: Set Up Monitoring and Alerting

Enable audit logging and set up monitoring:

# Enable data access logs for critical services
gcloud logging sinks create security-alerts \
    --log-filter='protoPayload.methodName:"google.cloud.audit" AND severity>=WARNING' \
    --destination="pubsub://topics/security-notifications"

SRE Best Practices Checklist

Security Posture

Audit Logs: Enable “Data Access” logs for critical services to see exactly who accessed your data
IAM Recommender: Check this weekly. Google will suggest removing unused permissions based on actual usage over the last 90 days
Groups over Users: Never grant a role directly to an email address. Use Google Groups
Service Account Keys: Hunt them down and delete them. Use Workload Identity instead
Conditional Access: Implement time-based and IP-based conditions for sensitive access
Separation of Duties: Ensure no single user has complete control over critical systems

Operational Excellence

Policy Simulations: Test policy changes before applying them using the IAM Policy Simulator
Resource Labels: Use consistent labeling to enable conditional access based on resource attributes
Regular Reviews: Conduct quarterly access reviews to remove unnecessary permissions
Documentation: Maintain clear documentation of roles, responsibilities, and access patterns
Training: Ensure team members understand IAM concepts and best practices

Performance and Efficiency

Role Optimization: Use predefined roles when possible; create custom roles only when necessary
Hierarchical Design: Organize resources to minimize the number of policies needed
Condition Complexity: Balance security needs with performance impact of complex conditions
Monitoring Overhead: Configure monitoring appropriately without excessive noise

In the next chapter, we dive into the “Vines” of the cloud—GCP Networking and the Global VPC.

Interview Preparation

Q1: What is the Principle of Least Privilege (PoLP) and how do you implement it in GCP IAM?

Answer: PoLP is the security concept of providing a user or service only the minimum permissions required to perform their job.Implementation in GCP:

Custom Roles: Instead of roles/editor, create a custom role with only the specific permissions needed (e.g., compute.instances.start).
Granular Scopes: Use predefined roles that are service-specific (e.g., roles/storage.objectViewer instead of roles/storage.admin).
IAM Recommender: Use this tool to identify over-privileged accounts based on actual usage and automatically downgrade them.
IAM Conditions: Grant permissions that are limited by time, resource name, or request context (like IP address).

Q2: Explain the difference between a Google Service Account (GSA) and a Kubernetes Service Account (KSA). How does Workload Identity bridge them?

Answer:

GSA: An IAM identity in GCP used by services (VMs, Cloud Run) to access other GCP resources. It uses JSON keys or the metadata server.
KSA: An identity inside a Kubernetes cluster used by pods to talk to the K8s API. It has no inherent permissions in GCP.
Workload Identity: It maps a KSA to a GSA. When a pod uses that KSA, the GKE metadata server returns a short-lived token for the linked GSA, allowing the pod to access GCP resources (like Cloud Storage) without needing a leaked JSON key.

Q3: You are assigned an 'Access Denied' error for a user trying to start a VM. How do you troubleshoot this?

Answer: I follow a systematic process:

Policy Troubleshooter: Input the user’s email, the permission (compute.instances.start), and the resource URL. This tool checks the entire hierarchy (Org, Folder, Project) to see which policy is blocking access.
Check Inheritance: Ensure they aren’t being granted permission at a high level and then expecting it to be “denied” at a lower level (permissions are additive).
Verify Identity: Ensure the user is authenticated with the correct account (especially if using multiple accounts in the same browser).
Cloud Audit Logs: Check the logs to see the raw denied event, which often includes details about the specific resource or condition that failed.

Q4: Why are Service Account Keys considered a security risk, and what are the alternatives?

Answer: JSON keys are long-lived (up to 10 years) and portable. If pushed to GitHub or stolen from a dev machine, they give an attacker permanent access until manually revoked.Alternatives:

Workload Identity Federation: For GitHub Actions, GitLab, or AWS, use OIDC tokens to exchange for temporary GCP tokens.
Instance Service Accounts: For VMs, use the attached service account and the metadata server.
Short-lived Tokens: Use the gcloud iam service-accounts generate-access-token command for temporary sessions.

Q5: What is the 'Additive' nature of IAM permissions in GCP? Can you deny a permission inherited from a folder?

Answer: No, you cannot “deny” an inherited permission. GCP IAM is additive only. If a permission is granted at a higher level (e.g., Folder), it exists at all lower levels (e.g., Project).Workaround: To achieve a “deny” effect, you must use IAM Conditions or VPC Service Controls. Conditions can restrict access based on attributes, and VPC SC can block access to services entirely regardless of IAM permissions.

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Chapter 2: Identity and Access Management (IAM)

​0. From Scratch: What is IAM?

​Traditional Security Model vs. Cloud IAM

​1. The Core Philosophy: Who, What, and Where

​The IAM Triangle Explained

​Identities: The “Who”

​1. Google Accounts (Individual Users)

​2. Service Accounts (Non-human Identities)

​3. Google Groups

​4. Cloud Identity / Google Workspace

​5. Domain-Wide Delegation

​6. allAuthenticatedUsers and allUsers

​Roles: The “What”

​Primitive Roles (Legacy - Do Not Use in Production)

​Predefined Roles (Recommended)

​Custom Roles (Production Use Cases)

​The Permission Hierarchy

​2. Service Accounts: The Powerhouse of Automation

​Understanding Service Accounts Deeply

​What Makes Service Accounts Different from User Accounts?

​Service Account Structure

​Service Account Keys (Credentials)

​The Service Account User vs Actor Distinction

​Service Account Identity

​Service Account Actor

​Workload Identity Federation (Modern Authentication)

​The Problem with Service Account Keys

​How Workload Identity Federation Works

​Benefits of Workload Identity Federation

​Service Account Best Practices

​1. Principle of Least Privilege

​2. Service Account Naming Conventions

​3. Service Account Separation

​4. Key Management

​3. IAM Conditions: Contextual Security

​Understanding IAM Conditions

​Common Condition Use Cases

​1. Time-Based Access