Skip to main content

Chapter 2: Identity and Access Management (IAM)

In the cloud, Identity is the new perimeter. Gone are the days when a firewall was enough to protect your data. In GCP, Identity and Access Management (IAM) is the central system that manages permissions across every single service. Understanding IAM is crucial for maintaining security and compliance in your cloud environment.

0. From Scratch: What is IAM?

Before diving into the specifics, let’s establish what IAM is and why it’s fundamentally different from traditional security models.

Traditional Security Model vs. Cloud IAM

In on-premises environments, security was primarily perimeter-based:
  • Physical security controlled access to buildings
  • Network firewalls protected internal resources
  • Active Directory managed user identities
  • Local admin accounts had privileged access to systems
In contrast, cloud IAM is identity-centric and API-driven:
  • There is no traditional network perimeter
  • Every resource access is authenticated and authorized through APIs
  • Identities can be humans, services, or external systems
  • Permissions are granted explicitly through policies
  • Security is enforced at the resource level, not the network level
This shift means that IAM becomes the primary security control in the cloud, making it essential to understand deeply.

1. The Core Philosophy: Who, What, and Where

The IAM model in GCP is a simple but powerful relationship: Who (Identity) + What (Role) + Where (Resource) This relationship forms the basis of every access decision in GCP. Understanding this triangle is fundamental to implementing proper security.

The IAM Triangle Explained

  1. WHO (Identity Principal):
    • The entity requesting access (human or service)
    • Must be authenticated before authorization occurs
    • Examples: users, service accounts, groups
  2. WHAT (Role/Permissions):
    • The specific actions the principal is allowed to perform
    • Granular to the API call level (e.g., compute.instances.start)
    • Roles bundle related permissions together
  3. WHERE (Resource Hierarchy):
    • The specific GCP resource the action applies to
    • Follows GCP’s resource hierarchy (Organization → Folder → Project → Resource)
    • Policies are attached to resources and inherited downward

Identities: The “Who”

Identities represent the entities that can access GCP resources. Each identity type serves different purposes:

1. Google Accounts (Individual Users)

  • Personal Gmail accounts (e.g., [email protected])
  • Cloud Identity or Google Workspace accounts (e.g., [email protected])
  • Used for human users accessing GCP resources
  • Can authenticate via password, 2-factor authentication, or SSO

2. Service Accounts (Non-human Identities)

  • Identities for applications, virtual machines, and services
  • Use cryptographic keys instead of passwords
  • Represent the “application” or “service” itself
  • Critical for automation and programmatic access

3. Google Groups

  • Collections of users, service accounts, or other groups
  • Best Practice: Always assign roles to groups, not individual users
  • Simplifies management and enables team-based access control
  • Can be Google Workspace groups or Cloud Identity groups

4. Cloud Identity / Google Workspace

  • Corporate directory integration
  • Enables Single Sign-On (SSO) with existing corporate credentials
  • Maintains separation between personal and corporate accounts

5. Domain-Wide Delegation

  • Grants access to all users in a domain
  • Used for enterprise applications that need broad access
  • Requires careful consideration due to scope

6. allAuthenticatedUsers and allUsers

  • allAuthenticatedUsers: Any authenticated Google user
  • allUsers: Anyone on the internet (including unauthenticated)
  • Critical Warning: Use these with extreme caution
  • Typically reserved for public resources like Cloud Storage buckets serving static websites

Roles: The “What”

Roles define what actions an identity can perform on resources. GCP offers multiple role types with different scopes and management approaches:

Primitive Roles (Legacy - Do Not Use in Production)

  • Owner: Full access to all resources and permissions to grant access to others
  • Editor: Can modify resources but cannot grant access to others
  • Viewer: Can view resources but cannot modify them
  • Warning: These roles grant extremely broad permissions
  • Never use in production - they violate the principle of least privilege
  • Created and maintained by Google
  • Regularly updated to include new permissions as services evolve
  • Granular and specific to service functions
  • Examples: roles/compute.instanceAdmin.v1, roles/storage.objectViewer
  • Well-tested and security-reviewed by Google

Custom Roles (Production Use Cases)

  • Created by administrators for specific organizational needs
  • Allow precise permission control beyond predefined roles
  • Must be manually maintained as services evolve
  • Useful for compliance requirements or unique business needs
  • Cannot be updated automatically when new permissions are added

The Permission Hierarchy

Understanding the relationship between permissions, roles, and policies is crucial:
  • Permissions: Individual actions (e.g., compute.instances.create)
  • Roles: Collections of related permissions (e.g., roles/compute.instanceAdmin)
  • Policies: Bindings between identities and roles for specific resources
Each permission corresponds to a specific API call, ensuring granular control over cloud resources.

2. Service Accounts: The Powerhouse of Automation

Service accounts are identities for applications and services, not humans. They are fundamental to cloud security and automation, requiring special attention and protection.

Understanding Service Accounts Deeply

What Makes Service Accounts Different from User Accounts?

  • Authentication Method: Use private keys/certificates instead of passwords
  • Multi-Factor Authentication: Do not support MFA/2FA (by design)
  • Identity Type: Represent applications/services, not humans
  • Management: Controlled programmatically, not through user interfaces
  • Scope: Can act on behalf of the application across various resources

Service Account Structure

A service account has multiple identifiers:
  • Email Address: [email protected]
  • Unique ID: Numeric identifier (e.g., 123456789012345678901)
  • Display Name: Human-readable name for identification
  • Description: Additional context about the service account’s purpose

Service Account Keys (Credentials)

  • JSON Key Files: Contain private keys for authenticating service accounts
  • High Risk: If compromised, provide full access to the service account’s permissions
  • Best Practice: Minimize use and rotate regularly
  • Alternative: Use Workload Identity Federation to eliminate key files

The Service Account User vs Actor Distinction

This is one of the most commonly misunderstood aspects of GCP IAM:

Service Account Identity

  • The service account itself (e.g., [email protected])
  • Has specific roles and permissions granted to it
  • Represents the application’s identity in GCP

Service Account Actor

  • The human or service that wants to “act as” the service account
  • Must have the iam.serviceAccounts.actAs permission on the service account
  • This creates a two-step authorization process
Example Scenario:
  • Service Account [email protected] has Storage Object Viewer role
  • User [email protected] wants to deploy an application that uses this service account
  • User [email protected] must have iam.serviceAccounts.actAs permission on the service account
  • Without this permission, the user cannot deploy the application with the service account

Workload Identity Federation (Modern Authentication)

Workload Identity Federation eliminates the need for service account key files, addressing a major security concern.

The Problem with Service Account Keys

  • Storage Risk: Keys stored in files can be accidentally committed to repositories
  • Distribution Risk: Keys must be securely distributed to all environments
  • Rotation Risk: Manual rotation is complex and often neglected
  • Compromise Impact: Stolen keys provide persistent access to resources

How Workload Identity Federation Works

  1. External Identity Provider: GitHub Actions, AWS, Azure, or other OIDC providers
  2. OIDC Token Exchange: External provider issues time-limited OIDC tokens
  3. GCP Trust Relationship: Configure trust between external provider and GCP service account
  4. Token Validation: GCP validates the OIDC token against the trust relationship
  5. Temporary Credentials: GCP provides temporary credentials for the service account

Benefits of Workload Identity Federation

  • No Key Files: Eliminates the risk of key file compromise
  • Automatic Rotation: OIDC tokens are short-lived and automatically rotated
  • External Control: Leverage existing identity providers for access management
  • Compliance: Meets regulatory requirements for credential management

Service Account Best Practices

1. Principle of Least Privilege

  • Grant minimal required permissions for each service account
  • Use predefined roles when possible instead of custom roles
  • Regularly audit and remove unnecessary permissions
  • Use IAM Recommender to identify unused permissions

2. Service Account Naming Conventions

3. Service Account Separation

  • Use separate service accounts for different applications
  • Use separate service accounts for different environments (dev/staging/prod)
  • Use separate service accounts for different functions within an application
  • Avoid sharing service accounts across unrelated applications

4. Key Management

  • Minimize use of service account key files
  • Rotate keys regularly (monthly recommended)
  • Audit key usage and remove unused keys
  • Use Workload Identity Federation instead of key files when possible

3. IAM Conditions: Contextual Security

IAM Conditions add contextual restrictions to IAM policies, enabling more granular and dynamic access control. They allow you to specify when a policy binding is effective based on various attributes.

Understanding IAM Conditions

Traditional IAM policies are static: “Grant this role to this identity on this resource.” IAM Conditions add a temporal or contextual dimension: “Grant this role to this identity on this resource WHEN certain conditions are met.”

Common Condition Use Cases

1. Time-Based Access

# Grant temporary admin access
- members:
  - user:[email protected]
  role: roles/resourcemanager.projectOwner
  condition:
    title: "Valid until March 31"
    expression: "request.time < timestamp('2024-04-01T00:00:00Z')"

2. IP-Based Restrictions

# Restrict access to corporate IP range
- members:
  - user:[email protected]
  role: roles/compute.instanceAdmin
  condition:
    title: "Corporate IP Only"
    expression: "request.auth.claims['origin'] in ['192.168.1.0/24']"

3. Resource Attributes

# Allow deletion only for temporary resources
- members:
  - group:[email protected]
  role: roles/storage.objectAdmin
  condition:
    title: "Temporary Buckets Only"
    expression: "resource.name.startsWith('temp-')"

Condition Expression Language

IAM Conditions use CEL (Common Expression Language) for expressing conditions. Key elements include:

1. Request Attributes

  • request.time: The time of the request
  • request.auth.claims: Authentication claims
  • request.auth.principal: The authenticated principal

2. Resource Attributes

  • resource.name: The resource name
  • resource.type: The resource type
  • resource.labels: Resource labels

3. Common Functions

  • timestamp(): Convert string to timestamp
  • string.startsWith(): Check if string starts with prefix
  • string.contains(): Check if string contains substring
  • in: Check membership in a list

Advanced Condition Scenarios

1. Business Hours Access

condition:
  title: "Business Hours Only"
  expression: |
    request.time.getHours() >= 9 && 
    request.time.getHours() < 17 && 
    request.time.getDayOfWeek() > 0 && 
    request.time.getDayOfWeek() < 6

2. Device Compliance

condition:
  title: "Compliant Devices Only"
  expression: |
    request.auth.claims['device_compliant'] == true &&
    request.auth.claims['encryption_enabled'] == true

3. Risk-Based Access

condition:
  title: "Low Risk Access Only"
  expression: |
    request.auth.claims['risk_level'] != 'HIGH' &&
    request.auth.claims['anomaly_score'] < 0.7

Limitations and Considerations

1. Performance Impact

  • Conditions add slight latency to authorization decisions
  • Complex conditions may impact performance
  • Monitor for any performance degradation

2. Debugging Challenges

  • Condition failures may be harder to troubleshoot
  • Use IAM Policy Troubleshooter to debug conditions
  • Test conditions thoroughly before production deployment

3. Service Support

  • Not all GCP services support IAM Conditions
  • Check service documentation for condition support
  • Some services have limitations on condition complexity

4. Policy Inheritance and the “Additive” Rule

GCP’s resource hierarchy creates a complex inheritance model for IAM policies. Understanding this model is crucial for effective access management.

The Resource Hierarchy

GCP organizes resources in a hierarchical structure:
Organization
├── Folder(s)
│   ├── Folder(s)
│   └── Project(s)
│       └── Resources (VMs, Buckets, etc.)
└── Project(s)
    └── Resources (VMs, Buckets, etc.)

Policy Inheritance Rules

1. Downward Inheritance

  • Policies set at higher levels (organization, folder) apply to lower levels (projects, resources)
  • Lower levels inherit all permissions from higher levels
  • This creates a cumulative effect of permissions

2. Additive Nature

  • CRITICAL: Permissions are additive, not subtractive
  • If a user has a role at the organization level, they retain those permissions at all child resources
  • You cannot “deny” a permission at a lower level if it was granted at a higher level

3. No Deny Mechanism

  • GCP IAM does not support explicit deny policies
  • Once a permission is granted at a higher level, it cannot be revoked at a lower level
  • This is a fundamental design choice for simplicity and consistency

Practical Implications

Scenario 1: Organization-Level Owner

Organization: [email protected] → roles/resourcemanager.organizationAdmin
Project A: No specific policies
Project B: [email protected] → roles/viewer
Result: Alice has organization admin access to both Project A and Project B. The viewer role on Project B is irrelevant because organization admin includes all permissions.

Scenario 2: Proper Role Assignment

Organization: [email protected] → No roles
Project A: [email protected] → roles/editor
Project B: [email protected] → roles/viewer
Result: Alice has editor access to Project A and viewer access to Project B. No inheritance conflicts.

Managing Hierarchical Policies Effectively

1. Organization-Level Policies

  • Apply to all resources in the organization
  • Use for enterprise-wide roles (security teams, auditors)
  • Be very careful with permissions granted here
  • Consider using groups rather than individual users

2. Folder-Level Policies

  • Apply to all projects within the folder
  • Useful for departmental or business unit access
  • Good for shared services and cross-functional teams
  • Can override organization policies due to additive nature

3. Project-Level Policies

  • Most common level for role assignments
  • Use for team-based access control
  • Combine with groups for easier management
  • Monitor for inheritance conflicts

4. Resource-Level Policies

  • Most granular level of control
  • Use sparingly to avoid complexity
  • Effective for sensitive resources
  • Combine with conditions for contextual access

Policy Troubleshooting for Hierarchies

1. Identifying Inheritance Issues

  • Use IAM Policy Troubleshooter to trace permission sources
  • Check all levels of the hierarchy for conflicting policies
  • Understand which policies contribute to effective permissions

2. Planning Policy Changes

  • Consider impact across the entire hierarchy
  • Test changes in non-production environments
  • Document the intended inheritance behavior
  • Communicate changes to affected stakeholders

4. IAM Deny Policies: The Explicit Guardrails

While standard IAM policies are allow-only, IAM Deny Policies allow you to explicitly block permissions.
  • Precedence: Deny policies always override allow policies. If a user is granted Owner at the project level but is Denied at the organization level, they cannot perform the action.
  • Use Case: “Prevent anyone (even admins) from deleting production storage buckets” or “Block access to all GCP services for a contractor after their contract ends, regardless of project-level permissions.”
  • Inheritance: Deny policies follow the same inheritance rules as allow policies (Organization -> Folder -> Project).

5. Workload Identity Federation: Multi-Cloud Identity

Workload Identity Federation is the modern way to connect AWS, Azure, or GitHub Actions to GCP without service account keys.
  • The OIDC Bridge: GCP trusts the external provider (e.g., AWS STS or GitHub OIDC).
  • Attribute Mapping: You map external attributes (like aws_role_arn or github_repo) to GCP principal identifiers.
  • Security: This eliminates the “Key Rotation” problem. The external workload receives a short-lived GCP token automatically.

AWS to GCP Example

  1. Pool: Create a Workload Identity Pool for AWS.
  2. Provider: Configure the AWS account ID as a trusted provider.
  3. Binding: Allow an AWS IAM Role to impersonate a GCP Service Account.
  4. CLI: gcloud auth login --cred-file=aws-credentials.json.

6. Advanced Troubleshooting

Effective IAM troubleshooting requires understanding both the current state and potential impacts of changes. GCP provides powerful tools to help with both reactive troubleshooting and proactive validation.

5.1 The Policy Troubleshooter

The Policy Troubleshooter is the primary tool for diagnosing access issues. It analyzes the entire policy hierarchy to determine why access was granted or denied.

How Policy Troubleshooter Works

  1. Input Requirements:
    • Principal: The user or service account experiencing the issue
    • Permission: The specific permission being checked (e.g., compute.instances.start)
    • Resource: The specific resource where access is needed
  2. Analysis Process:
    • Examines all IAM policies in the resource hierarchy
    • Evaluates direct bindings and conditional bindings
    • Identifies which policies grant or deny the requested permission
    • Provides detailed explanation of the access decision
  3. Output Information:
    • Whether access is granted or denied
    • Which policies contributed to the decision
    • Path through the resource hierarchy
    • Any conditions that affected the outcome

Using Policy Troubleshooter Effectively

Command Line Interface
# Troubleshoot a specific access issue
gcloud beta iam troubleshoot access \
    --principal="user:[email protected]" \
    --permission="compute.instances.start" \
    --resource="projects/my-project/zones/us-central1-a/instances/my-instance"
Console Interface
  1. Navigate to IAM section in Google Cloud Console
  2. Select “Policy Troubleshooter” from the left navigation
  3. Enter the principal, permission, and resource details
  4. Review the detailed analysis and recommendations

Common Troubleshooting Scenarios

Scenario 1: Unexpected Access Granted
Problem: A user has access to a resource they shouldn’t have access to. Solution: Use Policy Troubleshooter to identify where the permission was granted in the hierarchy.
Scenario 2: Expected Access Denied
Problem: A user cannot access a resource they should have access to. Solution: Use Policy Troubleshooter to identify missing permissions or conflicting policies.
Scenario 3: Conditional Access Issues
Problem: Access works sometimes but not consistently. Solution: Use Policy Troubleshooter to examine conditional policies and their evaluation.

5.2 The IAM Policy Simulator

The IAM Policy Simulator allows you to test policy changes before implementing them, preventing unintended access modifications.

Key Features of Policy Simulator

  • Predictive Analysis: Shows how proposed policies would affect existing access
  • Historical Replay: Analyzes past access attempts against proposed policies
  • Impact Assessment: Identifies which users/services would be affected by changes
  • Safety Net: Prevents disruptive policy changes

How Policy Simulator Works

  1. Proposed Policy: Define the IAM policy changes you want to make
  2. Historical Data: Simulator replays the last 90 days of access attempts
  3. Comparison: Shows which access attempts would succeed or fail under the new policy
  4. Reporting: Generates detailed reports of the policy change impact

Practical Use Case: Removing Editor Role

Situation: Need to remove roles/editor from a developer group but concerned about breaking CI/CD pipelines. Solution:
  1. Prepare the new policy without the editor role
  2. Run the policy through the simulator
  3. Analyze the results to identify any blocked API calls
  4. Adjust the policy or prepare for the impact before implementation

Simulator Limitations

  • Only analyzes the last 90 days of access data
  • May not catch all edge cases or new access patterns
  • Requires sufficient historical data to be meaningful
  • Does not account for future access needs

5.3 IAM Recommender: The SRE’s Intelligence

The IAM Recommender is an automated tool that uses machine learning to enforce the Principle of Least Privilege at scale.

How it Works

  1. Observation: Google monitors the actual permissions used by each principal over the last 90 days.
  2. Comparison: It compares the granted roles with the used permissions.
  3. Recommendation: If a user has roles/editor but only ever reads from GCS buckets, the Recommender suggests downgrading them to roles/storage.objectViewer.

The Policy Analyzer

For complex scenarios, the Policy Analyzer helps you answer “Who can access this resource and how?” by performing a recursive expansion of all groups, service accounts, and conditional bindings across the hierarchy.

9. Organization Policies: The Guardrails of the Cloud

While IAM defines Identity-based access, Organization Policies provide Resource-based constraints. They are the “Constitutional Law” of your GCP environment.

9.1 IAM vs. Org Policy

  • IAM: “Can user Alice start a VM?” (Identity-centric)
  • Org Policy: “Can anyone create a VM with a Public IP in this folder?” (Resource-centric)

9.2 Critical Organization Constraints

A Principal Engineer should implement these foundational constraints to prevent security drift:
ConstraintEffectWhy it matters
iam.disableServiceAccountKeyCreationBlocks JSON key generation.Forces usage of Workload Identity.
compute.disableExternalIPBlocks VMs from having Public IPs.Prevents accidental internet exposure.
gcp.resourceLocationsRestricts resource creation to specific regions.Ensures data sovereignty and compliance.
iam.allowedPolicyMemberDomainsRestricts IAM to only your corporate domain.Prevents sharing data with personal Gmail accounts.

9.3 Enforcement and Inheritance

  • Dry Run Mode: You can test an Org Policy in “Dry Run” mode to see what would be blocked without actually stopping developers.
  • Hierarchical Overrides: Policies set at the Org level apply to everyone, but you can “exempt” specific folders or projects if necessary (use with caution).

6. Deep Dive: Custom Roles and Advanced Permission Management

While predefined roles are suitable for most use cases, production environments often require custom roles for specific compliance, security, or operational requirements. Understanding custom roles deeply is essential for advanced IAM management.

6.1 Understanding Permissions at the API Level

GCP permissions correspond directly to API methods, providing granular control over cloud resources.

Permission Structure

service.resource.action

Examples:

  • compute.instances.create - Create Compute Engine instances
  • storage.buckets.delete - Delete Cloud Storage buckets
  • pubsub.topics.publish - Publish messages to Pub/Sub topics
  • bigquery.datasets.update - Update BigQuery datasets

Permission Categories

  1. Read Permissions: get, list, getIamPolicy
  2. Write Permissions: create, update, patch
  3. Delete Permissions: delete
  4. Special Permissions: setIamPolicy, testIamPermissions

6.2 Custom Role Creation and Management

Creating custom roles requires careful planning and ongoing maintenance to ensure they remain effective and secure.

Custom Role Components

1. Role Metadata
  • Title: Human-readable name for the role
  • Description: Explanation of the role’s purpose
  • Stage: Development stage (ALPHA, BETA, GA, DEPRECATED)
2. Permission Selection
  • includedPermissions: List of permissions granted by the role
  • Must be valid GCP permissions
  • Should follow least-privilege principle
3. Role Limits
  • Maximum of 5 permissions per custom role (soft limit)
  • Maximum of 100 custom roles per project/organization
  • Consider using predefined roles when possible

Custom Role YAML Template with Advanced Options

title: "Database Migration Operator"
description: "Can migrate data between databases but cannot access production data directly"
stage: "GA"
includedPermissions:
- storage.objects.get
- storage.objects.create
- bigquery.jobs.create
- bigquery.tables.getData
- cloudsql.instances.connect
- cloudsql.databases.get
etag: "AA=="

6.3 Advanced Custom Role Patterns

Pattern 1: Operational Roles

Roles designed for specific operational tasks without data access:
title: "Compute Maintenance Operator"
description: "Can manage Compute Engine instances for maintenance but cannot access data"
stage: "GA"
includedPermissions:
- compute.instances.start
- compute.instances.stop
- compute.instances.reset
- compute.instances.setMetadata
- compute.instances.setLabels
- compute.instances.addAccessConfig

Pattern 2: Deployment Roles

Roles for CI/CD systems with limited scope:
title: "Deployment Service Account"
description: "Can deploy applications but cannot modify IAM or billing"
stage: "GA"
includedPermissions:
- run.services.update
- run.services.get
- cloudfunctions.functions.sourceCodeSet
- cloudfunctions.functions.update
- storage.objects.get
- storage.objects.create

Pattern 3: Auditing Roles

Roles for compliance and auditing without operational capabilities:
title: "Security Auditor"
description: "Can audit security configurations but cannot modify resources"
stage: "GA"
includedPermissions:
- resourcemanager.projects.get
- resourcemanager.projects.testIamPermissions
- iam.roles.get
- iam.serviceAccounts.get
- logging.logs.list
- monitoring.metricDescriptors.list

6.4 Custom Role Maintenance and Evolution

1. Permission Drift Management

  • Challenge: New GCP features introduce new permissions
  • Solution: Regular review of custom roles against service updates
  • Best Practice: Subscribe to GCP release notes and feature announcements

2. Version Control for Custom Roles

  • Store custom role definitions in version control
  • Track changes and approvals for role modifications
  • Maintain backup versions for rollback capability
  • Document the business justification for each permission

3. Automated Role Validation

  • Implement CI/CD pipelines for custom role management
  • Test role permissions in non-production environments
  • Validate role effectiveness before production deployment
  • Monitor for unused or overly broad permissions

6.5 Custom Role Security Considerations

1. Privilege Escalation Prevention

  • Review permissions for potential escalation paths
  • Avoid granting both read and write permissions when only one is needed
  • Consider the combination of permissions and their potential misuse
  • Regular security reviews of custom role assignments

2. Principle of Least Privilege

  • Grant only the minimum permissions required for the task
  • Regularly audit and remove unused permissions
  • Use predefined roles when possible instead of custom roles
  • Monitor access patterns and adjust permissions accordingly

3. Separation of Duties

  • Create separate roles for different functional responsibilities
  • Prevent any single role from having excessive authority
  • Implement checks and balances through role separation
  • Document role responsibilities and access patterns

7. Workload Identity: Securing Service-to-Service Communication

Workload Identity is GCP’s recommended approach for authenticating workloads running on Kubernetes Engine (GKE) or other platforms to Google Cloud services. It eliminates the need for service account key files, addressing a significant security concern.

7.1 Understanding Workload Identity

The Traditional Problem

  • Applications in containers needed service account key files to authenticate to GCP services
  • Key files were stored in containers, creating security vulnerabilities
  • Key rotation was manual and often neglected
  • Compromised containers could expose key files and provide unauthorized access

Workload Identity Solution

  • Establishes trust relationship between Kubernetes service accounts and GCP service accounts
  • Uses Kubernetes native authentication mechanisms
  • Eliminates need for service account key files
  • Provides automatic credential exchange and refresh

7.2 Workload Identity Architecture

Components

  1. Kubernetes Service Account (KSA): Identity for pods within a GKE cluster
  2. GCP Service Account (GSA): Identity for accessing GCP resources
  3. Workload Identity Pool: Container for the trust relationship
  4. Workload Identity Provider: Authenticates Kubernetes workloads

Trust Flow

  1. Pod authenticates to GKE using Kubernetes service account
  2. GKE validates the workload against the trust relationship
  3. Temporary credentials are exchanged for the GCP service account
  4. Application accesses GCP resources using these credentials

7.3 Workload Identity Configuration

Step 1: Enable Workload Identity on the Cluster

gcloud container clusters update CLUSTER_NAME \
    --workload-pool=PROJECT_ID.svc.id.goog \
    --zone=COMPUTE_ZONE \
    --project=PROJECT_ID

Step 2: Create or Identify GCP Service Account

gcloud iam service-accounts create my-workload-sa \
    --project=PROJECT_ID

Step 3: Bind GCP Service Account to Kubernetes Service Account

gcloud iam service-accounts add-iam-policy-binding \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:PROJECT_ID.svc.id.goog[K8S_NAMESPACE/K8S_SERVICE_ACCOUNT_NAME]" \
    --project=PROJECT_ID \
    my-workload-sa@PROJECT_ID.iam.gserviceaccount.com

Step 4: Annotate Kubernetes Service Account

kubectl annotate serviceaccount \
    --namespace K8S_NAMESPACE \
    K8S_SERVICE_ACCOUNT_NAME \
    iam.gke.io/gcp-service-account=my-workload-sa@PROJECT_ID.iam.gserviceaccount.com

7.4 Workload Identity Federation (Beyond GKE)

Workload Identity Federation extends the concept beyond GKE to external identity providers.

Supported Providers

  • GitHub Actions
  • AWS
  • Azure
  • SAML 2.0 identity providers
  • OIDC providers

Federation Configuration

  1. Create a workload identity pool
  2. Create a workload identity provider
  3. Configure trust relationship with external provider
  4. Create or identify target GCP service account
  5. Grant necessary IAM permissions to the service account

7.5 Best Practices for Workload Identity

1. Namespace and Service Account Design

  • Use dedicated namespaces for different applications/environments
  • Follow consistent naming conventions for service accounts
  • Implement least-privilege access for each workload
  • Separate production and non-production workloads

2. Monitoring and Auditing

  • Enable audit logging for workload identity exchanges
  • Monitor for unusual authentication patterns
  • Alert on failed authentication attempts
  • Track credential usage and rotation

3. Security Considerations

  • Regularly review and update trust relationships
  • Monitor for unauthorized workload identity usage
  • Implement proper namespace isolation
  • Secure Kubernetes cluster configuration

8. IAM Design Patterns and Anti-Patterns

Effective IAM implementation follows proven design patterns while avoiding common anti-patterns that lead to security vulnerabilities and management complexity.

8.1 IAM Design Patterns

Pattern 1: Role-Based Access Control (RBAC) with Groups

Problem: Managing permissions for individual users becomes unwieldy. Solution: Create groups based on job functions and assign roles to groups. Implementation:

Pattern 2: Environment-Based Access Control

Problem: Same users need different access levels across dev/staging/prod. Solution: Use folders to organize environments and apply different policies. Implementation:
Organization
├── Development Folder
│   └── Development Projects
├── Staging Folder  
│   └── Staging Projects
└── Production Folder
    └── Production Projects

Pattern 3: Resource Tagging for Access Control

Problem: Need to control access based on resource characteristics. Solution: Use resource labels combined with IAM conditions. Implementation:
  • Label resources: environment=prod, team=payment, confidentiality=high
  • Use conditions to enforce access based on labels

Pattern 4: Segmented Administration

Problem: Preventing any single administrator from having complete control. Solution: Split administrative responsibilities across multiple roles. Implementation:
  • Network administrator: roles/compute.networkAdmin
  • Security administrator: roles/iam.securityAdmin
  • Billing administrator: roles/billing.admin

8.2 IAM Anti-Patterns to Avoid

Anti-Pattern 1: Overuse of Primitive Roles

Problem: Using Owner/Editor/Viewer roles in production. Risk: Excessive permissions violate least-privilege principle. Solution: Use predefined roles or create custom roles with minimal permissions.

Anti-Pattern 2: Direct User-to-Resource Binding

Problem: Granting roles directly to individual users instead of groups. Risk: Difficult to manage and maintain as team grows. Solution: Always use groups for role assignments.

Anti-Pattern 3: Overlapping Administrative Boundaries

Problem: Same person has both development and security administration. Risk: Potential for security bypass or conflict of interest. Solution: Separate duties and implement checks and balances.

Anti-Pattern 4: Inadequate Key Management

Problem: Poor management of service account keys. Risk: Compromised keys provide persistent access to resources. Solution: Minimize key use, implement rotation, use Workload Identity.

8.3 Common Implementation Scenarios

Scenario 1: Multi-Team Development Environment

Requirements:
  • Multiple development teams working on different projects
  • Shared services (CI/CD, monitoring, security)
  • Isolated development environments
  • Controlled production access
Implementation:
  1. Organize projects by team/product
  2. Create team-specific groups with appropriate access
  3. Implement shared service accounts for common infrastructure
  4. Use folders to group related projects
  5. Implement production access controls and approvals

Scenario 2: Regulatory Compliance Environment

Requirements:
  • Segregation of duties
  • Detailed audit trails
  • Restricted data access
  • Regular access reviews
Implementation:
  1. Implement role separation for different functions
  2. Use audit logging for all access attempts
  3. Implement data classification and access controls
  4. Schedule regular access certification reviews
  5. Use automated tools for compliance monitoring

Lab: Comprehensive IAM Implementation Exercise

This lab will walk you through implementing a complete IAM strategy for a fictional company with multiple teams and environments.

Scenario: Acme Corp GCP Implementation

Acme Corp is migrating their applications to GCP with the following requirements:
  • 3 development teams (Frontend, Backend, Data)
  • 3 environments (Development, Staging, Production)
  • Security team that monitors all access
  • Separate billing for each team/environment
  • Need to implement least-privilege access

Step 1: Organize the Resource Hierarchy

First, create the folder structure to organize resources:
# Create environment folders
gcloud resource-manager folders create \
    --display-name="Development" \
    --organization=ORGANIZATION_ID

gcloud resource-manager folders create \
    --display-name="Staging" \
    --organization=ORGANIZATION_ID

gcloud resource-manager folders create \
    --display-name="Production" \
    --organization=ORGANIZATION_ID

Step 2: Create Groups for Teams

Create Google Groups for each team and environment:
# Create team groups (assuming you have domain admin access)
# [email protected]
# [email protected]
# [email protected]
# [email protected]

Step 3: Create Service Accounts for Applications

Create service accounts for each application:
# Frontend application service account
gcloud iam service-accounts create frontend-app \
    --description="Frontend application service account" \
    --display-name="Frontend App SA"

# Backend application service account
gcloud iam service-accounts create backend-app \
    --description="Backend application service account" \
    --display-name="Backend App SA"

# Data processing service account
gcloud iam service-accounts create data-processor \
    --description="Data processing service account" \
    --display-name="Data Processor SA"

Step 4: Implement Least-Privilege Access

Assign roles based on the principle of least privilege:
# Frontend team gets access to their resources only
gcloud projects add-iam-policy-binding dev-frontend-project \
    --member="group:[email protected]" \
    --role="roles/compute.instanceAdmin.v1"

# Backend team gets access to their resources
gcloud projects add-iam-policy-binding dev-backend-project \
    --member="group:[email protected]" \
    --role="roles/cloudsql.editor"

# Data team gets access to their resources
gcloud projects add-iam-policy-binding dev-data-project \
    --member="group:[email protected]" \
    --role="roles/bigquery.dataEditor"

Step 5: Configure Service Account Access

Set up service account access for applications:
# Frontend app needs to write logs
gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:frontend-app@PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/logging.logWriter"

# Backend app needs database access
gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:backend-app@PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/cloudsql.client"

# Data processor needs to read/write data
gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:data-processor@PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/storage.objectAdmin"

Step 6: Implement Security Monitoring

Configure security team access for monitoring:
# Security team gets broad monitoring access
gcloud organizations add-iam-policy-binding ORGANIZATION_ID \
    --member="group:[email protected]" \
    --role="roles/logging.privateLogViewer"

gcloud organizations add-iam-policy-binding ORGANIZATION_ID \
    --member="group:[email protected]" \
    --role="roles/iam.securityReviewer"

Step 7: Verify the Implementation

Test the implementation to ensure it works as expected:
# Verify access for each team
gcloud projects test-iam-permissions PROJECT_ID \
    --permissions="compute.instances.create,storage.buckets.create"

Step 8: Set Up Monitoring and Alerting

Enable audit logging and set up monitoring:
# Enable data access logs for critical services
gcloud logging sinks create security-alerts \
    --log-filter='protoPayload.methodName:"google.cloud.audit" AND severity>=WARNING' \
    --destination="pubsub://topics/security-notifications"

SRE Best Practices Checklist

Security Posture

  • Audit Logs: Enable “Data Access” logs for critical services to see exactly who accessed your data
  • IAM Recommender: Check this weekly. Google will suggest removing unused permissions based on actual usage over the last 90 days
  • Groups over Users: Never grant a role directly to an email address. Use Google Groups
  • Service Account Keys: Hunt them down and delete them. Use Workload Identity instead
  • Conditional Access: Implement time-based and IP-based conditions for sensitive access
  • Separation of Duties: Ensure no single user has complete control over critical systems

Operational Excellence

  • Policy Simulations: Test policy changes before applying them using the IAM Policy Simulator
  • Resource Labels: Use consistent labeling to enable conditional access based on resource attributes
  • Regular Reviews: Conduct quarterly access reviews to remove unnecessary permissions
  • Documentation: Maintain clear documentation of roles, responsibilities, and access patterns
  • Training: Ensure team members understand IAM concepts and best practices

Performance and Efficiency

  • Role Optimization: Use predefined roles when possible; create custom roles only when necessary
  • Hierarchical Design: Organize resources to minimize the number of policies needed
  • Condition Complexity: Balance security needs with performance impact of complex conditions
  • Monitoring Overhead: Configure monitoring appropriately without excessive noise
In the next chapter, we dive into the “Vines” of the cloud—GCP Networking and the Global VPC.

Interview Preparation

Answer: PoLP is the security concept of providing a user or service only the minimum permissions required to perform their job.Implementation in GCP:
  1. Custom Roles: Instead of roles/editor, create a custom role with only the specific permissions needed (e.g., compute.instances.start).
  2. Granular Scopes: Use predefined roles that are service-specific (e.g., roles/storage.objectViewer instead of roles/storage.admin).
  3. IAM Recommender: Use this tool to identify over-privileged accounts based on actual usage and automatically downgrade them.
  4. IAM Conditions: Grant permissions that are limited by time, resource name, or request context (like IP address).
Answer:
  • GSA: An IAM identity in GCP used by services (VMs, Cloud Run) to access other GCP resources. It uses JSON keys or the metadata server.
  • KSA: An identity inside a Kubernetes cluster used by pods to talk to the K8s API. It has no inherent permissions in GCP.
  • Workload Identity: It maps a KSA to a GSA. When a pod uses that KSA, the GKE metadata server returns a short-lived token for the linked GSA, allowing the pod to access GCP resources (like Cloud Storage) without needing a leaked JSON key.
Answer: I follow a systematic process:
  1. Policy Troubleshooter: Input the user’s email, the permission (compute.instances.start), and the resource URL. This tool checks the entire hierarchy (Org, Folder, Project) to see which policy is blocking access.
  2. Check Inheritance: Ensure they aren’t being granted permission at a high level and then expecting it to be “denied” at a lower level (permissions are additive).
  3. Verify Identity: Ensure the user is authenticated with the correct account (especially if using multiple accounts in the same browser).
  4. Cloud Audit Logs: Check the logs to see the raw denied event, which often includes details about the specific resource or condition that failed.
Answer: JSON keys are long-lived (up to 10 years) and portable. If pushed to GitHub or stolen from a dev machine, they give an attacker permanent access until manually revoked.Alternatives:
  • Workload Identity Federation: For GitHub Actions, GitLab, or AWS, use OIDC tokens to exchange for temporary GCP tokens.
  • Instance Service Accounts: For VMs, use the attached service account and the metadata server.
  • Short-lived Tokens: Use the gcloud iam service-accounts generate-access-token command for temporary sessions.
Answer: No, you cannot “deny” an inherited permission. GCP IAM is additive only. If a permission is granted at a higher level (e.g., Folder), it exists at all lower levels (e.g., Project).Workaround: To achieve a “deny” effect, you must use IAM Conditions or VPC Service Controls. Conditions can restrict access based on attributes, and VPC SC can block access to services entirely regardless of IAM permissions.