The hidden trap of IAM Role recreation in Terraform: Understanding the AWS Unique Principal ID
Maialen | October 19, 2025As Cloud Engineers, we often manage our infrastructure with the confidence that comes from Infrastructure as Code (IaC). When we define an AWS IAM Role in Terraform, we rely on its name and ARN as a stable identifier.
This leads to a common and dangerous assumption: If we delete a role and immediately recreate a new one with the exact same name, policies and ARN, everything should just pick up where it left off, right?
Wrong.
This assumption can lead to a production nightmare, especially with critical resources like AWS Glue. The reason lies in a security feature that is often invisible until it breaks things: the Unique Principal ID. This post dives deep into why this happens and, most importantly, how to prevent it.
The core technical deep dive: ARN vs. Unique Principal ID
Every IAM Identity (User, Role, or federated identity) in AWS has three key identifiers:
- Role Names: What we define in Terraform (e.g.,
glue-service-role). - ARN (Amazon Resource Name): The global unique identifier (e.g.,
arn:aws:iam::1234567890:role/glue-service-role). - Unique Principal ID: An immutable, opaque ID assigned by AWS at creation (e.g.,
AROAJND24IEWKTX6BABJI).
When you reference an IAM Role, whether in a policy or a service configuration, AWS almost always translates that reference and stores the Unique Principal ID behind the scenes.
Here’s where the problem splits into two subtly different failure scenarios:
Failure 1: Resource-Based Policies (S3, KMS, Trust Policies)
When you apply a policy (like an S3 Bucket Policy) that grants access to your role’s ARN (arn:aws:iam::...:role/glue-service-role), the IAM service does two things:
- It validates that the ARN exists.
- It resolves that ARN to its
Unique Principal ID(e.g.,ID-A) and stores that ID in the policy.
If you later delete the role, the S3 Bucket Policy still exists, but it now points to an orphaned ID-A. In the AWS console, you’ll often see a “Principal not found” error or just the raw AROA... ID instead of the role name.
If you recreate the role (getting a new ID-B), the S3 policy does not fix itself. It still trusts the original ID-A. The policy is only repaired when you re-apply (terraform apply) it. That “update” action (even if the policy text didn’t change) forces AWS to re-validate the ARN, find the new ID-B, and update its internal reference.
Failure 2: Direct Service Associations (Glue Jobs, EC2 Instance Profiles)
This is the Glue scenario. Here, it’s not an S3 policy that fails, but the Glue Job itself. The Glue Job is configured to “Use Role X”. Internally, the Glue service registers that “X” corresponds to the Unique Principal ID ID-A.
When the role is deleted, the Glue Job’s configuration still points to ID-A, which no longer exists. When you recreate the role with the new ID-B, the Glue Job has no knowledge of this new role. It’s still looking for the original ID-A.
The symptom is different: the job fails on startup with an “Access Denied” error when trying to assume the role it was assigned, because the role it thinks it should assume (ID-A) is gone.
The “Why”: This is a Critical Security Feature, Not a Bug
This frustrating behavior is actually one of IAM’s most important security features. It’s designed to protect you from a dangerous “confused deputy” or privilege escalation attack.
The Attack Scenario: Imagine if permissions were tied only to the role name.
- You delete a highly-privileged role (e.g.,
prod-admin).- A malicious actor (or a simple accident) creates a new, unprivileged role with the exact same name.
- That new role would instantly and automatically inherit all the old
prod-adminpermissions to your secret S3 buckets and KMS keys.
The Unique Principal ID is the solution to this.
It acts as a unique, un-forgeable “serial number.” When you grant permissions, you aren’t trusting the name prod-admin - you are trusting the specific instances of that role (ID-A).
When you delete that role, ID-A is destroyed forever. The new role (ID-B) is a completely separate entity that must earn trust on its own. The “Access Denied” error is AWS correctly protecting you from an unknown principal.
Mitigation Strategies
Strategy 1: Prevention Using lifecycle
First of all, it is necessary to assess whether role recreation is necessary or can be avoided. For a mission-critical resource like an IAM Role, the number one best practice is to prevent its accidental destruction entirely. Terraform allows you to enforce this using the prevent_destroy = true lifecycle meta-argument:
resource "aws_iam_role" "glue_execution_role" {
name = "glue-service-role"
// ... other configuration ...
lifecycle {
prevent_destroy = true
}
}
With this in place, any change that would normally force a replacement will fail the terraform plan with a clear error, giving you a chance to re-evaluate the change and decide whether manual intervention is truly necessary.
Strategy 2: The Recovery Plan (If recreation is inevitable)
Sometimes, the recreation of the role is a necessary evil (e.g., moving the role to a new state file, or fixing an underlying structural issue). When you know a new Unique Principal ID is coming, your only option is to update all dependent resources.
This is where the engineering discipline comes in. We will consider two ways to resolve this, which must be considered on a case-by-case basis.
The Terraform-Native Approach (Leveraging the Dependency Graph)
You are right to ask: “How does this work if the role name - and therefore the ARN string - remains the same?”
The trick is not that the value of the ARN changes, but that we must force an explicit dependency in Terraform’s resource graph. This forces Terraform to update the dependent resources, which in turn forces AWS to re-validate the principal.
Here’s how to do it correctly.
Avoid this (Brittle) Using a hardcoded ARN string. Terraform sees no relationship between the S3 policy and the IAM role.
// AVOID THIS (Brittle):
data "aws_iam_policy_document" "s3_policy_doc" {
statement {
principals {
type = "AWS"
// Hardcoded ARN string - Terraform sees no dependency on the role
identifiers = ["arn:aws:iam::1234567890:role/glue-service-role"]
}
// ...
}
}
Do this (Robust) Use a direct reference to the aws_iam_role resource. This tells Terraform that the policy depends on the role.
// DO THIS (Robust):
resource "aws_iam_role" "glue_execution_role" {
name = "glue-service-role"
// ...
}
data "aws_iam_policy_document" "s3_policy_doc" {
statement {
principals {
type = "AWS"
// Direct reference - Terraform now sees the dependency
identifiers = [aws_iam_role.glue_execution_role.arn]
}
// ...
}
}
resource "aws_s3_bucket_policy" "my_policy" {
bucket = "my-bucket"
policy = data.aws_iam_policy_document.s3_policy_doc.json
}
By creating this hard dependency, Terraform understands the relationship. When the aws_iam_role is replaced (“Destroy / Create”), Terraform’s plan will automatically include an update for aws_s3_bucket_policy.my_policy.
This update action (a PutBucketPolicy API call) is the magic. Even though the policy text is identical, the API call forces AWS to re-validate the (unchanged) ARN string, discover the new Unique Principal ID (ID-B), and repair the trust relationship.
The Scripted Solution
If we encounter many difficulties doing so using Terraform’s native method, another more direct way, though less “Terraform-native,” is to implement a recreation script as part of your deployment process.
-
Get the New ARN: After the new role is created, capture its new ARN (which includes its new Unique Principal ID).
-
Scripted Updates: Use the AWS CLI or an SDK (e.g., Python Boto3) to iterate through all known dependent resources (Glue Jobs, S3 Bucket Policies, KMS Key Policies, etc.).
-
Force the Re-validation (The Two-Step Apply): This is the most important part. You cannot simply “apply” the new role’s ARN, because the ARN string is identical to the old one. The AWS API (e.g., for Glue or Lambda) will likely detect no change and do nothing, failing to register the new
Unique Principal ID.To force the update, your script must perform a “two-step jiggle”:
-
Step 3a: Apply a Temporary Role: First, update dependent resources (like a Glue Job) to point to a different, temporary, valid IAM role (e.g.,
dummy-update-role). This forces a real change in the service’s configuration. -
Step 3b: Re-apply the Correct Role: Immediately after the first update succeeds, run a second update to set the resource back to the newly-created role (e.g.,
glue-service-role).
This two-step process guarantees that the AWS service is forced to re-evaluate the ARN string and, in doing so, discover and store the new
Unique Principal IDof the recreated role. -
This approach ensures you can quickly roll through all dependencies.
Key Takeaway
The core message to remember is: The stability of an IAM Role is not defined by its Name or ARN, but by its immutable Unique Principal ID.
As advanced practitioners, we must be aware of this hidden mechanism to prevent unexpected outages. Always consider using lifecycle { prevent_destroy = true } on critical roles to turn a silent, destructive action into an audible, safe failure in your CI/CD pipeline.