Talos Vulnerability Report

TALOS-2021-1347

Microsoft Azure Sphere Pluton concurrent syscalls denial of service vulnerability

November 9, 2021

Summary

A denial of service vulnerability exists in the Pluton syscalls functionality of Microsoft Azure Sphere 21.01, 21.06 and 21.07. A specially-crafted set of syscalls executed in parallel by an unprivileged process can lead to the crash of Pluton, resulting in a device reboot (denial of service).

Tested Versions

Microsoft Azure Sphere 21.01
Microsoft Azure Sphere 21.06
Microsoft Azure Sphere 21.07

Product URLs

https://azure.microsoft.com/en-us/services/azure-sphere/

CVSSv3 Score

6.2 - CVSS:3.0/AV:L/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H

CWE

CWE-362 - Concurrent Execution using Shared Resource with Improper Synchronization (‘Race Condition’)

Details

Microsoft’s Azure Sphere is a platform for the development of internet-of-things applications. It features a custom SoC that consists of a set of cores that run both high-level and real-time applications, enforces security and manages encryption (among other functions). The high-level applications execute on a custom Linux-based OS, with several modifications to make it smaller and more secure, specifically for IoT applications.

In order to facilitate communication between the normal Linux kernel running on a Cortex A7 and the Pluton subsystem on the Cortex M4, there exists the /dev/pluton kernel driver, which is accessible by any user on the system. It implements very few functions (open, read, poll, close and ioctl), but provides a decent amount of ioctl requests for interacting with Pluton:

#define PLUTON_GET_SECURITY_STATE _IOWR('p', 0x01, struct azure_sphere_get_security_state_result)
#define PLUTON_GENERATE_CLIENT_AUTH_KEY _IOWR('p', 0x06, uint32_t)
#define PLUTON_COMMIT_CLIENT_AUTH_KEY _IOWR('p', 0x07, uint32_t)
#define PLUTON_GET_TENANT_PUBLIC_KEY _IOWR('p', 0x08, struct azure_sphere_ecc256_public_key)
#define PLUTON_PROCESS_ATTESTATION _IOWR('p', 0x09, struct azure_sphere_attestation_command)
#define PLUTON_SIGN_WITH_TENANT_ATTESTATION_KEY _IOWR('p', 0x0A, struct azure_sphere_ecdsa256_signature)
#define PLUTON_SET_POSTCODE _IOWR('p', 0x0B, uint32_t)
#define PLUTON_GET_BOOT_MODE_FLAGS _IOWR('p', 0x0C, struct azure_sphere_boot_mode_flags)
#define PLUTON_IS_CAPABILITY_ENABLED _IOWR('p', 0x0D, struct azure_sphere_is_capability_enabled)
#define PLUTON_GET_ENABLED_CAPABILITIES _IOR('p', 0x0E, struct azure_sphere_get_enabled_capabilities)
#define PLUTON_SET_MANUFACTURING_STATE _IOW('p', 0x0F, struct azure_sphere_manufacturing_state)
#define PLUTON_GET_MANUFACTURING_STATE _IOR('p', 0x10, struct azure_sphere_manufacturing_state)
#define PLUTON_DECODE_CAPABILITIES _IOR('p', 0x11, struct azure_sphere_decode_capabilities_command)

Out of these pluton ioctls, we just need to pick one that we have the capabilities for, since most of these ioctls are protected by specific AZURE_SPHERE_CAP_* capabilities.

Let’s take for example the PLUTON_SIGN_WITH_TENANT_ATTESTATION_KEY ioctl:

///
/// PLUTON_SIGN_WITH_TENANT_ATTESTATION_KEY message handler
///
/// @arg - ioctl buffer
/// @data - file data for FD
/// @async - is the FD in async mode
/// @returns - 0 for success
int pluton_sign_with_tenant_attestation_key(void __user *arg) {
    u32 ret = 0;
    struct azure_sphere_syscall args = {0};
    struct azure_sphere_task_cred *tsec;
    struct azure_sphere_tenant_id tenant_id;
    struct azure_sphere_digest digest;
    struct azure_sphere_ecdsa_signature signature;

    // no runtime permission check

    ret = copy_from_user(&digest, arg, sizeof(digest));
    if (unlikely(ret)) {
        return ret;
    }

    // copy out the tenant id
    tsec = current->cred->security;
    memcpy(&tenant_id, tsec->daa_tenant_id, sizeof(tenant_id));

    args.number = PlutonSyscallSignWithTenantKey;
    args.flags = MakeFlagsForArg(0, Input | Reference)
        | MakeFlagsForArg(1, Input)
        | MakeFlagsForArg(2, Input | Reference)
        | MakeFlagsForArg(3, Input)
        | MakeFlagsForArg(4, Output | Reference)
        | MakeFlagsForArg(5, Input);
    args.args[0] = (uintptr_t)&tenant_id;
    args.args[1] = sizeof(tenant_id);
    args.args[2] = (uintptr_t)&digest;
    args.args[3] = sizeof(digest);
    args.args[4] = (uintptr_t)&signature;
    args.args[5] = sizeof(signature);

    ret = azure_sphere_pluton_execute_syscall(&args, false);  // [1]

    // no data sent back on err
    if (!ret) {
        ret = copy_to_user(arg, &signature, sizeof(signature));
    }

    return ret;
}

Generally all the Pluton ioctls follow a specific pattern like this, so it’s not really worth diving too deeply. The main part we actually care about (which is still common to each pluton ioctl) is the azure_sphere_pluton_execute_syscall function at [1]. This function ends up sending a formatted message to the Pluton core over a shared ring buffer. When Pluton is done processing, it utilizes the same shared buffer to send a message back to the Linux kernel, as one might expect.

Eventually, the function pluton_remote_api_send_impl will be called via pluton_remote_api_send:

///
/// Sends a packet of data to Pluton and blocks on a response
///
/// @message - message pointer
/// @returns - 0 for success
int pluton_remote_api_send(uintptr_t message)
{
   struct completion c;
   int ret;
   u32 time_remaining = 0;

   struct pluton_relay_completion_data response_data = {
       .completion = &c
   };

   init_completion(&c);

   // Send command
   ret = pluton_remote_api_send_impl(message, response_data);
   if (ret != SUCCESS) {
       return ret;
   }

   // Requests should get resposnes in a small amount of time (usec or msc)
   // We put a timeout here to catch cases where a response never comes.
   // This is always unexpected but we WARN and return so we can recover.
   time_remaining =
       wait_for_completion_timeout(&c, msecs_to_jiffies(COMPLETION_TIMEOUT_IN_MS));

   // The wait completion routine can internally return a negative number if we receive a fatal/cancel signal.  Converted
   // to an unsigned value, it's large.  Warn on it here but don't specifically try to cancel a response request since there
   // would be a larger scope clean-up happening anyways.
   WARN_ON(time_remaining > msecs_to_jiffies(COMPLETION_TIMEOUT_IN_MS));

   // Warn if a time-out occurs.
   WARN_ON(time_remaining == 0);

   if (time_remaining == 0) {
       // Cancel the transfer response since it may not come.  If we fail to cancel the transfer, we can't exit otherwise
       // we risk a memory fault should the completion handler fire after we've returned.
       if (pluton_remoteapi_cancel_response_from_pluton(message) == SUCCESS)
       {
           dev_dbg(g_state.provider->dev,
               "Successfully cancelled M4 response request following a wait timeout.\n");

           return -ETIMEDOUT;
       }

       dev_dbg(g_state.provider->dev,
               "Failed to cancel M4 response request following a wait timeout, waiting for completion...\n");

       // Wait without time-out.
       wait_for_completion(&c);
   }

   return SUCCESS;
}

///
/// Sends a message to Pluton
///
/// @message - pointer to message to send
/// @returns -  0 for success
static int pluton_remote_api_send_impl(uintptr_t message, 
   struct pluton_relay_completion_data response_data)
{
   int ret = SUCCESS;
   struct pluton_relay_management_info *mgmt_info = NULL;

   // Validate handle
   if (g_state.provider == NULL) {
       printk(KERN_ERR "RemoteAPI provider not available\n");
       return -EINVAL;
   }

   // Get a free relay info structure to set up
   mgmt_info = pluton_remote_api_get_free_relay_management_info();
   if (mgmt_info == NULL) {
       dev_err(g_state.provider->dev, "Could not allocate relay info in "
                        "pluton_remote_api_send");
       return -ENOMEM;
   }

   // build out relay structure
   mgmt_info->message = message;
   mgmt_info->response_data = response_data;

   // Send the message
   ret = g_state.provider->send_message(message); // [2]

   if (ret != SUCCESS) {
       dev_err(g_state.provider->dev,
           "Failed to send Pluton command %p with error: %d", (void *)message, ret);

       pluton_remote_api_free_management_info(mgmt_info);
   }

   return ret;
}

We can see that function pluton_remote_api_send relays the message to pluton_remote_api_send_impl and implements timeouts to make sure Pluton calls don’t stall and that the ioctl returns after the call has been completed.

At [2] the message is actually sent via the send_message function. When this code is running on the device, the provider is mt3620_pluton_provider, which means it’s sending messages via the mt3620_send_message function [3]:

static struct pluton_remoteapi_provider mt3620_pluton_provider = {
   .send_message = mt3620_send_message                            // [3]
};

static int mt3620_send_message(uintptr_t message)
{
   int ret = SUCCESS;
   struct mt3620_mailbox_data *mailboxData = NULL;

   // Wrap it all into our final mailbox structure
   mailboxData = kmalloc(sizeof(*mailboxData), GFP_KERNEL);
   if (mailboxData == NULL) {
       ret = -ENOMEM;
       goto exit;
   }

   mailboxData->data = 0;
   mailboxData->cmd = message;

   // Send the message
   ret = mbox_send_message(g_pluton->event_channel, mailboxData); // [4]
   if (ret > 0) {
       // Returns > 0 for success, map it back to 0 so the caller isn't
       // confused
       ret = SUCCESS;
   }

exit:
   if (ret != SUCCESS) {
       if (mailboxData != NULL) {
           kfree(mailboxData);
       }
   }

   return ret;
}

Inside this function, the message is actually sent via mbox_send_message. This is a function defined in the Linux kernel in drivers/mailbox/mailbox.c, without any modification from Azure Sphere, meaning that it’s using Linux mailboxes.

The same functionality however is to be implemented on the Pluton side in order to be able to talk with the Linux side.

Despite the timeout handling in pluton_remote_api_send, we noticed that it is possible to race Pluton ioctls and cause a denial-of-service from a userland process. This only happens when calling certain Pluton syscalls in parallel. For example by having two process call GetTenantPublicKey, the device will crash. This won’t happen if one process calls GetTenantPublicKey and the other one calls GetSecurityState.

If we collect the telemetry info after a crash, we can see an empty log:

00000000  00 00 12 00 00 00 00 00  00 00 00 00 ff ff ff ff  |................|
00000010  ff ff ff ff e3 04 01 00  00 00 05 00 00 90 70 04  |..............p.|
00000020  c0 03 01 00 00 00                                 |......          |

Moreover, we will see this log via UART:

[1BL] BOOT: 70e00000/00000008/01020000
[PLUTON] Logging initialized
[PLUTON] Booting HLOS core

Normally, if the crash happened on the Linux or Security Monitor side, we’d see a “[PLUTON] HLOS Watchdog Reset” message. Since there’s no such message, this means the reboot is initiated by Pluton.

After discussing with Microsoft about the issue, the result of their investigation is that we’re hitting a rate limit which is handled with a reboot. Even though this is expected behavior, in order for an app to have the ability to reboot the device, it should declare a PowerControls section in its app_manifest. Since this issue allows an attacker to trigger a reboot without such permission (in fact this can be reproduced by an unprivileged app without any special manifest option), we think this still accounts as a denial-of-service.

Timeline

2021-07-21 - Vendor Disclosure
2021-11-09 - Public Release

Credit

Discovered by Claudio Bozzato and Lilith >_> of Cisco Talos.

This vulnerability has not been disclosed and cannot be viewed at this time.

TALOS-2021-1351

TALOS-2021-1344

Intelligence Center

Vulnerability Research

Incident Response

Security Resources

Media

Company