Automatically move your Application Insights data into Azure Data Explorer

Application Insights (AI) is a useful way of analyzing your application’s telemetry. Its lightning-fast queries make it ideal for analyzing historical data, but what happens when you start to bump into the limits? The default retention for data is 90 days, but that can be increased (for a fee) to 2 years. However, what happens when that’s not enough? If you query too much, or too often you may get throttled. When you start to bump into these limits, where can you go?

The answer lies in the fact that Application Insights is backed by Azure Data Explorer (ADX or Kusto). Moving your AI data to a full ADX cluster will allow you to continue using AI to collect data, and even to analyze recent data, but the ADX cluster can be sized appropriately and used when the AI instance won’t scale. The fact that it is using the same engine and query language as AI means that your queries can continue to work. This article describes a pattern for doing this.

Requirements

We’ll be working with several Azure components to create this solution. In addition to your AI instance, these components are:

Azure Data Explorer cluster
Azure Storage Account
Azure Event Namespace and at least one Event hub
Azure Event Grid

The procedure can be broken down into a series of steps:

Enable Continuous Export from AI
Create an Event Grid subscription in the storage account
Create an ADX database and ingestion table
Create an Ingestion rule in ADX
Create relevant query tables and update policies in the ADX database

Enable Continuous Export from Application Analytics

AI will retain data for up to 2 years, but for archival purposes, it provides a feature called “Continuous Export”. When this feature is configures, AI will write out any data it receives to Azure blob storage in JSON format.

To enable this, open your AI instance, and scroll down to “Continuous Export” in the “Configure” section. Any existing exports will show here, along with the last time data was written. To add a new destination, select the “Add” button.

You will then need to select which AI data types to export. For this example, we will only be using Page Views, although multiple types can be selected.

Next, you need to select your storage account. First select the subscription (if different from your AI instance), and then select the storage account and container. You will need to know what data region the account is in. Once selected, save the settings.

Initially, the “Last Export” column will display “Never”, but once AI has collected some data, it will be written out to your storage container, and the “Last Export” column will display when that occurred. Once it has occurred, you should be able to open your storage account using Storage Explorer, and then the container to see the output. In the root of the container selected above, you’ll see a folder that is named with the AI Instance name, and the AI instance GUID.

Opening that folder, you’ll find a folder for each data type selected above (if there has been data for them). Each data types will be further organized into folders names for the day, and the hour. Multiple files will be contained withing with the .blob extension. These are multiline json files and can be downloaded and opened with a simple text editor.

The next step is to raise an event whenever new content is added to this storage container.

Create an Event Grid subscription in the storage account

Prior to this step, ensure that you have created, or have available an Event namespace, and an Event hub. You will connect to this hub in this step.

From the Azure portal, open the storage account and then select the “Events” node. Then click the “Event Subscription” button at the top.

On the following screen, you’ll need to provide a name and schema for the subscription. The name can be whatever you wish, and the schema should be “Event Grid Schema”. In the Topic Details section, you will provide a topic name which will pertain to all subscriptions for this storage account. In the “Event Types” section, you select the types of actions that will fire an event. For our purposes, all we want is “Blob Created”. With this selection, the event will fire every time a new blob is added to the container. Finally, under “Endpoint Details”, you will select “Event Hubs” from the dropdown, and then you click on “Select an endpoint” to select your Event Hub.

Once created an event will fire anytime a blob is created in this storage account. If you wish to restrict this to specific folders or containers, you can select the Filters tab, and create a subject filter to restrict it to specific file types, containers, etc. More information on Event Grid filters can be found here. In our case, we do not need a filter.

When ready, click the “Create” button, and the Event subscription will be created. It can be monitored from the storage account and can also be monitored in the Event hub. As new blobs are added to the storage account, more events will fire.

Create an ADX database and ingestion table

From Azure portal, navigate to your ADX cluster and either select a database or create a new one. Once the database has been created, you need to create at least one table to store the data. Ultimately, Kusto will ingest data from the blobs added above whenever they are added, and you need to do some mapping to get that to work properly. For debugging purposes, I find it useful to create an intermediate ADX table to receive data from the blobs, and them transform the data afterward.

In this case, the intermediate table will have a single column, Body that will contain the entirety of each ingested record. To create this table, run the following KQL query on your database:

.create-merge table Ingestion (Body: dynamic)

The dynamic data type in ADX can work with JSON content, and each record will go there. For this to work, you also need to add a mapping to the table. The mapping can be very complex, but in our case, we’re doing a simple load in, so we’re matching the entire JSON record to the Body column in our database. To add this mapping, run the following KQL command:

.create table TweetIngest ingestion json mapping "RawInput"
'['
' {"column": "Body", "Properties":{"Path":"$"}}'
']'

At this point, we are ready for an ingestion rule.

Create an Ingestion rule in ADX

From the Azure portal, open your ADX cluster, and select the “Databases” node in the “Data” section, then click on your database.

The setting that we need is “Data ingestion” in the resulting window. Selecting that takes you to the ingestion rules. Now you want to create a new connection by selecting the “Add data connection” button.

The first selection is the data connection type. The options are Event Hub, Blob storage, or Iot Hub. We need to select Blob storage. Both it, and Event hub will connect to an Event hub, but the difference it that using “Blob storage”, the contents of the blobs will be delivered, and selecting “Event Hub” will only deliver the metadata of the blob being added.

Once the type is selected, you give it a name, and choose the event grid to connect to (the one that you created above) and the event type. Next, you select “Manual” in the Resources creation section. Selecting “Automatic” will create a new Event Hub Namespace, Hub, and Event grid and you won’t have any control of the naming of these resources. Selecting Manual allows you to keep it under control. Select your event grid here.

Next, select the “Ingest properties” tab, and provide the table and mapping that you created above (which in our case was “RawInput”). Also, you need to select “MULTILINE JSON” as the data format.

Once these values are complete, press the Create button and the automatic ingestion will commence. Adding a new blob to the storage account will fire an event, which will cause ADX to load the contents of the blob into the Body column of the Ingestion table. This process can take up to 5 minutes after the event fires.

Create relevant query tables and update policies in the ADX database

Once ingestion happens, your “Ingestion” table should have records in it. Running a simple query in ADX using the table name should show several records with data in the “Body” column. Opening a record will show the full structure of the JSON contained within. If records with different schema are being imported, a query filter can be employed to limit the results to only those records.

For example, the pageViews table in AI will always contain a JSON none named “view”. The query below will return only pageView data from the ingestion table:

This ingestion table can be queried in this matter moving forward, but for performance and usability reasons, it is better to “materialize” the views of this table. To do this, we create another table, and set an update policy on it that will add relevant rows to it whenever the ingestion table is updated.

The first step is to create the table. In our case, we want to replicate the schema of the pageViews table in Application Insights. This is because we want to be able to reuse any queries that we have already built against AI. All that should be necessary is to change the source of those queries to the ADV cluster/database. To create a table with the same schema of the AI pageViews table (mostly), the following command can be executed in ADX:

 .create table pageViews (
    timestamp: datetime, 
    ['id']: string, 
    name: string, 
    url: string, 
    duration: real,
    performanceBucket: string, 
    customDimensions: dynamic, 
    customMeasurements: dynamic, 
    operation_Name: string, 
    operation_Id: string, 
    operation_ParentId: string, 
    operation_SyntheticSource: string,
    session_Id: string, 
    user_Id: string, 
    user_AuthenticatedId: string, 
    user_AccountId: string, 
    application_Version: string, 
    client_Type: string, 
    client_Model: string, 
    client_OS: string, 
    client_IP: string, 
    client_City: string, 
    client_StateOrProvince: string, 
    client_CountryOrRegion: string, 
    client_Browser: string
)

Once the table is created, we need to create a query against the Ingestion table that will return pageViews records in the schema of the new table. Without getting deep into the nuances of the KQL language, a query that will do this is below:

    Ingestion
    | where isnull(Body.view) == false
    | extend view = Body.view
    | mvexpand view
    | extend performancems = view.durationMetric.value /10000
    | extend a = trim_end("\\]",trim_start("\\[",tostring(Body.context.custom.dimensions)))
    | extend b = replace('"}','"',replace('{"','"',a))
    | extend c = todynamic(strcat('{',b,'}'))
    | extend d = trim_end("\\]",trim_start("\\[",tostring(Body.context.custom.metrics)))
    | extend e = replace('"}','"',replace('{"','"',d))
    | extend f = todynamic(strcat('{',e,'}'))
    | project
        timestamp = todatetime(Body.context.data.eventTime),
        id = tostring(Body.internal.data.id),
        name = tostring(view.name),
        url = tostring(view.url),
        duration = toreal(performancems),
        performanceBucket = case(
                performancems < 250, "<250ms",
                performancems < 500, "250ms-500ms",
                performancems < 1000, "500ms-1sec",
                performancems < 3000, "1sec-3sec",
                performancems < 7000, "3sec-7sec",
                performancems < 15000, "7sec-15sec",
                performancems < 30000, "15sec-30sec",
                performancems < 60000, "30sec-1min",
                performancems < 120000, "1min-2min",
                performancems < 300000, "2min-5min",
                ">=5min"
            ),
        customDimensions = todynamic(c),
        customMeasurements = todynamic(f),
        operation_Name = tostring(Body.context.operation.name),
        operation_Id = tostring(Body.context.operation.id),
        operation_ParentId = tostring(Body.context.operation.parentId),
        operation_syntheticSource = tostring(Body.context.data.isSynthetic),
        session_Id = tostring(Body.context.session.id),
        user_Id = tostring(Body.context.user.anonId),
        user_AuthenticatedId = tostring(Body.context.user.authId),
        user_AccountId = tostring(Body.context.user.accountId),
        application_Version = tostring(Body.internal.data.documentVersion),
        client_Type = tostring(Body.context.device.type),
        client_Model = tostring(Body.context.device.deviceModel),
        client_OS = tostring(Body.context.device.osVersion),
        client_IP = tostring(Body.context.location.clientip),
        client_City = tostring(Body.context.location.city),
        client_StateOrProvince = tostring(Body.context.location.province),
        client_CountryOrRegion = tostring(Body.context.location.country),
        client_Browser = tostring(Body.context.device.browserVersion)
}

The “where isnull(Body.view) == false” statement above uniquely identifies records from the pageViews table. This is useful if multiple tables use the same Ingestion table.

Next, we need to create a function to encapsulate this query. When we add an update policy to the pageViews table, this function will run this query on any new records in the Ingestion table as they arrive. The output will be added to the pageViews table. To create the function, it’s a simple matter of wrapping the query from above in the code below and running the command:

 .create-or-alter function pageViews_Expand {
   Query to run
}

This creates a new function named pageViews_Expand. Now that the function has been created, we modify the update policy of the pageViews to run it whenever new records are added to the Ingestion table, and its output will be added to the pageViews table. The command to do this can be seen below:

.alter table Pages_pageViews policy update @'[{"Source": "PagesIngestion", "Query": "Pages_pageViews_Expand()", "IsEnabled": "True", "IsTransactional": true}]'

After the next ingestion run, not only will you see records in the Ingestion table, but if there were page views, you should see the results show up in the pageViews table as well.

If you have data already in the Ingestion table that you want to bring in to the pageViews table, whether for testing or for historical purposes, you can use the .append command to load rows into the table from the function:

.append pageViews <| pageViews_Expand

Finally, if you don’t want to maintain data in the Ingestion table for very long, or not at all, you can set the retention policy on it. Data will be automatically purged from it at the end of the time limit. Setting the value to zero will purge the data immediately, and in that case, the Ingestion table simply becomes a conduit. To set the retention policy on the Ingestion table to 0, you can run the following command:

.alter-merge table Ingestion policy retention softdelete = 0d recoverability = disabled

There are several steps involved, but once everything is wired up, data should flow from Application Insights to Azure Data Explorer within a few minutes. This example only worked with the pageViews table, but any of the AI tables can be used although of course their schemas will be different.

2 Comments

Jignesh

“The default retention for data is 90 days, but that can be increased (for a fee) to 2 years.” – Can you please tell, how to increase retention for 2yrs ? Where is the option available ?

February 5, 2021
Alan McBee

@Jignesh: https://docs.microsoft.com/en-us/azure/azure-monitor/app/pricing#change-the-data-retention-period

August 6, 2021

This site uses Akismet to reduce spam. Learn how your comment data is processed.