Tag: Event Hub

Continuous export for Azure Application Insights using Azure Data Explorer (Kusto)

Published July 6, 2022

If you use Azure, chances are that you’ve used Application Insights. Application Insights collects telemetry data for web applications, and allow that telemetry to be queries, analyzed or used to alert for any anomalies. It’s backed by the Kusto engine, which makes it possible to query and aggregate substantial amounts of data very quickly. It also relatively inexpensive. Depending on the nature of your data, however, you may find yourself bumping into one of its limitations. In most cases these limitations can be overcome by “upgrading” to AI’s big brother, Azure Data Explorer (ADX) which is more commonly known as Kusto.

This article outlines a pattern for continuously streaming data from Application Insights data to Kusto.

Application Insight Limitations

Retention – Data in application Insights is retained for 90 days at no charge and can be retained beyond that for a cost of $0.155 per GB per month. While the price is reasonable, there is a hard cap on retention of 2 years. Data older that 2 years is purged, so if longer retention is required, another solution is required.

Backup – At present, there is no way to backup and restore Application Insights data. It is certainly possible to export this data to a variety of media, but there is no way to restore it.

Data limits – Application Insights can struggle when a large amount of data is requested. It is not possible in any circumstance to query more than 500,000 rows or 64 MB of compressed data. It is possible to implement paged queries to work around this limitation, but this can be problematic. Query timeouts are also limited to 100 seconds, and unlike the underlying Kusto engine itself, these limits are absolute.

Scale – Application Insights is a “one size fits all” service. It cannot be scaled either up or down. It is therefore not possible to overcome issues with query performance of service limits by adding power.

Schema – At present, Application Insights collects data into 10 different tables. The schema of these tables is fixed and cannot be changed. It is possible to use custom data in these tables, in fact many have columns of dynamic type for just this purpose. JSON data can be added into these columns and queried by the engine. This makes Application Insights highly flexible.

The downside of this flexibility is performance. Querying custom data requires the engine to parse data at runtime. The engine is incredibly efficient at doing this, but it cannot compare to more structured columns, particularly when querying massive amounts of data. The fixed nature of Application Insights precludes other approaches for improving query performance like materialized views, etc.

Moving to Kusto

If any of these limitations are an issue, you may wish to consider moving your Application Insights data into Azure Data Explorer, otherwise known as “Kusto”. Kusto is the engine behind all of Azure Monitor (which includes Application Insights and Log Analytics), and it employs the same query language.

When you use your own Kusto cluster, you have complete control over your data. A Kusto cluster contains one or more nodes and can be scaled automatically. Specifically, it solves the limitations inherent to Application Insights while maintaining familiarity with the same data types and query language (KQL). It addresses the AI limits in the following ways:

Retention – Kusto has advanced data retention and caching settings that can be set at both the database level and the table level. Retention can be set to unlimited if necessary.

Backup – Kusto can connect to external tables that are connected to Azure storage accounts or to SQL tables. Continuous export can be added to any Kusto tables so that the externalized data is always up to date. Data can be restored from these externalized sources, or by reingesting directly from them. Alternatively, AI data can be simultaneously streamed into Azure storage accounts, and this data can be ingested into Kusto for restoration.

Data limits – The default query limits in Kusto are the same as those found in Application Insights, but here they are soft limits. They can be overridden, and asynchronous operations can be used to circumvent them when necessary. In most cases however, by using data optimization strategies available to Kusto, these limits should be less important.

Scale – Kusto clusters can be as small as 1 node (for development – a single node cluster has no SLA), and as large as 1,000. Each node can be as small as 2 CPUs/14 GB RAM, and as large as 32 CPUs/128 GB RAM. There is no limit to the quantity of data that can be ingested.

Schema – This is where Kusto really shines. Data can be transformed natively at the time of ingestion using update policies. Custom Application Insights data can be extracted from the dynamic columns into more structured tables. This has the benefit of greatly enhancing performance. In addition, materialized views can be created to further enhance query performance, create pre-aggregated query targets, etc. These strategies can greatly improve query performance.

By streaming Application Insights data into Kusto, you can continue to take advantage of the rich data collection capabilities of Application Insights, without being constrained by its storage limitations. In this scenario, AI acts as your telemetry collector, and Kusto your storage engine. The remainder of this article outlines how to do this.

Setting it all up

In our scenario, we are collecting data from 3 Application Insights tables, pageViews, customMetrics, and customEvents. To capture this data, we will use Diagnostic Settings, which transforms the table names to AppPageViews, AppMetics, and AppEvents respectively. The entire process is shown below for reference:

Azure Monitor collects data from application Insights as it arrives through Diagnostic settings. The data is then sent to an Azure Event Hub, as well as to an Azure Data Lake Gen2 (ADLG2) account for long term storage and recoverability. Azure Data Explorer (Kusto) ingests data directly from the Event Hub in near real time. Event Hub data is transformed and delivered to three staging tables through update policies and functions. In addition, External tables are connected to three containers in the storage account for diagnostic purposes or re-ingestion on demand.

Create an Event Hub and (optionally) a storage account

Data will be streamed continuously to an Event Hub and to and Azure Data Lake Gen 2 (ADLG2) account.

The Application Insights instance, the ADLG2 account, and the Event Hub namespace must all exist within the same Azure region. This is a limitation of the Azure Monitor service. The Kusto cluster can exist anywhere.

When creating the storage account, be sure to select the option for “Enable hierarchical namespace” from the Advanced page. This is what distinguishes an ordinary storage account from an ADLG2 account.

Configure Application Insights diagnostic settings

Many Azure services can stream usage data through their “Diagnostic Settings” option. In the case of Application Insights, all the collected data can be streamed. It should be noted however that the table names do not match those within the Application Insights logs, they are the same as those found in the Log Analytics workspace that backs the AI instance. In the example below, we are collecting data from the AppEvents, AppMetrics, and AppPageViews tables (customEvents, customMetrics, and pageViews in AI).

In this case we are sending data to an Event Hub and to an ADLG2 storage account. Each table will store its data in a separate container, and it is not possible to change that container.

Create the Kusto ingestion table and set up ingestion

The data stream to the Event Hub contains records from three different tables with different schemas. To accommodate this, we will create a temporary holding table, set up a policy to automatically distribute data in this table to three tables with different schemas, and then a retention policy to purge the holding table after distribution.

The holding table to receive Event Hub data will be named Pages_EventHub, and can be created from a Kusto query window using the following command:

.create table Pages_EventHub (records: dynamic)

This will create a table with one column named records which is of the dynamic data type. Event Hub data will land here.

Next, we create an ingestion mapping to match the incoming Event Hub JSON data to the holding table. This can be done from a query window using the following command:

.create table Pages_EventHub ingestion json mapping "RawRecordsMapping"
'['
' {"column": "records", "Properties":{"Path":"$.records"}}'
']'

When we define an ingestion, we will refer to this mapping by the name RawRecordsMapping. This mapping is a property of the holding table, and it will return the records path from the incoming JSON data and place it in the records column of the Pages_EventHub table.

Next, we define the data connection for the ingestion. To define a connection, navigate to your Kusto cluster, and open the Databases node, and then open the database that will receive the data. Finally, select Data connections, then Add data connection, and select Event Hub.

Give the connection a name, select the Event Hub namespace and hub, the default consumer group, and no compression. Use the table name and mapping created above and select JSON as the data format. When finished save the data connection.

If data is flowing into the Event Hub, it should begin to appear in the ingestion table within a few minutes, a typical time lag is 5 minutes. Once confirmed, it’s time to create the destination tables and update policies.

Create destination tables and update policies

We want to take data from the Event Hub and “reconstitute” it in Kusto. To do that, we will closely copy the data structure from the Log Analytics workspace that is connected to our Application Insights instance, leaving out some unnecessary system data. In our case, we will create three tables using the following Kusto commands (one at a time) in the Query window:

.create table pages_Staging_PageViews (TenantId: string, TimeGenerated: datetime, Id: string, Name: string, Url: string, DurationMs: real, PerformanceBucket: string, Properties: dynamic, Measurements: dynamic, 
OperationName: string, OperationId: string, ParentId: string, SyntheticSource: string, SessionId: string, UserId: string, UserAuthenticatedId: string, UserAccountId: string, AppVersion: string, AppRoleName: string, AppRoleInstance: string, ClientType: string, ClientModel: string, ClientOS: string, ClientIP: string, ClientCity: string, ClientStateOrProvince: string, ClientCountryOrRegion: string, ClientBrowser: string, ResourceGUID: string) 

.create table pages_Staging_Events (TenantId: string, TimeGenerated: datetime, Name: string, Properties: dynamic, Measurements: dynamic, OperationName: string, OperationId: string, ParentId: string, SyntheticSource: string, SessionId: string, UserId: string, UserAuthenticatedId: string, UserAccountId: string, AppVersion: string, AppRoleName: string, AppRoleInstance: string, ClientType: string, ClientModel: string, ClientOS: string, ClientIP: string, ClientCity: string, ClientStateOrProvince: string, ClientCountryOrRegion: string, ClientBrowser: string, ResourceGUID: string) 

.create table pages_Staging_Metrics (TenantId: string, TimeGenerated: datetime, Name: string, ItemCount: int, Sum: real, Min: real, Max: real, Properties: dynamic, OperationName: string, OperationId: string, ParentId: string, SyntheticSource: string, SessionId: string, UserId: string, UserAuthenticatedId: string, UserAccountId: string, AppVersion: string, AppRoleName: string, AppRoleInstance: string, ClientType: string, ClientModel: string, ClientOS: string, ClientIP: string, ClientCity: string, ClientStateOrProvince: string, ClientCountryOrRegion: string, ClientBrowser: string, ResourceGUID: string)

Next, we construct queries that will fit the schemas for these three and filter the result for the appropriate type. These queries will then be used to create Kusto functions for each of the three tables. The commands to create the three functions, which contain our queries can be found below.

.create-or-alter function fn_Pages_PageViewsIngest {
Pages_EventHub
| mv-expand records
| where records.Type == "AppPageViews"
| project 
TenantId = tostring(records.Properties.TenantId),
TimeGenerated = todatetime(records.['time']),
Id = tostring(records.Id),
Name = tostring(records.Name),
Url = tostring(records.Url),
DurationMs = toreal(records.DurationMs),
PerformanceBucket = tostring(records.PerformanceBucket),
Properties = todynamic(records.Properties),
Measurements = todynamic(records.Measurements),
OperationName = tostring(records.OperationName),
OperationId = tostring(records.OperationId),
ParentId = tostring(records.ParentId),
SyntheticSource = tostring(records.SyntheticSource),
SessionId = tostring(records.SessionId),
UserId = tostring(records.UserId),
UserAuthenticatedId = tostring(records.UserAuthenticatedId),
UserAccountId = tostring(records.UserAccountId),
AppVersion = tostring(records.AppVersion),
AppRoleName = tostring(records.AppRoleName),
AppRoleInstance = tostring(records.AppRoleInstance),
ClientType = tostring(records.ClientType),
ClientModel  = tostring(records.ClientModel), 
ClientOS  = tostring(records.ClientOS), 
ClientIP  = tostring(records.ClientIP),
ClientCity  = tostring(records.ClientCity), 
ClientStateOrProvince  = tostring(records.ClientStateOrProvince), 
ClientCountryOrRegion  = tostring(records.ClientCountryOrRegion), 
ClientBrowser  = tostring(records.ClientBrowser), 
ResourceGUID  = tostring(records.ResourceGUID)
}

.create-or-alter function fn_Pages_EventsIngest {
Pages_EventHub
| mv-expand records
| where records.Type == "AppEvents"
| project 
TenantId = tostring(records.Properties.TenantId),
TimeGenerated = todatetime(records.['time']),
Name = tostring(records.Name),
Properties = todynamic(records.Properties),
Measurements = todynamic(records.Measurements),
OperationName = tostring(records.OperationName),
OperationId = tostring(records.OperationId),
ParentId = tostring(records.ParentId),
SyntheticSource = tostring(records.SyntheticSource),
SessionId = tostring(records.SessionId),
UserId = tostring(records.UserId),
UserAuthenticatedId = tostring(records.UserAuthenticatedId),
UserAccountId = tostring(records.UserAccountId),
AppVersion = tostring(records.AppVersion),
AppRoleName = tostring(records.AppRoleName),
AppRoleInstance = tostring(records.AppRoleInstance),
ClientType = tostring(records.ClientType),
ClientModel  = tostring(records.ClientModel), 
ClientOS  = tostring(records.ClientOS), 
ClientIP  = tostring(records.ClientIP),
ClientCity  = tostring(records.ClientCity), 
ClientStateOrProvince  = tostring(records.ClientStateOrProvince), 
ClientCountryOrRegion  = tostring(records.ClientCountryOrRegion), 
ClientBrowser  = tostring(records.ClientBrowser), 
ResourceGUID  = tostring(records.ResourceGUID)
}

.create-or-alter function fn_Pages_MetricsIngest {
Pages_EventHub
| mv-expand records
| where records.Type == "AppMetrics"
| project 
TenantId = tostring(records.Properties.TenantId),
TimeGenerated = todatetime(records.['time']),
Name = tostring(records.Name),
ItemCount = toint(records.ItemCount),
Sum = toreal(records.sum),
Min = toreal(records.Min),
Max = toreal(records.Max),
Properties = todynamic(records.Properties),
OperationName = tostring(records.OperationName),
OperationId = tostring(records.OperationId),
ParentId = tostring(records.ParentId),
SyntheticSource = tostring(records.SyntheticSource),
SessionId = tostring(records.SessionId),
UserId = tostring(records.UserId),
UserAuthenticatedId = tostring(records.UserAuthenticatedId),
UserAccountId = tostring(records.UserAccountId),
AppVersion = tostring(records.AppVersion),
AppRoleName = tostring(records.AppRoleName),
AppRoleInstance = tostring(records.AppRoleInstance),
ClientType = tostring(records.ClientType),
ClientModel  = tostring(records.ClientModel), 
ClientOS  = tostring(records.ClientOS), 
ClientIP  = tostring(records.ClientIP),
ClientCity  = tostring(records.ClientCity), 
ClientStateOrProvince  = tostring(records.ClientStateOrProvince), 
ClientCountryOrRegion  = tostring(records.ClientCountryOrRegion), 
ClientBrowser  = tostring(records.ClientBrowser), 
ResourceGUID  = tostring(records.ResourceGUID)
}

With the three functions in place, we need to create an update policy that will use the results of a function to load a table whenever data is added to the holding table. For our pages_Staging_PageViews table, we run the following command to create the policy.

.alter table [@"pages_Staging_PageViews"] policy update @'[{"Source": "Pages_EventHub", "Query": "fn_Pages_PageViewsIngest()", "IsEnabled": "True", "IsTransactional": true}]'

This command adds an update policy to the pages_Staging_PageViews table. This update policy will be invoked whenever data is added to the Pages_EventHub table. It will execute the fn_Pages_PageViewsIngest function created above against this new data and load the result into the pages_Staging_PageViews table. The function itself filters out all data that did not originate from the original AppPageViews table and transform it to match the destination schema.

The commands for creating the policies on the other two tables are below:

.alter table [@"pages_Staging_Events"] policy update @'[{"Source": "Pages_EventHub", "Query": "fn_Pages_EventsIngest()", "IsEnabled": "True", "IsTransactional": true}]'

.alter table [@"pages_Staging_Metrics"] policy update @'[{"Source": "Pages_EventHub", "Query": "fn_Pages_MetricsIngest()", "IsEnabled": "True", "IsTransactional": true}]'

The last step is to add a retention policy to the Pages_EventHub table that will remove data automatically after it has been processed. This is an optional step and can be done at any point to conserve resources. A retention policy will remove ingested data after a defined time. Setting the period to 0 will delete the data shortly after all update policies have completed.

In this case the policy is added to the holding table by running the following command:

.alter-merge table Pages_EventHub policy retention softdelete = 0d recoverability = disabled

At this point, data should be flowing into the three destination tables shortly after it arrives through the event hub.

Connect external tables to the ADLG2 data (optional)

Earlier, we selected both an event hub and a storage account to receive data from Application Insights. The reason for the storage container is to provide an authoritative source of persisted data. Data in Application Insights expires by default after 90 days and cannot be retained any longer than 2 years. Data in Kusto can be persisted for an unlimited period, but it too can be configured to expire after a period of time. Storing the data in a storage account ensures permanency, and provides a location to re-ingest from should any disaster befall the Kusto data.

Kusto can be connected to external data sources as an external table. These sources can be a storage account, or SQL databases. While not strictly required, it is a good idea to create external tables connected to this data so that this data can be queried, and re-ingested with relative ease whenever necessary.

Connecting Kusto to ADLG2 storage is a two-step process. First you create a shared access signature, and then you create an external table in Kusto using that signature. A shared access signature can be created for the entire account, a container, or even a folder. Since we will be connecting to three different containers, we will create the signature at the account level. To do this navigate to the storage account in Azure, and the select Shared access signature in the Security + networking section. Select Blob and File from Allowed services, and then Container and Object from Allowed resource types. Set an expiry date applicable to your situation. The external table will stop working once your expiry date is exceeded.

When ready, click the Generate SAS and connection string button, and the screen will appear as follows:

Make note of the Blob service SAS URL – it will be needed in the next step. It’s also a good idea to record these settings, as it’s not possible to go back and retrieve them later.

Capturing the three tables above to ADLG2 creates the following three containers in the storage account:

insights-logs-appevents
insights-logs-appmetrics
insights-logs-apppageviews

When creating the external tables below, the Blob service SAS URL values need to be modified to include these containers by adding them before the token in the URL. Therefore:

https://mystorageaccount.blob.core.windows.net/?sv=2021-06-08&……. becomes

https://mystorageaccount.blob.core.windows.net/insights-logs-appevents?sv=2021-06-08&……. and so on.

To create the external table in Kusto, navigate to a Kusto query window that is connected to the appropriate database. The following commands can be used to create the table, substituting the sample url with the ones from above:

.create-or-alter external table Pages_AppEvents_Ext (['time']:datetime,resourceId:string,ResourceGUID:guid,Type:string,ClientBrowser:string,ClientCity:string,ClientCountryOrRegion:string,ClientIP:string,ClientModel:string,ClientOS:string,ClientStateOrProvince:string,ClientType:string,IKey:guid,_BilledSize:int,OperationName:string,OperationId:guid,ParentId:guid,SDKVersion:string,SessionId:string,UserAccountId:string,UserAuthenticatedId:string,UserId:string,Properties:dynamic,Name:string,ItemCount:int) 
kind=blob 
dataformat=json
( 
   h@'https://mystorageaccount.blob.core.windows.net/insights-logs-appevents?******' 
)

.create-or-alter external table Pages_AppMetrics_Ext (['time']:datetime,resourceId:string,ResourceGUID:guid,Type:string,ClientBrowser:string,ClientCity:string,ClientCountryOrRegion:string,ClientIP:string,ClientModel:string,ClientOS:string,ClientStateOrProvince:string,ClientType:string,IKey:guid,_BilledSize:int,OperationName:string,OperationId:guid,ParentId:guid,SDKVersion:string,SessionId:string,UserAccountId:string,UserAuthenticatedId:string,UserId:string,Properties:dynamic,Name:string,Sum:int,Min:int,Max:int,ItemCount:int) 
kind=blob 
dataformat=json 
( 
    h@'https://mystorageaccount.blob.core.windows.net/insights-logs-appmetrics?******'
)

.create-or-alter external table Pages_AppPageViews_Ext (['time']:datetime,resourceId:string,ResourceGUID:guid,Type:string,ClientBrowser:string,ClientCity:string,ClientCountryOrRegion:string,ClientIP:string,ClientModel:string,ClientOS:string,ClientStateOrProvince:string,ClientType:string,IKey:guid,_BilledSize:int,OperationName:string,OperationId:guid,ParentId:guid,SDKVersion:string,SessionId:string,UserAccountId:string,UserAuthenticatedId:string,UserId:string,Properties:dynamic,Measurements:dynamic,Id:guid,Name:string,Url:string,DurationMs:int,PerformanceBucket:string,ItemCount:int) 
kind=blob 
dataformat=json 
( 
    h@'https://mystorageaccount.blob.core.windows.net/insights-logs-apppageviews?******'
)

Once created, the external tables can be queried like any other table. They can be used for data validation or reingestion as appropriate.

In Conclusion

Once the data is flowing, subsequent tables and update policies can be set up to further transform the data, and materialized views can be created to further optimize query performance. Moving Application Insights data into Kusto gives you the best of both worlds, the telemetry collection capabilities of Application Insights, and the big data power of Kusto. This approach is not limited to Application Insights either – it can be used with and Azure services that support Azure Monitor with Diagnostic Settings.

Be aware however that this migration is a one way street. Once the data is in Kusto, it can be retained for as long as you like, but it can’t be put back into the source.

Automatically move your Application Insights data into Azure Data Explorer

Published November 30, 2020

Application Insights (AI) is a useful way of analyzing your application’s telemetry. Its lightning-fast queries make it ideal for analyzing historical data, but what happens when you start to bump into the limits? The default retention for data is 90 days, but that can be increased (for a fee) to 2 years. However, what happens when that’s not enough? If you query too much, or too often you may get throttled. When you start to bump into these limits, where can you go?

The answer lies in the fact that Application Insights is backed by Azure Data Explorer (ADX or Kusto). Moving your AI data to a full ADX cluster will allow you to continue using AI to collect data, and even to analyze recent data, but the ADX cluster can be sized appropriately and used when the AI instance won’t scale. The fact that it is using the same engine and query language as AI means that your queries can continue to work. This article describes a pattern for doing this.

Requirements

We’ll be working with several Azure components to create this solution. In addition to your AI instance, these components are:

Azure Data Explorer cluster
Azure Storage Account
Azure Event Namespace and at least one Event hub
Azure Event Grid

The procedure can be broken down into a series of steps:

Enable Continuous Export from AI
Create an Event Grid subscription in the storage account
Create an ADX database and ingestion table
Create an Ingestion rule in ADX
Create relevant query tables and update policies in the ADX database

Enable Continuous Export from Application Analytics

AI will retain data for up to 2 years, but for archival purposes, it provides a feature called “Continuous Export”. When this feature is configures, AI will write out any data it receives to Azure blob storage in JSON format.

To enable this, open your AI instance, and scroll down to “Continuous Export” in the “Configure” section. Any existing exports will show here, along with the last time data was written. To add a new destination, select the “Add” button.

You will then need to select which AI data types to export. For this example, we will only be using Page Views, although multiple types can be selected.

Next, you need to select your storage account. First select the subscription (if different from your AI instance), and then select the storage account and container. You will need to know what data region the account is in. Once selected, save the settings.

Initially, the “Last Export” column will display “Never”, but once AI has collected some data, it will be written out to your storage container, and the “Last Export” column will display when that occurred. Once it has occurred, you should be able to open your storage account using Storage Explorer, and then the container to see the output. In the root of the container selected above, you’ll see a folder that is named with the AI Instance name, and the AI instance GUID.

Opening that folder, you’ll find a folder for each data type selected above (if there has been data for them). Each data types will be further organized into folders names for the day, and the hour. Multiple files will be contained withing with the .blob extension. These are multiline json files and can be downloaded and opened with a simple text editor.

The next step is to raise an event whenever new content is added to this storage container.

Create an Event Grid subscription in the storage account

Prior to this step, ensure that you have created, or have available an Event namespace, and an Event hub. You will connect to this hub in this step.

From the Azure portal, open the storage account and then select the “Events” node. Then click the “Event Subscription” button at the top.

On the following screen, you’ll need to provide a name and schema for the subscription. The name can be whatever you wish, and the schema should be “Event Grid Schema”. In the Topic Details section, you will provide a topic name which will pertain to all subscriptions for this storage account. In the “Event Types” section, you select the types of actions that will fire an event. For our purposes, all we want is “Blob Created”. With this selection, the event will fire every time a new blob is added to the container. Finally, under “Endpoint Details”, you will select “Event Hubs” from the dropdown, and then you click on “Select an endpoint” to select your Event Hub.

Once created an event will fire anytime a blob is created in this storage account. If you wish to restrict this to specific folders or containers, you can select the Filters tab, and create a subject filter to restrict it to specific file types, containers, etc. More information on Event Grid filters can be found here. In our case, we do not need a filter.

When ready, click the “Create” button, and the Event subscription will be created. It can be monitored from the storage account and can also be monitored in the Event hub. As new blobs are added to the storage account, more events will fire.

Create an ADX database and ingestion table

From Azure portal, navigate to your ADX cluster and either select a database or create a new one. Once the database has been created, you need to create at least one table to store the data. Ultimately, Kusto will ingest data from the blobs added above whenever they are added, and you need to do some mapping to get that to work properly. For debugging purposes, I find it useful to create an intermediate ADX table to receive data from the blobs, and them transform the data afterward.

In this case, the intermediate table will have a single column, Body that will contain the entirety of each ingested record. To create this table, run the following KQL query on your database:

.create-merge table Ingestion (Body: dynamic)

The dynamic data type in ADX can work with JSON content, and each record will go there. For this to work, you also need to add a mapping to the table. The mapping can be very complex, but in our case, we’re doing a simple load in, so we’re matching the entire JSON record to the Body column in our database. To add this mapping, run the following KQL command:

.create table TweetIngest ingestion json mapping "RawInput"
'['
' {"column": "Body", "Properties":{"Path":"$"}}'
']'

At this point, we are ready for an ingestion rule.

Create an Ingestion rule in ADX

From the Azure portal, open your ADX cluster, and select the “Databases” node in the “Data” section, then click on your database.

The setting that we need is “Data ingestion” in the resulting window. Selecting that takes you to the ingestion rules. Now you want to create a new connection by selecting the “Add data connection” button.

The first selection is the data connection type. The options are Event Hub, Blob storage, or Iot Hub. We need to select Blob storage. Both it, and Event hub will connect to an Event hub, but the difference it that using “Blob storage”, the contents of the blobs will be delivered, and selecting “Event Hub” will only deliver the metadata of the blob being added.

Once the type is selected, you give it a name, and choose the event grid to connect to (the one that you created above) and the event type. Next, you select “Manual” in the Resources creation section. Selecting “Automatic” will create a new Event Hub Namespace, Hub, and Event grid and you won’t have any control of the naming of these resources. Selecting Manual allows you to keep it under control. Select your event grid here.

Next, select the “Ingest properties” tab, and provide the table and mapping that you created above (which in our case was “RawInput”). Also, you need to select “MULTILINE JSON” as the data format.

Once these values are complete, press the Create button and the automatic ingestion will commence. Adding a new blob to the storage account will fire an event, which will cause ADX to load the contents of the blob into the Body column of the Ingestion table. This process can take up to 5 minutes after the event fires.

Create relevant query tables and update policies in the ADX database

Once ingestion happens, your “Ingestion” table should have records in it. Running a simple query in ADX using the table name should show several records with data in the “Body” column. Opening a record will show the full structure of the JSON contained within. If records with different schema are being imported, a query filter can be employed to limit the results to only those records.

For example, the pageViews table in AI will always contain a JSON none named “view”. The query below will return only pageView data from the ingestion table:

This ingestion table can be queried in this matter moving forward, but for performance and usability reasons, it is better to “materialize” the views of this table. To do this, we create another table, and set an update policy on it that will add relevant rows to it whenever the ingestion table is updated.

The first step is to create the table. In our case, we want to replicate the schema of the pageViews table in Application Insights. This is because we want to be able to reuse any queries that we have already built against AI. All that should be necessary is to change the source of those queries to the ADV cluster/database. To create a table with the same schema of the AI pageViews table (mostly), the following command can be executed in ADX:

 .create table pageViews (
    timestamp: datetime, 
    ['id']: string, 
    name: string, 
    url: string, 
    duration: real,
    performanceBucket: string, 
    customDimensions: dynamic, 
    customMeasurements: dynamic, 
    operation_Name: string, 
    operation_Id: string, 
    operation_ParentId: string, 
    operation_SyntheticSource: string,
    session_Id: string, 
    user_Id: string, 
    user_AuthenticatedId: string, 
    user_AccountId: string, 
    application_Version: string, 
    client_Type: string, 
    client_Model: string, 
    client_OS: string, 
    client_IP: string, 
    client_City: string, 
    client_StateOrProvince: string, 
    client_CountryOrRegion: string, 
    client_Browser: string
)

Once the table is created, we need to create a query against the Ingestion table that will return pageViews records in the schema of the new table. Without getting deep into the nuances of the KQL language, a query that will do this is below:

    Ingestion
    | where isnull(Body.view) == false
    | extend view = Body.view
    | mvexpand view
    | extend performancems = view.durationMetric.value /10000
    | extend a = trim_end("\\]",trim_start("\\[",tostring(Body.context.custom.dimensions)))
    | extend b = replace('"}','"',replace('{"','"',a))
    | extend c = todynamic(strcat('{',b,'}'))
    | extend d = trim_end("\\]",trim_start("\\[",tostring(Body.context.custom.metrics)))
    | extend e = replace('"}','"',replace('{"','"',d))
    | extend f = todynamic(strcat('{',e,'}'))
    | project
        timestamp = todatetime(Body.context.data.eventTime),
        id = tostring(Body.internal.data.id),
        name = tostring(view.name),
        url = tostring(view.url),
        duration = toreal(performancems),
        performanceBucket = case(
                performancems < 250, "<250ms",
                performancems < 500, "250ms-500ms",
                performancems < 1000, "500ms-1sec",
                performancems < 3000, "1sec-3sec",
                performancems < 7000, "3sec-7sec",
                performancems < 15000, "7sec-15sec",
                performancems < 30000, "15sec-30sec",
                performancems < 60000, "30sec-1min",
                performancems < 120000, "1min-2min",
                performancems < 300000, "2min-5min",
                ">=5min"
            ),
        customDimensions = todynamic(c),
        customMeasurements = todynamic(f),
        operation_Name = tostring(Body.context.operation.name),
        operation_Id = tostring(Body.context.operation.id),
        operation_ParentId = tostring(Body.context.operation.parentId),
        operation_syntheticSource = tostring(Body.context.data.isSynthetic),
        session_Id = tostring(Body.context.session.id),
        user_Id = tostring(Body.context.user.anonId),
        user_AuthenticatedId = tostring(Body.context.user.authId),
        user_AccountId = tostring(Body.context.user.accountId),
        application_Version = tostring(Body.internal.data.documentVersion),
        client_Type = tostring(Body.context.device.type),
        client_Model = tostring(Body.context.device.deviceModel),
        client_OS = tostring(Body.context.device.osVersion),
        client_IP = tostring(Body.context.location.clientip),
        client_City = tostring(Body.context.location.city),
        client_StateOrProvince = tostring(Body.context.location.province),
        client_CountryOrRegion = tostring(Body.context.location.country),
        client_Browser = tostring(Body.context.device.browserVersion)
}

The “where isnull(Body.view) == false” statement above uniquely identifies records from the pageViews table. This is useful if multiple tables use the same Ingestion table.

Next, we need to create a function to encapsulate this query. When we add an update policy to the pageViews table, this function will run this query on any new records in the Ingestion table as they arrive. The output will be added to the pageViews table. To create the function, it’s a simple matter of wrapping the query from above in the code below and running the command:

 .create-or-alter function pageViews_Expand {
   Query to run
}

This creates a new function named pageViews_Expand. Now that the function has been created, we modify the update policy of the pageViews to run it whenever new records are added to the Ingestion table, and its output will be added to the pageViews table. The command to do this can be seen below:

.alter table Pages_pageViews policy update @'[{"Source": "PagesIngestion", "Query": "Pages_pageViews_Expand()", "IsEnabled": "True", "IsTransactional": true}]'

After the next ingestion run, not only will you see records in the Ingestion table, but if there were page views, you should see the results show up in the pageViews table as well.

If you have data already in the Ingestion table that you want to bring in to the pageViews table, whether for testing or for historical purposes, you can use the .append command to load rows into the table from the function:

.append pageViews <| pageViews_Expand

Finally, if you don’t want to maintain data in the Ingestion table for very long, or not at all, you can set the retention policy on it. Data will be automatically purged from it at the end of the time limit. Setting the value to zero will purge the data immediately, and in that case, the Ingestion table simply becomes a conduit. To set the retention policy on the Ingestion table to 0, you can run the following command:

.alter-merge table Ingestion policy retention softdelete = 0d recoverability = disabled

There are several steps involved, but once everything is wired up, data should flow from Application Insights to Azure Data Explorer within a few minutes. This example only worked with the pageViews table, but any of the AI tables can be used although of course their schemas will be different.

2 Comments