Tag: SSIS

The New Power BI – Now With Enterprise!

Published January 28, 2015

Yesterday Microsoft announced the next step in the evolution of Power BI. It’s getting quite a bit of attention, and rightly so for its aim of bringing Business Intelligence closer to users. Democratizing BI has always proved a challenge – it’s the realm of the gurus in the white coats that hold the keys to the data. Microsoft is aiming to accomplish this democratization through a combination of user focus, and as of yesterday, a drastic change in its pricing model. Power BI just went from about $40 per user per month, to free, or $9.99/user/month for advanced capabilities. That’s quite a drop, and arguably the biggest announcement from yesterday – it will have a massive impact. The detailed price breakdown can be found here.

However, all of the focus around personal BI is, in my opinion, missing a key component. Power BI and its components have always focused squarely on both personal and team BI solutions. That is to say the ability for a power user to model data, visualize it quickly and easily and to share it out with fellow team members. While that capability is certainly retained in the new Power BI, this new version contains the first appearance of enterprise grade BI in the cloud for Microsoft.

To fully understand this, it’s necessary to touch on the Microsoft BI stack as it stands today.

Microsoft BI On Premises

The On-Premises BI story from Microsoft may be confusing, and occasionally difficult to understand, but it is very powerful, and relatively complete. In a nutshell, the story is good from a personal, team and enterprise perspective.

On the enterprise side, there are products from both the SQL Server team, and the Office team. Data warehousing is served by SQL Server and ETL duties fall to SQL Server Integration Services (SSIS). Multidimensional analysis storage is served by SQL Server Analysis Services in both OLAP and Tabular modes, and Reporting is performed by SQL Server Reporting Services (SSRS). The SQL product line doesn’t have much on the client side for analysis apart from SSRS, but this slack is taken up by the analysis tools available in Excel, and through Performance Point services in SharePoint.

Indeed, SharePoint also provides a platform for SSRS via SSRS SharePoint mode, and for Excel based analytical workbooks connected to SQL Server and to SSAS through Excel Services.

On the personal BI side, that role has traditionally fallen to Excel. The pitfalls of importing data into Excel workbooks for analysis are well documented and don’t need to be discussed here, but the bulk of those issues were addressed with the introduction of PowerPivot several years ago. PowerPivot allows for massive amounts of data to be cached within the Excel file for analysis without any data integrity concerns. The addition in recent years of analytic visuals (Power View, Power Map) and ETL capabilities (Power Query) have further rounded out the offering.

Taking that Excel workbook and sharing it brings us into the realm of Team BI. This is to say that the analyses are relatively modest in size, and of interest to a targeted group. These models may not require the rigour or reliability associated with enterprise BI models. Once again, the technology involved here is SharePoint. A user can take a workbook with an embedded PowerPivot model, share it through a SharePoint library, and other users can interact with that embedded model using only a browser. This capability requires PowerPivot for SharePoint, which is really a specialized version of SSAS, along with a SharePoint service application.

One thing to note about these seemingly disparate approaches is that a power user can build a Power Pivot data model with Excel, share it to a team via SharePoint, and when it requires sufficient rigour or management, it can be “upgraded” into SSAS in tabular mode. This common model approach is powerful, and is key to understanding Microsoft’s entire BI strategy. You can also see here that SharePoint straddles the two worlds of team and enterprise BI.

Moving to the cloud

The BI workload is one of the last Microsoft workloads to move to the cloud, and with good reason. Massive amounts of data present problems of scale, and security or data sovereignty concerns tend to keep data on premises. However, there is a very real need to provide BI to users outside of the firewall.

SharePoint is the hub of BI on prem, so it’s logical to assume that with SharePoint Online, it could continue to perform that function in the cloud. The big catch here is that on-prem, SharePoint is simply the display platform. In the enterprise scenario, users connect through SharePoint to the back end servers. This isn’t an option in the cloud, so enterprise BI was left off the table.

With the personal and team BI scenarios, data is cached in a Power Pivot data model, which could be supported in the cloud. When Office 365 moved to the SharePoint 2013 code base for SharePoint online, rudimentary support for embedded Power Pivot models was indeed added. Essentially PowerPivot for SharePoint “light” was added. I call it light for two major reasons. Firstly, data models could be no larger than 10 MB. Secondly, there was no way to update the data contained within the Power Pivot cache, outside of re-uploading the Excel workbook. This is still true without a Power BI license. The inability to refresh the data renders team BI almost useless, except in static data scenarios.

The first generation of Power BI changed all of that. With a Power BI license, it was possible to install a Data Management Gateway on premises that would connect to team BI workbooks in Office 365 and update them on a scheduled basis. Yes, the gateway had many limitations (many of which have been removed over time), but finally, the on-prem refresh story was solved. In addition, the model size limit was increased to 250 MB. However, we were still left with a number of problems or limitations.

Daily data refresh schedule. Automatic data refreshes could be daily at their most frequent. Manual refreshes could be done anytime
Capacity. The maximum size of a data model was increased to 250 MB, which is relatively small for enterprise scenarios. In addition, refreshes aren’t differential, which means that the entire model is re-uploaded on every refresh
Data sensitivity/sovereignty. The refresh problem was solved, but because the data is still cached in the workbooks, there can be reluctance to sending it outside of the corporate firewall
Per User Security – Power Pivot data models have no concept of user security in a workbook (tabular models in SSAS do). Security is at the workbook level
Cost. This initial cost of Power BI was $40 per user per month. A power BI license was required to interact with any workbook that had a data model larger than 10 MB. Considering that a full Office 365 E3 license was around $25 per user per month, this price tended to limit the audience for sharing.

All of this is to say that Power BI in its first (and as yet current) incarnation is suitable for personal and team BI only. There has been no enterprise cloud BI story.

Power BI V2

The announcements yesterday outlined the next generation of Power BI. Going forward, Power BI will be available as a standalone offering, at the price points offered above. Office 365 users will continue to be able to use it from Office 365, but Office 365 will no longer be required to use it. In it’s early days, Power BI was a SharePoint app, but a careful examination of URLs in the current offering quickly reveals that it’s actually two apps currently, both running on Azure (not in SharePoint).

https://sites.powerbi.com/………
https://xxxxxx.admin.powerbi.com/…..

If you’ve signed up for the new Power BI preview, you may notice that the URL is http://app.powerbi.com/…… so this move isn’t a big surprise.

With the new model, Excel is no longer the central container. Users connect to data and publish it directly to Power BI. Behind the scenes, the service is doing a very similar thing as what it does with Power Pivot models – it’s storing them in SSAS. In fact, the same limits still apply – 250 MB per model (at least for now) Excel can still be used, but now it is as a data source.

Visualizations are performed through Power Views, and data is acquired through Power Query. These are no longer add-ons, but available on their own through Power BI Designer. This decoupling is good for those that have not made an investment in SharePoint Online, or Excel.

These changes to the architecture and the cost are great news for adoption, but don’t address the needs of the enterprise. Except for one thing – The SSAS Connector.

One of the data sources available to the new Power BI is the SSAS data connector. This connector is a piece of code that runs on premises (it actually includes the Data Management Gateway). It acts as a bridge between the Power BI service, and an on prem SSAS server.

The biggest distinction worth noting is that with the gateway, data is NOT being uploaded to the service, it remains on prem. The way that it works is that when a user interacts with a visualization from the cloud, a query is sent to the SSAS server through the gateway. That query is run, and its results sent back to the user’s visualization, and the data is not persisted.

In addition, when the query is sent back to the SSAS it is run with the permission of the user making the request. This is accomplished through the EFFECTIVEUSERNAME feature in SSAS. This provides for full user level security, and since tabular models in SSAS can utilize per user security, we no longer need to rely on proxy accounts/document level security.

Finally, because the data is being stored in an on prem SSAS server, it can be refreshed automatically as often as desired. For the same reason, we have no capacity limits – you can grow your own SSAS servers as large as you like.

The SSAS connector removes most of the limitations that prevent cloud based enterprise Business Intelligence, and the new pricing model removes the rest. Certainly there are going to be feature limits in the near term, but it appears to me at least that the back of this thorny problem has finally been cracked.

6 Comments

Using the SSIS OData Source Connector With SharePoint Online Authentication

Published March 30, 2014

Last week, Microsoft released the OData Source for Microsoft SQL Server 2012 . What is it? It allows SQL Server Integration Services (SSIS) to use an OData feed as a first class citizen data source in the same manner as SQL Server, Oracle, etc. Until now it was necessary to code OData connections using the script object.

This matters to those of us in the SharePoint world because any SharePoint list data can be expressed as OData.

I’ve written before about how SharePoint data can be extracted into a data warehouse using SSIS and the SharePoint List Source and Destination Adapters, available from CodePlex. These adapters plug in to SSIS and wrapper the SharePoint SOAP web services, and therefore do not need to be installed on a SharePoint server. We have used them for years, and they work very well, however, they are a CodePlex project, and therefore not fully supported.

These CodePlex adapters have more recently been bumping into another limitation. While SOAP web services are supported in Office 365, the adapters don’t support the Office 365 authentication mechanism, which effectively renders them useless. The OData services require the same authentication, but the the new OData Source supports it. It’s also an official Microsoft product, and is fully supported.

There is, however a trick to getting it working. Once you install the OData Source, you open up SQL Server Data Tools, open an SSIS project, and add or edit a data flow task. In the SSIS Toolbox, You should see the OData Source.

Drag the tool on to the design surface, and double click to configure it. You’ll first need to configure an OData Connection Manager, and you’ll do that by clicking the New button.

Give the connection a name. The connection will be common to all lists and libraries within a site, so something based on the name of the site is likely appropriate. The Service document location is the OData endpoint. It takes the form of the URL of the site, along with the suffix /_vti_bin/listdata.svc. If the connection is on premises, you can use Windows Authentication, but if it is Office 365, you must use a stored name and password.

If you are using Office 365, and you click Test Connection at this point, you’ll receive an error “Test connection failed –> The remote server returned an error: (400) Bad Request.”

This is due to Office 365’s “unique” authentication mechanism. In order to authenticate to Office 365, you must first select the “All” button in the toolbar, and set the value of “Microsoft Online Services Authentication” to true.

This option may not be available to you. If the Online Services Authentication option is disabled, or greyed out, as it was for me when I first tried to use it, it’s because a prerequisite is missing. In order to authenticate to Office 365, the machine must have the SharePoint Server 2013 Client Components SDK installed on it.

Once the client components are in place, and the option is selected, the data source should be able to connect to the source, and the connection manager can be closed. Lists are exposed as Collections, so if you want to work with list data, you can then select the list from the list of Collections.

At this point, the data source will act like any other SSIS data source, you can select and transform columns at will. More importantly, this will help you get SharePoint data both on-prem and in the cloud into a central data warehouse.

90 Comments

How To Automate SharePoint Report Creation With SSIS and SSRS

Published April 15, 2012

If you’ve used SQL Server Reporting Services to any great extent, you’ve likely encountered the need to generate reports automatically. This requirement may be for for delivery purposes, archival purposes, or simply to reduce report rendering wait times for the end users. SSRS supports this requirement out of the box – a report administrator can set up a subscription, enter the required parameters, and the report will be generated and delivered on that schedule.

This approach is highly declarative, and puts the onus of subscription creation on the report administrator. To this end SSRS also supports data driven subscriptions, which allow the subscriptions to be looked up from a SQL table. How that table is maintained is up to the individual organization, but it does allow a measure of dynamism. With SQL Server Reporting Services 2012, this feature is made much more user friendly through the use of User Driven Subscriptions.

The down side to any of this dynamic behaviour is that in every case, it requires the Enterprise version of SQL Server (with SQL Server 2012, the BI SKU also has this capability). In addition, with SQL Server versions prior to 2012, the capability is somewhat less than user friendly.

In this post, I will outline a methodology that will allow you to provide SharePoint list based report subscriptions that will allow users to subscribe to published reports, and have them published to a SharePoint document library. The approach is not restricted to SharePoint – indeed it could be used with native mode to read through a list of possible parameter values, and email the resulting reports, or store them in a file system, but the SharePoint example is the one that I will be using below.

I should also point out that although the examples below use SQL Server 2012, the approach should work with versions back to SQL Server 2005.

The primary components of this solution are a SharePoint list that will be read to determine what reports to render (the subscription list), a SQL Server Integration Services (SSIS) package that will read through the subscription list and use the values therein to render the report, and finally, a SharePoint document library that will house the reports. Of course, we also need a report to be rendered, and in our case, this report is also stored in a SharePoint document library, as Reporting Services is running in SharePoint Integrated mode.

The good news, is that all of the constituent portions of this solution are either downloadable for free, or come with SQL server in any other edition besides Express. Therefore, the chances are that if you have SharePoint, then you already have all of the tools that you need.

Step 1 – Obtain the SharePoint List Source and Destination Project

Out of the box, SSIS doesn’t know how to talk to SharePoint data. Fortunately, there’s an excellent Codeplex project that adds the required capability. If you haven’t already done so, download the SharePoint List Source and Destination project from Codeplex. You will find a good blog post on working with this tool here. Once installed, you will be ready to build the solution. Of course, this step is only necessary if you want to use a SharePoint list as a subscription source.

Step 2 – Create your subscription and report library

In this solution, we will allow a user to enter a subscription request in our subscription list. The user can specify the URL of the report to be run, the parameters for the report, the file type that is to be produced, and the library where the report is to be stored. In order to support this, we’ll need at least one document library where the produced reports will be stored, and one custom list.

Create your document library, and note its URL. In our case below, our report library will be at http://home.nautilusinc.local/sites/nmarine/IT/Sandbox/ExpenseReportOutput. This URL will be used below. In this library, we don’t need to add any custom metadata properties, but you certainly may, should you wish to do so.

Next, create a custom list. In our case, the list will be named “Subscriptions” and will be created in the “http://home.nautilusinc.local/sites/nmarine/IT/Sandbox” site. Where you create this list is not important, but what is important is the display name of the list, and the URL of its parent site.

For our use case, we want the user to be able to specify the Report to be rendered, the destination to place the rendered reports, the parameters to use for the rendered report, and the file type of the rendered report. To that end, we will add 4 additional columns to the list, as shown below.

You will also note that the “Title:” field has been renamed to “Subscription” on this list. This is purely for cosmetic purposes. Three of the new fields are simply single line text fields, while the Format field is choice. In our example, the options available for the choice field are WORDOPENXML, PDF, EXCELOPENXML, IMAGE, and NULL. You can allow any of the possible output types that Reporting services supports. I have outlined these types previously in another post here.

While it is outside the scope of this article, you will likely want to modify the form to display more user friendly names for the options than “WORDOPENXML”, etc, and automatically calculate the value for the subscription field. InfoPath would be an excellent tool to do this with, and there are other alternatives as well. For our purposes, we will work with the form as is.

Once done, you will want to add a couple of subscriptions. In our case, we’re working on a very simple report as shown below:

The report takes a single parameter, employee name, and renders the report filtered by that parameter. The subscription list item that we’ll create will look something like below:

The value for ReportURL is the full URL path to the report definition, in our case it is http://home.nautilusinc.local/sites/nmarine/finance/ReportsMarch22/ExpenseReports.rdl (you should be able to enter the URL into a browser and see the report), and the destination library is the full URL path to the destination library, in our case http://home.nautilusinc.local/sites/nmarine/IT/Sandbox/ExpenseReportOutput.

After adding two subscriptions, our subscription list appears as follows:

When our job runs (defined below) it will iterate through this list and create a corresponding PDF file and Word file in the destination library. Next, we create the SSIS package that will actually do the work.

Step 3 – Create a Reporting Services Web Service Proxy Class

In order to render the Reporting Services reports, we will need to call the Reporting Services web service from a SSIS Script task. In order to do that, we’ll need to use a proxy class. Luckily, we can just generate one using the WSDL.EXE generation tool available from the .Net 3.5 SDK. You run the tool with the following options:

wsdl.exe /language:[language choice] /out:ReportService.[language choice] http://[SPSiteURL]/_vti_bin/ReportServer/ReportService.asmx?WSDL

where:

[language choice] = VB or CS
[SPSiteURL] = URL of the SharePoint Site Collection

In our case, the precise command is:

wsdl.exe /language:VB /out:ReportService.vb http://home.nautilusinc.local/sites/nmarine/_vti_bin/ReportServer/ReportExecution2005.asmx?WSDL

If you don’t want to build your own, you can download the one that I created for this project (it’s Visual Basic). It was built using SSRS 2012, but should be backward compatible. Also, don’t forget to change the embedded server URLs.

Once you have the output file, make note of its location – we’ll use it below when creating the script task in SSIS.

Step 4 – Build the SSIS Package

I’m going to assume that most people reading this have little or no exposure to SSIS, so I’ll try to be as detailed as possible. You’ll need to start SQL Server Data Tools (if you’re using SQL Server 2012) or Business Intelligence Development Studio (for SQL versions prior to 2012).

You may notice that is has a striking resemblance to Visual Studio 2010. That’s because it is VS2010. Select “New Project” then in the “Business Intelligence” section, select “integration Services Project”. Give the new project a name and location and click OK.

Once created, we’ll need to create a SharePoint List connection manager. From the Solution explorer, right click on “Connection Managers” and select “New Connection Manager”. Scroll down on the window, select “SPCRED” and click Add.

You will only see SPCRED if you completed Step 1 above. The Connection Manager will then prompt for a name and a set of credentials. Provide the name, and also provide it with an account that has access to the subscription list. If the SSIS service account has access, you can select “Use Credentials of Executing Process”, otherwise provide a service account with access.

We’ll be working within a Data flow task, so drag a Data Flow Task onto the design canvas.

Next, double click on the data flow task, or click on the Data Flow tab to bring up the Data Flow Task Editor. From there, drag a “SharePoint List Source” action onto the canvas. (Note: if the SharePoint List Source does not appear, there may have been a problem installing it. Consult the documentation for the SharePoint List Source and Destination project for troubleshooting steps.) Double Click on the List Source action to configure it. The first item to configure is the Connection Manager. Simply select it from the (hidden!!! ) drop down list. Click on the area beside “SharePoint Credential Conn…” to reveal the dropdown.

Next, click on the “ Component Properties” tab. Here, you perform the bulk of the action configuration. There are many options to choose from, but the ones that we’re concerned with here are SiteUrl and SiteListName. SiteURL is the absolute URL of the site that will contain our list, and SiteListName is the display name of the list. I stress display name as this is different than working with most other APIs for SharePoint, which tend to use the internal name. Also – it’s relatively easy for users to change the display name of the list. Doing so will break the package until it is reconfigured.

Next, drag a Script Component onto the canvas, below the data source. If prompted, choose “Transformation” for the script type. Next, connect the two actions by dragging the arrow from the SharePoint List Source to the Script Component.

Next, double click on the script component to bring up the script component editor. From the left, select Input Columns and select all of the columns to use in this script. In our case, we’ll be working with the columns shown below:

Next, click on the Script section, choose the language that you want to work with, then click the “Edit Script” button.

Without getting into too much detail of how the script action works, what we are going to do is to add code that will run for each row of data that flows through the transformation. In our case, that will be for each configuration item. We’re going to use the values of the columns of each configuration item to render the reports. Therefore, the code that we will write will go into the “Input0_ProcessInputRow” sub.

Before we can do that however, we need to add some supporting items. Firstly, since we’ll be working with web services, we’ll need to reference the .Net System.Web.Services library. Right click on the project name in solution explorer, and select Add Reference. From the .Net tab, select System.Web.Services, and click OK.

Next, expand the “Imports” section and import the System.IO and the System.Net namespace.

We now need to add our Reporting Services proxy class. The best way to do this is to first create a new class. Right click on the project in solution explorer, and select Add – Class.

Next, give the class a name. I like to match the name to the main class embedded, so the new name is ReportExecutionService.vb. Next, using Notepad, open the file that you created or downloaded in Step 3 above. Select all text, copy it into the clipboard, and then paste it into the newly created class, overwriting anything already there. Once done, save and close the class.

Next, I add a helper function to the script that helps to deal with URLs missing an ending slash. You can add it immediately above the “Input0_ProcessInputRow” sub. The code is below:

 Private Function CheckSlash(ByVal input As String) As String
        If input.EndsWith(Path.DirectorySeparatorChar) Then
            Return input
        Else
            Return input & Path.DirectorySeparatorChar
        End If
    End Function

As we saw below, the output format parameters aren’t the friendliest, and we will need to specify the extension for the output file. To allow this, I also wrote a small helper function to turn output format values into file extensions, and include it below. This also needs to be added to the script.

Private Function GetExt(format As String) As String

        Select Case format

            Case Is = "XML"

                Return "xml"

            Case Is = "Null"

                Return Nothing

            Case Is = "CSV"

                Return "csv"

            Case Is = "ATOM"

                Return "atom"

            Case Is = "PDF"

                Return "pdf"

            Case Is = "HTML4.0"

                Return "htm"

            Case Is = "RGDI"

                Return "gdi"

            Case Is = "MHTML"

                Return "mhtml"

            Case Is = "EXCEL"

                Return "xls"

            Case Is = "EXCELOPENXML"

                Return "xlsx"

            Case Is = "RPL"

                Return "rpl"

            Case Is = "IMAGE"

                Return "tiff"

            Case Is = "WORD"

                Return "doc"

            Case Is = "WORDOPENXML"

                Return "docx"

            Case Else

                Return Nothing

        End Select

    End Function

Finally, we’re ready to add code to the “Input0_ProcessInputRow” sub. The complete code listing is below:

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
      '
      Dim rs As New ReportExecutionService
      rs.Url = "http://home.nautilusinc.local/sites/nmarine/_vti_bin/ReportServer/ReportExecution2005.asmx"
      rs.Credentials = System.Net.CredentialCache.DefaultCredentials
      Dim report As Byte() = Nothing
      Dim deviceinfo As String = Nothing
 
      Dim ParameterPairs As String() = Row.Parameters.Split(";")
      Dim parameters As ParameterValue() = New ParameterValue(ParameterPairs.Length – 1) {}
      Dim CurrentPair As String()
      For i As Integer = 0 To ParameterPairs.Length – 1
          CurrentPair = ParameterPairs(i).Split("=")
          parameters(i) = New ParameterValue
          parameters(i).Name = CurrentPair(0)
          parameters(i).Value = CurrentPair(1)
      Next
 
      Dim historyID As String = Nothing
      Dim credentials As DataSourceCredentials() = Nothing
      Dim showHideToggle As String = Nothing
      Dim extension As [String] = String.Empty
      Dim encoding As [String] = String.Empty
      Dim mimeType As [String] = String.Empty
      Dim warnings As Warning() = Nothing
      Dim reportHistoryParameters As ParameterValue() = Nothing
 
      Dim streamIDs As String() = Nothing
      Dim execInfo As New ExecutionInfo()
      Dim execHeader As New ExecutionHeader()
 
      rs.ExecutionHeaderValue = execHeader
      execInfo = rs.LoadReport(Row.ReportURL, historyID)
      rs.SetExecutionParameters(parameters, "en-us")
      Dim destUrl As String = Row.DestinationLibrary
      Dim destinationUrl As String = CheckSlash(destUrl) + Row.SubscriptionTitle + "." + GetExt(Row.Format)
      Dim r As Byte()
 
      Try
          report = rs.Render(Row.Format, deviceinfo, extension, mimeType, encoding, warnings, streamIDs)
          Dim m_WC As WebClient = New WebClient
          m_WC.Credentials = System.Net.CredentialCache.DefaultCredentials
          r = m_WC.UploadData(destinationUrl, "PUT", report)
      Catch ex As Exception
 
      End Try
  End Sub

Again, without getting into too much detail, some explanation of the above code is in order.

Lines 3-5 initialize the web service, assign it a URL (Don’t forget to change this for your environment!!) and assign it the credentials to use when calling the web service.

When this sub is called by SSIS, it is passed a row object. The row object contains column objects for each column that is used by the script (this was configured above). Therefore, to get the value for any given column, you simply need to refer to it as row.ColumnName. In our case, to get the value of the Parameters column, you use row.Parameters. Lines 9 through 17 get the value of the parameters column, split the value into an array of string objects using a semicolon as a value delimiter, then for each of these objects, separates them into name/value pairs using the equals sign as a delimiter, and them finally assigns them to a Reporting Services parameter collection.

Using this approach, we can use a single field to store all of the parameters for a report, and any report can have any number of parameters.

Lines 19-32 are primarily used for initialization. Line 33 loads the report specified in the subscription (by calling row.ReportURL). Line 34, sets the parameters, and lines 35-36 set the destination variables.

Finally, Line 40 calls the web service to actually render the report into a byte stream, and line 43 uses the .Net WebClient object to upload the file directly into SharePoint. In this example, we don’t actually add any metadata to the SharePoint library, but if this was required, you could use the techniques outlined in this post. We are now ready to test the process.

Step 5 – Run the Package

Close the Script editor window and click the OK button. If all is well, your Script Component action should show no errors. When ready, click the run button to test your package. If all is well, after a short compilation period, you should see that 2 records were successfully read from the subscription list, an both steps should show green. If things don’t go well, the error messages are pretty good….

Navigating to the destination library, we see the two requested reports.

Next Steps

Obviously, every time that this package runs, the reports will be overwritten with the new report. This may be desired behaviour, but if not, you may want to turn on version control (each version will be stored as a version) or modify the script to change the file name on each run (date stamping is a common technique).

In addition, you will want the package to be run automatically without human intervention. To do this, you’ll want to deploy it to a SQL Server running SSIS , and to schedule it to run as an agent job. There is a wealth of information online for how to do that.

Conclusion

The example provided above covers a single use case, but with minor adjustment could be used to automate all sorts of reporting tasks. A common one would be to use the NULL renderer to refresh report caches on a server. If you find any unique uses of this approach, I would love to hear about it. Please post a comment!

Credits

In preparing this post, I found the following articles to be useful:

SSIS and Reporting Services Web Services

Uploading documents to WSS (Windows Sharepoint Services) using SSIS

SharePoint reporting services SOAP endpoint in CTP3

Upload document from Local Machine to SharePoint Library using WebService

Uploading files to the SharePoint Document Library and updating any metadata columns

8 Comments

How to Get SharePoint Data Into a Data Warehouse (And Back Again)

Published October 28, 2011

With apologies to J.R.R. Tolkien……

Yesterday, as part of the ongoing Panellist Spotlight series for SharePoint Shoptalk, I presented a session on how to move your data from SharePoint, and in to a data warehouse, and then consume the warehoused data directly within SharePoint. These presentations are recorded, and posted online, and if you’re interested, you can view the entire presentation below:

In it, I demonstrate how SQL Server Integration Services (SSIS) can be used to extract the data from SharePoint, and to store it in a SQL Server data warehouse. Then I walk through the creation of an external content type, and an external list using SharePoint Designer and SharePoint Business Connectivity Services (BCS). Finally, I create a report using SQL Server Reporting Services (SSRS) in SharePoint Integrated mode.

Thanks to Jamees Wright for organizing the session, and to my fellow panellist, Laura Rogers.

2 Comments

Archiving Your SharePoint Workflow History Lists

Published September 27, 2011

If you’ve worked at all with SharePoint declarative workflows (the ones that you use InfoPath to create), or others, such as Nintex workflows that are based on them, you are undoubtedly aware of their ability to log items to the history list. These items are those that appear in the Workflow History section of the workflow status page.

What may be less commonly known is how this works. This history list is really just a view of items that are contained in a hidden list on the site, and whenever an item is logged to the list, it gets created in the history list. For regular SharePoint workflows, this list is located at http://yoursiteurl/Lists/Workflow History, and for For Nintex workflows, you can find it at http://yoursiteurl/Lists/NintexWorkflowHistory. So, why does this matter? Well, if you need to audit what is has happened with your workflows, this is where the information is contained.

There is, however a catch. By default, SharePoint will run a Timer job named “Workflow Auto Cleanup” daily that will remove all of the the tasks associated with a workflow, and all of the history links for workflows that are over 60 days old. This is done for performance reasons. Well, unless your audit requirements only go back two months, this isn’t going to work for you.

Try doing a search for “Workflow History” and you’ll see that this has caused a number of issues (especially for those that have found it out after the fact. The good news for those people is that the workflow history list isn’t actually purged (which is also bad news, as we’ll see shortly), and those links can be recreated through reporting. However, the most common guidance found on this topic is to simply disable the automatic cleanup job, as outlined in this very poorly named Technet article.

The problem with disabling the job is that performance will suffer, potentially badly. Assume that we have an approval workflow that runs on a list that will receive 2500 approvals annually. This is a reasonably sized list (for SharePoint 2010). Now lets also assume that during the life of the workflow, 10 items get logged to the history list. This means that in a given year, 25,000 items are being logged to the history list, which is beyond the default list view threshold of administrators, and would be considered a very large list.

What is needed is a way to balance the auditing requirements with the list size constraints of SharePoint. 25,000 items may be a large SharePoint list, but it’s trivial to a relational database like SQL. What the remainder of this article will do is to discuss how to use Microsoft’s Business Intelligence tools to extract workflow history data into a data warehouse, and then safely purge it from the workflow history list. This will be a lengthy one.

Step 1 – Extract And Load

In my opinion, one of the most underutilized tools in Microsoft’s arsenal is SQL Server Integration Services (SSIS). Almost every SharePoint site has it, and very few know about it. It is Microsoft’s ETL (Extract, Transform, and Load) tool, and it is used for taking data from source systems, performing operations on it, and loading it into a destination system, which is typically a data warehouse in the form of one or more SQL databases. This is precisely what we need to do with our Workflow History data. You can read more about SSIS here.

The problem however is that SSIS does not support SharePoint list data as a data source. Yes, ultimately all SharePoint data is stored in SQL content databases, but we all know that we’re supposed to stay out of there. SharePoint data should only be accessed via UI constructs, the SharePoint APIs, or the SharePoint web services.

Happily, a Codeplex project was created several years ago that adds both source and destination SSIS adapters for SharePoint list data, and yes, it works well with SharePoint 2007 data. What this project does is to encapsulate calls to the SharePoint web services into SSIS data adapters. Because it uses the SharePoint web services (not the API), there is no requirement for the SharePoint bits to be installed where it is being run.

You can find an excellent tutorial on how to set this up here, so I’m not going to go into any details on that. I will however cover the basic steps below. First, however I want to outline the logic involved.

What we want to be able to do is to maintain a complete log of workflow history. We also want to be able to keep the history in SharePoint for a period of time (60 days by default), and then be able to purge it, knowing that it’s secure in the data warehouse. Therefore, we need to take an initial dump of the data, and then be able to add only new items to it. The design of the data warehouse will also support multiple site history lists.

The solution will consist of 2 tables in a SQL data warehouse (a staging table and the actual archive table). The SSIS package will perform the following steps:

Empty the staging table
Extract the entire Workflow History List (SP) into the staging table (SQL)
Query the archive table for the most recent entry
Extract all items from the staging table more recent than the entry in step 3 into the archive table, and add in a site identifier.

First, open up Business Intelligence Development Studio (BIDS). BIDS is really just Visual Studio with all of the SQL BI project types added, and is normally installed when SQL is installed. If not, you can install it from the SQL media. You do not need SQL server installed to use it, but it does have some advantages.

From the Business Intelligence Projects section, select “Integration Services Project”, and give it a solution and project name. You’ll then be presented with the SSIS design canvas. The first thing that you’ll want to do is to create two connection managers – one for SharePoint, and one for SQL. In the Connection Managers pane right click anywhere in the window and select “New Connection”

Scroll down, and select SPCRED – Connection manager for SharePoint connections, give it a name, and select the credentials. If you use the credentials of the executing process, it will use your credentials when you test it, but the credentials of the SQL Server Agent process if you schedule it to run automatically. Alternatively, you can enter the credentials of a proxy account, which is what I typically do. Repeat this process, only this time select OLEDB and configure the connection to your SQL Data Warehouse database (if you haven’t already done so, you’ll need to create a SQL database to house the archive).

Next, from the Toolbox, drag a Data Flow Task onto the Design surface. Your surface should look something like below:

Double click on the Data Flow task, and the Data Flow window will open (you can also click on the Data Flow Tab). Here, from the toolbox, drag a SharePoint List Source, and an OLE DB Destination task onto the surface. Double click on the SharePoint List source, then click in the area to the right of the SharePoint Credential Connection, and set the Connection manager to the manager that you created above.

Next, click on the Component properties tab, and enter valid values for the SharePoint source site URL, and the list name. The List name will either be WorkflowHistory for standard SharePoint workflows, or NintexWorkflowHistory, for Nintex workflows.

Click OK. Next, grab the green arrow at the bottom of the SharePoint List source, and connect it to the OLE DB Destination. Double click on the OLE DB Destination, and select the New button beside the Name of the table field. What this allows us to do is to create our Temp Table in the Data Warehouse with the appropriate schema for our Workflow History List. Once the create table widow is open, simple change the name of the table to what you want (in this case wfhStaging).

As soon as you click OK, the table is created in SQL. Next, click the Mappings tab on the left, and confirm that all of the fields are mapped correctly from the SharePoint list, to the SQL table. No changes should be required. When complete, click OK, and the data flow is ready for testing. From the BIDS debug menu, select Start debugging. After a pause, the process will run, and the boxes will turn yellow and green as the process executes. If all works properly, you will see something like the screen below:

Both boxes green indicate that the process completed successfully, and there will be an indicator showing the number of rows that were transferred. You can confirm this by opening up SQL Server Management Studio, selecting your Data Warehouse database and running the following query:

SELECT Count(id) From wfhStaging

At this point, we need to stop our debugging process and switch back to the Control Flow tab. Given that we want to repopulate the staging table whenever we run this package, we need to first clear it at the beginning of the run. Drag an “Execute SQL Task” from the toolbox onto the design surface above or to the left of the data flow task. Us its arrow to connect it to the Data Flow task, and then double click on it. Select your OLE DB connection as its connection property and enter the following SQL (substituting your table name) as its SQL Statement:

TRUNCATE TABLE wfhStaging

Next, we will need to create and populate our actual archive table. To do this, drag another Data Flow task onto the design surface. Connect the output from the first data flow task to it and then double click on it. Drag an OLEDB Source, a derived column, and an OLEDB destination onto the design surface.

We want to be able to store the workflow history for multiple sites in the same data warehouse table. To do this, we need to add another identifier column to the schema of the workflow history list that will uniquely identify the source site. In our case we will use the relative site URL. The derived column action will add this column to each row as it is processed.

Configure the OLEDB Source to read from the Staging table. Then, connect the OLEDB Source to the Derived Column action with the green arrow. Double click on the derived column action. Under the Derived Column Name, enter the name of the new column. Leave the Derived Column action as “add as new column”, and for the expression we will simply use a literal string for the site relative site URL. When complete, the action should appear as below.

Click OK to close the dialog, and then connect the derived column action to the OLEDB Destination action with the green arrow. Double click the OLEDB Destination action and repeat the steps taken above to create the staging table, only this time, you’ll create the actual archive table. This time, when you click on the Mappings tab, note that the SiteURL column has been added at the bottom. Don’t run debug at this point, as it will run the entire package. Click back on the Control Flow tab, right click on the new Data Flow action, and select Execute Task. Just that task will run, and if you move back to the Data Flow tab, you should see that the same number of rows have been added to the archive table.

Now we need to ensure that only the new columns from the staging table are moved into the archive table. To do this, we will change the select statement in the OLEDB source of the second data flow task. Firstly, we’ll need to know what the date/time latest record in the archive table for this site. The SQL statement for this looks like

SELECT MAX(Modified) From wfhArchive 
Where SiteURL='/SalespersonChangeRequest'

So therefore, we can embed that statement into the select statement for our staging table. However, we still need to accommodate the case where there are no records in the archive table, where the above statement returns a NULL value. To deal with this, we can use the ISNULL TSQL function, and our complete staging table select statement will be

SELECT * FROM wfhStaging 
Where Modified > 
  ISNULL(
      (SELECT MAX(Modified) From wfhArchive 
       Where SiteURL='/SalespersonChangeRequest'
       ),
       '1900-01-01')

Translated into English, this basically says “Find the value for modified of the most recent record of any items with SiteURL set to /SalesPersonChangeRequest. If you don’t find any, set it to 1900-01-01. Then, get me everything from the staging table with a modified date more recent.”

Now that we have our SQL, we need to modify our OLEDB Source action. Double click on it, and then change the Data access mode from “Table or view” to “SQL command”. Then, add the select statement to the SQL command text window. At completion, the window should appear as follows:

Once done click back to the Control Flow tab, and then start Debugging (you can also just press the F5 key to start debug). The first Data Flow task should write all of the source records to the table, and the second should write none (assuming nothing has happened to the source since you did the initial extract. You can try deleting some of the records from the archive table, and rerunning the package – they should get replaced. That was step 1.

Step 2 – Schedule the package

Now that we have our package, we want it to run periodically (usually nightly). We do this by deploying the package to the server, and then scheduling to run with the SQL job agent.

To deploy the package, we need to first create a deployment utility for it. To do this, we must first select the Project in the Solution Explorer pane, and then select Project-Properties from the menu. The Configuration Properties window is then opened. In the left pane, select the Deployment Utility tab, and ensure that the CreateDeploymentUtility is set to True.

Also – take note of the DeploymentOutputPath value.

run the deployment utility for it. The deployment utility is stored in a subfolder of the package project. You can find the folder of the project by selecting the project in the Solution Explorer pane, and then examining its FullPath property in the Properties pane. Open the project path in Windows Explorer, and then navigate to the DeploymentOutputPath as noted above. In that folder, you’ll find a file named yourprojectname.SSISDeploymentManifest. When ready to deploy, double click on it, and the Deployment wizard will start.

The deployment wizard is straightforward and self explanatory. You’ll want to select “SQL Server deployment” on the first screen, then the SQL server that you wish to deploy to (usually (local) ), and select a location for the Package Path (the root is likely fine). Once the wizard is complete, you are ready to schedule the package.

Open Up SQL Server Management Studio, and connect the destination server. If the SQL Server Agent service is not running, start it (you’ll want to make sure that it is started automatically). Expand the Agent node, and then expand the Jobs node. Right click on Jobs, and select New Job.

Give the Job a name, then click on the Steps tab. Click the New button to create a new step, and give it a name. In the Type dropdown, select “SQL Server Integration Services Package”. In the General section, select the SQL server that holds the package, and then use the ellipsis in the Package field to select the package you deployed above.

Next, click the Schedules tab, click the new button, give the schedule a name, and select when you want the job to run. Save the schedule, and then save the job (click OK). Your job should now appear in the Jobs folder. To test it, right click on it and select “Start Job at step”. The job will run and you will see its progress in a dialog.

There are many options for scheduling SSIS jobs, and for error handling, and I would strongly recommend investigating them.

Step 3 – Purge the Workflow History Data

As mentioned above, the workflow cleanup job removes workflow history associations, but does not actually delete the items from the list, allowing that list to grow large. If you use Nintex, there’s a Nintex command that will take care of this for you:

NWAdmin.exe –o PurgeHistoryListData  -siteUrl urlToSite 
 [-workflowName workflowName] [-days daysSinceLastActivity] 
 [-lastActivityBefore datetime DateFormat)] [-state All|Running|Completed|Cancelled|Error]
 [-deletedLists] [-clearAll [-workflowItemId id -workflowListName "list name"]] [-verbose]
 [-reportOnly] [-batchSize numberDefaultIs500]
 [-pauseAfterBatch] [-maxItemsToDelete number]

This command is run on a front end server. To keep things up to date it would need to be scheduled. However, if you’re using out of the box workflows, there is no equivalent command. You could just access the history list and remove old data, but since SharePoint has built in tools for this, I recommend using them. These features are contained in the Information management policy settings of any list.

Open the Workflow history list (your site url + Lists/WorkflowHistory or NintexWorkflowHistory). Open the List settings page, and select “Information management policy settings” in the Permissions and Management section. If you don’t see this option, you may need to enable the relevant features.In the Content type policies, select the Workflow History content type, and then select “Enable Retention”. Once enabled, you will be able to select “Add a retention stage”.

The retention stage is what we will use to delete the workflow history items (which, given the name, is somewhat counter-intuitive, don’t you think?). Date occurred is when the event occurred, so it is likely our best time indicator, and I would suggest a period of time at least double to the automatic cleanup task, which is 60 days. Finally, we want the item to be deleted at this point, so we select “Permanently Delete” from the Action dropdown. When complete, the stage will appear as follows:

Once we save our policy, the expired items will be deleted the next time the timer jobs run.

And that’s all there is to it!

Now that we’ve taken the data out of SharePoint, it’s no longer obviously available to end users. If this is important, we will need to build some Reporting Services reports, and integrate them back into the appropriate locations in SharePoint. This will (hopefully) be the subject of an upcoming post.

9 Comments