THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a data mangler in London working for Dunnhumby

  • Creating a Spark dataframe containing only one column

    I’ve been doing lots of Apache Spark development using Python (aka PySpark) recently, specifically Spark SQL, and one thing I’ve found very useful to be able to do for testing purposes is create a Spark SQL dataframe from literal values. The documentation at pyspark.sql.SQLContext.createDataFrame covers this pretty well however the code there describes how to create a dataframe containing more than one column like so:

    l = [('Alice', 1)]
    # returns [Row(_1=u'Alice', _2=1)]
    sqlContext.createDataFrame(l, ['name', 'age']).collect()
    # returns [Row(name=u'Alice', age=1)]

    For simple testing purposes I wanted to create a dataframe that has only one column so you might think that the above code could be amended simply like so:

    l = [('Alice')]
    sqlContext.createDataFrame(l, ['name']).collect()

    but unfortunately that throws an error:

    TypeError: Can not infer schema for type: <type 'str'>

    The reason is simple, ('Alice', 1) returns a tuple whereas ('Alice') returns a string.

    print type(('Alice',1))
    # returns <type 'tuple'>
    print type(('Alice'))
    #returns <type 'str'>

    The latter causes an error because createDataFrame() only creates a dataframe from a RDD of tuples, not a RDD of strings.

    There is a very easy fix which will be obvious to any half-decent Python developer, unfortunately that’s not me so I didn’t stumble on the answer immediately. Its possible to create a one-element tuple by including an extra comma like so:

    print type(('Alice',))
    # returns <type 'tuple'>

    hence the earlier failing code can be adapted to this:

    l = [('Alice',)]
    # returns [Row(_1=u'Alice')]
    sqlContext.createDataFrame(l, ['name']).collect()
    # returns [Row(name=u'Alice')]

    It took me far longer than it should have done to figure that out Smile


    Here is another snippet that creates a dataframe from literal values without letting Spark infer the schema (behaviour which, I believe, is deprecated anyway):

    from pyspark.sql.types import *
    schema = StructType([StructField("foo", StringType(), True)])
    l = [('bar1',),('bar2',),('bar3',)]
    sqlContext.createDataFrame(l, schema).collect()
    # returns: [Row(foo=u'bar1'), Row(foo=u'bar2'), Row(foo=u'bar3')]

    or, if you don’t want to use the one-element tuple workaround that I outlined above and would rather just pass a list of strings:

    from pyspark.sql.types import *
    from pyspark.sql import Row
    schema = StructType([StructField("foo", StringType(), True)])
    l = ['bar1','bar2','bar3']
    rdd = sc.parallelize(l).map (lambda x: Row(x))
    sqlContext.createDataFrame(rdd, schema).collect()
    # returns [Row(foo=u'bar1'), Row(foo=u'bar2'), Row(foo=u'bar3')]

    Happy sparking!
  • Using Visual Studio Code and the PowerShell extension behind a proxy

    I’m putting this here solely for myself so I’ve got it for later. If its useful for anyone else, great.

    I’m a big fan of Visual Studio Code (VSCode) for editing code scripts mainly because of the Git integration, the built-in Explorer and the fact that its so lightweight. I have even used VSCode for editing PowerShell scripts in preference to PowerShell ISE even though it didn’t actually support or comprehend PowerShell. That changed however when Microsoft released a PowerShell extension for VSCode (read Announcing PowerShell language support for Visual Studio Code and more!) which brings fantastic PowerShell features such as intellisense, syntax highlighting and code navigation.

    Unfortunately I was unable to install the PowerShell extension at work because we have a proxy server sitting in the way, that also affected VSCode’s ability to auto-update itself. It took a while to figure out how to configure VSCode to use the proxy server so I’m posting here in case I ever need to know again.

    Open up File->Preferences->User Settings


    Add the following into the setting.json file that appears:



    (Enter the address of your own proxy server of course.)

    Boom. You’re done! After that VSCode will auto-update and you’ll be able to install extensions.


  • Sorting then formatting du output on Linux

    This is my first blog post in eons and, don’t laugh, its about using Bash/Linux (nothing to do with the recent announcement of  Bash on Windows, I’ve been using Bash on and off for about a year now). As I’m now a hadoop monkey the linux command-line is where I spend a lot of my time and today I discovered awk for the first time and the cool stuff one can do with it.

    My challenge was to get the size of a bunch of HDFS folders within a given folder, sort the results, then format the output to be human-readable (i.e. use K, M, G, T depending on whether the size should be measured in KB, MB etc…). Hadoop has a command to get the size of a bunch of folders and format the numbers to be human readable

    hadoop fs –du –h /path/to/folder

    but once the output is formatted it can’t be sorted so that was no good. That’s when I discovered what awk can do for you. Rather than try and explain it I’ll just put this here:

    $ hdfs dfs -du -s /foo/bar/*tobedeleted | sort -r -k 1 -g | awk '{ suffix="KMGT"; for(i=0; $1>1024 && i < length(suffix); i++) $1/=1024; print int($1) substr(suffix, i, 1), $3; }'
    28T /foo/bar/card_dim_h_tobedeleted
    20T /foo/bar/transaction_item_fct_tobedeleted
    2T /foo/bar/card_dim_h_new_tobedeleted
    2T /foo/bar/hshd_loyalty_seg_tobedeleted
    1T /foo/bar/prod_dim_h_tobedeleted
    607G /foo/bar/promo_item_fct_tobedeleted
    456G /foo/bar/card_dim_c_tobedeleted
    340G /foo/bar/ch_contact_offer_alc_fct_tobedeleted
    203G /foo/bar/prod_dim_h_new_tobedeleted
    184G /foo/bar/card_dim_h_test_tobedeleted
    166G /foo/bar/offer_dim_h_tobedeleted
    115G /foo/bar/promo_dim_h_tobedeleted
    87G /foo/bar/offer_tier_dtl_h_tobedeleted
    84G /foo/bar/ch_contact_offer_dlv_fct_tobedeleted
    50G /foo/bar/ch_contact_event_dlv_fct_tobedeleted

    All sorted in descending order and nicely formatted to be human-readable. Cool stuff. I’m mainly putting this here so I can find it later when I need it but thought it might be interesting for others also.


    P.S. Yes, I deleted about 55TB of data today Smile

  • Calling Cronacle web service QueryJob1 from PowerShell

    2 blog posts in 2 days. Hell must have frozen over…

    On my current project we are using Cronacle as our job scheduler and we have a need to programatically query Cronacle to ask it the status of a job. The following script accomplishes that using PowerShell:

    Code Snippet
    1. function Ignore-SSLCertificates
    2. {
    3.     $Provider = New-Object Microsoft.CSharp.CSharpCodeProvider
    4.     $Compiler = $Provider.CreateCompiler()
    5.     $Params = New-Object System.CodeDom.Compiler.CompilerParameters
    6.     $Params.GenerateExecutable = $false
    7.     $Params.GenerateInMemory = $true
    8.     $Params.IncludeDebugInformation = $false
    9.     $Params.ReferencedAssemblies.Add("System.DLL") > $null
    10.     $TASource=@'
    11.         namespace Local.ToolkitExtensions.Net.CertificatePolicy
    12.         {
    13.             public class TrustAll : System.Net.ICertificatePolicy
    14.             {
    15.                 public bool CheckValidationResult(System.Net.ServicePoint sp,System.Security.Cryptography.X509Certificates.X509Certificate cert, System.Net.WebRequest req, int problem)
    16.                 {
    17.                     return true;
    18.                 }
    19.             }
    20.         }
    21. '@
    22.     $TAResults=$Provider.CompileAssemblyFromSource($Params,$TASource)
    23.     $TAAssembly=$TAResults.CompiledAssembly
    24.     ## We create an instance of TrustAll and attach it to the ServicePointManager
    25.     $TrustAll = $TAAssembly.CreateInstance("Local.ToolkitExtensions.Net.CertificatePolicy.TrustAll")
    26.     [System.Net.ServicePointManager]::CertificatePolicy = $TrustAll
    27. }
    28. function Get-CronaclePassword {
    29.     $securePassword = Read-Host -Prompt "Please supply Cronacle password" -AsSecureString
    30.     $creds = New-Object System.Management.Automation.PSCredential -ArgumentList "dummyvalue", $securePassword
    31.     return $creds.GetNetworkCredential().Password
    32. }
    33. function Get-CronacleJobStatus {
    34.     Param(
    35.         [Parameter(Mandatory=$true)][string]$WSUri,
    36.         [Parameter(Mandatory=$true)][string]$CronacleUser,
    37.         [Parameter(Mandatory=$true)][string]$CronaclePassword,
    38.         [Parameter(Mandatory=$true)][string]$JobId
    39.     )
    40.     $secureCronaclePassword = ConvertTo-SecureString $CronaclePassword -AsPlainText -Force
    41.     $creds = New-Object System.Management.Automation.PSCredential ($CronacleUser, $secureCronaclePassword)
    42.     $userpasswordbase64 = [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes("${CronacleUser}:${CronaclePassword}"))
    43.     $payload = @"
    44. <soapenv:Envelope xmlns:soapenv="" xmlns:sch="">
    45.    <soapenv:Header/>
    46.    <soapenv:Body>
    47.       <sch:QueryJobRequest>
    48.          <sch:JobId>${JobId}</sch:JobId>
    49.       </sch:QueryJobRequest>
    50.    </soapenv:Body>
    51. </soapenv:Envelope>
    52. "@
    53.     Ignore-SSLCertificates
    54.     $headers = @{"Content-Type"="application/xml";"charset"="utf-8";"Authorization"="Basic $userpasswordbase64";"SOAPAction"="query/Job/QueryJob1"}
    55.     $response = Invoke-WebRequest -Headers $headers -Uri $WSUri -Method Post -Body $payload #-Credential $creds
    56.     return ([xml]($response.Content)).GetElementsByTagName("job:Status")[0].InnerText
    57. }
    60. $cronacleServer = Read-Host "Cronacle server (e.g. https://servername:10180)"
    61. $uri = "$cronacleServer/scheduler/api-soap/query/Job/QueryJob1"
    62. $user = Read-Host "Cronacle username"
    63. $password = Get-CronaclePassword
    64. $jobid = Read-Host "JobId?"
    66. Get-CronacleJobStatus -WSUri $uri -CronacleUser $user -CronaclePassword $password -JobId $jobid
  • Issuing linux commands with embedded quotes from PowerShell.

    Here’s a little thing I figured out today and I’m sticking it on my blog because I know I’ll need this again at some point.

    I’ve been writing some PowerShell code to connect to linux and issue bash commands (don’t ask why ). We use PLink.exe, a free utility that is bundled inside putty, to issue the commands hence our code is something like this:


    Here’s the result:


    Simple enough. The difficulty comes though if the string that you want to echo contains embedded quotes, like this:


    It still runs without error but the same message gets output. The embedded quotes have been lost. I discovered (thanks to “Duncan” on Stack Overflow) that you need to (a) use a backslash to escape the quotes so that bash understands them and (b) do some fancy replacing using regex which I wrapped up in a function Set-EmbeddedQuotesInLinuxCommand. Hence my code now looks like this:


    And then we get the result we’re after:


    Quite frankly I don’t fully understand the intricacies of why this works but work it does and that’s all I care about. Here’s the code so you can copy and paste:

    function Set-EmbeddedQuotesInLinuxCommand ($Command) {
        return (  $Command -replace '(\\*)"','$1$1\"'  )

    $hostname = "<redacted>"
    $user = "<redacted>"
    $password = "<redacted>"

    $plinkPath = gci . -Recurse | `
                    Where-Object {$_.Name -eq "plink.exe"} | `
                    Select-Object -First 1 | `
                    ForEach-Object {$_.FullName}  #assumes that plink.exe exisst somewhere under your current folder
    $greeting  = '\"Hello world!\"'   ##########Embedded quotes############
    $greeting = (Set-EmbeddedQuotesInLinuxCommand $greeting )
    $arguments = @()
    $arguments += "$User@$hostname","-pw",$password,"echo",$greeting

    & $plinkPath $arguments

    Obviously this is a bit of a contrived example (echoing strings isn’t particularly useful), in reality our use case was a lot more complicated as we were passing connection strings and SQL statements to be issued – the principle is the same though. Hope this helps.


  • Derive fiscal week using Excel

    I had to do something in Excel today that might be useful to other folks so sharing it here.

    I've got a A4 sheet sellotaped to my desk that tells me the start and end date of each fiscal week in our fiscal year. Its useful but its only for 2015 and moreover I could do with the information in digital form rather than dead tree form. I figured I could achieve the same using Excel, the resultant workbook can be seen at: (Annoyingly, I have configured the pivot table to display the values as dates not numbers, but it seems Excel Online doesn't honour that, so you will have to download the document and simply refresh the pivot table in order to view it properly.)


    It allows you to pick the month in which the fiscal year starts (ours is March so that’s what is currently set in the workbook):


    and assumes that the first Monday of that month is the first day of the fiscal year.

    Doubtless there's easier ways to do this (there's probably a function in Excel that I don't know about) but nevertheless, this works for me and it was kinda fun figuring it out – there’s lots of formulae involved here if you care to dig in to it. Thought I'd share it in case it was useful for anyone else. (Probably best not trust that its correct either, you might want to check it).


  • Iterate over a collection in Cronacle

    In my new life as a Hadoop data monkey I have been using a tool called Redwood Cronacle as a workflow/scheduler engine. One thing that has shocked me after years of working in the Microsoft ecosystem is the utter dearth of useful community content around Cronacle. There simply isn’t anything out there about it, nobody blogs about Cronacle (top link when googling for “cronacle blog” is, no forums, precisely zero questions (at the time of writing) on stackoverflow tagged Cronacle… there’s just nothing. Its almost as if nobody else out there is using it and that’s infuriating when you’re trying to learn it.

    In a small effort to change this situation I’ve already posted one Cronacle-related blog post Implementing a build and deploy pipeline for Cronacle and in this one I’m going to cover a technique that I think is intrinsic to any workflow engine, iterating over a collection and carrying out some operation on each iterated value (you might call it a cursor). There’s a wealth of blog posts on how to do this using SSIS’s ForEach Loop container because its a very common requirement (here is one I wrote 10 years ago) but I couldn’t find one pertaining to Cronacle. Here we go…

    We have identified a need to be able to iterate over a dataset within Cronacle and carry out some operation (e.g. execute a job) for each iterated value. This article explains one technique to do it.

    My method for doing this has two distinct steps:

    1. Build a dataset and return that dataset to Cronacle
    2. Iterate over the recordset

    There are many ways to build a dataset (in the example herein I execute a query on Hadoop using Impala) hence the second of these two steps is the real meat of this article. That second step is however its pointless without the first step, hence both steps will be explained in detail.

    Here's my Cronacle Job Chain that I built to demo this:


    Step 1 - Build the collection

    To emphasize a point made above, I could have used one of many techniques to build a collection to be iterated over, in this case I issued an Impala query:


    N.B. The beeline argument --showheader has no effect when used with the -e option (only has an effect when a file is specified using the -f option). This is an important point as you will see below.

    When this JobDefinition gets executed we can observe that the collection is assigned to the outParam parameter:


    outParam is of type String, not a String array. The values therein are delimited by a newline character ("\n")

    Step 2 - Iterate over the collection

    The output parameter from the first job in the Job Chain is fed into an input parameter of the second job in the Job Chain:


    From there we write Redwood Script (basically Java code) to split the string literal into an array and then execute an arbitrary Job Definition "JD_EchoInParameterValue_jamie_test" for each iterated value.


    Thus, the code shown above is the important part here. It takes a collection that has been crowbarred into a string literal and splits it by \n into a string array then passes each element of that array to another job as a parameter. I’ve made the code available in a gist:

    When executed observe that "JD_EchoInParameterValue_jamie_test" gets called three times, once for each value in the array ("col, "1", "2")




    I’m still a Cronacle beginner so its quite likely that there is an easier way to do this. The method I’ve described here feels like a bit of a hack however that’s probably more down to my extensive experience with SSIS which has built-in support for doing this (i.e. the For Each Loop container).

    Comments are welcome.

    You can read all of my blog posts relating to Cronacle at


    * SSIS is the tool that I used to use to do this sort of stuff in the Microsoft ecosystem

  • Fetch once, write many times

    As I’ve said before these days I’m spending time working on Hadoop rather than SQL Server and one of my colleagues today alerted me to a rather cool feature Of Hive (in a nutshell Hive is the technology that makes enables us to write SQL statements against Hadoop). Check this out:


    Pretty cool, no? From a single tablescan I can summarise and insert into two destinations. Its the capability of a SSIS dataflow, but in SQL. If I ever make it back to the SQL Server world I’ll be requesting this feature in T-SQL pronto.


  • Tabular Data Packages and a BIML challenge

    Update. The challenge set below has been responded to by Scott Currie who also happens to be the inventor of BIML. He has made his work  available at

    In February 2004 I joined a company called Conchango and had the pleasure of working with some very very smart people (Mick Horne, Dan Perrin, Pete Spencer, Paul Mcmillan, Steve Wright amongst others) who schooled me in the ways and means of building data integration solutions. One piece of IP that Conchango had revolved around a means of transporting a collection of data files in a well-known format. We called it the "Generic Feed Interface" and it broke down like this:

    • Data was kept in CSV files. There was no limit to how many files were in a collection.
    • Each collection had two extra files, an inventory file and a metadata file. Both of these were also CSV files.
    • The inventory file listed all of the data files
    • The metadata file defined the data type of each column in each data file

    Here’s a very simple example of what this looked like, a collection consisting of one datafile:




    As well as this format of defining data our tangible IP consisted of two DTS packages that could:

    • be pointed at a database and produce such a collection
    • be given a collection and import it into a specified database, creating target tables if required

    This was used to extract data from systems of record which could then later be imported elsewhere for additional processing (e.g. as a source for a data warehouse). We liked this approach because we were loosely-coupling our data warehouse from the systems-of-record, this brought benefits such as the ability to extract data from source at a different time from which it was loaded into the target. It was a great piece of IP and was used at a time which heralded a very successful period for us in the Microsoft BI space. Doubtless many of you reading this have invented and/or used similar mechanisms for moving data.

    It was with interest then that earlier this week I was alerted (by Matt Smith) to an effort to achieve the same called Tabular Data Package.


    Tabular Data Packages include a JSON file for the same as which we used the Inventory and Metadata file but nevertheless its the same basic concept, it is a well-known format for transporting data in human-readable files. Tabular Data Packages are advocated by the Open Data Institute, see Jeni Tennison's blog post on the subject 2014: The Year of CSV.

    It occurs to me that it would be really useful to have a couple of SSIS packages that work with a Tabular Data Package in the same manner that our DTS packages worked with our data files collection all those years ago. Then again, using the SSIS object model to dynamically generate packages based on some known metadata (which is what would be required here) is notoriously difficult, better would be if there existed a code-generation tool for SSIS packages. Luckily such a thing exists, its called BIML, it is free and is becoming a very very popular means for developing SSIS solutions.

    Therefore I’m issuing a friendly challenge. Can anyone out there build a BIML script that can, given a database connection string, generate a Tabular Data Package containing all the data in that database? Furthermore, can someone also build a BIML script that can consume a Tabular Data Package and push all the data therein into a SQL Server database?

    The challenge is set. Is anyone game?


  • Why won’t you implement my little feature request, Microsoft?

    I’ve been an active member of the Microsoft developer community* for many years now and in that time one of my frustrations has been numerous futile attempts to get new features into SQL Server and SQL Server Integration Services. As of today, 27th March 2015, I’ve submitted 109 bugs and 289 suggestions


    to Microsoft Connect and while there are occasional successes (TOKEN function in SSIS sticks in my mind as one feature request that led directly to inclusion in the product**) the most common response is “Closed as won’t fix” (which is almost a meme in itself these days). I’m as guilty as anyone at getting irate at such responses.

    In the cold light of day I like to think I can be empathetic kind of guy and can accept that getting a feature implemented in a behemoth product such as SQL Server perhaps is not as simple as a developer bashing out a few lines of code and committing to source control. With that in mind today (thanks to Paul Stovell) I stumbled upon a very old blog post from Eric Lippert  How many Microsoft employees does it take to change a lightbulb? which explains some of the contortions one has to go through to get a feature implemented in a Microsoft product and only one of these (the first one) actually involves writing the code.


    Now granted there is an argument to say they’re making it more difficult than it needs to be and 12 years on from that blog post shipping software is arguably easier than it was back then, but its worth remembering that implementing features for millions of customers is likely a different category of problem to any that us developers who use these products have to deal with (the product I work on arguably counts its customers in tens rather than millions). Call me a Microsoft apologist if you like, you might be right, but I do think that some people should perhaps weigh up what their “5 minute fix” actually entails before they start ranting about “Closed as wont fix” on social media. I include myself in that.


    * assuming you consider SQL server developers to be members of the Microsoft developer community. That is up for debate of course.

    **I remember Matt Masson telling me at the time that he implemented this on a plane journey home to Canada one day

  • Analysing SQLBlog using Power Query

    In December 2011 I wrote a blog post entitled Querying RSS feed subscriber count on Google Reader using Data Explorer‏ in which I used a product called Data Explorer, that was in beta at the time, to find out how many people subscribed on Google Reader to various bloggers on Here were the results:

    It was a nice demo of what was possible using Data Explorer however it did have some limitations, mainly that I had to type in the list of bloggers and their URLs. It would have been much more elegant to have Data Explorer request a list of bloggers from and iterate over that list, querying for information about each in turn.

    Times move on and today, 3 and a bit years later, Google Reader is defunct and Data Explorer has morphed into Power Query. It looks different and has lots of new features which make it much more conducive to garnering information from the web. I thought it might be time well spent (and kinda fun) to revisit my previous blog post and see if Power Query can collate more useful information regarding than it could back then.


    If you can’t be bothered reading the detail of this post then just take a look at this screenshot that I shared on Twitter earlier today of a Power BI dashboard that shows information pertaining to


    This dashboard shows:

    That screenshot is taken from which is a place where Power BI reports (that are based on Power Query queries) can be stored and shared. Unfortunately they can only be shared with people whose email domain is the same as yours so I can’t share a link directly to the page which is a real shame because Power BI reports are interactive and the benefit is much greater when one is able to interact with them. If you would like this situation to change (and you should) then go and vote for these requests:

    The real work here goes on in the Power Query query that retrieves the data that underpins these reports. If you’d like to know more about that, read on!

    The Power Query part

    Power Query is, to all intents and purpose, a draggy-droppy-pointy-clicky UI over a functional language called M. The M queries that I built to obtain the data depicted above I have shared in this github gist: or you could download as a Power BI Designer file: sqlblog.pbix. Whether you copy-and-paste the M code or you download the .pbix you will need to download and install Power BI Designer in order to run the queries and see the resultant data.

    One caveat to running the queries, ensure you have created a Data Source setting on your machine for, it should look just like this:


    Retrieving data from a website using Power Query is essentially a web scraping exercise. Here are the steps the M query basically goes through to retrieve this information:


      • For each link in the list of monthly archives
        • retrieve the monthly archive
        • for each blog post summary in the monthly archive
          • retrieve the title and number of comments

    That’s pretty much it. The code looks a bit gnarly at first glance but all its essentially doing is navigating the DOM and following a few hyperlinks.

    The reports

    The reports are Power View reports. Power View is fairly intuitive so I don’t think its worth talking about it here too much. I do however think it is worth showing some examples of the insights that Power View can provide on top of this data.

    How many blog posts and how many comments on SQLBlog

    image image

    Slicing by blogger

    Charting all of the blog posts and ordering by number of comments is useful, that’s how we see that Marco Russo’s post DateTool dimension: an alternative Time Intelligence implementation is the most popular:


    However if we introduce a slicer we can analyse a single blogger’s blog posts, let’s take Aaron Bertrand for example seeing as he has posted the most:


    Aaron’s most popular blog post (going by the number of comments) is his ditty from October 2010 Fun with software : uninstalling SQL Server 2008 R2 Evaluation Edition.

    How many bloggers ARE there on



    Taking it further

    There is a heck of a lot of other data available on should one want to go and fetch it and Power Query makes it really easy to do that. It would be interesting, for example, to chart the tally of blog posts per month and see when “peak blog month” occurred. I’ll leave that as an exercise for the reader.


  • Implementing a build and deploy pipeline for Cronacle

    In my new role I am working on a system that makes heavy use of Redwood Software’s Cronacle. If you hit that link you’ll find a lot of marketing mumbo jumbo about Cronacle such as:

    • Cronacle gives you greater process consistency and quality
    • speed up your mission critical IT and business processes by as much as 90%

    Read into that what you will, to me Cronacle is a scheduler and orchestration tool. That might come across as slightly belittling but its not intended to be, in my (limited) experience Cronacle does a couple of things and does them very well (certainly in comparison to the tool I’m more familiar with, SSIS, that’s a comparison I hope to explain more fully in a future blog post).

    In the aforementioned post New year, new beginnings I said

    My aim is to instil my beliefs about continuous integration, unit testing, failing fast, on-demand provisioning into my new team

    which, to use a buzzword du jour, might well have been written as “My aim is to instil a devops culture into my team”. First step has been to integrate Cronacle into our build/deploy pipeline and in this post I’d like to describe how one goes about doing that.

    Scene setting

    We are using Cronacle v9. We are developing Job Chains and Job Definitions (aka Process Chains and Process Definitions) and it is those objects that we want to:

    1. store in a version control system (aka source control)
    2. compose into a single deployable artefact as part of a Continuous Integration (CI) build
    3. deploy to our various dev, test and prod environments in a repeatable fashion

    Storing in VCS

    The first pre-requisite to implementing a successful devops culture (in my opinion) is to use a version control system (VCS). I’m not fussy about which VCS one is using, as long as one uses one of them. We are using Subversion. My only stipulation is that no code in our build and deployment pipeline relies on us using Subversion because I don’t want to be tightly coupled to a particular technology.

    Typically one stores human-readable files in a VCS and I was keen that we did the same for our Cronacle Job Chains and Job Definitions. Cronacle enables one to export objects:


    when selecting that option a binary file with a .car extension is downloaded:


    .car file is shorthand for “Cronacle Jar”. Jar files are (as I understand them) used in the Java ecosystem as a mechanism for distributing stuff, they’re basically just .zip files with a different extension. Hence, get hold of some compression software that understands the .zip format (I recommend 7zip*), unzip your .car file and in the unzipped folder you’ll find a JobDefinition folder containing an XML file that defines the object that was exported:


    That XML file is the human-readable file that we check in to our VCS.


    We use TeamCity on which to run our CI builds (its my first time using TeamCity in anger and I have become an unabashed fan) in combination with an internally-built build automation framework called PSP.Build that is built atop PowerShell and Psake. PSP.Build is loosely based on Lloyd Holman’s OneBuild and follows the principles that Lloyd sets out at Run the same build process everywhere. We also use Pester for unit testing our Powershell code - I’m loving me some Pester.

    Building for Cronacle means reconstituting those XML files that we have in Subversion into a deployable .car file. The easiest way to do that is simply to zip them up for which, again, we are using 7zip. In addition there are some other files that need to be bundled into the .car file, one can see those files in the aforementioned .car file that is used for exporting from Cronacle.


    You can simply take a copy those files, they don’t need changing. (We bundle them inside PSP.Build so that we always have them available to our build process.)

    So in summary, in order to build a .car file you need three things:

    • XML files that define the Cronacle objects
    • A zip utility
    • com and META-INF folders

    Once you have them it just takes a bit of PowerShell to bundle them up. Here’s a direct copy of our code that does that:

    Remove-CarFiles -sourcePath $sourcePath #Simply deletes all the .car files it finds underneath $sourcePath

    Get-ChildItem -path "$basePath" -Recurse | Where-Object {$_.Name -eq "CronacleCarIncludes"} |

           Get-ChildItem | Copy-Item -Destination "$sourcePath\$nameofFolderToBeZippedUp" -Recurse -Force #fetches "com" & "META-INF" folders


    $zipperExe = Get-ChildItem -Path $basePath -Include "7za.exe" -Recurse |

           Select-Object -First 1 | foreach {$_.FullName} #find 7zip, which is bundled in our build framework, PSP.Build

    if ($zipperExe -eq $null) {throw "7za.exe could not be found in $basepath"}


    "Creating .car for sourcePath: $sourcePath\$nameofFolderToBeZippedUp" | Write-Verbose

    Push-Location "$sourcePath\$nameofFolderToBeZippedUp"

    & $zipperExe a -tzip "$" * #zip it all up!


    [I’m slightly reluctant to show my PowerShell code in public as it nearly always get criticised, so be gentle OK Smile]


    So, the output from building all our Cronacle objects is a .car file. How do we deploy that .car file to our Cronacle cluster? Its actually quite simply, we execute our .car file using java.exe and pass in a few parameters. The Cronacle documentation informs us of the generic form of the command and gives an example:


    (This documentation does not appear online anywhere so I can’t link to it I‘m afraid)

    The smarts to say what to do with all the Cronacle object definitions is within that com folder that was mentioned above, hence its imperative that that com folder is included in the .car file.

    Again, we wrote a load of PowerShell code to automate the process of building up the command-line and executing it, here’s that code if you’re interested:

    function Invoke-DeployCronacleCarFile () {










        try {

            if ($carFilePath.Extension -ne ".car") {throw "File: '$carFilePath' is invalid. File must have a .car extension."}

            if (-not (Test-Path $carFilePath)) {throw "File: '$carFilePath' does not exist."}


            $cmd =  "$javaExeFilePath"

            $arguments = '-jar',"$carFilePath", '-server', "$server", '-txt','-ruleset',"$importRuleset"


            if (-not ($username -eq $null -or $username -eq "")){$arguments += '-username', "$username"}

            "cmd: $cmd" | Write-Verbose

            "arguments: $arguments" | Write-Verbose

            if (-not ($password -eq $null -or $password -eq "")){$arguments += "-password","$password"}

            Invoke-CommandLine -cmd $cmd -arguments $arguments


        catch {




    function Invoke-CommandLine ([string]$cmd, [string[]]$arguments, [string]$errorMessage = "Error executing command: " + $cmd) {

        & $cmd @arguments #| write-host

      if ($LastExitCode -ne 0) {

        throw $errorMessage



    That may look slightly daunting but there’s not too much going on here. It takes the path to a .car file, the path to java.exe, server, username, password and an ImportRuleSet**, concatenates things together into a bunch of arguments, then sends it to Invoke-CommandLine to be executed.


    That’s pretty much it. The nuts and bolts of this is relatively simple, you’re simply zipping your definitions up and deploying that zip file using java.exe.

    Hope this helps.


    * We bundle 7zip inside PSP.Build so we don’t have to rely on the actor who kicks off the build (which would be TeamCity or one of our team members) having it installed - I’m pretty fierce about reducing external dependencies for our build/deploy pipeline. If its a pre-requisite then it gets bundled into PSP.Build.

    ** An ImportRuleSet defines environment-specific settings. A discussion of ImportRuleSets is outside the scope of this article however if you’ve come this far you’ll probably already know whether you need to be using them or not.

  • Test for a warning message using Pester

    Here’s a little ditty that’s worth throwing out (if nothing else so that I can find it later), if you’re not using Pester to test your PowerShell code then this post probably isn’t for you. If you’re not using PowerShell stop reading now.

    I wanted to write a Pester test that tested whether or not a piece of code threw a warning or not. I discovered from Understanding Streams, Redirection, and Write-Host in PowerShell that it is possible to redirect stuff in the PowerShell Warning (or Verbose or Debug) stream to the output stream. Here’s an example of doing just that:

    "Here is a message that I will send to the warning stream then redirect back to the output stream" | Write-Warning 3>&1 | %{$_.message}


    The “3>&1” part says “take the contents of stream #3 (i.e. the Warning stream) and send it to stream #1 (i.e. the Output stream)”. Once its in the output stream you can operate upon it as normal.

    In my Pester test I wanted to know whether or not a particular piece of code returned a warning matching a given string. Using the above technique its simply:

    (Some-Code-That-Is-Getting-tested) 3>&1) -match "Desired warning message" | Should Be $true

    Cool stuff! I love Pester.


  • New OneDrive API, possibly unified with OneDrive for Business

    Microsoft offers two similar, but separate, services containing the OneDrive moniker. There’s OneDrive (which is free) and OneDrive for Business (which isn’t). Strangely (as my ex-colleague Mark Wilson points out at OneDrive for Business: lots of cloud storage; terrible sync client) the free offering is much better than the paid-for one.

    Microsoft state in their blog post Taking the Next Step in Sync for OneDrive on 7th January 2015:

    It was clear that the right approach was to converge to a single sync engine and experience that would be able to provide all of the benefits of the consumer and business service to all customers faster. We decided to start with the consumer sync engine foundation from Windows 7 and Windows 8 and add the right capabilities from the other two engines.

    That’s Microsoft-speak for “Yes, OneDrive for Business is shit so we’re dumping it in favour of the OneDrive stuff”. Good news.

    Back in October 2014 Microsoft announced the availability of a Files API that allowed developers the ability to access files stored in Office 365

    we’re enhancing the opportunity for developers with Office 365, with new APIs for mail, files, calendar and contacts. These new robust REST-based APIs empower all developers to leverage the more than 400 petabytes of data (as of March 2014) and popular services across Office 365 in any application.

    New Office 365 extensibility for Windows, iOS, Android and web developers

    I asked a couple of the guys that worked on the API would the files API work against OneDrive and/or OneDrive 4 Business? The answer from Chris Johnson was “working on it”:


    That sounded like good news too.

    Yesterday Microsoft announced the availability of The new OneDrive API which is a RESTful API. Given that they are attempting to unify OneDrive and OneDrive for Business, and also given Chris Johnson’s response to my question in October, I wondered whether this API would cover OneDrive for Business as well so I decided to check out the new OneDrive API using the API console and found something quite promising. A request to which GETs a user’s default drive returns:


    Note the mention of

    "driveType": "consumer"

    That says to me that there will be other driveTypes and so I assume that this API will be used to access OneDrive for Business stuff also. More good news if that’s the case. I tweeted that question but at the time of writing haven’t had an answer.

    Some questions still remain in my mind:

    • What’s the difference between this new OneDrive API and the files API announced back in October? Do we have two teams building essentially the same thing? That sounds like the bad old Microsoft of old so I do hope not.
    • Are we ever going to get a RESTful API that will enable us to push data into an Excel workbook that is stored on OneDrive? Chris Webb and I have been asking for this for five years now. Again I asked the question on Twitter and am still hoping for a response. I do have high hopes that the new Power BI REST API will sate my needs here (right now I wouldn’t know because if one isn’t in the US then one cannot access it).


  • Response to “SSDT limitations”

    A commenter on my blog post Considerations when starting a new SSDT database project asked me to comment on his blog post SSDT Limitations. i wrote a response but when I tried to post it I got told it was too long:


    so I’m posting my response here instead. I’d suggest taking a read of the commenter’s post before reading on.

    "SSDT does not handle multiple file groups"
    I haven't experienced this. I'm pretty sure I've used SSDT to deploy multiple filegroups. Can you clarify what is meant by "does not handle"?

    "SSDT wants to control everything"
    I'd restate that as "SSDT gives you the option to control everything". If you don't want to put all your database objects into your SSDT project, then don't.

    "Production will not look like UAT. For SSDT, everything looks the same."
    Yes, this is a problem. SSDT espouses a mantra of "build once, deploy anywhere" but this breaks down if you want something to be different on each environment - security is the obvious thing here. My approach is to specify roles in the database project that have permissions assigned to them and then later (perhaps in a post-deployment script) add users into those roles. I describe this more fully here: A strategy for managing security for different environments using the Database Development Tools in Visual Studio 2010

    "This means storing in version control production:
    -Logins & passwords
    -app role passwords
    If you're storing passwords in source control then that must mean you're using SQL authentication instead of Windows authentication - that's your decision and I don't see how that is a shortcoming of SSDT. If you don't want to store them in source control - don't. SSDT has mechanisms that allow you to specify values at deployment time (search for SSDT sqlcmd variables). Bottom line, if you're storing passwords in source control then more fool you - nothing in SSDT forces you to do this.

    "It is generally preferred to at least have the option to inspect what will be deployed"
    Correct. And that is why (as you pointed out) SSDT has "the option of either publishing directly to the database or generating a script to run for the deployment."

    "Unfortunately for SSDT the upgrade generates a single file. In practice this means is that if the database upgrade fails for any reason recovery is rather complex. You'll need to work out where the upgrade script failed and then what action you will take from there."
    True. There are things you can do to mitigate this:
    -Test your deployments on a copy of your production database first
    -Do small, frequent releases rather than the big bang approach of infrequently deploying large, monolithic sets of lots of changes

    "The product feels rather like a beta release, bringing even a simple project into production will expose some pretty serious limitation."
    Beta release (to me) implies that it doesn't behave the way its been designed to. I put it to you that it behaves exactly the way its been designed to, it just so happens that that you don't like that behaviour. Fair enough if that's the case, you're not alone in that regard. There are other options for doing deployments if you don't like the way that SSDT espouses. Perhaps look at migrations ( or the old-fashioned method of manually writing change scripts.

    "However the largest issue is more fundamental: the approach where you are expected to trust the magic box to generate a valid upgrade script."
    Yes, you do have to put your trust in SSDT. If you're not happy doing so, don't do it, or just use the option to generate the deployment script and not run it. I personally am happier putting my trust in a repeatable algorithm rather than human beings who tend to be error prone. That's just me, your mileage may vary.

    "It is very easy to generate scripts that will fail to upgrade the database successfully."
    There are certain changes that SSDT is not easily able to cater for (e.g. adding a non-nullable column to a table that contains data) however I'm not sure that constitutes "very easy to generate failing scripts". Personally I think SSDT is better at generating working deployment scripts than a human being is because its quicker and less error prone. As I've alluded, its imperative that you test your deployments before pushing to a production database.

    You asked me to comment and I have done. Clearly we have a difference of opinion about SSDT and that's OK. I hope you can find a way around your complaints because I honestly believe SSDT is the best tool out there for managing SQL Server deployments - I hope you come to the same conclusion one day.


More Posts Next page »

This Blog


Privacy Statement