Spinach and Database Development- SQLSatExeter Keynote

Last weekend, we held our SQL Saturday event in Exeter. It was a brilliant event for many reasons but we were delighted to have a world exclusive keynote video by Phil Factor about Spinach and Database Development. With many thanks to those that made it possible and particularly to Phil Factor I have linked to the video here and also transcribed it. Please watch and read and understand the message

 

 

What has spinach got to do with Database Development?

 

 

Generations of children were fed spinach in preference to more nutritious things, such as cardboard, because of the persistence of bad data.

 

It wasn’t in fact the decimal point error of legend but confusion over the way that iron was measured in the late 19th century data. As a result nutritionists persisted in believing for generations that it was a rich source of iron that the body needs in order to create bloodcells. In fact, the very little iron that there is in spinach isn’t in a form that can be readily absorbed by the body anyway.

 

The consequences of bad data can be dire

 

Guarding the quality of your data is about the most important thing that you as a data professional can do. You may think that performance is important but it would just deliver you the wrong answer faster. Resilience? it would just make it more likely that you’d be able to deliver the wrong answer. Delivery? Yep you got it, the wrong answer quicker.

 

The spinach example is a good one because bad data is hard to detect and can go unnoticed for generations. This is probably because people don’t inspect and challenge data as much as they should. You would have thought it strange that a vegetable like spinach should have fifty times as much iron as any other vegetable but the fact came from a very reputable source so people just shrugged and accepted it

 

We have a touching faith in data,

 

We, as a culture, assume its correct and complete, we like to believe that it’s impossible that either prejudice, bias, criminality or foolishness could affect the result, worse we think that valuable truth can be sifted from any data no matter the source. If there’s enough of it then there must be value in it. It’s like panning for gold dust from a river. The sad truth is that this is a delusion but very common in our society. We are, in our mass culture, in the bronze age rather than the information age struggling with silvery toys imbued with mystical magical powers

 

A good database professional must be unequivocal.

 

Bad data cannot be cleaned in the same way that one can clean mud of a diamond. If data contains bad data then the entire data set must be rejected

 

There’s no such thing as data cleansing.

 

You as a DBA may be asked to take out data that seems absurd such as ages that are negative or ages that are so great that the person couldn’t possibly be alive but then that leaves you in the same dataset, data that is plausible but wrong.

 

Only in very exceptional circumstances when you know precisely why a minority of your data is wrong would you be justified in correcting it.

 

Statistics can help us to make very confident assertions about large datasets if they conform to one of the common distributions but they cannot tell us anything about individual items of data. You can of course remove outliers but in fact outliers are just items of data that don’t conform to your assumptions about the data and the whole point of data analysis is to test your assumptions. By cleaning data, by removing outliers you can prove almost anything scientifically

 

A well designed database is defended in depth at every possible opportunity.

 

Depth is actually an interesting analogy because experience tells us that bad data seems to leak in under pressure, through every crack when the database is working hard. Like you will see in a World War 2 submarine movie, in a well-used OLTP database, we are like the crew, swivelling our eyes in terror savouring the futility of any remediation as ghastly drips run down the walls of our database and wishing we had put in more constraints.

 

In terms of the defence of data, check constraints and foreign key constraints are excellent of course and triggers are good but there are other ways of getting warnings of errors in data such as sudden changes in the distribution of data and other anomalies. One check I like to do is the tourism check where you check your data all the way through back to source, this technique once famously picked up the fact that a famous motor manufacturer was reporting its deceleration figures in yards per second when it should have been metres per second.

 

When you start putting in check constraints you say to yourself, this couldn’t possibly happen. This is the voice of superstition. A famous programmer of the 1970’s took to putting a message in his code saying “this error could never happen” and he put it in places where it couldn’t possibly ever be executed and the funny thing was the more he tested the programme, the more that error appeared on the screen and it is the same with constraints, the more traps you set the more critters you catch and you’re left wondering how on earth all that bad data was getting in

 

Its misleading to go on about the value of the great flood of big data. There’s a strong superstition that data has some sort of intrinsic mystical value all of its own.

 

Unless you can prove that data is correct its valueless because if you trust it you can end up with generations of children compelled to eat spinach for no good reason at all.

Advertisements

Instances and Ports with PowerShell

Just a quick post and a day late for #SQLNewBlogger There are some excellent posts on that hashtag and I recommend that you read them

When you know a server name but not the name of the instances or the ports that they are using this function will be of use

Checking SQL Server User Role Membership with PowerShell

Please  go and check the New SQL Bloggers posting here https://twitter.com/search?q=%23sqlnewblogger There are some brilliant new and older bloggers adding great value to the SQL Community

This is my most viewed post so I thought it made a good candidate to be updated and reblogged

PowerShelling SQL Saturday Sessions to the Guidebook app

Following on from my previous post about parsing XML where I used the information from Steve Jones blog post to get information from the SQL Saturday web site I thought that this information and script may be useful for others performing the same task.

  1. Edit – This post was written prior to the updates to the SQL Saturday website over the weekend. When it can back up the script worked perfectly but the website is unavailable at the moment again so I will check and update as needed once it is back.

    We are looking at using the Guidebook app to provide an app for our attendees with all the session details for SQL Saturday Exeter

    The Guidebook admin website requires the data for the sessions in a certain format. You can choose CSV or XLS.

    In the admin portal you can download the template

    down

    which gives an Excel file like this

-excel

 

So now all we need to do is to fill it with data.

I have an Excel Object Snippet which I use to create new Excel Objects when using Powershell to manipulate Excel. Here it is for you. Once you have run the code you will be able to press CTRL + J and be able to choose the New Excel Object Snippet any time.

I needed to change this to open the existing file by using

In the more help tab of the Excel workbook it says

2.     Make sure that your dates are in the following format: MM/DD/YYYY (i.e. 4/21/2011).  If the dates are in any other format, such
as “April 21, 2011” or “3-Mar-2012”, Gears will not be able to import the data and you will receive an error message.
3.     Make sure that your times are in the following format: HH:MM AM/PM (i.e. 2:30 PM, or 11:15 AM). If the times are in any other
format, such as “3:00 p.m.” or “3:00:00 PM”, Gears will not be able to import the data and you will receive an error message.

So we need to do some manipulation of the data we gather. As before I selected the information from the XML as follows

I then looped through the $Talks array and wrote each line to Excel like this

I know that I converted the String to DateTime and then back to a String again but that was the easiest (quickest) way to obtain the correct format for the Excel file

Then to finish save the file and quit Excel

Then you upload the file in the Guidebook admin area
import

wait for the email confirmation and all your sessions are available in the guidebook

sched

I hope that is useful to others. The full script is below