T-SQL Tuesday #016 – Aggregation Functions

March’s T-SQL Tuesday is brought to us by Jes Schultz Borland (Blog|Twitter).  In addition to hosting this month’s blog event, Jes is presenting a session later this month at SQL Saturday#67 in Chicago, so I’m hoping to finally have a chance to meet her.  It comes as no surprise that such a busy person is interested in rolling-up voluminous data into meaningful information for analysis.  Jes wants “to hear how you solved business problems with aggregate functions.”

A fair amount of my work involves loading data along the path of source systems to datawarehouses to reporting databases or data marts.  The source systems are typically external, so there’s no control over the flow or contents being delivered from them.  That’s interesting, because if someone says they re-sent yesterday’s feed, you don’t know if there was a change in the feed’s contents that requires a reload of yesterday’s data.  Also, the development team makes occasional changes in the components that load data from the source systems to datawarehouses to reports.  Did those changes necessitate a reload of data?  Did those changes produce the desired results?  Or, woops, did a source code modification accidentally change something that wasn’t supposed to be touched?

So for various reasons, detecting changes in data is a normal task for me.  No problem, just do a query that compares the before-and-after values linked on a natural key, right?  Yes, but…

What if there isn’t a natural key?  If a surrogate key exists then perhaps that can be used for the link, but that requires the before and after records you are linking for comparison to be the same.  That’s going to depend on how data is processed and how the relevant surrogate key is created.  Let’s say you can use the surrogate key for linking.  But guess what?  Now you are depending on a process.  Processes can change and possibly break your ability to link on the surrogate key.

What to do?  As you ponder this, you realize that you could SUM the numeric columns and compare the before and after SUM values.  Of course, there could be offsetting changes, but let’s not get picky.  Woops, if there are a lot of rows, the value could overflow, better use AVG.  Oh no, that doesn’t help with character columns.  How about calculating the LEN of each character value, it won’t catch everything. but whatever, you just want this done, so you can take the AVG of that, right?  Wait, what about date columns?  Phew.  Time to rethink this approach.

No problem, this was on the right track, but there’s a simpler approach than trying to treat each data type differently.  Checksums are often used to detect errors by comparing before and after values, and they can be used to solve our problem as well.  The SQL Server T-SQL functions CHECKSUM and BINARY_CHECKSUM can be applied to one or more columns in a table to return a checksum value.  Fine, so how does that help?  And what’s that got to do with today’s topic, aggregate functions?

The aggregate function I’ve been building up to is CHECKSUM_AGG.  This function allows you to find a checksum over groupings.  I’ve found this helpful in situations where there is no natural key for linking before and after results, but there are groupings that can be linked and compared.  This means you still won’t necessarily find the exact record that changed.  However, the groupings you use should let you know if changes occurred and might let you narrow your search down to just a few records to find changes.

Of course, the problem with checksum values is that they can collide.  In other words, different values can end up producing the same checksum.  This means that while a different before-and-after checksum can be equated with a change in the underlying data, an identical before-and-after checksum doesn’t guarantee that the underlying data stayed the same.  For this reason, I like calculating different levels of grouping when using the CHECKSUM_AGG function… I’ve sometimes found that one grouping will detect a change that another missed.

Enough talk, it’s time to demonstrate this concept.  As usual, I want to focus on the concept, so I’ll keep things simple rather than worrying about proper table design.  And let’s face it, you don’t always get to work with properly designed databases!  So please don’t look to the resulting table that is created below as an example of good database design or practice.  Standard disclaimer applies that there are no guarantees, warranties, don’t blindly use any source code that follows on a production system, etc.  The source code below was used with SQL Server 2008R2 Developer Edition.

Let’s say we’ve inherited a datawarehouse of farm animals.  A data feed (not to be confused with animal feed) populates this datawarehouse each day with information on the animals.  Reports are generated from this datawarehouse periodically.

CREATE DATABASE [TestFarm] ;
GO

USE [TestFarm] ;

CREATE TABLE [dbo].[TestAnimal]
( [RecordID]    INT IDENTITY(1,1) NOT NULL
, [MeasureDate] SMALLDATETIME NOT NULL
, [FarmName]    VARCHAR(30)   NOT NULL
, [Animal]      VARCHAR(30)   NOT NULL
, [Breed]       VARCHAR(30)   NOT NULL
, [Gender]      CHAR(1)       NOT NULL
, [Weight]      INT           NOT NULL
, CONSTRAINT [PK_TestAnimal] PRIMARY KEY CLUSTERED ([RecordID]) ) ;
GO

Next, we’ll simulate the loading of the first day’s data feed by using an INSERT statement (in reality, the data feed might be, for example, a csv file that’s loaded with SSIS).  Note that this “table value constructor” INSERT syntax is not supported prior to SQL Server 2008, so on older versions you’ll need to break this out into separate insert statements.  Sorry, but when I pay for new features I like to use them!

INSERT INTO [dbo].[TestAnimal] VALUES
('2011-01-01','Happy Trails','Sheep','Suffolk','M',175),
('2011-01-01','Happy Trails','Cattle','Angus','M',900),
('2011-01-01','Happy Trails','Cattle','Angus','F',850),
('2011-01-01','Happy Trails','Cattle','Jersey','F',850),
('2011-01-01','Happy Trails','Sheep','Montadale','F',150),
('2011-01-01','Happy Trails','Sheep','Montadale','F',180),
('2011-01-01','Happy Trails','Sheep','Suffolk','F',165),
('2011-01-01','Happy Trails','Sheep','Montadale','M',220),
('2011-01-01','Oak Run','Cattle','Angus','F',840),
('2011-01-01','Oak Run','Cattle','Angus','F',850),
('2011-01-01','Oak Run','Cattle','Jersey','F',850),
('2011-01-01','Oak Run','Cattle','Jersey','F',840),
('2011-01-01','Oak Run','Cattle','Jersey','F',875),
('2011-01-01','Oak Run','Sheep','Montadale','F',160),
('2011-01-01','Oak Run','Sheep','Montadale','F',185),
('2011-01-01','Oak Run','Sheep','Suffolk','M',225),
('2011-01-01','Oak Run','Sheep','Suffolk','F',165),
('2011-01-01','Oak Run','Sheep','Montadale','M',210),
('2011-01-01','Oak Run','Sheep','Southdown','M',225) ;
GO

With this, we are able to generate reports for 2011-01-01 and send them off to Happy Trails and Oak Run farms.  Not long after, the data feed provider lets you know they’ve re-transmitted the feed.  You ask what changed, and they don’t know.  You ask if anything changed and they don’t know.  You ask if they know anything, and, well, the conversation goes downhill after that.  You’ve got to figure out for yourself whether or not the reports need to be regenerated.  Of course, this example is just 19 rows, so you could just eyeball the data to see if it changed, but let’s pretend it’s thousands or millions of rows.  Below we’ll simulate archiving the prior load, removing the prior load, and loading the new transmission into the datawarehouse.

--archive prior load.
SELECT * INTO [dbo].[TestAnimal_PRIOR] FROM [dbo].[TestAnimal] ;
GO

--remove prior load from datawarehouse.
DELETE [dbo].[TestAnimal] WHERE [MeasureDate] = '2011-01-01' ;
GO

--load retransmitted feed.
INSERT INTO [dbo].[TestAnimal] VALUES
('2011-01-01','Happy Trails','Cattle','Angus','M',900),
('2011-01-01','Happy Trails','Cattle','Angus','F',850),
('2011-01-01','Happy Trails','Cattle','Jersey','F',850),
('2011-01-01','Happy Trails','Sheep','Suffolk','M',175),
('2011-01-01','Happy Trails','Sheep','Suffolk','F',165),
('2011-01-01','Happy Trails','Sheep','Montadale','F',150),
('2011-01-01','Happy Trails','Sheep','Montadale','F',180),
('2011-01-01','Happy Trails','Sheep','Montadale','M',220),
('2011-01-01','Oak Run','Sheep','Montadale','F',160),
('2011-01-01','Oak Run','Sheep','Montadale','F',185),
('2011-01-01','Oak Run','Sheep','Suffolk','M',225),
('2011-01-01','Oak Run','Sheep','Suffolk','F',165),
('2011-01-01','Oak Run','Sheep','Montadale','M',210),
('2011-01-01','Oak Run','Sheep','Suffolk','M',225),
('2011-01-01','Oak Run','Cattle','Angus','F',840),
('2011-01-01','Oak Run','Cattle','Angus','F',850),
('2011-01-01','Oak Run','Cattle','Jersey','F',850),
('2011-01-01','Oak Run','Cattle','Jersey','F',840),
('2011-01-01','Oak Run','Cattle','Jersey','F',875);
GO

Now we have the prior results available in the “TestAnimal_PRIOR” to compare with new values.  Notice that while there is a surrogate key, it’s determined by the order of insert, and the records in the new transmission are in a different order than the old one… just look at first record in feed, previously it was a sheep, now it’s a cow.  Did something change at Happy Trails farm or did just the order of rows in the feed change?  Let’s find out.

SELECT CASE WHEN ChecksumOld = ChecksumNew
THEN 'same' ELSE 'DIFFERENT'
END AS "Changed?",*
FROM (
SELECT s.FarmName,s.Animal,s.[Column]
,CHECKSUM_AGG([s].[Old]) "ChecksumOld"
,CHECKSUM_AGG([s].[New]) "ChecksumNew"
FROM (
SELECT FarmName,Animal,'Breed' AS "Column",
CHECKSUM(Breed) AS "New",0 AS "Old"
FROM dbo.TestAnimal WHERE MeasureDate = '2011-01-01'
UNION ALL
SELECT FarmName,Animal,'Gender' AS "Column",
CHECKSUM(Gender) AS "New",0 AS "Old"
FROM dbo.TestAnimal WHERE MeasureDate = '2011-01-01'
UNION ALL
SELECT FarmName,Animal,'Weight' AS "Column",
CHECKSUM(Weight) AS "New",0 AS "Old"
FROM dbo.TestAnimal WHERE MeasureDate = '2011-01-01'
UNION ALL
SELECT FarmName,Animal,'Breed' AS "Column",0 AS "New",
CHECKSUM(Breed) AS "Old" FROM dbo.TestAnimal_PRIOR
UNION ALL
SELECT FarmName,Animal,'Gender' AS "Column",0 AS "New",
CHECKSUM(Gender) AS "Old" FROM dbo.TestAnimal_PRIOR
UNION ALL
SELECT FarmName,Animal,'Weight' AS "Column",0 AS "New",
CHECKSUM(Weight) AS "Old" FROM dbo.TestAnimal_PRIOR
) [s] GROUP BY s.FarmName,s.Animal,s.[Column]
) [q] ORDER BY q.FarmName,q.Animal,q.[Column] ;

The above statement could be formatted better with indentation, specifying columns individually instead of using SELECT * and so on.  Still, it gives us the results below, which shows that there were no changes in the data feed for Happy Trails, so you do not need to regenerate and re-send their reports.  However, Oak Run farm did have a change related to the Breed column of one or more Sheep rows.

Changed?  FarmName     Animal  Column ChecksumOld ChecksumNew
--------- ------------ ------- ------ ----------- -----------
same      Happy Trails Cattle  Breed  166418822   166418822
same      Happy Trails Cattle  Gender 154         154
same      Happy Trails Cattle  Weight 900         900
same      Happy Trails Sheep   Breed  1368889233  1368889233
same      Happy Trails Sheep   Gender 147         147
same      Happy Trails Sheep   Weight 244         244
same      Oak Run      Cattle  Breed  166418822   166418822
same      Oak Run      Cattle  Gender 147         147
same      Oak Run      Cattle  Weight 875         875
DIFFERENT Oak Run      Sheep   Breed  958075363   -80924007
same      Oak Run      Sheep   Gender 9           9
same      Oak Run      Sheep   Weight 110         110

(12 row(s) affected)

If you look at the old feed, you’ll see that the last row had a Southdown breed Sheep for Oak Run farm, but there are no Southdown sheep in the new feed.  It was replaced with a Suffolk sheep in the new feed, so Southdown must have been an error that was corrected with the retransmitted feed.  Thus, you only need to regenerate and re-send reports for Oak Run farm.  While the good folks at Oak Run farm might be irritated that the original set of reports was wrong, at least you didn’t unnecessarily bother Happy Trails farm.  The CHECKSUM_AGG function empowered you to narrow the issue down and identify where changes took place.  Oh, and are we done?  Well, not really, because as I mentioned earlier, checksum values can collide.  So in our farm example, I’d probably calculate checksums on some additional groupings.  So far, this has worked well for me in practice over the years.

As always, I’m looking forward to reading the contributions to this month’s T-SQL Tuesday and finding out how everyone else has been using aggregate functions.  Thank you to Jes Schultz Borland for hosting T-SQL Tuesday #16, and thanks much to Adam Machanic (Blog|Twitter) for creating this monthly blog event!

About Noel McKinney

Noel McKinney is the administrator of noelmckinney.com
This entry was posted in T-SQL Tuesday and tagged , . Bookmark the permalink.

4 Responses to T-SQL Tuesday #016 – Aggregation Functions

  1. Claire says:

    Nice one, though I wish there was a way to prevent collisions altogether 🙂

    • Thanks Claire! Fortunately the collisions have been rare for me, but come up often enough that CHECKSUM alone isn’t sufficient to detect changes.

  2. Thanks for participating Noel!

Comments are closed.