T-SQL Tuesday #022 – Data Presentation

September’s T-SQL Tuesday is brought to us by Robert Pearl (Blog|Twitter), and he has chosen Data Presentation as the topic for this month’s T-SQL Tuesday. I shuddered after seeing this topic because it brought to mind an experience where the separation of data and presentation was violated.

I received an email from the boss saying he had been querying a particular database. He wanted the leading zeros removed from all numeric values.

I said that’s no problem, we’ll just modify the query tool to trim leading zeros. The previous boss wanted to see data exactly as it was stored in the database, thus he didn’t want leading zeros trimmed. But no problem Mr. Boss Du Jour, we can change that. Or better yet, we can offer it as an option in a checkbox. Because who knows what the next boss will want (okay, I didn’t say this last part, I just thought it to myself).

The boss countered “No, don’t modify the query tool. I also want to see the data as it is stored in the database. And the way I want it stored in the database is with the leading zeros trimmed.”

I explained that there were specifications written years before detailing that numbers were to be stored in that database exactly as they were received from data providers. There were a number of reasons for this, the most crucial being that part of our service offering was the ability to rate data from our various sources by a range of metrics. This particular database was used for that product offering. Modifying the numeric values by removing leading zeros would corrupt that process.

In the end, none of my arguments mattered. I could tell it had become a point of pride for the boss to win this no matter what. I asked why this was so important. For example, did he have storage or performance concerns. But his answer was just “It looks better without leading zeros.”

After that, there were questions I thought about asking but didn’t. Like if he thought it would look nicer if he could see the database at the byte level. Or if he had an opinion on whether big-endian or little-endian looks prettier. Or if he’d be happier with a mauve database.

Or, how we were supposed to put the zeros back when the next boss asks for them.

Anyway, I know I mentioned this in a prior post, but it’s probably worth repeating… when asked to do certain tasks (such as those regarding presentation), at least consider whether it’s something that belongs in the database or at another level.

Thanks to Robert Pearl for hosting T-SQL Tuesday #22, especially when he was asked to host earlier than planned! And thanks to Adam Machanic (Blog|Twitter) for creating this monthly blog event and keeping it going!

Posted in T-SQL Tuesday | Tagged , | 4 Comments

Labor for the Data

Happy Labor Day! My last post on Google Correlate is still on my mind. So is Buck Woody’s (Blog|Twitter) post on being a Data Professional … yes, this is at least the third time I’ve mentioned Buck’s post, but I think the message is that important to grasp. Anyway, what’s on my mind is two encounters I had that, at least for me, indicated I should not get so involved in technology that I forget about the data.

The first encounter was seventeen years ago, the second was ten years ago. So I have two stories, and since it’s a holiday you probably feel like reading just one. Okay, I’ll go with the more recent story for now and save the older one for a follow-up blog post.

It was my first day at a new job. The previous day I graduated with a computer science degree (I know, I should have taken some time off in between, but I was excited about the new job). My boss came over to my desk to welcome me, and the exchange went something like this:

  • Boss: Hi Noel, we’re glad you’re finally here.
  • Me: Same here, I received my degree yesterday, so I’m ready to get to work!
  • Boss: You were the one who wanted to wait until you graduated to start work. I was ready for you to start working here months ago. Quite frankly, I don’t care about your computer science degree, it’s the skills from your previous graduate study that are interesting.

Ouch, talk about a backhanded compliment. I’d just spent over two years of time and boatloads of SQLCruises in dollars to study algorithms, programming languages, relational algebra, software engineering techniques, etc. How could he not value that? How could he find the five years of grad school I did beforehand more interesting? Especially when half of those five years didn’t even result in a degree.

After a few years, I not only understood his viewpoint but even agreed with it. Well, mostly agree with it. I’d never want to give up the computer science study (especially the algorithms material), and if I try to think of what parts I could have cut out, I don’t come up with much.

So what was it about the five years of non-computer science grad school that made my boss interested in me?

The answer: data. I spent all that time rolling around in data.

The first graduate degree was in social and applied economics. The focus was on applying data, statistics and economic theory to public policy and business issues. I was also a research assistant. So days and nights revolved around loading data tapes on mainframes, crunching away at them with SAS, then taking the resulting greenbar printouts to a professor’s office. That’s where your real education started: pour over scatter plots and regression lines, dig into data rows to find points that didn’t fit, figure out what was missed, what the data told us that we didn’t know, adjust models, then head back to the computer center and repeat.

After that, I taught economics for a few years, then I headed back to graduate school to work on a doctorate (I became bored with being in a small college town, plus my interests changed, so I left without a degree during my third year). Once again I was a research assistant while taking courses in quantitative methods, economics and accounting, so my experience was similar to the above. Similar but not the same; by that time, PCs had become powerful enough that you no longer needed a mainframe to run SAS on large data sets. Which meant you spent more time with data… it was right there on your desktop, so you didn’t have to run back and forth to the computer center!

That’s the end of the story. My take-away: it occurs to me that in recent years I’ve spent disproportionately more time learning about the technology side than the data side. So I’m consciously going to try to balance that out in the coming months. With that, I guess it’s time to find my copy of Hogg and Craig. Oh there it is, underneath my monitor stand :-)

Posted in Professional Development | Tagged | Comments Off

Drawing with Google Correlate

This week, Nick Hatch (twitter) showed me the Search by Drawing feature in Google Correlate. My reaction was “That’s going to consume several hours of my weekend” and sure enough, it already has :-)

Getting started is easy. Go to Google Correlate (you’ll need to sign in with your Google account) then click on the Search by Drawing link on the left side of the page under the Correlate Labs section. With that, you’ll be presented with a blank chart where you can draw a time series of search activity… just draw a line, click the Correlate! button and see what happens. The tool will display your line with a line of activity for search terms. The most correlated search term is displayed initially, but other search terms are presented as well (ranked by decreasing correlation) and you can click on those terms to display their lines.

You might be wondering how could I spend hours drawing lines. Fair enough.

I started thinking about search terms and imagining what a time series for that term would look like, then drew it, then looked at what my line actually matched. For example, try to think of the number of searches over the last several years for Lady Gaga or Charlie the Unicorn, then draw that time series and see the results.

Some of my more interesting attempts:

  • The line I drew that I thought might look like the time series for MusicMatch had a 0.9756 correlation with AltaVista. This was interesting because Yahoo ended up acquiring both of these products.
  • The line I drew trying to guess search activity for FarmVille correlated most highly with Dropbox. This wasn’t too interesting, but the data had something unexpected. As I looked at the time series of search activity for Dropbox, there wasn’t a surprise in the general trend. Dropbox was founded in 2007, so sure enough in 2007 the time series shot up exponentially. Before that, the search activity was perfectly level, except for little squiggles around 2005. Hmmm…
  • I drew my imagined time series for searches on the mortgage banking industry, and my line’s highest correlation was with a bank that was purchased after the 2008 banking crisis. However, well before the banking crisis, there was a sudden and dramatic peak in search activity for that bank. With a little searching, I read that the bank encountered regulatory activity during the period of that peak. So looking at the data prompted me to do a little digging and learn something.

Anyway, if you’re a data geek and haven’t poked around with Google Correlate yet, then you might want to check it out. Enjoy!

Posted in Data | Tagged | 1 Comment

Netbook – Part 3

Thirteen months ago I wrote Part 1 about my experience with choosing a netbook to take on the very first SQLCruise. Part 2 continued with setting up the netbook for SQL Server development and education use.

Thoughts After a Year of Use

So after more than a year of use, what’s the verdict? Easy answer: success, no regrets.

The netbook not only went on the very first SQLCruise from Miami to Cozumel, but also the most recent SQLCruise to Alaska. On both trips I used it in the classroom as well as to VPN into the office to do work. Same for last year’s PASS Summit and some SQL Saturday events. For a year I carried it to the office so that I could work away from my desk for an hour or so each day. The netbook had a cost of around $300, throws off very little heat compared to a laptop, and only weighs 2.8 pounds. Most surprising: the estimate of 14 hours of use on battery power was not much of an overstatement (turning on wi-fi zaps it down a bit, turning on bluetooth zaps it down a lot).

So is that the end of this post? Am I going to end with “All is well, tune in next year when I will let you know if the netbook is still running” or something like that?

No.

Let us push-on and take the netbook to a new place. A wonderful place with everyone’s favorite creature of the arctic waters, the narwhal. In this particular case, a natty narwhal.

It’s Ubuntu Time

Ubuntu is a linux distribution. Unlike the olden days, you can avoid the disk partitioning stuff, boot loader configurations, command lines, having to do yet another operating system install where you have to babysit the machine in case you need to type in information and hit the enter key, etc. More on that later.

But first, why would I want linux on my netbook if it already works fine? Because sometimes I just want to grab my netbook and do some web browsing, but I don’t want to wait for Windows 7 to boot up. So I wondered if a minimal Ubuntu install would boot up faster on my netbook. No need for suspense… the boot up time is about the same, but the feature I like is that the Ubuntu user interface is very nice for smaller, netbook-size screens. This interface seems a bit awkward at first, but it doesn’t take long before you begin to appreciate it.

One way to get Ubuntu running would have been to use a virtualization solution such as VMware Player. But that’s not going to work in this case because then I’d have to boot to Windows first, then Ubuntu. That’s not much of a time-saver. Also, my netbook doesn’t have a lot of memory or CPU horsepower for pushing virtual machines. So it was looking like I would be setting up a dual-boot configuration.

At this point, I decided to try the option of running Ubuntu within Windows. This turned out to be very simple, and so far I’ve been quite satisfied with the result. With this installation option, Ubuntu installs into a folder in your Windows file system, and the experience is similar to installing a Windows application. So you don’t have to deal with partitioning your hard disk for a separate operating system. The Ubuntu installer uses your existing Windows installation to figure out the settings to use, so your interaction during the installation process is minimized.

If you want to try out Ubuntu then you can follow the instructions here, plus this page has more detailed instructions on installation as well as how to uninstall (you uninstall from the Control Panel just like you’d uninstall a typical Windows application).

Once you have Ubuntu installed, fire up the web browser and do some reading. A couple of links I’d recommend would be

So my netbook has become even more useful now. When I start it up, I can choose to boot into Windows to do some SQL Server work, or boot into Ubuntu for some linux goodness.

Posted in Hardware | Tagged , , | Comments Off

T-SQL Tuesday #021 – Inelegant Yet Educational

August’s T-SQL Tuesday is brought to us by Adam Machanic (Blog|Twitter). Actually, we have Adam to thank for every T-SQL Tuesday because he created it in the first place, however this month he’s hosting it as well. He’s also decided to have the event take place on a Wednesday rather than Tuesday, which is appropriate given this month’s topic, which is to reveal our “crap code.”

This seems easy enough because I’ve got plenty of examples. The first one that came to mind was when I was told to replace all cursors with temp tables and WHILE loops. However, I already discussed this in item #4 of last month’s post. No need to revisit that. So here’s two other crap code examples; one that died quickly and another that lives on.

Quickly Archived But Not Forgotten

I was directed to implement a soft-coded values solution that shoehorned generic objects into a relational model, with GUIDs as clustered keys and values in sql_variant columns. By the way, the data architect in this situation admitted not knowing anything about 1) the solution’s industry and 2) SQL Server.

I implemented the design. Now, you might ask, why write this code even though I knew it would be crap? Well, because I had already been labeled a naysayer by the new management team for raising objections on two other issues. I had also been shown negative emails that managers were circulating about me. So I decided to be a team player while preparing to get a pink slip, and along the way I’d learn about this type of design.

Guess what? I did learn a lot about soft-coded values design (e.g. when it’s appropriate, when it’s not, how far you can optimize it, and non-relational approaches). But in the end, a daily load of just one feed took 22 hours (and almost 100 feeds needed to be processed a day). Each consultant brought in to review my code said it was perfect. According to them, the solution was that we needed to buy much, much more expensive hardware to fix the situation.

New hardware was not possible, and the situation was desperate enough to ask this “naysayer” for ideas. I demonstrated a redesigned solution that could process a feed in several minutes. With that, the soft-coded values solution was replaced and sent off to the archive repository. Years later, my solution is still used to process large volumes of data. Even though the soft-coded values solution was only used for several months many years ago, I continue to be asked questions related to that experience.

Lesson: Data matters. Relying on technology and/or technical skills to allow abstraction or over-generalization of your data might not be a viable solution. I was able to design a superior solution in this situation because of my experience with data in this particular industry. This is why I like Buck Woody’s (Blog|Twitter) job title of Data Professional – it keeps me focused on what I consider to be the most important part of my work.

It Lives On

It was over eight years ago when I was asked for this particular hack. A new solution had been developed without considering that certain identifier values from the old solution needed to be maintained.

The manager of development asked if I could hack something together quickly to maintain these identifiers. It went something like “You can? Two days? Great! And, oh yeah, also create new identifiers that won’t collide with the old solution. This will be a temporary hack, don’t worry. In a couple of months a permanent solution will be built. Two, three months. Okay, maybe four or five since some people might go on vacation. That’s it. Probably. Well, maybe six months. Tops.

So I created my hack solution. Driven by nothing but the finest .bat scripts. Plus several tables, stored procedures and user-defined functions. Yes, I usually avoid user-defined functions, but this was just a temporary hack that I needed to throw together quickly, right? Since this was a temporary solution, I decided to stick these objects in their own database that could just be dropped when the hack was no longer needed.

I asked the DBA for a database, which he named DB_CorpCo (replace “CorpCo” with client name). I said this was a terrible name because it’s not doing something necessarily related to a specific client, rather it’s performing a certain function. The DBA said for now only one client, CorpCo, will use it, so that’s a fine name. I said that sometimes temporary hacks become permanent, so this database might end up being used for other clients such as AirCo, FruitCo, StoreCo, etc. The name will be confusing. It should not contain a client name. However, my argument was going nowhere, and I had a lot of hacking to do, so DB_CorpCo was created. Over eight years ago.

DB_CorpCo worked great. Six months came and went. Then most of the development team was laid off. DB_CorpCo was working fine and there were no resources to replace it, so it kept going. Pretty soon, it’s use was expanded to include other clients than just CorpCo. I got to explain over and over that, yes, it’s called DB_CorpCo, but all clients use it.

I left the company for a while, and when I returned I found that even with all the comments and documentation warning that no new functionality should be added to DB_CorpCo, it’s tables had been altered and stored procedures had been modified to add new, expanded duties that belonged elsewhere. Like a beast in a horror film, DB_CorpCo, with all it’s temporary, hacked together abominations, was now intertwined with new and old solutions serving multiple clients. And at the top of each source code file is the comment “Original version created by Noel McKinney.”

To this day, in a data center far, far away, on a lonely server in a rack, a database named “DB_CorpCo” is being updated and queried daily. It Lives (cue the horror music now).

Lesson: Temporary hacks can become permanent, unless you want to argue that eight years is not “permanent” when it comes to computer solutions.

I’m looking forward to reading the contributions to this month’s T-SQL Tuesday Wednesday and reading the examples that others were generous enough to share. Thanks to Adam Machanic for creating this monthly blog event and hosting it this month!

Posted in T-SQL Tuesday | Comments Off