Labor for the Data

Happy Labor Day! My last post on Google Correlate is still on my mind. So is Buck Woody’s (Blog|Twitter) post on being a Data Professional … yes, this is at least the third time I’ve mentioned Buck’s post, but I think the message is that important to grasp. Anyway, what’s on my mind is two encounters I had that, at least for me, indicated I should not get so involved in technology that I forget about the data.

The first encounter was seventeen years ago, the second was ten years ago. So I have two stories, and since it’s a holiday you probably feel like reading just one. Okay, I’ll go with the more recent story for now and save the older one for a follow-up blog post.

It was my first day at a new job. The previous day I graduated with a computer science degree (I know, I should have taken some time off in between, but I was excited about the new job). My boss came over to my desk to welcome me, and the exchange went something like this:

  • Boss: Hi Noel, we’re glad you’re finally here.
  • Me: Same here, I received my degree yesterday, so I’m ready to get to work!
  • Boss: You were the one who wanted to wait until you graduated to start work. I was ready for you to start working here months ago. Quite frankly, I don’t care about your computer science degree, it’s the skills from your previous graduate study that are interesting.

Ouch, talk about a backhanded compliment. I’d just spent over two years of time and boatloads of SQLCruises in dollars to study algorithms, programming languages, relational algebra, software engineering techniques, etc. How could he not value that? How could he find the five years of grad school I did beforehand more interesting? Especially when half of those five years didn’t even result in a degree.

After a few years, I not only understood his viewpoint but even agreed with it. Well, mostly agree with it. I’d never want to give up the computer science study (especially the algorithms material), and if I try to think of what parts I could have cut out, I don’t come up with much.

So what was it about the five years of non-computer science grad school that made my boss interested in me?

The answer: data. I spent all that time rolling around in data.

The first graduate degree was in social and applied economics. The focus was on applying data, statistics and economic theory to public policy and business issues. I was also a research assistant. So days and nights revolved around loading data tapes on mainframes, crunching away at them with SAS, then taking the resulting greenbar printouts to a professor’s office. That’s where your real education started: pour over scatter plots and regression lines, dig into data rows to find points that didn’t fit, figure out what was missed, what the data told us that we didn’t know, adjust models, then head back to the computer center and repeat.

After that, I taught economics for a few years, then I headed back to graduate school to work on a doctorate (I became bored with being in a small college town, plus my interests changed, so I left without a degree during my third year). Once again I was a research assistant while taking courses in quantitative methods, economics and accounting, so my experience was similar to the above. Similar but not the same; by that time, PCs had become powerful enough that you no longer needed a mainframe to run SAS on large data sets. Which meant you spent more time with data… it was right there on your desktop, so you didn’t have to run back and forth to the computer center!

That’s the end of the story. My take-away: it occurs to me that in recent years I’ve spent disproportionately more time learning about the technology side than the data side. So I’m consciously going to try to balance that out in the coming months. With that, I guess it’s time to find my copy of Hogg and Craig. Oh there it is, underneath my monitor stand 🙂

Posted in Professional Development | Tagged | Comments Off on Labor for the Data

Drawing with Google Correlate

This week, Nick Hatch (twitter) showed me the Search by Drawing feature in Google Correlate. My reaction was “That’s going to consume several hours of my weekend” and sure enough, it already has 🙂

Getting started is easy. Go to Google Correlate (you’ll need to sign in with your Google account) then click on the Search by Drawing link on the left side of the page under the Correlate Labs section. With that, you’ll be presented with a blank chart where you can draw a time series of search activity… just draw a line, click the Correlate! button and see what happens. The tool will display your line with a line of activity for search terms. The most correlated search term is displayed initially, but other search terms are presented as well (ranked by decreasing correlation) and you can click on those terms to display their lines.

You might be wondering how could I spend hours drawing lines. Fair enough.

I started thinking about search terms and imagining what a time series for that term would look like, then drew it, then looked at what my line actually matched. For example, try to think of the number of searches over the last several years for Lady Gaga or Charlie the Unicorn, then draw that time series and see the results.

Some of my more interesting attempts:

  • The line I drew that I thought might look like the time series for MusicMatch had a 0.9756 correlation with AltaVista. This was interesting because Yahoo ended up acquiring both of these products.
  • The line I drew trying to guess search activity for FarmVille correlated most highly with Dropbox. This wasn’t too interesting, but the data had something unexpected. As I looked at the time series of search activity for Dropbox, there wasn’t a surprise in the general trend. Dropbox was founded in 2007, so sure enough in 2007 the time series shot up exponentially. Before that, the search activity was perfectly level, except for little squiggles around 2005. Hmmm…
  • I drew my imagined time series for searches on the mortgage banking industry, and my line’s highest correlation was with a bank that was purchased after the 2008 banking crisis. However, well before the banking crisis, there was a sudden and dramatic peak in search activity for that bank. With a little searching, I read that the bank encountered regulatory activity during the period of that peak. So looking at the data prompted me to do a little digging and learn something.

Anyway, if you’re a data geek and haven’t poked around with Google Correlate yet, then you might want to check it out. Enjoy!

Posted in Data | Tagged | 1 Comment

Netbook – Part 3

Thirteen months ago I wrote Part 1 about my experience with choosing a netbook to take on the very first SQLCruise. Part 2 continued with setting up the netbook for SQL Server development and education use.

Thoughts After a Year of Use

So after more than a year of use, what’s the verdict? Easy answer: success, no regrets.

The netbook not only went on the very first SQLCruise from Miami to Cozumel, but also the most recent SQLCruise to Alaska. On both trips I used it in the classroom as well as to VPN into the office to do work. Same for last year’s PASS Summit and some SQL Saturday events. For a year I carried it to the office so that I could work away from my desk for an hour or so each day. The netbook had a cost of around $300, throws off very little heat compared to a laptop, and only weighs 2.8 pounds. Most surprising: the estimate of 14 hours of use on battery power was not much of an overstatement (turning on wi-fi zaps it down a bit, turning on bluetooth zaps it down a lot).

So is that the end of this post? Am I going to end with “All is well, tune in next year when I will let you know if the netbook is still running” or something like that?

No.

Let us push-on and take the netbook to a new place. A wonderful place with everyone’s favorite creature of the arctic waters, the narwhal. In this particular case, a natty narwhal.

It’s Ubuntu Time

Ubuntu is a linux distribution. Unlike the olden days, you can avoid the disk partitioning stuff, boot loader configurations, command lines, having to do yet another operating system install where you have to babysit the machine in case you need to type in information and hit the enter key, etc. More on that later.

But first, why would I want linux on my netbook if it already works fine? Because sometimes I just want to grab my netbook and do some web browsing, but I don’t want to wait for Windows 7 to boot up. So I wondered if a minimal Ubuntu install would boot up faster on my netbook. No need for suspense… the boot up time is about the same, but the feature I like is that the Ubuntu user interface is very nice for smaller, netbook-size screens. This interface seems a bit awkward at first, but it doesn’t take long before you begin to appreciate it.

One way to get Ubuntu running would have been to use a virtualization solution such as VMware Player. But that’s not going to work in this case because then I’d have to boot to Windows first, then Ubuntu. That’s not much of a time-saver. Also, my netbook doesn’t have a lot of memory or CPU horsepower for pushing virtual machines. So it was looking like I would be setting up a dual-boot configuration.

At this point, I decided to try the option of running Ubuntu within Windows. This turned out to be very simple, and so far I’ve been quite satisfied with the result. With this installation option, Ubuntu installs into a folder in your Windows file system, and the experience is similar to installing a Windows application. So you don’t have to deal with partitioning your hard disk for a separate operating system. The Ubuntu installer uses your existing Windows installation to figure out the settings to use, so your interaction during the installation process is minimized.

If you want to try out Ubuntu then you can follow the instructions here, plus this page has more detailed instructions on installation as well as how to uninstall (you uninstall from the Control Panel just like you’d uninstall a typical Windows application).

Once you have Ubuntu installed, fire up the web browser and do some reading. A couple of links I’d recommend would be

So my netbook has become even more useful now. When I start it up, I can choose to boot into Windows to do some SQL Server work, or boot into Ubuntu for some linux goodness.

Posted in Hardware | Tagged , , | Comments Off on Netbook – Part 3

T-SQL Tuesday #020 – T-SQL Best Practices

July’s T-SQL Tuesday is brought to us by Amit Banerjee (Blog|Twitter), and he has chosen T-SQL Best Practices as the topic for this month’s T-SQL Tuesday. Amit discusses best practices in light of their impact on performance and states that “Sometimes, the most obvious things are the easiest to overlook!”

I’m not going to try for a comprehensive list. Instead I’ll put down the first three best practices that came to mind when I saw this topic. Just for fun, I added a fourth issue that might stir things up a bit.

  1. Put your T-SQL scripts in a version control solution. I’ve mentioned this before but I’ll repeat it. If you’re a DBA and don’t know where to start with this, talk with your developers. It will be a bonding moment with them. I know that’s a frightening thought, but it could save your bacon someday.
  2. Go out of your way to make T-SQL scripts re-runnable in an elegant and graceful manner. Seriously, these are the words I use when discussing this. Ask yourself how you can make scripts more like a ballet dance and less like a freeway pileup. Consider what can happen when running your script. Does your script drop a database? Check for users in that database and handle appropriately, e.g. exit the script with a message output that the database was not dropped because it had users. Going to create a schema? Check for its existence and if it already exists then don’t create it. Output a message informing that the schema already exists and didn’t have to be recreated. You’ll need to use Dynamic SQL to do this. Fine, do that. Your output messages should include the object names and a timestamp so that you’ll know when it was run. If you use the OBJECT_ID function to check for existence, include the optional object_type parameter so that you aren’t just relying on the name (even if you use naming conventions based on object type, you can’t assume that everyone always followed that convention everywhere). Take foreign keys and other dependency relationships into account and never assume that your scripts will be executed in the proper order. Sure, you could say “It’s not my job to make sure others use my scripts correctly.” But you know what? You could make things easier for someone else, even if they don’t notice it. And if someone else does notice it, they will think highly of you. Yes, it’s extra work. But you will learn things. And it will make you a better database professional. I guarantee it (disclaimer: this not an actual guarantee).
  3. Ask if the functionality even belongs in T-SQL. Sure, I like stored procedures and the SQL language. But there are other layers to most solutions than just the database. Don’t be afraid to ponder the possibility that something belongs in another layer. For example, when I encounter a user-defined function that’s used to format a number to appear a certain way, my first question is “where the heck is this being used?” Same thing with stored procedures that are doing a heavy amount of aggregation. Maybe that’s not a bad thing, but it could be. Situations are different all the time, so I think the easiest general rule is to be open to the possibility that a thing might belong elsewhere, even if it has already been implemented in T-SQL.
  4. I’m going to step on some toes and say that not all cursors are evil abominations to be completely and totally disallowed. Some are different than others. While they shouldn’t be the first solution considered in SQL Server, they can work just fine. Once upon a time, I was told to go through all of a company’s stored procedures and remove all the cursors. When I mentioned how long this could take, the DBA said “It’s easy. Just do what the cursor does, but put everything in a temp table, then iterate through the temp table with a WHILE loop.” I demonstrated that declaring the cursors as LOCAL and FAST_FORWARD consistently beat the performance of his temp table and loop solution. But it didn’t matter. He explained that some day he would be a manager and would fire anyone who used cursors. Interesting. There might be absolutes in the world, but I don’t think making “never use cursors” a best practice is one of them.

I’m looking forward to reading the contributions to this month’s T-SQL Tuesday and reading the best practices discussions of others.  Thanks to Amit Banerjee for hosting T-SQL Tuesday #20, and thanks to Adam Machanic (Blog|Twitter) for creating this monthly blog event!

Posted in T-SQL Tuesday | Tagged | 2 Comments

T-SQL Tuesday #018 – CTE with OPENROWSET

May’s T-SQL Tuesday is brought to us by Bob Pusateri (Blog|Twitter). Bob is very active in the SQL Server community here in Chicago and was one of the first people I met when I started attending user group meetings. So there was no way I was going to miss out on this month’s T-SQL Tuesday event! For this month, Bob is asking “Have you ever solved or created a problem by using CTEs? Got a tip, trick, or something nifty to share?” Sure, I think I can fill that order.

CTE is the abbreviation for Common Table Expression, a feature that was new in SQL Server 2005. Much of the excitement with this feature’s arrival was the capability to implement recursion in T-SQL code via CTEs. I’ve mostly used CTEs to create more readable code in situations that might have involved complicated subqueries as well as some (but not all) circumstances that might utilize temp tables or table variables.

One of my favorite ways to use CTEs is when I want to access a file using the T-SQL OPENROWSET feature. There’s nothing wrong with using OPENROWSET in the FROM clause. However, it seems cleaner to break it out into a CTE so that all that awkward syntax is up and away from the FROM clause. That way, once you have your OPENROWSET syntax done, you can fiddle around with the FROM clause and not worry about accidentally breaking your OPENROWSET work.

Let’s say you have a tab-delimited file of some folks you know and their favorite words. The file has three columns: a line number, the person’s first name, and their favorite word. So this file might look like the following.

1	Bob	coefficient of thermal expansion
2	Brent	gaga
3	Buck	cloud
4	Harper	awesome
5	Jes	run
6	Karen	data
7	Norman	hammer
8	Paul	sheep
9	Yanni	hike

Note that this tab-delimited data file named “FavoriteWords.txt” is available in a compressed zip file that can be downloaded here. Also available in this zip file is the format file named “FavoriteWords.xml” that is utilized in the queries that follow. So first let’s see if we can query this file in a CTE utilizing OPENROWSET with the BULK provider. As usual, nothing here comes with any guarantees that your server won’t explode, data won’t be wiped out permanently, etc. Like any source code or files you find on the web, don’t run anything here untested on a production machine or any other system that you care about.

; WITH [FavoriteWords] ([LineNumber],[FirstName],[FavoriteWord])
 AS ( SELECT [LineNumber],[FirstName],[FavoriteWord]
        FROM OPENROWSET( BULK 'C:\temp\FavoriteWords.txt'
                       , FORMATFILE = 'C:\temp\FavoriteWords.xml')
      AS [fav] )
SELECT [LineNumber],[FirstName],[FavoriteWord]
  FROM [FavoriteWords] ;

That’s not too bad, although you may have to read up on OPENROWSET in Books Online in case you run into permissions issues, etc. Perhaps you can already see why I like pushing all that OPENROWSET syntax up into a CTE. We could now add more complexity to our SELECT, or even write an INSERT/UPDATE/DELETE, and not have to touch the OPENROWSET syntax in the CTE. For example, if you have the AdventureWorks sample database available, you can write a query like this.

; WITH [FavoriteWords] ([LineNumber],[FirstName],[FavoriteWord])
AS
 (SELECT [fav].[LineNumber],[fav].[FirstName],[fav].[FavoriteWord]
    FROM OPENROWSET( BULK 'C:\temp\FavoriteWords.txt'
           , FORMATFILE = 'C:\temp\FavoriteWords.xml') AS [fav] )
SELECT [e].[Title],[f].[FirstName],[f].[FavoriteWord]
  FROM [AdventureWorks].[HumanResources].[Employee] [e]
 INNER JOIN [FavoriteWords] [f]
    ON [e].[EmployeeID] = [f].[LineNumber] ;

Of course, if you don’t have AdventureWorks, you could use some other table and join on an integer primary key to get similar results.  Running the above query in SQL Server Management Studio gave the results seen in this screenshot snippet.

BONUS: Did you notice that the file path includes the drive letter in the above queries? Yuck. Plus, what if the folder containing the file being queried is on C:\ in some environments but E:\ in others? Well, the nice thing is that you can also specify the share name if the folder is shared.  So if the C:\temp folder is shared as “temp” then you can use that to write queries such as the one below.

; WITH [FavoriteWords] ([LineNumber],[FirstName],[FavoriteWord])
AS ( SELECT [LineNumber],[FirstName],[FavoriteWord]
     FROM OPENROWSET( BULK '\\LOCALHOST\temp\FavoriteWords.txt'
 , FORMATFILE = '\\LOCALHOST\temp\FavoriteWords.xml') AS [fav] )
SELECT [LineNumber],[FirstName],[FavoriteWord]
  FROM [FavoriteWords] ;

With that, I’m looking forward to reading the contributions to this month’s T-SQL Tuesday and seeing how everyone is using CTEs.  Thank you to Bob Pusateri for hosting T-SQL Tuesday #18, and thanks agaom to Adam Machanic (Blog|Twitter) for creating this monthly blog event!

 

Posted in T-SQL Tuesday | Tagged , | 7 Comments