Friday, 16 November 2012

Population Stability Index (PSI) in SQL

So dealing with probability is quite a fun thing to play around with in SQL if you're into that sort of thing.
 
It's easy when you're building a data warehouse and cubes and you need to compare values to a report vs the values that you're getting from a Transactional system like Syspro or SAP. (think ERP, POS systems, but you get the point.)
 
Recently I've had to build a report that compares values of data that might change, or might not change, but it's not financial figures, it's Population figures. So for example in the Credit Risk industry the percentage of people from last month that now have credit cards paid, to the percentage this month.
 
Do this calculation there's a formula called a Population Stability Index (PSI).
 
The formula looks like this:
 
All fancy and complicated, but once you break it down and you realize that you can substitute the Sigma sign actually just means SUM() in SQL, then your life becomes a whole lot easier.

So let's put together some Sample Data and create the formula in SQL.


 All of a sudden the mathematical formula is less intimmidating, and you're able to write the SQL code with ease.

The PSI will show you a shift in the population from 1 month to next if any, then you can create alerts based on the value that you get back.

I hope this has been informative, and I've been asked to blog a little more about writing mathematical algorithms into SQL code, so I'll see if I can find some good and applicable examples to use, and post them.

If you want to learn more about the PSI, then read this SAS document that I found online. It gives you quite a good explination.
http://www.google.co.za/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&ved=0CB4QFjAA&url=http%3A%2F%2Fsupport.sas.com%2Fresources%2Fpapers%2Fproceedings10%2F288-2010.pdf&ei=c0CmUJW-B82XhQfTwIGIDA&usg=AFQjCNENupc_KhO2paGvRnXvXYYyI9Nm9g&sig2=tNhiT7zQpr_5aUUNJB2LlQ

Friday, 31 August 2012

Retrieve SQL Data from Stored Procedure using OPENROWSET

So recently I was busy with a performance monitoring and SQL space monitoring quest.

I decided that due to all the irritated managers and people in the office, that I'll put together a report that gets the drive space from all the hard drives of all the servers, and then report on free space, used space, database space usage vs. file usage etc. This should highlight the issues, and then put people in the spotlight so that they can start cleaning their servers up that they're responsible for.

The funny thing was, that the server they game me that had access to email couldn't create linked servers.

Create Linked Server Error.

Unfortunately I don't have the authority or the time to repair the instance, or bother with troubleshooting this error right now (I'm a bad person, I know.), and also because no one else really uses this sever for ANYTHING, I thought i'd play with the OPENROWSET commands.

To use the OPENROWSET clause, you need to make sure that you have 2 options enabled.

Ad hoc distributed queries, and Ole Automation Procedures.










Once you've enabled the procedures, you can then perform OPENROWSET queries. The reason I do this is because I want to consolidate the data form all my servers, and then union the results into one query that I can execute when I need to, or if you've got some reporting knowledge, you can build it into an SSRS report or Crystal Report etc.

To retrieve the required data from my server, and test that it works. I did a simple select from my master.sys.tables.








Now that works fine until you try to execute a stored procedure inside the OPENROWSET.






All I did was replace the SELECT * FROM sys.tables with a simple EXEC Xp_fixeddrives, surely this should have worked, I mean it's also just selecting data. Oh well, lets keep hacking at it :-)

Now there's this nice little function called FMTONLY, which Returns only meta data to the client. You can test it, by turning it ON, it will only return the Columns, and No Rows, so I suspect that It's actually battling to obtain the meta data, so when I force the option to be turned OFF, which is actually it's default setting, and try to run my stored procedure again, from inside the OPENROWSET.









By setting the value OFF explicitly, it returns results when we execute the proc from inside the OPENROWSET.

I hope this post helps other's that's battled with a similar problem.

If you want more information about the FMTONLY statement, then read up about it on BOL
http://msdn.microsoft.com/en-us/library/aa259200(v=sql.80).aspx

Keep checking my posts, as I generate more traffic, I'll start posting more tips, and tricks.

Kind Regards
CG


Thursday, 24 May 2012

SQL Leap Year Time Dimension (Parallel Period)


Hi Guys,

Seen as the internet is so Unhelpful when it comes to working with leap years in SQL Server.


I’ve written a script that will allow you to do parallel period comparisons down to the day. So if today is Thursday and you want to compare your data to last year Thursday e.g. Sales. Then you will need to create a script that can’t just take you back a year, and add or remove a few days. You need to take into consideration that in every leap year there’s 53 weeks and an extra day. So just to save you all the hassle here’s the script. If you don’t know how to check for leap years, you can get the logic from the script too.

You guys might want to think about adding this to your Date Dimensions in the future if you’re creating Sales data marts.

/******************Parallel Peroid Comparison to day****************************/
/********This is if you want to compare today to last year what today is********/
-- Replace FullDate with your own Current date time in your Time Dimension; Format = GETDATE()
-- Replace DayOfYear with your own Current day of year in your Time Dimension; Format = DATENAME(DAYOFYEAR, FullDate)
SELECT *,CASE WHEN (YEAR(FullDate) % 4 <> 0) THEN 0 ELSE 1 END IsLeapYear
, CASE WHEN ((YEAR(FullDate) % 4 = 0) AND [DayOfYear] > 59) THEN DATENAME(WEEKDAY,DATEADD(dd, +2,DATEADD(yy,-1,FullDate)))
WHEN ((YEAR(FullDate) % 4 = 1) AND [DayOfYear] < 60) THEN DATENAME(WEEKDAY,DATEADD(dd, +2,DATEADD(yy,-1,FullDate)))
WHEN ((YEAR(FullDate) % 4 <> 0) AND [DayOfYear] > 59) THEN DATENAME(WEEKDAY,DATEADD(dd, 1,DATEADD(yy,-1,FullDate)))
WHEN ((YEAR(FullDate) % 4 <> 0) AND [DayOfYear] < 60) THEN DATENAME(WEEKDAY,DATEADD(dd, 1,DATEADD(yy,-1,FullDate))) ELSE DATENAME(WEEKDAY,DATEADD(dd, +1,DATEADD(yy,-1,FullDate))) END LeapYearDay
FROM DimDate

What irritates me is how simple it would be to do this in MDX.
ParallelPeriod( [ Level_Expression [ ,Index [ , Member_Expression ] ] ] )

SELECT ParallelPeriod ([Date].[Calendar].[Calendar Quarter]
   , 3
   , [Date].[Calendar].[Month].[October 2003])
   ON 0
   FROM [Adventure Works]

Friday, 11 May 2012

How does Hyper-Threading work and when to use it on Microsoft SQL Server

Hi Guys,

So the question is when to use Hyper-Threading on a SQL box, and what impact it has. To do this you first need to understand the CPU architectures, so to do this I compared the Intel Core i5 processors with the Intel i7's because the i7's have hyper-threading whereas the i5's don't. That's pretty much the only difference between the 2 if you didn't know.

Most people automatically assume that bigger is better, but welcome to the SQL world where software actually makes a difference, and can save a company millions on hardware costs if you just knew how to configure your software in the right way.

So let's start with the CPU. Fist off, download CPU-Z for yourself to get the information about your processor that you're running. http://www.cpuid.com/softwares/cpu-z.html, the link should be on the right-hand side.


So if you look at the above screen shots of my laptop, you'll see that I have 2 cores, and with hyper-threading enabled, it gives me 4 threads (Logical CPU's), if hyper-threading was disabled it would only give me 2 threads. When hyper-threading is enabled, and it increases your Threads, then it assigns some of your L3 cache to those threads, so if I have 4Mb of L3 Cache, it will distribute it among each of the threads, so every thread will get 1Mb of L3 Cache to use, but with hyper-threading disabled, it will give me 2 Threads, with 2Mb of Cache each.See my diagram below.


The reason why this is important to know is that when you compare the differences in OLTP (Online Transaction Processing) databases, and data warehouses / marts in the BI space, you want more threads to process your real-time incoming data by having Hyper-Threading enabled. BI systems require larger data volumes to be processed less frequently, so in that case, it would make sense to have less threads, but give those threads more L3 Cache to process the larger volumes, in which case it will make sense to disable Hyper-Threading, but only if you're doing less frequent loads, you will have to weigh it up and do some tests for yourself when it comes to Real-Time data warehouses.

Hope this will help to clear things up for people that are still debating about which one is better to use when you're developing or working with BI systems, or OLTP systems.

Kind Regards
Carel Greaves

T-SQL Analytic Functions SUM

Let's try this blog thing out :-)

I often get asked by friends and colleagues to help them out with some really troublesome SQL problems that they run into, or for better ways of writing their SQL code. I have what I consider to be the best way, I know that at any given time there's probably 11 or more ways of solving the same business problems with SQL code.

I Love getting other people's feedback so please feel free to criticize my code at will so that I can also learn.

"My brain has frozen up & I can’t solve a simple issue … hope you can quickly help. See attached.
 It’s a lot more complex than the example, but for some reason I’m sitting here & just not getting the simplest example right. "

So, the test data is as follows.











Set A
DimA DimB DimC Value
A001 B001 C001 10
A002 B002 C002 15
A003 B003 C003 20
Set B
DimA DimC DimD Value
A001 C001 D001 100
A001 C001 D002 100
A003 C003 D003 50

-- and the desired outcome is:
DimA DimB DimC DimD Value Comments
A001 B001 C001 D0015 (10/200*100)
A001 B001 C001 D002 5   (10/200*100)
A002 B002 C002 Null 15  As is In A
A003 B003 C003 D003 20  As is in A

Additional Information that was provided was:

Join on DimA and DimC
All Combinations must exist
Value must total Set A's values, with Set B's Ratios
If Set B doesn't match, return Set A (Left Outer), with Set A Value

First steps to troubleshoot.
I created 2 table variables and replicated the result sets.

DECLARE @SetA TABLE (DimA VARCHAR(10), DimB VARCHAR(10), DimC VARCHAR(10), Value INT)
INSERT INTO @SetA VALUES ('A001',   'B001',     'C001',     10)
INSERT INTO @SetA VALUES ('A002',   'B002',     'C002',     15)
INSERT INTO @SetA VALUES ('A003',   'B003',     'C003',     20)

DECLARE @SetB TABLE (DimA VARCHAR(10), DimC VARCHAR(10), DimD VARCHAR(10), Value INT)
INSERT INTO @SetB VALUES ('A001',   'C001',     'D001',     100)
INSERT INTO @SetB VALUES ('A001',   'C001',     'D002',     100)
INSERT INTO @SetB VALUES ('A003',   'C003',     'D003',     50)

SELECT * FROM @SetA
SELECT * FROM @SetB

I then checked the result sets on using an inner join and a left join, to see which one closely matches the required result sets. (Small overhead for additional testing which might provide another solution to the problem.)

SELECT * FROM @SetA A INNER JOIN @SetB B ON A.DimA = B.DimA AND A.DimC = B.DimC
SELECT * FROM @SetA A LEFT JOIN @SetB B ON A.DimA = B.DimA AND A.DimC = B.DimC

What I then needed to do was add up all the values for each grouping, and then divide those values into the totals for those groups. (This is where the analytic function comes in handy.)

SELECT A.DimA, A.DimB, A.DimC, B.DimD, A.Value AS ValueA, B.Value AS ValueB, SUM(B.Value) OVER (PARTITION BY A.DimA, A.DimB, A.DimC) AS TotalPart
FROM @SetA A LEFT JOIN @SetB B ON A.DimA = B.DimA AND A.DimC = B.DimC

Additional logic needed to be added for "Value must total Set A's values, with Set B's Ratios".

SELECT *, CASE WHEN TotalPart IS NULL THEN ValueA ELSE (CAST(ValueA AS DECIMAL(18,2)) / CAST(TotalPart AS DECIMAL(18,2)) ) * CAST(ValueB AS DECIMAL(18,2)) END Results FROM (
SELECT A.DimA, A.DimB, A.DimC, B.DimD, A.Value AS ValueA, B.Value AS ValueB, SUM(B.Value) OVER (PARTITION BY A.DimA, A.DimB, A.DimC) AS TotalPart
FROM @SetA A LEFT JOIN @SetB B ON A.DimA = B.DimA AND A.DimC = B.DimC
) TB

That will give you the correct result set that's required based on the logic.

I personally feel that for a first try its not too bad, but I'll try to improve and make more sense as I go along :-)

If you have any questions or scenarios that you want me to try out, and post, then please let me know... looking forward to hearing from you guys.


Cheers for now.