What Are The Differences Between SQL Analytical Functions RANK, DENSE_RANK And ROW_NUMBER?

We read on applications of analytical function, ROW_NUMBER, in a prior blog. We will now focus on related analytical functions called RANK and DENSE_RANK and its differences with each other and ROW_NUMBER.

Assume there is an existing emp table with the following data.

Oracle employee table

Oracle employee table

For each department, we want to retrieve salaries (sal) in descending order along with values for RANK, DENSE_RANK and ROW_NUMBER. After reading the prior blog, it would be reasonable to assume ROW_NUMBER would return a value of 1 for the highest salary in each department. RANK and DENSE_RANK would also function in the same manner.

The differences however emerge when, in a given department, salaries are same for two or more employees. As shown in the illustration below, two employees ( empno ) 7788 and 7902 earn $3000 each which is the highest salary in department ( deptno ) 20. Hence, both DENSE_RANK and RANK provide a value of 1. For the employee earning the next highest salary (7566) , DENSE_RANK displays 2. RANK actually skips 2 to display 3 because there were 2 prior employess ( 7788, 7902 ) earning the higest salary in department 20.

ROW_NUMBER, on the other hand, has no consideration at all for multiple employees earning same salary. It provides sequential numbers from one row to the other.

Rank, Dense_Rank, Row_Number

Rank, Dense_Rank, Row_Number

The actual code that produces these results is provided below.

SELECT empno, deptno, sal,

DENSE_RANK() OVER (PARTITION BY
deptno ORDER BY sal DESC NULLS
LAST) DENSE_RANK,

RANK() OVER (PARTITION BY deptno
ORDER BY sal DESC NULLS LAST) RANK,

ROW_NUMBER() OVER (PARTITION BY deptno
ORDER BY sal DESC NULLS LAST) ROW_NUMBER

FROM emp
WHERE deptno IN (10, 20)
ORDER BY DEPTNO, RANK;

 

What Are CLI Commands In Netezza?

CLI stands for command line interface. All Netezza CLI commands can be run on a Netezza host while there are certain CLI commands that can also be run on a client host. The hosts always have Linux/UNIX operating systems. For this article, I will be running CLI commands on a Netezza host.

As you can see from the image below, nzrev provides Netezza’s release number. In my case, I am on NPS 7.0. nzstate informs me my NPS is online i.e. up and running. I can stop and start NPS by using nzstop and nzstart accordingly.

Netezza3

I will illustrate the purpose of another CLI command called nzstats. This command shows stats for a group ( in a database ) or a table.  By the way, running nzstats -h provides a description of the command.

Netezza4

Next is nzds. This CLI command provides details on data slices in a NPS server. -h option provides a small description on the command.

Netezza5

What Is A Three-Level Naming Notation In Netezza?

In Oracle, unless a database table is global in scope, the table is usually accessed in the following notation: schema _name.table_name. For example:

SELECT * FROM schema_name.table_name;

A global table, such as DUAL, can be directly accessed without a schema, for example : SELECT * FROM DUAL;

With this short introduction in mind, Netezza has a similar naming construct in older versions of Netezza Performance Server, popularly known as NPS. Since Netezza allows cross-database references, it has a three-level naming convention which is database_name.schema_name.object_name.

The image below shows a database, “DivyeshDB”, that I created. Within this database, I created a table called “DivyeshT” and inserted two values.

Netezza Cross-Database Query

My next task is to use the three-level convention to query this table from the SYSTEM database. Since SYSTEM and DivyeshDB are two different databases, I have to let Netezza know where the table resides by adding a prefix for the database name, “DivyeshDB”.

I can also let Netezza know the table resides on the DivyeshDB database without having to specify the schema, “admin”. If “admin” is omitted, two periods between database name and object name are required. I can’t, however, utilize owner.object notation if I am not on the database where the table resides.

Netezza Cross-Database Reference

What is the difference between in-database analytics and traditional analytics?

It is a common practice to extract data from a database so that analytics could be performed on it. The word “extract” implies that there is a movement of data from the database to a data set which resides on a disk.  This practice worked quite well till Big Data came along.

Assuming a database table has ten terabytes of data, it is not possible to extract this data considering disk space and time taken for extract ion to complete.  It would be best to prevent data movement in the first place and perform analytics on the database directly, hence the phrase in-database analytics.  Modern data warehouse appliances such as Netezza, Teradata and Greenplum provide in-database analytics functionality.

So what are the advantages of in-database analytics? Obviously, insights would be derived faster and this in turn would assist C-Suite executives make business decisions hopefully before their competitors do.  In a retail industry where today’s “hot” item can be tomorrow’s “has-been”, faster insights are crucial.

The second advantage is total cost of ownership (TCO). Due to elimination of data movement, servers don’t have to be upgraded to account for extra space and processors. Since data is analyzed typically by advanced SQL, license fees for expensive analytics software can be avoided.

What is the purpose of the analytical function, “ROW_NUMBER”, In Oracle?

SQL-related analytical functions are very common in the world of Big Data. For instance, there is always a need to retrieve n-tier levels of data from a database. A good example is top half of a data slice versus bottom half of the same data slice (top n-reporting versus bottom n-reporting). It is even possible to compute inner n-reporting.

Taking Oracle as a database, it is best to illustrate this concept via examples.

First, create a table called myEmployees.

create table myEmployees

(  department_id number,

last_name varchar2(20),

employee_id number

);

Second, insert data into the table.

insert into myEmployees values ( 10, ‘Dave’, 100);

insert into myEmployees values ( 20, ‘Patel’, 300);

insert into myEmployees values ( 10, ‘Divyesh’, 200);

insert into myEmployees values ( 20, ‘Kassam’, 400);

insert into myEmployees values ( 10, ‘Asher’, 500);

insert into myEmployees values ( 30, ‘Teh’, 600);

Now, let us explain the concept of ROW_NUMBER

In Oracle, ROW_NUMBER is considered to be an analytic function. It assigns unique values for each row returned by a partition. Conceptually, one can think of a “partition by” as a “group of”. So for the query below, ROW_NUMBER assigns a unique value for row returned from a partition of (group of) department ids which have been ordered by employee ids.

Bear in mind, values need not be unique across partitions.

 select department_id, last_name, employee_id, row_number()

over (partition by department_id order by employee_id) as emp_id

from myEmployees;

Its output is shown below.

row_number1

 

The query below is a good example of inner n-reporting.

select last_name from

(select last_name, row_number() over (order by last_name) rank from myEmployees)

where rank between 2 and 3;

The nested “select” query in blue text from the SQL statement above returns the data below.

row_number2

 

The overall query specifies rank between 2 and 3 inclusive (in red text). For rank=2, we have “Dave” and for rank=3, we have “Divyesh”. This is what we get in the output below.

row_number3

What Is A Julian Day In Oracle?

To begin with, Oracle date range is from January 1st, 4712 BC to December 31st, 9999. A Julian day number is the number of days since January 1st, 4712, BC. The intent behind this concept is to provide one common reference point for comparison purposes. This is very similar to the date concept in SAS where the base date was chosen to be January 1st, 1960.

 

To find the Julian equivalent of the current date, one would run the following SQL in Oracle. Please note the use of ‘J’ in the SQL to compute the Julian number.

 

select to_char(sysdate, ’J‘) from dual;

 2456528

 

What is the purpose of Impact Analysis in SAS Data Integration Studio?

split

If A has an impact on B and B has an impact on C, it is logical to deduce that A has an impact on C. Before modifying A, surely one would be interested to know what changes C would get. Enter impact analysis. In SAS Data Integration Studio, impact analysis identifies jobs, columns or transformation (our C) that would be impacted if a table or a column were to be modified (our A).

Translating A into a real table, let us consider an Employees table shown below.

Employees table

By using the “Splitter” transformation in DI Studio, we split Employees into 2 tables based on the gender of the employees.

Split Job

After the Split Job is ran, the FemaleEmployees table  will have only 2 rows where Gender=’F’.

Female Employees

 

Now, let us create another SAS DI Studio job where FemaleEmployees table becomes an input to the job and the output is a sorted table by salary ascending.

Sorted Female Employees by Salary Ascending

This is how SortFemaleEmployees looks like after the job is run.

Sorted Female Employees

So far, we have only described the flow of the job processes.  Now imagine in a large corporation, you have an Employees table and you are not sure where exactly this table is being used and what end results come out of it. Using SAS Data Integration Studio, you can easily right-click on this table name and click on “Analyze“.

You will get a graphical representation of the process flows from the Employees table to all its appropriate direct and indirect end tables.

 

It is possible to run impact analysis on a more granular level such as a specific column name in Employees. It is also possible to run reverse impact analysis on the end table, SortedFemaleEmployees, to figure out its sources. But that is a topic for another day!

 

What is a SAS OLAP Cube?

Definition:

An OLAP cube is a multi-dimensional data store that contains pre-summarized data. OLAP cubes support ad-hoc business reporting since they are optimized for on-the-fly querying.

Components of an OLAP cube:

SAS OLAP cubes are made up of:

  • Dimensions: variables organized in logical groups
  • Hierarchies: data’s logical order
  • Levels: Units that make up the hierarchy
  • Members: Level’s specific values
  • Measures: statistics that are produced

Taking time as an example, time itself can be considered as a dimension.

A hierarchy of time, i.e. time’s logical order, can be a year.

Levels of a year can be a quarter or a month.

Members are the actual values i.e. 2012 for year and 3 for third quarter.

 

 

What are the steps to benchmark a SAS program?

From a pure software development standpoint, efficiency of a SAS program is usually determined by:

  • Volume of data to be processed
  • Length of program, also known as lines of code (LOC).
  • Number of times of program execution

There is no one-size-fits-all solution when it comes to determining the criteria mentioned above. Case in point: suppose if a program is meant to be run only once, then it probably doesn’t make sense to invest time and have LOC decreased.

SAS programs consume the following machine resources:

  • Memory
  • Input/output
  • Network bandwidth
  • CPU time
  • Data storage

If you need to determine how your machine resources are being impacted, you can benchmark your program using the following criteria:

  • Run  the program as it would be run in a live environment.
  • Run the program multiple times so that an average can be obtained. Obviously, an average will give a more accurate picture.
  • While benchmarking, test only one change at any single execution.  The reason is that multiple changes in a single execution can have an adverse effect.
  • Turn on SAS system options so that resource usage can be obtained.

What is Master Data Management?

Sometimes, it is not possible to even comprehend the complexity  of big companies like Exxon Mobil. Just imagine how data is managed in these companies for their products and tens of thousands of customers. There could easily be many data silos for each department with continuous changes such as product name changes and additions and deletions to data.

To illustrate one example, imagine a sales representative taking an order from a customer. The representative incorrectly enters “7280 Park Street” instead of the correct address, “7280 Park Avenue”. As the order gets fulfilled, this dirty data is propagated to various systems. However, the billing department detects the incorrect address and rectifies it. This rectification, though, is only local to the billing system and other departments may not be aware of it.

We just discussed one scenario of inconsistent data across different data silos. But there could be thousands of such scenarios occurring everyday, hence creating havoc among all corporate departments.

Enter “Master Data Management“.

The objective of Master Data Management is to provide consistency among critical corporate data silos. In doing so, no matter what department an employee works for, he/she would see a “single version of truth“. In our illustration, the employee would only see the correct address.

How is it possible to retrieve SAS observations within a certain ordinal number range of observations?

number

As indicated in the Wikipedia page, ordinal numbers denote the rank or position of a number with respect to some order i.e. first, second, third, fourth.

Now assume, there is a SAS data set, tempData, comprising of :

ID Age
1 20
2 10
3 40
4 30
5 50

If we want to find the sum of ages from the second to the fourth observation, we can use the code below.

proc sql;

 select sum(age) as SumAge

from tempData

where 2 <= id <= 4;

 quit;

The code works correctly because the ids happen to be in sequential order, but what if they are not?  Say for example, tempData comprises of:

ID Age
4 20
3 10
7 40
8 30
1 50

Obviously, we don’t have the luxury of using ids in our code any more.  This is where monotonic() function would come in handy. Monotonic function basically provides a row / observation number. For instance, the function always returns 1 for the first observation, 2 for the second and so on.

With this in mind, the code below will now work correctly even if ids are not sequential.

proc sql;

select SUM(age) as sumAge, monotonic() as x_a

from tempData where calculated x_a between 2 and 4;

quit;

What are the different approaches to show the nth highest observation in a SAS data set?

Assume you have SAS code to produce a data set called “myData” as shown below. You want to retrieve the 2nd highest observation, highest defined in terms of age. That would mean the 2nd highest observation would be “10 Babo”. We will look at 3 approaches to derive the 2nd highest observation.

data myData ;

input age name $;
datalines;
10 Babo
9  Vivian
9  Tina
11 Michael
8  Mungu
;
run;
Approach 1: using plain-old SQL

proc sql;
select * from myData
where age=(select max(age) from myData
where age not in (select max(age) from myData));
quit;
The SQL code above is conceptually easy to understand for retrieving the 2nd highest observation. But what if you had to retrieve the 4th highest observation? As you can imagine, the SQL code can get really ugly. So, I personally would avoid SQL queries.
Approach 2: using “Rank”

proc rank data=myData
descending
ties=dense
out=want (where=(rank=2)) ;
var age;
ranks rank;
run;
Compared to the SQL code approach #1, the proc rank code seems to be very flexible. If we have to find the 4th highest observation, all we would need is to replace 2 with 4 in the code.
Approach 3: using “largest” function

proc sql;
select age into :num separated by ”,”
from myData;
run;

data se_score;
set myData;
where age = largest(2,&num);
run;
Compared to the Rank code in approach #2, the code above has two code operations, first is to create the macro variable num and the second is to determine the actual 2nd highest observation. Obviously, approach #3 has more code than approach #2. Nevertheless, the code is still flexible because 2 can be replaced with any number.

Conclusion: Personally I think approach #2 is the best approach in terms of code maintenance, flexibility and performance. Approach #1 provides no flexibility at all in terms of deriving any nth observation. It only works if n=2.

What are the reasons to use SAS Enterprise Guide over Base SAS?

  • Reduction in development time: Since Enterprise Guide has point and click interfaces, there is no need to write native Base SAS code in general .  If at all, customization is needed, it is always possible to modify the Enterprise Guide generated code.
  • Ease of creating reports:  there are tasks available for various charts such as bar, pie and line plot. C-suite leaders are usually more interested in charts than raw data reports. It can take literally take 5 minutes to create a report when the boss needs it.
  • Program flows show the progression of logic from start to finish in a pictorial representation. There is no need to dig through lines of code just to understand the logic flow.
  • Interaction with other SAS BI tools :  It is easier to communicate with SAS OLAP Cube Studio and Web Report Studio from Enterprise Guide.
  • Free with Base SAS: Enterprise Guide is provided free of charge with the Base SAS license for the Windows operating system.

What is Data Integration?

datawarehouse

As McKinsey’s report on Big Data indicates, the volume of data has nowhere, but to go up. A corporation often has disparate data silos which often need to be consolidated together. For instance, accounting database of a company may have to be consolidated with billing database to provide enterprise-wide data to the company.

Data integration is the process of consolidating data from heterogeneous sources in order to produce a unified view of the data. Taking the example from the paragraph above, accounting database can be consolidated with billing database to provide one set of data for a given corporation.

Characteristics of data integration include

  • Data federation: ability to use data across different systems such as ERP applications, flat files and databases without the movement of data.
  • Master data management: create a single “version of truth” from multiple sources i.e. create unified view of enterprise data
  • Extraction, transformation, and loading (ETL): extracts data from source(s), transforms the data according to business rules and loads the data into target system(s).
  • Data cleansing: process of cleaning the data. For instance, accounting database could have “Divy Dave” for an individual while the billing department could have “Divyesh Dave” for the same individual.

Data integration is a career branch by itself in the Big Data universe. There are many third-party tools to assist in data integration such as Informatica, DataStage and SAS Data Integration Studio. There is an abundance of data in corporate data silos, and social media websites such as Twitter and Facebook. It has been anticipated that demand in this career is likely to go up.

Post Navigation

 
%d bloggers like this: