Statistical sampling in SAS

In my prior blog, I had references on sample as applicable to statistics. In this blog, I will discuss sampling in SAS.

In SAS, there are three kinds of sampling namely ;

  • Systematic sampling
  • Simple random sampling with replacement
  • Simple random sampling without replacement

In systematic sampling, data in a table is selected in fixed intervals.

In random sampling, data is selected randomly from a table. However, there is a chance the very same row could be selected again, of course, randomly. This phenomenon is called random sampling with replacement.

Random sampling without replacement occurs when a row, if already selected once, is not selected again.

Let us now focus on the actual SAS code that derives samples for these three kinds.


In SAS procedure “surveyselect”, the value in method determines the kind of sample we want. n determines the size of sample.

  • sys creates a systematic sample
  • urs creates a sample with replacement
  • srs creates a sample without replacement

For illustration purpose, taking n=3 and method=sys, I get the following results:

Obs custID empID
1 63 121039
2 45 123456
3 35 654321

Descriptive versus inferential statistics

I am back from hibernation and this time, I would like to indulge in another topic where
SAS is extensively used – Statistics.

Whether we are statisticians or not, we have dealt with simple numerical measures such as percentages, averages, mininum, maximum and graphical depictions such as the price of Apple stock over time. All these concepts describe data such as anomalies, distinct features and graphical shape. Going back to statistics now, the branch that deals with such descriptions is often known as descriptive statistics.

Assume for a second, the descriptions were derived from a small sample of data, but we want to extrapolate the descriptions on a big population ( more on sample and population in my next blog ). Extrapolation from a sample of data to a population is called inferential statistics.


What Is SAS University Edition?

SAS, until May of 2014, had not offered free use of its software outside of academia. But that has changed from May. For the time time ever, an individual not associated with an academic institution can develop and run Base SAS code for free. Instead of the usual point-and-click interface called SAS Enterprise Guide, Base SAS code can now be developed in an interface running on a browser i.e. Google Chrome and Internet Explorer. This interface is called SAS Studio.

The details of the installation are beyond the scope of this article, but you may click on the SAS Studio link to get details.

I ran the code below to find out the SAS version and all the installed components that comprise SAS Studio. One product that is not mentioned below is SAS/GRAPH. Hence, it is not possible to create graphs using SAS/GRAPH. However, procedures such as SGPLOT, SGPANEL and SGSCATTER may be used instead. There are some graph examples on this SAS blog. From a personal standpoint, I have found that SAS Studio is immensely useful for learning purposes. 

proc setinit;

NOTE: PROCEDURE SETINIT used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
Original site validation data
Site name: ‘SAS University Edition‘.
Site number: 70068118.
CPU A: Model name=” model number=” serial=’+2′.
Expiration: 16JUN2015.
Grace Period: 0 days (ending 16JUN2015).
Warning Period: 45 days (ending 31JUL2015).
System birthday: 05FEB2014.
Operating System: LIN X64 .
Product expiration dates:
—Base SAS Software 16JUN2015 (CPU A)
—SAS/IML 16JUN2015 (CPU A)
—SAS/Secure 168-bit 16JUN2015 (CPU A)
—SAS/ACCESS Interface to PC Files 16JUN2015 (CPU A)
—SAS/ACCESS Interface to ODBC 16JUN2015 (CPU A)
—SAS/IML Studio 16JUN2015 (CPU A)
—SAS Workspace Server for Local Access 16JUN2015 (CPU A)
—SAS Workspace Server for Enterprise Access 16JUN2015 (CPU A)
—High Performance Suite 16JUN2015 (CPU A)


What Is A Materialized View In Netezza?

Disclaimer: Although materialized views are a part of many databases, Netezza has been used as an illustration for this article.

view is a virtual table which contains an output of a SQL query. A materialized view is slightly different from a view in the sense that it stores an ordered projection of a base table on a disk. Behind the scenes, this actually allows query optimizer to use a materialized view if it exists thereby improving performance since the view is sorted.

Let us walk though an example. In the image below, table “dd” contains 3 rows. A materialized view, “dd_mv“, was created from this table.

Creating a materialized view

Creating a materialized view

There are certain restrictions with the query used above to create a view such as:

  • There can only be one base table after “FROM”.
  • “WHERE” is not allowed in the query.
  • The columns after “SELECT” must be from the base table. No aggregation or “DISTINCT” is allowed.

After a view has been created, if a row is inserted in the base table (“dd”), the view also gets updated automatically. See the image below. A Netezza user is always assured that a data fetch from a view will always pull most current data.

View gets updated

View gets updated

Last but not least, if the base table itself gets dropped, the materialized view is converted to a regular view. Of course, it will be an ‘orphan’ view since it can’t associate to the table since it has been dropped.

Base table gets dropped

Base table gets dropped

What Are The Differences Between SQL Analytical Functions RANK, DENSE_RANK And ROW_NUMBER?

We read on applications of analytical function, ROW_NUMBER, in a prior blog. We will now focus on related analytical functions called RANK and DENSE_RANK and its differences with each other and ROW_NUMBER.

Assume there is an existing emp table with the following data.

Oracle employee table

Oracle employee table

For each department, we want to retrieve salaries (sal) in descending order along with values for RANK, DENSE_RANK and ROW_NUMBER. After reading the prior blog, it would be reasonable to assume ROW_NUMBER would return a value of 1 for the highest salary in each department. RANK and DENSE_RANK would also function in the same manner.

The differences however emerge when, in a given department, salaries are same for two or more employees. As shown in the illustration below, two employees ( empno ) 7788 and 7902 earn $3000 each which is the highest salary in department ( deptno ) 20. Hence, both DENSE_RANK and RANK provide a value of 1. For the employee earning the next highest salary (7566) , DENSE_RANK displays 2. RANK actually skips 2 to display 3 because there were 2 prior employess ( 7788, 7902 ) earning the higest salary in department 20.

ROW_NUMBER, on the other hand, has no consideration at all for multiple employees earning same salary. It provides sequential numbers from one row to the other.

Rank, Dense_Rank, Row_Number

Rank, Dense_Rank, Row_Number

The actual code that produces these results is provided below.

SELECT empno, deptno, sal,




FROM emp
WHERE deptno IN (10, 20)


What Are CLI Commands In Netezza?

CLI stands for command line interface. All Netezza CLI commands can be run on a Netezza host while there are certain CLI commands that can also be run on a client host. The hosts always have Linux/UNIX operating systems. For this article, I will be running CLI commands on a Netezza host.

As you can see from the image below, nzrev provides Netezza’s release number. In my case, I am on NPS 7.0. nzstate informs me my NPS is online i.e. up and running. I can stop and start NPS by using nzstop and nzstart accordingly.


I will illustrate the purpose of another CLI command called nzstats. This command shows stats for a group ( in a database ) or a table.  By the way, running nzstats -h provides a description of the command.


Next is nzds. This CLI command provides details on data slices in a NPS server. -h option provides a small description on the command.


What Is A Three-Level Naming Notation In Netezza?

In Oracle, unless a database table is global in scope, the table is usually accessed in the following notation: schema _name.table_name. For example:

SELECT * FROM schema_name.table_name;

A global table, such as DUAL, can be directly accessed without a schema, for example : SELECT * FROM DUAL;

With this short introduction in mind, Netezza has a similar naming construct in older versions of Netezza Performance Server, popularly known as NPS. Since Netezza allows cross-database references, it has a three-level naming convention which is database_name.schema_name.object_name.

The image below shows a database, “DivyeshDB”, that I created. Within this database, I created a table called “DivyeshT” and inserted two values.

Netezza Cross-Database Query

My next task is to use the three-level convention to query this table from the SYSTEM database. Since SYSTEM and DivyeshDB are two different databases, I have to let Netezza know where the table resides by adding a prefix for the database name, “DivyeshDB”.

I can also let Netezza know the table resides on the DivyeshDB database without having to specify the schema, “admin”. If “admin” is omitted, two periods between database name and object name are required. I can’t, however, utilize owner.object notation if I am not on the database where the table resides.

Netezza Cross-Database Reference

What is the difference between in-database analytics and traditional analytics?

It is a common practice to extract data from a database so that analytics could be performed on it. The word “extract” implies that there is a movement of data from the database to a data set which resides on a disk.  This practice worked quite well till Big Data came along.

Assuming a database table has ten terabytes of data, it is not possible to extract this data considering disk space and time taken for extract ion to complete.  It would be best to prevent data movement in the first place and perform analytics on the database directly, hence the phrase in-database analytics.  Modern data warehouse appliances such as Netezza, Teradata and Greenplum provide in-database analytics functionality.

So what are the advantages of in-database analytics? Obviously, insights would be derived faster and this in turn would assist C-Suite executives make business decisions hopefully before their competitors do.  In a retail industry where today’s “hot” item can be tomorrow’s “has-been”, faster insights are crucial.

The second advantage is total cost of ownership (TCO). Due to elimination of data movement, servers don’t have to be upgraded to account for extra space and processors. Since data is analyzed typically by advanced SQL, license fees for expensive analytics software can be avoided.

What is the purpose of the analytical function, “ROW_NUMBER”, In Oracle?

SQL-related analytical functions are very common in the world of Big Data. For instance, there is always a need to retrieve n-tier levels of data from a database. A good example is top half of a data slice versus bottom half of the same data slice (top n-reporting versus bottom n-reporting). It is even possible to compute inner n-reporting.

Taking Oracle as a database, it is best to illustrate this concept via examples.

First, create a table called myEmployees.

create table myEmployees

(  department_id number,

last_name varchar2(20),

employee_id number


Second, insert data into the table.

insert into myEmployees values ( 10, ‘Dave’, 100);

insert into myEmployees values ( 20, ‘Patel’, 300);

insert into myEmployees values ( 10, ‘Divyesh’, 200);

insert into myEmployees values ( 20, ‘Kassam’, 400);

insert into myEmployees values ( 10, ‘Asher’, 500);

insert into myEmployees values ( 30, ‘Teh’, 600);

Now, let us explain the concept of ROW_NUMBER

In Oracle, ROW_NUMBER is considered to be an analytic function. It assigns unique values for each row returned by a partition. Conceptually, one can think of a “partition by” as a “group of”. So for the query below, ROW_NUMBER assigns a unique value for row returned from a partition of (group of) department ids which have been ordered by employee ids.

Bear in mind, values need not be unique across partitions.

 select department_id, last_name, employee_id, row_number()

over (partition by department_id order by employee_id) as emp_id

from myEmployees;

Its output is shown below.



The query below is a good example of inner n-reporting.

select last_name from

(select last_name, row_number() over (order by last_name) rank from myEmployees)

where rank between 2 and 3;

The nested “select” query in blue text from the SQL statement above returns the data below.



The overall query specifies rank between 2 and 3 inclusive (in red text). For rank=2, we have “Dave” and for rank=3, we have “Divyesh”. This is what we get in the output below.


What Is A Julian Day In Oracle?

To begin with, Oracle date range is from January 1st, 4712 BC to December 31st, 9999. A Julian day number is the number of days since January 1st, 4712, BC. The intent behind this concept is to provide one common reference point for comparison purposes. This is very similar to the date concept in SAS where the base date was chosen to be January 1st, 1960.


To find the Julian equivalent of the current date, one would run the following SQL in Oracle. Please note the use of ‘J’ in the SQL to compute the Julian number.


select to_char(sysdate, ‘J‘) from dual;



What is the purpose of Impact Analysis in SAS Data Integration Studio?


If A has an impact on B and B has an impact on C, it is logical to deduce that A has an impact on C. Before modifying A, surely one would be interested to know what changes C would get. Enter impact analysis. In SAS Data Integration Studio, impact analysis identifies jobs, columns or transformation (our C) that would be impacted if a table or a column were to be modified (our A).

Translating A into a real table, let us consider an Employees table shown below.

Employees table

By using the “Splitter” transformation in DI Studio, we split Employees into 2 tables based on the gender of the employees.

Split Job

After the Split Job is ran, the FemaleEmployees table  will have only 2 rows where Gender=’F’.

Female Employees


Now, let us create another SAS DI Studio job where FemaleEmployees table becomes an input to the job and the output is a sorted table by salary ascending.

Sorted Female Employees by Salary Ascending

This is how SortFemaleEmployees looks like after the job is run.

Sorted Female Employees

So far, we have only described the flow of the job processes.  Now imagine in a large corporation, you have an Employees table and you are not sure where exactly this table is being used and what end results come out of it. Using SAS Data Integration Studio, you can easily right-click on this table name and click on “Analyze“.

You will get a graphical representation of the process flows from the Employees table to all its appropriate direct and indirect end tables.


It is possible to run impact analysis on a more granular level such as a specific column name in Employees. It is also possible to run reverse impact analysis on the end table, SortedFemaleEmployees, to figure out its sources. But that is a topic for another day!


What is a SAS OLAP Cube?


An OLAP cube is a multi-dimensional data store that contains pre-summarized data. OLAP cubes support ad-hoc business reporting since they are optimized for on-the-fly querying.

Components of an OLAP cube:

SAS OLAP cubes are made up of:

  • Dimensions: variables organized in logical groups
  • Hierarchies: data’s logical order
  • Levels: Units that make up the hierarchy
  • Members: Level’s specific values
  • Measures: statistics that are produced

Taking time as an example, time itself can be considered as a dimension.

A hierarchy of time, i.e. time’s logical order, can be a year.

Levels of a year can be a quarter or a month.

Members are the actual values i.e. 2012 for year and 3 for third quarter.



What are the steps to benchmark a SAS program?

From a pure software development standpoint, efficiency of a SAS program is usually determined by:

  • Volume of data to be processed
  • Length of program, also known as lines of code (LOC).
  • Number of times of program execution

There is no one-size-fits-all solution when it comes to determining the criteria mentioned above. Case in point: suppose if a program is meant to be run only once, then it probably doesn’t make sense to invest time and have LOC decreased.

SAS programs consume the following machine resources:

  • Memory
  • Input/output
  • Network bandwidth
  • CPU time
  • Data storage

If you need to determine how your machine resources are being impacted, you can benchmark your program using the following criteria:

  • Run  the program as it would be run in a live environment.
  • Run the program multiple times so that an average can be obtained. Obviously, an average will give a more accurate picture.
  • While benchmarking, test only one change at any single execution.  The reason is that multiple changes in a single execution can have an adverse effect.
  • Turn on SAS system options so that resource usage can be obtained.

What is Master Data Management?

Sometimes, it is not possible to even comprehend the complexity  of big companies like Exxon Mobil. Just imagine how data is managed in these companies for their products and tens of thousands of customers. There could easily be many data silos for each department with continuous changes such as product name changes and additions and deletions to data.

To illustrate one example, imagine a sales representative taking an order from a customer. The representative incorrectly enters “7280 Park Street” instead of the correct address, “7280 Park Avenue”. As the order gets fulfilled, this dirty data is propagated to various systems. However, the billing department detects the incorrect address and rectifies it. This rectification, though, is only local to the billing system and other departments may not be aware of it.

We just discussed one scenario of inconsistent data across different data silos. But there could be thousands of such scenarios occurring everyday, hence creating havoc among all corporate departments.

Enter “Master Data Management“.

The objective of Master Data Management is to provide consistency among critical corporate data silos. In doing so, no matter what department an employee works for, he/she would see a “single version of truth“. In our illustration, the employee would only see the correct address.

Post Navigation

%d bloggers like this: