What is the purpose of Impact Analysis in SAS Data Integration Studio?

If A has an impact on B and B has an impact on C, it is logical to deduce that A has an impact on C. Before modifying A, surely one would be interested to know what changes C would get. Enter impact analysis. In SAS Data Integration Studio, impact analysis identifies jobs, columns or transformation (our C) that would be impacted if a table or a column were to be modified (our A).

Translating A into a real table, let us consider an Employees table shown below.

Employees table

By using the “Splitter” transformation in DI Studio, we split Employees into 2 tables based on the gender of the employees.

Split Job

After the Split Job is ran, the FemaleEmployees table  will have only 2 rows where Gender=’F’.

Female Employees

 

Now, let us create another SAS DI Studio job where FemaleEmployees table becomes an input to the job and the output is a sorted table by salary ascending.

Sorted Female Employees by Salary Ascending

This is how SortFemaleEmployees looks like after the job is run.

Sorted Female Employees

So far, we have only described the flow of the job processes.  Now imagine in a large corporation, you have an Employees table and you are not sure where exactly this table is being used and what end results come out of it. Using SAS Data Integration Studio, you can easily right-click on this table name and click on “Analyze“.

You will get a graphical representation of the process flows from the Employees table to all its appropriate direct and indirect end tables.

 

It is possible to run impact analysis on a more granular level such as a specific column name in Employees. It is also possible to run reverse impact analysis on the end table, SortedFemaleEmployees, to figure out its sources. But that is a topic for another day!

 

What is a SAS OLAP Cube?

Definition:

An OLAP cube is a multi-dimensional data store that contains pre-summarized data. OLAP cubes support ad-hoc business reporting since they are optimized for on-the-fly querying.

Components of an OLAP cube:

SAS OLAP cubes are made up of:

  • Dimensions: variables organized in logical groups
  • Hierarchies: data’s logical order
  • Levels: Units that make up the hierarchy
  • Members: Level’s specific values
  • Measures: statistics that are produced

Taking time as an example, time itself can be considered as a dimension.

A hierarchy of time, i.e. time’s logical order, can be a year.

Levels of a year can be a quarter or a month.

Members are the actual values i.e. 2012 for year and 3 for third quarter.

 

 

What are the steps to benchmark a SAS program?

From a pure software development standpoint, efficiency of a SAS program is usually determined by:

  • Volume of data to be processed
  • Length of program, also known as lines of code (LOC).
  • Number of times of program execution

There is no one-size-fits-all solution when it comes to determining the criteria mentioned above. Case in point: suppose if a program is meant to be run only once, then it probably doesn’t make sense to invest time and have LOC decreased.

SAS programs consume the following machine resources:

  • Memory
  • Input/output
  • Network bandwidth
  • CPU time
  • Data storage

If you need to determine how your machine resources are being impacted, you can benchmark your program using the following criteria:

  • Run  the program as it would be run in a live environment.
  • Run the program multiple times so that an average can be obtained. Obviously, an average will give a more accurate picture.
  • While benchmarking, test only one change at any single execution.  The reason is that multiple changes in a single execution can have an adverse effect.
  • Turn on SAS system options so that resource usage can be obtained.

What is Master Data Management?

Sometimes, it is not possible to even comprehend the complexity  of big companies like Exxon Mobil. Just imagine how data is managed in these companies for their products and tens of thousands of customers. There could easily be many data silos for each department with continuous changes such as product name changes and additions and deletions to data.

To illustrate one example, imagine a sales representative taking an order from a customer. The representative incorrectly enters “7280 Park Street” instead of the correct address, “7280 Park Avenue”. As the order gets fulfilled, this dirty data is propagated to various systems. However, the billing department detects the incorrect address and rectifies it. This rectification, though, is only local to the billing system and other departments may not be aware of it.

We just discussed one scenario of inconsistent data across different data silos. But there could be thousands of such scenarios occurring everyday, hence creating havoc among all corporate departments.

Enter “Master Data Management“.

The objective of Master Data Management is to provide consistency among critical corporate data silos. In doing so, no matter what department an employee works for, he/she would see a “single version of truth“. In our illustration, the employee would only see the correct address.

How is it possible to retrieve SAS observations within a certain ordinal number range of observations?

As indicated in the Wikipedia page, ordinal numbers denote the rank or position of a number with respect to some order i.e. first, second, third, fourth.

Now assume, there is a SAS data set, tempData, comprising of :

ID Age
1 20
2 10
3 40
4 30
5 50

If we want to find the sum of ages from the second to the fourth observation, we can use the code below.

proc sql;

 select sum(age) as SumAge

from tempData

where 2 <= id <= 4;

 quit;

The code works correctly because the ids happen to be in sequential order, but what if they are not?  Say for example, tempData comprises of:

ID Age
4 20
3 10
7 40
8 30
1 50

Obviously, we don’t have the luxury of using ids in our code any more.  This is where monotonic() function would come in handy. Monotonic function basically provides a row / observation number. For instance, the function always returns 1 for the first observation, 2 for the second and so on.

With this in mind, the code below will now work correctly even if ids are not sequential.

proc sql;

select SUM(age) as sumAge, monotonic() as x_a

from tempData where calculated x_a between 2 and 4;

quit;

What are the different approaches to show the nth highest observation in a SAS data set?

Assume you have SAS code to produce a data set called “myData” as shown below. You want to retrieve the 2nd highest observation, highest defined in terms of age. That would mean the 2nd highest observation would be “10 Babo”. We will look at 3 approaches to derive the 2nd highest observation.

data myData ;

input age name $;
 
datalines;
10 Babo
9  Vivian
9  Tina
11 Michael
8  Mungu
;
run;
 
Approach 1: using plain-old SQL

proc sql;
select * from myData
where age=(select max(age) from myData
where age not in (select max(age) from myData));
quit;
 
The SQL code above is conceptually easy to understand for retrieving the 2nd highest observation. But what if you had to retrieve the 4th highest observation? As you can imagine, the SQL code can get really ugly. So, I personally would avoid SQL queries.
 
 
Approach 2: using “Rank”

proc rank data=myData
descending
ties=dense
out=want (where=(rank=2)) ;
var age;
ranks rank;
run;
 
Compared to the SQL code approach #1, the proc rank code seems to be very flexible. If we have to find the 4th highest observation, all we would need is to replace 2 with 4 in the code.
Approach 3: using “largest” function

proc sql;
select age into :num separated by ”,”
from myData;
run;

data se_score;
set myData;
where age = largest(2,&num);
run;
 
Compared to the Rank code in approach #2, the code above has two code operations, first is to create the macro variable num and the second is to determine the actual 2nd highest observation. Obviously, approach #3 has more code than approach #2. Nevertheless, the code is still flexible because 2 can be replaced with any number.

Conclusion: Personally I think approach #2 is the best approach in terms of code maintenance, flexibility and performance. Approach #1 provides no flexibility at all in terms of deriving any nth observation. It only works if n=2.

What are the reasons to use SAS Enterprise Guide over Base SAS?

  • Reduction in development time: Since Enterprise Guide has point and click interfaces, there is no need to write native Base SAS code in general .  If at all, customization is needed, it is always possible to modify the Enterprise Guide generated code.
  • Ease of creating reports:  there are tasks available for various charts such as bar, pie and line plot. C-suite leaders are usually more interested in charts than raw data reports. It can take literally take 5 minutes to create a report when the boss needs it.
  • Program flows show the progression of logic from start to finish in a pictorial representation. There is no need to dig through lines of code just to understand the logic flow.
  • Interaction with other SAS BI tools :  It is easier to communicate with SAS OLAP Cube Studio and Web Report Studio from Enterprise Guide.
  • Free with Base SAS: Enterprise Guide is provided free of charge with the Base SAS license for the Windows operating system.

What is Data Integration?

As McKinsey’s report on Big Data indicates, the volume of data has nowhere, but to go up. A corporation often has disparate data silos which often need to be consolidated together. For instance, accounting database of a company may have to be consolidated with billing database to provide enterprise-wide data to the company.

Data integration is the process of consolidating data from heterogeneous sources in order to produce a unified view of the data. Taking the example from the paragraph above, accounting database can be consolidated with billing database to provide one set of data for a given corporation.

Characteristics of data integration include

  • Data federation: ability to use data across different systems such as ERP applications, flat files and databases without the movement of data.
  • Master data management: create a single “version of truth” from multiple sources i.e. create unified view of enterprise data
  • Extraction, transformation, and loading (ETL): extracts data from source(s), transforms the data according to business rules and loads the data into target system(s).
  • Data cleansing: process of cleaning the data. For instance, accounting database could have “Divy Dave” for an individual while the billing department could have “Divyesh Dave” for the same individual.

Data integration is a career branch by itself in the Big Data universe. There are many third-party tools to assist in data integration such as Informatica, DataStage and SAS Data Integration Studio. There is an abundance of data in corporate data silos, and social media websites such as Twitter and Facebook. It has been anticipated that demand in this career is likely to go up.

What are SAS macro variables?

If you have hundreds of lines of code with a reference to a single value,
like 2012 in the code snippet below, it would be a time-consuming and an
error-prone process to make a change to another value.

Title “Employees for the year 2012”
Title “Salaries for the year 2012”

This is where SAS macros would come in handy.

With a macro, it is possible to make appropriate changes and have SAS echo
these changes in a given program. From this statement alone, we can deduce
that macros greatly enhance code reusability. Not only that, macros can created
in a data-driven fashion, allowing SAS to decide on various execution paths depending
on actual data values.

A macro variable  always stores values in the character format.

I have shown below an example of SAS macro variables.

– Assigning values to macro variables iterations and person –
%LET iterations = 5;
%LET person = Divyesh Dave;

– To access the value in a variable, use ampersand (&) –
DO i = 1 to &iterations;
TITLE “Name: &person”

What are the three types of Dimension tables?

 

Fact tables store numeric business measures such as Sales. Dimension tables
store data that describe such business measures, for example,
Customer, Customer Address and Store locations.

Dimension tables can be broken down into the three types mentioned below.
To illustrate an example, assume a customer moved from
123 Minor Street USA to 321 Major Street USA.

  • Type 1 : No history is maintained. Old data is overwritten with new data. 123 Minor Street is overwritten with 321 Major Street.
  • Type 2 : Full history is maintained. There will be two rows in the table : 123 Minor Street and 321 Major Street.
  • Type 3: Partial history is maintained. There will be one row in the table that contains some pieces of information from both old and new addresses : 123 Minor Street 321 Major Street.