Use Factor Analysis to better understand your data

Surveys get used for a wide range of applications within marketing. It might be to comprehend consumers political choices. It might be to comprehend your brand name choices. It may be utilized in the design of brand-new items. It might be used to figure out what is the ideal credit to be focusing on marketing interactions. Well, think about the last time that you received a survey to submit. May have been 10, 20 questions. Other surveys might be 50 to 100 questions. Surveys can be long and for each participant that might be 100 specific products that they’re responding to. Well, as an online marketer what we’re attempting to do are derive insights from those surveys.
Moreover, I do not care about how you reacted to an individual item. What I appreciate is what’s driving you, what are your beliefs. Also, the idea is that the individual items on a survey are manifestations of those underlying beliefs.

Using Factor Analysis to Identify Underlying Constructs

So what we’re going to be doing is first looking at a tool called factor analysis that’s intended to allow us to go from a large number of survey items, narrow that down, retain as much information as possible to identify underlying preferences, underlying beliefs that consumers have. No once we’ve done that, then we can go about forming market segments using cluster analysis. We can also look to identify individuals that belong to different segments using discriminate analysis. And lastly we’re going to look at perceptual mapping as a means of understanding how our brand is seen relative to other brands. 
So to start out we’re going to look at how we identify those underlying constructs using factor analysis.

Suppose that we’re interested in understanding consumer preferences for local retailers versus large national chains. And in this case we’ve got five survey items that were included. 

  1. First asking about whether or not respondents agreed with the statement that local retailers have more variety compared to retail chains. 
  2. Second question, ask whether or not you agree with the statement that the associates at retail chains tend to be less knowledgeable than the associates at local businesses. 
  3. And the last three questions, questions three through five, get into the courtesy and the level of personal attention that you might expect when you patronize local retailers versus when you patronize national chains. 

Now if we collected these five responses to these five questions, you might have it from a sample of respondents. In this case we have 15 responses. What we might begin to do is look for patterns among the responses. That is, for when people respond above average to question one, how do they tend to respond to question two? When people respond above average for question three do they tend to respond above or below average for questions four or five. And so the technique that we might default to using is correlation analysis. 

Correlation Analysis

What correlation analysis let’s us look at is is there a pairwise linear relationship? That is do the two items, when one goes up does the other tend to go up? When one goes down does the other tend to go down?That’d be indicative of a positive relationship. Negative relationship would be when one goes up the other tends to go down and vice versa. And if we’re dealing with a small number of survey items such as the case here that might be all right. So what we could look at first is the correlation matrix. And we can see along the diagonal, we have ones, that’s to be expected because we are taking the correlation between, let say item one and itself, so that’s why we’re getting the ones along the diagonal.

Correlation matrix from customer surveys

 Then we look below the diagonal 0.61, fairly strong positive relationship between items one and two. If we look for other high or very low values of correlation, we might see question three is correlated with question four.

Now, in this case we might say, let’s identify those items that tend to move together. And it looks like items three, four and five tend to move together and Items one and two tend to move together. Now in this case we happen to get lucky with the correlation matrix. The items that are correlated with each other are directly adjacent to each other. We’re dealing with a small enough number of items that we can just stare at the correlation matrix and see which items tend to move together. 

Factor analysis

But what about a lengthier survey? What about a survey that’s several pages long if we’re dealing with 20, 50, 100 items? 
Staring at that matrix is going to be very difficult to identify the patterns that exist. All right, so that’s where factor analysis is going to come into play for us.It’s going to allow us to draw these boxes around items. That tend to move together without us having to do that work. So what factor analysis is going to take as an input is all of the survey responses. It doesn’t matter if you have ten survey items, doesn’t matter if you have 50, 100, 200 items factor analysis doesn’t care about that. What it’s going to do is take those individual items the responses from all of the individuals on those items, and identify which sets of items tend to move together. So think of this as correlation analysis on steroids.

Example 2

Let’s say we were looking at the young urban professionals. And how do you go about designing branding and targeting consumers with a message that’s going to ultimately resonate with them. So one way we might go about trying to understand the consumer is to administer a survey.
One way we might go about trying to understand our consumers is to administer a survey. So let’s take a look at the survey that we might administer.

Automotive Example

Based on Automotive survey items what could we ultimately do with it? Well, if we could identify those people who are likely to buy a car or expressing interest in this car. And what are the perceptions of themselves, perceptions of the society, the perceptions of their finances are associated with people who are likely to buy this car, right? And so we might afford to say let’s run one regression. let’s take all of these survey responses as inputs or outcome variable, or y variable, that’s can be the purchase intention. And conceptually that makes sense.That’s what we’re trying to do. We’re trying to relate the individual survey items to the outcome of interest. The problem is some of the survey items are going to be highly correlated with each other. And we may run into problems of multicollinearity, if we were to run that large regression.The other problem that we might run into is supposed that we are able to run the regression.Well, what do we ultimately do with it? So, suppose that the government should restrict import or products from Japan is a significant driver of purchase intentions. How do we act on that? That’s different from saying that somebody’s who is likely to buy this car has a lot of patriotism.

Saying that, we’re going after consumers or a patriotic that’s something that we can design a marketing campaign around. Saying that we’re going after people who are against imports, not as clear.

So what can factor analysis do for us? 

What we ultimately want to do is we want to group those variables together, those survey items together that are highly correlated with each other, the ones that tend to move together.Now that movement maybe in the same direction, that movement maybe an opposite direction.But the assumption that we’re going to make is that items that tend to move together, there’s some underlying construct. There’s some high order belief that consumers have or some set of preferences that they have that cause all of those survey items to move together. And if we can identify those underlying beliefs, those constructs, those are what we’re going to put into our regression analysis as well as the subsequent analyses that we might conduct. Now while we’re doing that, we want to make sure that we retain as much information as possible. 

Exploratory Factor Analysis

Factor analysis is a method for investigating whether a number of variables of interest X1, X2,……., Xl, are linearly related to a smaller number of unobservable factors F1, F2,..……, Fk.

Let’s say we’ve got our 50 survey items that we’re looking at. We want to make that a more manageable number. We want to cut that down to identify what’s really driving those responses, and maybe it’s ultimately five constructs that are ultimately driving those 50 responses. Well those five constructs, that’s a lot smaller than the 50 survey items that we began with. And so any time that we engage with dimension reduction we are going to be throwing away information. Our goal is to retain as much information as possible.
We’re going to ask factor analysis to do for us is two things:

1. Reveal to us how many constructs are appropriate. What is the appropriate number K?

2. Reveal which constructs and which survey items are ultimately related to each other.

One of the ways that factor analysis is commonly used when it comes to analyzing survey data as I had mentioned, is to group these similar items (items that tend to move together) together. 

So maybe I can go from a 150 survey items down to 50 surveys items after the first pass. Well, factor analysis will help us identify which items tend to move together and as such, identify which ones are potentially redundant. I can eliminate those redundancies and administer my survey in the second wave and continue to refine it until I have a number of survey items that I’m comfortable with. The other way that factor analysis gets used is to produce measures that are uncorrelated with each other. Multicollinearity is a big problem when it comes to regression analysis.

Steps for Factor Analysis

  1. Decide how many factors are necessary, 
  2. Conduct the analysis, derive that solution. 
  3. Rotate the factor solution 
  4. Interpreting the factors or naming the factors -This is where a person needs to be involved
  5. Evaluate the quality of the fit
  6. Save the factor scores for use in subsequent data

Types of Factor Analysis

  • Exploratory Factor Analysis: It is the most popular factor analysis approach among social and management researchers. Its basic assumption is that any observed variable is directly associated with any factor.
  • Confirmatory Factor Analysis (CFA): Its basic assumption is that each factor is associated with a particular set of observed variables. CFA confirms what is expected on the basic.

Terminology

A factor is a latent variable which describes the association among the number of observed variables. The maximum number of factors are equal to a number of observed variables. Every factor explains a certain variance in observed variables. The factors with the lowest amount of variance were dropped. Factors are also known as latent variables or hidden variables or unobserved variables or Hypothetical variables.

Factor loadings – The factor loading is a matrix which shows the relationship of each variable to the underlying factor. It shows the correlation coefficient for observed variable and factor and variance explained by the observed variables.

Eigenvalues – represent variance explained each factor from the total variance. It is also known as characteristic roots.

Communalities – are the sum of the squared loadings for each variable. It represents the common variance. It ranges from 0-1 and value close to 1 represents more variance.

Factor Rotation is a tool for better interpretation of factor analysis. Rotation can be orthogonal or oblique. It re-distributed the commonalities with a clear pattern of loadings.

Introduction to Factor Analysis in Python

In this tutorial, you’ll learn the basics of factor analysis and how to implement it in python.

Factor Analysis (FA) is an exploratory data analysis method used to search influential underlying factors or latent variables from a set of observed variables. It helps in data interpretations by reducing the number of variables. It extracts maximum common variance from all variables and puts them into a common score.

Factor analysis is widely utilized in market research, advertising, psychology, finance, and operation research. Market researchers use factor analysis to identify price-sensitive customers, identify brand features that influence consumer choice, and helps in understanding channel selection criteria for the distribution channel.

In this tutorial, you are going to cover the following topics:

  • Factor Analysis
  • Types of Factor Analysis
  • Determine Number of Factors
  • Factor Analysis Vs. Principle Component Analysis
  • Factor Analysis in python
  • Adequacy Test
  • Interpreting the results
  • Pros and Cons of Factor Analysis
  • Conclusion

Factor Analysis

Factor analysis is a linear statistical model. It is used to explain the variance among the observed variable and condense a set of the observed variable into the unobserved variable called factors. Observed variables are modeled as a linear combination of factors and error terms (Source). Factor or latent variable is associated with multiple observed variables, who have common patterns of responses. Each factor explains a particular amount of variance in the observed variables. It helps in data interpretations by reducing the number of variables.

Factor analysis is a method for investigating whether a number of variables of interest X1, X2,……., Xl, are linearly related to a smaller number of unobservable factors F1, F2,..……, Fk.

Source: This image is recreated from an image that I found in factor analysis notes. The image gives a full view of factor analysis.

Assumptions:

  1. There are no outliers in data.
  2. Sample size should be greater than the factor.
  3. There should not be perfect multicollinearity.
  4. There should not be homoscedasticity between the variables.

Types of Factor Analysis

  • Exploratory Factor Analysis: It is the most popular factor analysis approach among social and management researchers. Its basic assumption is that any observed variable is directly associated with any factor.
  • Confirmatory Factor Analysis (CFA): Its basic assumption is that each factor is associated with a particular set of observed variables. CFA confirms what is expected on the basic.

How does factor analysis work?

The primary objective of factor analysis is to reduce the number of observed variables and find unobservable variables. These unobserved variables help the market researcher to conclude the survey. This conversion of the observed variables to unobserved variables can be achieved in two steps:

  • Factor Extraction: In this step, the number of factors and approach for extraction selected using variance partitioning methods such as principal components analysis and common factor analysis.
  • Factor Rotation: In this step, rotation tries to convert factors into uncorrelated factors — the main goal of this step to improve the overall interpretability. There are lots of rotation methods that are available such as: Varimax rotation method, Quartimax rotation method, and Promax rotation method.

Terminology

What is a factor?

A factor is a latent variable which describes the association among the number of observed variables. The maximum number of factors are equal to a number of observed variables. Every factor explains a certain variance in observed variables. The factors with the lowest amount of variance were dropped. Factors are also known as latent variables or hidden variables or unobserved variables or Hypothetical variables.

What are the factor loadings?

The factor loading is a matrix which shows the relationship of each variable to the underlying factor. It shows the correlation coefficient for observed variable and factor. It shows the variance explained by the observed variables.

What is Eigenvalues?

Eigenvalues represent variance explained each factor from the total variance. It is also known as characteristic roots.

What are Communalities?

Commonalities are the sum of the squared loadings for each variable. It represents the common variance. It ranges from 0-1 and value close to 1 represents more variance.

What is Factor Rotation?

Rotation is a tool for better interpretation of factor analysis. Rotation can be orthogonal or oblique. It re-distributed the commonalities with a clear pattern of loadings.

How many factors do we need to include in our analysis? 

There are a couple of different criteria that can be used. 
One criteria is to say, we want to capture, we want to retain at least a given percentage of the original variation in the service. So we might say, okay, I want to retain at least 50% of the variation in the survey.
Another criterion that we could use is to say, well let’s include as many factors as are necessary such that each factor that we include is doing its fair share of explaining variation. Mathematically, what this maps on to is saying that all of the eigenvalues in the analysis have to be greater than 1.
Or saying that the amount of variation, a given factor explains has to be greater than 1 over j where j is the number of survey items that we have. So if I have 20 survey items, we’re going to include as many factors as necessary until a survey item falls below the 5% threshold or the 1 over 20 threshold.

Kaiser criterion

Kaiser criterion is an analytical approach, which is based on the more significant proportion of variance explained by factor will be selected. The eigenvalue is a good criterion for determining the number of factors. Generally, an eigenvalue greater than 1 will be considered as selection criteria for the feature.

The graphical approach is based on the visual representation of factors’ eigenvalues also called scree plot. This scree plot helps us to determine the number of factors where the curve makes an elbow.

Source

Factor Analysis Vs. PCA

  • PCA components explain the maximum amount of variance while factor analysis explains the covariance in data.
  • PCA components are fully orthogonal to each other whereas factor analysis does not require factors to be orthogonal.
  • PCA component is a linear combination of the observed variable while in FA, the observed variables are linear combinations of the unobserved variable or factor.
  • PCA components are uninterpretable. In FA, underlying factors are labelable and interpretable.
  • PCA is a kind of dimensionality reduction method whereas factor analysis is the latent variable method.
  • PCA is a type of factor analysis. PCA is observational whereas FA is a modeling technique.

Factor Analysis in python using factor_analyzer package

import pandas as pd
from sklearn.datasets import load_iris
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt

https://vincentarelbundock.github.io/Rdatasets/datasets.html

data = 'bfi.csv'
df= pd.read_csv(data)

Dropping unnecessary columns

df.drop([‘gender’, ‘education’, ‘age’],axis=1,inplace=True)

Dropping missing values rows

df.dropna(inplace=True)
df.info()

Adequacy Test

Before you perform factor analysis, you need to evaluate the “factorability” of our dataset. Factorability means “can we found the factors in the dataset?”. There are two methods to check the factorability or sampling adequacy:

  • Bartlett’s Test
  • Kaiser-Meyer-Olkin Test

Bartlett’s test of sphericity checks whether or not the observed variables intercorrelated at all using the observed correlation matrix against the identity matrix. If the test found statistically insignificant, you should not employ a factor analysis.

from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
chi_square_value,p_value=calculate_bartlett_sphericity(df)
chi_square_value, p_value

#In this Bartlett ’s test, the p-value is 0. The test was statistically significant, indicating that the observed correlation matrix is not an identity matrix.

Kaiser-Meyer-Olkin (KMO) Test measures the suitability of data for factor analysis.

from factor_analyzer.factor_analyzer import calculate_kmo
kmo_all,kmo_model=calculate_kmo(df)
kmo_model

If Kaiser-Meyer-Olkin gives value over 0.6 then we can proceed with the factor analysis.

Create factor analysis object and perform factor analysis

fa = FactorAnalyzer()
fa.analyze(df, 25, rotation=None)

Check Eigenvalues

ev, v = fa.get_eigenvalues()

From here, we pick number of factors where eigenvalues are greater than 1.

Create factor analysis object and perform factor analysis, Note that Varimax rotation is used under the assumption that the factors are completely uncorrelated.

fa = FactorAnalyzer()
fa.analyze(df, 6, rotation=”varimax”)
fa.loadings

Naming the Factors

After establishing the adequacy of the factors, it’s time for us to name the factors. This is the theoretical side of the analysis where we form the factors depending on the variable loadings. In this case, here is how the factors can be created:

Looking at the description of data: https://vincentarelbundock.github.io/Rdatasets/doc/psych/bfi.dictionary.html, we can come up with the names:

  • Factor 1 has high factor loadings for E1,E2,E3,E4, and E5 (Extraversion)
  • Factor 2 has high factor loadings for N1,N2,N3,N4, and N5 (Neuroticism)
  • Factor 3 has high factor loadings for C1,C2,C3,C4, and C5 (Conscientiousness)
  • Factor 4 has high factor loadings for O1,O2,O3,O4, and O5 (Openness)
  • Factor 5 has high factor loadings for A1,A2,A3,A4, and A5 (Agreeableness)
  • Factor 6 has none of the high loadings for any variable and is not easily interpretable. Its good if we take only five factors.

What is next?

Now that we have seen how to perform factor analysis, we can use the same technique to analyze other data.

Possible applications would be, reducing the dimensionality when building predictive model using high dimensional data. Reducing dimensionality will significantly improve models performance like I write in this article: Improve performance on Machine learning models

Improving Data Exploration like mentioned here:

Machine learning in practice – Quick data analysis
Starting with Data Science

How to boost your Machine learning model accuracy

boosting predictive machine learning algorithms

There are multiple ways to boost your predictive model accuracy. Most of these steps are really easy to implement, but yet for many reasons data scientist fail to do proper data preparation and model tuning. in the end, they end up with average or below average machine learning models.
Having domain knowledge will give you the best possible chance of getting improvements on your machine learning models accuracy. However, if every data scientist follows these simple technical steps, they will end up with a great machine learning model accuracy even without being an expert in a certain field.

Read More »

Why logs keep on growing?

A Shorter Answer:
You probably either have a long running transaction running (Index maintenance? Big batch delete or update?) or you are in the “default” (more below on what is meant by default) recovery mode of Full and have not taken a log backup (or aren’t taking them frequently enough).
There could be other reasons, but these are the most common. This answer begins to dive into the most common two reasons and gives you some background information on the why and how behind the reasons as well as exploring some other reasons.
A Longer Answer:

What Scenarios can Cause the Log to Keep Growing?

There are many reasons, but usually these reasons are of the following two patterns:


1.) Not Understanding Recovery Models

(Being in Full Recovery Mode and Not Taking Log Backups – This is the most common reason – the vast majority of those experiencing this issue are. )
This answer is not a deep dive in SQL Server recovery models, the topic of recovery models is critical to this problem. In SQL Server, there are three recovery models – FullBulk-Logged and Simple. We’ll ignore Bulk-Logged for now we’ll sort of say it is a hybrid model and most people who are in this model are there for a reason and understand recovery models. The two we care about and their confusion are the cause of the majority of the cases of people having this issue are Simple and Full.
Before we talk about Recovery Models – Let’s talk about recovery in general.
1.    The transaction log file is there for crash/restart recovery. For the rolling forward and rolling back of work that was either done (rolling forward/redo) before a crash or restart and the work that was started but not finished after a crash or restart (rolling back/undo). It is the job of the transaction log to see that a transaction started but never finished (rolled back or crash/restart happened before the transaction committed). In that situation It is the log’s job to say “hey.. this never really finished, let’s roll it back” during recovery. It is also the log’s job to see that you did finish something and that your client application was told it was finished (even if it hadn’t yet hardened to your data file) and say “Hey.. this really happened, let’s roll it forward, let’s make it like the applications think it was” after a restart. Now there is more but that is the main purpose.
2.    The other purpose for a transaction log file is to be able to give us the ability to recover to a point in time due to an “oops” in a database or to guarantee a recovery point in the event of a hardware failure involving the data and/or log files of a database. If this transaction log contains the records of transactions that have been started and finished for recovery, SQL Server can and does then use this information to get a database to where it was before an issue happened. But that isn’t always an available option for us. For that to work we have to have our database in the right recovery model, and we have to take log backups.
Onto the recovery models:
Simple Recovery Model – So with the above introduction, it is easiest to talk about Simple Recoverymodel first. In this model, you are telling SQL Server – I am fine with you using your transaction log file for crash and restart recovery (You really have no choice there.. Look up ACID properties and that should make sense quickly), but once you no longer need it for that crash/restart recovery purpose, go ahead and reuse the log file.
SQL Server listens to this request in Simple Recovery and it only keeps the information it needs to do crash/restart recovery. Once SQL Server is sure it can recover because data is hardened to the data file (more or less), the data that has been hardened is no longer necessary in the log and is marked for truncation – which means it gets re-used.
Full Recovery Model – With Full Recovery, you are telling SQL Server that you want to be able to recover to a specific point in time, as long as your log file is available or to a specific point in time that is covered by a log backup. In this case when SQL Server reaches the point where it would be safe to truncate the log file in Simple Recovery Model, it will not do that. Instead It lets the log file continue to grow and will allow it to keep growing, until you take a log backup (or run out of space on your log file drive) under normal circumstances.
There are rules and exceptions here: We’ll talk about long running transactions below, but one to keep in mind for in full mode – If you just switch into Full Recovery mode, but never take an initial Full Backup, SQL Server will not honor your request to be in Full Recovery model. Your transaction log will continue to operate as it has in simple until you switch to Full Recovery Model AND Take your first Full Backup.
So, that’s the most common reason for uncontrolled log growth: Being in Full Recovery mode without having any log backups. This happens all the time to people.
(Why does it happen all the time? Because each new database gets its initial recovery model setting by looking at the model database. Model’s initial recovery model setting is always Full Recovery Model – until and unless someone changes that. So you could say the “default Recovery Model” is Full. Many people are not aware of this and have their databases running in Full Recovery Model with no log backups, and therefore a transaction log file much larger than necessary. This is why it is important to change defaults when they don’t work for your organization and its needs)
You can also get yourself in trouble here by not taking log backups frequently enough. – Taking a log backup a day may sound fine, it makes a restore require less restore commands, but keeping in mind the discussion above, that log file will continue to grow and grow until you take log backups. You need to consider your log backup frequency with two things in mind:
1.    Recovery Needs – This should hopefully be first. In the event that the drive housing your transaction log goes bad or you get serious corruption that affects your log backup, how much data can be lost? If that number is no more than 10-15 minutes, then you need to be taking the log backup every 10-15 minute, end of discussion.
2.    Log Growth – If your organization is fine to lose more data because of the ability to easily recreate that day you may be fine to have a log backup much less frequently than 15 minutes. Maybe your organization is fine with every 4 hours. But you have to look at how many transactions you generate in 4 hours. Will allowing the log to keep growing in those four hours make too large of a log file? Will that mean your log backups take too long?


2.) Long Running Transactions

(“My recovery model is fine! The log is still growing!)
This can also be a cause of uncontrolled and unrestrained log growth. No matter the recovery model, but it often comes up as “But I’m in Simple Recovery Model – why is my log still growing?!”
The reason here is simple, if SQL is using this transaction log for recovery purposes as I described above, then it has to see back to the start of a transaction. If you have a transaction that takes a long time or does a lot of changes, the log cannot truncate on checkpoint for any of the changes that are still in open transactions or that have started since that transaction started.
This means that a big delete, deleting millions of rows in one delete statement is one transaction and the log cannot do any truncating until that whole delete is done. In Full Recovery Model, this delete is logged and that could be a lot of log records. Same thing with Index optimization work during maintenance windows. It also means that poor transaction management and not watching for and closing open transactions can really hurt you and your log file.
You can save yourself here by: – Properly sizing your log file to account for the worst case scenario – like your maintenance or known large operations. And when you grow your log file you should look to thisguidance (and the two links she sends you to) by Kimberly Tripp. Right sizing is super critical here. – Watching your usage of transactions. Don’t start a transaction in your application server and start having long conversations with SQL Server and risk leaving one open too long. – Watch your implied transactions in your DML statements. UPDATE TableName Set Col1 = ‘New Value’ is a transaction. I didn’t put a BEGIN TRAN there and I don’t have to, it is one transaction that just automatically commits when done. So if doing operations on large numbers of rows, consider batching those operations up into more manageable chunks and giving the log time to recover. Or consider the right size to deal with that, or perhaps look into changing recovery models during a bulk load window.


What about Log Shipping?

“I’m using log shipping, so my log backups are automated… Why am I still seeing transaction log growth?”
Log shipping is just what it sounds like – you are shipping your transaction log backups to another server for DR purposes. The process is fairly simple – after the initialization – a job to backup the log on one server, a job to copy that log backup and a job to restore it without recovery (either NORECOVERY or STANDBY) on the destination server. There are also some jobs to monitor and alert if things don’t go as you have them planned.
In some cases, you may only want to do the log shipping restore once a day or every 3rd day or once a week. That is fine. But if you make this change on all of the jobs (including the log backup and copy jobs) that means you are waiting all that time to take a log backup. That means you will have a lot of log growth – because you are in full recovery mode without log backups, and it also likely means a large log file to copy across. You should only modify the restore job’s schedule and let the log backups and copies happen on a more frequent basis, otherwise you will suffer from the first issue described in this answer.

Getting Info on Your Cause

There are reasons other than these two, but these are the most common. Regardless of the cause – there is a way you can analyze your reason for this unexplained log growth/lack of truncation and see what they are:
By Querying the [sys.databases][5] catalog view you can see information describing the reason your log file may be waiting on truncate/reuse. There is a column called log_reuse_waitwith a lookup ID of the reason code and a log_reuse_wait_desc column with a description of the wait reason. From the referenced books online article are the majority of the reasons (the ones you are likely to see and the ones we can explain reasons for. The missing ones are either out of use or for internal use) with a few notes about the wait in italics:
·         0 = Nothing – What it sounds like.. Shouldn’t be waiting
·         1 = Checkpoint – Waiting for a checkpoint to occur. This should happen and you should be fine – but there are some cases to look for here for later answers or edits.
·         2 = Log backup – You are waiting for a log backup to occur. Either you have them scheduled and it will happen soon, or you have the first problem described here and you now know how to fix it
·         3 = Active backup or restore – A backup or restore operation is running on the database
·         4 = Active transaction – * There is an active transaction that needs to complete (either way – ROLLBACK or COMMIT) before the log can be backed up. This is the second reason described in this answer.
·         5 = Database mirroring Either a mirror is getting behind or under some latency in a high performance mirroring situation or mirroring is paused for some reason
·         6 = Replication – There can be issues with replication that would cause this – like a log reader agent not running, a database thinking it is marked for replication that no longer is and various other reasons. You can also see this reason and it is perfectly normal because you are looking at just the right time, just as transactions are being consumed by the log reader
·         7 = Database snapshot creation You are creating a database snapshot, you’ll see this if you look at just the right moment as a snapshot is being created
·         8 = Log Scan I have yet to encounter an issue with this running along forever. If you look long enough and frequently enough you can see this happen, but it shouldn’t be a cause of excessive transaction log growth, that I’ve seen.
·         9 = An AlwaysOn Availability Groups secondary replica is applying transaction log records of this database to a corresponding secondary database. About the clearest description yet..
Originally written by Mike Walsh

Indexing Myths

I just found this article about indexing myths by Kurt Survance and I thought it is really useful to share it

Myth 1: The Primary Key index must be clustered and unique.

Truth: The index supporting the Primary Key must be unique, but not necessarily clustered.
Explanation: There are often much better places for the clustered indexes than on the PK. A unique integer column is not always the best place for the clustered index from a performance point of view.
Clustered indexes are very good at returning a large range of records efficiently. You don’t often (if ever) query for a range of arbitrary ID numbers. A non clustered index can return an identity value as efficiently as can a clustered index. Why waste your only clustered index on a column that can be served by a non clustered index?
While the optimal place for the clustered index is determined by how the table is accessed, we often find that a sequentially increasing date field is a good place for it. Range searches on dates are common (…where invoicedate between ‘November 1, 2010’ and ‘November 30, 2010’). And sequentially increasing date fields do not incur page splitting or rapid fragmentation.

Myth 2: Stay away from clustered indexes because they cause page splitting and fragmentation.

Truth: There is no way you can get proper performance from most database applications without using clustered indexes.
Explanation: It is true that in some cases clustered indexes will cause page splitting and rapid fragmentation but that is not a reason to avoid them altogether. Rather it is a reason to learn how to use them effectively without causing page-splitting and fragmentation. Here are a couple of tips:
Never place a clustered index on a column with a GUID data type. A GUID is a 16 byte randomly generated value. Its randomness will promote a high level of page splitting and fragmentation of the table. Since the clustered index value of a row becomes part of every nonclustered index created on the table, this 16 byte value added to every index row will bloat the size of your table and slow down performance.
Try to find a naturally sequentially-increasing column ( a date, for example) that is used to filter ranges of data. This can make an important performance difference in many queries.
If you cannot find a column or set of columns where the clustered index would provide a performance boost, place the Primary Key on an IDENTITY column using a clustered index. This will physically organize your table on disk without incurring page splits or rapid fragmentation.

Myth 3: Multiple Column indexes can be used to search on any of the fields contained in the index.

Truth: In a multi-column index, only the first column is truly indexed.
Explanation: If you take a minute and think about how a multi-column index must work, you will realize that the values in the additional columns are grouped by the order of the first column and are not sequential, To find values in the second column without having a filter on the first column you would have to scan the entire table.
For example, let’s assume this index exists on the ORDERDETAIL table:
  • CREATE INDEX IX_TEST ON ORDERDETAIL(CUSTOMERID, PRODUCTID)
The following query can use this index to perform a very efficient seek:
  • SELECT * FROM ORDERDETAIL
  • WHERE CUSTOMERID = 1234
  • AND PRODUCTID = 34
This query might use a few more page reads but it will also be executed efficiently with an index seek:
  • SELECT * FROM ORDERDETAIL
  • WHERE CUSTOMERID = 1234
However, the following query will cause a table scan.
  • SELECT * FROM ORDERDETAIL
  • WHERE PRODUCTID = 34
Even though PRODUCTID is included in the index, the index will not be used for searches on PRODUCTID alone.
This does not mean that there is no benefit in multi-column indexes. They are useful in many ways, particularly in multi-column joins and in queries that filter on multiple columns from the same table (as in our first example).

Multiply Selectivity Using Multi-Column Indexes

Multiple column indexes allow you to combine the selectivity of two or more columns.
Often a query will have several conditions in a WHERE clause, but none of them alone are selective enough for the query optimizer to use them. The result will be a scan of the table.
However, by using two or more non-selective columns in the index you might increase the selectivity of the index to the degree that it can be used effectively.
For example, your business process dictates that an order should not be closed before it is shipped. You have a process to search for orders that may have fallen through the cracks and not been shipped.
  • SELECT ORDERID
  • FROM ORDERS
  • WHERE status = ‘closed’
  • AND shipped = ‘no’
It is likely that you have very many closed orders in your database, and you might have very many unshipped items. Individual indexes on either of those columns may not be selective enough to be used by the optimizer.
However, it is not likely that you have a lot of closed orders that have not been shipped, since they would be exceptions to the business rule. An index combining both of these columns might be very selective and could turn a table scan into a fast and efficient index seek.

Covering Indexes

Covering indexes are a special type of multi-column index and they can sometimes work magic on a poorly performing query. Covering indexes contain every column referenced from a specific table by a specific query in any part of the query.
Since all necessary data is contained in the index, the query never has to access the table itself. This is a bit like having another clustered index on the table and the performance improvement can be amazing.
It is important to use restraint in creating covering indexes or you will over-index your table. Here are some criteria to consider:
  • The query or queries to be optimized should be a serious enough performance problem to justify the existence of what might be a big index.
  • The number and size of the columns required should be relatively small.
  • The table should be relatively large.

Myth 4: Create Indexes in descending order if that is how they will be accessed.
Truth: The query engine can read an index backwards as easily as it can read it forward. There is seldom any reason to specify the order for a single column index.

Explanation: In multi-column indexes there are times when it might make sense to sort columns in different orders. For example if the results of queries are ordered or grouped on a set of ascending and descending columns. A clustered index on those columns using the same sort order might improve performance. However, that applies only to clustered indexes.
In my experience, it is extremely rare that you would use a clustered index to support the ORDER BY clause of a query which normally has a minimal cost. There are usually much better places for clustered indexes.

Creating clustered index on view with COUNT in it

If you are in situation you have a view, but your tables have grown so much ( lets say above 5 million records) soon you will realize that queries that are used in your views consume too much I/O. The solution is to create index on that view.

If you are using COUNT in your view you will run into a problem, something like:

Cannot create index on view “dbo.View” because it uses the aggregate COUNT. Use COUNT_BIG instead.

The solution of this is using COUNT_BIG instead:

ALTER VIEW ViewWITH SCHEMABINDINGAS
SELECT 
COUNT_BIG(*) Count, NameOfProduct
FROM dbo.TableGROUP BY NameOfProduct
GO


CREATE UNIQUE CLUSTERED INDEX [IX_View] ON [dbo].[View](NameOfProduct ASC)GO

Filtered Index with variable

Filtered Index is a great way to improve your SQL performance and save a vast amount of server resources while processing your big data query.
Whenever you have a filtered index you just might find that it’s not used when its columns are compared to a variable. The thing is, SQL Server will always make an Execution Plan that would work with any values for those variable (including the values that don’t match the index’s filter).
Thus SQL Server will not use the filtered index.
There’s an easy way to fix that: adding the filtering clauses to your statement.
For instance, if you have an index on column ID filtered on ID ”, instead of using a clause like this: WHERE ID = @pID use a clause like this: WHERE ID = @pID AND ID ” SQL Server will then know it can use the filter.

How to force query to execute on a single processor core without affecting other queries?

For example there is a case when we have database server with eight quad-core processors.
The database executes complex queries to generate reports.
We need to  to force query to execute on a single processor core without affecting other queries.

In this case the best option is using query hint OPRION (MAXDOP 1)

The OPTION (MAXDOP) Transact-SQL query hints can override the max degree of parallelism configuration option in the sp_configure value only for the query that specifies this option.

 If MAXDOP exceeds the value that was configured by using the Resource Governor, the Database Engine uses the Resource Governor MAXDOP value

Guidelines when you configure the MAXDOP value:

  • For servers that use more than eight processors, use the following configuration: MAXDOP=8.
  • For servers that have eight or less processors, use the following configuration where N equals the number of processors: MAXDOP=0 to N.
  • For servers that have NUMA configured, MAXDOP should not exceed the number of CPUs that are assigned to eachNUMA node.
  • For servers that have hyper-threading enabled, the MAXDOP value should not exceed the number of physical processors.

These are general guidelines only. There may be some exceptions to these guidelines. For example, do not use the guidelines under the following conditions:

  • If the server is not an online transaction processing (OLTP)-based server.
  • If the server is running large batch processing when there is no or minimal user load.
  • If the server is running resource-intensive operations.

Example:


SELECT ProductID, OrderQty, SUM(LineTotal) AS Total
FROM Sales.SalesOrderDetail
WHERE UnitPrice < $5.00
GROUP BY ProductID, OrderQty
ORDER BY ProductID, OrderQty
OPTION (MAXDOP 1);
GO

Recompilation of Stored Procedures

Stored procedures can provide performance benefits due to the cached query execution plan,
allowing SQL Server to reuse an existing plan instead of generating a new one.

However, Stored Procedures are not the solution for best database performance.

We still need to account for the performance of individual statements within the body of your stored procedure and to make sure that the tables are indexed properly and that the database is designed efficiently.

A recompilation occurs when a stored procedure’s plan is re-created.
Re-compilations occur automatically during stored procedure execution when underlying table
or other object changes occur to objects that are referenced within a stored procedure. Automatic recompilations can also occur with changes to indexes used by the plan or after a large number of updates to table keys referenced by the stored procedure.
This is done to make sure the SQL Server execution plan is using the most current information and not using out-of-date assumptions about the schema and data.

Although recompilations are costly and should be avoided most of the time, there may sometimes
be reasons why you would want to force a recompilation.

For example, if we have stored procedure that produce widely different results depending of the parameters sent on the SP execution. For example when SP is executed with one parameter, the procedure returns one row and executed with another parameter returns millions of rows.
In that case SQL server may end up with caching plan that is optimized only for single row not for millions of rows, causing poor performance.

Example:

Recompilation is achieved with placing the WITH RECOMPILE statement in the stored procedure definition after defining the parameters.

 Statement-level recompilation

SQL Server also uses statement-level recompiles within the stored procedure, instead of
recompiling the entire stored procedure. Since recompiles cause extra overhead in generating new
plans, statement-level recompiles help decrease this overhead by correcting only what needs to be
corrected.

Example:

We should use recompilation when we want to use other benefits of stored procedures like security and modularization, but we don’t want SQL Server to store inefficient execution plan and with that to cause bad performance.

Covering Indexes with Included Columns

When creating non-clustered index there is a option for including columns to extend its functionality:

With including other columns we can create non-clustered indexes that cover more queries.

Included columns can be:

  • datatypes not allowed as a index key
  • not considered by the Database Engine when calculating the number of index key columns or index key size
Included columns are used if columns are not in the WHERE/JOIN/GROUP BY/ORDER BY, but only in the column list in the SELECT clause.

How to create it in Management Studio:

1. Choose the Table in Object Explorer where you want to create covered index
2. Right Click on Indexes Folder -> New index


3. New Index window opens.
4. Go under Included Columns tab and click Add ellipse button to choose the included columns


Real life Example

Lets use Person.Contact table from AdventureWorks database.

I wilI perform select on a simple query:

SELECT FirstName,LastName,EmailAddress,Phone,Title
FROM Person.Contact
WHERE FirstName = ‘Gustavo’ AND LastName=’Achong’

Firstly the table will have only clustered index, the execution plan looks like this:

Which clearly says it has performed Clustered Index scan.
Next I will add Non-clustered index only covering FirstName and LastName:


 After this execution plan changed and now we can see that non-clustered index is used but also Key Lookup operation is performed.


Finally, after adding the EmailAddress, Phone and Title columns in the Included Columns I got improvement, the Key Lookup operation is not present any more and Index Scan is changed into Index Seek. 
Whit this the query is potentially much faster.



Downsides of Included Columns

  • The downside of INCLUDE columns are increased disk-space need by non-clustered indexes.
  • With creating larger indexes fewer rows can fit on a page which can lead to increasing disk I/O.
  • Index maintenance is increased for data modifications, potentially hurting performance if non key columns are large and the database experiences a high level of data modifications.