Use Factor Analysis to better understand your data

Surveys get used for a wide range of applications within marketing. It might be to comprehend consumers political choices. It might be to comprehend your brand name choices. It may be utilized in the design of brand-new items. It might be used to figure out what is the ideal credit to be focusing on marketing interactions. Well, think about the last time that you received a survey to submit. May have been 10, 20 questions. Other surveys might be 50 to 100 questions. Surveys can be long and for each participant that might be 100 specific products that they’re responding to. Well, as an online marketer what we’re attempting to do are derive insights from those surveys.
Moreover, I do not care about how you reacted to an individual item. What I appreciate is what’s driving you, what are your beliefs. Also, the idea is that the individual items on a survey are manifestations of those underlying beliefs.

Using Factor Analysis to Identify Underlying Constructs

So what we’re going to be doing is first looking at a tool called factor analysis that’s intended to allow us to go from a large number of survey items, narrow that down, retain as much information as possible to identify underlying preferences, underlying beliefs that consumers have. No once we’ve done that, then we can go about forming market segments using cluster analysis. We can also look to identify individuals that belong to different segments using discriminate analysis. And lastly we’re going to look at perceptual mapping as a means of understanding how our brand is seen relative to other brands. 
So to start out we’re going to look at how we identify those underlying constructs using factor analysis.

Suppose that we’re interested in understanding consumer preferences for local retailers versus large national chains. And in this case we’ve got five survey items that were included. 

  1. First asking about whether or not respondents agreed with the statement that local retailers have more variety compared to retail chains. 
  2. Second question, ask whether or not you agree with the statement that the associates at retail chains tend to be less knowledgeable than the associates at local businesses. 
  3. And the last three questions, questions three through five, get into the courtesy and the level of personal attention that you might expect when you patronize local retailers versus when you patronize national chains. 

Now if we collected these five responses to these five questions, you might have it from a sample of respondents. In this case we have 15 responses. What we might begin to do is look for patterns among the responses. That is, for when people respond above average to question one, how do they tend to respond to question two? When people respond above average for question three do they tend to respond above or below average for questions four or five. And so the technique that we might default to using is correlation analysis. 

Correlation Analysis

What correlation analysis let’s us look at is is there a pairwise linear relationship? That is do the two items, when one goes up does the other tend to go up? When one goes down does the other tend to go down?That’d be indicative of a positive relationship. Negative relationship would be when one goes up the other tends to go down and vice versa. And if we’re dealing with a small number of survey items such as the case here that might be all right. So what we could look at first is the correlation matrix. And we can see along the diagonal, we have ones, that’s to be expected because we are taking the correlation between, let say item one and itself, so that’s why we’re getting the ones along the diagonal.

Correlation matrix from customer surveys

 Then we look below the diagonal 0.61, fairly strong positive relationship between items one and two. If we look for other high or very low values of correlation, we might see question three is correlated with question four.

Now, in this case we might say, let’s identify those items that tend to move together. And it looks like items three, four and five tend to move together and Items one and two tend to move together. Now in this case we happen to get lucky with the correlation matrix. The items that are correlated with each other are directly adjacent to each other. We’re dealing with a small enough number of items that we can just stare at the correlation matrix and see which items tend to move together. 

Factor analysis

But what about a lengthier survey? What about a survey that’s several pages long if we’re dealing with 20, 50, 100 items? 
Staring at that matrix is going to be very difficult to identify the patterns that exist. All right, so that’s where factor analysis is going to come into play for us.It’s going to allow us to draw these boxes around items. That tend to move together without us having to do that work. So what factor analysis is going to take as an input is all of the survey responses. It doesn’t matter if you have ten survey items, doesn’t matter if you have 50, 100, 200 items factor analysis doesn’t care about that. What it’s going to do is take those individual items the responses from all of the individuals on those items, and identify which sets of items tend to move together. So think of this as correlation analysis on steroids.

Example 2

Let’s say we were looking at the young urban professionals. And how do you go about designing branding and targeting consumers with a message that’s going to ultimately resonate with them. So one way we might go about trying to understand the consumer is to administer a survey.
One way we might go about trying to understand our consumers is to administer a survey. So let’s take a look at the survey that we might administer.

Automotive Example

Based on Automotive survey items what could we ultimately do with it? Well, if we could identify those people who are likely to buy a car or expressing interest in this car. And what are the perceptions of themselves, perceptions of the society, the perceptions of their finances are associated with people who are likely to buy this car, right? And so we might afford to say let’s run one regression. let’s take all of these survey responses as inputs or outcome variable, or y variable, that’s can be the purchase intention. And conceptually that makes sense.That’s what we’re trying to do. We’re trying to relate the individual survey items to the outcome of interest. The problem is some of the survey items are going to be highly correlated with each other. And we may run into problems of multicollinearity, if we were to run that large regression.The other problem that we might run into is supposed that we are able to run the regression.Well, what do we ultimately do with it? So, suppose that the government should restrict import or products from Japan is a significant driver of purchase intentions. How do we act on that? That’s different from saying that somebody’s who is likely to buy this car has a lot of patriotism.

Saying that, we’re going after consumers or a patriotic that’s something that we can design a marketing campaign around. Saying that we’re going after people who are against imports, not as clear.

So what can factor analysis do for us? 

What we ultimately want to do is we want to group those variables together, those survey items together that are highly correlated with each other, the ones that tend to move together.Now that movement maybe in the same direction, that movement maybe an opposite direction.But the assumption that we’re going to make is that items that tend to move together, there’s some underlying construct. There’s some high order belief that consumers have or some set of preferences that they have that cause all of those survey items to move together. And if we can identify those underlying beliefs, those constructs, those are what we’re going to put into our regression analysis as well as the subsequent analyses that we might conduct. Now while we’re doing that, we want to make sure that we retain as much information as possible. 

Exploratory Factor Analysis

Factor analysis is a method for investigating whether a number of variables of interest X1, X2,……., Xl, are linearly related to a smaller number of unobservable factors F1, F2,..……, Fk.

Let’s say we’ve got our 50 survey items that we’re looking at. We want to make that a more manageable number. We want to cut that down to identify what’s really driving those responses, and maybe it’s ultimately five constructs that are ultimately driving those 50 responses. Well those five constructs, that’s a lot smaller than the 50 survey items that we began with. And so any time that we engage with dimension reduction we are going to be throwing away information. Our goal is to retain as much information as possible.
We’re going to ask factor analysis to do for us is two things:

1. Reveal to us how many constructs are appropriate. What is the appropriate number K?

2. Reveal which constructs and which survey items are ultimately related to each other.

One of the ways that factor analysis is commonly used when it comes to analyzing survey data as I had mentioned, is to group these similar items (items that tend to move together) together. 

So maybe I can go from a 150 survey items down to 50 surveys items after the first pass. Well, factor analysis will help us identify which items tend to move together and as such, identify which ones are potentially redundant. I can eliminate those redundancies and administer my survey in the second wave and continue to refine it until I have a number of survey items that I’m comfortable with. The other way that factor analysis gets used is to produce measures that are uncorrelated with each other. Multicollinearity is a big problem when it comes to regression analysis.

Steps for Factor Analysis

  1. Decide how many factors are necessary, 
  2. Conduct the analysis, derive that solution. 
  3. Rotate the factor solution 
  4. Interpreting the factors or naming the factors -This is where a person needs to be involved
  5. Evaluate the quality of the fit
  6. Save the factor scores for use in subsequent data

Types of Factor Analysis

  • Exploratory Factor Analysis: It is the most popular factor analysis approach among social and management researchers. Its basic assumption is that any observed variable is directly associated with any factor.
  • Confirmatory Factor Analysis (CFA): Its basic assumption is that each factor is associated with a particular set of observed variables. CFA confirms what is expected on the basic.

Terminology

A factor is a latent variable which describes the association among the number of observed variables. The maximum number of factors are equal to a number of observed variables. Every factor explains a certain variance in observed variables. The factors with the lowest amount of variance were dropped. Factors are also known as latent variables or hidden variables or unobserved variables or Hypothetical variables.

Factor loadings – The factor loading is a matrix which shows the relationship of each variable to the underlying factor. It shows the correlation coefficient for observed variable and factor and variance explained by the observed variables.

Eigenvalues – represent variance explained each factor from the total variance. It is also known as characteristic roots.

Communalities – are the sum of the squared loadings for each variable. It represents the common variance. It ranges from 0-1 and value close to 1 represents more variance.

Factor Rotation is a tool for better interpretation of factor analysis. Rotation can be orthogonal or oblique. It re-distributed the commonalities with a clear pattern of loadings.

Introduction to Factor Analysis in Python

In this tutorial, you’ll learn the basics of factor analysis and how to implement it in python.

Factor Analysis (FA) is an exploratory data analysis method used to search influential underlying factors or latent variables from a set of observed variables. It helps in data interpretations by reducing the number of variables. It extracts maximum common variance from all variables and puts them into a common score.

Factor analysis is widely utilized in market research, advertising, psychology, finance, and operation research. Market researchers use factor analysis to identify price-sensitive customers, identify brand features that influence consumer choice, and helps in understanding channel selection criteria for the distribution channel.

In this tutorial, you are going to cover the following topics:

  • Factor Analysis
  • Types of Factor Analysis
  • Determine Number of Factors
  • Factor Analysis Vs. Principle Component Analysis
  • Factor Analysis in python
  • Adequacy Test
  • Interpreting the results
  • Pros and Cons of Factor Analysis
  • Conclusion

Factor Analysis

Factor analysis is a linear statistical model. It is used to explain the variance among the observed variable and condense a set of the observed variable into the unobserved variable called factors. Observed variables are modeled as a linear combination of factors and error terms (Source). Factor or latent variable is associated with multiple observed variables, who have common patterns of responses. Each factor explains a particular amount of variance in the observed variables. It helps in data interpretations by reducing the number of variables.

Factor analysis is a method for investigating whether a number of variables of interest X1, X2,……., Xl, are linearly related to a smaller number of unobservable factors F1, F2,..……, Fk.

Source: This image is recreated from an image that I found in factor analysis notes. The image gives a full view of factor analysis.

Assumptions:

  1. There are no outliers in data.
  2. Sample size should be greater than the factor.
  3. There should not be perfect multicollinearity.
  4. There should not be homoscedasticity between the variables.

Types of Factor Analysis

  • Exploratory Factor Analysis: It is the most popular factor analysis approach among social and management researchers. Its basic assumption is that any observed variable is directly associated with any factor.
  • Confirmatory Factor Analysis (CFA): Its basic assumption is that each factor is associated with a particular set of observed variables. CFA confirms what is expected on the basic.

How does factor analysis work?

The primary objective of factor analysis is to reduce the number of observed variables and find unobservable variables. These unobserved variables help the market researcher to conclude the survey. This conversion of the observed variables to unobserved variables can be achieved in two steps:

  • Factor Extraction: In this step, the number of factors and approach for extraction selected using variance partitioning methods such as principal components analysis and common factor analysis.
  • Factor Rotation: In this step, rotation tries to convert factors into uncorrelated factors — the main goal of this step to improve the overall interpretability. There are lots of rotation methods that are available such as: Varimax rotation method, Quartimax rotation method, and Promax rotation method.

Terminology

What is a factor?

A factor is a latent variable which describes the association among the number of observed variables. The maximum number of factors are equal to a number of observed variables. Every factor explains a certain variance in observed variables. The factors with the lowest amount of variance were dropped. Factors are also known as latent variables or hidden variables or unobserved variables or Hypothetical variables.

What are the factor loadings?

The factor loading is a matrix which shows the relationship of each variable to the underlying factor. It shows the correlation coefficient for observed variable and factor. It shows the variance explained by the observed variables.

What is Eigenvalues?

Eigenvalues represent variance explained each factor from the total variance. It is also known as characteristic roots.

What are Communalities?

Commonalities are the sum of the squared loadings for each variable. It represents the common variance. It ranges from 0-1 and value close to 1 represents more variance.

What is Factor Rotation?

Rotation is a tool for better interpretation of factor analysis. Rotation can be orthogonal or oblique. It re-distributed the commonalities with a clear pattern of loadings.

How many factors do we need to include in our analysis? 

There are a couple of different criteria that can be used. 
One criteria is to say, we want to capture, we want to retain at least a given percentage of the original variation in the service. So we might say, okay, I want to retain at least 50% of the variation in the survey.
Another criterion that we could use is to say, well let’s include as many factors as are necessary such that each factor that we include is doing its fair share of explaining variation. Mathematically, what this maps on to is saying that all of the eigenvalues in the analysis have to be greater than 1.
Or saying that the amount of variation, a given factor explains has to be greater than 1 over j where j is the number of survey items that we have. So if I have 20 survey items, we’re going to include as many factors as necessary until a survey item falls below the 5% threshold or the 1 over 20 threshold.

Kaiser criterion

Kaiser criterion is an analytical approach, which is based on the more significant proportion of variance explained by factor will be selected. The eigenvalue is a good criterion for determining the number of factors. Generally, an eigenvalue greater than 1 will be considered as selection criteria for the feature.

The graphical approach is based on the visual representation of factors’ eigenvalues also called scree plot. This scree plot helps us to determine the number of factors where the curve makes an elbow.

Source

Factor Analysis Vs. PCA

  • PCA components explain the maximum amount of variance while factor analysis explains the covariance in data.
  • PCA components are fully orthogonal to each other whereas factor analysis does not require factors to be orthogonal.
  • PCA component is a linear combination of the observed variable while in FA, the observed variables are linear combinations of the unobserved variable or factor.
  • PCA components are uninterpretable. In FA, underlying factors are labelable and interpretable.
  • PCA is a kind of dimensionality reduction method whereas factor analysis is the latent variable method.
  • PCA is a type of factor analysis. PCA is observational whereas FA is a modeling technique.

Factor Analysis in python using factor_analyzer package

import pandas as pd
from sklearn.datasets import load_iris
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt

https://vincentarelbundock.github.io/Rdatasets/datasets.html

data = 'bfi.csv'
df= pd.read_csv(data)

Dropping unnecessary columns

df.drop([‘gender’, ‘education’, ‘age’],axis=1,inplace=True)

Dropping missing values rows

df.dropna(inplace=True)
df.info()

Adequacy Test

Before you perform factor analysis, you need to evaluate the “factorability” of our dataset. Factorability means “can we found the factors in the dataset?”. There are two methods to check the factorability or sampling adequacy:

  • Bartlett’s Test
  • Kaiser-Meyer-Olkin Test

Bartlett’s test of sphericity checks whether or not the observed variables intercorrelated at all using the observed correlation matrix against the identity matrix. If the test found statistically insignificant, you should not employ a factor analysis.

from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
chi_square_value,p_value=calculate_bartlett_sphericity(df)
chi_square_value, p_value

#In this Bartlett ’s test, the p-value is 0. The test was statistically significant, indicating that the observed correlation matrix is not an identity matrix.

Kaiser-Meyer-Olkin (KMO) Test measures the suitability of data for factor analysis.

from factor_analyzer.factor_analyzer import calculate_kmo
kmo_all,kmo_model=calculate_kmo(df)
kmo_model

If Kaiser-Meyer-Olkin gives value over 0.6 then we can proceed with the factor analysis.

Create factor analysis object and perform factor analysis

fa = FactorAnalyzer()
fa.analyze(df, 25, rotation=None)

Check Eigenvalues

ev, v = fa.get_eigenvalues()

From here, we pick number of factors where eigenvalues are greater than 1.

Create factor analysis object and perform factor analysis, Note that Varimax rotation is used under the assumption that the factors are completely uncorrelated.

fa = FactorAnalyzer()
fa.analyze(df, 6, rotation=”varimax”)
fa.loadings

Naming the Factors

After establishing the adequacy of the factors, it’s time for us to name the factors. This is the theoretical side of the analysis where we form the factors depending on the variable loadings. In this case, here is how the factors can be created:

Looking at the description of data: https://vincentarelbundock.github.io/Rdatasets/doc/psych/bfi.dictionary.html, we can come up with the names:

  • Factor 1 has high factor loadings for E1,E2,E3,E4, and E5 (Extraversion)
  • Factor 2 has high factor loadings for N1,N2,N3,N4, and N5 (Neuroticism)
  • Factor 3 has high factor loadings for C1,C2,C3,C4, and C5 (Conscientiousness)
  • Factor 4 has high factor loadings for O1,O2,O3,O4, and O5 (Openness)
  • Factor 5 has high factor loadings for A1,A2,A3,A4, and A5 (Agreeableness)
  • Factor 6 has none of the high loadings for any variable and is not easily interpretable. Its good if we take only five factors.

What is next?

Now that we have seen how to perform factor analysis, we can use the same technique to analyze other data.

Possible applications would be, reducing the dimensionality when building predictive model using high dimensional data. Reducing dimensionality will significantly improve models performance like I write in this article: Improve performance on Machine learning models

Improving Data Exploration like mentioned here:

Machine learning in practice – Quick data analysis
Starting with Data Science

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.