Tutorials
Home >

# Validation of Categorical Variables

A variable is categorical if its values fall into a distinct set of categories that do not overlap.  In other words, categorical variables are the kind you can put each individual value into one of several groups (categories).

## Examples

An example of a common categorical variable is "gender." Every individual will be grouped into either "male" or "female", and possibly "unknown" if you have missing data. Other categorical variables may include:

• Race (White, Black, Asian, Other)
• Cause of Injury
• Pain Scale (0,2,4,6,8,10)
• Likert Scales such as 1=Strongly Disagree, 2=Disagree, 3=Neutral, 4=Agree, 5=Strongly Agree

## Frequency Checks

The first data validation for categorical variables is frequency checks (counting the number in each group). Let’s look at some examples.

Example: Suppose you mailed a survey to hospital administrators in your state, asking them the following question:

• Do you agree with the following statement?

"Implementing inter-facility transfer agreements will improve outcomes for children."

When you get the surveys back, enter the data into your database, and run a frequency check, you get the following results:

 Frequency Percent Agree 66 48.9 Disagree 69 51.1

Does this make sense?  It seems like a reasonable result.  It is possible the administrators are almost evenly split about their opinion on inter-facility transfer agreements.

Let’s look at another example: Suppose your state had done a survey, and you are particularly interested in the results for one of the questions:

• Do you agree or disagree with the following statement?

"On-line medical direction at the scene of an emergency will help ensure optimal pediatric emergency care."

So, you download the data from the website where it is available from, read the data into your analysis software, and run a frequency check.  These are the results of the frequency check:

 Frequency Percent Agree 9595 70.72 Disagree 3696 27.24 7 36 0.27 8 236 1.74 9 5 0.04 Frequency Missing = 2

Does this make sense?  You might immediately be concerned about getting the results "7," "8," and "9" in your data since the answers should have only been "Agree" or "Disagree." You would want to go back and:

2. check that the data was imported properly into the format for your analysis software.

These are both steps where data mistakes can easily happen.

## Cross Tabulations

Another useful data validation method is running cross tabulations (counting the number in each group of combinations of variables).  This is useful because there are often combinations of variables that should not occur.  Take, for example, the variables "sex" and "pregnancy status."  One obvious combination that should never occur is a pregnant male.  Let’s do an example.

Example: Suppose you have collected data on elementary school students.  You decide to run cross tabulations on the variables "AgeGroup" and "Grade."  You know that elementary students should belong to grades K-6, and follow in an age group of "5-6", "7-8", "9-10", "11-12."

1. First you run frequency checks to make sure the variables have the appropriate values, as we discussed above. When you look at the frequency checks, everything seems to make sense.
2. Next, you decide to run the cross tabulation:

# Age Group

K

315

0

0

0

1

107

168

0

0

2

0

105

0

0

3

0

135

130

0

4

0

0

243

0

5

0

0

124

128

6

0

1

0

269

Does everything make sense?  You might be concerned about the one sixth grader in the "7-8" year old age group.  You would want to go back and check the original data.  This could be due to a data entry mistake, or there just might be a child in the school that has skipped a few grades.

It will be up to you to find out what is going on with the strange data point to correct it or to confirm that it is correct.

rev. 29-Aug-2016