National EMSC Data Analysis Resource Center
A variable is categorical if its values fall into a distinct set of categories that do not overlap. In other words, categorical variables are the kind you can put each individual value into one of several groups (categories).
An example of a common categorical variable is "gender." Every individual will be grouped into either "male" or "female", and possibly "unknown" if you have missing data. Other categorical variables may include:
The first data validation for categorical variables is frequency checks (counting the number in each group). Let’s look at some examples.
Example: Suppose you mailed a survey to hospital administrators in your state, asking them the following question:
 Do you agree with the following statement?
"Implementing interfacility transfer agreements will improve outcomes for children."
When you get the surveys back, enter the data into your database, and run a frequency check, you get the following results:

Frequency 
Percent 
Agree 
66 
48.9 
Disagree 
69 
51.1 
Does this make sense? It seems like a reasonable result. It is possible the administrators are almost evenly split about their opinion on interfacility transfer agreements.
Let’s look at another example: Suppose your state had done a survey, and you are particularly interested in the results for one of the questions:
 Do you agree or disagree with the following statement?
"Online medical direction at the scene of an emergency will help ensure optimal pediatric emergency care."
So, you download the data from the website where it is available from, read the data into your analysis software, and run a frequency check. These are the results of the frequency check:

Frequency 
Percent 
Agree 
9595 
70.72 
Disagree 
3696 
27.24 
7 
36 
0.27 
8 
236 
1.74 
9 
5 
0.04 
Frequency Missing = 2 
Does this make sense? You might immediately be concerned about getting the results "7," "8," and "9" in your data since the answers should have only been "Agree" or "Disagree." You would want to go back and:
 double check that the data was downloaded properly
 check that the data was imported properly into the format for your analysis software.
These are both steps where data mistakes can easily happen.
Another useful data validation method is running cross tabulations (counting the number in each group of combinations of variables). This is useful because there are often combinations of variables that should not occur. Take, for example, the variables "sex" and "pregnancy status." One obvious combination that should never occur is a pregnant male. Let’s do an example.
Example: Suppose you have collected data on elementary school students. You decide to run cross tabulations on the variables "AgeGroup" and "Grade." You know that elementary students should belong to grades K6, and follow in an age group of "56", "78", "910", "1112."
 First you run frequency checks to make sure the variables have the appropriate values, as we discussed above. When you look at the frequency checks, everything seems to make sense.
 Next, you decide to run the cross tabulation:
Age Group 

Grade  "56"  "78"  "910"  "1112" 
K 
315 
0 
0 
0 
1 
107 
168 
0 
0 
2 
0 
105 
0 
0 
3 
0 
135 
130 
0 
4 
0 
0 
243 
0 
5 
0 
0 
124 
128 
6 
0 
1 
0 
269 
Does everything make sense? You might be concerned about the one sixth grader in the "78" year old age group. You would want to go back and check the original data. This could be due to a data entry mistake, or there just might be a child in the school that has skipped a few grades.
It will be up to you to find out what is going on with the strange data point to correct it or to confirm that it is correct.
Tweet
rev. 04Aug2022