Quantitative: Data Management and Cleaning
Topic 3: Data Analysis
Topic 3: Data Analysis
Before embarking on the analysis of your data, it is very important to select a data management system, set up the properties for your data file and variables, enter your data accurately, and check for any data entry errors.
Data Management
Data management involves creating a code book which includes all your variable names, labels, and attributes. It is not uncommon to combine or merge different datasets, so having a codebook that indicates to which dataset the original variable belonged can be very useful. For example:
EXAMPLE 1
Dataset: Demographics
Variable Name: GENDER
Variable Label: Gender
Variable Type: Nominal
Values: 1, 2
Value Labels:
1 = female
2 = male
EXAMPLE 2
Dataset: Grade 6 Grades
Variable Name: MATH6
Variable Label: 6th grade math GPA
Variable Type: Continuous
Values: 0-5
Value Labels: None
For more on how to construct a codebook, see pages 135-137 in Fink, A. (2013). How to conduct surveys. (5thed,). Thousand Oaks: Sage Publications.
Data Entry
Choose where you will enter your data. It will be helpful to know the programs that you will use to analyze your data (SPSS, Excel, etc.). You could choose to enter your data directly into one of those programs or into a text file that is later imported.
If you are using an online survey program such as Qualtrics, you will probably download your file for further analysis. You can export data in various formats; consider which format will be the easiest for you to work with based on where you will analyze the data.
Data Cleaning
Data cleaning refers to the process of improving the quality of your data by checking that your dataset does not contain data entry errors and that it is set up appropriately for analysis. The data cleaning step should not be skipped and should be done before conducting any analysis. Running descriptive statistics, including frequency tables for each variable, helps to spot most errors. Some of the most common errors include:
Also, depending on the analysis you are conducting, some of your variables may need to have certain requirements and you will have to recode your data. Perhaps you are interested in examining how three groups differ on one variable, but you have to construct the groups based on a continuous variable. Assume you are interested in high performing, average performing, and low performing students and how they vary on some attribute. You will need to recode GPA into a new categorical variable based on performance and assign your respondents accordingly (e.g., high performing >2.99; average performing 2.00-2.99; low performing < 2.00).
Also, make sure to decide how you will handle missing data so that you can distinguish between a non-response pattern to a specific question versus a data entry error.
You may also have to do data transformations, which include reversing responses or calculating total scores values. Make sure to consult the scoring instruction manual of measures that are already established.
Consult the various statistical resources included here to determine how you need to construct your final data set.
Resources:
Pallant, J. (2013). SPSS Survival Manual (5th ed.). New York, NY: McGraw-Hill. (For a discussion of data management and set up consult chapter 5)
William Trochim’s Data Preparation webpage: http://www.socialresearchmethods.net/kb/statprep.php
Data Management
Data management involves creating a code book which includes all your variable names, labels, and attributes. It is not uncommon to combine or merge different datasets, so having a codebook that indicates to which dataset the original variable belonged can be very useful. For example:
EXAMPLE 1
Dataset: Demographics
Variable Name: GENDER
Variable Label: Gender
Variable Type: Nominal
Values: 1, 2
Value Labels:
1 = female
2 = male
EXAMPLE 2
Dataset: Grade 6 Grades
Variable Name: MATH6
Variable Label: 6th grade math GPA
Variable Type: Continuous
Values: 0-5
Value Labels: None
For more on how to construct a codebook, see pages 135-137 in Fink, A. (2013). How to conduct surveys. (5thed,). Thousand Oaks: Sage Publications.
Data Entry
Choose where you will enter your data. It will be helpful to know the programs that you will use to analyze your data (SPSS, Excel, etc.). You could choose to enter your data directly into one of those programs or into a text file that is later imported.
If you are using an online survey program such as Qualtrics, you will probably download your file for further analysis. You can export data in various formats; consider which format will be the easiest for you to work with based on where you will analyze the data.
Data Cleaning
Data cleaning refers to the process of improving the quality of your data by checking that your dataset does not contain data entry errors and that it is set up appropriately for analysis. The data cleaning step should not be skipped and should be done before conducting any analysis. Running descriptive statistics, including frequency tables for each variable, helps to spot most errors. Some of the most common errors include:
- Inconsistent data entry. For example, data for gender might be entered as “F”, “f”, “fem” “female” or “1”. Frequency tables will provide all the possible values that were entered for gender.
- Misspellings. Frequency tables will allow you to audit all the text that respondents typed in.
- Out of range values. For example, one respondent’s value for GPA may have been entered as 6.4 rather than 3.4. The codebook established that the range of GPA should be between 0 and 5. A frequency table would provide all the possible values that were entered so that this error could be corrected.
- Errors resulting from the process of transferring data from different software, for example from Qualtrics to SPSS.
- Redundancy. Perhaps two records were entered for one survey. A frequency table including ID would identify such an error.
Also, depending on the analysis you are conducting, some of your variables may need to have certain requirements and you will have to recode your data. Perhaps you are interested in examining how three groups differ on one variable, but you have to construct the groups based on a continuous variable. Assume you are interested in high performing, average performing, and low performing students and how they vary on some attribute. You will need to recode GPA into a new categorical variable based on performance and assign your respondents accordingly (e.g., high performing >2.99; average performing 2.00-2.99; low performing < 2.00).
Also, make sure to decide how you will handle missing data so that you can distinguish between a non-response pattern to a specific question versus a data entry error.
You may also have to do data transformations, which include reversing responses or calculating total scores values. Make sure to consult the scoring instruction manual of measures that are already established.
Consult the various statistical resources included here to determine how you need to construct your final data set.
Resources:
Pallant, J. (2013). SPSS Survival Manual (5th ed.). New York, NY: McGraw-Hill. (For a discussion of data management and set up consult chapter 5)
William Trochim’s Data Preparation webpage: http://www.socialresearchmethods.net/kb/statprep.php