Obtaining Your Data
The first step to data analysis is getting to know your data. This means learning the variable names, variable definitions, and what values are valid in each field.
Data From an Outside Party
If you discover a database that is already available for you to use, great! Just don’t forget that using existing databases will still require extra work from you. Some sources provide data online and only require you to provide specifications on which data you need, other sources will need personal requests and confidentiality agreements before the data will be released to you.
Sometimes the process of obtaining data from outside sources can be a bit daunting. Don’t forget a few key items:
- Be sure you request the data in a format you will be able to use. If the provider is not able to give the data to you in a format you are able to read on your computer, check around and see if there are other resources available to you to convert the data into the format you need. Some common formats include:
- csv - Comma separated values text file format (ASCII)
- mdb - Access Application (Microsoft®), MDB Access Database (Microsoft®)
- sas - SAS file format
- sav - SPSS file format
- txt - Text file format (ASCII)
- xml - XML file format
- xls - Microsoft Excel® file format
- Requesting the data dictionary, a document containing variable definitions, is an absolute must. Before using the data, you should understand the definitions well. A data dictionary should describe:
- the definition of each variable
- the valid values of the variable
- how missing values are represented
- the data type (numeric or character)
- and the data format (the length for character fields; integer, long, float, etc. for numeric fields).
Download a sample data dictionary.
- Don’t forget to address permissions issues, such as ensuring only those individuals who were granted access on confidentiality agreements actually have access to the data.
If you will be using new data you or your office collected, make sure you understand the database and how the data is stored. Again, you will need to assess the usability of the format it is currently in, the variable definitions, and confidentiality issues.