Summary and Schedule
ATTENTION This is an experimental test of The Carpentries Workbench lesson infrastructure. It was automatically converted from the source lesson via the lesson transition script.
If anything seems off, please contact Zhian Kamvar zkamvar@carpentries.org
A part of the data workflow is preparing the data for analysis. Some of this involves data cleaning, where errors in the data are identifed and corrected or formatting made consistent. This step must be taken with the same care and attention to reproducibility as the analysis.
OpenRefine (formerly Google Refine) is a powerful free and open source tool for working with messy data: cleaning it and transforming it from one format into another.
This lesson will teach you to use OpenRefine to effectively clean and format data and automatically track any changes that you make. Many people comment that this tool saves them literally months of work trying to make these edits by hand.
Getting Started
Data Carpentry’s teaching is hands-on, so participants are encouraged to use their own computers to ensure the proper setup of tools for an efficient workflow.
These lessons assume no prior knowledge of the skills or tools.
To get started, follow the directions in the “Setup” tab to download data to your computer and follow any installation instructions.
For Instructors
If you are teaching this lesson in a workshop, please see the Instructor notes.
Setup Instructions | Download files required for the lesson | |
Duration: 00h 00m | 1. Introduction | What is OpenRefine useful for? |
Duration: 00h 10m | 2. Working with OpenRefine |
How can we bring our data into OpenRefine? How can we sort and summarize our data? How can we find and correct errors in our raw data? |
Duration: 00h 45m | 3. Filtering and Sorting with OpenRefine |
How can we select only a subset of our data to work with? How can we sort our data? |
Duration: 01h 05m | 4. Examining Numbers in OpenRefine |
How can we convert a column from one data type to another? How can we find non-numeric values in a column that should contain numbers? |
Duration: 01h 25m | 5. Using scripts |
How can we document the data-cleaning steps we’ve applied to our
data? How can we apply these steps to additional data sets? |
Duration: 01h 45m | 6. Exporting and Saving Data from OpenRefine | How can we save and export our cleaned data from OpenRefine? |
Duration: 02h 00m | 7. Other Resources in OpenRefine | What other resources are available for working with OpenRefine? |
Duration: 02h 10m | Finish |
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
Data
The data for this lesson is a part of the Data Carpentry Social Sciences workshop. It is a teaching version of the Studying African Farmer-Led Irrigation (SAFI) database. The SAFI dataset represents interviews of farmers in two countries in eastern sub-Saharan Africa (Mozambique and Tanzania). These interviews were conducted between November 2016 and June 2017 and probed household features (e.g. construction materials used, number of household members), agricultural practices (e.g. water usage), and assets (e.g. number and types of livestock).
The data used in this lesson is a subset of the teaching version that has been intentionally ‘messed up’ for this lesson.
Download the data file to your computer.
Software
For this lesson you will need OpenRefine (formerly Google Refine) and a web browser. Basic installation steps are provided on this page. The OpenRefine installation manual provides more details about installation, upgrades and configuration.
Note: this is a Java program that runs on your machine (not in the cloud). It runs inside your browser, but no web connection is needed for this lesson.
Callout
You do not need administrative rights on the computer to install OpenRefine. However, if anti-malware software blocks OpenRefine when you try to start it, you may need administrative rights to allow OpenRefine to run. OpenRefine is safe to run.
Windows
Check that you have Firefox, Edge, Opera or Chrome browsers installed and set as your default browser. OpenRefine runs in your default browser. It will not run correctly in Internet Explorer.
Download the software from openrefine.org.
-
Unzip the downloaded file into a directory by right-clicking and selecting “Extract…”. Name that directory something like OpenRefine.
Callout
The path to the directory you extract the application files into should be short, because some of OpenRefine’s files have very long names. If the path is too long, OpenRefine cannot start.
Go to your newly created OpenRefine directory.
Launch OpenRefine by opening
openrefine.exe
. This will launch a command prompt window, but you can ignore that and wait for the browser to launch.If you see Internet Explorer start, or OpenRefine does not automatically open for you, point one of the supported browsers at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.
Mac
- Check that you have Firefox, Edge, Opera or Chrome browsers installed and set as your default browser. OpenRefine runs in your default browser. It will not run correctly in Internet Explorer.
- Download the software from openrefine.org.
- Unzip the downloaded file into a directory by double-clicking it. Name that directory something like OpenRefine.
- Go to your newly created OpenRefine directory.
- Drag the OpenRefine app into the Applications folder.
- Launch OpenRefine: Control-click the app icon, then choose “Open” from the shortcut menu. For Troubleshooting help, see the Apple support page.
- If you are using a different browser than listed above, or if OpenRefine does not automatically open for you, point your browser at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.
Linux
- Check that you have Firefox or Chrome browsers installed and set as your default browser. OpenRefine runs in your default browser.
- Download the software from openrefine.org.
- Unzip the downloaded file into a directory. Name that directory something like OpenRefine.
- Go to your newly created OpenRefine directory.
- Launch OpenRefine by typing
./refine
into the terminal within the OpenRefine directory. - If you are using a different browser than listed above, or if OpenRefine does not automatically open for you, point your browser at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.