Introduction

Last updated on 2023-05-09 | Edit this page

Estimated time: 10 minutes

Overview

Questions

  • How is OpenRefine useful?

Objectives

  • Describe OpenRefine’s uses and applications.
  • Differentiate data cleaning from data organization.
  • Experiment with OpenRefine’s user interface.
  • Locate helpful resources to learn more about OpenRefine.

Lesson


Motivations for the OpenRefine Lesson


  • Data is often very messy, and this tool saves a lot of time on cleaning headaches.

  • Data cleaning steps often need repeating with multiple files. It is important to know what you did to your data. This makes it easy for you to repeat these steps again with similarly structured data. OpenRefine is perfect for speeding up repetitive tasks by replaying previous actions on multiple datasets.

  • Additionally, journals, granting agencies, and other institutions are requiring documentation of the steps you took when working with your data. With OpenRefine, you can capture all actions applied to your raw data and share them with your publication as supplemental material.

  • Any operation that changes the data in OpenRefine can be easily reversed or undone.

  • Some concepts such as clustering algorithms are quite complex, but OpenRefine makes it easy to introduce them, use them, and show their power.

    Note: You must export your modified dataset to a new file: OpenRefine does not save over the original source file. All changes are stored in the OpenRefine project.

Before we get started


The following setup is necessary before we can get started (see the instructions here.)

What is OpenRefine?


  • OpenRefine is a Java program that runs on your machine (not in the cloud): it is a desktop application that uses your web browser as a graphical interface. No internet connection is needed, and none of the data or commands you enter in OpenRefine are sent to a remote server.
  • OpenRefine does not modify your original dataset. All actions are easily reversed in OpenRefine and you can capture all the actions applied to your data and share this documentation with your publication as supplemental material.
  • OpenRefine saves as you go. You can return to the project at any time to pick up where you left off or export your data to a new file.
  • OpenRefine can be used to standardise and clean data across your file.

It can also help you

  • Get an overview of a data set
  • Resolve inconsistencies in a data set
  • Help you split data up into more granular parts
  • Match local data up to other data sets
  • Enhance a data set with data from other sources
  • Save a set of data cleaning steps to replay on multiple files

OpenRefine is a powerful, free, and open source tool with a large growing community of practice. More help can be found at https://openrefine.org.

Features

  • Open source (source on GitHub).
  • A large growing community, from novice to expert, ready to help.

More Information on OpenRefine

You can find out a lot more about OpenRefine at the official user manual docs.openrefine.org. There is a user forum that can answer a lot of beginner questions and problems. Recipes, scripts, projects, and extensions are available to add functionality to OpenRefine. These can be copied into your OpenRefine instance to run on your dataset.

Key Points

  • OpenRefine is a powerful, free and open source tool that can be used for data cleaning.
  • OpenRefine will automatically track any steps you take in working with your data.