Macquarie University

20-21 February 2018

9 am - 4:30 pm

Instructors: Adela Sobotkova, Ingrid Tarr, Tünde Szalay, Vince Polito

Helpers: Richard Miller, Iryna Levchenko, Petra Janouchova, Peter Humburg, Peter Ha

General Information

Thank your for your interest in this Data Carpentry workshop.

The ever-increasing digital nature of research requires researchers, postgraduate students, and research-support staff to equip themselves with the skills to create, manipulate and manage data in digital format. This can involve complex research data management techniques.

Today, researchers and students can perform simple to complex data management through open source tools and techniques which do not require highly specialist skills.

This workshop will assist researchers, postgraduate students, and research-support staff to learn more about such tools and techniques

The workshop is organised and funded as part of the Macquarie eResearch Programme in cooperation with Software Carpentry.

Workshop Aims

This workshop aims to provide a broad introduction to the following concepts and tools

Who: The course is aimed at graduate students and other researchers. You don't need to have any previous knowledge of the tools that will be presented at the workshop.

Where: Active Learning Spaces in C5A (430-435), Macquarie University. Get directions with OpenStreetMap or Google Maps.

When: 20-21 February 2018. Add to your Google Calendar.

Requirements: Participants must bring a laptop with a Mac, Linux, or Windows operating system (not a tablet, Chromebook, etc.) that they have administrative privileges on. They should have a few specific software packages installed (listed below). They are also required to abide by Data Carpentry's Code of Conduct.

Accessibility: We are committed to making this workshop accessible to everybody. The workshop organisers have checked that:

Materials will be provided in advance of the workshop and large-print handouts are available if needed by notifying the organizers in advance. If we can help making learning easier for you (e.g. sign-language interpreters, lactation facilities) please get in touch (using contact details below) and we will attempt to provide them.

Contact: Please email for more information.



Please be sure to complete these surveys before and after the workshop.

Pre-workshop Survey

Post-workshop Survey

20 February

09:00 Data organization in spreadsheets
10:30 Coffee
12:30 Lunch (catered)
13:00 OpenRefine for data cleaning
14:30 Coffee
16:30 Wrap-up

21 February

09:00 Introduction to R
10:30 Coffee
12:30 Lunch (catered)
13:00 Data analysis and visualization in R
14:30 Coffee
16:30 Wrap-up

We will use this collaborative document for chatting, taking notes, and sharing URLs and bits of code.


Data Carpentry

Data Organisation in Spreadsheets

  • Data organisation and management
  • Good data formatting practices
  • Avoiding formatting mistakes
  • Quality control and data manipulation in spreadsheets

Data Cleaning with OpenRefine

  • Introduction to OpenRefine
  • Importing data
  • Basic functions
  • Advanced Functions
  • Reference...

Introduction to R

  • R and R Studio
  • Reproducibility in R

Starting with data in R

  • Describe what a data frame is.
  • Load external data from a .csv file into a data frame in R.
  • Summarize the contents of a data frame in R.
  • Manipulate categorical data in R.
  • Change how character strings are handled in a data frame.
  • Format dates in R

Data aggregation with dplyr

  • Select certain columns in a data frame with the dplyr function select.
  • Select certain rows in a data frame according to filtering conditions with the dplyr function filter.
  • Link the output of one dplyr function to the input of another function with the ‘pipe’ operator %>%.
  • Add new columns to a data frame that are functions of existing columns with mutate.
  • Use the split-apply-combine concept for data analysis. Use summarize, group_by, and tally to split a data frame< into groups of observations, apply a summary statistics for each group, and then combine the results.
  • Reshape a data frame from long to wide format and back with the spread and gather commands from the tidyr package.
  • Export a data frame to a .csv file.

Data visualization with ggplot2

  • Produce scatter plots, boxplots, and time series plots using ggplot.
  • Set universal plot settings.
  • Describe what faceting is and apply faceting in ggplot.
  • Modify the aesthetics of an existing ggplot plot (including axis labels and color).
  • Build complex and customized plots from data in a data frame.

Geospatial Data with R - Optional

  • Introduction to Maps and Projections.
  • Intro to Raster Data in R.
  • Plot Raster Data in R.
  • Open and Plot Shapefiles in R.
  • Plot by Shapefile Attributes.


To participate in a Data Carpentry workshop, you will need access to the software described below. In addition, you will need an up-to-date web browser.

We maintain a list of common issues that occur during installation as a reference for instructors that may be useful on the Configuration Problems and Solutions wiki page.


R is a programming language that is especially powerful for data exploration, visualization, and statistical analysis. To interact with R, we use RStudio.


Video Tutorial

Install R by downloading and running this .exe file from CRAN. Also, please install the RStudio IDE. Note that if you have separate user and admin accounts, you should run the installers as administrator (right-click on .exe file and select "Run as administrator" instead of double-clicking). Otherwise problems may occur later, for example when installing R packages.

Mac OS X

Video Tutorial

Install R by downloading and running this .pkg file from CRAN. Also, please install the RStudio IDE.


You can download the binary files for your distribution from CRAN. Or you can use your package manager (e.g. for Debian/Ubuntu run sudo apt-get install r-base and for Fedora run sudo yum install R). Also, please install the RStudio IDE.