Concept of Tidy Data, Vectors & Pivoting

CVEN 5837 - Summer 2023

Lars Schöbitz

Exam Update

  • No exam anymore in Week 8
  • Due date for Capstone Project will be extended (due on 28th July)
  • Capstone Project contributes 40% to grade instead of 20%
type percent
Homework assignments 40
Learning reflections 20
Capstone project 40

Learning Objectives (for this week)

  1. Learners can apply functions from the dplyr R Package to transform their data from a wide to a long format and vice versa.
  2. Learners can list the four main atomic vector types in R.
  3. Learners can explain the three characteristics of tidy data.

Part 1: Data types and vectors

Why care about data types?


Example: survey data

id job price_glass
1 Student 0
2 Retired 0
3 Other 0
4 Employed 10
5 Employed See comment
6 Student 05-Oct
7 Student 0
8 Retired 0
9 Student 10
10 Employed 0
11 Employed 20 (2chf per person with 10 people in the WG)
12 Student 10
13 Student 10
14 Employed 0
15 Student 10
16 Student 0
17 Employed 5 to 10
18 Other 0
19 Student 0
20 Employed 10
21 Employed 0
22 Employed 5

Oh why won’t you work?!

survey_data_small |> 
  summarise(mean_price_glass = mean(price_glass))
# A tibble: 1 × 1
1               NA

Oh why won’t you still work??!!

survey_data_small |> 
  summarise(mean_price_glass = mean(price_glass, na.rm = TRUE))
# A tibble: 1 × 1
1               NA

Take a breath and look at your data

Very common data tidying step!

survey_data_small |> 
  mutate(price_glass_new = case_when(
    price_glass == "5 to 10" ~ "7.5",
    price_glass == "05-Oct" ~ "7.5",
    str_detect(price_glass, pattern = "2chf") == TRUE ~ "20",
    str_detect(price_glass, pattern = "See comment") == TRUE ~ NA_character_,
    TRUE ~ price_glass

Very common data tidying step!

id job price_glass_new price_glass
1 Student 0 0
2 Retired 0 0
3 Other 0 0
4 Employed 10 10
5 Employed NA See comment
6 Student 7.5 05-Oct
7 Student 0 0
8 Retired 0 0
9 Student 10 10
10 Employed 0 0
11 Employed 20 20 (2chf per person with 10 people in the WG)
12 Student 10 10
13 Student 10 10
14 Employed 0 0
15 Student 10 10
16 Student 0 0
17 Employed 7.5 5 to 10
18 Other 0 0
19 Student 0 0
20 Employed 10 10
21 Employed 0 0
22 Employed 5 5

Sumamrise? Argh!!!!

survey_data_small |> 
  mutate(price_glass_new = case_when(
    price_glass == "5 to 10" ~ "7.5",
    price_glass == "05-Oct" ~ "7.5",
    str_detect(price_glass, pattern = "20") == TRUE ~ "20",
    str_detect(price_glass, pattern = "See comment") == TRUE ~ NA_character_,
    TRUE ~ price_glass
  )) |> 
  summarise(mean_price_glass = mean(price_glass_new, na.rm = TRUE))
# A tibble: 1 × 1
1               NA

Always respect your data types!

Taking the mean of vector with type “character” is not possible.

# A tibble: 22 × 4
      id job      price_glass price_glass_new
   <int> <chr>    <chr>       <chr>          
 1     1 Student  0           0              
 2     2 Retired  0           0              
 3     3 Other    0           0              
 4     4 Employed 10          10             
 5     5 Employed See comment <NA>           
 6     6 Student  05-Oct      7.5            
 7     7 Student  0           0              
 8     8 Retired  0           0              
 9     9 Student  10          10             
10    10 Employed 0           0              
# ℹ 12 more rows

Always respect your data types!

survey_data_small |> 
  mutate(price_glass_new = case_when(
    price_glass == "5 to 10" ~ "7.5",
    price_glass == "05-Oct" ~ "7.5",
    str_detect(price_glass, pattern = "20") == TRUE ~ "20",
    str_detect(price_glass, pattern = "See comment") == TRUE ~ NA_character_,
    TRUE ~ price_glass
  )) |> 
  mutate(price_glass_new = as.numeric(price_glass_new)) |> 
  summarise(mean_price_glass = mean(price_glass_new, na.rm = TRUE))
# A tibble: 1 × 1
1             4.76

Live Coding Exercise: Vectors


  1. Head over to
  2. Open the workspace for the course (cven5837-ss23)
  3. Open your “course-materials” project
  4. Follow along with me

Break One


Part 2: tidyr - long and wide formats




A grammar of data tidying

The goal of tidyr is to help you tidy your data via

  • pivoting for going between wide and long data
  • splitting and combining character columns
  • nesting and unnesting columns
  • clarifying how NAs should be treated

Pivoting data

Waste characterisation data

Three variables -> three aesthetics

ggplot(data = waste_data_tidy,
       mapping = aes(x = objid, 
                     y = weight, 
                     fill = waste_category)) +
  geom_col() + 
  scale_fill_brewer(type = "qual")

How to

How to

waste_category_levels <- c("glass", "metal_alu", "paper", "pet", "other")

waste_data_untidy |> 
  pivot_longer(cols = pet:other,
               names_to = "waste_category",
               values_to = "weight") |> 
  mutate(waste_category = factor(waste_category, levels = waste_category_levels)) 
Three variables -> three aesthetics

ggplot(data = waste_data_tidy,
       mapping = aes(x = objid, 
                     y = weight, 
                     fill = waste_category)) +
  geom_col() + 
  scale_fill_brewer(type = "qual")

Live Coding Exercise: Pivoting


  1. Head over to
  2. Open the workspace for the course (cven5837-ss23)
  3. Open your “course-materials” project
  4. Follow along with me

Homework week 5

Homework due dates

  • All material on course website
  • Homework assignment & learning reflection due: Friday, 7th July

Thanks! 🌻

All material is licensed under Creative Commons Attribution Share Alike 4.0 International.


Ben Aleya, Ali, Daniel Biek, Lin Boynton, Julia Jaeggi, Sebastian Camilo Loos, Chiara Meyer-Piening, Jonathan Olal Ogwang, et al. 2022. “Research Beyond the Lab, Spring Term 2022, Global Health Engineering, ETH Zurich. Raw Data and Analysis-Ready Derived Data on Waste Management in Public Spaces in Zurich, Switzerland.” Zenodo.