Welcome to Data Analytics for Development

CVEN 5837 - Summer 2023

Lars Schöbitz

Welcome! 👋

Meet the lecturer

Lars Schöbitz (he/him)

Headshot of Lars Schöbitz

  • Environmental Engineer
  • Open Science Specialist at ETH Zurich
  • Independent Instructor for Data Science with R
  • Twitter: @larnsce

Learning Goals (for the course)

  1. Be able to use a common set of data science tools (R, RStudio IDE, Git, GitHub, tidyverse, Quarto) to illustrate and communicate the utility of solutions for water, sanitation, air quality, and global health.

  2. Learn to use the Quarto file format and the RStudio IDE visual editing mode to produce scholarly documents with citations, footnotes, cross-references, figures, and tables.

Why are you here?

Pick an item

Take notes for 2 minutes:
What does the item you have picked have to do with the reason for you being here?

Why are you here?

In break-out rooms

Take 2 minutes each to share with your room partner:
What does the item you have picked have to do with the reason for you being here?

From which country are you joining us?

In the Zoom chat

Share with us from which country you are joining us.

Topics

  • The data science life-cycle
  • Data organization in spreadsheets
  • Exploratory data analysis using visualization
  • Concept of tidy data and data tidying
  • Data transformation and descriptive statistics
  • Data communication using the Quarto open-source scientific and technical publishing system

Learning Objectives (for this week)

  1. Learners can navigate the platforms (Posit Cloud, GitHub, Course Website) that are used to for the course.
  2. Learners can render a Quarto file to an output file in HTML, PDF and DOCX format.
  3. Learners can list the six elements of the data science lifecycle.
  4. Learners can identify four components of a Quarto file (YAML, code chunk, R code, markdown).

Classroom tools

Live Coding Exercises

  • Instructor writes and narrates code out loud
  • Intstructor explains elements and principles that are relevant
  • Code is displayed on second screen / split screen
  • Learners join by writing and executing the same code
  • Learners “code-along” with the instructor

Pair Programming Exercises

  • Two learners work together in a break out session
  • One person (the driver) shares the screen and does the typing
  • The other person (the navigator) offers comments and suggestions
  • Roles get switched

Platforms and Tools

  • R
  • Posit Cloud
  • RStudio IDE
  • tidyverse R Packages
  • Quarto publishing system

cven5873-ss23.github.io/website/ 🔖

Posit Cloud

-

-

-

-

-

-

-

Screen setup

Who uses a setup with one screen?

“One screen” in the Zoom Chat

Screen setup

Who uses a setup with two screens?

“Two screens” in the Zoom Chat

Live Coding Exercise

live-01a-setup - Posit Cloud Setup

Follow along on the screen

  1. Open the GitHub organisation for the course: https://github.com/cven5873-ss23
  2. You will find a repository titled: course-material-USERNAME (with your GitHub Username)
  3. You will “clone” this repository to Posit Cloud

Break

GitHub PAT from week 1

Do you have your GitHub Personal Access Token readily accessible?

10:00

Version Control

Version Control with Git and GitHub

A way to share files with others, so they can:

  • download
  • re-use
  • contribute

You can view the history of files, and jump back in time to any point.

Why is it useful?

Git and GitHub

  • Git is a software for version control
  • Released in 2005
  • Popular among programmers collaboratively developing code
  • Tracks changes in a set of files (directory/folder/repository)

  • GitHub is a hosting platform for version control using Git
  • Launched in 2008, aquired by Microsoft in in 2018, Microsoft for US$ 7.5 billion
  • 73 million Users (February, 2022)
  • Facebook for Software Developers

Version Control - Terminology

Data Science Lifecycle

Deep End

via GIPHY

-

-

-

-

-

-

-

Live Coding Exercise

live-01b-data-science-lifecycle - Data Science Lifecycle

  1. Head over to posit.cloud
  2. Open the workspace for the course (cven5837-ss23)
  3. Open “Projects”
  4. Open the “course-materials-USERNAME” project
  5. Follow along with me

Break

05:00

R

Packages

base R

sqrt(49)
sum(1, 2)
  • Functions come with R

R Packages

library(dplyr)
  • Installed once in the Console: install.packages("dplyr")
  • Loaded per script

Functions & Arguments

library(dplyr)

filter(.data = gapminder, 
       year == 2007)
  • Function: filter()
  • Argument: .data =
  • Arguments following: year == 2007 What do do with the data

Objects

library(dplyr)

gapminder_yr_2007 <- filter(.data = gapminder, 
                            year == 2007)
  • Function: filter()
  • Argument: .data =
  • Arguments following: year == 2007 What do do with the data
  • Object: gapminder_yr_2007

Operators

library(dplyr)

gapminder_yr_2007 <- gapminder |> 
  filter(year == 2007) 
  • Function: filter()
  • Argument: .data =
  • Arguments following: year == 2007 What do do with the data
  • Object: gapminder_yr_2007
  • Assignment operator: <-
  • Pipe operator: |>

Rules

Rules of dplyr functions:

  • First argument is always a data frame
  • Subsequent arguments say what to do with that data frame
  • Always return a data frame
  • Don’t modify in place

Course information

Weekly Structure

Monday Lecture
Tuesday
Wednesday Feedback (grading) on assignments from previous week
Thursday Student hours on Zoom (10 am to 12 pm CEST)
Friday Homework assignment and learning reflection are due

Homework assignments

  • Weekly programming assignments
  • Graded as pass/fail (100 pts)
  • Submitted as rendered Quarto documents on GitHub
  • weighted at 40% of the total grade

Learning reflections

  • Reflections on the different class elements (lecture, homework assignment, readings)
  • Graded as pass/fail (100 pts)
  • minimum 100 words
  • Submitted as rendered Quarto documents on GitHub
  • weighted at 20% of the total grade

Capstone Project

  • Data analysis project report with a data set of your choice
  • Graded as number of points out of 100 pts for pre-defined graded elements
  • Submitted as rendered Quarto document on GitHub
  • weighted at 20% of the total grade

Exam

  • 2-hour exam assessing the technical skills taught during the course
  • Graded as number of points out of 100 pts for pre-defined graded elements
  • Submitted as rendered Quarto document, but not on GitHub
  • weighted at 20% of the total grade

Grading

Conversion from percent to grades.
grade percent
A+ 97
A 93
A- 90
B+ 87
B 83
B- 80
C+ 77
C 73
C- 70
D+ 67
D 63
D- 60
F 0

Late work policy

  • due dates are set and all work is due on the stated date
  • work not submitted by the due date will receive 0 pts
  • the lowest score for each of the assignments or learning reflections is dropped

Homework week 1

Homework due dates

  • All material on course website
  • Homework assignment & learning reflection due: Friday, June 16th

Thanks! 🌻

Slides created via revealjs and Quarto: https://quarto.org/docs/presentations/revealjs/ Access slides as PDF on GitHub

All material is licensed under Creative Commons Attribution Share Alike 4.0 International.