Getting and Cleaning Data Course Syllabus
Full curriculum breakdown — modules, lessons, estimated time, and outcomes.
This course provides a comprehensive introduction to acquiring, cleaning, and transforming real-world data using R. Designed for beginners in data science, it emphasizes practical skills in handling diverse data formats and sources such as APIs, databases, and web pages. You'll learn how to convert raw data into tidy, analysis-ready datasets while applying principles of reproducibility and documentation. With approximately 19 hours of content, the course combines hands-on exercises with real-world examples to build strong foundational data preparation skills.
Module 1: Introduction and Getting Raw Data
Estimated time: 2 hours
- Understanding the difference between raw and tidy data
- Downloading and reading data from local and online sources
- Introduction to using data.table for fast data manipulation
Module 2: Reading and Cleaning Data
Estimated time: 1 hour
- Accessing data from MySQL databases and web APIs
- Importing and handling data in multiple formats (Excel, XML, JSON)
- Preprocessing steps including trimming, renaming, filtering
Module 3: Data Tidying and Transformation
Estimated time: 10 hours
- Reshaping data using functions like melt and dcast
- Merging datasets using merge
- Dealing with missing values and inconsistent formatting
- Practical cleaning and transformation with real-world datasets
Module 4: Reproducible Research and Final Project
Estimated time: 6 hours
- Writing clean, reproducible code for data workflows
- Creating R scripts and markdown documentation for analysis
- Final project to demonstrate cleaning, transforming, and documenting data
Prerequisites
- Basic knowledge of R programming
- Familiarity with fundamental programming concepts
- Access to R and RStudio environment
What You'll Be Able to Do After
- Acquire data from sources such as web pages, APIs, databases, and flat files
- Clean and reshape datasets into tidy formats ready for analysis
- Perform data manipulation using R and essential libraries like data.table
- Work with different file formats: CSV, XML, JSON, Excel, HDF5
- Apply principles of reproducible research in data processing workflows