diff --git a/R/chapter2.qmd b/R/chapter2.qmd index 9353faf..bec3c12 100644 --- a/R/chapter2.qmd +++ b/R/chapter2.qmd @@ -2,7 +2,7 @@ title: "Real-World Machine Learning" subtitle: "Chapter 2" author: "Paul Adamson" -date: "December 7, 2016" +date: "August 21, 2022" format: html: toc: true @@ -33,7 +33,10 @@ the categorical `maritalstatus` variable is of class `factor` with levels `singl `married`. Then, a new dataframe is created using the `model.matrix` function to convert the `factor` variable to dummy variables. This approach isn't really needed, though, since [R uses factor vectors to represent dummy and categorical data](https://bookdown.org/carillitony/bailey/chp6.html#dummy-variables-in-r). For more discussion on the topic of factor variables in R, see Amelia McNamara and Nicholas Horton’s paper, [Wrangling categorical data in R](https://peerj.com/preprints/3163/). We will avoid the explicit use of dummy variables in the remainder of the rwml-R project. -```{r listing2.1} +The [`kable`](https://bookdown.org/yihui/rmarkdown-cookbook/kable.html) function is built into the `knitr` +package for generating simple, yet elegant tables of data. + +```{r figure2.4} personData <- data.frame( person = 1:2, name = c("Jane Doe", "John Smith"), @@ -43,14 +46,17 @@ personData <- data.frame( ) kable(personData) -str(personData) +``` +```{r listing2.1} personDataNew <- data.frame(personData[,1:4], - model.matrix(~ maritalstatus - 1, - data = personData)) + model.matrix(~ maritalstatus - 1, + data = personData)) + +``` +```{r figure2.6} kable(personDataNew) -str(personDataNew) ``` In the call to `model.matrix`, the −1 in the model formula diff --git a/R/chapter3.qmd b/R/chapter3.qmd index 40fda0e..2b97219 100644 --- a/R/chapter3.qmd +++ b/R/chapter3.qmd @@ -2,7 +2,7 @@ title: "Real-World Machine Learning" subtitle: "Chapter 3" author: "Paul Adamson" -date: "December 7, 2016" +date: "August 21, 2022" format: html: toc: true @@ -13,19 +13,20 @@ This file contains R code to accompany Chapter 3 of the book by Henrik Brink, Joseph W. Richards, and Mark Fetherolf. The code was contributed by [Paul Adamson](http://github.com/padamson). -*NOTE: working directory should be set to this file's location.* +*REMINDER: update `project_dir` below to execute code as interactive code cells* ```{r setup, include=FALSE} library(knitr) knitr::opts_chunk$set(echo = TRUE) -library(plyr) -library(dplyr) +project_dir <- file.path(Sys.getenv("HOME"), "projects/github-padamson/rwml-R") +setwd(file.path(project_dir, "R")) +library(tidyverse) +library(gridExtra) library(vcd) library(AppliedPredictiveModeling) library(caret) library(ellipse) library(kknn) -library(gridExtra) library(grid) library(randomForest) set.seed(3456) @@ -34,21 +35,30 @@ set.seed(3456) ## Figure 3.4 A subset of the Titanic Passengers dataset -We are going to be interested in predicting survival, so it is useful to specify -the `Survived` variable to be of type `factor`. For visualizing the data, -it is also useful to use the `revalue` function to specify the `no` and `yes` -levels for the `factor` variable. The `kable` function is built into the `knitr` -package. +We use the same steps as in Chapter 2 to read in and tidy the +titanic data. ```{r figure3.4, cache=TRUE} titanic <- read.csv("../data/titanic.csv", colClasses = c( Survived = "factor", + Sex = "factor", Name = "character", Ticket = "character", - Cabin = "character")) -titanic$Survived <- revalue(titanic$Survived, c("0"="no", "1"="yes")) -kable(head(titanic, 6), digits=2) + Cabin = "character")) |> + mutate( + Survived = fct_recode(Survived, + "no" = "0", + "yes" = "1" + ) + ) |> + separate(Cabin, into = "firstCabin", sep = " ", extra = "drop", remove = FALSE) |> + separate(firstCabin, into = c("cabinChar", "cabinNum"), sep = 1) |> + rowwise() |> + mutate(numCabins = length(unlist(strsplit(Cabin, " ")))) |> + ungroup() + +kable(head(titanic[c("PassengerId", "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked")], 6), digits=4) ``` ## Figure 3.5 Mosaic plot for Titanic data: Gender vs. survival diff --git a/README.md b/README.md index 79bd3ee..d4ab1a6 100644 --- a/README.md +++ b/README.md @@ -13,8 +13,8 @@ The [renv](https://rstudio.github.io/renv/) package is used to manage a **r**epr ## Tidyverse workflow -The [`tidyverse`](https://www.tidyverse.org) collection of packages is used for (``a coherent system of packages for data manipulation, exploration and visualization that share a common design philosophy'')[https://rviews.rstudio.com/2017/06/08/what-is-the-tidyverse/]. The main exceptions to this philosophy are the use of [data.table](https://github.com/Rdatatable/data.table) for larger data (> 2 Gb), ["Visualizing Categorical Data" (`vcd`)](https://cran.r-project.org/web/packages/vcd/vcd.pdf) -for exploring categorical data (including mosaic plots), and (`gridExtra`)[https://cran.r-project.org/package=gridExtra] +The [`tidyverse`](https://www.tidyverse.org) collection of packages is used for ["a coherent system of packages for data manipulation, exploration and visualization that share a common design philosophy"](https://rviews.rstudio.com/2017/06/08/what-is-the-tidyverse/). The main exceptions to this philosophy are the use of [data.table](https://github.com/Rdatatable/data.table) for larger data (> 2 Gb), ["Visualizing Categorical Data" (`vcd`)](https://cran.r-project.org/web/packages/vcd/vcd.pdf) +for exploring categorical data (including mosaic plots), and [`gridExtra`](https://cran.r-project.org/package=gridExtra) for combining and organizing plots. ## Quarto publishing system