padamson · padamson · Aug 22, 2022 · Aug 22, 2022 · Aug 22, 2022 · Aug 23, 2022
diff --git a/R/chapter2.qmd b/R/chapter2.qmd
@@ -2,7 +2,7 @@
 title: "Real-World Machine Learning"
 subtitle: "Chapter 2"
 author: "Paul Adamson"
-date: "December 7, 2016"
+date: "August 21, 2022"
 format:
   html:
     toc: true
@@ -33,7 +33,10 @@ the categorical `maritalstatus` variable is of class `factor` with levels `singl
 `married`. Then, a new dataframe is created using the `model.matrix` function to convert the `factor` variable to dummy variables. This approach isn't really needed, though, since [R uses factor vectors to represent dummy and categorical data](https://bookdown.org/carillitony/bailey/chp6.html#dummy-variables-in-r). For more discussion on the topic of factor variables in R, see Amelia McNamara and Nicholas Horton’s paper, [Wrangling categorical data in R](https://peerj.com/preprints/3163/). We will avoid the explicit
 use of dummy variables in the remainder of the rwml-R project.
 
-```{r listing2.1}
+The [`kable`](https://bookdown.org/yihui/rmarkdown-cookbook/kable.html) function is built into the `knitr`
+package for generating simple, yet elegant tables of data.
+
+```{r figure2.4}
 personData <- data.frame(
   person = 1:2, 
   name = c("Jane Doe", "John Smith"),
@@ -43,14 +46,17 @@ personData <- data.frame(
 )
 
 kable(personData)
-str(personData)
+```
 
+```{r listing2.1}
 personDataNew <- data.frame(personData[,1:4], 
-                         model.matrix(~ maritalstatus - 1, 
-                                      data = personData)) 
+                            model.matrix(~ maritalstatus - 1, 
+                                         data = personData)) 
+
+```
 
+```{r figure2.6}
 kable(personDataNew)
-str(personDataNew)
 ```
 
 In the call to `model.matrix`, the −1 in the model formula 

diff --git a/R/chapter3.qmd b/R/chapter3.qmd
@@ -2,7 +2,7 @@
 title: "Real-World Machine Learning" 
 subtitle: "Chapter 3"
 author: "Paul Adamson"
-date: "December 7, 2016"
+date: "August 21, 2022"
 format:
   html:
     toc: true
@@ -13,19 +13,20 @@ This file contains R code to accompany Chapter 3 of the book
 by Henrik Brink, Joseph W. Richards, and Mark Fetherolf.  The code was contributed by
 [Paul Adamson](http://github.com/padamson). 
 
-*NOTE: working directory should be set to this file's location.*
+*REMINDER: update `project_dir` below to execute code as interactive code cells*
 
 ```{r setup, include=FALSE}
 library(knitr)
 knitr::opts_chunk$set(echo = TRUE)
-library(plyr)
-library(dplyr)
+project_dir <- file.path(Sys.getenv("HOME"), "projects/github-padamson/rwml-R")
+setwd(file.path(project_dir, "R"))
+library(tidyverse)
+library(gridExtra)
 library(vcd)
 library(AppliedPredictiveModeling)
 library(caret)
 library(ellipse)
 library(kknn)
-library(gridExtra)
 library(grid)
 library(randomForest)
 set.seed(3456)
@@ -34,21 +35,30 @@ set.seed(3456)
 
 ## Figure 3.4 A subset of the Titanic Passengers dataset
 
-We are going to be interested in predicting survival, so it is useful to specify 
-the `Survived` variable to be of type `factor`. For visualizing the data, 
-it is also useful to use the `revalue` function to specify the `no` and `yes`
-levels for the `factor` variable. The `kable` function is built into the `knitr`
-package.
+We use the same steps as in Chapter 2 to read in and tidy the
+titanic data.
 
 ```{r figure3.4, cache=TRUE}
 titanic <- read.csv("../data/titanic.csv", 
                     colClasses = c(
                       Survived = "factor",
+                      Sex = "factor",
                       Name = "character",
                       Ticket = "character",
-                      Cabin = "character"))
-titanic$Survived <- revalue(titanic$Survived, c("0"="no", "1"="yes"))
-kable(head(titanic, 6), digits=2)
+                      Cabin = "character")) |>
+  mutate(
+    Survived = fct_recode(Survived,
+      "no"  = "0",
+      "yes" = "1"  
+    )
+  ) |>
+  separate(Cabin, into = "firstCabin", sep = " ", extra = "drop", remove = FALSE) |>
+  separate(firstCabin, into = c("cabinChar", "cabinNum"), sep = 1) |>
+  rowwise() |>
+  mutate(numCabins = length(unlist(strsplit(Cabin, " ")))) |>
+  ungroup()
+
+kable(head(titanic[c("PassengerId", "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked")], 6), digits=4)
 ```
 
 ## Figure 3.5 Mosaic plot for Titanic data: Gender vs. survival

diff --git a/README.md b/README.md
@@ -13,8 +13,8 @@ The [renv](https://rstudio.github.io/renv/) package is used to manage a **r**epr
 
 ## Tidyverse workflow
 
-The [`tidyverse`](https://www.tidyverse.org) collection of packages is used for (``a coherent system of packages for data manipulation, exploration and visualization that share a common design philosophy'')[https://rviews.rstudio.com/2017/06/08/what-is-the-tidyverse/]. The main exceptions to this philosophy are the use of [data.table](https://github.com/Rdatatable/data.table) for larger data (> 2 Gb), ["Visualizing Categorical Data" (`vcd`)](https://cran.r-project.org/web/packages/vcd/vcd.pdf) 
-for exploring categorical data (including mosaic plots), and (`gridExtra`)[https://cran.r-project.org/package=gridExtra]
+The [`tidyverse`](https://www.tidyverse.org) collection of packages is used for ["a coherent system of packages for data manipulation, exploration and visualization that share a common design philosophy"](https://rviews.rstudio.com/2017/06/08/what-is-the-tidyverse/). The main exceptions to this philosophy are the use of [data.table](https://github.com/Rdatatable/data.table) for larger data (> 2 Gb), ["Visualizing Categorical Data" (`vcd`)](https://cran.r-project.org/web/packages/vcd/vcd.pdf) 
+for exploring categorical data (including mosaic plots), and [`gridExtra`](https://cran.r-project.org/package=gridExtra)
 for combining and organizing plots.
 
 ## Quarto publishing system