Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 12 additions & 6 deletions R/chapter2.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: "Real-World Machine Learning"
subtitle: "Chapter 2"
author: "Paul Adamson"
date: "December 7, 2016"
date: "August 21, 2022"
format:
html:
toc: true
Expand Down Expand Up @@ -33,7 +33,10 @@ the categorical `maritalstatus` variable is of class `factor` with levels `singl
`married`. Then, a new dataframe is created using the `model.matrix` function to convert the `factor` variable to dummy variables. This approach isn't really needed, though, since [R uses factor vectors to represent dummy and categorical data](https://bookdown.org/carillitony/bailey/chp6.html#dummy-variables-in-r). For more discussion on the topic of factor variables in R, see Amelia McNamara and Nicholas Horton’s paper, [Wrangling categorical data in R](https://peerj.com/preprints/3163/). We will avoid the explicit
use of dummy variables in the remainder of the rwml-R project.

```{r listing2.1}
The [`kable`](https://bookdown.org/yihui/rmarkdown-cookbook/kable.html) function is built into the `knitr`
package for generating simple, yet elegant tables of data.

```{r figure2.4}
personData <- data.frame(
person = 1:2,
name = c("Jane Doe", "John Smith"),
Expand All @@ -43,14 +46,17 @@ personData <- data.frame(
)

kable(personData)
str(personData)
```

```{r listing2.1}
personDataNew <- data.frame(personData[,1:4],
model.matrix(~ maritalstatus - 1,
data = personData))
model.matrix(~ maritalstatus - 1,
data = personData))

```

```{r figure2.6}
kable(personDataNew)
str(personDataNew)
```

In the call to `model.matrix`, the −1 in the model formula
Expand Down
36 changes: 23 additions & 13 deletions R/chapter3.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: "Real-World Machine Learning"
subtitle: "Chapter 3"
author: "Paul Adamson"
date: "December 7, 2016"
date: "August 21, 2022"
format:
html:
toc: true
Expand All @@ -13,19 +13,20 @@ This file contains R code to accompany Chapter 3 of the book
by Henrik Brink, Joseph W. Richards, and Mark Fetherolf. The code was contributed by
[Paul Adamson](http://github.com/padamson).

*NOTE: working directory should be set to this file's location.*
*REMINDER: update `project_dir` below to execute code as interactive code cells*

```{r setup, include=FALSE}
library(knitr)
knitr::opts_chunk$set(echo = TRUE)
library(plyr)
library(dplyr)
project_dir <- file.path(Sys.getenv("HOME"), "projects/github-padamson/rwml-R")
setwd(file.path(project_dir, "R"))
library(tidyverse)
library(gridExtra)
library(vcd)
library(AppliedPredictiveModeling)
library(caret)
library(ellipse)
library(kknn)
library(gridExtra)
library(grid)
library(randomForest)
set.seed(3456)
Expand All @@ -34,21 +35,30 @@ set.seed(3456)

## Figure 3.4 A subset of the Titanic Passengers dataset

We are going to be interested in predicting survival, so it is useful to specify
the `Survived` variable to be of type `factor`. For visualizing the data,
it is also useful to use the `revalue` function to specify the `no` and `yes`
levels for the `factor` variable. The `kable` function is built into the `knitr`
package.
We use the same steps as in Chapter 2 to read in and tidy the
titanic data.

```{r figure3.4, cache=TRUE}
titanic <- read.csv("../data/titanic.csv",
colClasses = c(
Survived = "factor",
Sex = "factor",
Name = "character",
Ticket = "character",
Cabin = "character"))
titanic$Survived <- revalue(titanic$Survived, c("0"="no", "1"="yes"))
kable(head(titanic, 6), digits=2)
Cabin = "character")) |>
mutate(
Survived = fct_recode(Survived,
"no" = "0",
"yes" = "1"
)
) |>
separate(Cabin, into = "firstCabin", sep = " ", extra = "drop", remove = FALSE) |>
separate(firstCabin, into = c("cabinChar", "cabinNum"), sep = 1) |>
rowwise() |>
mutate(numCabins = length(unlist(strsplit(Cabin, " ")))) |>
ungroup()

kable(head(titanic[c("PassengerId", "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked")], 6), digits=4)
```

## Figure 3.5 Mosaic plot for Titanic data: Gender vs. survival
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ The [renv](https://rstudio.github.io/renv/) package is used to manage a **r**epr

## Tidyverse workflow

The [`tidyverse`](https://www.tidyverse.org) collection of packages is used for (``a coherent system of packages for data manipulation, exploration and visualization that share a common design philosophy'')[https://rviews.rstudio.com/2017/06/08/what-is-the-tidyverse/]. The main exceptions to this philosophy are the use of [data.table](https://github.com/Rdatatable/data.table) for larger data (> 2 Gb), ["Visualizing Categorical Data" (`vcd`)](https://cran.r-project.org/web/packages/vcd/vcd.pdf)
for exploring categorical data (including mosaic plots), and (`gridExtra`)[https://cran.r-project.org/package=gridExtra]
The [`tidyverse`](https://www.tidyverse.org) collection of packages is used for ["a coherent system of packages for data manipulation, exploration and visualization that share a common design philosophy"](https://rviews.rstudio.com/2017/06/08/what-is-the-tidyverse/). The main exceptions to this philosophy are the use of [data.table](https://github.com/Rdatatable/data.table) for larger data (> 2 Gb), ["Visualizing Categorical Data" (`vcd`)](https://cran.r-project.org/web/packages/vcd/vcd.pdf)
for exploring categorical data (including mosaic plots), and [`gridExtra`](https://cran.r-project.org/package=gridExtra)
for combining and organizing plots.

## Quarto publishing system
Expand Down