Machine Learning | StatCityPro

Immigration - Where do they live? - Part 2

Sun, 16 Aug 2020 00:00:00 +0000

This publication follows on from a previous post and looks to build a machine learning classification model to predict if an immigrant that arrived to Santiago de Chile in 2019 lived in the Eastern Sector of the city upon arrival.

Packages

The following packages are used in this publication.

library(dplyr)
library(caret)
library(modelr)
library(forcats) 
library(caTools)
library(readr)

2) The Eastern Sector

As previously explained the Eastern Sector contains the comunas of Providencia, Las Condes, Vitacura, y Lo Barnechea and is located to the north east of the city. These comunas are considered the most wealthy in the city and are identified in the below map.

3) Distribution of Immigrants

The below map shows the distribution of all immigrants who arrived to Santiago in 2019. Tt must be noted that the data used only refers to the comuna of residence when an immigrant applied for their visa. Therefore, it is possible that they have since moved to a different sector of the city.

maplabels <- read_csv("lables.csv")

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   nombre_comuna = col_character(),
##   labels = col_double()
## )

maplabels

## # A tibble: 32 x 3
##       X1 nombre_comuna    labels
##    <dbl> <chr>             <dbl>
##  1     1 Santiago              1
##  2     2 Cerrillos             2
##  3     3 Cerro Navia           3
##  4     4 Conchalí              4
##  5     5 El Bosque             5
##  6     6 Estación Central      6
##  7     7 Huechuraba            7
##  8     8 Independencia         8
##  9     9 La Cisterna           9
## 10    10 La Florida           10
## # … with 22 more rows

4) Data

The idea of this section is to create a classification model to predict in which comuna immigrants lived when they arrived to Santiago. The data used is the visas2019STG data frame that was prepared in part 1.

visas2019STG <- read.csv("visas2019STG.csv")

5) Model Preparation

The following processes are conducted to prepare the data for modeling.

5.1) Selecting Variables and Eastern Sector Variable

First the relevant variables are choosen.

visas2019STGfilter <- visas2019STG %>% select(SEXO, PAÍS, ACTIVIDAD, PROFESIÓN, ESTUDIOS, nombre_comuna, AÑO, MES, Age)

Second a new binomial variable is created to determine if an immigrant lived in the Eastern Sector. Within this variable 1 represents an immigrant that lived in the Eastern Sector, and 0 represents an immigrant who lived in a different sector of the city.

visas2019STGfilter$SECTOR_ORRIENTE <- if_else(visas2019STGfilter$nombre_comuna == 'Providencia', 1,
                                 if_else(visas2019STGfilter$nombre_comuna == 'Las Condes', 1, 
                                         if_else(visas2019STGfilter$nombre_comuna == 'Vitacura', 1,
                                                 if_else(visas2019STGfilter$nombre_comuna =='Lo Barnechea', 1, 0))))

5.2) One Hot Encoding

Thirdly One Hot Encoding is conducted. This is a process by which to convert categorical variables into a strucutre which is easier to compute. When One Hot Encoding is applied to a particular variable, new columns are created for each categorical class of the original variable. For each class column, a one is recorded for all the observations that have that class, with a zero recorded for all those that do not.

One Hot Encoding is carried out for the variables of Sex, Activity and Studies.

Sex

model_matrix(visas2019STGfilter, SECTOR_ORRIENTE~SEXO-1)

visas2019STGfilter <- cbind(visas2019STGfilter, model_matrix(visas2019STGfilter, SECTOR_ORRIENTE~SEXO-1))

Activity

model_matrix(visas2019STGfilter, SECTOR_ORRIENTE~ACTIVIDAD-1)

visas2019STGfilter <- cbind(visas2019STGfilter, model_matrix(visas2019STGfilter, SECTOR_ORRIENTE~ACTIVIDAD-1))

Studies

model_matrix(visas2019STGfilter, SECTOR_ORRIENTE~ESTUDIOS-1)

visas2019STGfilter <- cbind(visas2019STGfilter, model_matrix(visas2019STGfilter, SECTOR_ORRIENTE~ESTUDIOS-1))

5.3) Countries to Continents

In total there are 75 nationalities in the data. To make these easier to compute a new variable is created, grouping the nationalities into continents. One Hot Encoding is then carried out for this variable.

visas2019STGfilter$continente <- fct_collapse(visas2019STGfilter$PAÍS, Europe = c('Alemania', 'Austria', 'Bélgica', 'Bulgaria', 'Croacia', 'Dinamarca', 'Eslovaquia', 'España', 'Finlandia',
                                                 'Francia', 'Grecia', 'Holanda', 'Hungría', 'Inglaterra', 'Irlanda', 'Italia', 'Lituania', 'Noruega', 
                                                 'Polonia', 'Portugal', 'República Checa', 'República De Bielorrusia', 'República De Serbia',
                                                 'Rumanía', 'Rusia', 'Suecia', 'Suiza', 'Ucrania'), Africa = c('Angola', 'Camerún', 'Egipto', 'Marruecos', 'República de Congo', 'Sudráfica'), 
             Asia = c('Bangladesh', 'Corea del Sur', 'China', 'Filipinas', 'India', 'Indonesia', 'Irán', 'Israel', 'Japón',
                      'Jordania', 'Líbano', 'Malasia', 'Nepal', 'Pakistán', 'Palestina', 'Siria', 'Tailandia', 'Taiwan', 'Turquía'),
             SouthAmerica = c('Argentina', 'Bolivia', 'Brasil', 'Colombia', 'Ecuador', 'Paraguay', 'Perú', 'Uruguay', 'Venezuela'), 
             CentralAmerica = c('Costa Rica', 'Cuba', 'El Salvador', 'Guatemala', 'Haití', 'Honduras', 'México', 'Nicaragua', 'Panamá','República Dominicana'), 
             NorthAmerica = c('Canadá', 'Estados Unidos'), Other = c('Otro país'), 
             Oceania = c('Australia', 'Nueva Zelanda'))

visas2019STGfilter <- cbind(visas2019STGfilter, model_matrix(visas2019STGfilter, SECTOR_ORRIENTE~continente-1))

6) Creating the Model

The data is divided into two groups. One is the training group with 75% of the observations. The second is the test group with 25% of the observations.

set.seed(1234)
split <- sample.split(visas2019STGfilter$SECTOR_ORRIENTE, SplitRatio = 0.75)
training_set <- subset(visas2019STGfilter, split == TRUE)
test_set <- subset(visas2019STGfilter, split == FALSE)

training_set_cut <- training_set[,c(-1, -2, -3, -4, -5, -6, -7, -8, -9, -34)]

test_set_cut <- test_set[,c(-1, -2, -3, -4, -5, -6, -7, -8, -9, -34)]

6.2) Training the Model

A General Linear Model is used and trained with the below syntax.

set.seed(1234)
classifier = glm(formula = SECTOR_ORRIENTE ~.,
                  family = binomial,
                  data = training_set_cut)

The model is used to make predictions on the test data. A prediction of 1 represents an immigrant in the Eastern Sector and a prediction of 0 represents an immigrant in another area of the city. A Confusion Matrix is used to assess the accuracy of the model.

set.seed(1234)
prob_pred <- predict(classifier, type = 'response', newdata = test_set_cut[,c(-1)])

y_pred <- ifelse(prob_pred >= 0.5, 1, 0)

cm <- confusionMatrix(factor(test_set_cut$SECTOR_ORRIENTE), factor(y_pred), positive = "1")

cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 22616   169
##          1  1617   262
##                                           
##                Accuracy : 0.9276          
##                  95% CI : (0.9243, 0.9308)
##     No Information Rate : 0.9825          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2042          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.60789         
##             Specificity : 0.93327         
##          Pos Pred Value : 0.13944         
##          Neg Pred Value : 0.99258         
##              Prevalence : 0.01747         
##          Detection Rate : 0.01062         
##    Detection Prevalence : 0.07618         
##       Balanced Accuracy : 0.77058         
##                                           
##        'Positive' Class : 1               
##

7) Analysis

The confusion matrix compares the real data with the predicted data to show how accurate the model is. The matrix provides four values:

EasternSectorCorrect = 262
EasternSectorFalse = 1617
OtherSectorCorrect = 22616
OtherSectorIncorrect = 169

TotalObservations = EasternSectorCorrect + EasternSectorFalse + OtherSectorCorrect + OtherSectorIncorrect

The above values can be used to calculate the following four measures of model accuracy:

Accuaracy
Kappa
Precision
Recall

These values are discussed below.

7.1) Accuracy

The model had an accuracy of 92.76%. This means that the model classified 92.76% of the test data correctly.

Accuracy <- ((EasternSectorCorrect + OtherSectorCorrect)/TotalObservations)

7.2) Precision

The model has a Precision of 13.94%. This means that of all the immigrants which the model predicted as living in the Easter Sector, 13.94% actually lived there.

Precision <- EasternSectorCorrect/(EasternSectorCorrect+EasternSectorFalse)

7.3) Recall

The Recall value of 60.79% means that out of the 431 immigrants that actually lived in the Eastern Sector, the model correctly predicted that 60.79% of them lived there.

Recall <- EasternSectorCorrect / (EasternSectorCorrect + OtherSectorIncorrect)

7.4) Kappa

One of the problems with the data is that it is not balanced. Only 431 immigrants from the test data live there accounting for 1.75%. Subsequently, the model can easily obtain a high accuracy as it could predict that all the observations do not live in the Eastern Sector, and it would still obtain an accuracy of 98.25%.

Therefore, the Kappa value can be used to measure how well the model performed. The models performance is compared with the results for if the model had been run at random on the data. The Kappa value is between 0 and 1, with the closer to 1 meaning the model is more accurate.

The model has a Kappa of 0.2042. This means that the model classifies the data with accuracy 20.42% better than that of random classification. A Kappa value of between 0.21 and 0.40 is considered reasonable. More can be read about how to calculate the Kappa value

6) Conclusion

In conclusion this publication has created a classification model to classifiy if an immigrant that arrived to Santiago in 2019 lived in the Eastern Sector. It has followed on from the part 1 publication which prepared the data and looked at the distribution of some nationalities throughout the city. When the model was used on the test data an accuracy of 92.76% was obtained with a Kappa value of 0.2042, suggesting that the model had some success. However, the Precision of 13.94% and the Recall value of 60.79% suggest that the model found it difficult to correctly classify immigrants from the Eastern Setor. The model could be improved with more equally spread data and different types of classification models. These options will be explored in future publications. Many thanks for reading this publication.

Immigration - Where do they live? - Santiago

Sun, 21 Jun 2020 00:00:00 +0000

A few months ago an investigation was carried out by the author of StatCityPro into where immigrants live in Santiago de Chile. Various points of interest were identified regarding the number of immigrants and also where they lived upon arrival to Chile.

This publication looks to build on this previous work by looking at more current data from 2019 and also by using machine learning methods of classification to build a model to predict if immigrants live in the Eastern Sector of Santiago. More information can be read about the Eastern Sector.

2) Packages

The following packages will be used in this publication.

library(dplyr)
library(lubridate)
library(chilemapas)
library(ggplot2)
library(sf)
library(ggspatial)
library(caret)
library(modelr)
library(forcats) 
library(caTools)

3) Data

The data used in this publication can be downloaded from the following link.

setwd("~/Documents/Machine Learning/4. Proyectos/Migration/Data Sets")

visas2019 <- read.csv("visas_otorgadas_2019.csv")

The below syntax can be used to reveal the data’s variables. In total there are 14 variables with 328,118 observations.

str(visas2019)

## 'data.frame':    328115 obs. of  14 variables:
##  $ SEXO              : Factor w/ 2 levels "Femenino","Masculino": 2 1 1 2 2 2 2 2 2 1 ...
##  $ PAÍS              : Factor w/ 77 levels "Alemania","Angola",..: 58 65 65 18 65 31 18 14 14 14 ...
##  $ NACIMIENTO        : Factor w/ 26521 levels "","1900-01-01",..: 15106 16136 16461 14048 2870 17043 21235 16788 14639 15764 ...
##  $ ACTIVIDAD         : Factor w/ 14 levels "Dueña De Casa",..: 8 7 7 8 7 9 7 2 13 9 ...
##  $ PROFESIÓN         : Factor w/ 606 levels "A Bodega","A Planificac",..: 351 426 426 432 158 423 399 245 569 84 ...
##  $ ESTUDIOS          : Factor w/ 7 levels "Básico","Medio",..: 4 1 7 2 4 2 4 2 4 2 ...
##  $ COMUNA            : Factor w/ 340 levels "Algarrobo","Alhué",..: 304 113 262 327 127 308 304 10 91 113 ...
##  $ PROVINCIA         : Factor w/ 56 levels "Antártica Chilena",..: 49 49 49 56 25 51 49 2 49 49 ...
##  $ REGIÓN            : Factor w/ 16 levels "Antofagasta",..: 13 13 13 16 6 12 13 1 13 13 ...
##  $ TIT_DEP           : Factor w/ 3 levels "","D","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ AUTORIDAD         : Factor w/ 55 levels "Dem","Gobernación Antártica Chilena",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BENEFICIO_AGRUPADO: Factor w/ 7 levels "Estudiante","Inversionista",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ AÑO               : int  2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
##  $ MES               : int  4 7 7 7 5 2 2 5 6 5 ...

4) Feature Engineering

Feature Engineering will be conducted to preapre the date for further analysis.

4.1) Filter Data

Firstly the data is filtered to only have data for the province of Santiago. From now on this area will be referred to as Santiago. It is important to note that this area does not include the comunas of San Bernado or Puente Alto as they are part of different provinces.

visas2019STG <- visas2019 %>% filter(REGIÓN == "Metropolitana de Santiago") 

visas2019STG <- visas2019STG %>% filter(PROVINCIA == 'Santiago')

4.2) Missing Values

Some variables have a class of ‘No Informa’. However, this is not the case for TIT-DEP which has 15,902 without a value. In accordance with the other variables, a class of ‘No Informa’ is added for these missing values. After this all of the observations that have a value of ‘No Informa’ for any of the variables are removed from the data frame as they can reduce the precision of the model.

table(visas2019STG$ACTIVIDAD) 
      
table(visas2019STG$PROFESIÓN) 
            
table(visas2019STG$ESTUDIOS) 

table(visas2019STG$TIT_DEP)

levels(visas2019STG$TIT_DEP)

levels(visas2019STG$TIT_DEP)[1] <- "No Informa"

table(visas2019STG$TIT_DEP)

visas2019STG <- visas2019STG %>% filter(!ACTIVIDAD == "No Informa" ) %>% filter(!PROFESIÓN == "No Informa") %>% filter(!ESTUDIOS == "No Informa") %>% filter(!TIT_DEP == "No Informa")

4.3) Immigrante Age

The date of birth is converted to age.

visas2019STG$NACIMIENTO <- as.Date(visas2019STG$NACIMIENTO)

year <- 2020

Birth_year <- year(visas2019STG$NACIMIENTO)

visas2019STG <- visas2019STG %>% mutate(Age = year - Birth_year)

4.4) Geographic Coordinates

The package chilemapas is used to create a base map for the Province of Santiago.

Chile <- chilemapas::codigos_territoriales
STG <- Chile %>% filter(nombre_provincia == 'Santiago')
Comunas <- chilemapas::mapa_comunas
STGgeo <- left_join(STG, Comunas)

Additionally, accents are added to the names of each of the comunas, so that they can be combined with other data bases that use accents in their spelling of the comunas.

STGgeo[4, 2] = "Conchalí"
STGgeo[6, 2] = "Estación Central"
STGgeo[19, 2] = "Maipú"
STGgeo[20, 2] = "Ñuñoa"
STGgeo[22, 2] = "Peñalolén"
STGgeo[29, 2] = "San Joaquín"
STGgeo[31, 2] = "San Ramón"

The two data bases are combined.

visas2019STG <- visas2019STG %>% rename(nombre_comuna = COMUNA)

visas2019STG$nombre_comuna <- as.factor(visas2019STG$nombre_comuna)

visas2019STG <- left_join(visas2019STG, STGgeo)

5) Initial Analysis

In this section the data is explored.

5.1) Nationalities

In 2019 156,260 inmigrantes arrived to Santiago with a total of 76 nationalities. However, after following the feature engineering steps outline in Section 4 this number reduces to 98,655 with 76 nationalities. Of this amount Venezuelans are the most prominent representing 58.80% with 58,009 people. An additional point of interest is that of the ten most prominent nationalities eight are from South or Central America, with China and the United States the only exceptions. It is also observed that six of these ten nationalities speak Spanish as a first language.

visas2019STG %>% group_by(PAÍS) %>% count() %>% arrange(-n)

## # A tibble: 75 x 2
## # Groups:   PAÍS [75]
##    PAÍS               n
##    <fct>          <int>
##  1 Venezuela      58009
##  2 Perú           11474
##  3 Colombia        8465
##  4 Haití           7135
##  5 Bolivia         2424
##  6 Ecuador         2119
##  7 Argentina       1988
##  8 Brasil          1335
##  9 China            923
## 10 Estados Unidos   564
## # … with 65 more rows

5.2) Where do they live?

The most popular comuna for immigrants in 2019 was Santiago Centro with 30,207 arrivals. This is not surprising as Santiago Centro is the center of the city where there is more access to services, employment opportunities and housing. However, it must be noted that the data used only refers to the comuna of residence when an immigrant applied for their visa. Therefore, it is possible that they have since moved to a different sector of the city as they have become used to the city and have developed a support network.

comuna_count <- visas2019STG %>% group_by(nombre_comuna) %>% count() %>% arrange(-n)

comuna_count

## # A tibble: 32 x 2
## # Groups:   nombre_comuna [32]
##    nombre_comuna        n
##    <chr>            <int>
##  1 Santiago         30207
##  2 Estación Central  8080
##  3 Independencia     7291
##  4 Quinta Normal     4343
##  5 San Miguel        4124
##  6 Recoleta          3871
##  7 Ñuñoa             3736
##  8 Las Condes        3591
##  9 La Florida        3571
## 10 Maipú             2780
## # … with 22 more rows

The total number of immigrants in each comuna is added to the STGgeo data frame so that it can be mapped below. Each comuna is labelled, with details of which comunas correspond to each number provided in the table below the map.

STGgeo <- left_join(STGgeo, comuna_count, by = "nombre_comuna")

STGgeo <- STGgeo %>% rename(number_inmigrantes = n)

STGgeo <- cbind(STGgeo, st_coordinates(st_centroid(STGgeo$geometry)))

labels <- seq(1,32)

ggplot() + geom_sf(data = STGgeo$geometry, aes(fill = STGgeo$number_inmigrantes)) + 
scale_fill_viridis_c(option = "inferno",trans = 'sqrt') +
geom_text(data = STGgeo, aes(X, Y, label = labels), size = 3, color = "white") +
geom_text(data = STGgeo %>% filter(nombre_comuna == "Santiago"), aes(X, Y, label = "1"), size = 3, color = "black") +
annotation_north_arrow(aes(which_north = "true", location = "br"), pad_y = unit(0.8, "cm")) +
  annotation_scale(aes(location = "br", style = "bar")) +
  theme(panel.grid.major = element_line(color = gray(0.5), linetype = "dashed")) +
  theme (panel.background = element_rect(fill = "light grey")) +
  ggtitle("5.1) Location of immigrants that arrived in 2019") + xlab("Longitude") + ylab("Latitude") +
  labs(fill = "Number of Immigrants")

cbind(STGgeo, labels) %>% select(nombre_comuna, labels)

##          nombre_comuna labels
## 1             Santiago      1
## 2            Cerrillos      2
## 3          Cerro Navia      3
## 4             Conchalí      4
## 5            El Bosque      5
## 6     Estación Central      6
## 7           Huechuraba      7
## 8        Independencia      8
## 9          La Cisterna      9
## 10          La Florida     10
## 11           La Granja     11
## 12          La Pintana     12
## 13            La Reina     13
## 14          Las Condes     14
## 15        Lo Barnechea     15
## 16           Lo Espejo     16
## 17            Lo Prado     17
## 18               Macul     18
## 19               Maipú     19
## 20               Ñuñoa     20
## 21 Pedro Aguirre Cerda     21
## 22           Peñalolén     22
## 23         Providencia     23
## 24            Pudahuel     24
## 25           Quilicura     25
## 26       Quinta Normal     26
## 27            Recoleta     27
## 28               Renca     28
## 29         San Joaquín     29
## 30          San Miguel     30
## 31           San Ramón     31
## 32            Vitacura     32

5.4) The Eastern Sector

This publication and its part two counterpart aims to build a classification model to predict if an immigrant lives in the Eastern Sector. This sector contains the comunas of Providencia, Las Condes, Vitacura, and Lo Barnechea and is located to the north east of the city. These comunas are considered the most wealthy in the city and are identified in the below map.

SectorOriente <- STGgeo %>% filter(nombre_comuna == 'Providencia' | nombre_comuna == 'Las Condes' | nombre_comuna == 'Vitacura' | nombre_comuna == 'Lo Barnechea')

ggplot() + geom_sf(data = STGgeo$geometry, fill = "white") + 
  geom_sf(data = SectorOriente$geometry, fill = "purple") +
geom_text(data = STGgeo, aes(X, Y, label = labels), size = 3, color = "white") +
annotation_north_arrow(aes(which_north = "true", location = "br"), pad_y = unit(0.8, "cm")) +
  annotation_scale(aes(location = "br", style = "bar")) +
  theme(panel.grid.major = element_line(color = gray(0.5), linetype = "dashed")) +
  theme (panel.background = element_rect(fill = "light grey")) +
  ggtitle("5.2 Eastern Sector of Santiago") + xlab("Longitude") + ylab("Latitude")

6) Further Mapping

In this section four maps are presented.

Map 6.1 shows the distribution of Venezuelan immigrants.

Map 6.2 shows the distribution of Haitian immigrants.

Map 6.3 shows the distribution of immigrants from the USA.

Map 6.4 shows the distribution of Peruvian immigrants.

These four nationalities were chosen as they play an important role in the immigration trends in Santiago. There has been a big increase in the number of Venezuelans in the last few years due to the political situation in their own country. The number of Haitians has also increased dramatically since 2015 due to the lower quality of life in their own country. The GDP per capita in Haiti is $868. This value is the lowest within the ten most prominent nationalities for immigrants that arrived to Santiago in 2019. Similarly it is interesting to explore the distribution of immigrants from the United States as it is the country with the highest GDP per capita. Finally, Peruvians are the nationality which historically has been the biggest contributor of immigrants to Chile. The GDP per capita (Banco Mundial, 2018) for each of the ten main nationalities are shown in US Dollars below.

USA = $62,887 Peru = $6,941 Colombia = $6,668 Haití = $868 Bolivia = $3,549 Ecuador = $6,345 Argentina = $11,684 Brasil = $9,001 China = $9,771

For comparison the GDP per capita of Chile is $15,923.

6.1) Venezuelans

This map shows that Venezuelans were concentrated in Santiago Centro with 21,387 people, corresponding to 36.88% of the Venezuelans that arrived in 2019. Estacion Central and Independencia were the second and third most populated comunas. In the eastern sector there were 1,998 Venezuelans.

venezuela_count <- visas2019STG %>% filter(PAÍS == 'Venezuela') %>% group_by(nombre_comuna) %>% count() %>% arrange(-n)

venezuela_count

## # A tibble: 32 x 2
## # Groups:   nombre_comuna [32]
##    nombre_comuna        n
##    <chr>            <int>
##  1 Santiago         21387
##  2 Estación Central  5628
##  3 Independencia     4710
##  4 San Miguel        3318
##  5 Quinta Normal     2805
##  6 Ñuñoa             2622
##  7 La Florida        2339
##  8 Macul             1575
##  9 Maipú             1537
## 10 La Cisterna       1237
## # … with 22 more rows

visas2019STG %>% filter(PAÍS == 'Venezuela' & nombre_comuna %in% c('Providencia', "Las Condes", 'Vitacura', 'Lo Barnechea')) %>% count()

##      n
## 1 1998

STGgeo <- left_join(STGgeo, venezuela_count, by = "nombre_comuna")

STGgeo <- STGgeo %>% rename(numero_venezuelanos = n)

ggplot() + geom_sf(data = STGgeo$geometry, aes(fill = STGgeo$numero_venezuelanos)) + 
scale_fill_viridis_c(option = "inferno",trans = 'sqrt') +
geom_text(data = STGgeo, aes(X, Y, label = labels), size = 3, color = "white") +
annotation_north_arrow(aes(which_north = "true", location = "br"), pad_y = unit(0.8, "cm")) +
  annotation_scale(aes(location = "br", style = "bar")) +
  theme(panel.grid.major = element_line(color = gray(0.5), linetype = "dashed")) +
  theme (panel.background = element_rect(fill = "light grey")) +
  ggtitle("6.1 Location of Venezuelans that arrived in 2019") + xlab("Longitude") + ylab("Latitude") +
  labs(fill = "Number of immigrants")

6.2) Haitians

The below map highlights that the most popular comuna for Haitians was Quilicura to the north of Santiago with 984 arrivals, accounting for 13.79% of the 7,135 Haitians that arrived in 2019. Estacion Central also had a high amount of Haitians with 758 arriving (10.62%). Likewise Santiago Centro had 523 (7.33%) arrivals. It is also interesting to note the lack of Haitians in the Eastern Sector of the city with only 25 Haitians arriving there in 2019.

haitiano_count <- visas2019STG %>% filter(PAÍS == 'Haití') %>% group_by(nombre_comuna) %>% count() %>% arrange(-n)

visas2019STG %>% filter(PAÍS == 'Haití') %>% count()

##      n
## 1 7135

haitiano_count

## # A tibble: 32 x 2
## # Groups:   nombre_comuna [32]
##    nombre_comuna           n
##    <chr>               <int>
##  1 Quilicura             984
##  2 Estación Central      758
##  3 Santiago              523
##  4 Lo Espejo             426
##  5 Recoleta              375
##  6 Pedro Aguirre Cerda   367
##  7 Cerro Navia           363
##  8 Conchalí              280
##  9 Quinta Normal         252
## 10 El Bosque             248
## # … with 22 more rows

visas2019STG %>% filter(PAÍS == 'Haití' & nombre_comuna %in% c('Providencia', "Las Condes", 'Vitacura', 'Lo Barnechea')) %>% count()

##    n
## 1 25

STGgeo <- left_join(STGgeo, haitiano_count, by = "nombre_comuna")

STGgeo <- STGgeo %>% rename(numero_haitianos = n)

ggplot() + geom_sf(data = STGgeo$geometry, aes(fill = STGgeo$numero_haitianos)) + 
scale_fill_viridis_c(option = "inferno",trans = 'sqrt') +
geom_text(data = STGgeo, aes(X, Y, label = labels), size = 3, color = "white") +
annotation_north_arrow(aes(which_north = "true", location = "br"), pad_y = unit(0.8, "cm")) +
  annotation_scale(aes(location = "br", style = "bar")) +
  theme(panel.grid.major = element_line(color = gray(0.5), linetype = "dashed")) +
  theme (panel.background = element_rect(fill = "light grey")) +
  ggtitle("6.2 Location of Haitians that arrived in 2019") + xlab("Longitude") + ylab("Latitude") +
  labs(fill = "Number of immigrants")

6.3) United States of America

Of the 564 US-Americans that arrived in 2019, 252 (62.41%) lived in the Eastern Sector. As was the case for Venezuelans and Haitians, Santiago Centro again received a high percentage of the arrivals with 80 people (14.18%). It is also interesting that there were various comunas without US-American arrivals in 2019, this was not the case for the other two nationalities analysed so far with Venezuelans and Haitians present in each of Santiago’s comunas.

eeuu_count <- visas2019STG %>% filter(PAÍS == 'Estados Unidos') %>% group_by(nombre_comuna) %>% count() %>% arrange(-n)

eeuu_count

## # A tibble: 23 x 2
## # Groups:   nombre_comuna [23]
##    nombre_comuna        n
##    <chr>            <int>
##  1 Providencia        208
##  2 Las Condes         110
##  3 Santiago            80
##  4 Maipú               57
##  5 Ñuñoa               30
##  6 Vitacura            21
##  7 Lo Barnechea        13
##  8 Estación Central    11
##  9 Macul                5
## 10 Independencia        4
## # … with 13 more rows

visas2019STG %>% filter(PAÍS == 'Estados Unidos') %>% count()

##     n
## 1 564

visas2019STG %>% filter(PAÍS == 'Estados Unidos' & nombre_comuna %in% c('Providencia', "Las Condes", 'Vitacura', 'Lo Barnechea')) %>% count()

##     n
## 1 352

STGgeo <- left_join(STGgeo, eeuu_count, by = "nombre_comuna")

STGgeo <- STGgeo %>% rename(numero_eeuu = n)

ggplot() + geom_sf(data = STGgeo$geometry, aes(fill = STGgeo$numero_eeuu)) + 
scale_fill_viridis_c(option = "inferno",trans = 'sqrt') +
geom_text(data = STGgeo, aes(X, Y, label = labels), size = 3, color = "white") +
annotation_north_arrow(aes(which_north = "true", location = "br"), pad_y = unit(0.8, "cm")) +
  annotation_scale(aes(location = "br", style = "bar")) +
  theme(panel.grid.major = element_line(color = gray(0.5), linetype = "dashed")) +
  theme (panel.background = element_rect(fill = "light grey")) +
  ggtitle("6.3 Location of US-Americans that arrived in 2019") + xlab("Longitude") + ylab("Latitude") +
  labs(fill = "Number of Immigrants")

6.4) Peruvians

Santiago Centro, Recoleta, and Independencia were the three comunas with the most Peruvian arrivals in 2019 with 2,785 (24.27%), 1,272 (11.09%), and 1,092 (9.51%) respectively. In the Eastern Sector there were 855 (7.45%) Peruvians.

peruano_count <- visas2019STG %>% filter(PAÍS == 'Perú') %>% group_by(nombre_comuna) %>% count() %>% arrange(-n)

visas2019STG %>% filter(PAÍS == 'Perú') %>% count()

##       n
## 1 11474

visas2019STG %>% filter(PAÍS == 'Perú' & nombre_comuna %in% c('Providencia', "Las Condes", 'Vitacura', 'Lo Barnechea')) %>% count()

##     n
## 1 855

peruano_count

## # A tibble: 32 x 2
## # Groups:   nombre_comuna [32]
##    nombre_comuna        n
##    <chr>            <int>
##  1 Santiago          2785
##  2 Recoleta          1272
##  3 Independencia     1092
##  4 Estación Central   634
##  5 Quinta Normal      585
##  6 Conchalí           475
##  7 Peñalolén          415
##  8 Las Condes         373
##  9 La Florida         335
## 10 Lo Prado           320
## # … with 22 more rows

STGgeo <- left_join(STGgeo, peruano_count, by = "nombre_comuna")

STGgeo <- STGgeo %>% rename(numero_peruanos = n)

ggplot() + geom_sf(data = STGgeo$geometry, aes(fill = STGgeo$numero_peruanos)) + 
scale_fill_viridis_c(option = "inferno",trans = 'sqrt') +
geom_text(data = STGgeo, aes(X, Y, label = labels), size = 3, color = "white") +
annotation_north_arrow(aes(which_north = "true", location = "br"), pad_y = unit(0.8, "cm")) +
  annotation_scale(aes(location = "br", style = "bar")) +
  theme(panel.grid.major = element_line(color = gray(0.5), linetype = "dashed")) +
  theme (panel.background = element_rect(fill = "light grey")) +
  ggtitle("6.4 Location of Peruvians tha arrived in 2019") + xlab("Longitude") + ylab("Latitude") +
  labs(fill = "Number of immigrants")

6.5) Analysis Summary

The following conclusions can be taken from the above analysis:

There were more immigrants from Central America and South America.
Speaking Spanish is an important factor for determining if there are many immigrants from a certain nationality.
US-Americans have the highest GDP per capita and were the only analysed nationality with the majority of their population living in the Eastern Sector.
Haitians have the lowest GDP per capita and also had the lowest percentage of people living in the Eastern Sector.
Haitians were more widely dispersed with Quilicura, to the north of Santiago having the most Haitians. In comparison Venezuelans, US-Americans, and Peruvians were more concentrated around the center of the city.

7) Conclusion

In this part 1 publication immigration data from 2019 for Santiago has been explored with maps created for the distribution of Venezuelan, Haitian, US-American, and Peruvian immigrants, with some conclusions drawn. A part 2 publication will follow where a classification model will be created to try and classify if an immigrant lives in the Eastern Sector of the city. Thank you for reading this publication.

Titanic - Random Forest

Tue, 02 Jun 2020 00:00:00 +0000

In the last publication from StatCityPro a decision tree classification model was created with the aim of predicting if a passenger on the Titanic survived or died. Cross validation was used with an accuracy of 80.79% achieved.

In this new publication a Random Forest model is used to try and improve the accuracy. The data used is the same as the data which was being used at the end of the previous publication. (total, train_val, train_test_val). Therefore, feature engineering has already been carried out with the data ready for model creation.

2) Packages

The following packages are used in this publication.

library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(DT)
library(purrr)
library(corrplot)
library(randomForest)
library(caret)
library(rpart)
library(rpart.plot)

3) Loading the Data

setwd("~/Documents/Machine Learning/15. Hugo/academic-kickstart-master/content/en/post/Titanic-RF")

total <- read.csv("total2.csv")

total <- total[,-1]

train_val <- read.csv("train_val.csv")

train_test_val <- read.csv("train_test_val.csv")

The variables of Title and Age Group are changed to factors.

train_val$Title <- as.factor(train_val$Title)
train_val$Age.Group <- as.factor(train_val$Age.Group)
train_val$Survived <- as.factor(train_val$Survived)

4) Variables and Data

The data base total has 1,309 observations and 16 variables. Of these 1,309 observations, 891 are training data and 418 are testing data.

To train the model the 891 training observations are used, and are divided into two groups. The first group is train_val which has 714 observations. This group is used to train the model. Then the model is tested using the second group train_test_val which acts as a preliminary testing group. After the feature engineering that took place in the part 1 publication these data bases only have seven variables.

Variable	Description
Survived	Survived (1) or Died (0)
Pclass	Social class of passenger
Title	Title of passenger
Sexo	Sex of passenger
Age Group	Age group of passenger
Family_size	Number of family members on Titanic
Embarked	Port of embarkation

5) What is a Random Forest Model?

A Random Forest is a collection of decision trees that are joined together in a forest. This collection of trees is what makes a Random Forest Model more reliable than a Decision Tree Model. Each tree gives a classification of survived or deceased for each observation. For each observation the final result is the most frequent classification. For example in a model with 500 trees, if 300 predict that an observation survived and 200 predict that they died, the observation is classified as survived.

6) Creating the Model

The model is created using survived as the dependent variable and the other six variables as independent variables. The model is ran against the train_val data and obtains an accuracy of 83.47% (100 - 16.53 (error rate)). The model uses 500 trees with two variables tested at each node. Further on these settings are verified to check the model can not be improved with different settings.

set.seed(1234)
rf_model <- randomForest(Survived ~ Pclass + Title + Sex + Embarked + Family_size + Age.Group, data = train_val, ntree = 500)


rf_model

## 
## Call:
##  randomForest(formula = Survived ~ Pclass + Title + Sex + Embarked +      Family_size + Age.Group, data = train_val, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 16.53%
## Confusion matrix:
##     0   1 class.error
## 0 409  31  0.07045455
## 1  87 187  0.31751825

6.1) Number of Trees

In this section the number of trees in the model is reviewed to see if it needs to be increased. A table is created to show the error rate for each of the 500 trees. For each tree there are tree error rates:

for values of 0 (somebody who died)
for values of 1 (somebody who survived)
for all values (people who died and survived)

These error rates are graphed. It is hoped that before the 500th tree the error rate will have stabilised with a flat line present on the graph.

oob.error.data <- data.frame(
  Trees=rep(1:nrow(rf_model$err.rate), times=3),
  Type=rep(c("OOB", "0", "1"), each=nrow(rf_model$err.rate)),
  Error=c(rf_model$err.rate[,"OOB"],
          rf_model$err.rate[,"0"],
          rf_model$err.rate[,"1"]))

ggplot(data=oob.error.data, aes(x=Trees, y=Error)) + 
  geom_line(aes(color=Type))

In the above graph it is clear that the error rate stabilises before the 500th tree. This means that it is not necessary to add more trees to the model, as they would not reduce the error rate any more.

6.2) Number of Variables

In this section a test is run to see how many variables should be tested at each node in the model. Currently, two variables are tested.

In oob.values it is shown that mtry=2, with two tested variables, gives the lowest error value (0.1666667) and therefore is the value that is used in the model.

oob.values <- vector(length = 10)
for(i in 1:10) {
  temp.model <- randomForest(Survived ~ Pclass + Title + Sex + Embarked + Family_size + Age.Group, data = train_val, mtry=i, ntree=500)
  oob.values[i] <- temp.model$err.rate[nrow(temp.model$err.rate),1]
  }

oob.values

##  [1] 0.1918768 0.1666667 0.1778711 0.1806723 0.1764706 0.1778711 0.1806723
##  [8] 0.1806723 0.1792717 0.1736695

6.3) Variable Importance

The Gini value can be used to determine which variable are the most and least important in the model. The below table and graph show that the variables Age_Group and Embarked have a low importance in the model. Therefore a new model is tried without these two variables.

importance(rf_model)

##             MeanDecreaseGini
## Pclass              33.40132
## Title               62.14757
## Sex                 42.51139
## Embarked            11.08408
## Family_size         22.26955
## Age.Group           11.74166

varImpPlot(rf_model)

7) New Model

A new model is created without the variables for Age.Group and Embarked. The settings of 500 trees and two variables at each node are used.

However, when the model is run using the train_val data it seems that the two removed variables are in fact needed in the model as without them the accuracy is reduced to 82.63% (100 - 17.37% (error rate)). Therefore, the first model with all of the independent variables is used as the final model.

train_val1=train_val[,c(-4,-6)]

train_test_val1=train_test_val[,c(-4,-6)]

set.seed(1234)
rf_model2 <- randomForest(Survived ~ Pclass + Title + Sex + Family_size, data = train_val1, ntree=500)

rf_model2

## 
## Call:
##  randomForest(formula = Survived ~ Pclass + Title + Sex + Family_size,      data = train_val1, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 17.37%
## Confusion matrix:
##     0   1 class.error
## 0 387  53   0.1204545
## 1  71 203   0.2591241

varImpPlot(rf_model2)

8) Testing the Model

In this section the model is tested using the train_test_val data. In order to prepare this data some variables are converted to factors. The model is then run with the dependent variable hidden to predict which people survived or died. The predictions are then compared to the real data in a confusion matrix with the model achieving an accuracy of 83.05% and a kappa value of 0.6316.

train_test_val$Title <- as.factor(train_test_val$Title)
train_test_val$Age.Group <- as.factor(train_test_val$Age.Group)
train_test_val$Survived <- as.factor(train_test_val$Survived)

set.seed(1234)
rf_predictions <- predict(rf_model, train_test_val)
confusionMatrix(train_test_val$Survived, rf_predictions)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 99 10
##          1 20 48
##                                          
##                Accuracy : 0.8305         
##                  95% CI : (0.767, 0.8826)
##     No Information Rate : 0.6723         
##     P-Value [Acc > NIR] : 1.683e-06      
##                                          
##                   Kappa : 0.6316         
##                                          
##  Mcnemar's Test P-Value : 0.1003         
##                                          
##             Sensitivity : 0.8319         
##             Specificity : 0.8276         
##          Pos Pred Value : 0.9083         
##          Neg Pred Value : 0.7059         
##              Prevalence : 0.6723         
##          Detection Rate : 0.5593         
##    Detection Prevalence : 0.6158         
##       Balanced Accuracy : 0.8298         
##                                          
##        'Positive' Class : 0              
##

9) Conclusion

In this publication a Random Forest Model has been created to predict if somebody survived or died on board the titanic. It was hopped that the model would improve on the accuracy achieved using a decision tree model in a previous publication. The accuracy was improved with the Random Forest Model achieving an accuracy of 83.05% in comparison to the 80.79% of the Decision Tree Model. 500 trees were used with two variables tested at each node. Thank you for taking the time to read this publication and hopefully it has been of use.

Titanic - Who Survived? - Part 2

Tue, 26 May 2020 00:00:00 +0000

This publication will follow on from Part one on the Titanic to conduct data analysis for the passengers onboard and to create a classification decision tree model to predict if a passenger survived or died. Part one covered the data preparation process.

2) Packages

The following packages are used in this publication.

library(dplyr)
library(readr)
library(tidyr)
library(ggplot2)
library(knitr)
library(DT)
library(purrr)
library(corrplot)
library(randomForest)
library(caret)
library(rpart)
library(rpart.plot)

3) Data Exploration

With the data already prepared in part 1 the next step is to carry out some data exploration to better understand the data’s characteristics. Only the train data is be used in this data exploration section.

setwd("~/Documents/Machine Learning/15. Hugo/academic-kickstart-master/content/en/post/Titanc-part2")


total <- read_csv("total2.csv")

total <- total[,-1]

test <- read.csv("test.csv")

train <- read.csv("train.csv")

3.1) Pclass

The below graph shows the survival of passengers in each of the three classes. Red represents passengers that died, and blue passengers that survived. The white line is the average survival rate for passengers in the training data (38.38%).

First class passengers had a survival rate higher than the average. Second class passengers followed this trend, all be it to a lesser degree with a survival rate of close to 50%. In comparison third class passengers had a lower survival rate of 25%.

ggplot() + 
  
  geom_bar(data = total %>% filter(group == 'train'), aes(Pclass, fill = as.factor(Survived)), position = 'fill') + 
  
  geom_hline(yintercept = 0.3838, col = "white", lty=2, size=2) +
  
    scale_fill_brewer(palette="Set1") +

  ylab("Survival Rate") +
  
  xlab("Class") + 

  ggtitle("Survival Rate by Class") + 
  
  labs(fill = "Survived") +
  
  theme_minimal()

5.2) Title

The following graph highlights how passengers with the tile ‘Mr’ had a very low survival rate. In comparision a child with the title ‘Master’ or an adult woman with the title “Miss” had a greater chance of survival with survival rates of close to 70%.

ggplot() + 
  
  geom_bar(data = total %>% filter(group == 'train'), aes(Title, fill = as.factor(Survived)), position = 'fill') + 
  
  geom_hline(yintercept = 0.3838, col = "white", lty=2, size=2) +
  
    scale_fill_brewer(palette="Set1") +

  ylab("Survival Rate") +

  ggtitle("Survival Rate by Title") + 
  
  labs(fill = "Survived") +
  
  theme_minimal()

5.3) Port

There were three ports of embarkation: Southampton, Great Britain; Cherborg, France; and Queenstown, Ireland. There is not that much variation in the survival rates of passengers from different ports, however, passengers from Cherborg had the highest survival rate.

ggplot(total %>% filter(group=="train"), aes(Embarked, fill=as.factor(Survived))) +

  geom_bar(position = "fill") +

  scale_fill_brewer(palette="Set1") +

  ylab("Survival Rate") +

  geom_hline(yintercept=0.38, col="white", lty=2, size=2) +

  ggtitle("Survival Rate by Emarking Point") + 
  
  labs(fill = "Survived") +

  theme_minimal()

5.4) Fare and Class

The below graph show the relationship between survival and fare. The graph suggests that passengers with a more expensive ticket were more likely to survive. For example, there is a higher concentration of tickets with a cost of more than £50, found in those passengers who survived.

ggplot(total %>% filter(group=="train"), aes(Fare, Survived)) +
  
  geom_point() +

  ylab("Survival Rate") +

  ggtitle("Survival vs. Fare") + 
  
  theme_minimal()

Class can also be plotted by fare with this graph suggesting that there is potentially a strong correlation between class and fare.

ggplot(total %>% filter(group=="train"), aes(Pclass, Fare)) +
  
  geom_point() +

  ylab("Fare (£)") +
  
  xlab("Class") +

  ggtitle("Class vs. Fare") + 
  
  theme_minimal()

5.4.1) Over Correlation

Following on from the above suggestion of a correlation between class and fare the correlation for these two variables is calculated. The below graphic shows there is a negative correlation of 0.55 between these two variables. This correlation could be detrimental to the model with class and fare both having the same impact on the dependent variable and multicollinearity occurring. Therefore fare will not be included in the model.

tbl_corr <- total %>%

  filter(group=="train") %>%

  select(-PassengerId, -SibSp, -Parch) %>%

  select_if(is.numeric) %>%

  cor(use="complete.obs") %>%

  corrplot.mixed(tl.cex=0.85)

5.5) Sex

This graph shows that females had a higher rate of survival that males. Almost 75% of the females survived in comparison with close to 23% of the males.

ggplot(total %>% filter(group=="train"), aes(Sex, fill=as.factor(Survived))) +

  geom_bar(position = "fill") +

  scale_fill_brewer(palette="Set1") +

  ylab("Survival Rate") +

  geom_hline(yintercept=0.38, col="white", lty=2, size=2) +

  ggtitle("Survival Rate by Sex") + 
  
  labs(fill = "Survived") +

  theme_minimal()

5.5.1) Sex y Class

The following graph shows how sex had a very strong impact on the chances of survival. To be a female from first or second class almost was a characteristic of survival, and almost 50% of the females in third class survived. In comparison to these figures the survival rate for males, without taking into account their class, was never higher than the average survival rate for all passengers.

ggplot(total %>% filter(group=="train"), aes(Pclass, fill=as.factor(Survived))) +
  
  facet_wrap(~Sex, scale = "free") +

  geom_bar(position = "fill") +

  scale_fill_brewer(palette="Set1") +

  ylab("Survival Rate") +

  geom_hline(yintercept=0.38, col="white", lty=2, size=2) +

  ggtitle("Survival Rate by Sex and Class") + 
  
  labs(fill = "Survived") +

  theme_minimal()

5.5.2) Cherborg, Sex and Class

The below graph splits the variables of port, sex and class to see how these variables impacted survival. Some interesting trends are identified.

Firstly, almost all the females from first and second class survived, regardless of the port where they started their journey. Females from third class did not have such luck with lower survival rates depending on which port they boarded the ship. Females from Southampton had a survival rate particularmente lower than the other females.

In relation to the male passengers there were higher survival rates for those who had boarded the boat in Southampton and Cherborg. Additionally, Cherborg had the highest male survival rate for each of the three classes. Finally, almost all of the males from Queenstown died, with third class males from this port having the highest survival rate from male Queenstown passengers.

ggplot(total %>% filter(group=="train"), aes(Embarked, fill = as.factor(Survived))) + 
  
  facet_wrap(~Sex~Pclass, scale = "free") +

  geom_bar(position = "fill") + 
  
  scale_fill_brewer(palette="Set1") +

  ylab("Survival Rate") +

  ggtitle("Survival Rate by Emarking Point") + 
  
  labs(fill = "Survived") +
  
    geom_hline(yintercept=0.38, col="white", lty=2, size=2) +

  theme_minimal()

5.6) Family Size

In the below graph it is noted that families of two, three, or four people on the Titanic had a higher rate of survival.

ggplot(total %>% filter(group=="train"), aes(Family_size, fill=as.factor(Survived))) +

  geom_bar(position = "fill") +
  
  scale_fill_brewer(palette="Set1") +

  ylab("Survival Rate") +

  geom_hline(yintercept=0.38, col="white", lty=2, size=2) +

  ggtitle("Survival Rate by Family Size and Class") + 
  
  labs(fill = "Survived") +

  theme_minimal()

6) Creating the Model

6.1) Preparing the Data

Firstly, the variables used in the model need to be chosen. As previously explained, Fare and Cabin are not used. The train observations are put in a separate group and then divided into two separate groups:

train_val = 80% of the training observations

train_test_val = 20% of the training observations

The train_val is used to train the model. Then the model is tested firstly using the train_test_val. The results of this testing are used to adjust the model so that the best settings can be used when testing the model with the real test data.

Finally with the model adjusted the model is tested using the real test data with the dependent variable hidden.

feauter1 <-total[1:891, c("Pclass", "Title","Sex","Embarked","Family_size","Age Group", "Survived")]

feauter1$Survived <- as.factor(feauter1$Survived)

set.seed(500)

ind <- createDataPartition(feauter1$Survived,times=1,p=0.8,list=FALSE)

train_val <- feauter1[as.vector(ind),]

train_test_val <- feauter1[-ind,]

6.2) The Distribution of the Dependent Variable

The distribution of passengers that survived and died in the train_val and train_test_val groups will be analysed below to ensure that there is an equal distribution between the two groups. The distribution of survivors and fatalities across the groups is equal with a ratio of 6 deaths to 4 survivors.

round(prop.table(table(train$Survived)*100),digits = 1)

## 
##   0   1 
## 0.6 0.4

round(prop.table(table(train_val$Survived)*100),digits = 1)

## 
##   0   1 
## 0.6 0.4

round(prop.table(table(train_test_val$Survived)*100),digits = 1)

## 
##   0   1 
## 0.6 0.4

6.3) Decision Tree Model

In this section a decision tree model is created using the train_val group which represents 54% of the total data. Then the train_test_val group is used to test and make adjustments to the model.

Having created the model the Confusion Matrix can be used to see its accuracy. The model has an accuracy of 0.8389 with a kappa of 0.642. Cross Validation is used to verify the model and to check that over fitting has not occurred.

set.seed(1234)

Model_DT <- rpart(Survived~.,data=train_val,method="class")

rpart.plot(Model_DT,extra =  3,fallen.leaves = T)

PRE_TDT=predict(Model_DT,data=train_val,type="class")

confusionMatrix(PRE_TDT,train_val$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 418  93
##          1  22 181
##                                           
##                Accuracy : 0.8389          
##                  95% CI : (0.8099, 0.8652)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.642           
##                                           
##  Mcnemar's Test P-Value : 6.686e-11       
##                                           
##             Sensitivity : 0.9500          
##             Specificity : 0.6606          
##          Pos Pred Value : 0.8180          
##          Neg Pred Value : 0.8916          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5854          
##    Detection Prevalence : 0.7157          
##       Balanced Accuracy : 0.8053          
##                                           
##        'Positive' Class : 0               
##

6.4) Cross Validation

Cross validation is used ten times to ensure that there is sufficient data being used to create the model and that it represents the full range of the complete data. With this method the train_val group is divided into ten parts. Each part is used as the test group once with the other nine parts being used as the training data. In this way it is less probable that over fitting occurs.

set.seed(1234)

cv.10 <- createMultiFolds(train_val$Survived, k = 10, times = 10)

# Control

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10,

                       index = cv.10)

                     
train_val <- as.data.frame(train_val)

##Train the data

Model_CDT <- train(x = train_val[,-7], y = train_val[,7], method = "rpart", tuneLength = 30,

                   trControl = ctrl)



rpart.plot(Model_CDT$finalModel, type=4, clip.right.labs=FALSE, branch=.7)

6.4.1) Cross Validation Predictions

The below confusion matrix shows how the cross validation model has a precision of 0.8079. This is less that the accuracy of the first model created which did not use cross validation (accuracy of 0.8389). This suggests that over fitting did occur in the first model.

set.seed(1234)
PRE_VDTS=predict(Model_CDT$finalModel,newdata=train_test_val,type="class")

confusionMatrix(PRE_VDTS,train_test_val$Survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 94 19
##          1 15 49
##                                           
##                Accuracy : 0.8079          
##                  95% CI : (0.7421, 0.8632)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 2.854e-08       
##                                           
##                   Kappa : 0.5895          
##                                           
##  Mcnemar's Test P-Value : 0.6069          
##                                           
##             Sensitivity : 0.8624          
##             Specificity : 0.7206          
##          Pos Pred Value : 0.8319          
##          Neg Pred Value : 0.7656          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5311          
##    Detection Prevalence : 0.6384          
##       Balanced Accuracy : 0.7915          
##                                           
##        'Positive' Class : 0               
##

rpart.rules(Model_CDT$finalModel)

##  .outcome                                                                                                                                          
##      0.08 when Title is Master or Miss or Mrs & Pclass >= 3 & Family_size >= 5                                                                     
##      0.10 when Title is                    Mr & Pclass <  2                    & Age Group is                         Age.60Ov                     
##      0.11 when Title is      Mr or Rare Title & Pclass >= 2                                                                                        
##      0.29 when Title is                    Mr & Pclass <  2 & Family_size >= 2 & Age Group is Age.1317 or Age.1839 or Age.4059 & Embarked is      S
##      0.32 when Title is                    Mr & Pclass <  2 & Family_size <  2 & Age Group is Age.1317 or Age.1839 or Age.4059                     
##      0.33 when Title is                  Miss & Pclass >= 3 & Family_size <  5 & Age Group is Age.1317 or Age.1839 or Age.4059 & Embarked is      S
##      0.53 when Title is                   Mrs & Pclass >= 3 & Family_size <  5 & Age Group is Age.1317 or Age.1839 or Age.4059 & Embarked is      S
##      0.58 when Title is                    Mr & Pclass <  2 & Family_size >= 2 & Age Group is Age.1317 or Age.1839 or Age.4059 & Embarked is      C
##      0.67 when Title is            Rare Title & Pclass <  2                                                                                        
##      0.77 when Title is Master or Miss or Mrs & Pclass >= 3 & Family_size <  5                                                 & Embarked is C or Q
##      0.78 when Title is Master or Miss or Mrs & Pclass >= 3 & Family_size <  5 & Age Group is             Age.0012 or Age.60Ov & Embarked is      S
##      0.94 when Title is Master or Miss or Mrs & Pclass <  3

6.5) Important Variables

The importance of the variables in both models is shown below. In both models, title and gender are the most important variables in determining whether someone survived or not.

This trend is reflected in the analysis carried out in Section 5 of this publication with males and passengers with the title of ‘Mr’ having a very low rate of survival.

# Get importance
Model_DT$variable.importance

##       Title         Sex Family_size      Pclass   Age Group    Embarked 
##   111.41503    96.41472    54.63255    31.71858    28.20113    11.29332

Model_CDT$finalModel$variable.importance

##       Title         Sex Family_size      Pclass   Age Group    Embarked 
##   114.23647    97.02466    55.44293    40.43166    29.94150    12.49217

6.6) Final Testing

In this section the model is tested using the original testing data with its hidden results for the dependent variable. The model with cross validation is used.

Running this model against the test data it is predicted that out of 418 passengers, 258 died, and 160 survived giving a survival rate of 38.28 percent, which is very close to the survival rate of 38.38% for the training data.

set.seed(1234)
PRE_TEST=predict(Model_CDT$finalModel,newdata=total[892:1309,],type="class")

PRE_TEST

##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   0   1   0   0   1   0   1   0   1   0   0   0   1   0   1   1   0   0   0   1 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   1   1   1   1   1   0   1   0   0   0   0   0   1   1   1   0   0   0   0   0 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   0   0   0   1   1   0   0   0   1   1   0   0   1   1   0   0   0   0   0   1 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   0   0   0   1   1   1   1   0   0   1   1   0   0   0   1   0   0   1   0   1 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   1   0   0   0   0   0   1   0   1   1   1   0   1   0   0   0   1   0   0   0 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   1   0   0   0   1   0   0   0   0   0   0   1   1   1   1   0   0   1   0   1 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##   1   0   1   0   0   0   0   1   0   0   0   1   0   0   0   0   0   0   0   0 
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
##   0   1   0   0   0   0   0   0   0   0   1   0   0   1   0   0   1   0   0   1 
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##   1   1   1   0   0   1   0   0   1   0   0   0   0   0   0   1   1   1   1   1 
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 
##   0   1   1   0   1   0   1   0   0   0   0   0   1   0   1   0   1   0   0   1 
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 
##   1   1   1   1   0   0   1   0   1   0   0   0   0   1   0   0   1   0   1   0 
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 
##   1   0   1   0   1   1   0   1   0   0   0   1   0   0   1   0   0   0   1   1 
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 
##   1   1   1   0   1   0   1   0   1   1   1   0   1   0   0   0   0   0   1   0 
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 
##   0   0   1   1   0   0   0   0   0   0   0   0   1   1   0   1   0   0   0   0 
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 
##   0   1   1   1   1   0   0   0   0   0   0   1   0   1   0   0   1   0   0   0 
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 
##   0   0   0   0   1   1   0   1   0   1   0   0   0   1   1   1   1   0   0   0 
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 
##   0   0   0   0   1   0   1   0   0   0   1   0   0   1   0   0   0   0   0   1 
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 
##   0   0   0   1   1   0   0   1   0   1   1   0   0   0   1   0   1   0   0   1 
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 
##   0   1   1   0   1   0   0   0   1   0   0   1   0   0   1   1   0   0   0   0 
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 
##   0   0   1   1   0   1   0   0   0   0   0   1   1   0   0   1   0   1   0   0 
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 
##   1   0   1   0   1   0   0   1   1   1   1   1   0   0   1   0   0   1 
## Levels: 0 1

TEST_Results <- cbind(test, PRE_TEST)

summary(PRE_TEST)

##   0   1 
## 258 160

7) Conclusion

In conclusion in this publication a decision tree model has been created to classify if passengers survived or died on the Titanic. The final model used cross validation in order to avoid over fitting.

When the model was tested against the train_test_val an accuracy of 80.79% was achieved.

Finally, when the model was tested using the test data it was predicted that out of the 418 test group passengers 258 died and 160 survived with a survival rate of 38.28%.

It would be interesting to extend this analysis in the future using a random forest model to see if the model accuracy could be improved.

Thanks you for reading this two part publication. Hopefully it has been informative.

Titanic - Who Survived? - Part 1

Wed, 20 May 2020 00:00:00 +0000

The Titanic sank on the 15th of April 1912 after hitting an iceberg during its first voyage. Unfortunately there were not sufficient lifeboats on board resulting in the death of 1502 of the 2224 passengers and crew. When the Titanic started its journey it was considered the best boat in the World and unsinkable.

As time has passed and analysis has been carried out of the passengers who died and survived it has been discovered that people with certain characteristics had a higher chance of survival. This publication responds to these findings with the intention of creating a decision tree classification model to predict if someone onboard survived or died. Therefore survival is the dependent variable. Independent variables including title, age, sex, social class, and port of embarkation are used. The publication is split into two parts. Part one focuses on the data preparation, with part two focusing on data analysis and model creation.

2) Packages

The following packages are used in this publication.

library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(DT)
library(purrr)
library(corrplot)
library(randomForest)
library(caret)
library(rpart)
library(rpart.plot)

3) Loading the Data

Two data bases are used - the first to train the model, and the second to test the model.

The general idea of machine learning classification is to use the training data to look for relationships between the independent variables and the dependent variable, with a model being created.

The model is then tested against the second data set which contains unseen data. This is to say that it contains all of the independent variables but with the dependent variable hidden. The model looks for similar patterns between the independent variables that existed in the training data and uses them to predict the dependent variable.

Both data bases can be downloaded from the following link.

setwd("~/Documents/Machine Learning/15. Hugo/academic-kickstart-master/content/en/post/Titanic")

test <- read.csv("test.csv")

train <- read.csv("train.csv")

3.1) Combining Test and Train

In order to conduct preliminary analysis of all of the passengers both the data sets are combined. In order for this to be possible firstly a survival variable is added to the test data set ensuring that both data sets have 12 variables and can be combined. In total with both data sets there are 1309 observations with 12 variables. The variables are:

Variable	Description
PassengerId	Id number
Survived	Survived (1) or died (0)
Pclass	Social class of passenger
Name	Name of passenger
Sex	Sex of passenger
Age	Age of passenger
SibSp	Number of siblings or partners on the Titanic
Parch	Number of parents or children on the Titanic
Ticket	Ticket number
Fare	Cost of ticket
Cabin	Cabin number
Embarked	Port of embarkation

test$Survived <- NA

test <- test[,c(1,12,2,3,4,5,6,7,8,9,10,11)]

total <- rbind(train, test)

str(total)

## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 187 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

4) Characteristics Engineering

In the following section the data is cleaned so that it can be analysed further on.

4.1) Missing Values

There are five variables with missing values: Cabin, Survived, Age, Embarked, Fare. It is necessary to treat these missing values as they can impact the effectiveness of a classification model. The only variable that is not treated for missing values is Survived as this is the dependent variable and the missing values correspond to the hidden test data.

checkAllCols(total)

##            col   class  num numMissing numInfinite      avgVal minVal maxVal
## 1  PassengerId integer 1309          0           0 655.0000000      1   1309
## 2     Survived integer  891        418           0   0.3838384      0      1
## 3       Pclass integer 1309          0           0   2.2948816      1      3
## 4         Name  factor 1309          0          NA          NA     NA     NA
## 5          Sex  factor 1309          0          NA          NA     NA     NA
## 6          Age numeric 1046        263           0  29.8811377      0     80
## 7        SibSp integer 1309          0           0   0.4988541      0      8
## 8        Parch integer 1309          0           0   0.3850267      0      9
## 9       Ticket  factor 1309          0          NA          NA     NA     NA
## 10        Fare numeric 1308          1           0  33.2954793      0    512
## 11       Cabin  factor  295       1014          NA          NA     NA     NA
## 12    Embarked  factor 1307          2          NA          NA     NA     NA

4.2) Age

There are 263 missing values for the age variable. The average age of passengers is inserted for these missing values.

4.2.1) Avergae Age

The below syntax is used to calculate the average age of passengers and with this value then added to the data. Additionally, a new variable is created, which groups the passengers by age into the following classes:

Age < 13
Age >= 13 & Age < 18
Age >= 18 & Age < 40
Age >= 40 & Age < 60
Age >= 60

total <- total %>% mutate(Age = ifelse(is.na(Age), mean(total$Age, na.rm = T), Age), 
                          `Age Group` = case_when(Age < 13 ~ "Age.0012", 

                                 Age >= 13 & Age < 18 ~ "Age.1317",
                                 
                                 Age >= 18 & Age <40 ~ "Age.1839",

                                 Age >= 40 & Age < 60 ~ "Age.4059",

                                 Age >= 60 ~ "Age.60Ov"))

4.3) Port

As there are only two observations with missing values for Port, Southampton is inserted for both these observations as it is the most frequent port in the data.

levels(total$Embarked)

## [1] ""  "C" "Q" "S"

table(total$Embarked)

## 
##       C   Q   S 
##   2 270 123 914

levels(total$Embarked)[1] <- c("S")

4.4) Fare

There is one missing value for this variable which is located in row 1044. It is likely that the variables of Class and Port impacted the cost of the Fare. Passenger 1044 ‘Mr Thomas Storey’ was a class 3 passenger and boarded the ship in Southampton. Therefore, to replace this missing value the average fare for third class passengers who boarded in Southampton is used. The result is a value of £14.44.

mean_fare_calculation <- total %>% filter(Pclass == '3' & Embarked == 'S') %>% filter(!PassengerId == 1044)

mean(mean_fare_calculation$Fare)

## [1] 14.43542

total[1044, 10] <- 14.43542

4.5) Name

The title of each passenger is separated in a new variable for Title. The following table shows that the most common titles for passengers were Master, Miss, Mr, y Mrs with a representation of 97.40% of the passengers.

Some of the less common titles are grouped together in a new class called rare_title. Additionally, the titles for Mlle and Ms are added to the class of Miss. The title of Mme is added to the class of Mrs.

total$Title <- gsub('(.*, )|(\\..*)', '', total$Name)

table_titles_total <- table(total$Sex, total$Title)

table_titles_total

##         
##          Capt Col Don Dona  Dr Jonkheer Lady Major Master Miss Mlle Mme  Mr Mrs
##   female    0   0   0    1   1        0    1     0      0  260    2   1   0 197
##   male      1   4   1    0   7        1    0     2     61    0    0   0 757   0
##         
##           Ms Rev Sir the Countess
##   female   2   0   0            1
##   male     0   8   1            0

rare_title <- c('Dona', 'Lady', 'the Countess','Capt', 'Col', 'Don', 
                'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer')

total$Title[total$Title %in% rare_title] <- 'Rare Title'
total$Title[total$Title == 'Mlle'] <- 'Miss'
total$Title[total$Title == 'Ms'] <- 'Miss'
total$Title[total$Title == 'Mme'] <- 'Mrs'

table(total$Sex, total$Title)

##         
##          Master Miss  Mr Mrs Rare Title
##   female      0  264   0 198          4
##   male       61    0 757   0         25

4.6) Family

With the variables SibSp and Parch it is possible to know if a passenger had family on the Titanic. SibSp counts siblings and partners, with Parch counting parents and children. A new variable is created to count family sizes.

total$Family_size <- total$SibSp + total$Parch + 1

4.7) Survival

As already explained the survived variable has missing values as it is the dependent variable, with the missing values the observations from the test data. It is therefore not necessary to treat these missing values, however, the survival rate for the available data is analysed.

In the below table it can be seen that in the train data 61.62% of the passengers died and 38.38% survived.

total$group <- ifelse(total$PassengerId <= 891, "train", "test")

total %>% filter(group == "entrenar") %>% group_by(Survived) %>% count() %>% mutate(percentage_all = (n/1309) * 100) %>% mutate(percentage_entrenar = (n/891) * 100)

## # A tibble: 0 x 4
## # Groups:   Survived [0]
## # … with 4 variables: Survived <int>, n <int>, percentage_all <dbl>,
## #   percentage_entrenar <dbl>

4.8) Cabin

Cabin is the variable with the most missing values with 1014 missing in total. Therefore, this variable is not included in the model.

Conclusion

This publication has introduced this data analysis project which aims to create a decision tree classification model to predict if someone onboard the Titanic survived or died. Part one has focused on the data preparation, with the data being prepared for subsequent analysis and model creation. The data set is saved for use in further publications.

write.csv(total, file = "total2.csv")