Monday, 3 October 2016

Do you know Sparse Principal Components Analysis, Mr Trump?

While writing a paper on sparse principal component analysis I came across an old dataset containing 1990s socio-economic data and rate of violent crime for 1994 communities in the US. I am not a sociologist, so my analysis may be superficial, but I found the results interesting with respect to Mr Trump's political views. Looking at the results, it turns out that the traditional approach of considering only the largest loadings of the PCs seems to support the view that immigrants are a major cause of violent crime  Instead, applying SPCA gives an entirely different view of the problem identifying mainly socio-economical characteristics, rather than being immigrants or speaking poor English, as drivers for crime. Naturally, the two things are correlated but the causal inference may be different.

The dataset is called  Communities and Crime and can be downloaded from the UCI Machiine Learning Repository. I deleted 26 variables with missing values, ending up with 99 explanatory variaables and ran Principal Components Analysis on their correlation matrix. The the first and second PC respectively explain 25.3% and 17% of the variance of the data, Their ordered contributions (loadings scaled to unit sum of the absolute values) are plotted below.

Clearly it is difficult to choose a threshold to isolate the most important contributions. I selected the ten largest contributions, corresponding to a threshold of about 2%. Then I ran a sparse PCA algorithm of my invention* requiring that each sparse component explained at least 95% of the variance explained by the corresponding PC,  Both sets of loadings are shown in Table 1 below. Rj-sq is the R squared resulting from regressing a variable on all other in the set (the one used to compute the variance inflation factor in regression), the closer the value to one the more the variable is (multiply) correlated with the others.

-2.2%median family income (differs from household income) 0.98
-2.2%median household income 0.98
-2.1%percentage of kids in family housing with two parents 0.98
-2.1%percentage of households with investment / rent income in 1989 0.83
2.1%percentage of people under the poverty level 0.84
-2.1%percentage of families (with kids) that are headed by two parents 0.98
-2.0%percent of kids 4 and under in two parent households 0.90
-2.0%per capita income 0.92
2.0%percentage of households with public assistance income in 1989 0.75
2.0%percent of occupied housing units without phone (in 1990, this was rare!) 0.76
-51%median family income (differs from household income) 0.51
-37%percentage of kids in family housing with two parents 0.52
12%percent of family households that are large (6 or more) 0.09

It turns out that the ten most influential variables in the first PC are highly correlated and are either indicators of income or of families with two parents. They all have more or less the same weight. The sparse component with only three variables explains 96.5% percent of the variance explained by the PC. The contributions, like for the PC, indicate the Income (51%) and  family with two parents (37%) are influential but also add a 12% contribution from percentage of large families. This variable is hardly correlated with the others and its contribution to the PC ranks 46th in absolute value (so would not have been picked as influential in any analysis).

OK, Nothing really new here. I only managed to pick up that living conditions have a little importance. But let's look at the second components. The ten largest PC's contributions and those of the sparse component are given in Table 2.

2.7%percent of population who have immigrated within the last 10 years 1.00
2.7%percent of population who have immigrated within the last 8 years 1.00
2.6%percent of population who have immigrated within the last 5 years 1.00
2.6%percent of population who have immigrated within the last 3 years 0.98
2.6%percent of people foreign born 0.96
-2.3%percent of people who speak only English 0.95
2.3%percent of people who do not speak English well 0.94
2.1%percent of persons in dense housing (more than 1 person per room) 0.86
2.0%percentage of population that is of asian heritage 0.63
2.0%percentage of population that is of hispanic heritage 0.90
42%percent of population who have immigrated within the last 10 years 0.11
-15%percentage of population that is 65 and over in age 0.06
15%owner occupied housing - upper quartile value 0.55
14%percent of family households that are large (6 or more) 0.47
13%number of people living in areas classified as urban 0.32

Wow! 9 out of 10 variables with large contributions to the second PC are percentages of immigrants! These are highly correlated (the first 7 with Rj-sq over 94%). That's right Mr Trump, immigrants do make a difference, the relative weight of their contributions to the ten most influential variables is about 91%. 

But let's look at the sparse contributions: the story is quite different. The percent of population recently immigrated has a contribution of 42% but other socio-economical aspects are now influential. We find again large families (which obviously is correlated with living in dense housing) but also percentage of elderly and urbanised residents. Surprisingly, owner occupied housing has positive contribution, This may be due to families for which the cost of the mortgage is a relevant part of their income. Note how the variables in the sparse component are not very correlated.

PCA and SPCA are not methods designed for the prediction of an external variable but for summarising the variation in a set of variables with few components. However, if we regress violent crime rate on the components, it turns out that the sparse components are somewhat better at predicting crime rate than the full cardinality PCs, as shown by the R-squared coefficients in Table 3, where also other summary statistics for the SPCA results are given.

cumulative vexp24.4%40.8%49.8%57.2%62.7%
relative cumulative vexp (sPC/PC)96.5%96.5%96.5%96.6%96.6%
Correlation with PC0.970.980.9420.8330.92
R-sq log(crime rate) on PCs44.0%49.0%50.9%51.3%53.3%
R-sq log(crime rate) on sparse comp.45.7%51.9%56.4%57.2%57.2%

In conclusion, SPCA does a better job at identifying uncorrelated socio-economical characteristics that make a difference in a community, while PCA tends to identify clusters of highly correlated variables sharing those characteristics. The sparse components are also  (slightly) better at predicting crime rate. After all also non-immigrants may have low income and live crammed in a house.

I imagine that seeing the PCA results Mr Trump would immediately conclude that recent immigrants who don't have good command of English and maybe live in dense housing are criminals.   

The data used are quite dated. If anybody has more recent data, please let me know.

*a similar algorithm can be found in Merola, G., 2015. Least squares sparse principal component analysis: a backward elimination approach to attain large loadings. Australia & New
Zealand Journal of Statistics 57, 391–429.  My R package for SPCA is available on GitHub

Sunday, 28 August 2016

Relative time microbenchmark

Microbenchmark is a very useful function for timing the execution time of one or more programs. I often use it to compare different programs. So I created a function that produces the output relative to the time of one of the programs being tested.

Snippet available at GitHub
Download zip

For example, it is known that the singular value decomposition of a matrix is slower than the eigen-decomposition of its covariance matrix, but what  about the time needed to compute the covariance? Of course it depends on the size of the matrix. The example below shows the use of mbm_rel to compare these times.

M = matrix(runif(4000),200)
S = crossprod(M)
# [1] 200  20

eigenS computes the covariance

eigenS <- function(X){

mbm_out = microbenchmark(svd = svd(M), eigen = eigen(S), eigenS = eigenS(X))

# Unit: microseconds
# expr         min       lq               mean      median       uq        max       neval
# svd        704.470 722.0075  911.6752  812.900  878.556  4163.942   100
# eigen     365.281 374.0500  477.9450  425.163  515.628   978.217   100
# eigenS    916.197 941.0050 1746.5251 1085.363 1245.119 50295.797   100

As expected, eigen is faster than svd but eigenS is slower. How much slower? Let's call mbm_rel

# Unit: microseconds
#              min      lq   mean median   uq        max          neval
# svd time         705.754 716.233 806.809 735.267 753.873 3459.045   100
# svd/eigen  1.943  1.936  2.009  1.965  1.919      3.694    100
# svd/eigenS  0.637  0.636  0.414  0.640  0.642     0.054     100

The time units can be changed, just like in the original function. For milliseconds, set unit = "ms"

mbm_rel(mbm_out, unit = "ms")   
# Unit: milliseconds
#                        min          lq     mean      median    uq      max    neval
# svd time         0.706    0.716    0.807     0.735    0.754    3.459  100
# svd/eigen       1.943    1.936    2.009    1.965    1.919    3.694   100
# svd/eigenS     0.637    0.636    0.414    0.640    0.642    0.054   100

The first row gives the time for svd and the following the svd time relative to the other functions. So svd takes almost twice the time of eigen but 60% of the time of eigenS. A more informative output would be the time of the slowest. easily done, set relpos= 3 (the third function)

mbm_rel(mbm_out, relpos = 3, unit = "ms")
# Unit: milliseconds
#                   min              lq mean median     uq max       neval
# eigenS time      1.109 1.126 1.948 1.149 1.174 64.490 100
# eigenS/svd       1.571 1.572 2.415 1.563 1.557 18.644 100
# eigenS/eigen     3.053 3.044 4.852 3.070 2.987 68.877 100

Easier to interpret.

It's also possible to reverse the relativity, that is have the proportion of time of the methods with respect to the target method by setting inverse = TRUE

mbm_rel(mbm_out, relpos = 3, inverse = TRUE, unit = "ms")   
# Unit: milliseconds
#                    min     lq         mean median   uq           max neval
# eigenS time     1.109 1.126 1.948 1.149 1.174 64.490 100
# svd/eigenS      0.637 0.636 0.414 0.640 0.642 0.054 100
# eigen/eigenS    0.328 0.329 0.206 0.326 0.335 0.015 100

One can retrieve the time units with


# [1] "microseconds"

The argument rel_name assigns a different name to the target method.

mbm_rel returns the relative times table, setting reltime = FALSE, returns ones instead of the target method time summaries.

mbm_r <- mbm_rel(mbm_out, relpos = 3, reltime = FALSE,  unit = "ms")
#                           min        lq        mean    median    uq      max    neval
# eigenS             1.000    1.000    1.000    1.000    1.000     1.000    100
# eigenS/svd      1.571    1.572    2.415    1.563    1.557    18.644    100
# eigenS/eigen   3.053    3.044    4.852    3.070    2.987    68.877    100

A possible plot

barplot(mbm_r$median, names = rownames(mbm_r), main = "relative execution times")

Saturday, 2 July 2016

Computing the mode in R

In R there isn't a function for computing the mode. This statistic is not often used but it is very useful for categorical and discrete data.

The mode is defined as "the most common value occurring in a set of observations." Mathematically, for numerical data, the mode is the centre of order zero mode = arg min_m sum [x_i - m]^0, where 0^0 is defined as equal to 0.

This definition is not complete because in a set of data there can be one or many or no mode. For example: in a set with 10 apples, 5 pears and 2 bananas the mode is apple, in a set with 5 apples, 5 pears and 2 bananas the modes are apple and pear, in a set with 5 apples, 5 pears and 5 bananas there is no mode. This is shown in the figure below.

Hence a function that computes the mode needs to return one or many or no value. Furthermore, the values returned can be either characters or numbers (in case of discrete data). My solution is this:

Mode = function(x){
    ta = table(x)
    tam = max(ta)
    if (all(ta == tam))
         mod = NA
    mod = as.numeric(names(ta)[ta == tam])
         mod = names(ta)[ta == tam]

Let's see how it works for nominal data:

One mode
fruit = c(rep("apple", 10), rep("pear", 5), rep("banana", 2))
# [1] "apple"

Two modes
fruit2 = c(rep("apple", 5), rep("pear", 5), rep("banana", 2))
# [1] "apple" "pear" 

No mode
fruit3 = c(rep("apple", 5), rep("pear", 5), rep("banana", 5))
# [1] NA

Works fine for nominal data. Let's check for numerical data:

One mode
count1 = c(rep(1, 10), rep(2, 5), rep(3, 2))
# [1] 1

Two modes
count2 = c(rep(1, 5), rep(2, 5), rep(3, 2))
# [1] 1 2

No mode
count3 = c(rep(1, 5), rep(2, 5), rep(3, 5))
# [1] NA