Formulas

Thursday, February 6, 2014

Random Forest Almighty

Random Forests are awesome. They do not overfit, they are easy to tune, they tell you about important variables, they can be used for classification and regression, they are implemented in many programming languages and they are faster than their competitors (neural nets, boosting, support vector machines, ...)

Let us take a moment to appreciate them:




The Random Forest is my shepherd; I shall not want.
     He makes me watch the mean squared error decrease rapidly.
He leads me beside classification problems.
    He restores my soul.
He leads me in paths of the power of ensembles
    for his name's sake. 

Even though I walk through the valley of the curse of dimensionality,
    I will fear no overfitting,
for you are with me;
    your bootstrap and your randomness,
    they comfort me.

 You prepare a prediction before me
    in the presence of complex interactions;
you anoint me data scientist;
    my wallet overflows.
 Surely goodness of fit and money shall follow me
    all the days of my life,
and I shall use Random Forests 
    forever.


Joke aside: Random Forests proved to give very stable and good predictions in many prediction settings (like in the See Click Predict Fix kaggle challenge, in which the winner used the Random Forests implementation in R), but they are not the solution to all problems. In a very simple setting, where the true relationship between the response and the covariates is linear, a linear model will perform better than a random forest. You find a good explanation why this happens on Cross Validated.
One thing I learned the hard way was that you should not get to attached to an algorithm for prediction. This probably applies to other areas as well. When I participated in the Observing Dark Worlds challenge, I  fell into this trap by sticking to Random Forests. My model performed poorly, but instead of thinking about another algorithm I thought about better features. The winner of this competition used a Bayesian approach.

You can find implementations in R (randomForest package) or in Python (scikit-learn library).

Monday, January 20, 2014

Statistical modeling: two ways to see the world.

This a machine-learning-vs-traditional-statistics kind of blog post inspired by Leo Breiman's "Statistical Modeling: The Two Cultures". If you're like: "I had enough of this machine learning vs. statistics discussion,  BUT I would love to see beautiful beamer-slides with an awesome font.", then jump to the bottom of the post and for my slides on this subject plus source code.

I prepared presentation slides about the paper for a university course. Leo Breiman basically argued, that there are two cultures of statistical modeling:
  • Data modeling culture: You assume to know the underlying data-generating process and model your data accordingly. For example if you choose to model your data with a linear regression model you assume that the outcome y is normally distributed given the covariates x. This is a typical procedure in traditional statistics. 
  • Algorithmic modeling culture:  You treat the true data-generating process as unkown and try to find a model that is good at predicting the outcome y given the covariates x. So you basically try to find a function of your covariates x that minimizes the loss between prediction and true outcome y. This culture is associated with machine learning. 
Breiman was a supporter of the algorithmic modeling culture and argued that in many cases this culture is superior to the other. I recommend to read the paper when you have a quiet hour. I think it is a compulsive reading for everyone who seriously analyses data. 

The paper is 13 years old and I guess most people in the statistics community still live the data modeling culture (at least at the statistics department where I study). But the world is not black and white: There are also professors and research associates who research in the field of the algorithmic modeling culture: Extention of boosting algorithms to a more flexible model class, introduction of permutation tests to random forests, study of the effect of different cross validation schemas, ... Not only can the traditional statistics be enhanced by machine learning, but there is also a need to bring more statistics into machine learning. I think there is a lot of room for mutual benefit

My personal opinion on this subject is very pragmatic. I use predictive accuracy now more often as an evaluation of model goodness and added Random Forests, (Tree-) Boosting and others to my tool kit. But sometimes it is okay to pay some MSE, AUC or Gini to replace a complex random forest with a glm with a nice interpretation. Even if you assumed the wrong data-generating process, your conclusions from the fitted model is not deadly wrong in most cases. 

So here are the slides. They contain notes so they should be easy to follow:
If you want to now how to reproduce them please visit my Github account.








Friday, March 8, 2013

From OpenOffice noob to control freak: A love story with R, LaTeX and knitr

Lately I had to write a seminar paper for a class and I decided to overdo it.
But let's start at the very beginning. Here is my evolution of how I used to write stuff and how I got from this:


to that:



School: 
OpenOffice - I guess everyone has some youthful indiscretions.
I remember how much time i spent trying to position each figure correctly and trying to make every line in the table of contents to start at the same position.


1. Semester: 
I heard of this "LaTeX" - thing and there was a rumor that this might be useful throughout the whole time at university, at the latest for the bachelor thesis. So I decided to learn LaTeX from the very beginning and chose to write a formulary for descriptive statistics as a first project.
The result was very neat. So proud of myself.
Started to missionize other students.

2.- 5. Semester:
Then I continued to use LaTeX for almoste every document. Mostly I did a lot presentations with the beamer class. Most of the creation time spent choosing the best themes and colors, and creating the perfect title slide.
I discovered Sweave, but did not want to use it and still copied R output into my documents and included figures manually.
Also switched from gedit to emacs about that time.

6. Semester:
Bachelor thesis time. The first time I needed to embed more R-Code into a document. I wanted my R-Code to look good. So I searched and found: "Hm ... what's knitr? ... want to meet my bachelor thesis?"
At the same time I learned subversion. Felt odd to use, but at the same time I felt there was a deeper power in revision tools.

1. Master Semester: 
Revision Control: Level Up! Now working with git.
Facebook update: Now in a bigamous relationship with knitr and emacs.


But now back to my latest paper. I used different tools to create it. Here are the ingredients plus why I used them:


LaTeX

In my opinion, there is no alternative to LaTeX for scientific paper including at least one number or one figure. There have been many discussions about LaTeX vs. Word (or Word-alikes). I like the personal summary in this blog.


Tufte book documentclass

A very fancy documentclass for LaTeX I tried for the first time. Most Latex-adversaries I know complain about every slide show or paper looking the same and I have to admit that's true. Of course you can build your own documentclasses and styles from scratch, but I really don't feel like it. Fortunately there are already some other classes around, like the fancy Tufte package, which I like very much. Edward Tufte (statistician) has done a lot in the fields information design and data visualization.
The Tufte book document class is based on the style Tufte wrote his books in.
The most significant impression is the broad margin, which can be used for notes, references and pictures.


R

Any analysis we are doing at university (I am studying statistics) is done with R. There are of course alternatives like Julia, Matlab, Python, ...
R might not be the fastest language and from my subjective feelings it is a little messy.
But I definitely love the huge ecosystem of packages, which are making R a pleasure to use. Also the plot functionality is amazing (I am kind of a ggplot2 fanboy)


knitr

Speaking of the ecosystem, knitr is one of the best examples for useful add-on packages. knitr makes integrating R code and output in documents simple and even fun. It is based on Sweave, but eliminates some problems and extends the functionality. I also like the fact that you are not tied to LaTeX documents, but you can, for example, also write markdown files and convert them to html.


emacs

Personally I am using emacs a lot. Though I think RStudio evolved very nice as well. The integration of knitr is very neat, better than in emacs. But I feel very comfortable in emacs, that's why I haven't switched to RStudio yet.


git

A revision control system. Mostly used for coding, but this time I used it for my seminar paper. It is not necessary to use revision control for a paper, but it definitely has some benefits. First of all it is a good feeling to be able to revert back to older revisions, for example in case you accidentally deleted something. One cool side effect is, that you automatically begin to split bigger tasks in small steps, because every time you commit something, you are encouraged to write a short text about what changed.


That's basically my my status quo of tools I use for papers and presentations.
My repository for the paper is public, just visit my github account. Here is the PDF.
By the way, the paper I wrote is about conditional inference trees, where I also did a presentation.







Monday, December 31, 2012

Misusage of the new shiny package: A nerdy drink tracker for your next party

Currently a lot of people are talking about the new shiny package. So I got curious and built an own, more or less useful app: A drink tracker


This app can be used to track how much someone drank and therefore it is very useful for every party, especially when you plan to play some drinking games.


The usage is very simple:

  1. Start an R session
  2. Run the following script (uncomment to install the packages) and change the names in the persons vector
  3. Game should start in your default browser
  4. Every person who had a shot / a sip beer / whatever can be chosen from the drop-down  list. If the same person has to drink again simply push "Drink again!". You can switch between the timeline and the leaderboard by clicking on the tabs
  5. Have fun!


Saturday, December 22, 2012

Get the party started

Have you already used trees or random forests to model a relationship of a response and some covariates? Then you might like the condtional trees, which are implemented in the party package.

In difference to the CART (Classification and Regression Trees) algorithm, the conditional trees algorithm uses statistical hypothesis tests to determine the next split. Every variable is tested at each splitting step, if it has an association with the response. The variable with the lowest p-value is taken for the next split. This is done until the global null-hypothesis of independence of the response and all covariates can not be rejected.

Conditional trees is my subject in a university seminar this semester. Here are my slides explaining the functionality of conditional trees, which I wanted to share with you. It includes the theory and two short examples in R.


Tuesday, November 13, 2012

Trees with the rpart package


What are trees?

Trees (also called decision trees, recursive partitioning) are a simple yet powerful tool in predictive statistics. The idea is to split the covariable space into many partitions and to fit a constant model of the response variable in each partition. In case of regression, the mean of the response variable in one node would be assigned to this node. The structure is similar to a real tree (from the bottom up): there is a root, where the first split happens. After each split, two new nodes are created (assuming we only make binary splits). Each node only contains a subset of the observations. The partitions of the data, which are not split any more, are called terminal nodes or leafs. This simple mechanism makes the interpretation of the model pretty easy.

Interpretation looks like: “If \(x1 > 4\) and \(x2 < 0.5\) than \(y = 12\)." This is much easier to explain to  a non-statistician than a linear model. Therefore it is a powerful tool not only for prediction, but also to explain the relation of your response \(Y\) and your covariables \(X\) in an easy understandable way.

Different algorithms implement these kind of trees. They differ in the criterion, which decides how to split a variable, the number of splits per step and other details. Another difference is how pruning takes places. Pruning means to shorten the tree, which makes trees more compact and avoids overfitting to the training data. The algorithms have in common, that they all use some criterion to decide about the next split. In case of regression, the split criterion is the sum of squares in each partition. The split is made at the variable and split point, where the best split can be achieved according to the criterion (regression trees: minimal sum of squares)

To avoid too large trees, there are two possible methods:

1. Avoid growing large trees: This is also called early stopping. Stopping criteria might be, that the number of observations in a node undercuts some minimum number of observations. If the criterion is fulfilled the current node will not be split any further. Early stopping yields smaller trees and saves computational time.

 2. Grow large tree, cut afterwards: Also known as pruning. The full tree is grown (early stopping might additionally be used), and each split is examined, if it brings a reliable improvement. This can be top-down, starting from the first split made, or bottom-up, starting at the splits above the terminal nodes. Bottom-up is more common, because top-down has the problem, that whole sub-trees can be trashed. However, after a "bad” split a lot of good splits can follow. Pruning takes into account the weighted split criterion for all splits and the complexity of the trees, which is weighted by some \(\alpha\). Normally the complexity parameter \(\alpha\) is chosen data-driven by cross-validation.


How can I use those trees?

The R package rpart implements recursive partitioning. It is very easy to use. The following example uses the iris data set. I'm trying to find a tree, which can tell me if an Iris flower species is setosa, versicolor or virginica, using some measurements as covariables. As the response variable is categorial,  the resulting tree is called classification tree. The default criterion, which is maximized in each split is the gini coefficient. The model, which is fit to each node, is simply the mode of the flower species, the flower which appears most often in this node.

The result is a very short tree: If Petal.length is smaller than 2.4 we label the flower with setosa. Else we look at the covariable Petal.Width. Is Petal.Width smaller than 1.8? If so, we label the flower versicolor, else virginica.

I personally think the plots from the rpart package are very ugly, so I use the plot function rpart.plot from the rpart.plot package. The results from the tree show, that all of the Iris flowers which are in the left node are correctly labeled setosa, no other flower is in this terminal node of the tree. The other terminal nodes are also very pure, the versicolor labeled node contains 54 correctly assigned flowers and 5 wrongly assigned. The virginic node has about the same purity (46 correctly, 1 incorrectly assigned).

library("rpart")
library("rpart.plot")
data("iris")

tree <- rpart(Species ~ ., data = iris, method = "class")
rpart.plot(tree)
plot of chunk unnamed-chunk-1
tree
## n= 150 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 150 100 setosa (0.33333 0.33333 0.33333)  
##   2) Petal.Length< 2.45 50   0 setosa (1.00000 0.00000 0.00000) *
##   3) Petal.Length>=2.45 100  50 versicolor (0.00000 0.50000 0.50000)  
##     6) Petal.Width< 1.75 54   5 versicolor (0.00000 0.90741 0.09259) *
##     7) Petal.Width>=1.75 46   1 virginica (0.00000 0.02174 0.97826) *
The method-argument can be switched according to the type of the response variable. It is “class”“ for categorial, "anova”“ for numerical, "poisson”“ for count data and "exp”“ for survival data.
All in all the package is pretty easy to use. Thanks to the formula interface, it can be used like most other regression models (like lm(), glm() and so on). I personally think the utility of trees as regression models is underestimated. They are super-easy to understand and if you have to work with non-statistician it might be a benefit to use trees.


Further readings:

Friday, August 31, 2012

PCA or Polluting your Clever Analysis


When I learned about principal component analysis (PCA), I thought it would be really useful in big data analysis, but that's not true if you want to do prediction. I tried PCA in my first competition at kaggle, but it delivered bad results. This post illustrates how PCA can pollute good predictors.
When I started examining this problem, I always had following picture in mind. So I couldn't resist to include it. The left picture shows, how I feel about my raw data before PCA. The right side represents how I see my data after applying PCA on it. 





Yes it can be that bad …



The basic idea behind PCA is pretty compelling:
Look for a (linear) combination of your variables which explains most of the variance of the data. This combination shall be your first component. Then search for the combination which explains the second most of the variance and is vertical to the first component.
I don't want to go into details and assume that you are familiar with the idea.
At the end of the procedure you have uncorrelated variables, which are linear combinations of your old variables. They can be ordered by how much variance they explain. The idea for machine learning / predictive analysis is now to use only the ones with high variance, because variance equals information, right?
So we can reduce dimensions by dropping the components which do not explain much of the variance. Now you have less variables, the dataset becomes manageable, your algorithms run faster and you have the feeling, that you have done something useful to your data, that you have aggregated them in a very compact but effective way.
It's not as simple as that.
Well let's try it out with some toy data.
At first we build a toy data set. Therefor we first create some random “good” x values, which are simply drawn from a normal distribution. The response variable y is a linear transformation of x, with a random error, so we should be able to make a good prediction of y with the help of good_x.
The second covariable is a “moisy” x, which is highly correlated with good_x, but has more variance itself.
To build a bigger dataset I included variables which are very useless for predicting y, because they are randomly normal distributed with mean zero. Five of the variables are highly correlated and the other five as well, which will be detected by the PCA later.
library("MASS")
set.seed(123456)


### building a toy data set ###

## number of observations
n <- 500

## good predictor
good_x <- rnorm(n = n, mean = 0.5, sd = 1)

## noisy variable, which is highly correlated with good predictor
noisy_x <- good_x + rnorm(n, mean = 0, sd = 1.2)

## response variable linear to good_x plus random error
y <- 0.7 * good_x + rnorm(n, mean = 0, sd = 0.11)

## variables with no relation to response 10 variables, 5 are always
## correlated
Sigma_diag <- matrix(0.6, nrow = 5, ncol = 5)
zeros <- matrix(0, nrow = 5, ncol = 5)
Sigma <- rbind(cbind(Sigma_diag, zeros), cbind(zeros, Sigma_diag))
diag(Sigma) <- 1
random_X <- mvrnorm(n = n, mu = rep(0, 10), Sigma = Sigma)
colnames(random_X) <- paste("useless", 1:10, sep = "_")

## binding all together
dat <- data.frame(y = y, good_x, noisy_x, random_X)
Let's look at the relationship between good_x and y, and noisy_x and y:

## relationship between y and good_x and noisy_x
par(mfrow = c(1, 2))
plot(y ~ good_x, data = dat)
plot(y ~ noisy_x, data = dat)
plot of chunk unnamed-chunk-2
Obviously, good_x is a much better predictor for y than noisy_x.
Now I run the principal component analysis. The first three components explain a lot, which is due to the way I constructed the data. The variables good_x and noisy_x are highly correlated (Component 3), the useless variables number one to five are correlated and so are the useless variables number six to ten (Components 1 and 2)
## pca
prin <- princomp(dat[-1], cor = TRUE)

## results visualized
par(mfrow = c(1, 2))
screeplot(prin, type = "l")
plot of chunk unnamed-chunk-3
loadings(prin)
## 
## Loadings:
##            Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## good_x                    0.700  0.148                0.459        -0.441
## noisy_x                   0.700         0.148        -0.424         0.474
## useless_1  -0.183 -0.408        -0.290        -0.464  0.382 -0.184  0.180
## useless_2  -0.204 -0.390         0.170  0.490  0.549  0.114        -0.116
## useless_3  -0.228 -0.393         0.242        -0.434         0.176       
## useless_4  -0.201 -0.394        -0.446 -0.278  0.309 -0.103  0.599       
## useless_5  -0.172 -0.414         0.365 -0.271  0.105 -0.306 -0.484       
## useless_6   0.413 -0.172         0.209 -0.527        -0.125              
## useless_7   0.411 -0.171        -0.305         0.273  0.398 -0.311  0.413
## useless_8   0.368 -0.241        -0.424  0.340 -0.252 -0.403 -0.209 -0.447
## useless_9   0.398 -0.191         0.366  0.386 -0.166         0.427  0.262
## useless_10  0.405 -0.215         0.137 -0.157  0.122               -0.297
##            Comp.10 Comp.11 Comp.12
## good_x     -0.234           0.146 
## noisy_x     0.239                 
## useless_1   0.144  -0.440  -0.252 
## useless_2   0.272  -0.284   0.196 
## useless_3   0.264   0.561   0.347 
## useless_4  -0.237                 
## useless_5  -0.443          -0.220 
## useless_6   0.186  -0.392   0.515 
## useless_7           0.412   0.165 
## useless_8  -0.135           0.150 
## useless_9  -0.416  -0.166  -0.171 
## useless_10  0.484   0.212  -0.596 
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.083  0.083  0.083  0.083  0.083  0.083  0.083  0.083
## Cumulative Var  0.083  0.167  0.250  0.333  0.417  0.500  0.583  0.667
##                Comp.9 Comp.10 Comp.11 Comp.12
## SS loadings     1.000   1.000   1.000   1.000
## Proportion Var  0.083   0.083   0.083   0.083
## Cumulative Var  0.750   0.833   0.917   1.000
Let's look at the relationship between the now mixed good_x and noisy_x and the response y. Component 3 is the only one which contains only the good and the noisy x, but none of the useless variables. You can see here, that the relationship is still remained, but by adding the noise to the good predictor we now have a worse predictor than before.
dat_pca <- as.data.frame(prin$scores)
dat_pca$y <- y

plot(y ~ Comp.3, data = dat_pca)
plot of chunk unnamed-chunk-4
Now we can compare the prediction of y with the raw data and with the data after pca analysis. The first method is a linear model. Since the true relationship between good_x and y is linear, this should be a good approach. At first we take the raw data and include all variables, which are the good and the noisy x and the useless variables.
As expected, the linear model performs very well with an adjusted R-squared of 0.975. The estimated coefficient of good_x is also very close to the true value. The linear model also performed well on finding the only covariable that matters indicated by the p-values.

## linear model detects good_x rightfully as only good significant
## predictor
mod <- lm(y ~ ., dat)
summary(mod)
## 
## Call:
## lm(formula = y ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3205 -0.0750 -0.0048  0.0784  0.3946 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.001377   0.005681   -0.24     0.81    
## good_x       0.705448   0.006411  110.03   <2e-16 ***
## noisy_x     -0.002980   0.004288   -0.70     0.49    
## useless_1   -0.010783   0.006901   -1.56     0.12    
## useless_2    0.009326   0.006887    1.35     0.18    
## useless_3   -0.001307   0.007572   -0.17     0.86    
## useless_4   -0.005631   0.007037   -0.80     0.42    
## useless_5    0.002154   0.007116    0.30     0.76    
## useless_6   -0.001407   0.007719   -0.18     0.86    
## useless_7   -0.003830   0.007513   -0.51     0.61    
## useless_8    0.004877   0.007215    0.68     0.50    
## useless_9    0.000769   0.007565    0.10     0.92    
## useless_10   0.005247   0.007639    0.69     0.49    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.112 on 487 degrees of freedom
## Multiple R-squared: 0.976,   Adjusted R-squared: 0.975 
## F-statistic: 1.62e+03 on 12 and 487 DF,  p-value: <2e-16 
## 
And now let's look at the mess the PCA made. If we would include all components into our linear model, then we would get the same R-squared value, because the new components are only linear combinations of the old variables. Only the p-values would be a mess, because a lot of components would have a significant influence on the outcome of y.
But we wanted to use PCA for dimensionality reduction. The screeplot some plots above suggests, that we should take the first four components, because they explain most of the variance. Applying the linear model on the reduced data gives us a worse model. The adjusted R-squared drops to 0.787.
dat_pca_reduced <- dat_pca[c("y", "Comp.1", "Comp.2", "Comp.3", "Comp.4")]
mod_pca_reduced <- lm(y ~ ., data = dat_pca_reduced)

summary(mod_pca_reduced)
## 
## Call:
## lm(formula = y ~ ., data = dat_pca_reduced)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9634 -0.2188 -0.0147  0.2084  0.9651 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.35828    0.01463   24.50  < 2e-16 ***
## Comp.1       0.03817    0.00786    4.86  1.6e-06 ***
## Comp.2      -0.00778    0.00795   -0.98     0.33    
## Comp.3       0.48733    0.01149   42.41  < 2e-16 ***
## Comp.4       0.11190    0.02138    5.24  2.4e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.327 on 495 degrees of freedom
## Multiple R-squared: 0.789,   Adjusted R-squared: 0.787 
## F-statistic:  463 on 4 and 495 DF,  p-value: <2e-16 
## 
The same thing happens when we use other methods. I chose random forests, but the same will happen when you use other methods.
The first random forest with the raw data gives the best results with 93.15% of the variance explained. This result is not as good as the linear model, because the true relationship is linear, but it is still a reasonable result.
library("randomForest")

## raw data
set.seed(1234)
forest <- randomForest(y ~ ., data = dat)
forest
## 
## Call:
##  randomForest(formula = y ~ ., data = dat) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 0.03436
##                     % Var explained: 93.15
The next random forest uses all components from the PCA, which means that there is still all information in the data, because we only build some linear combinations of the old variables. But the results are worse with only 85.47% of the variance explained by the random forest.

## pca data
set.seed(1234)
forest_pca <- randomForest(y ~ ., data = dat_pca)
forest_pca
## 
## Call:
##  randomForest(formula = y ~ ., data = dat_pca) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 0.07293
##                     % Var explained: 85.47
Let's see what happens, if we only take the first four components, which explain most of the variance of the covariables.
As suspected we loose again predictive power and the explained variance drops to 70.98 %.
That's a difference of about 22% percent of explained variance, compared to the results from the random forest with the raw data.
## reduced pca data
set.seed(1234)
forest_pca <- randomForest(y ~ ., data = dat_pca_reduced)
forest_pca
## 
## Call:
##  randomForest(formula = y ~ ., data = dat_pca_reduced) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 0.1456
##                     % Var explained: 70.98
PCA can do worse things to your data, when used for prediction. This was of course a simple example, but my intuition is telling me, that the the problem stays for other relationships between y and the covariables and other covariation structures of the covariables.
Of course PCA still is a useful statistical tool. It can help as a descriptive tool or if you are looking for some latent variable behind your observed features (which is very common in surveys) it does its job. But don't use it blindly in predictive models.
This blog entry was inspired by this one:
http://blog.explainmydata.com/2012/07/should-you-apply-pca-to-your-data.html
I am interested in your experience with PCA. Feel free to comment.