ggplotHelper: easy and beautiful plots in R

Visual expression is key in transmitting information between producers and users of research regardless of the scientific field. After many painful hours of re-writing code for plots in R based on the extensive ggplot2 package we finally created an easy-to-use overlay for making beautiful plots in R.

Following our two latest posts on multivariate stochastic volatility (part 1 and part 2) we bring an intermezzo, not related to trading, on how to make beautiful plots in R.

The gold standard in the R community is to create plots based on the extensive ggplot2 package built on the principles of the grammar of graphics in R. It has many settings and can create practically any plot you desire. However, as other R users may have noticed it often gets cumbersome to rewrite code for your next ggplot because of all the code needed to create the customization you want. Or you keep forgetting the functions for creating a specific plot.

After many painful hours in this trap we finally liberated ourselves by creating a package called ggplotHelper, hosted on GitHub, which is essentially a series of overlay functions for the ggplot function. It reduces the time spent on creating plots significantly but it comes with certain constrains. We have restricted the colour palette to grey and a few easy-to-read colours. However, these can easily be modified in the colour functions. In addition our package does not have an overlay for all plot functions in ggplot.

If you like our design template you can now easily create similar charts yourselves. Before you enter the land of ggplotHelper, we have to warn that the package has not yet entered a stable version 1.0.0, so we will make many commits as the package evolves. The package does not come with unit tests at this stage so sometimes new code may create bugs elsewhere in the code.

How to get ggplotHelper

Our ggplotHelper package is published as open source and hosted on GitHub. With the devtools package it is very easy to install ggplotHelper using the install_github function.


library(devtools)

install_github("pgarnry/ggplotHelper")

Now ggplotHelper is installed. Today we will only discuss the three functions bar_chart, line_chart and save_png . For the full package explanation you will have to wait for the vignette or the function documentation as we do try to make documentation self-explanatory.

If you experience issues with our functions you are welcome to raise tickets in our GitHub repository. You can also contribute code to make it better.

Colour palette

Colours are controlled through two functions grey_theme and chart_colours. The grey_theme function allows customization of plot margin and legend position (see ggplot documentation for available options). These feautures can be set through the ellipsis which will be clear very soon.

Below is a small snippet of the grey_theme function showing the only available options.

grey_theme <- function(legend.position = "bottom",
plot.margin = c(0.7, 1.2, 0.5, 0.5)) {

If you want to change the colour scheme you simply clone the repository and change the colours in the grey_theme function.

Bar plots

The bar plot is a toolbox stable for any researcher. With ggplotHelper it only takes one line of code to create a bar plot.

data(mtcars)
mtcars$name <- rownames(mtcars)

# not a pretty bar plot because of overlapping x-axis names
bar_chart(mtcars, y = "mpg", x = "name")

bar1

This bar plot would require around 50 lines of code to create the design, colours and data handling with the ggplot2 functions. To get it to one line of code, we have simplified and made choices for the user.

The x-axis names are overlapping because of the large number of bars. This problem is easily fixed by flipping the data.

# the flip option makes the plot prettier
bar_chart(mtcars, y = "mpg", x = "name", flip = TRUE)

bar2

Flipping the data looks easy but behind the lines we are handling the dirtiness. Next we add a title.

# add a title
bar_chart(mtcars, y = "mpg", x = "name", flip = TRUE,
title = "Miles per gallon across different car models")

bar3

Again we make choices for the user by adding a line break in the title to get the proper distance to the plot window. Next we change the y-axis scale.

# change the y scale
bar_chart(mtcars, y = "mpg", x = "name", flip = TRUE,
title = "Miles per gallon across different car models",
scale.y = c(0, 40, 5))

bar4

Unordered data can be difficult to interpret so we change the order of our data from highest to lowest values (decreasing).

# now we want to order the values decreasing
bar_chart(mtcars, y = "mpg", x = "name", flip = TRUE,
title = "Miles per gallon across different car models",
scale.y = c(0, 40, 5),
decreasing = TRUE)

bar5

Finally, we can highlight a specific data point (bar) using the the bar.colour.name variable.

# finally we highlight a data point
bar_chart(mtcars, y = "mpg", x = "name", flip = TRUE,
title = "Miles per gallon across different car models",
scale.y = c(0, 40, 5),
decreasing = TRUE,
bar.colour.name = "Merc 280C")

bar6

We hope this have given you an idea of the powerful functions available in ggplotHelper.

Line plots

The most tricky part of ggplot is to make time series plots, but with ggplotHelper we aimed to make it very easy through the line_chart function. Line plots can either be time-series (often the case) or not. The line_chart function provides options for both and for time-series plots the function supports four different date/time classes. Those are POSIXct, Date, yearmon and yearqtr.

Let us start with creating two random processes.

set.seed(5)

# create two random processes
rand.ts <- data.frame(rp = c(cumprod(rnorm(500, 0.0004, 0.0016) + 1),
cumprod(rnorm(500, 0.0002, 0.0016) + 1)),
date = rep(seq.Date(Sys.Date() - 499, Sys.Date(), by = "days"), 2),
name = rep(c("rp1", "rp2"), each = 500))

# quick line plot
line_chart(rand.ts, y = "rp", x = "date", group = "name")

line1

In order to create a line plot with multiple lines the grouping variable has to be specified. However, the method for multiple lines will likely change in the future. As you can see in the data.frame object the date column contain duplicates. Ideally we want to get rid of this input design and allow for only unique date classes and multiple y columns.

The current version grouping names will be used as default legend names. The function automatically detects that the x variable is a Date class and enables certain options for manipulating dates (date format and scaling).

Now we will change the y-axis scale and change the legend names. We will also add a title.

# add title, changing the y-axis scale and add custome legend names
line_chart(rand.ts, y = "rp", x = "date",
group = "name", title = "Random processes",
legend.names = c("Random process 1", "Random process 2"),
y.min = .95, y.max = 1.25)

line2

As one quickly realizes, the two legend names stand next to each other. Can we add some space and make it prettier?

# add some extra space between legend names
line_chart(rand.ts, y = "rp", x = "date",
group = "name", title = "Random processes",
legend.names = c("Random process 1 ", "Random process 2"),
y.min = .95, y.max = 1.25)

line3

Extra space added. Sometimes you want a different date format than the ISO standard. This can easily be changed through the date.format variable and the x.interval setting sets the increment of the sequence in x values. See example below.

# change the date format and interval
line_chart(rand.ts, y = "rp", x = "date",
group = "name", title = "Random processes",
legend.names = c("Random process 1 ", "Random process 2"),
y.min = .95, y.max = 1.25, date.format = "%Y %b %d",
x.interval = 60)

line4

The date.format takes all the available conversion rules in the strptime function. Finally we want to show how to send arguments to the grey_theme function through the line_chart function.

# remove legend
line_chart(rand.ts, y = "rp", x = "date",
group = "name", title = "Random processes",
legend.names = c("Random process 1 ", "Random process 2"),
y.min = .95, y.max = 1.25, date.format = "%Y %b %d",
x.interval = 60, legend.position = "none")

line5

The legend.position variable is set in the grey_theme function but is passed on through the ellipsis. The only other available variable is plot.margin which provides the functionality to change the plot margins in centimeters.

The line_chart function also supports ribbon, vertical and horizontal lines (used to highlight a critical level in the data). Especially the ribbon is a more advanced function and currently works only with one time series.

Save plots

Often we want our plots saved outside our R environment. The ggplotHelper has a quick save_png function that takes the ggplot object names (plot names) as input and stores the plot objects as .png files with a height of 480 pixels and width scaled by the golden ratio to 777. A future feature will allow the user to specify the height and width.

line1 <- line_chart(rand.ts, y = "rp", x = "date", group = "name")

line3 <- line_chart(rand.ts, y = "rp", x = "date", group = "name",
title = "Random processes",
legend.names = c("Random process 1 ", "Random process 2"),
y.min = .95, y.max = 1.25)

# save the two ggplots as .png files
save_png(line4, line5)

It is that easy to save ggplots as .png files.

Other functions

The ggplotHelper package also supports density, box and scatter plots and more will likely be implemented in the future. Density plot is very powerful and is one of our favourite plots for showing results of bootstrapped trading strategies.

Advertisements

Trading strategies: No need for the holy grail

We demonstrate that weak trading signals, which do not offer high risk-adjusted returns on their own, can be combined into a powerful portfolio. In other words, no need for holy grails when researching signals.

We start our experiment with some key assumptions. We have 20 signals with annualized log returns of 8% and annualized Sharpe Ratios of 0.6 – not exactly stellar signals. The signals make daily bets. The strategies in this experiment run for 10 years (on a daily basis), but we will later show how the statistics change when the number of observations decline. The experiment is repeated 500 times to get a sense of the distributions of relevant statistics, such as Sharpe Ratios and annualized returns.

An important input variable in trading is the correlation between signals and our experiment is carried out across a sequence of cross-signal correlations from zero to 0.9. Disregarding trading costs (because we are simply interested in relative performances) and using daily rebalancing the distributions of annualized portfolio returns across correlations looks more or less identical. Clearly, having more than one strategy does not improve the annualized return – regardless of the correlations between the strategies.

ann.ret.chart

Blending multiple signals with lower correlation does not enhance returns, but the chart above does hint at the benefit to having more strategies – and especially if those strategies are relatively uncorrelated. The left-most distributions are much narrower and none of the 500 trials have returns below zero (up to and including strategies with correlations of 0.4).

The result becomes clearer when we move to risk-adjusted returns measured by the Sharpe Ratio. Here 20 strategies with zero correlation and low individual annualized Sharpe Ratios of 0.6 turn into a portfolio with an annualized Sharpe Ratio of 3 compared to 0.64 for a portfolio with average correlations of 0.9 between the trading strategies – this translates into a 370% improvement.

ann.sr.chart

What is also fascinating about the chart above is how fast the improvement in Sharpe Ratio declines as the signals become more correlated. Increasing the correlations to 0.2 from 0 results in a decline of 56% in the Sharpe Ratio.

Despite a high Sharpe Ratio and around 50,000 bets across the signals (i.e. trading strategies) the variance of the Sharpe Ratios for the zero cross-signal correlation portfolios is still staggering. One investor might get lucky and produce a Sharpe Ratio of 3.5 (probably turning the person into a multi-billionaire) while another investor with the same types of strategies may be less fortunate, resulting in a Sharpe Ratio of 2.5. Luck does play a role in trading even for high Sharpe Ratio portfolios.

Obviously, an edge becomes clearer with more observations. What happens if our investor only has one year of observations rather than ten years? The chart below shows the explosion in the variances of the Sharpe Ratios across correlations. Despite 5,000 trades most portfolios cannot be separated from random luck. It is clear why data-driven hedge funds prefer higher frequencies (intraday trading). It validates the signals faster.

ann.sr.one.yr.chart

If we simulate 10,000 time-series with the above properties what is the percentage of them with a p-value lower than 5%? The answer is close to 48%, which could lead most researchers to discard such daily strategies (with an annualized Sharpe Ratio of 0.6). However, blending such weak signals can result in magic – if the correlations are low enough – whereby a portfolio’s combined return stream becomes highly significant. Among the zero correlation portfolios all of them have a p-value of less than 5%.

pv.chart

A daily strategy with an annualized Sharpe Ratio of 0.6 would likely be discarded on its own by a researcher as insufficient to produce anything attractive in trading. But with the right (i.e. low) correlations to existing signals, it could well add value to the portfolio.

This post does not break new ground as the effects of diversification are well-known in the investment community, but it does serve as a reminder that instead of discarding that 0.6 Sharpe Ratio strategy of yours perhaps you can add it to your existing portfolio of strategies, thereby lowering your portfolio volatility and hence allowing for more leverage to be used enhancing the total return.

Note: This post will be the final post (for a while) on random returns and fundamental concepts. Upcoming blog posts will instead focus on specific trading strategies in various asset classes.


# define variables
N <- 252 * 10 # number of observations
M <- 500 # number of trials
cors <- seq(0, .9, .1) # sequence of cross-signal correlations
strats <- 20 # number of strategies
er <- 8 / 252 # expected daily log return in percent
esr <- 0.6 / sqrt(252) # expected daily sharpe ratio

# pre-allocate arrays to store portfolio returns and p-values
port.rets <- array(NA, dim = c(N, M, length(cors)))
port.pv <- array(NA, dim = c(M, length(cors)))

for (i in 1:length(cors)) {

# correlation structure
R <- matrix(1, strats, strats)
R[lower.tri(R)] <- cors[i]
R[upper.tri(R)] <- cors[i]

# transpose Choleski decomposition
W <- t(chol(R))

for (m in 1:M) {

# generate 5 random stratey return streams with zero mean and expected weekly log returns
rets <- matrix(rnorm(N*strats, 0, er / esr / 100), ncol = strats)

# multiply the Cholesky decomposition matrix with our random returns and add drift
rets <- t(W %*% t(rets)) + er / 100

# calculate equal-weighted portfolio return
port.rets[, m, i] <- log(apply(exp(rets) - 1, 1, mean) + 1)

# insert p-values for portfolios
port.pv[m, i] <- t.test(port.rets[, m, i])$p.value

}

cat(paste("Calculations done for correlation", cors[i]), "\n")

}

# calculate annualized Sharpe ratio and returns
ann.sr <- apply(port.rets, c(2, 3), mean) / apply(port.rets, c(2, 3), sd) * sqrt(252)
ann.ret <- apply(port.rets, c(2, 3), mean) * 252

# calculate percent of portfolios with p-value below 5% across correlations
pv <- apply(port.pv < 0.05, 2, mean)

# calculate theoretical percentage of time-series with p-value below 5% for the specified strategy properties
# daily strategy with annualized Sharpe Ratio of 0.6
t.pv <- sum(replicate(10000, ifelse(t.test(rnorm(252 * 10, 8 / 252 / 100, (8/252/100) / (0.6/sqrt(252))))[["p.value"]] < 0.05, 1, 0))) / 10000

# create data.frame for ggplot2 charts of p-values
pv.df <- data.frame(P.value = c(t.pv, sort(pv)) * 100,
Name = c("Single signal p-value", paste("Correlation (", sort(cors, decreasing = TRUE), ")", sep = "")))

# create data.frame for ggplot2 charts of annualized return and Sharpe Ratio
stat.df <- data.frame(Sharpe = c(ann.sr),
Ann.Returns = c(ann.ret),
Correlation = rep(cors,each = M))


			

Bootstrapping avoids seductive backtest results

Nothing gets the adrenaline rushing as strong backtesting results of your latest equity trading idea. Often, however, it is a mirage created by a subset of equities, which have performed particularly well or poorly thereby inflating the results beyond what seems reasonable to expect going forward.

The investment community has come a long way in terms of becoming more statistically sound, but it is still surprising how few research papers on cross-sectional equity factors mention bootstrapping. Without bootstrapping researchers are simply presenting results from the full sample, implying that the same type of – potentially spectacular – returns will happen again in the future and will be captured satisfactorily by the model. In other words, the backtesting results may be heavily skewed by outliers. In our latest post on equity factor models we mentioned bootstrapping, but we postponed any real discussion of the topic. In today’s blog post we return to the topic of bootstrapping and specifically how outliers influence the results of the aforementioned factor model.

Outlier sample bias

In our equity factor model research our backtest is based on 1,000 bootstrapped samples where each sample is a subset of the constituents in the S&P 1200 Global index. Due to our use of bootstrapped samples the same stock can appear twice or more in any given holding period and it is possible to use subsampling instead.

What is particularly interesting in our backtest is that the historical sample (running the backtest simply on the available historical data set) delivers impressive results with an annualized return of 11% (vertical orange line in the chart below). This is 2.2%-points better than the bootstrapped mean of 8.8%

Bootstrapping vs historical sample

Only 7.2% of the bootstrapped results have a higher annualized return than the historical sample putting it firmly in the right tail of the distribution of annualized returns. Without bootstrapping the historical sample backtest could lead to inflated expectations due to outliers. While it is possible that the model will be able to capture such outliers in the future as well, we prefer to err on the side of caution and therefore prefer to use the boostrapped mean as our expected annualized return rather than that achieved with the full list of historical constituents.

Another advantage of the resampling methodology is that is creates a confidence band around our expectation. We expect an 8.8% annualized return from this model, but if it delivers 6.5% it would fit well within the distribution of annualized returns and hence not surprise us too much. Below the confidence band (5% and 95% percentiles) is shown for the cumulative return with a mean total return of 363% compared with 540% for the historical sample which is close to the upper band of 558%

cum_chart

It is our wish that the investment community steps up its use of bootstrapping on cross-sectional equity research or in other ways incorporate the uncertainty of the results more explicitly, thereby painting a more appropriate picture of the expected performance of a trading strategy.

Factor-based equity investing: is the magic gone?

Factor-based equity investing has shown remarkable results against passive buy-and-hold strategies. However, our research shows that the magic may have diminished over the years.

Equity factor models are used by many successful hedge funds and asset management firms. Their ability to create rather consistent alpha has been the driving force behind their widespread adoption. As we will show the magic seems to be under pressure.

Our four-factor model is based on the well-researched equity factors value, quality, momentum and low volatility with S&P 1200 Global as the universe. Value is defined by 12-month trailing EV/EBITDA and price-to-book (if the former is not available, which is often the case for financials). Quality is defined as return on invested capital (ROIC) and return on equity (ROE, if the former is not available, which is the case for financials). Momentum is defined as the three-month change in price (the window has not been optimized). The stocks’ betas are estimated by linear regression against the market return based on 18 months of monthly observations (the estimation window has not been optimized). All factors have been lagged properly to avoid look-ahead bias.

To avoid concentration risk, as certain industries often dominate a factor at a given period, the factor model spreads its exposure systematically across the equity market. A natural choice is to use industry classifications such as GICS, but from a risk management perspective we are more interested in correlation clusters. We find these clusters by applying hierarchical clustering on excess returns (the stocks’ returns minus the market return) looking back 24 months (again, this window has not been optimized). We require at least 50 stocks in each cluster otherwise the algorithm stops.

The median number of clusters across 1,000 bootstraps is robust around 6 over time compared to the 10 GICS sectors that represent the highest level of segmentation, even – surprisingly – during the financial crisis of 2008 despite a dramatic increase in cross-sectional correlation over that period. It is only the period from June 2000 to June 2003 that sees a big change in the market structure with fewer distinct clusters.

Clusters across time

For each cluster at any given point in time in the backtesting period we rescale the raw equity factors to the range [0, 1]. Another approach would be to normalize the factors but that approach would be more sensitive to outliers which are quite profound in our equity factors. The average of rescaled factors are then ranked and those stocks with the highest combined rank (high earnings yield, high return on capital, strong price momentum and low beta) are chosen.

This is probably the simplest method in factor investing, but why increase complexitity if the increase in risk-adjusted returns are not significant? Another approach is use factors in regressions and compute the expected returns based on the stocks’ factors.

Our data goes back to 1996 but with 24 months reserved for the estimation window for calculating clusters the actual results start in 1998 and end in November 2015. We have specified the portfolio size to 40 stocks based on research showing that after point the marginal reduction in risk is not meaningful (in a long-only framework). Increasing the portfolio size reduces monthly turnover and hence cost, but we have not optimized this parameter. We have set trading costs (one-way average bid-ask spread and commission) to 0.15% which is probably on the high side in today’s market but on the low side in 1998, but on average likely a fair average trade cost for investors investing in global equities with no access to prime brokerage services.

The backtest results are based on 1,000 bootstraps in order to get more robust estimates of performance measures such as annualized return. In a follow-up post we will explain the importance of bootstrapping when testing cross-sectional equity strategies. One may object to the replacement requirement because it creates situations where the portfolio will select the same stock more than once, which increases concentration risk and departs from an equal-weight framework. However, sampling without replacement would reduce the universe from 1,200 securities.

Despite high trading costs and high turnover our four-factor model delivers alpha against random portfolios with both Kolmogorov-Smirnov (comparing the two distributions) and t-test (on excess annualized returns across bootstraps) showing it to be highly significant. The Kolmogorov-Smirnov statistic (D) is 0.82 and the t-test statistics on excess returns is 60. The four-factor model delivers on average 8.8% annualized return over the 1,000 bootstraps compared to 5.8% annualized return for S&P 1200 Global (orange vertical line) and an average 5.3% annualized return for random portfolios.

Four-factor model annualised returns vs random portfolios

In our previous post we showed that random portfolios beat buy-and-hold for the S&P 500 index, but in this case they do not. The reason is the small portfolio size being roughly three percent of the universe leading to excessive turnover and hence costs.

The Sharpe ratio is also decent at 0.56 on average across the 1,000 bootstraps compared to 0.28 for random portfolios – hence a very significant improvement. A natural extension would be to explore ways to improve the risk-adjusted return further, for example by shorting the stocks with the worst total score thereby creating a market-neutral portfolio. is one possible solution.

Sharpe Ratio annualized vs random portfolio

However, our research shows that a market-neutral version does not deliver consistent alpha. In general we find that our equity models do not capture alpha equally well on the long and short side, indicating that drivers of equity returns are not symmetric. So far, we have found market-neutral equity strategies to have more merit on shorter time-frames (intraday or daily), but encourage readers to share any findings on longer time-frames (weekly or monthly frequency).

Cumulative excess performance vs S&P 1200 Global

Interestingly the cumulative excess performance chart shows exactly what Cliff Asness, co-founder of AQR Capital Management, has explained at multiple occasions about the firm’s early start. Namely persistent underperformance as the momentum effect dominated the late 1990s and catapulted the U.S. equity market into a historical bubble. Value investing together with Warren Buffett was ridiculed. Basically stocks that were richly valued, had low or negative return on capital, high beta and high momentum performed very well.

However, starting in 2000 our four-factor model enjoys a long streak of outperformance similar to the fortunes of AQR and it continued until summer 2009. Since then excess return against S&P 1200 Global (we choose this as benchmark here because it beats random portfolios) has been more or less flat. In other words, it performs similarly to the market, but does not generate alpha for our active approach.

Why are traditional equity factor models not producing alpha to the same degree as the period 2000-2009? Two possible explanations come to mind. Competition in financial markets has gone up and with cheap access to computers, widespread adoption of open-source code and equity factors well-researched, the alpha has been competed away. Alternatively, the standard way of creating a four-factor model has run out of juice and factors can still work, but have to be applied in different ways. Maybe the factors should not be blended into a combined score, but instead the best stocks from each factor should be selected. There are endless ways to construct an equity factor model.

### risk.factors is an array with dimensions 239, 2439 and 7 (months, unique stocks in S&amp;amp;amp;amp;P 1200 Global over the whole period, equity factors).
### variables such as cluster.window etc. are specified in our data handling script
### hist.tickers is a matrix with dimensions 239, 2439 (months, unique stocks in S&amp;amp;amp;amp;P 1200 Global over the whole period) - basically a matrix with ones or NAs indicating whether a ticker was part of the index or not at a given point in time
### tr is a matrix containing historical total returns (including reinvesting of dividends and adjustments for corporate actions) with same dimensions as hist.tickers 

# variables
no.pos <- 40 # number of positions in portfolio
strategy <- "long" # long or long-short?
min.stock.cluster <- 50 # minimum stocks per cluster
B <- 1000 # number of bootstraps
tc <- 0.15 # one-way trade cost in % (including bid-ask and commission)

# pre-allocate list with length of dates to contain portfolio info over dates
strat <- vector("list", length(dates))
names(strat) <- dates

# pre-allocate xts object for portfolio returns
port.ret <- xts(matrix(NA, N, B), order.by = dates)

# pre-allocate xts object for random portfolio retuns
rand.port.ret <- xts(matrix(NA, N, B), order.by = dates)

# number of clusters over time
no.clusters <- xts(matrix(NA, N, B), order.by = dates)

# initialise text progress bar
pb <- progress::progress_bar$new(format = "calculating [:bar] :percent eta: :eta",
 total = B, clear = FALSE, width = 60)

# loop of bootstraps
for(b in 1:B) {
 
 # loop over dates
 for(n in (cluster.window+2):N) {
 
 # rolling dates window
 dw <- (n - cluster.window):(n - 1)
 
 # index members at n time
 indx.memb <- which(hist.tickers[n - 1, ] == 1)
 
 # if number of bootstraps is above one
 if(b > 1) {
 
 # sample with replacement of index members at period n
 indx.memb <- sample(indx.memb, length(indx.memb), replace = T)
 
 }
 
 complete.obs <- which(apply(is.na(tr[dw, indx.memb]), 2, sum) == 0)
 
 # update index members at n time by complete observations for correlation
 indx.memb <- indx.memb[complete.obs]
 
 # temporary total returns
 temp.tr <- tr[dw, indx.memb]
 
 # normalised returns
 norm.ret <- scale(temp.tr)
 
 # fit PCA on normalised returns
 pca.fit <- prcomp(norm.ret)
 
 # estimate market returns from first PCA component
 x <- (norm.ret %*% pca.fit$rotation)[, 1]
 
 # estimate beta
 betas <- as.numeric(solve(t(x) %*% x) %*% t(x) %*% norm.ret)
 
 # estimate residuals (normalised return minus market)
 res <- norm.ret - tcrossprod(x, betas)
 
 # correlation matrix
 cm <- cor(res)
 
 # distance matrix
 dm <- as.dist((1 - cm) / 2)
 
 # fit a hierarchical agglomerative clustering
 fit <- hclust(dm, method = "average")
 
 for(i in 2:20) {
 
 # assign tickers into clusters
 groups <- cutree(fit, k = i)
 
 # minimum number of tickers in a cluster
 group.min <- min(table(groups))
 
 # if smallest cluster has less than minimum required number of stocks break loop
 if(group.min < min.stock.cluster) {
 
 groups <- cutree(fit, k = i - 1)
 
 break
 
 }
 
 }
 
 # number of clusters
 G <- length(unique(groups))
 
 # insert number of clusters
 no.clusters[n, b] <- G
 
 # stocks per cluster
 cluster.size <- table(groups)
 
 # number of positions per cluster
 risk.allocation <- round(table(groups) / sum(table(groups)) * no.pos)
 
 # pre-allocate list for containing all trade info on each cluster
 cluster.info <- vector("list", G)
 
 # loop over clusters
 for(g in 1:G) {
 
 # find the ticker positions in the specific cluster
 cluster.pos <- indx.memb[which(groups == g)]
 
 # which tickers have total returns for period n
 has.ret <- which(!is.na(tr[n, cluster.pos]))
 
 # adjust stock's position for g cluster based on available forward return
 cluster.pos <- cluster.pos[has.ret]
 
 # rescale quality risk factor
 quality.1 <- risk.factors[n - 1, cluster.pos, "quality.1"]
 quality.2 <- risk.factors[n - 1, cluster.pos, "quality.2"]
 quality.1.rank <- (quality.1 - min(quality.1, na.rm = T)) /
 (max(quality.1, na.rm = T) - min(quality.1, na.rm = T))
 quality.2.rank <- (quality.2 - min(quality.2, na.rm = T)) /
 (max(quality.2, na.rm = T) - min(quality.2, na.rm = T))
 
 quality.rank <- ifelse(!is.na(quality.2.rank), quality.2.rank, quality.1.rank)
 
 # rescale value risk factor
 value.1 <- risk.factors[n - 1, cluster.pos, "value.1"]
 value.2 <- risk.factors[n - 1, cluster.pos, "value.2"]
 value.1.rank <- (value.1 - min(value.1, na.rm = T)) /
 (max(value.1, na.rm = T) - min(value.1, na.rm = T))
 value.2.rank <- (value.2 - min(value.2, na.rm = T)) /
 (max(value.2, na.rm = T) - min(value.2, na.rm = T))
 
 value.rank <- ifelse(!is.na(value.2.rank), value.2.rank, value.1.rank)
 
 # rescale momentum risk factor
 mom <- risk.factors[n - 1, cluster.pos, "mom"]
 mom.rank <- (mom - min(mom, na.rm = T)) /
 (max(mom, na.rm = T) - min(mom, na.rm = T))
 
 # rescale beta risk factor
 beta <- risk.factors[n - 1, cluster.pos, "beta"] * -1
 beta.rank <- (beta - min(beta, na.rm = T)) /
 (max(beta, na.rm = T) - min(beta, na.rm = T))
 
 # rescale reversal risk factor
 reversal <- risk.factors[n - 1, cluster.pos, "reversal"] * -1
 reversal.rank <- (reversal - min(reversal, na.rm = T)) /
 (max(reversal, na.rm = T) - min(reversal, na.rm = T))
 
 # combine all normalised risk factor ranks into one matrix
 ranks <- cbind(quality.rank, value.rank, mom.rank, beta.rank)#, reversal.rank)
 
 if(sum(complete.cases(ranks)) < risk.allocation[g]) {
 
 col.obs <- apply(!is.na(ranks), 2, sum)
 
 col.comp <- which(col.obs > (cluster.size[g] / 2))
 
 comb.rank <- rank(apply(ranks[, col.comp], 1, mean), na.last = "keep")
 
 } else {
 
 comb.rank <- rank(apply(ranks, 1, mean), na.last = "keep")
 
 }
 
 if(strategy == "long") {
 
 
 long.pos <- cluster.pos[which(comb.rank > max(comb.rank, na.rm = T) - risk.allocation[g])]
 
 cluster.info[[g]] <- data.frame(Ticker = tickers[long.pos],
 Ret = as.numeric(tr[n, long.pos]),
 stringsAsFactors = FALSE)
 
 }
 
 if(strategy == "long-short") {
 
 long.pos <- cluster.pos[which(comb.rank > max(comb.rank, na.rm = T) - risk.allocation[g])]
 short.pos <- cluster.pos[which(comb.rank < risk.allocation[g] + 1)]
 
 long.data <- data.frame(Ticker = tickers[long.pos],
 Sign = rep("Long", risk.allocation[g]),
 Ret = as.numeric(tr[n, long.pos]),
 stringsAsFactors = FALSE)
 short.data <- data.frame(Ticker = tickers[short.pos],
 Sign = rep("Short", risk.allocation[g]),
 Ret = as.numeric(tr[n, short.pos]) * -1,
 stringsAsFactors = FALSE)
 
 cluster.info[[g]] <- rbind(long.data, short.data)
 
 }
 
 }
 
 # rbind data.frames across clusters and insert into strat list
 strat[[n]] <- do.call("rbind", cluster.info)
 
 # insert portfolio return
 if(n == cluster.window + 2) {
 
 port.ret[n, b] <- mean(strat[[n]][,"Ret"]) - tc * 2 / 100
 
 } else {
 
 # turnover in % (only selling)
 strat.turnover <- 1 - sum(!is.na(match(strat[[n]][, "Ticker"], strat[[n-1]][, "Ticker"]))) /
 length(strat[[n-1]][, "Ticker"])
 
 port.ret[n, b] <- mean(strat[[n]][,"Ret"]) - tc * strat.turnover * 2 / 100
 
 }
 
 # insert random portfolio return
 rand.pos <- sample(indx.memb[which(!is.na(tr[n, indx.memb]))],
 size = no.pos, replace = T)
 
 if(n == cluster.window + 2) {
 
 rand.port.ret[n, b] <- mean(tr[n, rand.pos]) - tc * 2 / 100
 
 prev.rand.pos <- rand.pos
 
 } else {
 
 rand.turnover <- 1 - sum(!is.na(match(prev.rand.pos, rand.pos))) / length(prev.rand.pos)
 
 rand.port.ret[n, b] <- mean(tr[n, rand.pos]) - tc * rand.turnover * 2 / 100
 
 }
 
 }
 
 # update progress bar
 pb$tick()
 
 Sys.sleep(1 / 100)
 
}

Welcome to Predictive Alpha

The Predictive Alpha blog is about quantitative trading and investing using the free statistical programming language R.

The Predictive Alpha blog favors complete transparency so all relevant R code is freely available. This allows visitors of the blog to recreate our results and hopefully also promotes discussion and feedback.