Advanced Regression Analysis: How To Print All Best Models ?

Output Models.txt  – Contains the summary, vif,  model fit, forecast of holdout observations and deviations of lm() and rlm(). This is the main output file because it prints out 10 best models for models of all possible sizes, meaning, the first 10 models will be 1 variable models, the second 10 will have 2 predictor variables, the 3rd 10 models will have 3 variables and so on. Within each group these models are arranged based on highest ‘adj Rsq’ values. Quite a convenience eh?

Actual vs Fit/Forecast Charts – The  Actual vs Fit charts for all the best models are stored in PNG format.

Should you wish to make changes and  incorporate this in your work, you will have to change the read.table and colnames(dat) statements to suit your needs, and do quote this blog ‘rprogrammingblog.wordpress.com’ at the top in your source code. I would like to know how it works out for you, in case you tried it out.

How to use this code?

Step1: Install the “leaps”, “car”, “MASS” libraries using install.packages(c(“leaps”,”car”,”MASS”) command.

Step 2: If you have your own response-predictors data, update dat <- read.table() and colnames(dat) sections in below code to point to the correct location in your local disk. If your dataset has headers, comment out the colnames(dat) part. Finally, the first column in your dataset is assumed as a time column like month-year. If your analysis does not have time series data like demand or sales, fill up the first column in your data as row indices. You are now ready to run the code.

If you do not have your own dataset but want to just see how this code works, just install the packages as explained above and run the rest of it as it is in your R console.

library(leaps)
library(car)
library(MASS)
cat(paste("Outputs will be collected in this location :",getwd()))
store.graphs = TRUE
holdout.obs = 6	# No of observations you want to holdout in the tail of input data. These values will be predicted using lm() and rlm().
 
# Change the following 2 statements as per your need.
dat <- read.table("http://www.stat.ufl.edu/~winner/data/winepop.dat",header=T)	#Load the input data. The last few observations will be in hold-out for prediction, as directed by 'holdout.obs' variable.
colnames(dat) <- c("Year", "Total.Population.Thousands", "years.5", "years.5to14","years.15to24","years.25to34","years.35to44","years.45to54","years.55to64","sixtyfiveyears","Wine.Consumption.Millions.of.gallons")
 
predictors.df <- dat[,c(3:ncol(dat))]						# The predictor variables in your input dataset (dat)
target.df <- dat[,2]								# Point to the response variable in 'dat'
 
 
LPS <- leaps(x = predictors.df ,y = target.df,  names = colnames(predictors.df), nbest = 10 ,method =  "adjr2")
sink("Output Models.txt")
 
for(i in (1:nrow(LPS$which))){
			# Create a formula using those variables
			preds <- paste("I(",names(which(LPS$which[i,]!= "FALSE")),"^",1,")",sep="", collapse = " + ")
			fm <- paste("(",colnames(dat)[2]," ~ ",preds,")",sep = "")
 
			# Create a linear model (LM) and a robust linear model (RLM)
			rg <- lm(as.formula(fm), data= as.data.frame(dat[c(1:(nrow(dat)-holdout.obs)),]))  				 					
			rrg <- rlm(as.formula(fm), data= as.data.frame(dat[c(1:(nrow(dat)-holdout.obs)),]))  		
 
			# Store the Actual and Predicted 
		      	out <- data.frame(actuals = dat[,2] ,predicted.robust = round(predict(rrg, dat)), predicted.lm = round(predict(rg,  dat)))	
 
			# Get the percent deviations of the linear model and robust linear model
			robdev <- as.numeric(as.matrix(out[1:nrow(dat),2])) - as.numeric(as.matrix(out[1:nrow(dat),1]))  	
			lmdev <- as.numeric(as.matrix(out[1:nrow(dat),3]))- as.numeric(as.matrix(out[1:nrow(dat),1]))    	
			deviation <- cbind(robdev, lmdev)  										 		
			devperc.rob <- sprintf("%1.2f%%" , deviation[,1]/as.numeric(as.matrix(out[1:nrow(dat),1])) * 100) 	
			devperc.lm <- sprintf("%1.2f%%" , deviation[,2]/as.numeric(as.matrix(out[1:nrow(dat),1])) * 100)  	
 
			# Add the deviations to the output
			devperc <- cbind(rob.lm.dev = c(devperc.rob, rep("-", (nrow(out) - length(devperc.rob)))), lm.dev = c (devperc.lm, rep("-", (nrow(out) - length (devperc.lm)))))
			out <- cbind(dat$Year, out, devperc)
 
			# If we chose to store the actual vs. predicted graphs, generate the graphs and store them. The graphs will be stored in the working directory.
			if(store.graphs){
				# Set the margins of the graph
				par(mar = c(4,4,4,1))
				# Set up output to PNG.
				png(file=paste("Model ",i,".png",sep=""),width=900,height=550)
				# Plot the actual vs predicted and set parameters of the graph
				pic <- plot(c(1:nrow(dat)), out[,2], type = "b", cex.axis = 0.5, lwd = 2, xlab = "Year-Month", ylab =  "Demand", sub = paste("Plot from: ",dat[1,1] ," to ",dat [nrow(dat),1] ,sep = ""), main = paste("Model ",i,", 
 
No. Forecasts: ", (nrow (dat)-nrow(dat)),sep=""))
				text(12,((max(out[,c(2:4)])-min(out[,c(2:4)]))/2)+10,print(paste(names(which(LPS$which[i,]!=  "FALSE")),sep="\n", collapse = " ")),  adj = 1)
				box("figure")
				lines(out[,3], col = "red", lwd = 2)
				lines(out[,4], col = "blue", lwd = 2)
				legend("bottomright", inset=.01, c("Actual","Robust", "linear"), fill=c(1,4,2), horiz=TRUE, cex=0.75  )
				# End output to PNG
				dev.off()
			}
 
			# PRINT THE OUTPUT
			cat("Model",i,"\n")
			print(summary(rg))
			cat("vif")
			if(length(rg$coefficients)>2){print(vif(rg))}			
			print(out)
			cat("------------------------------------------------------------------- \n")
		}
 
		# close the output file
		sink()

Author: Selva Prabhakaran Sanjeevi Julian

Selva Prabhakaran

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s