Main | Read Me 1st »

May 08, 2004

Welcome

Welcome to Loyalty Matrix's weblog devoted to R. Here is a brief introduction to R and how we use it.

Overview
The statistical computing, data analysis & graphics software package R has become an important part of our analytic toolkit. The purpose of this memo is to introduce R to those of you who are not familiar with this powerful resource.

R Basics
R is the free open source version of S – a statistical language developed by John Chambers at Bell Labs in the 1980’s. The ACM System Software Award received by John Chambers in 1998 stated that S “… has forever altered the way people analyze, visualize and manipulate data…”

R is an open and cross platform system. It has pre-compiled binaries for easy installation on Windows, MacOS and various Linux implementations. The sources can be easily compiled, I’m told, for various flavors of UNIX. Since R is (mostly) written in R, it is easily extended.

R source, executables, documentation, newsletters, and more is available at www.r-project.org/ and at various mirror sites around the world. Here, in the USA, the preferred site is cran.us.r-project.org/.

Loyalty Matrix and R
Loyalty Matrix discovered R early in 2003 while doing analysis work for several of our clients who were looking for a high value solution with low upfront costs. Since the analytical needs of our clients were very specific to customer data analysis, sponsoring a large budget for one of the conventional commercial packages like SAS or SPSS was not feasible. As a result, we were tasked with finding a viable alternative, for which R definitely fit the bill, literally.

While R has been around since the early 1990’s (and released under GNU GPL in 1995), it has really become popular in the last couple of years. Version 1.0.0 was officially released in February 2000. Major revisions generally happen twice a year. The current version is 1.9.0.

Just like SAS and SPSS, R is essentially a programming language fine-tuned for statistical analysis, data mining, and data visualization. However, since it is open-source, it has an extremely fast speed of optimization because of the size and caliber of its contributor community. All the required functionality for customer data analysis (such as scatter plots, histograms, cluster analysis, decision trees, time series analysis, CHAID, CART, ANOVA, to name a few) are available in R. Furthermore, with its advanced data visualization capabilities, we found a lot of our traditional SAS and SPSS analysts opting for R.

Since R is essentially S-language compatible, the extensive S code-base, providing all the basic analytic and graphic methods, gave the R project a significant head start. Nowadays, many new packages are released first on R and then picked up by the S community. We were immediately attracted to R by its strong and flexible graphic tools, wealth of exploratory data analysis methods, and general activity around the R project.

The R movement is continuing to expand. The R Foundation was created in April, 2003 to give a formal focal point for the project (modeled after the Apache and GNOME Foundations). Founding organizations include the Institute of Mathematical Statistics (IMS), MedAnalytics, Baxter AG, Johns Hopkins University and the University of Wisconsin. Loyalty Matrix became a “Supporting Institution” of the R Foundation in 2003.

Cutting edge institutions, such as Stanford, now use R in their data mining courses rather than traditional commercial packages such as SAS or SPSS.

The first useR! (annual user conference) will be held in Vienna in on May 20-22, 2004. “Data Mining and Large Databases” will be one featured topic.

R Resources
The CRAN link is the place to start. In particular, the Windows install file is there. Note PDF’s of the extensive standard documentation set are part of the standard installs and available separately on CRAN. PDF’s of the all issues R News, the R newsletter, various lecture notes/handouts, and documentation for the nearly 300 add-on packages are also on CRAN. The R-Help mail list is extremely active with 50 + messages on many days.

Conclusion
Loyalty Matrix uses R as a core element of our analytic toolkit leveraging its wide range of methods for exploratory data analysis (EDA), presentation plots, and rigorous statistical algorithms.

Over the next few months, we will be coding our models for customer intelligence analysis in a set of formal R packages. These will become part of our MatrixOptimizerTM suite, which is our product framework to derive marketing insights from customer data. The overall goal is to utilize R to improve the efficiency and effectiveness of our processes and insight development techniques. We will also release these models into R’s open-source community so that we can invite leading world-class statisticians and data mining experts to critique and, hopefully, contribute to these customer intelligence packages. Our goal is to share our expertise and spearhead a cooperative “Open Source Customer Intelligence” project within the R community.

Posted by Jim Porzak on May 08, 2004 at 02:22 AM in General | Permalink

Comments

The comments to this entry are closed.