After Microsoft took over Revolution Analytics in 2015, they created a lot of R implementations, usage applications and extensions of available tools. The amount of them means that users may struggle to select and adapt the most appropriate of R distributions. This newest in our Talking about R… series of articles will help you make the right choice for your use case.
Why should I use Microsoft R Open distribution?
First, let’s analyze the differences between R distributions. Microsoft has its own repository called Microsoft R Application Network. There, the previous and current versions of Microsoft R Open can be found, as well as a set of packages along with their snapshots thanks to CRAN Time Machine.
Microsoft R Open is an enhanced version of the open source R, that in addition to dedicated functionalities available in the classic R distribution achieves higher efficiency. This is thanks to the possibility of installing Intel’s Math Kernel Library (Intel MKL).
It allows for computing optimization in R matrices, as well as introducing multithread processing which increases efficiency up to 45 times depending on application. You can read more about the differences in efficiency on the official website.
An important benefit of the MRAN repo is the restore functionality thanks to the R and packages’ version snapshots – these are maintained using CRAN Time Machine. Thus, because of it being officially distributed, R package is also supported by Microsoft.
What you should remember is that actions performed in R can be done in Microsoft R Open, however this does not work both ways. By installing R distribution, data processing will gain greater efficiency and you will get a wider set of functionalities.
About Microsoft R Server
For multithread processing and computing Microsoft R Server uses the same functions and libraries as Microsoft R Open, i.e. Intel Kernel Math Library. Additionally, what makes Microsoft R Server different to R Open is the ability to process data on several nodes, e.g. using a number of computers.
The key element of the R Server solution is the possibility to install it on many platforms, including Linux, Windows, Hadoop, Teradata DB. This allows to deliver as accurate analysis as possible with received data.
Additionally, with Microsoft R Server it is possible to conduct a sequence of operationalizing tasks thanks to the DeployR package. It allows for code implementation according to best practices and managing the code in a clear and transparent way.
The package also enables management of the data processing environment on more than one server. R Server already has the Intel Kernel Math Library implemented, and so it does not require additional installation.
Another element that is important when using R Server is the ability to use a dedicated RevoScaleR library. It is a set of enhanced functions for importing, transforming and analyzing data on a larger scale, as it is optimized by multithread processing.
The entry level for using these functions is quite low as the only difference from the classic functions is the rx prefix. For instance, for the kmeans function we use rxKmeans, for correlation rxCor etc.
So why do I need… R Client?
If you are an analyst working on tasks involving machine learning, processing and cleansing data, and want to work locally with functions available in the RevoScaleR, you can do so with the Microsoft R Client. It allows for using a whole range of functions of this package and multithread processing of data with a maximum limit of 2 threads.
However, if we need to take the processing to a higher level, it is possible to switch the compute context from within R Client to R Server built e.g. on Hadoop clusters and make use of all the R Server tools.
R Client is best used for local data processing with the RevoScaleR package (with the possibility of limiting processing to two threads).
And it’s time for R Services
R Services is nothing other than a SQL Server functionality that allows for running R scripts in SQL Server procedures. Thanks to this functionality it is possible to analyze data more accurately.
It also enhances efficiency because it is no longer required to move the whole data source to R memory (which is where the data was analyzed), using the RODBC library to connect to the database. It was often a considerable limitation when the memory resources were limited and there was a large amount of analyzed data.
It is worth noting that after recent changes Microsoft introduced to the names of their services, there are new names for R Services and R Server:
- R Services is now called Machine Learning Services
- R Server is now called Machine Learning Server.
This is due to the fact that these tools were enriched with the functionality of writing Python scripts.
When can these tools be used and for what purposes?
- R Open – it’s a mandatory tool for an R programmer who wants to process data efficiently with the use of Intel libraries, for the purposes of building visualizations, data cleansing, building data processing models and non-production, or for their own use.
- R Client – it’s the right tool to use if we want to enhance the solution with RevoScaleR package capabilities, with which the user can move the compute context to machines other than the local one. It is essential for anyone who will be preparing a production-ready solution on their local machine and wants to publish it for production.
- R Server (Machine Learning Server) – it is an efficient data processing machine, to be used with production solutions requiring multithread processing. If you’re wondering which product you should publish your solution on or which one to use to process your R code in Hadoop clusters, then R Server is the perfect choice.
- R Services (Machine Learning Services) – if you’re using a SQL Server database and wish to enrich the processing of this data with advanced machine learning algorithms so that processing results are loaded to the database right away, it is worth using this component to embed R code in procedures stored in the SQL Server database.
This article has hopefully shed some light on the different R distributions available on the market. If you’d like further information or advice, or are thinking of doing a project utilizing R, don’t hesitate to contact me!