In our previous installment on this tutorial, we demonstrated how to load data into R from existing sources like an Excel spreadsheet by saving it as a Comma Separated Value file and using R’s utility functions to load and check the data. The same approach also works fairly well when you’re trying to get a file of data from your IT department; most business intelligence and querying tools can dump out their results as a CSV and these files can be easily emailed between departments. I used this approach for years before seeking better methods.
However, there are certain types of projects where the “download and save to CSV trick” gets either tedious or impractical.
So what if we could… perhaps… connect R directly to the corporate database in question and pull the data directly?
It turns out this sort of process is indeed “a thing” and, depending on how liberal your corporate system administrators are, can significantly simplify your work. It’s also fairly simple to implement using R – for this next example, we will share how you can use two protocols called ODBC and SQL to pull data directly R from a larger corporate system.
ODBC is a common protocol that allows computer software to talk to many common databases (Oracle, SQL Server, etc). SQL stands for Structured Query Language and is the primary language used to extract information from relational databases; it’s fairly simple (good intro here, including a code playground) and a “good thing” for analysts to know.
In this example, we will use a R library called RODBC to connect to a database and run a simple query. The results of this query will be pulled back into R and available for us to use as a standard data frame.
Incidentally, your local installation of R likely doesn’t already have the RODBC package installed; so we will need to download and install it. This is quite simple.
# r odbc - odbc package installation install.packages('RODBC')
The system will take it from there (it may prompt you to pick a download site from a list); congratulations, you’ve likely just installed your first R package. These little libraries can handle a wide range of tasks and are one of the best things about the R community. No matter what you’re trying to do, someone else has probably already attempted a similar problem and published a helpful library for the technique.
Once installed, we will load the library into the R environment and connect it to a database.
# r odbc example - loading library library(RODBC) # r odbc connect - odbcconnect myconn
Boom, we’re connected. And now for a little query….
# r sql query customers
That particularly epic example of an SQL query just aggregated sales data from my invoice table by customer, returning the result as a neat little table of customers and sale amounts which we can use for future analysis. For example, we could examine the distribution of sales by customer size to understand if our business was driven by a handful of accounts or broadly distributed across a base of small customers.
SQL is an incredibly flexible language – you can join different tables together, summarize results, and filter down to the items you want. This approach also has the advantage of allowing you to run heavy processing on a remote machine (usually more powerful that your computer) and only bring back the summarized results to your own machine.
Resources to help you simplify data collection and analysis using R. Automate all the things!