Two of the most important data science tools for wrangling and data manipulation are R and python. Of these two, I personally prefer R for data manipulation since it was written specifically for this by statisticians. One down side of R is that its documentation is quite poor and so it can be helpful to make your own list of useful codes which you can refer to as and when needed. In this article, I will be posting a collection of really useful R codes which I’ve found very handy over the course of my PhD and datascience work.
- Install and/or load multiple R packages at once: This is quite handy as most times, you will find yourself needing to either install new packages or load previously installed packages. A good exercise would be to re-write the code below as a function which can take one or more R packages as an argument. The code below was obtained from http://diggdata.in/.
# List of packages to be installed/loaded packageList <- c("ggplot2", "nlme") check <- packageList %in% rownames(installed.packages()) if(any(!check)) install.packages(packageList[!check]) lapply(packageList, library, character.only = TRUE)
- Create new folders in R: It is possible to create a new folder in your current directory using R.
# get current working directory get.wd() # create "folder_1" in current working directory dir.create("folder_1")
- Plotting in ggplot2 using the pipe operator. The pipe operator is very handy for doing this. It also allows you to manipulate the data prior to the plot-see 4 below
library(ggplot2) library(dplyr) # Plot in ggplot2 using pipe CO2 %>% ggplot(aes(x = conc, y = uptake, group = Plant, col = Plant, shape=Type)) + geom_point() + geom_line()
- Manipulate data prior to plotting in ggplot
# convert uptake by group into z scores (mean - uptake/sd(uptake)) CO2 %>% group_by(Plant) %>% mutate(mean_uptake = mean(uptake), z_uptake = (mean_uptake - uptake)/sd(uptake)) %>% ggplot(aes(x = Plant, y = z_uptake, fill = Plant)) + geom_boxplot()