02 Data Sources and Getting Data into R

Variable assignment

Note: It is recommended to use <- for assignment rather than = sign)

> x = 1

Assign x a vector of integers from 1 to 1

> x = 1:10 

Assign x a vector of integers from 10 to 1

> x = 10:1

If you are interested in different scale ...

> x = 1:10 * 0.01

Use seq function - generates vector of integers 0 through 100 at interval of 10

> x = seq(0, 100, by = 10)

Create a vector of arbitrary sequence of numbers

> x = c(1, 5, 8, 5, 10)

Prompt user to enter number

> x = scan()

Reading from local files

There are packages available to read data from excel spreadsheet etc. However, whenever possible export your data to .csv format then use read.csv function

> sfpd = read.csv("data/credit.csv")

Take a look at the other similar functions: read.table, and read.csv2.

Reading from web

Option 1: Reading well formatted date
> url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
> housing = read.csv(url, header = FALSE, sep = "")

Option 2: More granular control of the data

> raw = getURL("https://raw.githubusercontent.com/einext/data/master/Olympics2016.csv")
> olympics = read.csv(textConnection(raw), header=TRUE) 

Built-in Datasets

View description of the built in datasets in R in the following link

https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html

In R, you can get the list by

> data()

Load a dataset in R session

> data("iris")

View help on data

> ?iris

Clean up data from R session

> rm(list = ls())

External Data Sources

Other data format and sources

  • To load json data you can use rjson package
  • To load data from RDBMS, you can use RODBC package

Common Data Exploration Tasks

Once you have data in R session, few common exploration steps you take as below

View the no of records, no of columns, column types and sample values from each column

> str(iris)

'data.frame': 150 obs. of 5 variables:

 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Summary of each column. For numeric column, it shows the min, max, quartiles etc, for categorical type it shows frequency.

> summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500                  

Find summary of a numeric column

> summary(iris$Petal.Length)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.600   4.350   3.758   5.100   6.900 

Find unique values of a categorical column

> unique(iris$Species)
[1] setosa     versicolor virginica 
Levels: setosa versicolor virginica

Find frequency of categorical column

> table(iris$Species)
    setosa versicolor  virginica 
        50         50         50 

Find proportion of each categorical values

> prop.table(table(iris$Species))
    setosa versicolor  virginica 
 0.3333333  0.3333333  0.3333333 

Find no of records

> nrow(iris)
[1] 150

Find no of columns

> ncol(iris)
[1] 5