02 Data Sources and Getting Data into R

Variable assignment

Note: It is recommended to use <- for assignment rather than = sign)

> x = 1

Assign x a vector of integers from 1 to 1

> x = 1:10

Assign x a vector of integers from 10 to 1

> x = 10:1

If you are interested in different scale ...

> x = 1:10 * 0.01

Use seq function - generates vector of integers 0 through 100 at interval of 10

> x = seq(0, 100, by = 10)

Create a vector of arbitrary sequence of numbers

> x = c(1, 5, 8, 5, 10)

Prompt user to enter number

> x = scan()

Reading from local files

There are packages available to read data from excel spreadsheet etc. However, whenever possible export your data to .csv format then use read.csv function

> sfpd = read.csv("data/credit.csv")

Take a look at the other similar functions: read.table, and read.csv2.

Reading from web

Option 1: Reading well formatted date

> url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"

> housing = read.csv(url, header = FALSE, sep = "")

Option 2: More granular control of the data

> raw = getURL("https://raw.githubusercontent.com/einext/data/master/Olympics2016.csv")

> olympics = read.csv(textConnection(raw), header=TRUE)

Built-in Datasets

View description of the built in datasets in R in the following link

https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html

In R, you can get the list by

> data()

Load a dataset in R session

> data("iris")

View help on data

> ?iris

Clean up data from R session

> rm(list = ls())

External Data Sources

https://archive.ics.uci.edu/ml/
Here is nice index of data updated 2/12/2016

Other data format and sources

To load json data you can use rjson package
To load data from RDBMS, you can use RODBC package

Common Data Exploration Tasks

Once you have data in R session, few common exploration steps you take as below

View the no of records, no of columns, column types and sample values from each column

> str(iris)

'data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Summary of each column. For numeric column, it shows the min, max, quartiles etc, for categorical type it shows frequency.

> summary(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50

1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50

Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50

Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199

3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800

Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

Find summary of a numeric column

> summary(iris$Petal.Length)

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.000 1.600 4.350 3.758 5.100 6.900

Find unique values of a categorical column

> unique(iris$Species)

[1] setosa versicolor virginica

Levels: setosa versicolor virginica

Find frequency of categorical column

> table(iris$Species)

setosa versicolor virginica

50 50 50

Find proportion of each categorical values

> prop.table(table(iris$Species))

setosa versicolor virginica

0.3333333 0.3333333 0.3333333

Find no of records

> nrow(iris)

[1] 150

Find no of columns

> ncol(iris)

[1] 5