02 Data Sources and Getting Data into R

Variable assignment

Note: It is recommended to use <- for assignment rather than = sign)

> x = 1

Assign x a vector of integers from 1 to 1

> x = 1:10

Assign x a vector of integers from 10 to 1

> x = 10:1

If you are interested in different scale ...

> x = 1:10 * 0.01

Use seq function - generates vector of integers 0 through 100 at interval of 10

> x = seq(0, 100, by = 10)

Create a vector of arbitrary sequence of numbers

> x = c(1, 5, 8, 5, 10)

Prompt user to enter number

> x = scan()

Reading from local files

There are packages available to read data from excel spreadsheet etc. However, whenever possible export your data to .csv format then use read.csv function

> sfpd = read.csv("data/credit.csv")

Take a look at the other similar functions: read.table, and read.csv2.

Reading from web

Option 1: Reading well formatted date

> url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"

> housing = read.csv(url, header = FALSE, sep = "")

Option 2: More granular control of the data

> raw = getURL("https://raw.githubusercontent.com/einext/data/master/Olympics2016.csv")

> olympics = read.csv(textConnection(raw), header=TRUE)

Built-in Datasets

View description of the built in datasets in R in the following link

https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html

In R, you can get the list by

> data()

Load a dataset in R session

> data("iris")

View help on data

> ?iris

Clean up data from R session

> rm(list = ls())

External Data Sources

Other data format and sources

  • To load json data you can use rjson package

  • To load data from RDBMS, you can use RODBC package

Common Data Exploration Tasks

Once you have data in R session, few common exploration steps you take as below

View the no of records, no of columns, column types and sample values from each column

> str(iris)

'data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Summary of each column. For numeric column, it shows the min, max, quartiles etc, for categorical type it shows frequency.

> summary(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50

1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50

Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50

Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199

3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800

Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

Find summary of a numeric column

> summary(iris$Petal.Length)

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.000 1.600 4.350 3.758 5.100 6.900

Find unique values of a categorical column

> unique(iris$Species)

[1] setosa versicolor virginica

Levels: setosa versicolor virginica

Find frequency of categorical column

> table(iris$Species)

setosa versicolor virginica

50 50 50

Find proportion of each categorical values

> prop.table(table(iris$Species))

setosa versicolor virginica

0.3333333 0.3333333 0.3333333

Find no of records

> nrow(iris)

[1] 150

Find no of columns

> ncol(iris)

[1] 5