02 Data Sources and Getting Data into R
Variable assignment
Note: It is recommended to use <- for assignment rather than = sign)
> x = 1
Assign x a vector of integers from 1 to 1
> x = 1:10
Assign x a vector of integers from 10 to 1
> x = 10:1
If you are interested in different scale ...
> x = 1:10 * 0.01
Use seq function - generates vector of integers 0 through 100 at interval of 10
> x = seq(0, 100, by = 10)
Create a vector of arbitrary sequence of numbers
> x = c(1, 5, 8, 5, 10)
Prompt user to enter number
> x = scan()
Reading from local files
There are packages available to read data from excel spreadsheet etc. However, whenever possible export your data to .csv format then use read.csv function
> sfpd = read.csv("data/credit.csv")
Take a look at the other similar functions: read.table, and read.csv2.
Reading from web
Option 1: Reading well formatted date
> url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
> housing = read.csv(url, header = FALSE, sep = "")
Option 2: More granular control of the data
> raw = getURL("https://raw.githubusercontent.com/einext/data/master/Olympics2016.csv")
> olympics = read.csv(textConnection(raw), header=TRUE)
Built-in Datasets
View description of the built in datasets in R in the following link
https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html
In R, you can get the list by
> data()
Load a dataset in R session
> data("iris")
View help on data
> ?iris
Clean up data from R session
> rm(list = ls())
External Data Sources
Here is nice index of data updated 2/12/2016
Other data format and sources
To load json data you can use rjson package
To load data from RDBMS, you can use RODBC package
Common Data Exploration Tasks
Once you have data in R session, few common exploration steps you take as below
View the no of records, no of columns, column types and sample values from each column
> str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Summary of each column. For numeric column, it shows the min, max, quartiles etc, for categorical type it shows frequency.
> summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Find summary of a numeric column
> summary(iris$Petal.Length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.600 4.350 3.758 5.100 6.900
Find unique values of a categorical column
> unique(iris$Species)
[1] setosa versicolor virginica
Levels: setosa versicolor virginica
Find frequency of categorical column
> table(iris$Species)
setosa versicolor virginica
50 50 50
Find proportion of each categorical values
> prop.table(table(iris$Species))
setosa versicolor virginica
0.3333333 0.3333333 0.3333333
Find no of records
> nrow(iris)
[1] 150
Find no of columns
> ncol(iris)
[1] 5