Datasources

https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research

http://www.datasciencecentral.com/profiles/blogs/a-plethora-of-data-set-repositories

Wikipedia data dump https://dumps.wikimedia.org/enwiki/

StackExchange data dump https://archive.org/details/stackexchange

US Supreme Court http://scdb.wustl.edu/data.php

US postal codes https://www.unitedstateszipcodes.org

Weather data https://www.ncdc.noaa.gov/data-access

Million Song Dataset: http://labrosa.ee.columbia.edu/millionsong/

Transportation Dataset http://transtats.bts.gov/DL_SelectFields.asp

Catalog of 33 datasource http://www.forbes.com/sites/bernardmarr/2016/02/12/big-data-35-brilliant-and-free-data-sources-for-2016/#43139bf16796

Baseball game data: http://www.retrosheet.org

Del.icio.us Dataset http://www.din.uem.br/~gsii/delicious-dataset/

Project Challenges http://www.datasciencecentral.com/group/dsa-projects/forum/topics/data-science-projects-for-dsa-candidates

https://data.world/

https://blog.bigml.com/2013/02/28/data-data-data-thousands-of-public-data-sources/

Youtube dataset: http://netsg.cs.sfu.ca/youtubedata/

Datasets for machine learning

https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research

Company profiles

https://data.crunchbase.com/docs

Microfinance Lending Data

http://build.kiva.org/docs/data/basic_types

World Bank - International Debt Statistics

http://data.worldbank.org/data-catalog/international-debt-statistics

AWS public data set https://aws.amazon.com/datasets/

S3 bucket for public dataset: s3://aws-publicdatasets

Amazon product review data

http://jmcauley.ucsd.edu/data/amazon/

USDA Food database https://ndb.nal.usda.gov/ndb/search/list

US Federal Commission Data

http://www.fec.gov/disclosure.shtml

Data published by opendatasoft.com

https://data.opendatasoft.com/explore/?sort=modified

NYC Open Data

https://data.cityofnewyork.us/browse

World Bank

http://datacatalog.worldbank.org/

Open Data Catalog

http://dataportals.org/

US Geological Survey Science Data Catalog

https://data.usgs.gov/datacatalog

Geolocation and IP mapping http://dev.maxmind.com/geoip/geoip2/geolite2/

List of cities

http://openweathermap.org/help/city_list.txt

Graph Data

ICON is a comprehensive index of research-quality network data sets from all domains of network science, including social, web, information, biological, ecological, connectome, transportation, and technological networks.

Each network record in the index is annotated with and searchable or browsable by its graph properties, description, size, etc., and many records include links to multiple networks. The contents of ICON are curated by volunteer experts from Prof. Aaron Clauset's research group at the University of Colorado Boulder.

https://icon.colorado.edu

KONECT is a comprehensive archive that provides not only the data (dozens of networks), but also summary statistics about each dataset.

http://konect.uni-koblenz.de/networks/

http://www-personal.umich.edu/~mejn/netdata/

http://snap.stanford.edu/data/

Social Network Data

http://socialnetworks.mpi-sws.org/data-imc2007.html

Medline Database - a database of academic papers that have been published in journals covering the life sciences and medicine

ftp://ftp.nlm.nih.gov/nlmdata/sample/medline

With 3.5 billion nodes and 128 billion edges, this is the largest known freely available real world graph dataset.

http://webdatacommons.org/hyperlinkgraph/index.html

Case Studies on Benefits of Open Data

  • Business case for open data https://project-open-data.cio.gov/business-case/
  • https://socrata.com/case-studies/
  • https://www.opendatasoft.com/resources/#casestudies

English Dictionary Database

https://wordnet.princeton.edu/wordnet/download/

Awesome public dataset

https://github.com/awesomedata/awesome-public-datasets

Airbnb

http://insideairbnb.com/get-the-data.html

CTR (click through rate) prediction

  • Criteo: https://www.kaggle.com/c/criteo-display-ad-challenge
  • Avazu: https://www.kaggle.com/c/avazu-ctr-prediction
  • Outbrain: https://www.kaggle.com/c/outbrain-click-prediction
  • RecSys 2015: http://dl.acm.org/citation.cfm?id=2813511&dl=ACM&coll=DL&CFID=941880276&CFTOKEN=60022934