Datasources
https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research
http://www.datasciencecentral.com/profiles/blogs/a-plethora-of-data-set-repositories
Wikipedia data dump https://dumps.wikimedia.org/enwiki/
StackExchange data dump https://archive.org/details/stackexchange
US Supreme Court http://scdb.wustl.edu/data.php
US postal codes https://www.unitedstateszipcodes.org
Weather data https://www.ncdc.noaa.gov/data-access
Million Song Dataset: http://labrosa.ee.columbia.edu/millionsong/
Transportation Dataset http://transtats.bts.gov/DL_SelectFields.asp
Catalog of 33 datasource http://www.forbes.com/sites/bernardmarr/2016/02/12/big-data-35-brilliant-and-free-data-sources-for-2016/#43139bf16796
Baseball game data: http://www.retrosheet.org
Del.icio.us Dataset http://www.din.uem.br/~gsii/delicious-dataset/
Project Challenges http://www.datasciencecentral.com/group/dsa-projects/forum/topics/data-science-projects-for-dsa-candidates
https://data.world/
https://blog.bigml.com/2013/02/28/data-data-data-thousands-of-public-data-sources/
Youtube dataset: http://netsg.cs.sfu.ca/youtubedata/
Datasets for machine learning
https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research
Company profiles
https://data.crunchbase.com/docs
Microfinance Lending Data
http://build.kiva.org/docs/data/basic_types
World Bank - International Debt Statistics
http://data.worldbank.org/data-catalog/international-debt-statistics
AWS public data set https://aws.amazon.com/datasets/
S3 bucket for public dataset: s3://aws-publicdatasets
Amazon product review data
http://jmcauley.ucsd.edu/data/amazon/
USDA Food database https://ndb.nal.usda.gov/ndb/search/list
US Federal Commission Data
http://www.fec.gov/disclosure.shtml
Data published by opendatasoft.com
https://data.opendatasoft.com/explore/?sort=modified
NYC Open Data
https://data.cityofnewyork.us/browse
World Bank
http://datacatalog.worldbank.org/
Open Data Catalog
http://dataportals.org/
US Geological Survey Science Data Catalog
https://data.usgs.gov/datacatalog
Geolocation and IP mapping http://dev.maxmind.com/geoip/geoip2/geolite2/
List of cities
http://openweathermap.org/help/city_list.txt
Graph Data
ICON is a comprehensive index of research-quality network data sets from all domains of network science, including social, web, information, biological, ecological, connectome, transportation, and technological networks.
Each network record in the index is annotated with and searchable or browsable by its graph properties, description, size, etc., and many records include links to multiple networks. The contents of ICON are curated by volunteer experts from Prof. Aaron Clauset's research group at the University of Colorado Boulder.
https://icon.colorado.edu
KONECT is a comprehensive archive that provides not only the data (dozens of networks), but also summary statistics about each dataset.
http://konect.uni-koblenz.de/networks/
http://www-personal.umich.edu/~mejn/netdata/
http://snap.stanford.edu/data/
Social Network Data
http://socialnetworks.mpi-sws.org/data-imc2007.html
Medline Database - a database of academic papers that have been published in journals covering the life sciences and medicine
ftp://ftp.nlm.nih.gov/nlmdata/sample/medline
With 3.5 billion nodes and 128 billion edges, this is the largest known freely available real world graph dataset.
http://webdatacommons.org/hyperlinkgraph/index.html
Case Studies on Benefits of Open Data
Business case for open data https://project-open-data.cio.gov/business-case/
https://socrata.com/case-studies/
https://www.opendatasoft.com/resources/#casestudies
English Dictionary Database
https://wordnet.princeton.edu/wordnet/download/
Awesome public dataset
https://github.com/awesomedata/awesome-public-datasets
Airbnb
http://insideairbnb.com/get-the-data.html
CTR (click through rate) prediction
Criteo: https://www.kaggle.com/c/criteo-display-ad-challenge
Avazu: https://www.kaggle.com/c/avazu-ctr-prediction
Outbrain: https://www.kaggle.com/c/outbrain-click-prediction
RecSys 2015: http://dl.acm.org/citation.cfm?id=2813511&dl=ACM&coll=DL&CFID=941880276&CFTOKEN=60022934