8 Useful Databases to Dig for Data (and 100 more)
You already know that data is the bread and butter of reports and presentations. Data makes your presentation solid. It backs up the ideas you are selling. It gives people reasons to listen to you.
However, data digging is a struggle. It’s a struggle to look for reputable and legitimate sources, especially in this digital age.
To make your life easier, we’ve put together a list of useful databases that you can use to find the data you seek. To being with, here are eight useful databases for you to dig for data.
1. Freebase
Freebase is an open platform for data sharing. It contains a wide range of topics from fictional characters to Modest Mouse. You can even curate your data with data plotting feature. You can plot your datasets in timeline or map.
2. UN Data
This database contains large datasets, consisting virtually all the public data collected by the United Nation. To access the API you have to sign up (it will only take a couple of minutes).
3. WorldBank
Where else to look for financial data of the world but the WorldBank? You can get virtually any country’s financial and economy standings here. Some other topics included are:
- Agriculture & Rural Development
- Aid Effectiveness
- Economic Policy and External Debt
- Education
- Energy & Mining
- Environment
- Financial Sector
- Health
- Infrastructure
- Labor & Social Protection
- Poverty
- Private Sector
- Public Sector
- Science & Technology
- Social Development
- Urban Development
4. Data.gov
Data.gov is leading the way in democratizing public sector data and driving innovation. This movement has spread throughout cities, states, and countries. 5 of 50+ categories:
- Agriculture
- Arts, Recreation, and Travel
- Banking, Finance, and Insurance
- Births, Deaths, Marriages, and Divorces
- Business
5. Infochimps
Infochimps contains paid and free datasets just about anything. What’s cool about Infochimps is that you can download datasets into csv format. What’s more is that you can fiddle with the API to extract the data specific to your needs. Try Twitter as your search metric and you will see what I mean.
6. Google Public Data
The Google Public Data Explorer makes large datasets easy to explore, visualize and communicate.
7. Google Scholar
The Google Scholar is a free search engine that contains all kinds of academic literatures. Citing journal publishers, universities research papers, and other scholarly materials do not just make your content looks smarter, but as well as more trustworthy.
8. Data Market
Data Market contains in-house and third party datasets. It’s a good place to explore data related to economics, healthcare, food and agriculture, and the automotive industry.
Since our original post, we’ve come across a few more sources of data that might be useful for you:
- Torrent downloads and uploads on Pirate Bay
- Social media & networks – from Stanford Uni
- Human Emotions by We Feel Fine: to allow other artists to more easily make pieces that explore these human emotions
- LittleSis profiles who’s who in the biggest organisations in the world
- NY Times bestseller
- Google Flu Trends
- NY Times People: User data for com, including the user profiles, activities, news feeds, and networks.
- CrunchBase: Plenty of information about startups and large tech companies
- Wunderground: Provides detailed weather info and lets you search historical data by zip code or city.
You can also get a crazy amount of datasets and related information from Datamob.
DataWrangling is a repository with a large volume of datasets from a wide range of fields. To make it easier for you, we have scraped the list for you below. However, please note that list may not be up to date as it was last updated in 2009. Even so, it’s still a good place to start your data search.
Tips on using this list: Each link comes with tags. You can do a search using keyword to find the appropriate database for use.
Happy data digging, people!
- Announcing the Article Search API – Open Blog – NYTimes.com (tags: article, api, nytimes, text, corpus, newspaper)
- Twitter API to follow users, search for users and get user information (tags: network, api, social, twitter)
- Information Extraction: The RISE Repository of Information Sources (tags: information, textmining, extraction, reviews, jobs)
- Visualizing the Growth of Target, 1962-2008 | FlowingData (tags: visualization, retail, finance, gis, map, location, store, via:magnetbox, target)
- The Economy According To Mint (tags: finance, commercial, consumer, mint, spending)
- Repositories (tags: links, textmining, books, rdf, ocr, documents)
- Subsidyscope.com (tags: government, banking, csv, tarp, bailout)
- Best Buy – Welcome to the Best Buy Developer Network (tags: retail, data, api, product, bestbuy)
- twibs : find the businesses on twitter (tags: directory, businesses, twitter, companies)
- Massive Scrape of Twitter’s Friend Graph « blog.infochimps.org – Organizing Huge Information Sources (tags: textmining, twitter, network, socialnetwork, pagerank, graph, queryminer)
- Twitter Scrape (rough draft) – get.theinfo | Google Groups (tags: twitter, socialnetwork, graph)
- dbpedia.org (tags: wikipedia, named_entity, rdf, ontology)
- CinC Challenge 2000 datasets (tags: timeseries, machinelearning, ecg, health, medical, sleep, apnea)
- Free book usage data from the University of Huddersfield » “Self-plagiarism is style” (tags: books, library, borrowing, recommender, isbn, recommendation, collaborative, filtering, opendata)
- UC Berkeley. Sheldon Margen Public Health Library. Statistical/Data Resources (tags: health, links, resources, publichealth, berkeley)
- ICWSM 2009 – International AAAI Conference on Weblogs and Social Media (tags: blog, crawl, corpus, network, web, link)
- BART – For Developers (tags: urban, transportation, feeds, public, sanfrancisco, bart, api)
- Tim Davis: UF Sparse Matrix Collection : sparse matrices from a wide range of applications (tags: spare, matrix)
- HumanScan : BioID : Downloads : BioID Face Database (tags: face, detection, image)
- Building a (fast) Wikipedia offline reader (tags: django, wikipedia, compressed, textmining, howto)
- UN General Assembly Voting Data (tags: un, voting, statistics, government)
- NORB Object Recognition Dataset, Fu Jie Huang, Yann LeCun, New York University (tags: image, 3d)
- Reddit’s Secret API (tags: reddit, api, json)
- Amazon Web Services (AWS) Hosted Public Datasets (tags: amazon, ebs, publicdata)
- Executive PayWatch Database (tags: ceo, compensation, pay, economics, business, labor)
- Research Datasets :: CID Data :: Center for International Development at Harvard University (CID) (tags: economics, international, development)
- NACDA: Search Holdings (tags: aging, statistics, studies)
- LIFE photo archive hosted by Google (tags: images, photo, pictures, search)
- Main Task QA Data (tags: question, answering, trec, nlp, machinelearning)
- The New York Times Annotated Corpus « YooName – named entity recognition (tags: named_entity, nytimes, corpus, people, organizations, locations)
- downloading – flossmole – Google Code – How to get FLOSSmole data for your own use (tags: opensource, project, activity, mysql, dump)
- Google Flu Trends | How does this work? (tags: google, health, trends, search, prediction, epidemiology, biodefence, queries, queryminer)
- Chris Pound’s Name Generation Page (tags: bizarre, scifi, phrase, name, word, generators, random, perl)
- TradingSolutions – Data Sources (tags: trading, finance, s, api, list)
- Announcing the New York Times Campaign Finance API – Open – Code – New York Times Blog (tags: nyt, api, campaign, donations, fec)
- Beautiful Data – WikiContent (tags: book, data, wiki, via:jhammerb)
- public domain sounds | free sound library (tags: sound, publicdomain, audio)
- Data Catalog (tags: dc, government, feeds, transparency, opendata
- Open beats Closed: Best Buy’s new APIs – O’Reilly Radar (tags: retail, bestbuy, api)
- Voter registration data; or, HERE IS YOUR HOPE, YOU FOOLS! « The Edge of the American West (tags: voter, registration, politics, 2008
- Linked Movie Data Base (tags: rdf, movies, movie, api)
- Big Huge Thesaurus API: Access 145,000 Words and Phrases (tags: webservice, api, thesaurus, textmining, nlp, rest)
- The Watchdog Project: volunteer (tags: government, transparency, parsing, election, python)
- Normalized Campaign Contribution Data (tags: cmu, politics, campaign, donations, fec, via:jhammerb, government)
- YouTube Dataset (tags: youtube, research, crawl, socialnetwork, network, graph, web)
- CRAWDAD (tags: wireless, RF, radio, signal, dartmouth, network)
- Web FAQ collection | ILPS (tags: faq, question_answering, questions, web, crawl, corpus, xml, textmining)
- Yahoo! Music API – YDN (tags: api, yahoo, music, artists)
- Search Query Performance report – Google AdWords Help Center (tags: adwords, ppc, search, metrics, webanalytics, sem, query, queryminer)
- Wordze Keyword Research Tool (tags: queryminer, keyword, tool, research, commercial, search, adwords)
- Searchable Catalogs of Data (tags: links, catalogs, social)
- Download Database – baseball1.com (tags: baseball, database, publicdata, statistics, sports)
- radiohead – Google Code (tags: lidar, visualization, radiohead, google, video)
- 80 Million Tiny Images (tags: images, words, english, search, visualization, imagemap)
- Time Series Center | Harvard University (tags: timeseries, anomaly, detection, astronomical, physics)
- BGN: Domestic Names – State and Topical Gazetteer Download Files (tags: gis, usgs)
- NGA: Country Files (tags: country, cities, geo)
- Datasets (tags: benchmark, clustering, regression, machinelearning, list, statistics, mathematics)
- Yahoo! Search Blog: BOSS — The Next Step in our Open Search Ecosystem (tags: api, open, search, yahoo, BOSS, queryminer)
- Download the Database – IP Address Lookup – Community Geotarget IP Project (tags: geocoding, geoip, internet, ip, ipaddress, mysql)
- Airline Data Project (tags: airline, statistics, finance, revenue, location, travel)
- Reddit.com: Ask Reddit: Where to download a DB dump of Reddit? (tags: reddit, socialnetwork, news, web)
- Show Us a Better Way: What public data is already available? (tags: statistics, census, uk, school, news, publicdata)
- Collaborative filtering dataset – dating agency (tags: collaborative, filtering, dating, rating, profiles, czech)
- About Us – Predictify (tags: predictionmarket, tool, finance, buzz, advertising, marketing, startup, mmds, david_kellogg)
- VGChartz.com | Video Games, Charts, News, Forums, Reviews, Wii, PS3, Xbox360, DS, PSP (tags: sales, ranking, videogames, retail)
- Store Level Information (tags: retail, finance, sales, store)
- Code for querying and downloading Flickr images (tags: image, python, code, flickr, matlab, recognition)
- Image Parsing Datasets (tags: image, recognition)
- OHPI – Traffic Volume Trends (tags: government, traffic, statistics, trends, transportation)
- PigTutorial – Pig Wiki (tags: search, log, query, web, excite, queries, hadoop, pig, tutorial, mapreduce, parallel, queryminer)
- Quality of Life Grand Challlenge Dataset: Kitchen Capture (tags: machinelearning, motion, capture, sensor)
- Summize Twitter Search API (tags: api, buzz, opinion, trends, text, twitter, summize, search)
- 2008 IEEE InfoVis Contest Dataset (tags: visualization, contest, scalability, motion, tracking, pedestrian, sensor)
- IMDb Pro : Scary Movie 4: Box office (tags: movie, revenue, sales, box_office, imdb, commercial, movie_study)
- Spider-Man 2 (2004) – Daily Box Office Results (tags: movie, revenue, box_office)
- IMDbPro.com Free Trial Signup (tags: movie, revenue, timeseries, imdb, commercial, subsription)
- Free time-series and micro-data to download (tags: economics, links)
- Official Google Blog: A new flavor of Google Trends (tags: google, trends, search, query, api, csv, keyword, timeseries)
- i2b2: Informatics for Integrating Biology & the Bedside (tags: medical, obesity)
- Tiger Data Set Lecture (tags: tiger, gis, lectures)
- Google To Launch Large Scale Geo-Services (tags: geo, google, gps, location, geolocation, cell, wifi, api, gis)
- ImportGenius.com : U.S. Customs Database and Competitive Intelligence Tools (tags: commercial, shipping, imports, exports, finance, datamining)
- Directory Listing of Betfair price files (tags: betting, prediction, betfair, price, csv, predictionmarket)
- Reuters Spotlight – Article and Media API (tags: news, text, articles, api, content, media, xml, images, publicdata)
- [Wikitech-l] page counters (tags: wikipedia, pageviews, trends, textmining, seo, topic)
- Wikipedia article traffic statistics (tags: via:chl, wikipedia, web, analytics, seo, topic, textmining, traffic)
- Yahoo! Internet Location Platform – YDN (tags: yahoo, geo, geocoding, location, landmarks, gis)
- How to find images on the internet « Random knowledge (tags: images, links, lists, archive)
- Yahoo offers geographic data to Web sites | Tech news blog – CNET News.com (tags: gis, webservice, yahoo, api, location, landmark)
- Instructions for Obtaining Search Engine Transaction Logs (tags: query, search, log, excite, altavista, alltheweb, transaction)
- TechTC – Technion Repository of Text Categorization Datasets (tags: datamining, textmining, categorization, classification, odp, directory, text)
- The TechTC-100 Test Collection for Text Categorization (tags: textmining, classification, category, odp, directory)
- FEC Election Contributions: Download Detailed Files by Election Cycle (tags: individual, donations, government, election, publicdata, fec)
- Country Name and ISO 3166 Code MySQL Import File (tags: mysql, states, countries, isocode)
- Semantic Search the US Library of Congress (tags: via:inkdroid, libraries, mashup, rdf, semantic, search, semanticweb, books, api, webservice)
- geocoded Hotels « GeoNames Blog (tags: hotels, geonames)
- GeoNames webservice and data download (tags: locations, cities, countries, gis)
- Index of /download/worldcities (tags: cities, gis)
- CommonCrawl – About (tags: web, crawler, bot)
- Office of Defects Investigation (ODI), Flat File Downloads (tags: defect, recall, automobile, fightclub, nhtsa, saefty)
- p2psim – kingdata : DNS server latency network distance matrices (tags: distance, matrix, network, p2p, dns, latency, nmf, queryminer)
- opentick.com (tags: opentick, trading, beta, feeds, finance)
- Open Cell Id dataset – phone geolocation from GSM cellids (tags: gis, mobile, geolocation)
- im2gps: estimating geographic information from a single image (tags: imagerecognition, via:csantos, gis, cmu, gps, imageprocessing, paper, hack, freaking_awesome)
- Datasets: MUSCLE WP2 Evaluation, Integration and Standards (tags: image, video, audio, currency, sports, imagerecognition)
- Open Economics – Resources (tags: economics, list)
- welcome @ omdb (tags: free, movie, database, netflixprize)
- Cogblog » Blog Archive » Cogmap APIs (tags: api, cogmap, person, name, organization, record_linkage)
- Wal-Mart : Freebase – The World’s Database (tags: retail, locations, stores)
- Cogmap: The Org Chart Wiki (tags: record_linkage, identity, name, organization, orgchart, marketing)
- German English Parallel Corpus “de-news”, Daily News 1996-2000 (tags: german, translation, corpus, english, text, via:maxme)
- Welcome to the CRCNS data sharing activity website — CRCNS (tags: neuroscience, patch, clamp, recordings, neuron, timeseries, patchclamp, data, neural, cortex, visual)
- org: Free Redistributable Rich Datasets (tags: aggregator, links)
- Frequent Itemset Mining Dataset Repository (tags: retail, clickstream, traffic, web, links, sales)
- TeradataUniversityNetwork.com -> Registration (tags: teradata, retail, transactional, database)
- ECIS 2007 – The 15th European Conference on Information Systems (tags: retail, dillards, sams_club)
- Alexa Web Search (tags: alexa, aws, web, search, api)
- developerWorks Interviews: Massive data mining and the resurgent mainframe (tags: price, retail, transaction, sams_club, dillards)
- Arkansas Newswire (tags: retail, dillards, uark)
- Crime data bonanza!!! (tags: timeseries, crime, statistics, publicdata)
- Wikipedia:Lists of common misspellings/For machines – Wikipedia, the free encyclopedia (tags: spelling, mispelling, wikipedia)
- Access to Web Research Collections VLC2/WT10g/WT2g (tags: blog, web, text)
- Data you can use for benchmarking (tags: image, vision, recognition)
- Lyricsfly Lyrics API, database access to search for music artist and song title, protocol REST with XML document (tags: song, lyrics, database, api)
- 2007 IEEE AVSS Detection and Tracking Algorithm Datasets (tags: tracking, video, detection, image, recognition, vehicle, pedestrian)
- Eigenvector Research, Inc. : Datasets Available to Download (tags: NIR, spectra, chemistry, semiconductor, pharmaceutical, matlab)
- OTCBVS (tags: image, recognition, detection, pedestrian, thermal, tracking, facerecognition, illumination)
- 99 Wikipedia Sources Aiding the Semantic Web » AI3:::Adaptive Information (tags: links, directory, record_linkage, extraction, wikipeida, named_entity, recognition, textmining, semanticweb, paper)
- UNdata (tags: UN, publicdata, government, statistics)
- AudioScrobbler Data (tags: audioscrobbler, recommendation, collaborative, filtering, music)
- The Linking Open Data dataset cloud (tags: directory, rdf, semantic, data, soup, graph)
- Free Economic Data | Economic, Financial, and Demographic Data (tags: finance, economics, portal, links)
- ::MLSP 2008::: MLSP competition (tags: machinelearning, trading, competition, backtest, matlab, code, finance, via:DeliciousRob)
- Computer Vision Test Images (tags: computer, vision, image, ray, trace, fingerprint, stereo, detection, via:chl)
- The Dataverse Network Project | The Dataverse Network Project (tags: statistics, repository, harvard)
- DVN – Home (tags: harvard, repository, social, science, research, portal, links)
- Ohio voter registration data (tags: voter, voting, politics, government, name, address, registration)
- Voter List Data Files – Election Department, Clark County, Nevada (tags: voting, voter, registration, name, address, data, election, politics, government, nevada)
- Temperature data (HadCRUT3 and CRUTEM3) (tags: climate, temperature, netcdf)
- MNIST handwritten digit database, Yann LeCun and Corinna Cortes (tags: handwriting, mnist, image, recognition)
- LFW : Labelled Faces in the Wild (tags: facerecognition, face, recognition, umass, image)
- Making random contacts – (37signals) (tags: generator, names)
- Test (Sample) Data Generators (tags: generator, tools, list, via:jd)
- Compete – Compete Developer Resources (tags: compete, api, web, statistics, traffic, analytics, mashup)
- Machine Learning (Theory) » The Peekaboom Dataset (tags: peekaboom, vision, image, large, human, computation, machinelearning, recognition)
- Ocean Processes and Modeling: Ocean Data (tags: links, oceanography, satellite
- Tagged datasets for named entity recognition tasks (tags: nlp, corpus, tagged, named_entity, recognition, list)
- The Financial Data Finder A – G (tags: finance, links)
- Freebase Wikipedia Extraction (WEX) (tags: wikipedia, xml, structured, corpus)
- The arXiv.org API (tags: arxiv, api, open, paper, academic)
- England Football Results Betting Odds | Premiership Results & Betting Odds (tags: gambling, soccer, football, excel, statistics)
- HughesData – Main – Hughes Lab (tags: rna, bioinformatics, microarray, expression, gene, machinelearning)
- Stanford MicroArray Database (tags: bioinformatics, microarray, expression, gene, machinelearning, stanford)
- ArrayExpress Home (tags: bioinformatics, microarray, expression, gene, machinelearning)
- Gene Expression Omnibus (GEO) Main page (tags: bioinformatics, microarray, expression, gene, machinelearning)
- Welcome to Openvest (tags: python, finance, edgar, pylons, matplotlib, sec, webservice, via:jolby)
- Statistical Science Web: Datasets (tags: links, statistics)
- Data Mining: Text Mining, Visualization and Social Media: TailRank, Spinn3r, TechMeme and TechCrunch: New Attention (tags: crawler, blog, corpus)
- Aleix Face Database (tags: facerecognition, machinelearning, face, image)
- Data Repository Evaluation (tags: umd, links, statistics, government, sports, via:rickladd)
- PMC FTP Service (tags: biology, medicine, articles, text, journal, authors)
- “uspop2002″ data set (tags: music, similarity, machinelearning)
- Internet Archive: Details: Amazon ASIN listing and similarity graph (tags: ASIN, amazon, recommendation, collaborative, filtering, via:keyvowel)
- European Climate Assessment Daily Weather Data (tags: weather, europe, ascii, netcdf)
- Poverty Datasets General Information (tags: poverty, statistics)
- StatLib—Datasets Archive (tags: machinelearning, datamining, cmu, link, collection)
- National Household Travel Survey (NHTS) Data (tags: driving, transportation, publicdata)
- RealClearPolitics – Election 2008 – Democratic Presidential Nomination (tags: polls, politics)
- Nielsen BookScan (tags: books, sales, commercial)
- Pew Internet & American Life Project (tags: internet, demographics, online, web)
- Main Page – OpenTextMining (tags: textmining, open, nature, standards, search)
- Metafilter Infodump (tags: metafilter, comments, network, via:chl)
- WEBSPAM-UK2007 | Datasets | Web Spam Detection (tags: web, search, spam, crawler, yahoo)
- Google to Host Terabytes of Open-Source Science Data | Wired Science from Wired.com (tags: google, article, openaccess)
- Zillow – Labs – Neighborhood Boundaries (tags: neighborhoods, geo, gis, maps)
- Crime in the United States (tags: crime, fbi)
- TaskForces/CommunityProjects/LinkingOpenDa)ta/DataSets – ESW Wiki (tags: opendata, semantic, rdf, collaboration
- XML.com: GovTrack.us, Public Data, and the Semantic Web (tags: semanticweb, rdf, congress, politics, government)
- CiteULike: Available datasets (tags: networks, research, graph, tags, paper, record_linkage)
- Archive-It.org (tags: archive, internet, web, index)
- Challenge: Synopsis – Causality Workbench (tags: competition, machinelearning, forecasting, contest)
- Natural Language Processing (tags: microsoft, text, paraphrase, corpus)
- LDC – Linguistic Data Consortium – Obtaining Data Resorces (tags: nlp, text, corpus, ngram, google, commercial, license)
- 1990 Census Name Files (tags: census, names, identity, frequency, record_linkage)
- Given Name Frequency Project: Analysis of Given Name Popularity (tags: name, record_linkage, text, identity, code)
- Email Datasets (tags: enron, names, identity, text, record_linkage)
- ZoomInfo (tags: api, identity, people, webservice, record_linkage)
- Ted Pedersen – Name Discrimination Data / Name Disambiguation Data / Name Ambiguity Data / Named Entity Resolution / Named Entity Disambiguation (tags: record_linkage, corpus, nlp, names)
- New SwetoDblp RDF dataset released with 11M triples (tags: name, authorship, rdf, record_linkage)
- LSDIS : SwetoDblp (tags: bibliography, rdf, ontology, duplicate, name, record_linkage)
- StrikeIron Super Data Pack Web Service 1.0 – StrikeIron Marketplace (tags: webservice, publicdata, datacleaning)
- Duplicate Detection, Record Linkage, and Identity Uncertainty: Datasets (tags: duplicate, detection, record_linkage, datacleaning, text)
- Amazon Web Services Developer Connection : Can Alexa WS provide detailed … (tags: finance, alexa, amazon, tech)
- Market Data — eBay Developers Program (tags: ebay, retail, pricing, sales, api, product)
- Health Data Tools and Statistics (tags: health, information, public, publicdata)
- It’s a Pitch-by-Pitch Scouting Report, Minus the Scout – New York Times (tags: baseball, gameday)
- opentick :: market data (tags: opentick, nasdaq, finance, stock)
- Daily Kos: Obama helps us track $1,000,000,000,000 of federal spending (tags: corruption, government, politics, finance)
- Welcome to USAspending.gov (tags: government, money, politics)
- Campaign Finance Reports and Data (tags: campaign, politics, elections)
- Machine Learning and Data Mining – Datasets (tags: face, image)
- Cardiac MRI dataset – York University (tags: mri, cardiac)
- Google Trends API coming soon | Tech news blog – CNET News.com (tags: google, trends, api)
- MIT Media Lab: Reality Mining (tags: social, activity, location, cell, gis)
- Vehicle Routing Datasets (tags: optimization, vehicle, routing)
- EIA – Petroleum Data, Reports, Analysis, Surveys (tags: oil, energy, statistics, economics, petroleum)
- DMOZ100k06 – Michael G. Noll (tags: search, pagerank, text, tags, content)
- Grading (tags: machinelearning, CMU, course, projects, graphicalmodel, code, paper)
- Carnegie Mellon University – CMU Graphics Lab – motion capture library (tags: gait, pedestrian, walk, motion)
- Financial Forecast Center’s Historical Economic and Market Data (tags: exchangerate, dollar, economics)
- Bureau of Labor Statistics Data (tags: economics, lumber, building, materials, homedepot)
- Browse Business Cycle Indicators Data (tags: economics, indicators, time, series)
- The Numbers Guy : Aspiring to Be the Wikipedia of Numbers (tags: finance, numberpedia, mechanicalturk, textmining, statistics)
- Social characteristics of the Marvel Universe (tags: socialnetwork, graphs, comicbooks)
- net: Word Lists Collection (tags: dictionary, words)
- ERS/USDA Data – International Macroeconomic Data Set (tags: usda, economics, population, cpi, gdp, income)
- The 2000 U.S. Census: 1 Billion RDF Triples (tags: gis, census, rdf, semantic, sparql)
- See Who’s Editing Wikipedia – Diebold, the CIA, a Campaign (tags: wikipedia, authorship)
- Dataset Generator – Perfect data for an imperfect world. (tags: tools, generator)
- National Bureasu of Economic Research: Data (tags: economics, links)
- Entree Chicago Recommendation Data (tags: recommender, collaborative, restaurant)
- community resource guide: i’ve been here before – show me the links (tags: demographics, maps, gis, statistics, links)
- Social Science Data on the Net (tags: economics, social, government, health, labor, links)
- NBI ASCII Files – Bridge – FHWA (tags: government, bridges, safety)
- List of films: A – Wikipedia, the free encyclopedia (tags: netflix, netflixprize, movie, index, wikipedia)
- The arXiv on your harddrive (tags: paper, corpus, arXiv)
- Insanely Useful Websites | Sunlight Foundation (tags: links, transparency, government, politics, congress, reference)
- Technophilia: Where to find public records online – Lifehacker (tags: public, records, links)
- Junk email project (tags: corpus, email, spam, textmining)
- Enron Email Dataset (tags: enron, corpus, email, text, social, network)
- ftp://ftp.bls.gov/pub/special.requests/cpi/cpiai.txt (tags: finance, cpi, inflation, data)
- GOS – Geospatial One Stop (tags: health, gis, epidemiology, links)
- CIA Factbook Grep in Python (tags: cia, population, python, code, grep)
- University of Virginia – Richard Nixon – Oval Office Recordings (tags: nixon, speech, tapes, audio, mp3, wav, flac)
- UC San Diego Data Mining Competition – 2007 – Datasets (tags: housing, refinance, mortgage)
- package – MoinMaster
- stores | POI Factory (tags: retail, location, poi)
- GpsPasSion Forums – ** INDEX OF POI COLLECTIONS ** (tags: retail, poi, location, gis, gps)
- Collective Dynamics Group (tags: smallworld, networking, socialnetwork, graph)
- Jester Data download page (tags: collaborative, filtering, jokes)
- Index of /edgar (tags: finance, xml, edgar, sec, code, perl)
- Mail Index (tags: EDGAR, sec, mail, text)
- metafy / AnthraciteIdioms (tags: finance, SEC, scrape, parse, commercial)
- Volume of retail sales: Social Trends 33 (tags: retail, sales, uk)
- generatedata.com (tags: tools, generator, random)
- S. Company Filings and Annual Reports (tags: finance, links, sec)
- FTP Information – EDGAR Database (tags: edgar, finance, sec, filing, ftp, instructions)
- Data Mining For Investing (tags: investing, finance, datamining, announcement, sec, filing, links)
- Melissa DATA – Lookups (tags: consumer, data, database, api)
- FactSet: Data Maven – Kiplinger.com (tags: factset, finance)
- IBES (Demo) (tags: finance, ibes, analyst, forecast, wharton)
- Historical Quotes – Yahoo! Finance (tags: yahoo, finance, stock, price)
- Network data (tags: network, links)
- Bureau of Labor Statistics Home Page (tags: statistics, labor, government, consumer)
- NAR: Research: EHS Data (tags: housing, sales, finance)
- RFA – The Industry – Industry Statistics (tags: ethanol)
- Chain Store Guide – Retail Locations (tags: retail, finance, store, locations, gis)
- Energy Information Administration – EIA – Official Energy Statistics from the U.S. Government (tags: finance, government, energy, historical, forecasts, fuel, oil)
- UPC Database: Downloads (tags: product, upc, database)
- Web Crawling / Crawl Datasets at Tobias Escher at the OII (tags: crawler, benchmark, search, web, links)
- TechTC – Technion Repository of Text Categorization Datasets (tags: corpus, text)
- TMC data archive download site (tags: traffic, data)
- http://www.volvis.org/ (tags: volumerendering)
- Computational Vision: Archive (tags: vision, caltech, imagerecognition)
- DC Pedestrian Classification Benchmark (tags: pedestrian, image, classification, detection)
- opentick :: home (tags: finance, economics, feed, free, stock, trading, opentick, opensource)
- Web as Corpus (tags: textmining, corpus, concordance, wordlist, n-gram)
- .:[ packet storm ]:. – http://packetstormsecurity.org/ (tags: dictionary, hack, security, wordlist, password)
- Enron Dataset (tags: data, mysql, email, energy, text, socialnetwork)
- Splog Blog Dataset (tags: blog, corpus, spam)
- Home Page for 20 Newsgroups Data Set (tags: corpus, text, newsgroup)
- White Glove Tracking (tags: crowdsourcing, image, processing, algorithm, collaborative, distributed, web2.0, code, opensource)
- NOAA Paleoclimatology Program – Coral and Sclerosponge Data (tags: paleoclimatology, climate, oceanography, coral, sponge, biology)
- NAICS — North American Industry Classification System (tags: finance, economics, naics, industry, classifications)
- Saving Democracy With Web 2.0 – (tags: democracy, web2.0, mashup, government, funding, article)
- Population Estimates Datasets (tags: census, data, population, statistics)
- CRAN Task View: Machine Learning & Statistical Learning (tags: statisticallearning, machinelearning, code, R, libraries, cran)
- PAIDA – Pure Python scientific analysis package (tags: python, visualization, library)
- SUBDUE – Graph Based Knowledge Discovery (tags: machinelearning, network, graph)
- Python Cheese Shop : shakespeare 0.4 (tags: python, text)
- AG’s corpus of news articles (tags: corpus, nlp, machinelearning, textmining)
- Sampling Techniques for Massive Data – Google Video (tags: video, machinelearning, statistics, matrix, sampling, large, sparse, algorithm, experiment_design, towatch)
- metachronistic » Mirror the Wikipedia (tags: wikipedia, laptop, install, dump)
- LETOR: Benchmark Datasets for Learning to Rank (tags: ranking, search)
- CN710: Comparative Analysis of Learning Systems (Spring 2006) – Class Project (tags: machinelearning, algorithm, ogi, bu, greyhound, finance)
- UrbanSim Home (tags: python, urban, software, simulation, opensource, GIS, census)
- Face Recognition Homepage (tags: face, algorithm, facerecognition, data, image)
- CBCL SOFTWARE Face data set (tags: face, seung, algorithm, recognition, image)
- Text Analytics Solutions from ClearForest (tags: extraction, finance, semantic, semanticweb, text)
- 23C3 – Mining Search Queries – Video (tags: aol, search, video, talk, algorithm, informationretrieval, datamining, machinelearning)
- Digital History Hacks: Keywords and Clues (tags: aol, search, query, analysis)
- Digital History Hacks: Searching for History (tags: aol, search, query, analysis)
- The Tom Kyte Blog: An interesting data set… (tags: aol, search, oracle, database, code)
- KDD 2005 – KDD Cup 2005: Aug 21-24, Chicago, IL. USA (tags: query, categorization, algorithm, google)
- Statistical NLP / corpus-based computational linguistics resources (tags: corpus, machinelearning, text)
- Intelligent Web Search and Mining: Tools & Resources (tags: machinelearning, code, links)
- Official Google Research Blog: All Our N-gram are Belong to You (tags: linguistics, google, ngram, nlp, record_linkage)
- Hyper-threaded Java – Java World (tags: clustering, algorithm, java, parallel)
- Statistical Modeling, Causal Inference, and Social Science (tags: blog, econometrics, finance, machinelearning, math, statistics)
- Structural Analysis of Discrete Data and Econometric Applications, by Charles F. Manski and Daniel L. McFadden, MIT Press, 1981. (tags: books, econometrics, economics, finance, ebook)
- CSE 250B Fall 2006 (tags: netflixprize, machinelearning, course)
- Matrix Market (tags: matrixmarket, matrix)
- Analysis of incomplete datasets: Estimation of mean values and covariance matrices and imputation of missing values (tags: imputation, matlab, missing, EM, machinelearning)
- CSE 250B Project 4, Fall 2006 (tags: subset, netflixprize, dimensionality, reduction)
- G3DATA (tags: extract, from, graphs, hack, google, trends)
- cwm – a general purpose data processor for the semantic web (tags: python, processor, semantic, web, rdf)
- WebBase Project (tags: link, analysis, sturcture, web, crawler, stanford)
- sam roweis : data (tags: machine, learning, matlab, python, hackers, image)
- Index of /data/sequence/mnist (tags: mnist, xml, format)
- MNIST handwritten digit database (tags: mnist)
- Book-Crossing Dataset (tags: data, set, collaborative, filtering, datamining, books, movie)
- allmovie (tags: movie, netflixprize, source)
- Cinema.com (tags: plot, synopsis, movie, netflixprize, prize)
- LUMIERE (tags: netflixprize, prize, european, movie, revenue)
- Data dumps – Meta (tags: mediawiki, wikipedia, import, mysql, sql)
- Driver safety: safe driving practices, drugged driving prevention, safety tips for older drivers
- “phone ***” ” address *” “e-mail” intitle:”curriculum vitae” – Google Search (tags: resume, google)
Now that you have an abundance of data on hand, find out how to avoid these common mistakes when transforming them into infographics.
Update: List was updated on the 5th of January, 2017