Data Sets

From Wiki2
Jump to navigation Jump to search

Image Data Sets

  • MNIST (MNIST is one of the most popular deep learning datasets out there. It’s a dataset of handwritten digits and contains a training set of 60,000 examples and a test set of 10,000 examples. It’s a good database for trying learning techniques and deep recognition patterns on real-world data while spending minimum time and effort in data preprocessing).

Size: ~50 MB Number of Records: 70,000 images in 10 classes

  • COCO (COCO is a large-scale and rich for object detection, segmentation and captioning dataset. It has several features:

Object segmentation Recognition in context Superpixel stuff segmentation 330K images (>200K labeled) 1.5 million object instances 80 object categories 91 stuff categories 5 captions per image 250,000 people with keypoints Size: ~25 GB (Compressed)

Number of Records: 330K images, 80 object categories, 5 captions per image, 250,000 people with key points)

  • ImageNet (ImageNet is a dataset of images that are organized according to the WordNet hierarchy. WordNet contains approximately 100,000 phrases and ImageNet has provided around 1000 images on average to illustrate each phrase.

Size: ~150GB

Number of Records: Total number of images: ~1,500,000; each with multiple bounding boxes and respective class labels)

  • Open Image Dataset (Open Images is a dataset of almost 9 million URLs for images. These images have been annotated with image-level labels bounding boxes spanning thousands of classes. The dataset contains a training set of 9,011,219 images, a validation set of 41,260 images and a test set of 125,436 images.

Size: 500 GB (Compressed)

Number of Records: 9,011,219 images with more than 5k labels)

  • Visual QA (VQA is a dataset containing open-ended questions about images. These questions require an understanding of vision and language. Some of the interesting features of this dataset are:

265,016 images (COCO and abstract scenes) At least 3 questions (5.4 questions on average) per image 10 ground truth answers per question 3 plausible (but likely incorrect) answers per question Automatic evaluation metric Size: 25 GB (Compressed)

Number of Records: 265,016 images, at least 3 questions per image, 10 ground truth answers per question)

  • CIFAR-10 (This dataset is another one for image classification. It consists of 60,000 images of 10 classes (each class is represented as a row in the above image). In total, there are 50,000 training images and 10,000 test images. The dataset is divided into 6 parts – 5 training batches and 1 test batch. Each batch has 10,000 images.

Size: 170 MB

Number of Records: 60,000 images in 10 classes)

Natural Language Processing

  • IMDB Reviews (This is a dream dataset for movie lovers. It is meant for binary sentiment classification and has far more data than any previous datasets in this field. Apart from the training and test review examples, there is further unlabeled data for use as well. Raw text and preprocessed bag of words formats have also been included.

Size: 80 MB

Number of Records: 25,000 highly polar movie reviews for training, and 25,000 for testing)

  • Newsgroup (This dataset, as the name suggests, contains information about newsgroups. To curate this dataset, 1000 Usenet articles were taken from 20 different newsgroups. The articles have typical features like subject lines, signatures, and quotes.

Size: 20 MB

Number of Records: 20,000 messages taken from 20 newsgroups)

  • Sentiment-140 (Sentiment140 is a dataset that can be used for sentiment analysis. A popular dataset, it is perfect to start off your NLP journey. Emotions have been pre-removed from the data. The final dataset has the below 6 features:

polarity of the tweet id of the tweet date of the tweet the query username of the tweeter text of the tweet Size: 80 MB (Compressed)

Number of Records: 1,60,000 tweets)

  • WordNet (Mentioned in the ImageNet dataset above, WordNet is a large database of English synsets. Synsets are groups of synonyms that each describe a different concept. WordNet’s structure makes it a very useful tool for NLP.

Size: 10 MB

Number of Records: 117,000 synsets is linked to other synsets by means of a small number of “conceptual relations).

  • YELP Reviews (This is an open dataset released by Yelp for learning purposes. It consists of millions of user reviews, businesses attributes and over 200,000 pictures from multiple metropolitan areas. This is a very commonly used dataset for NLP challenges globally.

Size: 2.66 GB JSON, 2.9 GB SQL and 7.5 GB Photos (all compressed)

Number of Records: 5,200,000 reviews, 174,000 business attributes, 200,000 pictures and 11 metropolitan areas)

  • Wikipedia Corpus (This dataset is a collection of a the full text on Wikipedia. It contains almost 1.9 billion words from more than 4 million articles. What makes this a powerful NLP dataset is that you search by word, phrase or part of a paragraph itself.

Size: 20 MB

Number of Records: 4,400,000 articles containing 1.9 billion words)

  • Blogs (This dataset consists of blog posts collected from thousands of bloggers and has been gathered from Each blog is provided as a separate file. Each blog contains a minimum of 200 occurrences of commonly used English words.

Size: 300 MB

Number of Records: 681,288 posts with over 140 million words)

Machine Language Translation

  • Translations (This dataset consists of training data for four European languages. The task here is to improve the current translation methods. You can participate in any of the following language pairs:

English-Chinese and Chinese-English English-Czech and Czech-English English-Estonian and Estonian-English English-Finnish and Finnish-English English-German and German-English English-Kazakh and Kazakh-English English-Russian and Russian-English English-Turkish and Turkish-English Size: ~15 GB

Number of Records: ~30,000,000 sentences and their translations)

Real Time Bidding

  • [1] (The iPinYou Global RTB(Real-Time Bidding) Bidding Algorithm Competition is organized by iPinYou from April 1st, 2013 to December 31st, 2013.The competition has been divided into three seasons. For each season, a training dataset is released to the competition participants, the testing dataset is reserved by iPinYou. The complete testing dataset is randomly divided into two parts: one part is the leaderboard testing dataset to score and rank the participating teams on the leaderboard, and the other part is reserved for the final offline evaluation. The participant's last offline submission is evaluated by the reserved testing dataset to get a team's offline final score. This dataset contains all three seasons training datasets and leaderboard testing datasets.The reserved testing datasets are withheld by iPinYou. The training dataset includes a set of processed iPinYou DSP bidding, impression, click, and conversion logs).

General Machine Learning Data Sets