Dataset & Database

Microsite on Dataset

Go to the microsite for perpetual lists of interesting datasets

Introduction

Dataset or data set are collection of raw information group together and can be just about anything. The most common dataset consist of single table or statistic table matrix where every column represents a particular variable / parameter and each row corresponds to given value / member of dataset in question.

Dataset

The Classic

Data Science

MNIST database
- Images of handwritten digits commonly used to test classification, clustering, and image processing algorithms
- Commonly used as teaching material for universities' lectures
- Dataset license type: CC BY-SA 3.0 as of 22nd March 2020
Bupa liver
- Used in several papers in the machine learning (data mining) literature
- Dataset license type: PDDL as of 22nd March 2020

COVID-19

Healthcare

BMJ
- By BMJ
Cambridge
- By Cambridge University Press
CDC by Massachusetts Medical Society
- Centres for Disease Control and Prevention by U.S. Department of Health & Human Services
Cell
- Super high impact journals by Elsevier Inc
Elsevier
- Access over 19,000 ScienceDirect articles on COVID-19 since 1996
Lancet
- Lancet journals by The Lancet
medRxiv
- Free online archive and distribution server for complete but unpublished manuscripts (preprints)
Nature
- Super high impact journals: Nature , Nature Medicine , Nature Microbiology , Nature Biotechnology , Nature Structural & Molecular Biology , etc. by Nature Research
NEJM
- New England Journal of Medicine
Oxford
- Oxford Academic journals by Oxford University Press
Wiley
- By John Wiley & Sons, Inc.

HSpam14

Dropbox (308 MB)
Onedrive (74.8 MB)
Surendra Sedhai, and Aixin Sun , "HSpam14: A Collection of 14 Million Tweets for Hashtag-Oriented Spam Research," In Proc. 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 223-232,Aug. 2015 DOI: 10.1145/2766462.2767701 [ PDF ]
- Abstract: Hashtag facilitates information diffusion in Twitter by creating dynamic and virtual communities for information aggregation from all Twitter users. Because hashtags serve as additional channels for one's tweets to be potentially accessed by other users than her own followers, hashtags are targeted for spamming purposes (e.g., hashtag hijacking), particularly the popular and trending hashtags. Although much effort has been devoted to fighting against email/web spam, limited studies are on hashtag-oriented spam in tweets. In this paper, we collected 14 million tweets that matched some trending hashtags in two months' time and then conducted systematic annotation of the tweets being spam and ham (i.e., non-spam). We name the annotated dataset HSpam14. Our annotation process includes four major steps: (i) heuristic-based selection to search for tweets that are more likely to be spam, (ii) near-duplicate cluster based annotation to firstly group similar tweets into clusters and then label the clusters, (iii) reliable ham tweets detection to label tweets that are non-spam, and (iv) Expectation-Maximization (EM)-based label prediction to predict the labels of remaining unlabeled tweets. One major contribution of this work is the creation of HSpam14 dataset, which can be used for hashtag-oriented spam research in tweets. Another contribution is the observations made from the preliminary analysis of the HSpam14 dataset
Dataset license type: not explicitly defined

Change Detection

dataset2012 (1.9 GB)
- 6 video categories with 4 to 6 videos sequences in each category
- Read more at author's
  - Page: http://jacarini.dinf.usherbrooke.ca/dataset2012/
  - Paper: N. Goyette, P. Jodoin, F. Porikli, J. Konrad and P. Ishwar, "Changedetection.net: A new change detection benchmark dataset," 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, pp. 1-8, Jun. 2012. DOI: 10.1109/CVPRW.2012.6238919
    - Abstract: Change detection is one of the most commonly encountered low-level tasks in computer vision and video processing. A plethora of algorithms have been developed to date, yet no widely accepted, realistic, large-scale video dataset exists for benchmarking different methods. Presented here is a unique change detection benchmark dataset consisting of nearly 90,000 frames in 31 video sequences representing 6 categories selected to cover a wide range of challenges in 2 modalities (color and thermal IR). A distinguishing characteristic of this dataset is that each frame is meticulously annotated for ground-truth foreground, background, and shadow area boundaries - an effort that goes much beyond a simple binary label denoting the presence of change. This enables objective and precise quantitative comparison and ranking of change detection algorithms. This paper presents and discusses various aspects of the new dataset, quantitative performance metrics used, and comparative results for over a dozen previous and new change detection algorithms. The dataset, evaluation tools, and algorithm rankings are available to the public on a website and will be updated with feedback from academia and industry in the future.
dataset2014 (3.7 GB)
- 11 video categories with 4 to 6 videos sequences in each category
- Read more at author's
  - Page: http://jacarini.dinf.usherbrooke.ca/dataset2014/
  - Paper: Y. Wang, P.-M. Jodoin, F. Porikli, J. Konrad, Y. Benezeth, and P. Ishwar, "CDnet 2014: An Expanded Change Detection Benchmark Dataset," in Proc. of IEEE Workshop on Change Detection (CDW-2014), pp. 387-394, Jun. 2014. DOI: 10.1109/CVPRW.2014.126
    - Abstract: Change detection is one of the most important lowlevel tasks in video analytics. In 2012, we introduced the changedetection.net (CDnet) benchmark, a video dataset devoted to the evalaution of change and motion detection approaches. Here, we present the latest release of the CDnet dataset, which includes 22 additional videos (70000 pixel-wise annotated frames) spanning 5 new categories that incorporate challenges encountered in many surveillance settings. We describe these categories in detail and provide an overview of the results of more than a dozen methods submitted to the IEEE Change DetectionWorkshop 2014. We highlight strengths and weaknesses of these methods and identify remaining issues in change detection.
Dataset license type: not explicitly defined

Dataset on Request

A*3D

On request with conditions. Please read more at the user webpage. [link]
1 LiDAR and 1 camera data
Paper: Quang-Hieu Pham, Pierre Sevestre, Ramanpreet Singh Pahwa, Huijing Zhan, C. H. Pang, Yuda Chen, Armin Mustafa, Vijay Chandrasekhar, Jiajun Lin, "A*3D Dataset: Towards Autonomous Driving in Challenging Environments,'
- Abstract: With the increasing global popularity of self-driving cars, there is an immediate need for challenging real-world datasets for benchmarking and training various computer vision tasks such as 3D object detection. Existing datasets either represent simple scenarios or provide only day-time data. In this paper, we introduce a new challenging A*3D dataset which consists of RGB images and LiDAR data with significant diversity of scene, time, and weather. The dataset consists of high-density images (~10 times more than the pioneering KITTI dataset), heavy occlusions, a large number of night-time frames (~3 times the nuScenes dataset), addressing the gaps in the existing datasets to push the boundaries of tasks in autonomous driving research to more challenging highly diverse environments. The dataset contains 39K frames, 7 classes, and 230K 3D object annotations. An extensive 3D object detection benchmark evaluation on the A*3D dataset for various attributes such as high density, day-time/night-time, gives interesting insights into the advantages and limitations of training and testing 3D object detection in real-world setting
- Direct download from Arvxiv
Dataset license type: not explicitly defined

Publications

Highly Cited Research Publication on Big Dataset

KITTI dataset

A Geiger, P Lenz, C Stiller, and R Urtasun, " Vision meets robotics: The KITTI dataset," The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231-1237, Aug. 2013 . DOI: 10.1177/0278364913491297

Abstract: We present a novel dataset captured from a VW station wagon for use in mobile robotics and autonomous driving research. In total, we recorded 6 hours of traffic scenarios at 10–100 Hz using a variety of sensor modalities such as high-resolution color and grayscale stereo cameras, a Velodyne 3D laser scanner and a high-precision GPS/IMU inertial navigation system. The scenarios are diverse, capturing real-world traffic situations, and range from freeways over rural areas to inner-city scenes with many static and dynamic objects. Our data is calibrated, synchronized and timestamped, and we provide the rectified and raw image sequences. Our dataset also contains object labels in the form of 3D tracklets, and we provide online benchmarks for stereo, optical flow, object detection and other tasks. This paper describes our recording platform, the data format and the utilities that we provide.
Data available via KIT
Dataset license type: CC BY-NC-SA 3.0 as of 22nd March 2020

Thierry Bertin-Mahieux, Daniel P. W. Ellis, BrianWhitman, and Paul Lamere, " The Million Song Dataset," in Proc. of 12th International Society for Music Information Retrieval Conference (ISMIR 2011), Miami, Oct. 2011. DOI: 10.7916/D8NZ8J07

Abstract: We introduce the Million Song Dataset, a freely-available collection of audio features and metadata for a million contemporary popular music tracks. We describe its creation process, its content, and its possible uses. Attractive features of the Million Song Database include the range of existing resources to which it is linked, and the fact that it is the largest current research dataset in our field. As an illustration, we present year prediction as an example application, a task that has, until now, been difficult to study owing to the absence of a large set of suitable data. We show positive results on year prediction, and discuss more generally the future development of the dataset.
Data available online:
- Million Song Dataset on AWS (500 GB) [link]
- http://millionsongdataset.com/pages/getting-dataset/
Funded by National Science Foundation (IIS-0713334)
Dataset license type: CC BY-NC-SA 2.0 as of 22nd March 2020

Reference

n.d. (n.d.) Million Song Dataset. Retrieved Mar. 22, 2020 from millionsongdataset.com
Sun, Aixin (n.d.) Dataset. Retrieved Mar. 22, 2020 from Nanyang Technological University

Special notes: 1. This page cover image is courtesy of geralt via pixabay.com .2. Disclaimer applies. Except where otherwise noted, contents on this page are licensed under a CC BY-NC-SA 4.0 International License. 3. Please check with the original dataset creator's licence T&C for commercial use.

Keywords

dataset, download, resource, covid, database, i2r, data, video, data mining, astar, ntu, science, research, singapore