Getting hands-on with COVID-19 data

Are you a machine learning engineer or data scientist eager to contribute your technical expertise to help fight the COVID-19 pandemic? We’re here to help.  

In record time, the machine learning community has rallied together, organizing a large number of projects and crowdsourced efforts that enable practitioners to effectively contribute time and energy to this cause. Here, we highlight where you can find a few resources so you can join the effort.

Assembled by a consortium of technology companies and federal officials under the leadership of The Allen Institute for AI and The White House Office of Science and Technology Policy, the CORD-19 corpus contains full texts and paper abstracts from 44,000 journal articles covering COVID-19, SARS-CoV-2, and the coronavirus family of viruses. The corpus is continually being updated as more research becomes available.

Its breadth makes CORD-19 a great starting point for analyzing COVID-19 and the SARS-CoV-2 virus using natural language processing (NLP) tools.

CORD-19 is available as a dataset on Kaggle or direct download via its SemanticScholar landing page. Here are a few of Kaggle’s public notebooks to help you get started:


The United States National Institute for Health (NIH) has created a paper indexer specific to COVID-19, the disease caused by SARS-CoV-2: LitCovid.

The LitCovid literature hub provides a reference list of 1,500 COVID-19 and SARS-CoV-2 -specific papers curated by subject matter experts. Many of these articles are open access, and you can download paper metadata as a CSV file.

nCov Dashboard

This January, John Hopkins University debuted the 2019-nCoV dashboard, a website tracking the regional and international spread of COVID-19. The dashboard is regularly being updated, keeping pace with the developing situation, and continues to be among the most accurate trackers on the web.

The team behind the dashboard has normalized data from over a dozen different health organizations worldwide. The raw time series data is available on GitHub.

Midas Network

MIDAS is a software catalog and research portal for COVID-19 related tools and data. Created as a collaboration between epidemiologists and subject matter researchers, the raw data streams are a robust starting point for deeper exploration of the various available raw data streams.

Collaboration Spaces

In addition to tools and data, there are a number of project-sharing spaces for data scientists seeking to join collaborative efforts. These include Data Against COVID, a Discourse channel for data scientists looking to collaborate on COVID-19 -related data science projects; and Help With COVID, a Y Combinator created volunteer-to-project matching website for COVID-19 -related software projects.

COVID-19 data on Spell

If you're looking for a more data-driven perspective on COVID-19, the Spell team is working on creating a workspace template on Spell, packaging the most useful COVID-19 datasets into a Jupyter notebook environment. To start it up, run the following command in your shell:

spell jupyter covid-19-workspace \
    --machine-type cpu \
    --github-url \
    --mount 's3://ai2-semanticscholar-cord-19/2020-03-20/comm_use_subset.tar.gz:cord-19/commerical_use_subset' \

This workspace template is still a work-in-progress! We welcome pull requests to the repository.

Stay self and healthy, everyone! 

