DocVQA Dataset

Document Visual Question Answering


Dataset

The data is available for download at the challenge page in RRC portal. You must login to the portal before downloading. Also make sure that you read the terms and conditions listed in the Download page.

Dataset for Task 1 of 2020 Challenge

All the images in Document VQA Challenge, Task1 are pages of documents downloaded from UCSF Industry Documents Library .

We additionally provide OCR transcriptions of all the document images, using a commercial OCR solution. The OCR transcription include, line and word bounding boxes and line level and word level text transcriptions.

Ground Truth Format


Corresponding to each data split (train-val-test) there is a JSON file with the ground truth annotations, and a folder with the document images. The JSON file, called "docvqa_train_vX.X" has the following format (explanations in italics):

{

"dataset_name": "docvqa", The name of the dataset, should be invariably "docvqa"

"dataset_split": "train", The subset (either "train" or "test")

"dataset_version": "0.1", The version of the dataset. A string in the format of major.minor version

"data": [{...}]

}

The "data" element is a list of dictionary entries with the following structure

{

"questionId": 52212, A unique ID number for the question

"question": "Whose signature is given?", The question string - natural language asked question

"image": "documents/txpn0095_1.png", The image filename corresponding to the document page which the question is defined on. The images are provided in the /documents folder

"docId": 1968, A unique ID number for the document

"ucsf_document_id": "txpn0095", The UCSF document id number

"ucsf_document_page_no": "1", The page number within the UCSF document that is used here

"answers": ["Edward R. Shannon", "Edward Shannon"], A list of correct answers provided by annotators

"data_split": "train" The dataset split this question pertains to

}

More details on the annotation of this dataset, analysis of the data and baselines using SOTA QA methods from NLP and VQA space are reported in our paper titled DocVQA: A Dataset for VQA on Document Images

Dataset for Task 2 of 2020 Challenge

All the images provided for the Document VQA Challenge, Task 2, are sourced from the Public Disclosure Commission (PDC) documents.

Ground Truth Format

{

"dataset_name":"docvqa_task2", The name of the dataset, should be invariably "docvqa_task2"

"dataset_split": "sample", The subset(either "sample" or "test")

"dataset_version":"0.1", The version of the dataset. A string in the format of major.minor version

"data": [{...}]

}

The "data" element is a list of dictionary entries with the following structure:

{

"question_id": 0, A unique ID number for the question

"questions": In which years did Anna M. Rivers run for the State senator office? The question string - Natural Language asked question

"answers": [2016, 2020], The answer to the question

"evidence": [454, 10901], The Doc IDs where the answer can be found

"ground_truth": [0, 0, 0, 1, 0 ..... 0] List of dimension equal to the number of documents in the dataset. Values are 0 and 1, where 1 means the document is considered as a positive evidence for the question.

"data_split": "sample" The dataset split this question belongs to.

}

Dataset for 2nd Edition of the Challenge

TBD