Find it again! - Receipt Dataset for Document Forgery Detection

The Find it again! dataset contains 988 scanned images of receipts and their transcriptions, originating from the scanned receipts OCR and information extraction (SROIE) dataset. Among these images, 163 have undergone realistic fraudulent modifications. The dataset includes ground truth information for distinguishing between forged and authentic receipts. It also provides annotations on the fraudulent modifications, including details about the entities that have been modified and the location of the forgeries.

Data Collection and Forgery

Find it again! aims to address the limitations of existing forgery detection datasets by providing a collection of labeled and annotated documents suitable for both image-based and content-based forgery detection approaches. This novel dataset contains diverse receipts, encompassing different layouts, fonts, styles and document characteristics encountered in real-world scenarios that have undergone pseudo-realistic manual forgeries.

SROIE Dataset Annotation

One characteristic of the scanned receipts of this dataset is that some have been modified, either digitally or manually, with different types of annotations. These annotations are not considered as forgeries. Even though the documents have been modified, they are still authentic, as they have not undergone any forgery, and the modification doesn't compromise the meaning of the receipts. These annotations suit our case study, as most of them are context-specific notes found in real document applications. For instance, some annotations are consistent with notes left on the receipts, such as ''staff outing'' to describe the nature of the event (Figure 1), numbers that can describe a mission or a case number (or any contextual numerical information) (Figure 3), names (Figure 2) or markings to highlight key information on the document, such as the price. Many of such annotations might even come from the collection process of the dataset and are difficult to interpret without contextual queues (names, numbers, etc.).

Figure 1

Figure 2

Figure 3

Provided Annotations

The annotations in the Find it again! dataset provide detailed information about the forged and authentic receipts, including:

Location of fraudulent modifications on the receipts (Bboxes)
Type of the entities that have been modified in the forgeries
Ground truth labels indicating whether a receipt is authentic or forged
Annotations on whether the authentic SROIE receipt contains a manual or digital alteration

Dataset Structure

The dataset is organized into the following directory structure:

dataset/
  ├── test/
  │   ├── receipt1.png
  │   ├── receipt1.txt
  │   ├── receipt2.png
  │   ├── receipt2.txt
  │   ├── ...
  ├── train/
  │   ├── receipt1.png
  │   ├── receipt1.txt
  │   ├── receipt2.png
  │   ├── receipt2.txt
  │   ├── ...
  ├── val/
  │   ├── receipt1.png
  │   ├── receipt1.txt
  │   ├── receipt2.png
  │   ├── receipt2.txt
  │   ├── ...
  ├── train.txt
  ├── test.txt
  └── val.txt

The test, train and val folders contain the PNG images of the receipts as well as their transcriptions. The train.txt, test.txt and val.txt files list the filenames of the training, test and validation images, respectively along with their ground truth. We provide annotations on whether the authentic receipt contains any kind of digital or handwritten modification (digital annotation and handwritten annotation), if it has been forged and in case of forgery, the forgery annotations in JSON format, as shown below.


          {'filename': 'X51005230616.png', 'size': 835401, 'regions': 
            [{'shape_attributes': {'name': 'rect', 'x': 27, 'y': 875, 
            'width': 29, 'height': 43}, 
            'region_attributes': {'Modified area': {'IMI': True}, 
            'Entity type': 'Product', 'Original area': 'no'}}, 
            {'shape_attributes': {'name': 'rect', 'x': 458, 'y': 883, 
            'width': 35, 'height': 37},
            'region_attributes': {'Modified area': {'IMI': True},
            'Entity type': 'Product', 'Original area': 'no'}}], 
            'file_attributes': {'Software used': 'paint', 'Comment': ''}}

Note that not all region attributes correspond to modified areas, as some are original areas that have been copied (Check 'Original area': 'yes' to rule them out, or the modification type).

Download the Find it again! Dataset

Reference

Beatriz Martínez Tornés, Théo Taburet, Emanuela Boros, Kais Rouis, Petra Gomez-Krämer, Nicolas Sidere, Antoine Doucet and Vincent Poulain d'Andecy. Receipt Dataset for Document Forgery Detection. In Proceedings of The 17th International Conference on Document Analysis and Recognition, August 21-26, 2023 — San José, California, USA.

Acknowledgements

We would like to thank the participants for their contribution to the creation of the dataset. This work was supported by the French defence innovation agency (AID), the VERINDOC project funded by the Nouvelle-Aquitaine Region and the LabCom IDEAS (ANR-18-LCV3-0008) funded by the French national research agency (ANR).