The Find it again! dataset contains 988 scanned images of receipts and their transcriptions, originating from the scanned receipts OCR and information extraction (SROIE) dataset. Among these images, 163 have undergone realistic fraudulent modifications. The dataset includes ground truth information for distinguishing between forged and authentic receipts. It also provides annotations on the fraudulent modifications, including details about the entities that have been modified and the location of the forgeries.
Find it again! aims to address the limitations of existing forgery detection datasets by providing a collection of labeled and annotated documents suitable for both image-based and content-based forgery detection approaches. This novel dataset contains diverse receipts, encompassing different layouts, fonts, styles and document characteristics encountered in real-world scenarios that have undergone pseudo-realistic manual forgeries.
One characteristic of the scanned receipts of this dataset is that some have been modified, either digitally or manually, with different types of annotations. These annotations are not considered as forgeries. Even though the documents have been modified, they are still authentic, as they have not undergone any forgery, and the modification doesn't compromise the meaning of the receipts. These annotations suit our case study, as most of them are context-specific notes found in real document applications. For instance, some annotations are consistent with notes left on the receipts, such as ''staff outing'' to describe the nature of the event (Figure 1), numbers that can describe a mission or a case number (or any contextual numerical information) (Figure 3), names (Figure 2) or markings to highlight key information on the document, such as the price. Many of such annotations might even come from the collection process of the dataset and are difficult to interpret without contextual queues (names, numbers, etc.).
The annotations in the Find it again! dataset provide detailed information about the forged and authentic receipts, including:
The dataset is organized into the following directory structure:
dataset/ ├── test/ │ ├── receipt1.png │ ├── receipt1.txt │ ├── receipt2.png │ ├── receipt2.txt │ ├── ... ├── train/ │ ├── receipt1.png │ ├── receipt1.txt │ ├── receipt2.png │ ├── receipt2.txt │ ├── ... ├── val/ │ ├── receipt1.png │ ├── receipt1.txt │ ├── receipt2.png │ ├── receipt2.txt │ ├── ... ├── train.txt ├── test.txt └── val.txt
The test
, train
and val
folders contain the PNG images of the receipts as well as their transcriptions. The train.txt
, test.txt
and val.txt
files list the filenames of the training, test and validation images, respectively along with their ground truth. We provide annotations on whether the authentic receipt contains any kind of digital or handwritten modification (digital annotation
and handwritten annotation
), if it has been forged and in case of forgery, the forgery annotations in JSON format, as shown below.
{'filename': 'X51005230616.png', 'size': 835401, 'regions':
[{'shape_attributes': {'name': 'rect', 'x': 27, 'y': 875,
'width': 29, 'height': 43},
'region_attributes': {'Modified area': {'IMI': True},
'Entity type': 'Product', 'Original area': 'no'}},
{'shape_attributes': {'name': 'rect', 'x': 458, 'y': 883,
'width': 35, 'height': 37},
'region_attributes': {'Modified area': {'IMI': True},
'Entity type': 'Product', 'Original area': 'no'}}],
'file_attributes': {'Software used': 'paint', 'Comment': ''}}
Note that not all region attributes correspond to modified areas, as some are original areas that have been copied (Check 'Original area': 'yes' to rule them out, or the modification type).
Beatriz Martínez Tornés, Théo Taburet, Emanuela Boros, Kais Rouis, Petra Gomez-Krämer, Nicolas Sidere, Antoine Doucet and Vincent Poulain d'Andecy. Receipt Dataset for Document Forgery Detection. In Proceedings of The 17th International Conference on Document Analysis and Recognition, August 21-26, 2023 — San José, California, USA.
We would like to thank the participants for their contribution to the creation of the dataset. This work was supported by the French defence innovation agency (AID), the VERINDOC project funded by the Nouvelle-Aquitaine Region and the LabCom IDEAS (ANR-18-LCV3-0008) funded by the French national research agency (ANR).