Dataset Short description and/or associated publication More details / How to download?
AMADI_LontarSet Made Windu Antara Kesiman, Jean-Christophe Burie, Jean-Marc Ogier, Gusti Ngurah Made Agus Wibawantara, I Made Gede Sunarya. "AMADI_LontarSet: The First Handwritten Balinese Palm Leaf Manuscripts Dataset", 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2016, pp.168-172.
CHAMDoc Dataset T.N. Nguyen, J.C. Burie, T.L. Lan, A. V. Schweyer. An effective method for text line segmentation in historical document images. ICPR (2022)
Complex Maps Dataset Q.B. Dang, M.M. Luqman, M. Coustaty, N. Nayef, C.D. Tran, J.M. Ogier - "A system for camera-based complex map image retrieval using a multi-layer approach", published in IAPR International workshop on Document Analysis System (2014)

Ground truth
DCM772 Annotation of 772 public domain comic books from the Digital Comics Museum as panels, balloons and characters position and relations.

Nhu Van Nguyen, Christophe Rigaud, Jean-Christophe Burie "Digital comics image indexing based on deep learning", published in the open access Journal of Imaging, volume 4, number 7, article number 89 (2018).
eBDtheque A representative database of annotated comic book images (100 images with 850 panels, 1092 speech ballons, 1550 comic characters and 4691 text lines)

Clément Guérin, Christophe Rigaud, Antoine Mercier, Farid Ammar-Boudjelal, Karell Bertet, Alain Bouju, Jean-Christophe Burie, Georges Louis, Jean-Marc Ogier, Arnaud Revel "ebdtheque: a representative database of comics". In Proceedings of International Conference on Document Analysis and Recognition (ICDAR), pages 1145–1149, Washington DC, 2013
Evaluation of Administrative Document Segmentation into Color Layers This contribution presents the conception of a dataset used to evaluate the segmentation of color administrative documents into color layers.
More information about the segmentation project can be seen on the website of the L3i laboratory (only in French): here

Elodie Carel, Jean-Christophe Burie, Vincent Courboulay, Jean-Marc Ogier, Vincent Poulain D'Andecy, "Multiresolution Approach Based on Adaptive Superpixels for Administrative Documents Segmentation into Color Layers", 13th International Conference on Document Analysis and Recognition (ICDAR15), Aug 2015, Nancy, France.
Find it! Artaud Chloé, Sidère Nicolas, Doucet Antoine, Ogier Jean-Marc and Poulain d’Andecy Vincent, “Find it! Fraud Detection Contest Report”, 24th International Conference on Pattern Recognition, ICPR 2018, Beijing, China, August 20-24, 2018.
Find it again! The receipt forgery detection dataset contains 988 scanned images of receipts and their transcriptions, originating from the scanned receipts OCR and information extraction (SROIE) dataset. 163 images and their transcriptions have undergone realistic fraudulent modifications. Ground truth between the forged and authentic receipts is provided along with annotations on the fraudulent modifications are provided, concerning the entities that have been modified as well as the location of the forgeries.

Beatriz Martínez Tornés, Théo Taburet, Emanuela Boros, Kais Rouis, Petra Gomez-Krämer, Nicolas Sidere, Antoine Doucet and Vincent Poulain d'Andecy. Receipt Dataset for Document Forgery Detection, The 17th International Conference on Document Analysis and Recognition, August 21-26, 2023 — San José, California, USA.
FMIDV Forged Mobile Identity Document Video dataset

A dataset contains 28k forged IDs for 10 countries based on copy-move forgeries on the identity documents of MIDV-2020 dataset.

Musab Al-Ghadi; Zuheng Ming; Petra Gomez-Krämer; Jean-Christophe Burie; Mickaël Coustaty; Nicolas Sidere. Guilloche Detection for ID Authentication: A Dataset and Baselines. Published in: 2023 IEEE 25th International Workshop on Multimedia Signal Processing (MMSP), Poitiers, France, 2023, pp. 1-6, doi: 10.1109/MMSP59012.2023.10337681 [PDF]
Forgery Detection Dataset N. Sidère, F. Cruz, M. Coustaty and J-M Ogier, “A Dataset for Forgery Detection and Spotting in Document Images”, in Proc. of Seventh International Conference on Emerging Security Technologies (EST), Canterbury, UK, 2017
L3iDocCopies Photocopies of magazine pages from the PRImA dataset (990 images of 55 documents)

Eskenazi, S., Gomez-Krämer, P., and Ogier, J.-M. (2016). Evaluation of the stability of four document segmentation algorithms. In Document Analysis Systems (DAS), pages 1–6. IEEE.
L3iLayoutCopies Photocopies of stable segmentation layouts (960 images of 15 layouts)

Eskenazi, S., Gomez-Krämer, P., and Ogier, J.-M. (2015). The Delaunay document layout descriptor. In Symposium on Document Engineering (DocEng), pages 167–175. ACM
L3iTextCopies Photocopies of text only documents with 216 typographical variations per text (42 768 images)

Eskenazi, S., Gomez-Krämer, P., and Ogier, J.-M. (2015). When document security brings new challenges to document analysis. In International Workshop on Computational Forensics (IWCF), pages 104–116. SPIE.
L3iTextCopies-WordPatches An extended version of L3iTextCopies datasets "L3iTextCopies-WordPatches" for font recogntion by multi-task learning.

Mondal, Tanmoy, Abhijit Das, and Zuheng Ming. "Exploring Multi-Tasking Learning in Document Attribute Classification." arXiv preprint arXiv:2108.13382 (2021) [PDF]
(on historical documents)
In these datasets, we provide multiple features extracted from the text itself. Please note the text is missing from the dataset published in the CSV format for copyright reasons. You can download the original datasets and manually add the missing texts from the original publications.

Bernard, Guillaume, Cyrille Suire, Cyril Faucher, et Antoine Doucet. 2021a. « A Comprehensive Extraction of Relevant Real-World-Event Qualifiers for Semantic Search Engines ». In Proceedings of the 25th International Conference on Theory and Practice of Digital Libraries, 12866:153‑64. Online: Springer.
MIDV-2020 A Comprehensive Benchmark Dataset for Identity Document Analysis

K.B. Bulatov, E.V. Emelianova, D.V. Tropin, N.S. Skoryukina, Y.S. Chernyshova, A.V. Sheshkus, S.A. Usilin, Z. Ming, J.-C. Burie, M. M. Luqman, V.V. Arlazarov: “MIDV-2020: A Comprehensive Benchmark Dataset for Identity Document Analysis”, Computer Optics (submitted), 2021.
PostOCRCor 2017 The dataset is multilingual (50% French, 50% English, with maybe few expressions borrowed from other languages such as Latin). It consists in texts published in monographs and periodicals during the last four centuries. It accounts more than 12 million characters, and includes both noisy OCR-ed texts and the corresponding Gold-Standard (GS) which has been aligned at the character level.

Chiron G., Doucet A., Coustaty M., Moreux J-P. ICDAR2017 Competition on Post-OCR Text Correction. 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017
PostOCRCor 2019 The dataset accounts for 22M OCR-ed characters (754 025 tokens) along with the corresponding ground truth, with an unequally share of 10 European languages (Bulgarian, Czech, Dutch, English, Finnish, French, German, Polish, Spanish and Slovak). The digitized documents consist in newspapers, historical books and shopping receipts coming from different collections available, among others, in national libraries or universities. The corresponding GT comes from initiatives such as HIMANIS, IMPACT, IMPRESSO, Open data of National Library of Finland, GT4HistOCR and RECEIPT dataset. It includes both noisy OCR-ed texts and the corresponding Gold-Standard (GS) which has been aligned at the character level.

Rigaud C., Doucet A., Coustaty M., Moreux J-P. ICDAR2019 Competition on Post-OCR Text Correction. 15th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2019
RRC-MLT ICDAR2017 Competition on Multi-lingual scene text detection and script identification.

Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon, Wafa Khlif, Muhammad Muzzamil Luqman, Jean-Christophe Burie, Cheng-Lin Liu, Jean-Marc Ogier: ICDAR2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification - RRC-MLT. ICDAR 2017: 1454-1459
SmartDoc Jean-Christophe Burie, Joseph Chazalon, Mickaël Coustaty, Sébastien Eskenazi, Muhammad Muzzamil Luqman, Maroua Mehri, Nibal Nayef, Jean-Marc OGIER, Sophea Prum and Marçal Rusinol: “ICDAR2015 Competition on Smartphone Document Capture and OCR (SmartDoc)”, In 13th International Conference on Document Analysis and Recognition (ICDAR), 2015.
SmartDoc-QA Nibal Nayef, Muhammad Muzzamil Luqman, Sophea Prum, Sebastien Eskenazi, Joseph Chazalon, Jean-Marc Ogier: “SmartDoc-QA: A Dataset for Quality Assessment of Smartphone Captured Document Images - Single and Multiple Distortions”, Proceedings of the sixth international workshop on Camera Based Document Analysis and Recognition (CBDAR), 2015.
SmartDoc-VideoCapture Chazalon, J., Gomez-Krämer, P., Burie, J.C., Coustaty, M., Eskenazi, S., Luqman, M., Nayef, N., Rusinol, M., Sidere, N. and Ogier, J.M., 2017, November. SmartDoc 2017 Video Capture: Mobile Document Acquisition in Video Mode. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on (Vol. 4, pp. 11-16). IEEE.
SSGCI Jean-Christophe BURIE, Pasquale FOGGIA, Clément GUÉRIN, Thanh Nam LE, Muhammad Muzzamil LUQMAN, Jean-Marc OGIER and Christophe RIGAUD: “ICPR 2016 Competition on Subgraph Spotting in Graph Representations of Comic Book Images (SSGCI)”; in conjunction with the 23rd International Conference on Pattern Recognition in Cancun (Mexico), December 2016.

Thanh Nam Le, Muhammad Muzzamil Luqman, Anjan Dutta, Pierre Héroux, Christophe Rigaud, Clément Guérin, Pasquale Foggia, Jean-Christophe Burie, Jean-Marc Ogier, Josep Lladós, Sébastien Adam, Subgraph spotting in graph representations of comic book images, Pattern Recognition Letters, Volume 112, 2018, Pages 118-124, ISSN 0167-8655,