L3i-Share

Dataset	Short description and/or associated publication	More details / How to download?
AMADI_LontarSet	Made Windu Antara Kesiman, Jean-Christophe Burie, Jean-Marc Ogier, Gusti Ngurah Made Agus Wibawantara, I Made Gede Sunarya. "AMADI_LontarSet: The First Handwritten Balinese Palm Leaf Manuscripts Dataset", 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2016, pp.168-172.	http://amadi.univ-lr.fr/ICFHR2016_Contest/index.php/download-123
AncientCoins	Florian Lardeux, Petra Gomez-Krämer, Sylvain Marchand. Low-complexity arrays of patch signature for efficient ancient coin retrieval. Pattern Analysis and Applications, 27(3), 2024.	https://l3i-share.univ-lr.fr/2024AncientCoins/index.html
CHAMDoc Dataset	T.N. Nguyen, J.C. Burie, T.L. Lan, A. V. Schweyer. An effective method for text line segmentation in historical document images. ICPR (2022)	https://l3i-share.univ-lr.fr/2022CHAMDoc/CHAMDoc.html/
Complex Maps Dataset	Q.B. Dang, M.M. Luqman, M. Coustaty, N. Nayef, C.D. Tran, J.M. Ogier - "A system for camera-based complex map image retrieval using a multi-layer approach", published in IAPR International workshop on Document Analysis System (2014)	https://navidomass.univ-lr.fr/MapDataset/ Ground truth
DCM772	Annotation of 772 public domain comic books from the Digital Comics Museum as panels, balloons and characters position and relations. Nhu Van Nguyen, Christophe Rigaud, Jean-Christophe Burie "Digital comics image indexing based on deep learning", published in the open access Journal of Imaging, volume 4, number 7, article number 89 (2018).	https://gitlab.univ-lr.fr/crigau02/dcm-dataset
DocQT	DocQT - Improving Document Forgery Localization Robustness via Diverse JPEG Quantization-Tables Only header-extracted quantization matrices are provided. Ronfleux-Corail, K., Bernard, G., Coustaty, M., & Sidère, N. (2026). DocQT: Improving Document Forgery Localization Robustness via Diverse JPEG Quantization Tables. arXiv preprint arXiv:2605.19688.	https://data.univ-lr.fr/datasets/814 https://zenodo.org/records/20341674 https://huggingface.co/datasets/Kyliroco/DocQT
eBDtheque	A representative database of annotated comic book images (100 images with 850 panels, 1092 speech ballons, 1550 comic characters and 4691 text lines) Clément Guérin, Christophe Rigaud, Antoine Mercier, Farid Ammar-Boudjelal, Karell Bertet, Alain Bouju, Jean-Christophe Burie, Georges Louis, Jean-Marc Ogier, Arnaud Revel "ebdtheque: a representative database of comics". In Proceedings of International Conference on Document Analysis and Recognition (ICDAR), pages 1145–1149, Washington DC, 2013	http://ebdtheque.univ-lr.fr
Evaluation of Administrative Document Segmentation into Color Layers	This contribution presents the conception of a dataset used to evaluate the segmentation of color administrative documents into color layers. More information about the segmentation project can be seen on the website of the L3i laboratory (only in French): here Elodie Carel, Jean-Christophe Burie, Vincent Courboulay, Jean-Marc Ogier, Vincent Poulain D'Andecy, "Multiresolution Approach Based on Adaptive Superpixels for Administrative Documents Segmentation into Color Layers", 13th International Conference on Document Analysis and Recognition (ICDAR15), Aug 2015, Nancy, France.	https://navidomass.univ-lr.fr/ColorSegmentationGT/
Find it!	Artaud Chloé, Sidère Nicolas, Doucet Antoine, Ogier Jean-Marc and Poulain d’Andecy Vincent, “Find it! Fraud Detection Contest Report”, 24th International Conference on Pattern Recognition, ICPR 2018, Beijing, China, August 20-24, 2018.	http://findit.univ-lr.fr/download-the-dataset/
Find it again!	The receipt forgery detection dataset contains 988 scanned images of receipts and their transcriptions, originating from the scanned receipts OCR and information extraction (SROIE) dataset. 163 images and their transcriptions have undergone realistic fraudulent modifications. Ground truth between the forged and authentic receipts is provided along with annotations on the fraudulent modifications are provided, concerning the entities that have been modified as well as the location of the forgeries. Beatriz Martínez Tornés, Théo Taburet, Emanuela Boros, Kais Rouis, Petra Gomez-Krämer, Nicolas Sidere, Antoine Doucet and Vincent Poulain d'Andecy. Receipt Dataset for Document Forgery Detection, The 17th International Conference on Document Analysis and Recognition, August 21-26, 2023 — San José, California, USA.	https://l3i-share.univ-lr.fr/2023Finditagain/index.html
FMIDV	Forged Mobile Identity Document Video dataset A dataset contains 28k forged IDs for 10 countries based on copy-move forgeries on the identity documents of MIDV-2020 dataset. Musab Al-Ghadi; Zuheng Ming; Petra Gomez-Krämer; Jean-Christophe Burie; Mickaël Coustaty; Nicolas Sidere. Guilloche Detection for ID Authentication: A Dataset and Baselines. Published in: 2023 IEEE 25th International Workshop on Multimedia Signal Processing (MMSP), Poitiers, France, 2023, pp. 1-6, doi: 10.1109/MMSP59012.2023.10337681 [PDF]	https://l3i-share.univ-lr.fr/2022FMIDV/FMIDV_v3.htm
Forgery Detection Dataset	N. Sidère, F. Cruz, M. Coustaty and J-M Ogier, “A Dataset for Forgery Detection and Spotting in Document Images”, in Proc. of Seventh International Conference on Emerging Security Technologies (EST), Canterbury, UK, 2017	https://navidomass.univ-lr.fr/ForgeryDataset/
KhmerST	The KhmerST (Khmer Scene-Text) dataset is a new collection specifically designed to advance computer vision research focused on the Khmer script. This dataset comprises numerous images captured from various public places in Cambodia, including streets, signboards, supermarkets, and commercial establishments, all featuring text written in Khmer. KhmerST is the first dataset to include Khmer text in natural scene images, consisting of 1,544 annotated images—997 indoor and 547 outdoor—providing a diverse and intricate collection. The dataset presents challenges such as planar text, raised text, poorly lit text, distant text, and partially occluded text. Each image is annotated with line-level text and polygonal bounding box coordinates. Nom, V., Bakkali, S., Luqman, M. M., Coustaty, M., & Ogier, J. M. (2024). KhmerST: A Low-Resource Khmer Scene Text Detection and Recognition Benchmark. In Proceedings of the Asian Conference on Computer Vision (pp. 1777-1792).	https://gitlab.com/vannkinhnom123/khmerst
L3iDocCopies	Photocopies of magazine pages from the PRImA dataset (990 images of 55 documents) Eskenazi, S., Gomez-Krämer, P., and Ogier, J.-M. (2016). Evaluation of the stability of four document segmentation algorithms. In Document Analysis Systems (DAS), pages 1–6. IEEE.	https://l3i-share.univ-lr.fr/datasets/DocCopiesWebsite/DocCopiesDataset.html
L3iLayoutCopies	Photocopies of stable segmentation layouts (960 images of 15 layouts) Eskenazi, S., Gomez-Krämer, P., and Ogier, J.-M. (2015). The Delaunay document layout descriptor. In Symposium on Document Engineering (DocEng), pages 167–175. ACM	https://navidomass.univ-lr.fr/layoutcopies/
L3iTextCopies	Photocopies of text only documents with 216 typographical variations per text (42 768 images) Eskenazi, S., Gomez-Krämer, P., and Ogier, J.-M. (2015). When document security brings new challenges to document analysis. In International Workshop on Computational Forensics (IWCF), pages 104–116. SPIE.	https://navidomass.univ-lr.fr/TextCopies/
L3iTextCopies-WordPatches	An extended version of L3iTextCopies datasets "L3iTextCopies-WordPatches" for font recogntion by multi-task learning. Mondal, Tanmoy, Abhijit Das, and Zuheng Ming. "Exploring Multi-Tasking Learning in Document Attribute Classification." arXiv preprint arXiv:2108.13382 (2021) [PDF]	https://l3i-share.univ-lr.fr/L3iTextCopies-WordPatches/L3iTextCopies-WordPatches.html
MIDV-2020	A Comprehensive Benchmark Dataset for Identity Document Analysis K.B. Bulatov, E.V. Emelianova, D.V. Tropin, N.S. Skoryukina, Y.S. Chernyshova, A.V. Sheshkus, S.A. Usilin, Z. Ming, J.-C. Burie, M. M. Luqman, V.V. Arlazarov: “MIDV-2020: A Comprehensive Benchmark Dataset for Identity Document Analysis”, Computer Optics (submitted), 2021.	https://l3i-share.univ-lr.fr/MIDV2020/midv2020.html
news_tracking (on historical documents)	In these datasets, we provide multiple features extracted from the text itself. Please note the text is missing from the dataset published in the CSV format for copyright reasons. You can download the original datasets and manually add the missing texts from the original publications. Bernard, Guillaume, Cyrille Suire, Cyril Faucher, et Antoine Doucet. 2021a. « A Comprehensive Extraction of Relevant Real-World-Event Qualiﬁers for Semantic Search Engines ». In Proceedings of the 25th International Conference on Theory and Practice of Digital Libraries, 12866:153‑64. Online: Springer. https://doi.org/10.1007/978-3-030-86324-1_19.	https://l3i-share.univ-lr.fr/2022_news_tracking/
OceanSimulation Dataset	The Ocean Simulation Dataset is a collection of 684 synthetic ocean videos equally distributed between different sea states. N. Paris, S. Marchand, P. Gomez-Krämer. Peak Wave Period and Direction Estimation Using 3D FFT on Monoscopic Videos. 28th International Conference on Pattern Recognition (ICPR 2026), Lyon - France, 17-22 August 2026.	https://l3i-share.univ-lr.fr/2026OceanSimulationDataset/
PostOCRCor 2017	The dataset is multilingual (50% French, 50% English, with maybe few expressions borrowed from other languages such as Latin). It consists in texts published in monographs and periodicals during the last four centuries. It accounts more than 12 million characters, and includes both noisy OCR-ed texts and the corresponding Gold-Standard (GS) which has been aligned at the character level. Chiron G., Doucet A., Coustaty M., Moreux J-P. ICDAR2017 Competition on Post-OCR Text Correction. 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017	https://sites.google.com/view/icdar2017-postcorrectionocr
PostOCRCor 2019	The dataset accounts for 22M OCR-ed characters (754 025 tokens) along with the corresponding ground truth, with an unequally share of 10 European languages (Bulgarian, Czech, Dutch, English, Finnish, French, German, Polish, Spanish and Slovak). The digitized documents consist in newspapers, historical books and shopping receipts coming from different collections available, among others, in national libraries or universities. The corresponding GT comes from initiatives such as HIMANIS, IMPACT, IMPRESSO, Open data of National Library of Finland, GT4HistOCR and RECEIPT dataset. It includes both noisy OCR-ed texts and the corresponding Gold-Standard (GS) which has been aligned at the character level. Rigaud C., Doucet A., Coustaty M., Moreux J-P. ICDAR2019 Competition on Post-OCR Text Correction. 15th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2019	https://l3i.univ-larochelle.fr/ICDAR2019PostOCR https://sites.google.com/view/icdar2019-postcorrectionocr
RRC-MLT	ICDAR2017 Competition on Multi-lingual scene text detection and script identification. Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon, Wafa Khlif, Muhammad Muzzamil Luqman, Jean-Christophe Burie, Cheng-Lin Liu, Jean-Marc Ogier: ICDAR2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification - RRC-MLT. ICDAR 2017: 1454-1459	http://rrc.cvc.uab.es/?ch=8
SmartDoc	Jean-Christophe Burie, Joseph Chazalon, Mickaël Coustaty, Sébastien Eskenazi, Muhammad Muzzamil Luqman, Maroua Mehri, Nibal Nayef, Jean-Marc OGIER, Sophea Prum and Marçal Rusinol: “ICDAR2015 Competition on Smartphone Document Capture and OCR (SmartDoc)”, In 13th International Conference on Document Analysis and Recognition (ICDAR), 2015.	https://sites.google.com/site/icdar15smartdoc/home
SmartDoc-QA	Nibal Nayef, Muhammad Muzzamil Luqman, Sophea Prum, Sebastien Eskenazi, Joseph Chazalon, Jean-Marc Ogier: “SmartDoc-QA: A Dataset for Quality Assessment of Smartphone Captured Document Images - Single and Multiple Distortions”, Proceedings of the sixth international workshop on Camera Based Document Analysis and Recognition (CBDAR), 2015.	https://navidomass.univ-lr.fr/SmartDoc-QA/
SmartDoc-VideoCapture	Chazalon, J., Gomez-Krämer, P., Burie, J.C., Coustaty, M., Eskenazi, S., Luqman, M., Nayef, N., Rusinol, M., Sidere, N. and Ogier, J.M., 2017, November. SmartDoc 2017 Video Capture: Mobile Document Acquisition in Video Mode. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on (Vol. 4, pp. 11-16). IEEE.	http://smartdoc.univ-lr.fr
SSGCI	Jean-Christophe BURIE, Pasquale FOGGIA, Clément GUÉRIN, Thanh Nam LE, Muhammad Muzzamil LUQMAN, Jean-Marc OGIER and Christophe RIGAUD: “ICPR 2016 Competition on Subgraph Spotting in Graph Representations of Comic Book Images (SSGCI)”; in conjunction with the 23rd International Conference on Pattern Recognition in Cancun (Mexico), December 2016. Thanh Nam Le, Muhammad Muzzamil Luqman, Anjan Dutta, Pierre Héroux, Christophe Rigaud, Clément Guérin, Pasquale Foggia, Jean-Christophe Burie, Jean-Marc Ogier, Josep Lladós, Sébastien Adam, Subgraph spotting in graph representations of comic book images, Pattern Recognition Letters, Volume 112, 2018, Pages 118-124, ISSN 0167-8655, https://doi.org/10.1016/j.patrec.2018.06.017.	http://icpr2016-ssgci.univ-lr.fr
VisEmoCom	The VisEmoCom (Visual Emotion Recognition in Comics image) dataset contains annotations on the emotions expressed by characters in comic books. Despite the fact that text represents a significant part of the content, the transcriptions of the dialogues are not included. The main goal of this dataset is to focus on visual elements such as facial expressions or symbols, intentionally drawn by the artists to communicate a piece of information. For each targeted character, multiple annotators were asked their interpretation on the expressed emotion. Ruddy Théodose and Jean-Christophe Burie. 2024. VisEmoComic: Visual Emotion Recognition in Comics Image. In Pattern Recognition: 27th International Conference, ICPR 2024, Kolkata, India, December 1–5, 2024, Proceedings, Part XIX. Springer-Verlag, Berlin, Heidelberg, 281–296. https://doi.org/10.1007/978-3-031-78495-8_18	https://l3i-share.univ-lr.fr/2024visemocomic
WildKhmerST	The WildKhmerST dataset is a new collection designed to support computer vision research on the Khmer script. It contains 29,601 annotated text lines from 10,000 images taken in various public places across Cambodia, such as streets, signboards, supermarkets, and commercial areas. The dataset features diverse and challenging text, including artistic, blurred, low-light, curved, and occluded text, as well as text on complex backgrounds. Each text line is carefully annotated with polygonal bounding box coordinates, transcriptions, and information about background complexity, character appearance, and text style. Vannkinh Nom, Saly Keo, Souhail Bakkali, Muhammad Muzzamil Luqman, Mickaël Coustaty, Marçal Rossinyol, and Jean-Marc Ogier. "WildKhmerST: A Comprehensive Dataset and Benchmark for Khmer Scene Text Detection and Recognition in the Wild". In International Conference on Document Analysis and Recognition (ICDAR 2025)	https://l3i-share.univ-lr.fr/2025WildKhmerST/