L3iTextCopies-WordPatches: dataset for font recognition


This webpage presents an extended version of L3iTextCopies datasets L3iTextCopies-WordPatches for font recogntion by multi-task learning (A Hybrid Multi-Tasking and Multi-Instance Convolution Neural Network for JointLearning in Document Attribute Classification). We have used L3iTextCopies dataset. This data-set is consisting of clean, text-only, typewritten documents which has 22 actual pages. These pages has following characteristics: 1 page of a scientific article with a single column header and a double column body, 3 pages of scientific articles with a double column layout, 2 pages of programming code with a single column layout, 4 pages of a novel with a single column layout, 2 pages of legal texts with a single column layout, 4 pages of invoices with a single column layout, 4 pages of payslips with a single column layout, 2 pages of birth extract with a single column layout. Several variants of these 22 pages are created by combining 6 fonts: Arial, Calibri, Courier, Times New Roman, Trebuchet, Verdana; 3 font sizes: 8, 10 and 12 points; 4 emphasis: normal, bold, and the combination of bold and italic which makes the total data-set size of 1584 documents. Then these documents were printed by three printers (Konica Minolta Bizhub 223, Sharp MX M904 and Sharp MX M850) then these ones were scanned by three scanners and at different resolutions between 150~dpi, 300~dpi and 600~dpi. Which finally generates a complete data-set of total 42768 document images. Out of all the images in the data-set, 70% images are considered for training, 10% are considered for validation and 20% are considered for testing purpose. To obtain the word images, we apply Tesseract OCR to detect the word boundaries and then these ones are cropped from all the document images. To avoid the noisy elements, we only have considered the word images, more than 15*15 pixels in dimension. Whereas, to get the patches from a whole image, we crop patches of window size(the standard input image size of ResNet 224*224 pixels by sliding the window by 112*112 in horizontal and vertical directions. 

Any use of this dataset is required to cite the following reference:
Mondal, Tanmoy, Abhijit Das, and Zuheng Ming. "Exploring Multi-Tasking Learning in Document Attribute Classification." arXiv preprint arXiv:2108.13382 (2021).


How to download this dataset?

The dataset has a size of 157GB and is hosted on an sFTP server of the University of La Rochelle (France). Please fill the following form for getting access to the dataset. You need to accept the licence of the dataset to have access to it.

https://docs.google.com/forms/d/e/1FAIpQLSdxB1gvdVlRcARUlMolTJzyqY93XBZHhwiBwkDx8BDyMIPWIg/viewform?vc=0&c=0&w=1&flr=0