WildKhmerST:
A Comprehensive Dataset and Benchmark
for Khmer Scene Text Detection and Recognition in the Wild

WildKhmerST

The WildKhmerST dataset is a new collection designed to support computer vision research on the Khmer script. It contains 29,601 annotated text lines from 10,000 images taken in various public places across Cambodia, such as streets, signboards, supermarkets, and commercial areas. The dataset features diverse and challenging text, including artistic, blurred, low-light, curved, and occluded text, as well as text on complex backgrounds. Each text line is carefully annotated with polygonal bounding box coordinates, transcriptions, and information about background complexity, character appearance, and text style.

Dataset Annotation

The data for each image, along with its annotations, is organized in JSON format, which provides a clear and hierarchical representation of the attributes. In this JSON structure, the coordinates of each polygon are shown as arrays of x and y points, such as "all_points_x": [x1, x2, x3, x4] and "all_points_y": [y1, y2, y3, y4]. This format is illustrated in the dataset entries, where each polygon is associated with line-level text data, offering context and labels for the areas they enclose. Below is a sample of an image pair with the corresponding JSON file.

Dataset Availability

  • The dataset is strictly available for non-commercial research and educational purposes only.
  • Please ensure to cite the following paper when using the dataset in your research.

Download Information

Here are the options to download the WildKhmerST dataset:

  • WildKhmerST Dataset:
  • You can download the dataset from the following server (use FileZilla software for easy download):

    • Server URL: sftp://l3i-share.univ-lr.fr/
    • Server Port: 22
    • Login: WildKhmerST_guest
    • Password: *%7S8xXq_prJ46

    If you prefer, you can also download the dataset from Kaggle:

Citation

If you use the WildKhmerST dataset in your research, please cite the following paper:

Vannkinh Nom, Saly Keo, Souhail Bakkali, Muhammad Muzzamil Luqman, Mickaël Coustaty, Marçal Rossinyol, and Jean-Marc Ogier. "WildKhmerST: A Comprehensive Dataset and Benchmark for Khmer Scene Text Detection and Recognition in the Wild". ICDAR (2025).

Contact Us:

If you have any questions or require further information, feel free to contact us at: saly.keo@cadt.edu.kh; vannkinh.nom@cadt.edu.kh

License

© 2025 This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

For more details, visit the Creative Commons website.

Acknowledgements

This research is funded by the France Government Scholarship, in partnership with the Cambodia Academy of Digital Technology (CADT), and supported by La Rochelle University, Laboratoire Informatique Image Interaction (L3i).