`news_tracking` (on historical documents)

Logo Université de La Rochelle Logo du laboratoire L3i Logo du projet NewsEye

This archive gathers all the resources and code produced by Guillaume Bernard during his PhD thesis in the Laboratoire L3i (from 2019 to 2022).

The Thesis is published and hosted on public repositories (in French) : access on STAR/HAL.

Reading this document and all the published papers may help understand the content of this repository.

Bernard, Guillaume, Cyrille Suire, Cyril Faucher, et Antoine Doucet. 2021a. « A Comprehensive Extraction of Relevant Real-World-Event Qualiﬁers for Semantic Search Engines ». In Proceedings of the 25th International Conference on Theory and Practice of Digital Libraries, 12866:153‑64. Online: Springer. https://doi.org/10.1007/978-3-030-86324-1_19.
Bernard, Guillaume, Cyrille Suire, Cyril Faucher, et Antoine Doucet. 2021b. « Event Related Document Retrieval with Multilingual Real World Event Representation ». In Proceedings of the ISWC 2021 Posters, Demos and Industry Tracks: From Novel Ideas to Industrial Practice Co-Located with 20th International Semantic Web Conference, 2980:5. Online. http://ceur-ws.org/Vol-2980/paper309.pdf.
Bernard, Guillaume, Cyrille Suire, Cyril Faucher, Antoine Doucet, et Paolo Rosso. 2022. « Tracking News Stories in Short Messages in the Era of Infodemic ». In Experimental IR Meets Multilinguality, Multimodality, and Interaction, 13390:18‑32. Lecture Notes in Computer Science. Bologna, Italy: Springer. https://doi.org/10.1007/978-3-031-13643-6_2.

Note : this repository is a duplicate of what is hosted on Software Heritage for the source code and Zenodo for other resources such as datasets.

Acknowledgments

This work has been supported by the European Union’s Horizon 2020 research and innovation program under grants 770299 (NewsEye).

The authors would like to thank the Polytechnic University Of València (UPV), Spain, which made this work possible, and its IT laboratory, DSIC.

Access data

If you wish to access this data, first go to Zenodo and if missing, ask l3i DASH pn AT univ-lr DOT fr.

Resources

Resources

Datasets

In these dataset, we provide multiple features extracted from the text itself. Please note the text is missing from the dataset published in the CSV format for copyright reasons. You can download the original datasets and manually add the missing texts from the original publications.

Features are extracted using:

A corpus of reference articles in multiple languages languages for TF-IDF weighting. (features_news) [1]
A corpus of tweets reporting news for TF-IDF weighting. (features_tweets) [1]
A S-BERT model [2] that uses distiluse-base-multilingual-cased-v1 (called features_use) [3]
A S-BERT model [2] that uses paraphrase-multilingual-mpnet-base-v2 (called features_mpnet) [4]

References:

[1]: Guillaume Bernard. (2022). Resources to compute TF-IDF weightings on press articles and tweets (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6610406

[2]: Reimers, Nils, et Iryna Gurevych. 2019. « Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks ». In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982‑92. Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410.

[3]: https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1

[4]: https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2

Event Registry dataset with multiple extracted features (both sparse and dense)

This is a republication of the Event Registry dataset originaly published by:

Rupnik, Jan, Andrej Muhic, Gregor Leban, Primoz Skraba, Blaz Fortuna, et Marko Grobelnik. 2016. « News Across Languages - Cross-Lingual Document Similarity and Event Tracking ». Journal of Artificial Intelligence Research 55 (janvier): 283‑316. https://doi.org/10.1613/jair.4780.

And reorganised for document tracking by:

Miranda, Sebastião, Artūrs Znotiņš, Shay B. Cohen, et Guntis Barzdins. 2018. « Multilingual Clustering of Streaming News ». In 2018 Conference on Empirical Methods in Natural Language Processing, 4535‑44. Brussels, Belgium: Association for Computational Linguistics. https://www.aclweb.org/anthology/D18-1483/.

CoAID dataset with multiple extracted features (both sparse and dense)

This is a publication of the CoAID dataset originaly dedicated to fake news detection. We changed here the purpose of this dataset in order to use it in the context of event tracking in press documents.

Cui, Limeng, et Dongwon Lee. 2020. « CoAID: COVID-19 Healthcare Misinformation Dataset ». ArXiv:2006.00885 [Cs], novembre. http://arxiv.org/abs/2006.00885.

Fibvid dataset with multiple extracted features (both sparse and dense)

This is a publication of the FibVid dataset originaly dedicated to fake news detection. We changed here the purpose of this dataset in order to use it in the context of event tracking in press documents.

Kim, Jisu, Jihwan Aum, SangEun Lee, Yeonju Jang, Eunil Park, et Daejin Choi. 2021. « FibVID: Comprehensive Fake News Diffusion Dataset during the COVID-19 Period ». Telematics and Informatics 64 (novembre): 101688. https://doi.org/10.1016/j.tele.2021.101688.

Event Registry titles only dataset with multiple extracted features (both sparse and dense)

This is the same content as:

Guillaume Bernard. (2022). Event Registry dataset with multiple extracted features (both sparse and dense) (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6630367

But with titles of articles only.

Datasets with OCR damages (images + text)

Some degradations are applied using the DocCreator [1] tool in order to degrade the text of the tweets and to reproduce some common errors found in OCRised documents [2].

[1]: Journet, Nicholas, Muriel Visani, Boris Mansencal, Kieu Van-Cuong, et Antoine Billy. 2017. « DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images ». Journal of Imaging 3 (4): 62. https://doi.org/10.3390/jimaging3040062.

[2]: Linhares Pontes, Elvys, Ahmed Hamdi, Nicolas Sidere, et Antoine Doucet. 2019. « Impact of OCR Quality on Named Entity Linking ». In Digital Libraries at the Crossroads of Digital Information for the Future, 11853:102‑15. Lecture Notes in Computer Science. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-34058-2_11.

CoAID dataset texts with OCR degradations

This is the text of the CoAID dataset dedicated to fake news detection that has been updated to be used in event detection.

Cui, Limeng, et Dongwon Lee. 2020. « CoAID: COVID-19 Healthcare Misinformation Dataset ». ArXiv:2006.00885 [Cs], novembre. http://arxiv.org/abs/2006.00885.

Guillaume Bernard. (2022). CoAID dataset with multiple extracted features (both sparse and dense) (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6630405

The results of the OCR degradations are as follow:

		Without	Character degradation	Phantom degradation	Bleed	Blur	All
CoAID	CER	2.105	6.358	2.105	2.122	2.616	7.898
CoAID	WER	2.494	20.230	2.496	2.580	3.726	20.230

FibVid dataset texts with OCR degradations

This is the text of the FibVid dataset dedicated to fake news detection that has been updated to be used in event detection.

Guillaume Bernard. (2022). Fibvid dataset with multiple extracted features (both sparse and dense) (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6630409

		Without	Character degradation	Phantom degradation	Bleed	Blur	All
FibVid	CER	1.463	6.089	1.461	1.467	1.935	6.359
FibVid	WER	2.065	20.797	2.041	2.052	2.868	21.396

Event Registry titles dataset texts with OCR degradations

This is the text of the Event Registry titles:

Guillaume Bernard. (2022). Event Registry titles only dataset with multiple extracted features (both sparse and dense) (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6630447

The results of the OCR degradations are as follow:

		Without	Character degradation	Phantom degradation	Bleed	Blur	All
Event Registry	Titles	CER	2.421	6.940	2.414	2.422	2.874
Event Registry	Titles	WER	1.127	19.785	1.124	1.131	2.035

Event Registry dataset texts with OCR degradations and synthesised segmentation

This is the text of the Event Registrt dataset dedicated to fake news detection that has been updated to be used in event detection.

Guillaume Bernard. (2022). Event Registry titles only dataset with multiple extracted features (both sparse and dense) (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6630447

The results of the OCR degradations are as follow:

		Without	Character degradation	Phantom degradation	Bleed	Blur	All
Event Registry	CER	0.282	4.154	0.274	0.275	0.582	4.577
Event Registry	WER	0.552	16.364	0.551	0.548	1.159	16.974

Datasets with OCR and segmentation damages

FibVid dataset with multiple extracted features (both sparse and dense) and degraded by OCR

This is the same dataset as:

Guillaume Bernard. (2022). Fibvid dataset with multiple extracted features (both sparse and dense) (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6630409

But with texts degraded by OCR as described in:

Guillaume Bernard. (2022). FibVid dataset texts with OCR degradations (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6630758

Event Registry dataset with multiple extracted features (both sparse and dense) and degraded by OCR

This is the same dataset as:

Guillaume Bernard. (2022). Event Registry dataset with multiple extracted features (both sparse and dense) (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6630367

But with texts degraded by OCR as described in:

Guillaume Bernard. (2022). Event Registry dataset texts with OCR degradations and synthesised segmentation (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6631305

Event Registry titles dataset with multiple extracted features (both sparse and dense) and degraded by OCR

This is the same dataset as:

Guillaume Bernard. (2022). Event Registry titles only dataset with multiple extracted features (both sparse and dense) (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6630447

But with texts degraded by OCR as described in:

Guillaume Bernard. (2022). Event Registry titles dataset texts with OCR degradations (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6630828

CoAID dataset with multiple extracted features (both sparse and dense) and degraded by OCR

This is the same datasets as:

Guillaume Bernard. (2022). CoAID dataset with multiple extracted features (both sparse and dense) (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6630405

But with texts degraded by OCR as described in:

Guillaume Bernard. (2022). CoAID dataset texts with OCR degradations (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6630710

Other

Resources to compute TF-IDF weightings on press articles and tweets

These two datasets of features are used in order to compute TF-IDF weightings of documents. It is meant to be used with the compute-tf-idf-vectors program written in Python and available on Pypi.org.

features_tweets.csv contains features (tokens, lemmas and entities) extracted from Tweets published by press agencies in french, german, spanish and english.
features_news.csv contains features (tokens, lemmas and entities) extracted from articles published by Deutsche Welle in the same languages.

news_tracking (on historical documents)

Acknowledgments

Access data

Table of Contents

Resources

Datasets

Event Registry dataset with multiple extracted features (both sparse and dense)

CoAID dataset with multiple extracted features (both sparse and dense)

Fibvid dataset with multiple extracted features (both sparse and dense)

Event Registry titles only dataset with multiple extracted features (both sparse and dense)

Datasets with OCR damages (images + text)

CoAID dataset texts with OCR degradations

FibVid dataset texts with OCR degradations

Event Registry titles dataset texts with OCR degradations

Event Registry dataset texts with OCR degradations and synthesised segmentation

Datasets with OCR and segmentation damages

FibVid dataset with multiple extracted features (both sparse and dense) and degraded by OCR

Event Registry dataset with multiple extracted features (both sparse and dense) and degraded by OCR

Event Registry titles dataset with multiple extracted features (both sparse and dense) and degraded by OCR

CoAID dataset with multiple extracted features (both sparse and dense) and degraded by OCR

Other

Resources to compute TF-IDF weightings on press articles and tweets

`news_tracking` (on historical documents)