Identification of potential Lyme disease cases using self-reported worldwide tweets: A deep learning modeling approach enhanced with sentimental words via emojis
Background: Effective surveillance for Lyme disease, a disease commonly transmitted by ticks worldwide, necessitates prompt medical diagnosis and precise laboratory testing. Web-based data sources could be used to enhance surveillance.
To better understand Twitter's potential and limits as a tool for Lyme disease surveillance, we evaluate data from Twitter users worldwide. Additionally, we suggest using self-reported tweets to identify possible Lyme disease cases using a transformer-based classification method.
Methods: 20,000 tweets from throughout the globe were selected for our initial sample from a database containing over 1.3 million tweets about the Lyme disease. Following the preprocessing and geolocation of tweets, a portion of the original sample's tweets were manually classified using terms that were carefully chosen as either possible Lyme disease cases or not. We transformed the emojis in these tweets to sentiment words in order to address their use, and then we replaced them in the tweets where needed. The DistilBERT, ALBERT, and BERTweet classifiers were then trained, validated, and tested on this set of labeled tweets
.
Results: The empirical findings demonstrated that, with the highest average F1-score of 89.3%, classification accuracy of 90.0%, and precision of 97.1%, BERTweet is the best classifier out of all the examined models. Recall, on the other hand, was better for TF-IDF and k-NN, with 93.2% and 82.6%, respectively. Emoji enrichment of the tweet embeddings resulted in an 8% increase in recall for BERTweet; DistilBERT showed a much higher F1-score of 93.8% (+4%) and a 94.1% (+4%) classification accuracy, whereas ALBERT showed an F1-score of 93.1% (5%) and a 93.9% (+5%) classification accuracy.
Conclusions: The study highlights the robustness of BERTweet and DistilBERT as classifiers for potential cases of Lyme disease from self-reported data. The results show that emojis are effective for enriching features, thereby improving the accuracy of tweet embedding and the performance of the classifiers. Specifically, emoji that reflect sadness, empathy, and encouragement can reduce false negatives.