Overview

Changes in the meaning of information as it passes through cyberspace can mislead those who access the information. This project develops new datasets and algorithms to identify and categorize medical information based on whether it remains true to the original meaning or undergoes distortion. Instead of imposing an external true/false label on this information, this project looks into a series of changes within the news coverage itself that gradually lead to a deviation from the original medical claims. Identifying important differences between original medical articles and news stories is a challenging, high-risk high-reward venture. Broader impacts of this work include benefits to the research community by making novel contributions to understanding temporal changes in natural language information, as well as social benefits in the form of improved informational tools like question-answering. For the medical domain in particular, understanding temporal distortions and deviations from actual medical findings can reduce occurrences of harmful health choices, for instance, by embedding the research outcomes in news, social media, or search engines.

This project has created large datasets of medical scientific publications, and recorded characteristics of the core information over time across news articles. It provides the basis for designing and implementing machine learning tasks that exploit stylometric features in natural language in conjunction with temporal distributions to identify and categorize such changes. This research goes beyond current approaches limited to true/false binary classification of individual articles, and is hence able to identify and analyze information change in narratives, including semantic changes and nuances, or selective emphasis of related information. The research entails not just distinguishing distorted pieces of information from those that are faithful to the scientific finding, but also a multi-label categorization to learn the type of semantic distortion in medical news.

As a first step in this direction, we focused on identifying what information is worth verifying, and developed a hybrid method comprising heuristics and supervised learning to identify "check-worthy" information (Zuo, Karakas, and Banerjee; 2018) . Our approach achieved the best state-of-the-art detection, as measured by several metrics. An expansion on this work was invited to the CLEF 2019 conference (Zuo, Karakas, and Banerjee; 2019). Next, we looked into how healthcare information is first presented in research literature, and then in newswires for general readership. We developed a novel dataset of 5,034 news articles paired with the research abstracts of the work being mentioned, and explored how to identify identical or near-identical content expressed in vastly different syntax and vocabulary. For this, we took a two-step approach: (1) select the most relevant candidates from a collection of 222,000 research abstracts, and (2) re-rank this list of most relevant candidates. We compared the classical approach of information retrieval (IR) using BM25 with more recent transformer-based models, and find that cross-genre medical IR is a viable task, but incorporating domain-specific knowledge is crucial for its success (Zuo, Acharya, and Banerjee; 2020).

Through the course of this project, we observed that the complex nature of medical misinformation can be attributed largely to two phenomena. First, (mis)information propagates across multiple distinct genres ... from research literature to newswires to social media, where each genre has its own linguistic properties and pragmatic hurdles to overcome. Second, a large amount of information amounts to paltering, or what is often called "less than lying". We have pursued scientific investigations in both directions.

(Mis)information propagation across genres

In the former, we looked into the phenomenon of linguistic transformations that happen when medical information transitions from specialized research literature into news intended for wider readership. This transition makes the information vulnerable to misinterpretation, misrepresentation, and incorrect attribution, all of which may be difficult to identify without adequate domain knowledge and may exist even in the presence of explicit citations. Moreover, news articles seldom provide a precise correspondence between a specific claim and its origin, making it harder to identify which claims, if any, reflect the original findings. For instance, an article stating “Flagellin shows therapeutic potential with H3N2, known as Aussie Flu.” contains two claims (“Flagellin ... H3N2,” and “H3N2, known as Aussie Flu”) that may be true or false independent of each other, and it is prima facie unclear which claims, if any, are supported by the cited research. We developed a corpus of sentences from medical news along with the sources from peer-reviewed medical research journals these news articles cite. Then, we used this corpus to study what a general reader perceives to be true, and how to verify the scientific source of claims. Unlike existing corpora, this captures the metamorphosis of information across two genres with disparate readership and vastly different vocabularies and presents the first empirical study of health-related fact-checking across them (Zuo et al.; 2022a)

We delved further into the cross-genre propagation of misinformation and the perception of truth. For this part of our research, we collaborated with a team led by Dr. Indrakshi Ray at the Colorado State University, Fort Collins. As prior research has often demonstrated, social media posts often leverage the trust readers have in prestigious news agencies and cite news articles as a way of gaining credibility. It is not, however, always the case that the cited article supports the claim being upheld in the social media post. In other words, the post makes it "look" like the claim originates from a credible source, but it really does not! We develop a cross-genre ad hoc information retrieval model to identify whether the information in a Twitter post is, indeed, supported by the news article it cites. This leg of our work rests on a large corpus of 46.86 million Twitter posts about COVID-19, and is divided into two tasks: (i) development of models to detect Tweets containing claim and worth to be fact-checked and (ii) verifying whether the claims made in a Tweet are supported by the newswire article it cites. Unlike previous studies that detect unsubstantiated information by post hoc analysis of the patterns of propagation, our approach is capable of identifying deceptive support before the misinformation begins to spread. Among our chief findings is the observation that among the posts that contain a seemingly factual claim while citing a news article as supporting evidence, at least 1% include a citation intended to deceive the reader (Zuo et al.; 2022b).

Less than lying

The latter consists of selective reporting, non-disclosure of conflicts of interests, disease-mongering, etc. These manifold attributes make the automatic detection of medical misinformation a daunting challenge, and has so far only been explored by journalists and healthcare professionals in purely qualitative studies. We delved into a significantly more complex multi-class classification task to test whether medical news articles (most of which are not considered "fake" by any existing fact-checking system) actually satisfy criteria deemed important by medical experts and healthcare journalists (as far as misinformation is concerned). We collected a corpus of 1,119 health news paired with systematic reviews, where each review has six criteria essential to the accuracy of medical news. Our experiments compared classical token-based approaches with the more recent transformer-based models, and found that detecting qualitative lapses is an extremely challenging task with direct ramifications in misinformation. Moreover, it is an important direction to pursue beyond assigning True or False labels to short claims (Zuo, Zhang, and Banerjee; 2021).

Team

Ritwik Banerjee (Principal Investigator), Research Assistant Professor of Computer Science, Stony Brook University
Chaoyuan Zuo, Ph.D. ↦ Faculty at the School of Journalism & Communication, Nankai University (China)
Noushin Salek Faramarzi, Research Assistant
Kritik Mathur, M.S. ↦ Software Engineer @ Amazon
Dhruv Kela, M.S. ↦ Software Engineer @ DigitalOcean
Narayan Acharya, M.S. ↦ Research Engineer @ dmetrics, Inc.
Ayla Karakas, B.S. (Linguistics) ↦ Ph.D., Computational Linguistics @ Yale
Qi Zhang, B.S. ↦ MS, University of California San Diego

Collaborators

Indrakshi Ray, Professor of Computer Science, Colorado State University
Hossein Shirazi, Assistant Professor of Management Information Systems, San Diego State University
Fateme Hashemi Chaleshtori, MS (Computer Science), Colorado State University
Sina Mahdipour Saravani, MS (Computer Science), Colorado State University

Publications

Faramarzi, N., Chaleshtori, F., Shirazi, H., Ray, I., & Banerjee, R. (2023). Claim Extraction and Dynamic Stance Detection in COVID-19 Tweets. Companion Proceedings Of The Acm Web Conference 2023. Austin, TX, USA: Association for Computing Machinery. https://doi.org/10.1145/3543873.3587643 (Original work published 2023)
Zuo, C., Mathur, K., Kela, D., Faramarzi, N. S., & Banerjee, R. (2022). Beyond Belief: A Cross-Genre Study on Perception and Validation of Health Information Online. International Journal Of Data Science And Analytics. https://doi.org/10.1007/s41060-022-00310-7 (Original work published 2022)
Zuo, C. (2022). Seeing Should Probably not be Believing: The Role of Deceptive Support in COVID-19 Misinformation on Twitter. Acm Journal Of Data And Information Quality. https://doi.org/10.1145/3546914 (Original work published 2022)
Zuo, C., Zhang, Q., & Banerjee, R. (2021). An Empirical Assessment of the Qualitative Aspects of Misinformation in Health News. Proceedings Of The Fourth Workshop On Nlp For Internet Freedom: Censorship, Disinformation, And Propaganda. https://doi.org/10.18653/v1/2021.nlp4if-1.11 (Original work published 2021)
Saravani, S. M., Ray, I., & Banerjee, R. (2021). An Investigation into the Contribution of Locally Aggregated Descriptors to Figurative Language Identification. Second Workshop On Insights From Negative Results In Nlp. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.insights-1.15 (Original work published 2021)
Zuo, C., Acharya, N., & Banerjee, R. (2020). Querying Across Genres for Medical Claims in News. Empirical Methods In Natural Language Processing. https://doi.org/10.18653/v1/2020.emnlp-main.139 (Original work published 2020)
Zuo, C., Karakas, A. I., & Banerjee, R. (2019). To Check or not to Check: Syntax, Semantics, and Context in the Language of Check-Worthy Claims. Proceedings Of The 10Th International Conference Of The Clef Association. https://doi.org/10.1007/978-3-030-28577-7_23 (Original work published 2019)
Zuo, C., Karakas, A. I., & Banerjee, R. (2018). A Hybrid Recognition System for Check-worthy Claims Using Heuristic and Supervised Learning. Working Notes Of The Conference And Labs Of The Evaluation Forum (Clef). Retrieved de https://ceur-ws.org/Vol-2125/paper_143.pdf (Original work published 2018)