The discourse around social media moderation often centers on the idea of “censorship” and protecting free expression vs. protecting conversation health via account bans (what my generation dubbed “Facebook jail”). This framing can make the choices in moderation seem binary.
However, for those of us either navigating or studying polarized opinions in online spaces, it is clear that the reality is far more nuanced. A significant and challenging spectrum exists online from speech that is merely disagreeable or misinformed, to degrees of hateful or toxic speech and deliberate disinformation, and, ultimately, to speech that crosses the threshold into actual incitement of violence. This spectrum demands a more precise and exacting approach than current moderation tools often provide.
I am pleased to announce a forthcoming paper on this very topic, led by PhD student Saajan Patel: “Adapting the Dangerous Speech Paradigm to Identify Incitement from Polarized and Hate Speech.” In this work, Patel and our team describe methods and a framework for distinguishing speech with the hallmarks of violence incitement from misinformation, rants, slurs, and other types of polarized, but lawful, online discourse.
Distinguishing What is ‘Dangerous’
Most existing moderation systems struggle to distinguish between speech that is merely offensive and speech that acts as a spark for real-world harm. To solve this, our team looked at the Dangerous Speech Paradigm. Based in analysis of real-life discourse around mass murders and genocides in different cultures, this paradigm describes the rhetorical contexts and hallmarks of text, images or video (such as Dehumanization and Accusation in a Mirror) that have led to people either condoning or participating in violence against members of another group. We realized that this paradigm needed an update for the era of Large Language Models (LLMs).
We proposed a three-tiered framework to classify polarized content:
- Allowable Speech: Polarized or disagreeable, but lawful.
- Harmful Speech: Hate speech or content that denigrates, but doesn’t necessarily call for action.
- Inciting Speech: Content that directly encourages or facilitates dangerous acts.
By bisecting the old “dangerous” category into “harmful” and “inciting,” we allow platforms to respond more appropriately – perhaps shadow-banning or labeling harmful content, while reserved immediate removal and reporting for truly inciting content.
Scaling Post Labeling and Explanations
To test this framework, we didn’t just look at a few hundred posts. We analyzed over 3.5 million posts from the platform Gab, focusing on the period surrounding the 2017 Charlottesville “Unite the Right” rally – a pivotal moment in digital polarization.
Using a customized GPT-4o-mini framework paired with Chain-of-Thought (CoT) prompting, we were able to teach the AI to “think through” the nuances of incitement. The AI doesn’t just give a label; it explains why a post reaches the threshold of inciting violence based on the context of the event.
Modeling the Spread of Incitement
We didn’t stop at classification. We also wanted to see how this speech moves. Using a DeGroot diffusion model, we built a directed repost graph to track the spread of harmful vs. inciting speech.
What we found is encouraging for the future of moderation: LLM classifiers can effectively assist human moderators by flagging the “inciters” in real-time, helping to stop the spread of dangerous content before it reaches a tipping point. (Plus, we spare human moderators the trauma of repeatedly being exposed to, and thinking about, horrifically toxic speech.)
Why This Matters
Our goal isn’t to silence political disagreement. In fact, by accurately identifying and isolating inciting speech, we actually protect the space for allowable (if polarized) speech to exist. This research offers a path forward for platforms to manage content in a way that is transparent, consistent, and grounded in the protection of human life.
Huge thanks to my brilliant co-authors Saajan Patel (this is his first first-author paper!), Ramisha Mahiyat, Aditya Narasimhan Sampath, and Siddharth Krishnan.
Just accepted for ACM Transactions on Social Computing! Check out the link below for more details on the machine learning pipeline and how the GPT model was trained and deployed.
- Saajan Patel, Cori Faklaris, Ramisha Mahiyat, Aditya Sampath, and Siddharth Krishnan. 2026. Adapting the Dangerous Speech Paradigm to Identify Incitement from Polarized and Hate Speech. Trans. Soc. Comput. Just Accepted (February 2026). https://doi.org/10.1145/3797963

