Scaling the Critique: How Structural Topic Modelling Can Enhance Critical Discourse Analysis
- Hendrik Theine & Carlotta Verità
- 5 minutes ago
- 4 min read

Traditionally, Critical Discourse Analysis (CDA) tends to be qualitative, but again and again has been combined with corpus linguistics and other quantitative methods. Epistemologically, the qualitative focus make sense: qualitative, interpretive approaches are uniquely suited to reconstruct meaning, power, and legitimation in context. Qualitative CDA analyses typically operate on limited samples and corpus size. This makes it difficult for researchers to assess broader regularities or temporal dynamics. In the age of digitalization, we see a persistent scaling dilemma arising for CDA scholars. Of course, this scaling dilemma is not entirely new, but definitely more pronounced: the volume of content steadily increases, and massive datasets are more easily available. This is especially true when considering the capacity to leverage tools that help make finding, accessing, downloading, and even scraping data easier (think APIs and amazing GitHub pages), as well as advances in natural language processing which open many avenues for data analysis: not to mention AI, which adds a whole new layer to this. This availability (among other developments) has certainly fueled the use of quantitative methods and techniques that at least on the surface seems to force a choice between a representative “Big Data” mapping that lacks depth and a deep “Small Data” interpretation that lacks scale.
However, as we argue in our recent working paper, this apparent dilemma can be overcome by combining CDA with structural topic modelling (STM), a quantitative, exploratory methods originating from the “text as data” camp. For us, the answer isn't to abandon our critical roots for computational, big data approaches. Our combination brings in the best of both worlds: it allows us to identify broad discursive patterns across large-scale corpora, while maintaining the interpretive and critical foundations needed to explain why those patterns exist, reaching beyond a narrow view on the text and data.
Why STM is a Natural Ally for CDA
In simple terms, STM applies algorithms to large collections of text to calculate the likelihood of certain words appearing together, identifying “topics” as clusters of related terms and showing how they are distributed across thousands of documents. While topic models are often perceived as probability algorithms which spit out different bundles of frequent terms, STM is a sophisticated tool uniquely suited for integration with CDA for several key reasons.
First, because STM is an unsupervised and inductive approach, thematic structures emerge directly from the data itself rather than being forced into a pre-set dictionary. This “emptiness” provides researchers with the exploratory breadth needed to discover broad macro-discursive patterns and shifts might be overlooked in a more constrained qualitative sample.
Second, the method aligns with CDA’s understanding of meaning as context-dependent by treating language as relational and open. By accounting for polysemy (words having multiple meanings) and heteroglossia (the presence of multiple voices), STM mirrors the CDA view that the meaning of a word is not inherent but arises from the specific clusters and contexts in which it is employed.
Third, STM allows researchers to move beyond basic counts by using document-level metadata (such as publication dates, political affiliations, or media ownership) as covariates (or a fancy way of saying factors which could relate to the probabilities being calculated). This enables the statistical modeling of how specific discourses vary across different institutional contexts or shift over time, effectively treating these metadata patterns as empirical traces of deeper power structures and generative mechanisms.
Beyond the Empirical Map: Reaching Explanatory Depth with CDA
While the shared premises of relational meaning and inductive exploration make STM and CDA epistemologically compatible, this alignment does not mean the algorithm can perform the critical work autonomously. Even though STM incorporates structural influences through metadata, these statistical outputs are merely empirical traces (so called “demi-regularities” in critical realist terms) that require a strong, explicit role for CDA to reach the level of true social critique. Without this diagnostic rigor, computational patterns remain critically blind and risk being reduced to a descriptive standard that leaves the deeper mechanisms of power unexamined, a point we discuss extensively in our paper by drawing on the depth ontology of critical realism.
Specifically, we argue that scholars can use CDA to perform an intensive interpretation of the representative texts identified by the STM. This process involves a text-immanent critique to deconstruct the internal lexicon and argumentative strategies used to justify specific positions. By moving further into a socio-diagnostic critique, researchers can then link these textual details back to the broader sociopolitical power structures signaled by the metadata. This also allows us to look for what the algorithm is fundamentally incapable of seeing: the textual silences. These meaningful omissions in a dominant discourse often represent the most powerful sites of ideological work, defining the boundaries of what is considered realistic or legitimate. By zooming in on these specific moments, we move from broad “semantic macropositions” to a deep, retroductive understanding of how language naturalizes social hierarchies and reproduces hegemony. STM, in this way, acts as a preliminary sort of map for large text corpora: ultimately, though, the meaning behind comes from critically questioning the landscape which appears.
A Pluralist Workflow for mixed methods CDA Scholars
Integrating these methods creates a complementary workflow (a critical methodological pluralism) that allows for both statistical robustness and critical depth:
Map (STM): Utilize unsupervised modeling as an extensive mapping device to navigate thousands of documents, identifying stable thematic patterns or demi-regularities across the entire corpus.
Contextualize (Metadata): Analyze how these patterns shift across social and institutional variables (such as political affiliation or media ownership), highlighting where specific ideas are normalized, marginalized, or strategically silenced.
Interrogate (CDA): Transition to an intensive research design by performing a close reading of representative documents. This step reconstructs the ideological underpinnings and argumentative operations of the discourse, informing retroductive claims about the generative mechanisms that shape our social reality.
This post is based on: Theine, H., & Verità, C. (2026). Beyond Text-as-Data: Integrating Structural Topic Modeling and Critical Discourse Analysis for Contextualized Transformations Research. SET Lab Working Paper Series No. 2. https://www.jku.at/fileadmin/gruppen/410/SET_Lab/SET_Lab_WP_Series/wp2.pdf
