Show the code
!pip install gensim scikit-learn pandas numpy matplotlib plotly nltk scipy --quietDo autocratic and democratic constitutions use language differently?
Alex Ronczewski
24 April 2026
Word embeddings transform words into numerical vectors, where distance reflects similarity of meaning. They are a core tool in Natural Language Processing (NLP), the field concerned with how computers handle human language. The key property is that similar words end up near each other in this vector space: “economy” and “trade” are close; “economy” and “zebra” are not. For economists, this makes language quantitative.
In this notebook we load national constitutions from the Comparative Constitutions Project, classify them by regime type, and use word embeddings to ask whether autocratic and democratic constitutions use language differently. This is built off of the notion that autocratic constitutions are dominated with rights language they don’t enforce. For example: North Korea’s constitution guarantees freedom of speech. Can word embeddings detect these patterns, or do all constitutions sound the same?
By the end of this notebook, students will be able to:
Install our libraries
Load our libraries
import re, warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import plotly.express as px
import nltk
import gensim.downloader as api
from scipy import stats
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download("stopwords", quiet=True)
from nltk.corpus import stopwordsThe data comes from the Comparative Constitutions Project (constituteproject.org), which provides English-language texts of every available national constitution. We’ve pre-fetched the in-force constitutions to datasets/constitutions.csv so the notebook runs offline. The resulting dataframe has 4 columns: country, country_id, id and the text of the constitution.
All texts are English translations. Translation decisions bring their own set of biases; we need to keep this in mind when interpreting results.
| country | country_id | id | text | |
|---|---|---|---|---|
| 0 | The Islamic Republic of Afghanistan | Afghanistan | Afghanistan_2004 | Afghanistan 2004 Preamble In the name of Allah... |
| 1 | Republic of Albania | Albania | Albania_2016 | Albania 1998 (rev. 2016) Subsequently amended ... |
| 2 | People's Democratic Republic of Algeria | Algeria | Algeria_2020 | Algeria 2020 Translated by International IDEA ... |
| 3 | Principality of Andorra | Andorra | Andorra_1993 | Andorra 1993 Preamble The Andorran People, wit... |
| 4 | Republic of Angola | Angola | Angola_2010 | Angola 2010 Preamble We, the people of Angola,... |
We classify countries as autocratic or democratic using a manual list, derived from V-Dem’s Regimes of the World (RoW) index (Coppedge et al. 2023, V-Dem Dataset v13), specifically countries the index classifies as “closed autocracies” or “electoral autocracies”. This is a simplification; a country’s constitution may have been drafted under a different regime than the current one. This is sufficient for a workshop, but for a research paper we would have to check V-Dem’s Regimes of the World index at the time of drafting the constitution or its most recent update. You can see the list of autocracies in the code cell below.
# Source: V-Dem Regimes of the World index v13 (Coppedge et al. 2023)
# Countries classified as "closed autocracy" or "electoral autocracy"
autocracy_ids = {
"China", "Russian_Federation__the", "Saudi_Arabia", "Iran_Islamic_Rep_of_",
"Cuba", "Syrian_Arab_Republic_the", "Libya", "Belarus", "Venezuela",
"Myanmar", "Eritrea", "Turkmenistan", "Uzbekistan", "Sudan_the",
"Zimbabwe", "Chad", "Equatorial_Guinea", "Tajikistan", "Bahrain",
}
df["regime"] = df["country_id"].apply(lambda x: "autocratic" if x in autocracy_ids else "democratic")
df["regime"].value_counts()regime
democratic 174
autocratic 19
Name: count, dtype: int64
Our sample is heavily imbalanced: 19 autocratic constitutions vs 174 democratic ones. This roughly 1:9 ratio means patterns in the autocratic group are driven by a small number of documents, and any single constitution has outsized influence on the group average. Keep this asymmetry in mind when interpreting results.
Constitutions vary enormously in length. A quick look at word counts by regime helps us understand the corpus before we start analysing it.
| mean | median | min | max | |
|---|---|---|---|---|
| regime | ||||
| autocratic | 15566.0 | 11222.0 | 3000 | 53485 |
| democratic | 25724.0 | 18920.0 | 4370 | 228389 |
Please discuss in small groups with your classmates:
Raw constitution text is full of noise that doesn’t carry meaning: punctuation, inconsistent capitalisation (“The” vs “the”), and high-frequency filler words like “the,” “of,” “and,” and “shall.” These words appear in every constitution regardless of regime type, so they would dominate the analysis without telling us anything useful. We strip punctuation, lowercase everything so “Rights” and “rights” are treated as the same word, and remove English stopwords (standard list of ~180 common words that carry grammatical function but little semantic content).
Consider: we removed stopwords like “shall” and “the.” Could removing certain words bias our analysis? Are there words you’d want to keep that a standard stopword list removes?
We use Google’s pre-trained Word2Vec (Mikolov et al. 2013). This model has 3 million words, 300 dimensions, trained on ~100 billion words of Google News. We use this pre-trained model to avoid training (we would need a far larger corpus of legal/political text than 193 constitutions). The tradeoff is that the model learned language from Google News, not from legal/constitutional text. So its sense of what “sovereignty” or “rights” means is shaped by journalism. It is important to note this bias.
This download is roughly 1.7 GB the first time. Gensim caches it locally after that.
We load the model
Before using embeddings, let’s look at which words are most distinctive to both democracies and autocracies using Term Frequency-Inverse Document Frequency (TF-IDF) scores alone. TF-IDF is a numerical statistic used in NLP to measure a word’s relevance to a specific document within a collection. It looks for unique meaningful terms.
tfidf = TfidfVectorizer(max_features=5000)
tfidf_matrix = tfidf.fit_transform(df["tokens"].apply(lambda t: " ".join(t)))
feature_names = tfidf.get_feature_names_out()
auto_mask = (df["regime"] == "autocratic").values
demo_mask = (df["regime"] == "democratic").values
auto_mean = tfidf_matrix[auto_mask].mean(axis=0).A1
demo_mean = tfidf_matrix[demo_mask].mean(axis=0).A1
diff = auto_mean - demo_mean
top_auto_idx = diff.argsort()[-10:][::-1]
top_demo_idx = diff.argsort()[:10]
print("Top 10 overrepresented in AUTOCRATIC constitutions:")
display(pd.DataFrame({"word": feature_names[top_auto_idx], "tfidf_difference": diff[top_auto_idx].round(4)}))
print("\nTop 10 overrepresented in DEMOCRATIC constitutions:")
display(pd.DataFrame({"word": feature_names[top_demo_idx], "tfidf_difference": abs(diff[top_demo_idx]).round(4)}))Top 10 overrepresented in AUTOCRATIC constitutions:
| word | tfidf_difference | |
|---|---|---|
| 0 | peoplex27s | 0.0545 |
| 1 | turkmenistan | 0.0469 |
| 2 | uzbekistan | 0.0463 |
| 3 | hluttaw | 0.0450 |
| 4 | islamic | 0.0437 |
| 5 | belarus | 0.0415 |
| 6 | majlisi | 0.0408 |
| 7 | russian | 0.0395 |
| 8 | council | 0.0313 |
| 9 | zimbabwe | 0.0307 |
Top 10 overrepresented in DEMOCRATIC constitutions:
| word | tfidf_difference | |
|---|---|---|
| 0 | shall | 0.0700 |
| 1 | parliament | 0.0544 |
| 2 | may | 0.0501 |
| 3 | court | 0.0401 |
| 4 | section | 0.0393 |
| 5 | office | 0.0368 |
| 6 | house | 0.0352 |
| 7 | person | 0.0318 |
| 8 | subsection | 0.0254 |
| 9 | public | 0.0252 |
The autocratic list is mostly country names which we classify as autocratic (Turkmenistan, Uzbekistan, Belarus, Zimbabwe) and country-specific legislative terms (hluttaw is Myanmar’s parliament, majlisi is Tajikistan’s). That’s because with only 19 autocratic constitutions, unique country references dominate the TF-IDF scores. The democratic list shows more generic institutional vocabulary (parliament, court, office) since those terms are spread across 174 countries rather than concentrated in a few.
We found that country names dominate the autocratic list: what does this tell us about TF-IDF with small sample sizes? How might you address this?
Add 3 more countries to the autocracy_ids set that you think should be classified as autocratic. Re-run the TF-IDF cell. Do the top words change?
Simple averaging treats every word equally, but words like “article” and “shall” appear in every constitution and carry no distinctive meaning. We weight each word by its TF-IDF score before averaging. This is like weighting a price index where not all items matter equally.
tfidf_full = TfidfVectorizer()
tfidf_full.fit(df["tokens"].apply(lambda t: " ".join(t)))
idf_values = dict(zip(tfidf_full.get_feature_names_out(), tfidf_full.idf_))
def get_weighted_doc_vector(tokens, wv_model, idf_dict):
vecs, weights = [], []
for t in tokens:
if t in wv_model and t in idf_dict:
vecs.append(wv_model[t])
weights.append(idf_dict[t])
if not vecs:
return np.zeros(wv_model.vector_size)
return np.average(vecs, axis=0, weights=np.array(weights))
df["doc_vector"] = df["tokens"].apply(lambda t: get_weighted_doc_vector(t, model, idf_values))
zero_mask = df["doc_vector"].apply(lambda v: np.allclose(v, 0))
df = df[~zero_mask].copy()
print(f"Constitutions with valid embeddings: {len(df)}")Constitutions with valid embeddings: 193
For this part we compare distributions. For each target word, we compute its cosine similarity to every individual constitution vector in each regime group, then test whether the means differ significantly. Cosine similarity measures how much two vectors point in the same direction. If two constitutions use similar language, their vectors point roughly the same way and the score is close to 1. If they have nothing in common, the score is near 0. It ignores how long a document is and focuses purely on whether the content is similar.
The mathematical formula for cosine similarity is: \[\text{cos}(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}\]
target_words = [
"sovereignty", "freedom", "rights", "security", "justice",
"equality", "democracy", "power", "protection", "welfare",
"property", "military", "religion", "party",
]
auto_vecs = np.stack(df.loc[df["regime"] == "autocratic", "doc_vector"].values)
demo_vecs = np.stack(df.loc[df["regime"] == "democratic", "doc_vector"].values)
results = []
for word in target_words:
if word not in model:
continue
wv = model[word].reshape(1, -1)
sim_auto = cosine_similarity(wv, auto_vecs).flatten()
sim_demo = cosine_similarity(wv, demo_vecs).flatten()
n1, n2 = len(sim_auto), len(sim_demo)
var_auto = sim_auto.var(ddof=1)
var_demo = sim_demo.var(ddof=1)
pooled_std = np.sqrt(((n1 - 1) * var_auto + (n2 - 1) * var_demo) / (n1 + n2 - 2))
d = (sim_auto.mean() - sim_demo.mean()) / pooled_std if pooled_std > 0 else 0
t_stat, p_val = stats.ttest_ind(sim_auto, sim_demo, equal_var=False)
results.append({
"word": word,
"mean_autocratic": round(sim_auto.mean(), 4),
"mean_democratic": round(sim_demo.mean(), 4),
"difference": round(sim_auto.mean() - sim_demo.mean(), 4),
"cohens_d": round(d, 3),
"p_value": round(p_val, 4),
})
results_df = pd.DataFrame(results).sort_values("cohens_d", key=abs, ascending=False)
results_df| word | mean_autocratic | mean_democratic | difference | cohens_d | p_value | |
|---|---|---|---|---|---|---|
| 12 | religion | 0.3989 | 0.3703 | 0.0286 | 0.901 | 0.0029 |
| 6 | democracy | 0.4581 | 0.4175 | 0.0405 | 0.886 | 0.0003 |
| 5 | equality | 0.4221 | 0.3901 | 0.0320 | 0.846 | 0.0031 |
| 7 | power | 0.2940 | 0.2788 | 0.0152 | 0.821 | 0.0032 |
| 1 | freedom | 0.3825 | 0.3527 | 0.0298 | 0.773 | 0.0043 |
| 11 | military | 0.3542 | 0.3351 | 0.0191 | 0.692 | 0.0026 |
| 10 | property | 0.2979 | 0.3142 | -0.0163 | -0.680 | 0.0810 |
| 0 | sovereignty | 0.4337 | 0.4080 | 0.0257 | 0.618 | 0.0115 |
| 3 | security | 0.3143 | 0.2990 | 0.0153 | 0.546 | 0.0166 |
| 9 | welfare | 0.3799 | 0.3660 | 0.0139 | 0.523 | 0.0201 |
| 2 | rights | 0.3578 | 0.3510 | 0.0067 | 0.228 | 0.4694 |
| 13 | party | 0.3273 | 0.3302 | -0.0029 | -0.135 | 0.6426 |
| 8 | protection | 0.3164 | 0.3139 | 0.0025 | 0.100 | 0.7441 |
| 4 | justice | 0.3924 | 0.3900 | 0.0025 | 0.088 | 0.7728 |
Cohen’s d measures the size of the difference in standard deviations (values above 0.2 small, 0.5 medium, 0.8 large). The p-value tells us how likely we’d see a difference this large by chance if the two groups truly used the same language. Positive d values mean the word is closer to autocratic constitutions.
The standout finding: “democracy,” “equality,” and “freedom” all show large effect sizes toward autocratic constitutions, with p-values small enough to suggest the difference is unlikely to be due to chance. This is in line with our hypothesis that autocratic constitutions more heavily lean on language around equality and human rights, which they then do not uphold. The only word which significantly leans towards democratic constitutions is “property,” consistent with the emphasis on individual economic rights in democratic nations. With only 19 autocratic constitutions, statistical power is limited and individual constitutions have outsized influence – so treat p-values as suggestive rather than definitive.
A word shows cohens_d = 0.9 but p_value = 0.5, off just 19 autocratic constitutions. Reliable finding or not?
No: big effect size, but p = 0.5 means a gap that size could easily be chance. With 19 documents one outlier can carry the whole result.
Add 3-5 words to target_words that you think would differ between regime types. Re-run the analysis. What did you find?
Our embeddings live in 300 dimensions which we can’t visualize. Instead we use Principal Component Analysis (PCA) that finds the two directions in that 300-dimensional space where our data varies the most, and projects everything onto those two axes. The axes (PC1 and PC2) don’t correspond to any single word or concept; they simply capture the most spread in the data. Points that are close together on the plot have constitutions with similar language; points far apart use language differently.
all_vectors = np.stack(df["doc_vector"].values)
coords = PCA(n_components=2).fit_transform(all_vectors)
plot_df = pd.DataFrame(coords, columns=["PC1", "PC2"])
plot_df["regime"] = df["regime"].values
plot_df["country"] = df["country"].values
plot_df["country_id"] = df["country_id"].values
highlight_ids = {"China", "Russian_Federation__the", "Cuba", "Iran_Islamic_Rep_of_",
"United_States_of_America", "Canada"}
plot_df["label"] = plot_df.apply(
lambda r: r["country"] if r["country_id"] in highlight_ids else "", axis=1
)
fig = px.scatter(
plot_df, x="PC1", y="PC2", color="regime",
color_discrete_map={"autocratic": "red", "democratic": "blue"},
hover_data=["country"], opacity=0.5, text="label",
title="National Constitutions in Embedding Space (PCA)",
)
fig.update_traces(marker_size=8, textposition="top center")
fig.update_layout(width=800, height=600)
fig.show()We can see many of the autocratic constitutions are on the right side of the visualization and most of the democratic constitutions lean left. You can hover over the dots to see which country they correspond to.
Hungary, Turkey, and Singapore are often described as “hybrid” or “competitive authoritarian” regimes. They don’t fit cleanly into the binary classification we used. Find them on the PCA plot. Where do they land relative to the autocratic and democratic clusters? What does this tell you about the limits of binary regime classification, and about what word embeddings can and can’t detect?
Optional: modify highlight_ids in the cell above to include "Hungary", "Turkey", and "Singapore" so they’re labelled directly on the plot.
In this notebook we used word embeddings to compare the constitutional language of autocratic and democratic regimes. We built TF-IDF weighted document vectors using Google’s pre-trained Word2Vec model and compared how politically meaningful words relate to each group’s language.
The results support our original hypothesis that autocratic constitutions lean heavily on democratic vocabulary and notions of equality. Words like “democracy,” “equality,” and “freedom” showed large effect sizes toward autocratic constitutions. This is consistent with a well-documented pattern in politics: authoritarian regimes highlight rights they do not enforce. The one word that meaningfully tilted toward democratic constitutions was “property,” reflecting the emphasis on individual economic rights in democracies.
Further Discussion Questions: