Word Embeddings and National Constitutions

Do autocratic and democratic constitutions use language differently?

Python

political economy

word embeddings

text analysis

Using word embeddings to compare the language of autocratic and democratic national constitutions.

Author

Alex Ronczewski

Published

24 April 2026

1 Introduction + Setup

Word embeddings transform words into numerical vectors, where distance reflects similarity of meaning. They are a core tool in Natural Language Processing (NLP), the field concerned with how computers handle human language. The key property is that similar words end up near each other in this vector space: “economy” and “trade” are close; “economy” and “zebra” are not. For economists, this makes language quantitative.

In this notebook we load national constitutions from the Comparative Constitutions Project, classify them by regime type, and use word embeddings to ask whether autocratic and democratic constitutions use language differently. This is built off of the notion that autocratic constitutions are dominated with rights language they don’t enforce. For example: North Korea’s constitution guarantees freedom of speech. Can word embeddings detect these patterns, or do all constitutions sound the same?

Learning Outcomes

By the end of this notebook, students will be able to:

Explain what word embeddings are and why they make language quantitative
Load and preprocess a text corpus from a public API
Use TF-IDF and cosine similarity to compare groups of documents
Interpret PCA visualizations of high-dimensional text data
Critically evaluate the limitations of NLP methods applied to political texts

Prerequisites

Python: Basic familiarity with Python syntax (variables, loops, functions). No NLP or machine learning experience required.
Statistics: Comfort with means, standard deviations, and the idea of statistical significance.
Internet connection: Required to fetch constitution texts from the API and download the Word2Vec model (~1.7 GB on first run).

Install our libraries

Show the code

!pip install gensim scikit-learn pandas numpy matplotlib plotly nltk scipy --quiet

Load our libraries

Show the code

import re, warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import plotly.express as px
import nltk
import gensim.downloader as api
from scipy import stats
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download("stopwords", quiet=True)
from nltk.corpus import stopwords

2 Load the Data

The data comes from the Comparative Constitutions Project (constituteproject.org), which provides English-language texts of every available national constitution. We’ve pre-fetched the in-force constitutions to datasets/constitutions.csv so the notebook runs offline. The resulting dataframe has 4 columns: country, country_id, id and the text of the constitution.

Note

All texts are English translations. Translation decisions bring their own set of biases; we need to keep this in mind when interpreting results.

Show the code

df = pd.read_csv("datasets/constitutions.csv")
df.head()

	country	country_id	id	text
0	The Islamic Republic of Afghanistan	Afghanistan	Afghanistan_2004	Afghanistan 2004 Preamble In the name of Allah...
1	Republic of Albania	Albania	Albania_2016	Albania 1998 (rev. 2016) Subsequently amended ...
2	People's Democratic Republic of Algeria	Algeria	Algeria_2020	Algeria 2020 Translated by International IDEA ...
3	Principality of Andorra	Andorra	Andorra_1993	Andorra 1993 Preamble The Andorran People, wit...
4	Republic of Angola	Angola	Angola_2010	Angola 2010 Preamble We, the people of Angola,...

Classifying Regimes

We classify countries as autocratic or democratic using a manual list, derived from V-Dem’s Regimes of the World (RoW) index (Coppedge et al. 2023, V-Dem Dataset v13), specifically countries the index classifies as “closed autocracies” or “electoral autocracies”. This is a simplification; a country’s constitution may have been drafted under a different regime than the current one. This is sufficient for a workshop, but for a research paper we would have to check V-Dem’s Regimes of the World index at the time of drafting the constitution or its most recent update. You can see the list of autocracies in the code cell below.

Show the code

# Source: V-Dem Regimes of the World index v13 (Coppedge et al. 2023)
# Countries classified as "closed autocracy" or "electoral autocracy"
autocracy_ids = {
    "China", "Russian_Federation__the", "Saudi_Arabia", "Iran_Islamic_Rep_of_",
    "Cuba", "Syrian_Arab_Republic_the", "Libya", "Belarus", "Venezuela",
    "Myanmar", "Eritrea", "Turkmenistan", "Uzbekistan", "Sudan_the",
    "Zimbabwe", "Chad", "Equatorial_Guinea", "Tajikistan", "Bahrain",
}

df["regime"] = df["country_id"].apply(lambda x: "autocratic" if x in autocracy_ids else "democratic")
df["regime"].value_counts()

regime
democratic    174
autocratic     19
Name: count, dtype: int64

Note

Our sample is heavily imbalanced: 19 autocratic constitutions vs 174 democratic ones. This roughly 1:9 ratio means patterns in the autocratic group are driven by a small number of documents, and any single constitution has outsized influence on the group average. Keep this asymmetry in mind when interpreting results.

Constitutions vary enormously in length. A quick look at word counts by regime helps us understand the corpus before we start analysing it.

Show the code

df["word_count"] = df["text"].apply(lambda t: len(t.split()))
df.groupby("regime")["word_count"].agg(["mean", "median", "min", "max"]).round(0)

	mean	median	min	max
regime
autocratic	15566.0	11222.0	3000	53485
democratic	25724.0	18920.0	4370	228389

Please discuss in small groups with your classmates:

What gets lost in a binary regime classification?
Can you think of countries that don’t fit neatly into either category?
Since these constitutions were drafted at different times under different regimes, why might the current regime type not match the constitution’s language?

3 Text Preprocessing

Raw constitution text is full of noise that doesn’t carry meaning: punctuation, inconsistent capitalisation (“The” vs “the”), and high-frequency filler words like “the,” “of,” “and,” and “shall.” These words appear in every constitution regardless of regime type, so they would dominate the analysis without telling us anything useful. We strip punctuation, lowercase everything so “Rights” and “rights” are treated as the same word, and remove English stopwords (standard list of ~180 common words that carry grammatical function but little semantic content).

Show the code

stop_words = set(stopwords.words("english"))

def preprocess(text):
    text = text.lower()
    text = re.sub(r"[^\w\s]", "", text)
    tokens = text.split()
    return [t for t in tokens if t not in stop_words]

df["tokens"] = df["text"].apply(preprocess)

Consider: we removed stopwords like “shall” and “the.” Could removing certain words bias our analysis? Are there words you’d want to keep that a standard stopword list removes?

4 Load Pre-trained Word2Vec Embeddings

We use Google’s pre-trained Word2Vec (Mikolov et al. 2013). This model has 3 million words, 300 dimensions, trained on ~100 billion words of Google News. We use this pre-trained model to avoid training (we would need a far larger corpus of legal/political text than 193 constitutions). The tradeoff is that the model learned language from Google News, not from legal/constitutional text. So its sense of what “sovereignty” or “rights” means is shaped by journalism. It is important to note this bias.

Warning

This download is roughly 1.7 GB the first time. Gensim caches it locally after that.

We load the model

Show the code

model = api.load("word2vec-google-news-300")

Show the code

model.most_similar("constitution", topn=5)

[('Constitution', 0.799364447593689),
 ('consitution', 0.7791620492935181),
 ('constitutional', 0.7356841564178467),
 ('constitutions', 0.7170382142066956),
 ('constitutional_amendments', 0.6457947492599487)]

5 TF-IDF Baseline: Most Distinctive Words

Before using embeddings, let’s look at which words are most distinctive to both democracies and autocracies using Term Frequency-Inverse Document Frequency (TF-IDF) scores alone. TF-IDF is a numerical statistic used in NLP to measure a word’s relevance to a specific document within a collection. It looks for unique meaningful terms.

Show the code

tfidf = TfidfVectorizer(max_features=5000)
tfidf_matrix = tfidf.fit_transform(df["tokens"].apply(lambda t: " ".join(t)))
feature_names = tfidf.get_feature_names_out()

auto_mask = (df["regime"] == "autocratic").values
demo_mask = (df["regime"] == "democratic").values

auto_mean = tfidf_matrix[auto_mask].mean(axis=0).A1
demo_mean = tfidf_matrix[demo_mask].mean(axis=0).A1
diff = auto_mean - demo_mean

top_auto_idx = diff.argsort()[-10:][::-1]
top_demo_idx = diff.argsort()[:10]

print("Top 10 overrepresented in AUTOCRATIC constitutions:")
display(pd.DataFrame({"word": feature_names[top_auto_idx], "tfidf_difference": diff[top_auto_idx].round(4)}))

print("\nTop 10 overrepresented in DEMOCRATIC constitutions:")
display(pd.DataFrame({"word": feature_names[top_demo_idx], "tfidf_difference": abs(diff[top_demo_idx]).round(4)}))

Top 10 overrepresented in AUTOCRATIC constitutions:

	word	tfidf_difference
0	peoplex27s	0.0545
1	turkmenistan	0.0469
2	uzbekistan	0.0463
3	hluttaw	0.0450
4	islamic	0.0437
5	belarus	0.0415
6	majlisi	0.0408
7	russian	0.0395
8	council	0.0313
9	zimbabwe	0.0307


Top 10 overrepresented in DEMOCRATIC constitutions:

	word	tfidf_difference
0	shall	0.0700
1	parliament	0.0544
2	may	0.0501
3	court	0.0401
4	section	0.0393
5	office	0.0368
6	house	0.0352
7	person	0.0318
8	subsection	0.0254
9	public	0.0252

The autocratic list is mostly country names which we classify as autocratic (Turkmenistan, Uzbekistan, Belarus, Zimbabwe) and country-specific legislative terms (hluttaw is Myanmar’s parliament, majlisi is Tajikistan’s). That’s because with only 19 autocratic constitutions, unique country references dominate the TF-IDF scores. The democratic list shows more generic institutional vocabulary (parliament, court, office) since those terms are spread across 174 countries rather than concentrated in a few.

We found that country names dominate the autocratic list: what does this tell us about TF-IDF with small sample sizes? How might you address this?

Try It Yourself

Add 3 more countries to the autocracy_ids set that you think should be classified as autocratic. Re-run the TF-IDF cell. Do the top words change?

6 Create Document Vectors

Simple averaging treats every word equally, but words like “article” and “shall” appear in every constitution and carry no distinctive meaning. We weight each word by its TF-IDF score before averaging. This is like weighting a price index where not all items matter equally.

Show the code

tfidf_full = TfidfVectorizer()
tfidf_full.fit(df["tokens"].apply(lambda t: " ".join(t)))
idf_values = dict(zip(tfidf_full.get_feature_names_out(), tfidf_full.idf_))

def get_weighted_doc_vector(tokens, wv_model, idf_dict):
    vecs, weights = [], []
    for t in tokens:
        if t in wv_model and t in idf_dict:
            vecs.append(wv_model[t])
            weights.append(idf_dict[t])
    if not vecs:
        return np.zeros(wv_model.vector_size)
    return np.average(vecs, axis=0, weights=np.array(weights))

df["doc_vector"] = df["tokens"].apply(lambda t: get_weighted_doc_vector(t, model, idf_values))

zero_mask = df["doc_vector"].apply(lambda v: np.allclose(v, 0))
df = df[~zero_mask].copy()
print(f"Constitutions with valid embeddings: {len(df)}")

Constitutions with valid embeddings: 193

7 Analysis: Comparing Constitutional Language

For this part we compare distributions. For each target word, we compute its cosine similarity to every individual constitution vector in each regime group, then test whether the means differ significantly. Cosine similarity measures how much two vectors point in the same direction. If two constitutions use similar language, their vectors point roughly the same way and the score is close to 1. If they have nothing in common, the score is near 0. It ignores how long a document is and focuses purely on whether the content is similar.

The mathematical formula for cosine similarity is: \[\text{cos}(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}\]

Show the code

target_words = [
    "sovereignty", "freedom", "rights", "security", "justice",
    "equality", "democracy", "power", "protection", "welfare",
    "property", "military", "religion", "party",
]

auto_vecs = np.stack(df.loc[df["regime"] == "autocratic", "doc_vector"].values)
demo_vecs = np.stack(df.loc[df["regime"] == "democratic", "doc_vector"].values)

results = []
for word in target_words:
    if word not in model:
        continue
    wv = model[word].reshape(1, -1)
    sim_auto = cosine_similarity(wv, auto_vecs).flatten()
    sim_demo = cosine_similarity(wv, demo_vecs).flatten()
    n1, n2 = len(sim_auto), len(sim_demo)
    var_auto = sim_auto.var(ddof=1)
    var_demo = sim_demo.var(ddof=1)
    pooled_std = np.sqrt(((n1 - 1) * var_auto + (n2 - 1) * var_demo) / (n1 + n2 - 2))
    d = (sim_auto.mean() - sim_demo.mean()) / pooled_std if pooled_std > 0 else 0
    t_stat, p_val = stats.ttest_ind(sim_auto, sim_demo, equal_var=False)
    results.append({
        "word": word,
        "mean_autocratic": round(sim_auto.mean(), 4),
        "mean_democratic": round(sim_demo.mean(), 4),
        "difference": round(sim_auto.mean() - sim_demo.mean(), 4),
        "cohens_d": round(d, 3),
        "p_value": round(p_val, 4),
    })

results_df = pd.DataFrame(results).sort_values("cohens_d", key=abs, ascending=False)
results_df

	word	mean_autocratic	mean_democratic	difference	cohens_d	p_value
12	religion	0.3989	0.3703	0.0286	0.901	0.0029
6	democracy	0.4581	0.4175	0.0405	0.886	0.0003
5	equality	0.4221	0.3901	0.0320	0.846	0.0031
7	power	0.2940	0.2788	0.0152	0.821	0.0032
1	freedom	0.3825	0.3527	0.0298	0.773	0.0043
11	military	0.3542	0.3351	0.0191	0.692	0.0026
10	property	0.2979	0.3142	-0.0163	-0.680	0.0810
0	sovereignty	0.4337	0.4080	0.0257	0.618	0.0115
3	security	0.3143	0.2990	0.0153	0.546	0.0166
9	welfare	0.3799	0.3660	0.0139	0.523	0.0201
2	rights	0.3578	0.3510	0.0067	0.228	0.4694
13	party	0.3273	0.3302	-0.0029	-0.135	0.6426
8	protection	0.3164	0.3139	0.0025	0.100	0.7441
4	justice	0.3924	0.3900	0.0025	0.088	0.7728

Cohen’s d measures the size of the difference in standard deviations (values above 0.2 small, 0.5 medium, 0.8 large). The p-value tells us how likely we’d see a difference this large by chance if the two groups truly used the same language. Positive d values mean the word is closer to autocratic constitutions.

The standout finding: “democracy,” “equality,” and “freedom” all show large effect sizes toward autocratic constitutions, with p-values small enough to suggest the difference is unlikely to be due to chance. This is in line with our hypothesis that autocratic constitutions more heavily lean on language around equality and human rights, which they then do not uphold. The only word which significantly leans towards democratic constitutions is “property,” consistent with the emphasis on individual economic rights in democratic nations. With only 19 autocratic constitutions, statistical power is limited and individual constitutions have outsized influence – so treat p-values as suggestive rather than definitive.

Check

A word shows cohens_d = 0.9 but p_value = 0.5, off just 19 autocratic constitutions. Reliable finding or not?

Answer

No: big effect size, but p = 0.5 means a gap that size could easily be chance. With 19 documents one outlier can carry the whole result.

Try It Yourself

Add 3-5 words to target_words that you think would differ between regime types. Re-run the analysis. What did you find?

8 Analysis: Visualising with PCA

Our embeddings live in 300 dimensions which we can’t visualize. Instead we use Principal Component Analysis (PCA) that finds the two directions in that 300-dimensional space where our data varies the most, and projects everything onto those two axes. The axes (PC1 and PC2) don’t correspond to any single word or concept; they simply capture the most spread in the data. Points that are close together on the plot have constitutions with similar language; points far apart use language differently.

Show the code

all_vectors = np.stack(df["doc_vector"].values)
coords = PCA(n_components=2).fit_transform(all_vectors)

plot_df = pd.DataFrame(coords, columns=["PC1", "PC2"])
plot_df["regime"] = df["regime"].values
plot_df["country"] = df["country"].values
plot_df["country_id"] = df["country_id"].values

highlight_ids = {"China", "Russian_Federation__the", "Cuba", "Iran_Islamic_Rep_of_",
                 "United_States_of_America", "Canada"}
plot_df["label"] = plot_df.apply(
    lambda r: r["country"] if r["country_id"] in highlight_ids else "", axis=1
)

fig = px.scatter(
    plot_df, x="PC1", y="PC2", color="regime",
    color_discrete_map={"autocratic": "red", "democratic": "blue"},
    hover_data=["country"], opacity=0.5, text="label",
    title="National Constitutions in Embedding Space (PCA)",
)
fig.update_traces(marker_size=8, textposition="top center")
fig.update_layout(width=800, height=600)
fig.show()

We can see many of the autocratic constitutions are on the right side of the visualization and most of the democratic constitutions lean left. You can hover over the dots to see which country they correspond to.

Discussion

Find your home country on the plot. Are you surprised by its position? What might explain where it ended up?

Try It Yourself: Ambiguous Cases

Hungary, Turkey, and Singapore are often described as “hybrid” or “competitive authoritarian” regimes. They don’t fit cleanly into the binary classification we used. Find them on the PCA plot. Where do they land relative to the autocratic and democratic clusters? What does this tell you about the limits of binary regime classification, and about what word embeddings can and can’t detect?

Optional: modify highlight_ids in the cell above to include "Hungary", "Turkey", and "Singapore" so they’re labelled directly on the plot.

9 Conclusion

In this notebook we used word embeddings to compare the constitutional language of autocratic and democratic regimes. We built TF-IDF weighted document vectors using Google’s pre-trained Word2Vec model and compared how politically meaningful words relate to each group’s language.

The results support our original hypothesis that autocratic constitutions lean heavily on democratic vocabulary and notions of equality. Words like “democracy,” “equality,” and “freedom” showed large effect sizes toward autocratic constitutions. This is consistent with a well-documented pattern in politics: authoritarian regimes highlight rights they do not enforce. The one word that meaningfully tilted toward democratic constitutions was “property,” reflecting the emphasis on individual economic rights in democracies.

Further Discussion Questions:

What are some other corpora that you could apply this method to, possible examples are: trade agreements, UN General Assembly speeches, or party platforms.
Given that our model was trained on Google News, how might results differ with a model trained on legal or constitutional text? If you were advising a policy organization, what would you need to validate before drawing conclusions from an analysis like this?

10 References

Elkins, Z., Ginsburg, T., & Melton, J. (2014). Constitute: The world’s constitutions to read, search, and compare. Comparative Constitutions Project. https://constituteproject.org
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. https://arxiv.org/abs/1301.3781
Coppedge, M., Gerring, J., Knutsen, C. H., Lindberg, S. I., Teorell, J., et al. (2023). V-Dem Dataset v13. Varieties of Democracy (V-Dem) Project. https://www.v-dem.net/
Rodriguez, P. L., & Spirling, A. (2022). Word embeddings: What works, what doesn’t, and how to tell the difference for applied research. Journal of Politics, 84(1), 101-115. https://doi.org/10.1086/715162
Spirling, A., & Rodriguez, P. L. (2019). Word embeddings for the analysis of ideological placement in parliamentary corpora. Political Analysis, 28(1), 112-133. https://doi.org/10.1017/pan.2019.26