This notebook goes over how to make all sorts of visuals. We look at different types of graphs, like scatter plots and histograms, exporting figures, and how to edit the figure for clarity.
Author
Marina Adshade, Paul Corcuera, Giulia Lo Forte, Jane Platt
Published
29 May 2024
Prerequisites
Be able to effectively use Stata do-files and generate log-files.
Be able to change your directory so that Stata can find your files.
Import datasets in .csv and .dta format.
Save data files.
Learning Outcomes
Know when to use the following kinds of visualizations to answer specific questions using a data set:
scatterplots
line plots
bar plots
histograms
Generate and fine-tune visualizations using the Stata command twoway and its different options.
Use graph export to save visualizations in various formats including .svg, .png and .pdf.
9.0 Intro
Note: The best approach to completing this module is to copy and paste these commands into a do-file in Stata. Because Stata produces graphs in a separate window, Jupyter Notebooks will not produce a graph that we can see when we execute the commands on this page. The most we can do is export image files to a directory on our computer. We will see these commands whenever a graph is produced below.
We’ll continue working with the fake data set we have been using as we work on developing our research skills. Recall that this data set is simulating information for workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.
___ ____ ____ ____ ____ ®
/__ / ____/ / ____/ StataNow 19.5
___/ / /___/ / /___/ SE—Standard Edition
Statistics and Data Science Copyright 1985-2025 StataCorp LLC
StataCorp
4905 Lakeway Drive
College Station, Texas 77845 USA
800-782-8272 https://www.stata.com
979-696-4600 service@stata.com
Stata license: Unlimited-user network, expiring 19 Aug 2026
Serial number: 401909301439
Licensed to: Alex Ronczewski
UBC
Notes:
1. Unicode is supported; see help unicode_advice.
2. Maximum number of variables is set to 5,000 but can be increased;
see help set_maxvar.
>>>import sys>>> sys.path.append('/Applications/Stata/utilities') # make sure this is the same as what you set up in Module 01, Section 1.3: Setting Up the STATA Path>>>from pystata import config>>> config.init('se')
%%stataclear**cd ""use fake_data, clear
. clear*
. *cd ""
. use fake_data, clear
.
Data visualization is an effective way of communicating ideas to our audience, whether it’s for an academic paper or a business setting. It can be a powerful medium to motivate our research, illustrate relationships between variables, and provide some intuition behind why we applied certain econometric methods.
The real challenge is not understanding how to use Stata to create graphs. Instead, the challenge is figuring out which graph will do the best job at telling our empirical story. Before creating any graphs, we must identify the message we want the graph to convey. Try to answer these questions: Who is our audience? What is the question you’re trying to answer?
9.1 Types of Graphs
9.1.1 Scatter Plot using twoway
What is it? and, when to use?
Scatter plots are frequently used to demonstrate how two quantitative variables are related to one another. This plot works well when we are interested in showing relationships and groupings among variables from relatively large data sets.
Below is a nice example.
Scatter plot presenting the relationship of country religiosity vs wealth
Let’s say we want to plot the log-earnings by year using our fake data set. We begin by generating a new variable for log-earnings.
%%statagenerate log_earnings = log(earnings)label var log_earnings "Log-earnings"// We are adding the label "log-earnings" to the variable log_earnings
.
. generate log_earnings = log(earnings)
.
. label var log_earnings "Log-earnings" // We are adding the label "log-earning
> s" to the variable log_earnings
.
Now let’s create a new data set that includes a variable that is the log-earnings by year. We use the command preserve to save the data set that we are working on. We then include the command restore to bring back the original data set.
.
. preserve
. collapse (mean) log_earnings, by(year)
. describe
Contains data
Observations: 17
Variables: 2
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
year int %8.0g Calendar Year
log_earnings float %9.0g (mean) log_earnings
-------------------------------------------------------------------------------
Sorted by: year
Note: Dataset has changed since last saved.
.
To create a graph between two numeric variables, we need to use the command twoway. The format for this command is twoway (type_of_graph x-axis_variable y-axis_variable).
In this case we want to create a graph that is a scatterplot that shows log-earnings as the dependent variable (y-axis) and year as the explanatory variable (x-axis variable).
.
. twoway (scatter log_earnings year)
.
. graph export graph1.jpg, as(jpg) replace
(file graph1.jpg not found)
file graph1.jpg written in JPEG format
.
Note that no graph will appear in the notebook when we executed this command. However, we can find the graph directly saved under the name “graph1.jpg”. That graph will look like this:
myscatterplot
A second way that we can create this graph is by replacing the graph type scatter with the graph type connected. This will create the graph below.
.
. twoway (connected log_earnings year)
.
. graph export graph1.jpg, as(jpg) replace
file graph1.jpg written in JPEG format
.
connected-scatter-plot
9.1.2 Line Plot using twoway
What is it? and, when to use?
Line plots visualize trends with respect to an independent, ordered quantity (e.g., time). This plot works well when one of our variables is ordinal (time-like) or when we want to display multiple series on a common timeline.
Line plots can be generated using Stata’s twoway command we saw earlier. This time, instead of writing scatter for the type of graph, we write line.
Below we introduce something new. We have added options to the graph that change the title on the x-axis (xtitle) and on the y-axis (y-title). Options for the graph as a whole appear at the end of the command. As we will see, options that affect an individual plot appear in the brackets where the plot is specified.
.
. twoway (line log_earnings year), xtitle("Year") ytitle("Log-earnings")
.
. graph export graph3.jpg, as(jpg) replace
(file graph3.jpg not found)
file graph3.jpg written in JPEG format
.
It should look something like this:
mylineplot
Now, let’s try creating a line plot with multiple series on a common twoway graph. To create this graph we first need to restore our data to the original version of the “fake_data” data set.
%%statarestore
.
. restore
.
Now that we have done that, we can collapse it to create the mean of log_earnings by both year and treated
.
. preserve
.
. collapse (mean) log_earnings, by(treated year)
.
. describe
Contains data
Observations: 34
Variables: 3
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
year int %8.0g Calendar Year
treated byte %8.0g Treatment Dummy
log_earnings float %9.0g (mean) log_earnings
-------------------------------------------------------------------------------
Sorted by: treated year
Note: Dataset has changed since last saved.
.
We can create a graph that separates the earnings between the treated and non-treated over time. We need to add each line separately to the graph. Within brackets, we can choose the observations we want included. We can also add line specific options, like color.
%%statatwoway (connected log_earnings year if treated==1, color(orange)) (connected log_earnings year if treated==0, color(purple)), xtitle(Year) ytitle(Average Log Earnings)graph export graph4.jpg, as(jpg) replace
.
. twoway (connected log_earnings year if treated==1, color(orange)) (connected
> log_earnings year if treated==0, color(purple)), xtitle(Year) ytitle(Average
> Log Earnings)
.
. graph export graph4.jpg, as(jpg) replace
(file graph4.jpg not found)
file graph4.jpg written in JPEG format
.
One final tip about working with scatterplots: sometimes we will want to draw a fit line on our graph that approximates the relationship between the two variables. We can do this by adding a second graph to the twoway plot that uses the graph type lfit.
9.1.3 Histogram using twoway
What is it? and, when to use?
Histograms visualize the distribution of one quantitative variable. This plot works well when we are working with a discrete variable and are interested in visualizing all its possible values and how often they each occur.
Now let’s restore the original data set so that we can plot the distribution of log_earnings and draw a simple histogram.
.
. restore
.
. histogram log_earnings
(bin=51, start=3.58887, width=.28193801)
.
. graph export graph5.jpg, as(jpg) replace
(file graph5.jpg not found)
file graph5.jpg written in JPEG format
.
It will look like this:
myhistogram
We can also draw two histograms on one plot. They won’t look very nice unless we change the plot colours though. But, if we execute the command below, it should create a nice graph that allows us to compare the distributions of log_earnings between the treatment and control groups.
.
. twoway (histogram log_earnings if treated==0, color(orange) lcolor(black))
> ///
> (histogram log_earnings if treated==1, color(olive) lcolor(black)),
> ///
> legend(label(1 "Treated") label(2 "Untreated"))
.
. graph export graph6.jpg, as(jpg) replace
(file graph6.jpg not found)
file graph6.jpg written in JPEG format
.
9.1.4 Bar Plot using graph
What is it? and, when to use?
Bar plots visualize comparisons of amounts. They are useful when we are interested in comparing a few categories as parts of a whole, or across time. Bar plots should always start at 0. Starting bar plots at any number besides 0 is generally considered a misrepresentation of the data.
Let’s plot mean earnings by region. Note that the regions are numbered in our data set.
To make a bar plot, we have to use the command graph instead of twoway. The syntax is similar:graph bar (statistic) x-var, over(grouping_var).
See an example below:
%%statagraph bar (mean) earnings, over(region)graph export graph7.jpg, as(jpg) replace
.
. graph bar (mean) earnings, over(region)
. graph export graph7.jpg, as(jpg) replace
(file graph7.jpg not found)
file graph7.jpg written in JPEG format
.
mybarchart
We can also create a horizontal bar plot by using the option hbar instead of bar.
.
. graph hbar (mean) earnings, over(treated) over(region)
.
. graph export graph9.jpg, as(jpg) replace
(file graph9.jpg not found)
file graph9.jpg written in JPEG format
.
mybarchart3
9.2 Exporting Format
So far, we have been exporting our graphs in .svg format. However, we can also export graphs in other formats such as .jpg, .png, and .pdf. This may be particularly helpful if using LaTeX to write a paper, as .svg files cannot be used with LaTeX PDF output.
9.3 Fine-tuning a Graph Further
In order to customize our graph further, we can use the tools in the Stata graph window or the graph option commands we have been using in this module. Namely, we can include and adjust the following:
title
axis titles
legend
axis
scale
labels
theme (i.e. colour, appearance)
adding lines, text or objects
Let’s see how to add some of these customizations to our graphs in practice. For example, let’s modify our latest bar graph such that:
the title is “Earnings by region and treatment”: we do this with the option title();
the axis title is “Earnings (average)”: we do this with the option ytitle();
the regions and the treatment status are labeled: we do this with the sub-option relabel within the over option, over(varname, relabel()). Remember that relabelling follows the order in which the values appear: e.g., for treated and untreated, the not treated group appears first and the treated group appears second, therefore we have to use 1 to indicate the non-treated group and 2 to indicate the treated group: over(treated, relabel(1 "Not treated" 2 "Treated"));
the background color is white: we do this with the option graphregion(color());
the color of the bars is dark green: we do this using the option bar and its suboptions. Remember that we need to specify this option for each variable we are plotting in the bars. In our case, we are only plotting variable earnings, which is by definition the first variable we are plotting, therefore all sub-options refer to 1: bar(1, fcolor(dkgreen)).
%%statagraph hbar (mean) earnings, /// over(treated, relabel(1"Not treated"2"Treated")) /// over(region, relabel(1"A"2"B"3"C"4"D"5"E")) /// title("Earnings by region and treatment") ytitle("Earnings (average)") /// graphregion(color(white)) bar(1, fcolor(dkgreen))graph export graph10.jpg, as(jpg) replace
.
. graph hbar (mean) earnings, ///
> over(treated, relabel(1 "Not treated" 2 "Treated")) ///
> over(region, relabel(1 "A" 2 "B" 3 "C" 4 "D" 5 "E")) ///
> title("Earnings by region and treatment") ytitle("Earnings (average)") //
> /
> graphregion(color(white)) bar(1, fcolor(dkgreen))
.
. graph export graph10.jpg, as(jpg) replace
(file graph10.jpg not found)
file graph10.jpg written in JPEG format
.
These are just some of the customizations available to you. Other common options are:
adding a labelled legend to our graphs. To include the legend, we use the option legend( label(number_of_label "label"));
adding a vertical line, for example one indicating the year in which the treatment was administered (2003). To include the indicator line we use the the option xline(). The line can also have different characteristics. For example, we can change its color and pattern using the options lcolor() and lpattern().
We can always go back to the Stata documentation to explore the options available based on what we need to do. We can also adjust many of these aspects in the Graph Editor that appears wheneve we create a new graph (top right corner). Just don’t forget to save your graph when you are done since this won’t be in your do-file!
When thinking about colors, always make sure that your graphs are accessible to everyone. Run the code cell below to view the colorstyle options available in Stata. If the color you desire is not available, you can input its RGB code within quotes: for example, a red line would be lcolor("248 7 27"). You can learn more about accessible color combinations on this website.
%%statahelp colorstyle
.
. help colorstyle
[G-4] colorstyle -- Choices for color
(View complete PDF manual entry)
Syntax
------
Set color of <object> to colorstyle
<object>color(colorstyle)
Set color of all affected objects to colorstyle
color(colorstyle)
Set opacity of <object> to #, where # is a percentage of 100% opacity
<object>color(%#)
Set opacity for all affected objects colors to #
color(%#)
Set both color and opacity of <object>
<object>color(colorstyle%#)
Set both color and opacity of all affected objects
<object>color(colorstyle%#)
colorstyle Description
-------------------------------------------------------------------------
black
stc1 color used by scheme stcolor
stc2 color used by scheme stcolor
.
.
stc15 color used by scheme stcolor
stblue blue used by scheme stcolor
stgreen green used by scheme stcolor
stred red used by scheme stcolor
styellow yellow used by scheme stcolor
gs0 gray scale: 0 = black
gs1 gray scale: very dark gray
gs2
.
.
gs15 gray scale: very light gray
gs16 gray scale: 16 = white
white
blue
bluishgray
brown
cranberry
cyan
dimgray between gs14 and gs15
dkgreen dark green
dknavy dark navy blue
dkorange dark orange
eggshell
emerald
forest_green
gold
gray equivalent to gs8
green
khaki
lavender
lime
ltblue light blue
ltbluishgray light blue-gray, used by scheme s2color
ltkhaki light khaki
magenta
maroon
midblue
midgreen
mint
navy
olive
olive_teal
orange
orange_red
pink
purple
red
sand
sandb bright sand
sienna
stone
teal
yellow
colors used by The Economist magazine:
ebg background color
ebblue bright blue
edkblue dark blue
eltblue light blue
eltgreen light green
emidblue midblue
erose rose
none no color; invisible; draws nothing
background or bg same color as background
foreground or fg same color as foreground
"# # #" RGB value; white = "255 255 255"
"# # # #" CMYK value; yellow = "0 0 255 0"
"hsv # # #" HSV value; white = "hsv 0 0 1"
"#######" hexadecimal value; red = "#FF0000"
colorstyle*# color with adjusted intensity; #'s range from 0 to
255
colorstyle%# color with adjusted opacity; #s range from 0 to 100
*# default color with adjusted intensity
%# default color with adjusted opacity
-------------------------------------------------------------------------
When you specify RGB, CMYK, HSV, or hexadecimal values, it is best to
enclose the values in quotes; type "128 128 128" not 128 128 128.
Description
-----------
colorstyle sets the color and opacity of graph components such as lines,
backgrounds, and bars. Some options allow a sequence of colorstyles with
colorstylelist; see [G-4] stylelists.
Links to PDF documentation
--------------------------
Remarks and examples
The above sections are not included in this help file.
Remarks
-------
colorstyle sets the color and opacity of graph components such as lines,
backgrounds, and bars. Colors can be specified with a named color, such
as black, olive, and yellow, or with a color value in the RGB, CMYK, or
HSV format. colorstyle can also set a component to match the background
color or foreground color. Additionally, colorstyle can modify color
intensity, making the color lighter or darker. Some options allow a
sequence of colorstyles with colorstylelist; see [G-4] stylelists.
To see a list of named colors, use graph query colorstyle. See [G-2]
graph query. For a color palette showing an individual color or
comparing two colors, use palette color. See [G-2] palette.
Remarks are presented under the following headings:
Adjust opacity
Adjust intensity
Specify RGB values
Specify CMYK values
Specify HSV values
Specify hexadecimal values
Export custom colors
Adjust opacity
--------------
Opacity is the percentage of a color that covers the background color.
That is, 100% means that the color fully hides the background, and 0%
means that the color has no coverage and is fully transparent. If you
prefer to think about transparency, opacity is the inverse of
transparency. Adjust opacity with the % modifier. For example, type
green%50
"0 255 0%50"
%30
Omitting the color specification in the command adjusts the opacity of
the object while retaining the default color. For instance, specify
mcolor(%30) with graph twoway scatter to use the default fill color at
30% opacity.
Specifying color%0 makes the object completely transparent and is
equivalent to color none.
Adjust intensity
----------------
Color intensity (brightness) can be modified by specifying a color, *,
and a multiplier value. For example, type
green*.8
purple*1.5
"0 255 255*1.2"
"hsv 240 1 1*.5"
A value of 1 leaves the color unchanged, a value greater than 1 makes the
color darker, and a value less than 1 makes the color lighter. Note that
there is no space between color and *, even when color is a numerical
value for RGB or CMYK.
Omitting the color specification in the command adjusts the intensity of
the object's default color. For instance, specify bcolor(*.7) with graph
twoway bar to use the default fill color at reduced brightness, or
specify bcolor(*2) to increase the brightness of the default color.
Specifying color*0 makes the color as light as possible, but it is not
equivalent to color none. color*255 makes the color as dark as possible,
although values much smaller than 255 usually achieve the same result.
For an example using the intensity adjustment, see Typical use in [G-2]
graph twoway kdensity.
Specify RGB values
------------------
In addition to specifying named colors such as yellow, you can specify
colors with RGB values. An RGB value is a triplet of numbers ranging
from 0 to 255 that describes the level of red, green, and blue light that
must be emitted to produce a given color. RGB is used to define colors
for on-screen display and in nonprofessional printing. Examples of RGB
values are
red = 255 0 0
green = 0 255 0
blue = 0 0 255
white = 255 255 255
black = 0 0 0
gray = 128 128 128
navy = 26 71 111
Specify CMYK values
-------------------
You can specify colors using CMYK values. You will probably only use
CMYK values when they are provided by a journal or publisher. You can
specify CMYK values either as integers from 0 to 255 or as proportions of
ink using real numbers from 0.0 to 1.0. If all four values are 1 or
less, the numbers are taken to be proportions of ink. For example,
red = 0 255 255 0 or, equivalently, 0 1 1 0
green = 255 0 255 0 or, equivalently, 1 0 1 0
blue = 255 255 0 0 or, equivalently, 1 1 0 0
white = 0 0 0 0 or, equivalently, 0 0 0 0
black = 0 0 0 255 or, equivalently, 0 0 0 1
gray = 0 0 0 128 or, equivalently, 0 0 0 .5
navy = 85 40 0 144 or, equivalently, .334 .157 0 .565
Specify HSV values
------------------
You can specify colors with HSV (hue, saturation, and value), also called
HSL (hue, saturation, and luminance) and HSB (hue, saturation, and
brightness). HSV is often used in image editing software. An HSV value
is a triplet of numbers. So that Stata can differentiate them from RGB
values, HSV colors must be prefaced with hsv. The first number specifies
the hue from 0 to 360, the second number specifies the saturation from 0
to 1, and the third number specifies the value (luminance or brightness)
from 0 to 1. For example,
red = hsv 0 1 1
green = hsv 120 1 .502
blue = hsv 240 1 1
white = hsv 0 0 1
black = hsv 0 0 0
navy = hsv 209 .766 .435
Specify hexadecimal values
--------------------------
You can specify colors with hexadecimal values. A hexidecimal value is a
triplet of symbols ranging from 00 to FF that describes the level of red,
green, and blue in the color. The symbols can include digits and letters
A, B, C, D, E, and F in either uppercase or lowercase. For example,
red = #FF0000
green = #00FF00
blue = #0000FF
white = #FFFFFF
black = #000000
Export custom colors
--------------------
graph export stores all colors as RGB+opacity values, that is, RGB values
0-255 and opacity values 0-1. If you need color values from Stata in
CMYK format, use the graph export command with the cmyk(on) option, and
save the graph in one of the following formats: PostScript, Encapsulated
PostScript, or PDF.
You can set Stata to permanently use CMYK colors for PostScript export
files by typing translator set Graph2ps cmyk on and for EPS export files
by typing translator set Graph2eps cmyk on.
The CMYK values returned in graph export may differ from the CMYK values
that you entered. This is because Stata normalizes CMYK values by
reducing all CMY values until one value is 0. The difference is added to
the K (black) value. For example, Stata normalizes the CMYK value 10 10
5 0 to 5 5 0 5. Stata subtracts 5 from the CMY values so that Y is 0 and
then adds 5 to K.
Video example
-------------
Transparency in Stata graphs
.
9.4 Wrap Up
We have learned in this module how to create different types of graphs using the command twoway and how to adjust them with the multiple options which come with this command. However, the most valuable take-away from this module is understanding when to use a specific type of graph. Graphs are only able to tell a story if we choose them appropriately and customize them as necessary.
Remember to check the Stata documentation when creating graphs. The documentation can be your best ally if you end up using it.
9.5 Wrap-up Table
Command
Function
twoway scatter
It creates a scatterplot.
twoway connected
It creates a scatterplot where points are connected by a line.