When working with survey data
there are several issues / strategies to clean and prepare the data that
are useful and worth being incorporated to the routines and workflow.
This vignette uses the CEOdata
package to present several
examples.
It uses primarily the data retrieved by default using the
CEOdata()
function in its default form, which retrieves the
compiled “Barometers” from 2014 onwards.
Once you have retrieved the data of the surveys, it is easy to accommodate them to your regular workflow. For instance, to get the overall number of males and females surveyed:
## # A tibble: 2 × 2
## SEXE n
## <fct> <int>
## 1 Masculí 24963
## 2 Femení 26875
Or to trace the proportion of females surveyed over time, across barometers:
d |>
group_by(BOP_NUM) |>
summarize(propFemales = length(which(SEXE == "Dona")) / n()) |>
ggplot(aes(x = BOP_NUM, y = propFemales, group = 1)) +
geom_point() +
geom_line() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
expand_limits(y = c(0, 1))
Proportion of females in the different Barometers.
The metadata also provides the option of examining the time periods where there has been fieldwork in quantitative studies, since 2018. In addition, we can distinguish between studies that provide microdata and surveys that don’t.
CEOmeta() |>
filter(`Dia inici treball de camp` > "2018-01-01") |>
ggplot(aes(xmin = `Dia inici treball de camp`,
xmax = `Dia final treball de camp`,
y = reorder(REO, `Dia final treball de camp`),
color = microdata_available)) +
geom_linerange() +
xlab("Date") + ylab("Surveys with fieldwork") +
theme(axis.ticks.y = element_blank(), axis.text.y = element_blank())
Fieldwork periods.
Once a dataset has been retrieved from the CEO servers, it is important to clean it and arrange it to one’s individual preferences, and store the result in an R object.
The following example, for instance, process several variables of the survey, picks them and stores the resulting object in a workspace (RData) format.
survey.data <- d |>
mutate(Female = ifelse(SEXE == "Dona", 1, 0),
Age = EDAT,
# Pass NA correctly
Income = ifelse(INGRESSOS_1_15 %in% c("No ho sap", "No contesta"),
NA,
INGRESSOS_1_15),
Date = Data,
# Reorganize factor labels
`Place of birth` = factor(case_when(
LLOC_NAIX == "Catalunya" ~ "Catalonia",
LLOC_NAIX %in% c("No ho sap", "No contesta") ~ as.character(NA),
TRUE ~ "Outside Catalonia")),
# Convert into numerical (integer)
`Interest in politics` = case_when(
INTERES_POL == "Gens" ~ 0L,
INTERES_POL == "Poc" ~ 1L,
INTERES_POL == "Bastant" ~ 2L,
INTERES_POL == "Molt" ~ 3L,
TRUE ~ as.integer(NA)),
# Convert into numeric (double) and properly address missing values
`Satisfaction with democracy` = ifelse(
SATIS_DEMOCRACIA %in% c("No ho sap", "No contesta"),
NA,
as.numeric(SATIS_DEMOCRACIA))) |>
# Center income to the median
mutate(Income = Income - median(Income, na.rm = TRUE)) |>
# Pick only specific variables
select(Date, Female, Age, Income,
`Place of birth`, `Interest in politics`,
`Satisfaction with democracy`)
Finally, this can be stored for further analysis (hence, without the need to download and arrange the data again) in an R’s native format:
There are several packages that construct convenient tables with the
descriptive summary of a dataset. For example, using the
vtable
package to produce a table with descriptive
statistics.
Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
---|---|---|---|---|---|---|---|
Female | 51838 | 0 | 0 | 0 | 0 | 0 | 0 |
Age | 51838 | 51 | 18 | 18 | 37 | 65 | 101 |
Income | 38624 | 0.12 | 2.9 | -7 | -2 | 2 | 7 |
Place of birth | 51836 | ||||||
… Catalonia | 36315 | 70% | |||||
… Outside Catalonia | 15521 | 30% | |||||
Interest in politics | 33736 | 1.5 | 0.98 | 0 | 1 | 2 | 3 |
Satisfaction with democracy | 50880 | 3 | 0.74 | 1 | 3 | 4 | 4 |
Or the compareGroups
that allows to flexibly produce
tables that compare descriptive statistics for different groups of
individuals.
##
## --------Summary descriptives table by 'Female'---------
##
## ___________________________________________________
## 0 p.overall
## N=51838
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
## Edat 50.7 (17.8) .
## Income 0.12 (2.85) .
## Place of birth: .
## Catalonia 36315 (70.1%)
## Outside Catalonia 15521 (29.9%)
## Interest in politics 1.46 (0.98) .
## Satisfaction with democracy 3.00 (0.74) .
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
The development of CEOdata
(track changes, propose
improvements, report bugs) can be followed at github.
If using the data and the package, please cite and acknowledge properly the CEO and the package, respectively.