--- title: Using the CEOdata package author: Xavier Fernández-i-Marín date: "`r format(Sys.time(), '%d/%m/%Y')` - Version `r packageVersion('CEOdata')`" classoption: a4paper,justified output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using the CEOdata package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r echo=FALSE, message=FALSE, warning=FALSE} library(CEOdata) ``` CEOdata is a package that facilitates the incorporation of microdata (individual responses) of public opinion polls in Catalonia into `R`, as performed by the "Centre d'Estudis d'Opinió" (CEO, Opinion Studies Center). It has basically three main functions with a separate purpose: - **`CEOdata()`**: that provides the _data_ of the surveys directly into `R`. - **`CEOmeta()`**: that allows the user to inspect the details of the available surveys (_metadata_) and to search for specific topics and get the survey details. - **`CEOsearch()`**: that allows the user to search for variables, variable labels and value labels within a survey data gathered using the `CEOdata()` function. # `CEOdata()`: Get the survey data The most comprehensive kind of data on Catalan public opinion is the "Barometer", that can be retrieved by default by the main function `CEOdata()`. ```{r message = FALSE, echo = TRUE, eval = FALSE} library(CEOdata) d <- CEOdata() ``` ```{r message = FALSE, echo = FALSE, eval = TRUE} library(knitr) library(CEOdata) d <- CEOdata() # If there is an internet problem, do not run the remaining of the chunks. if (is.null(d)) { print("here") knitr::opts_chunk$set(eval = FALSE) } else { knitr::opts_chunk$set(eval = TRUE) } ``` This provides a cleaned and merged version of all the available Barometers, since 2017, providing easy access to the following number of responses and variables: ```{r} dim(d) ``` ```{r} d ``` ```{r} names(d)[1:50] ``` But default `CEOdata()` transforms the gathered data into pure-R format (labelled SPSS variables are converted into factors). If you want to use `haven_labelled` variables as provided by the raw SPSS files available, you can use the argument `raw = TRUE`. ```{r, eval = FALSE} d.raw <- CEOdata(raw = FALSE) ``` ## Specific studies or time frames `CEOdata()` allows you to select specific Barometers, by providing their internal register in the `reo` argument. The reo is the internal name that the CEO uses, and stands for "Registre d'Estudis d'Opinió" (register of opinion studies), and is the main identifier of the survey, also present in the table of meta data. Although many of them are numbers, some have a number, a slash and another number, and therefore a character vector must be passed. Only a single REO can be passed, as it is not guaranteed that different data matrices share any column, and may refer to very different topics. For instance, to get only the data of the study with register "746" (corresponding to March 2013): ```{r} d746 <- CEOdata(reo = "746") d746 ``` Not all studies carried on by the CEO (and therefore listed in the `CEOmeta()` function --see below--) have microdata available. For convenience, there is a variable in the metadata that returns whether the microdata is available or not (`microdata_available`). When using the `kind` argument (which is the default), the function `CEOdata()` also allows to restrict the whole set of barometers based on specific time frames defined by a date with the arguments `date_start` and `date_end` using the YYYY-MM-DD format. Notice that only the barometers are considered when using this arguments, not other studies. ```{r} b2019 <- CEOdata(date_start = "2019-01-01", date_end = "2019-12-31") b2019 ``` ## Extra variables By default `CEOdata()` incorporates new variables to the original matrix. Variables that are created for convenience, such as the date of the survey. The CEO data not always provides a day of the month. In that case, 28 is used. These variables appear at the end of the dataset and can be distinguished from the original CEO variables because only the first letter is capitalized. ```{r} tail(names(d)) ``` In case of desiring all variable names to be lowercase, one can simply convert them with `tolower()`: ```{r} d.lowercase <- d names(d.lowercase) <- tolower(names(d.lowercase)) ``` # `CEOmeta()`: Access to the metadata of studies and surveys The function `CEOmeta` allows to easily retrieve, search and restrict by time the list of all the surveys produced by the CEO, which amounts to more than a thousand as of early 2022. When called alone, the function downloads the latest version of the metadata published by the center, in a transparent way, and caching its content so that any subsequent calls in the same `R` session do not need to download it again. ```{r} CEOmeta() ``` ## Get a specific study In order to get the metadata of a specific study, the `reo` argument can be used: ```{r} CEOmeta(reo = "746") ``` ## Search for specific topics though keywords The first relevant argument for `CEOmeta()` is `search`, which is a built-in simple search engine that goes through the columns of the metadata containing potential descriptive information (title, summary, objectives and tags -descriptors-) and returns the studies that contain such keyword. ```{r} CEOmeta(search = "Medi ambient") ``` It is also possible to pass more than one value to `search`, so that the search includes them (either one of them OR any other). ```{r} CEOmeta(search = c("Medi ambient", "Municipi")) ``` In addition to the built-in argument to search through the columns of the survey title, the study title, the objectives, the summary and the tags (descriptors), it is possible to combine `CEOmeta()` with `dplyr`'s `filter()` to limit the results of studies returned. For example, to get the studies that have been performed using Internet to get the data: ```{r} CEOmeta() |> filter(`Metode de recollida de dades` == "internet") ``` Or to get studies with a specific quantitative sample size limit: ```{r} CEOmeta() |> filter(`Mostra estudis quantitatius` < 500) ``` ## Restrict by time Metadata can be retrieved for a specific period of time, by using the arguments `date_start` and `date_end`, also using the YYYY-MM-DD format. In this case the dates that are taken into account are dates where the study gets into the records, not the fieldwork dates. ```{r} CEOmeta(date_start = "2019-01-01", date_end = "2019-12-31") ``` ## Browse the CEO site In addition, to the search engine and the restriction by time `CEOmeta()` also allows to automatically open the relevant URLs at the CEO domain that contain the details of the studies gathered with the function. This can be done setting the `browse` argument to `TRUE`. However, there is a soft limitation of only 10 URLs to be opened, unless the user forces to really open all of them (proceed with caution, as this may open many tabs in your browser and leave your computer out of RAM in some scenarios of RAM black holes, such as Chrome). ```{r, eval = FALSE} CEOmeta(search = "Medi ambient a", browse = TRUE) ``` To open a specific REO, a simpler call with its specific identifier can be used: ```{r, eval = FALSE} CEOmeta(reo = "746", browse = TRUE) ``` It is also possible to specify an alternative language, so the default Catalan pages are substituted by the automatic translations provided by Apertium (for Occitan/Aranese) or Google Translate. ```{r, eval = FALSE} CEOmeta(search = "Medi ambient a", browse = TRUE, browse_translate = "de") ``` # `CEOsearch()`: Access to the variable and value labels Contrary to `CEOdata()` and `CEOmeta()`, `CEOsearch()` needs at least one argument: the survey data (microdata) for which we want to extract the variable labels and the value labels. By default it provides the variable labels in a tidy object: ```{r} CEOsearch(d) # equivalent to CEOsearch(d, where = "variables") ``` Equivalently, the use of `where = "values"` provides with a tidy object containing the value labels. Notice that in this case the variable names are repeated to accommodate each of the different value labels. ```{r} CEOsearch(d, where = "values") ``` Just like with the `CEOmeta()`, `CEOsearch()` has a simple built-in search facility that allows to retrieve only the rows that match a specific keyword(s). In the following example, we restrict the variables to those that contain "edat" (age). ```{r} CEOsearch(d, keyword = "edat") ``` Finally, an English translation of the variable labels/values is provided if the argument `translate` is set to `TRUE`, by opening a browser tab with the translations. ```{r} CEOsearch(d, keyword = "edat", translate = TRUE) ``` Of course, variable labels and values can be merged into a single object using a combination of `join` and `CEOsearch()`: ```{r} CEOsearch(d) |> left_join(CEOsearch(d, where = "values")) ``` # Development and acknowledgement The development of `CEOdata` (track changes, propose improvements, report bugs) can be followed at [github](https://github.com/ceopinio/CEOdata/). If using the data and the package, please cite and acknowledge properly the CEO and the package, respectively.