The raw data can be found on these websites:
Everyone
➜ FP7 Projects
➜
FP7 Project Details
➜
Top Funded Projects
Everyone
➜ H2020 Projects
➜
H2020 Projects
➜ Top Funded Projects
Everyone
➜ H2020 Projects
➜
Data Export
➜ Raw Data Export Sheet
The Horizon
Dashboard consists of a Qlik Sense application. Raw data are used in
the official statistics of the European Commission but cannot be
downloaded automatically. However, certain datasets can still be
manually exported to xlsx
files (see Sources above).
The EU produces snapshots of its publicly available data on the EU Data Portal. Data can be found for FP6, FP7, and H2020. For FP7 and H2020, the most recent data are stored in ZIP archives containing CSV files about projects and participants.
It is possible to download the data by hand or automatically. The following outlines how to download the data automatically with the command-line and the Wget command.
First, we create a text file, inst/url_list.txt
, with
the list of URLs from which we would like to download the data
files.
https://cordis.europa.eu/data/FP6/cordis-fp6projects.csv
https://cordis.europa.eu/data/FP6/cordis-fp6organizations.csv
https://cordis.europa.eu/data/cordis-fp7projects-csv.zip
https://cordis.europa.eu/data/cordis-h2020projects-csv.zip
Second, we create a shell script,
inst/download_files.sh
, which downloads the data files to
the data-raw/
folder.
#!/bin/bash
while read url; do
wget -P ../data-raw $url
done < ../inst/url_list.txt
Besides, it is also possible to inspect when the files were last
updated. We do this by creating another shell script,
inst/last_updated.sh
, which displays the
Last-Modidied
information for each file.
#!/bin/bash
while read url; do
wget --server-response --spider $url 2>&1 | grep -i Last-Modified
done < ../inst/url_list.txt
To finish, we extract the files from the ZIP archives. The unzip command fails to extract the files from the Zip archives. The most recent release contains a faulty Zip file structure. Fortunately, the tar command offers a workaround.
#!/bin/bash
tar -xvf data-raw/cordis-fp7projects-csv.zip
mkdir data-raw/fp7
mv csv data-raw/fp7
tar -xvf data-raw/cordis-h2020projects-csv.zip
mkdir data-raw/h2020
mv csv data-raw/h2020
The first step is to load a few useful libraries and helper functions.
source(here::here("R", "package.r"))
source(here::here("R", "utils.r"))
The data from the Horizon Dashboard only cover FP7 and H2020. The details about the projects are given in two tables, one per Framework Programme. Therefore, we select the matching columns to make it possible to merge the two tables.
<- c(
matching_columns "project_id",
"project_acronym",
"thematic_priority",
"topic_code",
"topic_description",
"participations"
)
<-
fp7_dashboard # import xlsx to data frame
here("data-raw", "fp7-projects.xlsx") %>%
read_excel(guess_max = 20000) %>%
# convert names to snake case
clean_names() %>%
# rename columns for consistency between tables
rename(
project_id = project_nbr,
thematic_priority = thematic_priority_descr) %>%
select(all_of(matching_columns)) %>%
# data frame label column
mutate(framework_programme = "FP7")
<-
h2020_dashboard # import xlsx to data frame
here("data-raw", "h2020-projects.xlsx") %>%
read_excel(guess_max = 20000) %>%
# convert names to snake case
clean_names() %>%
# rename columns for consistency between tables
rename(
project_id = project_nbr,
thematic_priority = thema,
participations = h2020_participations
%>%
) select(all_of(matching_columns)) %>%
# data frame label column
mutate(framework_programme = "H2020")
# merge the tables
<- bind_rows(fp7_dashboard, h2020_dashboard) projects_dashboard
Details about the projects can also be found in the data downloaded from the Data Portal. Similarly to the Horizon Dashboard data, each Framework Programme has its own data table from which we select a few additional columns of interest.
<- c(
additional_columns "project_id",
"start_date",
"end_date",
"legal_basis",
"funding_scheme"
)
# use H2020 schema as reference
<-
projects_cordis_h2020 read_csv2(here("data-raw", "h2020", "csv", "project.csv")) %>%
# for data type consistency
mutate_if(is.POSIXct, as.Date)
<-
projects_cordis_fp7 ::read_csv2(here("data-raw", "fp7", "csv", "project.csv")) %>%
readr# for data type consistency
mutate(
contentUpdateDate = as.Date(contentUpdateDate),
ecSignatureDate = as.Date(ecSignatureDate),
masterCall = as.character(masterCall))
<- bind_rows(projects_cordis_fp7, projects_cordis_h2020) %>%
projects_cordis # use snake case
clean_names() %>%
# rename columns for consistency between tables
rename(project_id = id) %>%
select(all_of(additional_columns))
Finally, we combine the details from the Horizon Dashboard and the Data Portal.
<- full_join(projects_dashboard, projects_cordis, by = "project_id") projects
For completeness, we also include the code which reads the data available for FP6. As indicated already, the Horizon Dashboard contains no data for FP6. Moreover, the Data Portal only provides contribution amounts for 2,621 cases (out of 75,241). The data from FP6 will therefore be ignored in the remainder of this work.
# FP6 project details
<-
projects_cordis_fp6 here("data-raw", "cordis-fp6projects.csv") %>%
::read_csv2() %>%
readrclean_names()
# FP6 participation/contribution details
<-
organizations_cordis_fp6 here("data-raw", "cordis-fp6organizations.csv") %>%
read_tsv() # beware: tsv, not csv
Horizon Dashboard only includes participation data for H2020 and not for FP7. Nevertheless, the data in the Horizon Dashboard appear to be of a higher quality than that of the Data Portal and are therefore used as a source of official statistics.
First, we save the data to CSV.
<-
organizations_dashboard_h2020 here("data-raw", "horizon2020.xlsx") %>%
read_excel() %>%
clean_names() %>%
rename(project_id = project_nbr,
pic = general_pic)
# add call year
<- organizations_dashboard_h2020 %>%
organizations_dashboard_h2020 mutate(
call_date = as.Date(call_deadline_date, "%d/%m/%Y"),
call_year = format(call_date, "%Y")
%>%
) select(-call_date)
# store data types
map_chr(organizations_dashboard_h2020, class) %>%
write_yaml(here("data", "schema-dashboard_export.yml"))
# split data frame by call year
<- organizations_dashboard_h2020 %>%
yrs distinct(call_year) %>%
pull() %>% sort()
<- here("data", glue("dashboard-export-{yrs}.csv"))
file_names
%>%
organizations_dashboard_h2020 arrange(call_year) %>%
group_by(call_year) %>%
group_split() %>%
setNames(yrs) %>%
# write to ./data
walk2(file_names, vroom_write, delim = ",")
And select a few columns of interest.
<- c(
additional_columns "project_id",
"pic",
"partner_role",
"pillar_abbr",
"pillar_descr",
"thematic_priority_abbr",
"signature_date",
"call_deadline_date",
"eu_contribution")
<- organizations_dashboard_h2020 %>%
organizations_dashboard_h2020 select(all_of(additional_columns))
Details about participation can be found in the organization.csv tables.
<- c(
non_matching_columns "contentUpdateDate",
"SME",
"nutsCode",
"endOfParticipation",
"active",
"totalCost",
"rcn"
)
<- c(
selection_columns "project_id",
"pic",
"legal_name",
"legal_short_name",
"legal_entity_type",
"country_code",
"legal_url",
"partner_role",
"ec_contribution",
"net_ec_contribution"
)
<-
organizations_cordis_h2020 here("data-raw", "h2020", "csv", "organization.csv") %>%
read_csv2(guess_max = 100000) %>%
select(-all_of(non_matching_columns))
<-
organizations_cordis_fp7 here("data-raw", "fp7", "csv", "organization.csv") %>%
read_csv2(guess_max = 130000) %>%
select(-all_of(non_matching_columns))
# 3 cases where (project_id, pic)
# appear 2x w/ different partner_role
<-
organizations_cordis %>%
organizations_cordis_fp7 bind_rows(organizations_cordis_h2020) %>%
clean_names() %>%
rename(
pic = organisation_id,
legal_name = name,
legal_short_name = short_name,
legal_entity_type = activity_type,
country_code = country,
legal_url = organization_url,
partner_role = role
%>%
) select(all_of(selection_columns))
Here again, we combine the details from the Horizon Dashboard and the Data Portal.
<- organizations_dashboard_h2020 %>%
organizations mutate(partner_role = str_to_lower(partner_role)) %>% # for joining
full_join(organizations_cordis,
by = c("project_id", "pic", "partner_role")
)
Having combined the datasets from the Horizon Dashboard and the Data
Portal, we obtain three columns with details about funding
contributions: ec_contribution
(Data Portal),
net_ec_contribution
(Data Portal), and
eu_contribution
(Horizon Dashboard). The latter comprises
the values of the Commission’s official statistics for H2020.
eu_contribution
is nevertheless only available for H2020.
To extrapolate to FP7 statistics, we are going to compare the columns
ec_contribution
and net_ec_contribution
with
eu_contribution
.
# take a sample of 10'000 observations
<- organizations %>%
df_sample filter(!is.na(eu_contribution)) %>%
filter(!is.na(ec_contribution)) %>%
filter(!is.na(net_ec_contribution)) %>%
select(ends_with("contribution")) %>%
mutate(ec_contribution = str_replace(ec_contribution, ",", ".")) %>%
mutate_all(as.numeric) %>%
sample_n(10000)
# comparing `ec_contribution` with `eu_contribution`
with(
df_sample,plot(
ec_contribution,
eu_contribution,type = "p",
pch = 19,
cex = .5,
log = "xy"
)
)
# comparing `net_ec_contribution` with `eu_contribution`
with(
df_sample,plot(
net_ec_contribution,
eu_contribution,type = "p",
pch = 19,
cex = .5,
log = "xy"
) )
In addition, comparing of the number of projects between the
projects
table and the organizations
table
shows that twenty four projects lack details about individual funding
contributions.
anti_join(organizations, projects, by = "project_id")
anti_join(projects, organizations, by = "project_id")
# => 24 projects with no participation details
<- organizations %>%
master_view left_join(projects, by = "project_id")
Below are a few steps to clean up and improve the data.
country_code
and add the corresponding country
names.# EU Data Portal country codes
<-
country_codes glue(
"https://cordis.europa.eu/",
"data/reference/",
"cordisref-countries.csv") %>%
read_csv2() %>%
clean_names() %>%
filter(language == "en") %>%
select(eu_code, name) %>%
rename(
country_code = eu_code, # two-letter Eurostat country code
country = name)
<- master_view %>%
master_view mutate(
country_code = str_sub(country_code, 1, 2), # two-letter only
country_code = str_replace(country_code, "KO", "XK")) %>%
left_join(country_codes, by = "country_code")
legal_entity_type
.<- master_view %>%
master_view # `legal_entity_type` is a three-letter label only
mutate(legal_entity_type = str_sub(legal_entity_type, 1, 3))
<- master_view %>%
master_view mutate(
ec_contribution = str_replace(ec_contribution, ",", "."),
ec_contribution = str_replace(ec_contribution,
fixed("xxxxx"),
NA_character_),
eu_contribution = str_replace(eu_contribution, fixed("-"), "0"),
ec_contribution = as.numeric(ec_contribution),
net_ec_contribution = as.numeric(net_ec_contribution),
eu_contribution = as.numeric(eu_contribution),
# reconcile FP7 and H2020 contribution data
re_contribution = eu_contribution,
re_contribution = if_else(is.na(re_contribution),
net_ec_contribution,
re_contribution),re_contribution = if_else(is.na(re_contribution),
ec_contribution,
re_contribution) )
<- master_view %>%
master_view mutate(
pillar = str_extract(pillar_abbr, "\\d"),
pillar = if_else(pillar_abbr == "EU.0.", "Cross-theme", pillar),
pillar = if_else(pillar_abbr == "Euratom", pillar_abbr, pillar))
<- master_view %>%
master_view mutate(
signature_date = as.Date(signature_date, format = "%d/%m/%Y"),
call_deadline_date = as.Date(call_deadline_date, format = "%d/%m/%Y"),
signature_year = str_sub(signature_date, 1, 4),
call_year = str_sub(call_deadline_date, 1, 4),
start_year = str_sub(start_date, 1, 4),
signature_year = int(signature_year),
call_year = int(call_year),
start_year = int(start_year)
)
partner_role
to title case.<- master_view %>%
master_view mutate(partner_role = str_to_title(partner_role))
Finally, we can save master_vier
as a CSV file, which
can be viewed in here.
# store data types
%>%
master_view map_chr(class) %>%
write_yaml(here("data", "schema-cordis-plus.yml"))
# split CSVs by year
<-
df_years %>%
master_view distinct(start_year) %>%
pull() %>%
sort()
# write to ./data
<- here("data", glue("cordis-plus-{df_years}.csv"))
file_names
%>%
master_view filter(!is.na(start_date)) %>%
arrange(start_year) %>%
group_by(start_year) %>%
group_split() %>%
setNames(df_years) %>%
walk2(file_names, vroom_write, delim = ",")