Gathering FP Data for Analysis

Sources

The raw data can be found on these websites:

Horizon Dashboard
- Everyone ➜ FP7 Projects ➜ FP7 Project Details ➜ Top Funded Projects
- Everyone ➜ H2020 Projects ➜ H2020 Projects ➜ Top Funded Projects
- Everyone ➜ H2020 Projects ➜ Data Export ➜ Raw Data Export Sheet
EU Data Portal

Getting the data

Horizon Dashboard

The Horizon Dashboard consists of a Qlik Sense application. Raw data are used in the official statistics of the European Commission but cannot be downloaded automatically. However, certain datasets can still be manually exported to xlsx files (see Sources above).

Data Portal

The EU produces snapshots of its publicly available data on the EU Data Portal. Data can be found for FP6, FP7, and H2020. For FP7 and H2020, the most recent data are stored in ZIP archives containing CSV files about projects and participants.

It is possible to download the data by hand or automatically. The following outlines how to download the data automatically with the command-line and the Wget command.

First, we create a text file, inst/url_list.txt, with the list of URLs from which we would like to download the data files.

https://cordis.europa.eu/data/FP6/cordis-fp6projects.csv
https://cordis.europa.eu/data/FP6/cordis-fp6organizations.csv
https://cordis.europa.eu/data/cordis-fp7projects-csv.zip
https://cordis.europa.eu/data/cordis-h2020projects-csv.zip

Second, we create a shell script, inst/download_files.sh, which downloads the data files to the data-raw/ folder.

#!/bin/bash
while read url; do
    wget -P ../data-raw $url
done < ../inst/url_list.txt

Besides, it is also possible to inspect when the files were last updated. We do this by creating another shell script, inst/last_updated.sh, which displays the Last-Modidied information for each file.

#!/bin/bash
while read url; do
    wget --server-response --spider $url 2>&1 | grep -i Last-Modified
done < ../inst/url_list.txt

To finish, we extract the files from the ZIP archives. The unzip command fails to extract the files from the Zip archives. The most recent release contains a faulty Zip file structure. Fortunately, the tar command offers a workaround.

#!/bin/bash
tar -xvf data-raw/cordis-fp7projects-csv.zip
mkdir data-raw/fp7
mv csv data-raw/fp7
tar -xvf data-raw/cordis-h2020projects-csv.zip
mkdir data-raw/h2020
mv csv data-raw/h2020

Data Preprocessing

The first step is to load a few useful libraries and helper functions.

source(here::here("R", "package.r"))
source(here::here("R", "utils.r"))

Project Details

Horizon Dashboard

The data from the Horizon Dashboard only cover FP7 and H2020. The details about the projects are given in two tables, one per Framework Programme. Therefore, we select the matching columns to make it possible to merge the two tables.

matching_columns <- c(
  "project_id",
  "project_acronym",
  "thematic_priority",
  "topic_code",
  "topic_description",
  "participations"
)

fp7_dashboard <-
  # import xlsx to data frame
  here("data-raw", "fp7-projects.xlsx") %>%
  read_excel(guess_max = 20000) %>%
  # convert names to snake case
  clean_names() %>%
  # rename columns for consistency between tables
  rename(
    project_id = project_nbr,
    thematic_priority = thematic_priority_descr) %>%
  select(all_of(matching_columns)) %>%
  # data frame label column
  mutate(framework_programme = "FP7")

h2020_dashboard <-
  # import xlsx to data frame
  here("data-raw", "h2020-projects.xlsx") %>%
  read_excel(guess_max = 20000) %>%
  # convert names to snake case
  clean_names() %>%
  # rename columns for consistency between tables
  rename(
    project_id = project_nbr,
    thematic_priority = thema,
    participations = h2020_participations
  ) %>%
  select(all_of(matching_columns)) %>%
  # data frame label column
  mutate(framework_programme = "H2020")

# merge the tables
projects_dashboard <- bind_rows(fp7_dashboard, h2020_dashboard)

Data Portal

Details about the projects can also be found in the data downloaded from the Data Portal. Similarly to the Horizon Dashboard data, each Framework Programme has its own data table from which we select a few additional columns of interest.

additional_columns <- c(
  "project_id",
  "start_date",
  "end_date",
  "legal_basis",
  "funding_scheme"
)

# use H2020 schema as reference
projects_cordis_h2020 <-
  read_csv2(here("data-raw", "h2020", "csv", "project.csv")) %>%
  # for data type consistency
  mutate_if(is.POSIXct, as.Date)

projects_cordis_fp7 <-
  readr::read_csv2(here("data-raw", "fp7", "csv", "project.csv")) %>%
  # for data type consistency
  mutate(
    contentUpdateDate = as.Date(contentUpdateDate),
    ecSignatureDate = as.Date(ecSignatureDate),
    masterCall = as.character(masterCall))

projects_cordis <- bind_rows(projects_cordis_fp7, projects_cordis_h2020) %>%
  # use snake case
  clean_names() %>%
  # rename columns for consistency between tables
  rename(project_id = id) %>%
  select(all_of(additional_columns))

Finally, we combine the details from the Horizon Dashboard and the Data Portal.

projects <- full_join(projects_dashboard, projects_cordis, by = "project_id")

FP6

For completeness, we also include the code which reads the data available for FP6. As indicated already, the Horizon Dashboard contains no data for FP6. Moreover, the Data Portal only provides contribution amounts for 2,621 cases (out of 75,241). The data from FP6 will therefore be ignored in the remainder of this work.

# FP6 project details
projects_cordis_fp6 <-
  here("data-raw", "cordis-fp6projects.csv") %>%
  readr::read_csv2() %>%
  clean_names()

# FP6 participation/contribution details
organizations_cordis_fp6 <-
  here("data-raw", "cordis-fp6organizations.csv") %>%
  read_tsv() # beware: tsv, not csv

Participation Details

Horizon Dashboard

Horizon Dashboard only includes participation data for H2020 and not for FP7. Nevertheless, the data in the Horizon Dashboard appear to be of a higher quality than that of the Data Portal and are therefore used as a source of official statistics.

First, we save the data to CSV.

organizations_dashboard_h2020 <-
  here("data-raw", "horizon2020.xlsx") %>%
  read_excel() %>%
  clean_names() %>%
  rename(project_id = project_nbr,
         pic = general_pic)

# add call year
organizations_dashboard_h2020 <- organizations_dashboard_h2020 %>%
  mutate(
    call_date = as.Date(call_deadline_date, "%d/%m/%Y"),
    call_year = format(call_date, "%Y")
  ) %>%
  select(-call_date)

# store data types
map_chr(organizations_dashboard_h2020, class) %>%
  write_yaml(here("data", "schema-dashboard_export.yml"))

# split data frame by call year
yrs <- organizations_dashboard_h2020 %>%
  distinct(call_year) %>%
  pull() %>% sort()

file_names <- here("data", glue("dashboard-export-{yrs}.csv"))

organizations_dashboard_h2020 %>%
  arrange(call_year) %>%
  group_by(call_year) %>%
  group_split() %>%
  setNames(yrs) %>%
  # write to ./data
  walk2(file_names, vroom_write, delim = ",")

And select a few columns of interest.

additional_columns <- c(
  "project_id",
  "pic",
  "partner_role",
  "pillar_abbr",
  "pillar_descr",
  "thematic_priority_abbr",
  "signature_date",
  "call_deadline_date",
  "eu_contribution")

organizations_dashboard_h2020 <- organizations_dashboard_h2020 %>%
  select(all_of(additional_columns))

Data Portal

Details about participation can be found in the organization.csv tables.

non_matching_columns <- c(
  "contentUpdateDate",
  "SME",
  "nutsCode",
  "endOfParticipation",
  "active",
  "totalCost",
  "rcn"
)

selection_columns <- c(
    "project_id",
    "pic",
    "legal_name",
    "legal_short_name",
    "legal_entity_type",
    "country_code",
    "legal_url",
    "partner_role",
    "ec_contribution",
    "net_ec_contribution"
)

organizations_cordis_h2020 <-
  here("data-raw", "h2020", "csv", "organization.csv") %>%
  read_csv2(guess_max = 100000) %>%
  select(-all_of(non_matching_columns))

organizations_cordis_fp7 <-
  here("data-raw", "fp7", "csv", "organization.csv") %>%
  read_csv2(guess_max = 130000) %>%
  select(-all_of(non_matching_columns))

# 3 cases where (project_id, pic)
# appear 2x w/ different partner_role

organizations_cordis <-
  organizations_cordis_fp7 %>%
  bind_rows(organizations_cordis_h2020) %>%
  clean_names() %>%
  rename(
    pic = organisation_id,
    legal_name = name,
    legal_short_name = short_name,
    legal_entity_type = activity_type,
    country_code = country,
    legal_url = organization_url,
    partner_role = role
  ) %>%
  select(all_of(selection_columns))

Here again, we combine the details from the Horizon Dashboard and the Data Portal.

organizations <- organizations_dashboard_h2020 %>%
  mutate(partner_role = str_to_lower(partner_role)) %>% # for joining
  full_join(organizations_cordis,
    by = c("project_id", "pic", "partner_role")
)

Project and Participation Details

Having combined the datasets from the Horizon Dashboard and the Data Portal, we obtain three columns with details about funding contributions: ec_contribution (Data Portal), net_ec_contribution (Data Portal), and eu_contribution (Horizon Dashboard). The latter comprises the values of the Commission’s official statistics for H2020. eu_contribution is nevertheless only available for H2020. To extrapolate to FP7 statistics, we are going to compare the columns ec_contribution and net_ec_contribution with eu_contribution.

# take a sample of 10'000 observations
df_sample <- organizations %>%
  filter(!is.na(eu_contribution)) %>%
  filter(!is.na(ec_contribution)) %>%
  filter(!is.na(net_ec_contribution)) %>%
  select(ends_with("contribution")) %>%
  mutate(ec_contribution = str_replace(ec_contribution, ",", ".")) %>%
  mutate_all(as.numeric) %>%
  sample_n(10000)

# comparing `ec_contribution` with `eu_contribution`
with(
  df_sample,
  plot(
    ec_contribution,
    eu_contribution,
    type = "p",
    pch = 19,
    cex = .5,
    log = "xy"
  )
)

# comparing `net_ec_contribution` with `eu_contribution`
with(
  df_sample,
  plot(
    net_ec_contribution,
    eu_contribution,
    type = "p",
    pch = 19,
    cex = .5,
    log = "xy"
  )
)

In addition, comparing of the number of projects between the projects table and the organizations table shows that twenty four projects lack details about individual funding contributions.

anti_join(organizations, projects, by = "project_id")
anti_join(projects, organizations, by = "project_id")

# => 24 projects with no participation details

master_view <- organizations %>%
  left_join(projects, by = "project_id")

Transforming

Below are a few steps to clean up and improve the data.

Clean up country_code and add the corresponding country names.

# EU Data Portal country codes
country_codes <-
  glue(
    "https://cordis.europa.eu/",
    "data/reference/",
    "cordisref-countries.csv") %>%
  read_csv2() %>%
  clean_names() %>%
  filter(language == "en") %>%
  select(eu_code, name) %>%
  rename(
    country_code = eu_code, # two-letter Eurostat country code
    country = name)

master_view <- master_view %>%
  mutate(
    country_code = str_sub(country_code, 1, 2), # two-letter only
    country_code = str_replace(country_code, "KO", "XK")) %>%
  left_join(country_codes, by = "country_code")

Clean up legal_entity_type.

master_view <- master_view %>%
  # `legal_entity_type` is a three-letter label only
  mutate(legal_entity_type = str_sub(legal_entity_type, 1, 3))

Format and clean up the contributions.

master_view <- master_view %>%
  mutate(
    ec_contribution = str_replace(ec_contribution, ",", "."),
    ec_contribution = str_replace(ec_contribution,
                                  fixed("xxxxx"),
                                  NA_character_),
    eu_contribution = str_replace(eu_contribution, fixed("-"), "0"),
    ec_contribution = as.numeric(ec_contribution),
    net_ec_contribution = as.numeric(net_ec_contribution),
    eu_contribution = as.numeric(eu_contribution),
    # reconcile FP7 and H2020 contribution data
    re_contribution = eu_contribution,
    re_contribution = if_else(is.na(re_contribution),
                              net_ec_contribution,
                              re_contribution),
    re_contribution = if_else(is.na(re_contribution),
                              ec_contribution,
                              re_contribution)
  )

Relabel the pillars.

master_view <- master_view %>%
  mutate(
    pillar = str_extract(pillar_abbr, "\\d"),
    pillar = if_else(pillar_abbr == "EU.0.", "Cross-theme", pillar),
    pillar = if_else(pillar_abbr == "Euratom", pillar_abbr, pillar))

Format and round the dates.

master_view <- master_view %>%
  mutate(
    signature_date = as.Date(signature_date, format = "%d/%m/%Y"),
    call_deadline_date = as.Date(call_deadline_date, format = "%d/%m/%Y"),
    signature_year = str_sub(signature_date, 1, 4),
    call_year = str_sub(call_deadline_date, 1, 4),
    start_year = str_sub(start_date, 1, 4),
    signature_year = int(signature_year),
    call_year = int(call_year),
    start_year = int(start_year)
  )

Transform partner_role to title case.

master_view <- master_view %>%
  mutate(partner_role = str_to_title(partner_role))

Saving

Finally, we can save master_vier as a CSV file, which can be viewed in here.

# store data types
master_view %>%
  map_chr(class) %>% 
  write_yaml(here("data", "schema-cordis-plus.yml"))

# split CSVs by year
df_years <-
  master_view %>%
  distinct(start_year) %>%
  pull() %>%
  sort()

# write to ./data
file_names <- here("data", glue("cordis-plus-{df_years}.csv"))

master_view %>%
  filter(!is.na(start_date)) %>%
  arrange(start_year) %>%
  group_by(start_year) %>%
  group_split() %>%
  setNames(df_years) %>%
  walk2(file_names, vroom_write, delim = ",")