2 Data Curators

We are looking for data curators into our

  1. Work with governmental or scientific or otherwise open data.
  2. Committed to high policy or business professional standards, and by making their work reproducible, they adhere to reviewability, reproducability, confirmability and auditability, regardless if they work, or study for various professsional roles in business, academia, public or non-governmental policy, and data journalism.
  3. They are interested in helping us with indicator design.
  4. Make the authoritative copy of their indicator available on the Zenodo data repository, and keep it up-to-date with our automated observatory’s help.

An important aspect of the EU Datathon Challenges is “.. to propose the development of an application that links and uses open datasets […] to find suitable new approaches and solutions to help Europe achieve important goals set by the European Commission through the use of open data.”

Where to find us: - dataobservatory-eu is our private repo collection and private github collaboration platform, but many of our repos are open. Like this one.

2.1 Get Inspired

2.1.1 Create New Datasets

Our mission is to create standardized data about a social, economic or environmental process that does not have standardized, well-processed open data.

2.1.2 Remain Critical

Sometimes we put our hands on data that looks like a unique starting point to create a new indicator. But our indicator will be flawed, if the original dataset is flawed. And it can be flawed in many ways, most likely that some important aspect of the information was omitted, or the data is autoselected, for example, under-sampling women, people of color, or observations from small or less developed countries.

  • Cathy O’Neil: Weapons of math destruction, which O’Neil are mathematical models or algorithms that claim to quantify important traits: teacher quality, recidivism risk, creditworthiness but have harmful outcomes and often reinforce inequality, keeping the poor poor and the rich rich. They have three things in common: opacity, scale, and damage. https://blogs.scientificamerican.com/roots-of-unity/review-weapons-of-math-destruction/](https://blogs.scientificamerican.com/roots-of-unity/review-weapons-of-math-destruction/)

  • Catherine D’Ignazio and Lauren F. Klein: Data Feminism. This is a much celebrated book, and with a good reason. It views AI and data problems with a feminist point of view, but the examples and the toolbox can be easily imagined for small-country biases, racial, ethnic, or small enterprise problems. A very good introduction to the injustice of big data and the fight for a more fair use of data, and how bad data collection practices through garbe in garbe out lead to misleading information, or even misinformation.

  • Why The Bronx Burned. Between 1970 and 1980, seven census tracts in the Bronx lost more than 97 percent of their buildings to fire and abandonment. In his book The Fires, Joe Flood lays the blame on misguided “best and brightest” effort by New York City to increase government efficiency. With the help of the Rand Corp., the city tried to measure fire response times, identify redundancies in service, and close or re-allocate fire stations accordingly. What resulted, though, was a perfect storm of bad data: The methodology was flawed, the analysis was rife with biases, and the results were interpreted in a way that stacked the deck against poorer neighborhoods. The slower response times allowed smaller fires to rage uncontrolled in the city’s most vulnerable communities. Listen to the podcast here

  • Bad Incentives Are Blocking Better Science “There’s a difference between an answer and a result. But all the incentives are pointing toward telling you that as soon as you get a result, you stop.” After the deluge of retractions, the stories of fraudsters, the false positives, and the high-profile failures to replicate landmark studies, some people have begun to ask: “Is science broken?” Listen to the pdocast Science is Hard here

  • In Algorithms of Oppression, Safiya Umoja Noble challenges the idea that search engines like Google offer an equal playing field for all forms of ideas, identities, and activities. Data discrimination is a real social problem; Noble argues that the combination of private interests in promoting certain sites, along with the monopoly status of a relatively small number of Internet search engines, leads to a biased set of search algorithms that privilege whiteness and discriminate against people of color, specifically women of color.

  • Christopher Ingraham wrote a quick blog post for The Washington Post about an obscure USDA data set called the natural amenities index, which attempts to quantify the natural beauty of different parts of the country. He described the rankings, noted the counties at the top and bottom, hit publish and didn’t think much of it. Almost immediately he started to hear from the residents of northern Minnesota, who were not very happy that Chris had written, “the absolute worst place to live in America is (drumroll, please) … Red Lake County, Minn.” He could not have been more wrong … a year later he moved to Red Lake County with his family.

2.1.3 Your First Data Contribution

Your first contribution can be made without writing a single program code – but if you are experienced in reproducible science, than you can also submit a code that creates your data.

  1. Make sure that you read the Contributors Covenant. You must make this pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation. Participating in our data observatories requires everybody to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community. It’s better this way for you and for us!

  2. Send us a plain language document, preferably in any flavor of markdown (See subchapter 7.3.1 in the Tools), or even in a clear text email about the indicator. What should the indicator be used for, how it should be measured, with what frequency, and what could be the open data source to acquire the observations. Experienced data scientists can send us a Jupiter Notebook or an Rmarkdown file with code, but this submission can be a simple plain language document without numbers.

  3. Make sure that you have and ORCiD ID. This is a standard identification for scientific publications. We need your numeric ORCiD ID.

  4. Make sure that you have a Zenodo account which is connected to your ORCiD ID. This enables you to publish data under your name. If you curate data for our observatories, you will be the indicator’s first author, and depending on what processes help you, the author of the (scientific) code that helps you calculate the values will be your co-author.

  5. Without programming experience your first indicator should be uploaded manually to Zenodo, and we will help automating the new versions. This will mean, for example, the upload of a simple, csv version of an Excel table, and filling in some important information about the contents of the table.

  6. With some level of R or Python programming experience, we ask you to create a Github repo where you store your indicator. We will help you with tutorials, program codes, or applications to automate your data publication on Zenodo. In this case, make sure that you also have a Sandbox Zenodo account. There is no undo button on Zenodo. If you are tinkering with automatically publishing data, practice first in the sandbox, which is a practicing clone of Zenodo with undo button. (To avoid accidents, you need to have a completely different account with different credential on the real and the sandbox practice repository.)

  7. Experienced programmers are welcome to participate in our developer team, and become contributors, or eventually co-authors of the (scientific) software codes that we make to continuously improve our data observatories. All our data code is open source. At this level, you are expected to be able to raise and/or pick up and solve an issue in our observatory’s Github repository, or its connecting statistical repositories.

Our data is mainly processed in R language software, and sometimes in Python language software. If you are experienced with R bookdown, R Shiny or working in the hugo language, then you are welcome to join our developer team in non-curatorial roles.

2.2 What is Open Data

In the EU, open data is governed by the Directive on open data and the re-use of public sector information - in short: Open Data Directive (EU) 2019 / 1024. It entered into force on 16 July 2019. It replaces the Public Sector Information Directive, also known as the PSI Directive which dated from 2003 and was subsequently amended in 2013.

2.3 Reproducible Research

Reproducible research is a scientific concept that can be applied to a wide range of professional designations. We are applying this concept to Evidence-based, Open Policy Analysis and Professional Standards in Business, for example, reproducible finance in the investment process or reproducible impact assessment in policy consulting. Based on the computational reproducibility we believe that the following principles should be followed.

  • Reviewability means that our application’s results are can be assessed and judged by our user’s experts, or experts they trust. We help reviewability with full transparency: we publish the software code that created the indicators, our methodology, and an automatically refreshing statistical description of the indicator each day when it receives new data or corrections from the original source.

  • Reproducibility means that we are providing data products and tools that allow the exact duplication of our results during assessments. This ensures that all logical steps can be verified. Reproducibility ensures that there is no lock-in to our applications. You can always chose a different data and software vendor, or compare our results with them.

  • Confirmability means that using our applications findings leads to the same professional results as other available software and information. Our data products use the open-source statistical programming language R. We provide details about our algorithms and methodology to confirm our results in SPSS or Stata or sometimes even in Excel.

  • Auditability means that our data and software is archived in a way that external auditors can later review, reproduce and confirm our findings. This is a stricter form of data retention that most organizations apply, because we do not only archive results step-by-step but all computational steps – as if your colleagues would not only save every step in Excel but also their keystrokes. While auditability is a requirement in accounting, we are extending this approach to all the quantitative work of a professional organization in an advisory or consulting capacity.

  • Reviewable findings: The descriptions of the methods can be independently assessed, and the results judged credible. In our view, this is a fundamental requirement for all professional applications. CEEMID’s music data is used to settle royalty disputes in judicial procedures, or in grant and policy design. We believe that the future European Music Observatory should aim at the same bar, making its data & research products open for challenges in the publicity of science, courts, and professional peers.

  • Replicable findings: We are presenting our findings and provide tools so that our users or auditors or external authorities can duplicate our results.

  • Confirmable findings: The main conclusions of the research can be obtained independently without our software, because we describe in detail the algorithms and methodology in supplementary materials. We believe that other organizations, analysts, statisticians must come to the same findings with their own methods and software. This avoids lock-in and allows independent cross-examination.

  • Auditable findings: Sufficient records (including data and software) have been archived so that the research can be defended later if necessary or differences between independent confirmations resolved. The archive might be private, as with traditional laboratory notebooks. See Open collaboration with academia, auditors, and industry.

These computational requirements require a data workflow that relies on further principles.

  • Record retention: all aspects of reproducibility require a high level of standardized documentation. The standardization of documentation requires the use of standardized metadata, metadata structures, taxonomies, vocabularies.

  • Best available information / data universe: the quality of the findings, their confirmation and auditing success will improve with better data and facts used.

  • Data validations: The quality of the findings will greatly depend on the factual inputs. While the reproducible findings may have many problems, inputting erroneous data or faulty information will likely lead to wrong conclusions, and in all cases will make confirmation and auditing impossible. Especially when organizations use large and heterogeneous data sources, even small errors, such as erroneous currency translations or accidental misuse of decimals, units can cause results that will not pass confirmation or auditing.

2.3.1 Evidence-based, Open Policy Analysis

In the last two decades, governments and researchers have placed a growing emphasis on the value of evidence-based policy. However, while the evidence generated through research to inform policy has become more rigorous and transparent, policy analysis–the process of contextualizing evidence to inform specific policy decisions–remains opaque.

We believe that a modern data observatory must improve how evidence is created and used in policy reports, and pass on the efficiency gains from increasing reproducibility and automation. Therefore, we pledge that the music.dataobservatory.eu will comply with the Open Policy Analysis standards developed by the Berkeley Initiative for Transparency in the Social Sciences & Center for Effective Global Action. These standards are applied by the World Bank.

2.3.2 Professional Standards in Business

Some of the requirements of reproducible research are usually required by professional standards. For example, various accounting, finance, legal or consulting professional standards call for appropriate documentation and record retention.

2.4 Indicator Design

We are committing ourselves in the final deliverable to follow the indicator design principles set out by Eurostat: (Eurostat 2014, 2014; Kotzeva et al. 2017) to create high-quality, validated indicators that receive appropriate feedback from users, i.e. music businesses, their trade associations and policy-makers.

What are the characteristics of a good indicator? Based on the above mentioned Eurostat expectation, we formulated it for our observatories in this way.

  • Relevance: Indicators must ‘meet the users’ needs’; if they do not measure anything useful to policymakers, the public or researchers, they will probably not be widely used. Indicators should also be unambiguous in showing which direction is ‘desirable.’
  • Accuracy and reliability: Indicators must ‘accurately and reliably portray reality’; an inaccurate indicator can lead to erroneous conclusions, steer the business or policy making process in the wrong direction or let negative effects go undetected.
  • Timeliness and punctuality: Indicators must be released at a time that is relevant to the end user. If we cannot produce an accurate indicator in a timely manner, we should aim to create a leading indicator that is sooner available and with relatively high accuracy correlates with the indicator that is not available on time.
  • Coherence and comparability: Indicators should be ‘consistent internally, over time and comparable between regions and countries. This is particularly relevant for indicators used for policy monitoring and assessment, and in international business planning and assessment.
  • Accessibility and clarity

Examples for indicators in our Digital Music Observatory:

  • Indicators that were used with all known royalty valuation methods (PwC 2008), for both author’s and neighbouring rights, and fullfil the IFRS fair value standards, incorporated in EU law and the recent EU jurisprudence (InfoCuria 2014, 2017).

  • Indicators that can be used for calculating damages, or calculating the value of the value gap (Daniel Antal 2019a, 2019c).

  • Indicators that quantify the development needs of musicians, and can set objective granting aims and grant evaluations (Dániel Antal 2015).

  • Understanding how music is taxed, how music contributes to the local and national GDP, and how music creates jobs directly, indirectly and with induced effects (Daniel Antal 2019b).

  • Providing detailed comparison of the differences of music audience among countries.

  • Measuring exporting success on streaming platforms, and preparing better targeting tools.

2.4.1 Creation and Quality Control of Indicators

An indicator values are created if we the data curator has some, preferably at least 20 observation values available in data table that confirms the tidy data principles, i.e. each variable is in exactly one column of the table, and each observation is in one row of the table.

Each indicators should be described in a clear, English language text, describing the meaning of the variables, the source of the observations, and other important information about the processing, refreshing, extending of the dataset.

We are safeguarding the quality of the indicators with various reproducible research methods. Depending on the data scientific level of the curator, we either take over the quality control mechanism, or cooperate with the curator. But the main inputs for quality control should be described by the data curator.

  • Unit testing: Unit tests are simple, numerical test that avoid logical errors in an indicator. Shall we exclude zero values? Negative values? Do percentages must add up to 100? Some of our indicators go through more than 60 unit tests. We ask your help to get us going, and we will take care of the usual suspects: wrong currency translations, wrong decimal places (thousand, million units), etc.

  • Missing data treatment: No real life dataset is complete, but many statistical and AI methods cannot handle missing values. Therefore, we make an effort to impute with an estimated value the missing values. Imputation is sometimes self-understanding, but sometimes it is a very tricky business, particularly when the data has several dimensions (particularly time or geographical dimension.) We want to agree with the curator why some data may be missing, and how best to handle it. For simple, two dimensional datasets, by default, we use linear approximation, forecasting and backcasting of the values, and in small datasets the last observation carry forward or the next observation carry backwards methods. May compromise the data? Let us know.

  • Testing against peer-reviewed results: Often we know that after making various computations with a data, we must achieve an already known value. For example, the various components of the GDP in economics must add up with a pre-defined precision. Certain inputs must match a scientifically valid result. If you know of such tests, let us know, and let’s include them in the unit-testing processes.

  • Peer-reviewed data manipulation code: Whenever we re-organize, impute, or otherwise change the original data, we do it only with algorithms that went through scientific peer-review as algorithms. If there is a bug or something to improve in the way we handle the data, our code transparency makes it likely to come out.

  • Peer-reviewed data application: We encourage our curators, particularly academics, to send the indicators created with the help of oour research automation to various forms of scientific peer review, to make sure that the data is valid, useful… and to bring credits to the curators.

  • Authentic copies: We are placing each new version of the indicators values into Zenodo, a data repository that keeps authentic copies, versions, and assignes them digital object identifiers (DOIs). This makes sure that whenever our curators data is re-used, and incorrectly manipulated by a business, scientific or policy user, we can detect such manipulation.

Zenodo Deposition Example

Figure 2.1: Zenodo Deposition Example

You can see this dataset here, which was used in this high-profile scientific publication.

2.5 Authentic Depositions of Indicators

We designed a workflow that helps our curators to put their indicator tables to Zenodo. In many cases, particularly if they do EU-funded research, this is also usually a grant requirement. At the same time, we place the indicator to our database, and make it available on our data observatory’s API.

With low-frequency data, such as annual data tables, we place all copies to Zenodo first, and then to the data API. In these cases, each new version of the indicator values (containing a new year, a new estimation, or a new country, a new observation unit) will have a new DOI version.

With high-frequency data, such as data tables that are refreshing daily or several times a day, we do not think that authentic versioning is useful. In such cases, we create an authentic version at a pre-agreed time frequency, for example, monthly.

2.5.1 How to Add your Existing Zenodo Depositions to Our Observatory

If you have a relevant dataset on Zenodo that should be featured in one of our observatories, or you are just uploading a new dataset, you should send it to our observatory communities. Communities are just collections that make your data easier to find and cite.

On your new or existing deposition, go to Edit, and you will find Communities right after Files and above Upload Type.

How to Add your Existing Zenodo Depositions to Our Observatory?

Figure 2.2: How to Add your Existing Zenodo Depositions to Our Observatory?

If you want to be featured regularly in our observatories, your data should conform our database schema. In this case, we will help you maintaining the timeliness of your data – basically we will together keep your dataset growing, expanding, and be available via our API, too. (See an example here. We will add a tutorial on this shortly to our blog.)

2.5.1.1 Digital Music Observatory

You can deposit your data, or search for new, exciting data on Zenodo itself to our music observatory on zenodo.org/communities/music_observatory.

Deposit Data, Curate Data on Zenodo for the Digital Music Observatory

Figure 2.3: Deposit Data, Curate Data on Zenodo for the Digital Music Observatory

2.5.2 Green Deal Data Observatory

You can deposit your data, or search for new, exciting data on Zenodo itself to our green deal data observatory on zenodo.org/communities/greendeal_observatory.

Deposit Data, Curate Data on Zenodo for the Green Deal Data Observatory

Figure 2.4: Deposit Data, Curate Data on Zenodo for the Green Deal Data Observatory

2.5.2.1 Economy Data Observatory

You can deposit your data, or search for new, exciting data on Zenodo itself to our green deal data observatory on zenodo.org/communities/economy_observatory/.

References

Antal, Daniel. 2019a. “Private Copying in Croatia.” https://www.zamp.hr/uploads/documents/Studija_privatno_kopiranje_u_Hrvatskoj_DA_CEEMID.pdf.
———. 2019b. Správa o slovenskom hudobnom priemysle.” https://doi.org/10.17605/OSF.IO/V3BE9.
———. 2019c. “The Competition of Unlicensed, Licensed and Illegal Uses on the Markets of Music and Audiovisual Works [A szabad felhasználások, a jogosított tartalmak és az illegális felhasználások versenye a zenék és audiovizuális alkotások hazai piacán].” Artisjus - not public.
Antal, Dániel. 2015. “Javaslatok a Cseh Tamás Program pályázatainak fejlesztésére. A magyar könnyűzene tartós jogdíjnövelésének lehetőségei. [Proposals for the Development of the Cseh Tamas Program Grants. The Possibilities of Long-Term Royalty Growth in Hungarian Popular Music].” manuscript.
Eurostat. 2014. Towards a Harmonised Methodology for Statistical Indicators — Part 1: Indicator Typologies and Terminologies. 2014th ed. Vol. 1. Towards a Harmonised Methodology for Statistical Indicators 1. Luxembourg: Publications Office of the European Union. https://ec.europa.eu/eurostat/web/products-manuals-and-guidelines/-/KS-GQ-14-011.
InfoCuria. 2014. OSAOchranný svaz autorský pro práva k dílům hudebním o.s. v Léčebné lázně Mariánské Lázně a.s. Case C‑351/12.” http://curia.europa.eu/juris/document/document.jsf?text=&docid=150055&pageIndex=0&doclang=en&mode=lst&dir=&occ=first&part=1&cid=1996526.
———. 2017. Autortiesību un komunicēšanās konsultāciju aģentūra /Latvijas Autoru apvienība v Konkurences padome.” http://curia.europa.eu/juris/liste.jsf?language=en&num=C-177/16.
Kotzeva, Mariana, Anton Steurer, Nicola Massarelli, and Mariana Popova, eds. 2017. Towards a Harmonised Methodology for Statistical Indicators — Part 2: Communicating Through Indicators. 2017th ed. Vol. 2. Towards a Harmonised Methodology for Statistical Indicators 1. Luxembourg: Publications Office of the European Union. https://ec.europa.eu/eurostat/web/products-manuals-and-guidelines/-/KS-GQ-17-001.
PwC. 2008. “Valuing the Use of Recorded Music.” IFPI PricewaterhouseCoopers. http://www.ifpi.org/content/library/valuing_the_use_of_recorded_music.pdf.