Open Collaboration With Data Curators
Big data creates inequalities, because only the world’s biggest global corporations, best endowed universities and strongest governments can maintain long, well-designed, global data collection programs.
We are balancing these inequalities in the spirit of the new European Data Governance Act with fostering the re-use of public sector information, increasing the interoperability (integration capacity) of public and private data assets, and helping with practical tools data sharing
and voluntary data altruism
.
We are taking a new and modern approach to the data observatory
concept and modernizing it with the application of 21st century data and metadata standards, the new results of reproducible research, and data science. Many centralized data observatories only put research reports and spreadsheet or SPSS files on the internet, which are usually hard to find, very difficult to import for use, and to join with your database or other research data tables. We apply instead the web 3.0 approach, when our reports and datasets are self-synchronizing with their sources; they are updating datasets and re-visualizing them, correcting the footnotes and bibliographies, and place the new release of the report directly into global library systems. We can even build pipelines to legally open datasets that have never been released on the web 2.0 and cannot be downloaded with a browser.
Various UN and OECD bodies, and particularly the European Union support or maintain more than 60 data observatories, or permanent data collection and dissemination points.
We have studied about 80 EU, UN, OECD data observatories, including already defunct ones, and we have found that almost none of them use these 21st century solutions. We are building open-source data observatories, which run open-source statistical software that automatically processes and documents reusable public sector data (from public transport, meteorology, tax offices, taxpayer funded satellite systems, etc.) and reusable scientific data (from EU taxpayer funded research).
This document is intended for curators of the data observatories. For an introduction to the data observatories, please refer to Our Vision of a Modern Data Observatory
Stop reinventing the wheel
When was a file downloaded from the internet? What happened with it sense? Are their updates? Did the bibliographical reference was made for quotations? Missing values imputed? Currency translated? Who knows about it – who created a dataset, who contributed to it? Which is an intermediate format of a spreadsheet file, and which is the final, checked, approved by a senior manager?
Read our full blogpost: The Data Sisyphus
When a data users, aka a data curator finds a good resource of data or information in a statistical agency, an observatory, a library, he or she will likely go back to that page many times over the years, and download the same tables … because they are lost, because they have a new release, because their bibliographical information was missing. This is a very error-prone work that no senior consultant, advisor, researcher, or laywer gets credited for. We do not believe that the answer to this is to task PhD-candidates, trainee lawyers, or interns to do these manual tasks over and over again, because they will not like it and they will not do it right.
In our vision, computers should do this taks perfectly, painlessly, and every day. If you have used a data source, such as the website of Eurostat or Europeana, the Library of Congress, or the Zenodo science repository at least twice, it is very likely that you will go back again. And again. And again. Our data observatories are designed to do this for you automatically. Retrieve, format, improve, document, and make available all resources that you use in your business or scholarly (research or education) practice.
Humans and Algorithms
We believe in evidence-driven, open policy analysis, open science, and open government. We believe that humans are able to collect information, process and organize it, and form informed opinions. We believe in trustworthy artificial intelligence, AI that uses big data subject to human agency and ethical or legal constraints.
We would like to find new collaboratos, professionals, researchers or citizen scientists in an institutional (business) or personal capacity who share our values and would like to create more informative datasets, indicators, and visualizations. We follow the open collaboration method used in open knowledge systems (such as Wikipedia) or open source software development. We are following the ⏯ Contributor Covenant Code Of Conduct originating in open source software development to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation.
If you already know that you would collaborate with us on the creation of better data products, please refer to our practical ⏩ onboarding essentials.
Motivation
Who should become a data curator? You do not need to be a data scientist, a statistician, or a data engineer. We are looking for professionals, researchers, or citizen scientists who are interested in data and its visualization, and its potential to form the basis of informed business or policy decisions and to provide scientific or legal evidence. Our ideal curators share a passion for data-driven evidence or visualizations, and have a strong, subjective idea about the data that would inform them in their work. See our ⏩ See our inspriation chapter.
Why we need better data? Data is hardly informative—it may be a page in a book, a file in an obsolete file format on a governmental server, an Excel sheet that you do not remember to have checked for updates. Most data are useless, because we do not know how it can inform us, or we do not know if we can trust it. Unfortunately, even most data on open science repositories, or data made public on the basis of the EU open data regime (or the freedom of information regime of the U.S.) is useless without further work.
“Data is potential information, analogous to potential energy: work is required to release it.” —citation
Data sources
We work with original, primary data collection, with big data collection from immaterial, small sources on the internet, and we aggregate data from open science or open government repositories.
Working with data is hard. No matter if the data is coming from a reliable social science archive, like GESIS (which holds plenty of social science data about Europe), or Eurostat (Europe’s statistical umbrella authority), or from a new fieldwork, data needs to be polished to a form that it can be trusted and reused.
Our curators point us to data that they find interesting or useful, usually from their research, professional or artistic practice. We help them make the data more trustworthy, more useful, and ask them to evaluate the result. Our curators need to have a curiosity for data as an information source and as an evidence, but do not need to have a statistic, computer science or data science background. ⏩ Get inspired to be our data curator or immediately ⏩ Sign up as a data curator
Our data scientists and developers know how to check for data inconsistencies, improve the documentation, or bring the data to a form that it is easy to import into a database, a knowledge graph, or simply into a spreadsheet application like Excel or statistical software like SPSS or STATA.
The aim of your engineers is to enter into the phase of web 3.0, when our curators can synchronize data with trusted sources: for example, download ingredients from Eurostat or Europeana, add their knowledge, and return the data publication to the open science repository Zenodo, when these organizations (and other professional users) immediately find it.
Reproducible research
We follow the principles of reproducible research, that increases data quality
with the use of open algorithms, provision of full data (lifecycle) history, unit testing, facilitating external review and audit.
We follow the principles of reproducible research, that increases data quality
with the use of open algorithms, provision of full data (lifecycle) history. We aim to make review by senior staff or external audit as easy as possible. Whenever possible, we rely on scientific peer-review for such an audit, and we are always open for suggestions, bug reports and other issues. Our observatories embrace the idea of open government, open science, and open policy analysis.
Read more about our reproducible research practice.
Informative data
The good news about documentation and data validation costs is that they can be shared. If many users need GDP/capita data from all over the world in euros, then it is enough if only one entity, a data observatory, collects all GDP and population data expresed in dollars, korunas, and euros, and makes sure that the latest data is correctly translated to euros, and then correctly divided by the latest population figures. These task are error-prone,and should not be repeaeted by every data journalist, NGO employee, PhD student or junior analyst. This is one of the services of our data observatory.
Impact
Regardless if you want to target policy-makers, students, or you want to sell your business intelligence, we will help you to increase your Impact and maximize your performance to make your scientific, educational, policy or business content more Findable
, Accessible
, Interoperable
, and Reusable
.
- You will get more viewers, readers, and interactions.
- You do not have to worry about the use of your datasets in Excel, SPSS or STATA; importing into relational databases. Your reports will read on Kindle reader or on paper, in website format or in long-read.
- Your knowledge products will link to statistical resources, libraries, industry knowledge-hubs.
- Your research products, including datasets, databases, visualizations, reports, bibliographies will self-synchronize with new versions of your sources, allowing you to recast market research, educational material or scientific experiments every quarter or year.
We will make such a big leap with the presentation and release of your research results, as in 1995 the world wide web did. The first web allowed it to be found and downloaded all over the world. The web 2.0 created applications. The web 3.0 will allow that our machines will connect your research with library databases, place them into open science repositories, and self-refresh their datasets, visualizations, bibliographies, to stay synchronized with the rest of the world.
By connecting directly to the worlds library systems and repositories, the findability of your research products will exponentially increase. We will use the latest data science to make them as accessible as possible: we use strictly defined structures to make their import into relational databases, spreadsheet applications like Excel, SPSS, Stata plug-and-play. Your research results will be available for all major software vendors, and your research reports simultaneously translated to EPUB, Kindle, PDF, Word, PowerPoint, Apple software, … you name it. The ongoing, permanent synchronization with libraries, statistical data sources, even non-public databases will keep your research product reusable for you all the time.
We will make a no-nonsense application of the FAIR reuirements enshrined in most EU-mandated/sponsored research. We will guide you through the often hard to imagine requirement to “…emphasise machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention) because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data.”
of digital assets, particularly datacubes and datasets used in statistics and data analysis.
We publish your data, data shared with your, data collected for you, and reused data in a way that it is easy for computers, libraries, users to find it. Read more on our ⏩ FAIR metadata handling (our try our software for R users).
Governance principles
- We do not centralize data and do not touch upon data ownweship. We developed a model of operations with CEEMID, where we learned to work with the various conflicts of interests and data protection rules of the music industry.
- Our data observatories integrate partner data into shared data pools. Such data integration exponentially increases the value of the contributing, small datasets, and supports data altruism and other measures of the Data Governance Act1.
- We support syndicated, joined, pooled research efforst to make Big Data Work For All
- Our observatories are stakeholder governed.
Technical features
- supported with optional, open source APIs to retrieve the data
- supported with RDF serialization
- support research automation
- support automated publishing and releasing of data, visualizations, newsletters, and long-form documentation in auto-refreshing websites, blogposts, or articles, or even books.
- develop an ecosystem of open source software that helps the professional collection, processing, documentation of data conforming the Data Governance Act, and supporting data sharing and data altruism.
Our data observatories
are collaborative
and professionally curated
data services made from datasets, codebooks and descriptions, reusable visualizations, and documentation. They are designed to synchronize the datasets, research documents, databases of our partners with reliable statistical, library, knowledge graph and other services. This enables our partners to keep their data and research products fully up to date and make them visible for global knowledge, library, data repository and other services.
Big data for all
Our data observatories focus participants to collect only new data, and to reuse already existing data
in the world’s statistical agencies, libraries, encyclopedias, or digital platforms. With harmonized data collection, particularly in the form of surveys, you can immediately give a history and international context to your data. We tap into governmental and scientific data collections that businesses or civil society organizations could never replicate data collected by satellites or anonymized data collected by tax or statistical authorities. We use metadata standardization and the RDF (semantic web) concept to constantly synchronize our data observatories with knowledge in the worlds large libraries, encyclopedias, and statistical agencies.
Synchronize your research with the world
we help our observatory partners to bring their own datasets and databases to a form that can connect to other industry, scientific, government or library sources and refresh or enhance themselves.
We support the machine-reading of our data products and their importing into relational databases. Our own API organizes the datasets into an SQL relational database, which allow more complex querying for expert users in SQL, or the dbplyr extension of the R language which allows the mixing of dplyr and SQL queries (See ?? Relational Databases, SQL and API).
Our data observatories are data-as-service and research-as-service providers, and they are designed to synchronize knowledge with other trusted information agents, like global libraries, global statistical agencies, or Wikidata (that powers many structured Wikipedia pages) via the semantic web. We are still experimenting with these features It also contains codebooks and other metadata organized in a format that offers an easy importing and serialisation into RDF and SPARQL applications (See ?? Data-as-service, Linked Data, SPARQL)
Our system is designed to help the Findability
, Accessibility
, Interoperability
, and Reuse
of digital assets, particularly datacubes and datasets used in statistics and data analysis. The FAIR “…emphasise machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention) because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data.”
Most small- and medium sized businesses, NGOs, civil society organizations, public policy units do not have the resources to employ data scientists and data engineers full-time, and such services on a part-time or ad hoc basis are too expensive for them. This means that they are struggling with the data Sisyphus: munching spreadsheets into the desired format for a chart or a regression model, chasing missing data, trying to catch up on documentation or supervisory control, and in the meantime wasting countless of hours on boring work that computers to much better and with far less errors.
High Usability
Our datasets are tidy
, imputed
or forecasted
, and visualized
, which means that they are immediately ready to be used in Excel-like spreadsheet applications, SPSS or STATA-like statistical software, or for reporting in a book, in a newsletter or on a website.
The dataobservatory.eu products are not made by official statistical agencies, but triangular data ecosystems of business, policy, and academic users. This allows us to be professionally subjective and therefore achieve a higher usability
.
Our data curators
professionally perform those error-prone and laborious tasks (currency conversion, unit conversions, linear interpolation of missing observations, etc.) that data analysts hate and less tech-savvy users often get wrong. Our datasets often go through more than a hundred automated controls before they are presented to the user to make sure that the data quality is excellent, and the datasets are indeed readily available for use. These services are not offered by statistical agencies because they are subjective to the knowledge of the data curator.
Tidy data is ready to be published
, ready to placed on a visual chart, or placed on a map. Tidiness is a rigorous concept in data science. Our data observatories come with many extra services that help the effective communication of the observatory partners’ knowledge. We automatically create charts and tables that are every day refreshed for your publications. We can automatically place them into newsletter templates. We automatically place them on the (documentation part) of your website. We can even automate most of the process to put them into an annual report or statistical yearbook that you can publish in e-bookstores, send to global libraries, sell or give away to your stakeholders.