Data context

This team was created to avoid overlap in data manipulation between teams. As such, the primary goal consisted of analyzing the raw PIK data set and making the bibliographical data contained within, accessible. Thus, the dataset had to be of reasonably good quality and compatible with the Wikidata data model before making it available to the data mapping group.

Challenges:

Our first challenge was to understand the data set fully. Some types were unclear, and the correct formatting of the CSV file had to be respected. We quickly realized that there were some inconsistencies in the dataset and that we needed to fix them as best we could in order to continue working with the data. We also had to consider a data model that fits both Wikidata and the Scholia concept. After the cleanup and the creation of the data model, the last challenge was to get the current data set into the created data model.

Results:

The results include a cleaned-up dataset and the scripts to clean up a dataset. Furthermore, a list with requirements to the created data sets, in order to guarantee the quality already with further reproduction of such data sets. Moreover, a model was created that fits into the format of Wikidata and Scholia.

Team:

*The complete documentation can be found on our Documentation GitHub repository under: Documentation: Data Cleansing and Model View on GitHub View Scrum Board Further Internal Project Documentation


Table of contents