Taming [meta]data with DataNet, a PaaS case study
Introduction
Scientific computation is a source of many large data sets, which often are structured in a non‑interoperable manner. The data and metadata are stored on computing infrastructures or local computers in databases or in files. The discoverability and the possibility to verify the published results represented by such data is practically non-existent. Also, managing access level to the data is difficult by using the available file system or database permission granting mechanisms. Another issue is the configuration of distributed systems (different computing sites) where storage has to be accessed by the computing node through non-standaprd ports blocked by the firewalls.
DataNet is a lightweight data and metadata management platform, which addresses the mentioned issues by providing a web-based data model management interface, exposing a REST repository API for data recording and allowing for easy access level configuration. DataNet does not replace existing storage facilities but provides a way to annotate the data already stored on dedicated storage sites.
Principles, Architecture and Implementation
Fig 1. DataNet overall architecture
Some of the requirements of DataNet are drawn from the experiences in cooperation with scientists from other, and mostly non IT fields. Most data handling tools used during that work had some shortcomings, and we tried to address them while gathering requirements for DataNet. With the developed solution one should be able to:
- create an abstract data model including file and structured data mixed together,
- version data models and create corresponding repositories,
- manage access to data,
- access the data independently of programming language used,
- scale the infrastructure as required,
- easily integrate with existing storage solutions.
Those requirements served as principles for building DataNet, which architecture is depicted in Fig 1.
Three layers are used to decompose the functionality and make the solution modular. The top-most layer, built as a web interface, using Google Web Toolkit and Java, allows for managing abstract data models, their versioning and deploying repositories. Additionally, the web interface is used to view and manage repositories, which include mechanism for applying access restrictions. The middle layer provides a scalable space for repository deployment. It was built on top of a PaaS platform which ensures means of using different storage engines for tabular data, and can be easily expanded when the current deployment capacity is exceeded. Current PaaS platform of choice is CloudFoundry deployed atop of PL-Grid Cloud infrastructure. Each repository exposes a REST interface for data recording and retrieval. The bottom layer is the storage infrastructure where file data is stored through the GridFTP protocol.
In the next two sections you can find screenshots of WebGUI, and snippets of code showing that accessing data stored with DataNet is quite straightforward and easy.
WebGUI screenshots
Fig. 2 depicts a sample database, named UserDatabase with Person and Address entities. After deployment, the model becomes a repository, then it is possible to enter and query data as shown in Fig. 3.
Fig 2. Model editor and definition of example database
Fig 3. Repository's data browser
REST usage example
Every data operation can be performed with help of REST interface. Below is a code snippet representing data upload done with Ruby:
repository = RestClient::Resource.new('https://userdatabase.paas.datanet.plgrid.pl/Person', :user => 'user', :password => 'password')
jdata = {:name => 'Jan', :surname => 'Kowalski', :age => '35', :address => '530b4dd04c09da2f3c000001'}.to_json
repository.post jdata, content_type: :json
After that it is possible to retrieve data, here done with Python:
response = requests.get('https://userdatabase.paas.datanet.plgrid.pl/Person', auth = ('user', 'password'))
#find specific userId, get record details
response = requests.get('https://userdatabase.paas.datanet.plgrid.pl/Person/{userId}', auth = ('user', 'password'))
Demo installation and further reading
The preproduction installation of DataNet is available at https://datanet.plgrid.pl. This installation will become the official DataNet PL-Grid service in the near future. More information about DataNet can be found:
- On the documentation page of DataNet preproduction installation, mentioned above,
- on the “products page” of this website - coming soon,
- in the article “DataNet – Lightweight Metadata and Data Management” which will be published in PL-Grid PLUS Book.