Skip to content. | Skip to navigation

Personal tools
Log in

DIstributed Computing Environments (DICE) Team
We are a group of computer scientists and IT experts from the Department of Computer Science AGH and ACC Cyfronet AGH. We are a curiosity- and research-driven team, specializing in large-scale distributed computing, HPC, Web and Cloud technologies. We develop new methods, tools and environments and we apply these solutions in e-Science, healthcare and industrial domains.


Sections
You are here: Home DICE Blog Taming [meta]data with DataNet, a PaaS case study

Taming [meta]data with DataNet, a PaaS case study

Posted by Maciej Pawlik at Feb 19, 2014 09:10 AM |
Description and thoughts that lead to creation of DataNet, a lightweight data and metadata management platform, exposed as a service. DataNet was designed so its usage yields better data availability, discoverability, easier access and sharing, compared to work done on plain files or databases.

Introduction

Datanet welcome

Scientific computation is a source of many large data sets, which often are structured in a non‑interoperable manner. The data and metadata are stored on computing infrastructures or local computers in databases or in files. The discoverability and the possibility to verify the published results represented by such data is practically non-existent. Also, managing access level to the data is difficult by using the available file system or database permission granting mechanisms. Another issue is the configuration of distributed systems (different computing sites) where storage has to be accessed by the computing node through non-standaprd ports blocked by the firewalls.
DataNet is a lightweight data and metadata management platform, which addresses the mentioned issues by providing a web-based data model management interface, exposing a REST repository API for data recording and allowing for easy access level configuration. DataNet does not replace existing storage facilities but provides a way to annotate the data already stored on dedicated storage sites.

Principles, Architecture and Implementation

DataNet architecture

Fig 1. DataNet overall architecture

Some of the requirements of DataNet are drawn from the experiences in cooperation with scientists from other, and mostly non IT fields. Most data handling tools used during that work had some shortcomings, and we tried to address them while gathering requirements for DataNet. With the developed solution one should be able to:

  • create an abstract data model including file and structured data mixed together,
  • version data models and create corresponding repositories,
  • manage access to data,
  • access the data independently of programming language used,
  • scale the infrastructure as required,
  • easily integrate with existing storage solutions.

 

Those requirements served as principles for building DataNet, which architecture is depicted in Fig 1.

Three layers are used to decompose the functionality and make the solution modular. The top-most layer, built as a web interface, using Google Web Toolkit and Java, allows for managing abstract data models, their versioning and deploying repositories. Additionally, the web interface is used to view and manage repositories, which include mechanism for applying access restrictions. The middle layer provides a scalable space for repository deployment. It was built on top of a PaaS platform which ensures means of using different storage engines for tabular data, and can be easily expanded when the current deployment capacity is exceeded. Current PaaS platform of choice is CloudFoundry deployed atop of PL-Grid Cloud infrastructure. Each repository exposes a REST interface for data recording and retrieval. The bottom layer is the storage infrastructure where file data is stored through the GridFTP protocol.

In the next two sections you can find screenshots of WebGUI, and snippets of code showing that accessing data stored with DataNet is quite straightforward and easy.

WebGUI screenshots

Fig. 2 depicts a sample database, named UserDatabase with Person and Address entities. After deployment, the model becomes a repository, then it is possible to enter and query data as shown in Fig. 3.

DataNet editor 2

Fig 2. Model editor and definition of example database

 

DataNet model 2

Fig 3. Repository's data browser

REST usage example

Every data operation can be performed with help of REST interface. Below is a code snippet representing data upload done with Ruby:

repository = RestClient::Resource.new('https://userdatabase.paas.datanet.plgrid.pl/Person', :user => 'user', :password => 'password')
jdata = {:name => 'Jan', :surname => 'Kowalski', :age => '35', :address => '530b4dd04c09da2f3c000001'}.to_json
repository.post jdata, content_type: :json

After that it is possible to retrieve data, here done with Python:

response = requests.get('https://userdatabase.paas.datanet.plgrid.pl/Person', auth = ('user', 'password'))
#find specific userId, get record details
response = requests.get('https://userdatabase.paas.datanet.plgrid.pl/Person/{userId}', auth = ('user', 'password'))

Demo installation and further reading

The preproduction installation of DataNet is available at https://datanet.plgrid.pl. This installation will become the official DataNet PL-Grid service in the near future. More information about DataNet can be found:

  • On the documentation page of DataNet preproduction installation, mentioned above,
  • on the “products page” of this website - coming soon,
  • in the article “DataNet – Lightweight Metadata and Data Management” which will be published in PL-Grid PLUS Book.

Comments (0)

NOTE! This web site uses cookies and similar technologies (if you do not change the browser settings, you agree to this).

cyfronet agh