Skip to content. | Skip to navigation

Personal tools
Log in

DIstributed Computing Environments (DICE) Team
We are a group of computer scientists and IT experts from the Department of Computer Science AGH and ACC Cyfronet AGH. We are a curiosity- and research-driven team, specializing in large-scale distributed computing, HPC, Web and Cloud technologies. We develop new methods, tools and environments and we apply these solutions in e-Science, healthcare and industrial domains.


Sections
You are here: Home DICE Blog Data Waterfall Floods a Poor Little DB

Data Waterfall Floods a Poor Little DB

Posted by Tomasz Gubala at May 23, 2012 04:55 PM |
How far can one go with a single MySQL db instance? When faced with a waterfall of bioinformatics data, how much is "too much"? Today we'd like to tell a tale of a real bioinformatics application which, quite simply, ended up producing too much data. We describe what we did in order to make the data manageable with a single SQL database, what lengths we had to go to to get there and, finally, why we never really got there after all.

The Pledge

Here is the problem. We implemented a little bioinformatics experiment for a colleague of ours, working at a medical department of another university, involving a bunch of FASTA programs [1] running in parallel on ZEUS, a pretty big machine (http://i.top500.org/system/177388). However, what at first seemed a rather innocent toy quickly turned into a massive data production plant. The FASTA worker farm produced approximately 715 GB in the form of text files. At that moment we realized we are faced with the challenging task of effectively parsing that set of files into a manageable structure and uploading the result into some kind of database, which our fellow researcher would then be able to query.

pasta3-diamond.png

Fig. 1. The diamond pattern of the interrelated tables, designed to hold most of the data of our bioinformatics experiment.

The exact final data structure required by the end user was a diamond pattern of four different interconnected tables, as shown in Fig. 1. This structure was imposed on us by the analysis tools to be executed further down the line, so we could not alter it to suit our needs (which only added to the challenge). With a very rough, single-pass parsing run we were able to reduce the number of entries we were supposed to store, arriving at ca. 251 million records. Each of these had to be split into three different tables (the bottom "half" of the diamond) so, in the end, we had to insert around 753 million records into a remote database (note that the size of each record was quite tiny, on the order of several dozen bytes). The fourth table (ss_task in Fig.1) was to be far smaller record-wise, and, as such, may be disregarded for the purposes of this discussion. Whereas we were able, and quite willing, to completely parallelize the parsing procedure (again, running the fast parser process on many cores of the aforementioned ZEUS cluster), we chose not to distribute the database itself.

The Turn

Faced with the presented problem, our strategy was to obtain as high data insert throughput, as possible. Following some simple calculations we decided that satisfactory throughput would need to be on the level of 10 kHz (ten thousand sequence entries, i.e., thirty thousand records per second). To put that into context, achieving a throughput of 1 kHz would make the upload process take three full days (which is not a very long time in high-performance computing, yet quite unsatisfactory nonetheless).

We set up a fairly standard MySQL instance (propelled by the InnoDB engine) with a large amount of operating memory (several GB), and made sure to procure a sufficient interconnect (1 Gbit Ethernet link). For our first test we benchmarked the installation with an out-of-the-box database configuration. We ran 250 parsing workers in parallel and had each establish a single, permanent TCP/IP connection (thus eliminating the latency factor involved in TCP/IP over Ethernet handshakes). Our parallel farm of 250 data pipes pumped inserts right into the heart of our database, yet the result was not very encouraging: roughly 1 kHz insert frequency; way below our target.

Thus we decided to go much further and sacrifice the validation mechanism. Having enlarged the buffers for incoming data and making MySQL rather lazy in writing to disk (using innodb_flush_log_at_trx_commit=0, which means that the DBMS would not automatically write its log to disk following each INSERT - the preferred alternative is to flush locks once per second, which slightly violates ACID) improved performance a little bit. Further removal of transaction safeguards, indexing and on-write validations yielded much better results. With no inter-table foreign key checks, the entire procedure achieved an insert frequency of around 3 kHz. A nice speedup; however what might be merely acceptable was not yet satisfactory!

We therefore decided to remove the last extant element of RDBMS control over what happens in the DB tables, namely the table key autoincrement mechanism. This mechanism caused problems because in order to properly populate the rna_subseq and gs_subseq tables the application needed to be aware of the primary key assigned to each row in the ss_alignment table (in order to insert this value in the freshly spawned dependent records). Doing away with autoincrements gave us the ability to perform bulk inserts (or, more accurately, emulate them, since they are not natively supported in MySQL): provided we had our own mechanism of provisioning unique IDs for table entries, we might insert them in large SQL "heaps". This resulted in a significant performance boost, almost reaching - and at times surpassing - the 10 kHz target. The Emperor was pleased, at least, for the moment.

No Prestige this time (aka Future Work)

What went wrong? Two factors ended up preventing us (as well as our fellow PhD student) from exploiting the achieved solution. One cause which might potentially be ameliorated involved failures on the parsing side. Having so many processes parsing and sending data concurrently in a large public installation, proved non-fault-tolerant. Queue system problems, computing node failures and sudden losses of connectivity between computing nodes and the target DB server made the resultant data stored in the MySQL instance unreliable (it should be noted that we dropped all the ACID guarantees normally afforded by relational storage in favor of improved performance). Revalidating this data was not an option - any kind of external validation would be much slower than the built-in InnoDB mechanisms, which we dropped for the very reason of better insert throughput.

However, a much more important problem, and a rather fundamental one at that, was MySQL's general inability to manage the gathered data. Even simple queries and index building procedures ceased to work efficiently given the size of the dataset. The "single MySQL" experiment was, therefore, over for the moment - we knew we would have to shard the database or switch to different storage strategies (e.g., a noSQL document store) to try our luck there.

However, the entire adventure taught us that the limits of MySQL with respect to insert query performance are much less constraining that originally anticipated (provided, of course, that one is able to tolerate failures, does not rely on 100% data validity and/or is able to validate data before it gets stored in the DB).

[1] FASTA Sequence Comparison, http://fasta.bioch.virginia.edu/fasta_www2/fasta_intro.shtml, University of Virginia

Comments (0)

NOTE! This web site uses cookies and similar technologies (if you do not change the browser settings, you agree to this).

cyfronet agh