Friday, November 25, 2011

New Version Update: 1.8.8 Release for HDF5

Here is a update for the people working on HDF5 that the newer version is available for download and test. The Version-to-Version update can be checked here:
http://www.hdfgroup.org/HDF5/doc/ADGuide/Changes.html

Wednesday, March 2, 2011

Check If 'Group' Exits?

Sometimes it happens that you need to traverse through your file hierarchy in HDF5 file to check for an existance of a group/ a link. Yes, it can be a 'Group' or even a dataset(Here, if we take an example, it can be a packet table also).

so what all you need to do is, use H5Lexist, more information you can find here

Thursday, February 24, 2011

Compound Data in Packet Table

As I promised, here I am back with some new stuff i.e. Packet Table in HDF5. Packet tables are one of the most flexible data structures in HDF5 which allows you to store data either fixed or variable length. If you have a situation where you need to write your data coming from multiple sources into a single object, packet tables are worth to accomplish your job. There exists one more data structure which is 'Table'. Unlike packet table, this allows you to write only fixed length data defined in a structure. In performance wise also it is recommended to use packet tables over 'tables'. Actually, I was also finding it very hard to implement compound data with Packet table but doing some research and digging into HDF5 libraries I managed to do it. So here we go.

Consider, We have a Structure defined as:
/*
 *
 * Struct MyTable.
 *
 */
typedef struct 
{
    /// Field 1
    int field1;
    /// Field 2
    float field2;
    /// Field 3
    int field3;
}myTable;

Now, we need to create a data type:

/* Create data type to be used to write data in the Table */
hid_t myDataType = H5Tcreate(H5T_COMPOUND, sizeof(myTable));

herr_t status;
status = H5Tinsert (myDataType , "Field 1",
            HOFFSET (myTable, field1), H5T_NATIVE_INT);

status = H5Tinsert (myDataType , "Field 2",
            HOFFSET (myTable, field2), H5T_NATIVE_FLOAT);

status = H5Tinsert (myDataType , "Field 3",
            HOFFSET (myTable, field3), H5T_NATIVE_INT);

Here our compound datatype is ready and we can create a packet table where we can associate this type.

/* Create packet table */
ptMyTable = H5PTcreate_fl(file, "My Table", myDataType , chunk_size, compression);

As, variable length data having memory leak issues(Check here) I have created a packet table using an API for fixed length.

So, if you view it with HDF viewer it will look like:





Tuesday, February 22, 2011

Problems with HDF5

In my earlier post, we did some writing to the dataset and that too variable length data by extending the dimensions of dataset and doing this in a loop.

Hmm, let me tell you, it is never easy to work with any open source utility or technology. Because, you never know what can go wrong. So I always reckon Murphy's laws.

At this point of time I was relaxed and very happy that my job has been done. But, when it came like a boomrang to me with higher CPU usage and memory leak issues I could not stopped myself from throwing into tears.

After discussing it with hdf-people I came to know that, hdf5 version 1.8.5 the one which I am using has some serious memory leak issues and CPU usage goes on increasing as if you run it over the time. Some of the members from hdf5 group are working on this issue and is expected to be resolved in hdf5 version 1.8.7 (May 2011).

One more problem was with 'H5DWrite' API which we have used to write into dataset. This APIs is also having the same problem when it is called in a thread/loop.

Huuuh, I was literally strolling with HDF5 and needed some approach which will work for me in its first go. Here, one more data structure sparked in my mind was 'Packet table'.




Let's do Some Design & Code

There are many datataypes which one can use to implement HDF5. But, the base for all is 'DataSet'. You can store your raw data in a DataSet, in a table structure or packet table. We will see it one by one. Before starting anything I'll show you the structure which we are going to create in our file.




Here the 'Circle' represents a group and 'Rectangle' stands for a dataset a container in which we are going to put our data. Here I am demonstrating you my experiences, mistakes and how I deal with them to implement a better solutions for my problems.

The root node is our file's default node/group under which all branches we are going to hold throughout the development of HDF5 file. If you are familiar with B-Trees then your work is more easier with HDF5. The file can be created using following API:

hid_t fileID = H5Fcreate( "../../FileName.h5", H5F_ACC_TRUNC, H5P_Default, 
H5P_Default);

The more APIs related to file properties can be found here.

For a sake of naming convention and to understand it better I have named a newly created group as 'MyGroup'. Now, we'll see how we can create the same with HDF5 API. The HDF5 APIs to create, open, delete a 'Group' are grouped under H5G.

/* Create group 'MyGroup' under 'root' node */
hid_t groupID = H5Gcreate(file_id, "MyGroup", H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
H5D is the next collection which holds the all the APIs required to work with the datasets(our data containers). Now, what we need is create a dataset under the group, the one which have already created called 'MyGroup'. Let's give our dataset a name 'DataSet1'.

/* Create the dataset */
hid_t dset = H5Dcreate (file, "DataSet1", filetype, filespace, H5P_DEFAULT, cparms, H5P_DEFAULT);

Here if you look at the parameters associated with it, you will get confused. So as for intial understanding consider I have defined few properties for our dataset to write data like data type, enable/disable chunking, etc.

Remember, I started this as creating a dataset for every new set of data. After giving it a test with running it in a thread to write data into file, I came to know that the number of datasets that can be created under one group has limitations i.e I could not create more than 65556 datasets in a single group. Now, this was a big big problem for me. I had almost lost everything that I already have spent much time on this and could not back-track.

To work with this problem I decided to go by another approach. That is, instead of creating altogether a new dataset for every set of new data, if I can use only one. For this I thought I need to understand more internals of datasets. Finaly, I came to know that I can use the same dataset, provided I should extend the dimensions of dataset each time I write a new data to it. This was the another reason why I chose C to program HDF5 as these APIs only available with C/C++ and not with the dot net or Java.

Before putting it altogether, I would like to know you that, I am going to write a variable length data to the dataset.

/* 
 * Create a new file using the default properties.
 */
file = H5Fcreate (FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

/* 
 * Modify dataset creation properties, i.e. enable chunking
 */
cparms = H5Pcreate (H5P_DATASET_CREATE);
status = H5Pset_chunk (cparms, 1, dims);

/*
 * Create dataspace.  Setting maximum size to NULL sets the maximum
 * size to be the current size.
 */
filespace = H5Screate_simple (1, dims, NULL);

memspace = H5Screate_simple (1, dims, NULL);


/*
 * Create file and memory datatypes. 
 */
filetype = H5Tvlen_create(H5T_NATIVE_UCHAR);

memtype = H5Tvlen_create(H5T_NATIVE_UCHAR);


/*
 * Create the dataset
 */
dset = H5Dcreate (file, "DataSet1", filetype, filespace, H5P_DEFAULT, cparms, H5P_DEFAULT);

Till here, we have done with the creating file and dataset ready to write data into it.

Now, next step was to write variable length data in thread/loop. This was actually a interesting part for me as I wanted to see how the dimensions of dataset changes at runtime. After reading out more I understood that there is one more attribute which I need to consider and is HyperSlab.

So, my 'Write' Function was pretty simple:

extdSize[0] = recordNumber + 1;
/* 
 * Extend Dataset 
 */
status = H5Dextend(dset,extdSize);

/* 
 * Define memory space 
 */
memspace = H5Screate_simple (1, dims, NULL);

and finally the most awaited writing:
offset[0] = recordNumber;
offset[1] = 0;

/* 
 * Select a hyperslab  
 */
filespace = H5Dget_space (dset);

status = H5Sselect_hyperslab(filespace, H5S_SELECT_SET, offset, NULL,
    dims, NULL);

/* 
 * Write the variable-length data to the dataset.
 */
status = H5Dwrite (dset, memtype, memspace, filespace, H5P_DEFAULT, data);






Monday, February 21, 2011

Hierachical Data Format

What is HDF5?

Actually I was also wondered when I heard very first time that, there is a file format named HDF5. HDF5 stands for Hierarchical Data Format, a data model which is flexible enough to store any kind of complex data types in the form of binary tree with a support of almost everything of ACID properties for the databases. It works with nearly all operating systems as having open source binaries developed for the development platforms such as Windows, Linux, Mac, etc.

How Do I fell for it?

Hmm..., I think that was a great day and I should not forgot when one of my project requirement come to me where I need to implement/use a file format with which one can be able to read or write data simultaneously. As everybody's favourite 'Google' came to my rescue and I linked to this versatile file format. Being a slowest learner, I spent almost 2 months to understand the basics of this file format ;). And I managed to develop my very first program with which I could able to write some array data into the file. Actually, It was started with the building of HDF5 libraries to use them in my application. The latest release available for the HDF5 is Release Version 1.8.6. I started with the 1.8.2 and that also for C#. After understanding more needs of my project I realized that the API available with HDF5's dot net library are not enough for me to accomplish my job and I moved to C, the best choice for anybody want to try his/her hand in programming with some basic logic and understandings.