Friday, January 1, 2010

Working with Hierarchical data in SQL Server 2008

In my series on SQL Server 2008, one of the features I had missed was the HierarchyId datatype and its usage. I came across this recently and decided to take a peek into what it does. This post covers my thoughts on the HierarchyId datatype.

At the time of this writing, I am still not convinced on the value add the HierarchyId datatype provides over traditional parent/child storage mechanisms. More on that later. For now, we will deal with how the datatype works and how it can be used to store hierarchial data.

The big advantage with HierarchyId is that it is optimised to store data in a tree structure thereby making it simpler for developers to query and update hierarchical information.  The important thing to keep in mind is that using the hierarchyid by itself does not mean that data is stored in a tree structure. It is the responsibility of the application to make sure that the data is stored in the desired structure. HierarchyId just acts as an enabler.

To demonstrate this, I will work through an example where we have to store the list of folders and files in the database. I have created a table with the following structure.

Create Table FileExplorer
(
    ItemId HierarchyId Primary Key,
    ItemLevel as ItemId.GetLevel(),
    ItemName varchar(100)   
)

I will touch upon the ItemLevel column later. Do note that the HierarchyId datatype by itself is not unique. It is application’s responsibility again to maintain the uniqueness of the ItemId column and hence I have declared a PrimaryKey on that column.

Coming to ItemLevel column, the primary purpose of that is to enable breadth first ordering. Typically, hierarchy ids can be breadth first or depth first indexed. You can find the details and working of both these types of ordering here.

Now, let us create the breadth first index. Depending on how many levels (called fan-outs) you plan to store, your indexing strategy will vary. It is not necessary to create both the indexes for all the scenarios.

CREATE INDEX FileExplorer_Breadth_First
ON FileExplorer(ItemLevel,ItemId)
GO

Since the FileExplorer table already has a Primary key, the above index will be created as a non-clustered index.

Now that we have created the table, let us start loading data into the table. If you haven’t already noted, let me point out that the ItemId is not declared as an Identity (it cannot be, actually) and hence it is the application’s responsibility again to make sure that the generated ids are unique. In addition, to ensure the tree structure, we have to make sure the id generated is based on the parent id.

The stored procedure to insert a new file or folder is

CREATE PROCEDURE InsertFile(@folderId hierarchyid, @ItemName nvarchar(100) )
AS
BEGIN
    DECLARE @last_child hierarchyid

SELECT @last_child = MAX(ItemId) FROM FileExplorer
    WHERE ItemId.GetAncestor(1) = @folderId
INSERT FileExplorer (ItemId, ItemName)
SELECT @folderId.GetDescendant(@last_child, NULL), @ItemName
END ;

 

Let us examine this procedure for a second.  The ItemId.GetAncestor(1) gets the immediate parent of the ItemId. So essentially, the select statement gets  the last valid file under the given folder id. Now that we have the last inserted item under the parent, our new file has to be created as a peer to that item.

This is done by the second select statement in the procedure. We use the GetDescendant method to create a child similar to another child, in this case, the item retrieved from the previous select. It wouldn’t be out of place to mention here that this is where our breadth first ordering helps.

The tree can also be maintained by creating a parentId column that references the itemId but there is also a performance drag associated with that approach. A snapshot of the data after inserting some rows is as below

Results

Note how the item level computes itself. The folders testFolderChild and TestfolderChildPeer are at the same level and have the same parent –TestFolder.

Let us now see how we can retrieve the list of parents for a given node. To do this, I had to write a stored procedure using CTEs. Below is the stored procedure I wrote

create proc dbo.GetAncestors(@ItemId hierarchyId)
as
begin

With AncestorsList(Itemid, ItemName, ItemLevel, Parent)
As
(
    Select ItemId, ItemName, ItemLevel, ItemId.GetAncestor(1) as Parent from Fileexplorer where ItemId=@ItemId
    UNION ALL
    Select fe.ItemId, fe.ItemName, fe.ItemLevel, fe.Itemid.GetAncestor(1) from FileExplorer fe Inner Join AncestorsList al
    ON fe.ItemId = al.Parent
)   
    Select fe.ItemId, fe.ItemName, fe.ItemLevel from FileExplorer fe INNER JOIN
    AncestorsList al on fe.ItemId = al.Itemid
end

 

If you are familiar with CTEs then most of the code above is self explanatory. My anchor members gets the first ancestor of the given node. This is then executed recursively by my recursive query. The last select executes the CTE to return the resultset.

Having seen most of the uses of HierarchyId, its biggest drawback lies in the fact that it still needs the developers to do a lot. For fan-outs of less that 2 levels, I still feel the self referencing table is faster than using the hierarchyId. I strongly believe that Microsoft still has some work left to do with the HierarchyId datatype to make sure it is a bit more easier to implement.

That though concludes this post. If you are looking to work with HierarchyIds, I hope this post has given you an idea of where to start. Hope to be back soon with something else. Happy Coding!!!!!!