Wednesday, January 8, 2014

How to Split a Comma Separated Values Column into Multiple Rows

In many cases a SQL Developer may face a situation in which different values are "wrapped" into the same column. Even though this violates the First Normal Form, it is not uncommon.


Depending on your design considerations, you may want to "unwrap" the column, generating a new table with multiple rows with the same ID -each for every corresponding "splitted" value-, or simply use an additional reference table. The first solution fits well with the fact table of a dimensional data warehouse, while the second is more suitable for a normalized environment, such as a traditional information system.


Here is my definitely not elegant but working solution for a DWH scenario. First of all, you need a splitting table-valued function:

create function [dbo].[ft_Split]
(
    @pr_RowData nvarchar(2000),
    @pr_SplitOn nvarchar(5)

returns @vr_Rtn table
(
    Id int identity(1,1),
    Data nvarchar(max)
)
as
begin
    Declare @
vr_Cnt int
    Set @vr_Cnt = 1

    while (Charindex(@pr_SplitOn, @pr_RowData); 0)
    begin
        Insert Into @vr_Rtn (data)
        Select
            Data = ltrim(rtrim(substring(@pr_RowData, 1, charindex(@pr_SplitOn, @pr_RowData)-1)))

        Set @pr_RowData = substring(@pr_RowData, charindex(@pr_SplitOn, @pr_RowData)+1, len(@pr_RowData))
        Set @
vr_Cnt = @vr_Cnt + 1
    end
   
    insert Into @vr_Rtn (data)
    Select Data = nullif(ltrim(rtrim(@pr_RowData)), N'')

    Return
end

Now you can write a query with the CROSS APPLY clause, which works very much like a traditional join:

select
    s.ID
    ,
virtualSplit.data  as SplittedValue
    ...
from
    dbo.tb_CrazyWrappedTable as s
    cross apply dbo.ft_Split(s.CSVWrappedColumn, N',') as virtualSplit


The CROSS APPLY operator was introduced in SQL Server 2005 and in our scenario returns the original table (CrazyWrappedTable) rows matching with the virtual table (virtualSplit) generated by ft_Split.

As you might have guessed, the original table expression is processed first; the right table expression is then evaluated against each row of the left table expression. The final result-set contains all the selected columns from the left table expression combined with all the corresponding columns of the right table expression - and this means the original source table ID will be duplicated! Queries against your resulting fact table should therefore use the COUNT DISTINCT aggregate function - slow and disgraceful, but logically consistent.

To reduce performance issues or simply boost your read-only queries, you should build an index-view upon the original table, based on the two previous code fragments.

No comments:

Post a Comment