Beyond Sql: September 2008

Sorting a delimited string by its numerical string parts.

This is a possible solution to the problem presented in the article:   

'T-SQL Sorting Challenge'
By: Itzik Ben-Gan 
http://www.sqlmag.com/Article/ArticleID/100156/100156.html

"You are given a table called t1 with a character string column called val. 
Each string in the val column holds a dot separated list of integers. Your
task is to write a T-SQL solution that sorts the strings based on the integer
values constituting the string segments. Note that the number of integers in
each string may vary, and is only limited by the column type VARCHAR(500).
Extra points if your solution will also support negative integers."

So the problem is how to construct a sort expression that represents the
positive and negative integers of the ids.

This solution uses Dataphor with the data being stored in Sql Server 2005.

Sample data

The data is populated thru dataphor and persisted in the sql server northwind
db. The table t1 in the article is table IZ here. The article uses an sql
identity for the primary key id, here it is explicitly declared.

create table IZ
{
 id:Integer,
 val:String tags {Storage.Length='75'},
 key{id}
};
//Positive value strings.
insert row{1 id,'100' val} into IZ;
insert row{2 id,'7.4.250' val} into IZ;
insert row{3 id,'22.40.5.60.4.100.300.478.19710212' val} into IZ;
insert row{4 id,'22.40.5.60.4.99.300.478.19710212' val} into IZ;
insert row{5 id,'22.40.5.60.4.99.300.478.9999999' val} into IZ;
insert row{6 id,'10.30.40.50.20.30.40' val} into IZ;
insert row{7 id,'7.4.250' val} into IZ;
//Add negative values.
insert row{8 id,'-1' val} into IZ;
insert row{9 id,'-2' val} into IZ;
insert row{10 id,'-11' val} into IZ;
insert row{11 id,'-22' val} into IZ;
insert row{12 id,'-123' val} into IZ;
insert row{13 id,'-321' val} into IZ;
insert row{14 id,'22.40.5.60.4.-100.300.478.19710212' val} into IZ;
insert row{15 id,'22.40.5.60.4.-99.300.478.19710212' val} into IZ;

Go directly to dataphor solution
Go directly to a solution using dataphor and sql server t-sql
Go directly to a solution using the Rac utility on sql server 2000
Go directly to a solution using the Rac utility on sql server 2005
(
 Rac is a system of stored procedures and functions for sql server designed
 to simplify solving various data manipulation problems including dynamic
 crosstabs, complex running sums and ranking, string manipulations etc.
)

Stepping thru the logic of the solution

Use the dataphor Split operator to split the val string for each id into
individual strings starting at Index 1 and going to the number of parts
delimited by the period ('.'). Note that the string is converted to an
integer. So we're dealing with numbers and not strings.

select
 (IZ add{val.Split({'.'}) StrList,val.Split({'.'}).Count() StrListCnt})
    times                //times is equivalent to an sql CROSS JOIN.
    (Numbers where N<10) //A table with a single column N, an integer from 0 to 800.
      where N<StrListCnt
       {id,N+1 Index,ToInteger(StrList[N]) StrNum}
         order by {id,Index} ;      
         
id Index StrNum   
-- ----- -------- 
1  1     100  <- id 1 has only a single value.
2  1     7    <- id 2 has 3 values.
2  2     4        
2  3     250      
3  1     22   <- id 3 has 9 values.
3  2     40       
3  3     5        
3  4     60       
3  5     4        
3  6     100      
3  7     300      
3  8     478      
3  9     19710212     
4  1     22  
.  .     .

Now lets look at the same data but within each Index and within each Index
ordered by the string as a number (StrNum). Remember all ordering is ascending.

select
 (IZ add{val.Split({'.'}) StrList,val.Split({'.'}).Count() StrListCnt})
    times
    (Numbers where N<10)
      where N<StrListCnt
       {id,N+1 Index,ToInteger(StrList[N]) StrNum}
         order by {Index,StrNum};
         
id Index StrNum   
-- ----- -------- 
13 1     -321  <- id 13 has lowest overall number for 1st string part (Index 1).
12 1     -123     
11 1     -22      
10 1     -11      
9  1     -2       
8  1     -1       
2  1     7        
7  1     7        
6  1     10       
3  1     22    <- ids 3,4,5,14,15 have the same value (22) for Index 1.   
4  1     22      
5  1     22       
14 1     22       
15 1     22       
1  1     100   <- id 1 has highest overall number for 1st string part (Index 1).  
.  .     .
3  8     478   <- Index 8 has the same string value for the 5 ids that have an 8th string part.
4  8     478      
5  8     478      
14 8     478      
15 8     478      
5  9     9999999  
3  9     19710212 
4  9     19710212 
14 9     19710212 
15 9     19710212 

The strings as integers are nicely sorted within each Index over the ids. 
How can we represent the same ordering within each Index independent of
the positive and negative numbers (and strings that indicate positive and
negative numbers)? What about with a ranking. So lets rank the numbers
within each Index. The combination of the dataphor ToTable, ToList and
cursor operators will generate a rank (a column named sequence) that 
follows the specified cursor ordering. We order the cursor by Index,StrNum
to get an ascending rank within each Index based on the StrNum values. 
The rank is column RowNum.

select
ToTable(
        ToList(
               cursor(
                       (
                        (IZ add{val.Split({'.'}) StrList,val.Split({'.'}).Count() StrListCnt})
                          times
                           (Numbers where N<10)
                             where N<StrListCnt
                              {id,N+1 Index,ToInteger(StrList[N]) StrNum}
                        )   
                                order by {Index,StrNum}))) 
                                 {Index,id,StrNum,sequence+1 RowNum}
                                   order by {Index,StrNum};

Index id StrNum   RowNum 
----- -- -------- ------ 
1     13 -321     1  <- id 13 has lowest value (-321) so id 1 gets lowest rank (1) within Index 1.    
1     12 -123     2      
1     11 -22      3      
1     10 -11      4      
1     9  -2       5      
1     8  -1       6      
1     2  7        7  <- ids 2 and 7 get a different rank for the duplicate values of 7 :(    
1     7  7        8      
1     6  10       9      
1     3  22       10 <- all 5 ids get a different rank for the duplicate values of 22 :(  
1     4  22       11     
1     5  22       12     
1     14 22       13     
1     15 22       14     
1     1  100      15 <- id 1 has highest value (100) so id 1 gets highest rank (15) within Index 1.
.     .  .        .
8     3  478      56     
8     4  478      57     
8     5  478      58     
8     14 478      59     
8     15 478      60     
9     5  9999999  61     
9     3  19710212 62 <- all 4 ids get a different rank for the duplicate values of 19710212 :(    
9     4  19710212 63     
9     14 19710212 64     
9     15 19710212 65     

(
 Note that this rank is equilvant to the sql ranking function ROW_NUMER().
 The rank could be obtained in sql using:
 ROW_NUMBER()OVER(ORDER BY [Index],StrNum) AS RowNum
 But the rank we want is really based on the sql RANK() function which
 accounts for duplicate/ties (of StrNum) by giving them the same rank. 
 Therefore it's necessary in dataphor to use a join to append the correct
 ranks to the table. In sql the join isn't necessary, RANK() can be used
 directly on the table:ie.
 RANK()OVER(ORDER BY [Index],StrNum) AS Rank
 (See t-sql solution)
 For more on dataphor and sql ranking see:
 'The Sql ranking OVERture' 
 http://beyondsql.blogspot.com/2008/04/sql-ranking-overture.html
)

If you haven't guessed it by now ☺ the idea is to create a string for each
id based on the ranks which will be used to order the ids. But we have a 
problem because for duplicate values of a number we're getting different
ranks. We want the 'same' rank for duplicate values since the same integer
cannot be used to distinguish among the ids. 

We can remedy the different ranks for duplicate values by simply choosing
the minimum rank (RowNum) for the value and assigning this rank to all ids.
Also note that the ranks continue to ascend over the Indexs. This is ok
because any numbers representing the ranks are ok if they correctly maintain
the ordering of integer strings values within the Index.

select
ToTable(
        ToList(
               cursor(
                       (
                        (IZ add{val.Split({'.'}) StrList,val.Split({'.'}).Count() StrListCnt})
                          times
                           (Numbers where N<10)
                             where N<StrListCnt
                              {id,N+1 Index,ToInteger(StrList[N]) StrNum}
                        )   
                                order by {Index,StrNum}))) 
                                 {Index,id,StrNum,sequence+1 RowNum}
                                   group by {Index,StrNum} add{Min(RowNum) Rank} 
                                     order by {Index,StrNum};

Index StrNum   Rank 
----- -------- ---- 
1     -321     1    
1     -123     2    
1     -22      3    
1     -11      4    
1     -2       5    
1     -1       6    
1     7        7  <- a rank of 7 can be assigned to the two ids (2,7) with a value of 7 for Index 1.  
1     10       9    
1     22       10 <- a rank of 10 can be assigned to all ids with a value of 22 for Index 1. 
1     100      15   
.     .        .
8     478      56   
9     9999999  61   
9     19710212 62 <- a rank of 62 can be assigned to all ids with a value of 19710212 for Index 9.

What we now have is a table of unique Index/StrNum combinations with a
unique rank for each combination. It's only necessary to join this table
to the table of split strings (IZ) for all ids by Index and StrNum to properly
assign the correct(ed) ranks. (As mentioned above this is the same rank that
would be obtained using the sql RANK() function and ordering by Index,StrNum.
And note that using the sql RANK() would eliminate the need to do a join in 
dataphor. Imagine dataphor with native sql like ranking operations ☺ ) 

Because the objective is to create a string to sort the ids we can't just
use the numeric rank, we have to modify it for string ordering. Given two
ranks of 7 and 11 if they are strings, '7' and '11', an ascending sort
would have '11' come '7':

'11'
'7'

This is the very problem the article is addressing! So we have to modify the
the strings to have '7' come '11' ☺. We can modify the '7' by left padding it
with '0'. So when we sort ascending we'll have the correct representation of
of the true numeric order of the values:

'07'
'11'

How much to pad a rank, how many '0's to insert, is the string length of 
the maximum rank generated. Because the ranks in dataphor keep ascending
regardless of the cursor ordering, the maximum rank (ignoring duplicates)
is the count of rows in the table. You could even make an educated guess
based on the amount of data and use that ☺. 

Left padding the string rank (RankStr) based on a maximum length of 2 we now
have all the data to finally construct a sorting column for the ids.

var UnpackedStrings:= //This variable holds all the split data and will be used in the select query. 
ToTable(ToList(cursor(
                       (
                        (IZ add{val.Split({'.'}) StrList,val.Split({'.'}).Count() StrListCnt})
                          times
                            (Numbers where N<10)
                               where N<StrListCnt
                             {id,N+1 Index,ToInteger(StrList[N]) StrNum}
                        )   
                           order by {Index,StrNum}))) 
                          {Index,id,StrNum,sequence+1 RowNum};
var LengthofMaxRank:=2;
select
      (UnpackedStrings {id,Index,StrNum})
         join //Join the unique ranks to all split data. This is a natural join (on Index/StrNum).
              //Create a left padded string (RankStr) from the numeric rank.
          (UnpackedStrings group by {Index,StrNum} add{Min(RowNum) Rank})
            {id,Index,StrNum,Rank,PadLeft(ToString(Rank),LengthofMaxRank,'0') RankStr}
              order by {Index,Rank};
            
id Index StrNum   Rank RankStr 
-- ----- -------- ---- ------- 
13 1     -321     1    01     <- ranks 1-9 are left padded in the rank string (RankStr).    
12 1     -123     2    02      
11 1     -22      3    03      
10 1     -11      4    04      
9  1     -2       5    05      
8  1     -1       6    06      
2  1     7        7    07      
7  1     7        7    07      
6  1     10       9    09      
3  1     22       10   10      
4  1     22       10   10      
5  1     22       10   10      
14 1     22       10   10      
15 1     22       10   10      
1  1     100      15   15      
.  .     .        .    .
3  8     478      56   56      
4  8     478      56   56      
5  8     478      56   56      
14 8     478      56   56      
15 8     478      56   56      
5  9     9999999  61   61      
3  9     19710212 62   62      
4  9     19710212 62   62      
14 9     19710212 62   62      
15 9     19710212 62   62    

The id sort column can now be formed by concatenating, using the Concat 
operator, the RankStr within each id in the order of the rank (either Rank,
Index or RankStr). This is easy to see by ordering the above data (table) 
by id,Rank. The ascending order of Index, Rank and RankStr all reflect 
where an id lies in value (RankStr) relative to the other ids. The sort
expression will be column SortStr.

id Index StrNum   Rank RankStr 
-- ----- -------- ---- ------- 
1  1     100      15   15      
2  1     7        7    07      
2  2     4        16   16      
2  3     250      30   30      
3  1     22       10   10      
3  2     40       19   19      
3  3     5        24   24      
3  4     60       33   33      
3  5     4        38   38      
3  6     100      49   49      
3  7     300      51   51      
3  8     478      56   56      
3  9     19710212 62   62    

The complete dataphor solution

var UnpackedStrings:= //Variable that holds all split data and ranks the split strings
                      //numerically within each Index ordered by the numeric string value.
ToTable(ToList(cursor(
                       (
                        (IZ add{val.Split({'.'}) StrList,val.Split({'.'}).Count() StrListCnt})
                          times
                            (Numbers where N<10)
                               where N<StrListCnt
                             {id,N+1 Index,ToInteger(StrList[N]) StrNum}
                        )   
                           order by {Index,StrNum}))) 
                          {Index,id,StrNum,sequence+1 RowNum};
var LengthofMaxRank:=
  Length(Count(UnpackedStrings).ToString()); //A string length used to left pad the rank strings for
                                             //a correct string sort (will be 2 here). 
select
    IZ
     join   //Natural join of input table IZ to sort column expression by id.
         (
            //Join unique ranks to all split strings.
          (
           (UnpackedStrings {id,Index,StrNum})
             join
                 // Adjust ranks to be unique for each Index/StrNum, duplicate values should get same rank.
              (UnpackedStrings group by {Index,StrNum} add{Min(RowNum) Rank})
                 //Left pad rank string with '0' (RankStr) for sorting string correctly. Add empty
                 //string ('') to be used as a delimiter in concatenating the ranks.
               {id,Index,StrNum,Rank,PadLeft(ToString(Rank),LengthofMaxRank,'0') RankStr,'' Del}
          ) adorn {key{id,Rank}} //Sql uses physical hints, in dataphor you use logical ones.
                //Form the sorting expression SortStr to sort ids by concatenating the string ranks
                //in the order of any of the ranking columns or Index.
              group by {id} add{Concat(RankStr,Del order by {id,Rank}) SortStr}
         ) 
          order by {SortStr}; //The object of the exercise, sort the ids by SortStr to get the correct
                              //numerical order of val.

id val                                SortStr            
-- ---------------------------------- ------------------ 
13 -321                               01                 
12 -123                               02                 
11 -22                                03                 
10 -11                                04                 
9  -2                                 05                 
8  -1                                 06                 
2  7.4.250                            071630             
7  7.4.250                            071630             
6  10.30.40.50.20.30.40               09182932434650     
14 22.40.5.60.4.-100.300.478.19710212 101924333844515662 
15 22.40.5.60.4.-99.300.478.19710212  101924333845515662 
5  22.40.5.60.4.99.300.478.9999999    101924333847515661 
4  22.40.5.60.4.99.300.478.19710212   101924333847515662 
3  22.40.5.60.4.100.300.478.19710212  101924333849515662 
1  100                                15              

Solution using the sql server RANK() function with a pass thru query

The solution can be made more compact using the sql RANK() function since
dataphor doesn't have a direct equivalent ranking operation.

Since sql server can only access persisted tables (and views) and not
dataphor expressions (we can't just stick in any dataphor table expression
in a pass thru query) we'll create a persisted table to hold all the 
split data.

create table IZSqlRanks  //The table will be created in sql server.
{
 id:Integer,
 Index:Integer,
 StrNum:Integer,
 key{id,Index}
}; 

Do the splitting in dataphor (because it knows the difference between a
string and a list ☺ ) and then 'assign' the resulting table to the
persisted table IZSqlRanks. Relational databases support this kind of
assignment for all variables including tables. And all tables in dataphor
are variables. (To my sql friends this makes all the difference in the world ☺ )

IZSqlRanks:=
 (IZ add{val.Split({'.'}) StrList,val.Split({'.'}).Count() StrListCnt})
    times
    (Numbers where N<10)
      where N<StrListCnt
       {id,N+1 Index,ToInteger(StrList[N]) StrNum};
//Use a t-sql passthru query to take advantage of the sql RANK() function. The
//resulting table will be treated like any other table (expression) in dataphor.
//Left pad the string rank (RankStr) for sorting purposes. We're using a total
//string length of 2 here so single digit ranks will be padded with a leading '0'. 
select 
 IZ
  join
     (
       SQLQuery('SELECT id,[Index],StrNum,RANK()OVER(ORDER BY [Index],StrNum) AS Rank
                 FROM IZSqlRanks') 
     //Use the sql result as if it was a native dataphor expression.
        {id,Index,StrNum,Rank,PadLeft(ToString(Rank),2,'0') RankStr,'' Del} 
          adorn {key{id,Rank}} //Metadata (a key) pertaining to the table expression.
                               //This key will be efficiently used by the Concat operation.
     //Concatenate the rank strings for each id to be used as the sort order for ids.
          group by {id} add{Concat(RankStr,Del order by {id,Rank}) SortStr}
      )       
        order by {SortStr};     
        
id val                                SortStr            
-- ---------------------------------- ------------------ 
13 -321                               01                 
12 -123                               02                 
11 -22                                03                 
10 -11                                04                 
9  -2                                 05                 
8  -1                                 06                 
2  7.4.250                            071630             
7  7.4.250                            071630             
6  10.30.40.50.20.30.40               09182932434650     
14 22.40.5.60.4.-100.300.478.19710212 101924333844515662 
15 22.40.5.60.4.-99.300.478.19710212  101924333845515662 
5  22.40.5.60.4.99.300.478.9999999    101924333847515661 
4  22.40.5.60.4.99.300.478.19710212   101924333847515662 
3  22.40.5.60.4.100.300.478.19710212  101924333849515662 
1  100                                15                        

Solving the problem on sql server 2000 with the Rac utility

The Rac solution follows the same logic as the dataphor and sql methods.
The 1st Rac execute creates the Index column, from 1 to N, for every string
part (delimited integer) an id has. The 2nd Rac execute creates a rank over
the ids based on the integers (string parts) within each Index. The rank
Rac generates is equivalent to the sql DENSE_RANK() function. This rank,
like RANK(), gives duplicate values the same rank but, unlike RANK(), does
not take the duplicate ranks into account when generating the next rank.
RANK() skips ahead based on ties while DENSE_RANK() consecutively numbers
the ranks. Both types of ranks give the same correct sort result for the ids.
The 3rd Rac execute concatenates the left padded rank strings for each id
and returns the ids sorted by them, correctly :) Note that Rac is called
recursively twice.

Exec Rac 
@split='[position]',    -- Rac splits each val string from left to right based on a period ('.').
@rows='id & [position]',-- Rac keeps the position each new string part starts in.
@pvtcol='val',          -- the target column of the split operation.
@from='IZ',    
-- use a counter to generate the Index column (from 1-N) that indicates the individual string part in a val.  
@separator='.',@rowcounters='id{[Index]}',@counterdatatype='int',
@defaults1='y',@rowbreak='n',@racheck='y',@shell='n',
@select='
        SELECT 1*id AS id,[Index],CAST([1] AS INT) AS StrNum
               INTO #J1      
                 FROM rac
 Exec Rac
 @transform=~_dummy_~,
 @rows=~[Index] & StrNum & id~,
 @pvtcol=~Report Mode~,
 @from=~#J1~,
 @defaults1=~y~,@rowbreak=~n~,@racheck=~y~,@shell=~n~,
  /* use a Rac counter to rank the integer string parts within each Index (column Rank).*/
 @rowindicators=~StrNum{Rank}_overtable_~,@counterdatatype=~int~,
        @select=
          ~if object_id(~~##J2~~) is not null 
           drop table dbo.t1  
           SELECT 1*id AS id,1*StrNum AS StrNum,1*[Index] AS [Index],1*Rank AS Rank,
                  /* left pad single digit string ranks for proper character sorting.*/
                  REPLICATE(~~0~~,2-DATALENGTH(CAST(Rank AS VARCHAR(2)))) + CAST(Rank AS VARCHAR(2)) AS RankStr
                  INTO ##J2
                   FROM rac~
             /* concatenate the string ranks (RankStr) within each id into column SortStr.*/     
      Exec Rac
      @transform=~Max(RankStr) as RankStr~,
      @rows=~id~,
      @pvtcol=~Rank~,
      @from=~##J2~,
      @defaults1=~y~,@racheck=~y~,@shell=~n~,@cutpvt=~y~,
      @concatenate=~RankStr~,@separator=~~,@stringname=~SortStr~,
       /* return the ids sorted by the concatenated string ranks.*/
      @select=~SELECT IZ.id,val,SortStr  
               FROM IZ JOIN rac ON IZ.id=rac.id 
               ORDER BY SortStr
               DROP TABLE ##J2~'

id          val                                  SortStr
----------- ------------------------------------ ------------------
13          -321                                 01
12          -123                                 02
11          -22                                  03
10          -11                                  04
9           -2                                   05
8           -1                                   06
7           7.4.250                              071116
2           7.4.250                              071116
6           10.30.40.50.20.30.40                 08121517202326
14          22.40.5.60.4.-100.300.478.19710212   091314181921272830
15          22.40.5.60.4.-99.300.478.19710212    091314181922272830
5           22.40.5.60.4.99.300.478.9999999      091314181924272829
4           22.40.5.60.4.99.300.478.19710212     091314181924272830
3           22.40.5.60.4.100.300.478.19710212    091314181925272830
1           100                                  10

More on Rac @
www.rac4sql.net

Solving the problem on sql server 2005 with the Rac utility

The Rac solution in S2k5 is similar to the one on S2K (go there for more info).
But it's simpler since the split string parts (integers) can be ranked
directly using a dense rank function not available in S2k. This eliminates
a recursive call to Rac so here it's executed twice instead of three times (in S2k).

Exec Rac 
@split='[position]',   
@rows='id & [position]',
@pvtcol='val',
@from='IZ',
@separator='.',@rowcounters='id{[Index]}',@counterdatatype='int',
@defaults1='y',@rowbreak='n',@racheck='y',@shell='n',
 -- the DENSE_RANK() function, available in S2k5, is used to rank the integer string parts
 -- and is left padded.
@select='
         SELECT id,[Index],StrNum,Rank,
                REPLICATE(~0~,2-DATALENGTH(CAST(Rank AS VARCHAR(2)))) + CAST(Rank AS VARCHAR(2)) AS RankStr
                INTO #J1     
         FROM 
              (SELECT 1*id AS id,[Index],CAST([1] AS INT) AS StrNum,
                      DENSE_RANK()OVER(ORDER BY [Index],CAST([1] AS INT)) AS Rank
               FROM rac ) AS A
     Exec Rac
     @transform=~Max(RankStr) as RankStr~,
     @rows=~id~,
     @pvtcol=~Rank~,
     @from=~#J1~,
     @defaults1=~y~,@racheck=~y~,@shell=~n~,@cutpvt=~y~,
     @concatenate=~RankStr~,@separator=~~,@stringname=~SortStr~,
     @select=~SELECT IZ.id,val,SortStr  
               FROM IZ JOIN rac ON IZ.id=rac.id 
                ORDER BY SortStr~'

id          val                                  SortStr
----------- ------------------------------------ ------------------
13          -321                                 01
12          -123                                 02
11          -22                                  03
10          -11                                  04
9           -2                                   05
8           -1                                   06
7           7.4.250                              071116
2           7.4.250                              071116
6           10.30.40.50.20.30.40                 08121517202326
14          22.40.5.60.4.-100.300.478.19710212   091314181921272830
15          22.40.5.60.4.-99.300.478.19710212    091314181922272830
5           22.40.5.60.4.99.300.478.9999999      091314181924272829
4           22.40.5.60.4.99.300.478.19710212     091314181924272830
3           22.40.5.60.4.100.300.478.19710212    091314181925272830
1           100                                  10

More on Rac @
www.rac4sql.net

Monday, September 08, 2008

Sorting a delimited string numerically

Beyond Sql

Blog Archive

Links

About Me