Beyond Sql: sql dense rank for identifying consecutive runs

In his July 2007 MS Sql Server article Itzik Ben-Gan raises the question
of identifying consecutive runs of rows where values repeat themselves. 

T-SQL Black Belt 
Identifying Sections 
By:Itzik Ben-Gan 

The problem is framed as:

'The generic form of the problem involves a table (call it T1) that has
 two columns of interest: one representing order among rows (call it id)
 and the other holding some value (call it val). The task at hand is to
 identify sections of consecutive rows that share the same value. 
 The terms section and consecutive rows are problematic when you're 
 dealing with sets, but as I mentioned, the column ID represents logical
 order among rows, and once order is defined, these terms become meaningful.
 For each identified section, you need to return the minimum id, maximum id,
 val, count of rows in the section, and possibly other aggregates.'
 
The use of the term 'generic' is most appropriate as it conveys the idea
that a proposed solution account for 'all' data. Such a generic approach
is an 'inclusive' one.

Given the sample data:

create table TG1
(
 id  int not null primary key,
 val varchar(10) not null
);

insert into TG1(id, val) values( 1, 'a');
insert into TG1(id, val) values( 2, 'a');
insert into TG1(id, val) values( 3, 'a');
insert into TG1(id, val) values( 5, 'a');
insert into TG1(id, val) values( 7, 'b');
insert into TG1(id, val) values( 9, 'b');
insert into TG1(id, val) values(11, 'a');
insert into TG1(id, val) values(13, 'a');
insert into TG1(id, val) values(17, 'b');
insert into TG1(id, val) values(19, 'b');
insert into TG1(id, val) values(23, 'b');
insert into TG1(id, val) values(29, 'a');
insert into TG1(id, val) values(31, 'b');
insert into TG1(id, val) values(37, 'b');

The ideal solution is to create a new column which can be
used with val to group the rows into discrete sections which
reflects the fact that a particular val can occur repeated times.

id   val    Grp(Dense_Rank) 
---- ------ --------------- 
1    a      1
2    a      1
3    a      1
5    a      1
7    b      2
9    b      2
11   a      3
13   a      3
17   b      4
19   b      4
23   b      4
29   a      5
31   b      6
37   b      6

The new column (Grp) is simply a dense rank on val in the order
of id. 

There are two proposed approaches. One uses a subquery to create the
Grp column. Various type of subqueries have been proposed as I did
some time ago:

Jul 5 1999, 12:00 am
microsoft.public.sqlserver.programming
Difficult SQL Question
Author: Trysql

Jun 11 2001, 10:33 am
microsoft.public.sqlserver.programming
SQL challenge: coelesce consecutive and overlapping bookings for a room
Author: steve dassin

None of the subquery solutions are particularly appealing. They are not
intuitive and it is doubtful the average developer would come upon
such a beast. And they are certainly not efficient as described here.
As for the analytic ranking functions as explained here sql ranking
functions cannot rank a column independent of its order. This makes it
necessary to simulate a dense rank with a combination of functions.
This too is not intuitive and it is doubtful the average developer
understands this approach and can apply it to other similar problems.
We are really talking expert sql programmers and not application
developers.

Perhaps a more fundamental issue than the types of solutions is the
question posed. A generic solution is the foundation of a report.
But of how much importance is the relationship between different
vals. Is the intent to see a relationship between different accounts,
students,sport teams or to examine data/relationships within the
same team but different sections? If the latter then is it really
necessary to produce sections on everything?

If the question is framed as identify all the sections for a 'particular'
val, as seems reasonable, the question of the types of solutions
and there nature dramatically changes. Now the whole solution can be
based on the very basic idea of a dense rank - if the thing is the
same use the current rank, if another thing increment the rank. Any
developer using any client should have no problem seeing this as
nothing more than a foreach loop.
Assuming val 'a' is of interest in D4 this can be done as simply:

var TempTable:=table of{id:Integer,val:String,Grp:Integer}{};
var I:Integer:=1;
foreach var LItem in TG1 do
 begin
 if LItem.val<>'a' then
    I:=I+1;
 if LItem.val='a' then   
 insert row{LItem.id id,LItem.val val,I Grp} into TempTable;
 end;
 select TempTable;

id val Grp 
-- --- --- 
1  a   1   
2  a   1   
3  a   1   
5  a   1   
11 a   3   
13 a   3   
29 a   6   

We have completely eliminated any need of subqueries or ranking
functions. In fact we have eliminated any need of any type of query.

We can, of course, create a procedure to return sections for any
val. And number the sections consecutively in case that is helpful.

A D4 procedure would be:

create operator Sections_for_a_val(aTable:typeof(TG1),aVal:String):
                        table{id:Integer,val:String,Grp:Integer}
begin                        
result:=table of typeof(result){};
var I:Integer:=1;
foreach row in aTable do
 begin
 if val<>aVal then
    I:=I+1;
 if val=aVal then   
 insert row{id id,val val,I Grp} into result;
 end;     
result:=
       result
         join
            ( 
             ToTable(ToList(cursor(result over{Grp} order by {Grp})))
              {Grp,sequence+1 GrpSeq}
            )
             {id,val,GrpSeq Grp}; 
end;

select Sections_for_a_val(TG1,'a');

id val Grp 
-- --- --- 
1  a   1   
2  a   1   
3  a   1   
5  a   1   
11 a   2   
13 a   2   
29 a   3   

select Sections_for_a_val(TG1,'b');

id val Grp 
-- --- --- 
7  b   1   
9  b   1   
17 b   2   
19 b   2   
23 b   2   
31 b   3   
37 b   3   

Summarizing is the same as in the sql solutions.

select Sections_for_a_val(TG1,'a')
  group by {Grp}
    add{Min(id) start_section,Max(id) end_section,Count() num_rows};

Grp start_section end_section num_rows 
--- ------------- ----------- -------- 
1   1             5           4        
2   11            13          2        
3   29            29          1 

Finally, least I have neglected sql, here is yet another subquery
approach (in D4) to dense ranks that takes the sql server 2005
row_number() function as its starting point.

var A:=
   SQLQuery('select id,val,row_number()over(order by id) RowID
               from TG1');
select A
  add
      {
       Count(
             (A rename {RowID RowID1} where RowID1<=RowID)
              add{ (A adorn{key{RowID}})[RowID1-1].val
                    =
                    (A adorn{key{RowID}})[RowID1].val Testval}
                       where not Testval
             ) + 1          
           DenseRank
       };  
       
id val RowID DenseRank 
-- --- ----- --------- 
1  a   1     1         
2  a   2     1         
3  a   3     1         
5  a   4     1         
7  b   5     2         
9  b   6     2         
11 a   7     3         
13 a   8     3         
17 b   9     4         
19 b   10    4         
23 b   11    4         
29 a   12    5         
31 b   13    6         
37 b   14    6     

There is certainly nothing 'wrong' with a report. But it is perhaps
wrong-headed to try to make one when it is not really the object
of the exercise.

steve
Sunday, June 24, 2007

Sql - Using a dense rank for identifying sections

Beyond Sql

Blog Archive

Links

About Me