Beyond Sql: Dataphor - Sql Visualizing a ranking query

The use of prior information is a fundamental basis for judging
the efficiency of a system. Systems that make use of prior information
are efficient systems. The most efficient systems make the best use
of this information. The most inefficient systems are those that
totally ignore it. This holds for small things like functions to large
things like databases and application development.

Many times efficiency or the lack thereof is hidden. If you could see
a visualization of what is happening it should certainly help in
understanding it.

A classic query in sql is a query that ranks a column. A count is
used to represent the rank of each column value over the table. This
type of query is discussed at length regarding MS Sql Server in:

Itzik Ben-Gan and Sujata Mehta
OVER Clause and Ordered Calculations
http://www.insidetsql.com/OVER_Clause_and_Ordered_Calculations.doc

Create a table R1 in Dataphor using MS Sql Server 2005 as the
data repository.

create table R1
{
 ID:Integer,
 Y:Integer,
 key{ID}
}; 
insert
  table
      {
       row{10 ID,8 Y},
       row{15,26},
       row{20,16},
       row{25,12},
       row{30,25},
       row{35,42},
       row{40,19},
       row{45,36},
       row{50,22},
       row{55,39}
     } into R1;  
     
select R1:

ID Y  
-- -- 
10 8  
15 26 
20 16 
25 12 
30 25 
35 42 
40 19 
45 36 
50 22 
55 39 

A query to rank column Y has traditionally been written in two forms. The
idea is to use the aggregate Count to count the number of times a
comparison between a particular value of Y and all the other row values
of Y is true. The comparison takes a specific row of R1 and compares the
column Y <= to all the column Y values from all the rows. The count of
comparisons where (a value of Y <= all other Y values) is true is the
rank of the Y value. The query can be written using a subquery to obtain
a count or with a join followed by a group by.

An sql query using a subquery:

select A.ID,A.Y,
       (select count(*)
                from R1 as B where B.Y<=A.Y) as Rank
from R1 as A                
 order by A.Y
 
An sql query using a join:

select A.ID,A.Y,Count(*) as Rank
             from R1 as A join R1 as B
                on B.Y<=A.Y
                 group by A.ID,A.Y
                   order by A.Y
                   
Both queries result in:

ID Y  Rank 
-- -- ---- 
10 8  1    
25 12 2    
20 16 3    
40 19 4    
50 22 5    
30 25 6    
15 26 7    
45 36 8    
55 39 9    
35 42 10   

But both queries can hide the nature of the inequality comparison
'B.Y<=A.Y' on which the count is based. This comparison requires
every row of R1 for every value of Y. In other words each count for
a value of Y requires the table R1. Therefore there are N values of
Y multiplied by the rows of the table, N. There N*N comparisons
for N counts where N is the number of rows in the table. For table
R1 it is therefore necessary for the database engine to make 100 
comparisons to obtain the 10 counts and rank each column Y.

To more clearly see the queries in terms of the work involved lets look
at representing the work of the database engine via a Dataphor (D4) query.
A table can be declared in any D4 query using the 'table' construct.
Three random rows of table R1 were chosen so as to represent three
random values of column Y. The query uses values of Y {39,16,12} to
obtain ranks for only these values. Each table expression is the
table R1 and additionally includes one of the random Y values in the 
list, its comparison to each of the Y column values in R1 and the 
boolean (true/false) result of the comparison. We use a left join
since we are not accounting for all ranks (Y values).

select
R1
 left join
 (
  (
   table
       { 
 row{1 RowID,10 ID,8 Y,39 Y1,8<=39 Test,true TestY1},    
 row{1 RowID,15 ID,26 Y,39 Y1,26<=39 Test,true TestY1},  
 row{1 RowID,20 ID,16 Y,39 Y1,16<=39 Test,true TestY1},  
 row{1 RowID,25 ID,12 Y,39 Y1,12<=39 Test,true TestY1},  
 row{1 RowID,30 ID,25 Y,39 Y1,25<=39 Test,true TestY1},  
 row{1 RowID,35 ID,42 Y,39 Y1,42<=39 Test,false TestY1}, 
 row{1 RowID,40 ID,19 Y,39 Y1,19<=39 Test,true TestY1},  
 row{1 RowID,45 ID,36 Y,39 Y1,36<=39 Test,true TestY1},  
 row{1 RowID,50 ID,22 Y,39 Y1,22<=39 Test,true TestY1},  
 row{1 RowID,55 ID,39 Y,39 Y1,39<=39 Test,true TestY1}
       }
          where TestY1
            group by {Y1} add{Count() Rank}
              {Y1 Y,Rank}
  )              
union  
  (            
   table
       {
 row{2 RowID,10 ID,8 Y,16 Y1,8<=16 Test,true TestY1},    
 row{2 RowID,15 ID,26 Y,16 Y1,26<=16 Test,false TestY1}, 
 row{2 RowID,20 ID,16 Y,16 Y1,16<=16 Test,true TestY1},  
 row{2 RowID,25 ID,12 Y,16 Y1,12<=16 Test,true TestY1},  
 row{2 RowID,30 ID,25 Y,16 Y1,25<=16 Test,false TestY1}, 
 row{2 RowID,35 ID,42 Y,16 Y1,42<=16 Test,false TestY1}, 
 row{2 RowID,40 ID,19 Y,16 Y1,19<=16 Test,false TestY1}, 
 row{2 RowID,45 ID,36 Y,16 Y1,36<=16 Test,false TestY1}, 
 row{2 RowID,50 ID,22 Y,16 Y1,22<=16 Test,false TestY1}, 
 row{2 RowID,55 ID,39 Y,16 Y1,39<=16 Test,false TestY1}
       }
          where TestY1
            group by {Y1} add{Count() Rank}
              {Y1 Y,Rank}
  )
union
  (
   table
       { 
 row{3 RowID,10 ID,8 Y,12 Y1,8<=12 Test,true TestY1},    
 row{3 RowID,15 ID,26 Y,12 Y1,26<=12 Test,false TestY1}, 
 row{3 RowID,20 ID,16 Y,12 Y1,16<=12 Test,false TestY1}, 
 row{3 RowID,25 ID,12 Y,12 Y1,12<=12 Test,true TestY1},  
 row{3 RowID,30 ID,25 Y,12 Y1,25<=12 Test,false TestY1}, 
 row{3 RowID,35 ID,42 Y,12 Y1,42<=12 Test,false TestY1}, 
 row{3 RowID,40 ID,19 Y,12 Y1,19<=12 Test,false TestY1}, 
 row{3 RowID,45 ID,36 Y,12 Y1,36<=12 Test,false TestY1}, 
 row{3 RowID,50 ID,22 Y,12 Y1,22<=12 Test,false TestY1}, 
 row{3 RowID,55 ID,39 Y,12 Y1,39<=12 Test,false TestY1}
       }
          where TestY1
            group by {Y1} add{Count() Rank}
              {Y1 Y,Rank}
  )
 )               order by {Y};       

ID Y  Rank       
-- -- ---------- 
10 8  <No Value> 
25 12 2          
20 16 3          
40 19 <No Value> 
50 22 <No Value> 
30 25 <No Value> 
15 26 <No Value> 
45 36 <No Value> 
55 39 9          
35 42 <No Value> 

Each expression:

 (
  table{ }
    where TestY1 //The number of rows where TestY1 is true is the Rank.
      group by {Y1} add{Count() Rank}
              {Y1 Y,Rank}
 ) 
 
 represents the work necessary to get a rank for each Y value. 
 Not only is it necessary to make the comparisons within the table { }
 but the result of the comparison must be related back to table R1.
 This idea is first conveyed in the where and group by so as to
 reduce the count to a single row and then in the union of all rows
 of ranks which is then joined by Y values back to R1. Visualizing
 the query this way makes it easier to see that the computation of
 any one rank is independent of any other rank. There is not any
 information used in one rank that is used for the computation
 of another rank. In other words, there is no use of prior information
 from one rank to another. Conceptually this is why this type of query
 has be referred to as an inefficient type of query.
 
 Note that the implied basic structure of ranking queries, a table
 repeated for each rank desired, does not change regardless of the
 type of rank desired. What changes is the complexity of the 
 comparison. Consider the following table:
 
 create table A1
  {
   ID:Integer,
   Y:Integer,
   key{ID}
  } ;
  insert
    table
        {
         row{1 ID,5 Y},
         row{2,7},
         row{3,7},
         row{4,10},
         row{5,20},
         row{6,20},
         row{7,32}
      } into A1;  
      
 To obtain unique ranks for Y the comparison 
 'each value of Y<=a particular value of Y'
 is inappropriate since it results in non-unique ranks. To break tied
 ranks the following D4 query can be used:
 
 select (A1 rename {ID ID1,Y Y1}) 
          add {Count(A1  
                  where ((Y<Y1) or (Y=Y1 and ID<=ID1)) ) 
               Rank}  
        {ID1 ID,Y1 Y,Rank}       
           order by {Rank} ;

ID Y  Rank 
-- -- ---- 
1  5  1    
2  7  2    
3  7  3    
4  10 4    
5  20 5    
6  20 6    
7  32 7    
          
This query can be visualized with the same table structure as the 
prior example but the complexity of the comparison is increased 
due to bringing in the ID column. The is another negative impact to
efficiency.

In all the queries presented so far the only logical order is
how the result will be presented (order by). The queries operate
under the assumption that the database engine will pick a particular
row to obtain a rank in any way it chooses. As the queries are
written we don't care nor does it have any importance how the
rows (ranks) are processed. 

Lets now introduce the idea of ordering the columns of Y by
sorting the rows of table R1 in ascending order of column Y
and processing the table in that order.

The D4 ToList operator creates a list of rows from a table defined
in the order of a cursor. The statement:

select ToList(cursor(R1 order by {Y}))[0];        

uses an indexer ([]) to return the 1st row in a list using
table R1 ordered by column Y.

ID Y 
-- - 
10 8 

The next row in the ordered list can be addressed by using 1 for 
the indexer:

select ToList(cursor(R1 order by {Y}))[1];           

ID Y  
-- -- 
25 12 

Finally we can get the last row in the list without explicitly 
stating an index, but using Count to count how many rows there are
(lists are addressed from 0 so we subtract 1 from count):

select 
  ToList(cursor(R1 order by {Y}))[ToList(cursor(R1 order by {Y})).Count()-1];        

ID Y  
-- -- 
35 42 

Now we can simply loop over the ordered list to obtain the rank
of each Y value:

var TempTable:=table of {Y:Integer,Rank:Integer}{};
var LListR1:=ToList(cursor(R1 order by {Y}));
var I:Integer:=0;
 foreach var LListRow in LListR1 do
  begin
  I:=I+1;//Reuse previous value of I for rank.
   insert row{LListRow.Y Y,I Rank} into TempTable;
  end;
select 'TempTable';  
select TempTable; 
select 'R1 joined to TempTable'; 
select R1 join TempTable with {IgnoreUnsupported = 'true'}
           order by {Y}; 

TempTable
Y  Rank 
-- ---- 
8  1    
12 2    
16 3    
19 4    
22 5    
25 6    
26 7    
36 8    
39 9    
42 10   

R1 joined to TempTable
ID Y  Rank 
-- -- ---- 
10 8  1    
25 12 2    
20 16 3    
40 19 4    
50 22 5    
30 25 6    
15 26 7    
45 36 8    
55 39 9    
35 42 10   

We have obtained the ranks by processing the rows of R1 only
once as compared to 100 times in a query. The statement:
I:= I + 1 
which represents the rank of a particular Y value, is a mathematical 
statement of reusing prior information. And we are able to reuse
information as a result of ordering the rows of R1.

We can take the concept of the loop one step further by simply
transforming the list of rows into a table:

select ToTable(ToList(cursor(R1 order by {Y})))
               {ID,Y,sequence+1 Cnt}
                  order by {Y}; 
                  
ID Y  Cnt 
-- -- --- 
10 8  1   
25 12 2   
20 16 3   
40 19 4   
50 22 5   
30 25 6   
15 26 7   
45 36 8   
55 39 9   
35 42 10  

This represents a most efficient use of prior information, the
previous value of Y. Sql is making the same efficient use of
prior information with its analytic ranking functions:

select ID,Y,row_number()over(order by Y) as Rank
        from R1
        
The paper by Ben-Gan can be seen as a plea for Sql Server to
more efficiently use prior information (in the form of processing
queries in an ordered way) in various contexts.

Just as I:=I+1 is a statement of reuse of information, application
development is surely based on the reuse of data. And what 
distinguishes one environment from another is the efficiency of
use. Conceptually I:=I+1 is no different than using a key to 
efficiently access a row of data from a view. To make the connection
between reusing prior information efficiently and application
development is to see what D4 is all about -:)

bye for now,
steve


This operator returns a random sample (N) of column Y values from table R1
in a table. It takes into account replacement.

create operator Rand_Y_Values(N:Integer):
        table{RowID:Integer,Y1:Integer}
begin
result:=table of typeof(result){};
var Check:Integer;
for var J:Integer:=1 to N do
begin
var Y1:Integer;
Math.Seed();
for var I:Integer:= 9 to 1000 do
begin
Y1:=Random(1,50);
Check:=I;
if ( IsNotNil ((R1 adorn {key{Y}})[Y1 by {Y}]) )
             and ( IsNil( (result adorn {key{Y1}})[Y1]) ) with {IgnoreUnsupported = 'true'} then
break;
end;  
if Check=1000 then
 begin
 raise Error('No random value could be found for Row #: '+ ToString(N));
 exit;   
 end;
result:=result union table{row{J RowID,Y1 Y1}};
end;
end;

This query uses operator Rand_Y_Values to return a character string
that represents a row statement using a value of Y. The rows were copied
and pasted into the query.

select (R1 times Rand_Y_Values(3)  with {IgnoreUnsupported = 'true'})
   add{Y<=Y1 TestResult}
   add{'row{'+ToString(RowID)+' RowID,'+ToString(ID)+' ID,'+ToString(Y)+' Y,'+
            ToString(Y1)+' Y1,'+ToString(Y)+'<='+ToString(Y1)+ ' Test,'+ToString(TestResult)+' TestY1},' Arow}
             {Arow};
3 comments:

Anonymous said...: demi lovato naked photos http://trusted.md/user/demi_lovato_nude nudes of demi lovato; Mon Nov 22, 01:01:00 PM
Anonymous said...: miley cyrus naked porn miley cyrus toon porn; Fri Nov 26, 12:55:00 AM
muebles en pontevedra said...: Gosh, there's a lot of helpful information above!; Tue Nov 29, 06:14:00 AM
Friday, June 22, 2007

Dataphor - Sql Visualizing a ranking query

3 comments:

Beyond Sql

Blog Archive

Links

About Me