Beyond Sql: June 2007

Sunday, June 24, 2007

Sql - Using a dense rank for identifying sections

In his July 2007 MS Sql Server article Itzik Ben-Gan raises the question
of identifying consecutive runs of rows where values repeat themselves. 

T-SQL Black Belt 
Identifying Sections 
By:Itzik Ben-Gan 

The problem is framed as:

'The generic form of the problem involves a table (call it T1) that has
 two columns of interest: one representing order among rows (call it id)
 and the other holding some value (call it val). The task at hand is to
 identify sections of consecutive rows that share the same value. 
 The terms section and consecutive rows are problematic when you're 
 dealing with sets, but as I mentioned, the column ID represents logical
 order among rows, and once order is defined, these terms become meaningful.
 For each identified section, you need to return the minimum id, maximum id,
 val, count of rows in the section, and possibly other aggregates.'
 
The use of the term 'generic' is most appropriate as it conveys the idea
that a proposed solution account for 'all' data. Such a generic approach
is an 'inclusive' one.

Given the sample data:

create table TG1
(
 id  int not null primary key,
 val varchar(10) not null
);

insert into TG1(id, val) values( 1, 'a');
insert into TG1(id, val) values( 2, 'a');
insert into TG1(id, val) values( 3, 'a');
insert into TG1(id, val) values( 5, 'a');
insert into TG1(id, val) values( 7, 'b');
insert into TG1(id, val) values( 9, 'b');
insert into TG1(id, val) values(11, 'a');
insert into TG1(id, val) values(13, 'a');
insert into TG1(id, val) values(17, 'b');
insert into TG1(id, val) values(19, 'b');
insert into TG1(id, val) values(23, 'b');
insert into TG1(id, val) values(29, 'a');
insert into TG1(id, val) values(31, 'b');
insert into TG1(id, val) values(37, 'b');

The ideal solution is to create a new column which can be
used with val to group the rows into discrete sections which
reflects the fact that a particular val can occur repeated times.

id   val    Grp(Dense_Rank) 
---- ------ --------------- 
1    a      1
2    a      1
3    a      1
5    a      1
7    b      2
9    b      2
11   a      3
13   a      3
17   b      4
19   b      4
23   b      4
29   a      5
31   b      6
37   b      6

The new column (Grp) is simply a dense rank on val in the order
of id. 

There are two proposed approaches. One uses a subquery to create the
Grp column. Various type of subqueries have been proposed as I did
some time ago:

Jul 5 1999, 12:00 am
microsoft.public.sqlserver.programming
Difficult SQL Question
Author: Trysql

Jun 11 2001, 10:33 am
microsoft.public.sqlserver.programming
SQL challenge: coelesce consecutive and overlapping bookings for a room
Author: steve dassin

None of the subquery solutions are particularly appealing. They are not
intuitive and it is doubtful the average developer would come upon
such a beast. And they are certainly not efficient as described here.
As for the analytic ranking functions as explained here sql ranking
functions cannot rank a column independent of its order. This makes it
necessary to simulate a dense rank with a combination of functions.
This too is not intuitive and it is doubtful the average developer
understands this approach and can apply it to other similar problems.
We are really talking expert sql programmers and not application
developers.

Perhaps a more fundamental issue than the types of solutions is the
question posed. A generic solution is the foundation of a report.
But of how much importance is the relationship between different
vals. Is the intent to see a relationship between different accounts,
students,sport teams or to examine data/relationships within the
same team but different sections? If the latter then is it really
necessary to produce sections on everything?

If the question is framed as identify all the sections for a 'particular'
val, as seems reasonable, the question of the types of solutions
and there nature dramatically changes. Now the whole solution can be
based on the very basic idea of a dense rank - if the thing is the
same use the current rank, if another thing increment the rank. Any
developer using any client should have no problem seeing this as
nothing more than a foreach loop.
Assuming val 'a' is of interest in D4 this can be done as simply:

var TempTable:=table of{id:Integer,val:String,Grp:Integer}{};
var I:Integer:=1;
foreach var LItem in TG1 do
 begin
 if LItem.val<>'a' then
    I:=I+1;
 if LItem.val='a' then   
 insert row{LItem.id id,LItem.val val,I Grp} into TempTable;
 end;
 select TempTable;

id val Grp 
-- --- --- 
1  a   1   
2  a   1   
3  a   1   
5  a   1   
11 a   3   
13 a   3   
29 a   6   

We have completely eliminated any need of subqueries or ranking
functions. In fact we have eliminated any need of any type of query.

We can, of course, create a procedure to return sections for any
val. And number the sections consecutively in case that is helpful.

A D4 procedure would be:

create operator Sections_for_a_val(aTable:typeof(TG1),aVal:String):
                        table{id:Integer,val:String,Grp:Integer}
begin                        
result:=table of typeof(result){};
var I:Integer:=1;
foreach row in aTable do
 begin
 if val<>aVal then
    I:=I+1;
 if val=aVal then   
 insert row{id id,val val,I Grp} into result;
 end;     
result:=
       result
         join
            ( 
             ToTable(ToList(cursor(result over{Grp} order by {Grp})))
              {Grp,sequence+1 GrpSeq}
            )
             {id,val,GrpSeq Grp}; 
end;

select Sections_for_a_val(TG1,'a');

id val Grp 
-- --- --- 
1  a   1   
2  a   1   
3  a   1   
5  a   1   
11 a   2   
13 a   2   
29 a   3   

select Sections_for_a_val(TG1,'b');

id val Grp 
-- --- --- 
7  b   1   
9  b   1   
17 b   2   
19 b   2   
23 b   2   
31 b   3   
37 b   3   

Summarizing is the same as in the sql solutions.

select Sections_for_a_val(TG1,'a')
  group by {Grp}
    add{Min(id) start_section,Max(id) end_section,Count() num_rows};

Grp start_section end_section num_rows 
--- ------------- ----------- -------- 
1   1             5           4        
2   11            13          2        
3   29            29          1 

Finally, least I have neglected sql, here is yet another subquery
approach (in D4) to dense ranks that takes the sql server 2005
row_number() function as its starting point.

var A:=
   SQLQuery('select id,val,row_number()over(order by id) RowID
               from TG1');
select A
  add
      {
       Count(
             (A rename {RowID RowID1} where RowID1<=RowID)
              add{ (A adorn{key{RowID}})[RowID1-1].val
                    =
                    (A adorn{key{RowID}})[RowID1].val Testval}
                       where not Testval
             ) + 1          
           DenseRank
       };  
       
id val RowID DenseRank 
-- --- ----- --------- 
1  a   1     1         
2  a   2     1         
3  a   3     1         
5  a   4     1         
7  b   5     2         
9  b   6     2         
11 a   7     3         
13 a   8     3         
17 b   9     4         
19 b   10    4         
23 b   11    4         
29 a   12    5         
31 b   13    6         
37 b   14    6     

There is certainly nothing 'wrong' with a report. But it is perhaps
wrong-headed to try to make one when it is not really the object
of the exercise.

steve

Friday, June 22, 2007

Dataphor - Sql Visualizing a ranking query

The use of prior information is a fundamental basis for judging
the efficiency of a system. Systems that make use of prior information
are efficient systems. The most efficient systems make the best use
of this information. The most inefficient systems are those that
totally ignore it. This holds for small things like functions to large
things like databases and application development.

Many times efficiency or the lack thereof is hidden. If you could see
a visualization of what is happening it should certainly help in
understanding it.

A classic query in sql is a query that ranks a column. A count is
used to represent the rank of each column value over the table. This
type of query is discussed at length regarding MS Sql Server in:

Itzik Ben-Gan and Sujata Mehta
OVER Clause and Ordered Calculations
http://www.insidetsql.com/OVER_Clause_and_Ordered_Calculations.doc

Create a table R1 in Dataphor using MS Sql Server 2005 as the
data repository.

create table R1
{
 ID:Integer,
 Y:Integer,
 key{ID}
}; 
insert
  table
      {
       row{10 ID,8 Y},
       row{15,26},
       row{20,16},
       row{25,12},
       row{30,25},
       row{35,42},
       row{40,19},
       row{45,36},
       row{50,22},
       row{55,39}
     } into R1;  
     
select R1:

ID Y  
-- -- 
10 8  
15 26 
20 16 
25 12 
30 25 
35 42 
40 19 
45 36 
50 22 
55 39 

A query to rank column Y has traditionally been written in two forms. The
idea is to use the aggregate Count to count the number of times a
comparison between a particular value of Y and all the other row values
of Y is true. The comparison takes a specific row of R1 and compares the
column Y <= to all the column Y values from all the rows. The count of
comparisons where (a value of Y <= all other Y values) is true is the
rank of the Y value. The query can be written using a subquery to obtain
a count or with a join followed by a group by.

An sql query using a subquery:

select A.ID,A.Y,
       (select count(*)
                from R1 as B where B.Y<=A.Y) as Rank
from R1 as A                
 order by A.Y
 
An sql query using a join:

select A.ID,A.Y,Count(*) as Rank
             from R1 as A join R1 as B
                on B.Y<=A.Y
                 group by A.ID,A.Y
                   order by A.Y
                   
Both queries result in:

ID Y  Rank 
-- -- ---- 
10 8  1    
25 12 2    
20 16 3    
40 19 4    
50 22 5    
30 25 6    
15 26 7    
45 36 8    
55 39 9    
35 42 10   

But both queries can hide the nature of the inequality comparison
'B.Y<=A.Y' on which the count is based. This comparison requires
every row of R1 for every value of Y. In other words each count for
a value of Y requires the table R1. Therefore there are N values of
Y multiplied by the rows of the table, N. There N*N comparisons
for N counts where N is the number of rows in the table. For table
R1 it is therefore necessary for the database engine to make 100 
comparisons to obtain the 10 counts and rank each column Y.

To more clearly see the queries in terms of the work involved lets look
at representing the work of the database engine via a Dataphor (D4) query.
A table can be declared in any D4 query using the 'table' construct.
Three random rows of table R1 were chosen so as to represent three
random values of column Y. The query uses values of Y {39,16,12} to
obtain ranks for only these values. Each table expression is the
table R1 and additionally includes one of the random Y values in the 
list, its comparison to each of the Y column values in R1 and the 
boolean (true/false) result of the comparison. We use a left join
since we are not accounting for all ranks (Y values).

select
R1
 left join
 (
  (
   table
       { 
 row{1 RowID,10 ID,8 Y,39 Y1,8<=39 Test,true TestY1},    
 row{1 RowID,15 ID,26 Y,39 Y1,26<=39 Test,true TestY1},  
 row{1 RowID,20 ID,16 Y,39 Y1,16<=39 Test,true TestY1},  
 row{1 RowID,25 ID,12 Y,39 Y1,12<=39 Test,true TestY1},  
 row{1 RowID,30 ID,25 Y,39 Y1,25<=39 Test,true TestY1},  
 row{1 RowID,35 ID,42 Y,39 Y1,42<=39 Test,false TestY1}, 
 row{1 RowID,40 ID,19 Y,39 Y1,19<=39 Test,true TestY1},  
 row{1 RowID,45 ID,36 Y,39 Y1,36<=39 Test,true TestY1},  
 row{1 RowID,50 ID,22 Y,39 Y1,22<=39 Test,true TestY1},  
 row{1 RowID,55 ID,39 Y,39 Y1,39<=39 Test,true TestY1}
       }
          where TestY1
            group by {Y1} add{Count() Rank}
              {Y1 Y,Rank}
  )              
union  
  (            
   table
       {
 row{2 RowID,10 ID,8 Y,16 Y1,8<=16 Test,true TestY1},    
 row{2 RowID,15 ID,26 Y,16 Y1,26<=16 Test,false TestY1}, 
 row{2 RowID,20 ID,16 Y,16 Y1,16<=16 Test,true TestY1},  
 row{2 RowID,25 ID,12 Y,16 Y1,12<=16 Test,true TestY1},  
 row{2 RowID,30 ID,25 Y,16 Y1,25<=16 Test,false TestY1}, 
 row{2 RowID,35 ID,42 Y,16 Y1,42<=16 Test,false TestY1}, 
 row{2 RowID,40 ID,19 Y,16 Y1,19<=16 Test,false TestY1}, 
 row{2 RowID,45 ID,36 Y,16 Y1,36<=16 Test,false TestY1}, 
 row{2 RowID,50 ID,22 Y,16 Y1,22<=16 Test,false TestY1}, 
 row{2 RowID,55 ID,39 Y,16 Y1,39<=16 Test,false TestY1}
       }
          where TestY1
            group by {Y1} add{Count() Rank}
              {Y1 Y,Rank}
  )
union
  (
   table
       { 
 row{3 RowID,10 ID,8 Y,12 Y1,8<=12 Test,true TestY1},    
 row{3 RowID,15 ID,26 Y,12 Y1,26<=12 Test,false TestY1}, 
 row{3 RowID,20 ID,16 Y,12 Y1,16<=12 Test,false TestY1}, 
 row{3 RowID,25 ID,12 Y,12 Y1,12<=12 Test,true TestY1},  
 row{3 RowID,30 ID,25 Y,12 Y1,25<=12 Test,false TestY1}, 
 row{3 RowID,35 ID,42 Y,12 Y1,42<=12 Test,false TestY1}, 
 row{3 RowID,40 ID,19 Y,12 Y1,19<=12 Test,false TestY1}, 
 row{3 RowID,45 ID,36 Y,12 Y1,36<=12 Test,false TestY1}, 
 row{3 RowID,50 ID,22 Y,12 Y1,22<=12 Test,false TestY1}, 
 row{3 RowID,55 ID,39 Y,12 Y1,39<=12 Test,false TestY1}
       }
          where TestY1
            group by {Y1} add{Count() Rank}
              {Y1 Y,Rank}
  )
 )               order by {Y};       

ID Y  Rank       
-- -- ---------- 
10 8  <No Value> 
25 12 2          
20 16 3          
40 19 <No Value> 
50 22 <No Value> 
30 25 <No Value> 
15 26 <No Value> 
45 36 <No Value> 
55 39 9          
35 42 <No Value> 

Each expression:

 (
  table{ }
    where TestY1 //The number of rows where TestY1 is true is the Rank.
      group by {Y1} add{Count() Rank}
              {Y1 Y,Rank}
 ) 
 
 represents the work necessary to get a rank for each Y value. 
 Not only is it necessary to make the comparisons within the table { }
 but the result of the comparison must be related back to table R1.
 This idea is first conveyed in the where and group by so as to
 reduce the count to a single row and then in the union of all rows
 of ranks which is then joined by Y values back to R1. Visualizing
 the query this way makes it easier to see that the computation of
 any one rank is independent of any other rank. There is not any
 information used in one rank that is used for the computation
 of another rank. In other words, there is no use of prior information
 from one rank to another. Conceptually this is why this type of query
 has be referred to as an inefficient type of query.
 
 Note that the implied basic structure of ranking queries, a table
 repeated for each rank desired, does not change regardless of the
 type of rank desired. What changes is the complexity of the 
 comparison. Consider the following table:
 
 create table A1
  {
   ID:Integer,
   Y:Integer,
   key{ID}
  } ;
  insert
    table
        {
         row{1 ID,5 Y},
         row{2,7},
         row{3,7},
         row{4,10},
         row{5,20},
         row{6,20},
         row{7,32}
      } into A1;  
      
 To obtain unique ranks for Y the comparison 
 'each value of Y<=a particular value of Y'
 is inappropriate since it results in non-unique ranks. To break tied
 ranks the following D4 query can be used:
 
 select (A1 rename {ID ID1,Y Y1}) 
          add {Count(A1  
                  where ((Y<Y1) or (Y=Y1 and ID<=ID1)) ) 
               Rank}  
        {ID1 ID,Y1 Y,Rank}       
           order by {Rank} ;

ID Y  Rank 
-- -- ---- 
1  5  1    
2  7  2    
3  7  3    
4  10 4    
5  20 5    
6  20 6    
7  32 7    
          
This query can be visualized with the same table structure as the 
prior example but the complexity of the comparison is increased 
due to bringing in the ID column. The is another negative impact to
efficiency.

In all the queries presented so far the only logical order is
how the result will be presented (order by). The queries operate
under the assumption that the database engine will pick a particular
row to obtain a rank in any way it chooses. As the queries are
written we don't care nor does it have any importance how the
rows (ranks) are processed. 

Lets now introduce the idea of ordering the columns of Y by
sorting the rows of table R1 in ascending order of column Y
and processing the table in that order.

The D4 ToList operator creates a list of rows from a table defined
in the order of a cursor. The statement:

select ToList(cursor(R1 order by {Y}))[0];        

uses an indexer ([]) to return the 1st row in a list using
table R1 ordered by column Y.

ID Y 
-- - 
10 8 

The next row in the ordered list can be addressed by using 1 for 
the indexer:

select ToList(cursor(R1 order by {Y}))[1];           

ID Y  
-- -- 
25 12 

Finally we can get the last row in the list without explicitly 
stating an index, but using Count to count how many rows there are
(lists are addressed from 0 so we subtract 1 from count):

select 
  ToList(cursor(R1 order by {Y}))[ToList(cursor(R1 order by {Y})).Count()-1];        

ID Y  
-- -- 
35 42 

Now we can simply loop over the ordered list to obtain the rank
of each Y value:

var TempTable:=table of {Y:Integer,Rank:Integer}{};
var LListR1:=ToList(cursor(R1 order by {Y}));
var I:Integer:=0;
 foreach var LListRow in LListR1 do
  begin
  I:=I+1;//Reuse previous value of I for rank.
   insert row{LListRow.Y Y,I Rank} into TempTable;
  end;
select 'TempTable';  
select TempTable; 
select 'R1 joined to TempTable'; 
select R1 join TempTable with {IgnoreUnsupported = 'true'}
           order by {Y}; 

TempTable
Y  Rank 
-- ---- 
8  1    
12 2    
16 3    
19 4    
22 5    
25 6    
26 7    
36 8    
39 9    
42 10   

R1 joined to TempTable
ID Y  Rank 
-- -- ---- 
10 8  1    
25 12 2    
20 16 3    
40 19 4    
50 22 5    
30 25 6    
15 26 7    
45 36 8    
55 39 9    
35 42 10   

We have obtained the ranks by processing the rows of R1 only
once as compared to 100 times in a query. The statement:
I:= I + 1 
which represents the rank of a particular Y value, is a mathematical 
statement of reusing prior information. And we are able to reuse
information as a result of ordering the rows of R1.

We can take the concept of the loop one step further by simply
transforming the list of rows into a table:

select ToTable(ToList(cursor(R1 order by {Y})))
               {ID,Y,sequence+1 Cnt}
                  order by {Y}; 
                  
ID Y  Cnt 
-- -- --- 
10 8  1   
25 12 2   
20 16 3   
40 19 4   
50 22 5   
30 25 6   
15 26 7   
45 36 8   
55 39 9   
35 42 10  

This represents a most efficient use of prior information, the
previous value of Y. Sql is making the same efficient use of
prior information with its analytic ranking functions:

select ID,Y,row_number()over(order by Y) as Rank
        from R1
        
The paper by Ben-Gan can be seen as a plea for Sql Server to
more efficiently use prior information (in the form of processing
queries in an ordered way) in various contexts.

Just as I:=I+1 is a statement of reuse of information, application
development is surely based on the reuse of data. And what 
distinguishes one environment from another is the efficiency of
use. Conceptually I:=I+1 is no different than using a key to 
efficiently access a row of data from a view. To make the connection
between reusing prior information efficiently and application
development is to see what D4 is all about -:)

bye for now,
steve


This operator returns a random sample (N) of column Y values from table R1
in a table. It takes into account replacement.

create operator Rand_Y_Values(N:Integer):
        table{RowID:Integer,Y1:Integer}
begin
result:=table of typeof(result){};
var Check:Integer;
for var J:Integer:=1 to N do
begin
var Y1:Integer;
Math.Seed();
for var I:Integer:= 9 to 1000 do
begin
Y1:=Random(1,50);
Check:=I;
if ( IsNotNil ((R1 adorn {key{Y}})[Y1 by {Y}]) )
             and ( IsNil( (result adorn {key{Y1}})[Y1]) ) with {IgnoreUnsupported = 'true'} then
break;
end;  
if Check=1000 then
 begin
 raise Error('No random value could be found for Row #: '+ ToString(N));
 exit;   
 end;
result:=result union table{row{J RowID,Y1 Y1}};
end;
end;

This query uses operator Rand_Y_Values to return a character string
that represents a row statement using a value of Y. The rows were copied
and pasted into the query.

select (R1 times Rand_Y_Values(3)  with {IgnoreUnsupported = 'true'})
   add{Y<=Y1 TestResult}
   add{'row{'+ToString(RowID)+' RowID,'+ToString(ID)+' ID,'+ToString(Y)+' Y,'+
            ToString(Y1)+' Y1,'+ToString(Y)+'<='+ToString(Y1)+ ' Test,'+ToString(TestResult)+' TestY1},' Arow}
             {Arow};

Tuesday, June 19, 2007

Dataphor - Inclusive Sql vs exclusive D4

Sql questions and solutions are proffered from the perspective of
inclusive. It is generally the largest scope encapsulated in a single
query that is the goal. Every row of every table must be accounted
for. And even those rows that don't exist must be included. If sql
is inclusive then application development must be exclusive. AD
raises a singular question, a particular vendor, a particular
customer. A solution for a specific something need not account
for every something. Therefore it makes sense that AD have constructs
and solutions that match the intent of AD. More on the inclusive vs.
exclusive argument at a later date -:)

Here is a solution in the application development language of Dataphor 
based on the thread:

Monday, June 18, 2007 6:52 AM
microsoft.public.sqlserver.newusers
Query How To?
http://tinyurl.com/yphn44

MS Sql Server 2005 is used to store the data.

We frame the question in an exclusive manner: given a particular player
summarize his ratings over the different sports.

create table WanaPlayers
{
 CID:Integer,
 Player:String {default nil},
 Ethnicity:String {default nil},
 key{CID}
};
insert
 table
     {
      row{1 CID,'Charlie' Player ,'black' Ethnicity},
      row{2, 'Dave', 'white'},
      row{3, 'Josh', 'white'},
      row{4, 'Jeff', 'black'}
     } into WanaPlayers;
     
create table WanaSports
{
 TID:Integer,
 Sport:String {default nil},
 key{TID}
};
insert
 table
     {
      row{1 TID, 'Baseball' Sport},
      row{2, 'Football'},
      row{3, 'Hockey'}
     } into WanaSports;      

create table WanaRatings
{
 RowID:Integer,
 TID:Integer,
 CID:Integer,
 Rating:Integer {default nil},
 key{RowID},
 reference Ratings_Sports {TID} references WanaSports {TID},
 reference Ratings_Players {CID} references WanaPlayers {CID}
}; 
insert
 table
  {
   row{1 RowID,1 TID, 1 CID, 50 Rating},
   row{2,1, 1, 25},
   row{3,2, 1, 69},
   row{4,2, 1, 71},
   row{5,2, 1, 50},
   row{6,3, 1, 50},
   row{7,1, 2, 97},
   row{8,1, 2, 100},
   row{9,1, 2, 100},
   row{10,2, 2, 98},
   row{11,3, 2, 99},
   row{12,1, 3, 0}
  } into WanaRatings; 
  
We create a view using natural joins as opposed to using any outer joins. 
The view simply addresses the maximum rating for each player for each
sport he has.

create view WanaView
 WanaRatings join WanaPlayers join WanaSports
   group by {Player,Sport} add {Max(Rating) MaxRating};  
   
select WanaView;

Player  Sport    MaxRating 
------- -------- --------- 
Charlie Baseball 50        
Charlie Football 71        
Charlie Hockey   50        
Dave    Baseball 100       
Dave    Football 98        
Dave    Hockey   99        
Josh    Baseball 0         

The key of the view is obvious, {Player,Sport}.

Assuming sports baseball, football and hockey we can define an
operator to return a single row with the columns of interest.
For any Player we create a table with their MaxRating for each
sport. We use the key of the view to directly address the max rating.

create operator PlayerRating(Name:String):
       row{Player:String,AvgRating:Decimal,HighNumber:Integer}
begin
result:=row of typeof(result){};
result:=
(table
     {
      row{1 ID,WanaView[Name,'BaseBall' by {Player,Sport}].MaxRating MaxRating},
      row{2,WanaView[Name,'FootBall' by {Player,Sport}].MaxRating},
      row{3,WanaView[Name,'Hockey' by {Player,Sport}].MaxRating}
     } 
      add{ToInteger(MaxRating>80) HighRating}
           group add{Avg(MaxRating) AvgRating,Sum(HighRating) HighNumber}
             {Name Player,AvgRating,HighNumber})[];       
end;

select PlayerRating('Charlie');    

Player  AvgRating HighNumber 
------- --------- ---------- 
Charlie 57        0          

select PlayerRating('Dave');        

Player AvgRating HighNumber 
------ --------- ---------- 
Dave   99        3          

select PlayerRating('Jeff');      

Player AvgRating  HighNumber 
------ ---------- ---------- 
Jeff   <No Value> 0          

select PlayerRating('Steve');      

Player AvgRating  HighNumber 
------ ---------- ---------- 
Steve  <No Value> 0          

Since the operator returns a row we have each column (scalar value)
easily available.

select PlayerRating('Dave').HighNumber;   

3

Feel free to post any questions as comments.

bye for now,
steve

Saturday, June 16, 2007

Dataphor - Intelligent views

This is in response to the thread:

Saturday, June 16, 2007 10:10 AM
microsoft.public.sqlserver.programming
How to make a view as input???
http://tinyurl.com/2o5q8h

MS Sql Server 2005 Bol on View:
 'Creates a virtual table that represents the data in one or more tables
  in an alternative way.' 
  
This is a good and accurate defintion. It is a 'representation' of a 
select statement. But it does not 'capture' the relationship(s) of the
representation. It is unaware of any key(s) and most importantly unaware
of any constraints between tables, ie. reference(s). It thus suffers from
a deficit disorder as far as application development goes. This retardation
causes developers to assume responsibilities the db should handle amongst
which is the use of the nightmare t-sql trigger.

Here is a very simple example of intelligent view behavior for application
development. It uses the D4 language of Dataphor and uses sql server 2005
as the data respository.

create table MyOrders
{
 order_nbr:Integer,
 some_col:Integer,
 key{order_nbr}
}; 

create table MyOrderDetails
 {
  order_nbr:Integer,
  sku:Integer,
  item_price:Integer,
  key{order_nbr,sku},
  reference MyDetails_MyOrders{ order_nbr } 
  references MyOrders { order_nbr }
 };

insert table
 {
  row{1 order_nbr, 15 some_col},
  row{2,30},
  row{3,14}
 } into MyOrders;
 
insert table
 {
  row{1 order_nbr, 1 sku, 500 item_price},
  row{1, 2, 205},
  row{2, 1, 490},
  row{3, 1, 480}
 } into MyOrderDetails ; 

This is a natural join. The sql "on MyOrders.order_nbr=MyOrderDetails.order_nbr"
is implied by a natural join.

create view MyOrdersView
  MyOrders join MyOrderDetails 

The keys of each table and the constraint (reference) between the
tables are used to resolve each insert into the view. The view inherits
the keys of each table and the reference between the tables.

insert row{1 order_nbr,3 sku,253 item_price} into MyOrdersView;
insert row{2 order_nbr,2 sku,321 item_price} into MyOrdersView;
insert row{4 order_nbr,1 sku,42 some_col,321 item_price} into MyOrdersView;

select MyOrdersView;
order_nbr some_col sku item_price 
--------- -------- --- ---------- 
1         15       1   500        
1         15       2   205        
1         15       3   253        
2         30       1   490        
2         30       2   321        
3         14       1   480        
4         42       1   321        

An insert should be resolved to a delete followed by an insert from the
view to each base table, ie. an update.
(The "adorn" is meta-data attached to the insert allowing the update
 behavior to take place thru the view).

insert row{1 order_nbr,3 sku,25 some_col,632 item_price} into MyOrdersView
                           adorn with { PropagateInsert = "Ensure" };

select MyOrdersView;                         
order_nbr some_col sku item_price 
--------- -------- --- ---------- 
1         25       1   500        
1         25       2   205        
1         25       3   632        
2         30       1   490        
2         30       2   321        
3         14       1   480        
4         42       1   321           

select MyOrders;
order_nbr some_col 
--------- -------- 
1         25       
2         30       
3         14       
4         42       

select MyOrderDetails;
order_nbr sku item_price 
--------- --- ---------- 
1         1   500        
1         2   205        
1         3   632        
2         1   490        
2         2   321        
3         1   480        
4         1   321        

Before one asks the question of whether or not there is intelligence
in the universe, developers should ask if there is intelligence in
their application tools -:)

bye for now,
steve

Dataphor - Sql: what does Update..From mean?

What is really wrong with the "UPDATE.. FROM.." syntax?

This example uses MS Sql Server 2005
and
the D4 language of Dataphor @
www.alphora.com

The discussion is based on the thread:

microsoft.public.sqlserver.programming
Thursday, June 14, 2007 10:21 AM
Update help
http://tinyurl.com/27Z8PM

Given an update with join representing a 1 to many relationship:

UPDATE Orders
   SET < >
  FROM Orders
       INNER JOIN
       OrderDetails
       ON Orders.order_nbr = OrderDetails.order_nbr;

Joe Celko argues:

'Bottom line: UPDATE FROM *can* be safely used - but only if you are
 FOREVER certain that no single row in the target table can EVER e
 joined to more than one row in the source table(s); FOREVER in the
 ENTIRE LIFETIME of the application; FOREVER across all programmers yet
 to come.'

And from Sql Server 2005 Bol:

'The results of an UPDATE statement are undefined if the statement
 includes a FROM clause that is not specified in such a way that 
 only one value is available for each column occurrence that is updated, 
 that is if the UPDATE statement is not deterministic.' 
 
What is conspiciously missing from the discussion and Bol is one of the
most fundamental concepts in database theory - a 'KEY' as opposed
to an index.
 
Obtaining the result of a natural join in the language of D4:

select Orders join OrderDetails ;

order_nbr some_col sku item_price 
--------- -------- --- ---------- 
1         0        1   500        
1         0        2   205        
2         0        1   490        
3         0        1   480        

It should be obvious that the key of this table is {order_nbr,sku}.
The example of the update in question is:

UPDATE Orders
   SET Orders.some_col = OrderDetails.item_price
  FROM Orders
       INNER JOIN
       OrderDetails
       ON Orders.order_nbr = OrderDetails.order_nbr;

which totally ignores the compound key and implicitly assumes a
1 to 1 relationalship between the tables. So instead of showing
a logical flaw in using a join in Update, it simply shows an
illogical update! The most logical way an update can use a
1 to N relationalship (ie a compound key) is to perform an update
on a single row! This is what the ridiculous 'undefined' in Bol
is hiding. An undefined update is the lack of recognition of the
appropriate key on the part of the user. The flip side to the coin
is the total lack of the concept of a key in Sql Server. Taken
together this combination is lethal. 

In D4 an update that makes sense takes the form:

update (Orders join OrderDetails)
               set { some_col :=item_price } 
                 where row{order_nbr order_nbr,sku sku}=row{2 order_nbr,1 sku};

If there was any question of writing an update assuming the existence of
a key of just {order_nbr} we could test it by trying to select a row
based on just that key, for example select a row using order_nbr 1:

select
 (Orders join OrderDetails adorn {key{order_nbr}})[1 by {order_nbr}];

D4 would give the message:

" Internal Index Error: 'Duplicate key violation.' ".

which should tell an application developer that any update using the
join only makes sense using the compound key. As in:

select
 (SOrders join SOrderDetails)[1,2 by{order_nbr,sku}] ;       

order_nbr some_col sku item_price 
--------- -------- --- ---------- 
1         0        2   205         

D4 is an application development language, sql is a programming language.
There is a big difference.

As for the sql standard in which Celko expresses the update:

UPDATE Orders 
   SET some_col
    = (SELECT item_price
         FROM OrderDetails
        WHERE OrderDetails.order_nbr = Orders.order_nbr)
 WHERE EXISTS
      (SELECT *
         FROM OrderDetails
        WHERE OrderDetails.order_nbr = Orders.order_nbr);

It is the SET subquery which returns the error:

'Server: Msg 512, Level 16, State 1, Line 1
 Subquery returned more than 1 value. This is not permitted when the subquery
 follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
 The statement has been terminated.'
 
 And is nothing more than the Bol idea of an 'undefined' value. The idea
 of the standard as a substitute for the concept of a key is, of course,
 ridiculous. The answer to the question of 'which one' can be seen as
 rather absurd in light of a 'key'. There should never be a question of
 'which one' in a RDMS as there can only be but one! -:)

Interestingly, this issue is an example of searching for a question
that is appropriate and makes sense given a response. It is no easy
feat sometimes to make the connection -:)

bye for now,
steve

Comment:
Anonymous said...
This is nearly my first reading of anything to do with Dataphor, so excuse my ignorance.
But the assumption that runs strongly through your entire article is that two tables can
only be joined to each other in one and only one way. There are many application
contexts where this is an overly limiting assumption!
Sun Jun 17, 12:04:00 AM

Thanks for browsing the post.

To understand Dataphor you must realize that the intent of D4 is
different than sql. While it is certainly true that just about
any sql query can formulated (and many times more succintly) in
D4, the goal of each system is different. The goal of sql is
performance oriented retrieval of large numbers of rows. The goal
of Dataphor is toward application development which is less
concerned with retrieval and more concerned with 'addressing' tables.
Application development usually implies a small sample of rows
from table(s). The real goal is addressing and working with these
relatively small samples with a logical clarity not offered by
any sql system.
The difference between sql and D4 can seen thru the join construct.
In sql:

select A.OrderID,B.ProductID,A.CustomerID,B.Quantity
from Orders A join OrderDetails B
on A.OrderID=B.OrderID 

The intent of the operation is retrieval and the only other information 
(working with MS Sql Server 2005) the db is concerned with are the
index(s) of the associated tables. The same join operation in the context
of AD has a different intent. Here we would want to address the space
of the join in terms of its 'key'. In other words, to address the
projection of the join for a single OrderID and ProductID. For example,
create a view of join:

create view AllOrdersDetails
 Orders join OrderDetails;
 
The key of the projection is {OrderID,ProductID} and within the application
the row of any combination can be easily obtained, for example Order 11077
and Product 77 (the [ ] in D4 is called an indexer for obvious reasons):

var T:=AllOrdersDetails[11077,77 by{OrderID,ProductID}] 
          over{OrderID,ProductID,CustomerID,Quantity} 
                 with {IgnoreUnsupported = 'true'} ;
                 
So we can easily work with any of the scalar values of the row:

select T.CustomerID;
select T.Quantity;

It is a small sample of rows or a single row that the application is
usually concerned with:

var Y:=row of typeof((Orders join OrderDetails)[]){ };  
Y:=(Orders join OrderDetails)[11077,77 by{OrderID,ProductID}];  
select  Y over{OrderID,ProductID,CustomerID,Quantity};          

The D4 join is in fact restricted to only 'equi-joins'. This is to stress
the importance of key inference. With only equi-joins the compiler has
an easier job of correctly inferring key(s) as does the developer.
For example expressing the join in terms of a 'where' restriction:

select ((Orders rename A) times (OrderDetails rename B)
        where A.OrderID=B.OrderID)
        
('times' is like an sql cross join) the compiler infers the key:

{A.OrderID,B.OrderID,B.ProductID}

and an expression to obtain a particular Order and Product becomes:

select ((Orders rename A) times (OrderDetails rename B)
        where A.OrderID=B.OrderID)[11077,11077,77 by{A.OrderID,B.OrderID,B.ProductID}];
        
Even though this is correct it lacks the 'clarity' of the join. Because
sql does not have the logical concept of a key nor the addressing of an
expression it can afford to lift any kind of restriction(s) for a join.
This is why the sql on clause accepts non equi-joins. Sql, in fact, will
take any argument(s) for on ranging from the sensible to the non-sensible.
It is only concerned with the physical index for retrieval and there
are no logical restrictions on the result (for example D4 will not return
nor address a table with duplicate key(s) or rows). Since AD translates to
expressions of integrity ie. keys, constraints and especially constraints
involving tables (references) the equality operator is the primary logical
operator of D4 and AD.

I should also note that traditional uses of non-equi joins and the
introduction of predicates in sql joins can be completely eliminated in AD.
For example the use of a self-join to obtain a rank using the Count
aggregate:

select A.OrderID,A.ProductID,A.Quantity,Count(*) as Rank
from OrderDetails A join OrderDetails B
on A.OrderID=B.OrderID and A.Quantity>=B.Quantity
group by A.OrderID,A.ProductID,A.Quantity
order by A.OrderID,Rank desc

OrderID     ProductID   Quantity Rank        
----------- ----------- -------- ----------- 
10248       11          12       3
10248       42          10       2
10248       72          5        1
10249       51          40       2
10249       14          9        1

If I want information on any quantity for a particular order
I can directly address the row(s) without the need to 'retrieve'
any information:

var Quantity:= 
ToList
  (
  cursor
       (
        OrderDetails where OrderID=10248 
        order by {Quantity desc}
        )
   );
var Cnt:=Quantity.Count();
select Cnt;
select Quantity[0];    //Max Quantity 
select Quantity[Cnt-1];//Min Quantity 
                    
3
OrderID ProductID UnitPrice Quantity Discount 
------- --------- --------- -------- -------- 
10248   11        $14.00    12       0        
OrderID ProductID UnitPrice Quantity Discount 
------- --------- --------- -------- -------- 
10248   72        $34.80    5        0        

Drop by anytime and I hope you explore Dataphor.

best,
steve

Wednesday, June 13, 2007

Dataphor - Passing a table as a parameter

A quick peek at what passing a table as a parameter really means?

This example uses the D4 language of Dataphor, MS Sql Server 2005, 
the Orders and OrderDetails tables from the Northwind db.

Consider the following two simple sql queries which use a group by 
column and return the maximum value of a column for each group.

select top 5 EmployeeID,Max(Freight) as MaxFrt
from Orders
group by EmployeeID
order by MaxFrt

EmployeeID  MaxFrt                
----------- --------------------- 
6           367.6300
8           398.3600
1           544.0800
4           719.7800
9           754.2600

select top 5 ProductID,Max(UnitPrice) as MaxUP
from OrderDetails
group by ProductID
order by MaxUP

ProductID   MaxUP                 
----------- --------------------- 
33          2.5000
24          4.5000
13          6.0000
52          7.0000
54          7.4500

We can put each query in a stored procedure and return all the rows.

create procedure OrdersMaxFrt
as
select EmployeeID,Max(Freight) as MaxFrt
from Orders
group by EmployeeID

create procedure OrderDetailsMaxUP
as
select ProductID,Max(UnitPrice) as MaxUP
from OrderDetails
group by ProductID

Given that the queries concern themselves with 2 mutually exclusive
sets of facts (tables) what can be said of any similarities
between them? It's clear they perform the same general operation,
they both group the data by a single column and return a max of
single column. So they return rows consisting of two columns.
We can say (loosely speaking) each sp returns a table. What's left?
Look at the columns of each query:

SELECT TABLE_NAME,COLUMN_NAME,DATA_TYPE 
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'Orders' 

TABLE_NAME COLUMN_NAME DATA_TYPE 
---------- ----------- --------- 
Orders     EmployeeID  int       
Orders     Freight     money     
Orders     OrderID     int       

SELECT TABLE_NAME,COLUMN_NAME,DATA_TYPE 
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'OrderDetails' 

TABLE_NAME   COLUMN_NAME DATA_TYPE 
------------ ----------- --------- 
OrderDetails OrderID     int       
OrderDetails ProductID   int       
OrderDetails UnitPrice   money     

Each query uses a group by column of int(eger) and a max of a 
money column. Each query returns a table of an int column and
a money. Each table involves a column (OrderID) of int in the 
primary key definition. All these similarities can be seen as
'structural' similarities. And when we talk of such similarities
we are talking the language of 'type' just as we would say the
number 5 is an int type. But in this context we are speaking
of 'table' type.
We can now take all these structural ideas and create a D4
procedure that performs the group by operation on a table that
is passed as a parameter.

create operator
  GetaMax(aTable:table{aPkCol:Integer,aGroupCol:Integer,aMoneyCol:Money}):
                table{aGroupCol:Integer,aMax:Money}
begin
result:=
  aTable group by{aGroupCol} add {Max(aMoneyCol) aMax};
end;

The operand (argument) aTable is a table defined in terms of its 
column name and corresponding data type:

table{aPkCol:Integer,aGroupCol:Integer,aMoneyCol:Money}

The procedure returns a result which is also a table:

table{aGroupCol:Integer,aMax:Money}

And now we can answer the question of what passing a table as a 
parameter really means. It just means the definition of the table
in terms of its structure, its column names and their data type.
By contrast, the only reference to a table in sql is by its name.
The structural use of table information in sql does not exist. 
In other words, there is no table 'type' in sql and therefore
nothing on which a table parameter could be based on. In sql there
is only the literal representation of a table (its name) whereas 
in D4 it can be represented as a variable. It is the same difference 
representing the literal string 'ABC' but not being able to assign
'ABC' to a variable of type string. (Note how sql tries to overcome
the lack of a table type: dynamic sql ☻ .)

Ok, now to how the D4 GetaMax procedure works.

The sql stored procedure exec OrdersMaxFrt becomes:

select 
 GetaMax(Orders {OrderID aPkCol,EmployeeID aGroupCol,Freight aMoneyCol})
       {aGroupCol EmployeeID,aMax MaxFrt}
          order by {EmployeeID};

       
EmployeeID MaxFrt    
---------- --------- 
1          $544.08   
2          $810.05   
3          $1,007.64 
4          $719.78   
5          $890.78   
6          $367.63   
7          $830.75   
8          $398.36   
9          $754.26   

In D4 the:

'GetaMax(Orders {OrderID aPkCol,EmployeeID aGroupCol,Freight aMoneyCol})'

means execute the GetaMax operator using the Orders table but renaming
the Orders columns to the table type (the columns with their data types)
declared in the operator. In renaming the columns and because the columns 
have the same corresponding data types as declared in the operator we 
succeed in passing the Orders table (as a variable). So the EmployeeID (int)
column is renamed to aGroupCol and Freight (money) to aMoneyCol 
(same for OrderID). 
The '{aGroupCol EmployeeID,aMax MaxFrt}' simply renames the generic
columns of the operator back to the meaningful column names of the
Orders table.

Likewise, The sql stored procedure exec OrderDetailsMaxUP becomes:

select 
 GetaMax(OrderDetails {OrderID aPkCol,ProductID aGroupCol,UnitPrice aMoneyCol})
      {aGroupCol ProductID,aMax MaxUP}
        order by {ProductID};
        
ProductID MaxUP   
--------- ------- 
1         $18.00  
2         $19.00  
3         $10.00  
4         $22.00  
.

Just as a variable defined as int can take an infinite number of values
the GetaMax procedure represents the same operation on a possibly 
infinite number of tables. As I hope you can imagine the concept of
a table type/argument has far reaching implications for application
development.

Finally, if you have Dataphor you can run the following two batches
to compare the GetaMax operator to the sql stored procedures.
(Note that the D4 operator 'SQLQuery' is used for pass-thru queries
 sending t-sql directly to Sql Server. Also Sql Server converts the max
 of a money type to decimal. To make the comparision with GetaMax it must
 be converted back to money).

//Check Orders.
var Compare_the_Tables:='The 2 tables are different';
if 
  (
    GetaMax(Orders {OrderID aPkCol,EmployeeID aGroupCol,Freight aMoneyCol})
       {aGroupCol EmployeeID, aMax MaxFrt}
   )
  =
(SQLQuery('exec OrdersMaxFrt') {EmployeeID,ToMoney(MaxFrt) MaxFrt}) then
Compare_the_Tables:='The 2 tables are the same';
select Compare_the_Tables;
//The 2 tables are the same

//Check OrderDetails.
var Compare_the_Tables:='The 2 tables are different';
if 
  (
    GetaMax(OrderDetails {OrderID aPkCol,ProductID aGroupCol,UnitPrice aMoneyCol})
      {aGroupCol ProductID,aMax MaxUP}
   )
  =
(SQLQuery('exec OrderDetailsMaxUP') {ProductID,ToMoney(MaxUP) MaxUP}) then
Compare_the_Tables:='The 2 tables are the same';
select Compare_the_Tables;
//The 2 tables are the same

bye for now,
steve

Thursday, June 07, 2007

Dataphor - trimming digits and letters from a string

Trimming leading zeros and trailing letters from a string.

This question was raised in the thread:
microsoft.public.sqlserver.programming
Tuesday, June 05, 2007 2:08 PM
Couple of update statements
http://tinyurl.com/2n73xf

This can done quite easily in Dataphors D4 language without the
need of combining string functions into complicated expressions.

The following D4 constructs can be used to return a string
meeting the criteria:

Indexer Expression
http://www.alphora.com/docs/O-System.iIndexer.html

IndexOfAny
http://www.alphora.com/docs/O-System.IndexOfAny.html

Concat (Concatenate)
http://www.alphora.com/docs/O-System.Concat.html

For updating using a table variable or view see:

Join/Outer Join/Lookup
http://www.alphora.com/docs/DDGRepresentingDatawithTablesandViews-DerivedTableVariables-Join.html
http://www.alphora.com/docs/D4LGTableExpressions-OuterJoin.html

Here is some sample data entered into MS Sql Server 2005 that
we will access with Dataphor:

create table LTLData (RowID int primary key,ProNum varchar(25) )
insert into LTLData  Values (1,'00234')
insert into LTLData  Values (2,'030234BD')
insert into LTLData  Values (3,'33030234BD')
insert into LTLData  Values (4,'0233030234BD')
insert into LTLData  Values (5,'0000234BD')
insert into LTLData  Values (6,'233030234')
insert into LTLData  Values (7,'00000ABD')
insert into LTLData  Values (8,'BDBDBD')

So for example given the string ''030234BD' we want to trim the
leading '0' and the trailing 'BD' and return '30234'.

An easy way to do this is transpose the string into rows of a
single character and work with the rows to omit the leading
zeros and trailing letters. Once the offending rows (characters)
are omitted then concatenate the rows to form the new string.

Here we use a table of numbers (numbers with a single column num)
and the indexer to split each string into rows of a single character.
A list of digits of type string (NumList1) is used with IndexOfAny
to add the first occurrance of any non zero digit in the string to
each row (FirstDigit). If a string has no digit from 1-9 then IndexOfAny
returns -1. The 'with {IgnoreUnsupported = 'true'}' is used to suppress
the message Dataphor returns that says there is no support in Sql Server
for the IndexOfAny function.

var NumList1:list(String):={'1','2','3','4','5','6','7','8','9'};
select
(
 numbers //table of digits (num) from 0-100.
   times //Like an sql cross join
         //Get the count (Cnt) of the items in the string and location of 1st digit.
     LTLData add{Length(ProNum.Trim()) Cnt,ProNum.IndexOfAny(NumList1) FirstDigit}
       with {IgnoreUnsupported = 'true'}
        where num<=Cnt-1
          {RowID,ProNum,num Index,FirstDigit,ProNum[num] Char}
 )
  order by {RowID,Index};

Here is a sample of the result:

RowID ProNum       Index FirstDigit Char
----- ------------ ----- ---------- ----
1     00234        0     2          0   
1     00234        1     2          0   
1     00234        2     2          2   
1     00234        3     2          3   
1     00234        4     2          4   
2     030234BD     0     1          0   
2     030234BD     1     1          3   
2     030234BD     2     1          0   
2     030234BD     3     1          2   
2     030234BD     4     1          3   
2     030234BD     5     1          4   
2     030234BD     6     1          B   
2     030234BD     7     1          D   
.

7     00000ABD     0     -1         0   
7     00000ABD     1     -1         0   
7     00000ABD     2     -1         0   
7     00000ABD     3     -1         0   
7     00000ABD     4     -1         0   
7     00000ABD     5     -1         A   
7     00000ABD     6     -1         B   
7     00000ABD     7     -1         D   

We now add a single where statement to omit leading zeros and
trailing letters.

var NumList1:list(String):={'1','2','3','4','5','6','7','8','9'};
select
(
 numbers //table of digits (num) from 0-100.
   times //Like a cross join
       //Get the count (Cnt) of the items in the string and location of 1st digit.
     LTLData add{Length(ProNum.Trim()) Cnt,ProNum.IndexOfAny(NumList1) FirstDigit}
       with {IgnoreUnsupported = 'true'}
        where num<=Cnt-1
          {RowID,ProNum,num Index,FirstDigit,ProNum[num] Char}
 )
       where (Index>=FirstDigit) and (FirstDigit<>-1 ) and (IsDigit(Char))
   order by {RowID,Index};

RowID ProNum       Index FirstDigit Char
----- ------------ ----- ---------- ----
1     00234        2     2          2   
1     00234        3     2          3   
1     00234        4     2          4   
2     030234BD     1     1          3   
2     030234BD     2     1          0   
2     030234BD     3     1          2   
2     030234BD     4     1          3   
2     030234BD     5     1          4   
3     33030234BD   0     0          3   
3     33030234BD   1     0          3   
3     33030234BD   2     0          0   
3     33030234BD   3     0          3   
3     33030234BD   4     0          0   
3     33030234BD   5     0          2   
3     33030234BD   6     0          3   
.
6     233030234    0     0          2   
6     233030234    1     0          3   
6     233030234    2     0          3   
6     233030234    3     0          0   
6     233030234    4     0          3   
6     233030234    5     0          0   
6     233030234    6     0          2   
6     233030234    7     0          3   
6     233030234    8     0          4   

The where statement did not affect RowID 6 but completely eliminated
RowIDs 7 and 8.

The Concat operator can now be used to form a new string based on the
rows meeting the criteria. The Index column is used in Concat to order
the new string in the same way it was transposed. A delimiter (Del) is
defined as an empty string ('') so when the new string is formed there
are is no separation between the characters (Char).

Instead of operating on the rows of LTLData, here is a select statement
that trims a single string. The 'with {IgnoreUnsupported = 'true'}' is
used with Concat since we have not mapped Concat to any Sql Server
function. Like IndexOfAny the Dataphor server will assume responsibility
for this operation. (Interested readers can search Dataphor help for
'query chunking' to see how Dataphor interacts with a device, ie Sql Server.)

var NumList1 := {'1','2','3','4','5','6','7','8','9'};
var Str:='030234BD';
select
Concat({Char,Del}
from
 (
  numbers
   where num<=Length(Str.Trim())-1
    {Str StrNum,num Index,Str[num] Char,Str.IndexOfAny(NumList1) FirstDigit,'' Del}
      with {IgnoreUnsupported = 'true'}
        where (Index>=FirstDigit) and (FirstDigit<>-1 ) and (IsDigit(Char))
  )
    order by {Index}  ) with {IgnoreUnsupported = 'true'};
 
Which returns '30234'.

We can now use the above select to create an operator that will return
a new string for each row of LTLData.

create operator TrimStr(Str:String):String
begin
var NumList1 := {'1','2','3','4','5','6','7','8','9'};
result:= 
Concat({Char,Del}
from
 (
  numbers
   where num<=Length(Str.Trim())-1
    {Str StrNum,num Index,Str[num] Char,Str.IndexOfAny(NumList1) FirstDigit,'' Del}
      with {IgnoreUnsupported = 'true'}
        where (Index>=FirstDigit) and (FirstDigit<>-1 ) and (IsDigit(Char))
  )
    order by {Index}  ) with {IgnoreUnsupported = 'true'};
end;   

We now use operator TrimStr to return the correctly trimmed new string (NewStr).

select
 LTLData add{TrimStr(ProNum) NewStr}
          with {IgnoreUnsupported = 'true'};
         
RowID ProNum       NewStr    
----- ------------ ----------
1     00234        234       
2     030234BD     30234     
3     33030234BD   33030234  
4     0233030234BD 233030234 
5     0000234BD    234       
6     233030234    233030234 
7     00000ABD     <No Value>
8     BDBDBD       <No Value>           

The ProNum column in LTLData can now be updated using TrimStr.
The 'IsNotNil()' is similar to the sql 'not IsNull()'. So only ProNum
values will be updated where TrimStr return a non-nil(null) value.
(Note that D4 uses the term 'nil' whereas sql uses 'null'.) The
"with {ShouldSupport='false'}" tag is used to tell the Dataphor
sever to take responsibility for evaluating 'IsNotNil(TrimStr(ProNum))'.
There is no way for Sql Server to evaluate this expression. A
"{ShouldSupport='true'}" (the default) means Sql Server should
try to support the expression. Without the "with {ShouldSupport='false'}"
the update will execute but Dataphor will issue a warning message.

update LTLData
  set { ProNum:= TrimStr(ProNum) }
    where IsNotNil(TrimStr(ProNum) with {ShouldSupport='false'} )  ;     

select LTLData;  

RowID ProNum   
----- ---------
1     234      
2     30234    
3     33030234 
4     233030234
5     234      
6     233030234
7     00000ABD 
8     BDBDBD 


Finally, we recreate LTLData and show a query which uses Concat in
a group by to form new trimmed strings for each RowID.

var NumList1 := {'1','2','3','4','5','6','7','8','9'};
select
 numbers
   times
    LTLData add{Length(ProNum.Trim()) Cnt,ProNum.IndexOfAny(NumList1) FirstDigit}
        with {IgnoreUnsupported = 'true'}
     where num<=Cnt-1
      {RowID,ProNum,num Index,FirstDigit,ProNum[num] Char,'' Del}
        where (Index>=FirstDigit) and (FirstDigit<>-1 ) and (IsDigit(Char))
            group by {RowID}
              add
                 {Max(ProNum) ProNum, Concat(Char,Del order by {RowID,Index}) NewPro}
                    order by {RowID};

RowID ProNum       NewPro   
----- ------------ ---------
1     00234        234      
2     030234BD     30234    
3     33030234BD   33030234 
4     0233030234BD 233030234
5     0000234BD    234      
6     233030234    233030234


Update with a view.
We can define a view and use it in a join to update ProNum with NewPro (from the view).

create view LTLDataView                    
 numbers
   times
    LTLData add{Length(ProNum.Trim()) Cnt,ProNum.IndexOfAny( {'1','2','3','4','5','6','7','8','9'}) FirstDigit}
        with {IgnoreUnsupported = 'true'}
     where num<=Cnt-1
      {RowID,ProNum,num Index,FirstDigit,ProNum[num] Char,'' Del}
        where (Index>=FirstDigit) and (FirstDigit<>-1 ) and (IsDigit(Char))
            group by {RowID}
              add
                 {Max(ProNum) ProNum, Concat(Char,Del order by {RowID,Index}) NewPro} ;
                
The lookup operator is similar to a join but can be used to
control just which table(s) should be the target of an update.
We only wish to update LTLData and don't want to go thru the
view to do it. We only want the view to supply the new string NewPro.
The lookup encapsulates this idea. Note the lookup (join) is a
natural join using RowID.

update LTLData lookup (LTLDataView remove{ProNum} )
   set { ProNum:= NewPro } ;  

select LTLData;

RowID ProNum   
----- ---------
1     234      
2     30234    
3     33030234 
4     233030234
5     234      
6     233030234
7     00000ABD 
8     BDBDBD   

Instead of a view a table variable could be also be used.

var NumList1 := {'1','2','3','4','5','6','7','8','9'};
var Y:=
 numbers
   times
    LTLData add{Length(ProNum.Trim()) Cnt,ProNum.IndexOfAny(NumList1) FirstDigit}
        with {IgnoreUnsupported = 'true'}
     where num<=Cnt-1
      {RowID,ProNum,num Index,FirstDigit,ProNum[num] Char,'' Del}
        where (Index>=FirstDigit) and (FirstDigit<>-1 ) and (IsDigit(Char))
            group by {RowID}
              add
                 {Max(ProNum) ProNum, Concat(Char,Del order by {RowID,Index}) NewPro} ;

update LTLData lookup (Y remove{ProNum} )
   set { ProNum:= NewPro } ;  

select LTLData;

RowID ProNum   
----- ---------
1     234      
2     30234    
3     33030234 
4     233030234
5     234      
6     233030234
7     00000ABD 
8     BDBDBD   

Bye for now,
steve

Sunday, June 03, 2007

Dataphor - String differences operator

The following is a string operator based on a suggestion in the thread:

microsoft.public.sqlserver.programming
Friday, June 01, 2007 7:53 AM
Compare String
http://tinyurl.com/38e7zg

It is written in D4 using Dataphor from:
www.alphora.com

(Go to operator Cstr).

Given a comma delimited string in the form of:
' string1 , string2 '
the the operator returns a row with the following columns:
removed - the difference between string1 and string2 (string1 diff-> string2). 
          It is the strings that remain in string1 after removing any string
          also found in string2.
added -   is the same as removed but in the opposite direction, ie. the
          difference between string2 diff-> string1.
IsRem -   a boolean. True if there is a difference in the direction of removal,
          false otherwise.
IsAdd -   boolean, same as IsRem but from the direction of added.
IsSame -  a boolean, true if there are no differences from either direction
          and the order of individual string parts are the same in string1
          and string2. If the string parts in each string are equal and they
          occur in the same position they are the same otherwise they are not. 
          This scalar is based on an observation by Gert-Jan Strik in the
          thread.
          
In the event that there is no difference the operator returns the string
for the appropriate direction. If there's no difference for removed string1
is returned cleanly formatted, ie a blank between each string part. The
same for added. Whether or not a difference exists can be determined by
checking IsRem and IsAdd. The same formatting extends for differences.
Removed and added are always formatted with a blank between each string
part.

Note that it is assumed that a blank is the delimiter within string1
and string2. This can easily be changed as the 'Split' operator can
take a list of delimiters. Also note how nicely the D4 'without' construct
lends itself to obtaining the difference in either direction between the
strings. Lastly, instead of passing in a delimited string a list could
have been used. In this case it does not appear to make much difference
which one is used.

The operator returns a row type. No declarative programming is used. Only
dedicated list, string and relational operators are used. While a bit long
winded compared to a dedicated D4 declarative approach, it illustrates the
relative ease with which the expressive power of the language can be used.

***********************************************************

Examples:

1.
 This is the example from the thread. It shows multiple differences in
 each direction.
 select Cstr('The red fox is big and fat. The,The green frog is slimey. The g');
 This returns the following row:

 removed              added                IsRem IsAdd IsSame 
 -------------------- -------------------- ----- ----- ------ 
 red fox big and fat. green frog slimey. g True  True  False  

2.
 A single difference in each direction.
 select Cstr(' B1  C1 A1 , D1  B1  A1');

 removed added IsRem IsAdd IsSame 
 ------- ----- ----- ----- ------ 
 C1      D1    True  True  False  

3.
 select Cstr(' white  red blue  black  , red yellow  pink '); 

 removed          added       IsRem IsAdd IsSame 
 ---------------- ----------- ----- ----- ------ 
 white blue black yellow pink True  True  False  

4.
 select Cstr(' white  red blue  black  , white  red blue  black pink '); 

 Since there is no difference from removal (string1) string1 is returned
 and IsRem is False. There is a single difference from added and IsAdd
 is True.

 removed              added IsRem IsAdd IsSame 
 -------------------- ----- ----- ----- ------ 
 white red blue black pink  False True  False  

5.
 select Cstr(' white  red blue  black pink  , white  red blue  black  '); 

 Same as the last example but the difference is reversed.

 removed added                IsRem IsAdd IsSame 
 ------- -------------------- ----- ----- ------ 
 pink    white red blue black True  False False  

6.
 select Cstr(' white  red blue  black pink  , white  red blue  black  pink'); 

 There is no difference in either direction and string1 and string2
 are returned. Since the string parts (colors) are the same and occur in
 the same order IsSame is True.

 removed                   added                     IsRem IsAdd IsSame 
 ------------------------- ------------------------- ----- ----- ------ 
 white red blue black pink white red blue black pink False False True   

7.
 select Cstr(' white  red blue  black pink  , black  pink blue  white  red'); 

 The colors in each string match but occur in a different order. Therefore
 IsSame is False

 removed                   added                     IsRem IsAdd IsSame 
 ------------------------- ------------------------- ----- ----- ------ 
 white red blue black pink black pink blue white red False False False  

8.
 This returns only the scalar removed from the row.

 select Cstr(' B1  C1 A1 , D1  B1  A1').removed ;

 C1

9.
 This returns it as the column Rem.

 select TableDee add {Cstr(' B1  C1 A1 , D1  B1  A1').removed Rem} ;

 Rem 
 --- 
 C1  

10.
 Assigning the result to a variable. Note that the argument (operand) to
 the operator can be constructed within the call.

 var Rem:String;
 var Str1:=' working with variables in Dataphor ';
 var Str2:=' working with variables';
 var Str:=Cstr(Str1+','+Str2);
 if Str.IsRem then
 Rem:=Str.removed
  else
    Rem:=' ';
 select Rem;
 
 in Dataphor
 
11.
 The differences can be used to add a column to the result showing
 the string parts in common.
 
 var Str1:='BLACK WHITE PURPLE GREEN ORANGE YELLOW';
 var Str2:='WHITE BLACK PINK NAVY PURPLE YELLOW GOLD BEIGE';
 select 
  Cstr(Str1+','+Str2)
    add
       {
        Concat({Str,Del}
         from
         (
          (ToTable((Str1+','+Str2).Split({',',' '}),'Str')
          {Str.Trim() Str} where Str>' ')
            without
             ToTable( ((Cstr(Str1+','+Str2) 
               {removed+' '+added Str}).Str).Split({' '}),'Str')              
          ) add {' ' Del}
             order by {Str})
        commonstr
        }
   {removed,added,commonstr,IsRem,IsAdd};     

removed      added                commonstr                 IsRem IsAdd 
------------ -------------------- ------------------------- ----- ----- 
GREEN ORANGE PINK NAVY GOLD BEIGE BLACK PURPLE WHITE YELLOW True  True  

12.
 Define a table variable to hold multiple rows of the operator.

 var TableStr:=
 table
       {
        Cstr(' A1 b1 C1 d1 , A1 B1 E1 F2 ') add{1 RowID},
        Cstr(' A B C , ') add{2 RowID},
        Cstr('AA A AA  a ,   AA   a')add{3 RowID},
        Cstr(' , ') add {4 RowID},
        Cstr(' 10 8 7 6 , 10   8  7 6 ') add{5 RowID},
        Cstr(' z  1  44_A , 1 44_A z ') add{6 RowID}
       } 
        {RowID,removed,IsRem,added,IsAdd,IsSame} ;   
 select TableStr order by {RowID};
        
RowID removed  IsRem added    IsAdd IsSame     
----- -------- ----- -------- ----- ---------- 
1     b1 C1 d1 True  B1 E1 F2 True  False      
2     A B C    False          False False      
3     A        True  AA a     False False      
4              False          False <No Value> 
5     10 8 7 6 False 10 8 7 6 False True       
6     z 1 44_A False 1 44_A z False False      

13.
 Accessing data in MS Sql Server 2005.
 
 Create some data in the pubs database and access it with Dataphor.
 
 create table Tstrings
  (RID int primary key,Astr varchar(150))
 insert Tstrings values(1,'AA BB CC DD TT, YY NN AA BB R1')
 insert Tstrings values(2,'FF1  4564 33AAA T1 55, 33AAA FF1 33AA Y4564 ')
 insert Tstrings values(3,'100 81 1 353 30 A64,1 81 82 R34 353 A353')
 insert Tstrings values(4,'The red fox is big and fat. The,The green frog is slimey. The g')
 insert Tstrings values(5,'This  is a fun  test, This  is a fun  test  ')
 insert Tstrings values(6,'This  is a fun  test, This test is a fun ')
 insert Tstrings values(7,'Does this stuff make sense? -:), ')
 insert Tstrings values(8,'I hope you,I hope you explore Dataphor ')

Reconcile(Name("Pubs"));//Make the data available to Dataphor.

select
 Tstrings 
  add
     {
      Cstr(Astr).removed removed,Cstr(Astr).added added,
      Cstr(Astr).IsRem IsRem,Cstr(Astr).IsAdd IsAdd,
      Cstr(Astr).IsSame IsSame
      } 
      with {IgnoreUnsupported='true'}
      remove{Astr}; //So the row doesn't get too long-:)
      
RID removed                         added                IsRem IsAdd IsSame 
--- ------------------------------- -------------------- ----- ----- ------ 
1   CC DD TT                        YY NN R1             True  True  False  
2   4564 T1 55                      33AA Y4564           True  True  False  
3   100 30 A64                      82 R34 A353          True  True  False  
4   red fox big and fat.            green frog slimey. g True  True  False  
5   This is a fun test              This is a fun test   False False True   
6   This is a fun test              This test is a fun   False False False  
7   Does this stuff make sense? -:)                      False False False  
8   I hope you                      explore Dataphor     False True  False  

End of examples.

***********************************************************

In sql (MS Sql Server) a procedure can have multiple output parameters.
In D4 there is no such concept as multiple output parameter. Multiple
output parameters in D4 means either multiple operators each returning
a scalar value, returning a row or returning a table with a single row.

create operator Cstr(Str:String):
 row{removed:String,IsRem:Boolean,IsSame:Boolean,added:String,IsAdd:Boolean}
begin
result:=row of typeof(result){};
var ListStr:=Str.Split({','});
//Check that input operand Str has a single comma, raise error and exit if false.
var StrCnt:=ListStr.Count();
if StrCnt<>2 then
begin
raise Error
("Input Str: "+ Unicode({39})+ Trim(Str) + Unicode({39}) +
  " has " + ToString(StrCnt-1) + " comma's, 
  the operator expects a single comma delimited string ");
exit;
end;
var Idummy:Integer;
if ListStr.Count()=1 then
Idummy:=ListStr.Add(' ');
//Form row (all scalar values) and modify row values when appropriate.
result:=
row
{
//removed
 (
 IfNil(Concat ({Str,Del}
  from
 (
  (
   (ToTable(ListStr[0].Split({' '}),'Str') {Str.Trim() Str} where Str>' ' )     
    without
     (ToTable(ListStr[1].Split({' '}),'Str') {Str.Trim() Str} where Str>' ') 
   )
   join
    ((ToTable(ListStr[0].Split({' '}),'Str') {Str.Trim() Str,sequence seq} where Str>' ')   
      add{' ' Del})
  )
      order by {seq} 
  ),ListStr[0])
  ) removed,
//added
 (
  IfNil(Concat ({Str,Del}
  from
  (
   (
    (ToTable(ListStr[1].Split({' '}),'Str') {Str.Trim() Str} where Str>' ')     
    without
     (ToTable(ListStr[0].Split({' '}),'Str') {Str.Trim() Str} where Str>' ') 
   )
   join
    ((ToTable(ListStr[1].Split({' '}),'Str') {Str.Trim() Str,sequence seq} where Str>' ')   
      add{' ' Del})
  )
      order by {seq} 
  ),ListStr[1])
  ) added, true IsRem, true IsAdd,  false IsSame
};  
//Check removed.
if (result.removed=ListStr[0]) or
 (
   (
    ToTable((result.removed).Split({' '}),'Str')
     {Str.Trim() Str} where Str>' '
   )  
         =
   (
     ToTable(ListStr[0].Split({' '}),'Str')
      {Str.Trim() Str} where Str>' '
   )   
 )  
then
update result 
 set 
    {
     IsRem:=false,
     removed:=IfNil(Concat( {Str,Del}
               from 
                  (
                   ToTable(ListStr[0].Split({' '}),'Str')
                     where Str>' ' add{' ' Del}
                  )
                    order by {sequence}
                    ),' ')          
    };
//Check added.
if (result.added=ListStr[1]) or
 (
   (
    ToTable((result.added).Split({' '}),'Str')
     {Str.Trim() Str} where Str>' '
   )  
         =
   (
     ToTable(ListStr[1].Split({' '}),'Str')
      {Str.Trim() Str} where Str>' '
   )   
 )  
then
update result 
 set 
    {
     IsAdd:=false,
     added:=IfNil(Concat( {Str,Del}
               from 
                  (
                   ToTable(ListStr[1].Split({' '}),'Str')
                     where Str>' ' add{' ' Del}
                  )
                    order by {sequence}
                    ),' ')          
    };
//If IsRem is false, are the 2 strings the same. 
if not(result.IsRem) then
if
(
(ToTable(ToList(cursor(ToTable(ListStr[0].Split({' '}),'Str')
                 {Str.Trim() Str,sequence seq} 
                    where Str>''
                     order by {seq}))) remove {seq})   
                   =
(ToTable(ToList(cursor(ToTable(ListStr[1].Split({' '}),'Str')
                 {Str.Trim() Str,sequence seq} 
                    where Str>''
                     order by {seq}))) remove {seq})
) then
update result
       set {IsSame:=true};  
//We cannot logically say 2 blank strings are the same, can we?       
if Length(Replace(Replace(Str,',',''),' ',''))=0 then
 update result
       set {IsSame:=nil};
        
end;

Sunday, June 24, 2007

Sql - Using a dense rank for identifying sections

Friday, June 22, 2007

Dataphor - Sql Visualizing a ranking query

Tuesday, June 19, 2007

Dataphor - Inclusive Sql vs exclusive D4

Saturday, June 16, 2007

Dataphor - Intelligent views

Dataphor - Sql: what does Update..From mean?

Wednesday, June 13, 2007

Dataphor - Passing a table as a parameter

Thursday, June 07, 2007

Dataphor - trimming digits and letters from a string

Sunday, June 03, 2007

Dataphor - String differences operator

Beyond Sql

Blog Archive

Links

About Me