Beyond Sql: Dataphor - Simplifying Relational Division

If your not familiar with relational division a popular article
for the translation of the concept into sql is:

Relational Division
by Joe Celko

Basically relational division is the question of whether a set
of predicates exist in a given table. Generally the predicates
represent a set of 'rows' and therefore the question becomes
does a predicate in the form of a 'table' exist in another
table. The predicate rows are referred to as the divisor table,
the given table as the dividend table and the result as the 
quotient table. Hence the idea of division by tables. 

Since sql does not support a table 'type' various solutions
have been offered to target the rows of the given (dividend)
table. Most often the means of targeting rows are the use
of aggregate functions like count. Because sql cannot directly
compare one set of rows (ie. a table) to another set of rows (a table)
the forced simulation is not unlike the attempt to simulate
a list with the substring function as explained here. The resulting
sql solutions can at the very least be unintuitive and convoluted 
and at the very most not even possible to specify (at least for
most human beings ☺. 

Given that Dataphor supports direct table comparisons the logic
of a relational division problem can be greatly simplified.

This example uses MS Sql Server 2005 and is based on the post:

Thursday, July 12, 2007 11:07 PM
microsoft.public.sqlserver.programming
Expert Challenge - Complex Join, Group By, Having

Here is some sample data. This is the dividend table.

create table CTable
{
 ID:Integer,
 Answer:String,
 Wcnt:Integer,
 Word:String,
 key{ID}
}; 
insert
table
{
row{1 ID,'first' Answer, 2 Wcnt,'george' Word},
row{2,'first', 2, 'burns'},
row{3,'second',2, 'burns'},
row{4,'second',2, 'burns'},
row{5,'third', 3,    'go'},
row{6,'third', 3,'george'},
row{7,'third', 3, 'burns'},
row{8,'fourth',2,  'fred'},   
row{9,'fourth',2,'george'}  
} into CTable;

Given a match (divisor) table:

(test #1)
declare @match table([ID] INT IDENTITY, [word] varchar(20))
insert @match([word] ) select 'burns'
insert @match([word] ) select 'george'

what the op refers to as a 'match' table, the relational division
question becomes find the rows in the dividend table that correspond
to the divisor rows. And it follows that we compare the Word and
ID columns of the tables. 

We now make two important observations:
1. The Answer column of the dividend table defines rows that belong
   together. These rows form a table.
2. Since we are interesting in comparing columns of two tables we note
   that it is unfeasible to try to compare the ID columns of the dividend
   and divisor tables since the ID column of the divisor table can take
   any value based on the Identity function. With a little insight we
   see that we can compare them if we first 'standardize' the ID column
   in both tables. We can do this by computing the 'rank' of Words in
   each table and compare the ranks instead of IDs. We also note that
   the way the original problem was posed the Word column is of primary
   significance. In other words, it is the presence of the divisor Words
   that determines a match. There is no significance attached to ID, we
   are simply standarizing it to include it in a comparison. Because it 
   has no meaning as to determining a match which ever way we standarize 
   as long as we are consistent within both tables it will be sufficient.

Sql solution   

Here is one of many sql solutions. The row_number() function is used
to standarize the ID columns in both tables and Rank is used in the
comparison. 

-- Match/divisor table (test#1).
declare @match table([ID] INT IDENTITY, [word] varchar(20))
insert @match([word] ) select 'burns'
insert @match([word] ) select 'george'

select C.ID,C.Answer,C.Wcnt,C.Word
from CTable as C
where C.Answer in
(
select B.Answer 
from
-- Note that if we order by 'Word desc' for the ranks in both selects it
-- will not make a difference, ie. (consistency for ranks between tables).
(select ID,Word,row_number()over(order by Word) Rank
     from @match as M) as A
right join
(select ID,Answer,Wcnt,Word,
    row_number()over(partition by Answer order by Word) Rank
     from CTable) as B
on A.Rank=B.Rank and A.Word=B.Word
group by B.Answer
having (count(A.rank)=count(*)) and (count(*)=(select count(*) from @match))
)

ID          Answer               Wcnt        Word                 
----------- -------------------- ----------- -------------------- 
1           first                2           george
2           first                2           burns

Note that the join is necessary but not sufficient. The rows of Answer
must be further restricted by aggregate comparisons using the count
aggregate. This methodology rests on being able to express a quotient
table using aggregate functions. But as the criteria for such a table
becomes increasingly more complex so does the query. It is interesting
that proponents of this methodology think it is quite helpful in 
understanding relation divison. For example the authors of:

A Simpler (and Better) SQL Approach to Relational Division

state 'We believe the syntactical construction of Q0 allows the student to
grasp the concepts of implementing SQL division in a more intuitive way'.
Perhaps most telling is their framing of the question of relational division
where they state:

'A common type of database query requires one to find all tuples of some table
 that are related to each and every one of the tuples of a second group.'
 
This seems to a reflection of the fact that there is no direct way to
compare tables in sql. What would appear much more to the point is simply:

'A common type of database query is to compare two tables'.

And this is precisely what we will do in the D4 language of Dataphor.

Developing a D4 solution

Keep in mind the very precise and simple definition of table equality:
Two tables are equal if they have the same named columns of the same 
data types and they have the same set of rows. In other words, they
are the same 'type' and have the same rows.

The following simple D4 query illustrates how easy relational division
can be. The query ignores ID and only compares tables based on Word.
It uses match test#1:

select
//Each row in CTable that has the particular Answer value given by the where
//predicate being true will be returned by the having operator.
CTable 
having
 ( 
//Every Answer value forms a (dividend) table. We get the Answer value 
//by comparing the match (divisor) table to each table formed by a 
//particular Answer.
 CTable {Answer AnswerI} 
//We compare two tables, each row in the match table must exist
//in the table formed by an Answer value.
  where 
   (table{row{'burns' Word},row{'george'}}) //match table.
   = 
  ((CTable where Answer=AnswerI with { ShouldSupport="false"}) {Word} )
   {AnswerI Answer }
 );   

ID Answer Wcnt Word   
-- ------ ---- ------ 
1  first  2    george 
2  first  2    burns  

The above query will work for matches #1 and #3 but will fail for #2.

select
 CTable 
  having
  ( 
   CTable {Answer AnswerI} 
    where 
  //Using test #2.
    (table{row{'burns' Word},row{'burns'}}) 
     = 
   ((CTable where Answer=AnswerI with { ShouldSupport="false"}) {Word} )
   {AnswerI Answer }
 );   
-- Internal Index Error: "Duplicate key violation."

Which makes the point that ID is necessary given duplicate Words.

Now lets get serious with how we're going to do this example in D4 ☺.

First we're going to create a view in D4 using an sql pass-thru query with
row_number(). This will give us a rank (from 1 to Wcnt) in the direction of
Word for each Answer (the partition column).

create view CTableViewWord 
SQLQuery('select ID,Answer,Wcnt,Word,
    row_number()over(partition by Answer order by Word) Rank
     from CTable','key{Answer,Rank}') {ID,ToInteger(Rank) Rank,Answer,Wcnt,Word};

select 
 CTableViewWord 
        join
           (CTable group by {Answer} add{Min(ID) Asort})
             with {IgnoreUnsupported = 'true'}
             return 50 by {Asort,Rank}
                remove{Asort};

ID Rank Answer Wcnt Word   
-- ---- ------ ---- ------ 
2  1    first  2    burns  
1  2    first  2    george 
3  1    second 2    burns  
4  2    second 2    burns  
7  1    third  3    burns  
6  2    third  3    george 
5  3    third  3    go     
8  1    fourth 2    fred   
9  2    fourth 2    george 

D4 operator Words is the sql query expressed in a relational way. It too
standarizes ID by using a rank for comparison. Like the sql query it
only finds tables where there is the same occurrance of the Words. The
values of ID are not pertinent to the intent of the problem. The intent
is only to find the same Words ignoring the IDs. The operator takes a match
table as an argument and returns matching tables from CTable based on Answers.

//Matches with Word occurrence only.
create operator Words(MatchTable:table{ID:Integer,Word:String}):typeof(CTable)
begin      
result:=table of typeof(result){};  
//Create a table variable with a rank in the direction of Word. This is similar
//to sql row_number() function.               
var MatchTableRank:= 
              (ToTable(ToList(cursor(MatchTable order by {Word})))
                    {sequence+1 Rank,Word}) ;
var RTable:=
CTable
  having
        ( 
         (CTable {Answer AnswerI})
          where 
           MatchTableRank
            =
             (  
              (CTableViewWord where Answer=AnswerI)
               {Rank,Word} 
             ) with {IgnoreUnsupported = 'true'} 
              {AnswerI Answer} 
        );
//Result table will show nil values if RTable is empty.
if exists(RTable) then
result:=RTable
 else
  insert row{nil ID,nil Answer,nil Wcnt,nil Word} into result;
end;

To clarify just what operator Words is doing lets insert some more data
into table CTable.

insert
table
{
row{10 ID,'fifth' Answer,3 Wcnt,'burns' Word}, 
row{11,'fifth', 3,   'burns'}, 
row{12,'fifth', 3,   'burns'}, 
row{13,'sixth',4,   'arthur'},  
row{14,'sixth',4,   'arthur'}, 
row{15,'sixth',4,    'burns'}, 
row{16,'sixth',4,    'burns'}, 
row{17,'seventh',4,  'burns'},  
row{18,'seventh',4,  'burns'}, 
row{19,'seventh',4, 'arthur'}, 
row{20,'seventh',4, 'arthur'}, 
row{21,'eight',4,   'arthur'},  
row{22,'eight',4,    'burns'}, 
row{23,'eight',4,   'arthur'}, 
row{24,'eight',4,    'burns'}, 
row{25,'nineth',4,   'burns'},  
row{26,'nineth',4,   'burns'}, 
row{27,'nineth',4,   'burns'}, 
row{28,'nineth',4,  'arthur'}, 
row{29,'tenth',4,    'burns'},  
row{32,'tenth',4,    'burns'}, 
row{35,'tenth',4,    'burns'}, 
row{38,'tenth',4,   'arthur'}, 
row{39,'eleventh',4, 'burns'},  
row{41,'eleventh',4, 'burns'}, 
row{43,'eleventh',4, 'burns'}, 
row{45,'eleventh',4,'arthur'} 
} into CTable;

Examples:

select Words(table{row{11 ID,'burns' Word},row{22,'burns'}});

ID Answer Wcnt Word  
-- ------ ---- ----- 
3  second 2    burns 
4  second 2    burns 

select Words(table{row{11 ID,'burns' Word},row{12,'burns'},row{17,'burns'}});

ID Answer Wcnt Word  
-- ------ ---- ----- 
10 fifth  3    burns 
11 fifth  3    burns 
12 fifth  3    burns 

The following three selects all return the same table since table
equality is based on the occurrence of the three Words. The value of 
ID and the sequence of Words in the match table is immaterial since the
comparison of Rank between the divisor (match) and dividend tables is
based on Word.

select Words(table{row{11 ID,'burns' Word},row{12,'george'},row{17,'go'}});
select Words(table{row{11 ID,'go' Word},row{12,'george'},row{17,'burns'}});
select Words(table{row{12 ID,'george' Word},row{13,'go'},row{14,'burns'}});

ID Answer Wcnt Word   
-- ------ ---- ------ 
5  third  3    go     
6  third  3    george 
7  third  3    burns  

These two selects return 3 Answers and conceptually 3 tables because the two
Words, 'arthur' and 'burns', occur twice within each Answer.

select Words(table{row{11 ID,'arthur' Word},row{17,'burns'},row{20,'arthur'},row{21,'burns'}});
select Words(table{row{11 ID,'burns' Word},row{17,'arthur'},row{20,'arthur'},row{21,'burns'}});

ID Answer  Wcnt Word   
-- ------- ---- ------ 
13 sixth   4    arthur 
14 sixth   4    arthur 
15 sixth   4    burns  
16 sixth   4    burns  
17 seventh 4    burns  
18 seventh 4    burns  
19 seventh 4    arthur 
20 seventh 4    arthur 
21 eight   4    arthur 
22 eight   4    burns  
23 eight   4    arthur 
24 eight   4    burns  

The idea of using the aggregate count in sql as a solution to relational
division problems is well suited to finding occurrences of things, ie. Words.
With D4 it is obviously not necessary to 'invent' an idea like count.
With table comparisons it is not necessary to go beyond the 'data' in the rows.

Up till now the ID value has not been meaningful. Lets change that. Here
we want to match the occurrence of Word and the ID column. Said another way,
we want to match the sequence of Words within an Answer.

We create another view where the rank is now based on ID.

create view CTableViewID 
SQLQuery('select ID,Answer,Wcnt,Word,
    row_number()over(partition by Answer order by ID) Rank
     from CTable','key{Answer,Rank}') {ID,ToInteger(Rank) Rank,Answer,Wcnt,Word};
     
select 
 CTableViewID 
        join
           (CTable group by {Answer} add{Min(ID) Asort})
             with {IgnoreUnsupported = 'true'}
              where Answer in ({'first','second','third','fourth','sixth','eight'})
               return 50 by {Asort,Rank}
                remove{Asort};
     
ID Rank Answer Wcnt Word   
-- ---- ------ ---- ------ 
1  1    first  2    george 
2  2    first  2    burns  
3  1    second 2    burns  
4  2    second 2    burns  
5  1    third  3    go     
6  2    third  3    george 
7  3    third  3    burns  
8  1    fourth 2    fred   
9  2    fourth 2    george 
13 1    sixth  4    arthur 
14 2    sixth  4    arthur 
15 3    sixth  4    burns  
16 4    sixth  4    burns  
21 1    eight  4    arthur 
22 2    eight  4    burns  
23 3    eight  4    arthur 
24 4    eight  4    burns  

Operator IDandWord is similar to operator Words but here Rank reflects
a meaningful sequence. Now the order of the match table is relevant to
the equality of the table comparison.

create operator IDandWord(MatchTable:table{ID:Integer,Word:String}):typeof(CTable)
begin      
result:=table of typeof(result){};  
//Create a table variable with a rank in the direction of ID. This is similar
//to sql row_number() function.               
var MatchTableRank:= 
              (ToTable(ToList(cursor(MatchTable order by {ID})))
                    {sequence+1 Rank,Word}) ;
var RTable:=
CTable
  having
        ( 
         (CTable {Answer AnswerI})
          where 
           MatchTableRank
            =
             (  
              (CTableViewID where Answer=AnswerI)
               {Rank,Word} 
             ) with {IgnoreUnsupported = 'true'} 
              {AnswerI Answer} 
        );
//Result table will show nil values if RTable is empty.
if exists(RTable) then
result:=RTable
 else
  insert row{nil ID,nil Answer,nil Wcnt,nil Word} into result;
end;

Here the sequence of the ID/Word combination in the match table, when
standarized, is the same as the sequence in CTable so we have a match.

select IDandWord(table{row{11 ID,'go' Word},row{12,'george'},row{17,'burns'}});

ID Answer Wcnt Word   
-- ------ ---- ------ 
5  third  3    go     
6  third  3    george 
7  third  3    burns  

But these sequences of ID/Word do not match any sequence in CTable.

select IDandWord(table{row{11 ID,'burns' Word},row{12,'george'},row{17,'go'}});
select IDandWord(table{row{12 ID,'george' Word},row{13,'go'},row{14,'burns'}});

ID         Answer     Wcnt       Word       
---------- ---------- ---------- ---------- 
<No Value> <No Value> <No Value> <No Value> 

select IDandWord(table{row{11 ID,'arthur' Word},row{17,'burns'},row{20,'arthur'},row{21,'burns'}});

ID Answer Wcnt Word   
-- ------ ---- ------ 
21 eight  4    arthur 
22 eight  4    burns  
23 eight  4    arthur 
24 eight  4    burns  

No match:
select IDandWord(table{row{11 ID,'burns' Word},row{17,'arthur'},row{20,'arthur'},row{21,'burns'}});

ID         Answer     Wcnt       Word       
---------- ---------- ---------- ---------- 
<No Value> <No Value> <No Value> <No Value> 

Note that operators Words and IDandWord are the same except for the view
being used, the type of standarizing on the ID column. The sql solution
implies that aggregates would have to test for the sequence of IDs given
the Word values. This seems to mean that sql has to use a separate set of
constructs for each meaningful column involved in the match. And this
certainly raises the complexity level of such a query.

Now lets raise the bar one more notch in matching. Now not only the
sequence of Words is meaningful (Words and IDs) but so is the relationship
between the IDs. Given the three Answers that match the sequence of Words:

select 
 IDandWord(table{row{11 ID,'burns' Word},row{17,'burns'},row{20,'burns'},row{21,'arthur'}});

ID Answer   Wcnt Word   
-- -------- ---- ------ 
25 nineth   4    burns  
26 nineth   4    burns  
27 nineth   4    burns  
28 nineth   4    arthur 
29 tenth    4    burns  
32 tenth    4    burns  
35 tenth    4    burns  
38 tenth    4    arthur 
39 eleventh 4    burns  
41 eleventh 4    burns  
43 eleventh 4    burns  
45 eleventh 4    arthur 

We want to be able to distinguish (match) each of these Answer tables. To do this we
have to examine the relationship between ID values. The only way to target each
Answer is to test the difference between the IDs. So we have to test for differences
of 1,2 or 3.

Operator BetweenIDs is similar to operator IDandWord but adds a column of the
difference between consecutive IDs. The table comparison now involves looking
at three columns, Word, Rank and TestID.

create operator BetweenIDs(MatchTable:table{ID:Integer,Word:String}):typeof(CTable)
begin     
result:=table of typeof(result){};  
var MatchTableRank:=
        (ToTable(ToList(cursor(MatchTable order by {ID})))
          {ID,sequence+1 Rank,Word}) ;
//Compare the IDs (next to current) in sequence (ascending order) of IDs.
//An indexer ([]) expression is used to directly access a particular Rank.
var MatchTableID:=
           MatchTableRank          
            add{IfNil(((MatchTableRank adorn{key{Rank}})[Rank+1].ID-
              (MatchTableRank adorn{key{Rank}})[Rank].ID),0) TestID} 
                  {Rank,Word,TestID} ;       
var RTable:=
CTable
  having
        ( 
         (CTable {Answer AnswerI})
          where
           MatchTableID
            =
             (  
               (CTableViewID where Answer=AnswerI)
                   add{IfNil((CTableViewID[AnswerI,Rank+1].ID-
                                     CTableViewID[Answer,Rank].ID),0) TestID} 
                   {Rank,Word,TestID} 
             )  with {IgnoreUnsupported = 'true'} 
              {AnswerI Answer} 
        );
var Resultable:=table of typeof(CTable){}; 
//Result table will show nil values if RTable is empty.
if exists(RTable) then
result:=RTable
 else
  insert row{nil ID,nil Answer,nil Wcnt,nil Word} into result;
end;

Match consecutivec IDs.
select BetweenIDs(table{row{10 ID,'burns' Word},row{11,'burns'},row{12,'burns'},row{13,'arthur'}});

ID Answer Wcnt Word   
-- ------ ---- ------ 
25 nineth 4    burns  
26 nineth 4    burns  
27 nineth 4    burns  
28 nineth 4    arthur 

Match IDs separated by 2.
select BetweenIDs(table{row{1 ID,'burns' Word},row{3,'burns'},row{5,'burns'},row{7,'arthur'}});

ID Answer   Wcnt Word   
-- -------- ---- ------ 
39 eleventh 4    burns  
41 eleventh 4    burns  
43 eleventh 4    burns  
45 eleventh 4    arthur 

Match IDs separated by 3.
select BetweenIDs(table{row{4 ID,'burns' Word},row{7,'burns'},row{10,'burns'},row{13,'arthur'}});

ID Answer Wcnt Word   
-- ------ ---- ------ 
29 tenth  4    burns  
32 tenth  4    burns  
35 tenth  4    burns  
38 tenth  4    arthur 

There is no match for these divisor tables:

select BetweenIDs(table{row{10 ID,'burns' Word},row{11,'burns'},row{12,'burns'},row{14,'arthur'}});
select BetweenIDs(table{row{2 ID,'burns' Word},row{3,'burns'},row{5,'burns'},row{7,'arthur'}});
select BetweenIDs(table{row{14 ID,'burns' Word},row{11,'burns'},row{8,'burns'},row{5,'arthur'}});

ID         Answer     Wcnt       Word       
---------- ---------- ---------- ---------- 
<No Value> <No Value> <No Value> <No Value> 


Hopefully you can see how D4 can take the mystery out of relational division.

bye for now,
steve

For those who like to explore D4 here is a bonus operator ☺. Operator
WordsLimit is similar to operator Words but limits the possible Answer values
used for comparing the divisor and dividend tables.

//This operator limits possible Answers for the table comparison.
create operator WordsLimit(MatchTable:table{ID:Integer,Word:String}):typeof(CTable)
begin     
result:=table of typeof(result){}; 
//Get the Word/ID  combination corresponding to the last Rank (when ording by Word)
//for each Answer.
var LastRow:=
    CTableViewWord 
     having 
       (
        CTableViewWord 
        group by {Answer} add{Max(Rank) Rank}
       )   
        {ID,Answer,Wcnt,Word,Rank};
var MatchTableRank:= 
              (ToTable(ToList(cursor(MatchTable order by {Word})))
                    {sequence+1 Rank,Word}) ;
//Get Last row of Match table                    
var LRow:=
        ((MatchTableRank adorn{key{Rank}}) return 1 by {Rank desc})[];
//PTable are Answer(s) from CTable where the last row of an Answer is also
//the last row of the MatchTable, ie the Rank is the same and the Word is
//the same as the last row in the MatchTable. It returns only those Answer(s)
//that 'could' be true in a table comparison using all the rows of MatchTable
//and CTable. It limits the possible Answer(s) to only those that could be
//true.
var PTable:=
CTable {Answer}
where
IsNotNil
(
(LastRow adorn{key{Answer,Rank,Word}})[Answer,LRow.Rank,LRow.Word by{Answer,Rank,Word}]
) with {IgnoreUnsupported = 'true'} ;
//Now using PTable which reduces number of table comparisons.
//Note that the entire 'having' relation is derived independently of CTable.
//It must be derived first so the rows of CTable can be compared to the 'having'
//relation.
var RTable:=
     CTable
      having
       ( 
        (PTable {Answer AnswerI})
          where 
           MatchTableRank
             =
            ( 
             (CTableViewWord where Answer=AnswerI)
              {Rank,Word} 
            ) 
              {AnswerI Answer} 
        ) with {IgnoreUnsupported = 'true'} ;
var Resultable:=table of typeof(CTable){}; 
//Result table will show nil values if RTable is empty.
if exists(RTable) then
result:=RTable
 else
  insert row{nil ID,nil Answer,nil Wcnt,nil Word} into result;
end;

select WordsLimit(table{row{10 ID,'burns' Word},row{13,'burns'},row{16,'burns'},row{19,'arthur'}});

ID Answer   Wcnt Word   
-- -------- ---- ------ 
25 nineth   4    burns  
26 nineth   4    burns  
27 nineth   4    burns  
28 nineth   4    arthur 
29 tenth    4    burns  
32 tenth    4    burns  
35 tenth    4    burns  
38 tenth    4    arthur 
39 eleventh 4    burns  
41 eleventh 4    burns  
43 eleventh 4    burns  
45 eleventh 4    arthur 

select WordsLimit(table{row{10 ID,'arthur' Word}});

ID         Answer     Wcnt       Word       
---------- ---------- ---------- ---------- 
<No Value> <No Value> <No Value> <No Value>
Tuesday, July 17, 2007

Dataphor - Simplifying Relational Division

No comments:

Beyond Sql

Blog Archive

Links

About Me