Dataphor SQL RAC (Relational Application Companion)


A site of hope for those looking for a true relational database system

Sunday, December 09, 2007

Q & A about tables, types and procedures

Here are some questions from an sql developer I'm trying to 
answer. I'm trying to explain that a relational database realizes 
a stored procedure in a fundamentally different way than sql. 
Along the way I'm talking about relational ideas as they relate
to tables and types. The conversation shows the vise-like grip
sql seems to have on the minds of man  Microsofts functional
programming offering LINQ also comes up. Overall I think I did
a good job trying to explain the big differences between sql
and a relational database. There's some simple but good examples
that drive home the significant differences.
Black is relational(me)
Purple is sql

Here we go:
> My objection is not so much to your general idea of variables of
> type table-with-given-columns (I've recently worked with some systems
> that could be cleaner if such a thing were available; currently they
> work around it using temp tables); more to your specific use of D4 in
> all your examples, as opposed to a pseudo-code extension of SQL.

Ok, let me directly address you dislike of D4 and your preference for
a pseudo-code extension of SQL. I'll refer to your pseudo-syntax in
the thread:
comp.databases.ms-sqlserver
'Basic Anatomy of Sql Server'
http://tinyurl.com/2olako

>
Then you might want to write examples in a pseudo-syntax that
/looks/ like SQL.  I know this is a matter of taste, but your
examples look ugly to me.  Consider:

-- Your example of a stored procedure that returns a result set, the
-- format of which can only be deduced by reading through the code.

CREATE PROCEDURE dbo.GroupByShipCountry
       @Employee Integer
AS
SELECT ShipCountry,Count(*) Cnt,Min(Freight) MinFrt,Max(Freight)
MaxFrt
FROM Orders
WHERE EmployeeID=@Employee
GROUP BY ShipCountry

-- Your example of the same stored procedure rewritten in D4.

create operator GroupByShipCountry (Employee:Integer):
table{ShipCountry:String,Cnt:Integer,MinFrt:Money,MaxFrt:Money}
begin
result:=
       Orders
        where EmployeeID=Employee
         group by {ShipCountry}
           add{Count() Cnt,Min(Freight) MinFrt,Max(Freight) MaxFrt} ;
end;

-- My example of the same stored procedure rewritten in a
-- pseudo-extension of T-SQL.

CREATE PROCEDURE dbo.GroupByShipCountry
       @Employee Integer,
       @ResultSet Table (
             ShipCountry varchar(15),
             Cnt         int,
             MinFrt      money,
             MaxFrt      money
       ) output
AS
SELECT ShipCountry,
        Count(*) Cnt,
        Min(Freight) MinFrt,
        Max(Freight) MaxFrt
INTO @ResultSet
FROM Orders
WHERE EmployeeID=@Employee
GROUP BY ShipCountry
>

To begin with, the idea of a stored procedure returning a 'result' is
an sql concept. This concept does not exist in a relational (D4)
system. Relationally, a stored procedure only exists when it is
created. The  execution of a sp, its runtime realization, does not
involve the definition of the procedure nor the idea of 'returning'
something from it. Relationally at runtime what sql see's as a
procedure and a result 'is' a variable of the type of the result. This
is the huge difference between the two systems. Relationally the
'@ResultSet' and the idea of inserting a query result into it is
contradictory and meaningless. The 'name' of the procedure 'is' the
variable (table), there is no result from a sp (ie. sql). Syntactically
an sql tabled value function is closer in spirit to the D4 procedure
with the big difference that the name of the table valued function is
'not' a typed variable like in D4. Finally, the sql sp makes the
distinction between identifiers as variables and non-variables using
the '@'.  In D4 there is no such distinction as 'all' identitifiers
are by definition variables and the '@' is superfluous. The 'output'
declaration in the sql sp is based on the general sql idea of
'returning' something. Such a declaration is superfluous relationally
as, again. there is no concept of 'returning a something' from a 'this sp'.
Note that LINQ realizes an sql stored procedure exactly as sql intends
it and nothing like the relational D4. The 'functional' part of
integrated query is simply how the sp is accessed within a net
language. There is no concept of a typed variable with the name of the
sp. In other words, the OR map is mapping to the same sql as if the
mapping didn't exist. MS has added a 'functional language' within net
when what it should have done is added a functional language to the
database itself!  D4, after all, represents the functional language
of a relational system and how easy it is to use such a language
within a present day (net) programming language. The D4 answer to
overcoming the object-relational mismatch is of a totally different
nature to the one offered thru LINQ. My objection to LINQ lies in the
idea that no one at MS seems to have considered an alternative. 

Let me try to cut thru the semantics with a few examples that I
hope will clarify some of the points I'm trying to make.

In t-sql this should be perfectly clear:
DECLARE @X INT
SET @X=5

The variable @X can only take one value at any specific time.
In a relational system a procedure that returns some value at runtime
must behave exactly like @X. At runtime the procedure is a variable of
a particular type and has a specific value based on input arguments.
An sql sp has no such nature and behaves in an entirely different way.

create procedure SqlOne
@Y int
AS
SELECT COUNT(*) AS CNT,SUM(FREIGHT) AS SUMFRT
FROM ORDERS
WHERE EMPLOYEEID=@Y
SELECT *
FROM ORDERS
WHERE EMPLOYEEID=@Y

Therefore the idea that an sql procedure can return multiple results
is meaningless if the sp is realized as a variable where only a single
result makes sense. Add to this the idea of type where each result is
a different type and the difference between sql and relational should
be even clearer. Again the relational procedure is realized exactly
like the int @X. No programming language chooses among possibe
multiple definitions of the value of a variable. It would be
equivalent to:
DECLARE @X INT
SET @X=5 or @X=10
which makes no sense. It is because the sql sp is not realized as a
variable that multiple results 'can' be returned.

This sql sp:
CREATE PROCEDURE SqlTwo
@A INT OUTPUT,
@B INT OUTPUT
AS
SET @A=5
SET @B=10

DECLARE @C INT,@D INT
EXEC SqlTwo @C OUTPUT,@D OUTPUT
SELECT @C
SELECT @D

makes no sense relationally because, again, there are multiple
results. Now there are two scalar types (int) returned instead of sql
'resultsets'. Relationally there is no such thing as more than 1 thing
(think a variable of a type) at a time. Two scalar results are
realized as a 'row' type relationally, ie. 'one' thing.
create operator D4Two():row(A:Integer,B:Integer)
begin
result:=row(5 A,10 B);
end;

In this case at runtime D4Two is a variable of type row with 2 scalar
columns.

From the relational perspective a table/row/list is a variable that
behaves exactly like a variable in a programming language. Its value
can be assigned to other values just like a t-sql integer variable
can. It can be compared to other variables (for equality) just like a t-sql
integer variable. It can be passed as an argument to a procedure just
like a t-sql integer variable. For these reasons why MS decided to
call something a 'table variable' remains a mystery. It behaves like no
other variable in any programming language on the face of the
planet  If it doesn't quack like a duck, doesn't behave like a duck,
doesn't waddle like a duck it sure as hell isn't a duck  What MS
calls a table variable is surely not a table variable as the idea
exists in any programming language or relationally. Whatever one wants
to call an sql table the table variable is the same thing. Its
phyiscally implementation may be different but that does not change
the fact it is not a variable of a specific table type.

Sql distinguishes between user defined functions and procedures. But
sql user defined functions are on the same exact level of procedures
when looked at from the point of view of 'variables'. Neither one
has anything to do with the idea of a relational variable. All this
artificial distinction does is serve to make it harder for users to
understand the relational model  (Why sql choose to create a user
define function/procedure dichotomy is another topic. But think of
where and having).

Rather than dwell on particular syntax or pseudo-syntax I think it
is the ideas that the relational model is based on that is important.
And what we're talking about here is just a slice of the relational
model. The relational model is not rocket science  It's actually
quiet straightforward. Ironically it's sql that is out in left field.
The relational model is in line with all current programming
languages. Unfortunately that's never been the case with sql  This is 
one of the reasons I find LINQ so unnecessary. Once you get the idea
that a big part of the relational model is all about the basic concepts of
variables and types I think (I at least hope) that what I've been
trying to explain will make perfect sense  

>Add to this the idea of type where each result is a different type
>In other words, TABLE (X INT, Y INT) is a different variable type
>from TABLE (M VARCHAR(15), N VARCHAR(15))?

Exactly. Think of sql strings. This table, TABLE(TABLE (M VARCHAR(15),
N VARCHAR(15)),
is a differnt type than TABLE (N VARCHAR(16), N VARCHAR(16))! This
means that we couldn't compare the two and undermines real relational
division. To declare how many characters in a string is clearly the
opposite of what the relational idea of data independence is all
about. Relationally there can only be a 'string' type having
absolutely nothing to do with its storage characteristics.  And this
is the same idea in any programming language. This is just one
manifestation of how sqls design ignores the concept of a strong type.
LINQ is an attempt to hide the fact that:
DECLARE @N VARCHAR(5),@M VARCHAR(6),@P VARCHAR(5),@Q VARCHAR(6)
represents 4 different types. This is but one simple form of the idea
that sql guarantees impedence mismatch! And having the choice of
changing the database or the access to it, MS chose access to the
database (LINQ). The sql community seems to not consider how bizarre
and confounding things like this look to developers coming to sql for
the first time. It must be force of habit blinding a more critical
look at how things are. There are so many that sql defeats the idea of
strong types that it would be better and easier to build a new
database system for application development. And have sql available
for everything else 

>What if the data you want to return is not multiple scalars, but
>rather multiple tables?  Upon reflection, I suppose tables could
>be nested in this model, i.e. you can return
>   TABLE(T1 TABLE(X INT, Y INT), T2 TABLE(M VARCHAR(15), N VARCHAR(15)))

The relational database emphasizes types to define structures that can
be used to model processes, ie.

create table T1
{
  A:Integer,
  B:String,
  T:row{X:Integer,Y:String},
  S:row{B:Integer,C:list(String)}
  key{A}

};

This is possible thru system provided types and user defined types.
It's also supported thru explicit conversion processes between one
particular type and another. The foundation to support these
constructs is unique to a relational system and does not exist in sql.
Whether a specific struture makes logical sense and whether it's
supported is another matter. Tables within tables is open to question.
Even if the system supports it would it make sense or would another
type of structure be more appropriate? This question goes to the edges
of a relational system and I'm afraid I can't do it justice here 

>This would allow bad developers to commit the common 'a,b,c' 1NF
>violation in a whole new way, but then bad developers can screw
>things up in any language.

The view that strings like 'a,b,c' violate the idea of the atomicity
of a column in an sql table is a direct result of sql's lack of types
and lack of relationships between types. There is no violation of any
kind in a relational system because the string can be stored as
a single value of a column retaining the concept that there individual
elements involved. It would simply be stored as a 'list' type.
For example column B of table TableList takes a comma delimited string
and splits it into a list type.

create table TableList
from
table
     {
      row{1 A,'A,B,C'.Split({','}) B},
      row{2,'D, E , F, G'.Split({','})},
      row{3,'H,I,J,K'.Split({','})}
     };

The table definition of TableList is:
create table TableList
A:Integer,
B:list(String),
key{A}

For each row of the table column A can be retrieved and the individual
items of column B, the list, are availiable.
Select the value of A and the value of the 1st item in the
list(B).
select TableList {A,B[0] ListItem1};
A ListItem1
- ---------
1 A
2 D
3 H

Directly address the 1st item in the list of A=2.
select TableList[2].B[0];
D

Directly address the last item in the list of A=3.
select TableList[3].B[TableList[3].B.Count()-1];
K

Get column B for the row where A=1 and convert the list into a table.
select ToTable(TableList[1].B,'Str','Index');
or
select ToTable( (TableList where A=1 over{B})[].B, 'Str','Index');
Str Index
--- -----
A   0
B   1
C   2

It's types that a relational system guarantees integrity for and high
level operators that allow the explicit conversions between that
developers should have for application development. And this is the
same idea the MS net team calls 'functional programming' which is what
they developed LINQ for. But a relational system is functional
programming! 

>bad developers can screw things up in any language.

Sure but application development with sql has a tendency to make
anyone a nitwit at some time or other. I'm for less nitwits 

>> 'where' and 'having').
>I'd guess these are both for efficiency.

GROUP BY was added after the original SELECT but instead of
redesigning the language they saw no problem with leaving in two
constructs that do the same thing! Every time I see an MS paper on
'best practices' I have got to grin 

On table variables:

>But you agree that (1) it has some features of variables, and (2) it
>could reasonably be extended to have more features of variables?

I don't think MS could lock its developer army in a hotel and tell
them to make sql a little more relational  They have two choices.
Either buy a relational system (like D4) or start from the ground up
to develop one. The gulf between a relational system and sql is too great
to try to simply make changes in sql server. 

>A lot of people find SQL pretty straightforward, especially in this
>newsgroup.  Your choice of (pseudo-)syntax will make a difference to
>them.  (You might get different responses from a newsgroup focusing
>on front-end programming languages, especially if they already
>resemble Pascal as D4 seems to do.)

Sure at least a dozen people who write books and articles find sql
straightforward  For the rest I'd like to at least see a choice for
them. Again, I hope some can get beyond just syntax to grasp what
a relational system would offer. 

Tuesday, November 06, 2007

Basic Anatomy of Sql Server, part III

This is the third in a series of articles examining some basic
concepts in Sql Server.

Basic Anatomy of Sql Server, part I - What is a stored procedure?
Basic Anatomy of Sql Server, part II - The unit test as part of the database

What does deferred name resolution really mean?     

Essentially part II answered the often asked question, "why can't I
SELECT from an sql stored procedure?". Because there is nothing to
select from. There is an absence of the concept of a table type in sql
which is necessary to realize a table. 

Now we ask another simple question. How is it D4 can understand the assignment
of a table and a test of equality between two tables just as t-sql understands
the same things between numbers:

var aNewTable:= GroupByShipCountry(1) ;     //Assign the table for employee 1 based
                                            //on the D4 operator from Part I to 
                                            //aNewTable.
var Test1:= aNewTable=GroupByShipCountry(1);//Are the two tables the same? Yes.
                                            //Now change aNewTable with an update.
update (aNewTable adorn{key{Cnt,MinFrt}}) 
         set {MinFrt:=$0.00}
          where ((Cnt<11) and (MinFrt<$2.50));
var Test2:= aNewTable=GroupByShipCountry(1); //Now are the two tables the same? No.

select row{Test1 TestofEquality_before_Update, Test2 TestofEquality_after_Update};

TestofEquality_before_Update TestofEquality_after_Update 
---------------------------- --------------------------- 
True                         False                       

And how is it that t-sql can do with numbers what D4 does with tables:

DECLARE @X Integer, @Y Integer; 
SET @X=5; SET @Y=10;  
if @X=@Y
  print '@X=@Y'
   else
    print '@X!=@Y'
    
@X!=@Y    

The answer is obvious and simple. We can only assign values and make comparisons
with variables. And we cannot do anything with a variable unless it is typed.
The unit in 'unit test' is the awareness of the database of a table as a variable.
To say the idea of a unit test is part of the database is to recognize that
a table as a variable and its type is the only way to distinguish one table
from another. The unit is the same for distinguishing among numbers and strings
and tables. And it is this basic unit that sql server lacks for tables. 

Sql server bol describes deferred name resolution as:

'This process is called deferred name resolution because table objects referenced
 by the stored procedure need not exist when the stored procedure is created, 
 but only when it is executed.'
 
What is this really a statement of? 

When t-sql parses these statements:

IF @X=5 
  PRINT '@X=5'
   ELSE
    PRINT '@X!=5'

it returns the error: Must declare the scalar variable "@X".

In D4 when this statement is parsed:

select aNonExistentTable;

it returns the same error as t-sql: Unknown identifier "aNonExistentTable".

These errors are the same. Each system is aware that a variable has not been
defined with a type. There is the intent at parse in each system to check
that an identifier (@X, aNonExistentTable) has been appropriately defined. 
And where the identifier is recognized as a variable and nothing else. This
intent has nothing to do with the context that surrounds the identifier. And
it is this intent that lies at the heart of a relational database. A statement
of just the variable results in the same error.

t-sql   D4
-----   -------
@X      aNonExistentTable;

Must declare the scalar variable "@X"., Unknown identifier "aNonExistentTable".

Now these are all examples of so called t-sql deferred name resolution, there
is no error raised on parsing only execution:

SELECT * FROM aNonExistentTable

SELECT aNonExistentTable

IF aNonExistentTable=anotherNonExistentTable
       print 'YES'

CREATE PROCEDURE theTableDoesNotExist
AS
SELECT * FROM aNonExistentTable

Given that there is computer science and not one computer science for t-sql
and one for everything else, how is it that these statements can be parsed
without error? Because at parse-time if there is no concept of an
identitifier as a variable there is only context to check, syntax. Absence
the idea of a variable there is nothing to resolve a database object to, 
hence it is as if these objects/identifiers do not exist at parse-time. 
How bol describes deferred name resolution is a consequence of working with
tables in a variable-less system. It is a price paid, in terms of sacrifice
of integrity and sacrifice in management of table objects, by the developer
for working in a type-less and variable-less and therefore a non-relational
database. Explanations like deferred name resolution are quite common in
sql but they are all just different sides of the same central issue. For
example there is the issue of impedance mismatch. This is just an expression
of the inherent difficulty of communication between a client, which 
understands what a variable is, and an sql server which does not. And finally
there is the idea of relational division. Since comparisons require variables
sql forces an inference to be made about a comparison of table(s) it cannot
do. It forces the use of constructs that are obtuse and complex compared
to simple and direct comparisons that can be made in a relational database
where the table exists as a variable.
(And now you know why you can't pass a table as a parameter. There is no 
table variable to pass  .

The history of sql is a history of a language trying to exist outside the
mainstream of the rest of IT. It's long overdue that application development
use a database that has the same foundation as other programming languages.
And that kind of database is relational.

Basic Anatomy of Sql Server, part I 
What is a stored procedure?

Basic Anatomy of Sql Server, part II 
The unit test as part of the database.

Basic Anatomy of Sql Server, part III 
What does deferred name resolution really mean?

Some related articles of interest:
All tables are typed variables 
A table as a parameter
Reusable procedures 

Basic Anatomy of Sql Server, part II

This is the second in a series of articles examining some basic
concepts in Sql Server.

Basic Anatomy of Sql Server, part I - What is a stored procedure?

The unit test as part of the database.

If Sql Server is operating with the genie in the bottle what would a system
look like with the genie out of the bottle. Here we take the sql procedure
dbo.GroupByShipCountry from part I and use it in a system that takes the
definition beyond the statements that define it

What does such a system look like and what exactly can we expect such a
system to be aware off. Here we use a system called D4, a system with a
kinship to sql but fundamentally different.

For reference we repeat the sql procedure dbo.GroupByShipCountry from part I:

CREATE PROCEDURE dbo.GroupByShipCountry 
       @Employee Integer
AS
SELECT ShipCountry,Count(*) Cnt,Min(Freight) MinFrt,Max(Freight) MaxFrt
FROM Orders
WHERE EmployeeID=@Employee
GROUP BY ShipCountry

Here is the sql procedure as it would be defined in D4:

create operator GroupByShipCountry (Employee:Integer):
           table{ShipCountry:String,Cnt:Integer,MinFrt:Money,MaxFrt:Money}  
begin
result:=
      Orders 
       where EmployeeID=Employee
        group by {ShipCountry}
          add{Count() Cnt,Min(Freight) MinFrt,Max(Freight) MaxFrt} ;
end;

While the differences in syntax between the two languages are beyond the
scope of this article it should be obvious that the definition of result in 
D4 is similar to the SELECT statement in the sql procedure. Semantically they
are identical. The sql procedure and D4 operator each have a parameter to 
define an employee. The obvious difference between the two is the construct:

table{ShipCountry:String,Cnt:Integer,MinFrt:Money,MaxFrt:Money}  

in the beginning of the D4 operator. This says the result of the operator
is a table with the specific type of:

ShipCountry:String,Cnt:Integer,MinFrt:Money,MaxFrt:Money

Lets execute the operator with syntax as close as possible to the sql sp:

Execute('GroupByShipCountry(8)');   

What is returned? Absolutely nothing. A D4 operator can also be specified
with just its name and input parameter(s) values. So lets try:

GroupByShipCountry(8);

What we get with this is the following system message:

Possibly incorrect use of expression as a statement.

What is going on here? Why doesn't the operator display something like the
sql procedure? Lets go back to the sql procedure. At run-time the procedure
is still only a bunch of t-sql statements so 'execute' is appropriate as 
in 'execute these statements'. But if the D4 operator is a table at run-time
it makes no sense to 'execute' a table! So lets try to use the 'table' in
its appropriate context. In sql we would simply say: 

SELECT * FROM <table>

Which corresponds to the D4 statement:

select <table>;

We now substite the D4 operator with an input parameter for the '<table>':

select GroupByShipCountry(8);

And now we get a similar display as the sql procedure:

ShipCountry Cnt MinFrt MaxFrt  
----------- --- ------ ------- 
Argentina   3   $0.33  $217.86 
Austria     5   $25.22 $353.07 
Brazil      9   $5.32  $110.87 
.
Venezuela   9   $4.41  $141.06 

Clearly the D4 system is aware of two distinct things. The definition of
the operator, its statements, and then at run-time a table. And now it
should also be clear that sql is only aware of statements.

How does D4 guarantee that the GroupByShipCountry(8) table will be
of the type:

table{ShipCountry:String,Cnt:Integer,MinFrt:Money,MaxFrt:Money} 

Because that is the only type of table possible. If the object (table)
exists at run-time it can only be of this type. And if the table is
derived it must have come from compatible datatypes. Another way of
saying this is that a system can distinguish between tables or it can
support implicit conversion. If the system can be aware of a table type
it cannot support implicit conversion. Only explicit conversion, an
intervention on the part of the user, can be consistent with a table type.
Sql supports implicit conversion and therefore cannot tell one table
from another. Sql therefore cannot guarantee the integrity of a table. 
It is object (table) types that offers integrity. For sql to support
implicit conversion means it has no concept of a table of a specific
type as an object. For example, this sql illustrates conversion, from
a money type to an integer type:

CREATE TABLE TestConv
                    (ShipCountry varchar(15),Cnt int,MinFrt int,MaxFrt int,
                     PRIMARY KEY (ShipCountry))
GO
INSERT TestConv (ShipCountry,Cnt,MinFrt,MaxFrt)
EXEC GroupByShipCountry 8 

It also undermines the integrity of each table and therefore the database.
This is what would happen in a D4 operator if a table with incompatible
datatypes with the source was defined:

create operator GroupByShipCountryI (Employee:Integer):
          table{ShipCountry:String,Cnt:Integer,MinFrt:Integer,MaxFrt:Integer}  
begin
result:=
      Orders 
       where EmployeeID=Employee
        group by {ShipCountry}
          add{Count() Cnt,Min(Freight) MinFrt,Max(Freight) MaxFrt} ;
end;

The error returned is a compile error (as opposed to run-time):

Cannot convert a value of type "System.Money" to a value of type "System.Integer".

Not only does the database offer integrity through type compatibility but
the entire framework is exposed to the developer for use as part of the
application itself. A value can be checked for its type with an operator
that accepts any type (a generic). We can make a comparison of the value to
a specific type which returns a boolean value of True/False.

create operator WhatIsIt(aValue:generic):String
begin
result:=
  if aValue is String then
   'String'
     else 
      if aValue is Integer then
       'Integer'
        else 
         if aValue is Money then
            'money'
           else
             'Some other type';
end;          

select 
 WhatIsIt((GroupByShipCountry(8) adorn{key{ShipCountry}})['Canada'].MaxFrt); 

money

We have created the GroupByShipCountry(8) table with a key on ShipCountry
and directly accessed the row where ShipCountry is 'Canada' and extracted
the value of MaxFrt to use as the argument to operator WhatIsIt. The type
of that value is money. And we can extend the same type recognition logic
to a table.

var TryTable:Boolean:=false;
if WhatIsIt(GroupByShipCountry(8))='Some other type' then
  TryTable:=GroupByShipCountry(8) is generic table;
select TryTable; 

True

We have used the GroupByShipCountry(8) table as the argument to the WhatIsIt
procedure. The procedure returned the string 'Some other type' and
in the if statement we asked if the GroupByShipCountry(8) table is a
generic table. In other words we are testing if GroupByShipCountry(8) 
represents some specific type of table. And we can go from a generic type
of table to a specific type of table:

var TryTable:Boolean:=false;
if WhatIsIt(GroupByShipCountry(8))='Some other type' then
 if (GroupByShipCountry(8) is generic table) then
   TryTable:=
      (table of typeof(GroupByShipCountry(8)){})
       is
        table{ShipCountry:String,Cnt:Integer,MinFrt:Money,MaxFrt:Money};
select TableDee add{TryTable Is_GroupByShipCountry8_this_table_type};   

Is_GroupByShipCountry8_this_table_type 
-------------------------------------- 
True        

After determining that GroupByShipCountry(8) is some type of table we have
asked the question, is the type of the GroupByShipCountry(8) table the
specific type represented by the pairs of columns and their datatypes of:

ShipCountry:String, Cnt:Integer, MinFrt:Money, MaxFrt:Money

The statement: table of typeof(GroupByShipCountry(8)){}
extracts just the type information from GroupByShipCountry(8) as information
independent from the values in the rows of the table.

var TryTable:Boolean:=false;
if WhatIsIt(Orders)='Some other type' then
 if (Orders is generic table) then
   TryTable:=
      (table of typeof(Orders){})
       is
       table{ShipCountry:String,Cnt:Integer,MinFrt:Money,MaxFrt:Money};
select TableDee add{TryTable Is_GroupByShipCountry8_this_table_type};    

Is_GroupByShipCountry8_this_table_type 
-------------------------------------- 
False                                  

We have confirmed that the Orders table is not the same particular type of
table as GroupByShipCountry(8). 

And finally we show the ability to make a functional test rests on the
ability to make a unit test:

var C:Boolean:=True;          
if
   (
     table of typeof(GroupByShipCountry(1)){}
      is
       table{ShipCountry:String,Cnt:Integer,MinFrt:Money,MaxFrt:Money}
    ) 
     and
   (
     table of typeof(GroupByShipCountry(8)){}
      is
       table{ShipCountry:String,Cnt:Integer,MinFrt:Money,MaxFrt:Money}
    )   
      then 
       if GroupByShipCountry(1)<>GroupByShipCountry(8) then
         C:=False;
          select TableDee add{C Are_Tables_of_Same_Type_the_Same};
    
Are_Tables_of_Same_Type_the_Same 
-------------------------------- 
False                     

We have first verified that the two tables are the same type of table and then
asked if the two tables are the same. In other words, we can only test the
equality of two tables (they have the same set of rows) if they are the same
table type just as we can only compare two numbers or two dates. It is
obviously nonsensical and illogical to compare two objects of different types.
(I note in passing that the fact that an sql database is aware of individual
 columns and their datatypes is fundamentally different from being aware that 
 such pairs can be encapsulated to define a type of table. It is the notion of 
 encapsulation that defines a type (table) independent of any column(s) as a 
 scalar type that sql is aware of. Any place in t-sql that a table definition
 appears ie. CREATE TABLE statement, a table valued function or a table variable
 (a confusing and misleading construct) whatever this definition is the important
  point is this definition does not signify a table type).

Can we make a test of equality between the sql procedure dbo.GroupByShipCountry
and the D4 operator GroupByShipCountry? As just previously shown we can easily
make such a functional test if both are recognized as table objects.
We can elevate the mere t-sql statements of dbo.GroupByShipCountry by bringing
in the statements to D4 with a pass-thru query. The following D4 query displays
the same result as executing the procedure in Sql Server.

select SQLQuery('Execute dbo.GroupByShipCountry 1');

ShipCountry Cnt MinFrt  MaxFrt   
----------- --- ------- -------- 
Argentina   1   63.7700 63.7700  
Austria     5   74.6000 351.5300 
Belgium     1   29.5900 29.5900  
.
Venezuela   8   0.9300  148.6100 

Is the display a 'result set', as in Sql Server, or a table in D4? Let us
apply a unit test to confirm or reject whether it is a table in D4:

select 
 table of typeof(SQLQuery('Execute dbo.GroupByShipCountry 1')){} is generic table;    
 
True

which confirms that D4 is aware of the dbo.GroupByShipCountry t-sql result 
set as an object of some type of table at run-time. We can now try to test
the equality of the two tables assuming they are of the same specific type. 
The following query should return a boolean, either True or False, on the
equality (sameness) of the tables:

select GroupByShipCountry(1) = SQLQuery('Execute dbo.GroupByShipCountry 1') ;

But on checking the query at 'compile', two errors are returned:

Cannot convert a value of type "System.Decimal" to a value of type "System.Money".
No signature for operator "iCompare" matches the call signature
"(System.Money, System.Decimal)".

Both errors indicate a type mismatch. It is clear we cannot test for equality
because the two tables are of different types. We know the specific type of
table for the D4 operator. Let us go back to the unit test to find out what
type of table is the sql procedure:

var TypeofTable:=table of typeof(SQLQuery('Execute dbo.GroupByShipCountry 1')){}; 
var TypeString:String;
if TypeofTable 
 is table{ShipCountry:String,Cnt:Integer,MinFrt:Decimal,MaxFrt:Decimal} then
  TypeString:='table{ShipCountry:String,Cnt:Integer,MinFrt:Decimal,MaxFrt:Decimal}'
 else
  if TypeofTable 
   is table{ShipCountry:String,Cnt:Integer,MinFrt:Money,MaxFrt:Money} then
    TypeString:='table{ShipCountry:String,Cnt:Integer,MinFrt:Money,MaxFrt:Money}'
    else
     TypeString:='A type where MinFrt and MaxFrt are differnt than Decimal and Money';
select TypeString;     

table{ShipCountry:String,Cnt:Integer,MinFrt:Decimal,MaxFrt:Decimal}  

Now we understand what's happening. The money datatype in Sql Server is
really a decimal type significant to 4 digits. The money type in D4 is
a totally different type. When D4 interacts with a server table it reads
money for datatype. When the sql procedure is brought into D4 as a table
the MinFrt and MinFrt are interpreted as decimal (and this also accounts
for the different displays of the server procedure and the D4 operator).
To compare the two tables we have to make an explicit conversion. We can
convert the server decimal types to D4 money types.

select 
        GroupByShipCountry(1) 
        = 
        (SQLQuery('Execute dbo.GroupByShipCountry 1')
           {ShipCountry,Cnt,ToMoney(MinFrt) MinFrt,ToMoney(MaxFrt) MaxFrt});

True

Clearly, the idea of a unit test in D4 can be used by a developer to provide a 
programmable form of integrity within an application. 

A database that recognizes, supports and exposes a unit test for a table
type is an objects database. An objects database that has an algebra for
tables is a relational database. D4 is a relational database.

Basic Anatomy of Sql Server, part I 
What is a stored procedure?

Basic Anatomy of Sql Server, part II 
The unit test as part of the database.

Basic Anatomy of Sql Server, part III 
What does deferred name resolution really mean?

Some related articles of interest:
All tables are typed variables 
A table as a parameter
Reusable procedures 

Monday, November 05, 2007

Basic Anatomy of Sql Server, part I

This is the first in a series of articles examining some basic
concepts in Sql Server.

What is a stored procedure?
                  
This example uses Sql Server 2005 and the Northwind sample database.

The following is a very simple stored procedure that returns some
aggregates from the Orders table for a given EmployeeID value supplied
as an input parameter:
                  
CREATE PROCEDURE dbo.GroupByShipCountry
      @Employee Integer
AS
SELECT ShipCountry,Count(*) Cnt,Min(Freight) MinFrt,Max(Freight) MaxFrt
FROM Orders
WHERE EmployeeID=@Employee
GROUP BY ShipCountry

After executing the above script to create the procedure we ask a most
simple question: what is it? What exactly is Sql Server aware of
regarding this procedure? We can take as the starting point the definition
of a generic procedure. Bol states:

'A stored procedure is a saved collection of Transact-SQL statements ...
that can take and return user-supplied parameters.'

We can simply ask is there any indication that the server is aware of
anything that rises above the level of the t-sql statements defining
the procedure. Should we expect to find such thing? We do know that
the server recognizes the object Orders in sys.objects as a USER_TABLE.
And if we execute the query:

SELECT *
FROM Northwind.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME='Orders'

we know the server recognizes the columns and their datatypes for the
Orders table. We know the server recognizes that the procedure depends
on three columns from the Orders table:

SELECT object_name(referenced_major_id) Dependent,COLUMN_NAME
FROM sys.sql_dependencies A
CROSS APPLY (SELECT B.COLUMN_NAME
            FROM Northwind.INFORMATION_SCHEMA.COLUMNS B
            WHERE B.TABLE_NAME=object_name(A.referenced_major_id)
            and A.referenced_minor_id=B.ORDINAL_POSITION) C
WHERE object_id=object_id('dbo.GroupByShipCountry')

Dependent COLUMN_NAME
--------- ---------------
Orders    EmployeeID
Orders    Freight
Orders    ShipCountry

We know the procedure works if we execute it:

Execute dbo.GroupByShipCountry 8       

It retrieves, in the precise wording of bol, 'rows and columns from the
Orders table' which are collectively termed a 'result set'. (We will
return to the server Orders table and result set term in later articles).

ShipCountry     Cnt         MinFrt                MaxFrt               
--------------- ----------- --------------------- ---------------------
Argentina       3           .3300                 217.8600
Austria         5           25.2200               353.0700
Brazil          9           5.3200                110.8700
.
Venezuela       9           4.4100                141.0600

When we look for specific information about the procedure itself with
the sp_help system procedure we see that @Employee is listed as a
parameter.

The information from sys.objects confirms that the procedure is
recognized and it indeed is a definition of a stored procedure:

SELECT name,object_id,type,type_desc
FROM sys.objects
WHERE object_id = OBJECT_ID('dbo.GroupByShipCountry') AND type='P'

name                 object_id   type type_desc             
-------------------- ----------- ---- ----------------------
GroupByShipCountry   1551396646  P    SQL_STORED_PROCEDURE                   

In summary there does not exist any evidence that the server is
aware of any information not contained in the collection of t-sql
statements that define it. In fact the server does not exhibit
any information about three of the four columns (Cnt, MinFrt, MaxFrt)
that are displayed when it is executed. This leaves us in a very unsteady
state. At the very most the server is performing an operation which is
not exposed to us to interact with, at the very least it is performing an
operation that it is itself unaware of. This situation makes it impossible
to look to the server for any characterization of the procedure towards
answering the question of what this procedure really is. It is up to us
to infer what the procedure is. With the consequence that we can not
rely on the server to either characterize what the procedure really is
nor depend on the server to guarantee the integrity of the operation.
This void in the server must be filled.

The unit test

Is it fair to say that the failure of the server necessitates a unit test?
We can more clearly see the answer to this question if we can clearly
see just what is the nature of a unit test that clarifies just what this
procedure is. An insightful perspective on the unit test is offered in
the excellent article:

'Using unit and functional tests in the development process'
by Jeff Canna
www.ibm.com/developerworks/library/j-test.html

The author writes:

'Many times the development of a system is likened to the building of a
house. Unit testing is analogous to a building inspector visiting a
house's construction site. He is focused on the various internal systems
of the house, the foundation, framing, electrical, plumbing, and so on.
He ensures (tests) that the parts of the house will work correctly and
safely, that is, meet the building code.'

He goes on to say:

'Unit tests tell a developer that the code is doing things right;
functional tests tell a developer that the code is doing the right things.'

The author is clearly expressing the idea that a unit test is something
the goes toward confirming something that is of a very basic nature.
Something that is the basis on which any other functional tests are
based on. So what sort of unit can we use to confirm the 'the code is doing
things right'? And will it answer the question of what the procedure really
is?
The sort of unit test we are looking for is approached by the insight offered
in another excellent article:

'Close those Loopholes - Testing Stored Procedures'
by Alex Kuznetsov and Alex Styler
www.simple-talk.com/content/print.aspx?article=426

With reference to a procedure very similar to dbo.GroupByShipCountry the
authors write:

'Suppose you need to write a unit test for that stored procedure.
You might want to verify that your result set has correct structure
(correct column names and types)...'

And further:

'When you set up your unit test, your result set (or result sets, if your
stored procedure returns several ones) is saved into an XML file. All the
information necessary for further comparison is saved column names, column
types, and the data...When you run your unit test, the result set is
compared against the saved XML file. Everything is compared - column names,
column types, and all the data.'

The idea of verifying that the 'result set has correct structure
(correct column names and types).' What the authors refer to as
'correct structure' is what we are really searching for. A test for the
correct structure is exactly what the server lacks. And what is this
structure that is based on pairs of columns and their datatypes? For it
is the columns of this structure that are displayed by executing the
procedure:

ShipCountry Cnt MinFrt MaxFrt 
----------- --- ------ ------

which are based on the structure of column/datatype:

ShipCountry:varchar, Cnt:Integer, MinFrt:Money, MaxFrt:Money

It is these pairs that define a specific type, a specific type of table.
And this simple concept allows us to finally answer the question of
what the stored procedure really is: a table. The unit test is a test
confirming the specific type of the table returned. What the server calls
a 'result set' (and not a 'table') is based on the fact that there does
not exist the concept of a type for a table. There does not exist a way
for the server to distinguish one table from another. It is the non-existence
of this concept in the server that necessitates this most basic unit (test)
being made the responsibility of the user. This is the huge hole in sql
that must be filled by the user, a role that the database should perform
is lost and with it the integrity that it should offer.

In part II we show and describe a database system that takes the
definition of a stored procedure beyond the statements that define it.

Basic Anatomy of Sql Server, part I 
What is a stored procedure?

Basic Anatomy of Sql Server, part II 
The unit test as part of the database.

Basic Anatomy of Sql Server, part III 
What does deferred name resolution really mean?

Some related articles of interest:
All tables are typed variables 
A table as a parameter
Reusable procedures 


Monday, October 22, 2007

Sql - Really simple division

This article is based on the thread:

microsoft.public.sqlserver.programming
Monday, October 22, 2007
"select rows that contain both arg1 = 'x' and arg1 = 'y'"
http://tinyurl.com/386lbx

>I need to select only rows where the name contains x and y.  I do not want
>to get rows that only contain x or contain x, z, or contain y, z.  I only
>want rows where the name has an x an a y (they can have a z but they must
>have an x and a y).

Forget QA or SSMS. Just draw a picture.
You use the term row. So draw a row.

row{aName as sName, acode as code}

Now you want people who have code x and y. These are two
rows. So draw them. Use 'burns' for the person.

row{'burns' as sName, 'x' as code}
row{'burns' as sName, 'y' as code}

Now what do you call two rows together? How about calling it
a table ☺ .
Draw it.

table
   {
    row{'burns' as sName, 'x' as code}
    row{'burns' as sName, 'y' as code}
   }
 
Now if this table, made just for 'burns', is in your
#tmp1 table then 'burns' is a guy you want. In other
words, if both rows (one for x, one for y) are in
#tmp1 then you have a hit. Super simple ☺ .

Now in QA or SSMS do this.

declare @x int, @y int
set @x=2
set @y=3
if @x<=@y
print 'Yes'
else
print 'No'

No explanation necessary. Super simple. Now take the same
idea of comparing two integers and extend it to comparing
two tables.

if
table
     {
      row{'burns' as sName, 'x' as code}
      row{'burns' as sName, 'y' as code}
     }
       <=
         #tmp1
           print 'Yes'
                else 'No'
              
In other words, does each row for 'burns' occur in #tmp1?  
'Burns' can have codes in #tmp1 in addition to x and y (ie. 2<=3)
So 'burns' has to have at least a row for x and a row for y
in #tmp1. If in #tmp1 'burns' has only an 'x' or only a 'y'
no matter what other codes he has that's no good (2<=1).
In the case above you will see 'Yes' printed since the comparison
is true.

This whole scenario is referred to as relational division in
database terminology. But these simple ideas are obscured
by sql because you can't draw a picture of a row, nor a
table nor does sql understand comparing tables like integers.
So instead you're left with grouping and counting, joins,
intersects, existential queries and whatnot all trying to
express a simple idea yet at the same time obscuring it.

Now in a query you want to substitute all the unique names
from #tmp1 into our little table so for each person we can
test the comparison with #tmp1. What would such a query look
like?

select
 select distinct sName as aPerson from #tmp1 
  where      -- draw a table with two rows for each aPerson
             -- the 1st column has a value aPerson and
             -- the column is named 'sName'. The 2nd column is
             -- called 'code'. The column names and datatypes
             -- are the same as in #tmp1.
        table
             {
              row{aPerson as sName, 'x' as code},
              row{aPerson as sName, 'y' as code}
             }
             --  Compare the tables.
              <=
             --  Form a table from #tmp1 of rows belonging to the
             --  aPerson above.
              (Tmp1 where sName=aPerson);

Now this won't quite work in sql no matter where you execute it ☺ .
But what would a query really look like that will work with
#tmp1. Here it is. And it really is almost self-explanatory.
And it works in the D4 language! :)
(Tmp1 is a table same as #tmp1 stored in an Sql Server 2005 database).

select
  Tmp1 {sName aPerson}
    where
         table
              {
               row{aPerson sName, 'x' code},
               row{aPerson sName, 'y' code}
              }
               <=
              (Tmp1 where sName=aPerson);
   
aPerson
-------
burns 
jones 
smith 

Now this is what MS should be doing. Sql is a language of choice
for some things. But it certainly is not the choice language
for others.

Sunday, October 21, 2007

Sql Server - The Undefined Trigger

Ask the question: does Sql Server have a row level trigger.
Answer: it depends. From the point of view of application
development how in the world can the answer possibly be
ambiguous? Well lets back up and look at the Update statement.
From Sql Server 2005 bol:
Using UPDATE with the FROM Clause
'The results of an UPDATE statement are undefined if the
 statement includes a FROM clause that is not specified in
 such a way that only one value is available for each column
 occurrence that is updated, that is if the UPDATE statement
 is not deterministic'.
Ok what is really the rational for even keeping this
'proprietary' syntax? Since it is inherently non-deterministic
why even offer it to developers most of whom do not understand
but the simplest of sql. After all Sql Server is explicitly
relinquishing responsibility for the integrity of the Update and
making the user responsible. Shouldn't the idea of a database be
the other way around? Ok now lets move to the trigger. From bol
under the topic: Multirow Considerations for DML Triggers:
'When you write the code for a DML trigger, consider that the
 statement that causes the trigger to fire can be a single statement
 that affects multiple rows of data, instead of a single row.'
Does this sound familiar? Of course it does. The same non-determinism
of the FROM clause in Update now reappears in the insert table
of the trigger. But of course it is not presented as a 'trigger
is inherently undefined'. No, now the undefined nature of the
trigger is called 'a Multirow consideration'. And again the server
reliquishes responsibility of integrity and puts it in the
hands of the user. Well lets get real. There is no such thing as
a multirow trigger. From an applications standpoint the largest
scope of a trigger is a row. The so called multirow trigger is an 
example of what happens when users don't scrutinize what nonsense
is thrown their way. Instead of getting rid of a stupid idea
in Update using a FROM it was instead extended to a trigger.
It was simply an easy and expedient thing to do. But it was
shameful and lazy too. Lack of integrity is based on lack of
character. But it is equally disappointing that more users do
not complain and demand the integrity that a database should offer.
Sql Server can do much, much better.  But, like E.T., users must
phone home.

Wednesday, October 17, 2007

Sql - Constantly struggling

The subject of this article is called (by others) database constants.
So lets be sure we're on the same page with what a constant is.
It is just a value.

X = Substring(Y,1,5);

X and Y are variables, 1 and 5 are values.

X = Y * 5;

X and Y are variables, 5 is a constant. It is a number whose value is 5.
A table in D4 is just as much a variable as X and Y. A table in 
sql is like the value 5, a constant. The difference between a constant
and a variable is also the difference between D4, a relational
database, and sql!

The following is part of an exchange I had with someone on the
issue of database constants. The idea of database constants
was raised in the article:

A New (and Hopefully Better) Approach to Constants
http://www.sqlservercentral.com/articles/T-SQL/61244/

Here I go ☺ :

Sql Server lets users define a 'table variable'. Now you really
don't believe it's a table 'variable' do you?:)  Of course not.
They're just playing with words. There is no such thing in sql
server as a table variable. But because they call it that most
users really believe it. It's like believing in the tooth fairy:)
Now we're going to play the same game with database 'constants':)
There is no such thing as a database constant. There are only
values (which can't be updated or assigned other values) and
variables (which can be updated and assigned other values).
But in sql there is no real talk of these basic things. And
that's why I talk about Dataphor ☺. There is no difference between
what this guy is calling a 'constant' and a table or 'resultset' in
sql. They are all (constant) values and not variables. I'm trying
to show what sql would be like if there really were table 'variables'.
One way was with the CTE article, 'Do you know what the Sql CTE really is?'
But people seem to be having a hard time wrapping their head
around it:) This sql stuff just gets crazier. Now people are
going to believe that you can pass a table as a parameter to a
stored procedure. That is nonsense. One reason people believe it
is they have nothing to compare/constrast it with. So I show what
it really means to pass a table to a sp:
http://beyondsql.blogspot.com/2007/10/sql-whats-really-with-with.html
http://beyondsql.blogspot.com/2007/06/dataphor-13-passing-table-as-parameter.html
http://beyondsql.blogspot.com/2007/08/dataphor-creating-super-function.html
http://beyondsql.blogspot.com/2007/09/dataphor-all-tables-are-typed-variables.html
etc.etc.

But to do these kind of things requires a foundation different than
sql. Hence the name of my site ☺ .
The so called 'constants' are basically a table without a key, ie a row.
So I do:
create table Y
{
 A:Integer,
 B:String,
 key{}
};

insert row{1 A,'Micheal' B} into Y;

select Y[].B
'Michael'
select Y[].A
1
where the '[]' is called a row extracter.
Or I can define a 'list':
var LList:=list(String){'Michael','Steve','Jimmy'};
select LList[0];
Michael

Or I can define a table (ie a row) with lists.
create table Z
{
 Strings:list(String),  //The column holds a list of strings.
 Integers:list(Integer),//The column holds a list of integers.
 key{}
};

insert row{{'Micheal','Steve','Jimmy'} Strings,{2,10,22,40} Integers}
into Z;

select Z[].Strings[2];
Jimmy
select Z[].Integers[1];
10
where the [2] and [1] refer to the ordinal position of the item (value)
in the list.
But sql doesn't have a table 'type' or a list type so it couldn't
understand these things. The only thing sql can do is try to simulate
these types. And it winds up in a mess of gooblygook that few understand
and robs the user of the integrity that the database should provide
by supporting these things directly. All this stuff is based on values,
variables and types. The only types that exist is sql are numbers and
strings. Again there is no type for tables, row or list. And therefore
no variables for these things because without a type you can't define
a variable. Make sense ☺ .
var X:= Orders //X is a variable of type table, specifically the type
               //of the Orders table (the column names and there data types).
This is the huge step forward from sql. In sql the result of a query
is no different than the 'Constant' that guy was trying to talk about ☺ .
It is also the difference between a CTE and a variable that holds the
result of the 'CTE'. You can't reuse a CTE in sql because it's a 'constant',
a value. You can only reuse a variable.

Dataphor - Merging strings together

This problem was raised in the thread:
microsoft.public.sqlserver.programming
Monday, October 15, 2007 
'Parse and Merge two fields'
http://tinyurl.com/2eosa3

The idea is to take two pipe ('|') delimited strings and created one
string where each item in the first string is matched with the item 
in the second string having the same ordinal position.
The two columns are People and Position. So given:

People:=' Sam | Jane | Gene'
Position:= 'Accounting | Finance | Marketing'

we want:

People_Position:= 'Sam|Accounting|Jane|Finance|Gene|Accounting'

Of course in sql this is usually a mess since sql does not
support a list type and any of the operations that can transform
a delimited string to a list and visa versa.
In D4 we simply split the delimited string to a list of items and
then create a table from the list. The table will contain the
ordinal position as an integer value of each item in the list. 
So a row has the item (string) value and sequence value of that item
in the string as columns. So we just concatenate the columns of each
string together (separated by a pipe) and then concatenate these 
strings over the rows with the Concat operator ordering the 
concatenated string by the sequence value (over rows). 

Note that there is no concept of just unique item in a list. A list
can have any number of duplicate items. When the list is transformed
to a table the sequence number guarantees that same value items will
be in the table. In other words the sequence is a key of the table.

select ToTable('A|A|B|B|C'.Split({'|'}));

value sequence 
----- -------- 
A     0        
A     1        
B     2        
B     3        
C     4      

And with column names of your choice:

select ToTable('A|A|B|B|C'.Split({'|'}),'Str','Seq');

Str Seq 
--- --- 
A   0   
A   1   
B   2   
B   3   
C   4   

(Browse this site for other articles on string operations for more info).

Here is some sample data. The data is stored in a Sql Server 2005 database.
All queries are written with the D4 relational language of Dataphor.

create table JR
{
 id:Integer,
 People:String tags{Storage.Length='100'},
 Position:String tags{Storage.Length='100'},
 key{id}
};
 
insert
table
{
row{1 id,'joe|sam|pete|mike' People,'Manager|Team Lead|Worker Ant|Worker Ant' Position},
row{2,'A| C|  T ',' JobA|JobC |JobT '},
row{3,'A| C ',' JobA | JobB |JobC '}, //Ummatched strings (lists)
row{4,'Bush | Cheney | Rice ' ,' Worser|Worst|Worse '},
row{5,'Z |Y |X ' ,' Backwards Z|Backwards Y |Backwards X '}
} into JR;

select JR;

id People                Position                                
-- --------------------- --------------------------------------- 
1  joe|sam|pete|mike     Manager|Team Lead|Worker Ant|Worker Ant 
2  A| C|  T               JobA|JobC |JobT                        
3  A| C                   JobA | JobB |JobC                      
4  Bush | Cheney | Rice   Worser|Worst|Worse                     
5  Z |Y |X                Backwards Z|Backwards Y |Backwards X

Also note that we can access the table with a pass-thru query using t-sql:

select SQLQuery("SELECT id,People,Position FROM JR"); 

id People                Position                                
-- --------------------- --------------------------------------- 
1  joe|sam|pete|mike     Manager|Team Lead|Worker Ant|Worker Ant 
2  A| C|  T               JobA|JobC |JobCT                       
3  A| C                   JobA | JobB |JobC                      
4  Bush | Cheney | Rice   Worser|Worst|Worse                     
5  Z |Y |X                Backwards Z|Backwards Y |Backwards X   

Note that the only restriction used here is that both strings have
the same number of items. 

select 
  JR //The table JR in Sql Server 2005. 
  //SQLQuery("SELECT id,People,Position FROM JR") Or we could use this t-sql query . 
     where //The same 'where' statement as in sql. 
           //Check that each string has the same number of items.      
          People.Split({'|'}).Count()=Position.Split({'|'}).Count()
                                with {IgnoreUnsupported = 'true'}
      //For each row in the table we 'add' a column (People_Position).
      //The People_Position column is the '|' delimited string of
      //People and Position item by item.
      add
      {
       //From a table whose rows represents each item in People and Position,
       //for each row form a string of the person and position separated by
       //a '|' and concatenate that string over the rows into a single string.
       Concat({People_Position,Del} 
        from
         (
          //Use Split to create a 'list' of items from the delimited string.
          //Then create a table from list with ToTable. Column 'Str1'
          //is the string value of the item from the list. Column 'Seq'
          //is the item number from the list. So now a table represents
          //the original input column People. In other words, the rows of
          //the table are the delimited items in the string from left to right.
          (ToTable(People.Split({'|'}),'Str1','Seq'){Str1.Trim() Str1,Seq} )       
           join  //A natural join on Seq which is an integer from 0-># of items
                 //in the string. Seq preserves the sequence from left to
                 //to right of the items in the string.
          (ToTable(Position.Split({'|'}),'Str2','Seq'){Str2.Trim() Str2,Seq} )
            {Seq,Str1,Str2,'|' Del,(Str1+'|'+Str2) People_Position}  
         )
                order by {Seq}) //The concatenation over rows is ordered by the
                                //Seq column. This is the same order as the items
                                //in the original input columns, People/Position.
                                //Therefore the original order is preserved in
                                //the result (People_Position).
        People_Position //This is name of the new column added to each row of JR.
       }
        {id,People_Position}  //These are the only two columns we've chosen
                              //to display from the query.
            order by {id};    //Order by the values if the id column.
 
id People_Position                                           
-- --------------------------------------------------------- 
1  joe|Manager|sam|Team Lead|pete|Worker Ant|mike|Worker Ant 
2  A|JobA|C|JobC|T|JobT                                      
4  Bush|Worser|Cheney|Worst|Rice|Worse                       
5  Z|Backwards Z|Y|Backwards Y|X|Backwards X                


Instead of eliminating rows where the item counts don't agree
we can insert the string 'Unmatched' for those id values.

select 
 //Instead of table JR we access the table in sql server with a t-sql query.
 //In D4 the query does not return an sql 'resultset' but instead is treated
 //as a table 'variable' which is of a nature fundamentally different than sql.
 SQLQuery("SELECT id,People,Position FROM JR") 
 //Store the restriction on count as a boolean value (T/F) in column TestCnt.
  add{People.Split({'|'}).Count()=Position.Split({'|'}).Count() TestCnt}
                                with {IgnoreUnsupported = 'true'}
   add
      {
 //If the TestCnt is true concatenate the rows, else return a nil (null) value.
 //A case statement just like the sql case statement could also be used.
  if TestCnt
      then
       Concat({People_Position,Del}
        from
         (
          (ToTable(People.Split({'|'}),'Str1','Seq'){Str1.Trim() Str1,Seq} )       
           join
          (ToTable(Position.Split({'|'}),'Str2','Seq'){Str2.Trim() Str2,Seq} )
            {Seq,Str1,Str2,'|' Del,(Str1+'|'+Str2) People_Position}  
         )
                order by {Seq})
                else
                  nil
                    as String
        People_Position
      }
 //Insert the unmatched string for a nil value of the People_Position column.     
        {id,IfNil(People_Position,'*** Unmatched Strings ***') Field} 
          order by {id};
          
id Field                                                     
-- --------------------------------------------------------- 
1  joe|Manager|sam|Team Lead|pete|Worker Ant|mike|Worker Ant 
2  A|JobA|C|JobC|T|JobT                                      
3  *** Unmatched Strings ***                                 
4  Bush|Worser|Cheney|Worst|Rice|Worse                       
5  Z|Backwards Z|Y|Backwards Y|X|Backwards X                 


D4 allows all the expressive power you need to easily test for
any criteria. Here we add some additional rows to table JR that
should be eliminated.

insert
table
{
//Digits in Postion.
row{6 id,'joe|sam|pete|mike' People,'Manager|Team Lead|Worker Ant1|Worker Ant2' Position},
//Digits in People and Position
row{7 id,'joe|sam|pete1|mike' People,'Manager|Team Lead|Worker Ant1|Worker Ant' Position}
} into JR;

select JR;

id People                Position                                  
-- --------------------- ----------------------------------------- 
1  joe|sam|pete|mike     Manager|Team Lead|Worker Ant|Worker Ant   
2  A| C|  T               JobA|JobC |JobT                          
3  A| C                   JobA | JobB |JobC                        
4  Bush | Cheney | Rice   Worser|Worst|Worse                       
5  Z |Y |X                Backwards Z|Backwards Y |Backwards X     
6  joe|sam|pete|mike     Manager|Team Lead|Worker Ant1|Worker Ant2 
7  joe|sam|pete1|mike    Manager|Team Lead|Worker Ant1|Worker Ant

We add some additional restrictions to the where statement to eliminate
rows with strings that contain any non letters.

select 
 JR
  where 
    (
     ( People.Split({'|'}).Count()=Position.Split({'|'}).Count() )
     //We also test that each item is only letters (no digits or other junk).
     //This can be done any numbers of ways. Here we just compare the count
     //of all items to the count of items eliminated by characters other than 
     //letters.
       and
       (
        People.Split({'|'}).Count()
        =
        Count(ToTable(People.Split({'|'}),'F1Str','F1Seq') 
        //Test that items are made up of letters only. To eliminate
        //blanks contaiminating the test we remove blank character before
        //testing.
          where IsLetter(Replace(Trim(F1Str),' ','')))
       )
       and
       (
        Position.Split({'|'}).Count()
        =
        Count(ToTable(Position.Split({'|'}),'F2Str','F2Seq') 
          where IsLetter(Replace(Trim(F2Str),' ','')))
       )   
     )  with {IgnoreUnsupported = 'true'}        
      add
      {
       Concat({People_Position,Del}
        from
         (
          (ToTable(People.Split({'|'}),'Str1','Seq'){Str1.Trim() Str1,Seq} )       
           join
          (ToTable(Position.Split({'|'}),'Str2','Seq'){Str2.Trim() Str2,Seq} )
            {Seq,Str1,Str2,'|' Del,(Str1+'|'+Str2) People_Position}  
         )
                order by {Seq})
        People_Position
       }
        {id,People_Position}
          order by {id};

id People_Position                                           
-- --------------------------------------------------------- 
1  joe|Manager|sam|Team Lead|pete|Worker Ant|mike|Worker Ant 
2  A|JobA|C|JobC|T|JobT                                      
4  Bush|Worser|Cheney|Worst|Rice|Worse                       
5  Z|Backwards Z|Y|Backwards Y|X|Backwards X                

Thanks for stopping by ☺ .

Tuesday, October 09, 2007

Dataphor - Exploding hierarchical data

This article show examples of using the D4 explode operator.
This operator is used for expressing hierarchical data. It is
not a 'recursive' operator like the recursive sql CTE but more
like the Oracle Connect By construct.

The examples follow the ones used by Itzik Ben-Gan to illustrate
the recursive CTE query in Sql Server 2005, specifically the
'Single-Parent Environment: Employees Organizational Chart'
which can be found at:
http://msdn2.microsoft.com/en-us/library/ms345144.aspx#docum_topic4

The explode examples are based on the Employees table used in the
sql example. The table is stored in an Sql Server 2005 database.
Several of the examples use the concept of a dense rank.

As with most of my articles, the code is not necessarily 'clever' but
straightforward, not necessarily the best 'performant' but intended
to express the many different concepts and constructs in D4.

select Employees

empid mgrid empname  salary    
----- ----- -------- ----------
1     0     Nancy    $10,000.00
2     1     Andrew   $5,000.00 
3     1     Janet    $5,000.00 
4     1     Margaret $5,000.00 
5     2     Steven   $2,500.00 
6     2     Michael  $2,500.00 
7     3     Robert   $2,500.00 
8     3     Laura    $2,500.00 
9     3     Ann      $2,500.00 
10    4     Ina      $2,500.00 
11    7     David    $2,000.00 
12    7     Ron      $2,000.00 
13    7     Dan      $2,000.00 
14    11    James    $1,500.00 


Get a tree for a specific manager.

select
 (
  Employees
    explode
   by mgrid = parent empid
   where  mgrid=0
     order by {empid}
       include level
   with {IgnoreUnsupported = 'true'}
  )
// Do a little string insert to format the tree.  
    add {'' Temp, level-1 Totalspace}
     add {
          (
           empname.Insert(0,Temp.PadLeft(Totalspace,'|'))
          ).Replace('|',' | ')
          Tree
         }
        {Tree,mgrid,empid,empname,sequence,level};
       
Tree              mgrid empid empname  sequence level
----------------- ----- ----- -------- -------- -----
Nancy             0     1     Nancy    1        1    
 | Andrew         1     2     Andrew   2        2    
 |  | Steven      2     5     Steven   3        3    
 |  | Michael     2     6     Michael  4        3    
 | Janet          1     3     Janet    5        2    
 |  | Robert      3     7     Robert   6        3    
 |  |  | David    7     11    David    7        4    
 |  |  |  | James 11    14    James    8        5    
 |  |  | Ron      7     12    Ron      9        4    
 |  |  | Dan      7     13    Dan      10       4    
 |  | Laura       3     8     Laura    11       3    
 |  | Ann         3     9     Ann      12       3    
 | Margaret       1     4     Margaret 13       2    
 |  | Ina         4     10    Ina      14       3    

Treat each employee as if they are a manager. This will return the
subordinates of each employee regardless of whether or not
they are a manager.

select
 (
  Employees
    explode
   by mgrid = parent empid
   where  mgrid>=0         //Changing where to include all employees.
     order by {empid}
       include level
   with {IgnoreUnsupported = 'true'}
  )
     add {'' Temp, level-1 Totalspace}
     add {
          (
           empname.Insert(0,Temp.PadLeft(Totalspace,'|'))
          ).Replace('|',' | ')
          Tree
         }
        {Tree,mgrid,empid,empname,sequence,level} ;
       
Tree              mgrid empid empname  sequence level
----------------- ----- ----- -------- -------- -----
Nancy             0     1     Nancy    1        1    
 | Andrew         1     2     Andrew   2        2    
 |  | Steven      2     5     Steven   3        3    
 |  | Michael     2     6     Michael  4        3    
 | Janet          1     3     Janet    5        2    
 |  | Robert      3     7     Robert   6        3    
 |  |  | David    7     11    David    7        4    
 |  |  |  | James 11    14    James    8        5    
 |  |  | Ron      7     12    Ron      9        4    
 |  |  | Dan      7     13    Dan      10       4    
 |  | Laura       3     8     Laura    11       3    
 |  | Ann         3     9     Ann      12       3    
 | Margaret       1     4     Margaret 13       2    
 |  | Ina         4     10    Ina      14       3    
Andrew            1     2     Andrew   15       1    
 | Steven         2     5     Steven   16       2    
 | Michael        2     6     Michael  17       2    
Janet             1     3     Janet    18       1    
 | Robert         3     7     Robert   19       2    
 |  | David       7     11    David    20       3    
 |  |  | James    11    14    James    21       4    
 |  | Ron         7     12    Ron      22       3    
 |  | Dan         7     13    Dan      23       3    
 | Laura          3     8     Laura    24       2    
 | Ann            3     9     Ann      25       2    
Margaret          1     4     Margaret 26       1    
 | Ina            4     10    Ina      27       2    
Steven            2     5     Steven   28       1    
Michael           2     6     Michael  29       1    
Robert            3     7     Robert   30       1    
 | David          7     11    David    31       2    
 |  | James       11    14    James    32       3    
 | Ron            7     12    Ron      33       2    
 | Dan            7     13    Dan      34       2    
Laura             3     8     Laura    35       1    
Ann               3     9     Ann      36       1    
Ina               4     10    Ina      37       1    
David             7     11    David    38       1    
 | James          11    14    James    39       2    
Ron               7     12    Ron      40       1    
Dan               7     13    Dan      41       1    
James             11    14    James    42       1            

If we reverse 'by mgrid = parent empid' to 'by empid = parent mgrid' we
get the tree of a particular employee to their top level manager.

select
 (
  Employees
    explode
      by empid = parent mgrid
         where  empid=13
   order by {empid}
       include level
   with {IgnoreUnsupported = 'true'}
  )
     add {'' Temp, level-1 Totalspace}
     add {
          (
           empname.Insert(0,Temp.PadLeft(Totalspace,'|'))
          ).Replace('|',' | ')
          Tree
         }
        {Tree,mgrid,empid,empname,sequence,level};
       
Tree           mgrid empid empname sequence level
-------------- ----- ----- ------- -------- -----
Dan            7     13    Dan     1        1    
 | Robert      3     7     Robert  2        2    
 |  | Janet    1     3     Janet   3        3    
 |  |  | Nancy 0     1     Nancy   4        4    
       
By changing the where predicate to >=1 we can get a report on all
employees (note where empid>=3 eliminates the graph of the 1st 2
employees (Nancy and Andrew) but doesn't eliminate them from graphs
of other employees who report to them).


Here we create an operator that will give a graph in either direction
of a specific employee and their top level manager. In other words,
we can either start with the employee and go down to their top level
manager or start with the employees top level manager and go down
to the employee. The concept of the dense rank is used for binding
all rows of each employee together. By getting the empname and empid
for each dense rank we can target any employee by name or number.   
(Note we could, of course, use a view or any number of other constructs.
I just felt in the mood to use an operator  ).

The operator takes two arguments. The first, aTreeTable, is a table
of type Employees. The second, Start_At, is a string and indicates
the direction of the graph. Using 'M' for manager starts with the
manager. Using 'E' starts with the employee.

create operator EmpTree(aTreeTable:typeof(Employees),Start_At:String):
//The operator returns a virtual table with columns and their data types
//defined by the typeof expression.
typeof(
       Employees
       add{1 level,'S' empnamerank,1 empidrank, 'T' Tree_Graph, 1 Rank, 1 TreeOrder}
       )
begin
result:=table of typeof(result){};
//Starting at top level mgr for each employee is a desc sort (default).
//Starting at each employee to the top level mgr. is an asc sort.    
//Start_At=M(gr) is desc sort (default)
//Start_At='E(mp) is asc sort.
var LSort:='D'; //Default.
if ( ((Start_At.Trim())[0]).Upper() ) = 'E'
   then LSort:='A' ;
var T:=
aTreeTable
explode  
by empid = parent mgrid
    where  empid>=1
      order by {empid}
       include level
   with {IgnoreUnsupported = 'true'};
//Get a dense rank. This rank binds all rows for each employee
//together. The idea is to increment a count for every level 1 since
//a level 1 indicates the start of a new employee.
var TR:=
       T add{case when level=1 then empname else nil end NameEmp}
        add
           {
            Count(
                  T rename {sequence sequenceX}
                   where  (sequenceX<=sequence) and (level=1)
                  ) 
             Rank
            };
var SR:=
     TR               
      join
       (
        TR group by {Rank}
        //We want the emp name, number (empid) and reverse level for each
        //dense rank (empname). We want the reverse level so we can get
        //the tree representative from the employee to top level manager
        //AND the top level manager to the employee.
                add{Max(empid) NumEmpid,Max(NameEmp) Emp_Name,Max(level) Maxlevel}
        )                        
 add {
    (empname.Insert(0,''.PadLeft( (if LSort='A' then (level-1) else (Maxlevel-level)),
                                                         '|'))).Replace('|',' | ')
        Tree_Graph
     }
     rename {sequence seq};
result:=
 ToTable(
  ToList(
 cursor(SR
 order by
     {
      Rank,
      seq
      sort ((1 - (2*ToInteger((LSort = 'D'))))*(.left.value ?= .right.value)) asc
     }
    
       )//cursor
       )//ToList
      )  //ToTable
        {empid,mgrid,empname,salary,level,Emp_Name empnamerank,
          NumEmpid empidrank,Tree_Graph,Rank,sequence+1 TreeOrder};
end;         


Here we show the first five employees by using Rank in a where statement.
The value of the Rank corresponds to the ascending order of empid.
Because empid starts at 1 Rank happens to be equal to the empid.

select EmpTree(Employees,'Mgr') //We use the operator as if it were a table.
  where Rank<=5
      order by {TreeOrder};     

empid mgrid empname  salary     level empnamerank empidrank Tree_Graph   Rank TreeOrder
----- ----- -------- ---------- ----- ----------- --------- ------------ ---- ---------
1     0     Nancy    $10,000.00 1     Nancy       1         Nancy        1    1        
1     0     Nancy    $10,000.00 2     Andrew      2         Nancy        2    2        
2     1     Andrew   $5,000.00  1     Andrew      2          | Andrew    2    3        
1     0     Nancy    $10,000.00 2     Janet       3         Nancy        3    4        
3     1     Janet    $5,000.00  1     Janet       3          | Janet     3    5        
1     0     Nancy    $10,000.00 2     Margaret    4         Nancy        4    6        
4     1     Margaret $5,000.00  1     Margaret    4          | Margaret  4    7        
1     0     Nancy    $10,000.00 3     Steven      5         Nancy        5    8        
2     1     Andrew   $5,000.00  2     Steven      5          | Andrew    5    9        
5     2     Steven   $2,500.00  1     Steven      5          |  | Steven 5    10     


We can overload the EmpTree operator so as to provide a default value
for the Start_At parameter, ie. the direction of the graph. We make
the default 'M' so the graph starts with the top level manager of the
employee.

We simply supply the literal 'Mgr' for the sort direction for the
same operator whose signiture includes the Start_At parameter.

create operator EmpTree(aTreeTable:typeof(Employees)):
typeof(
       Employees
       add{1 level,'S' empnamerank,1 empidrank, 'T' Tree_Graph, 1 Rank, 1 TreeOrder}
       )
begin
result:= EmpTree(aTreeTable,'Mgr');
end;

     
Here get the tree of employee 'James' starting at the highest level manager
by using the overload signature of the EmpTree operator.

select EmpTree(Employees)
  where empnamerank='James'
  {Tree_Graph,TreeOrder}
      order by {TreeOrder};
     
Tree_Graph        TreeOrder
----------------- ---------
Nancy             38       
 | Janet          39       
 |  | Robert      40       
 |  |  | David    41       
 |  |  |  | James 42       


Here we start with employee 'James' up to his highest level manager.

select EmpTree(Employees,' Emp ')
  where empnamerank='James'
  {Tree_Graph,TreeOrder}
      order by {TreeOrder};
     
Tree_Graph        TreeOrder
----------------- ---------
James             38       
 | David          39       
 |  | Robert      40       
 |  |  | Janet    41       
 |  |  |  | Nancy 42        


Here we use the EmpTree operator to get the enumerated paths in both
directions using the Concat (concatenation) operator. (More info here).

select
 (EmpTree(Employees) add{'.' Del} adorn{key{empidrank,TreeOrder}})
   group by {empidrank}
    add 
          {
            Max(empnamerank) empname,
            Concat(empname,Del order by {empidrank,TreeOrder}) PathMgrtoEmp,
            Concat(empname,Del order by {empidrank,TreeOrder desc}) PathEmptoMgr
          }
     rename {empidrank empid}
      order by {empid};
     
empid empname  PathMgrtoEmp                   PathEmptoMgr                  
----- -------- ------------------------------ ------------------------------
1     Nancy    Nancy                          Nancy                         
2     Andrew   Nancy.Andrew                   Andrew.Nancy                  
3     Janet    Nancy.Janet                    Janet.Nancy                   
4     Margaret Nancy.Margaret                 Margaret.Nancy                
5     Steven   Nancy.Andrew.Steven            Steven.Andrew.Nancy           
6     Michael  Nancy.Andrew.Michael           Michael.Andrew.Nancy          
7     Robert   Nancy.Janet.Robert             Robert.Janet.Nancy            
8     Laura    Nancy.Janet.Laura              Laura.Janet.Nancy             
9     Ann      Nancy.Janet.Ann                Ann.Janet.Nancy               
10    Ina      Nancy.Margaret.Ina             Ina.Margaret.Nancy            
11    David    Nancy.Janet.Robert.David       David.Robert.Janet.Nancy      
12    Ron      Nancy.Janet.Robert.Ron         Ron.Robert.Janet.Nancy        
13    Dan      Nancy.Janet.Robert.Dan         Dan.Robert.Janet.Nancy        
14    James    Nancy.Janet.Robert.David.James James.David.Robert.Janet.Nancy
    

Here we get the counts of employees directly or indirectly reporting to managers.

select
  Employees
    explode
      by empid = parent mgrid
         where  empid>=1
   order by {empid}
       include level
   with {IgnoreUnsupported = 'true'}
     group by {mgrid} add{Count() MgrCnt}
      where mgrid>0

 mgrid MgrCnt
 ----- ------
 1     13    
 2     2     
 3     7     
 4     1     
 7     4     
11     1    


Here are the employees who are the managers. For example there are 13 people
who report to Nancy. This is represented indirectly by people who report
to the three managers directly below her (Andrew, Janet, Margaret) and
directly by the same managers reporting to her. Dan, David and Ron report
to Robert. But there is an additional employee reporting to David (James)
so Robert (mgrid 7) has 4 employees reporting to him.

select
 (
  Employees
    explode
      by empid = parent mgrid
         where  empid>=1
   order by {empid}
       include level
   with {IgnoreUnsupported = 'true'}
     group by {mgrid} add{Count() MgrCnt}
  )
   join Employees
     where mgrid>0
      join  ( Employees {empid MgrEmpid,empname MgrName} )
        by mgrid=MgrEmpid
      {mgrid,MgrName,empname,MgrCnt};
     
mgrid MgrName  empname  MgrCnt
----- -------- -------- ------
1     Nancy    Andrew   13    
1     Nancy    Janet    13    
1     Nancy    Margaret 13    
2     Andrew   Michael  2     
2     Andrew   Steven   2     
3     Janet    Ann      7     
3     Janet    Laura    7     
3     Janet    Robert   7     
4     Margaret Ina      1     
7     Robert   Dan      4     
7     Robert   David    4     
7     Robert   Ron      4     
11    David    James    1           


Here we get salaries of subordinates under managers. Those employees
who are not managers are omitted.

var T:=
Employees
explode  
by mgrid = parent empid
     where  mgrid>=0
      order by {empid}
       include level
   with {IgnoreUnsupported = 'true'};
//Get a dense rank. This rank binds all rows for each employee
//together. The idea is to increment a count for every level 1 since
//a level 1 indicates the start of a new employee.
var TR:=
       T add{case when level=1 then empname else nil end NameEmp}
        add
           {
            Count(
                  T rename {sequence sequenceX}
                   where  (sequenceX<=sequence) and (level=1)
                  ) 
             Rank
            };
var SR:=
   (
     TR               
      join
       (
        TR group by {Rank}
        //We want the empid and name for each dense rank.
          add{Min(empid) NumEmpid,Max(NameEmp) Emp_Name}
        )                        
    )
    //We only want managers, those that have subordinates.
     where
     NumEmpid
              in
                ( Employees {mgrid} ) with {IgnoreUnsupported = 'true'}
      add {
         ( (empname+ case when level>1 then ' ('+ToString(salary)+')'
             else '' end).Insert(0,''.PadLeft((level-1),'|'))).Replace('|',' | ')
              Mgr_Sal_Tree
          };

select SR {Mgr_Sal_Tree,sequence} order by {sequence};   

Mgr_Sal_Tree                  sequence
----------------------------- --------
Nancy                         1       
 | Andrew ($5,000.00)         2       
 |  | Steven ($2,500.00)      3       
 |  | Michael ($2,500.00)     4       
 | Janet ($5,000.00)          5       
 |  | Robert ($2,500.00)      6       
 |  |  | David ($2,000.00)    7       
 |  |  |  | James ($1,500.00) 8       
 |  |  | Ron ($2,000.00)      9       
 |  |  | Dan ($2,000.00)      10      
 |  | Laura ($2,500.00)       11      
 |  | Ann ($2,500.00)         12      
 | Margaret ($5,000.00)       13      
 |  | Ina ($2,500.00)         14      
Andrew                        15      
 | Steven ($2,500.00)         16      
 | Michael ($2,500.00)        17      
Janet                         18      
 | Robert ($2,500.00)         19      
 |  | David ($2,000.00)       20      
 |  |  | James ($1,500.00)    21      
 |  | Ron ($2,000.00)         22      
 |  | Dan ($2,000.00)         23      
 | Laura ($2,500.00)          24      
 | Ann ($2,500.00)            25      
Margaret                      26      
 | Ina ($2,500.00)            27      
Robert                        30      
 | David ($2,000.00)          31      
 |  | James ($1,500.00)       32      
 | Ron ($2,000.00)            33      
 | Dan ($2,000.00)            34      
David                         38      
 | James ($1,500.00)          39    

Here is the same tree as above (of salaries for subordinates) using a
table and a view created with a pass-thru query.

We can create a table based on the result of explode.

create table Emp_T
from
     (
       Employees
       explode  
       by mgrid = parent empid
       where  mgrid>=0
       order by {empid}
       include level
       with {IgnoreUnsupported = 'true'}
       add{case when level=1 then empname else nil end NameEmp}
        adorn  //We can include various meta-data about the columns of the table.
             {
              NameEmp nil static tags {Storage.Length = "10"},
              empname static tags {Storage.Length = "10"}
             }//end adorn.
     );  

Now we get the dense rank using a pass-thru query to Sql Server. The result
of the sql query could be set to var TR which means that TR is a table
variable in D4. The TR variable is therefore of the same nature
no matter how it (a table variable) was derived. Or we could create
a view based on the pass-thru query and use that.
var TR:= but we're using a view instead of setting the pass-thru to var TR.

create view TR
  SQLQuery("select A.*, 
      (select Count(*)
              from Emp_T as B
                 where (B.sequence<=A.sequence) and (B.level=1)) as Rank
          from Emp_T as A");

Use the TR view in a batch to get the salary tree.

var SR:=
   (
     TR               
      join
       (
        TR group by {Rank}
        //We want the empid and name for each dense rank.
                add{Min(empid) NumEmpid,Max(NameEmp) Emp_Name}
        )                        
    )
    //We only want managers, those that have subordinates.
     where
     NumEmpid
              in
                ( Employees {mgrid} ) with {IgnoreUnsupported = 'true'}
      add {
         ( (empname+ case when level>1 then ' ('+ToString(salary)+')'
             else '' end).Insert(0,''.PadLeft((level-1),'|'))).Replace('|',' | ')
              Mgr_Sal_Tree
          };
select SR {Mgr_Sal_Tree,sequence} order by {sequence};

Mgr_Sal_Tree             sequence
------------------------ --------
Nancy                    1       
 | Andrew (5000)         2       
 |  | Steven (2500)      3       
 |  | Michael (2500)     4       
 | Janet (5000)          5       
 |  | Robert (2500)      6       
 |  |  | David (2000)    7       
 |  |  |  | James (1500) 8       
 |  |  | Ron (2000)      9       
 |  |  | Dan (2000)      10      
 |  | Laura (2500)       11      
 |  | Ann (2500)         12      
 | Margaret (5000)       13      
 |  | Ina (2500)         14      
Andrew                   15      
 | Steven (2500)         16      
 | Michael (2500)        17      
Janet                    18      
 | Robert (2500)         19      
 |  | David (2000)       20      
 |  |  | James (1500)    21      
 |  | Ron (2000)         22      
 |  | Dan (2000)         23      
 | Laura (2500)          24      
 | Ann (2500)            25      
Margaret                 26      
 | Ina (2500)            27      
Robert                   30      
 | David (2000)          31      
 |  | James (1500)       32      
 | Ron (2000)            33      
 | Dan (2000)            34      
David                    38      
 | James (1500)          39


Here we get the sum of salaries for subordinates under managers using
the TR view. Eliminating level 1 in the query excludes the managers
salary in the sum. Again the dense rank idea makes this an easy query.

select
   (
    TR               
      join //This is a natural join based on Rank.
       (
        TR group by {Rank}
        //We want the empid and name for each dense rank.
            add{Min(empid) NumEmpid,Max(NameEmp) Emp_Name}
        )//->The relation of view TR being group by Rank.
    )//->The relation from TR joined to (TR grouped by Rank) 
          where level>1 //A 'where' applied to above relation. This relation
                        //is now grouped by NumEmpid to get subordinate salaries.
            group by{NumEmpid} add{Max(Emp_Name) Emp_Name,Sum(salary) SumSalary}
              order by {NumEmpid};

NumEmpid Emp_Name SumSalary 
-------- -------- ----------
1        Nancy    37500.0000
2        Andrew   5000.0000 
3        Janet    15000.0000
4        Margaret 2500.0000 
7        Robert   7500.0000 
11       David    1500.0000 

Monday, October 01, 2007

Sql - The Two Standards of the Sql CTE

Sql really has two standards. There is the ANSI sql standard and then
there is the DOUBLE standard.

This example uses Sql Server 2005 and specifically the CTE (common table expression).

Sql Server sees nothing wrong with this query that duplicates a column name.
Something that no database should allow.

SELECT employeeid,customerid as employeeid
FROM orders

From Bol on the CTE column_name
'Duplicate names within a single CTE definition are not allowed.The list of
 column names is optional only if distinct names for all resulting  columns
 are supplied in the query definition.'

But this query produces an error.

WITH Emps  (Emp, Emp) AS
(
    SELECT employeeid,customerid as employeeid
    FROM orders
)
SELECT *
FROM Emps
Error: 'The column 'Emp' was specified multiple times for 'Emps'.

But try this. And it works!

WITH Emps  (Emp, Emp1) AS
(
    SELECT employeeid,customerid as employeeid
    FROM orders
)
SELECT *
FROM Emps

So what does Bol really mean in the column_name description. It means
that the stupid and error prone idea of creating duplicate column names
is perfectly ok in the query definition. It's only the outer reference
names (the  ( column_name [ ,...n ] ) that are going to be checked for
duplicates. So the MS statement:
'The list of column names is optional only if distinct names for all
 resulting columns are supplied in the query definition.'

is a recognition that its okay to screw up the query_definition as
as long as you come away with distinct names in the CTE. In other words,
give distinct names for the 'same' columns created in the query_definition
with the same name. If you don't duplicate names in the quey_definition
you can dispense with the CTE column name list totally. So not only is the
MS statement a statement of guilt (don't expect us to correct duplicate column
names when it first occurs, we'll catch it the 2nd time it comes around),
it's also a statement of a double standard. One standard for an independent
SELECT statement and another standard for a CTE! Double standards allow
the initial error to propagate. Not what you want in a database.

This is what a database should do when duplicate columns are declared.
This is what Dataphor will do when you try this.

select Orders {EmployeeID,CustomerID EmployeeID};
Error: Duplicate object name "EmployeeID".

Only if you 'love' sql could you tolerate its nonsense ☺ .

For more on the sql CTE see:
'Do you know what the Sql CTE is?'
http://beyondsql.blogspot.com/2007/10/sql-whats-really-with-with.html

Sql - Do you know what the Sql CTE is?

From Sql Server 2005 Bol for WITH CTE (common_table_expression):
'Specifies a temporary named result set, known as a common table expression (CTE).
 This is derived from a simple query and defined within the execution scope of
 a single SELECT, INSERT, UPDATE,
 or DELETE statement.'

Now this works fine (using the NorthWind database in Sql Server 2005):

WITH Emps (Emp, Cnt) AS
(
    SELECT employeeid,count(*) AS Cnt
    FROM orders
    GROUP BY employeeid
)
SELECT Max(Cnt)
FROM Emps

But try to define a CTE independently. You can't.
So this CTE craps out:

WITH Emps (Emp, Cnt) AS
(
    SELECT employeeid,count(*) AS Cnt
    FROM orders
    GROUP BY employeeid
)

Try to use the CTE in a batch with a 2nd SELECT. You can't 'reuse' the CTE.
So this batch craps out with an error thrown on the use of the CTE
on the 2nd SELECT.

WITH Emps (Emp, Cnt) AS
(
    SELECT employeeid,count(*) AS Cnt
    FROM orders
    GROUP BY employeeid
)
SELECT Max(Cnt)
FROM Emps

SELECT Min(Cnt)
FROM Emps       -- Invalid object name 'Emps'.

Now what if a CTE in a batch could be defined independently and
be used with multiple SELECT statements. What would the batch look
like. Well it could look something like this using the D4 language of Dataphor
(see http://beyondsql.blogspot.com/2007/09/dataphor-all-tables-are-typed-variables.html
 for some more background):

//Define a CTE independently. And to define a CTE independently it must
//be stored in a variable. We are using a pass-thru query from D4 to Sql Server
//in t-sql to the NorthWind database.
var EMPS.CTE:=
              SQLQuery("SELECT employeeid,count(*) AS Cnt
                          FROM orders
                            GROUP BY employeeid");
//Define a row using the CTE.
var LRow:=row{Min(Cnt from EMPS.CTE) MinCnt,Max(Cnt from EMPS.CTE) MaxCnt};//A row.  
//SELECT the row.
select LRow; 
/*
MinCnt MaxCnt
------ ------
42     156
*/
//Use the CTE in a 2nd SELECT, SELECT a table
select EMPS.CTE return 2 by {Cnt desc};
/*
employeeid Cnt
---------- ---
4          156
3          127
*/

What is really going on here. Now Sql has managed to immunize itself
against computer science forever. But if we were to apply a compute
science term to the sql CTE what term what we call it. We would call
it a value. Because a value cannot exist independently (by itself :))
and cannot be reused. To reuse a value we have to declare every time
we want to use it. With a variable we don't have to worry about
any of these value limitations. We can simply reuse the variable
as much as we want. The 'var' in 'var EMPS.CTE:=..' means define
a variable that contains a value and the value is the CTE SELECT
statement. So now you know the sql term 'result set' is really just
a value :) And now you know the benefit of working with variables
over values.

Of course the situation with the sql CTE is not the only case where
you have to repeat the damn thing to use it.

This works fine in sql.

SELECT orderid,customerid,employeeid
FROM orders
WHERE customerid in ('COMMI','TORTU','HUNGO')

But can you store the list of ('COMMI','TORTU','HUNGO') in a variable? No.
You have to repeat it every time you want to use it. It's because there is
no variable available to assign the list to. Of course in D4 we can
assign the list to a variable and use the variable just like the CTE.

var LList:=list(String){'COMMI','TORTU','HUNGO'};
select Orders
where CustomerID in LList
{OrderID,CustomerID,EmployeeID};

Doesn't this make more sense than having to use xml in sql to shred the
string or other crazy sql techniques to store the list items in a table? 

About Me

My photo
Phoenix, Arizona, United States