Beyond Sql: Sql Mythology

Greetings,

To paraphrase 'V' from the movie 'V for Vendetta' if
as an application developer or even as an sql programmer
you feel that everything is fine and dandy in S2005
and even .net you need read no further. But if you feel
there is something askew, something that does not make
alot of sense in sql land, then I suggest you read on.

'Think in terms of sets' in sql is empty of any real meaning
except don't use declarative statements. This phrase should
be replace with 'think in terms of application development'.
As a critic of sql I can only go so far. So from time to time
I'm going to walk the walk instead of just talk the talk.

I'm going to show you how to go about application development
in a new way. A way that makes sense. I'm going to introduce
you to Dataphor from:
www.alphora.com

In broad strokes Dataphor is an application development system
based on the MS .net technologies. It is therefore a strongly
typed system. It offers a table type of a very specific nature,
a relational table type. To support this type there are a host
of 'relational' operators. The table type and its corresponding
operators are the advancement over and beyond current sql. All
the types, be they table, strings, numbers, boolean can be
accessed and manipulated within a Pascal like declarative
language. And within the language all types can logically be
treated as 'variables'. This development environment is referred
to as 'D4'.

The availability of a 'relational' system is a fairly recent thing.
What transpired in the interim was the perpetuation of a grand myth.
Developers were led to believe that sql was relational! Led by
vendoraptors like IBM and Oracle and more recently MS and industry
punitwits like Joe Celko, the illusion of sql working as if it
were relational was and still is the industry norm. Finally with
Dataphor developers can 'see' the differences for themselves. Of
course there are many myths surrounding sql and I hope to get to
most of them in time -:)

So lets look at question raised in the post:
Tuesday, May 22, 2007 8:02 AM
microsoft.public.sqlserver.programming
'parsing string into sequence of numbers and letters?'
http://tinyurl.com/yo28he

This question elucidates many themes still prevalent today.
1. The Abbott & Costello syndrome. Where is the appropriate place
   for a data operation, the client or the server. Every
   failure for sql to support an operation has been portrayed
   as the inappropriateness for that operation to done on the
   server. A glaring example of this is list/string operations.
   A favorite tactic is too call an operation 'formatting' and
   argue that it belongs on the client. This none-sense has gone
   on forever and disappears in the strong typed environment of D4.
   The artifical dichotomy between client/server is a major
   hindrance to application development. It is now possible for
   developers to see this for themselves.
2. There is no logical manifestation of the concept of a 'key'
   in sql. Like so many concepts sql convinced users that a key
   was synonymous with an 'index'. Keys are of such fundamental
   importance that it is actually problematic to even call S2005
   (or Oracle,DB2) a database! With D4 developers can see all the
   'logical' advantages they gain with their use. All the none-sense
   about 'faking' arrays in sql goes out the window in D4. As you
   will see there is no reason to fake something and create more
   of a mess when the concept of an 'array' is seen as a component 
   of a key. It is often written that sql is a 'data retrieval
   language'. It is more meaningful to say it is a 'fast' retrieval
   language. That is why there is so much emphasis on indexes.
   But performance comes at a cost such as integrity and manipulation. 
   Up till now this cost has been covered up with all kinds of
   none-sense. 
3. D4 offers consistency of operations across types. Such consistency
   bring clarity to application development no found in sql.

What we want to do here is take a string and break it apart
creating separate columns for letters and digits. The only
twist is that there is an explicit ordering - letters for the
1st column, digits for the 2nd and letters for the 3rd column.
If a string does not have letters-digits-letters we want to
fill the remaining column(s) with a placeholder (we're using
a blank here but a null (nil in D4) or anything else could be used).

Table VMI will reside in the pubs database in S2005.
RowID will be the clustered primary key. Additional indexes
can be created directly in S2005 or thru D4. All logical operations
concerning 'keys' will be handled by the Dataphor server.
Essentially all data operations are handled by D4. This includes
all data integrity and manipulation. And both are a qualitative
advance over sql. S2005 is simply the data respository and D4 
will take advantage of its (hopefully) fast data access paths. 
The D4 server 'chunks' data operations. What this means is D4 
will determine when it is appropriate to hand over query requests
(as well as intermediate steps) to S2005 and when to handle them itself. 
This whole process is basically transparent to the user. The physical
implementation details needn't concern the user.

create table VMI 
{
RowID:Integer,
InputStr:String
static tags {Storage.Length = '12'},
key{RowID}
};
 
insert table
{
row{1 RowID,'AA99BB' InputStr},
row{2,'WWW8888PPPPP'},
row{3,'77DD'},
row{4,'EE66'},
row{5,'G'},
row{6,'HHHH5J'},
row{7,'44'},
row{8,'22KK'}
} into VMI;    
 
//The data is now in S2005 and available to D4.
//select VMI;
 
Operator Op1VMI() uses string operators to chunk together like characters
within a RowID. As an operator it is available to be used anywhere.
An operator can return any kind of supported type in Dataphor. That
means it can return: nothing, a table, a row (multiple scalar values),
a scalar, a cursor, a list. The result may be virtual or it may persist
in a table. Whatever it can return, it can also use as arguments (parameters). 
Since 'relational' is synonymous with 'type', tables/rows can be arguments
since we are dealing with structure based on type, ie for a table its type
is the structure of the table the column names and their scalar types. 
This a fundamental difference from sql which equates a table with its file name. 
This is also why in Dataphor we are dealing with tables as relational variables. 
A table is just another type of variable, but a variable nonetheless just like
an integer.  
Also note that there is no arbitrary breakup of operators. There are no 'procedures'
as opposed to 'functions' as in sql server. There are no limitations as what
can be done in an operator as there are in sql server. All the sql limitations
are based on design flaws that developers have to work around. This is another 
example of the general pattern of 'inconsistency' in sql:).
If your wondering why not define a view instead of an operator I can understand.
Views in Dataphor are not the same relatively simpleton macros they are 
in sql. They are much more highly evolved in a relational system and serve
many different functions not available in sql. There is no reason for a 
'relational' view here, again all in good time -:). 

create operator Op1VMI():
     table{RowID:Integer,StrChar:String,InputStr:String,Rank:Integer,
             Cnt:Integer,PadStr:String,sequence:Integer}
begin             
result:=table of typeof(result){};
result:=
 ToTable
  (
   ToList
    (
     cursor
       (
         ( 
    //Each character in a string can be addressed by its indexed position.
           (VMI add {Length(InputStr) LenStr})
              times
               numbers
                where num<=(LenStr-1)
                 add {InputStr[num] StrChar}
                   remove {LenStr}
          )
          //Here we're concatenating repetitions of characters into a single string.
          //We could have simply used the Concat operator but there is a rich
          //number of string operators available in D4 (and you can add your own).
  
            group by {RowID,StrChar} 
               add{Max(InputStr)InputStr,Min(num)Rank,Count()Cnt}
                add {PadRight(StrChar,Cnt,StrChar) PadStr}  
                 order by {RowID,Rank}
  
    //Using ToTable(ToList(cursor creates a table with a
    //sequence (rank) column in the order of the cursor declaration.
    //The sequence column is similar the sql-99 row_number() function.
    //In fact could have used a passthru query to sql server and used
    //row_number(). The returned table would then be available for
    //manipulation within Dataphor. All in good time -:).
 
        )//cursor   
      )//ToList      
    );//ToTable
end;
 
Invoking the operator makes its result (a table) available, just like 
a table. For examples: 
select Op1VMI() order by {sequence};

Table LVMI is similar in concept to a #tmp table in S2005.
We're going to store both the delimited string (DelStr)
and the delimited string as a 'list' (StrAsList). You need
only a dose of common sense to understand the difference
between the 'string' TYPE and the 'list' TYPE. In D4 a type
is supported by a plethora of dedicated operators. This
includes operations for strings and lists not available in sql.
These operators also serve to guarantee the integrity of the
type. The vacuum left by their absence in sql gives raise to a
special type of programming, commonly referred to as a 'kludge'-:)
(Wasn't LV1 the planet in Alien/Aliens?)

create session table LVMI
 {
  RowID:Integer,
  InputStr:String,
  DelStr:String,   //There is no real need to store the delimited
                   //string. In fact it defeats the purpose of the
                   //'list'. But we do it to illustrate the difference
                   //between a delimited string and a list. 
  StrAsList:list(String), 
  key{RowID}
 } ; 
 
In sql there is always much adieu made of violating normal form
by storing delimited strings. The whole argument disappears with
a 'list' type.
 
As I stated tables and table expressions are just variables.
We can therefore define a batch using variables just as if we were
crunching numbers. 

As you learn the 'relational' query language constructs in Dataphor
one of the first new things you will learn is that 'select' in sql
is a fundamental part of a query expression. And that is just
non-sense. This is just how sql was designed 40 years ago. I was as
ridiculous then as it is now. There is no reason for this and in the
relational world it is eviscerated from expressions and used only where
it makes sense, as part of a 'statement' to retrieve data.

An operator that returns a table is treated exactly as a table is.
It is just common sense! And there is no silly 'select' to muddle things
up!

*** Start of batch using variables to represent table expressions ***

var B1:=

//This code creates the column PRank which is like the sql row_number()
//partitioned by RowID. Subtracting OffSet from sequence within each 
//RowID will produce a rank from 1 to N within each RowID. The join
//is what is known as a 'natural' join. The D4 server simply joins
//on the same column names in the two table expressions (RowID). Of
//course D4 allows the join declared using 'by' columns (like 'on' in S2005).

            Op1VMI()
               join
               (
                 Op1VMI() group by {RowID} add{Min(sequence) OffSet}
                   redefine { OffSet:= OffSet-1 }
               )    
                  add{sequence-OffSet PRank}  ;   

/*
Again B1 is a variable that holds a table based on Op1VMI() to
which we have added a new column PRank.
*/                  
//select B1 order by {sequence}; Uncomment to see B1 table.                  

/*
'Add' is how constants/expressions are included in table expressions.
D4 is a relational algebra and if you remember your high school
class you know, atleast tacitly, how it works. You simply group
terms together and 'add' new terms, glue them together and add
still more terms. On the other hand, no one really knows how
sql works. It's not an algebra nor a calculus. It's a 'that just
the way it is' language. It is essentially faith based. Given
enough time and permutations of constructs one gains faith that
somehow a mess will produce a result -:) Forty years ago IBM produced
this bastard child. Larry Ellison piled more stupidity on it and made
a fortune. Not even the history of sql makes any sense.
*/

var B2:= 
  (
    B1
     join
       (
        B1
         add { (((PRank mod 2) =1 ) and (IsLetter(PadStr)=true)) ColDef1}
 
 /*
 There are those that argue that the boolean type has no place in
 a relational language. Of course these same people are quick to
 point out the concept of 'relation division'. This is another
 form of non-sense. The group by Any is an 'iterated or' over rows
 of a table. It returns true if any ColDef1 for a row is true and
 false otherwise. It is being used here to change (shift) the PRank
 column depending on whether the string is digit/letter. 
 */
           group by {RowID} add{Any(ColDef1) TestCols}       
        )   
          redefine{PRank:=PRank+ToInteger(not TestCols)}
   )         adorn {key{RowID,PRank}}

/*
Adorn is a way for the user to communicate with D4 on a logical
level. Here we are telling D4 to create a specific key in
addition to the ones it infers. In sql you have resultsets that
are essentially like spread sheets. They have no keys as sql
is concerned with performance and therefore indexes. In D4 there
are only tables regardless of how they are derived. And tables
to be tables must have key(s). The reason sql has a 'distinct'
construct is to cover up this embarrasement. There is no 'distinct'
in D4 and the system guarantees that each row in a table is unique
by inferring key(s). The idea in sql circles that data and meta-data
don't mix is non-sense.
*/

//This outer join will ensure all RowIDs have 3 PRanks.

   right join
    (  
      (VMI over{RowID}) //This results in only unique values of RowID.
        times   //'times' is like sql cross join.
                //Valid expressions within a query include declaring a table.
                //Here we are filling in non-existent PRanks for each RowID
                //so each RowID has 3 strings to concatenate.
         table{row{1 PRank},row{2},row{3}} 
     ) with {IgnoreUnsupported = 'true'} 
        {RowID,PRank,InputStr,IfNil(PadStr,' ') PadStr};

//select B2 order by{RowID,PRank}; Uncomment to see B2 table.   

/* 
var B3 is where we finally form a delimited (comma) string (DelStr).
The Concat operator is an aggregate that can be used within a group by.
Concat has an order by that is used to form the string within RowID
in a specific order. We then insert table B3 into table LVMI.
Since a 'list' is a totally different animal (type) than a 'string'
we have to form a list from DelStr before we can insert it (as a list)
into LVMI (StrAsList). A literal list of type string is represented as:
{'A','B','C'}
To convert the delimited string DelStr to a list type we use the
Split operator, ie.
DelStr.Split() StrAsList
This says create a list by splitting DelStr (comma is the default delimiter)
and name the LIST 'StrAsList'. Simple yes?-:)
Note that unlike sql there is no required ordering of names in an
insert statement to match the table name ordering. In D4 only the
column names of the inserted table need be found in the table their
being inserted into.
*/

var B3:=
    ( B2 add{',' Del} adorn {key{RowID,PRank}} )
     group by {RowID} 
      add {Concat(PadStr,Del order by {RowID,PRank}) DelStr,Max(InputStr) InputStr}
        redefine{DelStr:=Replace(DelStr,' ','')};
//select B3; Uncomment to see table B3. (See below to see the table.)        
  insert
    (B3 {InputStr,RowID,DelStr,DelStr.Split() StrAsList})
      into LVMI;

//*** End of batch ***

Each element of a list is accessible by an index just as each element
of a string is. Did someone say 'array'? -:) There is no need to
simulate arrays in D4. Types such a tables,strings,lists and cursors
imply an array structure and it is only common sense that there is
a simple logical addressing method for each type. Lists and strings
are addressed using a 0 based index. Below we return the three items
of the list 'StrAsList' for each row. If you've made it this far you
may have figured out that in D4 returned columns are specified and
can be manipulated 'after' specifying the table name. Of course this
is the opposite of sql. Which makes more sense? -:)
 
select LVMI
       {RowID,DelStr,StrAsList[0] Col1,StrAsList[1] Col2,StrAsList[2] Col3};

RowID DelStr         Col1 Col2 Col3  
----- -------------- ---- ---- ----- 
1     AA,99,BB       AA   99   BB    
2     WWW,8888,PPPPP WWW  8888 PPPPP 
3     ,77,DD              77   DD    
4     EE,66,         EE   66         
5     G,,            G               
6     HHHH,5,J       HHHH 5    J     
7     ,44,                44         
8     ,22,KK              22   KK    

Just as you can address the items of a list, you can directly address
a row of table via a key. Using RowID as the key the first row and the
elements of the list can be produced simply by:

select LVMI[1] {RowID,StrAsList[0] Col1,StrAsList[1] Col2,StrAsList[2] Col3};

RowID Col1 Col2 Col3 
----- ---- ---- ---- 
1     AA   99   BB   


And there you have it.
Bye for now,
Steve

More intermediate results to see what's happening -:)

Operator (table) Op1VMI().
select Op1VMI() order by {sequence};

RowID StrChar InputStr     Rank Cnt PadStr sequence 
----- ------- ------------ ---- --- ------ -------- 
1     A       AA99BB       0    2   AA     0        
1     9       AA99BB       2    2   99     1        
1     B       AA99BB       4    2   BB     2         
.
3     7       77DD         0    2   77     6        
3     D       77DD         2    2   DD     7        

Table B1.
select B1 order by {sequence};
PRank (for partitioned rank) repeates from 1 to N within 
each RowID.

RowID StrChar InputStr     Rank Cnt PadStr sequence OffSet PRank 
----- ------- ------------ ---- --- ------ -------- ------ ----- 
1     A       AA99BB       0    2   AA     0        -1     1     
1     9       AA99BB       2    2   99     1        -1     2     
1     B       AA99BB       4    2   BB     2        -1     3     
.
3     7       77DD         0    2   77     6        5      1     
3     D       77DD         2    2   DD     7        5      2     
4     E       EE66         0    2   EE     8        7      1     
4     6       EE66         2    2   66     9        7      2     
5     G       G            0    1   G      10       9      1     
.
7     4       44           0    2   44     14       13     1     
8     2       22KK         0    2   22     15       14     1     
8     K       22KK         2    2   KK     16       14     2     


Looking at immediate result using ColDef1 in B2. Only when ColDef1 evaluates
to false for all rows within a RowID is it necessary to shift up
RRank by 1 for each item. Of course this is only when the 1st 
characters in the string are digits (RowIDs 3,7 and 8).
But we give a general solution.

RowID StrChar InputStr     Rank Cnt PadStr sequence OffSet PRank ColDef1 
----- ------- ------------ ---- --- ------ -------- ------ ----- ------- 
1     A       AA99BB       0    2   AA     0        -1     1     True    
1     9       AA99BB       2    2   99     1        -1     2     False   
1     B       AA99BB       4    2   BB     2        -1     3     True    
.
3     7       77DD         0    2   77     6        5      1     False   
3     D       77DD         2    2   DD     7        5      2     False   
4     E       EE66         0    2   EE     8        7      1     True    
4     6       EE66         2    2   66     9        7      2     False   
5     G       G            0    1   G      10       9      1     True    
6     H       HHHH5J       0    4   HHHH   11       10     1     True    
6     5       HHHH5J       4    1   5      12       10     2     False   
6     J       HHHH5J       5    1   J      13       10     3     True    
7     4       44           0    2   44     14       13     1     False   
8     2       22KK         0    2   22     15       14     1     False   
8     K       22KK         2    2   KK     16       14     2     False   

Intermediate results for B2 looking at correction of PRank for all
RowIDs where TestCols is false (meaning all ColDef1 are false within
the RowID. The PRanks within RowIDs 3,7,8 are incremented by 1 (see above).

RowID StrChar InputStr     Rank Cnt PadStr sequence OffSet TestCols PRank 
----- ------- ------------ ---- --- ------ -------- ------ -------- ----- 
1     A       AA99BB       0    2   AA     0        -1     True     1     
1     9       AA99BB       2    2   99     1        -1     True     2     
1     B       AA99BB       4    2   BB     2        -1     True     3     
.
3     7       77DD         0    2   77     6        5      False    2     
3     D       77DD         2    2   DD     7        5      False    3     
4     E       EE66         0    2   EE     8        7      True     1     
4     6       EE66         2    2   66     9        7      True     2     
5     G       G            0    1   G      10       9      True     1     
6     H       HHHH5J       0    4   HHHH   11       10     True     1     
6     5       HHHH5J       4    1   5      12       10     True     2     
6     J       HHHH5J       5    1   J      13       10     True     3     
7     4       44           0    2   44     14       13     False    2     
8     2       22KK         0    2   22     15       14     False    2     
8     K       22KK         2    2   KK     16       14     False    3     

Table B2.
Each RowID has 3 strings (PadStr) with the correct PRank.
For missing PRanks a blank is assigned to PadStr.
select B2 order by{RowID,PRank};

RowID PRank InputStr     PadStr 
----- ----- ------------ ------ 
1     1     AA99BB       AA     
1     2     AA99BB       99     
1     3     AA99BB       BB     
.
3     1     <No Value>          
3     2     77DD         77     
3     3     77DD         DD     
4     1     EE66         EE     
4     2     EE66         66     
4     3     <No Value>      
5     1     G            G      
5     2     <No Value>          
5     3     <No Value>    
.
7     1     <No Value>          
7     2     44           44     
7     3     <No Value>               
8     1     <No Value>          
8     2     22KK         22     
8     3     22KK         KK     

Table B3.
This is where PadStr from B2 is concatenated to form a delimited string.
Each DelStr is guaranteed to have 3 elements. 

RowID InputStr     DelStr         
----- ------------ -------------- 
1     AA99BB       AA,99,BB       
2     WWW8888PPPPP WWW,8888,PPPPP 
3     77DD         ,77,DD         
4     EE66         EE,66,         
5     G            G,,            
6     HHHH5J       HHHH,5,J       
7     44           ,44,           
8     22KK         ,22,KK
3 comments:

Razvan Socol said...: Considering the following DDL and your sample data:

CREATE TABLE VMI (
RowID int PRIMARY KEY,
InputStr varchar(12) NOT NULL
)

INSERT INTO VMI VALUES (1,'AA99BB')
INSERT INTO VMI VALUES (2,'WWW8888PPPPP')
INSERT INTO VMI VALUES (3,'77DD')
INSERT INTO VMI VALUES (4,'EE66')
INSERT INTO VMI VALUES (5,'G')
INSERT INTO VMI VALUES (6,'HHHH5J')
INSERT INTO VMI VALUES (7,'44')
INSERT INTO VMI VALUES (8,'22KK')

We need an auxiliary Numbers table:

SELECT IDENTITY(int) N INTO Numbers FROM syscolumns
ALTER TABLE Numbers ADD PRIMARY KEY (N)

And then we can provide the expected result in a single SQL query:

SELECT V.InputStr,
COALESCE(SUBSTRING(V.InputStr,1,A.N),'') AS C1,
COALESCE(SUBSTRING(V.InputStr,COALESCE(A.N,0)+1,
COALESCE(B.N,100)-COALESCE(A.N,0)),'') AS C2,
COALESCE(SUBSTRING(V.InputStr,B.N+1,100),'') AS C3
FROM VMI V LEFT JOIN (
SELECT InputStr, N FROM VMI, Numbers
WHERE SUBSTRING(InputStr,N,2) LIKE '[A-Z][0-9]'
) A ON V.InputStr=A.InputStr
LEFT JOIN (
SELECT InputStr, N FROM VMI, Numbers
WHERE SUBSTRING(InputStr,N,2) LIKE '[0-9][A-Z]'
) B ON V.InputStr=B.InputStr

InputStr C1 C2 C3
------------ ------------ ------------ ------------
AA99BB AA 99 BB
WWW8888PPPPP WWW 8888 PPPPP
77DD 77 DD
EE66 EE 66
G G
HHHH5J HHHH 5 J
44 44
22KK 22 KK
(8 row(s) affected)

Razvan; Tue May 29, 03:01:00 AM
Anonymous said...: Two comments: 1) It looks like the proposed alternate solution in Oracle by razvan socol does not take into account the requirement that the first column always consist of letters, the second column digits, and the third column letters again. 2) It should be noted that if this is a correct alternate expression, then an analogous D4 expression could be provided.

Regards,
Bryn; Mon Jun 11, 04:36:00 PM
Steve said...: Hello Bryn,

You are correct about the sql solution. RowID 5 incorrectly places 'G' into the 2nd (number) column instead of the first. But more importantly, I'm trying to show D4 in broad terms as opposed to 'slick' solutions. Of course there's always the possiblity that I'm being too broad -:)

best,
steve; Thu Jun 14, 08:05:00 PM
Tuesday, May 29, 2007

Sql Mythology

3 comments:

Beyond Sql

Blog Archive

Links

About Me