Beyond Sql: Dataphor

LiT: = lists into tables:)

I'm going to talk about a bunch of kewl stuff as it
relates to LiT. So get relaxed and come take a ride on
the D(4) train:)

As I showed here the D4 language makes it very easy to split
strings. Now I want to show you some more kewl stuff about
working with strings, lists and tables using the Split operator.

Lets start with this string delimited by the pipe ('') character:

'FFBB  HH HH  D AA'

Notice that blanks can appear at the beginning, end and within a
string part.

The Spit operator takes a string and splits it into a 'list' of
strings based on one or more delimiters. Conceptually this means
that the above string is transformed into:

'FF',' BB',' HH HH ',' D ',' AA'

in other words a comma delimited list of strings. And with a list
you can access and operate on the individual strings parts. The
concept of a list is conceptually close to what concept in a database?
Yep, you guessed it - a table! Each string in the list fits nicely
with the idea of a row in a table. Make sense? And once the list is
in a table, well we can do just about anything with it as you'll see:)

So lets create a table from our example string using Split.

The basic construction of Split using the method style is:

'somestring'.Split({delimiters})

With our example string, which uses the pipe as a delimiter, we have:

'FFBB  HH HH  D AA'.Split({''})

The Split operator uses a thingie called an 'indexer' to access the
individual parts of the list, like this:

'some string'.Split({delimiters})[index]

where index is a digit for the desired element of the list. So if we
want the 1st string/element in the list we use:

'FFBB  HH HH  D AA'.Split({''})[0]

We're using 0 because the indexer is 0 based, ie. 1st element 0, 2nd
element 1 etc. Real easy. Ok we've got enough info now to select
the 1st string:

select 'FFBB  HH HH  D AA'.Split({''})[0]

FF

Ok for the moment lets leave the idea of getting individual parts
of the list with the indexer and think of getting 'the whole list'
with one swipe of the magical D4 wand. And that wand is the D4
ToTable. Since Split has done the heavy lifting of creating the
list, all we have to do to get LiT  is this: 

select ToTable( 'FFBB  HH HH  D AA'.Split({''}))

Value  
-------
 D     
 HH HH 
AA     
BB     
FF  

and we are Lit (list in table) :)

When you transform a list to a table the column by default is 'Value'.
You can name the column anything you want by including a column
name in Split. I'll call the column 'STRING':

select ToTable( 'FF GG HH HH  D  AA'.Split({''}),'STRING')

STRING 
-------
 D     
 HH HH 
AA     
BB     
FF     

Ok we got the list in a table but you see that the strings have the
same blanks as they had originally. Since we are LiT we can trim the
leading and trailing blanks of each row (string) using :

{Trim(STRING) STRING}

This means trim each row and retain the same column name.

select ToTable ( 'FFBB  HH HH  D AA'.Split({''}),'STRING' )
                                         {Trim(STRING) STRING}

STRING
------
AA    
BB    
D     
FF    
HH HH 

Ok at this point your head is probably ready to explode since
you can see that while I've made all this fuss about LiT,
the rows of the table does not match what the order of strings
should be! In other words, if Split creates a list shouldn't
the table reflect the strings like:

STRING
------
FF
BB
HH HH
D
AA

If this has occurred you I'm doing my job:) As I said above
we can use what's called an indexer to address the elements
of the list in order. And we'll do that soon. But, it's
meaningful to get LiT even though the elements are not in
the proper sequence.

Just understand that without supplying specific information
to Split about the individual elements (the indexer thingie [])
we can only get a table whose rows reflect the ordering of the
string elements themselves.

Here we add a system generated auto-incrementing column INDEX
using the GetNextGenerator operator that reflects the ascending
order of STRING. (The INDEX column is similar to the identity
function used in MS Sql Server). You can see that INDEX reflects
the original (untrimmed) strings.

select (ToTable ( 'FFBB  HH HH  D AA'.Split({''}),'STRING' ))
        add {GetNextGenerator('INDEX') INDEX,Trim(STRING) STRING1}
                                                   
STRING  INDEX STRING1
------- ----- -------
 D      1     D      
 HH HH  2     HH HH  
AA      3     AA     
BB      4     BB     
FF      5     FF     


Here we get the trimmed strings and then add the auto-incrementing column.

select (
        (ToTable ( 'FFBB  HH HH  D AA'.Split({''}),'STRING' ))
                         {Trim(STRING) STRING}
       )
         add { GetNextGenerator('INDEX') INDEX }
 
STRING INDEX
------ -----
AA     1    
BB     2    
D      3    
FF     4    
HH HH  5    

The INDEX now reflects the ascending order of the trimmed
strings. No matter how you cut it, splitting a string
in toto into a table is not going to give you the order
as they appear in the original string. As I said even though
this is true it still has value and significance. Now I'm
going to show you why.

Here are 2 tables created on MS Sql Server 2005. We're
going to access them with Dataphor using D4.

create table str1 (ID1 int primary key,STRING1 varchar(50))
insert str1 values(1,'AABB CC')
insert str1 values(2,'FFGGHH MM J KK')
insert str1 values(3,'SSTTUUVV')
insert str1 values(4,'AAYYYYPQQRDD')
insert str1 values(5,'D10B20A30')

create table str2 (ID2 int primary key,STRING2 varchar(50))
insert str2 values(1,'AA  BB CC')
insert str2 values(2,'FF GG HH MM  J  KK')
insert str2 values(3,'SSTTUUXX')
insert str2 values(4,'AAPPPPWRQRDD')
insert str2 values(5,'D10B20')

Using D4:

select str1

ID1 STRING1           
--- ------------------
1   AABB CC         
2   FFGGHH MM J KK
3   SSTTUUVV       
4   AAYYYYPQQRDD  
5   D10B20A30       

select str2

ID2 STRING2               
--- ----------------------
1   AA  BB CC           
2   FF GG HH MM  J  KK
3   SSTTUUXX           
4   AAPPPPWRQRDD      
5   D10B20             

Lets suppose we want a query that compares by ID STRING1
and STRING2 as lists. We want the ID's where the lists
are the same. In other words, we want the ID's where all
elements in STRING1 and STRING2 are the same (equal). So
we're looking for ID's 1 and 2 where the trimmed elements
are the same.

You'll recall that we LiT a string using:

select ToTable ( 'FFBB  HH HH  D AA'.Split({''}),'STRING' )
                                         {Trim(STRING) STRING}
                                           
All we have to do to LiT our strings is use STRING1 and
STRING2 in Split. And once we have LiT both strings we
can wave a magic wand. We can directly test to see if the
2 tables are equal! We simply join by ID's and then test
the tables for equality in where.

select 
 str1 join str2 by ID1=ID2
  where 
   ((ToTable(STRING1.Split({''}),'STRING')) {Trim(STRING) STRING})
   =
   ((ToTable(STRING2.Split({''}),'STRING')) {Trim(STRING) STRING})
       
ID1 STRING1            ID2 STRING2               
--- ------------------ --- ----------------------
1   AABB CC          1   AA  BB CC           
2   FFGGHH MM J KK 2   FF GG HH MM  J  KK

Comparing tables is just a logical extension of comparing
integers or strings or just about anything else. Tables
are equal if they have the same column(s) (STRING) and the
elements (rows) are equal. Is that a magic wand or what:)
Now it's fundamentally important to understand how this
works. The key of each table is of course STRING. Each
key also implies an order.  So each row, ordered by
the key (STRING), in each table is compared to the
corresponding row in the other table just like you'd
compare integers. If every row is equal in the 2 tables
the comparison is true, otherwise it's false. D4 is
thus able to logically address the rows in a table via
the key. This is fundamentally different than sql which
can't do this. The ability to use keys for logical
addressing has big, big implications. It's one of the
fundamental things that differentiates a relational
language from sql! I'll get into it in future posts.
Now back to our example.

Note that for ID's 5 that have 3 and 2 rows respectively,
the table comparison was no problem.
I figure that if your still with me your interested and
pretty sharp. So I'll give you the benefit of the doubt
and assume a pretty important question has occured to
you. What if the original string has duplicate elements?
For example:

'AAATTCCAAA'

Lets see what happens when we try to LiT this:

select ToTable ( 'AAATTCCAAA'.Split({''}),'STRING' )
                        
WoW! We get this from the Dataphor compiler:
'Internal Index Error: "Duplicate key violation." '

Yeah we're SOL but does it makes sense? Well all tables
in D4 have to have a key. No if, ands or buts. The key
is of course STRING and when the string is split into
a table there are duplicates of 'AAA' and there
can't be any duplicate rows. So there goes our table:(
I'm sure you can guess the remedy:) But remember that
D4 is a strongly typed language. A list is not the
same as a table. We've been trying to convert a list to
a table. Too different types, that is the point of
having to convert. And sometimes it just won't work
the way we want it to. Before we remedy this, things
could get more dicey if we have this:

select ToTable ( 'AAA TTCC AAA '.Split({''}),'STRING' )

STRING
------
 AAA  
AAA   
CC    
TT    

And then trimmed the strings:

select ToTable ( 'AAA TTCC AAA '.Split({''}),'STRING')
                         {Trim(STRING) STRING}
                        
STRING
------
AAA   
CC    
TT    

An element 'AAA' was eliminated (no duplicates allowed) so
we end up with 3 elements instead of 4. And if we compared
this with

'AAATTCC'

select TableDee add {'False Positive' Compare}
   where
     (ToTable ( 'AAA TTCC AAA '.Split({''}),'STRING')
                         {Trim(STRING) STRING} )
     =
     (ToTable ( 'AAATTCC'.Split({''}),'STRING')
                         {Trim(STRING) STRING} )

Compare       
--------------
False Positive

we'd get a false positive. The tables are equal but
the lists are not! Phew:) (Don't worry about the
'TableDee' thingie, I'll get to that in due time:)

But please come to the idea that table comparisons
are a magic wand and once you start using them you'll
wonder how you ever got along without them. But alas,
you can loose some of that magic when working with
lists so be careful:) (More on table comparisons here).

So finally lets get to splitting the list by the
individual elements, ie. using an indexer ([]).

On the MS Server newsgroups it's common to see the
phrase 'use a set oriented approach'. This means
don't use procedural/imperative code. So no
cursors, or loops:)
Now I've already gone over splitting strings here.
But I want to show you in a little more depth what
is going on.

Given a string:

'AAA TTCC AAA '

A basic query to get the elements of the list is:

select ToTable( {0, 1, 2, 3, 4 , 5, 6, 7, 8 },'INDEX')
  where 'AAA TTCC AAA '.Split({''}).Count()>INDEX
   {INDEX+1 INDEX,'AAA TTCC AAA '.Split({''})[INDEX].Trim() STRING}

INDEX STRING
----- ------
1     AAA   
2     TT    
3     CC    
4     AAA   

Note that with the inclusion of INDEX we are guaranteed
not to eliminate duplicate STRINGs. The INDEX column
is always unique and takes part in the key of the
table.

Now the statement:

ToTable( {0, 1, 2, 3, 4 , 5, 6, 7, 8 },'INDEX')

creates a table with column INDEX for the list of integers.

'AAA TTCC AAA '.Split({''}).Count()>INDEX

filters on INDEX where the count of the elements in the list
is greater than the INDEX value. There are 4 elements in
the list so INDEXs 0, 1, 2, 3 satisfy this criteria. Remember
lists are 0 based. So we end up with a table with 4 rows, ie:

INDEX
-----
0    
1    
2    
3    

Then the statement:

{INDEX+1 INDEX,'AAA TTCC AAA '.Split({''})[INDEX].Trim() STRING}

is applied. This statement modifies the above table by redefing
INDEX and adding the column STRING. The STRING column is each
list element, in order, identified by the indexer ([INDEX]).
Each element of the string is also trimmed. We added 1 to INDEX
because.. well I wanted too because I like to start from 1 not 0:)
So there you have it. Well...not quite. There's stuff going on
here under the covers that you should know about and understand.

How is it that Dataphor interprets:

{0, 1, 2, 3, 4, 5, 6, 7, 8 }

as a list of integers and not something else?

To answer this let me show you formally how this table of
integers is created in a block of D4 code.

begin
   var Indexes : list(Integer);
   Indexes := { 0, 1, 2, 3, 4 , 5, 6, 7, 8 };
   create session table Indexes
   from
   ToTable(Indexes,'INDEX');
end;

A variable, Indexes, is assigned the type list(Integer), ie.
an infinite list where the elements are of type integer.
Once variable Indexes is defined as a type of list, the actual
elements of the list (the integers) are assigned to it.
Now the list can be converted into a table. The

create session table Indexes
from

creates what is called a session table. It's like a permanent
table but is only available to the Dataphor session that
creates it. It's like an MS Sql Server ## temp table.

select Indexes

INDEX
-----
0    
1    
2    
.
8

The answer to the question of how Dataphor knows that:

ToTable( {0, 1, 2, 3, 4 , 5, 6, 7, 8 },'INDEX')

is a list of integers is because the system has inferred
it! That's why it isn't necessary to formally declare
any variable with a type of list(Integer). In other
words, a comma separated bunch of integers is logically
a list of integers. The D4 compiler as an inference
engine is a super big deal!:) The concept of logically
inferring metadata (ie.keys, constraints) does not
exist in an sql database. I'll get into this big
deal in future posts. Ok, back to our example:)

So now you know some more about the integer list in
our query:

select ToTable( {0, 1, 2, 3, 4 , 5, 6, 7, 8 },'INDEX')
  where 'AAA TTCC AAA '.Split({''}).Count()>INDEX
   {INDEX+1 INDEX,'AAA TTCC AAA '.Split({''})[INDEX].Trim() STRING}

And what about the argument to Split? Well here is a
formal code block for splitting a string (using the
above Indexes table).

begin
  var LList : list(String);
  var ExString : String;
  ExString := 'AAA TTCC AAA ';
  LList := ExString.Split( {''} );
create session table ExList
from
 Indexes
   where LList.Count()>INDEX
    {INDEX+1 INDEX, LList[INDEX].Trim() STRING} ;
end;  

select ExList

INDEX STRING
----- ------
1     AAA   
2     TT    
3     CC    
4     AAA   

We don't have to explicitly declare that Split should
result in a list of String elements. All this information
is inferred by the compiler (thank you very much D4:)

I'm rambled enough:) Now we're going to finally,
finally split strings and generate a little report.

For good measure we're going to insert another row
in our string tables, str1 and str2, via D4:

insert
 table
 {
     row { 6 ID1, 'AAA TTCC AAA ' STRING1 }
 }
 into str1

insert
 table
 {
     row { 6 ID2, 'TTCCAAA' STRING2 }
 }
 into str2

So now have:

select str1
ID1 STRING1           
--- ------------------
1   AABB CC         
2   FFGGHH MM J KK
3   SSTTUUVV       
4   AAYYYYPQQRDD  
5   D10B20A30       
6   AAA TTCC AAA    

select str2
ID2 STRING2               
--- ----------------------
1   AA  BB CC           
2   FF GG HH MM  J  KK
3   SSTTUUXX           
4   AAPPPPWRQRDD      
5   D10B20               
6   TTCCAAA       

We want to compare strings by ID and only where
the lists have the same number/count of elements. And
we only want the strings where there are differences
for the same element (INDEX). So each row will include
the ID, the corresponding different STRINGs, the
INDEX for each different element and the individual
different string parts.

select
 (
  (
   (  
    (str1 join str2 by ID1=ID2)
      where STRING1.Split({''}).Count()=STRING2.Split({''}).Count()
   )
    times // times is like the sql cross join.
     Indexes
      where STRING1.Split({''}).Count()>INDEX
  )  
      where // <> is the notation for not equal.
       STRING1.Split({''})[INDEX].Trim()<>STRING2.Split({''})[INDEX].Trim()
 )    
          {
            ID1 ID,STRING1,STRING2,INDEX+1 INDEX,
            STRING1.Split({''})[INDEX].Trim() DIF1,
            STRING2.Split({''})[INDEX].Trim() DIF2
          }

ID STRING1          STRING2          INDEX DIF1 DIF2
-- ---------------- ---------------- ----- ---- ----
3  SSTTUUVV      SSTTUUXX      4     VV   XX  
4  AAYYYYPQQRDD AAPPPPWRQRDD 2     YYYY PPPP
4  AAYYYYPQQRDD AAPPPPWRQRDD 3     PQ   WR  

Coming from sql you should think of each set of matching
parenthesis (indented) as a derived table. Unlike sql,
a derived table in D4 does not start with a 'select'
statement. In D4 it's just the opposite of sql. A table
is derived and then you specify the columns you want.
Which way makes more sense? In sql you pay for the
meal before you even order it:) Also note in D4 you
don't have to give the derived able a name (alias). In
D4 things are much cleaner and more logical.

I hope you found this interesting. And there's plenty
more to come:)
Friday, September 01, 2006

Dataphor - Get LiT!

No comments:

Beyond Sql

Blog Archive

Links

About Me