Beyond Sql: 2008

Saturday, December 20, 2008

Extracting numbers from a string

This example is based on the question asked in the thread:

microsoft.public.sqlserver.programming
Dec 18, 2008
Pulling a number out of an nvarchar field
http://tinyurl.com/3quyap

The OP was interested in pulling out the 1st occurrence of a number in a
string. The string has numbers and letters. So if the string is 'XY40A3' 
we want the number 40 which is the 1st of two numbers in the string.

This is very easy to do in Dataphor. Just as a table of numbers comes in
handy for solving many different problems so does a list of letters. 
The idea here is to treat each letter in the string as a 'delimiter'. We
then split the string using the list of letters as the delimiters so
what results are the number(s) in the string.

We can store the list (the letters of the alphabet) in a temporary table
for easy access.

create session table AlphaTable
 { AlphaList:list(String),
   key{ }
 };
AlphaTable:=
       table
            {
             row
             { {'A','B','C','D','E','F','G','H','I','J','K','L','M','N',
                'O','P','Q','R','S','T','U','V','W','X','Y','Z'} AlphaList}
             };      

For example if we split a string, transform it to a table  and remove any
blanks we'll just have numbers left. If we order by sequence (Index) it will
show the numbers as they occur from left to right. 

select ToTable('XY40A3'.Split(AlphaTable[].AlphaList),'StrNum','Index' )
                where StrNum>' ' 
                  order by {Index};                  

StrNum Index 
------ ----- 
40     2     
3      3     
           
Not only would it be easy to get the 1st number but we can get the
occurrence of any number easily. The 1st occurrence is just a special
case of the general problem of getting the Nth occurrence in a string.

By using ToTable(ToList(cursor by Index (that follows the order of
numbers from left to right in the string) we can create a consecutive rank
from 1 to N over the table of numbers that will allow direct access to the 
Nth number (if it exists).
       
select
 ToTable
        (  
         ToList
               (
                cursor
                      (
                        ( 
                         ToTable('XY40A3RAD853'.Split(AlphaTable[].AlphaList),'StrNum','Index' )
                         where StrNum>' ' 
                         {ToInteger(StrNum) Num,Index}
                        )
                         order by {Index}
                      )   
               )   
        )
         {Num,Index,sequence+1 NthIndex} ;
         
Num Index NthIndex 
--- ----- -------- 
3   3     2        
40  2     1        
853 6     3               
        
Here is the an operator for the Nth occurrence that takes into account
lower case letters and returns a -1 if the Nth occurrence doesn't exist.
        
create session operator NthNum(AStr:String,N:Integer):Integer
begin
var T1:= 
  ToTable
        (  
         ToList
               (
                cursor
                      (
                        ( 
                         ToTable(Upper(AStr).Split(AlphaTable[].AlphaList),'StrNum','Index' )
                         where StrNum>' ' 
                         {ToInteger(StrNum) Num,Index}
                        )
                         order by {Index}
                      )   
               )   
        )
         {Num,Index,sequence+1 NthIndex};
result:=IfNil((T1 adorn{key{NthIndex}})[N].Num,-1);
end;

select NthNum('SF346fs47sGs759 ',1); //returns 346
select NthNum('SF346fs47sGs759 ',2); //returns 37
select NthNum('SF346fs47sGs759 ',3); //returns 759  
select NthNum('SF346fs47sGs759 ',4); //returns -1

Here a table of strings in stored in an Sql Server database from which
we can extract the 1st occurrence of a number.

create table FooStrings
{
 keycol:Integer,
 datacol:String nil,
 key{keycol}
};
FooStrings:=
 table
  {
 row{1 keycol, 'XYZ40AB' datacol},
 row{2, 'WX32A'},
 row{3, '27 blah'},
 row{4, 'A87BNC30'},
 row{5, 'XY40A3'},
 row{6, 'TWFD'},
 row{7, 'XYA53GH5JGV934'},
 row{8, '7'},
 row{9, nil}
  };
  
select FooStrings add{NthNum(IfNil(datacol,' '),1) MyNumber}
                   with {IgnoreUnsupported = 'true'}
                    order by {keycol};
                      
keycol datacol        MyNumber 
------ -------------- -------- 
1      XYZ40AB        40       
2      WX32A          32       
3      27 blah        27       
4      A87BNC30       87       
5      XY40A3         40       
6      TWFD           -1       
7      XYA53GH5JGV934 53       
8      7              7        
9      <No Value>     -1   


Think how easy the operator can be modified if it's desired to return
the minimun or maximum number etc.

If you would like to see a coherent and easy to follow all t-sql solution
that doesn't cobble together string functions into spaghetti code see:
RAC - Are you Coordinated?

But I think you'll agree the much preferred solution is in Dataphor ☺

Saturday, December 06, 2008

Sql server dynamic crosstabs by Jeff Moden

This is the RAC version of Jeff Modens fine article on dynamic crosstabs
for sql server at:

Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs 2008/12/03 
www.sqlservercentral.com/articles/cross+tab/65048/

The data is generated by the method outlined in the article. The table was 
populated with 1 millions rows

- Basic crosstab. 2 secs in QA for S2005.
Exec Rac
@transform='Sum(SomeMoney) as SumMony',
@rows='Someletters2',
@pvtcol='(left(datename(mm,DATEADD(mm,DATEDIFF(mm,0,SomeDate),0)),3)+~ ~+
          cast(year(DATEADD(mm,DATEDIFF(mm,0,SomeDate),0)) as char(4))) as mthyr',
@pvtsort='month(DATEADD(mm,DATEDIFF(mm,0,SomeDate),0))', -- Sort pivot expression by an integer. 
@from='##JBMTest',
@WHERE='SomeDate>=~Jan  1 2008 12:00AM~ AND SomeDate<~Jul  1 2008 12:00AM~',
@rowtotalsposition='end',@racheck='y',@shell='n'

Someletters2 Funct   Jan 2008   Feb 2008   Mar 2008   Apr 2008   May 2008   Jun 2008   Totals
------------ ------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
AA           SumMony 685.67     763.64     656.13     575.93     879.13     192.13     3752.63
AB           SumMony 927.06     928.98     280.20     632.43     560.99     785.50     4115.16
AC           SumMony 791.09     555.18     916.71     273.23     187.48     508.31     3232.00
AD           SumMony 250.04     341.58     426.53     645.56     670.13     422.86     2756.70
AE           SumMony 809.14     487.21     295.33     625.92     716.12     527.19     3460.91
.
.
ZY           SumMony 776.32     682.98     326.17     677.69     546.87     926.54     3936.57
ZZ           SumMony 532.75     500.92     277.93     636.40     607.05     553.43     3108.48
Totals       SumMony 433997.25  383211.70  411913.12  411878.29  425431.07  409809.47  2476240.90

Here some additional bells and whistles are thrown in:)
 
-- Executed in 26 secs in QA for S2005.
Exec Rac
-- The same transformed repeated twice for different purposes. 
@transform='Sum(SomeMoney) as SumMony & Sum(SomeMoney) as [% row]',
@rows='Someletters2',
@pvtcol='(left(datename(mm,DATEADD(mm,DATEDIFF(mm,0,SomeDate),0)),3)+~ ~+
          cast(year(DATEADD(mm,DATEDIFF(mm,0,SomeDate),0)) as char(4))) as mthyr',
@pvtsort='month(DATEADD(mm,DATEDIFF(mm,0,SomeDate),0))',
@from='##JBMTest',
@WHERE='SomeDate>=~Jan  1 2008 12:00AM~ AND SomeDate<~Jul  1 2008 12:00AM~',
@rowtotalsposition='end',@racheck='y',@pformat='_pvtcols_',@shell='n',@translabel='Summary',
-- Display min and max sum for each Someletters along with pivot (date) it occurred.
-- The min and max are displayed in separate rows. Default is same row for all rowfunctions.
@rowfunctions='min(SumMony) & max(SumMony)',@rowfunctionslabel='Min/Max',@displayrowfunctions='m',
-- Running sum of pivot columns for each row from left to right. The pivot sum is followed by the 
-- run in each cell.
@colruns='SumMony',
-- The percentage of the pivot sum/[row total] displayed in a separate row.
@cpercents='[% row] %only' -- a different transform alias to force a separate row.
                           -- We could display in same row as sum (and column runs).


Someletters2 Summary Min/Max                 Jan 2008            Feb 2008            Mar 2008             Apr 2008             May 2008             Jun 2008             Totals
------------ ------- ----------------------- ------------------- ------------------- -------------------- -------------------- -------------------- -------------------- ----------
AA           SumMony min(192.13,Jun 2008)    685.67/685.67       763.64/1449.31      656.13/2105.44       575.93/2681.37       879.13/3560.50       192.13/3752.63       3752.63
                     max(879.13,May 2008)                                                                                                                                 
             % row                           18.3%               20.3%               17.5%                15.3%                23.4%                5.1%                 -
AB           SumMony min(280.20,Mar 2008)    927.06/927.06       928.98/1856.04      280.20/2136.24       632.43/2768.67       560.99/3329.66       785.50/4115.16       4115.16
                     max(928.98,Feb 2008)                                                                                                                                 
             % row                           22.5%               22.6%               6.8%                 15.4%                13.6%                19.1%                -
AC           SumMony min(187.48,May 2008)    791.09/791.09       555.18/1346.27      916.71/2262.98       273.23/2536.21       187.48/2723.69       508.31/3232.00       3232.00
                     max(916.71,Mar 2008)                                                                                                                                 
             % row                           24.5%               17.2%               28.4%                8.5%                 5.8%                 15.7%                -
AD           SumMony min(250.04,Jan 2008)    250.04/250.04       341.58/591.62       426.53/1018.15       645.56/1663.71       670.13/2333.84       422.86/2756.70       2756.70
                     max(670.13,May 2008)                                                                                                                                 
             % row                           9.1%                12.4%               15.5%                23.4%                24.3%                15.3%                -
AE           SumMony min(295.33,Mar 2008)    809.14/809.14       487.21/1296.35      295.33/1591.68       625.92/2217.60       716.12/2933.72       527.19/3460.91       3460.91
                     max(809.14,Jan 2008)                                                                                                                                 
             % row                           23.4%               14.1%               8.5%                 18.1%                20.7%                15.2%                -
AF           SumMony min(406.49,May 2008)    788.30/788.30       415.40/1203.70      605.56/1809.26       613.81/2423.07       406.49/2829.56       520.40/3349.96       3349.96
                     max(788.30,Jan 2008)                                                                                                                                 
.
.
ZY           SumMony min(326.17,Mar 2008)    776.32/776.32       682.98/1459.30      326.17/1785.47       677.69/2463.16       546.87/3010.03       926.54/3936.57       3936.57
                     max(926.54,Jun 2008)                                                                                                                                 
             % row                           19.7%               17.3%               8.3%                 17.2%                13.9%                23.5%                -
ZZ           SumMony min(277.93,Mar 2008)    532.75/532.75       500.92/1033.67      277.93/1311.60       636.40/1948.00       607.05/2555.05       553.43/3108.48       3108.48
                     max(636.40,Apr 2008)                                                                                                                                 
             % row                           17.1%               16.1%               8.9%                 20.5%                19.5%                17.8%                -
Totals       SumMony min(383211.70,Feb 2008) 433997.25/433997.25 383211.70/817208.95 411913.12/1229122.07 411878.29/1641000.36 425431.07/2066431.43 409809.47/2476240.90 2476240.90
                     max(433997.25,Jan 2008)                                                                                                                              
             % row                           17.5%               15.5%               16.6%                16.6%                17.2%                16.5%                -

Tuesday, November 25, 2008

Linq to Sql as an opportunity to sql users

Instead of a dead end, Linq presents opportunities for sql experts.

My comment orginally appeared in the www.sqlblog.com post:

'Continuing Industry Consolidation and Other Trends.' 1/16/08
by Kevin Kline
http://tinyurl.com/5opxlz

I think the Sun acquisition of MySql was a checkmate move. It was Oracle
that attempted to slow down MySql with their acquisition of the InnoDB
transaction engine. Now they have been paid back in spades. Now it's Sun
vs. MS in a new competition for the minds and hearts of developers. The
broad view is LINQ/net and Sql Server vs. java and MySql. This is not
about databases per se but a new war based on the object/entity model
inspired by LINQ. I don't see IBM and Oracle in this war. They will have
to be content to battle amongst themselves in the old database wars.
(I'll come back to LINQ:)

As for 'Checking out the Competition' of course I too applaud Linchi.
But honestly we're first 'now' recognizing the advantages of eyes wide
open?:) This attitude should be a given for a professional. Perhaps
better late than never, perhaps an airing of cheerleading vs.reality
checking:) For most ansi features that MS has pre-annouced all one has
to do to study them is read about them in Oracle or DB2 documentation
(ie. OVER). And as Linchi pointed out it often goes the opposite
direction for other types of features. This attitude contributes to the
great divides that are common in the industry. 

And now we come to the 'dead-end routes' like LINQ. I take the opposite
view you do. There's a compelling case to be made (I've commented on
this matter on this site) that if there is a deadweight MS sees sql as
it. LINQ is not just a piece of new technology, it's not just a local
variable, it's a global one. LINQ is both an affirmation of the runtime
environment for 'all' application development using an object model and
a rejection of the sql interface. MS can live with the physical file
structure (the idea of relational data, rows and columns) but they don't
want to live with the sql interface for application development. MS
explains this global move in terms of solving the historic impedance
mismatch between two completely different environments. And they have
picked their winner and at the same time the loser. The rows and columns
abstraction now ends at physical storage. The object abstraction and
LINQ will take up everything else. Sql server is now something quite
different than it used to be. Almost all developmental changes in server
will be based on servicing objects and quite probably at the expense of
features associated with a furtherance of the relational model. Look at
all the work on partitioned views in S2008. This lies at the heart of
how LINQ will translate entity updates. LINQ is still in its enfancy. I
would expect it to appear to many just like sql did when it was intially
introduced in the early eighties. It will take time to get the matured
version. What is truely ironic is I see no real argument in the sql
community that LINQ represents a great opportunity for sql developers.
MS is inventing a declarative language in the spirit of sql. Don't
people see an opportunity to jump in and at least influence the design
of the language? Or get involved in the LINQ translation/optimizations
to sql. Over time as MS integrates LINQ deeper into server (returning
entities) I can assure you the current translations will change:) Sql
was most certainly not an implementation of the relational model. So sql
folks shouldn't get hung up over this. The relational model would
require the same strong typed runtime as net but MS is certainly not
going there. But they are going to a place that sql skills can be used.
And now Sun is going to go along with them. It's actually a small world
if your eyes are open:)

Linq to Sql should be important to sql users

My plea for the sql server community to look closely at Linq.

My comment orginally appeared in the www.sqlblog.com post:

'PASS presentation: What would you like to hear about?' 2/23/08
by James Luetkehoelter
http://tinyurl.com/5lmj4m

Your eccentric. Such people tend to be really bright. So that's kewl.
Your passionate, so your highly motivated. That's very good. If you're
'over the top' that means your not afraid of taking risks! That's best
of all. If all this is true you can take on the burden of tackling LINQ
and the entity framework. Now I'm not talking about the 'how' of it. I'm
sure there will be plenty of people presenting point and click slides.
What I am talking about is the 'why' and 'what' of it. LINQ/EFM dwarfs
everything else in terms of importance in S2008. It's a reflection of a
movement in a different direction from what everyone is used to. It's
also a reflection of a change in attitude. When I look for sql folks
tackling the subject what do I find? Frighteningly little or nothing!
Now let me say if you're willing to make the case that sql folks can
simply ignore the subject, have at it:) If you even want to make the
case that it simply represents an isolated piece of technology go for
it. Some sql folk are waxing nostalgic about past 4GL/xbase languages
when discussing LINQ. So it may be that many sql folks think it's quite
the same idea and wonder why it's structure is different and possibly
inferior to what was. Well LINQ is different, it's comparing apples and
oranges, and it can't possibly be the same. But how can folks really get
a grasp of such a comparison if they don't really understand the 'what'
of LINQ. Trying to explain 'what' it is isn't easy especially to those
sql folks who are not familiar with net or any other contemporary
strongly typed object oriented runtime environment. I think MS is
finding out that it's a challenge. Even trickier is to explain the 'why'
of it. The motivation for it. Surely it didn't come out of thin air:)
And the 'why' involves taking risks. You may frighten or alienate
people:) The 'why' cuts to the heart of what most folks believe in. LINQ
is a departure. It will have significant implications for what
development is and will be like in the future. It will take a very
unique person to put all these ideas together so they'll be really
understood. Interestingly, all the answers about LINQ are right on the
MS website. All that's required is to dig around and back thru the years
to find them. With over eight years of development this shouldn't be
surprising:) But how many sql folks have bothered to do this homework?
From the looks of it very, very few. Presenting concepts is much harder
than presenting code. It takes a very special kind of guy to connect
the dots here:)
I'd be happy to share with you an extensive MS list of articles/notes
thru the years about the subject.

Linq to Sql vs. older 4GL attempts

Some thoughts on comparing Linq to older reporting languages.

My comment orginally appeared in the www.sqlblog.com post:

'LINQ - Lessons that could have been learned from languages like Progress' 2/25/08
by Greg Low 
http://tinyurl.com/56powf

>it's still very much an add-on rather than being fully baked into the language.

I remain somewhat perplexed by just what you mean by this. By definition
LINQ is 'burned in'. This is the achievement that MS has accomplished.
To imply that it's not fully baked in is like saying a woman is somewhat
pregnant. It is or it isn't, you are or you are not:) Either a
table/entity or projection can be represented as a variable or it
cannot. That all these things 'can' be a variable of some type
referenced in the language is the whole point of it no? Your use of
'add-on' seems to imply that LINQ is something external to the language
much like an external 4GL like Progress. I don't think this can be
further from the truth. In your interview with Pablo Castro he referred
to Progress as an example of a language having 'direct line of sight
from language to database'. Wasn't he struggling here to convey the idea
to sql folks of the fundamentally different natures of the two? To bring
the 4GL idea into contemporary languages one shouldn't expect they are
going to be similar. And you seem to be implying that LINQ is not as
'natural' as Progress. How could it be? If you first have to define a
query variable (table) you certainly can't expect to start your variable
with a Select/Find. You define and then perform an action right? In
t-sql 'Select @MyNumber' only makes sense if your first declare
@MyNumber. Is LINQ any different? And in the sense that 'declare
@MyNumber int' is burned into t-sql, isn't 'var customers = from c in
db.Customers' burned into C#?

I do think sql users should listen to your interview with Pablo. It is
proving difficult for MS folks to convey just what they have done to
those outside of net. What is worse, misunderstanding it or ignoring it?:)
Shouldn't sql folks fully understand why MS thinks it's such an
improvement over sql? So far I think either they don't or simply don't
care.

Linq to Sql: the 'what' and 'why' of it

The following comments concern 'what' is Linq (to Sql)/EF and the 'why'
of it, what motivated MS to develop it. What does MS mean by 'beyond 
'relational'? I also explore in what ways Linq, sql and the relational
model are related to each other. How these technologies are connected
to each other is quite a fascinating picture:)
My following 5 comments orginally appeared on www.sqlblog.com
in the post:

'Beyond Relational ???' 10/29/07
by Paul Nielsen 
http://tinyurl.com/686z6h

Comment #1

There is the association of relational to mathemetics (set theory). So
people criticize sql based on this point of view. Sql allows duplicates
rows, doesn't require a table to have a key, dependencies based on
ordinal position, is a poorly designed language etc. etc. These things
really are critical but the real problem is the prevailing idea that
relational is just a question of mathemetics. If it's just mathemetics
then allowing duplicate rows is perceived as 'close enough'. All the
objections from the set theory point view are not perceived as
compelling enough to really question the validity of sql. IMO the real
holes of sql have nothing to do with mathemetics. Rather it's the
foundation, the computer science if you will, that set theory and
relational algebra are embedded in. This point of view is unfortunately
not prevalent in IT. What the hell do I mean by the computer science of
the relational model? Well first, the set theory that relational depends
on is not some special kind of set theory. There is only one set theory.
In the same way there is only one computer science, there is no special
kind of computer science. But sql has invented such a special computer
science and this is the biggest flaw. What am I talking about?:)
Consider this, here is a table variable:

DECLARE @MyTableVar table(
   EmpID int NOT NULL primary key,
   OldVacationHours int,
   NewVacationHours int,
   ModifiedDate datetime);

Here is a server table:

create MyTable
      EmpID int NOT NULL primary,
      OldVacationHours int,
      NewVacationHours int,
      ModifiedDate datetime);
 
Here's the key question. If @MyTableVar really is a variable then what
is MyTable? In other words, @MyTableVar is to variable as MyTable is to
?????. If MyTable is persisted in the database what is it persisted as?
What computer science term describes it? Well whatever the hell it is (a
constant?) it certainly isn't a variable. And if it isn't a variable
then end of ballgame, end of relational model. And what of @MyTableVar?
Bol says 'A table variable behaves like a local variable.' and at the
same time says 'Assignment operation between table variables is not
supported.'. When is a door not a door?..when it's ajar:) Who the hell
ever heard of a variable that doesn't support assignment? Who ever heard
of a variable that doesn't support comparison? No one. Whatever
@MyTableVar really is it sure as hell ain't a variable. In a relational
db I should be able to assign the table @MyTableVar, all its rows, to
MyTable: 
      
MyTable=@MyTableVar

And I should be able to compare them.

if MyTable=@MyTableVar
 then print 'All rows in MyTable are in @MyTableVar and all rows in   
  @MyTableVar are in MyTable'
   else print 'Nope they're not equal'      
   
A relational db demands a table be a variable just like an integer
variable. Sql simply does not support basic computer science for tables.
Whatever a table is in sql it doesn't have a 'type' because computer
science is computer science and a variable must be typed. The only way
sql can recognize a table is by its name, not its type. This is why sql
doesn't support relational division and why dynamic sql must be used so
much. A table as a variable is a completely different animal than a
table in sql. This is why the expressive power of a relational db is
orders of magnitude greater than an sql db. Sql views and constraints
are redefined relationally. The 'types' in Dates work:   
Databases, Types and the Relational Model, The Third Manifesto' 2007
is about the central importance of variables of a particular type (a
table as one of many types) in a relational db. What a table as a
variable means and its significance. It is really a basic computer
science book. Ripping out the mathematics of relational theory (at least
trying to copy it), ie. the syntax to join, union tables, without the
computer science of relational has done all the damage. MS can't change
sql server because they are caught in an crazy computer science. The
difference in computer science between sql and net is the impedance
mismatch they're trying address. But I'm afraid they still don't get the
idea of a table as a variable. This is different than a table as a
class. The anonymous type in Linq is distinctly different
than a table type. So MS is doing the same thing IBM did forty years ago
with the sql System-R. The damage is the difference between a pickup
game in a playground and organized sports. You can draw up plays in the
dirt but they don't quite work the same as those run in a stadium. We're
still doing application development in the playground. Sometimes it
works, sometimes it doesn't but we're not basing it on the science of
any model. Sql is not a model of anything, it's an invention all its own. 
Close enough is only for horsehoes:) Maybe my blog will make more sense now:)

Comment #2

Wherever you see the word 'relational' just substitute 'fog'. As in fog of war:) 
> But when you have guys like Don Chamberlin (co-inventor of SQL and  
> co-designer of XQuery) on your staff, I guess you can afford to 
> boast your XML prowess.
He is revered in the sql world and reviled in the relational one. He was
a lead designer of System-R, the prototype of all sql database systems.
Those guys created a query language based on Codds description of basic
relational operators like projection, union and join. But they did NOT
implement the relational model Codd described. They just ripped out
these constructs without regard for their meaningfulness in the entire
relational model. So what you have today is nothing like the relational
model as it was envisioned. (IT successfully marginalizes the huge
difference and those that point it out:) And now comes 'beyond
relational'. What does this phrase really mean to MS? They are more than
willing to tell us. Aside from Jim Grays article/presentation, everyone
should read the articles on this site, the 'Comega language':
http://research.microsoft.com/Comega/
Especially this article:
'Unifying Tables, Objects and Documents'
http://tinyurl.com/yq7c4f

Here you'll find history repeating itself. MS, just like IBM did with
System-R, has extracted relational operators out of the relational model
and put them in an imperative object environment without any regard to
relational theory. The great irony is that the extensions that MS added
to net to realize projections of columns and xml within net is the
foundation for a true relational model! But the compiler generated
anonymous type of Linq while a variable is a different beast than the
explicit variable that is a table type in the relational model. It's the
relational variable that supports assignment and comparison as opposed
to the Linq variable that's no where near as smart:) But each supports a
'variable' which is a major step up from sql. Had MS any idea of the
friggin true relational model they would make a different kind of
history. Talk about dumbing down. Talk about of only academic interest.
Talk about relational fog (I should add that Alphora (Dataphor)
recognized the ability of the object imperative environment to support
the D relational language and implemented it. And it works:) 
Here is what Anders Hejlsberg, MS VS guru, and now the head of database
technology has to say about the disconnect:
InfoWorld
Interview of Microsoft Distinguished Engineer Anders Hejlsberg
'Microsoft's Hejlsberg touts .Net, C-Omega technologies'
June 10, 2005
http://www.infoworld.com/article/05/06/10/HNhejlsberg_1.html

"So what we're looking at is really trying to much more deeply integrate
the capabilities of query languages and data into the C# programming
language. And I don't specifically mean SQL, and I emphatically don't
mean just take SQL and slap it into C# and have SQL in there. But rather
try to understand what is it expressively that you can do in SQL and add
those same capabilities to C#."

Anders Hejlsberg is microsofts version of Don Chamberlin at IBM. So what
they have done is replace one flawed implementation of sql with another.
And this is how they achieve efficiency in application development. Now
that is unfriggin believable:) Well there's no free lunches. And I await
to be enlightened on just how this environment will replace the concept
of the logical relational model in solving business problems. I would
say the real meaning of beyond relational is sideways.

Comment #3

Consider the MS whitepaper:
'Microsoft SQL Server 2008 and Microsoft Data Platform Development'
http://www.microsoft.com/sql/techinfo/whitepapers/sql_2008_dp.mspx

Does anyone find it the least bit odd that an sql server technical article
is all about VS, LINQ and the entity framework? At the expense of the logical
relational model and the sql language.

What MS means by 'beyond relational' is 'forget relational already':)
Looking at sql server as if it was somehow an embodiment of relational
theory is every bit a form of dumbing down as some silly utterance by
some poor nitwit at MS. There never was and never will be any 'intent'
by MS to offer a 'relational' database. Sql servers only intent now is
to be responsive to its biggest customer, visual studio. And that team
is as knowledgeable in relational databases as the server team. Not. Why
does the community still view sql server thru an imaginary lense? Did
you ever hear of somewhat pregnant? If you open the dumbing down door be
prepared to greet all those who come thru:)

Comment #4

There is no longer a great divide, a debate, an impedance mismatch. MS 
has issued their own Emancipation Proclamation. And as a result they no
longer support the relation model as it is know to developers today.  
'A Call to Arms'
by Jim Gray, Microsoft
Mark Compton, Consultant
April 2005
http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=293

This paper is an invitation to embrace a new model. It's just as much
'A Farewell to Arms', an emancipation from the relational model which 
they are leaving behind. What does sql server look like in this new model?

'Interview with David Campbell'
General Manager of Strategy, Infrastructure and Architecture of Microsoft SQL Server.
May 14, 2007
http://tinyurl.com/6maseb
Campbell says:
"I believe the next major advance in Information Technology will come 
from addressing the gap between people and information."

That gap is the relational logical model itself.

Campbell continues:
"The focus will move to the data itself rather than on the machinery
used to manipulate it. We'll be less concerned with the plumbing and
more concerned with data quality, data protection, and information
production."
"Most of the data services provided by SQL Server will be driven from
a common data model. Whether you're creating a report, building an
information cube, or integrating data from another system, you will be
able to start from a common model of the key data entities such as
'customer', 'order', or 'prospect'."
"Finally, fewer and fewer people will miss, (or remember), the 'open
databases' sp_configure option..."

The class replaces the table as the basic unit of work.  VS replaces
QA/SSMS as the interface for application development. There is no
concept of relational anything in this object world. Sql constructs are
independent of application development. The language of the relational
model is replaced with the language of entities. There is no concept of
a dba.
MS is no longer in the database wars as we know it. They are trading 3rd
place in that world for 1st place in another. And they now have the
freedom to talk about this new world. It just sounds silly to those who
have not left the old one.
Ironically some were hoping for a new sub-language to further AD. Perhaps
the lesson here is to be careful of what you wish for. I too was hoping
they'd enter a new world but not the one they have chosen.

Comment #5

> should we be concerned staying in the DB world long with the fear 
> that we become obsolete one day?
Although I'm not an expert I can understand where you're coming from. It
would be nice to get a clear and concise answer to where MS is going and
what you should do about it. But there is no Oracle when it comes to MS.
There is no one position paper, no one person that clearly lays out
their five year plan and what it means to you. The experts here have
enormous importance and influence in the db community. But they also
have an enormous investment. How far can they be reasonably expected to
go without putting themselves in an awkward position should they take a
position that is not currently in line with company thinking? In the end
it's a question of connecting the dots. You get a dot here a dot there.
You have to do your homework. Study what they say and write and study
what they offer. Sql server pros shouldn't neglect what's going on in VS
and it's impact. If you study the company and the various technologies
enough you should be able to draw your own picture. Think of it as the
MS X-files:)

Sunday, November 23, 2008

Listing Missing Table Item

The following example is based on the post:

microsoft.public.sqlserver.programming
Friday, November 21, 2008 11:56 AM
"T-SQL Comparing a String of Items to a Table of Items; Listing Missing Table Items"
http://tinyurl.com/5dvc6o

Here's what the future of application development will hopefully be like
using a relational system. To program relationally you not only have to
think in "terms of sets" (the declarative thing:) but you have to think
"in terms of type" (the relation thing). Sql essentially doesn't concern
itself with variables and types. They are essentially viewed as foreign
thingies found in net. But in a relational system they are every bit as 
significant as understanding how to write an outer join. The following
example illustrates the relational approach using the Dataphor system
(which can play nicely with Sql Server ☺)

create session table Products {
  product_key:Integer,
  part_no: String,
  options: list(String), //options is defined as a list type, not a string.
  price: Money,
  key{product_key}
  };

All tables, like numbers, are variables with an explicit 'table' type, the
column names and their datatypes. And like integers they can be 'assigned'
values. Tables are composed of 'rows'. For each row options is input as a
'list' of type string ({'<string>','<string>','<string>'..}).

Products:=
table {
      row{11 product_key, 'ASP-20J' part_no, {'Wiz Bang', 'Hopla Enhancer'} options, $10.00 price},
      row{12, 'ASP-20R', {'Widget Holder','Wiz Bang',  'Hopla Enhancer'}, $12.00}
      }; 

create session table Options {
  option_key: Integer,
  product_key: Integer, 
  option: String,
  key{option_key},
  reference OptiontoProduct {product_key} references Products {product_key}
  };

Options:=
table {
       row{5 option_key, 11 product_key, 'Wiz Bang' option},
       row{6, 11, 'Hopla Enhancer'},
       row{7, 12, 'Wiz Bang'},
       row{8, 12, 'Hopla Enhancer'}
      };
       
Here are easy ways to find the missing product from the Options table.

Using row in table. First a table of numbers (Numbers) with a single
column N (from 0-10K) is used to create a row for each element in
the options list bringing along the other Products columns. The row
is constructed from this table to see if it's not in options. 

select 
   (
    Products 
       times  //Like an sql cross join. 
         Numbers
           where N<options.Count()
              //create a row for each element in the list.
             {product_key,options[N] option} 
                                           
    )     
       where not (
                  row{product_key product_key,option option}
                  in
                 (Options {product_key,option})
                 ) ;
                
Using relational division. Because tables (table expressions) are
variables one table can be tested to see if it's contained in another.

select 
   (
    Products times Numbers
         where N<options.Count()
          {product_key,options[N] option}
    )     
       where not (  
               table{row{product_key product_key,option option}}
                <=
               (Options {product_key,option})
               ) ;                 
               
Using a left join. Test whether the table on the right has a matching row.

select 
   (
    Products times Numbers
         where N<options.Count()
          {product_key,options[N] option}
    )     
      left join Options
                include rowexists  //A special Boolean column for whether there is a match.
                  where not rowexists 
                    {product_key,option} ;
      
Instead of inputting a list directly, a delimited string can be converted to
a list when input.

Products:=
table{
      row{11 product_key, 'ASP-20J' part_no, ('Wiz Bang, Hopla Enhancer '.Split()) options, $10.00 price},
      row{12, 'ASP-20R', (' Widget Holder ,Wiz Bang,  Hopla Enhancer'.Split()), $12.00}
      }; 
Options:=
table{
      row{5 option_key, 11 product_key, 'Wiz Bang' option},
      row{6, 11, 'Hopla Enhancer'},
      row{7, 12, 'Wiz Bang'},
      row{8, 12, 'Hopla Enhancer'}
     };

The queries are the same as above except for trimming each element of the list.

select 
   (
    Products times Numbers
         where N<options.Count()
          {product_key,options[N].Trim() option}
    )     
     where not (
                row{product_key product_key,option option}
                in
               (Options {product_key,option})
                ) ;
                
select 
   (
    Products times Numbers
         where N<options.Count()
          {product_key,options[N].Trim() option}
    )     
     where not (  
               table{row{product_key product_key,option option}}
                <=
               (Options {product_key,option})
               ) ;                 
               
select 
   (
    Products times Numbers
         where N<options.Count()
          {product_key,options[N].Trim() option}
    )     
      left join Options
                include rowexists 
                  where not rowexists 
                    {product_key,option};
                    
All queries produce a result of:

product_key option        
----------- ------------- 
12          Widget Holder 

Types eliminate violations of normal forms much like education eliminates ignorance ☺

Dataphor is a RAD tool, R(elational) A(ccelerator) D(evelopment).
Visit dataphor at:
www.dataphor.org

Friday, November 14, 2008

S(ecure) Q(uery) L(anguage)?

Concerning the thread:
microsoft.public.sqlserver.programming
Nov 10, 2008
"cascading deletes"
http://tinyurl.com/6nwjmd

For some reason I couldn't get my reply to post thru OE (it got thru
via google though). Perhaps there's an MS filter for metaphors ☺
In any event any mature adult should be able to handle it. So here's
my reply with a touch of creative writing ☺

'Most women believe men think with their tool. And it's just as true in
IT. Users model business problems in terms of the abilities of their
db. The idea that modeling exists independent of ones db is a myth.
It's not a question of seepage but of flooding. Modeling business
problems in terms of the available sql server constructs is messy
precisely because their immature and superficial to the task. The
result is you turn away from the db and towards the do-it-myself
model. You roll around in your own layer because you can't get layered
by the db. It's ridiculous to write a join procedurally but when it
comes to modeling it's perfectly acceptable to roll your own. Because
the model equivalent of the join is so lacking and messy. The genie
isn't going back in the sql server bottle. It's simply to far gone. 
That's why I advocate Dataphor. There the genie is in the join as well
as the modeling. Use Dataphor and put your tool back where your head
and shoulders are. You can still use sql server. But you aren't going
to get tooled by it :)'

www.dataphor.org
www.beyondsql.blogspot.com

Geoff Schaller wrote:
> Andre.

> I vote with Hugo here. We manage everything from code, not from TSQL in
> SSMS or via some other mechanism so we generally have to code everything
> (and that is not as difficult or as expansive as it sounds). Whilst
> cascading referential integrity is "nice" from a simplicity point of
> view, we've found that the act of deleting something (say an invoice) is
> almost never a simple thing. There is reversal of stock levels,
> rebalancing totals and if others are running reports when you thought
> you wanted to do the delete, it gets messy.

> The other thing is that we quite often have to delete the child entries
> individually or prevent the parent from being deleted because a child or
> two cannot be. Writing all that logic into a trigger and enforcing the
> rollback is quite complex. I find code an easier way to manage and
> maintain this.

To add insult to injury my reply to a post on SQLServerCentral was hacked (edited).

SQLServerCentral 
11/10/2008
'Arrays in Stroed Prcoedure'
http://tinyurl.com/5th5n4

My reply, as shown there under the name rog pike, was edited to read:  

'An array is a 'type', something the archaic computer science of sql knows
nothing about. You have to move to a 'real' relational system to find a
list/array type. You'll find such adult computer science in Dataphor.'

My orginal reply was a follows:

'Arrays are in sql server in the same sense as having sex by yourself which may account for
the shortsightedness of so many sql mavens. An array is a 'type', something the archaic computer
science of sql knows nothing about. You have to move to a 'real' relational system to find
a list/array type. You'll find such adult computer science in Dataphor.'

Is the site for mature adults or for the whole family? ☺ Just how much protection
does sql and its users need? Is this a security or, better yet, an insecurity problem? ☺

Finally, I'll repeat here what I posted in the above thread:

'Apparently someone complained/reported something I wrote as being objectionable. They
got their wish as it was magically extracted from the text. What was yanked, besides
my chain, was a metaphor, albeit a vivid one, to drive a salient point home. Now I
write for adults, I don't do child-speak very well. Nor do I have a predilection 
to only write drone-on-speak. So, if I can, I won't hesitate to use an adult metaphor
to amplify a point in an industry that is usually tone deaf. God forbid IT encourage
ability in something other than code or pixels. So if you are an adult, with a surname
other than anonymous, please explain just what you found R or X rated. Mature adults
usually confront conflicts thru the front door not the back one.'

Monday, September 08, 2008

Sorting a delimited string numerically

Sorting a delimited string by its numerical string parts.

This is a possible solution to the problem presented in the article:   

'T-SQL Sorting Challenge'
By: Itzik Ben-Gan 
http://www.sqlmag.com/Article/ArticleID/100156/100156.html

"You are given a table called t1 with a character string column called val. 
Each string in the val column holds a dot separated list of integers. Your
task is to write a T-SQL solution that sorts the strings based on the integer
values constituting the string segments. Note that the number of integers in
each string may vary, and is only limited by the column type VARCHAR(500).
Extra points if your solution will also support negative integers."

So the problem is how to construct a sort expression that represents the
positive and negative integers of the ids.

This solution uses Dataphor with the data being stored in Sql Server 2005.

Sample data

The data is populated thru dataphor and persisted in the sql server northwind
db. The table t1 in the article is table IZ here. The article uses an sql
identity for the primary key id, here it is explicitly declared.

create table IZ
{
 id:Integer,
 val:String tags {Storage.Length='75'},
 key{id}
};
//Positive value strings.
insert row{1 id,'100' val} into IZ;
insert row{2 id,'7.4.250' val} into IZ;
insert row{3 id,'22.40.5.60.4.100.300.478.19710212' val} into IZ;
insert row{4 id,'22.40.5.60.4.99.300.478.19710212' val} into IZ;
insert row{5 id,'22.40.5.60.4.99.300.478.9999999' val} into IZ;
insert row{6 id,'10.30.40.50.20.30.40' val} into IZ;
insert row{7 id,'7.4.250' val} into IZ;
//Add negative values.
insert row{8 id,'-1' val} into IZ;
insert row{9 id,'-2' val} into IZ;
insert row{10 id,'-11' val} into IZ;
insert row{11 id,'-22' val} into IZ;
insert row{12 id,'-123' val} into IZ;
insert row{13 id,'-321' val} into IZ;
insert row{14 id,'22.40.5.60.4.-100.300.478.19710212' val} into IZ;
insert row{15 id,'22.40.5.60.4.-99.300.478.19710212' val} into IZ;

Go directly to dataphor solution
Go directly to a solution using dataphor and sql server t-sql
Go directly to a solution using the Rac utility on sql server 2000
Go directly to a solution using the Rac utility on sql server 2005
(
 Rac is a system of stored procedures and functions for sql server designed
 to simplify solving various data manipulation problems including dynamic
 crosstabs, complex running sums and ranking, string manipulations etc.
)

Stepping thru the logic of the solution

Use the dataphor Split operator to split the val string for each id into
individual strings starting at Index 1 and going to the number of parts
delimited by the period ('.'). Note that the string is converted to an
integer. So we're dealing with numbers and not strings.

select
 (IZ add{val.Split({'.'}) StrList,val.Split({'.'}).Count() StrListCnt})
    times                //times is equivalent to an sql CROSS JOIN.
    (Numbers where N<10) //A table with a single column N, an integer from 0 to 800.
      where N<StrListCnt
       {id,N+1 Index,ToInteger(StrList[N]) StrNum}
         order by {id,Index} ;      
         
id Index StrNum   
-- ----- -------- 
1  1     100  <- id 1 has only a single value.
2  1     7    <- id 2 has 3 values.
2  2     4        
2  3     250      
3  1     22   <- id 3 has 9 values.
3  2     40       
3  3     5        
3  4     60       
3  5     4        
3  6     100      
3  7     300      
3  8     478      
3  9     19710212     
4  1     22  
.  .     .

Now lets look at the same data but within each Index and within each Index
ordered by the string as a number (StrNum). Remember all ordering is ascending.

select
 (IZ add{val.Split({'.'}) StrList,val.Split({'.'}).Count() StrListCnt})
    times
    (Numbers where N<10)
      where N<StrListCnt
       {id,N+1 Index,ToInteger(StrList[N]) StrNum}
         order by {Index,StrNum};
         
id Index StrNum   
-- ----- -------- 
13 1     -321  <- id 13 has lowest overall number for 1st string part (Index 1).
12 1     -123     
11 1     -22      
10 1     -11      
9  1     -2       
8  1     -1       
2  1     7        
7  1     7        
6  1     10       
3  1     22    <- ids 3,4,5,14,15 have the same value (22) for Index 1.   
4  1     22      
5  1     22       
14 1     22       
15 1     22       
1  1     100   <- id 1 has highest overall number for 1st string part (Index 1).  
.  .     .
3  8     478   <- Index 8 has the same string value for the 5 ids that have an 8th string part.
4  8     478      
5  8     478      
14 8     478      
15 8     478      
5  9     9999999  
3  9     19710212 
4  9     19710212 
14 9     19710212 
15 9     19710212 

The strings as integers are nicely sorted within each Index over the ids. 
How can we represent the same ordering within each Index independent of
the positive and negative numbers (and strings that indicate positive and
negative numbers)? What about with a ranking. So lets rank the numbers
within each Index. The combination of the dataphor ToTable, ToList and
cursor operators will generate a rank (a column named sequence) that 
follows the specified cursor ordering. We order the cursor by Index,StrNum
to get an ascending rank within each Index based on the StrNum values. 
The rank is column RowNum.

select
ToTable(
        ToList(
               cursor(
                       (
                        (IZ add{val.Split({'.'}) StrList,val.Split({'.'}).Count() StrListCnt})
                          times
                           (Numbers where N<10)
                             where N<StrListCnt
                              {id,N+1 Index,ToInteger(StrList[N]) StrNum}
                        )   
                                order by {Index,StrNum}))) 
                                 {Index,id,StrNum,sequence+1 RowNum}
                                   order by {Index,StrNum};

Index id StrNum   RowNum 
----- -- -------- ------ 
1     13 -321     1  <- id 13 has lowest value (-321) so id 1 gets lowest rank (1) within Index 1.    
1     12 -123     2      
1     11 -22      3      
1     10 -11      4      
1     9  -2       5      
1     8  -1       6      
1     2  7        7  <- ids 2 and 7 get a different rank for the duplicate values of 7 :(    
1     7  7        8      
1     6  10       9      
1     3  22       10 <- all 5 ids get a different rank for the duplicate values of 22 :(  
1     4  22       11     
1     5  22       12     
1     14 22       13     
1     15 22       14     
1     1  100      15 <- id 1 has highest value (100) so id 1 gets highest rank (15) within Index 1.
.     .  .        .
8     3  478      56     
8     4  478      57     
8     5  478      58     
8     14 478      59     
8     15 478      60     
9     5  9999999  61     
9     3  19710212 62 <- all 4 ids get a different rank for the duplicate values of 19710212 :(    
9     4  19710212 63     
9     14 19710212 64     
9     15 19710212 65     

(
 Note that this rank is equilvant to the sql ranking function ROW_NUMER().
 The rank could be obtained in sql using:
 ROW_NUMBER()OVER(ORDER BY [Index],StrNum) AS RowNum
 But the rank we want is really based on the sql RANK() function which
 accounts for duplicate/ties (of StrNum) by giving them the same rank. 
 Therefore it's necessary in dataphor to use a join to append the correct
 ranks to the table. In sql the join isn't necessary, RANK() can be used
 directly on the table:ie.
 RANK()OVER(ORDER BY [Index],StrNum) AS Rank
 (See t-sql solution)
 For more on dataphor and sql ranking see:
 'The Sql ranking OVERture' 
 http://beyondsql.blogspot.com/2008/04/sql-ranking-overture.html
)

If you haven't guessed it by now ☺ the idea is to create a string for each
id based on the ranks which will be used to order the ids. But we have a 
problem because for duplicate values of a number we're getting different
ranks. We want the 'same' rank for duplicate values since the same integer
cannot be used to distinguish among the ids. 

We can remedy the different ranks for duplicate values by simply choosing
the minimum rank (RowNum) for the value and assigning this rank to all ids.
Also note that the ranks continue to ascend over the Indexs. This is ok
because any numbers representing the ranks are ok if they correctly maintain
the ordering of integer strings values within the Index.

select
ToTable(
        ToList(
               cursor(
                       (
                        (IZ add{val.Split({'.'}) StrList,val.Split({'.'}).Count() StrListCnt})
                          times
                           (Numbers where N<10)
                             where N<StrListCnt
                              {id,N+1 Index,ToInteger(StrList[N]) StrNum}
                        )   
                                order by {Index,StrNum}))) 
                                 {Index,id,StrNum,sequence+1 RowNum}
                                   group by {Index,StrNum} add{Min(RowNum) Rank} 
                                     order by {Index,StrNum};

Index StrNum   Rank 
----- -------- ---- 
1     -321     1    
1     -123     2    
1     -22      3    
1     -11      4    
1     -2       5    
1     -1       6    
1     7        7  <- a rank of 7 can be assigned to the two ids (2,7) with a value of 7 for Index 1.  
1     10       9    
1     22       10 <- a rank of 10 can be assigned to all ids with a value of 22 for Index 1. 
1     100      15   
.     .        .
8     478      56   
9     9999999  61   
9     19710212 62 <- a rank of 62 can be assigned to all ids with a value of 19710212 for Index 9.

What we now have is a table of unique Index/StrNum combinations with a
unique rank for each combination. It's only necessary to join this table
to the table of split strings (IZ) for all ids by Index and StrNum to properly
assign the correct(ed) ranks. (As mentioned above this is the same rank that
would be obtained using the sql RANK() function and ordering by Index,StrNum.
And note that using the sql RANK() would eliminate the need to do a join in 
dataphor. Imagine dataphor with native sql like ranking operations ☺ ) 

Because the objective is to create a string to sort the ids we can't just
use the numeric rank, we have to modify it for string ordering. Given two
ranks of 7 and 11 if they are strings, '7' and '11', an ascending sort
would have '11' come '7':

'11'
'7'

This is the very problem the article is addressing! So we have to modify the
the strings to have '7' come '11' ☺. We can modify the '7' by left padding it
with '0'. So when we sort ascending we'll have the correct representation of
of the true numeric order of the values:

'07'
'11'

How much to pad a rank, how many '0's to insert, is the string length of 
the maximum rank generated. Because the ranks in dataphor keep ascending
regardless of the cursor ordering, the maximum rank (ignoring duplicates)
is the count of rows in the table. You could even make an educated guess
based on the amount of data and use that ☺. 

Left padding the string rank (RankStr) based on a maximum length of 2 we now
have all the data to finally construct a sorting column for the ids.

var UnpackedStrings:= //This variable holds all the split data and will be used in the select query. 
ToTable(ToList(cursor(
                       (
                        (IZ add{val.Split({'.'}) StrList,val.Split({'.'}).Count() StrListCnt})
                          times
                            (Numbers where N<10)
                               where N<StrListCnt
                             {id,N+1 Index,ToInteger(StrList[N]) StrNum}
                        )   
                           order by {Index,StrNum}))) 
                          {Index,id,StrNum,sequence+1 RowNum};
var LengthofMaxRank:=2;
select
      (UnpackedStrings {id,Index,StrNum})
         join //Join the unique ranks to all split data. This is a natural join (on Index/StrNum).
              //Create a left padded string (RankStr) from the numeric rank.
          (UnpackedStrings group by {Index,StrNum} add{Min(RowNum) Rank})
            {id,Index,StrNum,Rank,PadLeft(ToString(Rank),LengthofMaxRank,'0') RankStr}
              order by {Index,Rank};
            
id Index StrNum   Rank RankStr 
-- ----- -------- ---- ------- 
13 1     -321     1    01     <- ranks 1-9 are left padded in the rank string (RankStr).    
12 1     -123     2    02      
11 1     -22      3    03      
10 1     -11      4    04      
9  1     -2       5    05      
8  1     -1       6    06      
2  1     7        7    07      
7  1     7        7    07      
6  1     10       9    09      
3  1     22       10   10      
4  1     22       10   10      
5  1     22       10   10      
14 1     22       10   10      
15 1     22       10   10      
1  1     100      15   15      
.  .     .        .    .
3  8     478      56   56      
4  8     478      56   56      
5  8     478      56   56      
14 8     478      56   56      
15 8     478      56   56      
5  9     9999999  61   61      
3  9     19710212 62   62      
4  9     19710212 62   62      
14 9     19710212 62   62      
15 9     19710212 62   62    

The id sort column can now be formed by concatenating, using the Concat 
operator, the RankStr within each id in the order of the rank (either Rank,
Index or RankStr). This is easy to see by ordering the above data (table) 
by id,Rank. The ascending order of Index, Rank and RankStr all reflect 
where an id lies in value (RankStr) relative to the other ids. The sort
expression will be column SortStr.

id Index StrNum   Rank RankStr 
-- ----- -------- ---- ------- 
1  1     100      15   15      
2  1     7        7    07      
2  2     4        16   16      
2  3     250      30   30      
3  1     22       10   10      
3  2     40       19   19      
3  3     5        24   24      
3  4     60       33   33      
3  5     4        38   38      
3  6     100      49   49      
3  7     300      51   51      
3  8     478      56   56      
3  9     19710212 62   62    

The complete dataphor solution

var UnpackedStrings:= //Variable that holds all split data and ranks the split strings
                      //numerically within each Index ordered by the numeric string value.
ToTable(ToList(cursor(
                       (
                        (IZ add{val.Split({'.'}) StrList,val.Split({'.'}).Count() StrListCnt})
                          times
                            (Numbers where N<10)
                               where N<StrListCnt
                             {id,N+1 Index,ToInteger(StrList[N]) StrNum}
                        )   
                           order by {Index,StrNum}))) 
                          {Index,id,StrNum,sequence+1 RowNum};
var LengthofMaxRank:=
  Length(Count(UnpackedStrings).ToString()); //A string length used to left pad the rank strings for
                                             //a correct string sort (will be 2 here). 
select
    IZ
     join   //Natural join of input table IZ to sort column expression by id.
         (
            //Join unique ranks to all split strings.
          (
           (UnpackedStrings {id,Index,StrNum})
             join
                 // Adjust ranks to be unique for each Index/StrNum, duplicate values should get same rank.
              (UnpackedStrings group by {Index,StrNum} add{Min(RowNum) Rank})
                 //Left pad rank string with '0' (RankStr) for sorting string correctly. Add empty
                 //string ('') to be used as a delimiter in concatenating the ranks.
               {id,Index,StrNum,Rank,PadLeft(ToString(Rank),LengthofMaxRank,'0') RankStr,'' Del}
          ) adorn {key{id,Rank}} //Sql uses physical hints, in dataphor you use logical ones.
                //Form the sorting expression SortStr to sort ids by concatenating the string ranks
                //in the order of any of the ranking columns or Index.
              group by {id} add{Concat(RankStr,Del order by {id,Rank}) SortStr}
         ) 
          order by {SortStr}; //The object of the exercise, sort the ids by SortStr to get the correct
                              //numerical order of val.

id val                                SortStr            
-- ---------------------------------- ------------------ 
13 -321                               01                 
12 -123                               02                 
11 -22                                03                 
10 -11                                04                 
9  -2                                 05                 
8  -1                                 06                 
2  7.4.250                            071630             
7  7.4.250                            071630             
6  10.30.40.50.20.30.40               09182932434650     
14 22.40.5.60.4.-100.300.478.19710212 101924333844515662 
15 22.40.5.60.4.-99.300.478.19710212  101924333845515662 
5  22.40.5.60.4.99.300.478.9999999    101924333847515661 
4  22.40.5.60.4.99.300.478.19710212   101924333847515662 
3  22.40.5.60.4.100.300.478.19710212  101924333849515662 
1  100                                15              

Solution using the sql server RANK() function with a pass thru query

The solution can be made more compact using the sql RANK() function since
dataphor doesn't have a direct equivalent ranking operation.

Since sql server can only access persisted tables (and views) and not
dataphor expressions (we can't just stick in any dataphor table expression
in a pass thru query) we'll create a persisted table to hold all the 
split data.

create table IZSqlRanks  //The table will be created in sql server.
{
 id:Integer,
 Index:Integer,
 StrNum:Integer,
 key{id,Index}
}; 

Do the splitting in dataphor (because it knows the difference between a
string and a list ☺ ) and then 'assign' the resulting table to the
persisted table IZSqlRanks. Relational databases support this kind of
assignment for all variables including tables. And all tables in dataphor
are variables. (To my sql friends this makes all the difference in the world ☺ )

IZSqlRanks:=
 (IZ add{val.Split({'.'}) StrList,val.Split({'.'}).Count() StrListCnt})
    times
    (Numbers where N<10)
      where N<StrListCnt
       {id,N+1 Index,ToInteger(StrList[N]) StrNum};
//Use a t-sql passthru query to take advantage of the sql RANK() function. The
//resulting table will be treated like any other table (expression) in dataphor.
//Left pad the string rank (RankStr) for sorting purposes. We're using a total
//string length of 2 here so single digit ranks will be padded with a leading '0'. 
select 
 IZ
  join
     (
       SQLQuery('SELECT id,[Index],StrNum,RANK()OVER(ORDER BY [Index],StrNum) AS Rank
                 FROM IZSqlRanks') 
     //Use the sql result as if it was a native dataphor expression.
        {id,Index,StrNum,Rank,PadLeft(ToString(Rank),2,'0') RankStr,'' Del} 
          adorn {key{id,Rank}} //Metadata (a key) pertaining to the table expression.
                               //This key will be efficiently used by the Concat operation.
     //Concatenate the rank strings for each id to be used as the sort order for ids.
          group by {id} add{Concat(RankStr,Del order by {id,Rank}) SortStr}
      )       
        order by {SortStr};     
        
id val                                SortStr            
-- ---------------------------------- ------------------ 
13 -321                               01                 
12 -123                               02                 
11 -22                                03                 
10 -11                                04                 
9  -2                                 05                 
8  -1                                 06                 
2  7.4.250                            071630             
7  7.4.250                            071630             
6  10.30.40.50.20.30.40               09182932434650     
14 22.40.5.60.4.-100.300.478.19710212 101924333844515662 
15 22.40.5.60.4.-99.300.478.19710212  101924333845515662 
5  22.40.5.60.4.99.300.478.9999999    101924333847515661 
4  22.40.5.60.4.99.300.478.19710212   101924333847515662 
3  22.40.5.60.4.100.300.478.19710212  101924333849515662 
1  100                                15                        

Solving the problem on sql server 2000 with the Rac utility

The Rac solution follows the same logic as the dataphor and sql methods.
The 1st Rac execute creates the Index column, from 1 to N, for every string
part (delimited integer) an id has. The 2nd Rac execute creates a rank over
the ids based on the integers (string parts) within each Index. The rank
Rac generates is equivalent to the sql DENSE_RANK() function. This rank,
like RANK(), gives duplicate values the same rank but, unlike RANK(), does
not take the duplicate ranks into account when generating the next rank.
RANK() skips ahead based on ties while DENSE_RANK() consecutively numbers
the ranks. Both types of ranks give the same correct sort result for the ids.
The 3rd Rac execute concatenates the left padded rank strings for each id
and returns the ids sorted by them, correctly :) Note that Rac is called
recursively twice.

Exec Rac 
@split='[position]',    -- Rac splits each val string from left to right based on a period ('.').
@rows='id & [position]',-- Rac keeps the position each new string part starts in.
@pvtcol='val',          -- the target column of the split operation.
@from='IZ',    
-- use a counter to generate the Index column (from 1-N) that indicates the individual string part in a val.  
@separator='.',@rowcounters='id{[Index]}',@counterdatatype='int',
@defaults1='y',@rowbreak='n',@racheck='y',@shell='n',
@select='
        SELECT 1*id AS id,[Index],CAST([1] AS INT) AS StrNum
               INTO #J1      
                 FROM rac
 Exec Rac
 @transform=~_dummy_~,
 @rows=~[Index] & StrNum & id~,
 @pvtcol=~Report Mode~,
 @from=~#J1~,
 @defaults1=~y~,@rowbreak=~n~,@racheck=~y~,@shell=~n~,
  /* use a Rac counter to rank the integer string parts within each Index (column Rank).*/
 @rowindicators=~StrNum{Rank}_overtable_~,@counterdatatype=~int~,
        @select=
          ~if object_id(~~##J2~~) is not null 
           drop table dbo.t1  
           SELECT 1*id AS id,1*StrNum AS StrNum,1*[Index] AS [Index],1*Rank AS Rank,
                  /* left pad single digit string ranks for proper character sorting.*/
                  REPLICATE(~~0~~,2-DATALENGTH(CAST(Rank AS VARCHAR(2)))) + CAST(Rank AS VARCHAR(2)) AS RankStr
                  INTO ##J2
                   FROM rac~
             /* concatenate the string ranks (RankStr) within each id into column SortStr.*/     
      Exec Rac
      @transform=~Max(RankStr) as RankStr~,
      @rows=~id~,
      @pvtcol=~Rank~,
      @from=~##J2~,
      @defaults1=~y~,@racheck=~y~,@shell=~n~,@cutpvt=~y~,
      @concatenate=~RankStr~,@separator=~~,@stringname=~SortStr~,
       /* return the ids sorted by the concatenated string ranks.*/
      @select=~SELECT IZ.id,val,SortStr  
               FROM IZ JOIN rac ON IZ.id=rac.id 
               ORDER BY SortStr
               DROP TABLE ##J2~'

id          val                                  SortStr
----------- ------------------------------------ ------------------
13          -321                                 01
12          -123                                 02
11          -22                                  03
10          -11                                  04
9           -2                                   05
8           -1                                   06
7           7.4.250                              071116
2           7.4.250                              071116
6           10.30.40.50.20.30.40                 08121517202326
14          22.40.5.60.4.-100.300.478.19710212   091314181921272830
15          22.40.5.60.4.-99.300.478.19710212    091314181922272830
5           22.40.5.60.4.99.300.478.9999999      091314181924272829
4           22.40.5.60.4.99.300.478.19710212     091314181924272830
3           22.40.5.60.4.100.300.478.19710212    091314181925272830
1           100                                  10

More on Rac @
www.rac4sql.net

Solving the problem on sql server 2005 with the Rac utility

The Rac solution in S2k5 is similar to the one on S2K (go there for more info).
But it's simpler since the split string parts (integers) can be ranked
directly using a dense rank function not available in S2k. This eliminates
a recursive call to Rac so here it's executed twice instead of three times (in S2k).

Exec Rac 
@split='[position]',   
@rows='id & [position]',
@pvtcol='val',
@from='IZ',
@separator='.',@rowcounters='id{[Index]}',@counterdatatype='int',
@defaults1='y',@rowbreak='n',@racheck='y',@shell='n',
 -- the DENSE_RANK() function, available in S2k5, is used to rank the integer string parts
 -- and is left padded.
@select='
         SELECT id,[Index],StrNum,Rank,
                REPLICATE(~0~,2-DATALENGTH(CAST(Rank AS VARCHAR(2)))) + CAST(Rank AS VARCHAR(2)) AS RankStr
                INTO #J1     
         FROM 
              (SELECT 1*id AS id,[Index],CAST([1] AS INT) AS StrNum,
                      DENSE_RANK()OVER(ORDER BY [Index],CAST([1] AS INT)) AS Rank
               FROM rac ) AS A
     Exec Rac
     @transform=~Max(RankStr) as RankStr~,
     @rows=~id~,
     @pvtcol=~Rank~,
     @from=~#J1~,
     @defaults1=~y~,@racheck=~y~,@shell=~n~,@cutpvt=~y~,
     @concatenate=~RankStr~,@separator=~~,@stringname=~SortStr~,
     @select=~SELECT IZ.id,val,SortStr  
               FROM IZ JOIN rac ON IZ.id=rac.id 
                ORDER BY SortStr~'

id          val                                  SortStr
----------- ------------------------------------ ------------------
13          -321                                 01
12          -123                                 02
11          -22                                  03
10          -11                                  04
9           -2                                   05
8           -1                                   06
7           7.4.250                              071116
2           7.4.250                              071116
6           10.30.40.50.20.30.40                 08121517202326
14          22.40.5.60.4.-100.300.478.19710212   091314181921272830
15          22.40.5.60.4.-99.300.478.19710212    091314181922272830
5           22.40.5.60.4.99.300.478.9999999      091314181924272829
4           22.40.5.60.4.99.300.478.19710212     091314181924272830
3           22.40.5.60.4.100.300.478.19710212    091314181925272830
1           100                                  10

More on Rac @
www.rac4sql.net

Wednesday, July 16, 2008

Demoralization by trigger

In an interesting blog by Conor Cunningham, 'The Trouble with Triggers', he says:
"The problem with this area is that there is a great temptation to think
 about databases procedurally, as you would with a programming language
 like C++ or C#.  You can write code that looks like a procedural function
 call and have it get called for each insert into table!  Before you know
 it, non-database programmers are checking in code to your production sysem.
 Suddenly, your application grinds to a halt because a trigger plan has no
 index or is poorly designed.
 Databases are great tools for working on SETS of rows."

In another related blog, 'Triggers...Evil?', there is this comment with an 
insightful Freudian slip:

James Luetkehoelter said:
>Some sorts of demoralization lend themselves to triggers...

I would say this hits the nail on the head. I could understand a developer
getting a case of depression triggered by Conors article. Triggers were
'implemented' to work efficiently on tables (sets) not on rows. The principle
that's operating here is that how something was implemented to be most
effective is the basis for what's best in application development. Are you
kidding me, has everyone gone nuts? ☺ Because triggers don't consider a row
as a primary concept 'functional' programmers, application developers, must
'unlearn' their database contrarian views. This is Celkos 'them' vs. 'us' 
nonsense. Never mind that the real subject is application development and
possibly a theory that would best serve it, the basis for key concepts is 
what a bunch of programmers did for a MS product manager. Talk about the
tail wagging the dog ☺ Not only is the absence of a 'row' type or at least
concept antithetical to a relational dbms but it's central to application
development. Perhaps to even the score MS decided developers should learn
entities and unlearn sql entirely, LINQ. Or perhaps we'll get a hole new
science of application development based on what works fastest ☺

Sunday, July 06, 2008

An opposite constraint

This article discusses an issue raised in the thread:

microsoft.public.sqlserver.programming
Sunday, June 29, 2008 
'Storing a collection of lines'
http://tinyurl.com/56dpd4

Give two columns in a table suppose you want to eliminate the opposite
data where there's no mathematical or logical relationship between
the columns. For example, consider a trip between two cities. It's 
equally likely a trip could start and end in either direction. If the 
table already has:

column A  column B
--------  --------
NEW YORK  CHICAGO

We want to prevent the opposite from being entered:

column A  column B
--------  --------
CHICAGO   NEW YORK  

If the table has:

column A  column B
--------  --------
CHICAGO   NEW YORK  

We want to prevent the opposite from being entered:

column A  column B
--------  --------
NEW YORK  CHICAGO

Because there's no relationship between the columns an sql check
constraint can't be used. But in Dataphor a simple transition
constraint can be used. The Opposite constraint simply checks if 
the opposite data for columns A and B already exists in the table. 
If it does the current row is rejected. The constraint works the
same way for a single insert as it does for inserting multiple rows
(a table). 
For example:

create session table MyTable
{ A:String,B:String,C:String,key{A,B} };

alter table MyTable
{
  create transition constraint Opposite
   on insert //The current (row) values for columns A and B are accessed 
             //by prefixing each with 'new', ie. new.A, new.B . 
    not exists (MyTable {A X,B Y} {Y A,X B} where A=new.A and B=new.B)  
 tags        //A custom error message can be written using the current 
             //row values (new.A, new.B).
 { DAE.Message =
   "'For A: ' + new.A + ' and B: '+ new.B + ' there is a opposite, A: ' + new.B + ' and B: ' + new.A " 
 }   
};

These insert succeed:

insert row{'NEW YORK' A,'CHICAGO' B,'1c' C} into MyTable;
insert row{'CALIFORNIA' B,'TEXAS' A,'1d' C} into MyTable;

Inserting an opposite will fail and the custom error message will be raised:

insert row{'NEW YORK' B,'CHICAGO' A,'1e' C} into MyTable;
"For A: CHICAGO and B: NEW YORK there is a opposite, A: NEW YORK and B: CHICAGO"

insert row{'CALIFORNIA' A,'TEXAS' B,'1d' C} into MyTable;
"For A: CALIFORNIA and B: TEXAS there is a opposite, A: TEXAS and B: CALIFORNIA"

Given that the table contains the data "NEW YORK" (A) and "CHICAGO" (B), 
inserting the following rows as a table will fail:

insert table{
             row{'DENVER' A,'BOSTON' B,'1c' C},
             row{'RENO' B,'MIAMI' A,'1d' C},
             row{'CHICAGO' A,'NEW YORK' B,'1e' C} 
            } into MyTable;
       
"For A: CHICAGO and B: NEW YORK there is a opposite, A: NEW YORK and B: CHICAGO. 

Note that the primary key constraint will eliminate the same A and B cities from
being entered twice but entering the opposite cities does not violate it. That's
what the Opposite constraint is for. The Opposite constraint is much simpler than
an sql solution using triggers.

A one-one requirement constraint

The issue of a one-one requirement, a new order inserted must be
accompanied by a detail row (insert), was raised in the thread:

microsoft.public.sqlserver.programming
Thursday, June 26, 2008 
'update joined tables'
http://tinyurl.com/69hh5c

While this is a non-trivial exercise in sql it's quite easy in the
relational system Dataphor. For example:

create session table SOrders
{
 order_nbr:Integer,
 A:Integer,
 key{order_nbr}
}; 
 
create session table SOrderDetails
 {
  order_nbr:Integer,
  sku:Integer,
  B:Integer,
  C:String,
  key{order_nbr,sku},
  reference Details_SOrders{ order_nbr } references SOrders { order_nbr }
 };

This constraint makes a detail row a requirement for a new order. The user
defined message will be displayed anytime the constraint is violated.

alter table SOrders
  {
 create transition constraint MustHaveOrderDetails
  on insert 
    exists(SOrderDetails where order_nbr=new.order_nbr)  
   tags
 {
   DAE.Message =
          "'A detail row with order#: ' + ToString(new.order_nbr) + ' is required'" 
        }    
  };   
 
Inserting just a new order will violate the constraint:
insert row{1 order_nbr, 14 A} into SOrders; 
Violation of constraint "MustHaveOrderDetails", "A detail row with order#: 1 is required."

Inserting data can easily be done via a view.

create session view VSOrderDetails
                SOrders join SOrderDetails;//A natural join on order_nbr.

The MustHaveOrderDetails constraint works for the insertion of a single
row or a table.

insert row{1 order_nbr, 1 sku, 14 A, 5 B, 'Joe' C} into VSOrderDetails;     

delete SOrders;

Insert into the tables thru the view.

insert table
            {
             row{1 order_nbr, 1 sku, 14 A, 5 B, 'Joe' C},
             row{1,2,9,23,'Steve'},
             row{2,3,34,2,'Larry'}
            } into VSOrderDetails;   

select SOrders;

order_nbr A  
--------- -- 
1         14 
2         34 

select SOrderDetails;            

order_nbr sku B  C     
--------- --- -- ----- 
1         1   5  Joe   
1         2   23 Steve 
2         3   2  Larry 

This rather simple example shows off the much higher level of abstraction
offered to developers with a relational system like Dataphor over sql. In
sql only one table may be updated with a view. In Dataphor there is no
concept of updating multiple tables. There is only the idea of updating a
view/expression which is a table. It's the system responsibility to resolve
a view to its base tables given the constraints in the schema. The user need
only realize a table is being used for data modification and it's irrelevant
how that table was constructed as well as the number of tables in its definition.

Friday, July 04, 2008

Justifying an Sql Cop Out

This is my response to Joe Celko in the thread:

microsoft.public.sqlserver.programming
Thursday, June 26, 2008 
'update joined tables'
http://tinyurl.com/69hh5c

Joe invokes the 'NP complete' argument as the basis for why sql can't
handle complicated modeling issues. The specific modeling issue here
concerns a schema where there are multiple paths between tables due to
cascading updates. My view is the NP complete argument is essentially a
cop out. The same schema with multiple cascading updates that is rejected
by sql server is perfectly acceptable in Dataphor (along with associated
views involving multiple tables that can resolve data to the base tables).
(I suggest reading thru the whole thread, it's interesting ☺)

My response in the thread:

What do we call a man who demurs to an obstinate woman on every decision? We 
call him a whimp, a guy who doesn't think it's worth the effort to present 
his counter point because he's convinced it's impossible to change her 
mind. The bottom line is, if it's so hard to change her mind why even 
bother. And an easy out for taking any responsibility for control of the 
outcome. Here you're using the idea of an NP complete problem as a crutch 
for the mess sql has left developers in. Woman may be an NP complete problem 
for man but it doesn't follow that given a relationship all men are whimps. 
Just because there isn't a universal and instantaneous quick fix technique 
to change a woman's mind in any situation doesn't bum out all men. Some guys 
are inventive and creative and come up with techniques that will work at 
least in some situations. Isn't that what lying, pleading and begging are 
for?:) NP complete system problems are brick walls not because there's no 
way to solve them but because there's no good quick fix from a systems point 
of view. But that doesn't mean that dbms should whimp out on them. You're using 
NP to suggest an all or nothing game. Since no efficient computational 
scheme exists to cover all situations where referential integrity involving 
cascading updates comes into play then sql is going take its ball and just 
go home. This is at most nonsense, at the very least no sense. You're setting 
poor sql up as the victim here and whining about Petri Nets! The perfect 
solution doesn't exist, so what. Sql is/was just being damn lazy. All they 
had to do is talk to a bunch of guys who have been married for twenty years 
to get a clue:) Like they couldn't use some heuristics or approximations? 
Nope they just whimped out. Same with views. Sql is the victim of a yet to 
be found super quick universal solution to updatability. This is simply an sql 
crutch for abandoning the idea totally! They couldn't be creative and cover 
some percentage of possible expressions that could be resolved to base tables? 
Apparently not. So sql gets away with only being able to update a single table
and sticks a huge inconsistency between a table and view/expression in the face
of developers.

Here's an sql server example that has multiple cascading update paths:

-- Updating order_nbr in TableA cascades to order_nbr in TableB and TableC.
create table TableA
(
 order_nbr Integer,
 A         Integer,
 primary key(order_nbr)
) ;
 
 --Updating sku in TableB cascades to sku in TableC.
 create table TableB
(
 order_nbr Integer,
 sku       Integer,
 B         Integer
 primary key(order_nbr,sku),
CONSTRAINT  Details_TableB foreign key (order_nbr) REFERENCES TableA (order_nbr)
ON UPDATE CASCADE
) ;
  
 -- TableC has a dependency/path on TableA and TableB.
 create table TableC
(
 order_nbr Integer,
 sku       Integer,
 ID        Integer,
 C         char(1),
primary key (order_nbr,sku,ID),
CONSTRAINT Details_TableC1 FOREIGN KEY (order_nbr)     REFERENCES TableA (order_nbr)
ON UPDATE CASCADE,
CONSTRAINT Details_TableC2 FOREIGN KEY (order_nbr,sku) REFERENCES TableB (order_nbr,sku)
ON UPDATE CASCADE
 ) ;

Now is this a non-trival problem for a dbms. Yes it is. Is it insolvable in 
terms of solution and efficieny? Of course not. But when you try to create 
TableC you're told by sql server that it can't do it:

"Introducing FOREIGN KEY constraint 'Details_TableC2' on table 'TableC' may 
 cause cycles or multiple cascade paths. Specify ON DELETE NO ACTION or 
 ON UPDATE NO ACTION, or modify other FOREIGN KEY constraints."

Game over. Anything other than the most trivial path is outside the scope of 
the 'system' to handle. Users are forever told the sql game is 'what' not 
'how'. But when it's convenient and expedient you turn the idea around and 
say 'how' can this be done when there's no known (NP complete) solution. 
Nice sophistry!:) Now this same DDL, with some modification, is perfectly 
fine in Dataphor using sql server to persist the tables. And using views, 
such as TableA join TableB, is also fine. The views will resolve data to the 
tables and the update cascades work as expected. Do you really think that 
this problem is NP complete for the Sql Server team but not for Dataphor? 
Give me a break:)
Users should understand that because sql doesn't try to solve complex 
modeling issues and doesn't attempt to resolve views/expressions with 
multiple tables that this doesn't mean these things can't be solved. Dbms 
can be built so these things are handled by the 'system' and the user isn't
left twisting in the wind making a mess in the front end. Will something
like Dataphor be able to handle every conceivable combination of cascades
between tables along with other types of constraints? Of course not, but it
will handle a lot of tricky schemes that sql can't. Is Dataphor capable of
resolving (updatability) every view/expression to the base tables? Of course
not, but it will handle a large chunk of them. And in each case these things
are transparent to the user. It would be nice to have a little honesty about
this. Instead of creating straw men and red herrings just tell it like it is. 
The framework/foundation of sql is just not able to handle these issues. This
dye was cast a long time ago (and yes sql was lazy on a lot of fronts). And 
there's little hope of sql addressing these things in the future. This
certainly doesn't make sql irrelevant. It's a 'right tool for the job' world. 
I think sql folks can handle it. After all, not all sql users are whimps:)

Wednesday, May 14, 2008

Dataphor - Splitting a string into fixed sized word segments

This is a solution using Dataphor to the problem in the post:

comp.databases.ms-sqlserver
May 9, 2008
'SQL Statement Problem : Separate long varchar into word seqments 35 chars long'
http://tinyurl.com/42vty5

The op states the problem as:
'I have a table of names which can be very long.  These names get printed on envelopes.  
 The problem is the envelope can only hold 35 characters per line.  I have to divide into
 4 lines at most. 
 So I need to separate these long varchars into segments, no longer than 35 characters but 
 preserving whole words.' 

Data is stored in MS Sql Server 2005. Here is some sample data stored in table SMerge:

-- Tsql DDL statements to create the SMerge table in a sql server database.
create table SMerge (ID Int primary key,InStr varchar(250))
go
insert SMerge values
(1,'BIOLOGICAL SURVEY FOUNDATION OF ONTARIO HAPPY IF SOMEONE CAN HELP SOLVE THIS SQL PROBLEM') 
insert SMerge values
(2,'I SURE HOPE I CAN GET THIS THING TO WORK RIGHT') 
insert SMerge values
(3,' What will happen  with a big string full of all kinds of names and addresses including  
     some goobly goop. Will it hold up ?') 

This is a fairly simple iterative solution. It illustrates various dataphor programming
capabilities including set based, cursor and procedural.
I've relaxed the requirement of at most 4 lines. Given a maximum line size the batch
iterates over each input string creating lines not exceeding the maximum size until all 
words in the string are assigned to a line. There is no limit to the number of lines
allowed (although there certainly could be). 

//                    Start of batch

if not ObjectExists("WTable") then                  
Execute(
"create session table WTable         //Temporary table used in batch. 
{
 StrID:Integer,                      //Input string id.
 Str:String,                         //An individual string, ie. a word, from splitting.
 Seq:Integer,                        //Position/sequence of word from split.
 Interval:Integer nil {default nil}, //The line number the word will eventually belongs to.
 StrLen:Integer,                     //Length of word (Str).
 RunningSumStrLens:Integer nil,      //Running sum of word lengths within a line. 
 LenCheck:Boolean {default false},   //Whether or not a word (Str) has been assigned to a line.
 key{StrID,Seq}
};"
      )
  else
   Execute("delete WTable;");
                               
var MaxLineLength:=35; //Max length of line desired.
var LCheck:Boolean;    // Holds boolean (t/f) of adding string to line, 
                       // whether line is within max line size. 
var MaxSeq:Integer;    //Holds max Seq for ID and line (interval).
                       // SMerge is the sql server table. 

var LCursorSqlTable:=cursor(SMerge order by {ID});

// Loop over all rows in the sql table splitting each input string (InStr).

while LCursorSqlTable.Next() do
begin

// Error check. If for any ID the length of any word (Str), via Split, exceeds the max line 
// size then eliminate the ID from the result. This also insures no endless cursor loops.

if exists(
          ToTable(LCursorSqlTable.Select().InStr.Split({' '}),'Str','Seq') 
                   where Length(Str)>MaxLineLength
         ) then
 continue; //Skips below code and gets another input string (InStr).
 
// Insert the split string into work table WTable. Split operator splits using blank as
// the delimiter. A sequence number (Seq) is also generated which is an integer indicating
// the position (from 0 to N) from left to right of each word (Str) in the input string (InStr).
// We also compute the length each word (Str). This is stored in column StrLen.

  insert
     (
      ToTable(LCursorSqlTable.Select().InStr.Split({' '}),'Str','Seq') add{Length(Str) StrLen,
                   LCursorSqlTable.Select().ID StrID,
                   nil as Integer RunningSumStrLens} 
                   where (StrLen>0) //Eliminate extra spaces. We could do all sorts of other
                                    //checks and manipulations here as well.
      ) into WTable;  
var LInterval:=0; //Line(Interval) within an ID (InStr).
var LRunningSumStrLens:Integer;  
MaxSeq:=-1; //Start current ID with Seq>-1, ie. include all words (Seq starts at 0).

// Iterate over WTable. Get a running sum of word lengths. Each iteration represents a new
// line based on comparing the running sum of word lengths to the max line size. Mark words
// in rows of WTable as either true (part of new line) or false (to be assigned in a subsequent
// iteration). As long as there are words to be assigned to a line the iteration (loop)
// continues. The loop stops when all words for an ID have been optimally assigned to a
// line (Interval). In other words loop stops when all values of LenCheck are true because
// each word has been assigned to a line.

 while exists(WTable where (StrID=LCursorSqlTable.Select().ID) and (Seq>MaxSeq)) do
 begin
 LInterval:=LInterval+1 ;  
 LRunningSumStrLens:=0;  //Initialize running sum for word lengths to 0 for a new line.

// Use WTable eliminating prior assigned words. We don't need these rows, we need rows 
// representing words that still need to be assigned to a line. Seq>MaxSeq gets these rows,
// it gets all rows (words) for an ID that haven't yet been assigned to a line (Interval).

 var LCursor:=
 cursor( WTable where ((StrID=LCursorSqlTable.Select().ID) and (Seq>MaxSeq)) order by {StrID,Seq} 
                                        capabilities {Navigable, Searchable, Updateable});
  while LCursor.Next() do   

// For all remaining words that aren't yet assigned to a line(Interval) for the current ID
// get a running sum of word lengths and check that the run value is <= max line size.

  begin

// If it's the start of a new line we only want the running sum to be the length of the word, 
// if it's after the 1st word we add the previous word lengths, which includes a 1 for a space
// between words, and the length of the current word.

  LRunningSumStrLens:= if LRunningSumStrLens=0 then LCursor.Select().StrLen 
                       else LRunningSumStrLens + 1 + LCursor.Select().StrLen;
// Check if the word, as part of the running sum, is within the maximum line size.                       
  
  LCheck:=LRunningSumStrLens<=MaxLineLength ;

// If the current word length, when added to the running sum of lengths, is greater than the  
// max line size there's no more point in staying within the LCursor loop. The line(Interval)
// has been determined using prior words so break out of this loop and start forming a new 
// line(Interval) with as yet unassigned words (where LenCheck=false) with a new running sum. 

  if not LCheck then
               break;

// Update LenCheck to true. The current word fits on the current line as tested by comparing
// the running sum (with the current word length) to the max line size. We can now ignore
// this word (row) in assigning subsequent (unassigned) words to a line. Remember that the
// default value of LenCheck is false. So we don't have to update words (set LenCheck=false) 
// as false since this is the default. And we'll get those words (where LenCheck=false) on
// the next iteration of the loop.  

  LCursor.Update(row{LRunningSumStrLens RunningSumStrLens,LCheck LenCheck,LInterval Interval});
 end; //Lcursor, looping over WTable for an ID until all words are assigned to a line. 

// Find max Seq for the current ID where the word has been assigned to a line (LenCheck=true). 
// The next definition for the cursor for WTable starts with Seq = MaxSeq+1, ie. the rows with
// Seq values greater than the max Seq value of assigned words. Interval is also used here
// to help make the query more efficient (a possible better alternative would be to keep
// track of Seq).

 MaxSeq:=Max(Seq from WTable where (StrID=LCursorSqlTable.Select().ID) 
                                    and (Interval=LInterval) and (LenCheck));
 end; //WTable
end;  //LCursorSqlTable. 

// Here is a simple check that this batch assigned each word to a line for each ID. All
// LenCheck values must be true or else there is a failure somewhere in batch. Of course we
// could more thoroughly check exactly where (ID and line) failure(s) occurred in the query.
// We could also put extensive error handling throughout batch. Just so you know ☺

if not All(LenCheck from WTable) then
               exit;

// Use a query to build the output strings (ALine) by concatenating (using the Concat operator) 
// the words (Str) within IDs and lines (Interval) and ordering the concatenation by Seq 
//(the position determined by Split). We also include the line length for each 
// concatenated line (ALine) of words within an ID.

select WTable add{' ' del} 
         adorn {key{Seq}}  
            group by {StrID,Interval} add{ Concat(Str,del order by {Seq}) ALine} 
              add{Length(ALine) LineLength}
                  order by {StrID,Interval}; 

//                    End of batch                  
                  
Result:

StrID Interval ALine                               LineLength 
----- -------- ----------------------------------- ---------- 
1     1        BIOLOGICAL SURVEY FOUNDATION OF     31         
1     2        ONTARIO HAPPY IF SOMEONE CAN HELP   33         
1     3        SOLVE THIS SQL PROBLEM              22         
2     1        I SURE HOPE I CAN GET THIS THING TO 35         
2     2        WORK RIGHT                          10         
3     1        What will happen with a big string  34         
3     2        full of all kinds of names and      30         
3     3        addresses including some goobly     31         
3     4        goop. Will it hold up?              22     

Here is what WTable looks like before concatenation. Note the running sum of string lengths
(RunningSumStrLens) within each line (Interval) for each ID (StrID).

select WTable order by {StrID,Interval,Seq};    

StrID Str        Seq Interval StrLen RunningSumStrLens LenCheck 
----- ---------- --- -------- ------ ----------------- -------- 
1     BIOLOGICAL 0   1        10     10                True     
1     SURVEY     1   1        6      17                True     
1     FOUNDATION 2   1        10     28                True     
1     OF         3   1        2      31                True     
1     ONTARIO    4   2        7      7                 True     
1     HAPPY      5   2        5      13                True     
1     IF         6   2        2      16                True     
1     SOMEONE    7   2        7      24                True     
1     CAN        8   2        3      28                True     
1     HELP       9   2        4      33                True     
1     SOLVE      10  3        5      5                 True     
1     THIS       11  3        4      10                True     
1     SQL        12  3        3      14                True     
1     PROBLEM    13  3        7      22                True     
2     I          0   1        1      1                 True     
.
.
3     hold       23  4        4      18                True     
3     up?        24  4        3      22                True     


How will this batch work for larger strings? Lets take the whole question of the post
(cut and pasted from the op on google).

Here, through dataphor, we assign the sql server table SMerge just one row of data 
consisting of the post data as the input string. As part of the insert we remove
carrige returns and line feeds using the Replace function.

SMerge:=
table
{
row{1 ID,
"I'm having a really tough time with a SQL statement and I am wondering
if someone is able to help out or point me in the right direction.

I have a table of names which can be very long.  These names get
printed on envelopes.  The problem is the envelope can only hold 35
characters per line.  I have to divide into 4 lines at most.

So I need to separate these long varchars into segments, no longer
than 35 characters but preserving whole words.

So far my approach has been to take a LEFT segment, REVERSE it, find
the first space with CHARINDEX and use it to calculate how many
characters to take in a SUBBSTRING.

Here's an example of what I have been trying.  I can find the first
two segments, but then it starts to get confusing. 
Can anyone suggest a better approach?  Am I going to be able to do
this in SQL?

I appreciate any help." InStr}
} // Replace carrige returns and line feeds with blanks in InStr.
redefine {InStr:=Replace(Replace(InStr,ASCII(list(Byte){13}) ,' '),ASCII(list(Byte){10}) ,' ')}
;

Executing the batch with a maximum line size of 80 (MaxLineLength:=80) we get:

StrID Interval ALine                                                                            LineLength 
----- -------- -------------------------------------------------------------------------------- ---------- 
1     1        I'm having a really tough time with a SQL statement and I am wondering if        73         
1     2        someone is able to help out or point me in the right direction. I have a table   78         
1     3        of names which can be very long. These names get printed on envelopes. The       74         
1     4        problem is the envelope can only hold 35 characters per line. I have to divide   78         
1     5        into 4 lines at most. So I need to separate these long varchars into segments,   78         
1     6        no longer than 35 characters but preserving whole words. So far my approach has  79         
1     7        been to take a LEFT segment, REVERSE it, find the first space with CHARINDEX and 80         
1     8        use it to calculate how many characters to take in a SUBBSTRING. Here's an       74         
1     9        example of what I have been trying. I can find the first two segments, but then  79         
1     10       it starts to get confusing. Can anyone suggest a better approach? Am I going to  79         
1     11       be able to do this in SQL? I appreciate any help.                                49         

Executing the batch with a maximum line size of 70 (MaxLineLength:=70) we get:

StrID Interval ALine                                                                  LineLength 
----- -------- ---------------------------------------------------------------------- ---------- 
1     1        I'm having a really tough time with a SQL statement and I am wondering 70         
1     2        if someone is able to help out or point me in the right direction. I   68         
1     3        have a table of names which can be very long. These names get printed  69         
1     4        on envelopes. The problem is the envelope can only hold 35 characters  69         
1     5        per line. I have to divide into 4 lines at most. So I need to separate 70         
1     6        these long varchars into segments, no longer than 35 characters but    67         
1     7        preserving whole words. So far my approach has been to take a LEFT     66         
1     8        segment, REVERSE it, find the first space with CHARINDEX and use it to 70         
1     9        calculate how many characters to take in a SUBBSTRING. Here's an       64         
1     10       example of what I have been trying. I can find the first two segments, 70         
1     11       but then it starts to get confusing. Can anyone suggest a better       64         
1     12       approach? Am I going to be able to do this in SQL? I appreciate any    67         
1     13       help.                                                                  5   

Seems to work ok ☺

Note that with this simple framework it would be easy to add all kinds of editing niceties. 
Left justification, centering etc. could be easily added. It would also be easy to edit/check
the individual words, possibly eliminating some given specific rules. Note that we could
easily make the batch an operator (stored procedure) returning the work table or a virtual table.

If you have any ideas on further features or what would be kewl for this framework please 
let me know ☺

You can check out dataphor at:
www.dataphor.org

Saturday, December 20, 2008

Extracting numbers from a string

Saturday, December 06, 2008

Sql server dynamic crosstabs by Jeff Moden

Tuesday, November 25, 2008

Linq to Sql as an opportunity to sql users

Linq to Sql should be important to sql users

Linq to Sql vs. older 4GL attempts

Linq to Sql: the 'what' and 'why' of it

Sunday, November 23, 2008

Listing Missing Table Item

Friday, November 14, 2008

S(ecure) Q(uery) L(anguage)?

Monday, September 08, 2008

Sorting a delimited string numerically

Wednesday, July 16, 2008

Demoralization by trigger

Sunday, July 06, 2008

An opposite constraint

A one-one requirement constraint

Friday, July 04, 2008

Justifying an Sql Cop Out

Wednesday, May 14, 2008

Dataphor - Splitting a string into fixed sized word segments

Beyond Sql

Blog Archive

Links

About Me