Beyond Sql: Dataphor - Splitting a string into fixed sized word segments

This is a solution using Dataphor to the problem in the post:

comp.databases.ms-sqlserver
May 9, 2008
'SQL Statement Problem : Separate long varchar into word seqments 35 chars long'
http://tinyurl.com/42vty5

The op states the problem as:
'I have a table of names which can be very long.  These names get printed on envelopes.  
 The problem is the envelope can only hold 35 characters per line.  I have to divide into
 4 lines at most. 
 So I need to separate these long varchars into segments, no longer than 35 characters but 
 preserving whole words.' 

Data is stored in MS Sql Server 2005. Here is some sample data stored in table SMerge:

-- Tsql DDL statements to create the SMerge table in a sql server database.
create table SMerge (ID Int primary key,InStr varchar(250))
go
insert SMerge values
(1,'BIOLOGICAL SURVEY FOUNDATION OF ONTARIO HAPPY IF SOMEONE CAN HELP SOLVE THIS SQL PROBLEM') 
insert SMerge values
(2,'I SURE HOPE I CAN GET THIS THING TO WORK RIGHT') 
insert SMerge values
(3,' What will happen  with a big string full of all kinds of names and addresses including  
     some goobly goop. Will it hold up ?') 

This is a fairly simple iterative solution. It illustrates various dataphor programming
capabilities including set based, cursor and procedural.
I've relaxed the requirement of at most 4 lines. Given a maximum line size the batch
iterates over each input string creating lines not exceeding the maximum size until all 
words in the string are assigned to a line. There is no limit to the number of lines
allowed (although there certainly could be). 

//                    Start of batch

if not ObjectExists("WTable") then                  
Execute(
"create session table WTable         //Temporary table used in batch. 
{
 StrID:Integer,                      //Input string id.
 Str:String,                         //An individual string, ie. a word, from splitting.
 Seq:Integer,                        //Position/sequence of word from split.
 Interval:Integer nil {default nil}, //The line number the word will eventually belongs to.
 StrLen:Integer,                     //Length of word (Str).
 RunningSumStrLens:Integer nil,      //Running sum of word lengths within a line. 
 LenCheck:Boolean {default false},   //Whether or not a word (Str) has been assigned to a line.
 key{StrID,Seq}
};"
      )
  else
   Execute("delete WTable;");
                               
var MaxLineLength:=35; //Max length of line desired.
var LCheck:Boolean;    // Holds boolean (t/f) of adding string to line, 
                       // whether line is within max line size. 
var MaxSeq:Integer;    //Holds max Seq for ID and line (interval).
                       // SMerge is the sql server table. 

var LCursorSqlTable:=cursor(SMerge order by {ID});

// Loop over all rows in the sql table splitting each input string (InStr).

while LCursorSqlTable.Next() do
begin

// Error check. If for any ID the length of any word (Str), via Split, exceeds the max line 
// size then eliminate the ID from the result. This also insures no endless cursor loops.

if exists(
          ToTable(LCursorSqlTable.Select().InStr.Split({' '}),'Str','Seq') 
                   where Length(Str)>MaxLineLength
         ) then
 continue; //Skips below code and gets another input string (InStr).
 
// Insert the split string into work table WTable. Split operator splits using blank as
// the delimiter. A sequence number (Seq) is also generated which is an integer indicating
// the position (from 0 to N) from left to right of each word (Str) in the input string (InStr).
// We also compute the length each word (Str). This is stored in column StrLen.

  insert
     (
      ToTable(LCursorSqlTable.Select().InStr.Split({' '}),'Str','Seq') add{Length(Str) StrLen,
                   LCursorSqlTable.Select().ID StrID,
                   nil as Integer RunningSumStrLens} 
                   where (StrLen>0) //Eliminate extra spaces. We could do all sorts of other
                                    //checks and manipulations here as well.
      ) into WTable;  
var LInterval:=0; //Line(Interval) within an ID (InStr).
var LRunningSumStrLens:Integer;  
MaxSeq:=-1; //Start current ID with Seq>-1, ie. include all words (Seq starts at 0).

// Iterate over WTable. Get a running sum of word lengths. Each iteration represents a new
// line based on comparing the running sum of word lengths to the max line size. Mark words
// in rows of WTable as either true (part of new line) or false (to be assigned in a subsequent
// iteration). As long as there are words to be assigned to a line the iteration (loop)
// continues. The loop stops when all words for an ID have been optimally assigned to a
// line (Interval). In other words loop stops when all values of LenCheck are true because
// each word has been assigned to a line.

 while exists(WTable where (StrID=LCursorSqlTable.Select().ID) and (Seq>MaxSeq)) do
 begin
 LInterval:=LInterval+1 ;  
 LRunningSumStrLens:=0;  //Initialize running sum for word lengths to 0 for a new line.

// Use WTable eliminating prior assigned words. We don't need these rows, we need rows 
// representing words that still need to be assigned to a line. Seq>MaxSeq gets these rows,
// it gets all rows (words) for an ID that haven't yet been assigned to a line (Interval).

 var LCursor:=
 cursor( WTable where ((StrID=LCursorSqlTable.Select().ID) and (Seq>MaxSeq)) order by {StrID,Seq} 
                                        capabilities {Navigable, Searchable, Updateable});
  while LCursor.Next() do   

// For all remaining words that aren't yet assigned to a line(Interval) for the current ID
// get a running sum of word lengths and check that the run value is <= max line size.

  begin

// If it's the start of a new line we only want the running sum to be the length of the word, 
// if it's after the 1st word we add the previous word lengths, which includes a 1 for a space
// between words, and the length of the current word.

  LRunningSumStrLens:= if LRunningSumStrLens=0 then LCursor.Select().StrLen 
                       else LRunningSumStrLens + 1 + LCursor.Select().StrLen;
// Check if the word, as part of the running sum, is within the maximum line size.                       
  
  LCheck:=LRunningSumStrLens<=MaxLineLength ;

// If the current word length, when added to the running sum of lengths, is greater than the  
// max line size there's no more point in staying within the LCursor loop. The line(Interval)
// has been determined using prior words so break out of this loop and start forming a new 
// line(Interval) with as yet unassigned words (where LenCheck=false) with a new running sum. 

  if not LCheck then
               break;

// Update LenCheck to true. The current word fits on the current line as tested by comparing
// the running sum (with the current word length) to the max line size. We can now ignore
// this word (row) in assigning subsequent (unassigned) words to a line. Remember that the
// default value of LenCheck is false. So we don't have to update words (set LenCheck=false) 
// as false since this is the default. And we'll get those words (where LenCheck=false) on
// the next iteration of the loop.  

  LCursor.Update(row{LRunningSumStrLens RunningSumStrLens,LCheck LenCheck,LInterval Interval});
 end; //Lcursor, looping over WTable for an ID until all words are assigned to a line. 

// Find max Seq for the current ID where the word has been assigned to a line (LenCheck=true). 
// The next definition for the cursor for WTable starts with Seq = MaxSeq+1, ie. the rows with
// Seq values greater than the max Seq value of assigned words. Interval is also used here
// to help make the query more efficient (a possible better alternative would be to keep
// track of Seq).

 MaxSeq:=Max(Seq from WTable where (StrID=LCursorSqlTable.Select().ID) 
                                    and (Interval=LInterval) and (LenCheck));
 end; //WTable
end;  //LCursorSqlTable. 

// Here is a simple check that this batch assigned each word to a line for each ID. All
// LenCheck values must be true or else there is a failure somewhere in batch. Of course we
// could more thoroughly check exactly where (ID and line) failure(s) occurred in the query.
// We could also put extensive error handling throughout batch. Just so you know ☺

if not All(LenCheck from WTable) then
               exit;

// Use a query to build the output strings (ALine) by concatenating (using the Concat operator) 
// the words (Str) within IDs and lines (Interval) and ordering the concatenation by Seq 
//(the position determined by Split). We also include the line length for each 
// concatenated line (ALine) of words within an ID.

select WTable add{' ' del} 
         adorn {key{Seq}}  
            group by {StrID,Interval} add{ Concat(Str,del order by {Seq}) ALine} 
              add{Length(ALine) LineLength}
                  order by {StrID,Interval}; 

//                    End of batch                  
                  
Result:

StrID Interval ALine                               LineLength 
----- -------- ----------------------------------- ---------- 
1     1        BIOLOGICAL SURVEY FOUNDATION OF     31         
1     2        ONTARIO HAPPY IF SOMEONE CAN HELP   33         
1     3        SOLVE THIS SQL PROBLEM              22         
2     1        I SURE HOPE I CAN GET THIS THING TO 35         
2     2        WORK RIGHT                          10         
3     1        What will happen with a big string  34         
3     2        full of all kinds of names and      30         
3     3        addresses including some goobly     31         
3     4        goop. Will it hold up?              22     

Here is what WTable looks like before concatenation. Note the running sum of string lengths
(RunningSumStrLens) within each line (Interval) for each ID (StrID).

select WTable order by {StrID,Interval,Seq};    

StrID Str        Seq Interval StrLen RunningSumStrLens LenCheck 
----- ---------- --- -------- ------ ----------------- -------- 
1     BIOLOGICAL 0   1        10     10                True     
1     SURVEY     1   1        6      17                True     
1     FOUNDATION 2   1        10     28                True     
1     OF         3   1        2      31                True     
1     ONTARIO    4   2        7      7                 True     
1     HAPPY      5   2        5      13                True     
1     IF         6   2        2      16                True     
1     SOMEONE    7   2        7      24                True     
1     CAN        8   2        3      28                True     
1     HELP       9   2        4      33                True     
1     SOLVE      10  3        5      5                 True     
1     THIS       11  3        4      10                True     
1     SQL        12  3        3      14                True     
1     PROBLEM    13  3        7      22                True     
2     I          0   1        1      1                 True     
.
.
3     hold       23  4        4      18                True     
3     up?        24  4        3      22                True     


How will this batch work for larger strings? Lets take the whole question of the post
(cut and pasted from the op on google).

Here, through dataphor, we assign the sql server table SMerge just one row of data 
consisting of the post data as the input string. As part of the insert we remove
carrige returns and line feeds using the Replace function.

SMerge:=
table
{
row{1 ID,
"I'm having a really tough time with a SQL statement and I am wondering
if someone is able to help out or point me in the right direction.

I have a table of names which can be very long.  These names get
printed on envelopes.  The problem is the envelope can only hold 35
characters per line.  I have to divide into 4 lines at most.

So I need to separate these long varchars into segments, no longer
than 35 characters but preserving whole words.

So far my approach has been to take a LEFT segment, REVERSE it, find
the first space with CHARINDEX and use it to calculate how many
characters to take in a SUBBSTRING.

Here's an example of what I have been trying.  I can find the first
two segments, but then it starts to get confusing. 
Can anyone suggest a better approach?  Am I going to be able to do
this in SQL?

I appreciate any help." InStr}
} // Replace carrige returns and line feeds with blanks in InStr.
redefine {InStr:=Replace(Replace(InStr,ASCII(list(Byte){13}) ,' '),ASCII(list(Byte){10}) ,' ')}
;

Executing the batch with a maximum line size of 80 (MaxLineLength:=80) we get:

StrID Interval ALine                                                                            LineLength 
----- -------- -------------------------------------------------------------------------------- ---------- 
1     1        I'm having a really tough time with a SQL statement and I am wondering if        73         
1     2        someone is able to help out or point me in the right direction. I have a table   78         
1     3        of names which can be very long. These names get printed on envelopes. The       74         
1     4        problem is the envelope can only hold 35 characters per line. I have to divide   78         
1     5        into 4 lines at most. So I need to separate these long varchars into segments,   78         
1     6        no longer than 35 characters but preserving whole words. So far my approach has  79         
1     7        been to take a LEFT segment, REVERSE it, find the first space with CHARINDEX and 80         
1     8        use it to calculate how many characters to take in a SUBBSTRING. Here's an       74         
1     9        example of what I have been trying. I can find the first two segments, but then  79         
1     10       it starts to get confusing. Can anyone suggest a better approach? Am I going to  79         
1     11       be able to do this in SQL? I appreciate any help.                                49         

Executing the batch with a maximum line size of 70 (MaxLineLength:=70) we get:

StrID Interval ALine                                                                  LineLength 
----- -------- ---------------------------------------------------------------------- ---------- 
1     1        I'm having a really tough time with a SQL statement and I am wondering 70         
1     2        if someone is able to help out or point me in the right direction. I   68         
1     3        have a table of names which can be very long. These names get printed  69         
1     4        on envelopes. The problem is the envelope can only hold 35 characters  69         
1     5        per line. I have to divide into 4 lines at most. So I need to separate 70         
1     6        these long varchars into segments, no longer than 35 characters but    67         
1     7        preserving whole words. So far my approach has been to take a LEFT     66         
1     8        segment, REVERSE it, find the first space with CHARINDEX and use it to 70         
1     9        calculate how many characters to take in a SUBBSTRING. Here's an       64         
1     10       example of what I have been trying. I can find the first two segments, 70         
1     11       but then it starts to get confusing. Can anyone suggest a better       64         
1     12       approach? Am I going to be able to do this in SQL? I appreciate any    67         
1     13       help.                                                                  5   

Seems to work ok ☺

Note that with this simple framework it would be easy to add all kinds of editing niceties. 
Left justification, centering etc. could be easily added. It would also be easy to edit/check
the individual words, possibly eliminating some given specific rules. Note that we could
easily make the batch an operator (stored procedure) returning the work table or a virtual table.

If you have any ideas on further features or what would be kewl for this framework please 
let me know ☺

You can check out dataphor at:
www.dataphor.org
Wednesday, May 14, 2008

Dataphor - Splitting a string into fixed sized word segments

No comments:

Beyond Sql

Blog Archive

Links

About Me