Beyond Sql: Dataphor - Merging strings together

This problem was raised in the thread:
microsoft.public.sqlserver.programming
Monday, October 15, 2007 
'Parse and Merge two fields'
http://tinyurl.com/2eosa3

The idea is to take two pipe ('|') delimited strings and created one
string where each item in the first string is matched with the item 
in the second string having the same ordinal position.
The two columns are People and Position. So given:

People:=' Sam | Jane | Gene'
Position:= 'Accounting | Finance | Marketing'

we want:

People_Position:= 'Sam|Accounting|Jane|Finance|Gene|Accounting'

Of course in sql this is usually a mess since sql does not
support a list type and any of the operations that can transform
a delimited string to a list and visa versa.
In D4 we simply split the delimited string to a list of items and
then create a table from the list. The table will contain the
ordinal position as an integer value of each item in the list. 
So a row has the item (string) value and sequence value of that item
in the string as columns. So we just concatenate the columns of each
string together (separated by a pipe) and then concatenate these 
strings over the rows with the Concat operator ordering the 
concatenated string by the sequence value (over rows). 

Note that there is no concept of just unique item in a list. A list
can have any number of duplicate items. When the list is transformed
to a table the sequence number guarantees that same value items will
be in the table. In other words the sequence is a key of the table.

select ToTable('A|A|B|B|C'.Split({'|'}));

value sequence 
----- -------- 
A     0        
A     1        
B     2        
B     3        
C     4      

And with column names of your choice:

select ToTable('A|A|B|B|C'.Split({'|'}),'Str','Seq');

Str Seq 
--- --- 
A   0   
A   1   
B   2   
B   3   
C   4   

(Browse this site for other articles on string operations for more info).

Here is some sample data. The data is stored in a Sql Server 2005 database.
All queries are written with the D4 relational language of Dataphor.

create table JR
{
 id:Integer,
 People:String tags{Storage.Length='100'},
 Position:String tags{Storage.Length='100'},
 key{id}
};
 
insert
table
{
row{1 id,'joe|sam|pete|mike' People,'Manager|Team Lead|Worker Ant|Worker Ant' Position},
row{2,'A| C|  T ',' JobA|JobC |JobT '},
row{3,'A| C ',' JobA | JobB |JobC '}, //Ummatched strings (lists)
row{4,'Bush | Cheney | Rice ' ,' Worser|Worst|Worse '},
row{5,'Z |Y |X ' ,' Backwards Z|Backwards Y |Backwards X '}
} into JR;

select JR;

id People                Position                                
-- --------------------- --------------------------------------- 
1  joe|sam|pete|mike     Manager|Team Lead|Worker Ant|Worker Ant 
2  A| C|  T               JobA|JobC |JobT                        
3  A| C                   JobA | JobB |JobC                      
4  Bush | Cheney | Rice   Worser|Worst|Worse                     
5  Z |Y |X                Backwards Z|Backwards Y |Backwards X

Also note that we can access the table with a pass-thru query using t-sql:

select SQLQuery("SELECT id,People,Position FROM JR"); 

id People                Position                                
-- --------------------- --------------------------------------- 
1  joe|sam|pete|mike     Manager|Team Lead|Worker Ant|Worker Ant 
2  A| C|  T               JobA|JobC |JobCT                       
3  A| C                   JobA | JobB |JobC                      
4  Bush | Cheney | Rice   Worser|Worst|Worse                     
5  Z |Y |X                Backwards Z|Backwards Y |Backwards X   

Note that the only restriction used here is that both strings have
the same number of items. 

select 
  JR //The table JR in Sql Server 2005. 
  //SQLQuery("SELECT id,People,Position FROM JR") Or we could use this t-sql query . 
     where //The same 'where' statement as in sql. 
           //Check that each string has the same number of items.      
          People.Split({'|'}).Count()=Position.Split({'|'}).Count()
                                with {IgnoreUnsupported = 'true'}
      //For each row in the table we 'add' a column (People_Position).
      //The People_Position column is the '|' delimited string of
      //People and Position item by item.
      add
      {
       //From a table whose rows represents each item in People and Position,
       //for each row form a string of the person and position separated by
       //a '|' and concatenate that string over the rows into a single string.
       Concat({People_Position,Del} 
        from
         (
          //Use Split to create a 'list' of items from the delimited string.
          //Then create a table from list with ToTable. Column 'Str1'
          //is the string value of the item from the list. Column 'Seq'
          //is the item number from the list. So now a table represents
          //the original input column People. In other words, the rows of
          //the table are the delimited items in the string from left to right.
          (ToTable(People.Split({'|'}),'Str1','Seq'){Str1.Trim() Str1,Seq} )       
           join  //A natural join on Seq which is an integer from 0-># of items
                 //in the string. Seq preserves the sequence from left to
                 //to right of the items in the string.
          (ToTable(Position.Split({'|'}),'Str2','Seq'){Str2.Trim() Str2,Seq} )
            {Seq,Str1,Str2,'|' Del,(Str1+'|'+Str2) People_Position}  
         )
                order by {Seq}) //The concatenation over rows is ordered by the
                                //Seq column. This is the same order as the items
                                //in the original input columns, People/Position.
                                //Therefore the original order is preserved in
                                //the result (People_Position).
        People_Position //This is name of the new column added to each row of JR.
       }
        {id,People_Position}  //These are the only two columns we've chosen
                              //to display from the query.
            order by {id};    //Order by the values if the id column.
 
id People_Position                                           
-- --------------------------------------------------------- 
1  joe|Manager|sam|Team Lead|pete|Worker Ant|mike|Worker Ant 
2  A|JobA|C|JobC|T|JobT                                      
4  Bush|Worser|Cheney|Worst|Rice|Worse                       
5  Z|Backwards Z|Y|Backwards Y|X|Backwards X                


Instead of eliminating rows where the item counts don't agree
we can insert the string 'Unmatched' for those id values.

select 
 //Instead of table JR we access the table in sql server with a t-sql query.
 //In D4 the query does not return an sql 'resultset' but instead is treated
 //as a table 'variable' which is of a nature fundamentally different than sql.
 SQLQuery("SELECT id,People,Position FROM JR") 
 //Store the restriction on count as a boolean value (T/F) in column TestCnt.
  add{People.Split({'|'}).Count()=Position.Split({'|'}).Count() TestCnt}
                                with {IgnoreUnsupported = 'true'}
   add
      {
 //If the TestCnt is true concatenate the rows, else return a nil (null) value.
 //A case statement just like the sql case statement could also be used.
  if TestCnt
      then
       Concat({People_Position,Del}
        from
         (
          (ToTable(People.Split({'|'}),'Str1','Seq'){Str1.Trim() Str1,Seq} )       
           join
          (ToTable(Position.Split({'|'}),'Str2','Seq'){Str2.Trim() Str2,Seq} )
            {Seq,Str1,Str2,'|' Del,(Str1+'|'+Str2) People_Position}  
         )
                order by {Seq})
                else
                  nil
                    as String
        People_Position
      }
 //Insert the unmatched string for a nil value of the People_Position column.     
        {id,IfNil(People_Position,'*** Unmatched Strings ***') Field} 
          order by {id};
          
id Field                                                     
-- --------------------------------------------------------- 
1  joe|Manager|sam|Team Lead|pete|Worker Ant|mike|Worker Ant 
2  A|JobA|C|JobC|T|JobT                                      
3  *** Unmatched Strings ***                                 
4  Bush|Worser|Cheney|Worst|Rice|Worse                       
5  Z|Backwards Z|Y|Backwards Y|X|Backwards X                 


D4 allows all the expressive power you need to easily test for
any criteria. Here we add some additional rows to table JR that
should be eliminated.

insert
table
{
//Digits in Postion.
row{6 id,'joe|sam|pete|mike' People,'Manager|Team Lead|Worker Ant1|Worker Ant2' Position},
//Digits in People and Position
row{7 id,'joe|sam|pete1|mike' People,'Manager|Team Lead|Worker Ant1|Worker Ant' Position}
} into JR;

select JR;

id People                Position                                  
-- --------------------- ----------------------------------------- 
1  joe|sam|pete|mike     Manager|Team Lead|Worker Ant|Worker Ant   
2  A| C|  T               JobA|JobC |JobT                          
3  A| C                   JobA | JobB |JobC                        
4  Bush | Cheney | Rice   Worser|Worst|Worse                       
5  Z |Y |X                Backwards Z|Backwards Y |Backwards X     
6  joe|sam|pete|mike     Manager|Team Lead|Worker Ant1|Worker Ant2 
7  joe|sam|pete1|mike    Manager|Team Lead|Worker Ant1|Worker Ant

We add some additional restrictions to the where statement to eliminate
rows with strings that contain any non letters.

select 
 JR
  where 
    (
     ( People.Split({'|'}).Count()=Position.Split({'|'}).Count() )
     //We also test that each item is only letters (no digits or other junk).
     //This can be done any numbers of ways. Here we just compare the count
     //of all items to the count of items eliminated by characters other than 
     //letters.
       and
       (
        People.Split({'|'}).Count()
        =
        Count(ToTable(People.Split({'|'}),'F1Str','F1Seq') 
        //Test that items are made up of letters only. To eliminate
        //blanks contaiminating the test we remove blank character before
        //testing.
          where IsLetter(Replace(Trim(F1Str),' ','')))
       )
       and
       (
        Position.Split({'|'}).Count()
        =
        Count(ToTable(Position.Split({'|'}),'F2Str','F2Seq') 
          where IsLetter(Replace(Trim(F2Str),' ','')))
       )   
     )  with {IgnoreUnsupported = 'true'}        
      add
      {
       Concat({People_Position,Del}
        from
         (
          (ToTable(People.Split({'|'}),'Str1','Seq'){Str1.Trim() Str1,Seq} )       
           join
          (ToTable(Position.Split({'|'}),'Str2','Seq'){Str2.Trim() Str2,Seq} )
            {Seq,Str1,Str2,'|' Del,(Str1+'|'+Str2) People_Position}  
         )
                order by {Seq})
        People_Position
       }
        {id,People_Position}
          order by {id};

id People_Position                                           
-- --------------------------------------------------------- 
1  joe|Manager|sam|Team Lead|pete|Worker Ant|mike|Worker Ant 
2  A|JobA|C|JobC|T|JobT                                      
4  Bush|Worser|Cheney|Worst|Rice|Worse                       
5  Z|Backwards Z|Y|Backwards Y|X|Backwards X                

Thanks for stopping by ☺ .
Wednesday, October 17, 2007

Dataphor - Merging strings together

1 comment:

Beyond Sql

Blog Archive

Links

About Me