Thursday, August 16, 2012

LIKE is not always enough

Let´s imagine a situation in which we have, as input, a column s.namelist from a table s containing a comma-separated list of values:


This could be, for example, a dump from a CSV file or an Oracle External Table, inside a database staging area.

Now suppose we have to perform a filtering operation basing on the corresponding value of another column,; in particular, we would like to identify -and insulate- all the names from our comma-list, whose name is the same in our corresponding column. So, if we have a row like:

  ... | Smith | ... | Sara, Karl, John, Smith, Bill, Smith, Hoppen,... | ....

we wanna obtain:

 | Smith | Smith, Smith |

or just the entire name list.

(this doesn´t seem having much sense, but a similar situation could easly occour in many DWH or data integration scenarios).

Starting with a simple query statement (in this case T-SQL under MSSQL) like this:

       , a.namelist  LIKE ( '%|' + UPPER(RTRIM(LTRIM(
  +'|%' ) )

Doesn´t work correctly. What happen if we have two names like "Sarah" and "Sara", while filtering for "Sara"? With the LIKE operator, we would catch also "Sarah".

The solution is to clearly break and delimiatet the comma list names before performing any additional filtering/denormalization operation, through the use of the REPLACE operator

       , '|' + REPLACE(UPPER(LTRIM(RTRIM(a.namelist))),' ', '|') + '|'  LIKE ( '%|' + UPPER(RTRIM(LTRIM(
  +'|%' ) )

We will then obtain a clear, delimited list of names as |Sara|Sarah|Pippo|... in which we can easly apply the LIKE operator with the name explicitely delimited by the "|" as input

The RTRIM/LTRIM, and the UPPER operators will help us on avoiiding the classical data quality problems involving fields coming from source systems as names, addresses, etc...

No comments:

Post a Comment