I receive several CSV files each quarter to process through an SSIS package, and the formatting is inconsistent. This quarter, I encountered what I term optional text qualifiers--text qualifer of double quote (") used at times, but not always. We appear not to have any sway over the people who provide this data to convince them to provide clean, consistent data.
Example of the scenario.
Bob,7823984234,"Lions, Tigers, Bears",This
Fred,67961654,"Little,Mermaid, ",That
George,4568,"Things,Stuff",The Other Thing
Barney,324578,"Items,",Something Else
Al,2q36789472,"And More",Another
Update: Here is what I want to convert the data to.
"Bob","7823984234","Lions, Tigers, Bears","This"
"Fred","67961654","Little,Mermaid, ","That"
"George","4568","Things,Stuff","The Other Thing"
"Barney","324578","Items,","Something Else"
"Al","2q36789472","And More","Another"
I used these steps in a text editor to cleanse the data manually. I suspect a Python script would be the fastest way to do this in a script, and create a good opportunity for me to dive into Python.
Step 1
Search for ÿ - A character not found in the data file
Step 2
Regex Replace this: ("[^"\r\n]*),([^"\r\n]*")
With this: $1ÿ$2
Repeat until no more hits
Step 3
Regex Replace this: "([^"\r\n]*)"
With this: $1
Step 4
Replace this: ("[^"])
With this: "$1
Step 5
Replace this: ,
With this: ","
Step 6
Regex Replace this: ^
With this: "
Step 7
Regex Replace this: $
With this: "
Step 8
Replace this: ÿ
With this: ,
Options
Should I try to resolve this in SSIS in a Script Task?
Cleanse the data in a script outside of the SSIS package?
Look at a Third Party SSIS toolset like Pragmatic Works or Cozyroc?
Some other better approach?