Well, I'm facing an issue on exporting my data to SQL Server as the subject says.
I have a semicolon delimited file, but also I have occurrences when I find semicolon inside the text, for example:
ID;DESCRIPTION;VALUE
1;TEXT1;35
2;TEXT;2;45
3;TE;XT3;50
So as you can see I have some garbage that I would like to remove, since this is shifting the columns.
I have some ideas, like make a standard count of semicolons, in this case it would be 2 semicolons by line and remove the extra ones.
In my case this is always happening in 1 column specifically the Address column and complement, so i know exactly what the number of the column is.
I cant ask people who dispose this file since the system is an old system and they can't put qualifiers like double quotes or simply change the delimiter.
I know I could do this via script task but I have few knowledge on programming, so I'm trying to look for another manner.
I'd like to say again that this problem is happening on source file so when I configure the flat file connection it already shift the column so I can't make any treatment like derived column or something else. I have to do the changes on the file itself before I load it in SSIS.
I've been looking for some days on any kind of forums and I don't see any similar questions and solutions for this problem, since most of the example files of people who asks, already have qualifiers or something like this, so I really appreciate if you can help me!
You mentioned you have little programming knowledge but a script is only solution that can handle delimiters in fields that are not enclosed. You are fortunate there is only a single problem field as it wouldn't be possible to parse ambiguous delimiters unless you have additional rules to determine where actual fields begin and end.
As long as you are certain there is only one field with embedded delimiters, one method is a data flow source Script component. Below are the steps to create one:
Add a script component to data flow and select Source for the type
Add the flat file connection manager to the script properties Connection Managers collection
Add each field as an output column under the script properties component Input and Outputs
Edit the script source and replace the template 'CreateOutputRows()' method code with the version code below.
See comments in the script indicating where customizations are needed for your actual file. This version will work with your sample file of 3 fields, with the second field having embedded delimiters.
public override void CreateNewOutputRows()
{
const char FIELD_DEMIMITER = ';';
////*** change this to the zero-based index of the problem field
const int BAD_FIELD_INDEX = 1;
////*** change this to the connection added to script componenent connection manager
var filePath = this.Connections.FlatFileSource.ConnectionString;
string record = "";
using (var inputFile = new System.IO.StreamReader(filePath))
{
record = inputFile.ReadLine();
if(record != null)
{
//count header record fields to get expected field count for data records
var headerFieldCount = record.Split(FIELD_DEMIMITER).Length;
while (record != null)
{
record = inputFile.ReadLine();
if(record == null)
{
break; //end of file
}
var fields = record.Split(FIELD_DEMIMITER);
var extraFieldCount = fields.Length - headerFieldCount;
if (extraFieldCount < 0)
{
//raise an error if fewer fields that we expect
throw new DataException(string.Format("Invalid record. {0} fields read, {1} fields in header.", fields.Length, headerFieldCount));
}
if (extraFieldCount > 0)
{
var newFields = new string[headerFieldCount];
//copy preceding good fields
for (var i = 0; i < BAD_FIELD_INDEX; ++i)
{
newFields[i] = fields[i];
}
//combine segments of bad field into single field
var sourceFieldIndex = BAD_FIELD_INDEX;
var combinedField = new System.Text.StringBuilder();
while (sourceFieldIndex <= extraFieldCount + BAD_FIELD_INDEX)
{
combinedField.Append(fields[sourceFieldIndex]);
if(sourceFieldIndex < extraFieldCount + BAD_FIELD_INDEX)
{
combinedField.Append(FIELD_DEMIMITER); //add delimiter back to field value
}
++sourceFieldIndex;
}
newFields[BAD_FIELD_INDEX] = combinedField.ToString();
//copy subsquent good fields
var targetFieldIndex = BAD_FIELD_INDEX + 1;
while (sourceFieldIndex < fields.Length)
{
newFields[targetFieldIndex] = fields[sourceFieldIndex];
++sourceFieldIndex;
++targetFieldIndex;
}
fields = newFields;
}
//create output record and copy fields
this.Output0Buffer.AddRow();
//*** change the code below to map source fields to the columns defined as script component output
Output0Buffer.ID = fields[0];
Output0Buffer.DESCRIPTION = fields[1];
Output0Buffer.VALUE = fields[2];
}
}
}
this.Output0Buffer.SetEndOfRowset();
}
Another thing you can do is import the text file into a single column (varchar(max)) staging table, and then use TSQL to parse the records and import them to your final destination table.
Related
First off I'm quite new to both SSIS and C#, so apologies for any rookie mistakes. I am trying to muddle my way through splitting one column by a specific delimiter from an input file that will have a variable length header, and a footer.
For example, Input0Buffer has one column. The actual data is always preceded by a row starting with the phrase "STARTDATA", and is bracketed with a row starting with "ENDDATA".
The one input column contains 5 bits of data separated by | . Two of these columns I don't care about.
Basically the input file looks like this:
junkrow
headerstuff
morejunk
STARTDATA
ID1|rubbish|stuff|apple|cheese
ID2|badger|junk|pear|yoghurt
So far I have tried to get some row-by-row logic going in the C# transformer, which I think I am happy with - but I can't work out how to get it to output my split data. Code is below.
bool passedSOD;
bool passedEOD;
public void ProcessRow(Input0Buffer data)
{
string Col1, Col2, Col3;
if (data.Column0.StartsWith("ENDDATA"))
{
passedEOD = true;
}
if (passedSOD && !passedEOD)
{
var SplitData = data.Column0.Split('|');
Col1 = SplitData[0];
Col2 = SplitData[3];
Col3 = SplitData[4];
//error about Output0Buffer not existing in context
Output0Buffer.Addrow();
Output0Buffer.prodid = Col1;
Output0Buffer.fruit = Col2;
Output0Buffer.dairy = Col3;
}
if (data.Column0.StartsWith("STARTDATA"))
{
passedSOD = true;
}
}
If I change the output to asynchronous it stops the error about Output0Buffer not existing in the current context, and it runs, but gives me 0 rows output - presumably because I need it to be synchronous to work through each row as I've set this up?
Any help much appreciated.
you can shorten your code by just checking if the row contains a '|'
if(Row.Column0.Contains("|")
{
string[] cols = Row.Column0.Split('|');
Output0Buffer.AddRow();
Output0Buffer.prodid = cols[0];
Output0Buffer.fruit = cols[3];
Output0Buffer.dairy = cols[4];
}
Like Bill said. Make sure this is a transformation component and not a destination. Your options are source, transformation, and destination.
You also might want this as a different output as well. Otherwise, you will need to conditionally split out the "extra" rows.
Thanks both for for answering - it is a transformation, and thank you for the shorter way, however the header and footer are not well formatted and may contain junk characters also, so I daren't risk looking for | in rows. But I will definitely store that away for processing a better formatted file next time.
I got a reply outside this forum so I thought I should answer my own question in case any one else has a similar problem.
Note that:
it's a transform
the Output is be set to SynchronousInputID = None in the Inputs and Outputs section of the Script Transformation Editor
my input is just called Input, and contains one column called RawData
my output is called GenOutput, and has three columns
although the input file only really has 5 fields, there is a trailing | at the end of each row so this counts as 6
Setting the synchronous to None means that Output0Buffer is now recognised in context.
The code that works for me is:
bool passedSOD;
bool passedEOD;
public override void_InputProcessInputRow(InputBuffer Row)
{
if (Row.RawData.Contains("ENDDATA"))
{
passedEOD = true;
GenOutputBuffer.SetEndOfRowset();
}
//IF WE HAVE NOT PASSED THE END OF DATA, BUT HAVE PASSED THE START OF DATA, SPLIT THE ROW
if (passedSOD && !passedEOD)
{
var SplitData = Row.RawData.Split('|');
//ONLY PROCESS IF THE ROW CONTAINS THE RIGHT NUMBER OF ELEMENTS I.E. EXPECTED NUMBER OF DELIMITERS
if (SplitData.Length == 6)
{
GenOutputBuffer.AddRow();
GenOutputBuffer.prodid = SplitData[0];
GenOutputBuffer.fruit = SplitData[3];
GenOutputBuffer.dairy = SplitData[4];
}
//SILENTLY DROPPING ROWS THAT DO NOT HAVE RIGHT NUMBER OF ELEMENTS FOR NOW - COULD IMPROVE THIS LATER
}
if (Row.RawData.Contains("STARTDATA"))
{
passedSOD = true;
}
}
Now I've just got to work out how to convert one of the other fields from string to decimal, but decimal null and allow it to output a null if someone has dumped "N.A" in that field :D
I am working on an assignment that deals with file input and output. The instructions are as follows:
Write a program to update an inventory file. Each line of the inventory file will have a product number, a product name and a quantity separated by vertical bars. The transaction file will contain a product number and a change amount, which may be positive for an increase or negative for a decrease. Use the transaction file to update the inventory file, writing a new inventory file with the update quantities. I have provided 2 Input files to test your program with as well as a sample output file so you see what it should look like when you are done.
Hints:
This program requires 3 files
Initial Inventory File
File showing updates to be made
New Inventory File with changes completed
Use Lists to capture the data so you don’t have to worry about the number of items in the files
Each line of the Inventory file looks something like this:
123 | television | 17
I have also been given the basic structure and outline of the program:
class Program
{
public class InventoryNode
{
// Create variables to hold the 3 elements of each item that you will read from the file
// Make them all public
public InventoryNode()
{
// Create a constructor that sets all 3 of the items to default values
}
public InventoryNode(int ID, string InvName, int Number)
{
// Create a constructor that sets all 3 of the items to values that are passed in
}
public override string ToString() // This one is a freebie
{
return IDNumber + " | " + Name + " | " + Quantity;
}
}
static void Main(String[] args)
{
// Create variables to hold the 3 elements of each item that you will read from the file
// Create variables for all 3 files (2 for READ, 1 for WRITE)
List<InventoryNode> Inventory = new List<InventoryNode>();
InventoryNode Item = null;
// Create any other variables that you need to complete the work
// Check for proper number of arguments
// If there are not enough arguments, give an error message and return from the program
// Otherwise
// Open Output File
// Open Inventory File (monitor for exceptions)
// Open Update File (monitor for exceptions)
// Read contents of Inventory into the Inventory List
// Read each item from the Update File and process the data
// Write output file
//Close all files
return;
}
}
There is a lot of steps to this problem but right now I am only really concerned with how to read the inventory file into a list. I have read files into arrays before, so I thought I could do that and then convert the array to a list. But I am not entirely sure how to do that. Below is what I have created to add to the main method of the structure above.
int ID;
string InvName;
int Number;
string line;
List<InventoryNode> Inventory = new List<InventoryNode>();
InventoryNode Item = null;
StreamReader f1 = new StreamReader(args[0]);
StreamReader f2 = new StreamReader(args[1]);
StreamWriter p = new StreamWriter(args[2]);
// Read each item from the Update File and process the data
while ((line = f1.ReadLine()) != null)
{
string[] currentLine = line.Split('|');
{
ID = Convert.ToInt16(currentLine[0]);
InvName = currentLine[1];
Number = Convert.ToInt16(currentLine[2]);
}
}
I am a bit hung up on the Inventory Node Item = null; line. I am not really sure what this is supposed to be doing. I really just want to read the file to an array so I can parse it and then pass that data to a list. Is there a way to do that that is something similar to the block I have written? Maybe there is a simpler way. I am open to that, but I figured I'd show my train of thought.
There is not need to add everything to an array and then convert it to the list. InventoryNode Item = null is there to represent a line from the file.
You're pretty close. You just need to instantiate the InventoryNode and feed it the results of the split() method.
You're almost there. You already have fetched ID, InvName and Number, so you just have to instantiate the InventoryNode:
Item = new InventoryNode(...);
And then add Item to your list.
Note that Inventory Node Item = null; is not doing much; it just declares a variable that you can use later. This wasn't strictly necessary as the variable could have been declared inside the loop instead.
Some background about my code, I basically have a record class which has a lot of properties to capture the required data provided in the form of .txt files and each column is divided into their own separate files i.e. Month.txt, Day.txt, containing 600 rows of data in each.
Now, I have a second array which is basically a collection of the aforementioned class and I give it max value of 600 (as there are 600 of data). This class possesses an Initialization method.
This way it Initializes my records each column at the time, but I have to know the fixed size of the rows to not run into index out of range exception. Also, I have a lot of properties so the overall "if else" statements make this code look very redundant and hard on the eye. Flexibility is also an issue as the point resets to 0, so when I want to add extra data, I will just be overriding the original 600.
Is there anyway to improve this?
The code is suboptimal, because it checks the file name for each line in that file. It is better to decide what field to set before the loop, and then use that decision throughout the loop.
First, make a look-up table of setters based on the file prefix:
var setters = new Dictionary<string,Action<Record,string>> {
["Day"] = (r,v) => r.Day = v
, ["Month"] = (r,v) => r.Month = v
, ...
};
With this look-up in place, the reading code becomes straightforward:
using (StreamReader R = new StreamReader(file.FullName)) {
var pos = File.Name.IndexOf("_");
Action<Record,string> fieldSetter;
if (pos < 0 || !setters.TryGetValue(File.Name.Substring(0, pos), out fieldSetter)) {
continue; // Go to next file
}
string temp;
while((temp = R.ReadLine()) != null) {
fieldSetter(records[pointer++], temp);
}
}
First, we look up the setter using the prefix from the file name up to the first underscore character '_'. Then we go through the lines from the file, and call that setter for each record passing the string that we got.
Adding new fields becomes simple, too, because all you need is to add a new line to the setters initializer.
I am currently working on a project that converts a test from a "standard" word format to a format that is accepted by the new Saras program we are using.
I have been able to parse the file and gather the information that I want. Unfortunately there is a problem with the formatting, I only know how to parse and insert plain text.
A few options I have found so far include using the .Copy() method and then using the .Paste() method on the cell. I am using a class to hold the data intermittently so that if another parsing algorithm needs to be used for the next person's "standard" format, then the data will still be able to be put into the excel document the same way and the developer only needs to worry about parsing the new format.
NOTE: If there is a way to paste this into a format (like a stand alone cell) and then set the cell or cell value to it, that would be great.
Another option that I have found is using the interop.word.range.formattedtext property. This would work well if I was going to go from one word document to another, but I need to somehow get this formatted text over to an excel cell.
My current thought process would be to take the text and put it into a RTF object. Later I would take it and put it into excel by inserting the RTF.text into the cell and use the RTF.format to format each character. This seems like more work than I should have to do.
Please let me know what you all come up with!
Here is some code:
STORAGE:
class Question
{
public string text;
private List<string> responses;
public Question(string txt, List<string> r)
{
text = txt;
responses = r;
}
public List<string> getResponses() { return responses; }
}
PARSING:
if (p < MAX && docs.Paragraphs[p].Range.ListFormat.ListValue != 0)
{
// question main line
qTxt = docs.Paragraphs[p].Range.Text.ToString();
p++;
// question pos responses NOTE: THIS WILL BE USED TO DO MULTIPLE CHOICE LATER
for (; p < MAX && docs.Paragraphs[p].Range.ListFormat.ListValue == 0; p++)
{
tmp = docs.Paragraphs[p].Range.Text.ToString();
if (tmp != null && tmp != "\r")
qTxt += " \r\n " + docs.Paragraphs[p].Range.Text.ToString();
}
}
SET THE VALUE:
foreach (Question question in questionList)
{
...
// ITEM TEXT
ws.Cells[row, 15].Value2 = question.text;
...
}
Again, this code works for setting plain text, I just want to know how to get the formatting in there as well.
Thanks in advance!
UPDATE:
I figured out a way to make the Copy and Paste work. Silly me, I can just keep the excel document open the entire time. The constructor now sets up the first row and the parseDocument function will then fill in the rows with the data.
THE STRUGGLE:
I am currently using the Copy and Paste functions, but it seems to be putting an image of the text into my document rather than the formatted text itself.
CODE:
// Get the title
Word.Range rng = docs.Paragraphs[p].Range.Duplicate;
int index = 0;
for (; p < MAX && docs.Paragraphs[p].Range.ListFormat.ListValue == 0; p++)
{
string tmp = docs.Paragraphs[p].Range.Text.ToString();
if (tmp != null && tmp != "\r")
++index;
}
rng.MoveEnd(Word.WdUnits.wdParagraph, index);
rng.Copy();
xlDoc.ws.Range["A1", "A1"].PasteSpecial(); // Pastes an image of the title
Alas, I am looking for any way around this possible. Please let me know if you have any solutions.
Thanks!!
PS. I will keep updating this post if I make any progress.
I am trying to import a file with multiple record definition in it. Each one can also have a header record so I thought I would define a definition interface like so.
public interface IRecordDefinition<T>
{
bool Matches(string row);
T MapRow(string row);
bool AreRecordsNested { get; }
GenericLoadClass ToGenericLoad(T input);
}
I then created a concrete implementation for a class.
public class TestDefinition : IRecordDefinition<Test>
{
public bool Matches(string row)
{
return row.Split('\t')[0] == "1";
}
public Test MapColumns(string[] columns)
{
return new Test {val = columns[0].parseDate("ddmmYYYY")};
}
public bool AreRecordsNested
{
get { return true; }
}
public GenericLoadClass ToGenericLoad(Test input)
{
return new GenericLoadClass {Value = input.val};
}
}
However for each File Definition I need to store a list of the record definitions so I can then loop through each line in the file and process it accordingly.
Firstly am I on the right track
or is there a better way to do it?
I would split this process into two pieces.
First, a specific process to split the file with multiple types into multiple files. If the files are fixed width, I have had a lot of luck with regular expressions. For example, assume the following is a text file with three different record types.
TE20110223 A 1
RE20110223 BB 2
CE20110223 CCC 3
You can see there is a pattern here, hopefully the person who decided to put all the record types in the same file gave you a way to identify those types. In the case above you would define three regular expressions.
string pattern1 = #"^TE(?<DATE>[0-9]{8})(?<NEXT1>.{2})(?<NEXT2>.{2})";
string pattern2 = #"^RE(?<DATE>[0-9]{8})(?<NEXT1>.{3})(?<NEXT2>.{2})";
string pattern3 = #"^CE(?<DATE>[0-9]{8})(?<NEXT1>.{4})(?<NEXT2>.{2})";
Regex Regex1 = new Regex(pattern1);
Regex Regex2 = new Regex(pattern2);
Regex Regex3 = new Regex(pattern3);
StringBuilder FirstStringBuilder = new StringBuilder();
StringBuilder SecondStringBuilder = new StringBuilder();
StringBuilder ThirdStringBuilder = new StringBuilder();
string Line = "";
Match LineMatch;
FileInfo myFile = new FileInfo("yourFile.txt");
using (StreamReader s = new StreamReader(f.FullName))
{
while (s.Peek() != -1)
{
Line = s.ReadLine();
LineMatch = Regex1.Match(Line);
if (LineMatch.Success)
{
//Write this line to a new file
}
LineMatch = Regex2.Match(Line);
if (LineMatch.Success)
{
//Write this line to a new file
}
LineMatch = Regex3.Match(Line);
if (LineMatch.Success)
{
//Write this line to a new file
}
}
}
Next, take the split files and run them through a generic process, that you most likely already have, to import them. This works well because when the process inevitably fails, you can narrow it to the single record type that is failing and not impact all the record types. Archive the main text file along with the split files and your life will be much easier as well.
Dealing with these kinds of transmitted files is hard, because someone else controls them and you never know when they are going to change. Logging the original file as well as a receipt of the import is very import and shouldn't be overlooked either. You can make that as simple or as complex as you want, but I tend to write a receipt to a db and copy the primary key from that table into a foreign key in the table I have imported the data into, then never change that data. I like to keep a unmolested copy of the import on the file system as well as on the DB server because there are inevitable conversion / transformation issues that you will need to track down.
Hope this helps, because this is not a trivial task. I think you are on the right track, but instead of processing/importing each line separately...write them to a separate file. I am assuming this is financial data, which is one of the reasons I think provability at every step is important.
I think the FileHelpers library solves a number of your problems:
Strong types
Delimited
Fixed-width
Record-by-Record operations
I'm sure you could consolidate this into a type hierarchy that could tie in custom binary formats as well.
Have you looked at something using Linq? This is a quick example of Linq to Text and Linq to Csv.
I think it would be much simpler to use "yield return" and IEnumerable to get what you want working. This way you could probably get away with only having 1 method on your interface.