Validate CSV file - c#

I have a webpage that is used to submit a CSV file to the server. I have to validate the file, for stuff like correct number of columns, correct data type, cross field validations, data-range validations, etc. And finally either show a successful message or return a CSV with error messages and line numbers.
Currently every row and every column is looped through to find out all the errors in the CSV file. But it becomes very slow for bigger files, sometimes resulting in a server time-out. Can someone please suggest a better way to do this.
Thanks

To validate a CSV file you will surely need to check each column. The only best way if possible in your scenario is to validate the entry itself while appending to the CSV file..
Edit
As pinpointed an error by #accolaum, i have edited my code
It will only work provided each row is delimited with a `\n`
IF you only want to Validate number of Columns.. then its easier.. Just take the mod of all the entries with the num of columns
bool file_isvalid;
string data = streamreader.ReadLine();
while(data != null)
{
if(data.Split(',').Length % Num_Of_Columns == 0)
{
file_isvalid = true;
//Perform opertaion
}
else
{
file_isvalid = false;
//Perform Operation
}
data = streamreader.ReadLine();
}
Hope it helps

I would suggest a rule based approach, similar to unit tests. Think of every! error that can possibly occour and order them in increasing abstraction level
Correct file encoding
Correct number of lines/columns
correct column headers
correct number/text/date formats
correct number ranges
bussiness rules??
...
These rules could also have automatic fixes. So if you could automatically detect the encoding, you could correct it before testing all the rules.
Implementation could be done using the command pattern
public abstract class RuleBase
{
public abstract bool Test();
public virtual bool CanCorrect()
{
return false;
}
}
Then create a subclass for each test you want to make and put them in a list.
The timeout can be overcome by using a background thread only for test incoming files. The user has to wait till his file is validated and becomes "active". When finished you can forward him to the next page.

You may be able to optimize your code to perform faster, but what you really want to do is to spawn a worker thread to do the processing.
Two benefits of this
You can redirect the user to another page so that they know their request has submitted
The worker thread can be given a callback so that it can report its status - if you want to, you could put a progress bar or a percentage on the 'submitted' page so that the user can see as their file is being processed.
It is not good design to have the user waiting for long running processes to complete - they should be given updates or notifications, rather than just a 'loading' icon on their browser.
edit: This is my answer because (1) I can't recommend code improvements without seeing your code, and (2) efficiency improvements are probably only going to yield incremental improvements (unless you are doing something really wrong), which won't solve your problem long term.

Validation of csv data usually always needs to look at every single cell. Can you post some of your code, there may be ways to optimse it.
EDIT
in most cases this is the best solution
foreach(row) {
foreach (column) {
validate cell
}
}
if you were really keen, you could try something with regex's
foreach(row) {
validate row by regex
}
but then you are really just off loading the validation code from you to the regex, and i really hate using regexs

You could use XMLReader and parse against an XSD

Related

C# Weird behavior after error handling in StreamWriter "using" block

I have C# code that collects data from csv files, and writes it out in a certain way to another csv file. The user has an option to sort or not sort (sorting does some other stuff too), based on state of a checkbox called "sortBox." Most of the work happens inside a using block for a StreamWriter. Within the block, there is a step where it writes to a dictionary.
There is a corner case where sorting makes the dictionary think there is a duplicate value, and an error occurs. The corner case is not worth fixing, and since not sorting is not a big deal, I am trying to offer an option when dictionary error occurs for SW to uncheck sortBox, go back to the start of the method, and write out the unsorted data. Note that the output file has one of two names, depending on whether data are sorted or not. I am fine with having both files, the sorted on never getting written to due to the error, and the unsorted one getting written to.
This is a compact version of my lame attempt. The recursive call might be a dumb idea -- and maybe the reason for the odd behavior I will describe, but I do not know what other approach to take.
void CollectAndStreamStats(string readFolder, string writeFolder)
{
//<stuff>
try
{
using (StreamWriter csvOut = new StreamWriter(fullPath))
{
//<stuff>
if (sortBox.Checked)
{
//<stuff>
Try
{
resultsDictionary.Add(keyString, str); //THIS IS WHAT I AM ERROR CHECKING
}
catch (ArgumentException e) //OH NO, CANNOT ADD TO DICTIONARY, UNCHECKING SORT CHECK BOX SHOULD SOLVE IT
{
d = MessageBoc.Show(“You’re screwed, do you want to uncheck sort option and analyze again?”,”Sucker”, MessageBoxButtons.YesNo,other options)
if (d==DialogResult.Yes)
{
csvOut.Close(); //CLOSE STREAM CREATED BY STREAMWRITER
sortBox.Checked = false; //UNCHECK THE BOX, SHOULD WORK OK NOW. (CORNER CASE THING).
CollectAndStreamStats(csvReadFolder, csvWriteFolder);
} //RUN THIS SAME METHOD ALL OVER AGAIN, BUT WITH THAT CHECK BOX UNCHECKED. NOTE THAT CSVoUT WILL HAVE A DIFFERENT NAME AND STREAMING WILL BE TO NEW FILE
}
// **Loop for csvOut.WriteLine goes here if sortBox is checked. It happens elsewhere if it is unchecked.**
}
}
catch (IOException e)
{
MessageBox.Show("Well, I tried ::shrug::");
}
Well, some of it works. It unchecks sortBox -- local variables confirm this, as does UI -- compiles the unsorted data, and creates and writes to the second (unsorted) file. But then -- despite sortBox.checked being false, it enters the "if (sortBox.Checked)" loop, decides that the file name is the original sorted one again, and tries to write to it, only to throw an error saying it cannot write a closed stream.
No luck with online searches. There must be a right way, any thoughts?
Thanks much in advance,
Aram

Looking for an efficient way to build and parse string without GC

I am trying to figure out if there is a more efficient way than what I'm doing now to build up a message coming in on a serial port and validate it is the right message before I parse it. A complete message starts with a $ and ends with a CR/LF. I use an event handler to get the characters as they show up at the serial port so the message will not necessarily come in as one complete block. Just to confuse things, there are a bunch of other messages that come in on the serial port that don't necessarily start with a $ or end with a CR/LF. I want to see those but not parse them. I understand that concatenating strings is probably not a good idea so I use a StringBuilder to build the message then I use a couple of .ToString() calls to make sure I've got the right message to parse. Do the .ToString calls generate much garbage? Is there a better way?
I'm not a particularly experienced programmer so thanks for the help.
private void SetText(string text)
{
//This is the original approach
//this.rtbIncoming.Text += text;
//First post the raw data to the console rtb
rtbIncoming.AppendText(text);
//Now clean up the text and only post messages to the CPFMessages rtb that start with a $ and end with a LF
incomingMessage.Append(text);
//Make sure the message starts with a $
int stxIndex = incomingMessage.ToString().IndexOf('$');
if (stxIndex == 0)
{ }
else
{
if (stxIndex > 0)
incomingMessage.Remove(0, stxIndex);
}
//If the message is terminated with a LF: 1) post it to the CPFMessage textbox,
// 2) remove it from incomingMessage,
// 3) parse and display fields
int etxIndex = incomingMessage.ToString().IndexOf('\n');
if (etxIndex >= 0)
{
rtbCPFMessages.AppendText(incomingMessage.ToString(0, etxIndex));
incomingMessage.Remove(0, etxIndex);
parseCPFMessage();
}
}
Do the .ToString calls generate much garbage?
Every time you call ToString(), you get a new String object instance. Whether that's "much garbage" depends on your definition of "much garbage" and what you do with those instances.
Is there a better way?
You can inspect the contents of StringBuilder directly, but you'll have to write your own methods to do that. You could use state-machine-based techniques to monitor the stream of data.
Whether any of that would be "better" than your current implementation depends on a number of factors, including but not limited to:
Are you seeing a specific performance issue now?
If so, what specific performance goal are you trying to achieve?
What other overhead exists in your code?
The first question above is very important. Your first priority should be code that works. If your code is working now, and does not have a specific performance issue that you know you need to solve, then you can safely ignore the GC issues for now. .NET's GC system is designed to perform well in scenarios just like this one, and usually will. Only in unusual situations would you need to do extra work to solve a performance problem here.
Without a good, minimal, complete code example that clearly illustrates the above and any other relevant issues, it would not be possible to say with any specificity whether there is in fact "a better way". If the above answers don't provide the information you're looking for, consider improving your question so that it is not so broad.

Efficient Methods of Comparing Text Files Simultaneously

I did check to see if any existing questions matched mine but I didn't see any, if I did, my mistake.
I have two text files to compare against each other, one is a temporary log file that is overwritten sometimes, and the other is a permanent log, which will collect and append all of the contents of the temp log into one file (it will collect new lines in the log since when it last checked and append the new lines to the end of the complete log). However after a point this may lead to the complete log becoming quite large and therefore not so efficient to compare against so i have been thinking about different methods to approach this.
my first idea is to "buffer" the temp log (being that it will normally be the smaller of the two) strings into a list and simply loop through the archive log and do something like:
List<String> bufferedlines = new List<string>();
using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
if (bufferedlines.Contains(ArchiveStream.ReadLine()))
{
}
}
Now there is a couple of ways I could proceed from here, I could create yet another list to store the inconsistencies, close the read stream (I'm not sure you can both read and write at the same time, if you can that might make things easier for my options) then open a write stream in append mode and write the list to the file. alternatively, cutting out the buffering the inconsistencies, i could open a write stream while the files are being compared and on the spot write the lines that aren't matched.
The other method i could think of was limited by my knowledge of whether it could be done or not, which was rather than buffer either file, compare the streams side by side as they are read and append the lines on the fly. Something like:
using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
using (StreamReader templogStream = new StreamReader(tempPath))
{
if (!(ArchiveStream.ReadAllLines.Contains(TemplogStream.ReadLine())))
{
//write the line to the file
}
}
}
As I said I'm not sure whether that would work or that it may be more efficient than the first method, so i figured i'd ask, see if anyone had insight into how this might properly be implemented, and whether it was the most efficient way or there was a better method out there.
Effectively what you want here is all of the items from one set that aren't in another set. This is set subtraction, or in LINQ terms, Except. If your data sets were sufficiently small you could simply do this:
var lines = File.ReadLines(TempPath)
.Except(File.ReadLines(ArchivePath))
.ToList();//can't write to the file while reading from it
File.AppendAllLines(ArchivePath, lines);
Of course, this code requires bringing the all of the lines in the temp file into memory, because that's just how Except is implemented. It creates a HashSet of all of the items so that it can efficiently find matches from the other sequence.
Presumably here the number of lines that need to be added here is pretty small, so the fact that the lines that we find here all need to be stored in memory isn't a problem. If there will potentially be a lot the, you'd want to write them out to another file besides the first one (possibly concatting the two files together when done, if needed).

Fastest way to draw a large text file in C# winforms

I have a large text file (~100MB) that I keep it's lines in a list of strings.
my Winform requires occasionally to show a part of it, for example 500,000 lines.
I have tried using a ListBox, RichTextBox and TextBox, but the drawing takes too much time.
for example, TextBox takes 25 seconds to show 500,000 lines,
whereas notepad opens a text file of this size immediately.
what will be the fastest solution for this purpose?
Why not open a file stream and just read the first few lines. You can use seek as the user scrolls in the file and display the appropriate lines. The point is - reading the whole file into memory takes to long so don't do that!
Starter Code
The following is a short code snippet that isn't complete but it should at least get you started:
// estimate the average line length in bytes somehow:
int averageLineLengthBytes = 100;
// also need to store the current scroll location in "lines"
int currentScroll = 0;
using (var binaryReader = new StreamReader(new FileStream(fileName, FileAccess.Read)))
{
if (binaryReader.BaseStream.CanSeek)
{
// seek the location to read:
binaryReader.BaseStream.Seek(averageLineLengthBytes * currentScroll, SeekOrigin.Begin);
// read the next few lines using this command
binaryReader.ReadLine();
}
else
{
// revert to a slower implementation here!
}
}
The biggest trick is going to be estimating how long the scroll bar needs to be (how many lines are in the file). For that you are going to have to either alter the scroll bar as the user scrolls or you can use prior knowledge of how long typical lines are in this file and estimate the length based on the total number of bytes. Either way, hope this helps!
A Note About Virtual Mode
Virtual mode is a method of using a ListBox or similar list control to load the items on an as needed basis. The control will execute a callback to retrieve the items based on an index when the user scrolls within the control. This is a viable solution only if your data meets the following criteria:
You must know (up front) the number of data items that you wish to present. If you need to read the entire file to get this total, it isn't going to work for you!
You must be able to retrieve a specific data item based an index for that item without reading the entire file.
You must be willing to present the data in an icon, small details, details or other supported format (or be willing to go to a ton of extra work to write a custom list view).
If you cannot meet these criteria, then virtual mode is not going to be particularly helpful. The answer I presented with seek will work regardless of whether or not you can perform these actions. Of course, if you can meet these minimum criteria, then by all means - look up virtual mode for list views and you should find some really useful information!
ListView have a Virtual Mode property. It allows you to only load the data are in view using the Retrieve Virtual Item Event. So when that event is trigger for item number 40,000 for example, you would perform a seek on the file read in the line.
You can also find example of a virtual list box on Microsoft. It really old, but it gives you a basic idea.

C# Datatype for large sorted collection with position?

I am trying to compare two large datasets from a SQL query. Right now the SQL query is done externally and the results from each dataset is saved into its own csv file. My little C# console application loads up the two text/csv files and compares them for differences and saves the differences to a text file.
Its a very simple application that just loads all the data from the first file into an arraylist and does a .compare() on the arraylist as each line is read from the second csv file. Then saves the records that don't match.
The application works but I would like to improve the performance. I figure I can greatly improve performance if I can take advantage of the fact that both files are sorted, but I don't know a datatype in C# that keeps order and would allow me to select a specific position. Theres a basic array, but I don't know how many items are going to be in each list. I could have over a million records. Is there a data type available that I should be looking at?
If data in both of your CSV files is already sorted and have the same number of records, you could skip the data structure entirely and do in-place analysis.
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
while (!one.EndOfStream)
{
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
// do your comparison.
bool areDifferent = true;
if (areDifferent)
differences.WriteLine(lineOne + lineTwo);
}
one.Close();
two.Close();
differences.Close();
System.Collections.Specialized.StringCollection allows you to add a range of values and, using the .IndexOf(string) method, allows you to retrieve the index of that item.
That being said, you could likely just load up a couple of byte[] from a filestream and do byte comparison... don't even worry about loading that stuff into a formal datastructure like StringCollection or string[]; if all you're doing is checking for differences, and you want speed, I would wreckon byte differences are where it's at.
This is an adaptation of David Sokol's code to work with varying number of lines, outputing the lines that are in one file but not the other:
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
while (!one.EndOfStream || !two.EndOfStream)
{
if(lineOne == lineTwo)
{
// lines match, read next line from each and continue
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
continue;
}
if(two.EndOfStream || lineOne < lineTwo)
{
differences.WriteLine(lineOne);
lineOne = one.ReadLine();
}
if(one.EndOfStream || lineTwo < lineOne)
{
differences.WriteLine(lineTwo);
lineTwo = two.ReadLine();
}
}
Standard caveat about code written off the top of my head applies -- you may need to special-case running out of lines in one while the other still has lines, but I think this basic approach should do what you're looking for.
Well, there are several approaches that would work. You could write your own data structure that did this. Or you can try and use SortedList. You can also return the DataSets in code, and then use .Select() on the table. Granted, you would have to do this on both tables.
You can easily use a SortedList to do fast lookups. If the data you are loading is already sorted, insertions into the SortedList should not be slow.
If you are looking simply to see if all lines in FileA are included in FileB you could read it in and just compare streams inside a loop.
File 1
Entry1
Entry2
Entry3
File 2
Entry1
Entry3
You could loop through with two counters and find omissions, going line by line through each file and see if you get what you need.
Maybe I misunderstand, but the ArrayList will maintain its elements in the same order by which you added them. This means you can compare the two ArrayLists within one pass only - just increment the two scanning indices according to the comparison results.
One question I have is have you considered "out-sourcing" your comparison. There are plenty of good diff tools that you could just call out to. I'd be surprised if there wasn't one that let you specify two files and get only the differences. Just a thought.
I think the reason everyone has so many different answers is that you haven't quite got your problem specified well enough to be answered. First off, it depends what kind of differences you want to track. Are you wanting the differences to be output like in a WinDiff where the first file is the "original" and second file is the "modified" so you can list changes as INSERT, UPDATE or DELETE? Do you have a primary key that will allow you to match up two lines as different versions of the same record (when fields other than the primary key are different)? Or is is this some sort of reconciliation where you just want your difference output to say something like "RECORD IN FILE 1 AND NOT FILE 2"?
I think the asnwers to these questions will help everyone to give you a suitable answer to your problem.
If you have two files that are each a million lines as mentioned in your post, you might be using up a lot of memory. Some of the performance problem might be that you are swapping from disk. If you are simply comparing line 1 of file A to line one of file B, line2 file A -> line 2 file B, etc, I would recommend a technique that does not store so much in memory. You could either read write off of two file streams as a previous commenter posted and write out your results "in real time" as you find them. This would not explicitly store anything in memory. You could also dump chunks of each file into memory, say one thousand lines at a time, into something like a List. This could be fine tuned to meet your needs.
To resolve question #1 I'd recommend looking into creating a hash of each line. That way you can compare hashes quick and easy using a dictionary.
To resolve question #2 one quick and dirty solution would be to use an IDictionary. Using itemId as your first string type and the rest of the line as your second string type. You can then quickly find if an itemId exists and compare the lines. This of course assumes .Net 2.0+

Categories

Resources