Can you represent CSV data in Google's Protocol Buffer format?

Can you represent CSV data in Google's Protocol Buffer format? - c#

I've recently found out about protocol buffers and was wondering if they could be applied to my specific problem.
Basically I have some CSV data that I need to convert to a more compact format for storage as some of the files are several gig.
Each field in the CSV has a header, and there are only two types, strings and decimals (because sometimes there are alot of significant digits and I need to handle all numbers the same way). But each file will have different column names for each field.
As well as capturing the original CSV data I need to be able to add extra information to the file before saving. And I was hoping to make this future proof by handling different file versions.
So, is it possible to use protocol buffers to capture a random number of randomly named columns of data, like a CSV file?

Well, it's certainly representable. Something like:
message CsvFile {
repeated CsvHeader header = 1;
repeated CsvRow row = 2;
}
message CsvHeader {
require string name = 1;
require ColumnType type = 2;
}
enum ColumnType {
DECIMAL = 1;
STRING = 2;
}
message CsvRow {
repeated CsvValue value = 1;
}
// Note that the column is implicit based on position within row
message CsvValue {
optional string string_value = 1;
optional Decimal decimal_value = 2;
}
message Decimal {
// However you want to represent it (there are various options here)
}
I'm not sure how much benefit it will provide, mind you... You can certainly add more information (add to the CsvFile message) and future proofing is in the "normal PB way" - only add optional fields, etc.

Well, protobuf-net (my version) is based on regular .NET types, so no (since it won't cope with different schemas all the time). But Jon's version might allow dynamic types. Personally, I'd just use CSV and run it through GZipStream - I expect that will be fine for the purpose.
Edit: actually, I forgot: protobuf-net does support extensible objects, but you need to be a bit careful... it would depend on the full context, I expect.
Plus Jon's approach of nested data would probably work too.

Related

OutputBuffer not working for large c# list

I'm currently using SSIS to do an improvement on a project. need to insert single documents in a MongoDB collection of type Time Series. At some point I want to retrieve rows of data after going through a C# transformation script. I did this:
foreach (BsonDocument bson in listBson)
{
OutputBuffer.AddRow();
OutputBuffer.DatalineX = (string) bson.GetValue("data");
}
But this piece of code that works great with small file does not work with a 6 million line file. That is, there are no lines in the output. The other following tasks validate but react as if they had received nothing as input.
Where could the problem come from?

Your OuputBuffer has DatalineX defined as a string, either DT_STR or DT_WSTR and a specific length. When you exceed that value, things go bad. In normal strings, you'd have a maximum length of 8k or 4k respectively.
Neither of which are useful for your use case of at least 6M characters. To handle that, you'll need to change your data type to DT_TEXT/DT_NTEXT Those data types do not require a length as they are "max" types. There are lots of things to be aware of when using the LOB types.
Performance can suck depending on whether SSIS can keep the data in memory (good) or has to write intermediate values to disk (bad)
You can't readily manipulate them in a data flow
You'll use a different syntax in a Script Component to work with them
e.g.
// TODO: convert to bytes
Output0Buffer.DatalineX.AddBlobData(bytes);
Longer example of questionable accuracy with regard to encoding the bytes that you get to solve at https://stackoverflow.com/a/74902194/181965

why is this code bad?

Below is a code in C# for rss reader, why this code is bad? this class generates a list of the 5 most recent posts, sorted by title. What do you use to analyze code in C#?
static Story[] Parse(string content)
{
var items = new List<string>();
int start = 0;
while (true)
{
var nextItemStart = content.IndexOf("<item>", start);
var nextItemEnd = content.IndexOf("</item>", nextItemStart);
if (nextItemStart < 0 || nextItemEnd < 0) break;
String nextItem = content.Substring(nextItemStart, nextItemEnd + 7 - nextItemStart);
items.Add(nextItem);
start = nextItemEnd;
}
var stories = new List<Story>();
for (byte i = 0; i < items.Count; i++)
{
stories.Add(new Story()
{
title = Regex.Match(items[i], "(?<=<title>).*(?=</title>)").Value,
link = Regex.Match(items[i], "(?<=<link>).*(?=</link>)").Value,
date = Regex.Match(items[i], "(?<=<pubDate>).*(?=</pubdate>)").Value
});
}
return stories.ToArray();
}

why don't use XmlReader or XmlDocument or LINQ to Xml?

It is bad because it's using string parsing when there are excellent classes in the framework for parsing XML. Even better, there are classes to deal with RSS feeds.
ETA:
Sorry to not have answered your second question earlier. There are a great number of tools to analyze correctness and quality of C# code. There's probably a huge list compiled somewhere, but here's a few I use on a daily basis to help ensure quality code:
StyleCop (code formatting standards)
Resharper (idiomatic programming, gotcha catching)
FxCop (code correctness, standards adherence, idiomatic programming)
Pex (white box testing)
Nitriq (code quality metrics)
NUnit (unit testing)

You shouldn't parse XML with string functions and regular expressions. XML can get very complicated and be formatted many ways that a real XML parser like XmlReader can handle, but will break your simple string parsing code.
Basically: don't try and reinvent the wheel (an xml parser), especially when you don't realize how complicated that wheel actually is.

I think the worst thing for the code is the performance issue. You should parse the xml string into a XDocument(or similar structure) instead of parse it again and again using regex.

simply because you are reinventing a xml parser , use Linq to xml instead, it is very simple and clean.I am sure that you can do all the above in three lines of code if you use Linq to XML , your code uses a lot of magic numbers (ex: 7-n ..), the thing that make it unstable and non generic

For starters, it uses byte as a indexer instead of int (what if items has more items in it than a byte can represent?). It doesn't use idiomatic C# (see user1645569's response). It also unnecessarily uses var instead of specific data types (that's more stylistic though, but for me I prefer not to, hence by my metric it's not ideal (and you've given no other metric)).
Let me clarify what I am saying about "unnecessarily using var": var in and of itself is not bad, and I am not suggesting that. I am (mostly) suggesting the usage here isn't very consistent. For example, explicitly declaring start as an int, but then declaring nextItemEnd as var (which will deduce to int) and assigning nextItemEnd to start seems (to me) like a weird mixture between wanting to automatically deduce a variable's type and explicitly declaring it. I think it's good that var was not used in start's declaration (because then it's not exactly clear if the intent is an integer or a floating point number), but I (personally) don't think it helps any to declare nextItemStart and nextItemEnd as var. I tend to prefer to use var for more complex/longer data types (similar to how I use auto in C++ for iterators, but not for "simpler" data types).

c# : Loading typed data from file without casting

Is there a way to avoid casting to a non-string type when reading data from a text file containing exclusively integer values separated by integer markers ('0000' for example) ?
(Real-Life example : genetic DNA sequences, each DNA marker being a digit sequence.)
EDIT :
Sample data : 581684531650000651651561156843000021484865321200001987984948978465156115684300002148486532120000198798400009489786515611568430000214848653212000019879849480006516515611684531650000651651561156843000021 etc...
Unless I use a binary writer and read bytes, rather than text (because that is how data written at first), I think this a funky idea, so "NO" would be the straight answer for this.
Just wanted to get a definitive confirmation to that here, just to be definitely sure.
I welcome any intermediate solution to write/read this kind of data efficiently without having to code a custom reader GUI to display it outside my app, intelligibly (in some generic reader/viewer).

The short answer is no, because a text file is a string of characters.
The long answer is sort of yes; if you put your data into a format like XML, a deserializer can implicitly cast the data back to the correct type (without you having to do it manually) based on your schema.

If you have control over the format, consider using a binary format for your file and use e.g. BinaryReader.ReadInt32.

rather then just casting, you really should use the .TryParse(...) method(s) of the types you are trying to read. This is a much more type-safe solution.
And to answer your question, other then using a binary file, there is not (to my knowledge) a way to do this without casting (or using the TryParse methods)

The only way to control all the read process is to read bytes. Else you read strings.
Edit : I Didn't talk about automatic serialization via XML because of the details on the file format you gave.

If the data is text and you need to access it as an integer, a conversion will be required. The only question is which code does the conversion.
Depending upon the file format, you could look for classes or libraries that already handle them. Otherwise, keep your code well organized so you don't have to pay attention to the conversion too much.
Some options:
// Could throw exceptions
var x = Convert.ToInt32(text);
var x = Int32.Parse(text);
// Won't throw an exception, just check the results
int x = 0;
if (Int32.TryParse(text, out x)) { ... }

C# - Comparing two CSV Files and giving an output

Need a bit of help, I have two sources of information and the information is exported to two different CSV file's by different programs. They are supposed to include the same information, however this is what needs to be checked.
Therefore what I would like to do is as follows:
Take the information from the two files.
Compare
Output any differences and which file the difference was in. (e.g File A Contained this, but File B did not and vice versa).
The files are 200,000 odd rows so will need to be as effective as possible.
Tried doing this with Excel however has proved to be too complicated and I'm really struggling to find a way programatically.

Assuming that the files are really supposed to be identical, right down to text qualifiers, ordering of rows, and number of rows contained in each file, the simplest approach may be to simply iterate through both files together and compare each line.
using (StreamReader f1 = new StreamReader(path1))
using (StreamReader f2 = new StreamReader(path2)) {
var differences = new List<string>();
int lineNumber = 0;
while (!f1.EndOfStream) {
if (f2.EndOfStream) {
differences.Add("Differing number of lines - f2 has less.");
break;
}
lineNumber++;
var line1 = f1.ReadLine();
var line2 = f2.ReadLine();
if (line1 != line2) {
differences.Add(string.Format("Line {0} differs. File 1: {1}, File 2: {2}", lineNumber, line1, line2);
}
}
if (!f2.EndOfStream) {
differences.Add("Differing number of lines - f1 has less.");
}
}

Depending on your answers to the comments on your question, if it doesn't really need to be done with code, you could do worse than download a compare tool, which is likely to more sophisticated.
(Winmerge for example)

OK, for anyone else that googles this and finds this. Here is what my answer was.
I exported the details to a CSV and ordered them numerically when they were exported for ease of use. Once they were exported as two CSV files, I then used a program called Beyond Compare which can be found here. This allows the files to be compared.
At first I used Beyond Compare manually to test what I was exporting was correct etc, however Beyond Compare does have the ability to be able to use command lines to compare. This then results in everything done programatically, all that has to be done is a user views the results in Beyond Compare. You may be able to export them to another CSV, I havn't looked as the GUI of Beyond Compare is very nice and useful, so it is easier to use this.

C# Datatype for large sorted collection with position?

I am trying to compare two large datasets from a SQL query. Right now the SQL query is done externally and the results from each dataset is saved into its own csv file. My little C# console application loads up the two text/csv files and compares them for differences and saves the differences to a text file.
Its a very simple application that just loads all the data from the first file into an arraylist and does a .compare() on the arraylist as each line is read from the second csv file. Then saves the records that don't match.
The application works but I would like to improve the performance. I figure I can greatly improve performance if I can take advantage of the fact that both files are sorted, but I don't know a datatype in C# that keeps order and would allow me to select a specific position. Theres a basic array, but I don't know how many items are going to be in each list. I could have over a million records. Is there a data type available that I should be looking at?

If data in both of your CSV files is already sorted and have the same number of records, you could skip the data structure entirely and do in-place analysis.
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
while (!one.EndOfStream)
{
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
// do your comparison.
bool areDifferent = true;
if (areDifferent)
differences.WriteLine(lineOne + lineTwo);
}
one.Close();
two.Close();
differences.Close();

System.Collections.Specialized.StringCollection allows you to add a range of values and, using the .IndexOf(string) method, allows you to retrieve the index of that item.
That being said, you could likely just load up a couple of byte[] from a filestream and do byte comparison... don't even worry about loading that stuff into a formal datastructure like StringCollection or string[]; if all you're doing is checking for differences, and you want speed, I would wreckon byte differences are where it's at.

This is an adaptation of David Sokol's code to work with varying number of lines, outputing the lines that are in one file but not the other:
StreamReader one = new StreamReader("C:\file1.csv");
StreamReader two = new StreamReader("C:\file2.csv");
String lineOne;
String lineTwo;
StreamWriter differences = new StreamWriter("Output.csv");
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
while (!one.EndOfStream || !two.EndOfStream)
{
if(lineOne == lineTwo)
{
// lines match, read next line from each and continue
lineOne = one.ReadLine();
lineTwo = two.ReadLine();
continue;
}
if(two.EndOfStream || lineOne < lineTwo)
{
differences.WriteLine(lineOne);
lineOne = one.ReadLine();
}
if(one.EndOfStream || lineTwo < lineOne)
{
differences.WriteLine(lineTwo);
lineTwo = two.ReadLine();
}
}
Standard caveat about code written off the top of my head applies -- you may need to special-case running out of lines in one while the other still has lines, but I think this basic approach should do what you're looking for.

Well, there are several approaches that would work. You could write your own data structure that did this. Or you can try and use SortedList. You can also return the DataSets in code, and then use .Select() on the table. Granted, you would have to do this on both tables.

You can easily use a SortedList to do fast lookups. If the data you are loading is already sorted, insertions into the SortedList should not be slow.

If you are looking simply to see if all lines in FileA are included in FileB you could read it in and just compare streams inside a loop.
File 1
Entry1
Entry2
Entry3
File 2
Entry1
Entry3
You could loop through with two counters and find omissions, going line by line through each file and see if you get what you need.

Maybe I misunderstand, but the ArrayList will maintain its elements in the same order by which you added them. This means you can compare the two ArrayLists within one pass only - just increment the two scanning indices according to the comparison results.

One question I have is have you considered "out-sourcing" your comparison. There are plenty of good diff tools that you could just call out to. I'd be surprised if there wasn't one that let you specify two files and get only the differences. Just a thought.

I think the reason everyone has so many different answers is that you haven't quite got your problem specified well enough to be answered. First off, it depends what kind of differences you want to track. Are you wanting the differences to be output like in a WinDiff where the first file is the "original" and second file is the "modified" so you can list changes as INSERT, UPDATE or DELETE? Do you have a primary key that will allow you to match up two lines as different versions of the same record (when fields other than the primary key are different)? Or is is this some sort of reconciliation where you just want your difference output to say something like "RECORD IN FILE 1 AND NOT FILE 2"?
I think the asnwers to these questions will help everyone to give you a suitable answer to your problem.

If you have two files that are each a million lines as mentioned in your post, you might be using up a lot of memory. Some of the performance problem might be that you are swapping from disk. If you are simply comparing line 1 of file A to line one of file B, line2 file A -> line 2 file B, etc, I would recommend a technique that does not store so much in memory. You could either read write off of two file streams as a previous commenter posted and write out your results "in real time" as you find them. This would not explicitly store anything in memory. You could also dump chunks of each file into memory, say one thousand lines at a time, into something like a List. This could be fine tuned to meet your needs.

To resolve question #1 I'd recommend looking into creating a hash of each line. That way you can compare hashes quick and easy using a dictionary.
To resolve question #2 one quick and dirty solution would be to use an IDictionary. Using itemId as your first string type and the rest of the line as your second string type. You can then quickly find if an itemId exists and compare the lines. This of course assumes .Net 2.0+

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Can you represent CSV data in Google's Protocol Buffer format? - c#

Related

OutputBuffer not working for large c# list

why is this code bad?

c# : Loading typed data from file without casting

C# - Comparing two CSV Files and giving an output

C# Datatype for large sorted collection with position?

Categories

Resources