IEnumerable file comparison

IEnumerable file comparison - c#

I'm trying to modify the following method so it will show the column names of the non matched items in the in the output of the 2 CSV file I'm comparing:
public static void CompareCSVFiles_2(string file1, string file2)
{
string[] names1 = File.ReadAllLines(file1);
string[] names2 = File.ReadAllLines(file2);
IEnumerable<string> differenceQuery = names1.Except(names2);
foreach (string s in differenceQuery)
Console.WriteLine(s);
}
The format of the 2 files I'm trying to compare is plain CSV for example:
CSV_1 CSV_2
Column_1 Column_2 Column_3 Column_1 Column_2 Column_3
123 hhh bbb 123 hhh bbb
135 ddd lll 135 ddd zzz
The output I'm after needs to indicate not only that a diff was found between the 2 files but also indicate the column name.
For example:
'Diff found in Column_3, line 2'.
I do know that the 'Column' is just line [0] in the CSV but what am I'm missing here ?
Thanks.

The current implementation compares the two files row by row. If you want to find the differences in columns, you have to parse the rows first.
There is a good nuget package called CsvHelper that helps you with the parsing. Check the "Reading" > "Reading by Hand example" on their website to see how to read the file column by column.

Related

Check if substring of a string exists in datatable

I have a DataTable like this:
column1 column2
----------- ----------
1 abc d Alpha
2 ab Gamma
3 abc de Harry
4 xyz Peter
I want to check if a substring of a string exists in the datatable.
e.g. If the string I am looking for is "abc defg", record 3 should be returned (although record 1 is also a match, record 3 has more common characters in sequence).
I am not able to find any way to search as described above.
any help, guidance would be much appreciated.

This would be a two-step process.
Filter the table for rows that match. This can be done with the string.Contains method. In LINQ, this would look something like:
const string myText = "abc defg";
IEnumerable<Row> matches = MyTable.Where(row => myText.Contains(row.Column1));
Select the longest match. In LINQ, this might look something like this.
Row longestMatch = matches.OrderByDescending<Row, int>(row => row.Column1.Length).First();

Expression in datatable appends data instead of adding

I have a datatable with a few columns .
I am trying to add the column values using the datacolumn.expression.
The columns used for adding is of type decimal. Also the calculated column is also decimal. But while processing the expression, (like datatable column1+ datatable column2) its just appending the data.
SlNo Name F1 F2 F3
1 A 1 2 3
2 B 3 4 5
I am expecting an output similar to this.
SlNo Name F1 F2 F3 Total
1 A 1 2 3 6
2 B 3 4 5 12
What I tried.
dtTempData.Columns.Add("Total", typeof(Decimal));
dtTempData.Columns["Total"].Expression = "[F1]+[F2]+[F3]";
Now the output I am getting is in the following way
123
345
its just appending the data.Thanks in advance of any help.

I don't know why this is happening.
dtTempData.Columns.Add("Total", typeof(Decimal));
dtTempData.Columns["Total"].DefaultValue = 0;
dtTempData.Columns["Total"].Expression = expression;
This is the way I created the columns, but while performing suming based on expression, it appends the data.
I am importing data from another datatable which is type of string. So I tried to convert the data by using the following manner.
string expression="Convert(F1, 'System.Decimal') + Convert(F2,
'System.Decimal') + Convert(F3, 'System.Decimal')"
Now this is working and the Total column is having the value after addition. Thanks all for your help.

Split a string into lines?

Here is code;
foreach (var file in d.GetFiles("*.xml"))
{
string test = getValuesOneFile(file.ToString());
result.Add(test);
Console.WriteLine(test);
Console.ReadLine();
}
File.WriteAllLines(filepath + #"\MapData.txt", result);
Here is what it looks like in the console;
[30000]
total=5
sp 0 -144 152 999999999
sp 0 -207 123 999999999
sp 0 -173 125 999999999
in00 1 -184 213 999999999
out00 2 1046 94 40000
Here is how it looks like in the text file (when written at end of loop).
[30000]total=5sp 0 -144 152 999999999sp 0 -207 123 999999999sp 0 -173 125 999999999in00 1 -184 213 999999999out00 2 1046 94 40000
I need it to write the lines in the same style as the console output.

WriteAllLines is going to separate each of the values with the environments new line string, however, throughout the history of computers a number of possible different characters have been used to represent new lines. You are looking at the text file using some program that is expecting a different type of new line separator. You should either be using a different program to look at the value of that file; one that either properly handles this type of separator (or can handle any type of separator), you should be configuring your program to expect the given type of separator, or you'll need to replace WriteAllLines with a manual method of writing the strings that uses another new line separator.

Rather than WriteAllLines You'll probably want to just write the text manually:
string textToWrite = "";
foreach (var res in result)
{
textToWrite += res.Replace("\r","").Replace("\n",""); //Ensure there are no line feeds or carriage returns
textToWrite += "\r\n"; //Add the Carriage Return
}
File.WriteAllText(filepath + #"\MapData.txt", textToWrite)

The problem is definitely how you are looking for newlines in your output. Environment.NewLine will get inserted after each string written by WriteAllLines.
I would recommend opening the output file in NotePad++ and turn on View-> ShowSymbol-> Show End of Line to see what end of line characters are in the file. On my machine for instance it is [CR][LF] (Carriage Return / Line Feed) at the end of each line which is standard for windows.

Get numerical data from a tab-separated text file and store them in a list

I have a tab-separated text file like this:
customerNo Offer Score
1 1 0.273
2 1 0.630
3 1 0.105
4 1 0.219
5 1 0.000
6 1 0.303
7 1 0.760
I have a string array in my program that contains all the lines in this text file.
Using LINQ, I first would like to get rid of any lines that have non-numerical characters (like the header line above) or are empty and then would like to save the other lines as a List of Objects. Here my object would be something called ScoreItem that has properties: customerNo, Offer and Score. So eventually I get 7 of these objects from this file.

In your very case I would do this:
File.ReadAllLines("the-file-full-path")
.Select(x => x.Split('\t'))
.Where(x =>
{
int i;
return int.TryParse(x[0], out i);
})
.Select(x => new ScoreItem
{
CustomerNo = int.Parse(x[0]),
Offer = int.Parse(x[1]),
Score = double.Parse(x[2])
});
And consider using .ToArray() or .ToList() at the end to prevent possible reenumerations of that block in further code.
Updated:
The code provided is straight-forward: it does not consider any additional checks for data format culture etc. To be sure the number are always parsed independently on current user's culture setup, for double parsing must be used double.Parse(x[2], CultureInfo.InvariantCulture) (for instance), instead.

Convert a file full of "INSERT INTO xxx VALUES" in to something Bulk Insert can parse

This is a followup to my first question "Porting “SQL” export to T-SQL".
I am working with a 3rd party program that I have no control over and I can not change. This program will export it's internal database in to a set of .sql each one with a format of:
INSERT INTO [ExampleDB] ( [IntField] , [VarcharField], [BinaryField])
VALUES
(1 , 'Some Text' , 0x123456),
(2 , 'B' , NULL),
--(SNIP, it does this for 1000 records)
(999, 'E' , null);
(1000 , 'F' , null);
INSERT INTO [ExampleDB] ( [IntField] , [VarcharField] , BinaryField)
VALUES
(1001 , 'asdg', null),
(1002 , 'asdf' , 0xdeadbeef),
(1003 , 'dfghdfhg' , null),
(1004 , 'sfdhsdhdshd' , null),
--(SNIP 1000 more lines)
This pattern continues till the .sql file has reached a file size set during the export, the export files are grouped by EXPORT_PATH\%Table_Name%\Export#.sql Where the # is a counter starting at 1.
Currently I have about 1.3GB data and I have it exporting in 1MB chunks (1407 files across 26 tables, All but 5 tables only have one file, the largest table has 207 files).
Right now I just have a simple C# program that reads each file in to ram then calls ExecuteNonQuery. The issue is I am averaging 60 sec/file which means it will take about 23 hrs for it to do the entire export.
I assume if I some how could format the files to be loaded with a BULK INSERT instead of a INSERT INTO it could go much faster. Is there any easy way to do this or do I have to write some kind of Find & Replace and keep my fingers crossed that it does not fail on some corner case and blow up my data.
Any other suggestions on how to speed up the insert into would also be appreciated.
UPDATE:
I ended up going with the parse and do a SqlBulkCopy method. It went from 1 file/min. to 1 file/sec.

Well, here is my "solution" for helping convert the data into a DataTable or otherwise (run it in LINQPad):
var i = "(null, 1 , 'Some''\n Text' , 0x123.456)";
var pat = #",?\s*(?:(?<n>null)|(?<w>[\w.]+)|'(?<s>.*)'(?!'))";
Regex.Matches(i, pat,
RegexOptions.IgnoreCase | RegexOptions.Singleline).Dump();
The match should be run once per value group (e.g. (a,b,etc)). Parsing of the results (e.g. conversion) is left to the caller and I have not tested it [much]. I would recommend creating the correctly-typed DataTable first -- although it may be possible to pass everything "as a string" to the database? -- and then use the information in the columns to help with the extraction process (possibly using type converters). For the captures: n is null, w is word (e.g. number), s is string.
Happy coding.

Apparently your data is always wrapped in parentheses and starts with a left parenthesis. You might want to use this rule to split(RemoveEmptyEntries) each of those lines and load it into a DataTable. Then you can use SqlBulkCopy to copy all at once into the database.
This approach would not necessarily be fail-safe, but it would be certainly faster.
Edit: Here's the way how you could get the schema for every table:
private static DataTable extractSchemaTable(IEnumerable<String> lines)
{
DataTable schema = null;
var insertLine = lines.SkipWhile(l => !l.StartsWith("INSERT INTO [")).Take(1).First();
var startIndex = insertLine.IndexOf("INSERT INTO [") + "INSERT INTO [".Length;
var endIndex = insertLine.IndexOf("]", startIndex);
var tableName = insertLine.Substring(startIndex, endIndex - startIndex);
using (var con = new SqlConnection("CONNECTION"))
{
using (var schemaCommand = new SqlCommand("SELECT * FROM " tableName, con))
{
con.Open();
using (var reader = schemaCommand.ExecuteReader(CommandBehavior.SchemaOnly))
{
schema = reader.GetSchemaTable();
}
}
}
return schema;
}
Then you simply need to iterate each line in the file, check if it starts with ( and split that line by Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries). Then you could add the resulting array into the created schema-table.
Something like this:
var allLines = System.IO.File.ReadAllLines(path);
DataTable result = extractSchemaTable(allLines);
for (int i = 0; i < allLines.Length; i++)
{
String line = allLines[i];
if (line.StartsWith("("))
{
String data = line.Substring(1, line.Length - (line.Length - line.LastIndexOf(")")) - 1);
var fields = data.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries);
// you might need to parse it to correct DataColumn.DataType
result.Rows.Add(fields);
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

IEnumerable file comparison - c#

Related

Check if substring of a string exists in datatable

Expression in datatable appends data instead of adding

Split a string into lines?

Get numerical data from a tab-separated text file and store them in a list

Convert a file full of "INSERT INTO xxx VALUES" in to something Bulk Insert can parse

Categories

Resources