Finding all similar lines in a text file

Finding all similar lines in a text file - c#

I have a text file that contains some comma separated values. and it looks like this:
3,23500,R,5998,20.38,06/12/2013 01:44:17
2,23500,P,5983,20.234,06/12/2013 01:44:17
3,23501,R,5998,20.38,06/12/2013 01:44:18
2,23501,P,5983,20.235,06/12/2013 01:44:18
3,23502,R,6000,20.4,06/12/2013 01:44:19
2,23502,P,5983,20.236,06/12/2013 01:44:19
3,23503,R,5999,20.39,06/12/2013 01:44:20
2,23503,P,5983,20.236,06/12/2013 01:44:20
My task is to extract lines that start with same number in unique files. Eg in the above case you see some lines are starting with 2 and some with 3...there can be more cases like 4 and etc...
What would be the best and fastes approach to do this? The files that I am working with are quite big and sometimes are in magnitude of gigabytes...
I did split each line and store the first value that will be the number I am looking for in an array and then remove duplicate values from the array...it works but it is very slow!
This is my own code:
private void buttonBeginProcess_Click(object sender, EventArgs e)
{
var file = File.ReadAllLines(_fileName);
var nodeId = new List<int>();
foreach (var line in file)
{
nodeId.Add(int.Parse(line.Split(',')[0]));
}
//Unique numbers
nodeId = nodeId.Distinct().ToList();
}

var lines = File.ReadLines(myFilePath);
var lineGroups = lines
.Where(line => line.Contains(","))
.Select(line => new{key = line.Split(',')[0], line})
.GroupBy(x => x.key);
foreach(var lineGroup in lineGroups)
{
var key = lineGroup.Key;
var keySpecificLines = lineGroup.Select(x => x.line);
//save keySpecificLines to file
}

You could try using StreamReader / StreamWriter to process each file one line at a time:
var writers = new Dictionary<string, StreamWriter>();
using (StreamReader sr = new StreamReader(pathToFile))
{
while (sr.Peek() >= 0)
{
var line = sr.ReadLine();
var key = line.Split(new[]{ ',' },2)[0];
if (!lineGroups.ContainsKey(key))
{
writers[key] = new StreamWriter(GetPathToOutput(key));
}
writers[key].WriteLine(line);
}
}
foreach(StreamWriter sw in writers.Values)
{
sw.Dispose();
}
With this method, you ensure that your code never has to consume the entire input file, so it shouldn't matter how large your input files are. Of course the downside is it would have to keep an arbitrary number of files open throughout the process.

Related

Parse CSV File section wise

I am new to in C# need help to write parser for below cvs file of data
[INFO]
LINE_NAME,MACHINE_SN,MACHINE_NAME,OPERATOR_ID
LineName,ParmiMachineSN,PARMI_AOI_1,engineer
[INFO_END]
[PANEL_INSP_RESULT]
MODEL_NAME,MODEL_CODE,PANEL_SIDE,INDEX,BARCODE,DATE,START_TIME,END_TIME,DEFECT_NAME,DEFECT_CODE,RESULT
E11-03356-0388-A-TOP CNG,,BOTTOM,47,MLT0388A03358CSNSOF1232210200052-0001,20201023,12:46:57,12:47:04,,,OK
[PANEL_INSP_RESULT_END]
[BOARD_INSP_RESULT]
BOARD_NO,BARCODE,DEFECT_NAME,DEFECT_CODE,BADMARK,RESULT
1,MLT0388A03358CSNSOF1232210200052-0001,,,NO,OK
2,MLT0388A03358CSNSOF1232210200052-0004,,,NO,OK
3,MLT0388A03358CSNSOF1232210200052-0003,,,NO,OK
4,MLT0388A03358CSNSOF1232210200052-0002,,,NO,OK
[BOARD_INSP_RESULT_END]
[COMPONENT_INSP_RESULT]
BOARD_NO,LOCATION_NAME,PIN_NUMBER,POS_X,POS_Y,DEFECT_NAME,DEFECT_CODE,RESULT
[COMPONENT_INSP_RESULT_END]
I need to parse the above file

To parse the above CSV file in C#, you can use the following steps:
Read the entire file into a string using the File.ReadAllText method.
string fileText = File.ReadAllText("file.csv");
Split the file into individual sections by looking for the "[INFO]" and "
[INFO_END]" tags, and then use a loop to process each section.
string[] sections = fileText.Split(new string[] { "[INFO]", "[INFO_END]", "[PANEL_INSP_RESULT]", "[PANEL_INSP_RESULT_END]", "[BOARD_INSP_RESULT]", "[BOARD_INSP_RESULT_END]", "[COMPONENT_INSP_RESULT]", "[COMPONENT_INSP_RESULT_END]" }, StringSplitOptions.RemoveEmptyEntries);
foreach (string section in sections)
{
//Process each section
}
Within the loop, use the String.Split method to split each section into rows by looking for the newline character.
string[] rows = section.Split('\n');
Use the String.Split method again to split each row into cells by looking for the comma.
foreach (string row in rows)
{
string[] cells = row.Split(',');
//Process each cell
}
Now you can process each cell as you need, you can check the first cell value to decide which section this row belongs to, and then you can process the cells according to their type and position in the row.
You can use a switch case statement to check which section you are currently processing and then use appropriate logic to parse the data.
Please be aware that this is a simplified example, and you may need to add additional error handling and validation to ensure that the data is properly parsed.
This is an example how you can parse the csv file but you might need to handle various edge cases like empty rows, empty cells, etc based on your specific use case.

The following reads all text, creates an anonymous list with line index and line followed by looping through a list of sections. In the loop find a section and in this case displays to a console window.
internal partial class Program
{
static void Main(string[] args)
{
var items = (File.ReadAllLines("YourFileNameGoesHere")
.Select((line, index) => new { Line = line, Index = index })
.Select(lineData => lineData)).ToList();
List<string> sections = new List<string>()
{
"INFO",
"PANEL_INSP_RESULT",
"BOARD_INSP_RESULT",
"COMPONENT_INSP_RESULT"
};
foreach (var section in sections)
{
Console.WriteLine($"{section}");
var startItem = items.FirstOrDefault(x => x.Line == $"[{section}]");
var endItem = items.FirstOrDefault(x => x.Line == $"[{section}_END]");
if (startItem is not null && endItem is not null)
{
bool header = false;
for (int index = startItem.Index + 1; index < endItem.Index; index++)
{
if (header == false)
{
Console.WriteLine($"\t{items[index].Line}");
header = true;
}
else
{
Console.WriteLine($"\t\t{items[index].Line}");
}
}
}
else
{
Console.WriteLine("\tFailed to read this section");
}
}
}
}

Linq query for building a dictionary from a reg file

I'm building a simple dictionary from a reg file (export from Windows Regedit). The .reg file contains a key in square brackets, followed by zero or more lines of text, followed by a blank line. This code will create the dictionary that I need:
var a = File.ReadLines("test.reg");
var dict = new Dictionary<String, List<String>>();
foreach (var key in a) {
if (key.StartsWith("[HKEY")) {
var iter = a.GetEnumerator();
var value = new List<String>();
do {
iter.MoveNext();
value.Add(iter.Current);
} while (String.IsNullOrWhiteSpace(iter.Current) == false);
dict.Add(key, value);
}
}
I feel like there is a cleaner (prettier?) way to do this in a single Linq statement (using a group by), but it's unclear to me how to implement the iteration of the value items into a list. I suspect I could do the same GetEnumerator in a let statement but it seems like there should be a way to implement this without resorting to an explicit iterator.
Sample data:
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.msu]
#="Microsoft.System.Update.1"
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS]
#="WMP11.AssocFile.M2TS"
"Content Type"="video/vnd.dlna.mpeg-tts"
"PerceivedType"="video"
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS\OpenWithProgIds]
"WMP11.AssocFile.M2TS"=hex(0):
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS\ShellEx]
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS\ShellEx\{BB2E617C-0920-11D1-9A0B-00C04FC2D6C1}]
#="{9DBD2C50-62AD-11D0-B806-00C04FD706EC}"
Update
I'm sorry I need to be more specific. The files am looking at around ~300MB so I took the approach I did to keep the memory footprint down. I'd prefer an approach that doesn't require pulling the entire file into memory.

You can always use Regex:
var dict = new Dictionary<String, List<String>>();
var a = File.ReadAllText(#"test.reg");
var results = Regex.Matches(a, "(\\[[^\\]]+\\])([^\\[]+)\r\n\r\n", RegexOptions.Singleline);
foreach (Match item in results)
{
dict.Add(
item.Groups[1].Value,
item.Groups[2].Value.Split(new[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries).ToList()
);
}
I whipped this out real quick. You might be able to improve the regex pattern.

Instead of using GetEnumerator you can take advantage of TakeWhile and Split methods to break your list into smaller list (each sublist represents one key and its values)
var registryLines = File.ReadLines("test.reg");
Dictionary<string, List<string>> resultKeys = new Dictionary<string, List<string>>();
while (registryLines.Count() > 0)
{
// Take the key and values into a single list
var keyValues = registryLines.TakeWhile(x => !String.IsNullOrWhiteSpace(x)).ToList();
// Adds a new entry to the dictionary using the first value as key and the rest of the list as value
if (keyValues != null && keyValues.Count > 0)
resultKeys.Add(keyValues[0], keyValues.Skip(1).ToList());
// Jumps to the next registry (+1 to skip the blank line)
registryLines = registryLines.Skip(keyValues.Count + 1);
}
EDIT based on your update
Update I'm sorry I need to be more specific. The files am looking at
around ~300MB so I took the approach I did to keep the memory
footprint down. I'd prefer an approach that doesn't require pulling
the entire file into memory.
Well, if you can't read the whole file into memory, it makes no sense to me asking for a LINQ solution. Here is a sample of how you can do it reading line by line (still no need for GetEnumerator)
Dictionary<string, List<string>> resultKeys = new Dictionary<string, List<string>>();
using (StreamReader reader = File.OpenText("test.reg"))
{
List<string> keyAndValues = new List<string>();
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
// Adds key and values to a list until it finds a blank line
if (!string.IsNullOrWhiteSpace(line))
keyAndValues.Add(line);
else
{
// Adds a new entry to the dictionary using the first value as key and the rest of the list as value
if (keyAndValues != null && keyAndValues.Count > 0)
resultKeys.Add(keyAndValues[0], keyAndValues.Skip(1).ToList());
// Starts a new Key collection
keyAndValues = new List<string>();
}
}
}

I think you can use a code like this - if you can use memory -:
var lines = File.ReadAllText(fileName);
var result =
Regex.Matches(lines, #"\[(?<key>HKEY[^]]+)\]\s+(?<value>[^[]+)")
.OfType<Match>()
.ToDictionary(k => k.Groups["key"], v => v.Groups["value"].ToString().Trim('\n', '\r', ' '));
C# Demo
This will take 24.173 seconds for a file with more than 4 million lines - Size:~550MB - by using 1.2 GB memory.
Edit :
The best way is using File.ReadAllLines as it is lazy:
var lines = File.ReadAllLines(fileName);
var keyRegex = new Regex(#"\[(?<key>HKEY[^]]+)\]");
var currentKey = string.Empty;
var currentValue = string.Empty;
var result = new Dictionary<string, string>();
foreach (var line in lines)
{
var match = keyRegex.Match(line);
if (match.Length > 0)
{
if (!string.IsNullOrEmpty(currentKey))
{
result.Add(currentKey, currentValue);
currentValue = string.Empty;
}
currentKey = match.Groups["key"].ToString();
}
else
{
currentValue += line;
}
}
This will take 17093 milliseconds for a file with 795180 lines.

C# - Splitting by columns on a CSV file

I want to import some data from a csv file, but I've encountered a small problem I can't really figure out.
The person who gave me this file, added comma seperated values in cells, so when I split them they will be added to the list. Instead, I would like to get all values per column as a string, I just can't really figure out how.
For example, the column I'm talking about, is about the days a restaurant is open. This can be Mo, Tu, We, Su, but it can also be Mo, Tu.
Is there a way I can just loop over de values per column, instead of by the comma seperated values?
I'm currently using it like this, but this just adds each day to the total list of values:
using (var fs = File.OpenRead(csvUrl))
using (var reader = new StreamReader(fs, Encoding.UTF8))
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
if (i > 0)
{
var values = line.Split(',');
}
}
}

Use TextFieldParser to parse CSV files:
TextFieldParser parser = new TextFieldParser(new StringReader(lineContent));
parser.SetDelimiters(",");
string[] rawFields = parser.ReadFields();
lineContent is a string with the content of the current line in your file.
TextFieldParser is available in the namespace:
Microsoft.VisualBasic.FileIO
Don't mind abaout the Visual Basic part it works fine in C#
EDIT
In your code you could implement it like this:
using (var fs = File.OpenRead(csvUrl))
using (var reader = new StreamReader(fs, Encoding.UTF8))
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
if (i > 0)
{
TextFieldParser parser = new TextFieldParser(new StringReader(lineContent));
parser.SetDelimiters(",");
string[] rawFields = parser.ReadFields();
}
}
}

Best solution so far to deal with CSV values is using the .NET built in libraries:
Its explained here in my StackOverflow answer here:
Reading CSV file and storing values into an array
For easy reference, I am including the code here as well.
using Microsoft.VisualBasic.FileIO;
var path = #"C:\Person.csv"; // Habeeb, "Dubai Media City, Dubai"
using (TextFieldParser csvParser = new TextFieldParser(path))
{
csvParser.CommentTokens = new string[] { "#" };
csvParser.SetDelimiters(new string[] { "," });
csvParser.HasFieldsEnclosedInQuotes = true;
// Skip the row with the column names
csvParser.ReadLine();
while (!csvParser.EndOfData)
{
// Read current line fields, pointer moves to the next line.
string[] fields = csvParser.ReadFields();
string Name = fields[0];
string Address = fields[1];
}
}
More details about the parser is given here: http://codeskaters.blogspot.ae/2015/11/c-easiest-csv-parser-built-in-net.html

How do I pass a collection of strings as a TextReader?

I am using the CSVHelper library, which can extract a list of objects from a CSV file with just three lines of code:
var streamReader = // Create a reader to your CSV file.
var csvReader = new CsvReader( streamReader );
List<MyCustomType> myData = csvReader.GetRecords<MyCustomType>();
However, by file has nonsense lines and I need to skip the first ten lines in the file. I thought it would be nice to use LINQ to ensure 'clean' data, and then pass that data to CsvFReader, like so:
public TextReader GetTextReader(IEnumerable<string> lines)
{
// Some magic here. Don't want to return null;
return TextReader.Null;
}
public IEnumerable<T> ExtractObjectList<T>(string filePath) where T : class
{
var csvLines = File.ReadLines(filePath)
.Skip(10)
.Where(l => !l.StartsWith(",,,"));
var textReader = GetTextReader(csvLines);
var csvReader = new CsvReader(textReader);
csvReader.Configuration.ClassMapping<EventMap, Event>();
return csvReader.GetRecords<T>();
}
But I'm really stuck into pushing a 'static' collection of strings through a stream like a TextReaer.
My alternative here is to process the CSV file line by line through CsvReader and examine each line before extracting an object, but I find that somewhat clumsy.

The StringReader Class provides a TextReader that wraps a String. You could simply join the lines and wrap them in a StringReader:
public TextReader GetTextReader(IEnumerable<string> lines)
{
return new StringReader(string.Join("\r\n", lines));
}

An easier way would be to use CsvHelper to skip the lines.
// Skip rows.
csvReader.Configuration.IgnoreBlankLines = false;
csvReader.Configuration.IgnoreQuotes = true;
for (var i = 0; i < 10; i++)
{
csvReader.Read();
}
csvReader.Configuration.IgnoreBlankLines = false;
csvReader.Configuration.IgnoreQuotes = false;
// Carry on as normal.
var myData = csvReader.GetRecords<MyCustomType>;
IgnoreBlankLines is turned off in case any of those first 10 rows are blank. IgnoreQuotes is turned off so you don't get any BadDataExceptions if those rows contain a ". You can turn them back on after for normal functionality again.
If you don't know the amount of rows and need to test based on row data, you can just test csvReader.Context.Record and see if you need to stop. In this case, you would probably need to manually call csvReader.ReadHeader() before calling csvReader.GetRecords<MyCustomType>().

C# Exception Handling continue on error

I have a basic C# console application that reads a text file (CSV format) line by line and puts the data into a HashTable. The first CSV item in the line is the key (id num) and the rest of the line is the value. However I've discovered that my import file has a few duplicate keys that it shouldn't have. When I try to import the file the application errors out because you can't have duplicate keys in a HashTable. I want my program to be able to handle this error though. When I run into a duplicate key I would like to put that key into a arraylist and continue importing the rest of the data into the hashtable. How can I do this in C#
Here is my code:
private static Hashtable importFile(Hashtable myHashtable, String myFileName)
{
StreamReader sr = new StreamReader(myFileName);
CSVReader csvReader = new CSVReader();
ArrayList tempArray = new ArrayList();
int count = 0;
while (!sr.EndOfStream)
{
String temp = sr.ReadLine();
if (temp.StartsWith(" "))
{
ServMissing.Add(temp);
}
else
{
tempArray = csvReader.CSVParser(temp);
Boolean first = true;
String key = "";
String value = "";
foreach (String x in tempArray)
{
if (first)
{
key = x;
first = false;
}
else
{
value += x + ",";
}
}
myHashtable.Add(key, value);
}
count++;
}
Console.WriteLine("Import Count: " + count);
return myHashtable;
}

if (myHashtable.ContainsKey(key))
duplicates.Add(key);
else
myHashtable.Add(key, value);

A better solution is to call ContainsKey to check if the key exist before adding it to the hash table instead. Throwing exception on this kind of error is a performance hit and doesn't improve the program flow.

ContainsKey has a constant O(1) overhead for every item, while catching an Exception incurs a performance hit on JUST the duplicate items.
In most situations, I'd say check for the key, but in this case, its better to catch the exception.

Here is a solution which avoids multiple hits in the secondary list with a small overhead to all insertions:
Dictionary<T, List<K>> dict = new Dictionary<T, List<K>>();
//Insert item
if (!dict.ContainsKey(key))
dict[key] = new List<string>();
dict[key].Add(value);
You can wrap the dictionary in a type that hides this or put it in a method or even extension method on dictionary.

If you have more than 4 (for example) CSV values, it might be worth setting the value variable to use a StringBuilder as well since the string concatenation is a slow function.

Hmm, 1.7 Million lines? I hesitate to offer this for that kind of load.
Here's one way to do this using LINQ.
CSVReader csvReader = new CSVReader();
List<string> source = new List<string>();
using(StreamReader sr = new StreamReader(myFileName))
{
while (!sr.EndOfStream)
{
source.Add(sr.ReadLine());
}
}
List<string> ServMissing =
source
.Where(s => s.StartsWith(" ")
.ToList();
//--------------------------------------------------
List<IGrouping<string, string>> groupedSource =
(
from s in source
where !s.StartsWith(" ")
let parsed = csvReader.CSVParser(s)
where parsed.Any()
let first = parsed.First()
let rest = String.Join( "," , parsed.Skip(1).ToArray())
select new {first, rest}
)
.GroupBy(x => x.first, x => x.rest) //GroupBy(keySelector, elementSelector)
.ToList()
//--------------------------------------------------
List<string> myExtras = new List<string>();
foreach(IGrouping<string, string> g in groupedSource)
{
myHashTable.Add(g.Key, g.First());
if (g.Skip(1).Any())
{
myExtras.Add(g.Key);
}
}

Thank you all.
I ended up using the ContainsKey() method. It takes maybe 30 secs longer, which is fine for my purposes. I'm loading about 1.7 million lines and the program takes about 7 mins total to load up two files, compare them, and write out a few files. It only takes about 2 secs to do the compare and write out the files.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Finding all similar lines in a text file - c#

Related

Parse CSV File section wise

Linq query for building a dictionary from a reg file

C# - Splitting by columns on a CSV file

How do I pass a collection of strings as a TextReader?

C# Exception Handling continue on error

Categories

Resources