Linq query for building a dictionary from a reg file - c#

I'm building a simple dictionary from a reg file (export from Windows Regedit). The .reg file contains a key in square brackets, followed by zero or more lines of text, followed by a blank line. This code will create the dictionary that I need:
var a = File.ReadLines("test.reg");
var dict = new Dictionary<String, List<String>>();
foreach (var key in a) {
if (key.StartsWith("[HKEY")) {
var iter = a.GetEnumerator();
var value = new List<String>();
do {
iter.MoveNext();
value.Add(iter.Current);
} while (String.IsNullOrWhiteSpace(iter.Current) == false);
dict.Add(key, value);
}
}
I feel like there is a cleaner (prettier?) way to do this in a single Linq statement (using a group by), but it's unclear to me how to implement the iteration of the value items into a list. I suspect I could do the same GetEnumerator in a let statement but it seems like there should be a way to implement this without resorting to an explicit iterator.
Sample data:
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.msu]
#="Microsoft.System.Update.1"
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS]
#="WMP11.AssocFile.M2TS"
"Content Type"="video/vnd.dlna.mpeg-tts"
"PerceivedType"="video"
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS\OpenWithProgIds]
"WMP11.AssocFile.M2TS"=hex(0):
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS\ShellEx]
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS\ShellEx\{BB2E617C-0920-11D1-9A0B-00C04FC2D6C1}]
#="{9DBD2C50-62AD-11D0-B806-00C04FD706EC}"
Update
I'm sorry I need to be more specific. The files am looking at around ~300MB so I took the approach I did to keep the memory footprint down. I'd prefer an approach that doesn't require pulling the entire file into memory.

You can always use Regex:
var dict = new Dictionary<String, List<String>>();
var a = File.ReadAllText(#"test.reg");
var results = Regex.Matches(a, "(\\[[^\\]]+\\])([^\\[]+)\r\n\r\n", RegexOptions.Singleline);
foreach (Match item in results)
{
dict.Add(
item.Groups[1].Value,
item.Groups[2].Value.Split(new[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries).ToList()
);
}
I whipped this out real quick. You might be able to improve the regex pattern.

Instead of using GetEnumerator you can take advantage of TakeWhile and Split methods to break your list into smaller list (each sublist represents one key and its values)
var registryLines = File.ReadLines("test.reg");
Dictionary<string, List<string>> resultKeys = new Dictionary<string, List<string>>();
while (registryLines.Count() > 0)
{
// Take the key and values into a single list
var keyValues = registryLines.TakeWhile(x => !String.IsNullOrWhiteSpace(x)).ToList();
// Adds a new entry to the dictionary using the first value as key and the rest of the list as value
if (keyValues != null && keyValues.Count > 0)
resultKeys.Add(keyValues[0], keyValues.Skip(1).ToList());
// Jumps to the next registry (+1 to skip the blank line)
registryLines = registryLines.Skip(keyValues.Count + 1);
}
EDIT based on your update
Update I'm sorry I need to be more specific. The files am looking at
around ~300MB so I took the approach I did to keep the memory
footprint down. I'd prefer an approach that doesn't require pulling
the entire file into memory.
Well, if you can't read the whole file into memory, it makes no sense to me asking for a LINQ solution. Here is a sample of how you can do it reading line by line (still no need for GetEnumerator)
Dictionary<string, List<string>> resultKeys = new Dictionary<string, List<string>>();
using (StreamReader reader = File.OpenText("test.reg"))
{
List<string> keyAndValues = new List<string>();
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
// Adds key and values to a list until it finds a blank line
if (!string.IsNullOrWhiteSpace(line))
keyAndValues.Add(line);
else
{
// Adds a new entry to the dictionary using the first value as key and the rest of the list as value
if (keyAndValues != null && keyAndValues.Count > 0)
resultKeys.Add(keyAndValues[0], keyAndValues.Skip(1).ToList());
// Starts a new Key collection
keyAndValues = new List<string>();
}
}
}

I think you can use a code like this - if you can use memory -:
var lines = File.ReadAllText(fileName);
var result =
Regex.Matches(lines, #"\[(?<key>HKEY[^]]+)\]\s+(?<value>[^[]+)")
.OfType<Match>()
.ToDictionary(k => k.Groups["key"], v => v.Groups["value"].ToString().Trim('\n', '\r', ' '));
C# Demo
This will take 24.173 seconds for a file with more than 4 million lines - Size:~550MB - by using 1.2 GB memory.
Edit :
The best way is using File.ReadAllLines as it is lazy:
var lines = File.ReadAllLines(fileName);
var keyRegex = new Regex(#"\[(?<key>HKEY[^]]+)\]");
var currentKey = string.Empty;
var currentValue = string.Empty;
var result = new Dictionary<string, string>();
foreach (var line in lines)
{
var match = keyRegex.Match(line);
if (match.Length > 0)
{
if (!string.IsNullOrEmpty(currentKey))
{
result.Add(currentKey, currentValue);
currentValue = string.Empty;
}
currentKey = match.Groups["key"].ToString();
}
else
{
currentValue += line;
}
}
This will take 17093 milliseconds for a file with 795180 lines.

Related

check if string contains dictionary keys and replace the matching subtring with Values from dictionary

I am parsing a template file which will contain certain keys that I need to map values to. Take a line from the file for example:
Field InspectionStationID 3 {"PVA TePla #WSM#", "sw#data.tool_context.TOOL_SOFTWARE_VERSION#", "#data.context.TOOL_ENTITY#"}
I need to replace the string within the # symbols with values from a dictionary.
So there can be multiple keys from the dictionary. However, not all strings inside the # are in the dictionary so for those, I will have to replace them with empty string.
I cant seem to find a way to do this. And yes I have looked at this solution:
check if string contains dictionary Key -> remove key and add value
For now what I have is this (where I read from the template file line by line and then write to a different file):
string line = string.Empty;
var dict = new Dictionary<string, string>() {
{ "data.tool_context.TOOL_SOFTWARE_VERSION", "sw0.2.002" },
{"data.context.TOOL_ENTITY", "WSM102" }
};
StringBuilder inputText = new StringBuilder();
StreamWriter writeKlarf = new StreamWriter(klarfOutputNameActual);
using (StreamReader sr = new StreamReader(WSMTemplatePath))
{
while((line = sr.ReadLine()) != null)
{
//Console.WriteLine(line);
if (line.Contains("#"))
{
}
else
{
writeKlarf.WriteLine(line)
}
}
}
writeKlarf.Close();
THe idea is that for each line, replace the string within the # and the # with match values from the dictionary if the #string# is inside the dictionary. How can I do this?
Sample Output Given the line above:
Field InspectionStationID 3 {"PVA TePla", "sw0.2.002", "WSM102"}
Here because #WSM# is not the dictionary, it is replaced with empty string
One more thing, this logic only applies to the first qurter of the file. The rest of the file will have other data that will need to be entered via another logic so I am not sure if it makes sense to read the whole file in into memory just for the header section?
Here's a quick example that I wrote for you, hopefully this is what you're asking for.
This will let you have a <string, string> Dictionary, check for the Key inside of a delimiter, and if the text inside of the delimiter matches the Dictionary key, it will replace the text. It won't edit any of the inputted strings that don't have any matches.
If you want to delete the unmatched value instead of leaving it alone, replace the kvp.Value in the line.Replace() with String.Empty
var dict = new Dictionary<string, string>() {
{ "test", "cool test" }
};
string line = "#test# is now replaced.";
foreach (var kvp in dict)
{
string split = line.Split('#')[1];
if (split == kvp.Key)
{
line = line.Replace($"#{split}#", kvp.Value);
}
Console.WriteLine(line);
}
Console.ReadLine();
If you had a list of tuple that were the find and replace, you can read the file, replace each, and then rewrite the file
var frs = new List<(string F, string R)>(){
("#data.tool_context.TOOL_SOFTWARE_VERSION#", "sw0.2.002"),
("#otherfield#", "replacement here")
};
var i = File.ReadAllText("path");
frs.ForEach(fr => i = i.Replace(fr.F,fr.R));
File.WriteAllText("path2", i);
The choice to use a list vs dictionary is fairly arbitrary; List has a ForEach method but it could just as easily be a foreach loop on a dictionary. I included the ## in the find string because I got the impression the output is not supposed to contain ##..
This version leaves alone any template parameters that aren't available
You can try matching #...# keys with a help of regular expressions:
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
...
static string MyReplace(string value, IDictionary<string, string> subs) => Regex
.Replace(value, "#[^#]*#", match => subs.TryGetValue(
match.Value.Substring(1, match.Value.Length - 2), out var item) ? item : "");
then you can apply it to the file: we read file's lines, process them with a help of Linq and write them into another file.
var dict = new Dictionary<string, string>() {
{"data.tool_context.TOOL_SOFTWARE_VERSION", "sw0.2.002" },
{"data.context.TOOL_ENTITY", "WSM102" },
};
File.WriteAllLines(klarfOutputNameActual, File
.ReadLines(WSMTemplatePath)
.Select(line => MyReplace(line, dict)));
Edit: If you want to switch off MyReplace from some line on
bool doReplace = true;
File.WriteAllLines(klarfOutputNameActual, File
.ReadLines(WSMTemplatePath)
.Select(line => {
//TODO: having line check if we want to keep replacing
if (!doReplace || SomeCondition(line)) {
doReplace = false;
return line;
}
return MyReplace(line, dict)
}));
Here SomeCondition(line) returns true whenever header ends and we should not replace #..# any more.

Remove names that contain another in a list

I have a file with "Name|Number" in each line and I wish to remove the lines with names that contain another name in the list.
For example, if there is "PEDRO|3" , "PEDROFILHO|5" , "PEDROPHELIS|1" in the file, i wish to remove the lines "PEDROFILHO|5" , "PEDROPHELIS|1".
The list has 1.8 million lines, I made it like this but its too slow :
List<string> names = File.ReadAllLines("firstNames.txt").ToList();
List<string> result = File.ReadAllLines("firstNames.txt").ToList();
foreach (string name in names)
{
string tempName = name.Split('|')[0];
List<string> temp = names.Where(t => t.Contains(tempName)).ToList();
foreach (string str in temp)
{
if (str.Equals(name))
{
continue;
}
result.Remove(str);
}
}
File.WriteAllLines("result.txt",result);
Does anyone know a faster way? Or how to improve the speed?
Since you are looking for matches everywhere in the word, you will end up with O(n2) algorithm. You can improve implementation a bit to avoid string deletion inside a list, which is an O(n) operation in itself:
var toDelete = new HashSet<string>();
var names = File.ReadAllLines("firstNames.txt");
foreach (string name in names) {
var tempName = name.Split('|')[0];
toDelete.UnionWith(
// Length constraint removes self-matches
names.Where(t => t.Length > name.Length && t.Contains(tempName))
);
}
File.WriteAllLines("result.txt", names.Where(name => !toDelete.Contains(name)));
This works but I don't know if it's quicker. I haven't tested on millions of lines. Remove the tolower if the names are in the same case.
List<string> names = File.ReadAllLines(#"C:\Users\Rob\Desktop\File.txt").ToList();
var result = names.Where(w => !names.Any(a=> w.Split('|')[0].Length> a.Split('|')[0].Length && w.Split('|')[0].ToLower().Contains(a.Split('|')[0].ToLower())));
File.WriteAllLines(#"C:\Users\Rob\Desktop\result.txt", result);
test file had
Rob|1
Robbie|2
Bert|3
Robert|4
Jan|5
John|6
Janice|7
Carol|8
Carolyne|9
Geoff|10
Geoffrey|11
Result had
Rob|1
Bert|3
Jan|5
John|6
Carol|8
Geoff|10

Adding to the dictionary from a text document C#

I need to read from the text document through =, line by line, and add it to the dictionary. Can you help me please?
using (StreamReader sr = new StreamReader("slovardata.txt"))
{
string _line;
while ((_line = sr.ReadLine()) != null)
{
string[] keyvalue = _line.Split('=');
if (keyvalue.Length == 2)
{
slovarik.Add(keyvalue[0], keyvalue[1]);
}
}
}
You can read all lines of file with File.ReadAllLines and after splitting every line into Key & Value add it into dictionary like the following code:
caution: it may ignore some lines without throwing any exception and it may throw Argument Exception “Item with Same Key has already been added”
var lines = System.IO.File.ReadAllLines("slovardata.txt");
lines.Select(line=>line.Split('='))
.Where(line=>line.Length ==2)
.ToList()
.ForEach(line=> slovarik.Add(line[0],line[1]));
btw the .ForEach method made a lot of garbage (in large lists) and if there is no duplicate keys you can use following:
var slovarik = lines.Select(line=>line.Split('='))
.Where(line=>line.Length ==2)
.ToDictionary(line[0],line[1]);
For simple text file read operation you can use something like this :
Note : Make sure your Keys are unique, otherwise you will get the -
System.ArgumentException: An item with the same key has already been
added.
string[] FileContents = File.ReadAllLines(#"c:\slovardata.txt");
Dictionary<string, string> dict = new Dictionary<string, string>();
foreach (string line in FileContents)
{
var keyvalue = Regex.Match(line, #"(.*)=(.*)");
dict.Add(keyvalue.Groups[1].Value, keyvalue.Groups[2].Value);
}
foreach (var item in dict)
{
Console.WriteLine("Key : " + item.Key + "\tValue : " + item.Value);
}

Finding all similar lines in a text file

I have a text file that contains some comma separated values. and it looks like this:
3,23500,R,5998,20.38,06/12/2013 01:44:17
2,23500,P,5983,20.234,06/12/2013 01:44:17
3,23501,R,5998,20.38,06/12/2013 01:44:18
2,23501,P,5983,20.235,06/12/2013 01:44:18
3,23502,R,6000,20.4,06/12/2013 01:44:19
2,23502,P,5983,20.236,06/12/2013 01:44:19
3,23503,R,5999,20.39,06/12/2013 01:44:20
2,23503,P,5983,20.236,06/12/2013 01:44:20
My task is to extract lines that start with same number in unique files. Eg in the above case you see some lines are starting with 2 and some with 3...there can be more cases like 4 and etc...
What would be the best and fastes approach to do this? The files that I am working with are quite big and sometimes are in magnitude of gigabytes...
I did split each line and store the first value that will be the number I am looking for in an array and then remove duplicate values from the array...it works but it is very slow!
This is my own code:
private void buttonBeginProcess_Click(object sender, EventArgs e)
{
var file = File.ReadAllLines(_fileName);
var nodeId = new List<int>();
foreach (var line in file)
{
nodeId.Add(int.Parse(line.Split(',')[0]));
}
//Unique numbers
nodeId = nodeId.Distinct().ToList();
}
var lines = File.ReadLines(myFilePath);
var lineGroups = lines
.Where(line => line.Contains(","))
.Select(line => new{key = line.Split(',')[0], line})
.GroupBy(x => x.key);
foreach(var lineGroup in lineGroups)
{
var key = lineGroup.Key;
var keySpecificLines = lineGroup.Select(x => x.line);
//save keySpecificLines to file
}
You could try using StreamReader / StreamWriter to process each file one line at a time:
var writers = new Dictionary<string, StreamWriter>();
using (StreamReader sr = new StreamReader(pathToFile))
{
while (sr.Peek() >= 0)
{
var line = sr.ReadLine();
var key = line.Split(new[]{ ',' },2)[0];
if (!lineGroups.ContainsKey(key))
{
writers[key] = new StreamWriter(GetPathToOutput(key));
}
writers[key].WriteLine(line);
}
}
foreach(StreamWriter sw in writers.Values)
{
sw.Dispose();
}
With this method, you ensure that your code never has to consume the entire input file, so it shouldn't matter how large your input files are. Of course the downside is it would have to keep an arbitrary number of files open throughout the process.

C# Exception Handling continue on error

I have a basic C# console application that reads a text file (CSV format) line by line and puts the data into a HashTable. The first CSV item in the line is the key (id num) and the rest of the line is the value. However I've discovered that my import file has a few duplicate keys that it shouldn't have. When I try to import the file the application errors out because you can't have duplicate keys in a HashTable. I want my program to be able to handle this error though. When I run into a duplicate key I would like to put that key into a arraylist and continue importing the rest of the data into the hashtable. How can I do this in C#
Here is my code:
private static Hashtable importFile(Hashtable myHashtable, String myFileName)
{
StreamReader sr = new StreamReader(myFileName);
CSVReader csvReader = new CSVReader();
ArrayList tempArray = new ArrayList();
int count = 0;
while (!sr.EndOfStream)
{
String temp = sr.ReadLine();
if (temp.StartsWith(" "))
{
ServMissing.Add(temp);
}
else
{
tempArray = csvReader.CSVParser(temp);
Boolean first = true;
String key = "";
String value = "";
foreach (String x in tempArray)
{
if (first)
{
key = x;
first = false;
}
else
{
value += x + ",";
}
}
myHashtable.Add(key, value);
}
count++;
}
Console.WriteLine("Import Count: " + count);
return myHashtable;
}
if (myHashtable.ContainsKey(key))
duplicates.Add(key);
else
myHashtable.Add(key, value);
A better solution is to call ContainsKey to check if the key exist before adding it to the hash table instead. Throwing exception on this kind of error is a performance hit and doesn't improve the program flow.
ContainsKey has a constant O(1) overhead for every item, while catching an Exception incurs a performance hit on JUST the duplicate items.
In most situations, I'd say check for the key, but in this case, its better to catch the exception.
Here is a solution which avoids multiple hits in the secondary list with a small overhead to all insertions:
Dictionary<T, List<K>> dict = new Dictionary<T, List<K>>();
//Insert item
if (!dict.ContainsKey(key))
dict[key] = new List<string>();
dict[key].Add(value);
You can wrap the dictionary in a type that hides this or put it in a method or even extension method on dictionary.
If you have more than 4 (for example) CSV values, it might be worth setting the value variable to use a StringBuilder as well since the string concatenation is a slow function.
Hmm, 1.7 Million lines? I hesitate to offer this for that kind of load.
Here's one way to do this using LINQ.
CSVReader csvReader = new CSVReader();
List<string> source = new List<string>();
using(StreamReader sr = new StreamReader(myFileName))
{
while (!sr.EndOfStream)
{
source.Add(sr.ReadLine());
}
}
List<string> ServMissing =
source
.Where(s => s.StartsWith(" ")
.ToList();
//--------------------------------------------------
List<IGrouping<string, string>> groupedSource =
(
from s in source
where !s.StartsWith(" ")
let parsed = csvReader.CSVParser(s)
where parsed.Any()
let first = parsed.First()
let rest = String.Join( "," , parsed.Skip(1).ToArray())
select new {first, rest}
)
.GroupBy(x => x.first, x => x.rest) //GroupBy(keySelector, elementSelector)
.ToList()
//--------------------------------------------------
List<string> myExtras = new List<string>();
foreach(IGrouping<string, string> g in groupedSource)
{
myHashTable.Add(g.Key, g.First());
if (g.Skip(1).Any())
{
myExtras.Add(g.Key);
}
}
Thank you all.
I ended up using the ContainsKey() method. It takes maybe 30 secs longer, which is fine for my purposes. I'm loading about 1.7 million lines and the program takes about 7 mins total to load up two files, compare them, and write out a few files. It only takes about 2 secs to do the compare and write out the files.

Categories

Resources