Adding to the dictionary from a text document C# - c#

I need to read from the text document through =, line by line, and add it to the dictionary. Can you help me please?
using (StreamReader sr = new StreamReader("slovardata.txt"))
{
string _line;
while ((_line = sr.ReadLine()) != null)
{
string[] keyvalue = _line.Split('=');
if (keyvalue.Length == 2)
{
slovarik.Add(keyvalue[0], keyvalue[1]);
}
}
}

You can read all lines of file with File.ReadAllLines and after splitting every line into Key & Value add it into dictionary like the following code:
caution: it may ignore some lines without throwing any exception and it may throw Argument Exception “Item with Same Key has already been added”
var lines = System.IO.File.ReadAllLines("slovardata.txt");
lines.Select(line=>line.Split('='))
.Where(line=>line.Length ==2)
.ToList()
.ForEach(line=> slovarik.Add(line[0],line[1]));
btw the .ForEach method made a lot of garbage (in large lists) and if there is no duplicate keys you can use following:
var slovarik = lines.Select(line=>line.Split('='))
.Where(line=>line.Length ==2)
.ToDictionary(line[0],line[1]);

For simple text file read operation you can use something like this :
Note : Make sure your Keys are unique, otherwise you will get the -
System.ArgumentException: An item with the same key has already been
added.
string[] FileContents = File.ReadAllLines(#"c:\slovardata.txt");
Dictionary<string, string> dict = new Dictionary<string, string>();
foreach (string line in FileContents)
{
var keyvalue = Regex.Match(line, #"(.*)=(.*)");
dict.Add(keyvalue.Groups[1].Value, keyvalue.Groups[2].Value);
}
foreach (var item in dict)
{
Console.WriteLine("Key : " + item.Key + "\tValue : " + item.Value);
}

Related

'An item with the same key has already been added.' [duplicate]

I keep getting an error with the following code:
Dictionary<string, string> rct3Features = new Dictionary<string, string>();
Dictionary<string, string> rct4Features = new Dictionary<string, string>();
foreach (string line in rct3Lines)
{
string[] items = line.Split(new String[] { " " }, 2, StringSplitOptions.None);
rct3Features.Add(items[0], items[1]);
////To print out the dictionary (to see if it works)
//foreach (KeyValuePair<string, string> item in rct3Features)
//{
// Console.WriteLine(item.Key + " " + item.Value);
//}
}
The error throws an ArgumentException saying,
"An item with the same key has already been added."
I am unsure after several Google searches how to fix this.
Later in the code I need to access the dictionary for a compare function:
Compare4To3(rct4Features, rct3Features);
public static void Compare4To3(Dictionary<string, string> dictionaryOne, Dictionary<string, string> dictionaryTwo)
{
//foreach (string item in dictionaryOne)
//{
//To print out the dictionary (to see if it works)
foreach (KeyValuePair<string, string> item in dictionaryOne)
{
Console.WriteLine(item.Key + " " + item.Value);
}
//if (dictionaryTwo.ContainsKey(dictionaryOne.Keys)
//{
// Console.Write("True");
//}
//else
//{
// Console.Write("False");
//}
//}
}
This function isn't completed, but I am trying to resolve this exception. What are the ways I can fix this exception error, and keep access to the dictionary for use with this function? Thank you
This error is fairly self-explanatory. Dictionary keys are unique and you cannot have more than one of the same key. To fix this, you should modify your code like so:
Dictionary<string, string> rct3Features = new Dictionary<string, string>();
Dictionary<string, string> rct4Features = new Dictionary<string, string>();
foreach (string line in rct3Lines)
{
string[] items = line.Split(new String[] { " " }, 2, StringSplitOptions.None);
if (!rct3Features.ContainsKey(items[0]))
{
rct3Features.Add(items[0], items[1]);
}
////To print out the dictionary (to see if it works)
//foreach (KeyValuePair<string, string> item in rct3Features)
//{
// Console.WriteLine(item.Key + " " + item.Value);
//}
}
This simple if statement ensures that you are only attempting to add a new entry to the Dictionary when the Key (items[0]) is not already present.
If you want "insert or replace" semantics, use this syntax:
A[key] = value; // <-- insert or replace semantics
It's more efficient and readable than calls involving "ContainsKey()" or "Remove()" prior to "Add()".
So in your case:
rct3Features[items[0]] = items[1];
As others have said, you are adding the same key more than once. If this is a NOT a valid scenario, then check Jdinklage Morgoone's answer (which only saves the first value found for a key), or, consider this workaround (which only saves the last value found for a key):
// This will always overwrite the existing value if one is already stored for this key
rct3Features[items[0]] = items[1];
Otherwise, if it is valid to have multiple values for a single key, then you should consider storing your values in a List<string> for each string key.
For example:
var rct3Features = new Dictionary<string, List<string>>();
var rct4Features = new Dictionary<string, List<string>>();
foreach (string line in rct3Lines)
{
string[] items = line.Split(new String[] { " " }, 2, StringSplitOptions.None);
if (!rct3Features.ContainsKey(items[0]))
{
// No items for this key have been added, so create a new list
// for the value with item[1] as the only item in the list
rct3Features.Add(items[0], new List<string> { items[1] });
}
else
{
// This key already exists, so add item[1] to the existing list value
rct3Features[items[0]].Add(items[1]);
}
}
// To display your keys and values (testing)
foreach (KeyValuePair<string, List<string>> item in rct3Features)
{
Console.WriteLine("The Key: {0} has values:", item.Key);
foreach (string value in item.Value)
{
Console.WriteLine(" - {0}", value);
}
}
To illustrate the problem you are having, let's look at some code...
Dictionary<string, string> test = new Dictionary<string, string>();
test.Add("Key1", "Value1"); // Works fine
test.Add("Key2", "Value2"); // Works fine
test.Add("Key1", "Value3"); // Fails because of duplicate key
The reason that a dictionary has a key/value pair is a feature so you can do this...
var myString = test["Key2"]; // myString is now Value2.
If Dictionary had 2 Key2's, it wouldn't know which one to return, so it limits you to a unique key.
That Exception is thrown if there is already a key in the dictionary when you try to add the new one.
There must be more than one line in rct3Lines with the same first word. You can't have 2 entries in the same dictionary with the same key.
You need to decide what you want to happen if the key already exists - if you want to just update the value where the key exists you can simply
rct3Features[items[0]]=items[1]
but, if not you may want to test if the key already exists with:
if(rect3Features.ContainsKey(items[0]))
{
//Do something
}
else
{
//Do something else
}
I suggest .NET's TryAdd:
https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.dictionary-2.tryadd?view=net-7.0
I suggest a extension method for environments where .NET's TryAdd is not available:
public static class DictionaryUtils
{
/// <summary>
/// Prevents exception "Item with Same Key has already been added".
/// </summary>
public static void TryAdd<TKey, TValue>(this Dictionary<TKey, TValue> dictionary, TKey key, TValue value)
{
if (!dictionary.ContainsKey(key))
{
dictionary.Add(key, value);
}
}
}
Clear the dictionary before adding any items to it. I don't know how a dictionary of one object affects another's during assignment but I got the error after creating another object with the same key,value pairs.
NB:
If you are going to add items in a loop just make sure you clear the dictionary before entering the loop.

check if string contains dictionary keys and replace the matching subtring with Values from dictionary

I am parsing a template file which will contain certain keys that I need to map values to. Take a line from the file for example:
Field InspectionStationID 3 {"PVA TePla #WSM#", "sw#data.tool_context.TOOL_SOFTWARE_VERSION#", "#data.context.TOOL_ENTITY#"}
I need to replace the string within the # symbols with values from a dictionary.
So there can be multiple keys from the dictionary. However, not all strings inside the # are in the dictionary so for those, I will have to replace them with empty string.
I cant seem to find a way to do this. And yes I have looked at this solution:
check if string contains dictionary Key -> remove key and add value
For now what I have is this (where I read from the template file line by line and then write to a different file):
string line = string.Empty;
var dict = new Dictionary<string, string>() {
{ "data.tool_context.TOOL_SOFTWARE_VERSION", "sw0.2.002" },
{"data.context.TOOL_ENTITY", "WSM102" }
};
StringBuilder inputText = new StringBuilder();
StreamWriter writeKlarf = new StreamWriter(klarfOutputNameActual);
using (StreamReader sr = new StreamReader(WSMTemplatePath))
{
while((line = sr.ReadLine()) != null)
{
//Console.WriteLine(line);
if (line.Contains("#"))
{
}
else
{
writeKlarf.WriteLine(line)
}
}
}
writeKlarf.Close();
THe idea is that for each line, replace the string within the # and the # with match values from the dictionary if the #string# is inside the dictionary. How can I do this?
Sample Output Given the line above:
Field InspectionStationID 3 {"PVA TePla", "sw0.2.002", "WSM102"}
Here because #WSM# is not the dictionary, it is replaced with empty string
One more thing, this logic only applies to the first qurter of the file. The rest of the file will have other data that will need to be entered via another logic so I am not sure if it makes sense to read the whole file in into memory just for the header section?
Here's a quick example that I wrote for you, hopefully this is what you're asking for.
This will let you have a <string, string> Dictionary, check for the Key inside of a delimiter, and if the text inside of the delimiter matches the Dictionary key, it will replace the text. It won't edit any of the inputted strings that don't have any matches.
If you want to delete the unmatched value instead of leaving it alone, replace the kvp.Value in the line.Replace() with String.Empty
var dict = new Dictionary<string, string>() {
{ "test", "cool test" }
};
string line = "#test# is now replaced.";
foreach (var kvp in dict)
{
string split = line.Split('#')[1];
if (split == kvp.Key)
{
line = line.Replace($"#{split}#", kvp.Value);
}
Console.WriteLine(line);
}
Console.ReadLine();
If you had a list of tuple that were the find and replace, you can read the file, replace each, and then rewrite the file
var frs = new List<(string F, string R)>(){
("#data.tool_context.TOOL_SOFTWARE_VERSION#", "sw0.2.002"),
("#otherfield#", "replacement here")
};
var i = File.ReadAllText("path");
frs.ForEach(fr => i = i.Replace(fr.F,fr.R));
File.WriteAllText("path2", i);
The choice to use a list vs dictionary is fairly arbitrary; List has a ForEach method but it could just as easily be a foreach loop on a dictionary. I included the ## in the find string because I got the impression the output is not supposed to contain ##..
This version leaves alone any template parameters that aren't available
You can try matching #...# keys with a help of regular expressions:
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
...
static string MyReplace(string value, IDictionary<string, string> subs) => Regex
.Replace(value, "#[^#]*#", match => subs.TryGetValue(
match.Value.Substring(1, match.Value.Length - 2), out var item) ? item : "");
then you can apply it to the file: we read file's lines, process them with a help of Linq and write them into another file.
var dict = new Dictionary<string, string>() {
{"data.tool_context.TOOL_SOFTWARE_VERSION", "sw0.2.002" },
{"data.context.TOOL_ENTITY", "WSM102" },
};
File.WriteAllLines(klarfOutputNameActual, File
.ReadLines(WSMTemplatePath)
.Select(line => MyReplace(line, dict)));
Edit: If you want to switch off MyReplace from some line on
bool doReplace = true;
File.WriteAllLines(klarfOutputNameActual, File
.ReadLines(WSMTemplatePath)
.Select(line => {
//TODO: having line check if we want to keep replacing
if (!doReplace || SomeCondition(line)) {
doReplace = false;
return line;
}
return MyReplace(line, dict)
}));
Here SomeCondition(line) returns true whenever header ends and we should not replace #..# any more.

Linq query for building a dictionary from a reg file

I'm building a simple dictionary from a reg file (export from Windows Regedit). The .reg file contains a key in square brackets, followed by zero or more lines of text, followed by a blank line. This code will create the dictionary that I need:
var a = File.ReadLines("test.reg");
var dict = new Dictionary<String, List<String>>();
foreach (var key in a) {
if (key.StartsWith("[HKEY")) {
var iter = a.GetEnumerator();
var value = new List<String>();
do {
iter.MoveNext();
value.Add(iter.Current);
} while (String.IsNullOrWhiteSpace(iter.Current) == false);
dict.Add(key, value);
}
}
I feel like there is a cleaner (prettier?) way to do this in a single Linq statement (using a group by), but it's unclear to me how to implement the iteration of the value items into a list. I suspect I could do the same GetEnumerator in a let statement but it seems like there should be a way to implement this without resorting to an explicit iterator.
Sample data:
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.msu]
#="Microsoft.System.Update.1"
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS]
#="WMP11.AssocFile.M2TS"
"Content Type"="video/vnd.dlna.mpeg-tts"
"PerceivedType"="video"
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS\OpenWithProgIds]
"WMP11.AssocFile.M2TS"=hex(0):
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS\ShellEx]
[HKEY_LOCAL_MACHINE\SOFTWARE\Classes\.MTS\ShellEx\{BB2E617C-0920-11D1-9A0B-00C04FC2D6C1}]
#="{9DBD2C50-62AD-11D0-B806-00C04FD706EC}"
Update
I'm sorry I need to be more specific. The files am looking at around ~300MB so I took the approach I did to keep the memory footprint down. I'd prefer an approach that doesn't require pulling the entire file into memory.
You can always use Regex:
var dict = new Dictionary<String, List<String>>();
var a = File.ReadAllText(#"test.reg");
var results = Regex.Matches(a, "(\\[[^\\]]+\\])([^\\[]+)\r\n\r\n", RegexOptions.Singleline);
foreach (Match item in results)
{
dict.Add(
item.Groups[1].Value,
item.Groups[2].Value.Split(new[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries).ToList()
);
}
I whipped this out real quick. You might be able to improve the regex pattern.
Instead of using GetEnumerator you can take advantage of TakeWhile and Split methods to break your list into smaller list (each sublist represents one key and its values)
var registryLines = File.ReadLines("test.reg");
Dictionary<string, List<string>> resultKeys = new Dictionary<string, List<string>>();
while (registryLines.Count() > 0)
{
// Take the key and values into a single list
var keyValues = registryLines.TakeWhile(x => !String.IsNullOrWhiteSpace(x)).ToList();
// Adds a new entry to the dictionary using the first value as key and the rest of the list as value
if (keyValues != null && keyValues.Count > 0)
resultKeys.Add(keyValues[0], keyValues.Skip(1).ToList());
// Jumps to the next registry (+1 to skip the blank line)
registryLines = registryLines.Skip(keyValues.Count + 1);
}
EDIT based on your update
Update I'm sorry I need to be more specific. The files am looking at
around ~300MB so I took the approach I did to keep the memory
footprint down. I'd prefer an approach that doesn't require pulling
the entire file into memory.
Well, if you can't read the whole file into memory, it makes no sense to me asking for a LINQ solution. Here is a sample of how you can do it reading line by line (still no need for GetEnumerator)
Dictionary<string, List<string>> resultKeys = new Dictionary<string, List<string>>();
using (StreamReader reader = File.OpenText("test.reg"))
{
List<string> keyAndValues = new List<string>();
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
// Adds key and values to a list until it finds a blank line
if (!string.IsNullOrWhiteSpace(line))
keyAndValues.Add(line);
else
{
// Adds a new entry to the dictionary using the first value as key and the rest of the list as value
if (keyAndValues != null && keyAndValues.Count > 0)
resultKeys.Add(keyAndValues[0], keyAndValues.Skip(1).ToList());
// Starts a new Key collection
keyAndValues = new List<string>();
}
}
}
I think you can use a code like this - if you can use memory -:
var lines = File.ReadAllText(fileName);
var result =
Regex.Matches(lines, #"\[(?<key>HKEY[^]]+)\]\s+(?<value>[^[]+)")
.OfType<Match>()
.ToDictionary(k => k.Groups["key"], v => v.Groups["value"].ToString().Trim('\n', '\r', ' '));
C# Demo
This will take 24.173 seconds for a file with more than 4 million lines - Size:~550MB - by using 1.2 GB memory.
Edit :
The best way is using File.ReadAllLines as it is lazy:
var lines = File.ReadAllLines(fileName);
var keyRegex = new Regex(#"\[(?<key>HKEY[^]]+)\]");
var currentKey = string.Empty;
var currentValue = string.Empty;
var result = new Dictionary<string, string>();
foreach (var line in lines)
{
var match = keyRegex.Match(line);
if (match.Length > 0)
{
if (!string.IsNullOrEmpty(currentKey))
{
result.Add(currentKey, currentValue);
currentValue = string.Empty;
}
currentKey = match.Groups["key"].ToString();
}
else
{
currentValue += line;
}
}
This will take 17093 milliseconds for a file with 795180 lines.

Matching string values in a dictionary based on partial Key in the same dictionary based on rules stated in XML file

I am trying to find an efficient way to match the strings in this dictionary based on the rules stated in XML file.
I will try to explain the code from the beginning. There are two csv files.
File1.csv
RefID|Firstname|Lastname|ID|DOB
Ref_1|KEN|CARPENTER|67814|1122
Ref_2|TRAY|ROBINSON|67814|1122
Ref_3|TRAY|ROBINSON|67871|1122
Ref_4|TRAN|ROBINSON|67871|1122
Ref_5|LAWSN|PERDUE|6761|2009
Ref_6|MCKEN|BARNUM|6761|2009
Ref_7|MCKEN|BARNUM|6768|2009
Ref_8|MCKEN|BARNUM|6768|2009
Ref_9|TRAN|ROBINSON|67871|1122
File2.csv
SID|Values
TRAROB|Ref_1,Ref_2,Ref_3,Ref_4,Ref_9
MCKBAR|Ref_5,Ref_6,Ref_7,Ref_8
XML :
<?xml version="1.0" encoding="utf-8" ?>
<FeedInfo>
<Rule>
<RuleInfo>
<RuleName>Rule 1</RuleName>
</RuleInfo>
<Rules>
<item name ="FirstName" NoOfChars ="ALL" number ="0"/>
<item name ="LastName" NoOfChars ="ALL" number ="1"/>
<item name ="ID" NoOfChars ="ALL" number ="2" />
</Rules>
</Rule>
</FeedInfo>
I wrote the following code :
static void Main(string[] args)
{
populate();
rulesReader();
}
public static Dictionary<string,string> createDictionary(string dataPath)
{
//creates a dictionary from a file
StreamReader sr = new StreamReader(dataPath);
Dictionary<string, string> refIdVal = new Dictionary<string, string>();
string line = sr.ReadLine();
while ((line = sr.ReadLine()) != null)
{
string key = line.Split('|')[0];
int i = line.IndexOf('|',0) + 1;
int l = line.Length - i;
string value = line.Substring(i,l);
refIdVal.Add(key, value);
}
sr.Close();
return refIdVal;
}
public static Dictionary<string,string> populate()
{
//populates the dictionary with SID,RefID|values format.
string refIdPath = "File1.csv";
string sidPath = "File2.csv";
Dictionary<string, string> final = new Dictionary<string, string>();
Dictionary<string, string> refIdVal = createDictionary(refIdPath);
Dictionary<string, string> sidVal = createDictionary(sidPath);
foreach (KeyValuePair<string, string> pair in sidVal)
{
string[] refIdTockens = pair.Value.Split(',');
for (int i = 0; i <refIdTockens.Length; i++)
{
final.Add(pair.Key + "," + refIdTockens[i], refIdVal[refIdTockens[i]]);
//Console.WriteLine(pair.Key + "," + refIdTockens[i] + "==" + refIdVal[refIdTockens[i]]+ "==" + i);
}
}
foreach (KeyValuePair<string, string> pair in final)
{
Console.WriteLine(pair.Key + "==" + pair.Value);
}
return final;
}
public static Dictionary<string,string> finalOutput(Dictionary<string,string> inputDictionary)
{
Dictionary<string,string> input = inputDictionary;
foreach (KeyValuePair<string, string> pair in input)
{
}
return null;
}
public static Dictionary<String, List<int>> rulesReader()
{
//reads the rules from xml file and returns a dictionary in <string,list> format.
Dictionary<string, List<int>> rulesAndNumbers = new Dictionary<string, List<int>>();
XDocument xDoc = XDocument.Load("rules.xml");
int rulesCount = xDoc.Descendants("RuleName").Count();
string ruleName = null;
string ruleValue = null;
//List<string> ruleNumbers = new List<string>();
var feedDetails = from feed in xDoc.Descendants("Rule")
select new
{
IndexInfo = feed.Descendants("RuleInfo").Descendants(),
IndexRules = feed.Descendants("Rules").Descendants()
};
foreach (var feed in feedDetails)
{
foreach (XElement xe in feed.IndexInfo) //RuleName
{
List<int> ruleNumbers = new List<int>();
ruleName = xe.Value;
foreach (XElement xe1 in feed.IndexRules)
{
ruleValue = xe1.Attribute("number").Value;
ruleNumbers.Add(Int32.Parse(ruleValue));
Console.WriteLine(ruleName + "==" + ruleValue);
}
rulesAndNumbers.Add(ruleName, ruleNumbers);
//ruleNumbers.Clear();
}
}
return rulesAndNumbers;
}
the code above gives me a dictionary in this format:
SID,REFID == FirstName|LastName|ID|DOB ( KEY == VALUE )
SidRefID Dictionary
TRAROB,Ref_1==KEN|CARPENTER|67814|1122
TRAROB,Ref_2==TRAN|ROBINSON|67814|1122
TRAROB,Ref_3==TRAN|ROBINSON|67871|1122
TRAROB,Ref_4==TRAN|ROBINSON|67871|1122
MCKBAR,Ref_5==LAWSN|PERDUE|6761|2009
MCKBAR,Ref_6==MCKEN|BARNUM|6761|2009
MCKBAR,Ref_7==MCKEN|BARNUM|6768|2009
MCKBAR,Ref_8==MCKEN|BARNUM|6768|2009
TRAROB,Ref_9==TRAN|ROBINSON|67871|1122
and a dictionary like this XML Dictionary
[Rule1|0]
[Rule1|1]
[Rule1|2]
Now, after all this I am stuck here : I need to match all the values with the with the same partial KEY i.e. SID or Key.split(,)[0]. In the final dictionary, based on the numbers mentioned in the XML. The 0th,1st and 2nd position of array after splitting the values should be concatenated.
I have already created the XML Dictionary in string,List(int) format. So the Ref_1 should match with Ref_2,Ref3,Ref_4 based on (0,1,2) i.e concatenation of firstName,LastName,ID. Fir example:
Ref1,Ref_2,Ref3,Ref_4 all have same SID (SidRefId Dictionary)
so I need to match
KENCARPENTER67814 with TRAYROBINSON67814 & TRAYROBINSON67871 & TRAYROBINSON67871 & TRAYROBINSON67871 which will return FALSE for KENCARPENTER67814 because none of the string matches with each other, Similarly the desired output is:
RULE1,TRAROB,Ref_1==KEN|CARPENTER|67814|1122|FALSE
RULE1,TRAROB,Ref_2==TRAN|ROBINSON|67814|1122|FALSE
RULE1,TRAROB,Ref_3==TRAN|ROBINSON|67871|1122|TRUE
RULE1,TRAROB,Ref_4==TRAN|ROBINSON|67871|1122|TRUE
RULE1,MCKBAR,Ref_5==LAWSN|PERDUE|6761|2009|FALSE
RULE1,MCKBAR,Ref_6==MCKEN|BARNUM|6761|2009|FALSE
RULE1,MCKBAR,Ref_7==MCKEN|BARNUM|6768|2009|TRUE
RULE1,MCKBAR,Ref_8==MCKEN|BARNUM|6768|2009|TRUE
RULE1,TRAROB,Ref_9==TRAN|ROBINSON|67871|1122|TRUE
I thought of making a copy of the SidRefId dictionary and matching it with each other, but its gonna take lot of time for large files and multiple rules in the XML file, which i am going to deal with.
Can someone tell me an efficient way to do this? Thanks!
To me it looks like you're trying to develop your own engine for record linkage. That is, for finding duplicate records that are not exact duplicates. If I were you I wouldn't try to make my own engine, but instead just use one of the already existing ones.
Wikipedia used to have a list of such engines, but it got deleted and I don't know of any other lists, so I'll just link to the one I made: Duke. There are other engines as well.
If you insist on doing this yourself, one way to do it is what you're doing here: build a key for each record, then group by key. That's fairly primitive, though, so you should aim to do more detailed matching after you've matched by key. Just matching by key will cause many false positives.
A more sophisticated approach is to do what I did: index the data up with a search engine like Lucene, then search for similar records and do detailed comparison on the candidates. Or you could use locality-sensitive hashing. Or metric spaces. Or q-gram based indexes.

C# Exception Handling continue on error

I have a basic C# console application that reads a text file (CSV format) line by line and puts the data into a HashTable. The first CSV item in the line is the key (id num) and the rest of the line is the value. However I've discovered that my import file has a few duplicate keys that it shouldn't have. When I try to import the file the application errors out because you can't have duplicate keys in a HashTable. I want my program to be able to handle this error though. When I run into a duplicate key I would like to put that key into a arraylist and continue importing the rest of the data into the hashtable. How can I do this in C#
Here is my code:
private static Hashtable importFile(Hashtable myHashtable, String myFileName)
{
StreamReader sr = new StreamReader(myFileName);
CSVReader csvReader = new CSVReader();
ArrayList tempArray = new ArrayList();
int count = 0;
while (!sr.EndOfStream)
{
String temp = sr.ReadLine();
if (temp.StartsWith(" "))
{
ServMissing.Add(temp);
}
else
{
tempArray = csvReader.CSVParser(temp);
Boolean first = true;
String key = "";
String value = "";
foreach (String x in tempArray)
{
if (first)
{
key = x;
first = false;
}
else
{
value += x + ",";
}
}
myHashtable.Add(key, value);
}
count++;
}
Console.WriteLine("Import Count: " + count);
return myHashtable;
}
if (myHashtable.ContainsKey(key))
duplicates.Add(key);
else
myHashtable.Add(key, value);
A better solution is to call ContainsKey to check if the key exist before adding it to the hash table instead. Throwing exception on this kind of error is a performance hit and doesn't improve the program flow.
ContainsKey has a constant O(1) overhead for every item, while catching an Exception incurs a performance hit on JUST the duplicate items.
In most situations, I'd say check for the key, but in this case, its better to catch the exception.
Here is a solution which avoids multiple hits in the secondary list with a small overhead to all insertions:
Dictionary<T, List<K>> dict = new Dictionary<T, List<K>>();
//Insert item
if (!dict.ContainsKey(key))
dict[key] = new List<string>();
dict[key].Add(value);
You can wrap the dictionary in a type that hides this or put it in a method or even extension method on dictionary.
If you have more than 4 (for example) CSV values, it might be worth setting the value variable to use a StringBuilder as well since the string concatenation is a slow function.
Hmm, 1.7 Million lines? I hesitate to offer this for that kind of load.
Here's one way to do this using LINQ.
CSVReader csvReader = new CSVReader();
List<string> source = new List<string>();
using(StreamReader sr = new StreamReader(myFileName))
{
while (!sr.EndOfStream)
{
source.Add(sr.ReadLine());
}
}
List<string> ServMissing =
source
.Where(s => s.StartsWith(" ")
.ToList();
//--------------------------------------------------
List<IGrouping<string, string>> groupedSource =
(
from s in source
where !s.StartsWith(" ")
let parsed = csvReader.CSVParser(s)
where parsed.Any()
let first = parsed.First()
let rest = String.Join( "," , parsed.Skip(1).ToArray())
select new {first, rest}
)
.GroupBy(x => x.first, x => x.rest) //GroupBy(keySelector, elementSelector)
.ToList()
//--------------------------------------------------
List<string> myExtras = new List<string>();
foreach(IGrouping<string, string> g in groupedSource)
{
myHashTable.Add(g.Key, g.First());
if (g.Skip(1).Any())
{
myExtras.Add(g.Key);
}
}
Thank you all.
I ended up using the ContainsKey() method. It takes maybe 30 secs longer, which is fine for my purposes. I'm loading about 1.7 million lines and the program takes about 7 mins total to load up two files, compare them, and write out a few files. It only takes about 2 secs to do the compare and write out the files.

Categories

Resources