Reading in a text file more 'intelligently' - c#

I have a text file which contains a list of alphabetically organized variables with their variable numbers next to them formatted something like follows:
aabcdef 208
abcdefghijk 1191
bcdefga 7
cdefgab 12
defgab 100
efgabcd 999
fgabc 86
gabcdef 9
h 11
ijk 80
...
...
I would like to read each text as a string and keep it's designated id# something like read "aabcdef" and store it into an array at spot 208.
The 2 issues I'm running into are:
I've never read from file in C#, is there a way to read, say from
start of line to whitespace as a string? and then the next string as
an int until the end of line?
given the nature and size of these files I do not know the highest ID value of each file (not all numbers are used so some
files could house a number like 3000, but only actually list 200
variables) So how could I make a flexible way to store these
variables when I don't know how big the array/list/stack/etc.. would
need to be.

Basically you need a Dictionary instead of an array or list. You can read all lines with File.ReadLines method then split each of them based on space and \t (tab), like this:
var values = File.ReadLines("path")
.Select(line => line.Split(new [] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries))
.ToDictionary(parts => int.Parse(parts[1]), parts => parts[0]);
Then values[208] will give you aabcdef. It looks like an array doesn't it :)
Also make sure you have no duplicate numbers because Dictionary keys should be unique otherwise you will get an exception.

I've been thinking about how I would improve other answers and I've found this alternative solution based on Regex which makes the search into the whole string (either coming from a file or not) safer.
Check that you can alter the whole regular expression to include other separators. Sample expression will detect spaces and tabs.
At the end of the day, I found that MatchCollection returns a safer result, since you always know that 3rd group is an integer and 2nd group is a text because regular expression does a lot of checking for you!
StringBuilder builder = new StringBuilder();
builder.AppendLine("djdodjodo\t\t3893983");
builder.AppendLine("dddfddffd\t\t233");
builder.AppendLine("djdodjodo\t\t39838");
builder.AppendLine("djdodjodo\t\t12");
builder.AppendLine("djdodjodo\t\t444");
builder.AppendLine("djdodjodo\t\t5683");
builder.Append("djdodjodo\t\t33");
// Replace this line with calling File.ReadAllText to read a file!
string text = builder.ToString();
MatchCollection matches = Regex.Matches(text, #"([^\s^\t]+)(?:[\s\t])+([0-9]+)", RegexOptions.IgnoreCase | RegexOptions.Multiline);
// Here's the magic: we convert an IEnumerable<Match> into a dictionary!
// Check that using regexps, int.Parse should never fail because
// it matched numbers only!
IDictionary<int, string> lines = matches.Cast<Match>()
.ToDictionary(match => int.Parse(match.Groups[2].Value), match => match.Groups[1].Value);
// Now you can access your lines as follows:
string value = lines[33]; // <-- By value
Update:
As we discussed in chat, this solution wasn't working in some actual use case you showed me, but it's not the approach what's not working but your particular case, because keys are "[something].[something]" (for example: address.Name).
I've changed given regular expression to ([\w\.]+)[\s\t]+([0-9]+) so it covers the case of key having a dot.
It's about improving the matching regular expression to fit your requirements! ;)
Update 2:
Since you told me that you need keys having any character, I've changed the regular expression to ([^\s^\t]+)(?:[\s\t])+([0-9]+).
Now it means that key is anything excepting spaces and tabs.
Update 3:
Also I see you're stuck in .NET 3.0 and ToDictionary was introduced in .NET 3.5. If you want to get the same approach in .NET 3.0, replace ToDictionary(...) with:
Dictionary<int, string> lines = new Dictionary<int, string>();
foreach(Match match in matches)
{
lines.Add(int.Parse(match.Groups[2].Value), match.Groups[1].Value);
}

Related

I want to find a name after the first 9 characters

(not an actual person)
55705541;Henrik;Winther;50;7080;Børkop;hew#larsen.dk;0
60015956;Emilie;Beitzel;63;1610;København V;emb#senior.dk;1
44243159;Emilie;Kristensen;14;1125;København K;emk#jubiimail.dk;2
I want it so i can use a readline and find the index i need.
I want to use array.find to locate "Henrik;Winther" but i cant seem to skip the first 9 characters.
After finding the index, i want to use .Replace(";", " "); and display it with an write line.
Input: Henrik winther
Output:
55705541 Henrik Winther 50 7080 Børkop hew#larsen.dk
(The question has been edited heavily)
Here are the steps you can follow:
Make the input ; separated, since the strings in your database (I presume) are that way.
søgenavn = String.Join(";", søgenavn.Split());
Search for the input in the array, ignoring the case, of course
var line = databaseNavne.Where(x => x.IndexOf(søgenavn, StringComparison.OrdinalIgnoreCase) > 0).FirstOrDefault();
If line is null, no such line exists in the databaseNavne. Otherwise the line would be whatever's in the databaseNavne:
55705541;Henrik;Winther;50;7080;Børkop;hew#larsen.dk;0
You can split it with ;, chop it, cook it, boil it and do all sorts of stuffs. To get the space separated form like you showed in the question, do this:
line = String.Join(" ", line.Split(';'))
// 55705541 Henrik Winther 50 7080 Børkop hew#larsen.dk 0
The output you have shown does no have the 0 at the end though. Dunno who ate that away.
Old answer:
Considering the name is always going to be in the same position, you can use this to extract the name:
string inputStr = "55705541;Henrik;Winther;50;7080;Børkop;hew#larsen.dk;0"
string name = String.Join(" ", inputStr.Split(';').Skip(1).Take(2))
The code speaks for itself.
Split by ;. This returns an array of strings
Skip the first string, which in your example is 55705541
Then Take 2, in your example "Henrik" and "Winther". This also returns an array of strings (I said array only to keep things simple for the OP, I know it doesn't return an array)
Join the strings using a single space.

Dealing with escape sequences with ReadOnlySpan<char>

The ReadOnlySpan<char> is said to be perfect for parsing so I tried to use it and I came across a use case that I don't know how to handle.
I have a command-line string where the argument prefix - and the separator (space) are escaped (I know I could quote them here but for the sake of this problem let's assume it's not an option):
var str = #"foo -bar \-baz\ qux".AsMemory();
The tokenizer should return the following tokens:
foo - command name
bar - argument name
-baz qux - argument value
Cases 1 & 2 are simple because here I can just use str.Slice(i, length) but how can I create the 3rd case and return only a single ReadOnlySpan<char>? The Slice method doesn't allow me to specify multiple start/length ranges which would be necessary in order to jump over the escape char \.
Example:
str.Slice((10, 4), (15, 3));
where (10,4) = "-bar" and (15,3) = " qux"
With StringBuilder you can just skip a couple of characters and Append the others later. How would I achieve the same result with ReadOnlySpan<char>?
A Span/ReadOnlySpan is a contiguous block of memory. It cannot contain multiple ranges. This design is necessary for performance. Span/ReadOnlySpan is supposed to be roughly as fast as an array is. Arrays are fast because they are contiguous memory blocks with no further abstractions.
I don't see a way to do this without allocating a new string. You can use Span/ReadOnlySpan for all contiguous substrings but it seems your parsing problem is not suitable to use span to store results.
have a look at:
https://github.com/nemesissoft/Nemesis.TextParsers
and more precisely at:
TokenSequence.cs
Usage:
var tokens = "ABC|CD\|E".AsSpan().Tokenize('|', '\\', false); //no allocation. Result in 2 elements: "ABC", "CD\|E".
Consume via:
var result = new List<string>();
foreach (var part in tokens)
result.Add(part.ToString());
Unescaping can be done via:
ParsedSequence.cs
or
SpanParserHelper.UnescapeCharacter()
Hope this helps

Stripping text line with regular expression with c #

In the text shown below, I would need to extract the info in between the double quotes (The input is a text file)
Tag = "571EC002A-TD"
Tag = "571GI001-RUN"
Tag = "571GI001-TD"
The output should be,
571EC002A-TD
571GI001-RUN
571GI001-TD
How should I frame my regex in C# to match this and save it to a text file.
I was successful till reading all the lines into my code, but the regex gives me some undesirable values.
thanks and appreciate in advance.
A simple regex could be:
Regex tagRegex = new Regex(#"Tag\s?=\s?""(.+?)""");
Example with your input
UPDATE
For those that ask why not use String.Substring: The great advantage of regular expressions over string operations is that they don't generate temporary strings untily you actually ask for a matched value. Matches and groups contain only indexes to the source string. This cane be a huge advantage when processing log files.
You can match the content of a tag using a regex like
Tag\s*=\s*"(<tagValue>.*?)"
The ? in .*? results in a non-greedy search, ie only text up to the first double quote is extracted. Otherwise the pattern would match everything up to the last double quote.
(<tagValue>.*?) defines a named group. This way you can refer to the actual value captured by name and even use LINQ to process the values
The resulting C# code may look like this after escaping:
var myRegex=new Regex("Tag\\s*=\\s*\"(<tagValue>.*?)\"");
...
var tags=myRegex.Matches(someText)
.OfType<Match>()
.Select(match=>match.Groups["tagValue"].Value);
The result is an IEnumerable with all tag values. You can convert it to an array or List using ToArray() or ToList() just like any other IEnumerable
The equivalent code using a loop would be
var myRegex=new Regex("Tag\\s*=\\s*\"(<tagValue>.*?)\"");
...
List<string> tagValues=new List<string>();
foreach(Match m in myRegex.Matches(someText))
{
tagValues.Add(m.Groups["tagValue"].Value;
}
The LINQ version though can be extended very easily. For example, File.ReadLines returns an IEnumerable and doesn't wait to load everything in memory before returning. You could write something like:
var tags=File.ReadLines(myBigLog)
.SelectMany(line=>myRegex.Matches(line))
.OfType<Match>()
.Select(match=>match.Groups["tagValue"].Value);
If the tag names changed, you could also capture the tag name. If eg tags have a tag prefix you could use the pattern:
(?<tagName>tag\w+)\s*=\s*"(<tagValue>.*?)"
And extract both tag name and value in the Select function, eg :
.Select(match=>new {
TagName=match.Groups["tagName"].Value,
Value=match.Groups["tagValue"].Value
});
Regex.Matches is thread safe which means you can create one static Regex object and use it repeatedly, or even use PLINQ to match multiple lines in parallel simply by adding AsParallel() before the call to SelectMany.
If those strings will always be like that, you can go for a simpler approach by just using Substring:
line.Substring(7, line.Length - 8)
That will give you your desired output.

Replacing specific numbers with String.Replace or Regex.Replace

Just having a little problem in attempting a string or regex replace on specific numbers in a string.
For example, in the string
#1 is having lunch with #10 #11
I would like to replace "#1", "#10" and "#11" with the respective values as indicated below.
"#1" replace with "#bob"
"#10" replace with "#joe"
"#11" replace with "#sam"
So the final output would look like
"#bob is having lunch with #joe #sam"
Attempts with
String.Replace("#1", "#bob")
results in the following
#bob is having lunch with #bob0 #bob1
Any thoughts on what the solution might be?
I would prefer more declarative way of doing this. What if there will be another replacements, for example #2 change to luke? You will have to change the code (add another Replace call).
My proposition with declarations of the replacements:
string input = "#1 is having lunch with #10 #11";
var rules = new Dictionary<string,string>()
{
{ "#1", "#bob" },
{ "#10", "#joe" },
{ "#11", "#sam"}
};
string output = Regex.Replace(input,
#"#\d+",
match => rules[match.Value]);
Explanation:
Regular expression is searching for pattern #\d+ which means # followed by one or more digits. And replaces this match thanks to MatchEvaluator by the proper entry from the rules dictionary, where the key is the match value itself.
Assuming all placeholder start with # and contain only digits, you can use the Regex.Replace overload that accepts a MatchEvaluator delegate to pick the replacement value from a dictionary:
var regex = new Regex(#"#\d+");
var dict = new Dictionary<string, string>
{
{"#1","#bob"},
{"#10","#joe"},
{"#11","#sam"},
};
var input = "#1 is having lunch with #10 #11";
var result=regex.Replace(input, m => dict[m.Value]);
The result will be "#bob is having lunch with #joe #sam"
There are a few advantages compared to multiple String.Replace calls:
The code is more concise, for an arbitrary number of placeholders
You avoid mistakes due to the order of the replacements (eg #11 must come before #1)
It's faster because you don't need to search and replace the placeholders multiple times
It doesn't create temporary strings for each parameter. This can be an issue for server applications because a large number of orphaned strings will put pressure on the garbage collector
The reason for advantages 3-4 is that the regex will parse the input and create an internal representation that contains the indexes for any match. When the time comes to create the final string, it uses a StringBuilder to read characters from the original string but substitute the replacement values when a match is encountered.
Start with the biggest (read longest) number like #11 and #10 first and then replace #1.
string finalstring = mystring.Replace("#11", "#sam")
.Replace("#10", "#joe")
.Replace("#1", "#bob");
Make your regular expression look for the string #1_
The space after will ensure that it only gets the number #1.

Parsing a formatted string with RegEx or similar

I have an application which sends a TCP message to a server, and gets one back.
The message it gets back is in this format:
0,"120"1,"Data Field 1"2,"2401"3,"Data Field 3"1403-1,"multiple
occurence 1"1403-2,"multiple occurence 2"99,""
So basically it is a set of fields concatenated together.
Each field has a tag, a comma, and a value - in that order.
The tag is the number, the value is in quotes, the comma seperates them.
0,"120"
0 is the tag, 120 is the value.
A complete message always starts with a 0 field and ends with 99,"" field.
To complicate things, some tags have dashes because they are split into more than 1 value.
The order of the numbers is not significant.
(For reference, this is a "Fedex Tagged Transaction" message).
So I'm looking for a decent way of validating that we have a "complete" message (ie has the 0 and 99 fields) - because it's from a TCP message I guess I have to account for not having received the full message yet.
Then splitting it up to get all the values I need.
The best I have come up with is for parsing is some poor regex and some cleaning-up afterwards.
The heart of it is this: (\d?\d?\d?\d?-?\d?\d,") to split it
string s = #"(\d?\d?\d?\d?-?\d?\d,"")";
string[] strArray = Regex.Split(receivedData, r);
Assert.AreEqual(14, strArray.Length, "Array length should be 14", since we have 7 fields.);
Dictionary<string, string> fields = new Dictionary<string, string>();
//Now put it into a dictionary which should be easier to work with than an array
for (int i = 0; i <= strArray.Length-2; i+=2)
{
fields.Add(strArray[i].Trim('"').Trim(','), strArray[i + 1].Trim('"'));
}
Which doesn't really work.
It has a lot of quotes and commas left over, and doesn't seem particularly well-formed...
I'm not good with Regex so I can't put together what I need it to do.
I don't even know if it is the best way.
Any help appreciated.
I suggest you use Regex.Matches rather than Regex.Split. This way you can iterate over all the matches, and use capture groups to just grab the data you want directly, while still maintaining structure. I provided a regex that should work for this below in the example:
MatchCollection matchlist = Regex.Matches(receivedData, #"(?<tag>\d+(?:-\d+)?),""(?<data>.*?)""");
foreach (Match match in matchlist)
{
string tag = match.Groups["tag"].Value;
string data = match.Groups["data"].Value;
}
Try this expression
\d*(-\d*)?,"[^"]*"
Match count: 7
0,"120"
1,"Data Field 1"
2,"2401"
3,"Data Field 3"
1403-1,"multiple occurence 1"
1403-2,"multiple occurence 2"
99,""

Categories

Resources