Find repeated occurrences in String - c#

I'm currently trying to find all matches to a rule in a string and copy those to a vector. The purpose is to build an application which retrieves the top N .mp3 files (podcasts) from a community website.
My current tactic:
public static string getBetween(string strSource, string strStart, string strEnd)
{
int Start, End;
if (strSource.Contains(strStart) && strSource.Contains(strEnd))
{
Start = strSource.IndexOf(strStart, 0) + strStart.Length;
End = strSource.IndexOf(strEnd, Start);
string sFound = strSource.Substring(Start, End + 4 - Start);
strSource = strSource.Remove(Start, End + 4 - Start);
return sFound;
}
else
{
return"";
}
}
Called like this:
for (int i = 0; i < N; i++)
{
Podcast.Add(getBetween(searchDoc(#TARGET_HTM), "Sound/", ".mp3"));
}
Where searchDoc is:
public static string searchDoc(string strFile)
{
StreamReader sr = new StreamReader(strFile);
String line = sr.ReadToEnd();
return line;
}
Why am I posting such a big chunk of code?
This is my first application in C#. I assume my current tactic is flawed and I'd rather see a solution for the underlying problem than a cheap fix for lousy code. Feel free to do whatever you feel like though.
What it should do:
Find all occurrences of "Sound/" + * + ".mp3" (all MP3 files in the directory Sound, whatever their name, from the top of the target .htm file till N are found. Do so by returning the top occurrence and removing this from the String.
What it does:
It finds the first occurrence just fine. It also removes the occurrence just fine. However, it only does so from strSource which gets discarded at the end of the function.
Problem:
How do I return the modified string in a safe manner (no global variables or other improper tricks), so the found occurrence is properly removed and the next can be found?

This is the wrong approach. You can use Regex.Matches to get all matches of the pattern that you want. The regex would be something like "Sound/[^/\"]+\.mp3".
Once you have a list of matches you can apply .Cast<Match>().Take(3).Select(m => m.Value) to it to get the top 3 matches as strings.
It looks like you have a C++ background. This can lead to low-level designs out of habit. Try to avoid manual string parsing and loops.

Three flaws:
First, these two things seem to belong together strongly, but you split them over two functions.
Second, you forgot to use the startIndex parameter of Substring, requiring you to rebuild strings that are later discarded (this is a performance hit!)
Third, you had a small error: you hardcoded the length of strEnd as 4.
I just made an extension method based on your code, which fixes these 3 flaws. Untested, since I have no VS on this computer.
public static List<string> Split(this string source, string start, string end) {
List<string> result = new List<string>();
int i=0;
while(source.indexOf(start, i) != -1) {
startIndex = source.IndexOf(start, i) + start.Length;
endIndex = source.IndexOf(end, start);
result.Add(source.Substring(startIndex, endIndex + end.Length - startIndex));
i = endIndex;
}
return result;
}

Related

Efficiently split a string in format "{ {}, {}, ...}"

I have a string in the following format.
string instance = "{112,This is the first day 23/12/2009},{132,This is the second day 24/12/2009}"
private void parsestring(string input)
{
string[] tokens = input.Split(','); // I thought this would split on the , seperating the {}
foreach (string item in tokens) // but that doesn't seem to be what it is doing
{
Console.WriteLine(item);
}
}
My desired output should be something like this below:
112,This is the first day 23/12/2009
132,This is the second day 24/12/2009
But currently, I get the one below:
{112
This is the first day 23/12/2009
{132
This is the second day 24/12/2009
I am very new to C# and any help would be appreciated.
Don't fixate on Split() being the solution! This is a simple thing to parse without it. Regex answers are probably also OK, but I imagine in terms of raw efficiency making "a parser" would do the trick.
IEnumerable<string> Parse(string input)
{
var results = new List<string>();
int startIndex = 0;
int currentIndex = 0;
while (currentIndex < input.Length)
{
var currentChar = input[currentIndex];
if (currentChar == '{')
{
startIndex = currentIndex + 1;
}
else if (currentChar == '}')
{
int endIndex = currentIndex - 1;
int length = endIndex - startIndex + 1;
results.Add(input.Substring(startIndex, length));
}
currentIndex++;
}
return results;
}
So it's not short on lines. It iterates once, and only performs one allocation per "result". With a little tweaking I could probably make a C#8 version with Index types that cuts on allocations? This is probably good enough.
You could spend a whole day figuring out how to understand the regex, but this is as simple as it comes:
Scan every character.
If you find {, note the next character is the start of a result.
If you find }, consider everything from the last noted "start" until the index before this character as "a result".
This won't catch mismatched brackets and could throw exceptions for strings like "}}{". You didn't ask for handling those cases, but it's not too hard to improve this logic to catch it and scream about it or recover.
For example, you could reset startIndex to something like -1 when } is found. From there, you can deduce if you find { when startIndex != -1 you've found "{{". And you can deduce if you find } when startIndex == -1, you've found "}}". And if you exit the loop with startIndex < -1, that's an opening { with no closing }. that leaves the string "}whoops" as an uncovered case, but it could be handled by initializing startIndex to, say, -2 and checking for that specifically. Do that with a regex, and you'll have a headache.
The main reason I suggest this is you said "efficiently". icepickle's solution is nice, but Split() makes one allocation per token, then you perform allocations for each TrimX() call. That's not "efficient". That's "n + 2 allocations".
Use Regex for this:
string[] tokens = Regex.Split(input, #"}\s*,\s*{")
.Select(i => i.Replace("{", "").Replace("}", ""))
.ToArray();
Pattern explanation:
\s* - match zero or more white space characters
Well, if you have a method that is called ParseString, it's a good thing it returns something (and it might not be that bad to say that it is ParseTokens instead). So if you do that, you can come to the following code
private static IEnumerable<string> ParseTokens(string input)
{
return input
// removes the leading {
.TrimStart('{')
// removes the trailing }
.TrimEnd('}')
// splits on the different token in the middle
.Split( new string[] { "},{" }, StringSplitOptions.None );
}
The reason why it didn't work for you before, is because your understanding of how the split method works, was wrong, it will effectively split on all , in your example.
Now if you put this all together, you get something like in this dotnetfiddle
using System;
using System.Collections.Generic;
public class Program
{
private static IEnumerable<string> ParseTokens(string input)
{
return input
// removes the leading {
.TrimStart('{')
// removes the trailing }
.TrimEnd('}')
// splits on the different token in the middle
.Split( new string[] { "},{" }, StringSplitOptions.None );
}
public static void Main()
{
var instance = "{112,This is the first day 23/12/2009},{132,This is the second day 24/12/2009}";
foreach (var item in ParseTokens( instance ) ) {
Console.WriteLine( item );
}
}
}
Add using System.Text.RegularExpressions; to top of the class
and use the regex split method
string[] tokens = Regex.Split(input, "(?<=}),");
Here, we use positive lookahead to split on a , which is immediately after a }
(note: (?<= your string ) matches all the characters after your string only. you can read more about it here
If you dont want to your regular expressions, the following code will produce your required output.
string instance = "{112,This is the first day 23/12/2009},{132,This is the second day 24/12/2009}";
string[] tokens = instance.Replace("},{", "}{").Split('}', '{');
foreach (string item in tokens)
{
if (string.IsNullOrWhiteSpace(item)) continue;
Console.WriteLine(item);
}
Console.ReadLine();

Custom Uppercase on String

hi i was trying to make a program that modified a word in a string to a uppercase word.
the uppercase word is in a tag like this :
the <upcase>weather</upcase> is very <upcase>hot</upcase>
the result :
the WEATHER is very HOT
my code is like this :
string upKey = "<upcase>";
string lowKey = "</upcase>";
string quote = "the lazy <upcase>fox jump over</upcase> the dog <upcase> something here </upcase>";
int index = quote.IndexOf(upKey);
int indexEnd = quote.IndexOf(lowKey);
while(index!=-1)
{
for (int a = 0; a < index; a++)
{
Console.Write(quote[a]);
}
string upperQuote = "";
for (int b = index + 8; b < indexEnd; b++)
{
upperQuote += quote[b];
}
upperQuote = upperQuote.ToUpper().ToString();
Console.Write(upperQuote);
for (int c = indexEnd+9;c<quote.Length;c++)
{
if (quote[c]=='<')
{
break;
}
Console.Write(quote[c]);
}
index = quote.IndexOf(upKey, index + 1);
indexEnd = quote.IndexOf(lowKey, index + 1);
}
Console.WriteLine();
}
i have been trying using this code,and a while(while (indexEnd != -1)) :
index = quote.IndexOf(upKey, index + 1);
indexEnd = quote.IndexOf(lowKey, index + 1);
but that not work, the program run into unlimited loop, btw i'm a noob so please give a answer that i can understand :)
You can use a regular expression for this:
string input = "the <upcase>weather</upcase> is very <upcase>hot</upcase>";
var regex = new Regex("<upcase>(?<theMatch>.*?)</upcase>");
var result = regex.Replace(input, match => match.Groups["theMatch"].Value.ToUpper());
// result will be: "the WEATHER is very HOT"
Here's an explanation taken from here for the regular expression used above:
<upcase> matches the characters <upcase> literally (case sensitive)
(?<theMatch>.\*?) Named capturing group theMatch
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
< matches the characters < literally
/ matches the character / literally
upcase> matches the characters upcase> literally (case sensitive)
The following will work as long as there are only matching tags and none of them are nested.
public static string Upper(string str)
{
const string start = "<upcase>";
const string end = "</upcase>";
var builder = new StringBuilder();
// Find the first start tag
int startIndex = str.IndexOf(start);
// If no start tag found then return the original
if (startIndex == -1)
return str;
// Append the part before the first tag as is
builder.Append(str.Substring(0, startIndex));
// Continue as long as we find another start tag.
while (startIndex != -1)
{
// Find the end tag for the current start tag
var endIndex = str.IndexOf(end, startIndex);
// Append the text between the start and end as upper case.
builder.Append(
str.Substring(
startIndex + start.Length,
endIndex - startIndex - start.Length).ToUpper());
// Find the next start tag.
startIndex = str.IndexOf(start, endIndex);
// Append the part after the end tag, but before the next start as is
builder.Append(
str.Substring(
endIndex + end.Length,
(startIndex == -1 ? str.Length : startIndex) - endIndex - end.Length));
}
return builder.ToString();
}
I'm not rewriting your code. Just answering your (main) question:
You need to keep a variable of the index you're at, and check for IndexOf from there only (See MSDN). Something like this:
int index = 0;
while (quote.IndexOf(upKey, index) != -1)
{
//Your code, including updating the value of index.
}
(I didn't check this on Visual Studio. This is just to point you in the direction that I think you're looking for.)
The reason for the infinite loop is that you're always testing IndexOf of the same index. Perhaps you mean to have quote.IndexOf(upKey, index += 1); which would change the value of index?
The way to go here is to probably use Regex but these easy parsing excercises are always fun to do manually. This can be easily solved using a very simple state machine.
What states can we have when dealing with strings of this nature? I can think of 4:
We are either parsing normal text
Or we are parsing an opening format tag '<...>'
Or we are parsing a closing format tag '</...>'
Or we are parsing text to be formatted between tags
I can't think of any other states. Now we need to think about the normal flow / transition between states. What should happen when we a parse string with the correct format?
Parser starts up expecting normal text. That is easy to understand.
If expecting normal text we encounter a '<' then the parser should switch to parsing opening format tag state. There is no other valid state transition.
If in parsing opening format tag state we encounter a '>' then the parser should switch to parsing text to be formatted. There is no other valid state transition.
If in parsing text to be formatted we encounter a '<' then the parser should switch to parsing closing tag. Again, there is no other valid state transition.
If in parsing closing tag we encounter a '>' then the parser should switch to normal text. Once more, there is no other valid transition. Note that we are disallowing nested tags.
Ok, so that seems pretty easy to understand. What do we need to implement this?
First we'll need something to represent the parsing states. A good old enum will do:
private enum ParsingState
{
UnformattedText,
OpenTag,
CloseTag,
FormattedText,
}
Now we need some string buffers to keep track of the final formatted string, the current format tag we are parsing and finally the substring we need to format. We will use several StringBuilder's for these as we don't know how long these buffers are and how many concatenations will be performed:
var formattedStringBuffer = new StringBuilder();
var formatBuffer = new StringBuilder();
var tagBuffer = new StringBuilder();
We will also need to keep track of the parser's state and the current active tag if any (so we can make sure that the parsed closing tag matches the current active tag):
var state = ParsingState.UnformattedText;
var activeFormatTag = string.Empty;
And now we are good to go, but before we do, can we generalize this so it works with any format tag?
Yes we can, we just need to tell the parser what to do for each supported tag. We can do this easily just passing a along a Dictionary that ties each tag with the action it should perform. We do this the following way:
var formatter = new Dictionary<string, Func<string, string>>();
formatter.Add("upcase", s => s.ToUpperInvariant());
formatter.Add("lcase", s => s.ToLowerInvariant());
Great! Now our implementation could be the following:
public static string Parse(this string str, Dictionary<string, Func<string,string>> formatter)
{
var formattedStringBuffer = new StringBuilder();
var formatBuffer = new StringBuilder();
var tagBuffer = new StringBuilder();
var state = ParsingState.UnformattedText;
var activeFormatTag = string.Empty;
foreach (var c in str)
{
switch (state)
{
case ParsingState.UnformattedText:
{
if (c != '<')
{
formattedStringBuffer.Append(c);
}
else
{
state = ParsingState.OpenTag;
}
break;
}
case ParsingState.OpenTag:
{
if (c != '>')
{
tagBuffer.Append(c);
}
else
{
state = ParsingState.FormattedText;
activeFormatTag = tagBuffer.ToString();
tagBuffer.Clear();
}
break;
}
case ParsingState.FormattedText:
{
if (c != '<')
{
formatBuffer.Append(c);
}
else
{
state = ParsingState.CloseTag;
}
break;
}
case ParsingState.CloseTag:
{
if (c!='>')
{
tagBuffer.Append(c);
}
else
{
var expectedTag = $"/{activeFormatTag}";
var tag = tagBuffer.ToString();
if (tag != expectedTag)
throw new FormatException($"Expected closing tag not found: <{expectedTag}>.");
if (formatter.ContainsKey(activeFormatTag))
{
var formatted = formatter[activeFormatTag](formatBuffer.ToString());
formattedStringBuffer.Append(formatted);
tagBuffer.Clear();
formatBuffer.Clear();
state = ParsingState.UnformattedText;
}
else
throw new FormatException($"Format tag <{activeFormatTag}> not recognized.");
}
break;
}
}
}
if (state != ParsingState.UnformattedText)
throw new FormatException($"Bad format in specified string '{str}'");
return formattedStringBuffer.ToString();
}
Is it the most elegant solution? No, Regex will do a much better job, but being a beginner I would not recommend you start solving these kind of problems that way, you'll learn a whole lot more solving them manualy. You'll have plenty of time to learn Regex later on.

Extracting text between last pair of parentheses from a string without RegEx

I have a list of string where one item is like, textItem1 = "Brown, Adam. (user)(admin)(Sales)" where I would always have to extract the text from the last pair of parentheses, which in this case will be Sales.
I tried the following:
string name = DDlistName.SelectedItem.ToString();
int start = name.IndexOf("(");
int end = name.IndexOf("(");
string result = name.Substring(start + 1, end - start - 1);
_UILabelPrintName.Text = result;
Problem: This always picks the text from first pair of parentheses, which in this case user.
Reading lots of similar question's answer I realised Regex might not be recommended in this case (not particularly succeeded either trying other codes). However any help with any short routine which can do the task will be really appreciated.
You need to use LastIndexOf instead of IndexOf, and check for a close parenthesis at the end.
string name = "Brown, Adam. (user)(admin)(Sales)";
int start = name.LastIndexOf("(");
int end = name.LastIndexOf(")");
string result = name.Substring(start + 1, end - start - 1);
Really you'd want to validate start and end to be sure that both parenthesis were found. LastIndexOf returns -1 if the character is not found.
And in order to handle nesting we need to search forward for the closing parenthesis after the location of the opening parenthesis.
string name = "Brown, Adam. (user)(admin)((Sales))";
int start = name.LastIndexOf('(');
int end = (start >= 0) ? name.IndexOf(')', start) : -1;
string result = (end >= 0) ? name.Substring(start + 1, end - start - 1) : "";
You can use the split function, breaking the string at the opening parenthesis. The last array element is the desired output with a tailing ")", which will then be removed.
var input = "Brown, Adam. (user)(admin)(Sales)";
// var input = DDlistName.SelectedItem.ToString();
var lastPick = input.Split(new[] { "(" }, StringSplitOptions.RemoveEmptyEntries).Last();
var output = lastPick.Substring(0, lastPick.Length - 1);
_UILabelPrintName.Text = output;
Another approach is to use a while loop with IndexOf. It cuts the input string as long as another "(" is found. If not more "(" are found, it takes the contents of the remaining string until the closing parenthesis ")":
int current = -1;
while(name.IndexOf("(") > 0)
{
name = name.Substring(name.IndexOf("(") + 1);
}
var end = name.IndexOf(")");
var output = name.Substring(0, end);
_UILabelPrintName.Text = output;
Or use LastIndexOf....

How to properly split a CSV using C# split() function?

Suppose I have this CSV file :
NAME,ADDRESS,DATE
"Eko S. Wibowo", "Tamanan, Banguntapan, Bantul, DIY", "6/27/1979"
I would like like to store each token that enclosed using a double quotes to be in an array, is there a safe to do this instead of using the String split() function? Currently I load up the file in a RichTextBox, and then using its Lines[] property, I do a loop for each Lines[] element and doing this :
string[] line = s.Split(',');
s is a reference to RichTextBox.Lines[].
And as you can clearly see, the comma inside a token can easily messed up split() function. So, instead of ended with three token as I want it, I ended with 6 tokens
Any help will be appreciated!
You could use regex too:
string input = "\"Eko S. Wibowo\", \"Tamanan, Banguntapan, Bantul, DIY\", \"6/27/1979\"";
string pattern = #"""\s*,\s*""";
// input.Substring(1, input.Length - 2) removes the first and last " from the string
string[] tokens = System.Text.RegularExpressions.Regex.Split(
input.Substring(1, input.Length - 2), pattern);
This will give you:
Eko S. Wibowo
Tamanan, Banguntapan, Bantul, DIY
6/27/1979
I've done this with my own method. It simply counts the amout of " and ' characters.
Improve this to your needs.
public List<string> SplitCsvLine(string s) {
int i;
int a = 0;
int count = 0;
List<string> str = new List<string>();
for (i = 0; i < s.Length; i++) {
switch (s[i]) {
case ',':
if ((count & 1) == 0) {
str.Add(s.Substring(a, i - a));
a = i + 1;
}
break;
case '"':
case '\'': count++; break;
}
}
str.Add(s.Substring(a));
return str;
}
It's not an exact answer to your question, but why don't you use already written library to manipulate CSV file, good example would be LinqToCsv. CSV could be delimited with various punctuation signs. Moreover, there are gotchas, which are already addressed by library creators. Such as dealing with name row, dealing with different date formats and mapping rows to C# objects.
You can replace "," with ; then split by ;
var values= s.Replace("\",\"",";").Split(';');
If your CSV line is tightly packed it's easiest to use the end and tail removal mentioned earlier and then a simple split on a joining string
string[] tokens = input.Substring(1, input.Length - 2).Split("\",\"");
This will only work if ALL fields are double-quoted even if they don't (officially) need to be. It will be faster than RegEx but with given conditions as to its use.
Really useful if your data looks like
"Name","1","12/03/2018","Add1,Add2,Add3","other stuff"
Five years old but there is always somebody new who wants to split a CSV.
If your data is simple and predictable (i.e. never has any special characters like commas, quotes and newlines) then you can do it with split() or regex.
But to support all the nuances of the CSV format properly without code soup you should really use a library where all the magic has already been figured out. Don't re-invent the wheel (unless you are doing it for fun of course).
CsvHelper is simple enough to use:
https://joshclose.github.io/CsvHelper/2.x/
using (var parser = new CsvParser(textReader)
{
while(true)
{
string[] line = parser.Read();
if (line != null)
{
// do something
}
else
{
break;
}
}
}
More discussion / same question:
Dealing with commas in a CSV file

Why isn't this C# code working? It should return the string between two other strings, but always returns an empty string

The title explains it all. It seems simple enough, so I must be overlooking something stupid. Here's what I've got.
private string getBetween(string strSource, string strStart, string strEnd)
{
int start, end;
if (strSource.Contains(strStart) && strSource.Contains(strEnd))
{
start = strSource.IndexOf(strStart, 0) + strStart.Length;
end = strSource.IndexOf(strEnd, start);
return strSource.Substring(start, end - start);
}
else
{
return "";
}
}
Thanks, guys.
Your code doesn't make sure that start and end are in order.
static string SubString(string source, string prefix, string suffix)
{
int start = source.IndexOf(prefix); // get position of prefix
if (start == -1)
return String.Empty;
int subStart = start + prefix.Length; // get position of substring
int end = source.IndexOf(suffix, subStart); // make sure suffix also exists
if (end == -1)
return String.Empty;
int subLength = end - subStart; // calculate length of substring
if (subLength == 0)
return String.Empty;
return source.Substring(subStart, subLength); // return substring
}
As couple of peoples said the problem that you code is working on very specific input, it's all because of this start and end IndexOf magic =) But when you try to update you code to work correct on more inputs you will get into problem that your code become very long with many indexes, comparsions, substrings, conditions and so on. To avoid this i like to recommend you use regular expressions with theirs help you can express what you need on special language.
Here is the sample which solves your problem with regular expressions:
public static string getBetween(string source, string before, string after)
{
var regExp = new Regex(string.Format("{0}(?<needle>[^{0}{1}]+){1}",before,after));
var matches = regExp.Matches(source).Cast<Match>(). //here we use LINQ to
OrderBy(m => m.Groups["needle"].Value.Length). //find shortest string
Select(m => m.Groups["needle"].Value); //you can use foreach loop instead
return matches.FirstOrDefault();
}
All tricky part is {0}(?<needle>[^{0}{1}]+){1} where 0 - before string and 1 - after string. This expression means that we nned to find string that lies beetween 0 and 1, and also don't contains 0 and 1.
Hope this helps.
I get the correct answer if I try any of these:
var a = getBetween("ABC", "A", "C");
var b = getBetween("TACOBBURRITO", "TACO", "BURRITO");
var c = getBetween("TACOBACONBURRITO", "TACO", "BURRITO");
The problem is likely with your input argument validation, as this fails:
var a = getBetween("ABC", "C", "A");
var a = getBetween("ABC", "C", "C");
You can improve your validation of the issue by writing some test cases like these as a separate fixture (xUnit, or main loop in throw away app).

Categories

Resources