string replace using a List<string> - c#

I have a List of words I want to ignore like this one :
public List<String> ignoreList = new List<String>()
{
"North",
"South",
"East",
"West"
};
For a given string, say "14th Avenue North" I want to be able to remove the "North" part, so basically a function that would return "14th Avenue " when called.
I feel like there is something I should be able to do with a mix of LINQ, regex and replace, but I just can't figure it out.
The bigger picture is, I'm trying to write an address matching algorithm. I want to filter out words like "Street", "North", "Boulevard", etc. before I use the Levenshtein algorithm to evaluate the similarity.

How about this:
string.Join(" ", text.Split().Where(w => !ignoreList.Contains(w)));
or for .Net 3:
string.Join(" ", text.Split().Where(w => !ignoreList.Contains(w)).ToArray());
Note that this method splits the string up into individual words so it only removes whole words. That way it will work properly with addresses like Northampton Way #123 that string.Replace can't handle.

Regex r = new Regex(string.Join("|", ignoreList.Select(s => Regex.Escape(s)).ToArray()));
string s = "14th Avenue North";
s = r.Replace(s, string.Empty);

Something like this should work:
string FilterAllValuesFromIgnoreList(string someStringToFilter)
{
return ignoreList.Aggregate(someStringToFilter, (str, filter)=>str.Replace(filter, ""));
}

What's wrong with a simple for loop?
string street = "14th Avenue North";
foreach (string word in ignoreList)
{
street = street.Replace(word, string.Empty);
}

If you know that the list of word contains only characters that do not need escaping inside a regular expression then you can do this:
string s = "14th Avenue North";
Regex regex = new Regex(string.Format(#"\b({0})\b",
string.Join("|", ignoreList.ToArray())));
s = regex.Replace(s, "");
Result:
14th Avenue
If there are special characters you will need to fix two things:
Use Regex.Escape on each element of ignore list.
The word-boundary \b will not match a whitespace followed by a symbol or vice versa. You may need to check for whitespace (or other separating characters such as punctuation) using lookaround assertions instead.
Here's how to fix these two problems:
Regex regex = new Regex(string.Format(#"(?<= |^)({0})(?= |$)",
string.Join("|", ignoreList.Select(x => Regex.Escape(x)).ToArray())));

If it's a short string as in your example, you can just loop though the strings and replace one at a time. If you want to get fancy you can use the LINQ Aggregate method to do it:
address = ignoreList.Aggregate(address, (a, s) => a.Replace(s, String.Empty));
If it's a large string, that would be slow. Instead you can replace all strings in a single run through the string, which is much faster. I made a method for that in this answer.

LINQ makes this easy and readable. This requires normalized data though, particularly in that it is case-sensitive.
List<string> ignoreList = new List<string>()
{
"North",
"South",
"East",
"West"
};
string s = "123 West 5th St"
.Split(' ') // Separate the words to an array
.ToList() // Convert array to TList<>
.Except(ignoreList) // Remove ignored keywords
.Aggregate((s1, s2) => s1 + " " + s2); // Reconstruct the string

Why not juts Keep It Simple ?
public static string Trim(string text)
{
var rv = text.trim();
foreach (var ignore in ignoreList) {
if(tv.EndsWith(ignore) {
rv = rv.Replace(ignore, string.Empty);
}
}
return rv;
}

You can do this using and expression if you like, but it's easier to turn it around than using a Aggregate. I would do something like this:
string s = "14th Avenue North"
ignoreList.ForEach(i => s = s.Replace(i, ""));
//result is "14th Avenue "

public static string Trim(string text)
{
var rv = text;
foreach (var ignore in ignoreList)
rv = rv.Replace(ignore, "");
return rv;
}
Updated For Gabe
public static string Trim(string text)
{
var rv = "";
var words = text.Split(" ");
foreach (var word in words)
{
var present = false;
foreach (var ignore in ignoreList)
if (word == ignore)
present = true;
if (!present)
rv += word;
}
return rv;
}

If you have a list, I think you're going to have to touch all the items. You could create a massive RegEx with all your ignore keywords and replace to String.Empty.
Here's a start:
(^|\s+)(North|South|East|West){1,2}(ern)?(\s+|$)
If you have a single RegEx for ignore words, you can do a single replace for each phrase you want to pass to the algorithm.

Related

C# split by regex

I have a little problem that I don't know how to call it like, so I will do my best to explain you that.
String text = "Random text over here boyz, I dunno what to do";
I want to take by split only over here boyz for example, I want to let split the word text and the word , and it will show me the whole text that in thoose 2 strings. Any ideas?
Thank you,
Sagi.
From your comments I get that from this string:
foo bar id="baz" qux
You want to obtain the value baz, because it is in the id="{text}" pattern.
For that you can use a regular expression:
string result = Regex.Match(text, "id=\"(.*?)\"").Groups[1].Value;
Note that this will match any character. Also note that this will yield false positives, like fooid="bar", and that this won't match unquoted values.
So all in all, for parsing HTML, you should not use regular expressions. Try HtmlAgilityPack and an XPath expression.
There is a Split overload that can receive multiple string seperators:
var rrr = text.Split(new string[] { ",", "text" }, StringSplitOptions.None);
If you would like to extract only the text between these two strings using regex you can do something like this:
var pattern = #"text(.*),";
var a = new Regex(pattern).Match(text);
var result = a.Groups[1];
You can use Regex class:
https://msdn.microsoft.com/pl-pl/library/ze12yx1d%28v=vs.110%29.aspx
But first of all (as it was said) you need to clarify for yourself how you will identify string that you want.
in first case you can use
string stringResult;
if (text.Contains("over here boyz"))
stringResult = string.Empty;
else
stringResult = "over here boyz";
but the second case can solve by this code
String text = "Random text over here boyz, I dunno what to do";
//Second dream without whitespace
var result = Regex.Split(text, " *text *| *, *");
foreach (var x in result)
{
Console.WriteLine(x);
}
//Second dream with whitespace
result = Regex.Split(text, "text|,");
foreach (var x in result)
{
Console.WriteLine(x);
}
You can train to write Regex with this tool http://www.regexbuddy.com/ or http://www.regexr.com/

replacing characters in a single field of a comma-separated list

I have string in my c# code
a,b,c,d,"e,f",g,h
I want to replace "e,f" with "e f" i.e. ',' which is inside inverted comma should be replaced by space.
I tried using string.split but it is not working for me.
OK, I can't be bothered to think of a regex approach so I am going to offer an old fashioned loop approach which will work:
string DoReplace(string input)
{
bool isInner = false;//flag to detect if we are in the inner string or not
string result = "";//result to return
foreach(char c in input)//loop each character in the input string
{
if(isInner && c == ',')//if we are in an inner string and it is a comma, append space
result += " ";
else//otherwise append the character
result += c;
if(c == '"')//if we have hit an inner quote, toggle the flag
isInner = !isInner;
}
return result;
}
NOTE: This solution assumes that there can only be one level of inner quotes, for example you cannot have "a,b,c,"d,e,"f,g",h",i,j" - because that's just plain madness!
For the scenario where you only need to match one pair of letters, the following regex will work:
string source = "a,b,c,d,\"e,f\",g,h";
string pattern = "\"([\\w]),([\\w])\"";
string replace = "\"$1 $2\"";
string result = Regex.Replace(source, pattern, replace);
Console.WriteLine(result); // a,b,c,d,"e f",g,h
Breaking apart the pattern, it is matching any instance where there is a "X,X" sequence where X is any letter, and is replacing it with the very same sequence, with a space in between the letters instead of a comma.
You could easily extend this if you needed to to have it match more than one letter, etc, as needed.
For the case where you can have multiple letters separated by commas within quotes that need to be replaced, the following can do it for you. Sample text is a,b,c,d,"e,f,a",g,h:
string source = "a,b,c,d,\"e,f,a\",g,h";
string pattern = "\"([ ,\\w]+),([ ,\\w]+)\"";
string replace = "\"$1 $2\"";
string result = source;
while (Regex.IsMatch(result, pattern)) {
result = Regex.Replace(result, pattern, replace);
}
Console.WriteLine(result); // a,b,c,d,"e f a",g,h
This does something similar compared to the first one, but just removes any comma that is sandwiched by letters surrounded by quotes, and repeats it until all cases are removed.
Here's a somewhat fragile but simple solution:
string.Join("\"", line.Split('"').Select((s, i) => i % 2 == 0 ? s : s.Replace(",", " ")))
It's fragile because it doesn't handle flavors of CSV that escape double-quotes inside double-quotes.
Use the following code:
string str = "a,b,c,d,\"e,f\",g,h";
string[] str2 = str.Split('\"');
var str3 = str2.Select(p => ((p.StartsWith(",") || p.EndsWith(",")) ? p : p.Replace(',', ' '))).ToList();
str = string.Join("", str3);
Use Split() and Join():
string input = "a,b,c,d,\"e,f\",g,h";
string[] pieces = input.Split('"');
for ( int i = 1; i < pieces.Length; i += 2 )
{
pieces[i] = string.Join(" ", pieces[i].Split(','));
}
string output = string.Join("\"", pieces);
Console.WriteLine(output);
// output: a,b,c,d,"e f",g,h

Regular Expression split string and get whats in brackets [ ] put into array

I am trying to use regex to split the string into 2 arrays to turn out like this.
String str1 = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
How do I split str1 to break off into 2 arrays that look like this:
ary1 = ['First Second','Third Forth','Fifth'];
ary2 = ['insideFirst','insideSecond'];
here is my solution
string str = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
MatchCollection matches = Regex.Matches(str,#"\[.*?\]");
string[] arr = matches.Cast<Match>()
.Select(m => m.Groups[0].Value.Trim(new char[]{'[',']'}))
.ToArray();
foreach (string s in arr)
{
Console.WriteLine(s);
}
string[] arr1 = Regex.Split(str,#"\[.*?\]")
.Select(x => x.Trim())
.ToArray();
foreach (string s in arr1)
{
Console.WriteLine(s);
}
Output
insideFirst
insideSecond
First Second
Third Forth
Fifth
Plz Try below code. Its working fine for me.
String str1 = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
var output = String.Join(";", Regex.Matches(str1, #"\[(.+?)\]")
.Cast<Match>()
.Select(m => m.Groups[1].Value));
string[] strInsideBreacket = output.Split(';');
for (int i = 0; i < strInsideBreacket.Count(); i++)
{
str1 = str1.Replace("[", ";");
str1 = str1.Replace("]", "");
str1 = str1.Replace(strInsideBreacket[i], "");
}
string[] strRemaining = str1.Split(';');
Plz look at below screen shot of output while debugging code:
Here,
strInsideBreacket is array of breacket value like insideFirst andinsideSecond
and strRemaining is array of First Second,Third Forth and Fifth
Thanks
Try this solution,
String str1 = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
var allWords = str1.Split(new char[] { '[', ']' }, StringSplitOptions.RemoveEmptyEntries);
var result = allWords.GroupBy(x => x.Contains("inside")).ToArray();
The idea is that, first get all words and then the group it.
It seems to me that "user2828970" asked a question with an example, not with literal text he wanted to parse. In my mind, he could very well have asked this question:
I am trying to use regex to split a string like so.
var exampleSentence = "I had 185 birds but 20 of them flew away";
var regexSplit = Regex.Split(exampleSentence, #"\d+");
The result of regexSplit is: I had, birds but, of them flew away.
However, I also want to know the value which resulted in the second string splitting away from its preceding text, and the value which resulted in the third string splitting away from its preceding text. i.e.: I want to know about 185 and 20.
The string could be anything, and the pattern to split by could be anything. The answer should not have hard-coded values.
Well, this simple function will perform that task. The code can be optimized to compile the regex, or re-organized to return multiple collections or different objects. But this is (nearly) the way I use it in production code.
public static List<Tuple<string, string>> RegexSplitDetail(this string text, string pattern)
{
var splitAreas = new List<Tuple<string, string>>();
var regexResult = Regex.Matches(text, pattern);
var regexSplit = Regex.Split(text, pattern);
for (var i = 0; i < regexSplit.Length; i++)
splitAreas.Add(new Tuple<string, string>(i == 0 ? null : regexResult[i - 1].Value, regexSplit[i]));
return splitAreas;
}
...
var result = exampleSentence.RegexSplitDetail(#"\d+");
This would return a single collection which looks like this:
{ null, "I had "}, // First value, had no value splitting it from a predecessor
{"185", " birds but "}, // Second value, split from the preceding string by "185"
{ "20", " of them flew away"} // Third value, split from the preceding string by "20"
Being that this is a .NET Question and, apart from my more favoured approach in my other answer, you can also capture the Split Value another VERY Simple way. You just then need to create a function to utilize the results as you see fit.
var exampleSentence = "I had 185 birds but 20 of them flew away";
var regexSplit = Regex.Split(exampleSentence, #"(\d+)");
The result of regexSplit is: I had, 185, birds but, 20, of them flew away. As you can see, the split values exist within the split results.
Note the subtle difference compared to my other answer. In this regex split, I used a Capture Group around the entire pattern (\d+) You can't do that!!!?.. can you?
Using a Capture Group in a Split will force all capture groups of the Split Value between the Split Result Capture Groups. This can get messy, so I don't suggest doing it. It also forces somebody using your function(s) to know that they have to wrap their regexes in a capture group.

How to Replace Multiple Words in a String Using C#?

I'm wondering how I can replace (remove) multiple words (like 500+) from a string. I know I can use the replace function to do this for a single word, but what if I want to replace 500+ words? I'm interested in removing all generic keywords from an article (such as "and", "I", "you" etc).
Here is the code for 1 replacement.. I'm looking to do 500+..
string a = "why and you it";
string b = a.Replace("why", "");
MessageBox.Show(b);
Thanks
# Sergey Kucher Text size will vary between a few hundred words to a few thousand. I am replacing these words from random articles.
I would normally do something like:
// If you want the search/replace to be case sensitive, remove the
// StringComparer.OrdinalIgnoreCase
Dictionary<string, string> replaces = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase) {
// The format is word to be searched, word that should replace it
// or String.Empty to simply remove the offending word
{ "why", "xxx" },
{ "you", "yyy" },
};
void Main()
{
string a = "why and you it and You it";
// This will search for blocks of letters and numbers (abc/abcd/ab1234)
// and pass it to the replacer
string b = Regex.Replace(a, #"\w+", Replacer);
}
string Replacer(Match m)
{
string found = m.ToString();
string replace;
// If the word found is in the dictionary then it's placed in the
// replace variable by the TryGetValue
if (!replaces.TryGetValue(found, out replace))
{
// otherwise replace the word with the same word (so do nothing)
replace = found;
}
else
{
// The word is in the dictionary. replace now contains the
// word that will substitute it.
// At this point you could add some code to maintain upper/lower
// case between the words (so that if you -> xxx then You becomes Xxx
// and YOU becomes XXX)
}
return replace;
}
As someone else wrote, but without problems with substrings (the ass principle... You don't want to remove asses from classes :-) ), and working only if you only need to remove words:
var escapedStrings = yourReplaces.Select(Regex.Escape);
string result = Regex.Replace(yourInput, #"\b(" + string.Join("|", escapedStrings) + #")\b", string.Empty);
I use the \b word boundary... It's a little complex to explain what it's, but it's useful to find word boundaries :-)
Create a list of all text you want and load it into a list, you do this fairly simple or get very complex. A trivial example would be:
var sentence = "mysentence hi";
var words = File.ReadAllText("pathtowordlist.txt").Split(Enviornment.NewLine);
foreach(word in words)
sentence.replace("word", "x");
You could create two lists if you wanted a dual mapping scheme.
Try this:
string text = "word1 word2 you it";
List<string> words = new System.Collections.Generic.List<string>();
words.Add("word1");
words.Add("word2");
words.ForEach(w => text = text.Replace(w, ""));
Edit
If you want to replace text with another text, you can create class Word:
public class Word
{
public string SearchWord { get; set; }
public string ReplaceWord { get; set; }
}
And change above code to this:
string text = "word1 word2 you it";
List<Word> words = new System.Collections.Generic.List<Word>();
words.Add(new Word() { SearchWord = "word1", ReplaceWord = "replaced" });
words.Add(new Word() { SearchWord = "word2", ReplaceWord = "replaced" });
words.ForEach(w => text = text.Replace(w.SearchWord, w.ReplaceWord));
if you are talking about a single string the solution is to remove them all by a simple replace method. as you can read there:
"Returns a new string in which all occurrences of a specified string in the current instance are replaced with another specified string".
you may be needing to replace several words, and you can make a list of these words:
List<string> wordsToRemove = new List<string>();
wordsToRemove.Add("why");
wordsToRemove.Add("how);
and so on
and then remove them from the string
foreach(string curr in wordsToRemove)
a = a.ToLower().Replace(curr, "");
Importent
if you want to keep your string as it was, without lowering words and without struggling with lower and upper case use
foreach(string curr in wordsToRemove)
// You can reuse this object
Regex regex = new Regex(curr, RegexOptions.IgnoreCase);
myString = regex.Replace(myString, "");
depends on the situation ofcourse,
but if your text is long and you have many words,
and you want optimize performance.
you should build a trie from the words, and search the Trie for a match.
it won't lower the Order of complexity, still O(nm), but for large groups of words, it will be able to check multiple words against each char instead of one by one.
i can assume couple of houndred words should be enough to get this faster.
This is the fastest method in my opinion and
i written a function for you to start with:
public struct FindRecord
{
public int WordIndex;
public int PositionInString;
}
public static FindRecord[] FindAll(string input, string[] words)
{
LinkedList<FindRecord> result = new LinkedList<FindRecord>();
int[] matchs = new int[words.Length];
for (int i = 0; i < input.Length; i++)
{
for (int j = 0; j < words.Length; j++)
{
if (input[i] == words[j][matchs[j]])
{
matchs[j]++;
if(matchs[j] == words[j].Length)
{
FindRecord findRecord = new FindRecord {WordIndex = j, PositionInString = i - matchs[j] + 1};
result.AddLast(findRecord);
matchs[j] = 0;
}
}
else
matchs[j] = 0;
}
}
return result.ToArray();
}
Another option:
it might be the rare case where regex will be faster then building the code.
Try using
public static string ReplaceAll(string input, string[] words)
{
string wordlist = string.Join("|", words);
Regex rx = new Regex(wordlist, RegexOptions.Compiled);
return rx.Replace(input, m => "");
}
Regex can do this better, you just need all the replace words in a list, and then:
var escapedStrings = yourReplaces.Select(PadAndEscape);
string result = Regex.Replace(yourInput, string.Join("|", escapedStrings);
This requires a function that space-pads the strings before escaping them:
public string PadAndEscape(string s)
{
return Regex.Escape(" " + s + " ");
}

Go to each white space in a string. C#

Is it possible to pass over a string, finding the white spaces?
For example a data set of:
string myString = "aa bbb cccc dd";
How could I loop through and detect each white space, and manipulate that space?
I need to do this in the most effecient way possible.
Thanks.
UPDATE:
I need to manipulate the space by increasing the white space from an integer value. So for instance increase the space to have 3 white spaces instead of one. I'd like to make it go through each white space in one loop, any method of doing this already in .NET? By white space I mean a ' '.
You can use the Regex.Replace method. This will replace any group of white space character with a dash:
myString = Regex.Replace(myString, "(\s+)", m => "-");
Update:
This will find groups of space characters and replace with the tripple amount of spaces:
myString = Regex.Replace(
myString,
"( +)",
m => new String(' ', m.Groups[1].Value.Length * 3)
);
However, that's a bit too simple to make use of regular expressions. You can do the same with a regular replace:
myString = myString.Replace(" ", " ");
This will replace each space intead of replace groups of spaces, but the regular replace is much simpler than Regex.Replace, so it should still be at least as fast, and the code is simpler.
If you want to replace all whitespace in one swoop, you can do:
// changes all strings to dashes
myString.Replace(' ', '-');
If you want to go case by case (that is, not just a mass replace), you can loop through IndexOf():
int pos = myString.IndexOf(' ');
while (pos >= 0)
{
// do whatever you want with myString # pos
// find next
pos = myString.IndexOf(' ', pos + 1);
}
UPDATE
As per your update, you could replace single spaces with the number of spaces specified by a variable (such as numSpaces) as follows:
myString.Replace(" ", new String(' ', numSpaces));
If you just want to replace all spaces with some other character:
myString = myString.Replace(' ', 'x');
If you need the possibility of doing something different to each:
foreach(char c in myString)
{
if (c == ' ')
{
// do something
}
}
Edit:
Per your comment clarifying your question:
To change each space to three spaces, you can do this:
myString = myString.Replace(" ", " ");
However note that this doesn't take into account instances where your input string already has two or more spaces. If that is a possibility you will want to use a regex.
Depending on what you're tring to do:
for(int k = 0; k < myString.Length; k++)
{
if(myString[k].IsWhiteSpace())
{
// do something with it
}
}
The above is a single pass through the string, so it's O(n). You can't really get more efficient that that.
However, if you want to manipulate the original string your best bet is to Use a StringBuilder to process the changes:
StringBuilder sb = new StringBuilder(myString);
for(int k = 0; k < myString.Length; k++)
{
if(myString[k].IsWhiteSpace())
{
// do something with sb
}
}
Finally, don't forget about Regular Expressions. It may not always be the most efficient method in terms of code run-time complexity but as far as efficiency of coding it may be a good trade-off.
For instance, here's a way to match all white spaces:
var rex = new System.Text.RegularExpressions.Regex("[^\\s](\\s+)[^\\s]");
var m = rex.Match(myString);
while(m.Success)
{
// process the match here..
m.NextMatch();
}
And here's a way to replace all white spaces with an arbitrary string:
var rex = new System.Text.RegularExpressions.Regex("\\s+");
String replacement = "[white_space]";
// replaces all occurrences of white space with the string [white_space]
String result = rex.Replace(myString, replacement);
Use string.Replace().
string newString = myString.Replace(" ", " ");
LINQ query below returns a set of anonymous type items with two properties - "sybmol" represents a white space character, and "index" - index in the input sequence. After that you have all whitespace characters and a position in the input sequence, now you can do what you want with this.
string myString = "aa bbb cccc dd";
var res = myString.Select((c, i) => new { symbol = c, index = i })
.Where(c => Char.IsWhiteSpace(c.symbol));
EDIT: For educational purposes below is implementation you are looking for, but obviously in real system use built in string constructor and String.Replace() as shown in other answers
string myString = "aa bbb cccc dd";
var result = this.GetCharacters(myString, 5);
string output = new string(result.ToArray());
public IEnumerable<char> GetCharacters(string input, int coeff)
{
foreach (char c in input)
{
if (Char.IsWhiteSpace(c))
{
int counter = coeff;
while (counter-- > 0)
{
yield return c;
}
}
else
{
yield return c;
}
}
}
var result = new StringBuilder();
foreach(Char c in myString)
{
if (Char.IsWhiteSpace(c))
{
// you can do what you wish here. strings are immutable, so you can only make a copy with the results you want... hence the "result" var.
result.Append('_'); // for example, replace space with _
}
else result.Append(c);
}
myString = result.ToString();
If you want to replace the white space with, e.g. '_', you can using String.Replace.
Example:
string myString = "aa bbb cccc dd";
string newString = myString.Replace(" ", "_"); // gives aa_bbb_cccc_dd
In case you want to left/right justify your string
int N=10;
string newstring = String.Join(
"",
myString.Split(' ').Select(s=>s.PadRight(N-s.Length)));

Categories

Resources