Regular Expression split string and get whats in brackets [ ] put into array

Regular Expression split string and get whats in brackets [ ] put into array - c#

I am trying to use regex to split the string into 2 arrays to turn out like this.
String str1 = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
How do I split str1 to break off into 2 arrays that look like this:
ary1 = ['First Second','Third Forth','Fifth'];
ary2 = ['insideFirst','insideSecond'];

here is my solution
string str = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
MatchCollection matches = Regex.Matches(str,#"\[.*?\]");
string[] arr = matches.Cast<Match>()
.Select(m => m.Groups[0].Value.Trim(new char[]{'[',']'}))
.ToArray();
foreach (string s in arr)
{
Console.WriteLine(s);
}
string[] arr1 = Regex.Split(str,#"\[.*?\]")
.Select(x => x.Trim())
.ToArray();
foreach (string s in arr1)
{
Console.WriteLine(s);
}
Output
insideFirst
insideSecond
First Second
Third Forth
Fifth

Plz Try below code. Its working fine for me.
String str1 = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
var output = String.Join(";", Regex.Matches(str1, #"\[(.+?)\]")
.Cast<Match>()
.Select(m => m.Groups[1].Value));
string[] strInsideBreacket = output.Split(';');
for (int i = 0; i < strInsideBreacket.Count(); i++)
{
str1 = str1.Replace("[", ";");
str1 = str1.Replace("]", "");
str1 = str1.Replace(strInsideBreacket[i], "");
}
string[] strRemaining = str1.Split(';');
Plz look at below screen shot of output while debugging code:
Here,
strInsideBreacket is array of breacket value like insideFirst andinsideSecond
and strRemaining is array of First Second,Third Forth and Fifth
Thanks

Try this solution,
String str1 = "First Second [insideFirst] Third Forth [insideSecond] Fifth";
var allWords = str1.Split(new char[] { '[', ']' }, StringSplitOptions.RemoveEmptyEntries);
var result = allWords.GroupBy(x => x.Contains("inside")).ToArray();
The idea is that, first get all words and then the group it.

It seems to me that "user2828970" asked a question with an example, not with literal text he wanted to parse. In my mind, he could very well have asked this question:
I am trying to use regex to split a string like so.
var exampleSentence = "I had 185 birds but 20 of them flew away";
var regexSplit = Regex.Split(exampleSentence, #"\d+");
The result of regexSplit is: I had, birds but, of them flew away.
However, I also want to know the value which resulted in the second string splitting away from its preceding text, and the value which resulted in the third string splitting away from its preceding text. i.e.: I want to know about 185 and 20.
The string could be anything, and the pattern to split by could be anything. The answer should not have hard-coded values.
Well, this simple function will perform that task. The code can be optimized to compile the regex, or re-organized to return multiple collections or different objects. But this is (nearly) the way I use it in production code.
public static List<Tuple<string, string>> RegexSplitDetail(this string text, string pattern)
{
var splitAreas = new List<Tuple<string, string>>();
var regexResult = Regex.Matches(text, pattern);
var regexSplit = Regex.Split(text, pattern);
for (var i = 0; i < regexSplit.Length; i++)
splitAreas.Add(new Tuple<string, string>(i == 0 ? null : regexResult[i - 1].Value, regexSplit[i]));
return splitAreas;
}
...
var result = exampleSentence.RegexSplitDetail(#"\d+");
This would return a single collection which looks like this:
{ null, "I had "}, // First value, had no value splitting it from a predecessor
{"185", " birds but "}, // Second value, split from the preceding string by "185"
{ "20", " of them flew away"} // Third value, split from the preceding string by "20"

Being that this is a .NET Question and, apart from my more favoured approach in my other answer, you can also capture the Split Value another VERY Simple way. You just then need to create a function to utilize the results as you see fit.
var exampleSentence = "I had 185 birds but 20 of them flew away";
var regexSplit = Regex.Split(exampleSentence, #"(\d+)");
The result of regexSplit is: I had, 185, birds but, 20, of them flew away. As you can see, the split values exist within the split results.
Note the subtle difference compared to my other answer. In this regex split, I used a Capture Group around the entire pattern (\d+) You can't do that!!!?.. can you?
Using a Capture Group in a Split will force all capture groups of the Split Value between the Split Result Capture Groups. This can get messy, so I don't suggest doing it. It also forces somebody using your function(s) to know that they have to wrap their regexes in a capture group.

Related

Loop through string and remove any occurrence of specified word

I'm trying to remove all conjunctions and pronouns from any array of strings(let call that array A), The words to be removed are read from a text file and converted into an array of strings(lets call that array B).
What I need is to Get the first element of array A and compare it to every word in array B, if the word matches I want to delete the word of array A.
For example:
array A = [0]I [1]want [2]to [3]go [4]home [5]and [6]sleep
array B = [0]I [1]and [2]go [3]to
Result= array A = [0]want [1]home [2]sleep
//remove any duplicates,conjunctions and Pronouns
public IQueryable<All_Articles> removeConjunctionsProNouns(IQueryable<All_Articles> myArticles)
{
//get words to be removed
string text = System.IO.File.ReadAllText("A:\\EnterpriceAssigment\\EnterpriceAssigment\\TextFiles\\conjunctions&ProNouns.txt").ToLower();
//split word into array of strings
string[] wordsToBeRemoved = text.Split(',');
//all articles
foreach (var article in myArticles)
{
//split articles into words
string[] articleSplit = article.ArticleContent.ToLower().Split(' ');
//loop through array of articles words
foreach (var y in articleSplit)
{
//loop through words to be removed from articleSplit
foreach (var x in wordsToBeRemoved)
{
//if word of articles matches word to be removed, remove word from article
if (y == x)
{
//get index of element in array to be removed
int g = Array.IndexOf(articleSplit,y);
//assign elemnt to ""
articleSplit[g] = "";
}
}
}
//re-assign splitted article to string
article.ArticleContent = articleSplit.ToString();
}
return myArticles;
}
If it is possible as well, I need array A to have no duplicates/distinct values.

You are looking for IEnumerable.Except, where the passed parameter is applied to the input sequence and every element of the input sequence not present in the parameter list is returned only once
For example
string inputText = "I want this string to be returned without some words , but words should have only one occurence";
string[] excludedWords = new string[] {"I","to","be", "some", "but", "should", "have", "one", ","};
var splitted = inputText.Split(' ');
var result = splitted.Except(excludedWords);
foreach(string s in result)
Console.WriteLine(s);
// ---- Output ----
want
this
string
returned
without
words <<-- This appears only once
only
occurence
And applied to your code is:
string text = File.ReadAllText(......).ToLower();
string[] wordsToBeRemoved = text.Split(',');
foreach (var article in myArticles)
{
string[] articleSplit = article.ArticleContent.ToLower().Split(' ');
var result = articleSplit.Except(wordsToBeRemoved);
article.ArticleContent = string.Join(" ", result);
}

You may have your answer already in your code. I am sure your code could be cleaned up a bit, as all our code could be. You loop through articleSplit and pull out each word. Then compare that word to the words in the wordsToBeRemoved array in a loop one by one. You use your conditional to compare and when true you remove the items from your original array, or at least try.
I would create another array of the results and then display, use or what ever you'd like with the array minus the words to exclude.
Loop through articleSplit
foreach x in arcticle split
foreach y in wordsToBeRemoved
if x != y newArray.Add(x)
However this is quite a bit of work. You may want to use array.filter and then add that way. There is a hundred ways to achieve this.
Here are some helpful articles:
filter an array in C#
https://msdn.microsoft.com/en-us/library/d9hy2xwa(v=vs.110).aspx
These will save you from all that looping.

You want to remove stop words. You can do it with a help of Linq:
...
string filePath = #"A:\EnterpriceAssigment\EnterpriceAssigment\TextFiles\conjunctions & ProNouns.txt";
// Hashset is much more efficient than array in the context
HashSet<string> stopWords = new HashSet<string>(File
.ReadLines(filePath), StringComparer.OrdinalIgnoreCase);
foreach (var article in myArticles) {
// read article, split into words, filter out stop words...
var cleared = article
.ArticleContent
.Split(' ')
.Where(word => !stopWords.Contains(word));
// ...and join words back into article
article.ArticleContent = string.Join(" ", cleared);
}
...
Please, notice, that I've preserved Split() which you've used in your code and so you have a toy implementation. In real life you have at least to take punctuation into consideration, and that's why a better code uses regular expressions:
foreach (var article in myArticles) {
// read article, extract words, filter out stop words...
var cleared = Regex
.Matches(article.ArticleContent, #"\w+") // <- extract words
.OfType<Match>()
.Select(match => match.Value)
.Where(word => !stopWords.Contains(word));
// ...and join words back into article
article.ArticleContent = string.Join(" ", cleared);
}

C# split by regex

I have a little problem that I don't know how to call it like, so I will do my best to explain you that.
String text = "Random text over here boyz, I dunno what to do";
I want to take by split only over here boyz for example, I want to let split the word text and the word , and it will show me the whole text that in thoose 2 strings. Any ideas?
Thank you,
Sagi.

From your comments I get that from this string:
foo bar id="baz" qux
You want to obtain the value baz, because it is in the id="{text}" pattern.
For that you can use a regular expression:
string result = Regex.Match(text, "id=\"(.*?)\"").Groups[1].Value;
Note that this will match any character. Also note that this will yield false positives, like fooid="bar", and that this won't match unquoted values.
So all in all, for parsing HTML, you should not use regular expressions. Try HtmlAgilityPack and an XPath expression.

There is a Split overload that can receive multiple string seperators:
var rrr = text.Split(new string[] { ",", "text" }, StringSplitOptions.None);
If you would like to extract only the text between these two strings using regex you can do something like this:
var pattern = #"text(.*),";
var a = new Regex(pattern).Match(text);
var result = a.Groups[1];

You can use Regex class:
https://msdn.microsoft.com/pl-pl/library/ze12yx1d%28v=vs.110%29.aspx
But first of all (as it was said) you need to clarify for yourself how you will identify string that you want.

in first case you can use
string stringResult;
if (text.Contains("over here boyz"))
stringResult = string.Empty;
else
stringResult = "over here boyz";
but the second case can solve by this code
String text = "Random text over here boyz, I dunno what to do";
//Second dream without whitespace
var result = Regex.Split(text, " *text *| *, *");
foreach (var x in result)
{
Console.WriteLine(x);
}
//Second dream with whitespace
result = Regex.Split(text, "text|,");
foreach (var x in result)
{
Console.WriteLine(x);
}
You can train to write Regex with this tool http://www.regexbuddy.com/ or http://www.regexr.com/

Shortcut for splitting only once in C#?

Okay, lets say I have a string:
string text = "one|two|three";
If I do string[] texts = text.Split('|'); I will end up with a string array of three objects. However, this isn't what I want. What I actually want is to split the string only once... so the two arrays I could would be this:
one
two|three
Additionally, is there a way to do a single split with the last occurrence in a string? So I get:
one|two
three
As well, is there a way to split by a string, instead of a character? So I could do Split("||")

Split method takes a count as parameter, you can pass 2 in that position, which basically says that you're interested in only 2 elements maximum. You'll get the expected result.
For second question: There is no built in way AFAIK. You may need to implement it yourself by splitting all and joining first and second back.

C#'s String.Split() can take a second argument that can define the number of elements to return:
string[] texts = text.Split(new char[] { '|' }, 2);

For your first scenario, you can pass a parameter of how many strings to split into.
var text = "one|two|three";
var result = text.Split(new char[] { '|' }, 2);
Your second scenario requires a little more magic.
var text = "one|two|three";
var list = text.Split('|');
var result = new string[] { string.Join("|", list, 0, list.Length - 1), list[list.Length - 1] };
Code has not been verified to check results before using.

Well, I took it as a challenge to do your second one in one line. The result is... not pretty, mostly because it's surprisingly difficult to reverse a string and keep it as a string.
string text = "one|two|three";
var result = new String(text.Reverse().ToArray()).Split(new char[] {'|'}, 2).Reverse().Select(c => new String(c.Reverse().ToArray()));
Basically, you reverse it, then follow the same procedure as the first one, then reverse each individual one, as well as the resulting array.

You can simply do like this as well...
//To split at first occurence of '|'
if(text.Containts('|')){
beginning = text.subString(0,text.IndexOf('|'));
ending = text.subString(text.IndexOf('|');
}
//To split at last occurence of '|'
if(text.Contains('|')){
beginning = text.subString(0,text.LastIndexOf('|'));
ending = text.subString(text.LastIndexOf('|');
}

Second question was fun. I solved it this way:
string text = "one|two|three";
var result =
new []
{
string.Concat(text.ToCharArray().TakeWhile((c, i) => i <= text.LastIndexOf("|"))),
string.Concat(text.ToCharArray().SkipWhile((c, i) => i <= text.LastIndexOf("|")))
};

How do I know which delimiter was used when delimiting a string on multiple delimiters? (C#)

I read strings from a file and they come in various styles:
item0 item1 item2
item0,item1,item2
item0_item1_item2
I split them like this:
string[] split_line = line[i].split(new char[] {' ',',','_'});
I change an item (column) and then i stitch the strings back together using string builder.
But now when putting the string back I have to use the right delimiter.
Is it possible to know which delimiter was used when splitting the string?
UPDATE
the caller will pass me the first item so that I only change that line.

Unless you keep track of splitting action (one at the time) you don't.
Otherwise, you could create a regular expression, to catch the item and the delimiter and go from there.

Instead of passing in an array of characters, you can use a Regex to split the string instead. The advantage of doing this, is that you can capture the splitting character. Regex.Split will insert any captures between elements in the array like so:
string[] space = Regex.Split("123 456 789", #"([,_ ])");
// Results in { "123", " ", "456", " ", "789" }
string[] comma = Regex.Split("123,456,789", #"([,_ ])");
// Results in { "123", ",", "456", ",", "789" }
string[] underscore = Regex.Split("123_456_789", #"([,_ ])");
// Results in { "123", "_", "456", "_", "789" }
Then you can edit all items in the array with something like
for (int x = 0; x < space.Length; x += 2)
space[x] = space[x] + "x";
Console.WriteLine(String.Join("", space));
// Will print: 123x 456x 789x
One thing to be wary of when dealing with multiple separators is if there are any lines that have spaces, commas and underscores in them. e.g.
37,hello world,238_3
This code will preserve all the distinct separators but your results might not be expected. e.g. the output of the above would be:
37x,hellox worldx,238x_3x

As I mentioned that the caller passes me the first item so I tried something like this:
// find the right row
if (lines[i].ToLower().StartsWith(rowID))
{
// we have to know which delim was used to split the string since this will be
// used when stitching back the string together.
for (int delim = 0; delim < delims.Length; delim++)
{
// we split the line into an array and then use the array index as our column index
split_line = lines[i].Trim().Split(delims[delim]);
// we found the right delim
if (split_line.Length > 1)
{
delim_used = delims[delim];
break;
}
}
}
basically I iterate each line over the delims and check the resulting array length. If it is > 1 that means that delim worked otherwise skip to next one. I am using split functions property "If this instance does not contain any of the characters in separator, the returned array consists of a single element that contains this instance."

string replace using a List<string>

I have a List of words I want to ignore like this one :
public List<String> ignoreList = new List<String>()
{
"North",
"South",
"East",
"West"
};
For a given string, say "14th Avenue North" I want to be able to remove the "North" part, so basically a function that would return "14th Avenue " when called.
I feel like there is something I should be able to do with a mix of LINQ, regex and replace, but I just can't figure it out.
The bigger picture is, I'm trying to write an address matching algorithm. I want to filter out words like "Street", "North", "Boulevard", etc. before I use the Levenshtein algorithm to evaluate the similarity.

How about this:
string.Join(" ", text.Split().Where(w => !ignoreList.Contains(w)));
or for .Net 3:
string.Join(" ", text.Split().Where(w => !ignoreList.Contains(w)).ToArray());
Note that this method splits the string up into individual words so it only removes whole words. That way it will work properly with addresses like Northampton Way #123 that string.Replace can't handle.

Regex r = new Regex(string.Join("|", ignoreList.Select(s => Regex.Escape(s)).ToArray()));
string s = "14th Avenue North";
s = r.Replace(s, string.Empty);

Something like this should work:
string FilterAllValuesFromIgnoreList(string someStringToFilter)
{
return ignoreList.Aggregate(someStringToFilter, (str, filter)=>str.Replace(filter, ""));
}

What's wrong with a simple for loop?
string street = "14th Avenue North";
foreach (string word in ignoreList)
{
street = street.Replace(word, string.Empty);
}

If you know that the list of word contains only characters that do not need escaping inside a regular expression then you can do this:
string s = "14th Avenue North";
Regex regex = new Regex(string.Format(#"\b({0})\b",
string.Join("|", ignoreList.ToArray())));
s = regex.Replace(s, "");
Result:
14th Avenue
If there are special characters you will need to fix two things:
Use Regex.Escape on each element of ignore list.
The word-boundary \b will not match a whitespace followed by a symbol or vice versa. You may need to check for whitespace (or other separating characters such as punctuation) using lookaround assertions instead.
Here's how to fix these two problems:
Regex regex = new Regex(string.Format(#"(?<= |^)({0})(?= |$)",
string.Join("|", ignoreList.Select(x => Regex.Escape(x)).ToArray())));

If it's a short string as in your example, you can just loop though the strings and replace one at a time. If you want to get fancy you can use the LINQ Aggregate method to do it:
address = ignoreList.Aggregate(address, (a, s) => a.Replace(s, String.Empty));
If it's a large string, that would be slow. Instead you can replace all strings in a single run through the string, which is much faster. I made a method for that in this answer.

LINQ makes this easy and readable. This requires normalized data though, particularly in that it is case-sensitive.
List<string> ignoreList = new List<string>()
{
"North",
"South",
"East",
"West"
};
string s = "123 West 5th St"
.Split(' ') // Separate the words to an array
.ToList() // Convert array to TList<>
.Except(ignoreList) // Remove ignored keywords
.Aggregate((s1, s2) => s1 + " " + s2); // Reconstruct the string

Why not juts Keep It Simple ?
public static string Trim(string text)
{
var rv = text.trim();
foreach (var ignore in ignoreList) {
if(tv.EndsWith(ignore) {
rv = rv.Replace(ignore, string.Empty);
}
}
return rv;
}

You can do this using and expression if you like, but it's easier to turn it around than using a Aggregate. I would do something like this:
string s = "14th Avenue North"
ignoreList.ForEach(i => s = s.Replace(i, ""));
//result is "14th Avenue "

public static string Trim(string text)
{
var rv = text;
foreach (var ignore in ignoreList)
rv = rv.Replace(ignore, "");
return rv;
}
Updated For Gabe
public static string Trim(string text)
{
var rv = "";
var words = text.Split(" ");
foreach (var word in words)
{
var present = false;
foreach (var ignore in ignoreList)
if (word == ignore)
present = true;
if (!present)
rv += word;
}
return rv;
}

If you have a list, I think you're going to have to touch all the items. You could create a massive RegEx with all your ignore keywords and replace to String.Empty.
Here's a start:
(^|\s+)(North|South|East|West){1,2}(ern)?(\s+|$)
If you have a single RegEx for ignore words, you can do a single replace for each phrase you want to pass to the algorithm.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regular Expression split string and get whats in brackets [ ] put into array - c#

Related

Loop through string and remove any occurrence of specified word

C# split by regex

Shortcut for splitting only once in C#?

How do I know which delimiter was used when delimiting a string on multiple delimiters? (C#)

string replace using a List<string>

Categories

Resources