Removing words from text with separators in front(using Regex) - c#

I need to remove words from the text with separators next to them. I already removed words but I don't know how I can remove separators at the same time. Any suggestions?
At the moment I have:
static void Main(string[] args)
{
Program p = new Program();
string text = "";
text = p.ReadText("Duomenys.txt", text);
string[] wordsToDelete = { "Hello", "Thanks", "kinda" };
char[] separators = { ' ', '.', ',', '!', '?', ':', ';', '(', ')', '\t' };
p.DeleteWordsFromText(text, wordsToDelete, separators);
}
public string ReadText(string file, string text)
{
text = File.ReadAllText(file);
return text;
}
public void DeleteWordsFromText(string text, string[] wordsToDelete, char[] separators)
{
Console.WriteLine(text);
for (int i = 0; i < wordsToDelete.Length; i++)
{
text = Regex.Replace(text, wordsToDelete[i], String.Empty);
}
Console.WriteLine("-------------------------------------------");
Console.WriteLine(text);
}
The results should be:
how are you?
I am good.
I have:
, how are you?
, I am . good.
Duomenys.txt
Hello, how are you?
Thanks, I am kinda. good.

You may build a regex like
\b(?:Hello|Thanks|kinda)\b[ .,!?:;() ]*
where \b(?:Hello|Thanks|kinda)\b will match any words to delete as whole words and [ .,!?:;() ]* will match all your separators 0 or more times after the words to delete.
The C# solution:
char[] separators = { ' ', '.', ',', '!', '?', ':', ';', '(', ')', '\t' };
string[] wordsToDelete = { "Hello", "Thanks", "kinda" };
string SepPattern = new String(separators).Replace(#"\", #"\\").Replace("^", #"\^").Replace("-", #"\-").Replace("]", #"\]");
var pattern = $#"\b(?:{string.Join("|", wordsToDelete.Select(Regex.Escape))})\b[{SepPattern}]*";
// => \b(?:Hello|Thanks|kinda)\b[ .,!?:;() ]*
Regex rx = new Regex(pattern, RegexOptions.Compiled);
// RegexOptions.IgnoreCase can be added to the above flags for case insensitive matching: RegexOptions.IgnoreCase | RegexOptions.Compiled
DeleteWordsFromText("Hello, how are you?", rx);
DeleteWordsFromText("Thanks, I am kinda. good.", rx);
Here is the DeleteWordsFromText method:
public static void DeleteWordsFromText(string text, Regex p)
{
Console.WriteLine($"---- {text} ----");
Console.WriteLine(p.Replace(text, ""));
}
Output:
---- Hello, how are you? ----
how are you?
---- Thanks, I am kinda. good. ----
I am good.
Notes:
string SepPattern = new String(separators).Replace(#"\", #"\\").Replace("^", #"\^").Replace("-", #"\-").Replace("]", #"\]"); - it is a separator pattern that will be used inside a character class, and since only ^, -, \, ] chars require escaping inside a character class, only these chars are escaped
$#"\b(?:{string.Join("|", wordsToDelete.Select(Regex.Escape))})\b" - this will build the alternation from the words to delete and will only match them as whole words.
Pattern details
\b - word boundary
(?: - start of a non-capturing group:
Hello - Hello word
| - or
Thanks- Thanls word
| - or
kinda- kinda word
) - end of the group
\b - word boundary
[ .,!?:;() ]* - any 0+ chars inside the character class.
See the regex demo.

You can build the regex like follows:
var regex = new Regex(#"\b("
+ string.Join("|", wordsToDelete.Select(Regex.Escape)) + ")("
+ string.Join("|", separators.Select(c => Regex.Escape(new string(c, 1)))) + ")?");
Explanation:
the \b at the start matches a word boundary. Just in case you get "XYZThanks"
the next part builds a regex construct matching any of the wordsToDelete
the last part builds a regex construct matching any of the separators; the trailing "?" is there because you said you want to replace the word also if no separator follows

I would not use Regex. In 3 months from now, you'll not understand the Regex any more and fixing bugs is a hard thing then.
I would use simple loops. Everyone will understand:
public void DeleteWordsFromText(string text, string[] wordsToDelete, char[] separators)
{
Console.WriteLine(text);
foreach (string word in wordsToDelete)
{
foreach(char separator in separators)
{
text = text.Replace(word + separator, String.Empty);
}
}
Console.WriteLine("-------------------------------------------");
Console.WriteLine(text);
}

Related

Using Regex to split string by different characters based on occurance

I'm currently replacing a very old (and long) C# string parsing class that I think could be condensed into a single regex statement. Being a newbie to Regex, I'm having some issues getting it working correctly.
Description of the possible input strings:
The input string can have up to three words separated by spaces. It can stop there, or it can have an = followed by more words (any amount) separated by a comma. The words can also be contained in quotes. If a word is in quotes and has a space, it should NOT be split by the space.
Examples of input and expected output elements in the string array:
Input1:
this is test
Output1:
{"this", "is", "test"}
Input2:this is test=param1,param2,param3
Output2: {"this", "is", "test", "param1", "param2", "param3"}
Input3:use file "c:\test file.txt"=param1 , param2,param3
Output3: {"use", "file", "c:\test file.txt", "param1", "param2", "param3"}
Input4:log off
Output4: {"log", "off"}
And the most complex one:
Input5:
use object "c:\test file.txt"="C:\Users\layer.shp" | ( object = 10 ),param2
Output5:
{"use", "object", "c:\test file.txt", "C:\Users\layer.shp | ( object = 10 )", "param2"}
So to break this down:
I need to split by spaces up to the first three words
Then, if there is an =, ignore the = and split by commas instead.
If there are quotes around one of the first three words and contains a space, INCLUDE that space (don't split)
Here's the closest regex I've got:
\w+|"[\w\s\:\\\.]*"+([^,]+)
This seems to split the string based on spaces, and by commas after the =. However, it seems to include the = for some reason if one of the first three words is surrounded by quotes. Also, I'm not sure how to split by space only up to the first three words in the string, and the rest by comma if there is an =.
It looks like part of my solution is to use quantifiers with {}, but I've unable to set it up properly.
Without Regex. Regex should be used when string methods cannot be used. :
string[] inputs = {
"this is test",
"this is test=param1,param2,param3",
"use file \"c:\\test file.txt\"=param1 , param2,param3",
"log off",
"use object \"c:\\test file.txt\"=\"C:\\Users\\layer.shp\" | ( object = 10 ),param2"
};
foreach (string input in inputs)
{
List<string> splitArray;
if (!input.Contains("="))
{
splitArray = input.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries).ToList();
}
else
{
int equalPosition = input.IndexOf("=");
splitArray = input.Substring(0, equalPosition).Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries).ToList();
string end = input.Substring(equalPosition + 1);
splitArray.AddRange(end.Split(new char[] { ',' }).ToList());
}
string output = string.Join(",", splitArray.Select(x => x.Contains("\"") ? x : "\"" + x + "\""));
Console.WriteLine(output);
}
Console.ReadLine();

Delim Tabs in a string

string s = " 1 16 34";
string[] words = s.Split('\t');
foreach (string word in words)
{
Console.WriteLine(word);
}
I have a string format as shown above, but when I try to deliminate using the escape tab, it just outputs the exact same string in its original format, why is not removing the tabs?
string[] words = s.Split(new[] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries);
foreach (string word in words)
{
Console.WriteLine(word);
}
I think I fixed it.
It gives me this output.
1
16
34
Which I checked by outputting all 3 in the array to make sure they are separated.
Split on char[0] - this will split on all whitespaces.
StringSplitOptions.RemoveEmptyEntries - will remover empty entries.
var words = myStr.Split(new char[0], StringSplitOptions.RemoveEmptyEntries);
Try using split without a argument
string s = " 1 16 34";
string[] words = s.Split();
foreach (string word in words)
{
Console.WriteLine();
}
This will split the string by every whitespace, tab.
Paranoidal solution: split on any white space (space, tab, non-breaking space etc.)
string s = " 1 16 34";
string[] words = Regex
.Matches(s, #"\S+")
.OfType<Match>()
.Select(m => m.Value)
.ToArray();
Regular expressions can well be overshoot, but can help out in case of dirty data.

Extract multiple values from a string

I need to extract values from a string.
string sTemplate = "Hi [FirstName], how are you and [FriendName]?"
Values I need returned:
FirstName
FriendName
Any ideas on how to do this?
You can use the following regex globally:
\[(.*?)\]
Explanation:
\[ : [ is a meta char and needs to be escaped if you want to match it literally.
(.*?) : match everything in a non-greedy way and capture it.
\] : ] is a meta char and needs to be escaped if you want to match it literally.
Example:
string input = "Hi [FirstName], how are you and [FriendName]?";
string pattern = #"\[(.*?)\]";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(input);
if (matches.Count > 0)
{
Console.WriteLine("{0} ({1} matches):", input, matches.Count);
foreach (Match match in matches)
Console.WriteLine(" " + match.Value);
}
If the format/structure of the text won't be changing at all, and assuming the square brackets were used as markers for the variable, you could try something like this:
string sTemplate = "Hi FirstName, how are you and FriendName?"
// Split the string into two parts. Before and after the comma.
string[] clauses = sTemplate.Split(',');
// Grab the last word in each part.
string[] names = new string[]
{
clauses[0].Split(' ').Last(), // Using LINQ for .Last()
clauses[1].Split(' ').Last().TrimEnd('?')
};
return names;
You will need to tokenize the text and then extract the terms.
string[] tokenizedTerms = new string[7];
char delimiter = ' ';
tokenizedTerms = sTemplate.Split(delimiter);
firstName = tokenizedTerms[1];
friendName = tokenizedTerms[6];
char[] firstNameChars = firstName.ToCharArray();
firstName = new String(firstNameChars, 0, firstNameChars.length - 1);
char[] friendNameChars = lastName.ToCharArray();
friendName = new String(friendNameChars, 0, friendNameChars.length - 1);
Explanation:
You tokenize the terms, which separates the string into a string array with each element being the char sequence between each delimiter, in this case between spaces which is the words. From this word array we know that we want the 3rd word (element) and the 7th word (element). However each of these terms have punctuation at the end. So we convert the strings to a char array then back to a string minus that last character, which is the punctuation.
Note:
This method assumes that since it is a first name, there will only be one string, as well with the friend name. By this I mean if the name is just Will, it will work. But if one of the names is Will Fisher (first and last name), then this will not work.

c#: how to split string using default whitespaces + set of addtional delimiters?

I need to split a string in C# using a set of delimiter characters. This set should include the default whitespaces (i.e. what you effectively get when you String.Split(null, StringSplitOptions.RemoveEmptyEntries)) plus some additional characters that I specify like '.', ',', ';', etc. So if I have a char array of those additional characters, how to I add all the default whitespaces to it, in order to then feed that expanded array to String.Split? Or is there a better way of splitting using my custom delimiter set + whitespaces? Thx
Just use the appropriate overload of string.Split if you're at least on .NET 2.0:
char[] separator = new[] { ' ', '.', ',', ';' };
string[] parts = text.Split(separator, StringSplitOptions.RemoveEmptyEntries);
I guess i was downvoted because of the incomplete answer. OP has asked for a way to split by all white-spaces(which are 25 on my pc) but also by other delimiters:
public static class StringExtensions
{
static StringExtensions()
{
var whiteSpaceList = new List<char>();
for (int i = char.MinValue; i <= char.MaxValue; i++)
{
char c = Convert.ToChar(i);
if (char.IsWhiteSpace(c))
{
whiteSpaceList.Add(c);
}
}
WhiteSpaces = whiteSpaceList.ToArray();
}
public static readonly char[] WhiteSpaces;
public static string[] SplitWhiteSpacesAndMore(this string str, IEnumerable<char> otherDeleimiters, StringSplitOptions options = StringSplitOptions.None)
{
var separatorList = new List<char>(WhiteSpaces);
separatorList.AddRange(otherDeleimiters);
return str.Split(separatorList.ToArray(), options);
}
}
Now you can use this extension method in this way:
string str = "word1 word2\tword3.word4,word5;word6";
char[] separator = { '.', ',', ';' };
string[] split = str.SplitWhiteSpacesAndMore(separator, StringSplitOptions.RemoveEmptyEntries);
The answers above do not use all whitespace characters as delimiters, as you state in your request, only the ones specified by the program. In the solution examples above, this is only SPACE, but not TAB, CR, LF, and all the other Unicode-defined whitespace chars.
I have not found a way to retrieve the default whitespace chars from String. However, they are defined in Regex, and you can use that instead of String. In your case, adding period and comma to the Regex whitespace set:
Regex regex = new Regex(#"[\s\.,]+"); // The "+" will remove blank entries
input = #"1.2 3, 4";
string[] tokens = regex.Split(input);
will produce
tokens[0] "1"
tokens[1] "2"
tokens[2] "3"
tokens[3] "4"
str.Split(" .,;".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
I use something like the following to ensure I'm always splitting on Split's default whitespace characters:
public static string[] SplitOnWhitespaceAnd(this string value,
char[] separator, StringSplitOptions options = StringSplitOptions.RemoveEmptyEntries)
=> value.Split().SelectMany(s => s.Split(separator, options)).ToArray();
Note that to be consistent with Microsoft's naming conventions, you'd want to use WhiteSpace rather than Whitespace.
Refer to Microsoft's Char.IsWhiteSpace documentation to see the whitespace characters split on by default.
string[] splitSentence(string sentence)
{
return sentence
.Replace(",", " , ")
.Replace(".", " . ")
.Split(' ', StringSplitOptions.RemoveEmptyEntries)
}
or
string[] result = test.Split(new string[] {"\n", "\r\n"},
StringSplitOptions.RemoveEmptyEntries);

Getting punctuation from the end of a string only

I'm looking for a C# snippet to remove and store any punctuation from the end of a string only.
Example:
Test! would return !
Test;; would return ;;
Test?:? would return ?:?
!!Test!?! would return !?!
I have a rather clunky solution at the moment but wondered if anybody could suggest a more succinct way to do this.
My puncutation list is
new char[] { '.', ':', '-', '!', '?', ',', ';' })
You could use the following regular expression:
\p{P}*$
This breaks down to:
\p{P} - Unicode punctuation
* - Any number of times
$ - End of line anchor
If you know that there will always be some punctuation at the end of the string, use + for efficiency.
And use it like this in order to get the punctuation:
string punctuation = Regex.Match(myString, #"\p{P}*$").Value;
To actually remove it:
string noPunctuation = Regex.Replace(myString, #"\p{P}*$", string.Empty);
Use a regex:
resultString = Regex.Replace(subjectString, #"[.:!?,;-]+$", "");
Explanation:
[.:!?,;-] # Match a character that's one of the enclosed characters
+ # Do this once or more (as many times as possible)
$ # Assert position at the end of the string
As Oded suggested, use \p{P} instead of [.:!?,;-] if you want to remove all punctuation characters, not just the ones from your list.
To also "store" the punctuation, you could split the string:
splitArray = Regex.Split(subjectString, #"(?=\p{P}+$)");
Then splitArray[0] contains the part before the punctuation, and splitArray[1] the punctuation characters. If there are any.
Using Linq:
var punctuationMap = new HashSet<char>(new char[] { '.', ':', '-', '!', '?', ',', ';' });
var endPunctuationChars = aString.Reverse().
TakeWhile(ch => punctuationMap.Contains(ch));
var result = new string(endPunctuationChars.Reverse().ToArray());
The HashSet is not mandatory, you can use Linq's Contains on the array directly.

Categories

Resources