Regex for new line in string (str) in C# .NET - c#

I have exhausted my search and need asisstance. I am new to regex and have managed to pull words from a multi lined string, however not a whole line.
I have text pulled into string but I cannot find out to grab the next line.
Example string has multiple lines (string multipleLines):
Authentication information
User information:
domai n\username
Paris
I need to grab the text "domain\username" after the line "User iformation."
I have tried many combinations of regex and cannot get it to work. Example:
string topLine = "Authentication information";
label.Text = Regex.Match(multipleLines, topLine + "(.*)").Groups[1].Value;
I also tried using: topLine + "\n"
What should I add to look at the entire next line after getting the line for Authentication information?

Your objective with Regular Expressions can be found here at this thread on Stack Overflow. You would want to implement the RegexOptions.Multiline so you can make usage of the ^ and $ to match the Start and End of a line.
^: Depending on whether the MultiLine option is set, matches the position before the first character in a line, or the first character in the string.
$: Depending on whether the MultiLine option is set, matches the position after the last character in a line, or the last character in the string.
Those would be the easiest way to accomplish your task.
Update:
An example would be something like this.
const string account = #"Authentication information:" + "\n"
+ "User Information: " + "\n"
+ "Domain Username: "
+ " \n" + "\\Paris";
MatchCollection match = Regex.Matches(account, ^\\.*$, RegexOptions.Multiline);
That will retrieve the line with the \\ and all that proceed it on that line. That is an example, hopefully that points you in the correct direction.

Though RegEx would accomplish what you want, this might be simpler and far less overhead. Using this code depends on the kind of input you'll be receiving. For your example, this will work:
string domainUsername = inputString.Split('\n').Where(z => z.ToLower().Contains(#"\")).FirstOrDefault();
if (domainUsername != null) {
Console.WriteLine(domainUsername); // Should spit out the appropriate line.
} else {
Console.WriteLine("Domain and username not found!"); // Line not found
}

Related

Remove special characters from string with unicode

I found the most popular answer to this question is:
Regex.Replace(value, "[^a-zA-Z0-9]+", " ", RegexOptions.Compiled);
However, if users type in Non-English name when billing, this method will consider these non- are special characters and remove them.
Is there any way we can build for most of users since my website is multi-language.
Make it Unicode aware:
var res = Regex.Replace(value, #"[^\p{L}\p{M}\p{N}]+", " ");
If you plan to keep only regular digits, keep [0-9].
The regex matches one or more symbols other than Unicode letters (\p{L}), diacritics (\p{M}) and digits (\p{N}).
You might consider var res = Regex.Replace(value, #"\W+", " "), but it will keep _ since the underscore is a "word" character.
I found my self that the best way to achieve this and make work with all languages is create a string with all banned characters, look this code:
string input = #"heya's #FFFFF , CUL8R M8 how are you?'"; // This is the input string
string regex = #"[!""#$%&'()*+,\-./:;<=>?#[\\\]^_`{|}~]"; //Banned characters string, add all characters you don´t want to be displayed here.
Match m;
while ((m = Regex.Match(input, regex)) != null)
{
if (m.Success)
input = input.Remove(m.Index, m.Length);
else // if m.Success is false: break, because while loop can be infinite
break;
}
input = input.Replace(" ", " ").Replace(" "," "); //if string has two-three-four spaces together change it to one
MessageBox.Show(input);
Hope it works!
PS: As others posted here, there are other ways. But I personally prefer that one even though it´s way more code. Choose the one you think better fits for your needing.

Remove substring that starts with SOT and ends EOT, from string

I have a program that reads certain strings from memory. The strings contain, for the most part, recognizable characters. At random points in the strings however, "weird" characters appear. Characters I did not recognize. By going to a site that allows me to paste in Unicode characters to see what they are, I found that a selection of the "weird" characters were these:
\x{1} SOH, "start of heading", ctrl-a
\x{2} SOT, "start of text"
\x{3} EOT, "end of text"
\x{7} BEL, bell, ctrl-g
\x{13} dc3, device control three, ctrl-s
\x{11} dc1, device control one, ctrl-q
\x{14} dc4, device control four, ctrl-t
\x{1A} sub, substitute, ctrl-z
\x{6} ack, acknowledge, ctrl-f
I wanted to parse my strings to remove these characters. What I found out though, by looking at the strings, was that all the unwanted characters were always surrounded by the SOT and EOT, respectively.
Therefore, I am thinking that my question is: How can I remove, from a string, all occurrences of substrings that starts with SOT and ends with EOT?
Edit: Attempt at Solution
Using ideas from #RagingCain I made the following method:
private static string RemoveInvalidCharacters(string input)
{
while (true)
{
var start = input.IndexOf('\u0002');
var end = input.IndexOf('\u0003', start);
if (start == -1 || end == -1) break;
Console.WriteLine(#"Start: " + start + #". End: " + end);
var diff = end - start;
input = input.Remove(start, diff);
}
return input;
}
It does the trick, thanks again.
Regex would be your solution and should work fine. You would assign these characters to the Pattern and you can use the sub-method Match or even just Replace them with whitespace " ", or just cut them from the string all together by using "".
Regex.Replace: https://msdn.microsoft.com/en-us/library/xwewhkd1(v=vs.110).aspx
Regex.Match: https://msdn.microsoft.com/en-us/library/bk1x0726(v=vs.110).aspx
Regex example:
public static void Main()
{
string input = "This is text with far too much " +
"whitespace.";
string pattern = "\\s+";
string replacement = " ";
Regex rgx = new Regex(pattern);
string result = rgx.Replace(input, replacement);
Console.WriteLine("Original String: {0}", input);
Console.WriteLine("Replacement String: {0}", result);
}
I know the difficulty though of not being able to "see" them so you should assign them to Char variables by Unicode itself, add them to the pattern for replace.
Char Variables: https://msdn.microsoft.com/en-us/library/x9h8tsay.aspx
Unicode for Start of Text:
http://www.fileformat.info/info/unicode/char/0002/index.htm
Unicode for End of Text:
http://www.fileformat.info/info/unicode/char/0003/index.htm
To apply to your solution:
Does string contain SOT, EOT.
If true, remove entire string/sub-string/SOT or EOT.
It maybe easier to split original string into a string[], then go line by line... it's difficult to parse through your string without knowing what it looks like so hopefully I provided something that helps ^.^

Why this function for finding the n-th occurrence does not work on text with line breaks?

I found the following code to find the n-th occurrence of a value in a text here.
This is the code:
public static int NthIndexOf(this string target, string value, int n)
{
Match m = Regex.Match(target, "((" + value + ").*?){" + n + "}");
if (m.Success)
return m.Groups[2].Captures[n - 1].Index;
else
return -1;
}
I tried to find the index of the second occurrence of "< /form>" (the space does not appear in the original string) in some webpage, and it failed, although for sure it exists in the text. I also cut some prefix of the webpage, so the second occurrence will be the first, and then I succeeded to find the expression as the first occurrence.
In one of the comment on this code, someone wrote that "This Regex does not work if the target string contains linebreaks.".
My two questions are:
Why does not this code work if the target string contains linebreaks?
How can I fix this code, so it will work also for strings that contain linebreaks (replacing/removing the linebreaks is not considered a good solution for me)?
I don't look for other techniques to do the same thing.
the regex match till the end of the line.
For what you want you need to use the Singleline mode, so your code should look something like this:
Match m = Regex.Match(target, "((" + value + ").*?){" + n + "}", RegexOptions.Singleline);
By default Regular Expression end on a new line. To fix it you need to specify the regex option
Match m = Regex.Match(target, "((" + value + ").*?){" + n + "}", RegexOptions.MultiLine);
You can find more information about RegExOptions here.

Regex pattern format [duplicate]

I have the following Regex pattern to remove all characters after the 2 line breaks.
(?<=.+[\r\n]+.+[\r\n]+)([\s\S]*)
My problem here is that I also wanted to add a check for a specific text, for example after that 2 line breaks and if it is found, do not include it.
And here is how I do it on my c# code:
string newComment = string.IsNullOrEmpty(regexPattern) ? emailBody : new Regex(regexPattern, RegexOptions.IgnoreCase).Replace(emailBody, string.Empty);
EDIT
I wanted to look for a specific text, for example "This is a signature:" then if it is found, it should not be included and anything after it also, while maintaining the current design which everything after 2 line breaks will not be included
Sample strings:
string body = "Try comment.";
string additionalBody = "This is a signature";
string newBody = body + System.Environment.NewLine + additionalBody + System.Environment.NewLine + "asd Asd";
So the newBody should result to 3 paragraphs text.
It should display the "Try comment" only.
Possible scenarios may be:
1) On the first or second paragraph, the text can be present and should be removed automatically.
2) If the automated signature is not present but there is 3 paragraphs, remove the last paragraph.
Try this:
(?<=(?>.+[\r\n]+){2})(?:(?!\bThis is a signature\b)[\s\S])*
How about simply:
(?<=(?:.+[\r\n]+){2})([\s\S]*)This is a signature

Regular Expression without braces

i have the following sample cases :
1) "Sample"
2) "[10,25]"
I want to form a(only one) regular expression pattern, to which the above examples are passed returns me "Sample" and "10,25".
Note: Input strings do not include Quotes.
I came up with the following expression (?<=\[)(.*?)(?=\]), this satisfies the second case and retreives me only "10,25" but when the first case is matched it returns me blank. I want "Sample" to be returned? can anyone help me.
C#.
here you go, a small regex using a positive lookbehind, sometime these are very handy
Regex
(?<=^|\[)([\w,]+)
Test string
Sample
[10,25]
Result
MATCH 1
[0-6] Sample
MATCH 2
[8-13] 10,25
try at regex101.com
if " is included in your original string, use this regex, this will look for " mark as well, you may choose to remove ^| from lookup if " mark is always included or you may choose to leave it as it is if your text has combination of with and without " marks
Regex
(?<=^|\[|\")([\w,]+)
try at regex101.com
As far as I can tell, the below regex should help:
Regex regex = new Regex(#"^\w+|[[](\w)+\,(\w)+[]]$");
This will match multiple words, or 2 words (alphanumeric) separated by commas and inside square brackets.
One Java example:
// String input = "Sample";
String input = "[10,25]";
String text = "[^,\\[\\]]+";
Pattern pMod = Pattern.compile("(" + text + ")|(?>\\[(" + text + "," + text + ")\\])");
Matcher mMod = pMod.matcher(input);
while (mMod.find()) {
if(mMod.group(1) != null) {
System.out.println(mMod.group(1));
}
if(mMod.group(2)!=null) {
System.out.println(mMod.group(2));
}
}
if input is "[hello&bye,25|35]", then the output is hello&bye,25|35

Categories

Resources