How to split string preserving spaces and any number of \n characters - c#

I want to split the string and create a collection, with the following rules:
The string should be splitted into words.
1) If the string contains '\n' it should be considered as a seperate '\n' word.
2) If the string contains more than one '\n' it should considered it as more than on '\n' words.
3) No space should be removed from the string. Only exception is, if space comes between two \n it can be ignored.
PS: I tried a lot with string split, first split-ted \n characters and created a collection, downside is, if I have two \n consecutively, I'm unable to create two dummy words into the collection. Any help would be greatly appreciated.
Is there anyway to do this using regex?

Split with a regex like this:
(?<=[\S\n])(?=\s)
Something like:
var substrings = Regex.Split(input, #"(?<=[\S\n])(?=\s)");
This will not remove any spaces at all, but that was not required so should be fine.
If you really want the spaces between \ns to be removed, you could split with something like:
(?<=[\S\n])(?=\s)(?:[ \t]+(?=\n))?

Looks like homework. As such, read up on \b.
Should set you in the right direction.

Read up on the zero-width assertions. With them you can define a split position between e.g. \s and \S without actually matching either adjacent character.
edit:
Here's another question where the OP asked about those constructs.

Related

RegEx to find non-existence of white space prefix but not include the character in the match?

So i have the following RegEx for the purpose of finding and adding whitespace:
(\S)(\()
So for a string like "SomeText(Somemoretext)" I want to update this to "SomeText (Somemoretext)" it matches "t(" and so my replace eliminates the "t" from the string which is not good. I also do not know what the character could be, I'm merely trying to find the non-existence of whitespace.
Is there a better expression to use or is there a way to exclude the found character from the match returned so that I can safely replace without catching characters i do not want to replace?
Thanks
I find lookarounds hard to read and would prefer using substitutions in the replacement string instead:
var s = Regex.Replace("test1() test2()", #"(\S)\(", "$1 (");
Debug.Assert(s == "test1 () test2 ()");
$1 inserts the first capture group from the regex into the replacement string which is the non-space character before the opening parenthesis (.
If you need to detect the absence of space before a specific character (such as bracket) after a word, how about the following?
\b(?=[^\s])\(
This will detect words ( [a-zA-z0-9_] that are followed by a bracket, without a space).
(if I got your problem correctly) you can replace the full match with ( and get exactly what you need.
In case you need to look for absence spaces before a symbol (like a bracket) in any kind of text (as in the text may be non-word, such as punctuation) you might want to use the following instead.
^(?:\S*)(\()(?:\S*)$
When using this, your result will be in group 1, instead of just full match (which now contains the whole line, if a line is matched).

Regular expression to split on spaces, but included phrases in quotes?

Here's an example string -
"EP(DebugFlag="N",UILogFlag="N")" Other words here
I'd like to split the string by spaces, but need to keep quoted phrases together - even if there are quotes within quotes. So I'd like the sample string to be split as -
"EP(DebugFlag="N",UILogFlag="N")"
Other
words
here
I'm not sure how to take the quotes into consideration (finding the starting and ending one). Is there an easy way to do this?
Thanks!
You mean something lie this:
string example = #"EP(DebugFlag='N',UILogFlag='N') Other words here";
var result = example.Split(new string[]{" "}, StringSplitOptions.RemoveEmptyEntries).ToList();
foreach(var phrase in result)
{
Console.WriteLine("{0}", phrase);
}
Note: as #CodeCaster suggested, i have to mention that i replaced double-quote with single quote to provide "working example". If your sample differs to mine, you have to provide exact text without arounding quotes.

One single regular expression to match multiple alphanumeric words from 15 to 20 characters

I need to find all the words that have between 15 and 20 characters in a big string. And I want to avoid getting a long words with something else at the end (for ex 1234567890abcdef#asdf.com). I don't want that to be a result, only words. Right now I'm spliting the string using white space as token and for each word I'm applying the following regular expression:
^[a-zA-Z0-9]{15,20}$
Is there any chance to do both things using one regular expression?
I'm using C#.
Good examples to catch:
1234567890abcdeg
qwertyuiopasdfgh
1234567890abcdeg, (catch it but remove ",")
Examples to avoid: 1234567890abcdeg#gmail.com
Don't use start/end anchors (^/$), but word delimiters (\b):
\b[a-zA-Z0-9]{15,20}(?=[\s,]|$)
I used (?=[\s,]|$) instead of the end delimiter to force a space character or a comma or the end of the string. Expand it as needed.
You may want to do likewise for the first \b if you need to, for instance: (?<=\s|^).
Normally, you would use word boundaries (\b) before and after the alphanumerics:
\b[a-zA-Z0-9]{15,20}\b
However, there's a small detail to take into account: uderscores ("_") are also considered a word character. The previous regex won't match the following text:
12345678901234567_
In order to avoid it, you can check if it's preceded and followed by either a \b or a "_", with lookarounds.
Regex:
(?<=\b|_)[a-zA-Z0-9]{15,20}(?=\b|_)

How do I capitalize an entire text except for certain patterns?

I have a text (string) that I want in all upper case, except the following:
Words starting with : (colon)
Words or strings surrounded by double quotation marks, ""
Words or strings surrounded by single quotation marks, ''
Everything else should be replaced with its upper case counterpart, and formatting (whitespaces, line breaks, etc.) should remain.
How would I go about doing this using Regex (C# style/syntax)?
I think you are looking for something like this:
text = Regex.Replace(text, #":\w+|""[^""]*""|'[^']*'|(.)",
match => match.Groups[1].Success ?
match.Groups[1].Value.ToUpper() : match.Value);
:\w+ - match words with a colon.
"[^"]*"|'[^']*' - match quoted text. For escaped quotes, you may use:
"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'
(.) - capture anything else (you can also try ([^"':]*|.), it might be faster).
Next, we use a callback for Regex.Replace to do two things:
Determine if we need to keep the text as-is, or
Return the upper-case version of the text.
Working example: http://ideone.com/ORFU8
You can start with this RegEx:
\b(?<![:"'])(\w+?)(?!["'])\b
But of course you have to improve it by yourself, if it is not enough.
For example this will also not find "dfgdfg' (not equal quotation)
The word which is found is in the first match ($1)

Remove characters before any special characters c#

I'm trying to remove the characters in a string PRIOR to ANY non-alphanumeric characters. For instance, say you have a name "James Ebanks-Blake", I can split this into an array by using:
var s = "James Ebanks-Blake".Split(' ');
Even if there are more than one space, it'll just make more array indexes.
So what I need to do is loop thru all the arrays, find indexes with a special character, then remove all the indexes and the special character.
Can anyone assist me?
This works here
[-^$#](.*)
Just add what you consider special characters inside the character class
The string that you want will be in group 1
resultString = Regex.Match(subjectString, "[-^$#](.*)", RegexOptions.Singleline).Groups[1].Value;
[-'](.*)
That should grab anything after a - and a '. If you want, you can add more characters in the [ ] section. Just make sure to escape special regex ones.

Categories

Resources