Regular expression identifier and separator using ':' symbol

Regular expression identifier and separator using ':' symbol - c#

I want to separate my string between two ':' characters.
For example, if the input is "mypage-google-wax:press:-happy", then I want "press" out.
It can be assumed that the input doesn't contain any numeric characters.

Any reason to use regular expressions at all, rather than just:
string[] bits = text.Split(':');
That's assuming I understood your question correctly... which I'm not at all sure about. Anyway, depending on what you really want to do, this might be useful to you...

If you're always going to have a string in the format {stuffIDontWant}:{stuffIWant}:{moreStuffIDontWant} then String.Split() is your answer, not Regex.
To retrieve that middle value, you'd do:
string input = "stuffIDontWant:stuffIWant:moreStuffIDontWant"; //get your input
string output = "";
string[] parts = input.Split(':');
//converts to an array of strings using the character specified as the separator
output = parts[1]; //assign the second one
return output;
Regex is good for patern matching, but, unless you're specifically looking for the word press, String.Split() is a better answer for this need.

If you want it in regex:
string pattern = ":([^:]+):";
string sentence = "some text :data1: some more text :data2: text";
foreach (Match match in Regex.Matches(sentence, pattern))
Console.WriteLine("Found '{0}' at position {1}",
match.Groups[1].Value, match.Index);

Related

Get the first word from the string

I would like to get only the first word of the string regardless of any character or punctuation in front of it.
Sometimes, there could be , or . or !. I don't want these characters.
var s = "Hello, World";
var firstWord = s.Substring(0, s.IndexOf(" "));
This gives me Hello,. I would like to get Hello only.
How do I achieve this?

Simply use the following regex:
var s = "Hello, World";
var result = Regex.Match(s, #"^([\w\-]+)");
Console.WriteLine(result.Value); // Result is "Hello"
This will get the first word regardless of whether or not it ends with punctuation or simply precedes a space.

This will work for you. I assumed that words will be separated with whitespace.
var input = "Hello, World";
var output = Regex.Replace(input.Split()[0], #"[^0-9a-zA-Z\ ]+", "");

IndexOfAny (https://msdn.microsoft.com/fr-ca/library/11w09h50(v=vs.110).aspx) is an alternative if you know the list of characters you want to use. It really depends on the definition you want to use and which characters you want to handle. How do you want to handle characters like œ,é,µ,½,¶,ç,+,-,3...?
Also, do you want to handle locale as some characters might have a classification that is dependant on the language.
Char has many function that allows you to classify characters. See https://msdn.microsoft.com/en-us/library/system.char(v=vs.110).aspx.
And there is also the regex solutions proposed by others.
So the best solution really depends on your need. Do you need to properly handle any Unicode characters or only some specific ASCII characters?

LATE ENTRY:
If you don't want to use Regular Expressions:
private string GetFirstWord(string text)
{
var candidate = text.Trim();
if (!candidate.Any(Char.IsWhiteSpace))
return text;
return candidate.Split(' ').FirstOrDefault();
}

Regex Replacing only whole matches

I am trying to replace a bunch of strings in files. The strings are stored in a datatable along with the new string value.
string contents = File.ReadAllText(file);
foreach (DataRow dr in FolderRenames.Rows)
{
contents = Regex.Replace(contents, dr["find"].ToString(), dr["replace"].ToString());
File.SetAttributes(file, FileAttributes.Normal);
File.WriteAllText(file, contents);
}
The strings look like this _-uUa, -_uU, _-Ha etc.
The problem that I am having is when for example this string "_uU" will also overwrite "_-uUa" so the replacement would look like "newvaluea"
Is there a way to tell regex to look at the next character after the found string and make sure it is not an alphanumeric character?
I hope it is clear what I am trying to do here.
Here is some sample data:
private function _-0iX(arg1:flash.events.Event):void
{
if (arg1.type == flash.events.Event.RESIZE)
{
if (this._-2GU)
{
this._-yu(this._-2GU);
}
}
return;
}
The next characters could be ;, (, ), dot, comma, space, :, etc.

First of all, you should use Regex.Escape.
You can use then
contents = Regex.Replace(
contents,
Regex.Escape(dr["find"].ToString()) + #"(?![a-zA-Z])",
Regex.Escape(dr["replace"].ToString()));
or even better
contents = Regex.Replace(
contents,
#"\b" + Regex.Escape(dr["find"].ToString()) + #"\b",
Regex.Escape(dr["replace"].ToString()));

I think this is what you're looking for:
contents = Regex.Replace(
contents,
string.Format(#"(?<!\w){0}(?!\w)", Regex.Escape(dr["find"].ToString())),
dr["replace"].ToString().Replace("$", "$$")
);
You can't use \b because your search strings don't always start and end with word characters. Instead, I used (?<!\w) and (?!\w) to make sure the matched substring is not immediately preceded or followed by a word character (i.e., a letter, a digit, or an underscore). I don't know the complete specs for your search strings, so this pattern might need some tweaking.
None of the sample patterns you provided contain regex metacharacters, but like the other responders, I used Regex.Escape() to render it safe anyway. In the replacement string the only character you have to watch out for is the dollar sign (ref), and the way to escape that is with another dollar sign. Notice that I used String.Replace() for that instead of Regex.Replace().

There are two tricks that can help you here:
Order all the search string by length, and replace the longest ones first, that way you won't accidentally replace the shorter ones.
Use a MatchEvaluator and instead of looping through all your rows, search fro all replacement patterns in the string and look them up in your dataset.
Option one is simple, option two would look like this:
Regex.Replace(contents", "_-\\w+", ReplaceIdentifier)
public string ReplaceIdentifier(Match m)
{
DataRow row = FolderRenames.Rows.FindRow("find"); // Requires a primary key on "find"
if (row != null) return row["replace"];
else return m.Value;
}

match first digits before # symbol

How to match all first digits before # in this line
26909578#Sbrntrl_7x06-lilla.avi#356028416#2012-10-24 09:06#0#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#[URL=http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html]http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html[/URL]#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#http://bitshare.com/?f=dvk9o1oz#http://bitshare.com/delete/dvk9o1oz/4511e6f3612961f961a761adcb7e40a0/Sbrntrl_7x06-lilla.avi.html
Im trying to get this number 26909578
My try
string text = #"26909578#Sbrntrl_7x06-lilla.avi#356028416#2012-10-24 09:06#0#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#[URL=http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html]http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html[/URL]#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#http://bitshare.com/?f=dvk9o1oz#http://bitshare.com/delete/dvk9o1oz/4511e6f3612961f961a761adcb7e40a0/Sbrntrl_7x06-lilla.avi.html";
MatchCollection m1 = Regex.Matches(text, #"(.+?)#", RegexOptions.Singleline);
but then its outputs all text

Make it explicit that it has to start at the beginning of the string:
#"^(.+?)#"
Alternatively, if you know that this will always be a number, restrict the possible characters to digits:
#"^\d+"
Alternatively use the function Match instead of Matches. Matches explicitly says, "give me all the matches", while Match will only return the first one.

Or, in a trivial case like this, you might also consider a non-RegEx approach. The IndexOf() method will locate the '#' and you could easily strip off what came before.
I even wrote a sscanf() replacement for C#, which you can see in my article A sscanf() Replacement for .NET.

If you dont want to/dont like to use regex, use a string builder and just loop until you hit the #.
so like this
StringBuilder sb = new StringBuilder();
string yourdata = "yourdata";
int i = 0;
while(yourdata[i]!='#')
{
sb.Append(yourdata[i]);
i++;
}
//when you get to that # your stringbuilder will have the number you want in it so return it with .toString();
string answer = sb.toString();

The entire string (except the final url) is composed of segments that can be matched by (.+?)#, so you will get several matches. Retrieve only the first match from the collection returned by matching .+?(?=#)

How to get all words of a string in c#?

I have a paragraph in a single string and I'd like to get all the words in that paragraph.
My problem is that I don't want the suffixes words that end with punctuation marks such as (',','.',''','"',';',':','!','?') and /n /t etc.
I also don't want words with 's and 'm such as world's where it should only return world.
In the example
he said. "My dog's bone, toy, are missing!"
the list should be: he said my dog bone toy are missing

Expanding on Shan's answer, I would consider something like this as a starting point:
MatchCollection matches = Regex.Match(input, #"\b[\w']*\b");
Why include the ' character? Because this will prevent words like "we're" from being split into two words. After capturing it, you can manually strip out the suffix yourself (whereas otherwise, you couldn't recognize that re is not a word and ignore it).
So:
static string[] GetWords(string input)
{
MatchCollection matches = Regex.Matches(input, #"\b[\w']*\b");
var words = from m in matches.Cast<Match>()
where !string.IsNullOrEmpty(m.Value)
select TrimSuffix(m.Value);
return words.ToArray();
}
static string TrimSuffix(string word)
{
int apostropheLocation = word.IndexOf('\'');
if (apostropheLocation != -1)
{
word = word.Substring(0, apostropheLocation);
}
return word;
}
Example input:
he said. "My dog's bone, toy, are missing!" What're you doing tonight, by the way?
Example output:
[he, said, My, dog, bone, toy, are, missing, What, you, doing, tonight, by, the, way]
One limitation of this approach is that it will not handle acronyms well; e.g., "Y.M.C.A." would be treated as four words. I think that could also be handled by including . as a character to match in a word and then stripping it out if it's a full stop afterwards (i.e., by checking that it's the only period in the word as well as the last character).

Hope this is helpful for you:
string[] separators = new string[] {",", ".", "!", "\'", " ", "\'s"};
string text = "My dog's bone, toy, are missing!";
foreach (string word in text.Split(separators, StringSplitOptions.RemoveEmptyEntries))
Console.WriteLine(word);

See Regex word boundary expressions, What is the most efficient way to count all of the words in a richtextbox?. Moral of the story is that there are many ways to approach the problem, but regular expressions are probably the way to go for simplicity.

split on whitespace, trim anything that isn't a letter on the resulting strings.

Here's a looping replace method... not fast, but a way to solve it...
string result = "string to cut ' stuff. ! out of";
".',!#".ToCharArray().ToList().ForEach(a => result = result.Replace(a.ToString(),""));
This assumes you want to place it back in the original string, not a new string or a list.

How do i strip special characters from the end of a string?

I need to strip unknown characters from the end of a string returned from an SQL database. I also need to log when a special character occurs in the string.
What's the best way to do this?

You can use the Trim() method to trim blanks or specific characters from the end of a string. If you need to trim a certain number of characters you can use the Substring() method. You can use Regexs (System.Text.RegularExpressions namespace) to match patterns in a string and detect when they occur. See MSDN for more info.
If you need more help you'll need to provide a bit more info on what exactly you're trying to do.

First define what are unknown characters (chars other than 0-9, a to z and A to Z ?) and put them in an array
Loop trough the characters of a string and check if the char occurs, if so remove.
you can also to a String.Replace with as param the unknown char, and replaceparam ''.

Since you've specified that the legal characters are only alphanumeric, you could do something like this:
Match m = Regex.Match(original, "^([0-9A-Za-z]*)(.*)$");
string good = m.Groups[1].Value;
string bad = m.Groups[2].Value;
if (bad.Length > 0)
{
// log bad characters
}
Console.WriteLine(good);

Your definition of the problem is not precise yet this is a fast trick to do so:
string input;
...
var trimed = input.TrimEnd(new[] {'#','$',...} /* array of unwanted characters */);
if(trimed != input)
myLogger.Log(input.Replace(trimed, ""));

check out the Regex.Replace methods...there are lots of overloads. You can use the Match methods for the logging to identify all matches.
String badString = "HELLO WORLD!!!!";
Regex regex = new Regex("!{1,}$" );
String newString = regex.Replace(badString, String.Empty);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regular expression identifier and separator using ':' symbol - c#

I want to separate my string between two ':' characters. For example, if the input is "mypage-google-wax:press:-happy", then I want "press" out. It can be assumed that the input doesn't contain any numeric characters.

Any reason to use regular expressions at all, rather than just: string[] bits = text.Split(':'); That's assuming I understood your question correctly... which I'm not at all sure about. Anyway, depending on what you really want to do, this might be useful to you...

If you want it in regex: string pattern = ":([^:]+):"; string sentence = "some text :data1: some more text :data2: text"; foreach (Match match in Regex.Matches(sentence, pattern)) Console.WriteLine("Found '{0}' at position {1}", match.Groups[1].Value, match.Index);

Related

Get the first word from the string

Regex Replacing only whole matches

match first digits before # symbol

How to get all words of a string in c#?

How do i strip special characters from the end of a string?

Categories

Resources