Get the first word from the string

Get the first word from the string - c#

I would like to get only the first word of the string regardless of any character or punctuation in front of it.
Sometimes, there could be , or . or !. I don't want these characters.
var s = "Hello, World";
var firstWord = s.Substring(0, s.IndexOf(" "));
This gives me Hello,. I would like to get Hello only.
How do I achieve this?

Simply use the following regex:
var s = "Hello, World";
var result = Regex.Match(s, #"^([\w\-]+)");
Console.WriteLine(result.Value); // Result is "Hello"
This will get the first word regardless of whether or not it ends with punctuation or simply precedes a space.

This will work for you. I assumed that words will be separated with whitespace.
var input = "Hello, World";
var output = Regex.Replace(input.Split()[0], #"[^0-9a-zA-Z\ ]+", "");

IndexOfAny (https://msdn.microsoft.com/fr-ca/library/11w09h50(v=vs.110).aspx) is an alternative if you know the list of characters you want to use. It really depends on the definition you want to use and which characters you want to handle. How do you want to handle characters like œ,é,µ,½,¶,ç,+,-,3...?
Also, do you want to handle locale as some characters might have a classification that is dependant on the language.
Char has many function that allows you to classify characters. See https://msdn.microsoft.com/en-us/library/system.char(v=vs.110).aspx.
And there is also the regex solutions proposed by others.
So the best solution really depends on your need. Do you need to properly handle any Unicode characters or only some specific ASCII characters?

LATE ENTRY:
If you don't want to use Regular Expressions:
private string GetFirstWord(string text)
{
var candidate = text.Trim();
if (!candidate.Any(Char.IsWhiteSpace))
return text;
return candidate.Split(' ').FirstOrDefault();
}

Related

Regex Replacing only whole matches

I am trying to replace a bunch of strings in files. The strings are stored in a datatable along with the new string value.
string contents = File.ReadAllText(file);
foreach (DataRow dr in FolderRenames.Rows)
{
contents = Regex.Replace(contents, dr["find"].ToString(), dr["replace"].ToString());
File.SetAttributes(file, FileAttributes.Normal);
File.WriteAllText(file, contents);
}
The strings look like this _-uUa, -_uU, _-Ha etc.
The problem that I am having is when for example this string "_uU" will also overwrite "_-uUa" so the replacement would look like "newvaluea"
Is there a way to tell regex to look at the next character after the found string and make sure it is not an alphanumeric character?
I hope it is clear what I am trying to do here.
Here is some sample data:
private function _-0iX(arg1:flash.events.Event):void
{
if (arg1.type == flash.events.Event.RESIZE)
{
if (this._-2GU)
{
this._-yu(this._-2GU);
}
}
return;
}
The next characters could be ;, (, ), dot, comma, space, :, etc.

First of all, you should use Regex.Escape.
You can use then
contents = Regex.Replace(
contents,
Regex.Escape(dr["find"].ToString()) + #"(?![a-zA-Z])",
Regex.Escape(dr["replace"].ToString()));
or even better
contents = Regex.Replace(
contents,
#"\b" + Regex.Escape(dr["find"].ToString()) + #"\b",
Regex.Escape(dr["replace"].ToString()));

I think this is what you're looking for:
contents = Regex.Replace(
contents,
string.Format(#"(?<!\w){0}(?!\w)", Regex.Escape(dr["find"].ToString())),
dr["replace"].ToString().Replace("$", "$$")
);
You can't use \b because your search strings don't always start and end with word characters. Instead, I used (?<!\w) and (?!\w) to make sure the matched substring is not immediately preceded or followed by a word character (i.e., a letter, a digit, or an underscore). I don't know the complete specs for your search strings, so this pattern might need some tweaking.
None of the sample patterns you provided contain regex metacharacters, but like the other responders, I used Regex.Escape() to render it safe anyway. In the replacement string the only character you have to watch out for is the dollar sign (ref), and the way to escape that is with another dollar sign. Notice that I used String.Replace() for that instead of Regex.Replace().

There are two tricks that can help you here:
Order all the search string by length, and replace the longest ones first, that way you won't accidentally replace the shorter ones.
Use a MatchEvaluator and instead of looping through all your rows, search fro all replacement patterns in the string and look them up in your dataset.
Option one is simple, option two would look like this:
Regex.Replace(contents", "_-\\w+", ReplaceIdentifier)
public string ReplaceIdentifier(Match m)
{
DataRow row = FolderRenames.Rows.FindRow("find"); // Requires a primary key on "find"
if (row != null) return row["replace"];
else return m.Value;
}

match first digits before # symbol

How to match all first digits before # in this line
26909578#Sbrntrl_7x06-lilla.avi#356028416#2012-10-24 09:06#0#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#[URL=http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html]http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html[/URL]#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#http://bitshare.com/?f=dvk9o1oz#http://bitshare.com/delete/dvk9o1oz/4511e6f3612961f961a761adcb7e40a0/Sbrntrl_7x06-lilla.avi.html
Im trying to get this number 26909578
My try
string text = #"26909578#Sbrntrl_7x06-lilla.avi#356028416#2012-10-24 09:06#0#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#[URL=http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html]http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html[/URL]#http://bitshare.com/files/dvk9o1oz/Sbrntrl_7x06-lilla.avi.html#http://bitshare.com/?f=dvk9o1oz#http://bitshare.com/delete/dvk9o1oz/4511e6f3612961f961a761adcb7e40a0/Sbrntrl_7x06-lilla.avi.html";
MatchCollection m1 = Regex.Matches(text, #"(.+?)#", RegexOptions.Singleline);
but then its outputs all text

Make it explicit that it has to start at the beginning of the string:
#"^(.+?)#"
Alternatively, if you know that this will always be a number, restrict the possible characters to digits:
#"^\d+"
Alternatively use the function Match instead of Matches. Matches explicitly says, "give me all the matches", while Match will only return the first one.

Or, in a trivial case like this, you might also consider a non-RegEx approach. The IndexOf() method will locate the '#' and you could easily strip off what came before.
I even wrote a sscanf() replacement for C#, which you can see in my article A sscanf() Replacement for .NET.

If you dont want to/dont like to use regex, use a string builder and just loop until you hit the #.
so like this
StringBuilder sb = new StringBuilder();
string yourdata = "yourdata";
int i = 0;
while(yourdata[i]!='#')
{
sb.Append(yourdata[i]);
i++;
}
//when you get to that # your stringbuilder will have the number you want in it so return it with .toString();
string answer = sb.toString();

The entire string (except the final url) is composed of segments that can be matched by (.+?)#, so you will get several matches. Retrieve only the first match from the collection returned by matching .+?(?=#)

How to remove non-ASCII word from a string in C#

I want to filter some string which has some wrong letters (non-ASCII). It looks different in Notepad, Visual Studio 2010 and MySQL.
How can I check if a string has non-ASCII letters and how I can remove them?

You could use a regular expression to filter non ASCII characters:
string input = "AB £ CD";
string result = Regex.Replace(input, "[^\x0d\x0a\x20-\x7e\t]", "");

You could use Regular Expressions.
Regex.Replace(input, "[^a-zA-Z0-9]+", "")
You could also use \W+ as the pattern to remove any non-character.

This has been a God-send:
Regex.Replace(input, #"[^\u0000-\u007F]", "");
I think I got it elsewhere originally, but here is a link to the same answer here:
How can you strip non-ASCII characters from a string? (in C#)

string testString = Regex.Replace(OldString, #"[\u0000-\u0008\u000A-\u001F\u0100-\uFFFF]", "");

First, you need to determine what you mean by a "word". If non-ascii, this probably implies non-english?
Personally, I'd ask why you need to do this and what fundamental assumption has your application got that conflicts with your data? Depending on the situation, I suggest you either re-encode the text from the source encoding, although this will be a lossy conversion, or alternatively, address that fundamental assumption so that your application handles data correctly.

I think something as simple as this would probably work, wouldn't it?
public static string AsciiOnly(this string input, bool includeExtendedAscii)
{
int upperLimit = includeExtendedAscii ? 255 : 127;
char[] asciiChars = input.Where(c => (int)c <= upperLimit).ToArray();
return new string(asciiChars);
}
Example usage:
string input = "AB£ȼCD";
string asciiOnly = input.AsciiOnly(false); // returns "ABCD"
string extendedAsciiOnly = input.AsciiOnly(true); // returns "AB£CD"

Regular expression identifier and separator using ':' symbol

I want to separate my string between two ':' characters.
For example, if the input is "mypage-google-wax:press:-happy", then I want "press" out.
It can be assumed that the input doesn't contain any numeric characters.

Any reason to use regular expressions at all, rather than just:
string[] bits = text.Split(':');
That's assuming I understood your question correctly... which I'm not at all sure about. Anyway, depending on what you really want to do, this might be useful to you...

If you're always going to have a string in the format {stuffIDontWant}:{stuffIWant}:{moreStuffIDontWant} then String.Split() is your answer, not Regex.
To retrieve that middle value, you'd do:
string input = "stuffIDontWant:stuffIWant:moreStuffIDontWant"; //get your input
string output = "";
string[] parts = input.Split(':');
//converts to an array of strings using the character specified as the separator
output = parts[1]; //assign the second one
return output;
Regex is good for patern matching, but, unless you're specifically looking for the word press, String.Split() is a better answer for this need.

If you want it in regex:
string pattern = ":([^:]+):";
string sentence = "some text :data1: some more text :data2: text";
foreach (Match match in Regex.Matches(sentence, pattern))
Console.WriteLine("Found '{0}' at position {1}",
match.Groups[1].Value, match.Index);

How do i strip special characters from the end of a string?

I need to strip unknown characters from the end of a string returned from an SQL database. I also need to log when a special character occurs in the string.
What's the best way to do this?

You can use the Trim() method to trim blanks or specific characters from the end of a string. If you need to trim a certain number of characters you can use the Substring() method. You can use Regexs (System.Text.RegularExpressions namespace) to match patterns in a string and detect when they occur. See MSDN for more info.
If you need more help you'll need to provide a bit more info on what exactly you're trying to do.

First define what are unknown characters (chars other than 0-9, a to z and A to Z ?) and put them in an array
Loop trough the characters of a string and check if the char occurs, if so remove.
you can also to a String.Replace with as param the unknown char, and replaceparam ''.

Since you've specified that the legal characters are only alphanumeric, you could do something like this:
Match m = Regex.Match(original, "^([0-9A-Za-z]*)(.*)$");
string good = m.Groups[1].Value;
string bad = m.Groups[2].Value;
if (bad.Length > 0)
{
// log bad characters
}
Console.WriteLine(good);

Your definition of the problem is not precise yet this is a fast trick to do so:
string input;
...
var trimed = input.TrimEnd(new[] {'#','$',...} /* array of unwanted characters */);
if(trimed != input)
myLogger.Log(input.Replace(trimed, ""));

check out the Regex.Replace methods...there are lots of overloads. You can use the Match methods for the logging to identify all matches.
String badString = "HELLO WORLD!!!!";
Regex regex = new Regex("!{1,}$" );
String newString = regex.Replace(badString, String.Empty);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get the first word from the string - c#

Simply use the following regex: var s = "Hello, World"; var result = Regex.Match(s, #"^([\w\-]+)"); Console.WriteLine(result.Value); // Result is "Hello" This will get the first word regardless of whether or not it ends with punctuation or simply precedes a space.

This will work for you. I assumed that words will be separated with whitespace. var input = "Hello, World"; var output = Regex.Replace(input.Split()[0], #"[^0-9a-zA-Z\ ]+", "");

LATE ENTRY: If you don't want to use Regular Expressions: private string GetFirstWord(string text) { var candidate = text.Trim(); if (!candidate.Any(Char.IsWhiteSpace)) return text; return candidate.Split(' ').FirstOrDefault(); }

Related

Regex Replacing only whole matches

match first digits before # symbol

How to remove non-ASCII word from a string in C#

Regular expression identifier and separator using ':' symbol

How do i strip special characters from the end of a string?

Categories

Resources