I get a string from a website called "willekeurigwoord.nl" which means random word. So when I get the string from the site with HtmlAgilityPack, it is formatted like "\n\t\t\tkegelvrucht\r\n \t\n\t\t".
So the word that I get is "kegelvrucht" but before and after the word there are backslashes which when I try to remove they get ignored even when I put "#" or use double backslashes ("\") in front of the string.
So my question is, how do I remove the \ in my string?
I did try everything that is in the comment lines.
private string RandomWordOnline() //Get the word online
{
//get string from htlm file with htmlagilitypack
var webGet = new HtmlWeb();
var doc = webGet.Load("http://www.willekeurigwoord.nl/");
String word = doc.DocumentNode.SelectSingleNode("//h1").InnerText;
//word = word.Replace(#"\", "");
//word = #word.Trim(new char[] {' ','\\'});
//word = word.Substring(8, word.Length - 13);
//word = word.Substring(0, 13);
//trying to remove backslash, does not work
for (int i = 0; i < word.Length; i++)
{
char chrWord = Convert.ToChar(word.Substring(i, 1));
char backslash = Convert.ToChar(#"\");
if (chrWord == backslash)
{
word = word.Remove(i, 1);
}
}
return word;
}
Those backslashes are not in the string, they are just a representation of tabs, carriage returns and line feeds. For example, a string which Visual Studio shows as \t\t\n\n is only 4 characters long, not 8.
You can get rid of them just like this:
var webGet = new HtmlWeb();
var doc = webGet.Load("http://www.willekeurigwoord.nl/");
String word = doc.DocumentNode.SelectSingleNode("//h1").InnerText;
string fixedWord = word.Trim();
Trim removes all white spaces that surround your text, including tabs and new lines. If you happen to only want to remove some specific characters, or to remove them in the middle of the string, you need to do something like this:
string fixedWord = word.Replace("\t", "").Replace("\n", "").Replace("\r", "").Trim();
Just call Trim() on your string:
string cleaned = word.Trim();
It will remove all leading and trailing whitespace, which includes all of the characters you want removed.
Probably a C# String expert will know the answer you are looking for. But this is a great example of where post C languages make things harder. Probably your \ is being taken as an escape character by the compiler, so the code never sees it at run time.
By the way, "word" is a terrible choice for a label because it is reserved in most languages (meaning a type 16 bits wide or something similar).
In C, you just go through the string character by character and copy each one into a new string based on whether it is or isn't '\'; (I didn't test/debug this, and you need to add bounds checking unless you know the sizes of all the strings.)
i = j = 0;
while (strIn[i] != '0') {
if (strIn[i] != '\') {
strOut[j++] = strIn[i];
}
i++;
}
(If that sounds like extra work, know that at run time, your C# is doing that anyway, and hiding the required interaction with the memory manager from you so you don't know why your program runs slowly.)
Related
Using C#, I have a string that is a SQL script containing multiple queries. I want to remove sections of the string that are enclosed in single quotes. I can do this using Regex.Replace, in this manner:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, "'[^']*'", string.Empty);
Results in: "Only can we turn him to the of the Force"
What I want to do is remove the substrings between quotes EXCEPT for substrings containing a specific substring. For example, using the string above, I want to remove the quoted substrings except for those that contain "dark," such that the resulting string is:
Results in: "Only can we turn him to the 'dark side' of the Force"
How can this be accomplished using Regex.Replace, or perhaps by some other technique? I'm currently trying a solution that involves using Substring(), IndexOf(), and Contains().
Note: I don't care if the single quotes around "dark side" are removed or not, so the result could also be: "Only can we turn him to the dark side of the Force." I say this because a solution using Split() would remove all the single quotes.
Edit: I don't have a solution yet using Substring(), IndexOf(), etc. By "working on," I mean I'm thinking in my head how this can be done. I have no code, which is why I haven't posted any yet. Thanks.
Edit: VKS's solution below works. I wasn't escaping the \b the first attempt which is why it failed. Also, it didn't work unless I included the single quotes around the whole string as well.
test = Regex.Replace(test, "'(?![^']*\\bdark\\b)[^']*'", string.Empty);
'(?![^']*\bdark\b)[^']*'
Try this.See demo.Replace by empty string.You can use lookahead here to check if '' contains a word dark.
https://www.regex101.com/r/rG7gX4/12
While vks's solution works, I'd like to demonstrate a different approach:
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
test = Regex.Replace(test, #"'[^']*'", match => {
if (match.Value.Contains("dark"))
return match.Value;
// You can add more cases here
return string.Empty;
});
Or, if your condition is simple enough:
test = Regex.Replace(test, #"'[^']*'", match => match.Value.Contains("dark")
? match.Value
: string.Empty
);
That is, use a lambda to provide a callback for the replacement. This way, you can run arbitrary logic to replace the string.
some thing like this would work. you can add all strings you want to keep into the excludedStrings array
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
var excludedString = new string[] { "dark side" };
int startIndex = 0;
while ((startIndex = test.IndexOf('\'', startIndex)) >= 0)
{
var endIndex = test.IndexOf('\'', startIndex + 1);
var subString = test.Substring(startIndex, (endIndex - startIndex) + 1);
if (!excludedString.Contains(subString.Replace("'", "")))
{
test = test.Remove(startIndex, (endIndex - startIndex) + 1);
}
else
{
startIndex = endIndex + 1;
}
}
Another method through regex alternation operator |.
#"('[^']*\bdark\b[^']*')|'[^']*'"
Then replace the matched character with $1
DEMO
string str = "Only 'together' can we turn him to the 'dark side' of the Force";
string result = Regex.Replace(str, #"('[^']*\bdark\b[^']*')|'[^']*'", "$1");
Console.WriteLine(result);
IDEONE
Explanation:
(...) called capturing group.
'[^']*\bdark\b[^']*' would match all the single quoted strings which contains the substring dark . [^']* matches any character but not of ', zero or more times.
('[^']*\bdark\b[^']*'), because the regex is within a capturing group, all the matched characters are stored inside the group index 1.
| Next comes the regex alternation operator.
'[^']*' Now this matches all the remaining (except the one contains dark) single quoted strings. Note that this won't match the single quoted string which contains the substring dark because we already matched those strings with the pattern exists before to the | alternation operator.
Finally replacing all the matched characters with the chars inside group index 1 will give you the desired output.
I made this attempt that I think you were thinking about (some solution using split, Contain, ... without regex)
string test = "Only 'together' can we turn him to the 'dark side' of the Force";
string[] separated = test.Split('\'');
string result = "";
for (int i = 0; i < separated.Length; i++)
{
string str = separated[i];
str = str.Trim(); //trim the tailing spaces
if (i % 2 == 0 || str.Contains("dark")) // you can expand your condition
{
result += str+" "; // add space after each added string
}
}
result = result.Trim(); //trim the tailing space again
I'm not an expert in regular expressions and today in my project I face the need to split long string in several lines in order to check if the string text fits the page height.
I need a C# regular expression to split long strings in several lines by "\n", "\r\n" and keeping 150 characters by line maximum. If the character 150 is in the middle of an word, the entire word should be move to the next line.
Can any one help me?
It's actually a quite simple problem. Look for any characters up to 150, followed by a space. Since Regex is greedy by nature it will do exactly what you want it to. Replace it by the Match plus a newline:
.{0,150}(\s+|$)
Replace with
$0\r\n
See also: http://regexhero.net/tester/?id=75645133-1de2-4d8d-a29d-90fff8b2bab5
var regex = new Regex(#".{0,150}", RegexOptions.Multiline);
var strings = regex.Replace(sourceString, "$0\r\n");
Here you go:
^.{1,150}\n
This will match the longest initial string like this.
if you just want to split a long string into lines of 150 chars then I'm not sure why you'd need a regular expression:
private string stringSplitter(string inString)
{
int lineLength = 150;
StringBuilder sb = new StringBuilder();
while (inString.Length > 0)
{
var curLength = inString.Length >= lineLength ? lineLength : inString.Length;
var lastGap = inString.Substring(0, curLength).LastIndexOfAny(new char[] {' ', '\n'});
if (lastGap == -1)
{
sb.AppendLine(inString.Substring(0, curLength));
inString = inString.Substring(curLength);
}
else
{
sb.AppendLine(inString.Substring(0, lastGap));
inString = inString.Substring(lastGap + 1);
}
}
return sb.ToString();
}
edited to account for word breaks
This code should help you. It will check the length of the current string. If it is greater than your maxLength (150) in this case, it will start at the 150th character and (going backwards) find the first non-word character (as described by the OP, this is a sequence of non-space characters). It will then store the string up to that character and start over again with the remaining string, repeating until we end up with a substring that is less than maxLength characters. Finally, join them all back together again in a final string.
string line = "This is a really long run-on sentence that should go for longer than 150 characters and will need to be split into two lines, but only at a word boundary.";
int maxLength = 150;
string delimiter = "\r\n";
List<string> lines = new List<string>();
// As long as we still have more than 'maxLength' characters, keep splitting
while (line.Length > maxLength)
{
// Starting at this character and going backwards, if the character
// is not part of a word or number, insert a newline here.
for (int charIndex = (maxLength); charIndex > 0; charIndex--)
{
if (char.IsWhiteSpace(line[charIndex]))
{
// Split the line after this character
// and continue on with the remainder
lines.Add(line.Substring(0, charIndex+1));
line = line.Substring(charIndex+1);
break;
}
}
}
lines.Add(line);
// Join the list back together with delimiter ("\r\n") between each line
string final = string.Join(delimiter , lines);
// Check the results
Console.WriteLine(final);
Note: If you run this code in a console application, you may want to change "maxLength" to a smaller number so that the console doesn't wrap on you.
Note: This code does not take into effect any tab characters. If tabs are also included, your situation gets a bit more complicated.
Update: I fixed a bug where new lines were starting with a space.
Im using a StreamReader to open a text file and grab its contents. I need to grab just the text from the file without any escape characters ( \n, \r, \", etc ). Google is failing me right now. Any ideas?
There are no escape characters in a text that you read from a file. Escape characters are used when you write a string literal, for example in program code. I assume that you mean that you want to replace any write space characters with plain spaces.
You can use a regular expression to match white space characters and replace them with spaces. It's easier to use the File.ReadAllText to read the text from the file:
string text = Regex.Replace(File.ReadAllText(fileName), #"[\r\n\t ]+", " ");
Why don't you just call ReadToEnd and then Split the string?
// using statement and whatever code here
var rawContent = sr.ReadToEnd();
var usefulContent = rawContent.Split(new []{ "\r\n", "\\" },
StringSplitOptions.RemoveEmptyEntries);
Note: you'll want to tweak the separators in the Split method; this is just an example.
You could also simply Replace the unwanted characters:
// using statement and whatever code here
var rawContent = sr.ReadToEnd();
var usefulContent = rawContent
.Replace("\r\n", "" )
.Replace("\\", "");
If you're trying to do it as you stream, call StreamReader.Read() in a while loop and test the characters one by one.
If you're able to grab the entire file contents into a string, use a regular expression to strip the undesirable characters. Check out RegexHero: http://regexhero.net/tester/
Assume you have read the entire file in a string s
for (int i = 0; i < s.Length; i++)
{
if (char.IsLetterOrDigit(s, i)) // or if (!char.IsWhiteSpace(s, i))
{
// append to StringBuilder
}
}
If IsLetterOrDigit or IsWhiteSpace don't fit your needs you can create your own method and call it.
You may use universal function for skipping all characters you not need:
public string SkipChars(string InputString, char[] CharsToSkip)
{
string result = InputString;
foreach (var chr in CharsToSkip)
{
result = result.Replace(chr.ToString(), "");
}
return result;
}
usage:
string test = "one\ntwo\tthree";
MessageBox.Show(SkipChars(test, new char[] { '\n', '\t' }));
This question is not related to:
Best way to break long strings in C# source code
Which is about source, this is about processing long outputs. If someone enters:
WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
As a comment, it breaks the container and makes the entire page really wide. Is there any clever regexp that can say, define a maximum word length of 20 chars and then force a whitespace character?
Thanks for any help!
There's probably no need to involve regexes in something this simple. Take this extension method:
public static string Abbreviate(this string text, int length) {
if (text.Length <= length) {
return text;
}
char[] delimiters = new char[] { ' ', '.', ',', ':', ';' };
int index = text.LastIndexOfAny(delimiters, length - 3);
if (index > (length / 2)) {
return text.Substring(0, index) + "...";
}
else {
return text.Substring(0, length - 3) + "...";
}
}
If the string is short enough, it's returned as-is. Otherwise, if a "word boundary" is found in the second half of the string, it's "gracefully" cut off at that point. If not, it's cut off the hard way at just under the desired length.
If the string is cut off at all, an ellipsis ("...") is appended to it.
If you expect the string to contain non-natural-language constructs (such as URLs) you 'd need to tweak this to ensure nice behavior in all circumstances. In that case working with a regex might be better.
You could try using a regular expression that uses a positive look-ahead like this:
string outputStr = Regex.Replace(inputStr, #"([\S]{20}(?=\S+))", "$1\n");
This should "insert" a line break into all words that are longer than 20 characters.
Yes you can use this one regex
string pattern = #"^([\w]{1,20})$";
this regex allow to enter not more than 20 characters
string strRegex = #"^([\w]{1,20})$";
string strTargetString = #"asdfasfasfasdffffff";
if(Regex.IsMatch(strTargetString, strRegex))
{
//do something
}
If you need only lenght constraint you should use this regex
^(.{1,20})$
because the \w is match only
alphanumeric and underscore symbol
Here is what I'm trying to accomplish. I have an object coming back from
the database with a string description. This description can be up to 1000
characters long, but we only want to display a short view of this. So I coded
up the following, but I'm having trouble in actually removing the number of
words after the regular expression finds the total count of words. Does anyone
have good way of dispalying the words which are less than the Regex.Matches?
Thanks!
if (!string.IsNullOrEmpty(myObject.Description))
{
string original = myObject.Description;
MatchCollection wordColl = Regex.Matches(original, #"[\S]+");
if (wordColl.Count < 70) // 70 words?
{
uxDescriptionDisplay.Text =
string.Format("<p>{0}</p>", myObject.Description);
}
else
{
string shortendText = original.Remove(200); // 200 characters?
uxDescriptionDisplay.Text =
string.Format("<p>{0}</p>", shortendText);
}
}
EDIT:
So this is what I got working on my own:
else
{
int count = 0;
StringBuilder builder = new StringBuilder();
string[] workingText = original.Split(' ');
foreach (string word in workingText)
{
if (count < 70)
{
builder.AppendFormat("{0} ", word);
}
count++;
}
string shortendText = builder.ToString();
}
It's not pretty, but it worked. I would call it a pretty naive way of doing this. Thanks for all of the suggestions!
I would opt to go by a strict character count rather than a word count because you might happen to have a lot of long words.
I might do something like (pseudocode)
if text.Length > someLimit
find first whitespace after someLimit (or perhaps last whitespace immediately before)
display substring of text
else
display text
Possible code implementation:
string TruncateText(string input, int characterLimit)
{
if (input.Length > characterLimit)
{
// find last whitespace immediately before limit
int whitespacePosition = input.Substring(0, characterLimit).LastIndexOf(" ");
// or find first whitespace after limit (what is spec?)
// int whitespacePosition = input.IndexOf(" ", characterLimit);
if (whitespacePosition > -1)
return input.Substring(0, whitespacePosition);
}
return input;
}
One method, if you're using at least C#3.0, would be a LINQ like the following. This is provided you're going strictly by word count, not character count.
if (wordColl.Count > 70)
{
foreach (var subWord in wordColl.Cast<Match>().Select(r => r.Value).Take(70))
{
//Build string here out of subWord
}
}
I did a test using a simple Console.WriteLine with your Regex and your question body (which is over 70 words, it turns out).
You can use Regex Capture Groups to hold the match and access it later.
For your application, I'd recommend instead simply splitting the string by spaces and returning the first n elements of the array:
if (!string.IsNullOrEmpty(myObject.Description))
{
string original = myObject.Description;
string[] words = original.Split(' ');
if (words.Length < 70)
{
uxDescriptionDisplay.Text =
string.Format("<p>{0}</p>", original);
}
else
{
string shortDesc = string.Empty;
for(int i = 0; i < 70; i++) shortDesc += words[i] + " ";
uxDescriptionDisplay.Text =
string.Format("<p>{0}</p>", shortDesc.Trim());
}
}
Are you wanting to remove 200 characters or start truncating at the 200th character? When you call original.Remove(200) you are indexing the start of the truncation at the 200th character. This is how you use Remove() for a certain number of characters to remove:
string shortendText = original.Remove(0,200);
This starts at the first character and removes 200 starting with that one. Which I imagine that's not what you're trying to do since you're shortening a description. That's merely the correct way to use Remove().
Instead of using Regex matchcollections why not just split the string? It's a lot easier and straight forward. You can set the delimiter to a space character and split that way. Not sure if that completely fixes your need but it just might. I'm not sure what your data looks like in the description. But you split this way:
String[] wordArray = original.Split(' ');
From there you can determine the word count with wordArray's Length property value.
If I was you I would go by characters as you may have many one letter words or many long words in your text.
Go through until characters <= your limit, then either find the next space and then add these characters to a new string (possibly using the SubString method) or take these characters and add a few full stops, then make a new string The later could be unproffessional I suppose.