C# String/StringBuilder MemoryException on Large set of Replaces - c#

I am trying to figure out a better way to manipulate a large string and using both string and string builder I am unable to.
What I have below is a function that takes in a string and we search that string with regex to find any links. Any occurances of links I want to wrap them in a valid link text. My issue is , I have a database entry (string) with 101 link values present that need to be replaced and I am getting memory issues.
Is there a better way around this solution. I have included it with both string.replace and stringbuilder.replace and neither work
var resultString = new StringBuilder(testb);
Regex regx = new Regex(#"((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[.\!\/\\w]*))?)", RegexOptions.IgnoreCase);
MatchCollection mactches = regx.Matches(txt);
foreach (Match match in mactches)
{
if(match.Value.StartsWith("http://") || match.Value.StartsWith("https://"))
fixedurl = match.Value;
else
fixedurl = "http://" + match.Value;
resultString.Replace(match.Value, "<a target='_blank' class='ts-link ui-state-default' href='" + fixedurl + "'>" + match.Value + "</a>");
//testb = testb.Replace(match.Value, "<a target='_blank' class='ts-link ui-state-default' href='" + fixedurl + "'>" + match.Value + "</a>");
}

You can try the following. It may perform better in your specific case.
Regex regx = new Regex(#"((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[.\!\/\\w]*))?)", RegexOptions.IgnoreCase);
string resultString = regx.Replace(txt, (match) =>
{
string fixedurl = (match.Value.StartsWith("http://") || match.Value.StartsWith("https://"))
? match.Value
: "http://" + match.Value;
return "<a target='_blank' class='ts-link ui-state-default' href='" + fixedurl + "'>" + match.Value + "</a>";
});
EDITED:
BTW, the issue with your code seems to be the resultString.Replace call, since it replaces all the occurrences of the string it's probably causing the code to enter into an infinite loop of replacing the same strings over and over again until it hits an OutOfMemoryException.

Related

StringBuilder.Append deleting '\r' and '\n' tag

Im builiding colorized string in my cshtml.
StringBuilder colorizedOutput = new StringBuilder();
string concreteChar = outputArray[j] == ' ' ? " " : outputArray[j].ToString();
string htmlSpan = "<span " + "style=" + '"' + $"background-color: {color};" + "color:white;" + '"' + ">" + concreteChar + "</span>";
colorizedOutput.Append(htmlSpan);
#Html.Raw(colorizedOutput.ToString())
Each character in this string is an span.
My htmlSpan with "\r" tag look like this:
"<span style=\"background-color: red;color:white;\">\r</span>"
but .Appned method saving it like this:
"<span style=\"background-color: red;color:white;\"></span>"
Like you see, \r tag disappear.
My question is, how to deal with \r\n tags in StringBuilder?
When you append a string to StringBuilder all escaping characters are removed. You can do something like below to keep them.
var sb = new StringBuilder();
var html = "<span style=\"background-color: red;color:white;\">\\r</span>";
sb.Append(html);
Note that I have changed \r to \\r.
Output
When you try to call sb.ToString()
<span style="background-color: red;color:white;">\r</span>
You can simplify the htmlspan as
string htmlSpan = $"<span style='background-color: {color};color:white;>{concreteChar}</span>";
this will resolve your issue.
it works
var someString = #"Chunk1 \r\n Chunk2";
var color = "red";
var colorizedOutput = new StringBuilder();
foreach (var ch in someString.ToCharArray())
{
var concreteChar = ch == ' ' ? " " : ch.ToString();
var htmlSpan = "<span " + "style=" + '"' + $"background-color: {color};" + "color:white;" + '"' + ">" + concreteChar + "</span>"+"\n";
colorizedOutput.Append(htmlSpan);
}
Console.WriteLine(colorizedOutput.ToString());
This is what I got
Just use of Environment.NewLine Property
Something like
StringBuilder colorizedOutput = new StringBuilder();
colorizedOutput.AppendFormat("<span style=\"background-color: red;color:white;\">{0}
</span>", Environment.NewLine);
string s = sb.ToString();

How to find and Replace a token in a c# string that starts with a value

I've got an issue where I am applying a template to an object and am using a find and replace function to mesh the template in the form of a string of html. The issue is, the height and width of the image are contained in the token so I don't have a way to find and replace as it could vary.
Token value is [ARTICLEIMAGE:150:200]
foreach(var article in articles) {
var articleTemplateValue = _TemplateArticleMarkup;
articleTemplateValue = articleTemplateValue.Replace("[ARTICLEIMAGE:xx:yy]", "<img src=" + article.ArticleImageFolder + "/" + article.ArticleImage + " title=" + article.ArticleTitle + " width="
xx" height="
yy" />");
}
This obviously would not work for every example as the dimensions of the image token will vary. Is there a way to find the token as a StartsWith and then split the dimensions an array on the :. Please let me know if that makes sense as it is a little confusing. Thanks!
Regex will solve this issue for you.
using System.Text.RegularExpressions;
Then change your code as seen below.
foreach (var article in articles)
{
string articleTemplateValue = _TemplateArticleMarkup;
MatchCollection mc = Regex.Matches(articleTemplateValue, #"\[ARTICLEIMAGE\:(\d+)\:(\d+)\]");
if (mc.Count > 0)
{
string toReplace = mc[0].Value;
string xx = mc[0].Groups[1].Value;
string yy = mc[0].Groups[2].Value;
articleTemplateValue = articleTemplateValue.Replace(toReplace, "<img src=\"" + article.ArticleImageFolder + "/" + article.ArticleImage + "\" title=\"" + article.ArticleTitle + "\" width=\"" + xx + "\" height=\"" + yy + "\"/>");
}
}
You can use the Split() command to find the width and the height. A very rough approach follows:
rextester remo
String articleTemplateValue = "[test:40:200]";
Console.WriteLine(articleTemplateValue);
var arr = articleTemplateValue.Split(':');
if (arr.Length == 3) {
var xx = arr[1];
var yy = arr[2].Substring(0, arr[2].Length - 1);
articleTemplateValue = articleTemplateValue.Replace(articleTemplateValue, "<img src="
+ "folder" + "/" + "image" + " title=" + "ArticleTitle" + " width="+ xx + " height= " + yy+ "/>");
Console.WriteLine(articleTemplateValue);
}
Use Regex would do the trick repl.it demo
"\[ARTICLEIMAGE:\d+?:\d+?\]"
\[ escape the [ character. Brackets are special characters in Regex
\d any digit
\d+?: + is 0 or more digits. Until we find a colon :. The ? means non-greedy and is really not needed...
\] escape the closing bracket
var matches = Regex.Match(articleTemplateValue, #"\d+");
var xx = matches;
var yy = matches.NextMatch();
var template = "<img src=" + article.ArticleImageFolder + "/" + article.ArticleImage + " title=" + article.ArticleTitle + " width="
+ xx + " height="
+ yy + " />";
articleTemplateValue = articleTemplateValue = Regex.Replace(articleTemplateValue, #"\[ARTICLEIMAGE:\d+?:\d+?\]", template);
Using
Regex.Match(string, string)
Regex.Replace(string, string)

Extract word out of string using regex

I want to extract certain word out of a string using regex.
I got this code now and it works perfectly when i search for *
public static string Tagify(string value, string search, string htmlTag, bool clear = false)
{
Regex regex = new Regex(#"\" + search + "([^)]*)\\" + search);
var v = regex.Match(value);
if (v.Groups[1].ToString() == "" || v.Groups[1].ToString() == value || clear == true)
{
return value.Replace(search, "");
}
return value.Replace(v.Groups[0].ToString(), "<" + htmlTag + ">" + v.Groups[1].ToString() + "</" + htmlTag + ">");
}
But now I need to search for **, but unfortunately this does not work
How can I achieve this?
I think the simplest solution is to use lazy dot matching in a capturing group.
Replace
Regex regex = new Regex(#"\" + search + "([^)]*)\\" + search);
with
Regex regex = new Regex(string.Format("{0}(.*?){0}", Regex.Escape(search)));
Or in C#6.0
Regex regex = new Regex($"{Regex.Escape(search)}(.*?){Regex.Escape(search)}");
Regex.Escape will escape any special chars for you, no need to manually append \ symbols.

How to remove special characters with regex instead of Replace string

My current method:
var q = new StringBuilder(query);
return q.Replace("'", " ")
.Replace("\"", " ")
.Replace(":", "")
.Replace("#", " ")
.Replace("/", " ")
.Replace("\\", " ")
.Replace(",", " ")
.Replace("&", " ")
.Replace("?", " ")
.Replace("%", " ")
.Replace(".", " ")
.Replace("quot;", " ")
.Replace("-", " ")
.Replace("*", " ")
.ToString().Trim();
How can I done this using regex for better performance?
Edited: Sorry, I want replace all special characters by space " ".
You could use this:
string q = Regex.Replace(query, #"[:#/\\]", ".");
q = Regex.Replace(q, #""|['"",&?%\.*-]", " ");
EDIT:
On closer inspection of what you're doing, your code is translating several characters into ., and then translating all . into spaces. So you could just do this:
string q = Regex.Replace(query, #""|['"",&?%\.*:#/\\-]", " ").Trim();
I'm not really sure what you're trying to do here, though. I feel like what you're really looking for is something like:
string q = Regex.Replace(query, #"[^\w\s]", "");
The presence of " in there throws me for a loop, and is why I'm not sure what you're doing. If you want to get rid of HTML entities, you could run query through HttpUtility.HtmlDecode(string) first and then apply the regex.
Try this.
string pattern = #"[^a-zA-Z0-9]";
string test = Regex.Replace("abc*&34567*opdldld(aododod';", pattern, " ");

C# Regex Issue Getting URLs

To explain briefly, I'm trying to search Google with a keyword, then get the URLs of the top 10 results and save them.
This is the stripped down command line version of the code. It should return 1 result at least. If it works with that, I can apply it to my full version of the code and get all the results.
Basically the code I have right now, it fails if I try to get the entire source of Google. If I include a random section of code from Google's HTML source, it works fine. To me, that means my Regex has an error somewhere.
If there is a better way to do this aside from Regex, please let me know. The URLs are between <h3 class="r"><a href=" and " class=l onmousedown="return clk(this.href
I got this Regex code from a generator, but it's really hard for me to understand Regex, Since nothing I've read explains it clearly. If someone could pick out what's wrong and explain why, I'd greatly appreciate it.
Thanks,
Kevin
using System;
using System.Text.RegularExpressions;
using System.Net;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
WebClient wc = new WebClient();
string keyword = "seo nj";
string html = wc.DownloadString(String.Format("http://www.google.com/search?q={0}", keyword));
string re1 = "(<)"; // Any Single Character 1
string re2 = "(h3)"; // Alphanum 1
string re3 = "(\\s+)"; // White Space 1
string re4 = "(class)"; // Variable Name 1
string re5 = "(=)"; // Any Single Character 2
string re6 = "(\"r\")"; // Double Quote String 1
string re7 = "(>)"; // Any Single Character 3
string re8 = "(<)"; // Any Single Character 4
string re9 = "([a-z])"; // Any Single Word Character (Not Whitespace) 1
string re10 = "(\\s+)"; // White Space 2
string re11 = "((?:[a-z][a-z]+))"; // Word 1
string re12 = "(=)"; // Any Single Character 5
string re13 = ".*?"; // Non-greedy match on filler
string re14 = "((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s\"]*))"; // HTTP URL 1
string re15 = "(\")"; // Any Single Character 6
string re16 = "(\\s+)"; // White Space 3
string re17 = "(class)"; // Word 2
string re18 = "(=)"; // Any Single Character 7
string re19 = "(l)"; // Any Single Character 8
string re20 = "(\\s+)"; // White Space 4
string re21 = "(onmousedown)"; // Word 3
string re22 = "(=)"; // Any Single Character 9
string re23 = "(\")"; // Any Single Character 10
string re24 = "(return)"; // Word 4
string re25 = "(\\s+)"; // White Space 5
string re26 = "(clk)"; // Word 5
Regex r = new Regex(re1 + re2 + re3 + re4 + re5 + re6 + re7 + re8 + re9 + re10 + re11 + re12 + re13 + re14 + re15 + re16 + re17 + re18 + re19 + re20 + re21 + re22 + re23 + re24 + re25 + re26, RegexOptions.IgnoreCase | RegexOptions.Singleline);
Match m = r.Match(txt);
if (m.Success)
{
Console.WriteLine("Good");
String c1 = m.Groups[1].ToString();
String alphanum1 = m.Groups[2].ToString();
String ws1 = m.Groups[3].ToString();
String var1 = m.Groups[4].ToString();
String c2 = m.Groups[5].ToString();
String string1 = m.Groups[6].ToString();
String c3 = m.Groups[7].ToString();
String c4 = m.Groups[8].ToString();
String w1 = m.Groups[9].ToString();
String ws2 = m.Groups[10].ToString();
String word1 = m.Groups[11].ToString();
String c5 = m.Groups[12].ToString();
String httpurl1 = m.Groups[13].ToString();
String c6 = m.Groups[14].ToString();
String ws3 = m.Groups[15].ToString();
String word2 = m.Groups[16].ToString();
String c7 = m.Groups[17].ToString();
String c8 = m.Groups[18].ToString();
String ws4 = m.Groups[19].ToString();
String word3 = m.Groups[20].ToString();
String c9 = m.Groups[21].ToString();
String c10 = m.Groups[22].ToString();
String word4 = m.Groups[23].ToString();
String ws5 = m.Groups[24].ToString();
String word5 = m.Groups[25].ToString();
//Console.Write("(" + c1.ToString() + ")" + "(" + alphanum1.ToString() + ")" + "(" + ws1.ToString() + ")" + "(" + var1.ToString() + ")" + "(" + c2.ToString() + ")" + "(" + string1.ToString() + ")" + "(" + c3.ToString() + ")" + "(" + c4.ToString() + ")" + "(" + w1.ToString() + ")" + "(" + ws2.ToString() + ")" + "(" + word1.ToString() + ")" + "(" + c5.ToString() + ")" + "(" + httpurl1.ToString() + ")" + "(" + c6.ToString() + ")" + "(" + ws3.ToString() + ")" + "(" + word2.ToString() + ")" + "(" + c7.ToString() + ")" + "(" + c8.ToString() + ")" + "(" + ws4.ToString() + ")" + "(" + word3.ToString() + ")" + "(" + c9.ToString() + ")" + "(" + c10.ToString() + ")" + "(" + word4.ToString() + ")" + "(" + ws5.ToString() + ")" + "(" + word5.ToString() + ")" + "\n");
Console.WriteLine(httpurl1);
}
else
{
Console.WriteLine("Bad");
}
Console.ReadLine();
}
}
}
You're doing it wrong.
Google has an API for doing searches programmatically. Don't put yourself through the pain of trying to parse HTML with regexes, when there's already a published, supported way to do what you want.
Besides, what you're trying to do -- submit automated searches through Google's Web site and scrape the results -- is a violation of section 5.3 of their Terms of Service:
You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers)
using RegEx to parse HTML is sado-masochism.
Try using the HTML Agility Pack instead. It will allow you to parse HTML. See this question for an example of using it.

Categories

Resources