Regex to replace double nested quotes in C# - c#

I am trying to replace double nested quotes from string in C# using Regex, but not able to achieve it so far. Below is the sample text and the code i tried -
string html = "<img src=\"imagename=\"g1\"\" alt = \"\">";
string output = string.Empty;
Regex reg = new Regex(#"([^\^,\r\n])""""+(?=[^$,\r\n])", RegexOptions.Multiline);
output = reg.Replace(html, #"$1");
the above gives below output -
"<img src="imagename="g1 alt = >"
actual output i am looking for is -
"<img src="imagename=g1" alt = "">"
Please suggest how to correct the above code.

Pattern : \s*"\s*([^ "]+)"\s*(?=[">])|(?<=")("")(?=")
Replacement : $1
Here is demo and tested at regexstorm
String literals for use in programs:
#"\s*""\s*([^ ""]+)""\s*(?=["">])|(?<="")("""")(?="")"
To keep it simple and more precised directly focused for src attribute value
Pattern : (\bsrc="[^ =]+=)"([^ "]+")"
Replacement : $1$2
Here is online demo and tested at regexstorm
String literals for use in programs:
#"(\bsrc=""[^ =]+=)""([^ ""]+"")"""
Note: I assume attribute values don't contain any spaces.

Related

Regex extraction of a specific pattern

I have a string of following format. I have three scenarios which follows as:
Scenario 1:
"\\hjsschjsn\Bunong.PU2.PV/-56Noogg.BSC";
The extraction should be until ".BSC" , ".BSC" will be there in the original string always. Also "\" and "\" will be there but the text will change.
I have to omit the middle part , my output should be :
"\\hjsschjsn\-56Noogg.BSC";
Scenarion 2:
"\\adajsschjsn\Bcscx.sdjhs\AHHJogg.BSC";
The output should be :
"\\adajsschjsn\AHHJogg.BSC";
Scenario 3:
"aasjkankn\\adajsschjsn\Bcscx.sdjhs\AHHJogg.BSC\djkhakdjhjkj";
output should be:
"\\adajsschjsn\AHHJogg.BSC";
Here's what I have tried:
string text = "\\\\hjsschjsn\Bunong.PU2.PV/-56Noogg.BSC";
//Note: I have given \\\\ instead of \\ because of string literal to be accomadated in a string
Match pattern = Regex.Match(text, #"\\\\[\w]+\\/[\w*]+.BSC");
Try following mask:
.*(\\\\[^\\]*\\)([^\\\/]+)[\\\/](.*?\.BSC).*
Replace it with $1$3
Regex reg = new Regex(#".*(\\\\[^\\]*\\)([^\\\/]+)[\\\/](.*?\.BSC).*");
string input = #"\\hjsschjsn\Bunong.PU2.PV/-56Noogg.BSC";
string output = reg.Replace(input, "$1$3");
See example here
Match pattern1 = Regex.Match(text, #"\\\\\w+\\");
Match pattern2 = Regex.Match(text, #"\w+.BSC");
Console.WriteLine(pattern1.ToString() + pattern2.ToString());

Regex pattern for text between 2 strings

I am trying to extract all of the text (shown as xxxx) in the follow pattern:
Session["xxxx"]
using c#
This may be Request.Querystring["xxxx"] so I am trying to build the expression dynamically. When I do so, I get all sorts of problems about unescaped charecters or no matches :(
an example might be:
string patternstart = "Session[";
string patternend = "]";
string regexexpr = #"\\" + patternstart + #"(.*?)\\" + patternend ;
string sText = "Text to be searched containing Session[\"xxxx\"] the result would be xxxx";
MatchCollection matches = Regex.Matches(sText, #regexexpr);
Can anyone help with this as I am stumped (as I always seem to be with RegEx :) )
With some little modifications to your code.
string patternstart = Regex.Escape("Session[");
string patternend = Regex.Escape("]");
string regexexpr = patternstart + #"(.*?)" + patternend;
The pattern you construct in your example looks something like this:
\\Session[(.*?)\\]
There are a couple of problems with this. First it assumes the string starts with a literal backslash, second, it wraps the entire (.*?) in a character class, that means it will match any single open parenthesis, period, asterisk, question mark, close parenthesis or backslash. You'd need to escape the the brackets in your pattern, if you want to match a literal [.
You could use a pattern like this:
Session\[(.*?)]
For example:
string regexexpr = #"Session\[(.*?)]";
string sText = "Text to be searched containing Session[\"xxxx\"] the result would be xxxx";
MatchCollection matches = Regex.Matches(sText, #regexexpr);
Console.WriteLine(matches[0].Groups[1].Value); // "xxxx"
The characters [ and ] have a special meaning with regular expressions - they define a group where one of the contained characters must match. To work around this, simply 'escape' them with a leading \ character:
string patternstart = "Session\[";
string patternend = "\]";
An example "final string" could then be:
Session\["(.*)"\]
However, you could easily write your RegEx to handle Session, Querystring, etc automatically if you require (without also matching every other array you throw at it), and avoid having to build up the string in the first place:
(Querystring|Session|Form)\["(.*)"\]
and then take the second match.

Need Regex to match [#URL^Url Description^#]

I need regex to find this text
[#URL^Url Description^#]
in a string and replace it with
Url Description
"Url Description" can be set of characters in any language.
Any Regex Experts out there to help me?
Thanks.
It might be a bit confusing, but you can use the following:
string str = #"[#URL^Url Description^#]";
var regex = new Regex(#"^[^^]+\^([^^]+)\^[^^]+$");
var result = regex.Replace(str, #"$1");
The first ^ means the beginning of the string;
The [^^]+ means anything not a caret character;
The \^ is a literal caret;
The $ is the end of the string.
Basically, it captures all characters between the carets (^) and replace this in between the <a> tags.
See ideone demo.
You can also replace the last line with this:
var result = regex.Replace(str, #"$1");
Where link is the variable containing the link you want to replace in.
Why don't you use String.Replace()? A regex would work, but it looks like the format is well defined and regexes are harder to read.
string url = "[#URL^blah^#]";
string url_html = url.Replace("[#URL^", "<a href=\"http://www.somewhere.net\">")
.Replace("^#]", "</a>");

Regular expression to remove HTML tags

I am using the following Regular Expresion to remove html tags from a string. It works except I leave the closing tag. If I attempt to remove: blah it leaves the <a/>.
I do not know Regular Expression syntax at all and fumbled through this. Can someone with RegEx knowledge please provide me with a pattern that will work.
Here is my code:
string sPattern = #"<\/?!?(img|a)[^>]*>";
Regex rgx = new Regex(sPattern);
Match m = rgx.Match(sSummary);
string sResult = "";
if (m.Success)
sResult = rgx.Replace(sSummary, "", 1);
I am looking to remove the first occurence of the <a> and <img> tags.
Using a regular expression to parse HTML is fraught with pitfalls. HTML is not a regular language and hence can't be 100% correctly parsed with a regex. This is just one of many problems you will run into. The best approach is to use an HTML / XML parser to do this for you.
Here is a link to a blog post I wrote awhile back which goes into more details about this problem.
http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx
That being said, here's a solution that should fix this particular problem. It in no way is a perfect solution though.
var pattern = #"<(img|a)[^>]*>(?<content>[^<]*)<";
var regex = new Regex(pattern);
var m = regex.Match(sSummary);
if ( m.Success ) {
sResult = m.Groups["content"].Value;
To turn this:
'<td>mamma</td><td><strong>papa</strong></td>'
into this:
'mamma papa'
You need to replace the tags with spaces:
.replace(/<[^>]*>/g, ' ')
and reduce any duplicate spaces into single spaces:
.replace(/\s{2,}/g, ' ')
then trim away leading and trailing spaces with:
.trim();
Meaning that your remove tag function look like this:
function removeTags(string){
return string.replace(/<[^>]*>/g, ' ')
.replace(/\s{2,}/g, ' ')
.trim();
}
In order to remove also spaces between tags, you can use the following method a combination between regex and a trim for spaces at start and end of the input html:
public static string StripHtml(string inputHTML)
{
const string HTML_MARKUP_REGEX_PATTERN = #"<[^>]+>\s+(?=<)|<[^>]+>";
inputHTML = WebUtility.HtmlDecode(inputHTML).Trim();
string noHTML = Regex.Replace(inputHTML, HTML_MARKUP_REGEX_PATTERN, string.Empty);
return noHTML;
}
So for the following input:
<p> <strong> <em><span style="text-decoration:underline;background-color:#cc6600;"></span><span style="text-decoration:underline;background-color:#cc6600;color:#663333;"><del> test text </del></span></em></strong></p><p><strong><span style="background-color:#999900;"> test 1 </span></strong></p><p><strong><em><span style="background-color:#333366;"> test 2 </span></em></strong></p><p><strong><em><span style="text-decoration:underline;background-color:#006600;"> test 3 </span></em></strong></p>
The output will be only the text without spaces between html tags or space before or after html:
"   test text   test 1  test 2  test 3 ".
Please notice that the spaces before test text are from the <del> test text </del> html and the space after test 3 is from the <em><span style="text-decoration:underline;background-color:#006600;"> test 3 </span></em></strong></p> html.
Strip off HTML Elements (with/without attributes)
/<\/?[\w\s]*>|<.+[\W]>/g
This will strip off all HTML elements and leave behind the text. This works well even for malformed HTML elements (i.e. elements that are missing closing tags)
Reference and example (Ex.10)
So the HTML parser everyone's talking about is Html Agility Pack.
If it is clean XHTML, you can also use System.Xml.Linq.XDocument or System.Xml.XmlDocument.
can use:
Regex.Replace(source, "<[^>]*>", string.Empty);
If you need to find only the opening tags you can use the following regex, which will capture the tag type as $1 (a or img) and the content (including closing tag if there is one) as $2:
(?:<(a|img)(?:\s[^>]*)?>)((?:(?!<\1)[\s\S])*)
In case you have also closing tag you should use the following regex, which will capture the tag type as $1 (a or img) and the content as $2:
(?:<(a|img)(?:\s[^>]*)?>)\s*((?:(?!<\1)[\s\S])*)\s*(?:<\/\1>)
Basically you just need to use replace function on one of above regex, and return $2 in order to get what you wanted.
Short explanation about the query:
( ) - is used for capturing whatever matches the regex inside the brackets. The order of the capturing is the order of: $1, $2 etc.
?: - is used after an opening bracket "(" for not capturing the content inside the brackets.
\1 - is copying capture number 1, which is the tag type. I had to capture the tag type so closing tag will be consistent to the opening one and not something like: <img src=""> </a>.
\s - is white space, so after opening tag <img there will be at least 1 white space in case there are attributes (so it won't match <imgs> for example).
[^>]* - is looking for anything but the chars inside, which in this case is >, and * means for unlimited times.
?! - is looking for anything but the string inside, kinda similar to [^>] just for string instead of single chars.
[\s\S] - is used almost like . but allow any whitespaces (which will also match in case there are new lines between the tags). If you are using regex "s" flag, then you can use . instead.
Example of using with closing tag:
https://regex101.com/r/MGmzrh/1
Example of using without closing tag:
https://regex101.com/r/MGmzrh/2
Regex101 also has some explanation for what i did :)
You can use already existing libraries to strip off the html tags. One good one being Chilkat C# Library.
If all you're trying to do is remove the tags (and not figure out where the closing tag is), I'm really not sure why people are so fraught about it.
This Regex seems to handle anything I can throw at it:
<([\w\-/]+)( +[\w\-]+(=(('[^']*')|("[^"]*")))?)* *>
To break it down:
<([\w\-/]+) - match the beginning of the opening or closing tag. if you want to handle invalid stuff, you can add more here
( +[\w\-]+(=(('[^']*')|("[^"]*")))?)* - this bit matches attributes [0, N] times (* at then end)
+[\w\-]+ - is space(s) followed by an attribute name
(=(('[^']*')|("[^"]*")))? - not all attributes have assignment (?)
('[^']*')|("[^"]*") - of the attributes that do have assignment, the value is a string with either single or double quotes. It's not allowed to skip over a closing quote to make things work
*> - the whole thing ends with any number of spaces, then the closing bracket
Obviously this will mess up if someone throws super invalid html at it, but it works for anything valid I've come up with yet. Test it out here:
const regex = /<([\w\-/]+)( +[\w\-]+(=(('[^']*')|("[^"]*")))?)* *>/g;
const byId = (id) => document.getElementById(id);
function replace() {
console.log(byId("In").value)
byId("Out").innerText = byId("In").value.replace(regex, "CUT");
}
Write your html here: <br>
<textarea id="In" rows="8" cols="50"></textarea><br>
<button onclick="replace()">Replace all tags with "CUT"</button><br>
<br>
Output:
<div id="Out"></div>
Remove image from the string, using a regular expression in c# (image search performed by image id)
string PRQ=<td valign=\"top\" style=\"width: 400px;\" align=\"left\"><img id=\"llgo\" src=\"http://test.Logo.png\" alt=\"logo\"></td>
var regex = new Regex("(<img(.+?)id=\"llgo\"(.+?))src=\"([^\"]+)\"");
PRQ = regex.Replace(PRQ, match => match.Groups[1].Value + "");
Why not trying reluctant quantifier?
htmlString.replaceAll("<\\S*?>", "")
(It's Java but the main thing is to show the idea)
Simple way,
String html = "<a>Rakes</a> <p>paroladasdsadsa</p> My Name Rakes";
html = html.replaceAll("(<[\\w]+>)(.+?)(</[\\w]+>)", "$2");
System.out.println(html);
Here is the extension method I've been using for quite some time.
public static class StringExtensions
{
public static string StripHTML(this string htmlString, string htmlPlaceHolder) {
const string pattern = #"<.*?>";
string sOut = Regex.Replace(htmlString, pattern, htmlPlaceHolder, RegexOptions.Singleline);
sOut = sOut.Replace(" ", String.Empty);
sOut = sOut.Replace("&", "&");
sOut = sOut.Replace(">", ">");
sOut = sOut.Replace("<", "<");
return sOut;
}
}
This piece of code could help you out easily removing any html tags:
import re
string = str(blah)
replaced_string = re.sub('<a.*href="blah">.*<\/a>','',string) // remember, sub takes 3 arguments.
Output is an empty string.
Here's an extension method I created using a simple regular expression to remove HTML tags from a string:
/// <summary>
/// Converts an Html string to plain text, and replaces all br tags with line breaks.
/// </summary>
/// <returns></returns>
/// <remarks></remarks>
[Extension()]
public string ToPlainText(string s)
{
s = s.Replace("<br>", Constants.vbCrLf);
s = s.Replace("<br />", Constants.vbCrLf);
s = s.Replace("<br/>", Constants.vbCrLf);
s = Regex.Replace(s, "<[^>]*>", string.Empty);
return s;
}
Hope that helps.
Select everything except from whats in there:
(?:<span.*?>|<\/span>|<p.*?>|<\/p>)

Regular Expressions: How to escape the "(" meta char in c#

I am scraping a year value from the innerhtml of a span and the value is in brackets like this:
<span class="year_type">(2009)</span><br>
I want to get the value of the year without the brackets but am getting some compiler errors when trying to escape the "(" char.
My pattern:
const string yearPattern = "<span class=\"year_type\">\((?<year>.*?)\)</span>";
Complete Code:
const string yearPattern = "<span class=\"year_type\">\((?<year>.*?)\)</span>";
var regex = new Regex(yearPattern, RegexOptions.Singleline | RegexOptions.IgnoreCase);
Match match = regex.Match(data);
return match.Groups["year"].Value;
What is the best way to escape the ()
Thanks
use two slashes.
const string yearPattern = "<span class=\"year_type\">\\((?<year>.*?)\\)</span>";
or the # literal string operator
const string yearPattern = #"<span class=""year_type"">\(?<year>.*?)\)</span>";
note; in your original regex you were missing an open-paren.
Prepare to get rocked for parsing HTML with a Regex...
That being said, you just need the # in front of your pattern definition (or double your escapes \\).
const string yearPattern = #"<span class=""year_type"">\(?<year>.*?)\)</span>";
I would consider using a character class for this, e.g. [(] and [)], but using a double-backslash, e.g. \\( and \\) (one \ is for C# and the other one for the regex) is equivalently heavy syntax. So it's a matter of taste.

Categories

Resources