I haven't used regex much before but found something useful on the net that I'm using:
private string ConvertUrlsToLinks(string msg)
{
string regex = #"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:#=.+?,##%&~-]*[^.|\'|\\||\# |!|\(|?|\[|,| |>|<|;|\)])";
Regex r = new Regex(regex, RegexOptions.IgnoreCase);
return r.Replace(msg, "$1").Replace("href=\"www", "href=\"http://www").Replace(#"\r\n", "<br />").Replace(#"\n", "<br />").Replace(#"\r", "<br />");
}
It does a good job but now I want it to exclude urls that already have a "a href=" in front. There's the ending "/a" to consider too.
Can that be done with regex or have to use totally different approach, like coding?
Try this:
((?<!href=')(?<!href=")(www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:#=.+?,##%&~-]*[^.|\'|\\||\# |!|\(|?|\[|,| |>|<|;|\)])
I tested on regex101.com
With the following sample set:
www.google.com
http://hi.com
http://www.fishy.com
href='www.ignore.com'
www.ouch.com
Using your existing regex pattern you could make a few simple changes to handle additional text being prepended or appended to your string:
`.+` <- pattern -> `(.+)?`
Which would give you:
.+((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:#=.+?,##%&~-]*[^.|\'|\\||\# |!|\(|?|\[|,| |>|<|;|\)])(.+)?
So passing the string of either:
<a href='http://www.test.com'>http://www.test.com</a>
...or...
http://www.test.com
Would result in:
www.test.com
Examples:
https://regex101.com/r/bO0cW6/1
http://ideone.com/suVw3I
I think it would be a little ToNy tHe pOny to do that in regex after all, so wrote the code, in case anyone is interested here it is:
private string handleatag(string msg, string tagbegin, string tagend)
{
ArrayList tags = new ArrayList();
int tagbeginpos = msg.IndexOf(tagbegin);
int tagendpos;
string hash = tagbegin.GetHashCode().ToString();
while (tagbeginpos != -1)
{
tagendpos = msg.IndexOf(tagend, tagbeginpos);
if (tagendpos != -1)
{
string atag = msg.Substring(tagbeginpos, tagendpos - tagbeginpos + tagend.Length);
msg = msg.Replace(atag, hash + tags.Count.ToString());
tags.Add(atag);
}
else
msg = msg.Remove(tagbeginpos, tagbegin.Length);
tagbeginpos = msg.IndexOf(tagbegin, tagbeginpos);
}
msg = ConvertUrlsToLinks(msg);
for (int i = 0; i < tags.Count; i++)
msg = msg.Replace(hash + i.ToString(), tags[i].ToString());
return msg;
}
private string ConvertUrlsToLinks(string msg)
{
if (msg.IndexOf("<a href=") != -1)
return handleatag(msg, "<a href=", "</a>");
string regex = #"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:#=.+?,##%&~-]*[^.|\'|\\||\# |!|\(|?|\[|,| |>|<|;|\)])";
Regex r = new Regex(regex, RegexOptions.IgnoreCase);
return r.Replace(msg, "$1").Replace("href=\"www", "href=\"http://www").Replace(#"\r\n", "<br />").Replace(#"\n", "<br />").Replace(#"\r", "<br />");
}
Related
This question already has answers here:
C# Get string between 2 HTML-tags [closed]
(3 answers)
Closed 7 years ago.
I have a string like this:
<div class="fsxl fwb">Myname<br />
So how to get string Myname ?
here is my code:
public string name(string link)
{
WebClient client = new WebClient();
string htmlCode = client.DownloadString(link);
var output = htmlCode.Split("<div class="fsxl fwb">","<br />");
return output.ToString();
}
But the problem is "<div class="fsxl fwb">" it will become 2 string "<div class=", ">" and fsxl fwb so how to fix it ?
Here is a quick fix to your code:
var output = htmlCode.Split(
new [] { "<div class=\"fsxl fwb\">", "<br />"},
StringSplitOptions.RemoveEmptyEntries);
return output[0];
It escapes the quotes correctly and uses a valid override of the Split method.
You can solve this by parsing the HTML, that is often the best option.
A quick solution would be to use regex to get the string out. This one will do:
<div class="fsxl fwb">(.*?)<br \/>
It will capture the input between the div and the first following <br />.
This will be the C# code to get the answer:
string s = Regex.Replace
( "(.*)<div class=\"fsxl fwb\">Myname<br />"
, "<div class=\"fsxl fwb\">(.*?)<br \\/>(.*)"
, "$2"
);
Console.WriteLine(s);
Using regular expressions:
public string name(string link)
{
WebClient client = new WebClient();
string htmlCode = client.DownloadString(link);
Regex regex = new Regex("<div class=\"fsxl fwb\">(.*)<br />");
Match match = regex.Match(htmlCode);
string output = match.Groups[1].ToString();
return output;
}
var a = #"<div class='fsxl fwb'>Myname<br />";
var b = Regex.Match(a, "(?<=>)(.*)(?=<)");
Console.WriteLine(b.Value);
Code based on: C# Get string between 2 HTML-tags
If you want to avoid regex, you can use this extension method to grab the text between two other strings:
public static string ExtractBetween(this string str, string startTag, string endTag, bool inclusive)
{
string rtn = null;
var s = str.IndexOf(startTag);
if (s >= 0)
{
if (!inclusive)
{
s += startTag.Length;
}
var e = str.IndexOf(endTag, s);
if (e > s)
{
if (inclusive)
{
e += startTag.Length +1;
}
rtn = str.Substring(s, e - s);
}
}
return rtn;
}
Example usage (note you need to add the escape characters to your string)
var s = "<div class=\"fsxl fwb\">Myname<br />";
var r = s.ExtractBetween("<div class=\"fsxl fwb\">", "<br />", false);
Console.WriteLine(r);
I am trying to highlight search terms in display results. Generally it works OK based on code found here on SO. My issue with it is that it replaces the substring with the search term, i.e. in this example it will replace "LOVE" with "love" (unacceptable). So I was thinking I probably want to find the index of the start of the substring, do an INSERT of the opening <span> tag, and do similar at the end of the substring. As yafs may be quite long I'm also thinking I need to integrate stringbuilder into this. Is this do-able, or is there a better way? As always, thank you in advance for your suggestions.
string yafs = "Looking for LOVE in all the wrong places...";
string searchTerm = "love";
yafs = yafs.ReplaceInsensitive(searchTerm, "<span style='background-color: #FFFF00'>"
+ searchTerm + "</span>");
how about this:
public static string ReplaceInsensitive(string yafs, string searchTerm) {
return Regex.Replace(yafs, "(" + searchTerm + ")", "<span style='background-color: #FFFF00'>$1</span>", RegexOptions.IgnoreCase);
}
update:
public static string ReplaceInsensitive(string yafs, string searchTerm) {
return Regex.Replace(yafs,
"(" + Regex.Escape(searchTerm) + ")",
"<span style='background-color: #FFFF00'>$1</span>",
RegexOptions.IgnoreCase);
}
Check this code
private static string ReplaceInsensitive(string text, string oldtext,string newtext)
{
int indexof = text.IndexOf(oldtext,0,StringComparison.InvariantCultureIgnoreCase);
while (indexof != -1)
{
text = text.Remove(indexof, oldtext.Length);
text = text.Insert(indexof, newtext);
indexof = text.IndexOf(oldtext, indexof + newtext.Length ,StringComparison.InvariantCultureIgnoreCase);
}
return text;
}
Does what you need:
static void Main(string[] args)
{
string yafs = "Looking for LOVE in all the wrong love places...";
string searchTerm = "LOVE";
Console.Write(ReplaceInsensitive(yafs, searchTerm));
Console.Read();
}
private static string ReplaceInsensitive(string yafs, string searchTerm)
{
StringBuilder sb = new StringBuilder();
foreach (string word in yafs.Split(' '))
{
string tempStr = word;
if (word.ToUpper() == searchTerm.ToUpper())
{
tempStr = word.Insert(0, "<span style='background-color: #FFFF00'>");
int len = tempStr.Length;
tempStr = tempStr.Insert(len, "</span>");
}
sb.AppendFormat("{0} ", tempStr);
}
return sb.ToString();
}
Gives:
Looking for < span style='background-color: #FFFF00'>LOVE< /span> in all the wrong < span style='background-color: #FFFF00'>love< /span> places...
How can I remove all the HTML tags including   using regex in C#. My string looks like
"<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div> </div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"
If you can't use an HTML parser oriented solution to filter out the tags, here's a simple regex for it.
string noHTML = Regex.Replace(inputHTML, #"<[^>]+>| ", "").Trim();
You should ideally make another pass through a regex filter that takes care of multiple spaces as
string noHTMLNormalised = Regex.Replace(noHTML, #"\s{2,}", " ");
I took #Ravi Thapliyal's code and made a method: It is simple and might not clean everything, but so far it is doing what I need it to do.
public static string ScrubHtml(string value) {
var step1 = Regex.Replace(value, #"<[^>]+>| ", "").Trim();
var step2 = Regex.Replace(step1, #"\s{2,}", " ");
return step2;
}
I've been using this function for a while. Removes pretty much any messy html you can throw at it and leaves the text intact.
private static readonly Regex _tags_ = new Regex(#"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled);
//add characters that are should not be removed to this regex
private static readonly Regex _notOkCharacter_ = new Regex(#"[^\w;&##.:/\\?=|%!() -]", RegexOptions.Compiled);
public static String UnHtml(String html)
{
html = HttpUtility.UrlDecode(html);
html = HttpUtility.HtmlDecode(html);
html = RemoveTag(html, "<!--", "-->");
html = RemoveTag(html, "<script", "</script>");
html = RemoveTag(html, "<style", "</style>");
//replace matches of these regexes with space
html = _tags_.Replace(html, " ");
html = _notOkCharacter_.Replace(html, " ");
html = SingleSpacedTrim(html);
return html;
}
private static String RemoveTag(String html, String startTag, String endTag)
{
Boolean bAgain;
do
{
bAgain = false;
Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase);
if (startTagPos < 0)
continue;
Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase);
if (endTagPos <= startTagPos)
continue;
html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length);
bAgain = true;
} while (bAgain);
return html;
}
private static String SingleSpacedTrim(String inString)
{
StringBuilder sb = new StringBuilder();
Boolean inBlanks = false;
foreach (Char c in inString)
{
switch (c)
{
case '\r':
case '\n':
case '\t':
case ' ':
if (!inBlanks)
{
inBlanks = true;
sb.Append(' ');
}
continue;
default:
inBlanks = false;
sb.Append(c);
break;
}
}
return sb.ToString().Trim();
}
var noHtml = Regex.Replace(inputHTML, #"<[^>]*(>|$)| ||»|«", string.Empty).Trim();
I have used the #RaviThapliyal & #Don Rolling's code but made a little modification. Since we are replacing the   with empty string but instead   should be replaced with space, so added an additional step. It worked for me like a charm.
public static string FormatString(string value) {
var step1 = Regex.Replace(value, #"<[^>]+>", "").Trim();
var step2 = Regex.Replace(step1, #" ", " ");
var step3 = Regex.Replace(step2, #"\s{2,}", " ");
return step3;
}
Used &nbps without semicolon because it was getting formatted by the Stack Overflow.
this:
(<.+?> | )
will match any tag or
string regex = #"(<.+?>| )";
var x = Regex.Replace(originalString, regex, "").Trim();
then x = hello
Sanitizing an Html document involves a lot of tricky things. This package maybe of help:
https://github.com/mganss/HtmlSanitizer
HTML is in its basic form just XML. You could Parse your text in an XmlDocument object, and on the root element call InnerText to extract the text. This will strip all HTML tages in any form and also deal with special characters like < all in one go.
i'm using this syntax for remove html tags with
SessionTitle:result[i].sessionTitle.replace(/<[^>]+>|&**nbsp**;/g, '')
--Remove(*) **nbsp**
(<([^>]+)>| )
You can test it here:
https://regex101.com/r/kB0rQ4/1
I am currently trying to do a Regex Replace on a JSON string that looks like:
String input = "{\"`####`Answer_Options11\": \"monkey22\",\"`####`Answer_Options\": \"monkey\",\"Answer_Options2\": \"not a monkey\"}";
a
The goal is to find and replace all the value fields who's key field starts with `####`
I currently have this:
static Regex _FieldRegex = new Regex(#"`####`\w+" + ".:.\"(.*)\",");
static public string MatchKey(string input)
{
MatchCollection match = _encryptedFieldRegex.Matches(input.ToLower());
string match2 = "";
foreach (Match k in match )
{
foreach (Capture cap in k.Captures)
{
Console.WriteLine("" + cap.Value);
match2 = Regex.Replace(input.ToLower(), cap.Value.ToString(), #"CAKE");
}
}
return match2.ToString();
}
Now this isn't working. Naturally I guess since it picks up the entire `####`Answer_Options11\": \"monkey22\",\"`####`Answer_Options\": \"monkey\", as a match and replaces it. I want to just replace the match.Group[1] like you would for a single match on the string.
At the end of the day the JSON string needs to look something like this:
String input = "{\"`####`Answer_Options11\": \"CATS AND CAKE\",\"`####`Answer_Options\": \"CAKE WAS A LIE\",\"Answer_Options2\": \"not a monkey\"}";
Any idea how to do this?
you want a positive lookahead and a positive lookbehind :
(?<=####.+?:).*?(?=,)
the lookaheads and lookbehinds will verify that it matches those patterns, but not include them in the match. This site explains the concept pretty well.
Generated code from RegexHero.com :
string strRegex = #"(?<=####.+?:).*?(?=,)";
Regex myRegex = new Regex(strRegex);
string strTargetString = #" ""{\""`####`Answer_Options11\"": \""monkey22\"",\""`####`Answer_Options\"": \""monkey\"",\""Answer_Options2\"": \""not a monkey\""}""";
foreach (Match myMatch in myRegex.Matches(strTargetString))
{
if (myMatch.Success)
{
// Add your code here
}
}
this will match "monkey22" and "monkey" but not "not a monkey"
Working from #Jonesy's answer I got to this which works for what I wanted. It includes the .Replace on the groups that I required. The negative look ahead and behinds were very interesting but I needed to replace some of those values hence groups.
static public string MatchKey(string input)
{
string strRegex = #"(__u__)(.+?:\s*)""(.*)""(,|})*";
Regex myRegex = new Regex(strRegex, RegexOptions.IgnoreCase | RegexOptions.Multiline);
IQS_Encryption.Encryption enc = new Encryption();
int count = 1;
string addedJson = "";
int matchCount = 0;
foreach (Match myMatch in myRegex.Matches(input))
{
if (myMatch.Success)
{
//Console.WriteLine("REGEX MYMATCH: " + myMatch.Value);
input = input.Replace(myMatch.Value, "__e__" + myMatch.Groups[2].Value + "\"c" + count + "\"" + myMatch.Groups[4].Value);
addedJson += "c"+count + "{" +enc.EncryptString(myMatch.Groups[3].Value, Encoding.UTF8.GetBytes("12345678912365478912365478965412"))+"},";
}
count++;
matchCount++;
}
Console.WriteLine("MAC" + matchCount);
return input + addedJson;
}`
Thanks again to #Jonesy for the huge help.
I need to use a string for path for a file but sometimes there are forbidden characters in this string and I must replace them. For example, my string _title is rumbaton jonathan \"racko\" contreras.
Well I should replace the chars \ and ".
I tried this but it doesn't work:
_title.Replace(#"/", "");
_title.Replace(#"\", "");
_title.Replace(#"*", "");
_title.Replace(#"?", "");
_title.Replace(#"<", "");
_title.Replace(#">", "");
_title.Replace(#"|", "");
Since strings are immutable, the Replace method returns a new string, it doesn't modify the instance you are calling it on. So try this:
_title = _title
.Replace(#"/", "")
.Replace(#"""", "")
.Replace(#"*", "")
.Replace(#"?", "")
.Replace(#"<", "")
.Replace(#">", "")
.Replace(#"|", "");
Also if you want to replace " make sure you have properly escaped it.
Try regex
string illegal = "\"M\"\\a/ry/ h**ad:>> a\\/:*?\"| li*tt|le|| la\"mb.?";
string regexSearch = new string(Path.GetInvalidFileNameChars()) + new string(Path.GetInvalidPathChars());
Regex r = new Regex(string.Format("[{0}]", Regex.Escape(regexSearch)));
illegal = r.Replace(illegal, "");
Before: "M"\a/ry/ h**ad:>> a/:?"| litt|le|| la"mb.?
After: Mary had a little lamb.
Also another answer from same post is much cleaner
private static string CleanFileName(string fileName)
{
return Path.GetInvalidFileNameChars().Aggregate(fileName, (current, c) => current.Replace(c.ToString(), string.Empty));
}
from How to remove illegal characters from path and filenames?
Or you could try this (probably terribly inefficient) method:
string inputString = #"File ~!##$%^&*()_+|`1234567890-=\[];',./{}:""<>? name";
var badchars = Path.GetInvalidFileNameChars();
foreach (var c in badchars)
inputString = inputString.Replace(c.ToString(), "");
The result will be:
File ~!##$%^&()_+`1234567890-=[];',.{} name
But feel free to add more chars to the badchars before running the foreach loop on them.
See http://msdn.microsoft.com/cs-cz/library/fk49wtc1.aspx:
Returns a string that is equivalent to the current string except that all instances of oldValue are replaced with newValue.
I have written a method to do the exact operation that you want and with much cleaner code.
The method
public static string Delete(this string target, string samples) {
if (string.IsNullOrEmpty(target) || string.IsNullOrEmpty(samples))
return target;
var tar = target.ToCharArray();
const char deletechar = '♣'; //a char that most likely never to be used in the input
for (var i = 0; i < tar.Length; i++) {
for (var j = 0; j < samples.Length; j++) {
if (tar[i] == samples[j]) {
tar[i] = deletechar;
break;
}
}
}
return tar.ConvertToString().Replace(deletechar.ToString(CultureInfo.InvariantCulture), string.Empty);
}
Sample
var input = "rumbaton jonathan \"racko\" contreras";
var cleaned = input.Delete("\"\\/*?><|");
Will result in:
rumbaton jonathan racko contreras
Ok ! I've solved my issue thanks to all your indications. This is my correction :
string newFileName = _artist + " - " + _title;
char[] invalidFileChars = Path.GetInvalidFileNameChars();
char[] invalidPathChars = Path.GetInvalidPathChars();
foreach (char invalidChar in invalidFileChars)
{
newFileName = newFileName.Replace(invalidChar.ToString(), string.Empty);
}
foreach (char invalidChar in invalidPathChars)
{
newFilePath = newFilePath.Replace(invalidChar.ToString(), string.Empty);
}
Thank you so musch everybody :)