This question already has answers here:
C# Get string between 2 HTML-tags [closed]
(3 answers)
Closed 7 years ago.
I have a string like this:
<div class="fsxl fwb">Myname<br />
So how to get string Myname ?
here is my code:
public string name(string link)
{
WebClient client = new WebClient();
string htmlCode = client.DownloadString(link);
var output = htmlCode.Split("<div class="fsxl fwb">","<br />");
return output.ToString();
}
But the problem is "<div class="fsxl fwb">" it will become 2 string "<div class=", ">" and fsxl fwb so how to fix it ?
Here is a quick fix to your code:
var output = htmlCode.Split(
new [] { "<div class=\"fsxl fwb\">", "<br />"},
StringSplitOptions.RemoveEmptyEntries);
return output[0];
It escapes the quotes correctly and uses a valid override of the Split method.
You can solve this by parsing the HTML, that is often the best option.
A quick solution would be to use regex to get the string out. This one will do:
<div class="fsxl fwb">(.*?)<br \/>
It will capture the input between the div and the first following <br />.
This will be the C# code to get the answer:
string s = Regex.Replace
( "(.*)<div class=\"fsxl fwb\">Myname<br />"
, "<div class=\"fsxl fwb\">(.*?)<br \\/>(.*)"
, "$2"
);
Console.WriteLine(s);
Using regular expressions:
public string name(string link)
{
WebClient client = new WebClient();
string htmlCode = client.DownloadString(link);
Regex regex = new Regex("<div class=\"fsxl fwb\">(.*)<br />");
Match match = regex.Match(htmlCode);
string output = match.Groups[1].ToString();
return output;
}
var a = #"<div class='fsxl fwb'>Myname<br />";
var b = Regex.Match(a, "(?<=>)(.*)(?=<)");
Console.WriteLine(b.Value);
Code based on: C# Get string between 2 HTML-tags
If you want to avoid regex, you can use this extension method to grab the text between two other strings:
public static string ExtractBetween(this string str, string startTag, string endTag, bool inclusive)
{
string rtn = null;
var s = str.IndexOf(startTag);
if (s >= 0)
{
if (!inclusive)
{
s += startTag.Length;
}
var e = str.IndexOf(endTag, s);
if (e > s)
{
if (inclusive)
{
e += startTag.Length +1;
}
rtn = str.Substring(s, e - s);
}
}
return rtn;
}
Example usage (note you need to add the escape characters to your string)
var s = "<div class=\"fsxl fwb\">Myname<br />";
var r = s.ExtractBetween("<div class=\"fsxl fwb\">", "<br />", false);
Console.WriteLine(r);
Related
I haven't used regex much before but found something useful on the net that I'm using:
private string ConvertUrlsToLinks(string msg)
{
string regex = #"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:#=.+?,##%&~-]*[^.|\'|\\||\# |!|\(|?|\[|,| |>|<|;|\)])";
Regex r = new Regex(regex, RegexOptions.IgnoreCase);
return r.Replace(msg, "$1").Replace("href=\"www", "href=\"http://www").Replace(#"\r\n", "<br />").Replace(#"\n", "<br />").Replace(#"\r", "<br />");
}
It does a good job but now I want it to exclude urls that already have a "a href=" in front. There's the ending "/a" to consider too.
Can that be done with regex or have to use totally different approach, like coding?
Try this:
((?<!href=')(?<!href=")(www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:#=.+?,##%&~-]*[^.|\'|\\||\# |!|\(|?|\[|,| |>|<|;|\)])
I tested on regex101.com
With the following sample set:
www.google.com
http://hi.com
http://www.fishy.com
href='www.ignore.com'
www.ouch.com
Using your existing regex pattern you could make a few simple changes to handle additional text being prepended or appended to your string:
`.+` <- pattern -> `(.+)?`
Which would give you:
.+((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:#=.+?,##%&~-]*[^.|\'|\\||\# |!|\(|?|\[|,| |>|<|;|\)])(.+)?
So passing the string of either:
<a href='http://www.test.com'>http://www.test.com</a>
...or...
http://www.test.com
Would result in:
www.test.com
Examples:
https://regex101.com/r/bO0cW6/1
http://ideone.com/suVw3I
I think it would be a little ToNy tHe pOny to do that in regex after all, so wrote the code, in case anyone is interested here it is:
private string handleatag(string msg, string tagbegin, string tagend)
{
ArrayList tags = new ArrayList();
int tagbeginpos = msg.IndexOf(tagbegin);
int tagendpos;
string hash = tagbegin.GetHashCode().ToString();
while (tagbeginpos != -1)
{
tagendpos = msg.IndexOf(tagend, tagbeginpos);
if (tagendpos != -1)
{
string atag = msg.Substring(tagbeginpos, tagendpos - tagbeginpos + tagend.Length);
msg = msg.Replace(atag, hash + tags.Count.ToString());
tags.Add(atag);
}
else
msg = msg.Remove(tagbeginpos, tagbegin.Length);
tagbeginpos = msg.IndexOf(tagbegin, tagbeginpos);
}
msg = ConvertUrlsToLinks(msg);
for (int i = 0; i < tags.Count; i++)
msg = msg.Replace(hash + i.ToString(), tags[i].ToString());
return msg;
}
private string ConvertUrlsToLinks(string msg)
{
if (msg.IndexOf("<a href=") != -1)
return handleatag(msg, "<a href=", "</a>");
string regex = #"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:#=.+?,##%&~-]*[^.|\'|\\||\# |!|\(|?|\[|,| |>|<|;|\)])";
Regex r = new Regex(regex, RegexOptions.IgnoreCase);
return r.Replace(msg, "$1").Replace("href=\"www", "href=\"http://www").Replace(#"\r\n", "<br />").Replace(#"\n", "<br />").Replace(#"\r", "<br />");
}
I'm trying to replace a string in C# with the class Regex but I don't know use the class properly.
I want replace the next appearance chain in the String "a"
":(one space)(one or more characters)(one space)"
by the next regular expression
":(two spaces)(one or more characters)(three spaces)"
Will anyone help me and give me the code and explains me the regular expresion used?
you can use string.Replace(string, string)
try this one.
http://msdn.microsoft.com/en-us/library/fk49wtc1.aspx
try this one
private String StrReplace(String Str)
{
String Output = string.Empty;
String re1 = "(:)( )((?:[a-z][a-z]+))( )";
Regex r = new Regex(re1, RegexOptions.IgnoreCase | RegexOptions.Singleline);
Match m = r.Match(Str);
if (m.Success)
{
String c1 = m.Groups[1].ToString();
String ws1 = m.Groups[2].ToString() + " ";
String word1 = m.Groups[3].ToString();
String ws2 = m.Groups[4].ToString() + " ";
Output = c1.ToString() + ws1.ToString() + word1.ToString() + ws2.ToString() + "\n";
Output = Regex.Replace(Str, re1, Output);
}
return Output;
}
Using String.Replace
var str = "Test string with : .*. to replace";
var newstr = str.Replace(": .*. ", ": .*. ");
Using Regex.Replace
var newstr = Regex.Replace(str,": .*. ", ": .*. ");
How can I remove all the HTML tags including   using regex in C#. My string looks like
"<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div> </div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"
If you can't use an HTML parser oriented solution to filter out the tags, here's a simple regex for it.
string noHTML = Regex.Replace(inputHTML, #"<[^>]+>| ", "").Trim();
You should ideally make another pass through a regex filter that takes care of multiple spaces as
string noHTMLNormalised = Regex.Replace(noHTML, #"\s{2,}", " ");
I took #Ravi Thapliyal's code and made a method: It is simple and might not clean everything, but so far it is doing what I need it to do.
public static string ScrubHtml(string value) {
var step1 = Regex.Replace(value, #"<[^>]+>| ", "").Trim();
var step2 = Regex.Replace(step1, #"\s{2,}", " ");
return step2;
}
I've been using this function for a while. Removes pretty much any messy html you can throw at it and leaves the text intact.
private static readonly Regex _tags_ = new Regex(#"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled);
//add characters that are should not be removed to this regex
private static readonly Regex _notOkCharacter_ = new Regex(#"[^\w;&##.:/\\?=|%!() -]", RegexOptions.Compiled);
public static String UnHtml(String html)
{
html = HttpUtility.UrlDecode(html);
html = HttpUtility.HtmlDecode(html);
html = RemoveTag(html, "<!--", "-->");
html = RemoveTag(html, "<script", "</script>");
html = RemoveTag(html, "<style", "</style>");
//replace matches of these regexes with space
html = _tags_.Replace(html, " ");
html = _notOkCharacter_.Replace(html, " ");
html = SingleSpacedTrim(html);
return html;
}
private static String RemoveTag(String html, String startTag, String endTag)
{
Boolean bAgain;
do
{
bAgain = false;
Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase);
if (startTagPos < 0)
continue;
Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase);
if (endTagPos <= startTagPos)
continue;
html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length);
bAgain = true;
} while (bAgain);
return html;
}
private static String SingleSpacedTrim(String inString)
{
StringBuilder sb = new StringBuilder();
Boolean inBlanks = false;
foreach (Char c in inString)
{
switch (c)
{
case '\r':
case '\n':
case '\t':
case ' ':
if (!inBlanks)
{
inBlanks = true;
sb.Append(' ');
}
continue;
default:
inBlanks = false;
sb.Append(c);
break;
}
}
return sb.ToString().Trim();
}
var noHtml = Regex.Replace(inputHTML, #"<[^>]*(>|$)| ||»|«", string.Empty).Trim();
I have used the #RaviThapliyal & #Don Rolling's code but made a little modification. Since we are replacing the   with empty string but instead   should be replaced with space, so added an additional step. It worked for me like a charm.
public static string FormatString(string value) {
var step1 = Regex.Replace(value, #"<[^>]+>", "").Trim();
var step2 = Regex.Replace(step1, #" ", " ");
var step3 = Regex.Replace(step2, #"\s{2,}", " ");
return step3;
}
Used &nbps without semicolon because it was getting formatted by the Stack Overflow.
this:
(<.+?> | )
will match any tag or
string regex = #"(<.+?>| )";
var x = Regex.Replace(originalString, regex, "").Trim();
then x = hello
Sanitizing an Html document involves a lot of tricky things. This package maybe of help:
https://github.com/mganss/HtmlSanitizer
HTML is in its basic form just XML. You could Parse your text in an XmlDocument object, and on the root element call InnerText to extract the text. This will strip all HTML tages in any form and also deal with special characters like < all in one go.
i'm using this syntax for remove html tags with
SessionTitle:result[i].sessionTitle.replace(/<[^>]+>|&**nbsp**;/g, '')
--Remove(*) **nbsp**
(<([^>]+)>| )
You can test it here:
https://regex101.com/r/kB0rQ4/1
I need to use a string for path for a file but sometimes there are forbidden characters in this string and I must replace them. For example, my string _title is rumbaton jonathan \"racko\" contreras.
Well I should replace the chars \ and ".
I tried this but it doesn't work:
_title.Replace(#"/", "");
_title.Replace(#"\", "");
_title.Replace(#"*", "");
_title.Replace(#"?", "");
_title.Replace(#"<", "");
_title.Replace(#">", "");
_title.Replace(#"|", "");
Since strings are immutable, the Replace method returns a new string, it doesn't modify the instance you are calling it on. So try this:
_title = _title
.Replace(#"/", "")
.Replace(#"""", "")
.Replace(#"*", "")
.Replace(#"?", "")
.Replace(#"<", "")
.Replace(#">", "")
.Replace(#"|", "");
Also if you want to replace " make sure you have properly escaped it.
Try regex
string illegal = "\"M\"\\a/ry/ h**ad:>> a\\/:*?\"| li*tt|le|| la\"mb.?";
string regexSearch = new string(Path.GetInvalidFileNameChars()) + new string(Path.GetInvalidPathChars());
Regex r = new Regex(string.Format("[{0}]", Regex.Escape(regexSearch)));
illegal = r.Replace(illegal, "");
Before: "M"\a/ry/ h**ad:>> a/:?"| litt|le|| la"mb.?
After: Mary had a little lamb.
Also another answer from same post is much cleaner
private static string CleanFileName(string fileName)
{
return Path.GetInvalidFileNameChars().Aggregate(fileName, (current, c) => current.Replace(c.ToString(), string.Empty));
}
from How to remove illegal characters from path and filenames?
Or you could try this (probably terribly inefficient) method:
string inputString = #"File ~!##$%^&*()_+|`1234567890-=\[];',./{}:""<>? name";
var badchars = Path.GetInvalidFileNameChars();
foreach (var c in badchars)
inputString = inputString.Replace(c.ToString(), "");
The result will be:
File ~!##$%^&()_+`1234567890-=[];',.{} name
But feel free to add more chars to the badchars before running the foreach loop on them.
See http://msdn.microsoft.com/cs-cz/library/fk49wtc1.aspx:
Returns a string that is equivalent to the current string except that all instances of oldValue are replaced with newValue.
I have written a method to do the exact operation that you want and with much cleaner code.
The method
public static string Delete(this string target, string samples) {
if (string.IsNullOrEmpty(target) || string.IsNullOrEmpty(samples))
return target;
var tar = target.ToCharArray();
const char deletechar = '♣'; //a char that most likely never to be used in the input
for (var i = 0; i < tar.Length; i++) {
for (var j = 0; j < samples.Length; j++) {
if (tar[i] == samples[j]) {
tar[i] = deletechar;
break;
}
}
}
return tar.ConvertToString().Replace(deletechar.ToString(CultureInfo.InvariantCulture), string.Empty);
}
Sample
var input = "rumbaton jonathan \"racko\" contreras";
var cleaned = input.Delete("\"\\/*?><|");
Will result in:
rumbaton jonathan racko contreras
Ok ! I've solved my issue thanks to all your indications. This is my correction :
string newFileName = _artist + " - " + _title;
char[] invalidFileChars = Path.GetInvalidFileNameChars();
char[] invalidPathChars = Path.GetInvalidPathChars();
foreach (char invalidChar in invalidFileChars)
{
newFileName = newFileName.Replace(invalidChar.ToString(), string.Empty);
}
foreach (char invalidChar in invalidPathChars)
{
newFilePath = newFilePath.Replace(invalidChar.ToString(), string.Empty);
}
Thank you so musch everybody :)
What would an implementation of 'MagicFunction' look like to make the following (nunit) test pass?
public MagicFunction_Should_Prepend_Given_String_To_Each_Line()
{
var str = #"line1
line2
line3";
var result = MagicFunction(str, "-- ");
var expected = #"-- line1
-- line2
-- line3";
Assert.AreEqual(expected, result);
}
string MagicFunction(string str, string prepend)
{
str = str.Replace("\n", "\n" + prepend);
str = prepend + str;
return str;
}
EDIT:
As others have pointed out, the newline characters vary between environments. If you're only planning to use this function on files that were created in the same environment then System.Environment will work fine. However, if you create a file on a Linux box and then transfer it over to a Windows box you'll want to specify a different type of newline. Since Linux uses \n and Windows uses \r\n this piece of code will work for both Windows and Linux files. If you're throwing Macs into the mix (\r) you'll have to come up with something a little more involved.
Use .Select on a list of the lines.
private static string MagicFunction(string str, string prefix)
{
string[] lines = str.Split(new[] { '\n' });
return string.Join("\n", lines.Select(s => prefix + s).ToArray());
}
How about:
string MagicFunction(string InputText) {
public static Regex regex = new Regex(
"(^|\\r\\n)",
RegexOptions.IgnoreCase
| RegexOptions.CultureInvariant
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);
// This is the replacement string
public static string regexReplace =
"$1-- ";
// Replace the matched text in the InputText using the replacement pattern
string result = regex.Replace(InputText,regexReplace);
return result;
}
var result = "-- " + str.Replace(Environment.NewLine, Environment.NewLine + "-- ");
if you want it cope with either Windows (\r\n) NewLines or Unix ones (\n) then:
var result = "-- " + str.Replace("\n", "\n-- ");
No need to touch the \r as it is to be left where it was before. If however you want to cross between Unix and Windows then:
var result = "-- " + str.Replace("\r","").Replace("\n", Enviornment.NewLine + "-- ");
Will do it and return the result in the local OS's format
You could do it like that :
public string MagicFunction2(string str, string prefix)
{
bool first = true;
using(StringWriter writer = new StringWriter())
using(StringReader reader = new StringReader(str))
{
string line;
while((line = reader.ReadLine()) != null)
{
if (!first)
writer.WriteLine();
writer.Write(prefix + line);
first = false;
}
return writer.ToString();
}
}
You could split the string by Environment.NewLine, and then add the prefix to each of those string, and then join them by Environment.NewLine.
string MagicFunction(string prefix, string orignalString)
{
List<string> prefixed = new List<string>();
foreach (string s in orignalString.Split(new[]{Environment.NewLine}, StringSplitOptions.None))
{
prefixed.Add(prefix + s);
}
return String.Join(Environment.NewLine, prefixed.ToArray());
}
How about this. It uses StringBuilder in case you are planning on prepending a lot of lines.
string MagicFunction(string input)
{
StringBuilder sb = new StringBuilder();
StringReader sr = new StringReader(input);
string line = null;
using(StringReader sr = new StringReader(input))
{
while((line = sr.ReadLine()) != null)
{
sb.Append(String.Concat("-- ", line, System.Environment.NewLine));
}
}
return sb.ToString();
}
Thanks all for your answers. I implemented the MagicFunction as an extension method. It leverages Thomas Levesque's answer but is enhanced to handle all major environments AND assumes you want the output string to use the same newline terminator of the input string.
I favored Thomas Levesque's answer (over Spencer Ruport's, Fredrik Mork's, Lazarus, and JDunkerley) because it was the best performing. I'll post performance results on my blog and link here later for those interested.
(Obviously, the function name of 'MagicFunctionIO' should be changed. I went with 'PrependEachLineWith')
public static string MagicFunctionIO(this string self, string prefix)
{
string terminator = self.GetLineTerminator();
using (StringWriter writer = new StringWriter())
{
using (StringReader reader = new StringReader(self))
{
bool first = true;
string line;
while ((line = reader.ReadLine()) != null)
{
if (!first)
writer.Write(terminator);
writer.Write(prefix + line);
first = false;
}
return writer.ToString();
}
}
}
public static string GetLineTerminator(this string self)
{
if (self.Contains("\r\n")) // windows
return "\r\n";
else if (self.Contains("\n")) // unix
return "\n";
else if (self.Contains("\r")) // mac
return "\r";
else // default, unknown env or no line terminators
return Environment.NewLine;
}