i need the center string of Rocky44 only using C#
Hi <span>Rocky44</span>
I tried the some split method but can't work
string[] result = temp.Split(new string[] { "<span>" , "</span>" }, StringSplitOptions.RemoveEmptyEntries);
Example:
Hi <span>Rocky44</span>
To:
Rocky44
Use an html parser. I will give an example using HtmlAgilityPack
string html = #"Hi <span>Rocky44</span>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var text = doc.DocumentNode.SelectSingleNode("//span").InnerText;
You're on the right track; you're just not escaping your quotes correctly:
string[] result = temp.Split(new string[] { "<span>" , "</span>" }, StringSplitOptions.RemoveEmptyEntries);
Of course, this is assuming that your input will always be in exactly the given format. As I4V mentions, an HTML parser may come in handy if you're trying to do anything more complicated.
If you're only going to get this sort of thing (eg this sort of HTML) then I would use regex. Else, DO NOT USE IT.
string HTML = #"Hi <span>Rocky44</span>"
var result = Regex.Match(HTML, #".*<a.*><span.*>(.*)</span></a>").Groups[1].Value;
Find the index of <span> and </span> using the IndexOf method.
Then (adjusting for the length of <span>) use the String.Substring method to get the desired text.
string FindLinkText(string linkHtml)
{
int startIndex = linkHtml.IndexOf("<span>") + "<span>".Length,
length = linkHtml.IndexOf("</span>") - startIndex;
return linkHtml.Substring(startIndex, length);
}
Related
I need to decode HTML into plain text. I know that there are a lot of questions like this but I noticed one problem with those solutions and don't know how to solve it.
For example we have this piece of HTML:
<h1><strong>Some text</strong></h1><p><br></p><p>Some more text</p>
Tried regex solutions, HttpUtility.HtmlDecode method. And all of them give this output: Some textSome more text. Words get connected where they should be separate. Is there a way to decode string without merging words?
It's not clear what separator you wan between things that were not separated in the first place. So I used NewLine \n.
Where(x=>!string.IsNullOrWhiteSpace(x) will remove the empty element that will result in a lot of \n\n in more complex html doc
var input = "<h1><strong>Some text</strong></h1><p><br></p><p>Some more text</p>";
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(input);
var result = string.Join(
"\n",
htmlDocument
.DocumentNode
.ChildNodes
.Select(x=> x.InnerText)
.Where(x=>!string.IsNullOrWhiteSpace(x))
);
Result:
"Some text\nSome more text"
easy way to do it is to use HTML Agility pack:
HtmlDocument htmlDocument= new HtmlDocument();
htmlDocument.Load(htmlString);
string res=htmlDocument.DocumentNode.SelectSingleNode("YOUR XPATH TO THE INTRESTING ELEMENT").InnerText
You can use something as follows. In this sample i have used new line to separate inner text, hope you can adapt this to suite your scenario.
public static string GetPlainTextFromHTML(string inputText)
{
// Extracted plain text
var plainText = string.Empty;
if(string.IsNullOrWhiteSpace(inputText))
{
return plainText;
}
var htmlNote = new HtmlDocument();
htmlNote.LoadHtml(inputText);
var nodes = htmlNote.DocumentNode.ChildNodes;
if(nodes == null)
{
return plainText;
}
StringBuilder innerString = new StringBuilder();
// Replace <p> with new lines
foreach (HtmlNode node in nodes)
{
innerString.Append(node.InnerText);
innerString.Append("\\n");
}
plainText = innerString.ToString();
return plainText;
}
You can use a regex : <(div|/div|br|p|/p)[^>]{0,}>
I have this string :
"<figure><img
src='http://myphotos.net/image.ashx?type=2&image=Images\\2\\9\\11\\12\\3\\8\\4\\7\\685621455625.jpg'
href='JavaScript:void(0);' onclick='return takeImg(this)'
tabindex='1' class='myclass' width='55' height='66' alt=\"myalt\"></figure>"
How can I retrieve this link :
http://myphotos.net/image.ashx?type=2&image=Images\\2\\9\\11\\12\\3\\8\\4\\7\\685621455625.jpg
All string are the same type so somehow I need to get substring between src= and href. But I don't know how to do that. Thanks.
If you parse HTML don't not use string methods but a real HTML parser like HtmlAgilityPack:
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html); // html is your string
var linksAndImages = doc.DocumentNode.SelectNodes("//a/#href | //img/#src");
var allSrcList = linksAndImages
.Select(node => node.GetAttributeValue("src", "[src not found]"))
.ToList();
You can use regex:
var src = Regex.Match("the string", "<img.+?src=[\"'](.+?)[\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;
In general, you should use an HTML/XML parser when parsing a value from HTML code, but with a limited string like this, Regex would be fine.
string url = Regex.Match(htmlString, #"src='(.*?)'").Groups[1].Value;
If your string is always in same format, you can easily do this like so :
string input = "<figure><img src='http://myphotos.net/image.ashx?type=2&image=Images\\2\\9\\11\\12\\3\\8\\4\\7\\685621455625.jpg' href='JavaScript:void(0);' onclick='return takeImg(this)' tabindex='1' class='myclass' width='55' height='66' alt=\"myalt\"></figure>";
// link is between ' signs starting from the first ' sign so you can do :
input = input.Substring(input.IndexOf("'")).Substring(input.IndexOf("'"));
// now your string looks like : "http://myphotos.net/image.ashx?type=2&image=Images\\2\\9\\11\\12\\3\\8\\4\\7\\685621455625.jpg"
return input;
string str = "<figure><imgsrc = 'http://myphotos.net/image.ashx?type=2&image=Images\\2\\9\\11\\12\\3\\8\\4\\7\\685621455625.jpg'href = 'JavaScript:void(0);' onclick = 'return takeImg(this)'tabindex = '1' class='myclass' width='55' height='66' alt=\"myalt\"></figure>";
int pFrom = str.IndexOf("src = '") + "src = '".Length;
int pTo = str.LastIndexOf("'href");
string url = str.Substring(pFrom, pTo - pFrom);
Source :
Get string between two strings in a string
Q is your string in this case, i look for the index of the attribute you want (src = ') then I remove the first few characters (7 including spaces) and after that you look for when the text ends by looking for '.
With removing the first few characters you could use .IndexOf to look for how many to delete so its not hard coded.
string q =
"<figure><img src = 'http://myphotos.net/image.ashx?type=2&image=Images\\2\\9\\11\\12\\3\\8\\4\\7\\685621455625.jpg' href = 'JavaScript:void(0);' onclick = 'return takeImg(this)'" +
"tabindex = '1' class='myclass' width='55' height='66' alt=\"myalt\"></figure>";
string z = q.Substring(q.IndexOf("src = '"));
z = z.Substring(7);
z = z.Substring(0, z.IndexOf("'"));
MessageBox.Show(z);
This is certainly not the most elegant way (look at the other answers for that :)).
I have the following String "</script><div id='PO_1WTXxKUTU98xDU1'><!--DO NOT REMOVE-CONTENTS PLACED HERE--></div>"
I require to get the attribute value from the div tag. How can i retrieve this using C#.
Avoid parsing html with regex
Regex is not a good choice for parsing HTML files..
HTML is not strict nor is it regular with its format..
Use htmlagilityPack
You can do it like this with htmlagilityPack.
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
List<string> itemList = doc.DocumentNode.SelectNodes("//div[#id]")//selects all div having id attribute
.Select(x=>x.Attributes["id"].Value)//select the id attribute value
.ToList<string>();
//itemList will now contain all div's id attribute value
If you're a masochist you can do this old school VB3 style:
string input = #"</script><div id='PO_1WTXxKUTU98xDU1'><!--DO NOT REMOVE-CONTENTS PLACED HERE--></div>";
string startString = "div id='";
int startIndex = input.IndexOf(startString);
if (startIndex != -1)
{
startIndex += startString.Length;
int endIndex = input.IndexOf("'", startIndex);
string subString = input.Substring(startIndex, endIndex - startIndex);
}
Strictly solving the question asked, one of a myriad ways of solving it would be to isolate the div element, parse it as an XElement and then pull the attribute's value that way.
string bobo = "</script><div id='PO_1WTXxKUTU98xDU1'><!--DO NOT REMOVE-CONTENTS PLACED HERE--></div>";
string justDiv = bobo.Substring(bobo.IndexOf("<div"));
XElement xelem = XElement.Parse(justDiv);
var id = xelem.Attribute("id");
var value = id.Value;
There are certainly lots of ways to solve this but this one answers the mail.
A .NET Regex that looks something like this will do the trick
^</script><div id='(?<attrValue>[^']+)'.*$
you can then get hold of the value as
MatchCollection matches = Regex.Matches(input, #"^</script><div id='(?<attrValue>[^']+)'.*$");
if (matches.Count > 0)
{
var attrValue = matches[0].Groups["attrValue"];
}
First of all: Sorry for my bad English!
I know the title isn't the best English, but I don't really know how to format this question...
What I'm trying to do is reading an HTML source line by line so when it sees a given word (like http://) it copies the entire sentence so I can strip the rest an only keep the URL.
This is what I've tried:
using (var source = new StreamReader(TempFile))
{
string line;
while ((line = source.ReadLine()) != null)
{
if (line.Contains("http://"))
{
Console.WriteLine(line);
}
}
}
This works perfectly if I want to read it from an external file but it doesn't work when I want to read an string or stringbuilder, how do you read those line by line?
You can use new StringReader(theString) to do that with a string, but I question your overall strategy. That would be better done with a tool like HTML Agility Pack.
For example, here is HTML Agility Pack extracting all hyperlinks:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(theString);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href]")
{
HtmlAttribute att = link["href"];
Console.WriteLine(att.Value);
}
Well a string is just a string, it doesn't have any lines.
You can use something like String.Split to separate on the \r symbol.
MSDN: String.Split()
string words = "This is a list of words, with: a bit of punctuation" +
"\rand a newline character.";
string [] split = words.Split(new Char [] {'\r' });
foreach (string s in split) {
if (s.Trim() != "")
Console.WriteLine(s);
}
Firstly, you can use a StringReader.
Another option is to create a MemoryStream from the string via converting the string to a byte array first, as described in https://stackoverflow.com/a/10380166/396583
I think you can tokenize the input and check each entry for the required content.
string[] info = myStringBuilder.toString().split[' '];
foreach(var item in info) {
if(item.Contains('http://') {
//work with it
}
}
You can use a memory stream to read from.
I a have a string that contains the code of a webpage.
This is an example:
<input type="text" name="x4B07" value="650"
onchange="this.form.x8000.value=this.name;this.form.submit();"/>
<input type="text" name="x4B08" value="250"
onchange="this.form.x8000.value=this.name;this.form.submit();"/>
In that string I want to get the 650 and 250 (these are variables and they change value).
How can I do so?
Example:
name
value
x4b08
254
x4b07
253
x4b06
252
x4b05
251
If you were confident that the markup would never change (and you have a simple snippet like your example line) a regex could get you those values, for example:
Regex re = new Regex("name=\"(.*?)\" value=\"(.*?)\"");
Match match = re.Match(yourString);
if(match.Success && match.Groups.Count == 3){
String name = match.Groups[1];
String value = match.Groups[2];
}
Alternatively you could parse the page content and query the resulting document for the elements, and then extract the values. (C# HTML Parser: Looking for C# HTML parser )
You can use regular expressions to match value="([0-9]*)"
Or you can look for the string "value" using string.IndexOf and then take the following few characters.
This should work for you (assuming that s contains the string you want to parse):
string value = s.Substring(s.IndexOf("value=")+7);
value = value.Substring(0, value.IndexOf("\""));
How specific are your examples? Could you also want to extract varying length alphabetic strings? Will the strings you want to extract always be properties?
While the regex/substring way works for the specified examples I think they will scale quite badly.
I'd parse the HTML using a parser (see ndtreviv's answer) or possibly with an XML parser (if the HTML is valid XHTML). That way you will get better control and don't have to bleed your eyes out from fidgeting with a bucketload of regex.
If you have multiple such controls in the form of string you can create and XmlDocument and iterate through it.
just solved with this
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(URL);
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
Stream st = resp.GetResponseStream();
StreamReader sr = new StreamReader(st);
string buffer = sr.ReadToEnd();
ArrayList uniqueMatches = new ArrayList();
Match[] retArray = null;
Regex RE = new Regex("name=\"(.*?)\" value=\"(.*?)\"", RegexOptions.Multiline);
MatchCollection theMatches = RE.Matches(buffer);
for (int counter = 0; counter < theMatches.Count; counter++)
{
//string[] tempSplit = theMatches[counter].Value.Split('"');
Regex reName = new Regex("name=\"(.*?)\"");
Match matchName = reName.Match(theMatches[counter].Value);
Regex reValue = new Regex("value=\"(.*?)\"");
Match matchValue = reValue.Match(theMatches[counter].Value);
string[] dados = new string[2];
dados[0] = matchName.Groups[1].ToString();
dados[1] = matchValue.Groups[1].ToString();
uniqueMatches.Add(dados);
}
Tks all for the help