I am trying to come up with an algorithm that identififies if a string is part of the text content of an element or is it part of the element attributes.
For example:
<a class="tag tag-red-dark" href="/keywords?q=PARTOFATTRIBUTE"> Found TEXTCONTENT </a>
If you perform regex on TEXTCONTENT or PARTOFATTRIBUTE, you can run this algorithm to check if they are part of the text or part of the attributes:
MatchCollection matches = Regex.Matches(html, #"(?i)TEXTCONTENT");
for (int i = matches.Count-1; i >= 0 ; i--){
Match m = matches[i];
int currentIndex = m.Index;
bool isTextContent = false;
while (html[currentIndex] != '<'){
currentIndex--;
if (html[currentIndex] == '>'){ // text is placed between > and <
isTextContent = true;
break;
}
}
if (isTextContent){
// do something with text content
}else{
// do something with attribute
}
}
But the algorithm is fragile. If your html looks like this:
<a class="tag tag-red-dark" title="a>b" href="/keywords?q=PARTOFATTRIBUTE"> Found TEXTCONTENT </a>
PARTOFATTRIBUTE will be recognized as text, which is not.
Moreover, you could also have text with < in it, which makes the algorithm think that it found attribute:
<a class="tag tag-red-dark" title="a>b" href="/keywords?q=PARTOFATTRIBUTE"> < Found TEXTCONTENT </a>
Placing < in text without escaping is invalid html which i would like to handle. Placing > in attributes is on the other hand valid. Is it possible to determine if the selected string is part of attributes of text content solely based on the environment in which it is placed?
HtmlAgilityPack is not slow you do not have to parse the entire page just the A tag. Since you probably already parsed the a tags from your html. Just pass in only the the Html that you need parsed.
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml("<a class=\"tag tag - red - dark\" title=\"a > b\" href=\" / keywords ? q = PARTOFATTRIBUTE\"> < Found TEXTCONTENT </a>");
if (htmlDoc.DocumentNode.ChildNodes[0].InnerHtml.Contains("TEXTCONTENT"))
{
// do something with text content
}
if (htmlDoc.DocumentNode.ChildNodes[0].Attributes["href"].Value.Contains("PARTOFATTRIBUTE"))
{
// do something with attribute
}
Related
if (richTextBox1.Lines[i].StartsWith(#"<a href=""") ||
richTextBox1.Lines[i].EndsWith(#""""))
The StartsWith should be <a href="
The EndsWith should be one single "
But the way it is now i'm getting no results.
Input for example:
Screen-reader users, click here to turn off ggg Instant.
I need to get this part:
/setprefs?suggon=2&prev=https://www.test.com/search?q%3D%2Band%2B%26espv%3D2%26biw%3D960%26bih%3D489%26source%3Dlnms%26tbm%3Disch%26sa%3DX%26ei%3DYrxxVb-hJqac7gba0YOgDQ%26ved%3D0CAYQ_AUoAQ&sig=0_seDQVVTDQQx1hvN3BRktZNFc9Ew%3D
The part between the
I also tried to use htmlagilitypack:
HtmlAgilityPack.HtmlWeb hw = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load("https://www.test.com");
foreach (HtmlAgilityPack.HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue("href", string.Empty);
if (!newHtmls.Contains(hrefValue) && hrefValue.Contains("images"))
newHtmls.Add(hrefValue);
}
But this gave me only 1 link.
When i browse and see the page view-source and i make search and filter with the word image or images im getting over 350 results.
I tried also this solution:
var document = new HtmlWeb().Load(url);
var urls = document.DocumentNode.Descendants("img")
.Select(e => e.GetAttributeValue("src", null))
.Where(s => !String.IsNullOrEmpty(s));
But it didnt give me the results i needed.
Forgot to mention that the view-source of the page content i copied it to richTextBox1 window and then i'm reading line by line the text from the richTextBox1 so maybe that's why i'm not getting the results as i need ?
for (int i = 0; i < richTextBox1.Lines.Length; i++)
{
if (richTextBox1.Lines[i].StartsWith("<a href=\"") &&
richTextBox1.Lines[i].EndsWith("\""))
{
listBox1.Items.Add(richTextBox1.Lines[i]);
}
}
Maybe the view-source content as it's in the browser(chrome) is not the same as in the richTextbox1. And maybe i should not read it line by line from the richTextBox1 maybe to read the whole text from the richTextBox1 first ?
Based on your input, EndsWith isn't doing to help (as your input actually ends with </a>. Your next-best option would be to store the location (position) of href=", then look for the next occurrence of a " beginning at your stored location. e.g.
var input = #"Screen-reader users, click here to turn off ggg Instant.";
var needle = #"href=""";
var start = input.IndexOf(needle);
if (start != -1)
{
start += needle.Length;
var end = input.IndexOf(#"""", start);
// final result:
var href = input.Substring(start, end - start).Dump();
}
Better than that would be to use an actual HTML parser (might I recommend HtmlAgilityPack?).
Sometimes from a 3rd party API I get malformed HTML elements returned:
olor:red">Text</span>
when I expect:
<span style="color:red">Text</span>
For my context, the text content of the HTML is more important so it does not matter if I lose surrounding tags/formatting.
What would be the best way to strip out the malformed tags such that the first example would read
Text
and the second would not change?
I recommend you to take a look at the HtmlAgilityPack, which is a very handy tool also for HTML sanitization.
Here's an approach example by using the aforementioned library:
static void Main()
{
var inputs = new[] {
#"olor:red"">Text</span>",
#"<span style=""color:red"">Text</span>",
#"Text</span>",
#"<span style=""color:red"">Text",
#"<span style=""color:red"">Text"
};
var doc = new HtmlDocument();
inputs.ToList().ForEach(i => {
if (!i.StartsWith("<"))
{
if (i.IndexOf(">") != i.Length-1)
i = "<" + i;
else
i = i.Substring(0, i.IndexOf("<"));
doc.LoadHtml(i);
Console.WriteLine(doc.DocumentNode.InnerText);
}
else
{
doc.LoadHtml(i);
Console.WriteLine(doc.DocumentNode.OuterHtml);
}
});
}
Outputs:
Text
<span style="color:red">Text</span>
Text
<span style="color:red">Text</span>
<span style="color:red">Text</span>
If you just need the content of the tags, and no information of what type of tag etc, you could use Regular Expressions:
var r = new Regex(">([^>]+)<");
var text = "olor:red\">Text</span>";
var m = r.Match(text);
This will find every inner text of each tag.
Very crudely, you could strip out all 'tags' by stripping everything before a > and keeping everything before a <.
I'm assuming you also need to consider the situation where the text your receive is without tags: e.g. Text.
In pseudo-code:
returnText = ""
loop:
gtI = text.IndexOf(">")
ltI = text.IndexOf("<")
if -1==gtI and -1==ltI:
returnText += text
we're done
if gtI==-1:
returnText += text up to position ltI
return returnText
if ltI==-1:
returnText += text after gtI
return returnText
if ltI < gtI:
returnText += textBefore ltI
text = text after ltI
loop
// gtI < ltI:
text = text after gtI
loop
It's crude and can be done much better (and faster) with a custom coded parser, but essentially the logic would be the same.
You should really be asking why the API returns only part of what you require: I can't see why it should be returning ext</span> either, which really messes you up.
I have the following String "</script><div id='PO_1WTXxKUTU98xDU1'><!--DO NOT REMOVE-CONTENTS PLACED HERE--></div>"
I require to get the attribute value from the div tag. How can i retrieve this using C#.
Avoid parsing html with regex
Regex is not a good choice for parsing HTML files..
HTML is not strict nor is it regular with its format..
Use htmlagilityPack
You can do it like this with htmlagilityPack.
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
List<string> itemList = doc.DocumentNode.SelectNodes("//div[#id]")//selects all div having id attribute
.Select(x=>x.Attributes["id"].Value)//select the id attribute value
.ToList<string>();
//itemList will now contain all div's id attribute value
If you're a masochist you can do this old school VB3 style:
string input = #"</script><div id='PO_1WTXxKUTU98xDU1'><!--DO NOT REMOVE-CONTENTS PLACED HERE--></div>";
string startString = "div id='";
int startIndex = input.IndexOf(startString);
if (startIndex != -1)
{
startIndex += startString.Length;
int endIndex = input.IndexOf("'", startIndex);
string subString = input.Substring(startIndex, endIndex - startIndex);
}
Strictly solving the question asked, one of a myriad ways of solving it would be to isolate the div element, parse it as an XElement and then pull the attribute's value that way.
string bobo = "</script><div id='PO_1WTXxKUTU98xDU1'><!--DO NOT REMOVE-CONTENTS PLACED HERE--></div>";
string justDiv = bobo.Substring(bobo.IndexOf("<div"));
XElement xelem = XElement.Parse(justDiv);
var id = xelem.Attribute("id");
var value = id.Value;
There are certainly lots of ways to solve this but this one answers the mail.
A .NET Regex that looks something like this will do the trick
^</script><div id='(?<attrValue>[^']+)'.*$
you can then get hold of the value as
MatchCollection matches = Regex.Matches(input, #"^</script><div id='(?<attrValue>[^']+)'.*$");
if (matches.Count > 0)
{
var attrValue = matches[0].Groups["attrValue"];
}
I have a Html code and I want to Convert it to plain text but keep only colored text tags.
for example:
when I have below Html:
<body>
This is a <b>sample</b> html text.
<p align="center" style="color:#ff9999">this is only a sample<p>
....
and some other tags...
</body>
</html>
I want the output:
this is a sample html text.
<#ff9999>this is only a sample<>
....
and some other tags...
I'd use parser to parse HTML like HtmlAgilityPack, and use regular expressions to find the color value in attributes.
First, find all the nodes that contain style attribute with color defined in it by using xpath:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode
.SelectNodes("//*[contains(#style, 'color')]")
.ToArray();
Then the simplest regex to match a color value: (?<=color:\s*)#?\w+.
var colorRegex = new Regex(#"(?<=color:\s*)#?\w+", RegexOptions.IgnoreCase);
Then iterate through these nodes and if there is a regex match, replace the inner html of the node with html encoded tags (you'll understand why a little bit later):
foreach (var node in nodes)
{
var style = node.Attributes["style"].Value;
if (colorRegex.IsMatch(style))
{
var color = colorRegex.Match(style).Value;
node.InnerHtml =
HttpUtility.HtmlEncode("<" + color + ">") +
node.InnerHtml +
HttpUtility.HtmlEncode("</" + color + ">");
}
}
And finally get the inner text of the document and perform html decoding on it (this is because inner text strips all the tags):
var txt = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
This should return something like this:
This is a sample html text.
<#ff9999>this is only a sample</#ff9999>
....
and some other tags...
Of course you could improve it for your needs.
It is possible to do it using regular expressions but... You should not parse (X)HTML with regex.
The first regexp I came with to solve the problem is:
<p(\w|\s|[="])+color:(#([0-9a-f]{6}|[0-9a-f]{3}))">(\w|\s)+</p>
Group 5th will be the hex (3 or 6 hexadecimals) colour and group 6th will be the text inside the tag.
Obviously, it's not the best solution as I'm not a regexp master and obviously it needs some testing and probably generalisation... But still it's a good point to start with.
i am having a variable in c# holding some string like this
string myText="my text which contains <div>i am text inside div</div>";
now i want to replace all "\n" (new line character) with "<br>" for this variable's data except for text inside div.
How do i do this??
Others have suggested using libraries such as HTMLAgilityPack. The former is indeed a nice tool, but if you don't need HTML parsing functionality beyond what you have requested, a simple parser should suffice:
string ReplaceNewLinesWithBrIfNotInsideDiv(string input) {
int divNestingLevel = 0;
StringBuilder output = new StringBuilder();
StringComparison comp = StringComparison.InvariantCultureIgnoreCase;
for (int i = 0; i < input.Length; i++) {
if (input[i] == '<') {
if (i < (input.Length - 3) && input.Substring(i, 4).Equals("<div", comp)){
divNestingLevel++;
} else if (divNestingLevel != 0 && i < (input.Length - 5) && input.Substring(i, 6).Equals("</div>", comp)) {
divNestingLevel--;
}
}
if (input[i] == '\n' && divNestingLevel == 0) {
output.Append("<br/>");
} else {
output.Append(input[i]);
}
}
return output.ToString();
}
This should handle nested divs as well.
For something like this you will need to parse the HTML in order to distinguish the parts that you do want to make the replacement in from the ones you don't.
I suggest looking at the HTML agility pack - it can parse HTML fragments as well as malformed HTML. You can then query the resulting parse tree using XPath notation and do your replacement on the selected nodes.
That would require some fairly complicated RegEx, out of my league.
But you could try splitting the string:
string[] parts = myText.Split("<div>", "</div>");
for (int i = 0; i < parts.Length; i += 2) // only the even parts
parts[i] = string.Replace(...);
And then use a StringBuilder to re-assemble the parts.
I would split the string on div then look at the tokens if it starts with "div" then don't replace \n with BR if it does start with div then you need to find the closing div and split on that.. then take the 2nd token and do what you just did... of course as you are going to have to keep appending the tokens to a master string... I'll code up a sample here in a few minutes...
Use the string.Replace() method like this:
myText = myText.Replace("\n", "<br>")
You could consider using the Environment.NewLine property to find the newline chars. Are you sure they are not \n\r or \r\n etc...
You may have to pull the text inside the div out first if you dont want to parse that. Use a regex to find it and remove it then do the Replace() as above, then put the strings backtogether.