anglesharp - 0.9.11
On the page in the browser, the text is displayed as:
String_1.
String_2.
String_3.
String_4.
Parsing result:
String_1.String_2.String_3.String_4.
Page layout:
<div class="adv-point view-adv-point"><span>String_1. <br><br>String_2.<br>String_3.<br>String_4.</span></div>
I use code to parse:
var text = document.QuerySelectorAll("div:nth-child(4) >div:nth-child(3) > div.adv-point.view-adv-point");
text = items[0].TextContent.Trim();
Question
How to make the result of parsing with line breaks?
In other words, the result of the parsing should be:
String_1.
String_2.
String_3.
String_4.
I think if you use innerText here then it will work fine for you. Here is the code
var x = document.querySelectorAll("div:nth-child(4) >div:nth-child(3) > div.adv-point.view-adv-point");
console.log(x[0].innerText);
Try this-
var text=document.querySelectorAll(".view-adv-point span")[0].innerText;
If you log/alert text, you will see that the line break is present.
If you want to replace <br> with \n, then you can do this-
var text=document.querySelectorAll(".view-adv-point span")[0].innerHTML;
text = text.replace(/<br>/g, '\n');
But i believe this will return the same value as the first approach
Related
I only see the space tags "\r\n "\r\n" for InnerHTML & InnerText properties and not the actual content. Where am i going wrong
RENDERED HTML:
<div id="urllist" runat="server">
http://test1t.com
<br></br>
http://test2.com
<br></br>
</div>
C#:
HtmlContainerControl list = (HtmlContainerControl)urllist;
string string1 = list.InnerHtml;
string string2 = list.InnerText;
//this didnt work either
string string1 = urllist.InnerHtml;
string string2 = urllist.InnerText;
If i remember correctly you have to use Controls[0] to find the literal control that contains the text:
var div = (HtmlGenericControl) urllist;
var lit = (LiteralControl) div.Controls[0];
string text = lit.Text;
Update: tested, it works. This is text:
http://test1t.com
<br></br>
http://test2.com
<br></br>
However, now i have tested it with your approach and it works also.
I would have added a comment, but I cannot add images in comments. See below, I've tested your code and it works:
Are you sure you don't check your result in a HTML page or that you are not altering your result in any way before you check it?
I have this Html element on the page:
<li id="city" class="anketa_list-item">
<div class="anketa_item-city">From</div>
London
</li>
I found this element:
driver.FindElement(By.Id("city"))
If I try: driver.FindElement(By.Id("city")).Text, => my result: "From\r\nLondon".
How can I get only London by WebDriver?
You could easily get by using class-name:
driver.FindElement(By.Class("anketa_item-city")).Text;
or using Xpath
driver.FindElement(By.Xpath("\li\div")).Text;
You can try this:
var fromCityTxt = driver.FindElement(By.Id("city")).Text;
var city = Regex.Split(fromCityTxt, "\r\n")[1];
Sorry for my misleading. My previous provide xpath is which ends in the function /text() which selects not a node, but the text of it.
Approach for this situation is get parent's text then replace children's text then trim to remove space/special space/etc ....
var parent = driver.FindElement(By.XPath("//li"))
var child = driver.FindElement(By.XPath("//li/div"))
var london = parent.Text.Replace(child.Text, "").Trim()
Notes:
If Trim() isn't working then it would appear that the "spaces" aren't spaces but some other non printing character, possibly tabs. In this case you need to use the String.Trim method which takes an array of characters:
char[] charsToTrim = { ' ', '\t' };
string result = txt.Trim(charsToTrim);
This worked for me.
driver.FindElement(By.XPath("//li[#id='city']/text()"));
Sometimes from a 3rd party API I get malformed HTML elements returned:
olor:red">Text</span>
when I expect:
<span style="color:red">Text</span>
For my context, the text content of the HTML is more important so it does not matter if I lose surrounding tags/formatting.
What would be the best way to strip out the malformed tags such that the first example would read
Text
and the second would not change?
I recommend you to take a look at the HtmlAgilityPack, which is a very handy tool also for HTML sanitization.
Here's an approach example by using the aforementioned library:
static void Main()
{
var inputs = new[] {
#"olor:red"">Text</span>",
#"<span style=""color:red"">Text</span>",
#"Text</span>",
#"<span style=""color:red"">Text",
#"<span style=""color:red"">Text"
};
var doc = new HtmlDocument();
inputs.ToList().ForEach(i => {
if (!i.StartsWith("<"))
{
if (i.IndexOf(">") != i.Length-1)
i = "<" + i;
else
i = i.Substring(0, i.IndexOf("<"));
doc.LoadHtml(i);
Console.WriteLine(doc.DocumentNode.InnerText);
}
else
{
doc.LoadHtml(i);
Console.WriteLine(doc.DocumentNode.OuterHtml);
}
});
}
Outputs:
Text
<span style="color:red">Text</span>
Text
<span style="color:red">Text</span>
<span style="color:red">Text</span>
If you just need the content of the tags, and no information of what type of tag etc, you could use Regular Expressions:
var r = new Regex(">([^>]+)<");
var text = "olor:red\">Text</span>";
var m = r.Match(text);
This will find every inner text of each tag.
Very crudely, you could strip out all 'tags' by stripping everything before a > and keeping everything before a <.
I'm assuming you also need to consider the situation where the text your receive is without tags: e.g. Text.
In pseudo-code:
returnText = ""
loop:
gtI = text.IndexOf(">")
ltI = text.IndexOf("<")
if -1==gtI and -1==ltI:
returnText += text
we're done
if gtI==-1:
returnText += text up to position ltI
return returnText
if ltI==-1:
returnText += text after gtI
return returnText
if ltI < gtI:
returnText += textBefore ltI
text = text after ltI
loop
// gtI < ltI:
text = text after gtI
loop
It's crude and can be done much better (and faster) with a custom coded parser, but essentially the logic would be the same.
You should really be asking why the API returns only part of what you require: I can't see why it should be returning ext</span> either, which really messes you up.
I have a Html code and I want to Convert it to plain text but keep only colored text tags.
for example:
when I have below Html:
<body>
This is a <b>sample</b> html text.
<p align="center" style="color:#ff9999">this is only a sample<p>
....
and some other tags...
</body>
</html>
I want the output:
this is a sample html text.
<#ff9999>this is only a sample<>
....
and some other tags...
I'd use parser to parse HTML like HtmlAgilityPack, and use regular expressions to find the color value in attributes.
First, find all the nodes that contain style attribute with color defined in it by using xpath:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode
.SelectNodes("//*[contains(#style, 'color')]")
.ToArray();
Then the simplest regex to match a color value: (?<=color:\s*)#?\w+.
var colorRegex = new Regex(#"(?<=color:\s*)#?\w+", RegexOptions.IgnoreCase);
Then iterate through these nodes and if there is a regex match, replace the inner html of the node with html encoded tags (you'll understand why a little bit later):
foreach (var node in nodes)
{
var style = node.Attributes["style"].Value;
if (colorRegex.IsMatch(style))
{
var color = colorRegex.Match(style).Value;
node.InnerHtml =
HttpUtility.HtmlEncode("<" + color + ">") +
node.InnerHtml +
HttpUtility.HtmlEncode("</" + color + ">");
}
}
And finally get the inner text of the document and perform html decoding on it (this is because inner text strips all the tags):
var txt = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
This should return something like this:
This is a sample html text.
<#ff9999>this is only a sample</#ff9999>
....
and some other tags...
Of course you could improve it for your needs.
It is possible to do it using regular expressions but... You should not parse (X)HTML with regex.
The first regexp I came with to solve the problem is:
<p(\w|\s|[="])+color:(#([0-9a-f]{6}|[0-9a-f]{3}))">(\w|\s)+</p>
Group 5th will be the hex (3 or 6 hexadecimals) colour and group 6th will be the text inside the tag.
Obviously, it's not the best solution as I'm not a regexp master and obviously it needs some testing and probably generalisation... But still it's a good point to start with.
In C#, Windows Form, how would I accomplish this:
07:55 Header Text: This is the data<br/>07:55 Header Text: This is the data<br/>07:55 Header Text: This is the data<br/>
So, as you can see, i have a return string, that can be rather long, but i want to be able to format the data to be something like this:
<b><font color="Red">07:55 Header Text</font></b>: This is the data<br/><b><font color="Red">07:55 Header Text</font></b>: This is the data<br/><b><font color="Red">07:55 Header Text</font></b>: This is the data<br/>
As you can see, i essentially want to prepend <b><font color="Red"> to the front of the header text & time, and append </font></b> right before the : section.
So yeah lol i'm kinda lost.
I have messed around with .Replace() and Regex patterns, but not with much success. I dont really want to REPLACE text, just append/pre-pend at certain positions.
Is there an easy way to do this?
Note: the [] tags are actually <> tags, but i can't use them here lol
Just because you're using RegEx doesn't mean you have to replace text.
The following regular expression:
(\d+:\d+.*?:)(\s.*?\[br/\])
Has two 'capturing groups.' You can then replace the entire text string with the following:
[b][font color="Red"]\1[/font][/b]\2
Which should result in the following output:
[b][font color="Red"]07:55 Header Text:[/font][/b] This is the data[br/]
[b][font color="Red"]07:55 Header Text:[/font][/b] This is the data[br/]
[b][font color="Red"]07:55 Header Text:[/font][/b] This is the data[br/]
Edit: Here's some C# code which demonstrates the above:
var fixMe = #"07:55 Header Text: This is the data[br/]07:55 Header Text: This is the data[br/]07:55 Header Text: This is the data[br/]";
var regex = new Regex(#"(\d+:\d+.*?:)(\s.*?\[br/\])");
var matches = regex.Matches(fixMe);
var prepend = #"[b][font color=""Red""]";
var append = #"[/font][/b]";
string outputString = "";
foreach (Match match in matches)
{
outputString += prepend + match.Groups[1] + append + match.Groups[2] + Environment.NewLine;
}
Console.Out.WriteLine(outputString);
have you tried .Insert() check this.
Have you considered creating a style and setting the css class of each line by wrapping each line in a p or div tag?
Easier to maintain and to construct.
The easiest way probably is to use string.Replace() and string.Split(). Say your input string is input (untested):
var output = string.Join("<br/>", in
.Split("<br/>)
.Select(l => "<b><font color=\"Red\">" + l.Replace(": ", "</font></b>: "))
.ToList()
) + "<br/>";