This question already has answers here:
How to extract full url with HtmlAgilityPack - C#
(2 answers)
Closed 4 years ago.
I’ve been scraping a website using HtmlAgilityPack, but I need the links to print out in the proper format. On the page, I am scraping some of the links include the proper “https://...” formatting at the beginning of the link, however, most start with something else.
For example, a few of the links print starting with “/xxx” or just simply “.//”. Is there any way to sort through the links I have scraped and print the links starting with the proper “https://” format before them?
Currently my code looks like this:
var hg = doc.DocumentNode.SelectNodes("//body[#class]");
//Sort through list and print
foreach (var node in hg)
{
foreach(HtmlNode node2 in node.SelectNodes(".//a[#href]"))
{
string attributeValue = node2.GetAttributeValue("href", "");
if (attributeValue[0:7] != "https://")
{
Console.WriteLine("https://url/" + node2.Attributes["href"].Value);
}
}
}
Console.ReadLine();
I’ve been trying to use indexing of the attributeValue string to see what the link starts with, but keep getting an error telling me I can’t use indexing there. Perhaps there is a better way to check the beginning of the links I am unaware of?
I’m a novice at C#, and any help understanding this issue would be greatly appreciated!
Try using StartsWith as oppose to trying to index the string
var hg = doc.DocumentNode.SelectNodes("//body[#class]");
//Sort through list and print
foreach (var node in hg)
{
foreach(HtmlNode node2 in node.SelectNodes(".//a[#href]"))
{
string attributeValue = node2.GetAttributeValue("href", "");
if (!attributeValue.StartsWith("https://"))
{
Console.WriteLine("https://url/" + node2.Attributes["href"].Value);
}
}
}
Console.ReadLine();
I have a text file that contain only the FULL version number of an application that I need to extract and then parse it into separate Variables.
For example lets say the version.cs contains 19.1.354.6
Code I'm using does not seem to be working:
char[] delimiter = { '.' };
string currentVersion = System.IO.File.ReadAllText(#"C:\Applicaion\version.cs");
string[] partsVersion;
partsVersion = currentVersion.Split(delimiter);
string majorVersion = partsVersion[0];
string minorVersion = partsVersion[1];
string buildVersion = partsVersion[2];
string revisVersion = partsVersion[3];
Altough your problem is with the file, most likely it contains other text than a version, why dont you use Version class which is absolutely for this kind of tasks.
var version = new Version("19.1.354.6");
var major = version.Major; // etc..
What you have works fine with the correct input, so I would suggest making sure there is nothing else in the file you're reading.
In the future, please provide error information, since we can't usually tell exactly what you expect to happen, only what we know should happen.
In light of that, I would also suggest looking into using Regex for parsing in the future. In my opinion, it provides a much more flexible solution for your needs. Here's an example of regex to use:
var regex = new Regex(#"([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9])");
var match = regex.Match("19.1.354.6");
if (match.Success)
{
Console.WriteLine("Match[1]: "+match.Groups[1].Value);
Console.WriteLine("Match[2]: "+match.Groups[2].Value);
Console.WriteLine("Match[3]: "+match.Groups[3].Value);
Console.WriteLine("Match[4]: "+match.Groups[4].Value);
}
else
{
Console.WriteLine("No match found");
}
which outputs the following:
// Match[1]: 19
// Match[2]: 1
// Match[3]: 354
// Match[4]: 6
I want to find all the instagram urls within a string, and replace them with the embed url.
But I'm keen on performance, as this could be 5 to 20 posts each anything up to 6000 characters with an unknown amount of instagram urls in which need converting.
Url examples (Could be any of these in each string, so would need to match all)
http://instagram.com/p/xPnQ1ZIY2W/?modal=true
http://instagram.com/p/xPnQ1ZIY2W/
http://instagr.am/p/xPnQ1ZIY2W/
And this is what I need to replace them with (An embedded version)
<img src="http://instagram.com/p/xPnQ1ZIY2W/media/?size=l" class="instagramimage" />
I was thinking about going for regex? But is this the quickest and most performant way of doing this?
Any examples greatly appreciated.
Something like:
Regex reg = new Regex(#"http://instagr\.?am(?:\.com)?/\S*");
Edited regex. However i would combine this with a stringreader and do it line by line. Then put the string (modified or not) into a stringbuilder:
string original = #"someotherText http://instagram.com/p/xPnQ1ZIY2W/?modal=true some other text
some other text http://instagram.com/p/xPnQ1ZIY2W/ some other text
some other text http://instagr.am/p/xPnQ1ZIY2W/ some other text";
StringBuilder result = new StringBuilder();
using (StringReader reader = new StringReader(original))
{
while (reader.Peek() > 0)
{
string line = reader.ReadLine();
if (reg.IsMatch(line))
{
string url = reg.Match(line).ToString();
result.AppendLine(reg.Replace(line,string.Format("<img src=\"{0}\" class=\"instagramimage\" />",url)));
}
else
{
result.AppendLine(line);
}
}
}
Console.WriteLine(result.ToString());
You mean like this?
class Program
{
private static Regex reg = new Regex(#"http://instagr\.?am(?:\.com)?/\S*", RegexOptions.Compiled);
private static Regex idRegex = new Regex(#"(?<=p/).*?(?=/)",RegexOptions.Compiled);
static void Main(string[] args)
{
string original = #"someotherText http://instagram.com/p/xPnQ1ZIY2W/?modal=true some other text
some other text http://instagram.com/p/xPnQ1ZIY2W/ some other text
some other text http://instagr.am/p/xPnQ1ZIY2W/ some other text";
StringBuilder result = new StringBuilder();
using (StringReader reader = new StringReader(original))
{
while (reader.Peek() > 0)
{
string line = reader.ReadLine();
if (reg.IsMatch(line))
{
string url = reg.Match(line).ToString();
result.AppendLine(reg.Replace(line, string.Format("<img src=\"http://instagram.com/p/{0}/media/?size=1\" class=\"instagramimage\" />", idRegex.Match(url).ToString())));
}
else
{
result.AppendLine(line);
}
}
}
Console.WriteLine(result.ToString());
}
}
A well-crafted and compiled regular expression is hard to beat, especially since you're doing replacements, not just searching, but you should test to be sure.
If the Instagram URLs are only within HTML attributes, here's my first stab at a pattern to look for:
(?<=")(https?://instagr[^">]+)
(I added a check for https as well, which you didn't mention but I believe is supported by Instagram.)
Some false positives are theoretically possible, but it will perform better than pedantically matching every legal variation of an Instagram URL. (The ">" check is just in case the HTML is missing the end quote for some reason.)
First of all: Sorry for my bad English!
I know the title isn't the best English, but I don't really know how to format this question...
What I'm trying to do is reading an HTML source line by line so when it sees a given word (like http://) it copies the entire sentence so I can strip the rest an only keep the URL.
This is what I've tried:
using (var source = new StreamReader(TempFile))
{
string line;
while ((line = source.ReadLine()) != null)
{
if (line.Contains("http://"))
{
Console.WriteLine(line);
}
}
}
This works perfectly if I want to read it from an external file but it doesn't work when I want to read an string or stringbuilder, how do you read those line by line?
You can use new StringReader(theString) to do that with a string, but I question your overall strategy. That would be better done with a tool like HTML Agility Pack.
For example, here is HTML Agility Pack extracting all hyperlinks:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(theString);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href]")
{
HtmlAttribute att = link["href"];
Console.WriteLine(att.Value);
}
Well a string is just a string, it doesn't have any lines.
You can use something like String.Split to separate on the \r symbol.
MSDN: String.Split()
string words = "This is a list of words, with: a bit of punctuation" +
"\rand a newline character.";
string [] split = words.Split(new Char [] {'\r' });
foreach (string s in split) {
if (s.Trim() != "")
Console.WriteLine(s);
}
Firstly, you can use a StringReader.
Another option is to create a MemoryStream from the string via converting the string to a byte array first, as described in https://stackoverflow.com/a/10380166/396583
I think you can tokenize the input and check each entry for the required content.
string[] info = myStringBuilder.toString().split[' '];
foreach(var item in info) {
if(item.Contains('http://') {
//work with it
}
}
You can use a memory stream to read from.
Using ASP.NET, how can I strip the HTML tags from a given string reliably (i.e. not using regex)? I am looking for something like PHP's strip_tags.
Example:
<ul><li>Hello</li></ul>
Output:
"Hello"
I am trying not to reinvent the wheel, but I have not found anything that meets my needs so far.
If it is just stripping all HTML tags from a string, this works reliably with regex as well. Replace:
<[^>]*(>|$)
with the empty string, globally. Don't forget to normalize the string afterwards, replacing:
[\s\r\n]+
with a single space, and trimming the result. Optionally replace any HTML character entities back to the actual characters.
Note:
There is a limitation: HTML and XML allow > in attribute values. This solution will return broken markup when encountering such values.
The solution is technically safe, as in: The result will never contain anything that could be used to do cross site scripting or to break a page layout. It is just not very clean.
As with all things HTML and regex:
Use a proper parser if you must get it right under all circumstances.
Go download HTMLAgilityPack, now! ;) Download LInk
This allows you to load and parse HTML. Then you can navigate the DOM and extract the inner values of all attributes. Seriously, it will take you about 10 lines of code at the maximum. It is one of the greatest free .net libraries out there.
Here is a sample:
string htmlContents = new System.IO.StreamReader(resultsStream,Encoding.UTF8,true).ReadToEnd();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlContents);
if (doc == null) return null;
string output = "";
foreach (var node in doc.DocumentNode.ChildNodes)
{
output += node.InnerText;
}
Regex.Replace(htmlText, "<.*?>", string.Empty);
protected string StripHtml(string Txt)
{
return Regex.Replace(Txt, "<(.|\\n)*?>", string.Empty);
}
Protected Function StripHtml(Txt as String) as String
Return Regex.Replace(Txt, "<(.|\n)*?>", String.Empty)
End Function
I've posted this on the asp.net forums, and it still seems to be one of the easiest solutions out there. I won't guarantee it's the fastest or most efficient, but it's pretty reliable.
In .NET you can use the HTML Web Control objects themselves. All you really need to do is insert your string into a temporary HTML object such as a DIV, then use the built-in 'InnerText' to grab all text that is not contained within tags. See below for a simple C# example:
System.Web.UI.HtmlControls.HtmlGenericControl htmlDiv = new System.Web.UI.HtmlControls.HtmlGenericControl("div");
htmlDiv.InnerHtml = htmlString;
String plainText = htmlDiv.InnerText;
I have written a pretty fast method in c# which beats the hell out of the Regex. It is hosted in an article on CodeProject.
Its advantages are, among better performance the ability to replace named and numbered HTML entities (those like & and &203;) and comment blocks replacement and more.
Please read the related article on CodeProject.
Thank you.
For those of you who can't use the HtmlAgilityPack, .NETs XML reader is an option. This can fail on well formatted HTML though so always add a catch with regx as a backup. Note this is NOT fast, but it does provide a nice opportunity for old school step through debugging.
public static string RemoveHTMLTags(string content)
{
var cleaned = string.Empty;
try
{
StringBuilder textOnly = new StringBuilder();
using (var reader = XmlNodeReader.Create(new System.IO.StringReader("<xml>" + content + "</xml>")))
{
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Text)
textOnly.Append(reader.ReadContentAsString());
}
}
cleaned = textOnly.ToString();
}
catch
{
//A tag is probably not closed. fallback to regex string clean.
string textOnly = string.Empty;
Regex tagRemove = new Regex(#"<[^>]*(>|$)");
Regex compressSpaces = new Regex(#"[\s\r\n]+");
textOnly = tagRemove.Replace(content, string.Empty);
textOnly = compressSpaces.Replace(textOnly, " ");
cleaned = textOnly;
}
return cleaned;
}
string result = Regex.Replace(anytext, #"<(.|\n)*?>", string.Empty);
I've looked at the Regex based solutions suggested here, and they don't fill me with any confidence except in the most trivial cases. An angle bracket in an attribute is all it would take to break, let alone mal-formmed HTML from the wild. And what about entities like &? If you want to convert HTML into plain text, you need to decode entities too.
So I propose the method below.
Using HtmlAgilityPack, this extension method efficiently strips all HTML tags from an html fragment. Also decodes HTML entities like &. Returns just the inner text items, with a new line between each text item.
public static string RemoveHtmlTags(this string html)
{
if (String.IsNullOrEmpty(html))
return html;
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
if (doc.DocumentNode == null || doc.DocumentNode.ChildNodes == null)
{
return WebUtility.HtmlDecode(html);
}
var sb = new StringBuilder();
var i = 0;
foreach (var node in doc.DocumentNode.ChildNodes)
{
var text = node.InnerText.SafeTrim();
if (!String.IsNullOrEmpty(text))
{
sb.Append(text);
if (i < doc.DocumentNode.ChildNodes.Count - 1)
{
sb.Append(Environment.NewLine);
}
}
i++;
}
var result = sb.ToString();
return WebUtility.HtmlDecode(result);
}
public static string SafeTrim(this string str)
{
if (str == null)
return null;
return str.Trim();
}
If you are really serious, you'd want to ignore the contents of certain HTML tags too (<script>, <style>, <svg>, <head>, <object> come to mind!) because they probably don't contain readable content in the sense we are after. What you do there will depend on your circumstances and how far you want to go, but using HtmlAgilityPack it would be pretty trivial to whitelist or blacklist selected tags.
If you are rendering the content back to an HTML page, make sure you understand XSS vulnerability & how to prevent it - i.e. always encode any user-entered text that gets rendered back onto an HTML page (> becomes > etc).
For those who are complining about Michael Tiptop's solution not working, here is the .Net4+ way of doing it:
public static string StripTags(this string markup)
{
try
{
StringReader sr = new StringReader(markup);
XPathDocument doc;
using (XmlReader xr = XmlReader.Create(sr,
new XmlReaderSettings()
{
ConformanceLevel = ConformanceLevel.Fragment
// for multiple roots
}))
{
doc = new XPathDocument(xr);
}
return doc.CreateNavigator().Value; // .Value is similar to .InnerText of
// XmlDocument or JavaScript's innerText
}
catch
{
return string.Empty;
}
}
using System.Text.RegularExpressions;
string str = Regex.Replace(HttpUtility.HtmlDecode(HTMLString), "<.*?>", string.Empty);
You can also do this with AngleSharp which is an alternative to HtmlAgilityPack (not that HAP is bad). It is easier to use than HAP to get the text out of a HTML source.
var parser = new HtmlParser();
var htmlDocument = parser.ParseDocument(source);
var text = htmlDocument.Body.Text();
You can take a look at the key features section where they make a case at being "better" than HAP. I think for the most part, it is probably overkill for the current question but still, it is an interesting alternative.
For the second parameter,i.e. keep some tags, you may need some code like this by using HTMLagilityPack:
public string StripTags(HtmlNode documentNode, IList keepTags)
{
var result = new StringBuilder();
foreach (var childNode in documentNode.ChildNodes)
{
if (childNode.Name.ToLower() == "#text")
{
result.Append(childNode.InnerText);
}
else
{
if (!keepTags.Contains(childNode.Name.ToLower()))
{
result.Append(StripTags(childNode, keepTags));
}
else
{
result.Append(childNode.OuterHtml.Replace(childNode.InnerHtml, StripTags(childNode, keepTags)));
}
}
}
return result.ToString();
}
More explanation on this page: http://nalgorithm.com/2015/11/20/strip-html-tags-of-an-html-in-c-strip_html-php-equivalent/
Simply use string.StripHTML();