Parsing HTML page to extract links [duplicate] - c#

This question already has answers here:
How to extract full url with HtmlAgilityPack - C#
(2 answers)
Closed 8 years ago.
I want to parse html file and extract links from <a> tag. For example I am trying to extract link from following <a> tag.
<a class="thumb vtop inlblk rel tdnone linkWithHash scale5 detailsLink" href="http://olx.com.pk/item/honda-civic-exi-2005-IDSkzkt.html#6256e9ac30" title=""> <img class="fleft" src="http://img03.olx.com.pk/images_olxpk/89491775_1_144x108_honda-civic-exi-2005-lahore_rev001.jpg" alt="Honda Civic Exi 2005"> </a>
I use the following regular expression
private const string _LINK_REGEX = "href=\"[a-zA-Z./:&\\d_-]+\"";
But I am unable to extract this url.

You can use:
href=\"[^\"]+\"
Test here

Related

Best way to separate base64 image from a string in C# [duplicate]

This question already has answers here:
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 2 years ago.
I prefer to use regex not any HTML Parser.
Best way to extract base64 image from a HTMl that string is like:
"<p>This is test </p>
<p><img src=\"data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/4gKgSUNDX1BST0ZJTEUA....+tzPaXLlstlSjpcxKPEqV/zH//2Q==\"></p>"
I need this line so I can have access to base 64 image:
/9j/4AAQSkZJRgABAQAAAQABAAD/4gKgSUNDX1BST0ZJTEUA....+tzPaXLlstlSjpcxKPEqV/zH//2Q==
If there is an adequate HTML parser for this use case as suggested by others in the comments, go for that...
But, if that doesn't work, regular expressions to the rescue! This is using a positive lookbehind assertion and is matching everything until the first double quote. Should work -- adjust if it doesn't...
var val = "<p>This is test </p><p><img src=\"data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/4gKgSUNDX1BST0ZJTEUA....+tzPaXLlstlSjpcxKPEqV/zH//2Q==";
var match = Regex.Match(val, "(?<=data:image/jpeg;base64,)[^\"]*");
Console.WriteLine(match.Value);
// output: /9j/4AAQSkZJRgABAQAAAQABAAD/4gKgSUNDX1BST0ZJTEUA....+tzPaXLlstlSjpcxKPEqV/zH//2Q==

How to remove Only HTML tags in the program [duplicate]

This question already has an answer here:
Retrieving Inner Text of Html Tag C#
(1 answer)
Closed 3 years ago.
I want to remove HTML Tags with some source with C#.
Unfortunately, there are some content like <This is content>
first, I tried to Regex class like that.
Regex.Replace(htmltext,"[\\x00-\\x1f<>:\"/\\\\|?*]" +
"|^(CON|PRN|AUX|NUL|COM[0-9]|LPT[0-9]|CLOCK\\$)(\\.|$)" +
"|[\\. ]$", String.Empty);
but in this case,
"<This is content>" was removed.
so anyone, please tell me how to remove Only HTML Tags in the program.
Thanks regard.
Don't try and parse HTML with Regex. It tends not to go well.
Use a parser, HTML Agility Pack is very popular.
Using HTML agility pack you can simply call InnerText to extract the contents without HTML tags.

C# (.NET), Html parse using regex [duplicate]

This question already has answers here:
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 6 years ago.
Using Regex, I'm trying to get data from html code, but I don't know how build it, without using any html tags.
I have some string (item-desc), and count of symbols after this string, which must be my data.
Something like: in item-desc12345abcde, I'm using regex with value of 6 symbols, and i got 12345a.
This expression give me only 1 symbol after my string:
Regex itemInfoFilter = new Regex(#"item-desc\s*(.+?)\s*>", RegexOptions.Compiled | RegexOptions.IgnoreCase);
I don't recommend using regular expressions to parse HTML.
Use an HTML parser instead:
HTML Agility Pack
From what I understand of your question I think this should work: item-desc(.){6}(?=[\s'"])
In the code I assume that your string ends with a space (\s), ' or "
Hope this helps

Easiest way to extract some html from string [duplicate]

This question already has answers here:
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 9 years ago.
I have a long c# string of HTML code and I want to specifically extract bullet points "<ul><li></li></ul>".
Say I have the following HTML string.
var html = "<div class=ClassC441AA82DA8C5C23878D8>Here is a text that should be ignored.</div>This text should be ignored too<br><ul><li>* Need this one</li><li>Another bullet point I need</li><li>A bulletpoint again that I want</li><li>And this is the last bullet I want</li></ul><div>Ignore this line and text</div><p>Ignore this as well.</p>Text not important."
I need everything between the '<ul>' to '</ul>' tags. The '<ul>' tag can be excluded.
Now regular expression is not my strongest side, but if that can be used I need some help.
My code is in c#.
You should use the HtmlAgilityPack for things like this. I wrote a little introduction to it a while ago that may help you get going: http://colinmackay.scot/2011/03/22/a-quick-intro-to-the-html-agility-pack/

Check for opening and closing tags with HTML Agility Pack or Regexp? [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 9 years ago.
I have a little Text Editor written in C#.
I need to open HTML files (Already done) in plain text, and check for correct opening and closing tags. For example, if I have this:
<body> Text </body> It should say it is correct, but if I have: <body> <body> it should say it is wrong.
Any way to get this with HTML Agility Pack or a Regexp in C#?
public bool IsCorrectHtml(string html)
{
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var parseErrors = htmlDocument.ParseErrors;
return !parseErrors.Any(); // return true if no error.
}

Categories

Resources