Regex c# questions - c#

I have an html code.
I parse it with such regex
MatchCollection matches = Regex.Matches(go, #"photoWrapper""><div><a href=""(?<id>[^""]+?)\?");
I receive:
matches[0].Groups["id"].Value = "/group/47502002094086";
matches[1].Groups["id"].Value = "/dk";
matches[2].Groups["id"].Value = "/prostooglavnom";
How to edit my regexp or add smth, to receive in matches only
matches[0].Groups["id"].Value = "47502002094086";
matches[1].Groups["id"].Value = "prostooglavnom";
Any help?=\
Full html code : http://pastebin.com/xEJNiD4G

You have just discovered for yourself why Regex is a poor choice for parsing HTML.
I suggest you use the HTML Agility Pack to parse and query your HTML.
The source download comes with many example projects.
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Related

Html.Raw in controller

I get content from editor so content include html tags like this "dddd"
I must remove html tags from content because I write this content to PDF(generate pdf in c#-controller action) using itextsharp.DLL but itextsharp content with html tags,it does not render html tags as you can see below screen
There is no Html.Raw function or HtmlHelper.Raw function in c#(action -controller)
What should I do?I try to remove html tags with regex but content is very complex and it is dynamic so there is many many html tags
One approach would be to use an HTML parser like the HTML Agility Toolpack. I've used this successfully for problems as you describe (but am otherwise unaffiliated with its development). From the site:
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
You'll find lots of examples online to tailor to your needs.
You can use Html.Raw and Html.Json in controller like this
Example
If I use this in View
var attrilist = #Html.Raw(Json.Encode(attriFeildlist));
Then I can use this as alternate of this code in Controller like
var jsonencode = System.Web.Helpers.Json.Encode(attriFeildlist);
var htmlencode= WebUtility.HtmlEncode(jsonencode);

How I can extract text from HTML without using third-party libraries?

_request = (HttpWebRequest)WebRequest.Create(url);
_response = (HttpWebResponse) _request.GetResponse();
StreamReader streamReader = new StreamReader(_response.GetResponseStream());
string text = streamReader.ReadToEnd();
Text with html tags. How i can get text without html tags?
How do you extract text from dynamic HTML without using 3rd party libraries? Simple, you invent your own HTML parsing library using the string parsing functions present in the .NET framework.
Seriously, doing this by yourself is a bad idea. If you're pulling dynamic HTML off the web, you have to be prepared for different closing tags, mismatched tags, missing end tags, and so forth. Unless you have a really good reason why you need to write one yourself, just use HTML Agility Pack, and let that do the hard work for you.
Also, make sure you're not succumbing to Not Invented Here Syndrome.
Try this:
System.Xml.XmlDocument docXML = new System.Xml.XmlDocument();
docXML.Load(url);
string textWithoutTags = docXML.InnerText;
Be happy :)
1) Do not use Regular Expressions. (see this great StackOverflow post: RegEx match open tags except XHTML self-contained tags)
2) Use HtmlAgilityPack. But I see you do not want 3rd Party libraries, so we are forced to....
3) Use XmlReader. You can pretty much use the example code straight from MSDN, and just ignore all cases of XmlNodeType except for XmlNodeType.Text. For that case simply write your output to a StreamWriter.
This question has been asked before. There are a few ways to do it, including using a Regular Expression or as pointed out by Adrian, the Agility Pack.
See this question: How can I strip HTML tags from a string in ASP.NET?

C# reading html attribute value using regular expression

I have a string:
str = "<img src='http://server/path/a.jpg' />blah blah blah blah";
What would be the regular expression to find the attribute source value?
I do not want to use HTML Agility pack.
Regards,
Don't use regex to parse HTML. Use one of the many available parsers that suits your needs.
But if you really have to, for that sting this simple (and in many cases broken) regex could work for you:
\bsrc\s*=\s*["']([^"'>]+)
Use HTML agility pack; even for small things!
:: sigh ::

c# parse html using XPathDocument

i'm trying to parse an html page with XPathDocument, but gives error 'cause the html is not an xml...
is there a way to do this or not?
Should use HtmlAgilityPack. Still the best!
Use something like Html Agility Pack which can load your html into a DOM object which can be traversed with for example xpath queries.
Unless your html is in fact xhtml, it is usually not a valid xml structure with correct opening and ending node tags.

HTML Agility to extract PHP tags

What syntax should be used with HTML Agility Pack to extract all
Tags from a Php file..?
HtmlNodeCollection tags = htmlDoc.DocumentNode.SelectNodes("//??php");
Throws an exception (invalid token).
Tried escaping ? with ?? and \?
Thanks
HTML Agility Pack does choke on nodes with ? in the name. The simplest option is probably to go through the HTML string before you load it into a document object and replace instances of <? with <php and so-on. That doesn't handle any nasty cases like having a string literal on the page with "&lt?" but really, how often does that happen?

Categories

Resources