I am looking for a regular expression to find all input fields of type hidden in html output. Anyone know an expression to do such?
I agree that the link Radomir suggest is correct that HTML should not be parsed with regular expressions. However, I do not agree that nothing meaningful can be gleaned from their use together. And the ensuing rant is totally counter-productive.
To correct Robert's RegEx:
<([^<]*)type=('|")hidden('|")>[^<]*(/>|</.+?>)
I know you asked for regular expression, but download Html Agility Pack and do the following:
var inputs = htmlDoc.DocumentNode.Descendants("input");
foreach (var input in inputs)
{
if( input.Attributes["type"].Value == "hidden" )
// do something
}
You can also use xpath with html agility pack.
Regular expressions are generally the wrong tool for the job when trying to search or manipulate HTML or XML; a parsing library would likely be a much cleaner and easier solution.
That said, if you're just looking through a big file and accuracy isn't critical, you can probably do reasonably well with something like <input[^>]*type="?hidden"?.
Related
Have a xml string, goal is to replace an xml element value to a fixed string, i.e. for blah blah blah replace it to fixed value, I am thinking to use RegEx.Replace instead of loading the string to a DOM model and replace.
Could anyone please help on how to write this regular expression? essentially the goal is to match everything inside element tag 'abc'
Thanks a lot!
This article tells you what you need to know: XML is not Regular
Ignoring the most obvious solution to their problem (which would be to use a pre-existing XML parser), they think they should use regular expressions (regex for short). Now they have two problems.
Use regular expressions only on regular languages.
That said, there are many sites that purport to offer guidance on writing regular expressions for XML. They are all wrong. But they exist, and you can use them at your own risk.
For what it's worth, don't.
Process the XML normally, with a XmlDocument, Xml.Linq or XmlReader/Writer, it's what they are for, cover all kinds of edge cases we couldn't even imagine, and above all, are proven to work.
Don't use a regex for this, please . . . just don't.
My two cents.
let the downvoting begin
Regular expressions are meant to be used on regular languages. XML is a non-regular language. As such, regular expressions cannot be used to properly parse anything written in it. You will need to use a real XML parser, which can be found in the numerous libraries available in C#, to do it.
Regular expressions are not suitable for processing markup. Among other flaws, they won't work if elements can be nested:
<abc> ... <abc> ... </abc> ... </abc>
They are also unable to distinguish a comment from a non-comment.
You need a real XML parser.
I have the following string:
<div id="mydiv">This is a "div" with quotation marks</div>
I want to use regular expressions to return the following:
<div id='mydiv'>This is a "div" with quotation marks</div>
Notice how the id attribute in the div is now surrounded by apostrophes?
How can I do this with a regular expression?
Edit: I'm not looking for a magic bullet to handle every edge case in every situation. We should all be weary of using regex to parse HTML but, in this particular case and for my particular need, regex IS the solution...I just need a bit of help getting the right expression.
Edit #2: Jens helped to find a solution for me but anyone randomly coming to this page should think long and very hard about using this solution. In my case it works because I am very confident of the type of strings that I'll be dealing with. I know the dangers and the risks and make sure you do to. If you're not sure if you know then it probably indicates that you don't know and shouldn't use this method. You've been warned.
This could be done in the following way: I think you want to replace every instance of ", that is between a < and a > with '.
So, you look for each " in your file, look behind for a <, and ahead for a >. The regex looks like:
(?<=\<[^<>]*)"(?=[^><]*\>)
You can replace the found characters to your liking, maybe using Regex.Replace.
Note: While I found the Stack Overflow community most friendly and helpful, these Regex/HTML questions are responded with a little too much anger, in my opinion. After all, this question here does not ask "What regex matches all valid HTML, and does not match anything else."
I see you're aware of the dangers of using Regex to do these kinds of replacements. I've added the following answer for those in search of a method that is a lot more 'stable' if you want to have a solution that will keep working as the input docs change.
Using the HTML Agility Pack (project page, nuget), this does the trick:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("your html here");
// or doc.Load(stream);
var nodes = doc.DocumentNode.DescendantNodes();
foreach (var node in nodes)
{
foreach (var att in node.Attributes)
{
att.QuoteType = AttributeValueQuote.SingleQuote;
}
}
var fixedText = doc.DocumentNode.OuterHtml;
//doc.Save(/* stream */);
You can match:
(<div.*?id=)"(.*?)"(.*?>)
and replace this with:
$1'$2'$3
How do I parse HTML using regular expressions in C#?
For example, given HTML code
<s2> t1 </s2> <img src='1.gif' /> <span> span1 <span/>
I am trying to obtain
1. <s2>
2. t1
3. </s2>
4. <img src='1.gif' />
5. <span>
6. span1
7. <span/>
How do I do this using regular expressions in C#?
In my case, the HTML input is not well-formed XML like XHTML. Therefore I can not use XML parsers to do this.
Regular expressions are a very poor way to parse HTML. If you can guarantee that your input will be well-formed XML (i.e. XHTML), you can use XmlReader to read the elements and then print them out however you like.
This has already been answered literally dozens of times, but it bears repeating: regular expressions can only parse regular languages, that's why they are called regular expressions. HTML is not a regular language (as probably every college student in the last decade has proved at least once), and therefore cannot be parsed by regular expressions.
You might want to try the Html Agility Pack, http://www.codeplex.com/htmlagilitypack. It even handles malformed HTML.
I used this regx in C#, and it works. Thanks for all your answers.
<([^<]*)>|([^<]*)
you might want to simply use string functions. make < and > as your indicator for parsing.
How can I write a regular expression to replace links with no link text like this:
with
http://www.somesite.com
?
This is what I was trying to do to capture the matches, and it isn't catching any. What am I doing wrong?
string pattern = "<a\\s+href\\s*=\\s*\"(?<href>.*)\">\\s*</a>";
I wouldn't use a regex - I'd use the Html Agility Pack, and a query like:
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[.='']")) {
link.InnerText = link.GetAttribute("href");
}
I could be wrong, but I think you simply need to change the quantifier within the href group to be lazy rather than greedy.
string pattern = #"<a\s+href\s*=\s*""(?<href>.*?)"">\s*</a>";
(I've also changed the type of the string literal to use #, for better readability.)
The rest of the regex appears fine to me. That you're not capturing any matches at all makes me think otherwise, but there could be a problem in the rest of the code (or even the input data - have you verified that?).
I would suggest
string pattern = "(<a\\b[^>]*href=\"([^\"]+)\"[^>]*>)[\\s\\r\\n]*(</a>)";
This way also links with their href attribute somewhere else would be captured.
Replace with
"$1$2$3"
The usual word of warning: HTML and regex are essentially incompatible. Use with caution, this might blow up.
Marc Gravell has the right answer, regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
I have a string of test like this:
<customtag>hey</customtag>
I want to use a RegEx to modify the text between the "customtag" tags so that it might look like this:
<customtag>hey, this is changed!</customtag>
I know that I can use a MatchEvaluator to modify the text, but I'm unsure of the proper RegEx syntax to use. Any help would be much appreciated.
I wouldn't use regex either for this, but if you must this expression should work:
<customtag>(.+?)</customtag>
I'd chew my own leg off before using a regular expression to parse and alter HTML.
Use XSL or DOM.
Two comments have asked me to clarify. The regular expression substitution works in the specific case in the OP's question, but in general regular expressions are not a good solution. Regular expressions can match regular languages, i.e. a sequence of input which can be accepted by a finite state machine. HTML can contain nested tags to any arbitrary depth, so it's not a regular language.
What does this have to do with the question? Using a regular expression for the OP's question as it is written works, but what if the content between the <customtag> tags contains other tags? What if a literal < character occurs in the text? It has been 11 months since Jon Tackabury asked the question, and I'd guess that in that time, the complexity of his problem may have increased.
Regular expressions are great tools and I do use them all the time. But using them in lieu of a real parser for input that needs one is going to work in only very simple cases. It's practically inevitable that these cases grow beyond what regular expressions can handle. When that happens, you'll be tempted to write a more complex regular expression, but these quickly become very laborious to develop and debug. Be ready to scrap the regular expression solution when the parsing requirements expand.
XSL and DOM are two standard technologies designed to work with XML or XHTML markup. Both technologies know how to parse structured markup files, keep track of nested tags, and allow you to transform tags attributes or content.
Here are a couple of articles on how to use XSL with C#:
http://www.csharpfriends.com/Articles/getArticle.aspx?articleID=63
http://www.csharphelp.com/archives/archive78.html
Here are a couple of articles on how to use DOM with C#:
http://msdn.microsoft.com/en-us/library/aa290341%28VS.71%29.aspx
http://blogs.msdn.com/tims/archive/2007/06/13/programming-html-with-c.aspx
Here's a .NET library that assists DOM and XSL operations on HTML:
http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack
If there won't be any other tags between the two tags, this regex is a little safer, and more efficient:
<customtag>[^<>]*</customtag>
Most people use HTML Agility Pack for HTML text parsing. However, I find it a little robust and complicated for my own needs. I create a web browser control in memory, load the page, and copy the text from it. (see example below)
You can find 3 simple examples here:
http://jakemdrew.wordpress.com/2012/02/03/getting-only-the-text-displayed-on-a-webpage-using-c/
//This is to replace all HTML Text
var re = new RegExp("<[^>]*>", "g");
var x2 = Content.replace(re,"");
//This is to replace all
var x3 = x2.replace(/\u00a0/g,'');