Regular expression - how to match xml value [duplicate] - c#

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 9 years ago.
I want to use regular expression to get the airline code between <AirlineCode> and </AirlineCode> tags.
I only want the values of the <AirlineCode> tags that are w/in the <Flight> tags. There are more <AirlineCode>tags outside and I don't want the airline values from them.
I tried w/ the regex below but it's giving me all airline codes regardless of the position consideration mentioned. Please help.
var regex = new Regex(#"<AirlineCode>(.*?)</AirlineCode>", RegexOptions.IgnoreCase);
Match m = regex.Match("<PNRViewRS><AirGroup><Flight CnxxIndicator=\"N\"><Arrival></Arrival><Carrier><AirlineCode>DL</AirlineCode></Carrier></Flight><Flight CnxxIndicator=\"N\"><Arrival></Arrival><Carrier><AirlineCode>AA</AirlineCode></Carrier></Flight></AirGroup></PNRViewRS>");
int matchCount = 0;
while (m.Success)
{
Console.WriteLine("Match" + (++matchCount));
for (int i = 1; i <= 2; i++)
{
Group g = m.Groups[i];
//do stuff...
}
m = m.NextMatch();
}

In general, it's a bad idea to try parsing XML with regular expressions. The reason is that regex is insufficiently expressive, even with back references and such. The questions linked in the comments are worth reading to understand why this is generally a bad idea.
That said, you can be successful if you know for certain the format of your file, and if you're willing to do a little non-regex parsing as well.
In your situation, you have essentially:
<Flight>
<AirlineCode>
</AirlineCode>
<Flight>
<AirlineCode>
</AirlineCode>
<Flight>
<AirlineCode>
</AirlineCode>
<Flight>
And you want all of the <AirlineCode> tags that occur within <Flight> tags.
The way to approach this problem is to extract the <Flight> tags and their contents with one regex, and then use another regex to extract the <AirlineCode> tags from those extracted <Flight> tags. Don't try to do it in a single regular expression. You will not succeed.
If your data really is that simple, then this will work. I won't say that I recommend this approach. There are too many things that can go wrong. Data formats have a distressing tendency to change, and that fragile regex solution is likely to break if the format changes even a little bit. An XML parser solution will be much more robust.

Related

Parsing XML with spaces in element names [duplicate]

This question already has answers here:
Encoding space character in XML name
(2 answers)
How to parse invalid (bad / not well-formed) XML?
(4 answers)
Closed 5 years ago.
So I have to parse a simple XML file (there is only one level, no attributes, just elements and values) but the problem is that there are (or could be) spaces in the XML. I know that's bad (possibly terrible) practice, but I'm not the one that's building the XML, that's coming from an external library.
example:
<live key>test</live key>
<not live>test</not live>
<Test>hello</Test>
Right now my strategy is to read the XML (I have it as a string) one character at a time and just save each element name and value as I get to it, but that seems a bit too complicated.
Is there any easier way to do it? XMLReader would throw an error because it thinks the XML is well-formed, thus it thinks "live" is the element name and "key" is an attribute, so it is trying to look for a "=" and gets a ">".
Unfortunately, the text returned by your library is not a well-formed XML, so you cannot use an XML parser to parse it. The spaces in the tags are only part of the problem; there are other issues, for example, the absence of the "root" tag.
Fortunately, a single-level language is trivial enough to be matched with regular expressions. Regex-based "parsers" would be an awful choice for real XML, but this language is not real, so you could use regex at least as a workaround:
Regex rx = new Regex("<([^>\n]*)>(.*?)</(\\1)>");
var m = rx.Match(text);
while (m.Success) {
Console.WriteLine("{0}='{1}'", m.Groups[1], m.Groups[2]);
m = m.NextMatch();
}
The idea behind this approach is to find strings with "opening tags" that match "closing tags" with a slash.
Here is a demo, it produces the following output for your input:
live key='test'
not live='test'
Test='hello'
As it is a flat structure maybe that could help:
MatchCollection ms = Regex.Matches(xml, #"\<([\w ]+?)\>(.*?)\<\/\1\>");
foreach (Match m in ms)
{
Trace.WriteLine(string.Format("{0} - {1}", m.Groups[1].Value, m.Groups[2].Value));
}
So you get a list of 'key-value' pairs. Traces are only for checking results

Reasons of working slowly and solution of that [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 8 years ago.
here is my function with using regex. it's working corectly but it's taking tags very slowly.
I think it's searching html code character by character.So it works slowly. Is there any solution of working slow.
string s = Sourcecode(richTextBox6.Text);
// <a ... > </a> tagları arasını alıyor.(taglar dahil)
Regex regex = new Regex("(?i)<a([^>]+)>(.+?)</a>");
string gelen = s;
string inside = null;
Match match = regex.Match(gelen);
if (match.Success)
{
inside= match.Value;
richTextBox2.Text = inside;
}
string outputStr = "";
foreach (Match ItemMatch in regex.Matches(gelen))
{
Console.WriteLine(ItemMatch);
inside = ItemMatch.Value;
//boşluk bırakıp al satır yazıyor
outputStr += inside + "\r\n";
}
richTextBox2.Text = outputStr;
Change outputStr to a StringBuilder, if you are appending very many items this will increase your speed. As already mentioned parsing HTML with a regex might be an issue (depends a lot on your input).
The trouble with parsing HTML is that it isn't an exact science. If it was XHTML that you were parsing, then things would be a lot easier.
Because HTML isn't necessarily well-formed XML you will come into lots of problems trying to parse it.
It almost needs to be done on a site-by-site basis.
You should not parse HTML using Regex.(Although you can use compiled Regex in your above code, to make it a bit quick.)
Regex is not build for parsing HTML. You can use a third-party library for parsing HTML which are built specifically for this purpose.
List of HTML Parsing Libraries
If you don't want to use 3rd party libraries, then you can use the System.Windows.Forms.WebBrowser for this purpose.
You can also use Fizzler, it uses HTML agility pack, but has extended support for jQuery
Then there is Majestic-12 HTML Parse, which is very quick.
You can also use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser.
Check the following example on how improper usage of Regex can degrade performance.

regular expression to eliminate text inside < and > [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Using C# regular expressions to remove HTML tags
I'm trying to write a code that will return only the content of an HTML file. The best way I've figured revolves either around eliminating all elements within < ..> brackets, or to make a list of all text in between >...< brackets. I'm pretty new to regular expressions, but I'm pretty sure they're the way to go.
Here's the code I've tried
Regex reg = new Regex(#"<.*>");
file = reg.Replace(file, "");
Which works, as long as there is only one <...> before a block of text. Any file that has two or more of those elements in sequence, like <...><...>, and it just starts deleting any text it finds. Can someone tell me what I'm doing wrong?
Regex are regulary greedy (they match the longest string they can find). Try checking, depending on the language you are looking for, for the +? or *? operators, that will try the shortest match. Otherwise you must build another regex.
Well, the unexpected behavior you're getting is because your regular expression is greedy
If you change your regex to
Regex reg = new Regex(#"<.*?>");
file = reg.Replace(file, "");
you'll get what you expect.
Also, Know that Regex doesn't handle nesting, which HTML has a lot of, and I'd avoid using Regex to parse HTML unless you're trying to match a very specific thing, on a specifically formed piece of html.

Regular expression to replace quotation marks in HTML tags only

I have the following string:
<div id="mydiv">This is a "div" with quotation marks</div>
I want to use regular expressions to return the following:
<div id='mydiv'>This is a "div" with quotation marks</div>
Notice how the id attribute in the div is now surrounded by apostrophes?
How can I do this with a regular expression?
Edit: I'm not looking for a magic bullet to handle every edge case in every situation. We should all be weary of using regex to parse HTML but, in this particular case and for my particular need, regex IS the solution...I just need a bit of help getting the right expression.
Edit #2: Jens helped to find a solution for me but anyone randomly coming to this page should think long and very hard about using this solution. In my case it works because I am very confident of the type of strings that I'll be dealing with. I know the dangers and the risks and make sure you do to. If you're not sure if you know then it probably indicates that you don't know and shouldn't use this method. You've been warned.
This could be done in the following way: I think you want to replace every instance of ", that is between a < and a > with '.
So, you look for each " in your file, look behind for a <, and ahead for a >. The regex looks like:
(?<=\<[^<>]*)"(?=[^><]*\>)
You can replace the found characters to your liking, maybe using Regex.Replace.
Note: While I found the Stack Overflow community most friendly and helpful, these Regex/HTML questions are responded with a little too much anger, in my opinion. After all, this question here does not ask "What regex matches all valid HTML, and does not match anything else."
I see you're aware of the dangers of using Regex to do these kinds of replacements. I've added the following answer for those in search of a method that is a lot more 'stable' if you want to have a solution that will keep working as the input docs change.
Using the HTML Agility Pack (project page, nuget), this does the trick:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("your html here");
// or doc.Load(stream);
var nodes = doc.DocumentNode.DescendantNodes();
foreach (var node in nodes)
{
foreach (var att in node.Attributes)
{
att.QuoteType = AttributeValueQuote.SingleQuote;
}
}
var fixedText = doc.DocumentNode.OuterHtml;
//doc.Save(/* stream */);
You can match:
(<div.*?id=)"(.*?)"(.*?>)
and replace this with:
$1'$2'$3

c# : parsing text from html

I have an string input-buffer that contains html.
That html contains a lot of text, including some stuff I want to parse.
What I'm actually looking for are the lines like this : "< strong>Filename< /strong>: yadayada.thisandthat.doc< /p>"
(Although position and amount of whitespace / semicolons is variable)
What's the best way to get all the filenames into a List< string> ?
Well a regular expression to accomplish this will be very hard to write and will end up being unreliable anyway.
Probably your best bet is to have a whitelist of extensions you want to look for (.doc, .pdf etc), and trawl through the html looking for instances of these extensions. When you find one, track back to the next whitespace character and that's your filename.
Hope this helps.
You have a couple of options. You can use regular expressions, it could be something like Filename: (.*?)< /p> , but it will need to be much more complex. You would need to look at more of the text file to write a proper one. This could work depending on the structure of all your text, if there is always a certain tag after a filename for example.
If it is valid HTML you can also use a HTML parser like HTML Agility Pack to go through the html and pull out text from certain tags, then use a regex to seperate out the path.
I'm not sure a regular expression is the best way to do this, traversing the HTML tree is probably more sensible, but the following regex should do it:
<\s*strong\s*>\s*Filename\s*<\s*/strong\s*>[\s:]*([^<]+)<\s*/p\s*>
As you can see, I've been extremely tolerant of whitespace, as well as tolerant on the content of the filename. Also, multiple (or no) semicolons are permitted.
The C# to build a List (off the top of my head):
List<String> fileNames = new List<String>();
Regex regexObj = new Regex(#"<\s*strong\s*>\s*Filename\s*<\s*/strong\s*>[\s:]*([^<]+)<\s*/p\s*>", RegexOptions.IgnoreCase);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
fileNames.Add(matchResults.Groups[0].Value);
matchResults = matchResults.NextMatch();
}

Categories

Resources