Removing empty elements from xml with regex that matches a sequence twice - c#

I'm looking to remove empty elements from an XML file because the reader expects a value. It's not a nil xsi:nil="true" or element without content <Element /> Deserialize Xml with empty elements in C#. But Element where the inner part is simply missing <Element></Element>
I've tried writing my own code for removing these elements, but my code is too slow and the files too large. The end of every item will also contain this pattern. So the following regex would remove valid xml:
#"<.*></*>
I need some sort of regex that makes sure the pattern of the two * are the same.
So:
<Item><One>1</One><Two></Two><Three>3</Three></Item>
Would change into:
<Item><One>1</One><Three>3</Three></Item>
So the fact that it's all one one line makes this harder because it means the end of the item is right after the end of Three, producing the pattern I'd like to look for.
I don't have access to the original data that would allow recreating valid xml.

You want to capture one or more word characters inside <...>and match the closing tag by using \1 backreference to what was captured by first group.
<(\w+)></\1>
See demo at regex101

AFAIK there is no need to capture any group because <a></b> (which would match a simple regex without capturing) is just invalid XML and it can't be in your file (unless you're parsing HTML in which case - even if may be done - I'd suggest to do not use regex). Capturing a group is required only if you're matching non empty nodes but it's not your case.
Note that you have a problem with your regex (besides unescaped /) because you're matching any character with . but it's not allowed to have any character in XML tags. If you absolutely want to use .* then it should be .*? and you should exclude /).
What I would do is to keep regex as simple as possible (still matching valid XML node names or - even better - only what you know is your data input):
<\w+><\/\w+>
You should/may have a better check for tag name, for example \s*[\w\d]+\s* may be slightly better, regex with less steps will perform better for very large files. Also you may want to add an optional new-line between opening and closing tag.
Note that you may need to loop until no more replacements are done if, for example, you have <outer><inner></inner></outer> and you want it to be reduced to an empty string (especially in this case don't forget to compile your regex).

Use XML Linq
string xml = "<Item><One>1</One><Two></Two><Three>3</Three></Item>";
XElement item = XElement.Parse(xml);
item = new XElement("Item", item.Descendants().Where(x => x.Value.Length != 0));

Related

Incorrect Regex in XML node string

I'm trying to get a Regex to work, but several hours later still can't crack it!
Let's say I have this lines (in XML):
<MyXMLTag DataMember="$$Date$$" Name="$$DateName$$" DateTimeGroupInterval="MonthYear" DefaultId="$$MyId$$" />
<NameTag>$$item$$</NameTag>
I need to get the words that start and end with $$, but only the ones that don't start with "DataMember="".
Ideal is only the ones that aren't inside DataMember="...".
So, in this case I wanted the matches $$Date$$, $$MyId$$ and $$item$$. The $$DateName$$ should be ignored/discarted.
So far I've tried the following Regex combinations:
#"(?<!^\bDataMember="\b)\$\$(.*?)\$\$"
#"(?<!(\w*DataMember="\w*))\$\$(.*?)\$\$"
I had several other variations of the same, but none of them allowed me to achieve my goal.
With these combinations I had this (incorrect) result:
$$" Name="$$
$$" Name="$$
$$" DateTimeGroupInterval="MonthYear" DefaultId="$$
Has you can see it's catching the words between XML attributes!
What I want is to replace the text between $$ with a custom one.
I don't need to handle the XML itself, for that I can use multiple tools, but only the text betwen $$. Consider that the code doesn't know if the text is inside a tag, an attribute, the root node, a child node, one or multiple times...
Help?!
Instead of matching anything between the two $$ delimiters, look for consecutive word characters instead:
new Regex(#"(?<!\bDataMember="")\$\$(\w+)\$\$");
Matches $$DateName$$, $$MyId$$ and $$item$$ alike

Extract some atribute values with regex wether other atributes are present or not

I'am having hard times trying to get my values with regex in an XML like line
<themes:MetaGrid x:Key="external_liquid_waste" x:Shared="False" Categories="LiquidsAndWastes,Others" Group="XD">
I'm trying to retreive only the values of key and categories attributes wether the Group or the x:Shared attributes are present or not.
The separators in the category attribute can be either a coma, a semicolon or a whitespace.
So far (before adding the Group attribute) I was using this expression :
<.+MetaGrid.+x:Key="(?<reskey>[\w]+)".+Categories="(?<resca>.+)".+>
But with the group attriute, the second capture returns LiquidsAndWastes" Group= which is not good.
Any pointers would be much appreciated.
Thanks !!
The XML parser would be the way to go here. But, if you really want to use a regex, see below.
<.+MetaGrid.+x:Key="(?<reskey>[\w]+)".+Categories="(?<resca>.+?)".+>
Notice the ? in the second capture group, so the group is matched in a non-greedy fashion.

Stripping text line with regular expression with c #

In the text shown below, I would need to extract the info in between the double quotes (The input is a text file)
Tag = "571EC002A-TD"
Tag = "571GI001-RUN"
Tag = "571GI001-TD"
The output should be,
571EC002A-TD
571GI001-RUN
571GI001-TD
How should I frame my regex in C# to match this and save it to a text file.
I was successful till reading all the lines into my code, but the regex gives me some undesirable values.
thanks and appreciate in advance.
A simple regex could be:
Regex tagRegex = new Regex(#"Tag\s?=\s?""(.+?)""");
Example with your input
UPDATE
For those that ask why not use String.Substring: The great advantage of regular expressions over string operations is that they don't generate temporary strings untily you actually ask for a matched value. Matches and groups contain only indexes to the source string. This cane be a huge advantage when processing log files.
You can match the content of a tag using a regex like
Tag\s*=\s*"(<tagValue>.*?)"
The ? in .*? results in a non-greedy search, ie only text up to the first double quote is extracted. Otherwise the pattern would match everything up to the last double quote.
(<tagValue>.*?) defines a named group. This way you can refer to the actual value captured by name and even use LINQ to process the values
The resulting C# code may look like this after escaping:
var myRegex=new Regex("Tag\\s*=\\s*\"(<tagValue>.*?)\"");
...
var tags=myRegex.Matches(someText)
.OfType<Match>()
.Select(match=>match.Groups["tagValue"].Value);
The result is an IEnumerable with all tag values. You can convert it to an array or List using ToArray() or ToList() just like any other IEnumerable
The equivalent code using a loop would be
var myRegex=new Regex("Tag\\s*=\\s*\"(<tagValue>.*?)\"");
...
List<string> tagValues=new List<string>();
foreach(Match m in myRegex.Matches(someText))
{
tagValues.Add(m.Groups["tagValue"].Value;
}
The LINQ version though can be extended very easily. For example, File.ReadLines returns an IEnumerable and doesn't wait to load everything in memory before returning. You could write something like:
var tags=File.ReadLines(myBigLog)
.SelectMany(line=>myRegex.Matches(line))
.OfType<Match>()
.Select(match=>match.Groups["tagValue"].Value);
If the tag names changed, you could also capture the tag name. If eg tags have a tag prefix you could use the pattern:
(?<tagName>tag\w+)\s*=\s*"(<tagValue>.*?)"
And extract both tag name and value in the Select function, eg :
.Select(match=>new {
TagName=match.Groups["tagName"].Value,
Value=match.Groups["tagValue"].Value
});
Regex.Matches is thread safe which means you can create one static Regex object and use it repeatedly, or even use PLINQ to match multiple lines in parallel simply by adding AsParallel() before the call to SelectMany.
If those strings will always be like that, you can go for a simpler approach by just using Substring:
line.Substring(7, line.Length - 8)
That will give you your desired output.

Having trouble taking out all the newline, tab, and carriage return between two tags

I have been working on this for almost a day now. But I'm not able to take out all the newline, tab, and carriage return from ">" and "<"
This is a sample XML file I'm reading:
<Consequence_Note>
<Text>In some cases, integer coercion errors can lead to exploitable buffer
overflow conditions, resulting in the execution of arbitrary
code.</Text>
</Consequence_Note>
and this
<Consequence_Scope>Availability</Consequence_Scope>
<Consequence_Technical_Impact>DoS: resource consumption
(CPU)</Consequence_Technical_Impact>
My goal is to take out all the newline, tab, and carriage return from these two tag (> and <). The only thing I'm able to achieve is to take out all the /n/t/r from ">" and "<" when there's nothing in between the two tags. But I'm not able to take out all the \n\t\r when there's other character in between the two tags.
I need help in how to have a regular expression that will take out all the newline, tag, and carriage return from ">" and "<"
For example:
<Consequence_Technical_Impact>DoS: resource consumption
(CPU)</Consequence_Technical_Impact>
What I would like to have is:
<Consequence_Technical_Impact>DoS: resource consumption (CPU)</Consequence_Technical_Impact>
This is my code (I'm reading from a xml file):
String file = #"C:\Documents and Settings\YYC\Desktop\cwec_v2.1\cwec_v2.1.xml";
var lines = File.ReadAllText(file);
var replace = Regex.Replace(lines, #">([\r\n\t])*?<", "><");
File.WriteAllText(file, replace);
Don't parse html/xml with regexp ( RegEx match open tags except XHTML self-contained tags )!
Use XML reader for xml or HtmlAgilityPack (or some other html tool) for html.
The xml/html documents are so complex, the regexp is not always (in some cases yes, but not generaly) do the work absolutelly right.
If you first read the document using an XmlReader it will remove the newlines from the input by default. then you can simply write it back out with the writer correct settings.
See: http://msdn.microsoft.com/en-us/library/system.xml.xmlreadersettings.ignorewhitespace.aspx
See: http://msdn.microsoft.com/en-us/library/system.xml.xmlwritersettings.newlinehandling.aspx
A regex alternative can probably be built, but it will still have lots and lots of issues with XML containing CData, comments and other constructs which make XML hard to parse to begin with. If you XML is very structured, machine generated and unchanging, you could create a regex to fix it, but on the other hand, you might also be able to fix the generator. Simplest regex that might work:
\s{2,}
replace with
[ ]
That strips out any whitespace which is longer than one character and replaces it with one space. No need to treat any other whitespace inside tags differently, that's what the XMLReader should do by default anyways.

Checking a HTML string for unopened tags

I have a string as a HTML source and I want to check whether the HTML source which is string contains a tag which is not opened.
For example the string below contains </u> after WAVEFORM which has no opening <u>.
WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,
I just want to check for these types of unopened tag and then I have to append the open tag to the start of the string?
For this specific case you can use HTML Agility Pack to assert if the HTML is well formed or if you have tags not opened.
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(
"WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,");
foreach (var error in htmlDoc.ParseErrors)
{
// Prints: TagNotOpened
Console.WriteLine(error.Code);
// Prints: Start tag <u> was not found
Console.WriteLine(error.Reason);
}
Not so easy. You can't directly use an HTML parser as it's not valid HTML, but you can't easily throw a regex at the whole thing as regexes can't cope with nesting or other HTML complications.
Probably about the best you could do would be to use a regex to find each markup structure, eg. something like:
<(\w+)(?:\s+[-\w]+(?:\s*(?:=\s*(?:"[^"]*"|'[^']*'|[^'">\s][^>\s]*)))?)*\s*>
|</(\w+)\s*>
|<!--.*?-->
Start with an empty tags-to-open list and an empty tags-to-close list. For each match in the string, look at groups 1 and 2 to see if you've got a start or end tag. (Or a comment, which you can ignore.)
If you've got a start tag, you need to know if it needs closing, ie. if it's one of the EMPTY content-model tags like <img>. If a element is EMPTY, it doesn't need closing so you can ignore it. (If you have XHTML, this is all a bit easier.)
If you have a start-tag, add the tag name in the regex group to the tags-to-close list. If you've got an end tag, take one tag off the end of the tags-to-close list (it should be the same tag name as was on there, otherwise you've got invalid markup. If there are no tags on the tags-to-close list, instead add the tag name to the tags-to-open list.
Once you've got to the end of the input string, prepend each of the tags-to-open tags to the string in reverse order, and append the close tags for the the tags-to-close to the end, again in reverse order.
(Yeah, I'm parsing HTML with regex. I think the nastiness of this demonstrates why you don't want to. If there's anything you can do to avoid having already snipped your markup in the middle of a tag, do that.)

Categories

Resources