XML repair in C#

XML repair in C# - c#

The file format my application uses is Xml based. I just got a customer who has a botched xml file. The thing contains nearly 90,000 lines and for some reason there are about 20 "=" symbols randomly interspersed.
I get an XmlException for most of them with a line number and char position which allows me to find offending chars and remove them manually. I've just started writing a small app that automates this process, but I was wondering if there are better ways to repair damaged xml files.
Example of botched line:
<item name="InstanceGuid" typ=e_name="gh_guid" type_code="9">ee330f9f-a1e2-451a-8c6d-723f066a6bd4</item>
↑ (this is supposed to be [type_name])

You could search for any equal sign that isn't followed by a double quote. A regular expression (regex) would be pretty simple to write up.
Or you could just open the file in an advanced text editor and search by that same regex expression to find and replace/remove. Some text editors allow you to find/replace with regex, so you could search for any equal sign not followed by double quote and just remove it.
Of course, I'd keep a copy of the original since if you had equal signs in the inner XML then it might mess it up, etc.

Use a regular expression to clean the xml first.
something like:
s/([^\s"]+)=([^\s"]+="[^"]*")/\1\2/
Obviously this would need to be ported to your Regex engine of choice :)

In TextPad if you search using the regular expression =[^"] you will find any = signs not followed by a "
This should find the locations in the document where the rogue = signs have appeared. To replace them, first open the document in TextPad. Then press F8.
In the dialog enter the following:
Find what: =\([^"]\)
Replace with: \1
Check the "Regular expressions" box, select "All documents" and click "Replace All"
This should match all = that aren't followed by a " and replace the = with the symbol that did follow it.
typename="test" typ=ename="test"
will become
typename="test" typename="test"

Related

Word vsto get text of document with hidden characters

I'm developing a text analysis vsto add-in for Word.
Therefore I get the text of the active document like this:
Globals.ThisAddin.Apllication.ActiveDocument.Content.Text
After that I analyze it. The analysis returns a list of positions that Word should comment (like character 3 - 6 and character 10 - 13).
The problem is that it seems like the comment from 3 to 6 is adding a character (that is hidden) to the document. Because all comments that Word is doing after the first one are placed one character too early.
Is there a way how I can fix that or how I can get the text with the hidden characters?
I found TextRetrievalMode but I can not get it working with that.

Basically, the answer is "No, you can't do it the way you propose."
Yes, Word does add "hidden characters" to the text flow that cannot be picked up using the object model. Trying to work with character index values is not going to work reliably. The reliable method is Word's built-in Find/Replace with wildcards. If RegEx is absolutely necessary, then some kind of Find/Replace within a character-index range (say, starting 5 characters before and ending 5 characters after the indices computed using RegEx) might be a way to double-check the result and pick up the correct Range.
Possibly, depending on what kind of analysis this is, it might be better to work with the closed file, leveraging the Office Open XML. That will not have the problem of "hidden characters" that Word uses for structural information. On the other hand, there's a lot of formatting information that breaks up text runs that needs to be contended with...

C# Matching a recursive structure in a string

I'm currently working on a program that loads up a text file, searches through it to find a specific structure, and then replaces a certain part of that structure with different text.
The structure I need to find and extract is "N"(N) where N is any character. For example. Lets say I had a text file like this:
Everyone knows the saying "Do not do more than you can do" (Jim Doe).
Well, I'm here to tell you that this saying is awesome. Here is
another, "The sky is blue and the sun is bright" (Sally Wantsmore).
I would want to be able to match the text "Do not do more than you can do" (Jim Doe) along with "The sky is blue and the sun is bright" (Sally Wantsmore).
I don't think there is really a way to do this with a regular expression from the best of my knowledge. I've been trying for the last few days. I can't get it to work, it's a recursive pattern by nature. My question is, how would I go about writing C# code to parse through and find these patterns. I would like to do something where I can find the position within the string and the length, that way I can then extract it from the string.
EDIT
I need to be able to capture all characters in the quote. This means that there could also be another set of quotes within the quote and even another set of parenthesis. This means that the structure could also contain a match within itself.
I'm now trying to use this expression because I need to be able to capture all characters within a quote: \"(.+)\" \(([\w ]+)\)
The listed answers below both work. However, I've discovered a limitation. There is a possible recursive structure to this. The problem I am currently having is when there is a "N"(N) inside of a "N"(N)". For example:
"Random quote" (random person) Here is a fun saying, "The sky is blue and
the sun is bright, some even say "really bright" (others)" (Sally
Wantsmore).
This presents many problems. There is only one match because it takes the very first ", and then finds the last " just after (others) and finds the set of parens for (Sally Wantsmore) and only finds that match. However, I desire for it to find all the matches, especially the beginning one and last one separably, and even the inner quote. Is this possible with Regular expressions? If not, how do I go about solving this with Recursive c# code.

The following regex should find the two things you're looking for:
\"([\w ]+)\" \(([\w ]+)\)
In C# you can use Regex.Match to retrieve the two items in brackets.

An example on how you could have it in C#:
var quotes = Regex.Matches(#"Everyone knows the saying ""Do not do more than you can do"" (Jim Doe). Well, I'm here to tell you that this saying is awesome. Here is another, ""The sky is blue and the sun is bright"" (Sally Wantsmore).",
"(?<Quotes>\"(?<Text>[\\w ]+)\\\" \\((?<Author>[\\w ]+)\\))", RegexOptions.Singleline);
foreach (Match quote in quotes)
{
var text = quote.Groups["Text"].Value;
var author = quote.Groups["Author"].Value;
Console.WriteLine($"Text: {text}, Author: {author}");
}

C# Regex Replace ignore specific string

Since this is my first question here on stackoverflow I hope my question is correctly asked.
Basicly I have a normal .txt file which contains any text like:
car accident
people died
cat without owner
<!-- Text added at 6/29/2011 9:20:38 AM -->
Some addintional Text
other Text added
add Text
I have a write/append function which allows the user to append some text and set a little timestamp.
So my problem is: With another function, you can search and replace text in the textfile, but as you can guess if someone wants to replace the word "Text" it will be replaced in the xml-stylish comment(timestamp) as well.
My result until now is
content = Regex.Replace(content,"[^<+.*"+input+".*>+]*", replace);
//content = content of the .txt file, input = search term, replace = string to replace
But this fails miserably, as some regex pro's will see without executing it.
Now I hope that some regex pro could help me out here and provide me a search pattern which replaces the normal text but ignores the timestamp.
I'm not realy aware of the logic from regex until now, nevertheless I understand the single expressions so this would be a hook for me to understand Regex more properly.
Thanks in advice.

If I understand your question correctly, you want to replace every instance of "Text" except for the one(s) inside the comment.
The easist way is to use a negative lookbehind (fantastic description here) as below:
content = Regex.Replace(content, #"(?<!<!--.*?)" + input, replace);
What you're doing is attempting to replace a repetition of any length of a character that is NOT <+.*> or a character contained in input with the value in replace.
If you're going to be working a lot with Regex, I would HIGHLY recommend giving the website above a good read. It's hands down the best intro to Regex that I've found, the time spent now will save you lots of headaches later!
Edit
Updated to add flexibility thanks to #stema

Regular Expression question Visual Studio

I'm wanting to replace all references to a resource file in my C# code.
An example is a page that contains several references such as:
Resources.Global.Firstname
Resources.Global.Surname
I'd like the regular expression to find all of these (they could end either with a ; or a )).
Total beginner with regular expressions, so any advice here would be gratefully received.

You can just use the Find and Replace window in Visual Studio.
Press Ctrl-H to open the window.
Put Resources\.Global\.{[^,) ;]+} in the "Find what:" text box.
Put GetStringValue("\1") in the "Replace with:" text box.
Make sure the "Look in:" dropdown is set to the scope you want to search
Expand the Find options subpanel.
Check the box next to "Use:" and make sure that "Regular expressions" is selected.
What this is doing:
The first regular expression will find anything that starts with Resources.Global. and capture whatever is after it until it finds a space, a comma, a close paren, or a semi-colon.
The second one replaces the entire text that was found with GetStringValue("") and puts the captured text inside the quotes in the parentheses.

Why not just do CTRL+H (quick find and replace) and search on the actual terms rather than the regex pattern? What are you trying to rename from and to?
UPDATE
The pattern to match would be something like: Resources.Global.([^};]+)
Replace pattern would be GetStringValue("\1")

As Blazes said, in the scenario you mentioned, Refactoring is the actual answer. If you just want to see them, right click on the definition and select Find all references. If you want to change it, just make the changes and then press ctrl+shift+F11, a context menu appears which gives you the chance to rename all references.

Regular expression:
(Resources\.Global\.[A-Z][a-zA-Z]*[;\)])
will find only the last two lines out of the following tested lines:
Resources.Global.Firstname
Resources.Global.Surname
Resources.Global.Firstname;
Resources.Global.Surname)
used code to verify:
Regex regex = new Regex(#"(Resources\.Global\.[A-Z][a-zA-Z]*[;\)])");
MatchCollection mc = regex.Matches("Resources.Global.Firstname\n" +
"Resources.Global.Surname" +
"Resources.Global.Firstname;" +
"Resources.Global.Surname)");
foreach (Match match in mc)
{
Console.WriteLine(match.Groups[1].ToString());
}
This software might help you:
Rad Software Regular Expression Designer

Matching everything between two specific words using regular expressions

I'm attempting to parse an Oracle trace file using regular expressions. My language of choice is C#, but I chose to use Ruby for this exercise to get some familiarity with it.
The log file is somewhat predictable. Most lines (99.8%, to be specific) match the following pattern:
# [Timestamp] [Thread] [Event] [Message]
# TIME:2010/08/25-12:00:01:945 TID: a2c (VERSION) Managed Assembly version: 2.102.2.20
# TIME:2010/08/25-14:00:02:398 TID:1a60 OpsSqlPrepare2(): SELECT * FROM MyTable
line_regex = /^TIME:(\S+)\s+TID:\s*(\S+)\s+(\S+)\s+(.*)$/
However, in a few places in the log there much are complicated queried that, for some reason, span several lines:
Two things to point out about these entries is that they appear to cause some sort of corruption in the log file, because they end with unprintable characters, and then suddenly the next entry begins on the same line.
Since this obviously rules out capturing data on a per-line basis, I think the next best option is to match everything between the word "TIME:" and either the next instance of "TIME:" or the end of the file. I'm not sure how to express this using regular expressions.
Is there a more efficient approach? The log file I need to parse will be over 1.5GB. My intention is to normalize the lines, and drop unnecessary lines, to eventually insert them as rows in a database for querying.
Thanks!

The regex to match potentially multi line data between between "TIME:" and "TIME:" strings or the end of the file is:
/^TIME:(.+?)(?=TIME:|\z)/im
On the other hand as James mentions, tokenizing for "TIME:" substrings, or looking for substring positions of "\r\nTIME:" (after the first "TIME:" entry, depending on line-break format) may prove a better approach.

It might be better to do this old-school, i.e. read your file in one line at a time... start at the first 'TIME', and concatenate your lines until you hit the next 'TIME'... you can use regular expressions to filter out any lines you don't want.
I can't speak to Ruby; in C# it would be a StreamReader, of course, which helps you deal with the file size.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.