Regex to remove xml declaration from a string

Regex to remove xml declaration from a string - c#

First of all, I know this is a bad solution and I shouldn't be doing this.
Background: Feel free to skip
However, I need a quick fix for a live system. We currently have a data structure which serialises itself to a string by creating "xml" fragments via a series of string builders. Whether this is valid XML I rather doubt. After creating this xml, and before sending it over a message queue, some clean-up code searches the string for occurrences of the xml declaration and removes them.
The way this is done (iterate every character doing indexOf for the <?xml) is so slow its causing thread timeouts and killing our systems. Ultimately I'll be trying to fix this properly (build xml using xml documents or something similar) but for today I need a quick fix to replace what's there.
Please bear in mind, I know this is a far from ideal solution, but I need a quick fix to get us back up and running.
Question
My thought to use a regex to find the declarations. I was planning on: <\?xml.*?>, then using Regex.Replace(input, string.empty) to remove.
Could you let me know if there are any glaring problems with this regex, or whether just writing it in code using string.IndexOf("<?xml") and string.IndexOf("?>") pairs in a (much saner) loop is better.
EDIT
I need to take care of newlines.
Would: <\?xml[^>]*?> do the trick?
EDIT2
Thanks for the help. Regex wise <\?xml.*?\?> worked fine. I ended up writing some timing code and testing both using ar egex, and IndexOf(). I found, that for our simplest use case, JUST the declaration stripping took:
Nearly a second as it was
.01 of a second with the regex
untimable using a loop and IndexOf()
So I went for IndexOf() as it's easy a very simple loop.

You probably want either this: <\?xml.*\?> or this: <\?xml.*?\?>, because the way you have it now, the regex is not looking for '?>' but just for '>'. I don't think you want the first option, because it's greedy and it will remove everything between the first occurrence of ''. The second option will work as long as you don't have nested XML-tags. If you do, it will remove everything between the first ''. If you have another '' tag.
Also, I don't know how regexes are implemented in .NET, but I seriously doubt if they're faster than using indexOf.

strXML = strXML.Remove(0, sXMLContent.IndexOf(#"?>", 0) + 2);

Related

Fastest way of removing unicode codes from a string

Hi I'm trying to figure out a way to remove the tags from the results returned from the Google Feed API. Specifically they are placing bold tags on titles and inside the description.
The codes that are being inserted are as follows:
\u003cb
\u003e
\u003c/b\u003e
Since its a fixed amount I did try doing a String.Replace() for each of these codes per string but it resulted in bad performance not surprisingly. I'm not sure if RegEx would be better (or worse). Does anyone have an idea on how to remove these? Google does not supply an option to remove tags from the results.

You could remove the unicode codes using a regex like this one:
\\u[\d\w]{4}
var subject = #"\u003cb\u003e\u003c/b\u003e";
var result = Regex.Replace(subject, #"\\u[\d\w]{4}", String.Empty);
As for performance, this article seems to suggest that regex is much slower, but i would run your own tests with your own data as it might be wildly different. The regular expression itself will play a big part in performance and I don't think that article states what the regex is being used so its impossible to compare. The size and type of your data will also play a big part, so it's difficult to say which is better without understanding your data.
Also, you should try compiling the regex with the RegexOptions.Compiled flag to see if that boosts performance.

Regex Help (again)

I don't really know what to entitle this, but I need some help with regular expressions. Firstly, I want to clarify that I'm not trying to match HTML or XML, although it may look like it, it's not. The things below are part of a file format I use for a program I made to specify which details should be exported in that program. There is no hierarchy involved, just that each new line contains a 'tag':
<n>
This is matched with my program to find an enumeration, which tells my program to export the name value, anyway, I also have tags like this:
<adr:home>
This specifies the home address. I use the following regex:
<((?'TAG'.*):(?'SUBTAG'.*)?)?(\s+((\w+)=('|"")?(?'VALUE'.*[^'])('|"")?)?)?>
The problem is that the regex will split the adr:home tag fine, but fail to find the n tag because it lacks a colon, but when I add a ? or a *, it then doesn't split the adr:home and similar tags. Can anyone help? I'm sure it's only simple, it's just this is my first time at creating a regular expression. I'm working in C#, by the way.

Will this help
<((?'TAG'.*?)(?::(?'SUBTAG'.*))?)?(\s+((\w+)=('|"")?(?'VALUE'.*[^'])('|"")?)?)?>
I've wrapped the : capture into a non capturing group round subtag and made the tag capture non greedy

Not entirely sure what your aim is but try this:
(?><)(?'TAG'[^:\s>]*)(:(?'SUBTAG'[^\s>:]*))?(\s\w+=['"](?'VALUE'[^'"]*)['"])?(?>>)
I find this site extremely useful for testing C# regex expressions.

What if you put the colon as part of the second tag?
<((?'TAG'.*)(?':SUBTAG'.*)?)?(\s+((\w+)=('|"")?(?'VALUE'.*[^'])('|"")?)?)?>

What's the best way to parse a string for "bad" words in C#?

I'm thinking of something like:
foreach (var word in paragraph.split(' ')) {
if (badWordArray.Contains(word) {
// do something about it
}
}
but I'm sure there's a better way.
Thanks in advance!
UPDATE
I'm not looking to remove obscenities automatically... for my web app, I want to be notified if a word I deem "bad" is used. Then I'll review it myself to make sure it's legit. An auto flagging system of sorts.

While your way works, it may be a bit time consuming. There is a wonderful response here for a previous SO question. Though the question talks about PHP instead of C#, I think it can be easily ported.
Edit to add sample code:
public string FilterWords(string inputWords) {
Regex wordFilter = new Regex("(puppies|kittens|dolphins|crabs)");
return wordFilter.Replace(inputWords, "<3");
}
That should work for you, more or less.
Edit to answer OP clarification:
I'm not looking to remove obscenities automatically... for my web app, I want to be notified if a word I deem "bad" is used.
Much as the replacement portion above, you can see if something matches like so:
public bool HasBadWords(string inputWords) {
Regex wordFilter = new Regex("(puppies|kittens|dolphins|crabs)");
return wordFilter.IsMatch(inputWords);
}
It will return true if the string you passed to it contains any words in the list.

At my job we put some automatic bad word filtering into our software (it's kind of shocking to be browsing the source and suddenly run across the array containing several pages of obscenity).
One tip is to pre-process the user input before testing against your list, in that case that someone is trying to sneak something by you. So by way of preprocessing, we
uppercase everything in the input
remove most non-alphanumerics (that is, just splice out any spaces, or punctuation, etc.)
and then assuming someone is trying to pass off digits for letters, do the something like this: replace zero with O, 9 with G, 5 with S, etc. (get creative)
And then get some friends to try to break it. It's fun.

You could consider using the HashKey objects or Dictionary<T1, T2> instead of the array as using a Dictionary for example can make code more efficient, because the .Contains() method becomes .Keys.Contains() which is way more efficient. This is especially true if you have a large list of profanities (not sure how many there are! :)

C#: Issues using dictionary with languages other than english

Ok, so i'm basically trying to load the contents of a .txt file that contains 1 word per line into a dictionary.
I had no problems doing so when the words in that file were in english, but changing the file to a language with accents, i started having problems.
Had to change the encoding while creating the stream reader, also the culture in the ToLower method while adding the word to the dictionary.
Basically i now have something similar to this:
if (!dict.ContainsKey(word.ToLower(culture)))
dict.Add(word.ToLower(culture), true);
The problem is that words like "esta" and "está" are being considered the same. So, is there any way to set the ContainsKey method to a specific language or do we need to implement something in the lines of a comparable? Either way i'm kinda new to c# so i would apreciate an example please.
Another issue submerge with the new file... after like a hundred words it stops adding the rest of the file, leaving a word incomplete... but i cant see any special chars in that word to end the execution of the method, any ideas about this problem?
Many thanks.
EDIT:
1st Problem solved using Jon Skeet sugestion.
In regards of the 2nd problem:
Ok, changed the file format to UTF8 and removed the encoding in the stream reader since it now recognizes the accents just right. Testing some stuff regarding the 2nd issue now.
2nd problem also solved, it was a bug on my part... the shame...
Thnks for the quick answers everyone, and especially Jon Skeet.

I assume you're trying to get case insensitivity for the dictionary. Instead of calling ToLower, use the constructor of Dictionary which takes an equality comparer - and use StringComparer.Create(culture, true) to construct a suitable comparer.
I don't know what your second problem is about - we'd need more detail to diagnose it, including the code you're using, ideally.
EDIT: UTF-7 is almost certainly not the correct encoding. Don't just guess at the encoding; find out what it's really meant to be. Where did this text file come from? What can you open it successfully in?
I suspect that at least some of your problems are due to using UTF-7.

The problem is with the enconding you are using when opening the file to read. Looks like you may be using ASCIIEncoding.
.NET handles strings internally as UTF-8, so this kind of issue would not happen internally.

How do I use a regular expression to add linefeeds?

I have a really long string. I would like to add a linefeed every 80 characters. Is there a regular expression replacement pattern I can use to insert "\r\n" every 80 characters? I am using C# if that matters.
I would like to avoid using a loop.
I don't need to worry about being in the middle of a word. I just want to insert a linefeed exactly every 80 characters.

I don't know the exact C# names, but it should be something like
str.Replace("(.{80})", "$1\r\n");
The idea is to grab 80 characters and save it in a group, then put it back in (I think "$1" is the right syntax) along with the "\r\n".
(Edit: The original regex had a + in it, which you definitely don't want. That would completely eliminate everything except the last line and any leftover pieces--a decidedly suboptimal result.)
Note that this way, you will most likely split inside words, so it might look pretty ugly.
You should be looking more into word wrapping if this is indeed supposed to be readable text. A little googling turned up a couple of functions; or if this is a text box, you can just turn on the WordWrap property.
Also, check out the .Net page at regular-expressions.info. It's by far the best reference site for regexes that I know of. (Jan Goyvaerts is on SO, but nobody told me to say that.)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex to remove xml declaration from a string - c#

strXML = strXML.Remove(0, sXMLContent.IndexOf(#"?>", 0) + 2);

Related

Fastest way of removing unicode codes from a string

Regex Help (again)

What's the best way to parse a string for "bad" words in C#?

C#: Issues using dictionary with languages other than english

How do I use a regular expression to add linefeeds?

Categories

Resources