Fastest way of removing unicode codes from a string

Fastest way of removing unicode codes from a string - c#

Hi I'm trying to figure out a way to remove the tags from the results returned from the Google Feed API. Specifically they are placing bold tags on titles and inside the description.
The codes that are being inserted are as follows:
\u003cb
\u003e
\u003c/b\u003e
Since its a fixed amount I did try doing a String.Replace() for each of these codes per string but it resulted in bad performance not surprisingly. I'm not sure if RegEx would be better (or worse). Does anyone have an idea on how to remove these? Google does not supply an option to remove tags from the results.

You could remove the unicode codes using a regex like this one:
\\u[\d\w]{4}
var subject = #"\u003cb\u003e\u003c/b\u003e";
var result = Regex.Replace(subject, #"\\u[\d\w]{4}", String.Empty);
As for performance, this article seems to suggest that regex is much slower, but i would run your own tests with your own data as it might be wildly different. The regular expression itself will play a big part in performance and I don't think that article states what the regex is being used so its impossible to compare. The size and type of your data will also play a big part, so it's difficult to say which is better without understanding your data.
Also, you should try compiling the regex with the RegexOptions.Compiled flag to see if that boosts performance.

Related

Issue with find and replace apostrophe( ' ) in a Word Docx using OpenXML and Regex

Word seems to use a different apostrophe character than Visual Studio and it is causing problems with using Regex.
I am trying to edit some Word documents in C# using OpenXML. I am basically replacing [[COMPANY]] with a company name. This has worked pretty smoothly until I have reached my corner case of companies with names that end in s. I end up with issue s where sometimes it creates a s's.
Example:
Company Name: Simmons
Text in Doc: The [[COMPANY]]'s business is cars.
Result: The Simmons's business is cars.
This is improper English.
I should be able to just use a basic find and replace like I did for [[COMPANY]], but it is not working.
Regex apostropheReplace = new Regex("s\\'s");
docText = apostropheReplace.Replace(docText, "s\'");
This does not. It seems that Word is using an different character for and apostrophe(') than the standard one that is created when I use the key on my keyboard in Visual Studio. If I write a find and replace using my keyboard it will not work, but if I copy and paste the apostrophe from Word it does.
Regex apostrophyReplace = new Regex("s\\’s");
docText = apostrophyReplace.Replace(docText, "s\'");
Notice the different character in the Regex for the second one. I'm confused as to why this is, and also want to know if the is a proper way of doing this. I tried "&apos;" but that does not work. I just want to know if using the copied character from Word is the proper way of doing this, and is there a way to do it so that both characters work so I don't have an issue with docs that may be created with a different program.

The reason this happens is because they are different characters.
Word actually changes some punctuation characters after you type them in order to give them the right inclination or to improve presentation.
I ran in the very same issue before and I used this as regular expression: [\u2018\u2019\u201A\u201b\u2032']
So essentially modify your code to:
Regex apostropheReplace = new Regex("s\\[\u2018\u2019\u201A\u201b\u2032']s");
docText = apostropheReplace.Replace(docText, "s\'")
I found these were the five most common type of single quotes and apostrophes used.
And in case you come across the same issue with double quotes, here is what you can use: [\u201C\u201D\u201E\u201F\u2033\u2036\"]

Answering the question:
Is there a way to do it so that both characters work?
If you want one Regex to be able to handle both scenarios, this is perhaps a simple and readable solution:
Regex apostropheReplace = new Regex("s\\['’]s");
docText = apostropheReplace.Replace(docText, "s\'")
This has the added benefit of being understandable to other developers that you are attempting to cover both apostrophe cases. This benefit gets at the other part of your question:
If using the copied character from Word is the proper way of doing this?
That depends on what you mean by "proper". If you mean "most understandable to other developers," I'd say yes, because there would be the least amount of look-up needed to know exactly what your Regex is looking for. If you mean "most performant", that should not be an issue with this straightforward Regex search (some nice Regex performance tips can be found here).
If you mean "most versatile/robust single quote Regex", then as #Leonardo-Seccia points out, there are other character encodings that might cause trouble. (Some of the common Microsoft Word ones are listed here.) Such a solution might look like this:
Regex apostropheReplace =
new Regex("s\\['\u2018\u2019\u201A\u201b]s");
docText = apostropheReplace.Replace(docText, "s\'")
But you can certainly add other character encodings as needed. A more complete list of character encodings can be found here - to add them to the above Regex, simply change the "U+" to "u" and add it to the list after another "\" character. For example, to add the "prime" symbol (′ or U+2032) to the list above, change the RegEx string from
Regex("s\\['\u2018\u2019\u201A\u201b]s")
to
Regex("s\\['\u2018\u2019\u201A\u201b\u2032]s")
Ultimately, you would be the judge of what character encodings are the most "proper" for inclusion in your Regex based on your use cases.

Parsing A String - Is There A More Efficient Method than Checking Each Line?

I am working on a project to parse out a text file. The file is output from networking equipment. The incoming string is anywhere from a few thousand to tens of thousands of lines long. There will be a variable number of entries with keywords like these:
fcN/N is up
Hardware is Fibre Channel, SFP is short wave laser w/o OFC (SN)
Port WWN is 20:52:00:0d:ec:ef:b0:40
Admin port mode is F, trunk mode is on
snmp link state traps are enabled
Port vsan is 10
fcipN is up
.....
port-channel-N is trunking
......
The N is a number. There will always be the 'fcN/N' entries, there may or may not be the other two. The 'fcip' and 'port-channel' entries will have similar status information after each one as the fcN/N entries. All entries of the same type will be grouped - there won't be an fc followed by an fcip followed by another fc. Also as a general rule, all the fc entries are listed, then all the port-channel then all the fcip but I don't want to assume that. At the moment I have about 7 different RegEx patterns I am looking for. I do this by examining each line in turn, however managing all those is cumbersome. I thought about splitting the string on newline and then some kind of LINQ select to get all of each of the 3 types of entries, but that assumes they are always grouped in the same order. I also thought about 3 monster regexes to match everything from one entry to the next, but my experience is those are tough to get working and almost unreadable. Another thing I thought of was first match the three keywords - fc or port-channel or fcip, then have an if statement that matches the patterns unique to those. That is still matching each line for all 3 patterns though.
To be clear, I have the Regex patterns working. I am looking for a more efficient way to do this than test each line for 6 0r 8 matches.
Any other ideas?

I have two thought:
(1) Your last approach of using if statements to first find the right regex to apply is like to be quite efficient. I'd recommend it.
(2) You can compose regex's like this:
var pattern1 = #"abc";
var pattern2 = #"def";
var unionPattern = "((" + pattern1 + ")|(" + pattern2 + "))";
This makes it much more readable.
If you never want to find a match that spans lines you should split the file into lines first. That will improve efficiency because the regexes have smaller inputs and will backtrack less.
If your matches span multiple lines but they always start after a new-line, you can you can split the string into chunks first like this:
var chunks = Regex.Split(str, "((fc\d)|(fcip\d)|(port-channel-\d)));

You might get clearer and more concise code by using a parser combinator library, such as Sprache.
Not being a C# programmer, I'm not intimately familiar with this library (and there may well be others for C# as well), but I've used Scala parser combinators to good effect, and they build on and use regular expression parsing.
Whether it make your code more efficient likely depends on how inefficient your code now is.

Are you looking for raw speed, or efficiency? If the former, you can split the file into parts and have a thread parsing each part simultaneously. The trick will be finding a boundary to split on (so that each part contains only whole entries) quickly. You will also only want to go multithreaded if the total number of lines is large, or the overhead will outweigh the parallelization gains.

Regex to remove xml declaration from a string

First of all, I know this is a bad solution and I shouldn't be doing this.
Background: Feel free to skip
However, I need a quick fix for a live system. We currently have a data structure which serialises itself to a string by creating "xml" fragments via a series of string builders. Whether this is valid XML I rather doubt. After creating this xml, and before sending it over a message queue, some clean-up code searches the string for occurrences of the xml declaration and removes them.
The way this is done (iterate every character doing indexOf for the <?xml) is so slow its causing thread timeouts and killing our systems. Ultimately I'll be trying to fix this properly (build xml using xml documents or something similar) but for today I need a quick fix to replace what's there.
Please bear in mind, I know this is a far from ideal solution, but I need a quick fix to get us back up and running.
Question
My thought to use a regex to find the declarations. I was planning on: <\?xml.*?>, then using Regex.Replace(input, string.empty) to remove.
Could you let me know if there are any glaring problems with this regex, or whether just writing it in code using string.IndexOf("<?xml") and string.IndexOf("?>") pairs in a (much saner) loop is better.
EDIT
I need to take care of newlines.
Would: <\?xml[^>]*?> do the trick?
EDIT2
Thanks for the help. Regex wise <\?xml.*?\?> worked fine. I ended up writing some timing code and testing both using ar egex, and IndexOf(). I found, that for our simplest use case, JUST the declaration stripping took:
Nearly a second as it was
.01 of a second with the regex
untimable using a loop and IndexOf()
So I went for IndexOf() as it's easy a very simple loop.

You probably want either this: <\?xml.*\?> or this: <\?xml.*?\?>, because the way you have it now, the regex is not looking for '?>' but just for '>'. I don't think you want the first option, because it's greedy and it will remove everything between the first occurrence of ''. The second option will work as long as you don't have nested XML-tags. If you do, it will remove everything between the first ''. If you have another '' tag.
Also, I don't know how regexes are implemented in .NET, but I seriously doubt if they're faster than using indexOf.

strXML = strXML.Remove(0, sXMLContent.IndexOf(#"?>", 0) + 2);

Regular expression in C# , is this possible?

I never use regular expression before and plan to use it to solve my problem but not quite sure whether it can help me.
I have a situation where I need store a rule or formula to build string values like following examples in a database field then retrieve this rule and build the string value.
FacilityCode + Left(ModelNO,2)
Right(PO,3) + Left(Serial,2)
Is this achievable using .net regular expression? Any good tutorial or simple examples of this problem.

Regexp : http://msdn.microsoft.com/en-us/library/2k3te2cs(VS.80).aspx
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx
But it doesn't seems fitting :)

It might be better to code some random string generator. Regex is for searching data not creating data.
The thing to remember about regex is that it is like an aircraft carrier; it does one thing very very well, it does not do other jobs very well at all.
An aircraft carrier moves planes very well on the ocean; it does not make a cheese sandwich well AT ALL!!
That is to say, if you use regex when you shouldn't you will almost certainly use far more processing power than if you used another tool for that job. Html parsing comes to mind.

Regex is provided as part of System.Text.RegularExpressions, but you can't rely exclusively on it. It'll let you search existing strings, but you'll need to implement your own logic for building new strings based on what you find in the existing data.
Also, keep in mind that System.Text.RegularExpressions works differently from regexp in Perl and other implementations. For example, it doesn't recognize POSIX character class definitions.
Since you're new to regex, you might want to check out the "Regular Expressions User Guide" on zytrax.com. It's not as comprehensive as an O'Reilly manual, but it'll do as a start.

Removing <div>'s from text file?

Ive made a small program in C#.net which doesnt really serve much of a purpose, its tells you the chance of your DOOM based on todays news lol. It takes an RSS on load from the BBC website and will then look for key words which either increment of decrease the percentage chance of DOOM.
Crazy little project which maybe one day the classes will come uin handy to use again for something more important.
I recieve the RSS in an xml format but it contains alot of div tags and formatting characters which i dont really want to be in the database of keywords,
What is the best way of removing these unwanted characters and div's?
Thanks,
Ash

If you want to remove the DIV tags WITH content as well:
string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);
Input: <xml><div>junk</div>XXX<div>junk2</div></xml>
Output: <xml>XXX</xml>

IMHO the easiest way is to use regular expressions. Something like:
string txt = Regex.Replace(htmlString, #"<(.|\n)*?>", string.Empty);
Depending on which tags and characters you want to remove you will modify the regex, of course. You will find a lot of material on this and other methods if you do a web search for 'strip html C#'.
SO question Render or convert Html to ‘formatted’ Text (.NET) might help you, too.

Stripping HTML tags from a given string is a common requirement and you can probably find many resources online that do it for you.
The accepted method, however, is to use a Regular expression based Search and Replace. This article provides a good sample along with benchmarks. Another point worth mentioning is that you would require separate Regex based lookups for the different kinds of unwanted characters you are seeing. (Perhaps showing us an example of the HTML you receive would help)
Note that your requirements may vary based on which tags you want to remove. In your question, you only mention DIV tags. If that is the only tag you need to replace, a simple string search and replace should suffice.

A regular expression such as this:
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
Would highlight all HTML tags.
Use this to remove them form your data.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.