C# the following RegEx too slow on large string - c#

I am using the following regex to filter large string:
(?m)(?(^*(?=.*\\btrue\\b)(?=.*\\ba\\b).*\\r*$)(.*)|(?!))
It takes for ever to do so. What am I doing wrong here? is it a problem with my pattern or its a length of string that causes a delay.
Please help me here.
Thanks in advance.

Ok I fonund this Regex working for multiple words (Here I am using 2 word condition) with AND operator and speed is good compared to my old RegEx which #Dispersia pointed out as wrong RegEx. Strange that even that one worked but taking forever to produce result
(?i)(?m)^((?=.*\bword1\b)(?=.*\bword2\b)).*[\r\n]*$.
Here is the screenshot of the result:

Related

URL regex - not getting it to work

I am using the following regex to find if there is a url present in a text, however it seems to miss some URLs like:
youtube.be/8P0BxJO
youtube.com/watch?v=VrmlFL
and also some bit.ly links (but not all)
Match m = Regex.Match(nc[i].InnerText,
#"(http(s)?://)?([\w-]+\.)+[\w-]+(/\S\w[\w- ;,./?%&=]\S*)?");
if (m.Success)
{
MessageBox.Show(nc[i].InnerText);
}
any ideas how to fix it?
See this related question, the first answer should help you out. The suggestion both finds links and then replaces them, so obviously just take what you need. This and this article are different approaches that should get you more or less the same result.
Another (perhaps more reliable) non-regex approach would be to tokenize the string by splitting on spaces and punctuation, and then checking the tokens to see whether they are a valid uri using Uri.IsWellFormedUriString (which only works on well formed uri's, as this question points out).

Fastest way of removing unicode codes from a string

Hi I'm trying to figure out a way to remove the tags from the results returned from the Google Feed API. Specifically they are placing bold tags on titles and inside the description.
The codes that are being inserted are as follows:
\u003cb
\u003e
\u003c/b\u003e
Since its a fixed amount I did try doing a String.Replace() for each of these codes per string but it resulted in bad performance not surprisingly. I'm not sure if RegEx would be better (or worse). Does anyone have an idea on how to remove these? Google does not supply an option to remove tags from the results.
You could remove the unicode codes using a regex like this one:
\\u[\d\w]{4}
var subject = #"\u003cb\u003e\u003c/b\u003e";
var result = Regex.Replace(subject, #"\\u[\d\w]{4}", String.Empty);
As for performance, this article seems to suggest that regex is much slower, but i would run your own tests with your own data as it might be wildly different. The regular expression itself will play a big part in performance and I don't think that article states what the regex is being used so its impossible to compare. The size and type of your data will also play a big part, so it's difficult to say which is better without understanding your data.
Also, you should try compiling the regex with the RegexOptions.Compiled flag to see if that boosts performance.

RegEx Problem: Valid URL does not match, but should

I posted a similar question earlier, but I now realize I should have been more thorough.
I've tested a number of the URL/URI expressions listed on regexlib.com, but I can't get any of them to work as desired:
msn.com
msn-msn.net
yahoo.c!om
http://www.yahoo.com
msn msn
test ! number 1
Here is how I desire them to act:
msn.com (match)
msn-msn.net (match)
yahoo.c!om (fail)
http://www.yahoo.com (match)
msn msn (fail)
test ! number 1 (fail)
I'm using the tester here: http://regexlib.com/RETester.aspx before testing in my own app (C#, .NET 4.0)
The expression that is closest is this, but it doesn't match the http://www.yahoo.com one:
^[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&%\$#\=~])*
Any help is appreciated. Additionally, somebody should come up with a more human-readable equivalent to RegEx...this stuff is a nightmare.
Thanks,
Beems
If you can't guarantee that the URL-esque pattern you're trying to match has a scheme/protocol, then the safest thing to do is match against top-level domains:
^(https?://)[^/]*.([possibly|really|long|list|of|valid|top|level|domains][2])
From your post it's evidently not necessary to go into the path, hash, or querystring parts of a URL, so that's it!
This one appears to work as desired:
[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?
Can anyone see any issues with this in regards to my original query? I don't need to validate whether the TLD is proper, so this isn't really an issue.
Agree with kojiro
But this does match your tests
http://www.rubular.com/r/gUb4U6Pzux

Regex to remove xml declaration from a string

First of all, I know this is a bad solution and I shouldn't be doing this.
Background: Feel free to skip
However, I need a quick fix for a live system. We currently have a data structure which serialises itself to a string by creating "xml" fragments via a series of string builders. Whether this is valid XML I rather doubt. After creating this xml, and before sending it over a message queue, some clean-up code searches the string for occurrences of the xml declaration and removes them.
The way this is done (iterate every character doing indexOf for the <?xml) is so slow its causing thread timeouts and killing our systems. Ultimately I'll be trying to fix this properly (build xml using xml documents or something similar) but for today I need a quick fix to replace what's there.
Please bear in mind, I know this is a far from ideal solution, but I need a quick fix to get us back up and running.
Question
My thought to use a regex to find the declarations. I was planning on: <\?xml.*?>, then using Regex.Replace(input, string.empty) to remove.
Could you let me know if there are any glaring problems with this regex, or whether just writing it in code using string.IndexOf("<?xml") and string.IndexOf("?>") pairs in a (much saner) loop is better.
EDIT
I need to take care of newlines.
Would: <\?xml[^>]*?> do the trick?
EDIT2
Thanks for the help. Regex wise <\?xml.*?\?> worked fine. I ended up writing some timing code and testing both using ar egex, and IndexOf(). I found, that for our simplest use case, JUST the declaration stripping took:
Nearly a second as it was
.01 of a second with the regex
untimable using a loop and IndexOf()
So I went for IndexOf() as it's easy a very simple loop.
You probably want either this: <\?xml.*\?> or this: <\?xml.*?\?>, because the way you have it now, the regex is not looking for '?>' but just for '>'. I don't think you want the first option, because it's greedy and it will remove everything between the first occurrence of ''. The second option will work as long as you don't have nested XML-tags. If you do, it will remove everything between the first ''. If you have another '' tag.
Also, I don't know how regexes are implemented in .NET, but I seriously doubt if they're faster than using indexOf.
strXML = strXML.Remove(0, sXMLContent.IndexOf(#"?>", 0) + 2);

Regular expression for validating a url

I'm a beginner in regexes. My requirement is to validate simple urls to urls with query strings, square brackets etc.. say for eg,
www.test.com?waa=[sample data]
the regex that I wrote only work for simple urls. It fails for the one with square brackets. Any idea?
Do you really need to use regex ?
bool isUri = Uri.IsWellFormedUriString("http://...", UriKind.RelativeOrAbsolute)
I would suggest taking a better look at the following site
http://www.regular-expressions.info/dotnet.html
Without actually seeing the Regex you're using I can't provide much insight. And giving you the answer wouldn't really teach you much either. Give a man a regex and you help him for a bit. Teach him regex and he's good for life
Take a look at the following:
http://www.geekzilla.co.uk/view2D3B0109-C1B2-4B4E-BFFD-E8088CBC85FD.htm
thanks a lot fr reply..
this is what i wrote ..works for query strings too...but it fails while adding []..
/^(https?|ftp)://(?#)(([a-z0-9$.+!*\'(),;\?&=-]|%[0-9a-f]{2})+(?#)(:([a-z0-9$.+!*\'(),;\?&=-]|%[0-9a-f]{2})+)?(?#)#)?(#)((([a-z0-9][a-z0-9-][a-z0-9].)(#)[a-z]{2}[a-z0-9-]a-z0-9|(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5].){3}(?#)(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])(?#))(:\d+)?(?#))(((/+([a-z0-9$_.+!*\'(),;:#&=-]|%[0-9a-f]{2}))(?# )(\?([a-z0-9$_.+!*\'(),;:#&=-]|%[0-9a-f]{2}))(?#)?)?)?(?#)(#([a-z0-9$_.+!*\'(),;:#&=-]|%[0-9a-f]{2})*)?(?#)$/i
Use this if u want url with http
http(s)?://([\w-]+.)+[\w-]+(/[\w- ./?%&=]*)?
if oyu dnt want http in URL then go for
?://([\w-]+.)+[\w-]+(/[\w- ./?%&=]*)?

Categories

Resources