regex match with * not matching text with non-English characters - c#

I am trying to scrape a page that has Hebrew text on it. It contains the following piece of HTML:
<div id="AgeRating">דירוג גיל: ‎12+‎</div>
I just want the 12+ part here (in fact: I only want the '12' part). I am currently doing to with this piece of regex for other languages:
new Regex(#"<div id=""AgeRating"">.*(\d{1,2})\+</div>", RegexOptions.Compiled);
But I just can't get this to match. I tried all the regex options like RightToLeft, CultureInvariant, SingleLine, MultiLine, etc. but nothing works. It does work fine with plenty other languages though.
Note: I'm aware of HtmlAgilityPack for proper parsing of HTML. This is question about why seemingly correct RegEx fails to match particular string (as this a sample I have currently).

This regular expression works for me:
<div id="AgeRating">.*?(\d{1,2})\+
This returns 12. I added a ? to .* to make the dot not greedy.
I think the thing that is throwing you off is that you have a hidden character (perhaps a Hebrew character?) after the plus sign. The following also works for your string (notice the dot after the plus sign, which accommodates your hidden character):
<div id="AgeRating">.*?(\d{1,2})\+.</div>
You also do need the ? after .* as I mentioned above in order to prevent the regular expression from returning 2 instead of 12.

Related

Regex that returns all integers in C# "111; 222; 3333" and "213" in a string with alpha

I am extracting all numbers used in an xml file. The numbers are written in following two patterns
<Environment Id="11" StringId="8407" DescriptionId="5014" RemoteControlAppStringId="8119; 8118" EnvironmentType="BlueToothBridge" AlternateId="1" XML_NAME_ID="BTBSpeechPlusM" FactoryGainType="LIN18">
<Offsets />
</Environment>
I am using regex: "\"\d*;\"" and "\"\d*\"" to extract all numbers.
from the above when i ran Regex "\"\d*\"" using
Regex.Match(myString, "\"\\d*\"")
the above line returns 8407, 11,5014 but it is not returning 8119 and 8118
Your regex will fail to match 8119; 8118 because your pattern is finding quoted numbers.
try with
\b\d+\b
\b specify that \d+ will match only in word boundary. So LIN18 will not match.
Depening on whether you can assume that the provided input is valid XML, you could use the following regular expression:1
Regex.match(myString, "(?<=\")\\d+(?=\")|(?<=\")\\d+(?=; ?\\d+\")|(?<=\"\\d+; ?)\\d+(?=\")" )
The main idea behind this is that it takes the three possible situations into account:
"[number]"
"[number]; [other_number]" (With or without a space before [other_number])
"[other_number]; [number]" (With or without a space before [number])
There are two new concepts I included in the regular expression:2
Positive lookahead: (?=[regex])
Positive lookbehind: (?<=[regex])
These concepts allow the regular expression to check if something specific is before or after it, without putting it in the match.
This regular expression could easily be optimised, but this is meant as an example of a basic approach.
One good tip for developing a regular expression like this is to use a tool (online or offline) to test your regular expression. The tool I used was .NET Regex Tester.
As #poke stated in the comment, it's because your regex doesn't match the string. Change your regex to capture specific matches and account for the possibility of the ';'.
Something like below should probably do the trick.
EDIT: (\b\d+\b)|(\b\d+[;*]\d+\b)

regular expression validation not to allow asterisk

I am trying not to allow asterisk character in my validation.
My regex expression is
addressFormat="^[a-zA-Z0-9 \~\!\#\#\$\%\^\*\(\)_\'\-\+\=\{\}\[\]\|\:\;\,\.\?\/]{0,45}$"
As specified from the link, Link I tried adding [^\*] as below.
"^[a-zA-Z0-9 \~\!\#\#\$\%\^\*\(\)_\'\-\+\=\{\}\[\]\|\:\;\,\.\?\/][^\*]{0,45}$"
"^[^\*][a-zA-Z0-9 \~\!\#\#\$\%\^\*\(\)_\'\-\+\=\{\}\[\]\|\:\;\,\.\?\/]{0,45}$"
But it is allowing asterisk * character in my textbox. What is the mistake in my code. ? Any suggestions..
Your regex can be simplified to:
"^[a-zA-Z0-9 ~!##$%^*()_'+={}\[\]|:;,.?/-]{0,45}$"
and, as [a-zA-Z0-9_] is the same as \w:
"^[\w~!##$%^*()'+={}\[\]|:;,.?/-]{0,45}$"
then you could remove the *:
"^[\w~!##$%^()'+={}\[\]|:;,.?/-]{0,45}$"
First, for your information, you can simplify your regex to:
^(?i)[-a-z0-9 ~!##$%^()_'+={}[\]|:;,.?/]{0,45}$
Since you are using C#, do not yield to the temptation of replacing [0-9a-z_] with \w unless you use the ECMAScript option, as C# assumes your strings are utf-8 by default, and \w will too happily match Arabic digits, Nepalese characters and so forth, which you might not want... Unless this is okay:
abcdᚠᚱᚩᚠᚢᚱტყაოსdᚉᚔమరמטᓂᕆᔭᕌसられま래도654۳۲١८৮੪૯୫୬१७੩௮௫౫೮൬൪๘໒໕២៧៦᠖
(But that's 60 chars, over your 45 limit anyway... Whew.)
More interestingly:
What was wrong before?
When you have a regex such as [^*][a-z] (simplifying your earlier expression), the [^*] matches exactly one character, then the [a-z] matches exactly one other character (the next one). They do not work together to impose a condition on the next character. Each of them are character classes, and each character specifies the next character to be matched, subject to an optional quantifier (in your case, the {0,45}
Would this work?
On the surface, this might look like the ticket, but I do not recommend it:
^[^*]{0,45}$
Why not? This matches any character that is not an asterisk, zero to 45 times. That sounds good, but eligible characters would include tabs, new lines, and any glyph in any language... Probably not what you are looking for.
Delete \* from your expression.
Also look at this link - it's really helpfull when you writing the regular expressions.
jsFiddle example
HTML
<form>
<input type="text" required pattern="^[a-zA-Z0-9 \~\!\#\#\$\%\^\(\)_\'\-\+\=\{\}\[\]\|\:\;\,\.\?\/]{0,45}$" title="incorrect format"/>
<input type="submit"/>
</form>

Regex to adjust HTML hrefs in c#

I need to use regex to search through an html file and replace href="pagename" with href="pages/pagename"
Also the href could be formatted like HREF = 'pagename'
I do not want to replace any hrefs that could be upper or lowercase that begin with http, ftp, mailto, javascript, #
I am using c# to develop this little app in.
HTML manipulation through Regex is not recommended since HTML is not a "regular language." I'd highly recommend using the HTML Agility Pack instead. That gives you a DOM interface for HTML.
I have not tested with many cases, but for this case it worked:
var str = "href='page' href = 'www.goo' href='http://' href='ftp://'";
Console.WriteLine(Regex.Replace(str, #"href ?= ?(('|"")([a-z0-9_#.-]+)('|""))", "x", RegexOptions.IgnoreCase));
Result:
"x x href='http://' href='ftp://'"
You better hold backup files before running this :P
There are lots of caveats when using a find/replace with HTML and XML. The problem is, there are many variations of syntax which are permitted. (and many which are not permitted but still work!)
But, you seem to want something like this:
search for
([Hh][Rr][Ee][Ff]\s*=\s*['"])(\w+)(['"])
This means:
[Hh]: any of the items in square-brackets, followed by
\s*: any number of whitespaces (maybe zero),
=
\s* any more whitespaces,
['"] either quote type,
\w+: a word (without any slashes or dots - if you want to include .html then use [.\w]+ instead ),
and ['"]: another quote of any kind.
replace with
$1pages/$2$3
Which means the things in the first bracket, then pages/, then the stuff in the second and third sets of brackets.
You will need to put the first string in #" quotes, and also escape the double-quotes as "".
Note that it won't do anything even vaguely intelligent, like making sure the quotes match. Warning: try never to use as "any character" (.) symbol in this kind of regex, as it will grab large sections of text, over and including the next quotation mark, possibly up to the end of the file!
see a regex tutorial for more info, e.g. http://www.regular-expressions.info/dotnet.html

I need a regular expression to convert US tel number to link

Basically, the input field is just a string. People input their phone number in various formats. I need a regular expression to find and convert those numbers into links.
Input examples:
(201) 555-1212
(201)555-1212
201-555-1212
555-1212
Here's what I want:
(201) 555-1212 - Notice the space is gone
(201)555-1212
201-555-1212
555-1212
I know it should be more robust than just removing spaces, but it is for an internal web site that my employees will be accessing from their iPhone. So, I'm willing to "just get it working."
Here's what I have so far in C# (which should show you how little I know about regular expressions):
strchk = Regex.Replace(strchk, #"\b([\d{3}\-\d{4}|\d{3}\-\d{3}\-\d{4}|\(\d{3}\)\d{3}\-\d{4}])\b", "<a href='tel:$&'>$&</a>", RegexOptions.IgnoreCase);
Can anyone help me by fixing this or suggesting a better way to do this?
EDIT:
Thanks everyone. Here's what I've got so far:
strchk = Regex.Replace(strchk, #"\b(\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4})\b", "<a href='tel:$1'>$1</a>", RegexOptions.IgnoreCase);
It is picking up just about everything EXCEPT those with (nnn) area codes, with or without spaces between it and the 7 digit number. It does pick up the 7 digit number and link it that way. However, if the area code is specified it doesn't get matched. Any idea what I'm doing wrong?
Second Edit:
Got it working now. All I did was remove the \b from the start of the string.
Remove the [] and add \s* (zero or more whitespace characters) around each \-.
Also, you don't need to escape the -. (You can take out the \ from \-)
Explanation: [abcA-Z] is a character group, which matches a, b, c, or any character between A and Z.
It's not what you're trying to do.
Edits
In response to your updated regex:
Change [-\.\s] to [-\.\s]+ to match one or more of any of those characters (eg, a - with spaces around it)
The problem is that \b doesn't match the boundary between a space and a (.
Afaik, no phone enters the other characters, so why not replace [^0-9] with '' ?
Here's a regex I wrote for finding phone numbers:
(\+?\d[-\.\s]?)?(\(\d{3}\)\s?|\d{3}[-\.\s]?)\d{3}[-\.\s]?\d{4}
It's pretty flexible... allows a variety of formats.
Then, instead of killing yourself trying to replace it w/out spaces using a bunch of back references, instead pass the match to a function and just strip the spaces as you wanted.
C#/.net should have a method that allows a function as the replace argument...
Edit: They call it a `MatchEvaluator. That example uses a delegate, but I'm pretty sure you could use the slightly less verbose
(m) => m.Value.Replace(' ', '')
or something. working from memory here.

I have two problems, one of them is a regex

I am updating some code that I didn't write and part of it is a regex as follows:
\[url(?:\s*)\]www\.(.*?)\[/url(?:\s*)\]
I understand that .*? does a non-greedy match of everything in the second register.
What does ?:\s* in the first and third registers do?
Update: As requested, language is C# on .NET 3.5
The syntax (?:) is a way of putting parentheses around a subexpression without separately extracting that part of the string.
The author wanted to match the (.*?) part in the middle, and didn't want the spaces at the beginning or the end from getting in the way. Now you can use \1 or $1 (or whatever the appropriate method is in your particular language) to refer to the domain name, instead of the first chunk of spaces at the beginning of the string
?: makes the parentheses non-grouping. In that regex, you'll only pull out one piece of information, $1, which contains the middle (.*?) expression.
What does ?:\s* in the first and third registers do?
It's matching zero or more whitespace characters, without capturing them.
The regex author intends to allow trailing whitespace in the square-bracket-tags, matching all DNS labels following the "www." like so:
[url]www.foo.com[/url] # foo.com
[url ]www.foo.com[/url ] # same
[url ]www.foo.com[/url] # same
[url]www.foo.com[/url ] # same
Note that the regex also matches:
[url]www.[/url] # empty string!
and fails to match
[url]stackoverflow.com[/url] # no match, bummer
You may find this Regular Expressions Cheat Sheet very helpful (hopefully). I spent ages trying to learn Regex with no luck. And once I read this cheat-sheet - I immediately understood what I previously failed to learn.
http://krijnhoetmer.nl/stuff/regex/cheat-sheet/

Categories

Resources