I don't really know what to entitle this, but I need some help with regular expressions. Firstly, I want to clarify that I'm not trying to match HTML or XML, although it may look like it, it's not. The things below are part of a file format I use for a program I made to specify which details should be exported in that program. There is no hierarchy involved, just that each new line contains a 'tag':
<n>
This is matched with my program to find an enumeration, which tells my program to export the name value, anyway, I also have tags like this:
<adr:home>
This specifies the home address. I use the following regex:
<((?'TAG'.*):(?'SUBTAG'.*)?)?(\s+((\w+)=('|"")?(?'VALUE'.*[^'])('|"")?)?)?>
The problem is that the regex will split the adr:home tag fine, but fail to find the n tag because it lacks a colon, but when I add a ? or a *, it then doesn't split the adr:home and similar tags. Can anyone help? I'm sure it's only simple, it's just this is my first time at creating a regular expression. I'm working in C#, by the way.
Will this help
<((?'TAG'.*?)(?::(?'SUBTAG'.*))?)?(\s+((\w+)=('|"")?(?'VALUE'.*[^'])('|"")?)?)?>
I've wrapped the : capture into a non capturing group round subtag and made the tag capture non greedy
Not entirely sure what your aim is but try this:
(?><)(?'TAG'[^:\s>]*)(:(?'SUBTAG'[^\s>:]*))?(\s\w+=['"](?'VALUE'[^'"]*)['"])?(?>>)
I find this site extremely useful for testing C# regex expressions.
What if you put the colon as part of the second tag?
<((?'TAG'.*)(?':SUBTAG'.*)?)?(\s+((\w+)=('|"")?(?'VALUE'.*[^'])('|"")?)?)?>
Related
I need to check the syntax of a file name in order to treat it and exclude files that don't match the expression.
The file names must be as follows:
(2 caracters letter or number)_(some caracters letters and numbers)__(YYYY-MM-dd-HH-SS-MM).csv
The last part is a date
In the middle we have two underscores
Can you help me for this request ? I'm not familiar at all with the regex and the few tests I made were not good.
Thanks a lot :)
I have created some sample. please find below it will work for you also you can modify as per your requirement.
For example : (22)-(ww)-(aa).csv
^[(\[][a-zA-Z0-9]{2}[)\]]+(-[(\[][a-zA-Z0-9]{2}[)\]]+)+(-[(\[][a-zA-Z0-9]{2}[)\]]+.csv)$
For example : (22)_(ww)_(01-12-2018 19:20).csv
^[(\[][a-zA-Z0-9]{2}[)\]]+(_[(\[][a-zA-Z0-9]{2}[)\]]+)+(_[(\[]([1-9]|([012][0-9])|(3[01]))-([0]{0,1}[1-9]|1[012])-\d\d\d\d [012]{0,1}[0-9]:[0-6][0-9][)\]]+.csv)$
For example: (22)_(ww)_(1999-12-31-23-59-59).csv
^[(\[][a-zA-Z0-9]{2}[)\]]+(_[(\[][a-zA-Z0-9]{2}[)\]]+)+(_[(\[]19\d{2}(-|\/)((0[1-9])|(1[0-2]))(-|\/)((0[1-9])|([1-2][0-9])|(3[0-1]))(-)(([0-1][0-9])|(2[0-3]))-([0-5][0-9])-([0-5][0-9])[)\]]+.csv)$
I recommend you to refer this URL Click Here it will be helpful for learning as well as you can find more example. Thanks
This is the RegEx suited to your request if the String contains only the filename
^.{2}_.+__\d{4}(?:-\d{2}){5}.csv$
And this if you want to capture it from a longer String for example
(.{2}_.+__\d{4}(?:-\d{2}){5}.csv)
If you are having difficulties to create a RegEx i can recommend you to take a look at RegExr.
Have a xml string, goal is to replace an xml element value to a fixed string, i.e. for blah blah blah replace it to fixed value, I am thinking to use RegEx.Replace instead of loading the string to a DOM model and replace.
Could anyone please help on how to write this regular expression? essentially the goal is to match everything inside element tag 'abc'
Thanks a lot!
This article tells you what you need to know: XML is not Regular
Ignoring the most obvious solution to their problem (which would be to use a pre-existing XML parser), they think they should use regular expressions (regex for short). Now they have two problems.
Use regular expressions only on regular languages.
That said, there are many sites that purport to offer guidance on writing regular expressions for XML. They are all wrong. But they exist, and you can use them at your own risk.
For what it's worth, don't.
Process the XML normally, with a XmlDocument, Xml.Linq or XmlReader/Writer, it's what they are for, cover all kinds of edge cases we couldn't even imagine, and above all, are proven to work.
Don't use a regex for this, please . . . just don't.
My two cents.
let the downvoting begin
Regular expressions are meant to be used on regular languages. XML is a non-regular language. As such, regular expressions cannot be used to properly parse anything written in it. You will need to use a real XML parser, which can be found in the numerous libraries available in C#, to do it.
Regular expressions are not suitable for processing markup. Among other flaws, they won't work if elements can be nested:
<abc> ... <abc> ... </abc> ... </abc>
They are also unable to distinguish a comment from a non-comment.
You need a real XML parser.
First of all, I know this is a bad solution and I shouldn't be doing this.
Background: Feel free to skip
However, I need a quick fix for a live system. We currently have a data structure which serialises itself to a string by creating "xml" fragments via a series of string builders. Whether this is valid XML I rather doubt. After creating this xml, and before sending it over a message queue, some clean-up code searches the string for occurrences of the xml declaration and removes them.
The way this is done (iterate every character doing indexOf for the <?xml) is so slow its causing thread timeouts and killing our systems. Ultimately I'll be trying to fix this properly (build xml using xml documents or something similar) but for today I need a quick fix to replace what's there.
Please bear in mind, I know this is a far from ideal solution, but I need a quick fix to get us back up and running.
Question
My thought to use a regex to find the declarations. I was planning on: <\?xml.*?>, then using Regex.Replace(input, string.empty) to remove.
Could you let me know if there are any glaring problems with this regex, or whether just writing it in code using string.IndexOf("<?xml") and string.IndexOf("?>") pairs in a (much saner) loop is better.
EDIT
I need to take care of newlines.
Would: <\?xml[^>]*?> do the trick?
EDIT2
Thanks for the help. Regex wise <\?xml.*?\?> worked fine. I ended up writing some timing code and testing both using ar egex, and IndexOf(). I found, that for our simplest use case, JUST the declaration stripping took:
Nearly a second as it was
.01 of a second with the regex
untimable using a loop and IndexOf()
So I went for IndexOf() as it's easy a very simple loop.
You probably want either this: <\?xml.*\?> or this: <\?xml.*?\?>, because the way you have it now, the regex is not looking for '?>' but just for '>'. I don't think you want the first option, because it's greedy and it will remove everything between the first occurrence of ''. The second option will work as long as you don't have nested XML-tags. If you do, it will remove everything between the first ''. If you have another '' tag.
Also, I don't know how regexes are implemented in .NET, but I seriously doubt if they're faster than using indexOf.
strXML = strXML.Remove(0, sXMLContent.IndexOf(#"?>", 0) + 2);
It would be great if someone could provide me the Regular expression for the following string.
Sample 1: <div>abc</div><br>
Sample 2: <div>abc</div></div></div></div></div><br>
As you can see in the samples provided above, I need to match the string no matter how many number of </div> occurs. If there occurs any other string between </div> and <br>, say like this <div>abc</div></div></div>DEF</div></div><br> OR <div>abc</div></div></div></div></div>DEF<br>, then the Regex should not match.
Thanks in advance.
Try this:
<div>([^<]+)(?:<\/div>)*<br>
As seen on rubular
Notes:
This only works if there are not tags in the abc part (or anything that has a < symbol).
You might want to use start and end of string anchors (^<div>([^<]+)(?:<\/div>)*<br>$ if you want your string to match the pattern exactly.
If you want to allow the abc part to be empty, use * instead of +
That being said, you should be wary of using regex to parse HTML.
In this example, you can use regex because you are parsing a (hopefully) known, regular subset of HTML. But a more robust solution (ie: an [X]HTML parser like HtmlAgilityPack) is preferred when it comes to parsing HTML.
You need to use a real parser. Things like infinitely nested tags can't be handled via regex.
You could also include a named group in the the expression, e.g.:
<div>(?<text>[^<]*)(?:<\/div>)*<br>
Implemented in C#:
var regex = new Regex(#"<div>(?<text>[^<]*)(?:<\/div>)*<br>");
Func<Match, string> getGroupText = m => (m.Success && m.Groups["text"] != null) ? m.Groups["text"].Value : null;
Func<string, string> getText = s => getGroupText(regex.Match(s));
Console.WriteLine(getText("<div>abc</div><br>"));
Console.WriteLine(getText("<div>123</div></div></div></div></div><br>"));
NullUserException's answer is good. Here are a couple of questions, and variations, depending on what you want.
Do you want to prevent anything from occurring before the open div tag? If so, keep the ^ at the beginning of the regex. If not, drop it.
The rest of this post refers to the following section of the regex:
([^<]+?)
Do you want to capture the contents of the div, or just know that it matches your form? To capture, leave it as is. If you don't need to capture, drop the parentheses from the above.
Do you want to match if there is nothing inside the div? If so change the + in the above to *
Finally, although it will work fine, you don't need the ? in the above.
I think, this regex is more flexible:
<div\b[^><]*+>(?>.*?</div>)(?:\s*+</div>)*+\s*+<br(?:\s*+/)?>
I don't include the ^ and $ in the beginning and the end of my regex because we cannot assure that your sample will always in a single line.
I need to implement something similar to wikilinks on my site. The user is entering plain text and will enter [[asdf]] wherever there is an internal link. Only the first five examples are really applicable in the implementation I need.
Would you use regex, what expression would do this? Is there a library out there somewhere that already does this in C#?
On the pure regexp side, the expression would rather be:
\[\[([^\]\|\r\n]+?)\|([^\]\|\r\n]+?)\]\]([^\] ]\S*)
\[\[([^\]\|\r\n]+?)\]\]([^\] ]\S*)
By replacing the (.+?) suggested by David with ([^\]\|\r\n]+?), you ensure to only capture legitimate wiki links texts, without closing square brackets or newline characters.
([^\] ]\S+) at the end ensures the wiki link expression is not followed by a closing square bracket either.
I am note sure if there is C# libraries already implementing this kind of detection.
However, to make that kind of detection really full-proof with regexp, you should use the pushdown automaton present in the C# regexp engine, as illustrated here.
I don't know if there are existing libraries to do this, but if it were me I'd probably just use regexes:
match \[\[(.+?)\|(.+?)\]\](\S+) and replace with \1\3
match \[\[(.+?)\]\](\S+) and replace with \1\2
Or something like that, anyway.
Although this is an old question and already answered, I thought I'd add this as an addendum for anyone else coming along. The existing two answers do all the real work and got me 90% there, but here is the last bit for anyone looking for code to get straight on with trying:
string html = "Some text with a wiki style [[page2.html|link]]";
html = Regex.Replace(html, #"\[\[([^\]\|\r\n]+?)\|([^\]\|\r\n]+?)\]\]([^\] ]\S*)", #"$2$3");
html = Regex.Replace(html, #"\[\[([^\]\|\r\n]+?)\]\]([^\] ]\S*)", #"$1$2");
The only change to the actual regex is I think the original answer had the replacement parts the wrong way around, so the href was set to the display text and the link was shown on the page. I've therefore swapped them.