Complex regex or string parse - c#

We are trying to use urls for complex querying and filtering.
I managed to get some of the simpler parst working using expression trees and a mix of regex and string manipulation but then we looked at a more complex string example
var filterstring="(|(^(categoryid:eq:1,2,3,4)(categoryname:eq:condiments))(description:lk:”*and*”))";
I'd like to be able to parse this out in to parts but also allow it to be recursive.. I'd like to get the out put looking like:
item[0] (^(categoryid:eq:1,2,3,4)(categoryname:eq:condiments)
item[1] description:lk:”*and*”
From there I could Strip down the item[0] part to get
categoryid:eq:1,2,3,4
categoryname:eq:condiments
At the minute I'm using RegEx and strings to find the | ^ for knowing if it's an AND or an OR the RegEx matches brackets and works well for a single item it's when we nest the values that I'm struggling.
the Regex looks like
#"\((.*?)\)"
I need some way of using Regex to match the nested brackets and help would be appreciated.

You could transform the string into valid XML (just some simple replace, no validation):
var output = filterstring
.Replace("(","<node>")
.Replace(")","</node>")
.Replace("|","<andNode/>")
.Replace("^","<orNode/>");
Then, you could parse the XML nodes by using, for example, System.Xml.Linq.
XDocument doc = XDocument.Parse(output);
Based on you comment, here's how you rearrange the XML in order to get the wrapping you need:
foreach (var item in doc.Root.Descendants())
{
if (item.Name == "orNode" || item.Name == "andNode")
{
item.ElementsAfterSelf()
.ToList()
.ForEach(x =>
{
x.Remove();
item.Add(x);
});
}
}
Here's the resulting XML content:
<node>
<andNode>
<node>
<orNode>
<node>categoryid:eq:1,2,3,4</node>
<node>categoryname:eq:condiments</node>
</orNode>
</node>
<node>description:lk:”*and*”</node>
</andNode>
</node>

I understand that you want the values specified in the filterstring.
My solution would be something like this:
NameValueCollection values = new NameValueCollection();
foreach(Match pair in Regex.Matches(#"\((?<name>\w+):(?<operation>\w+):(?<value>[^)]*)\)"))
{
if (pair.Groups["operation"].Value == "eq")
values.Add(pair.Groups["name"].Value, pair.Groups["value"].Value);
}
The Regex understand a (name:operation:value), it doesn't care about all the other stuff.
After this code has run you can get the values like this:
values["categoryid"]
values["categoryname"]
values["description"]
I hope this will help you in your quest.

I think you should just make a proper parser for that — it would actually end up simpler, more extensible and save you time and headaches in the future. You can use any existing parser generator such as Irony or ANTLR.

Related

RegEx to pull out specific URL format from HTML source

I'm having problems with RegEx and trying to pull out a specifically formatted HTML link from a page's HTML source.
The HTML source contains many of these links. The link is in the format:
<a class="link" href="pagedetail.html?record_id=123456">RecordName</a>
For each matching link, I would like to be able to easily extract the following two bits of information:
The URL bit. E.g. pagedetail.html?record_id=123456
The link name. E.g. RecordName
Can anyone please help with this as I'm completely stuck. I'm needing this for a C# program so if there is any C# specific notation then that would be great. Thanks
TIA
People will tell you you should not parse HTML with REGEX. And I think it is a valid statement.
But sometimes with well formatted HTML and really easy cases like it seems is yours. You can use some regex to do the job.
For example you can use this regex and obtain group 1 for the URL and group 2 for the RecordName
<a class="link" href="([^"]+)">([^<]+)<
DEMO
I feel a bit silly answering this, because it should be evident through the two comments to your question, but...
You should not parse HTML with REGEX!
Use an XML parser, or better yet, a dedicated tool, like the HTML Agility Pack (which is still an XML parser, but fancier for working with HTML).
You can use TagRegex and EndTagRegex classes to parse html string and find tag you want. You need to iterate through all characters in html string to find out desired tag.
e.g.
var position = 0;
var tagRegex = new TagRegex();
var endTagRegex = new EndTagRegex();
while (position < html.length)
{
var match = tagRegex.Match(html, position);
if (match.Success)
{
var tagName = match.Groups["tagname"].Value;
if (tagName == "a")
{ ... }
}
else if (endTagRegex.match(html, position).Success)
{
var tagName = match.Groups["tagname"].Value;
if (tagName == "a")
{ ... }
}
position++;
}

Select xml element by class name with a regex

I want to extract an svg element by his class name with a C# regex.
For example I have this:
<path fill="none" ... class="highcharts-tracker highcharts-tracker" ... stroke-width="22" zIndex="2" style=""/>
And I want to delete every path elements with highcharts-tracker as class name by using :
new Regex("");
Anybody know ?
In LINQ to XML, this is pretty straightforward:
var classToRemove = "highlights-tracker";
var xml = XDocument.Parse(svg);
var elements = doc.Descendants("path")
.Where(x => x.Attribute("class") != null &&
x.Attribute("class")
.Value.Split(' ')
.Contains(classToRemove));
// Remove all the elements which match the query
elements.Remove();
You should not use regular expressions to try to parse XML... XML is very well handled by existing APIs, and regular expressions are not an appropriate tool.
EDIT: If it's malformed (which you should have said to start with) you should try to work out why it's malformed and fix it before you try to do any other processing. There's really no excuse for XML being malformed these days... there are plenty of good XML APIs for just about every platform in existence.

Finding text between tags and replacing it along with the tags

I am using The following regex pattern to find text between [code] and [/code] tags:
(?<=[code]).*?(?=[/code])
It returns me anything which is enclosed between these 2 tags, e.g. this: [code]return Hi There;[/code] gives me return Hi There;.
I need help with regex to replace entire text along with the tags.
Use this:
var s = "My temp folder is: [code]Path.GetTempPath()[/code]";
var result = Regex.Replace(s, #"\[code](.*?)\[/code]",
m =>
{
var codeString = m.Groups[1].Value;
// then evaluate this string
return EvaluateMyCode(codeString)
});
I would use a HTML Parser for this. I can see that what you are trying to do is simple, however these things have a habit to get much more complicated overtime. The end result is much pain for the poor sole who has to maintain the code in the future.
Take a look at this question about HTML Parsers
What is the best way to parse html in C#?
[Edit]
Here is a much more relevant answer to the question asked.
#Milad Naseri regex is correct you just need to do something like
string matchCodeTag = #"\[code\](.*?)\[/code\]";
string textToReplace = "[code]The Ape Men are comming[/code]";
string replaceWith = "Keep Calm";
string output = Regex.Replace(textToReplace, matchCodeTag, replaceWith);
Check out this web sites for more examples
http://www.dotnetperls.com/regex-replace
http://oreilly.com/windows/archive/csharp-regular-expressions.html
Hope this helps
You need to use back referencing, i.e. replace \[code\](.*?)\[/code\] with something like <code>$1</code> which will give you what's been enclosed by the [code][/code] tags enclosed in -- for this example -- <code></code> tags.

Extracting last character of a sentence using Regex

I want to extract last character of a string. In fact I should make clear with example. Following is the string from which i want to extract:
<spara h-align="right" bgcolor="none" type="verse" id="1" pnum="1">
<line>
<emphasis type="italic">Approaches to Teaching and Learning</emphasis>
</line>
</spara>
In the above string i want to insert space between the word "Learning" and "</emphasis>" if there is no space present.
Thanks,
Have a look at some of the Linq to XML examples on here instead of using Regex.
With Linq to XML you can do it as follows:
XDocument doc = XDocument.Load("xmlfilename");
foreach (var emphasis in doc.Descendants("emphasis"))
{
if (emphasis.Value.Last() != ' ')
emphasis.Value += " ";
}
doc.Save("outputfilename");
Instead of files you may use streams, readers etc in the Load
Something like the following perhaps?
Regex.Replace(yourString, #"(>[^<]+[^ ])<", #"$1 <");
The solution assumes a sentence is between > and < and is one or more characters long.
Is the sentence really inside XML, or have you extracted it using any of the many XML or DOM methods? For instance, using this:
foreach(node in YourDOM.SelectNodes("//emphasis[#type='italic']"))
{
string yourString = node.FirstChild.Value;
}
If so, if the string is on its own, you can do this instead, which is way simpler and safer:
Regex.Replace(yourString, "([^ ])$", "$1 ");
EDIT: I originally missed if there's no space present, the post above is edited with this information

replacing an undefined tags inside an xml string using a regex

i need to replace an undefined tags inside an xml string.
example: <abc> <>sdfsd <dfsdf></abc><def><movie></def> (only <abc> and <def> are defined)
should result with: <abc> <>sdfsd <dfsdf></abc><def><movie><def>
<> and <dfsdf> are not predefined as and and does not have a closing tag.
it must be done with a regex!.
no using xml load and such.
i'm working with C# .Net
Thanks!
How about this:
string s = "<abc> <>sdfsd <dfsdf></abc><def><movie></def>";
string regex = "<(?!/?(?:abc|def)>)|(?<!</?(?:abc|def))>";
string result = Regex.Replace(s, regex, match =>
{
if (match.Value == "<")
return "<";
else
return ">";
});
Console.WriteLine(result);
Result:
<abc> <>sdfsd <dfsdf></abc><def><movie></def>
Also, when tested on your other test case (which by the way I found in a comment on the other question):
<abc>>sdfsdf<<asdada>>asdasd<>asdasd<asdsad>asds<</abc>
I get this result:
<abc>>sdfsdf<<asdada>>asdasd<>asdasd<asdsad>asds<</abc>
Let me guess... this doesn't work for you because you just thought of a new requirement? ;)
it must be done with a regex! no using xml load and such.
I must hammer this nail in with my boot! No using a hammer and such. It's an old story :)
You'll need to supply more information. Are "valid" tags allowed to be nested? Are the "valid" tags likely to change at any point? How robust does this need to be?
Assuming that your list of valid tags isn't going to change at any point, you could do it with a regex substitution:
s/<(?!\/?(your|valid|tags))([^>]*)>/<$1>/g

Categories

Resources