Extracting last character of a sentence using Regex - c#

I want to extract last character of a string. In fact I should make clear with example. Following is the string from which i want to extract:
<spara h-align="right" bgcolor="none" type="verse" id="1" pnum="1">
<line>
<emphasis type="italic">Approaches to Teaching and Learning</emphasis>
</line>
</spara>
In the above string i want to insert space between the word "Learning" and "</emphasis>" if there is no space present.
Thanks,

Have a look at some of the Linq to XML examples on here instead of using Regex.

With Linq to XML you can do it as follows:
XDocument doc = XDocument.Load("xmlfilename");
foreach (var emphasis in doc.Descendants("emphasis"))
{
if (emphasis.Value.Last() != ' ')
emphasis.Value += " ";
}
doc.Save("outputfilename");
Instead of files you may use streams, readers etc in the Load

Something like the following perhaps?
Regex.Replace(yourString, #"(>[^<]+[^ ])<", #"$1 <");
The solution assumes a sentence is between > and < and is one or more characters long.
Is the sentence really inside XML, or have you extracted it using any of the many XML or DOM methods? For instance, using this:
foreach(node in YourDOM.SelectNodes("//emphasis[#type='italic']"))
{
string yourString = node.FirstChild.Value;
}
If so, if the string is on its own, you can do this instead, which is way simpler and safer:
Regex.Replace(yourString, "([^ ])$", "$1 ");
EDIT: I originally missed if there's no space present, the post above is edited with this information

Related

How to escape invalid characters inside XML string in C#

I have an XML string in C#. This XML has several tags. In some of these tags there are invalid characters like '&' in the text. I need to escape these characters inside the text from the whole long XML string but I want to keep the tags.
I have tried HttpUtility.HtmlEncode and few other available methods but they encode the whole string rather then just the text inside the tags. Example tags are
<node1>This is a string & so is this</node1> should be converted to
<node1>This is a string & so is this</node1>
Any ideas? thanks
P.S. I know similar question has been asked before I have not found a complete solution for this problem.
I guess the simplest solution is to load the whole Xml document in memory as an XmlDocument and then go through the elements and replace the values with their html encoded form.
you can use a CDATA field, like this:
<YourXml>
<Id>1</Id>
<Content>
<![CDATA[
your special caracteres
]]>
</content>
</yourXml>
I dont get what is the big deal in this. When you have the entire xml as a string, the easiest way to achieve what u want is to use the Replace function.
For example the whole xml is in the string str, then all u have to do is,
str.Replace("&" , "&");
Thats it man. You have achieved whatever u wanted to. Some times very simple solutions exist for big problems. Hope this helps for you.
XDocument or XmlDocument is a way to go. If for some crazy out of your control reason you need to encode just text blocks inside XmlElement:
using System.Text;
using System.Xml;
static string EncodeText(string unescapedText) {
if (string.IsNullOrEmpty(unescapedText)) {
return unescapedText;
}
var builder = new StringBuilder(unescapedText.Length);
using (var writer = XmlTextWriter.Create(builder, new XmlWriterSettings {
ConformanceLevel = ConformanceLevel.Fragment
})) {
writer.WriteValue(unescapedText);
}
return builder.ToString();
}

Complex regex or string parse

We are trying to use urls for complex querying and filtering.
I managed to get some of the simpler parst working using expression trees and a mix of regex and string manipulation but then we looked at a more complex string example
var filterstring="(|(^(categoryid:eq:1,2,3,4)(categoryname:eq:condiments))(description:lk:”*and*”))";
I'd like to be able to parse this out in to parts but also allow it to be recursive.. I'd like to get the out put looking like:
item[0] (^(categoryid:eq:1,2,3,4)(categoryname:eq:condiments)
item[1] description:lk:”*and*”
From there I could Strip down the item[0] part to get
categoryid:eq:1,2,3,4
categoryname:eq:condiments
At the minute I'm using RegEx and strings to find the | ^ for knowing if it's an AND or an OR the RegEx matches brackets and works well for a single item it's when we nest the values that I'm struggling.
the Regex looks like
#"\((.*?)\)"
I need some way of using Regex to match the nested brackets and help would be appreciated.
You could transform the string into valid XML (just some simple replace, no validation):
var output = filterstring
.Replace("(","<node>")
.Replace(")","</node>")
.Replace("|","<andNode/>")
.Replace("^","<orNode/>");
Then, you could parse the XML nodes by using, for example, System.Xml.Linq.
XDocument doc = XDocument.Parse(output);
Based on you comment, here's how you rearrange the XML in order to get the wrapping you need:
foreach (var item in doc.Root.Descendants())
{
if (item.Name == "orNode" || item.Name == "andNode")
{
item.ElementsAfterSelf()
.ToList()
.ForEach(x =>
{
x.Remove();
item.Add(x);
});
}
}
Here's the resulting XML content:
<node>
<andNode>
<node>
<orNode>
<node>categoryid:eq:1,2,3,4</node>
<node>categoryname:eq:condiments</node>
</orNode>
</node>
<node>description:lk:”*and*”</node>
</andNode>
</node>
I understand that you want the values specified in the filterstring.
My solution would be something like this:
NameValueCollection values = new NameValueCollection();
foreach(Match pair in Regex.Matches(#"\((?<name>\w+):(?<operation>\w+):(?<value>[^)]*)\)"))
{
if (pair.Groups["operation"].Value == "eq")
values.Add(pair.Groups["name"].Value, pair.Groups["value"].Value);
}
The Regex understand a (name:operation:value), it doesn't care about all the other stuff.
After this code has run you can get the values like this:
values["categoryid"]
values["categoryname"]
values["description"]
I hope this will help you in your quest.
I think you should just make a proper parser for that — it would actually end up simpler, more extensible and save you time and headaches in the future. You can use any existing parser generator such as Irony or ANTLR.

Finding text between tags and replacing it along with the tags

I am using The following regex pattern to find text between [code] and [/code] tags:
(?<=[code]).*?(?=[/code])
It returns me anything which is enclosed between these 2 tags, e.g. this: [code]return Hi There;[/code] gives me return Hi There;.
I need help with regex to replace entire text along with the tags.
Use this:
var s = "My temp folder is: [code]Path.GetTempPath()[/code]";
var result = Regex.Replace(s, #"\[code](.*?)\[/code]",
m =>
{
var codeString = m.Groups[1].Value;
// then evaluate this string
return EvaluateMyCode(codeString)
});
I would use a HTML Parser for this. I can see that what you are trying to do is simple, however these things have a habit to get much more complicated overtime. The end result is much pain for the poor sole who has to maintain the code in the future.
Take a look at this question about HTML Parsers
What is the best way to parse html in C#?
[Edit]
Here is a much more relevant answer to the question asked.
#Milad Naseri regex is correct you just need to do something like
string matchCodeTag = #"\[code\](.*?)\[/code\]";
string textToReplace = "[code]The Ape Men are comming[/code]";
string replaceWith = "Keep Calm";
string output = Regex.Replace(textToReplace, matchCodeTag, replaceWith);
Check out this web sites for more examples
http://www.dotnetperls.com/regex-replace
http://oreilly.com/windows/archive/csharp-regular-expressions.html
Hope this helps
You need to use back referencing, i.e. replace \[code\](.*?)\[/code\] with something like <code>$1</code> which will give you what's been enclosed by the [code][/code] tags enclosed in -- for this example -- <code></code> tags.

replacing an undefined tags inside an xml string using a regex

i need to replace an undefined tags inside an xml string.
example: <abc> <>sdfsd <dfsdf></abc><def><movie></def> (only <abc> and <def> are defined)
should result with: <abc> <>sdfsd <dfsdf></abc><def><movie><def>
<> and <dfsdf> are not predefined as and and does not have a closing tag.
it must be done with a regex!.
no using xml load and such.
i'm working with C# .Net
Thanks!
How about this:
string s = "<abc> <>sdfsd <dfsdf></abc><def><movie></def>";
string regex = "<(?!/?(?:abc|def)>)|(?<!</?(?:abc|def))>";
string result = Regex.Replace(s, regex, match =>
{
if (match.Value == "<")
return "<";
else
return ">";
});
Console.WriteLine(result);
Result:
<abc> <>sdfsd <dfsdf></abc><def><movie></def>
Also, when tested on your other test case (which by the way I found in a comment on the other question):
<abc>>sdfsdf<<asdada>>asdasd<>asdasd<asdsad>asds<</abc>
I get this result:
<abc>>sdfsdf<<asdada>>asdasd<>asdasd<asdsad>asds<</abc>
Let me guess... this doesn't work for you because you just thought of a new requirement? ;)
it must be done with a regex! no using xml load and such.
I must hammer this nail in with my boot! No using a hammer and such. It's an old story :)
You'll need to supply more information. Are "valid" tags allowed to be nested? Are the "valid" tags likely to change at any point? How robust does this need to be?
Assuming that your list of valid tags isn't going to change at any point, you could do it with a regex substitution:
s/<(?!\/?(your|valid|tags))([^>]*)>/<$1>/g

parsing XML with ampersand

I have a string which contains XML, I just want to parse it into Xelement, but it has an ampersand. I still have a problem parseing it with HtmlDecode. Any suggestions?
string test = " <MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>";
XElement.Parse(HttpUtility.HtmlDecode(test));
I also added these methods to replace those characters, but I am still getting XMLException.
string encodedXml = test.Replace("&", "&").Replace("<", "<").Replace(">", ">").Replace("\"", """).Replace("'", "&apos;");
XElement myXML = XElement.Parse(encodedXml);
t
or Even tried it with this:
string newContent= SecurityElement.Escape(test);
XElement myXML = XElement.Parse(newContent);
Ideally the XML is escaped properly prior to your code consuming it. If this is beyond your control you could write a regex. Do not use the String.Replace method unless you're absolutely sure the values do not contain other escaped items.
For example, "wow&".Replace("&", "&") results in wow&amp; which is clearly undesirable.
Regex.Replace can give you more control to avoid this scenario, and can be written to only match "&" symbols that are not part of other characters, such as <, something like:
string result = Regex.Replace(test, "&(?!(amp|apos|quot|lt|gt);)", "&");
The above works, but admittedly it doesn't cover the variety of other characters that start with an ampersand, such as and the list can grow.
A more flexible approach would be to decode the content of the value attribute, then re-encode it. If you have value="&wow&" the decode process would return "&wow&" then re-encoding it would return "&wow&", which is desirable. To pull this off you could use this:
string result = Regex.Replace(test, #"value=\""(.*?)\""", m => "value=\"" +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups[1].Value)) +
"\"");
var doc = XElement.Parse(result);
Bear in mind that the above regex only targets the contents of the value attribute. If there are other areas in the XML structure that suffer from the same issue then it can be tweaked to match them and replace their content in a similar fashion.
EDIT: updated solution that should handle content between tags as well as anything between double quotes. Be sure to test this thoroughly. Attempting to manipulate XML/HTML tags with regex is not favorable as it can be error prone and over-complicated. Your case is somewhat special since you need to sanitize it first in order to make use of it.
string pattern = "(?<start>>)(?<content>.+?(?<!>))(?<end><)|(?<start>\")(?<content>.+?)(?<end>\")";
string result = Regex.Replace(test, pattern, m =>
m.Groups["start"].Value +
HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups["content"].Value)) +
m.Groups["end"].Value);
var doc = XElement.Parse(result);
Your string doesn't contain valid XML, that's the issue. You need to change your string to:
<MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>"
HtmlEncode will not do the trick, it will probably create even more ampersands (for instance, a ' might become ", which is an Xml entity reference, which are the following:
& &
&apos; '
" "
< <
> >
But it might you get things like &nbsp, which is fine in html, but not in Xml. Therefore, like everybody else said, correct the xml first by making sure any character that is NOT PART OF THE ACTUAL MARKUP OF YOUR XML (that is to say, anything INSIDE your xml as a variable or text) and that occurs in the entity reference list is translated to their corresponding entity (so < would become <). If the text containing the illegal character is text inside an xml node, you could take the easy way and surround the text with a CDATA element, this won't work for attributes though.
Filip's answer is on the right track, but you can hijack the System.Xml.XmlDocument class to do this for you without an entire new utility function.
XmlDocument doc = new XmlDocument();
string xmlEscapedString = (doc.CreateTextNode("Unescaped '&' containing string that would have broken your xml")).OuterXml;
The ampersant makes the XML invalid. This cannot be fixed by a stylesheet so you need to write code with some other tool or code in VB/C#/PHP/Delphi/Lisp/Etc. to remove it or to translate it to &.
This is the simplest and best approach. Works with all characters and allows to parse XML for any web service call i.e. SharePoint ASMX.
public string XmlEscape(string unescaped)
{
XmlDocument doc = new XmlDocument();
var node = doc.CreateElement("root");
node.InnerText = unescaped;
return node.InnerXml;
}
If your string is not valid XML, it will not parse. If it contains an ampersand on its own, it's not valid XML. Contrary to HTML, XML is very strict.
You should 'encode' rather than decode. But calling HttpUtility.HtmlEncode will not help you as it will encode your '<' and '>' symbols as well and your string will no longer be an XML.
I think that for this case the best solution would be to replace '&' with '& amp;' (with no space)
Perhaps consider writing your own XMLDocumentScanner. That's what NekoHTML is doing to have the ability to ignore ampersands not used as entity references.

Categories

Resources