How do I fix open <'s without closing >'s with C#? - c#

I'm using C# with the .NEt 4.5 version of the HTML Agility Pack. I have to be able to import a large number of different html documents and always be able to load them into the .NET XmlDocument.
My current issue is that I am seeing html similar to this:
<p class="s18">(4) if qual. ch ild <17 f or</p>
I need to convert that "<" to anything else but I need to preserve all of the other <'s and >'s. I'd like to use as few lines of code as possible and hope that someone can show me how the Html Agility Pack (already being used in my project for other things) can be leveraged to solve this problem.
EDIT: If Html Agility Pack doesn't satisfy the need then I'd appreciate a C# method which will eliminate or close any open flags while preserving any valid tags.
EDIT 2: Removed, no longer relevant.
EDIT 3: I've partially solved this problem but there is a bug that I'd appreciate help resolving.
My method is below. This method successfully removes the '<' and '>' characters from this HTML.
<p>yo hi</p><p> Gee I love 1<'s</p><td name=\"\" /><p>bazinga ></p>
The problem that I am having is that the Regex.Matches() method seems to not actually find all matches. It will find a match and then look for the next match, positioned after the first match ends. This behavior makes the " Gee I love 2<'s" '<' character get skipped in following HTML.
<p>yo hi</p><p> Gee I love 1<'s<p> Gee I love 2<'s<p> Gee I love 3<'s</p></p></p><td name=\"\" /><p>bazinga ></p>
In my opinion " Gee I love 2<'s" should be a match but the Regex.Matches() method is skipping it because of, what I assume, is a position location being moved forward to the end of the last match.
private static string RemovePartialTags(string input)
{
Regex regex = new Regex(#"<[^<>/]+>(.*?)<[^<>]+>");
string output = regex.Replace(input, delegate(Match m)
{
string v = m.Value;
Regex reg = new Regex(#"<[^<>]+>");
MatchCollection matches = reg.Matches(v);
int locEndTag = v.IndexOf(matches[1].Value);
List<string> tokens = new List<string>
{
v.Substring(0, matches[0].Length),
v.Substring(matches[0].Length, locEndTag - matches[0].Length)
.Replace(#"<", string.Empty)
.Replace(#">", string.Empty)
};
tokens.Add(v.Substring(tokens[0].Length + (locEndTag - matches[0].Length)));
return tokens[0] + tokens[1] + tokens[2];
}
);
return output;
}
Thank you in advance!

I solved my problem by using the same method as above but with a modified regex expression
#"<[^<>/]+>(.*?)[<](.*?)<[^<>]+>"
Method:
private static string RemovePartialTags(string input)
{
Regex regex = new Regex(#"<[^<>/]+>(.*?)[<](.*?)<[^<>]+>");
string output = regex.Replace(input, delegate(Match m)
{
string v = m.Value;
Regex reg = new Regex(#"<[^<>]+>");
MatchCollection matches = reg.Matches(v);
int locEndTag = v.IndexOf(matches[1].Value);
List<string> tokens = new List<string>
{
v.Substring(0, matches[0].Length),
v.Substring(matches[0].Length, locEndTag - matches[0].Length)
.Replace(#"<", string.Empty)
.Replace(#">", string.Empty)
};
tokens.Add(v.Substring(tokens[0].Length + (locEndTag - matches[0].Length)));
return tokens[0] + tokens[1] + tokens[2];
}
);
return output;
}

Related

regex replace matchEvaluator using string Array

I need to highlight search terms in a block of text.
My initial thought was looping though the search terms. But is there an easier way?
Here is what I'm thinking using a loop...
public string HighlightText(string inputText)
{
string[] sessionPhrases = (string[])Session["KeywordPhrase"];
string description = inputText;
foreach (string field in sessionPhrases)
{
Regex expression = new Regex(field, RegexOptions.IgnoreCase);
description = expression.Replace(description,
new MatchEvaluator(ReplaceKeywords));
}
return description;
}
public string ReplaceKeywords(Match m)
{
return "<span style='color:red;'>" + m.Value + "</span>";
}
You could replace the loop with something like:
string[] phrases = ...
var re = String.Join("|", phrases.Select(s => Regex.Escape(s)).ToArray());
text = Regex.Replace(re, text, new MatchEvaluator(SomeFunction), RegexOptions.IgnoreCase);
Extending on Qtax's answer:
phrases = ...
// Use Regex.Escape to prevent ., (, * and other special characters to break the search
string re = String.Join("|", phrases.Select(s => Regex.Escape(s)).ToArray());
// Use \b (expression) \b to ensure you're only matching whole words, not partial words
re = #"\b(?:" +re + #")\b"
// use a simple replacement pattern instead of a MatchEvaluator
string replacement = "<span style='color:red;'>$0</span>";
text = Regex.Replace(re, text, replacement, RegexOptions.IgnoreCase);
Not that if you're already replacing data inside HTML, it might not be a good idea to use Regex to replace just anything in the content, you might end up getting:
<<span style='color:red;'>script</span>>
if someone is searching for the term script.
To prevent that from happening, you could use the HTML Agility Pack in combination with Regex.
You might also want to check out this post which deals with a very similar issue.

Splitting a string by another string

I got a string which I need to separate by another string which is a substring of the original one. Let's say I got the following text:
string s = "<DOC>something here <TEXT> and some stuff here </TEXT></DOC>"
And I want to retrieve:
"and some stuff here"
I need to get the string between the "<TEXT>" and his locker "</TEXT>".
I don't manage to do so with the common split method of string even though one of the function parameters is of type string[]. What I am trying is :
Console.Write(s.Split("<TEXT>")); // Which doesn't compile
Thanks in advance for your kind help.
var start = s.IndexOf("<TEXT>");
var end = s.IndexOf("</TEXT>", start+1);
string res;
if (start >= 0 && end > 0) {
res = s.Substring(start, end-start-1).Trim();
} else {
res = "NOT FOUND";
}
Splitting on "<TEXT>" isn't going to help you in this case anyway, since the close tag is "</TEXT>".
The most robust solution would be to parse it properly as XML. C# provides functionality for doing that. The second example at http://msdn.microsoft.com/en-us/library/cc189056%28v=vs.95%29.aspx should put you on the right track.
However, if you're just looking for a quick-and-dirty one-time solution your best bet is going to be to hand-code something, such as dasblinkenlight's solution above.
var output = new List<String>();
foreach (Match match in Regex.Matches(source, "<TEXT>(.*?)</TEXT>")) {
output.Add(match.Groups[1].Value);
}
string s = "<DOC>something here <TEXT> and some stuff here </TEXT></DOC>";
string result = Regex.Match(s, "(?<=<TEXT>).*?(?=</TEXT>)").Value;
EDIT: I am using this regex pattern (?<=prefix)find(?=suffix) which will match a position between a prefix and a suffix.
EDIT 2:
Find several results:
MatchCollection matches = Regex.Matches(s, "(?<=<TEXT>).*?(?=</TEXT>)");
foreach (Match match in matches) {
Console.WriteLine(match.Value);
}
If last tag is </doc> then you could use XElement.Load to load XML and then go through it to discover wanted element (you could also use Linq To XML).
If this is not necessarily correct XML string, you could always go with Regural Expressions to find desired part of text. In this case expression should not be to hard to write it yourself.

RegEx.Replace but exclude matches within html tags?

I have a helper method called HighlightKeywords, which I use on a Forum when viewing search results, to highlight the keyword(s) within the posts, that the user has searched on.
The problem I have is that, say for example the user searches for the keyword 'hotmail', where the HighlightKeywords method then finds matches of that keyword and wraps it with a span tag specifying a style to apply, it's finding matches within html anchor tags and in some cases image tags. As a result, when I render the highlighted posts to screen, the html tags are broken (due to the span being inserted within them).
Here is my function:
public static string HighlightKeywords(this string s, string keywords, string cssClassName)
{
if (s == string.Empty || keywords == string.Empty)
{
return s;
}
string[] sKeywords = keywords.Split(' ');
foreach (string sKeyword in sKeywords)
{
try
{
s = Regex.Replace(s, #"\b" + sKeyword + #"\b", string.Format("<span class=\"" + cssClassName + "\">{0}</span>", "$0"), RegexOptions.IgnoreCase);
}
catch {}
}
return s;
}
What would be the best way to prevent this from breaking? Even if I could just simply exclude any matches that occur within anchor tags (whether they be web or email addresses) or image tags?
No. You can't do that. At least, not in a way that won't break. Regular Expressions are not up to the task of parsing HTML. I am really sorry. You will want to read this rant too: RegEx match open tags except XHTML self-contained tags
So, you will probably need to parse the HTML (I hear the HtmlAgilityPack is good) and then only match inside certain portions of the document - excluding anchor tags etc.
I ran into the same problem, came up with this work around
public static string HighlightKeyWords(string s, string[] KeyWords)
{
if (KeyWords != null && KeyWords.Count() > 0 && !string.IsNullOrEmpty(s))
{
foreach (string word in KeyWords)
{
s = System.Text.RegularExpressions.Regex.Replace(s, word, string.Format("{0}", "{0}$0{1}"), System.Text.RegularExpressions.RegexOptions.IgnoreCase);
}
}
s = string.Format(s, "<mark class='hightlight_text_colour'>", "</mark>");
return s;
}
Looks kind of scary, but I delay the adding of the html tags until the regex expression has matched all the keywords, adding in the {0} and {1} place holders for the begging and end html tags, instead of the tags. I then add the html tags in at the end,
using the place holders from inside the loop.
Would still break if the keyword of {0} or {1} is passed in as a keyword though.
Marcus, resurrecting this question because it had a simple solution that wasn't mentioned. This situation sounds very similar to Match (or replace) a pattern except in situations s1, s2, s3 etc.
With all the disclaimers about using regex to parse html, here is a simple way to do it.
Taking hotmail as an example to show the technique in its simplest form, here's our simple regex:
<a.*?</a>|(hotmail)
The left side of the alternation matches complete <a ... </a> tags. We will ignore these matches. The right side matches and captures hotmail to Group 1, and we know they are the right hotmail because they were not matched by the expression on the left.
This program shows how to use the regex (see the results at the bottom of the online demo):
using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
class Program
{
static void Main() {
var myRegex = new Regex(#"<a.*?</a>|(hotmail)");
string s1 = #"replace this=> hotmail not that => hotmail";
string replaced = myRegex.Replace(s1, delegate(Match m) {
if (m.Groups[1].Value != "") return "<span something>hotmail</span>";
else return m.Value;
});
Console.WriteLine("\n" + "*** Replacements ***");
Console.WriteLine(replaced);
Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();
} // END Main
} // END Program
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...

Find and Insert

I have a string that looks like (the * is literal):
clp*(seven digits)1*
I want to change it so that it looks like:
clp*(seven digits)(space)(space)1*
I'm working in C# and built my search pattern like this:
Regex regAddSpaces = new Regex(#"CLP\*.......1\*");
I'm not sure how to tell regex to keep the first 11 characters, add two spaces and then cap it with 1*
Any help is appreciated.
No need to use regex here. Simple string manipulation will do the job perfectly well.
var input = "clp*01234561*";
var output = input.Substring(0, 11) + " " + input.Substring(11, 2);
I agree with Noldorin. However, here's how you could do it with regular expressions if you really wanted:
var result = Regex.Replace("clp*12345671*", #"(clp\*\d{7})(1\*)", #"$1 $2");
If you just want to replace this anywhere in the text you can use the excluded prefix and suffix operators...
pattern = "(?<=clp*[0-9]{7})(?=1*)"
Handing this off to the regex replace with the replacement value of " " will insert the spaces.
Thus, the following one-liner does the trick:
string result = Regex.Replace(inputString, #"(?<=clp\*[0-9]{7})(?=1\*)", " ", RegexOptions.IgnoreCase);
Here is the regex, but if your solution is as simple as you stated above Noldorin's answer would be a clearer and more maintainable solution. But since you wanted regex... here you go:
// Not a fan of the 'out' usage but I am not sure if you care about the result success
public static bool AddSpacesToMyRegexMatch(string input, out string output)
{
Regex reg = new Regex(#"(^clp\*[0-9]{7})(1\*$)");
Match match = reg.Match(input);
output = match.Success ?
string.Format("{0} {1}", match.Groups[0], match.Groups[1]) :
input;
return match.Success;
}

How can I find a string after a specific string/character using regex

I am hopeless with regex (c#) so I would appreciate some help:
Basicaly I need to parse a text and I need to find the following information inside the text:
Sample text:
KeywordB:***TextToFind* the rest is not relevant but **KeywordB: Text ToFindB and then some more text.
I need to find the word(s) after a certain keyword which may end with a “:”.
[UPDATE]
Thanks Andrew and Alan: Sorry for reopening the question but there is quite an important thing missing in that regex. As I wrote in my last comment, Is it possible to have a variable (how many words to look for, depending on the keyword) as part of the regex?
Or: I could have a different regex for each keyword (will only be a hand full). But still don't know how to have the "words to look for" constant inside the regex
The basic regex is this:
var pattern = #"KeywordB:\s*(\w*)";
\s* = any number of spaces
\w* = 0 or more word characters (non-space, basically)
() = make a group, so you can extract the part that matched
var pattern = #"KeywordB:\s*(\w*)";
var test = #"KeywordB: TextToFind";
var match = Regex.Match(test, pattern);
if (match.Success) {
Console.Write("Value found = {0}", match.Groups[1]);
}
If you have more than one of these on a line, you can use this:
var test = #"KeywordB: TextToFind KeyWordF: MoreText";
var matches = Regex.Matches(test, #"(?:\s*(?<key>\w*):\s?(?<value>\w*))");
foreach (Match f in matches ) {
Console.WriteLine("Keyword '{0}' = '{1}'", f.Groups["key"], f.Groups["value"]);
}
Also, check out the regex designer here: http://www.radsoftware.com.au/. It is free, and I use it constantly. It works great to prototype expressions. You need to rearrange the UI for basic work, but after that it's easy.
(fyi) The "#" before strings means that \ no longer means something special, so you can type #"c:\fun.txt" instead of "c:\fun.txt"
Let me know if I should delete the old post, but perhaps someone wants to read it.
The way to do a "words to look for" inside the regex is like this:
regex = #"(Key1|Key2|Key3|LastName|FirstName|Etc):"
What you are doing probably isn't worth the effort in a regex, though it can probably be done the way you want (still not 100% clear on requirements, though). It involves looking ahead to the next match, and stopping at that point.
Here is a re-write as a regex + regular functional code that should do the trick. It doesn't care about spaces, so if you ask for "Key2" like below, it will separate it from the value.
string[] keys = {"Key1", "Key2", "Key3"};
string source = "Key1:Value1Key2: ValueAnd A: To Test Key3: Something";
FindKeys(keys, source);
private void FindKeys(IEnumerable<string> keywords, string source) {
var found = new Dictionary<string, string>(10);
var keys = string.Join("|", keywords.ToArray());
var matches = Regex.Matches(source, #"(?<key>" + keys + "):",
RegexOptions.IgnoreCase);
foreach (Match m in matches) {
var key = m.Groups["key"].ToString();
var start = m.Index + m.Length;
var nx = m.NextMatch();
var end = (nx.Success ? nx.Index : source.Length);
found.Add(key, source.Substring(start, end - start));
}
foreach (var n in found) {
Console.WriteLine("Key={0}, Value={1}", n.Key, n.Value);
}
}
And the output from this is:
Key=Key1, Value=Value1
Key=Key2, Value= ValueAnd A: To Test
Key=Key3, Value= Something
/KeywordB\: (\w)/
This matches any word that comes after your keyword. As you didn´t mentioned any terminator, I assumed that you wanted only the word next to the keyword.

Categories

Resources