Get value between unknown string - c#

I'm trying to pull out a string between 2 other strings. But to make it more complicated the proceeding contents will often differ.
The string I'm trying to retrieve is Christchurch.
The regex I have so far is (?<=300px">).*(?=</td) and it will pull out the string I'm looking fine but it will also return dozens of other strings through out the LARGE text file I'm searching.
What I'd like to do is limit the prefix to start seraching from Office:, all the way to 300px"> but, the contents between those 2 strings will sometimes differ dependant upon user preferences.
To put it in crude non regex terms I want to do the following: Starting at Office: all the way to 300px> find the string that starts here and ends with </td. Thus resulting in Christchurch.

Have you considered using the HTMLAgilityPack instead? It's a Nuget package for handling HTML which is able to handle malformed HTML pretty well. Most on Stack Overflow would recommend against using Regex for HTML - see here: RegEx match open tags except XHTML self-contained tags
Here's how you'd do it for your example:
using HtmlAgilityPack; //This is a nuget package!
var html = #"<tr >
<td align=""right"" valign=""top""><strong>Office:</strong> </td>
<td align=""left"" class=""stippel"" style=""white-space: wrap;max-width:300px"">Christchurch </td>
</tr>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var node = htmlDoc.SelectSingleNode("//td[#class='stippel']");
Console.WriteLine(node.InnerHtml);
I haven't tested this code but it should do what you need.

I guess you need something like this:
office.*\n.*|(?<=300px">).*(?=<\/td)

The issue you're encountering is that * is greedy. Use the lazy/reluctant version *?.
Office:[\s\S]*?300px">(.*?)</td
This solution uses a group match rather than look-arounds.

Thanks to the posts from adamdc78 and greg I have the been able to come up with the below regex. This is exactly what I needed.
Thanks for you help.
(?<=office.*\n.*300px">).*(?=<\/td)

Related

Regex against markup after XPath?

Have been searching for the solution to my problem now already for a while and have been playing around regex101.com for a while but cannot find a solution.
The problem I am facing is that I have to make a string select for different inputs, thus I wanted to do this with Regular expressions to get the wanted data from these strings.
The regular expression will come from a configuration for each string seperately. (since they differ)
The string below is gained with a XPath: //body/div/table/tbody/tr/td/p[5] but I cannot dig any lower into this anymore to retrieve the right data or can I ?
The string I am using at the moment as example is the following:
<strong>Kontaktdaten des Absenders:</strong>
<br>
<strong>Name:</strong> Wanted data
<br>
<strong>Telefon:</strong>
<a dir='ltr' href='tel:XXXXXXXXX' x-apple-data-detectors='true' x-apple-data-detectors-type='telephone' x-apple-data-detectors-result='3'>XXXXXXXXX</a>
<br>
From this string I am trying to get the "Wanted data"
My regular expression so far is the following:
(?<=<\/strong> )(.*)(?= <br>)
But this returns the whole:
<br> <strong>Name:</strong> Wanted data <br> <strong>Telefon:</strong> <a dir='ltr' href='tel:XXXXXXXXX' x-apple-data-detectors='true' x-apple-data-detectors-type='telephone' x-apple-data-detectors-result='3'>XXXXXXXXX</a>
I thought I could solve this with a repeat group
((:?(?<=<\/strong> )(.*)(?= <br>))+)
But this returns the same output as without the repeat group.
I know I could build a for { } loop around this regex to gain the same output, but since this is the only regular expression I have to do this for (but means I have to change it for all the other data) I was wondering if it is possible to do this in a regular expression.
Thank you for the support already so far.
Regex is the wrong tool for parsing markup. You have a proper XML parsing tool, XPath, in hand. Finish the job with it:
This XPath,
strong[.='Name:']/following-sibling::text()[1]
when appended to your original XPath,
//body/div/table/tbody/tr/td/p[5]/strong[.='Name:']/following-sibling::text()[1]
will finish the job of selecting the text node immediately following the <strong>Name:</strong> label, as requested, with no regex hacks over markup required.
You can try to match everything but tag markers:
(?<=<\/strong> )([^<>]*)(?= <br>)
Demo

How to extract string between 2 markers using Regex in .NET?

I have a source to a web page and I need to extract the body. So anything between </head><body> and </body></html>.
I've tried the following with no success:
var match = Regex.Match(output, #"(?<=\</head\>\<body\>)(.*?)(?=\</body\>\</html\>)");
It finds a string but cuts it off long before </body></html>. I escaped characters based on the RegEx cheat sheet.
What am i missing?
I'd recommend using the HtmlAgilityPack instead - parsing HTML with regular expressions is very, very fragile.
The latest version even supports Linq so you can get your content like this:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://stackoverflow.com");
string html = doc.DocumentNode.Descendants("body").Single().InnerHtml;
Regex is not meant for such html handling, as many here would say. Without having your sample web page / html, I can only say that try removing the non-greedy ? quantifier in (.*?) and try. After all, a html page will have only one head and body.
Though regexes are definitely not the best tool for this task, there are a few suggestions and points I would like to make:
un-escape the angle brackets - with the # before your string, they are going through to the regex and they do not need to be escaped for a .NET regex
with your regex, you need to make sure that the head/body tag combinations do not have any white-space between them.
with your regex, the body tag cannot have any attributes.
I would suggest something more like:
(?<=</head>\s*<body(\s[^>]*)?>)(.*?)(?=</body>\s*</html>)
this seems to work for me on the source of this page!
As the others have said, the correct way to handle this is with an HTML-specific tool. I just want to point out some problems with that cheat-sheet.
First, it's wrong about angle brackets: you do not need to escape them. In fact, it's wrong twice: it also says \< and \> match word boundaries, which is both incorrect for .NET, and incompatible with the advice about escaping angle brackets.
That cheat-sheet is just a random collection of regex syntax elements; most of them will work in most flavors, but many are guaranteed not to work in your particular flavor, whatever it happens to be. I recommend you disregard it and rely instead on .NET-specific documents or Regular-Expressions.info. The books Mastering Regular Expressions and Regular Expressions Cookbook are both excellent, too.
As for your regex, I don't see how it could behave the way you say it does. If it were going to fail, I would expect it to fail completely. Does your HTML document contain a CDATA section or SGML comment with </body></html> inside it? Or is it really two or more HTML documents run together?

how can i remove an outer <p>...</p> from a string

I want to query a string (html) from a database and display it on a webpage. The problem is that the data has a
<p> around the text (ending with </p>
I want to strip this outer tag in my viewmodel or controlleraction that returns this data. what is the best way of doing this in C#?
Might be overkill for your needs, but if you want to parse the HTML you can use the HtmlAgilityPack - certainly a cleaner solution in general than most suggested here, although it might not be as performant:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("<p> around the text (ending with </p>");
string result = doc.DocumentNode.FirstChild.InnerHtml;
If you're absolutely sure the string will always have that tag, you can use String.Substring like myString.Substring(3, myString.Length-7) or so.
A more robust method would be to either manually code the appropriate tests or use a regular expression, or ultimately, use an HTML parser as suggested by BrokenGlass's answer.
UPDATE: Using regexes you could do:
String filteredString = Regex.Match(myString, "^<p>(.*)</p>").ToString();
You could add \s after the initial ^ to remove also leading whitespace. Also, you can check the result of Match to see if the string matched the <p>...</p> pattern at all. This may also help.
If the data is always surrounded by <p> ... </p>:
string withoutParas = withParas.Substring(3, withParas.Length - 7);
Try using string function Remove() passing it the FirstIndex() of <p> and the last index of </p> with length 3
If you are absolutely guaranteed that you string will always fit the pattern of <p>...</p>, then the other solutions using data.Substring(3, data.Length - 6) are sufficient. If, however, there's any chance that it could look at all different, then you really need to use an HTML parser. The consensus is that the HTML Agility Pack is the way to go.
s = s.Replace("<p>", String.Empty).Replace("</p>", String.Empty);

Regular expression to replace quotation marks in HTML tags only

I have the following string:
<div id="mydiv">This is a "div" with quotation marks</div>
I want to use regular expressions to return the following:
<div id='mydiv'>This is a "div" with quotation marks</div>
Notice how the id attribute in the div is now surrounded by apostrophes?
How can I do this with a regular expression?
Edit: I'm not looking for a magic bullet to handle every edge case in every situation. We should all be weary of using regex to parse HTML but, in this particular case and for my particular need, regex IS the solution...I just need a bit of help getting the right expression.
Edit #2: Jens helped to find a solution for me but anyone randomly coming to this page should think long and very hard about using this solution. In my case it works because I am very confident of the type of strings that I'll be dealing with. I know the dangers and the risks and make sure you do to. If you're not sure if you know then it probably indicates that you don't know and shouldn't use this method. You've been warned.
This could be done in the following way: I think you want to replace every instance of ", that is between a < and a > with '.
So, you look for each " in your file, look behind for a <, and ahead for a >. The regex looks like:
(?<=\<[^<>]*)"(?=[^><]*\>)
You can replace the found characters to your liking, maybe using Regex.Replace.
Note: While I found the Stack Overflow community most friendly and helpful, these Regex/HTML questions are responded with a little too much anger, in my opinion. After all, this question here does not ask "What regex matches all valid HTML, and does not match anything else."
I see you're aware of the dangers of using Regex to do these kinds of replacements. I've added the following answer for those in search of a method that is a lot more 'stable' if you want to have a solution that will keep working as the input docs change.
Using the HTML Agility Pack (project page, nuget), this does the trick:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("your html here");
// or doc.Load(stream);
var nodes = doc.DocumentNode.DescendantNodes();
foreach (var node in nodes)
{
foreach (var att in node.Attributes)
{
att.QuoteType = AttributeValueQuote.SingleQuote;
}
}
var fixedText = doc.DocumentNode.OuterHtml;
//doc.Save(/* stream */);
You can match:
(<div.*?id=)"(.*?)"(.*?>)
and replace this with:
$1'$2'$3

Removing <div>'s from text file?

Ive made a small program in C#.net which doesnt really serve much of a purpose, its tells you the chance of your DOOM based on todays news lol. It takes an RSS on load from the BBC website and will then look for key words which either increment of decrease the percentage chance of DOOM.
Crazy little project which maybe one day the classes will come uin handy to use again for something more important.
I recieve the RSS in an xml format but it contains alot of div tags and formatting characters which i dont really want to be in the database of keywords,
What is the best way of removing these unwanted characters and div's?
Thanks,
Ash
If you want to remove the DIV tags WITH content as well:
string start = "<div>";
string end = "</div>";
string txt = Regex.Replace(htmlString, Regex.Escape(start) + "(?<data>[^" + Regex.Escape(end) + "]*)" + Regex.Escape(end), string.Empty);
Input: <xml><div>junk</div>XXX<div>junk2</div></xml>
Output: <xml>XXX</xml>
IMHO the easiest way is to use regular expressions. Something like:
string txt = Regex.Replace(htmlString, #"<(.|\n)*?>", string.Empty);
Depending on which tags and characters you want to remove you will modify the regex, of course. You will find a lot of material on this and other methods if you do a web search for 'strip html C#'.
SO question Render or convert Html to ‘formatted’ Text (.NET) might help you, too.
Stripping HTML tags from a given string is a common requirement and you can probably find many resources online that do it for you.
The accepted method, however, is to use a Regular expression based Search and Replace. This article provides a good sample along with benchmarks. Another point worth mentioning is that you would require separate Regex based lookups for the different kinds of unwanted characters you are seeing. (Perhaps showing us an example of the HTML you receive would help)
Note that your requirements may vary based on which tags you want to remove. In your question, you only mention DIV tags. If that is the only tag you need to replace, a simple string search and replace should suffice.
A regular expression such as this:
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
Would highlight all HTML tags.
Use this to remove them form your data.

Categories

Resources