I'm trying to extract strings after a pattern in a long string, which is basically HTML output of a page.
For example; I need to extract target of href tag from this string
<h2 class=\ "product-name\">...</h2>\r\n
What I need from this: erkek-ayakkabi-spor-gri-17sfd3007141340-p
But also I need to find strings alike to the one above. SO I need to search for href tags after class=\ "product-name\" in the HTML string.
How can I achieve this?
Please check this.
Regex:
class=\"product-name\"(.*)<a\shref=\"(.*?)\"
Updated Regex:
class=\"product-name\".*<a\shref=\"(.*?)\"
Regex101 Example.
C# Code:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string data = "<h2 class=\"product-name\">...</h2>\r\n<h2 class=\"test-name\">...</h2>\r\n<h2 class=\"product-name\">...</h2>\r\n";
//string regex = "class=\"product-name\"(.*)<a\\shref=\"(.*?)\"";
string regex = "class=\"product-name\".*<a\\shref=\"(.*?)\"";
var matches = Regex.Matches(data, regex, RegexOptions.Multiline);
foreach(Match item in matches)
{
//Console.WriteLine("Value: " + item.Groups[2]);
Console.WriteLine("Value: " + item.Groups[1]);
}
}
}
DotNetFiddle Example.
Related
I have the following code:
string regexStr = "<a href=\"" + URL + "*.pdf";
regexStr = Regex.Escape(regexStr);
Regex regex = new Regex("^" + regexStr + "$");
MatchCollection matches = regex.Matches(html);
I have a parent HTML page that contains a list of directories. Usually the latest dated directory contains a list of files as such:
But sometimes, the directory is empty, so then I need to check the next latest dated directory, look for the list of files, and if there are files there, use that directory.
This is an HTML page so the files, when listed, are an anchor tag like the following:
So I'm "simply" trying to use Regex to search for [a href="/d-tpp/2204/*.pdf">]
I can't seem to get the Regex right to make a match.
You can use
var regex = new Regex($"<a\\s+href=\"({Regex.Escape(URL)}[^\"]*?\\.pdf)");
var matches = regex.Matches(html).Cast<Match>().Select(x => x.Groups[1].Value);
NOTE
The *.pdf part is wrong, * is a quantifier, and you need to match any chars other than a " char as few as possible, i.e. [^"]* pattern
regexStr = Regex.Escape(regexStr) is wrong, you should never Regex.Escape regular expression patterns, you only need to use Regex.Escape to escape literal texts passed to the regex pattern
new Regex("^" + regexStr + "$") is also wrong since you are looking for partial multiple matches inside a long string, and ^ with $ anchor the matches to the whole string.
See a C# demo:
using System;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var html = "000.pdf <a href=\"/d-tpp/2204/1.pdf\"> 1111.pdf <a href=\"/d-tpp/2204/2.pdf\"> 2222.pdf";
var URL = "/d-tpp/2204/";
var regex = new Regex($"<a\\s+href=\"({Regex.Escape(URL)}[^\"]*?\\.pdf)", RegexOptions.IgnoreCase);
var matches = regex.Matches(html).Cast<Match>().Select(x => x.Groups[1].Value);
foreach (var s in matches)
Console.WriteLine(s);
}
}
Output:
/d-tpp/2204/1.pdf
/d-tpp/2204/2.pdf
What Regex.Replace pattern can I use to prepend an underscore before any tag elements starting with a number?
e.g.
"<1ABC>Hello</1ABC><A8D>World</A8D><0>!</0>"
would become
"<_1ABC>Hello</_1ABC><A8D>World</A8D><_0>!</_0>"
This regex can get same result, but I'm sure there could be better ones.
using System.Text.RegularExpressions;
string input = #"<1ABC>Hello</1ABC><A8D>World</A8D><0>!</0>";
string output = Regex.Replace(input, #"(</?)(\d[\d\w]*?)(>)", #"$1_$2$3");
Console.WriteLine(output);
#Lei Yang's answer will fail if an element has attributes. Minimal change is required:
using System.Text.RegularExpressions;
string input = #"<1ABC id='abc'>Hello</1ABC><A8D>World</A8D><0>!</0>";
string output = Regex.Replace(input, #"(</?)(\d.*?)([ >])", #"$1_$2$3");
Console.WriteLine(output);
Try this:
private static Regex rxTagsWithLeadingDigit = new Regex(#"
(</?) # open/close tag start, followed by
(\d\w+) # a tag name that begins with a decimal digit, followed by
(\s|/?>) # a whitespace character or end-of-tag
", RegexOptions.IgnorePatternWhitespace);
public string ensureTagsStartWithWordCharacter( string s )
{
return rxTagsWithLeadingDigit.Replace( s , "$1_$2$3" );
}
I would like to use the ((?!(SEPARATOR)).)* regex pattern for splitting a string.
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
var separator = "__";
var pattern = String.Format("((?!{0}).)*", separator);
var regex = new Regex(pattern);
foreach (var item in regex.Matches("first__second"))
Console.WriteLine(item);
}
}
It works fine when a SEPARATOR is a single character, but when it is longer then 1 character I get an unexpected result. In the code above the second matched string is "_second" instead of "second". How shall I modify my pattern to skip the whole unmatched separator?
My real problem is to split lines where I should skip line separators inside quotes. My line separator is not a predefined value and it can be for example "\r\n".
You can do something like this:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string input = "plum--pear";
string pattern = "-"; // Split on hyphens
string[] substrings = Regex.Split(input, pattern);
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}
}
}
// The method displays the following output:
// 'plum'
// ''
// 'pear'
The .NET regex does not does not support matching a piece of text other than a specific multicharacter string. In PCRE, you would use (*SKIP)(*FAIL) verbs, but they are not supported in the native .NET regex library. Surely, you might want to use PCRE.NET, but .NET regex can usually handle those scenarios well with Regex.Split
If you need to, say, match all but [anything here], you could use
var res = Regex.Split(s, #"\[[^][]*]").Where(m => !string.IsNullOrEmpty(m));
If the separator is a simple literal fixed string like __, just use String.Split.
As for your real problem, it seems all you need is
var res = Regex.Matches(s, "(?:\"[^\"]*\"|[^\r\n\"])+")
.Cast<Match>()
.Select(m => m.Value)
.ToList();
See the regex demo
It matches 1+ (due to the final +) occurrences of ", 0+ chars other than " and then " (the "[^"]*" branch) or (|) any char but CR, LF or/and " (see [^\r\n"]).
string asd = "<area href='#' title='name' shape='poly' coords='38,23,242'/>"
how extract text from title atribute in c#
and then insert another atribute after title?
search : (?<=title=')[^']+
replace: something
demo here : http://regex101.com/r/nR3vQ8
something like this in your case:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
// This is the input string we are replacing parts from.
string input = "<area href='#' title='name' shape='poly' coords='38,23,242'/>";
// Use Regex.Replace to replace the pattern in the input.
// ... The pattern N.t indicates three letters, N, any character, and t.
string output = Regex.Replace(input, "(?<=title=')[^']+", "something");
// Write the output.
Console.WriteLine(input);
Console.WriteLine(output);
}
}
update
for taking out the title attribute as match use this:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
// First we see the input string.
string input = "<area href='#' title='name' shape='poly' coords='38,23,242'/>";
// Here we call Regex.Match.
Match match = Regex.Match(input, #"title='(\w+)'",
RegexOptions.IgnoreCase);
// Here we check the Match instance.
if (match.Success)
{
// Finally, we get the Group value and display it.
string key = match.Groups[1].Value;
Console.WriteLine(key);
}
}
}
output
name
Try this: In particular you may be interested in the HTMLAgilityPack answer.
Regex reg = new Regex("<a[^>]*?title=\"([^\"]*?\"[^>]*?>");
A couple of gotchas:
This will match is case-sensitive, you may want to adjust that
This expects the title attribute both exists and is quoted
Of course, if the title attribute doesn't exist, you probably don't want the match anyway?
To Extract, use the groups collection:
reg.Match("Howdy").Groups[1].Value
I need to highlight search terms in a block of text.
My initial thought was looping though the search terms. But is there an easier way?
Here is what I'm thinking using a loop...
public string HighlightText(string inputText)
{
string[] sessionPhrases = (string[])Session["KeywordPhrase"];
string description = inputText;
foreach (string field in sessionPhrases)
{
Regex expression = new Regex(field, RegexOptions.IgnoreCase);
description = expression.Replace(description,
new MatchEvaluator(ReplaceKeywords));
}
return description;
}
public string ReplaceKeywords(Match m)
{
return "<span style='color:red;'>" + m.Value + "</span>";
}
You could replace the loop with something like:
string[] phrases = ...
var re = String.Join("|", phrases.Select(s => Regex.Escape(s)).ToArray());
text = Regex.Replace(re, text, new MatchEvaluator(SomeFunction), RegexOptions.IgnoreCase);
Extending on Qtax's answer:
phrases = ...
// Use Regex.Escape to prevent ., (, * and other special characters to break the search
string re = String.Join("|", phrases.Select(s => Regex.Escape(s)).ToArray());
// Use \b (expression) \b to ensure you're only matching whole words, not partial words
re = #"\b(?:" +re + #")\b"
// use a simple replacement pattern instead of a MatchEvaluator
string replacement = "<span style='color:red;'>$0</span>";
text = Regex.Replace(re, text, replacement, RegexOptions.IgnoreCase);
Not that if you're already replacing data inside HTML, it might not be a good idea to use Regex to replace just anything in the content, you might end up getting:
<<span style='color:red;'>script</span>>
if someone is searching for the term script.
To prevent that from happening, you could use the HTML Agility Pack in combination with Regex.
You might also want to check out this post which deals with a very similar issue.