I have matchCollection.
And I need group index 1.
Now I take out the data from a large number of casts, I would like to avoid it.
example: startTag = <a>, endTag = </a>
Html = <a>texttexttext</a>.
I need get "texttexttext" with out <a> and </a>
var regex = new Regex(startTag + "(.*?)" + endTag, RegexOptions.IgnoreCase);
var matchCollection = regex.Matches(html);
foreach (var item in matchCollection)
{
string temp = ((Match)(((Group)(item)).Captures.SyncRoot)).Groups[1].Value;
}
I would recommend you using Html Agility Pack to parse HTML instead of regex for various reasons.
So to apply it to your example with finding all anchor text inside an HTML document:
using System;
using System.Net;
using HtmlAgilityPack;
class Program
{
static void Main()
{
string html = "";
using (var client = new WebClient())
{
html = client.DownloadString("http://stackoverflow.com");
}
var doc = new HtmlDocument();
doc.LoadHtml(html);
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a"))
{
// Will print all text contained inside all anchors
// on http://stackoverflow.com
Console.WriteLine(link.InnerText);
}
}
}
You could use a capture group. You might also want to use a named group. Notice the parentheses I added to regex.
var html = "<a>xx yyy</a> <a>bbb cccc</a>";
var startTag = "<a>";
var endTag = "</a>";
var regex = new Regex(startTag + "((.*?))" + endTag, RegexOptions.IgnoreCase);
var matchCollection = regex.Matches(html);
foreach (Match item in matchCollection)
{
var data = item.Groups[1];
Console.WriteLine(data);
}
This is even a little nicer, because a named group is a little easier to grab.
var html = "<a>xx yyy</a> <a>bbb cccc</a>";
var startTag = "<a>";
var endTag = "</a>";
var regex = new Regex(startTag + "(?<txt>(.*?))" + endTag, RegexOptions.IgnoreCase);
var matchCollection = regex.Matches(html);
foreach (Match item in matchCollection)
{
var data = item.Groups["txt"];
Console.WriteLine(data);
}
Related
I'm having trouble to make some loops.
I'm using agilitypack. I have a TXT file with several links (1 per line), and for each link that txt want to navigate to the page and then later extract to be in xpath and write in a memo.
The problem I'm having and that the code is only carrying out the procedure for the last line of txt. Where am I wrong?
var Webget = new HtmlWeb();
foreach (string line in File.ReadLines("c:\\test.txt"))
{
var doc = Webget.Load(line);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//*[#id='title-article']"))
{
memoEdit1.Text = node.ChildNodes[0].InnerHtml + "\r\n";
break;
}
}
try to change
memoEdit1.Text = node.ChildNodes[0].InnerHtml + "\r\n";
to
memoEdit1.Text += node.ChildNodes[0].InnerHtml + "\r\n";
You're overwriting memoEdit1.Text every time. Try
memoEdit1.Text += node.ChildNodes[0].InnerHtml + "\r\n";
instead - note the += instead of =, which adds the new text every time.
Incidentally, constantly appending strings together isn't really the best way. Something like this might be better:
var Webget = new HtmlWeb();
var builder = new StringBuilder();
foreach (string line in File.ReadLines("c:\\test.txt"))
{
var doc = Webget.Load(line);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//*[#id='title-article']"))
{
builder.AppendFormat("{0}\r\n", node.ChildNodes[0].InnerHtml);
break;
}
}
memoEdit1.Text = builder.ToString();
Or, using LINQ:
var Webget = new HtmlWeb();
memoEdit1.Text = string.Join(
"\r\n",
File.ReadAllLines("c:\\test.txt")
.Select (line => Webget.Load(line).DocumentNode.SelectNodes("//*[#id='title-article']").First().ChildNodes[0].InnerHtml));
If you are only selecting 1 node in the inner loop then use SelectSingleNode Instead. Also the better practice when concatenating strings in a loop is to use StringBuilder:
StringBuilder builder = new StringBuilder();
var Webget = new HtmlWeb();
foreach (string line in File.ReadLines("c:\\test.txt"))
{
var doc = Webget.Load(line);
builder.AppendLine(doc.DocumentNode.SelectSingleNode("//*[#id='title-article']").InnerHtml);
}
memoEdit1.Text = builder.ToString();
Using linq it will look like this:
var Webget = new HtmlWeb();
var result = File.ReadLines("c:\\test.txt")
.Select(line => Webget.Load(line).DocumentNode.SelectSingleNode("//*[#id='title-article']").InnerHtml));
memoEdit1.Text = string.Join(Environment.NewLine, result);
I have this data into the test text file:
behzad razzaqi xezerlooot abrizii ast
i want delete space and replace space one semicolon character,write this code in c# for that:
string[] allLines = File.ReadAllLines(#"d:\test.txt");
using (StreamWriter sw = new StreamWriter(#"d:\test.txt"))
{
foreach (string line in allLines)
{
if (!string.IsNullOrEmpty(line) && line.Length > 1)
{
sw.WriteLine(line.Replace(" ", ";"));
}
}
}
MessageBox.Show("ok");
behzad;;razzaqi;;xezerlooot;;;abrizii;;;;;ast
but i want one semicolon in space.how can i solve that?
Regex is an option:
string[] allLines = File.ReadAllLines(#"d:\test.txt");
using (StreamWriter sw = new StreamWriter(#"d:\test.txt"))
{
foreach (string line in allLines)
{
if (!string.IsNullOrEmpty(line) && line.Length > 1)
{
sw.WriteLine(Regex.Replace(line,#"\s+",";"));
}
}
}
MessageBox.Show("ok");
Use this code:
string[] allLines = File.ReadAllLines(#"d:\test.txt");
using (StreamWriter sw = new StreamWriter(#"d:\test.txt"))
{
foreach (string line in allLines)
{
string[] words = line.Split(" ", StringSplitOptions.RemoveEmptyEntries);
string joined = String.Join(";", words);
sw.WriteLine(joined);
}
}
You need to use a regular expression:
(\s\s+)
Usage
var input = "behzad razzaqi xezerlooot abrizii ast";
var pattern = "(\s\s+)";
Regex rgx = new Regex(pattern);
string result = rgx.Replace(input, ';');
You can do that with a regular expression.
using System.Text.RegularExpressions;
and:
string pattern = "\\s+";
string replacement = ";";
Regex rgx = new Regex(pattern);
sw.WriteLine(rgx.Replace(line, replacement));
This regular expression matches any series of 1 or more spaces and replaces the entire series with a semicolon.
you can try this
Regex r=new Regex(#"\s+");
string result=r.Replace("YourString",";");
\s+ is for matching all spaces. + is for one or more occurrences.
for more information on regular expression see http://www.w3schools.com/jsref/jsref_obj_regexp.asp
You should check a string length after replacement, not before ;-).
const string file = #"d:\test.txt";
var result = File.ReadAllLines(file).Select(line => Regex.Replace(line, #"\s+", ";"));
File.WriteAllLines(file, result.Where(line => line.Length > 1));
...and don't forget, that for input hello you will get ;hello;.
I am fetching data from Mysql but the issue is "HTML tags i.e.
<p>LARGE</p><p>Lamb;<br>;li;ul;
also being fetched with my data i just need "LARGE" and "Lamb" from above line. How can I separate/remove HTML tags from String?
I am going to assume that the HTML is intact, perhaps something like the following:
<ul><li><p>LARGE</p><p>Lamb<br></li></ul>
In which case, I would use HtmlAgilityPack to get the content without having to resort to regex.
var html = "<ul><li><p>LARGE</p><p>Lamb</p><br></li></ul> ";
var hap = new HtmlDocument();
hap.LoadHtml(html);
string text = HtmlEntity.DeEntitize(hap.DocumentNode.InnerText);
// text is now "LARGELamb "
string[] lines = hap.DocumentNode.SelectNodes("//text()")
.Select(h => HtmlEntity.DeEntitize(h.InnerText)).ToArray();
// lines is { "LARGE", "Lamb", " " }
If we assume that you are going to fix your html elements.
static void Main(string[] args)
{
string html = WebUtility.HtmlDecode("<p>LARGE</p><p>Lamb</p>");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
List<HtmlNode> spanNodes = doc.DocumentNode.Descendants().Where(x => x.Name == "p").ToList();
foreach (HtmlNode node in spanNodes)
{
Console.WriteLine(node.InnerHtml);
}
}
You need to use HTML Agility Pack.You can add reference like this.:
Install-Package HtmlAgilityPack
try this
// erase html tags from a string
public static string StripHtml(string target)
{
//Regular expression for html tags
Regex StripHTMLExpression = new Regex("<\\S[^><]*>", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.Compiled);
return StripHTMLExpression.Replace(target, string.Empty);
}
call
string htmlString="<div><span>hello world!</span></div>";
string strippedString=StripHtml(htmlString);
Assuming that:
the original string is always going to be in that specific format,
and that
you cannot add the HTMLAgilityPack,
here is a quick and dirty way of getting what you want:
static void Main(string[] args)
{
// Split original string on the 'separator' string.
string originalString = "<p>LARGE</p><p>Lamb;<br>;li;ul; ";
string[] sSeparator = new string[] { "</p><p>" };
string[] splitString = originalString.Split(sSeparator, StringSplitOptions.None);
// Prepare to filter the 'prefix' and 'postscript' strings
string prefix = "<p>";
string postfix = ";<br>;li;ul; ";
int prefixLength = prefix.Length;
int postfixLength = postfix.Length;
// Iterate over the split string and clean up
string s = string.Empty;
for (int i = 0; i < splitString.Length; i++)
{
s = splitString[i];
if (s.Contains(prefix))
{
s = s.Remove(s.IndexOf(prefix), prefixLength);
}
if (s.Contains(postfix))
{
s = s.Remove(s.IndexOf(postfix), postfixLength);
}
splitString[i] = s;
Console.WriteLine(splitString[i]);
}
Console.ReadLine();
}
// Convert < > etc. to HTML
String sResult = HttpUtility.HtmlDecode(sData);
// Remove HTML tags delimited by <>
String result = Regex.Replace(sResult, #"enter code here<[^>]*>", String.Empty);
I am currently trying to do a Regex Replace on a JSON string that looks like:
String input = "{\"`####`Answer_Options11\": \"monkey22\",\"`####`Answer_Options\": \"monkey\",\"Answer_Options2\": \"not a monkey\"}";
a
The goal is to find and replace all the value fields who's key field starts with `####`
I currently have this:
static Regex _FieldRegex = new Regex(#"`####`\w+" + ".:.\"(.*)\",");
static public string MatchKey(string input)
{
MatchCollection match = _encryptedFieldRegex.Matches(input.ToLower());
string match2 = "";
foreach (Match k in match )
{
foreach (Capture cap in k.Captures)
{
Console.WriteLine("" + cap.Value);
match2 = Regex.Replace(input.ToLower(), cap.Value.ToString(), #"CAKE");
}
}
return match2.ToString();
}
Now this isn't working. Naturally I guess since it picks up the entire `####`Answer_Options11\": \"monkey22\",\"`####`Answer_Options\": \"monkey\", as a match and replaces it. I want to just replace the match.Group[1] like you would for a single match on the string.
At the end of the day the JSON string needs to look something like this:
String input = "{\"`####`Answer_Options11\": \"CATS AND CAKE\",\"`####`Answer_Options\": \"CAKE WAS A LIE\",\"Answer_Options2\": \"not a monkey\"}";
Any idea how to do this?
you want a positive lookahead and a positive lookbehind :
(?<=####.+?:).*?(?=,)
the lookaheads and lookbehinds will verify that it matches those patterns, but not include them in the match. This site explains the concept pretty well.
Generated code from RegexHero.com :
string strRegex = #"(?<=####.+?:).*?(?=,)";
Regex myRegex = new Regex(strRegex);
string strTargetString = #" ""{\""`####`Answer_Options11\"": \""monkey22\"",\""`####`Answer_Options\"": \""monkey\"",\""Answer_Options2\"": \""not a monkey\""}""";
foreach (Match myMatch in myRegex.Matches(strTargetString))
{
if (myMatch.Success)
{
// Add your code here
}
}
this will match "monkey22" and "monkey" but not "not a monkey"
Working from #Jonesy's answer I got to this which works for what I wanted. It includes the .Replace on the groups that I required. The negative look ahead and behinds were very interesting but I needed to replace some of those values hence groups.
static public string MatchKey(string input)
{
string strRegex = #"(__u__)(.+?:\s*)""(.*)""(,|})*";
Regex myRegex = new Regex(strRegex, RegexOptions.IgnoreCase | RegexOptions.Multiline);
IQS_Encryption.Encryption enc = new Encryption();
int count = 1;
string addedJson = "";
int matchCount = 0;
foreach (Match myMatch in myRegex.Matches(input))
{
if (myMatch.Success)
{
//Console.WriteLine("REGEX MYMATCH: " + myMatch.Value);
input = input.Replace(myMatch.Value, "__e__" + myMatch.Groups[2].Value + "\"c" + count + "\"" + myMatch.Groups[4].Value);
addedJson += "c"+count + "{" +enc.EncryptString(myMatch.Groups[3].Value, Encoding.UTF8.GetBytes("12345678912365478912365478965412"))+"},";
}
count++;
matchCount++;
}
Console.WriteLine("MAC" + matchCount);
return input + addedJson;
}`
Thanks again to #Jonesy for the huge help.
How to strip this text
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
<test#test.com>
</body>
</html>
to look like
My First Heading
My first paragraph.
<test#test.com>
Using the function
public static string StripHTML(this string htmlText)
{
var reg = new Regex("<(.|\n)*?>", RegexOptions.IgnoreCase);
return reg.Replace(htmlText, "");
}
I get
My First Heading
My first paragraph.
Use Html Agility Pack for these kinds of operations. It is faster than any regex and supports LINQ.
static void Main(string[] args)
{
string modified_html = emas(input);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(modified_html);
string test1 = doc.DocumentNode.InnerText;
Console.WriteLine();
var reg = new Regex("<(.|\n)*?>", RegexOptions.IgnoreCase);
Console.WriteLine(reg.Replace(modified_html , ""));
Console.Read();
}
public static string emas(string text)
{
string stripped = text;
const string MatchEmailPattern =
#"(([\w-]+\.)+[\w-]+|([a-zA-Z]{1}|[\w-]{2,}))#"
+ #"((([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\."
+ #"([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])){1}|"
+ #"([a-zA-Z]+[\w-]+\.)+[a-zA-Z]{2,4})";
Regex rx = new Regex(MatchEmailPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
// Find matches.
MatchCollection matches = rx.Matches(text);
// Report the number of matches found.
int noOfMatches = matches.Count;
// Report on each match.
foreach (Match match in matches)
{
stripped = stripped.Replace("<"+ match.Value + ">" , match.Value);
}
return stripped;
}
static string input = " Your html goes here ";