I a have a string that contains the code of a webpage.
This is an example:
<input type="text" name="x4B07" value="650"
onchange="this.form.x8000.value=this.name;this.form.submit();"/>
<input type="text" name="x4B08" value="250"
onchange="this.form.x8000.value=this.name;this.form.submit();"/>
In that string I want to get the 650 and 250 (these are variables and they change value).
How can I do so?
Example:
name
value
x4b08
254
x4b07
253
x4b06
252
x4b05
251
If you were confident that the markup would never change (and you have a simple snippet like your example line) a regex could get you those values, for example:
Regex re = new Regex("name=\"(.*?)\" value=\"(.*?)\"");
Match match = re.Match(yourString);
if(match.Success && match.Groups.Count == 3){
String name = match.Groups[1];
String value = match.Groups[2];
}
Alternatively you could parse the page content and query the resulting document for the elements, and then extract the values. (C# HTML Parser: Looking for C# HTML parser )
You can use regular expressions to match value="([0-9]*)"
Or you can look for the string "value" using string.IndexOf and then take the following few characters.
This should work for you (assuming that s contains the string you want to parse):
string value = s.Substring(s.IndexOf("value=")+7);
value = value.Substring(0, value.IndexOf("\""));
How specific are your examples? Could you also want to extract varying length alphabetic strings? Will the strings you want to extract always be properties?
While the regex/substring way works for the specified examples I think they will scale quite badly.
I'd parse the HTML using a parser (see ndtreviv's answer) or possibly with an XML parser (if the HTML is valid XHTML). That way you will get better control and don't have to bleed your eyes out from fidgeting with a bucketload of regex.
If you have multiple such controls in the form of string you can create and XmlDocument and iterate through it.
just solved with this
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(URL);
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
Stream st = resp.GetResponseStream();
StreamReader sr = new StreamReader(st);
string buffer = sr.ReadToEnd();
ArrayList uniqueMatches = new ArrayList();
Match[] retArray = null;
Regex RE = new Regex("name=\"(.*?)\" value=\"(.*?)\"", RegexOptions.Multiline);
MatchCollection theMatches = RE.Matches(buffer);
for (int counter = 0; counter < theMatches.Count; counter++)
{
//string[] tempSplit = theMatches[counter].Value.Split('"');
Regex reName = new Regex("name=\"(.*?)\"");
Match matchName = reName.Match(theMatches[counter].Value);
Regex reValue = new Regex("value=\"(.*?)\"");
Match matchValue = reValue.Match(theMatches[counter].Value);
string[] dados = new string[2];
dados[0] = matchName.Groups[1].ToString();
dados[1] = matchValue.Groups[1].ToString();
uniqueMatches.Add(dados);
}
Tks all for the help
Related
I think I already wrote what I want to do in title, so now to the point:
I have a .txt file with url links and their source code will be parsed by regex expression.
Source code of every link is scraped by this:
public static string getSourceCode(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(resp.GetResponseStream());
string sourceCode = sr.ReadToEnd();
sr.Close();
resp.Close();
return sourceCode;
}
Each source code contains these text:
..code..
..code..
<p class="content">
exampleexampleexample
</p>
..code..
..code..
<p class="content">
example
</p>
..code..
..code..
There are more elements of content elements.
I get content content by this:
Regex k = new Regex(#"<p class=""question-content"">[\r\n\s]*(\S.*)");
var g = k.Matches(sourceCode);
Now I can easly extract every match:
g[1].ToString() <-- first match
g[2].ToString() <-- second match
g[3].ToString() <-- thirdmatch
etc.
But what I want to do is to extract these links where: first match does not contains XYZ, but there is XYZ in at least other matches.
For example:
First link's source code contains XYZ in first and third match <-- wrong
Second link's source code contains XYZ only in first match <-- wrong
Third link's source code contains XYZ only in third match <-- success!
Solution
I get Every match colletion from this:
MatchCollection b1 = Regex.Matches(sourceCode, #"<p class=""content"">[\r\n\s]*(\S.*)");
What I do next is
checking if first match does not contain "example" by this:
if (!b1[0].ToString().Contains("example"))
And checking the result of this function:
bool checkAnother(int amount, MatchCollection m)
{
for (int i=1; i<=amount-1; i++)
{
if (m[i].ToString().Contains("example"))
return true;
}
return false;
}
So that is the code:
MatchCollection b1 = Regex.Matches(sourceCode, #"<p class=""content"">[\r\n\s]*(\S.*)");
if ((!b1[0].ToString().Contains("example")) && (checkAnother(b1.Count, b1)))
{dataGridView1.Rows[i].Cells[2].Value = "GOOD";
}
What you are trying to do is not suitable for regular expressions.
It's probably possible with multiline matching, capture groups and look-arounds, but IMO it's not worthwhile to put lots of effort into an unmaintainable solution.
Try to verify the found matches in a post-processing step instead. Assuming you grab the matches like so:
var g = k.Matches(sourceCode);
...you can easily achieve that with something like:
var isFirstOk = !g[0].Value.Contains("XYZ");
var areAllOk = isFirstOk && g.Cast<Match>().Skip(1).Any(m => m.Value.Contains("XYZ"));
I want to find all the instagram urls within a string, and replace them with the embed url.
But I'm keen on performance, as this could be 5 to 20 posts each anything up to 6000 characters with an unknown amount of instagram urls in which need converting.
Url examples (Could be any of these in each string, so would need to match all)
http://instagram.com/p/xPnQ1ZIY2W/?modal=true
http://instagram.com/p/xPnQ1ZIY2W/
http://instagr.am/p/xPnQ1ZIY2W/
And this is what I need to replace them with (An embedded version)
<img src="http://instagram.com/p/xPnQ1ZIY2W/media/?size=l" class="instagramimage" />
I was thinking about going for regex? But is this the quickest and most performant way of doing this?
Any examples greatly appreciated.
Something like:
Regex reg = new Regex(#"http://instagr\.?am(?:\.com)?/\S*");
Edited regex. However i would combine this with a stringreader and do it line by line. Then put the string (modified or not) into a stringbuilder:
string original = #"someotherText http://instagram.com/p/xPnQ1ZIY2W/?modal=true some other text
some other text http://instagram.com/p/xPnQ1ZIY2W/ some other text
some other text http://instagr.am/p/xPnQ1ZIY2W/ some other text";
StringBuilder result = new StringBuilder();
using (StringReader reader = new StringReader(original))
{
while (reader.Peek() > 0)
{
string line = reader.ReadLine();
if (reg.IsMatch(line))
{
string url = reg.Match(line).ToString();
result.AppendLine(reg.Replace(line,string.Format("<img src=\"{0}\" class=\"instagramimage\" />",url)));
}
else
{
result.AppendLine(line);
}
}
}
Console.WriteLine(result.ToString());
You mean like this?
class Program
{
private static Regex reg = new Regex(#"http://instagr\.?am(?:\.com)?/\S*", RegexOptions.Compiled);
private static Regex idRegex = new Regex(#"(?<=p/).*?(?=/)",RegexOptions.Compiled);
static void Main(string[] args)
{
string original = #"someotherText http://instagram.com/p/xPnQ1ZIY2W/?modal=true some other text
some other text http://instagram.com/p/xPnQ1ZIY2W/ some other text
some other text http://instagr.am/p/xPnQ1ZIY2W/ some other text";
StringBuilder result = new StringBuilder();
using (StringReader reader = new StringReader(original))
{
while (reader.Peek() > 0)
{
string line = reader.ReadLine();
if (reg.IsMatch(line))
{
string url = reg.Match(line).ToString();
result.AppendLine(reg.Replace(line, string.Format("<img src=\"http://instagram.com/p/{0}/media/?size=1\" class=\"instagramimage\" />", idRegex.Match(url).ToString())));
}
else
{
result.AppendLine(line);
}
}
}
Console.WriteLine(result.ToString());
}
}
A well-crafted and compiled regular expression is hard to beat, especially since you're doing replacements, not just searching, but you should test to be sure.
If the Instagram URLs are only within HTML attributes, here's my first stab at a pattern to look for:
(?<=")(https?://instagr[^">]+)
(I added a check for https as well, which you didn't mention but I believe is supported by Instagram.)
Some false positives are theoretically possible, but it will perform better than pedantically matching every legal variation of an Instagram URL. (The ">" check is just in case the HTML is missing the end quote for some reason.)
i need the center string of Rocky44 only using C#
Hi <span>Rocky44</span>
I tried the some split method but can't work
string[] result = temp.Split(new string[] { "<span>" , "</span>" }, StringSplitOptions.RemoveEmptyEntries);
Example:
Hi <span>Rocky44</span>
To:
Rocky44
Use an html parser. I will give an example using HtmlAgilityPack
string html = #"Hi <span>Rocky44</span>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var text = doc.DocumentNode.SelectSingleNode("//span").InnerText;
You're on the right track; you're just not escaping your quotes correctly:
string[] result = temp.Split(new string[] { "<span>" , "</span>" }, StringSplitOptions.RemoveEmptyEntries);
Of course, this is assuming that your input will always be in exactly the given format. As I4V mentions, an HTML parser may come in handy if you're trying to do anything more complicated.
If you're only going to get this sort of thing (eg this sort of HTML) then I would use regex. Else, DO NOT USE IT.
string HTML = #"Hi <span>Rocky44</span>"
var result = Regex.Match(HTML, #".*<a.*><span.*>(.*)</span></a>").Groups[1].Value;
Find the index of <span> and </span> using the IndexOf method.
Then (adjusting for the length of <span>) use the String.Substring method to get the desired text.
string FindLinkText(string linkHtml)
{
int startIndex = linkHtml.IndexOf("<span>") + "<span>".Length,
length = linkHtml.IndexOf("</span>") - startIndex;
return linkHtml.Substring(startIndex, length);
}
I have the following String "</script><div id='PO_1WTXxKUTU98xDU1'><!--DO NOT REMOVE-CONTENTS PLACED HERE--></div>"
I require to get the attribute value from the div tag. How can i retrieve this using C#.
Avoid parsing html with regex
Regex is not a good choice for parsing HTML files..
HTML is not strict nor is it regular with its format..
Use htmlagilityPack
You can do it like this with htmlagilityPack.
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
List<string> itemList = doc.DocumentNode.SelectNodes("//div[#id]")//selects all div having id attribute
.Select(x=>x.Attributes["id"].Value)//select the id attribute value
.ToList<string>();
//itemList will now contain all div's id attribute value
If you're a masochist you can do this old school VB3 style:
string input = #"</script><div id='PO_1WTXxKUTU98xDU1'><!--DO NOT REMOVE-CONTENTS PLACED HERE--></div>";
string startString = "div id='";
int startIndex = input.IndexOf(startString);
if (startIndex != -1)
{
startIndex += startString.Length;
int endIndex = input.IndexOf("'", startIndex);
string subString = input.Substring(startIndex, endIndex - startIndex);
}
Strictly solving the question asked, one of a myriad ways of solving it would be to isolate the div element, parse it as an XElement and then pull the attribute's value that way.
string bobo = "</script><div id='PO_1WTXxKUTU98xDU1'><!--DO NOT REMOVE-CONTENTS PLACED HERE--></div>";
string justDiv = bobo.Substring(bobo.IndexOf("<div"));
XElement xelem = XElement.Parse(justDiv);
var id = xelem.Attribute("id");
var value = id.Value;
There are certainly lots of ways to solve this but this one answers the mail.
A .NET Regex that looks something like this will do the trick
^</script><div id='(?<attrValue>[^']+)'.*$
you can then get hold of the value as
MatchCollection matches = Regex.Matches(input, #"^</script><div id='(?<attrValue>[^']+)'.*$");
if (matches.Count > 0)
{
var attrValue = matches[0].Groups["attrValue"];
}
I need to extract the $value from the given piece of string .
string text = "<h2 class="knownclass unknownclass1 unknownclass2" title="Example title>$Value </h2>"
Using the code -:
Match m2 = Regex.Match(text, #"<h2 class=""knownclass(.*)</h2>", RegexOptions.IgnoreCase);
It gets me the full value -: unknownclass1 unknownclass2" title="Example title>$Value .But I just need the $value part.
Please tell me .Thanks in advance.
Assuming the string always follows this format, consider the following code:
var index = text.IndexOf(">");
text.Substring(index + 1, text.IndexOf("<", index));
As had been said multiple time, using a Regex for parsing HTML or XML is bad. Ignoring that, you are capturing too much. Here is an alternative Regex that should work.
#"<h2 class=""knownclass[^""]*"">(.*)</h2>"
If its always the same pattern of your string, you can consider this:
string text = "<h2 class=\"knownclass unknownclass1 unknownclass2\" title=\"Example title>$Value </h2>";
string result = "";
Regex test = new Regex(#"\<.*?\>(.*?)\</h2\>");
MatchCollection matchlist = test.Matches(text);
if (matchlist.Count > 0)
{
for (int i = 0; i < matchlist.Count; i++)
{
result = matchlist[i].Groups[1].ToString();
}
}
But if you are working with XML files or HTML files, I recommend you use XmlTextReader for XML and HtmlAgilityPack for HTML
http://msdn.microsoft.com/en-us/library/system.xml.xmltextreader.aspx
http://htmlagilitypack.codeplex.com/
hope it helps!