To explain briefly, I'm trying to search Google with a keyword, then get the URLs of the top 10 results and save them.
This is the stripped down command line version of the code. It should return 1 result at least. If it works with that, I can apply it to my full version of the code and get all the results.
Basically the code I have right now, it fails if I try to get the entire source of Google. If I include a random section of code from Google's HTML source, it works fine. To me, that means my Regex has an error somewhere.
If there is a better way to do this aside from Regex, please let me know. The URLs are between <h3 class="r"><a href=" and " class=l onmousedown="return clk(this.href
I got this Regex code from a generator, but it's really hard for me to understand Regex, Since nothing I've read explains it clearly. If someone could pick out what's wrong and explain why, I'd greatly appreciate it.
Thanks,
Kevin
using System;
using System.Text.RegularExpressions;
using System.Net;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
WebClient wc = new WebClient();
string keyword = "seo nj";
string html = wc.DownloadString(String.Format("http://www.google.com/search?q={0}", keyword));
string re1 = "(<)"; // Any Single Character 1
string re2 = "(h3)"; // Alphanum 1
string re3 = "(\\s+)"; // White Space 1
string re4 = "(class)"; // Variable Name 1
string re5 = "(=)"; // Any Single Character 2
string re6 = "(\"r\")"; // Double Quote String 1
string re7 = "(>)"; // Any Single Character 3
string re8 = "(<)"; // Any Single Character 4
string re9 = "([a-z])"; // Any Single Word Character (Not Whitespace) 1
string re10 = "(\\s+)"; // White Space 2
string re11 = "((?:[a-z][a-z]+))"; // Word 1
string re12 = "(=)"; // Any Single Character 5
string re13 = ".*?"; // Non-greedy match on filler
string re14 = "((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s\"]*))"; // HTTP URL 1
string re15 = "(\")"; // Any Single Character 6
string re16 = "(\\s+)"; // White Space 3
string re17 = "(class)"; // Word 2
string re18 = "(=)"; // Any Single Character 7
string re19 = "(l)"; // Any Single Character 8
string re20 = "(\\s+)"; // White Space 4
string re21 = "(onmousedown)"; // Word 3
string re22 = "(=)"; // Any Single Character 9
string re23 = "(\")"; // Any Single Character 10
string re24 = "(return)"; // Word 4
string re25 = "(\\s+)"; // White Space 5
string re26 = "(clk)"; // Word 5
Regex r = new Regex(re1 + re2 + re3 + re4 + re5 + re6 + re7 + re8 + re9 + re10 + re11 + re12 + re13 + re14 + re15 + re16 + re17 + re18 + re19 + re20 + re21 + re22 + re23 + re24 + re25 + re26, RegexOptions.IgnoreCase | RegexOptions.Singleline);
Match m = r.Match(txt);
if (m.Success)
{
Console.WriteLine("Good");
String c1 = m.Groups[1].ToString();
String alphanum1 = m.Groups[2].ToString();
String ws1 = m.Groups[3].ToString();
String var1 = m.Groups[4].ToString();
String c2 = m.Groups[5].ToString();
String string1 = m.Groups[6].ToString();
String c3 = m.Groups[7].ToString();
String c4 = m.Groups[8].ToString();
String w1 = m.Groups[9].ToString();
String ws2 = m.Groups[10].ToString();
String word1 = m.Groups[11].ToString();
String c5 = m.Groups[12].ToString();
String httpurl1 = m.Groups[13].ToString();
String c6 = m.Groups[14].ToString();
String ws3 = m.Groups[15].ToString();
String word2 = m.Groups[16].ToString();
String c7 = m.Groups[17].ToString();
String c8 = m.Groups[18].ToString();
String ws4 = m.Groups[19].ToString();
String word3 = m.Groups[20].ToString();
String c9 = m.Groups[21].ToString();
String c10 = m.Groups[22].ToString();
String word4 = m.Groups[23].ToString();
String ws5 = m.Groups[24].ToString();
String word5 = m.Groups[25].ToString();
//Console.Write("(" + c1.ToString() + ")" + "(" + alphanum1.ToString() + ")" + "(" + ws1.ToString() + ")" + "(" + var1.ToString() + ")" + "(" + c2.ToString() + ")" + "(" + string1.ToString() + ")" + "(" + c3.ToString() + ")" + "(" + c4.ToString() + ")" + "(" + w1.ToString() + ")" + "(" + ws2.ToString() + ")" + "(" + word1.ToString() + ")" + "(" + c5.ToString() + ")" + "(" + httpurl1.ToString() + ")" + "(" + c6.ToString() + ")" + "(" + ws3.ToString() + ")" + "(" + word2.ToString() + ")" + "(" + c7.ToString() + ")" + "(" + c8.ToString() + ")" + "(" + ws4.ToString() + ")" + "(" + word3.ToString() + ")" + "(" + c9.ToString() + ")" + "(" + c10.ToString() + ")" + "(" + word4.ToString() + ")" + "(" + ws5.ToString() + ")" + "(" + word5.ToString() + ")" + "\n");
Console.WriteLine(httpurl1);
}
else
{
Console.WriteLine("Bad");
}
Console.ReadLine();
}
}
}
You're doing it wrong.
Google has an API for doing searches programmatically. Don't put yourself through the pain of trying to parse HTML with regexes, when there's already a published, supported way to do what you want.
Besides, what you're trying to do -- submit automated searches through Google's Web site and scrape the results -- is a violation of section 5.3 of their Terms of Service:
You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers)
using RegEx to parse HTML is sado-masochism.
Try using the HTML Agility Pack instead. It will allow you to parse HTML. See this question for an example of using it.
Related
I would like to write every data to a CSV file. Everything is working fine, but the numbers aren't adding up, they are written next to each other.
This part of the code:
file.WriteLine("\"1. koordinata:\"" + ";" + opening.rect_co_x + ";" + cord.y + ";" + opening.rect_co_z);
file.WriteLine("\"2. koordinata:\"" + ";" + opening.rect_co_x + opening.rect_width + ";" + cord.y + ";" + opening.rect_co_z);
Every property is a double, uint or int.
What I expect:
1 + 3 = 4
What I got:
1 + 3 = 13
+: mean a string concatenation, you can't calculate some of int inside string directly.
Try to use the following approach:
int a = 1, b = 3;
string str = $"a;{a + b};b"; // a;4;b
//or string.Format
string str2 = string.Format("a;{0};b", a + b);
You can change your code to:
file.WriteLine($"\"1. koordinata:\";{opening.rect_co_x };{cord.y};{opening.rect_co_z}");
file.WriteLine($"\"2. koordinata:\";{opening.rect_co_x + opening.rect_width};{cord.y};{opening.rect_co_z}");
I hope this help.
I've got an issue where I am applying a template to an object and am using a find and replace function to mesh the template in the form of a string of html. The issue is, the height and width of the image are contained in the token so I don't have a way to find and replace as it could vary.
Token value is [ARTICLEIMAGE:150:200]
foreach(var article in articles) {
var articleTemplateValue = _TemplateArticleMarkup;
articleTemplateValue = articleTemplateValue.Replace("[ARTICLEIMAGE:xx:yy]", "<img src=" + article.ArticleImageFolder + "/" + article.ArticleImage + " title=" + article.ArticleTitle + " width="
xx" height="
yy" />");
}
This obviously would not work for every example as the dimensions of the image token will vary. Is there a way to find the token as a StartsWith and then split the dimensions an array on the :. Please let me know if that makes sense as it is a little confusing. Thanks!
Regex will solve this issue for you.
using System.Text.RegularExpressions;
Then change your code as seen below.
foreach (var article in articles)
{
string articleTemplateValue = _TemplateArticleMarkup;
MatchCollection mc = Regex.Matches(articleTemplateValue, #"\[ARTICLEIMAGE\:(\d+)\:(\d+)\]");
if (mc.Count > 0)
{
string toReplace = mc[0].Value;
string xx = mc[0].Groups[1].Value;
string yy = mc[0].Groups[2].Value;
articleTemplateValue = articleTemplateValue.Replace(toReplace, "<img src=\"" + article.ArticleImageFolder + "/" + article.ArticleImage + "\" title=\"" + article.ArticleTitle + "\" width=\"" + xx + "\" height=\"" + yy + "\"/>");
}
}
You can use the Split() command to find the width and the height. A very rough approach follows:
rextester remo
String articleTemplateValue = "[test:40:200]";
Console.WriteLine(articleTemplateValue);
var arr = articleTemplateValue.Split(':');
if (arr.Length == 3) {
var xx = arr[1];
var yy = arr[2].Substring(0, arr[2].Length - 1);
articleTemplateValue = articleTemplateValue.Replace(articleTemplateValue, "<img src="
+ "folder" + "/" + "image" + " title=" + "ArticleTitle" + " width="+ xx + " height= " + yy+ "/>");
Console.WriteLine(articleTemplateValue);
}
Use Regex would do the trick repl.it demo
"\[ARTICLEIMAGE:\d+?:\d+?\]"
\[ escape the [ character. Brackets are special characters in Regex
\d any digit
\d+?: + is 0 or more digits. Until we find a colon :. The ? means non-greedy and is really not needed...
\] escape the closing bracket
var matches = Regex.Match(articleTemplateValue, #"\d+");
var xx = matches;
var yy = matches.NextMatch();
var template = "<img src=" + article.ArticleImageFolder + "/" + article.ArticleImage + " title=" + article.ArticleTitle + " width="
+ xx + " height="
+ yy + " />";
articleTemplateValue = articleTemplateValue = Regex.Replace(articleTemplateValue, #"\[ARTICLEIMAGE:\d+?:\d+?\]", template);
Using
Regex.Match(string, string)
Regex.Replace(string, string)
Having trouble figuring out how to prevent the last key in my array to not have a comma. Since its being exported to a .Json file the last key shouldn't have a ",".
I know you can detect it by using .Last();, but I can't seem to make that work. Any recommendations?
//Data Path
string dataPath = #"..\..\FileIOExtraFiles\DataFieldsLayout.txt";
string[] dataList = File.ReadAllLines(dataPath);
//save Data data
using (StreamWriter outStream = new StreamWriter(outputFolder + #"\CharacterStringData3.json"))
{
outStream.WriteLine("{");
for (int i = 0; i < dataFile.Length; i++)
{
string s = dataFile[i];
char last = s.Last();
if (s == "")
{
outStream.WriteLine("\"" + dataList[i] + "\"" + " : " + "\" \",");
}
else
{
outStream.WriteLine("\"" + dataList[i] + "\"" + " : \"" + s + "\",");
}
}
outStream.WriteLine("}");
}
Output:
{
"data1":"item1",
"data2":item2",
"lastKey":item3",//trying to remove comma from last key in array.
}
As others have pointed out, it doesn't make sense that you are building json manually, but given that this is a question more about technique, here is one approach: you could change it to this:
var commaSuffix = (i == dataFile.Length - 1) ? "," : string.Empty;
outStream.WriteLine("\"" + dataList[i] + "\"" + " : \"" + s + "\"" + commaSuffix);
The suffix would be used on every iteration except the last.
Change this
outStream.WriteLine("\"" + dataList[i] + "\"" + " : " + "\" \",");
To this
outStream.WriteLine("\"" + dataList[i] + "\"" + " : " + "\" \""+(i==dataFile.Length?",":""));
Instead of using outStream.WriteLine() at every step, store it in a string. Then you can remove the last comma from that string and write the whole string at once:
//Get last index of comma
int lastCommaIndex = outputString.LastIndexOf(',');
//Create new StringBuilder with everything before the last comma
StringBuilder sb = new StringBuilder(outputString.Substring(0,lastCommaIndex));
//Add everything after the last comma, or just add a closing brace
//sb.Append("}"); //This instead of next line
sb.Append(outputString.Substring(lastCommaIndex+1));
//Add contents of StringBuilder to the Stream
outSteam.WriteLine(sb);
I am trying to replace a hash char in a string but the following is not working the
string address = "Blk 344, Jurong West, Street 11, #02-111";
address.Replace("#","%23");
Any ideas guys been driving me crazy
Query String full
http://localhost:54965/SKATEZ/thankyou.aspx?firstname=Fiora&lastname=Ray&address=Blk%20344,%20Jurong%20West,%20Street%2011,%20#02-111&total=22&nirc=S6799954H&country=Singapore&orderid=85&postalcode=746112
I construct the url as follows
string url = "thankyou.aspx?firstname=" + firstname + "&" + "lastname=" + lastname + "&" + "address=" + HttpUtility.EscapeDataString(address) + "&" + "total=" + total + "&" + "nirc=" + tbID.Text + "&" + "country=" + ddlCountry.SelectedValue + "&" + "orderid=" + orderid + "&" + "postalcode=" + tbPostalCode.Text;
Response.Redirect(url);
Try
address = address.Replace("#","%23");
Strings in C# are immutable:
Strings are immutable--the contents of a string object cannot be changed after the object is created, although the syntax makes it appear as if you can do this. For example, when you write this code, the compiler actually creates a new string object to hold the new sequence of characters, and that new object is assigned to b. The string "h" is then eligible for garbage collection.
Using System.Uri.EscapeDataString(string) should fix your issue:
var urlbuilder = new StringBuilder();
urlbuilder.AppendFormat("thankyou.aspx?firstname={0}", firstname);
urlbuilder.AppendFormat("&lastname={0}", lastname);
urlbuilder.AppendFormat("&address={0}", System.Uri.EscapeDataString(address));
urlbuilder.AppendFormat("&total={0}", total);
urlbuilder.AppendFormat("&nirc={0}", tbID.Text);
urlbuilder.AppendFormat("&country={0}", ddlCountry.SelectedValue);
urlbuilder.AppendFormat("&orderid={0}", orderid);
urlbuilder.AppendFormat("&postalcode={0}", tbPostalCode.Text);
Response.Redirect(urlbuilder.ToString());
(using System.Text.StringBuilder to compose your url makes the code a little more readable)
I have a routine where i prompt the user for a value. In this case the city. They will type in for example LA. I store this value in a variable named inputValue.
Now i need to pass a string to crystal reports that uses this input and i want it to look like this
{member.name} = "LA"
string inputValue = GetInputValue("Enter value for " + fieldName);
string sqlInput = sqlInput.Substring(0, leftPos - 1) + " + inputValue + " + sqlInput.Substring(rightPos + 2);
O thought by using " + inputValue + " would do the trick but it only puts the quotation mark after the input value ex. LA \". What is the proper way to quote this?
" + inputValue + " +
Should be
inputValue +
Thus making
string sqlInput = sqlInput.Substring(0, leftPos - 1) + inputValue +
sqlInput.Substring(rightPos + 2);
Assuming you don't want '"' characters leading and trailing your string.
Then that would be
string sqlInput = sqlInput.Substring(0, leftPos - 1) +"\"" + inputValue + "\"" +
sqlInput.Substring(rightPos + 2);