I'm using html agility pack to parse several text files that I load. I then save the data that I parse out into a string list for further processing. However, when I use this method, it never hits the line:
MessageBox.Show("test");
Additionally, if I include any other code following this method, none of it is triggered.
Does anyone have any suggestions as to my error?
The entire method is included below:
private void ParseOutput()
{
nodeDupList = new List<string>();
StreamWriter OurStream;
OurStream = File.CreateText(dir + #"\CombinedPages.txt");
OurStream.Close();
for (int crawl = 1; crawl <= crawlPages.Length; crawl++)
{
var web = new HtmlWeb();
var doc = web.Load(dir + #"\Pages" + crawl.ToString() + ".txt");
var nodeCount = doc.DocumentNode.SelectNodes(#"/html[1]/body[1]/div[1]/table[3]/tbody[1]/tr[td/#class=""style_23""]");
int nCount = nodeCount.Count;
for (int a = 3; a <= nCount; a++)
{
var specContent = doc.DocumentNode.SelectNodes(#"/html[1]/body[1]/div[1]/table[3]/tbody[1]/tr[" + a + #"]/td[3]/div[contains(#class,'style_24')]");
foreach (HtmlNode node in specContent)
{
nodeDupList.Add(node.InnerText + ".d");
}
}
}
MessageBox.Show("test");
}
I've created a crawler to save multiple html pages to text and parse them separately using this method.
I'm just using MessageBox to show that it won't continue following the "for loop". I've called multiple methods in my solution and it won't iterate through them.
The application is a Win Forms Application targeted at .Net Framework 4.
Edit:
Thanks for the help.
I realized after rerunning it through the debugger that it was crashing at times on the loop
for (int a = 3; a <= nCount; a++)
{
var specContent = doc.DocumentNode.SelectNodes(#"/html[1]/body[1]/div[1]/table[3]/tbody[1]/tr[" + a + #"]/td[3]/div[contains(#class,'style_24')]");
foreach (HtmlNode node in specContent)
{
nodeDupList.Add(node.InnerText + ".d");
}
}
when the var specContent was null.
There was no exception generated; the method just ended.
As the website is dynamic that I was crawling it rarely returned null but on several instances it had and this happened.
The solution, for anyone who might need this is to check if
for (int a = 3; a <= nCount; a++)
{
var specContent = doc.DocumentNode.SelectNodes(#"/html[1]/body[1]/div[1]/table[3]/tbody[1]/tr[" + a + #"]/td[3]/div[contains(#class,'style_24')]");
if(specContent !=null) //added this check for null
{
foreach (HtmlNode node in specContent)
{
nodeDupList.Add(node.InnerText + ".d");
}
}
}
I also could have used a try{} catch{} block to output the error if needed
Related
I'm trying to edit some data in a file with Visual Studio C#. I've tried using both
StreamReader and File.ReadAllLines / ReadAllText
Both results give me 3414 lines of content. I've jut used Split('\n') after "ReadAllText". But when I check the use the following command on linux I get the follow results:
cat phase1_promoter_data_PtoP1.txt | wc
Output:
184829 164686174 1101177922
So about 185.000 lines and 165 million words. A word count on Visual Studio gives me about 19 million.
So my question is, am I reading the file wrong or does Visual Studio have a limit on how much data it will read at once? My file takes about about 1 GB space.
Here's the code I use:
try
{
using (StreamReader sr = new StreamReader("phase1_promoter_data_PtoP1.txt"))
{
String line = sr.ReadToEnd();
Console.WriteLine(line);
String[,] data = new String[184829, 891];
//List<String> data2 = new List<String>();
string[] lol = line.Split('\n');
for (int i = 0; i < lol.Length; i++)
{
String[] oneLine = lol[i].Split('\t');
//List<String> singleLine = new List<String>(lol[i].Split('\t'));
for (int j = 0; j < oneLine.Length; j++)
{
//Console.WriteLine(i + " - " + lol.Length + " - " + j + " - " + oneLine.Length);
data[i,j] = oneLine[j];
}
}
Console.WriteLine(data[3413,0]);
}
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
The file in your dropbox contains 6043 lines.
Both
Debug.Print(File.ReadAllLines(fPath).Count().ToString());
And
Debug.Print(File.ReadAllText(fPath).Split('\n').Count().ToString());
Show the same results (Using VS 2013 .NET 4.5)
I was able to cycle through each line with..
using (var sr = new StreamReader(fPath))
{
while (!sr.EndOfStream)
{
Debug.Print(sr.ReadLine());
}
}
And
foreach(string line in File.ReadAllLines(fPath))
{
Debug.Print(line);
}
Instead of reading the entire file into a string at once, try one of the loops above and build an array as you cycle through.
Following the code in this MSDN blog, i have come up with the following code in C#
using Shell32; //Browse to C:\Windows\System32\shell32.dll
private void GetInstalledPrograms()
{
Shell shell = new Shell();
Shell objShell = shell.Application;
string folderName = #"::{26EE0668-A00A-44D7-9371-BEB064C98683}\8\" +
"::{7B81BE6A-CE2B-4676-A29E-EB907A5126C5}";
var objProgramList = objShell.NameSpace(folderName);
if (objProgramList != null)
{
MessageBox.Show(objProgramList.Items().ToString());
}
else
{
MessageBox.Show("Null");
}
}
For what ever reason, objProgramList is null. The odd thing is, with the following powershell code, I get exactly what I'm looking for! I don't know what I'm doing wrong. To me, both examples of my code are identical...
$Shell = New-Object -ComObject Shell.Application
$folderName = "::{26EE0668-A00A-44D7-9371-BEB064C98683}\8\::{7B81BE6A-CE2B-4676-A29E- EB907A5126C5}"
$folder = $Shell.NameSpace($folderName)
if($folder)
{
$folder.Items()
}
Any chance you are using Window 8? According to this answer, creating a shell like that doesn't work in Window 8.
This answer is too late but it might help others with the same problem.
Basically, the problem on your code was the shell command:
string folderName = #"::{26EE0668-A00A-44D7-9371-BEB064C98683}\8\" +
"::{7B81BE6A-CE2B-4676-A29E-EB907A5126C5}";
It should contain "shell:" in the beginning of the command, it should look like this:
string folderName = #"shell:::{26EE0668-A00A-44D7-9371-BEB064C98683}\8\::{7B81BE6A-CE2B-4676-A29E-EB907A5126C5}"
And to get info about the programs like the Name, Publisher, Installed On and etc, try this code that will enumerate all the available fields:
List<string> arrHeaders = new List<string>();
for (int i = 0; i < short.MaxValue; i++)
{
string header = list.GetDetailsOf(null, i);
if (String.IsNullOrEmpty(header))
break;
arrHeaders.Add(header);
}
foreach (Shell32.FolderItem2 item in list.Items())
{
for (int i = 0; i < arrHeaders.Count; i++)
{
//I used listbox to show the fields
listBox1.Items.Add(string.Format("{0}\t{1}: {2}", i, arrHeaders[i], list.GetDetailsOf(item, i)));
}
}
I have this mode that send/post a message/text to my group page on facebook:
private string PostFacebookWall(string accessToken, string message)
{
var responsePost = "";
try
{
var objFacebookClient = new FacebookClient(accessToken);
var parameters = new Dictionary<string, object>();
parameters["message"] = message;
responsePost = objFacebookClient.Post("/" + GroupId + "/feed", parameters).ToString();
}
catch (Exception ex)
{
responsePost = "Facebook Posting Error Message: " + ex.Message;
}
return responsePost;
}
Then i have a method that check make sure not to post same message/text twice:
private Hashtable alreadyPost = new Hashtable();
private void PostMessage()
{
for (int i = 0; i < ScrollLabel._lines.Length; i++)
{
for (int x = 0; x < WordsList.words.Length; x++)
{
if (ScrollLabel._lines[i].Contains(WordsList.words[x]) && !alreadyPost.ContainsKey(ScrollLabel._lines[i]))
{
lineToPost = ScrollLabel._lines[i];
string testline = lineToPost + Environment.NewLine + ScrollLabel._lines[i + 1];
PostFacebookWall(AccessPageToken, testline + Environment.NewLine + Environment.NewLine + "נשלח באופן אוטומטי כניסיון דרך תוכנה");
alreadyPost.Add(lineToPost, true);
numberofposts += 1;
label7.Text = numberofposts.ToString();
}
}
}
}
Then i have a timer tick that there i updte the text and also post the messages/texts every 10 seconds:
private void timer1_Tick(object sender, EventArgs e)
{
counter += 1;
label9.Text = counter.ToString();
label9.Visible = true;
if (counter == 10)
{
scrollLabel1.Reset();
scrollLabel1.Text = " ";
scrollLabel1.Invalidate();
this.scrollLabel1.Text = combindedString;
scrollLabel1.Invalidate();
counter = 0;
PostMessage();
}
}
The problem is when im running the progrem over and over again it will keep posting the same posts since it dosent know what posts are already exist on my group page.
I need to find somehow how to check/get the posts that are already exist on my group page in my facebook and then add this posts each time im running the program to the List: alreadyPost so it wont post them over and over again each time im closing and running the program again.
I could write the content of the List alreadyPost to a text file and read the text file back to the List i the constructor but that dosent make sure or check if the posts already exist on my group page.
How can i solve it ? The problem is how to know if the posts that have been posted are exist on my group page. No if my program sent it or if they were successfuly sent but if they are in my group page then dont post it again when im running the program over again.
The full error I am receiving is:
"The process cannot access the file 'e:\Batch\NW\data_Test\IM_0232\input\RN318301.WM' because it is being used by another process.>>> at IM_0232.BatchModules.BundleSort(String bundleFileName)
at IM_0232.BatchModules.ExecuteBatchProcess()"
The involved code can be seen below. The RN318301.WM file being processed is a text file that contains information which will eventually be placed in PDF documents. There are many documents referenced in the RN318301.WM text file with each one being represented by a collection of rows. As can be seen in the code, the RN318301.WM text file is first parsed to determine the number of documents represented in it as well as the maximum number of lines in a documents. This information is then used to create two-dimensional array that will contain all of the document information. The RN318301.WM text file is parsed again to populate the two-dimensional array and at the same time information is collected into a dictionary that will be sorted later in the routine.
The failure occurs at the last line below:
File.Delete(_bundlePath + Path.GetFileName(bundleFileName));
This is a sporadic problem that occurs only rarely. It has even been seen to occur with a particular text file with which it had not previously occurred. That is, a particular text file will process fine but then on reprocessing the error will be triggered.
Can anyone help us to diagnose the cause of this error? Thank you very much...
public void BundleSort(string bundleFileName)
{
Dictionary<int, string> memberDict = new Dictionary<int, string>();
Dictionary<int, string> sortedMemberDict = new Dictionary<int, string>();
//int EOBPosition = 0;
int EOBPosition = -1;
int lineInEOB = 0;
int eobCount = 0;
int lineCount = 0;
int maxLineCount = 0;
string compareString;
string EOBLine;
//#string[][] EOBLineArray;
string[,] EOBLineArray;
try
{
_batch.TranLog_Write("\tBeginning sort of bundle " + _bundleInfo.BundleName + " to facilitate householding");
//Read the bundle and create a dictionary of comparison strings with EOB position in the bundle being the key
StreamReader file = new StreamReader(#_bundlePath + _bundleInfo.BundleName);
//The next section of code counts CH records as well as the maximum number of CD records in an EOB. This information is needed for initialization of the 2-dimensional EOBLineArray array.
while ((EOBLine = file.ReadLine()) != null)
{
if (EOBLine.Substring(0, 2) == "CH" || EOBLine.Substring(0, 2) == "CT")
{
if (lineCount == 0)
lineCount++;
if (lineCount > maxLineCount)
{
maxLineCount = lineCount;
}
eobCount++;
if (lineCount != 1)
lineCount = 0;
}
if (EOBLine.Substring(0, 2) == "CD")
{
lineCount++;
}
}
EOBLineArray = new string[eobCount, maxLineCount + 2];
file = new StreamReader(#_bundlePath + _bundleInfo.BundleName);
try
{
while ((EOBLine = file.ReadLine()) != null)
{
if (EOBLine.Substring(0, 2) == "CH")
{
EOBPosition++;
lineInEOB = 0;
compareString = EOBLine.Substring(8, 40).Trim() + EOBLine.Substring(49, 49).TrimEnd().TrimStart() + EOBLine.Substring(120, 5).TrimEnd().TrimStart();
memberDict.Add(EOBPosition, compareString);
EOBLineArray[EOBPosition, lineInEOB] = EOBLine;
}
else
{
if (EOBLine.Substring(0, 2) == "CT")
{
EOBPosition++;
EOBLineArray[EOBPosition, lineInEOB] = EOBLine;
}
else
{
lineInEOB++;
EOBLineArray[EOBPosition, lineInEOB] = EOBLine;
}
}
}
}
catch (Exception ex)
{
throw ex;
}
_batch.TranLog_Write("\tSending original unsorted bundle to archive");
if(!(File.Exists(_archiveDir + "\\" +DateTime.Now.ToString("yyyyMMdd")+ Path.GetFileName(bundleFileName) + "_original")))
{
File.Copy(_bundlePath + Path.GetFileName(bundleFileName), _archiveDir + "\\" +DateTime.Now.ToString("yyyyMMdd")+ Path.GetFileName(bundleFileName) + "_original");
}
file.Close();
file.Dispose();
GC.Collect();
File.Delete(_bundlePath + Path.GetFileName(bundleFileName));
You didn't close/dispose your StreamReader first time round so the file handle is still open
Consider using the using construct - this will automatically dispose of the object when it goes out of scope:
using(var file = new StreamReader(args))
{
// Do stuff
}
// file has now been disposed/closed etc
You need to close your StreamReaders for one thing.
StreamReader file = new StreamReader(#_bundlePath + _bundleInfo.BundleName);
You need to close the StreamReader object, and you could do this in a finally block:
finally {
file.Close();
}
A better way is to use a using block:
using (StreamReader file = new StreamReader(#_bundlePath + _bundleInfo.BundleName)) {
...
}
It looks to me like you are calling GC.Collect to try to force the closing of these StreamReaders, but that doesn't guarantee that they will be closed immediately as per the MSDN doc:
http://msdn.microsoft.com/en-us/library/xe0c2357.aspx
From that doc:
"All objects, regardless of how long they have been in memory, are considered for collection;"
I have this HTML code
<div class="anc-style" onclick="window.open('./view.php?a=foo')"></div>
I'd like to extract the contents of the "onclick" attribute. I've attempted to do something like:
div.GetAttribute("onclick").ToString();
Which would ideally yield the string
"window.open('./view.php?a=foo')"
but it returns a System.__ComObject.
I'm able to get the class by changing ("onclick") to ("class"), what's going on with the onclick?
HtmlElementCollection div = webBrowser1.Document.GetElementsByTagName("div");
for (int j = 0; j < div.Count; j++) {
if (div[j].GetAttribute("class") == "anc-style") {
richTextBox1.AppendText(div[j].GetAttribute("onclick").ToString());
}
}
You can pull the document tags and extract data such as below using the htmlDocument class. This is only an example
string htmlText = "<html><head></head><body><div class=\"anc-style\" onclick=\"window.open('./view.php?a=foo')\"></div></body></html>";
WebBrowser wb = new WebBrowser();
wb.DocumentText = "";
wb.Document.Write(htmlText);
foreach (HtmlElement hElement in wb.Document.GetElementsByTagName("DIV"))
{
//get start and end positions
int iStartPos = hElement.OuterHtml.IndexOf("onclick=\"") + ("onclick=\"").Length;
int iEndPos = hElement.OuterHtml.IndexOf("\">",iStartPos);
//get our substring
String s = hElement.OuterHtml.Substring(iStartPos, iEndPos - iStartPos);
MessageBox.Show(s);
}
try also using div[j]["onclick"] what browser are you using?
I've created a jsfiddle that works try out and see if its working for you
http://jsfiddle.net/4ZwNs/102/