This is another programming issue in which I think everything looks fine but does not work as intended.
What I'm trying to do is scrape all links from a webpage with htmlagilitypack and add them to a datagrid, but NOT to add duplicates to the datagrid.
Code:
webBrowser.Navigate(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser.DocumentText);
if (debug)
{
Helpers.SaveDebugToFile(#"Debug\[google.com]-" + DateTime.Now.ToString("hhmmssffffff") + "-debug.html", webBrowser.DocumentText);
}
List<string> values = new List<string>();
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute href = link.Attributes["href"];
if (href.Value.Contains("google.") || href.Value.Contains("search?") || href.Value.StartsWith("/") || href.Value.Length < 5)
{
// Ignore.
}
else
{
// DO NOT ADD TO THE DATAGRID IF href.Value ALREADY EXISTS IN COLUMN 1 //
values.Add(href.Value);
}
}
foreach (var value in values.Distinct().ToList())
{
DataGridViewLinks.Rows.Add(value, randomKeyword);
}
The code works but it's still adding duplicates in the first column, but I'm only adding Distinct() values in (or that's what I intended it to happen).
I can't see the reason for this issue, i have looked over the code a good few times and don't see anything obvious wrong.
EDIT:
As it was already mentioned in above comments, most likely somewhere the content isn't exactly equal (different casing, some leading or trailing whitespace, ...)
Better would be to check for duplicates (with defined casing, and removing whitespaces), already when inserting to the "values" list
Instead of using Distinct directly in the for loop you can check the result in a List what all values you are getting and then can find whether the problem is in this section of code or any other section. Possibly list is appending while the loop is iterating.
Related
This is a part of the code that i was trying to use to get the respective elements, but it keeps giving me the following error:
System.Collections.ObjectModel.ReadOnlyCollection`1[OpenQA.Selenium.IWebElement]or
others identical
This is also shown in a datagridview, in her rows.
IList<IWebElement> ruas = Gdriver.FindElements(By.ClassName("search-title"));
String[] AllText = new String[ruas.Count];
int i = 0;
foreach (IWebElement element in ruas)
{
AllText[i++] = element.Text;
table.Rows.Add(ruas);
}
First thing is: as far as I understand the elements you are talking about are not contained in table. Its a list: <ul class="list-unstyled list-inline">... (considering the comment you left with site link)
If you want to find those elements you can use the code below:
var elements = driver.FindElements(By.CssSelector("ul.list-inline > li > a"));
// Here you can iterate though links and do whatever you want with them
foreach (var element in elements)
{
Console.WriteLine(element.Text);
}
// Here is the collection of links texts
var linkNames = elements.Select(e => e.Text).ToList();
Considering the error you get, I may assume that you are using DataGridView for storing collected data, which is terribly incorrect. DataGridView is used for viewing data in MVC application. There is no standard Selenium class for storing table data. There are multiple approaches for this, but I can't suggest you any because I don't know your what you are trying to achieve.
Here is how i answered my own question:
IList<string> all = new List<string>();
foreach (var element in Gdriver.FindElements(By.ClassName("search-title")))
{
all.Add(element.Text);
table.Rows.Add(element.Text);
}
I am comparing two csv files and distinguishing whether records have been added or removed. I am able to know if items have been added or removed but i want to show what were the records that were inserted or removed. I had a different yet related post that Mauricio Gracia has helped a lot Comparing two excel files for differences
if (fileB.excelRows.Count() < fileA.excelRows.Count())
{
string result = "";
foreach (ExcelRow rowA in fileA.excelRows)
{
if (!fileB.ContainsHash(rowA.rowHash))
{
result = rowA.ToString();
}
}
MessageBox.Show("Files are NOT the same. Data was REMOVED.\n" + result);
}
else
{
var addedItems = fileB.excelRows.Except(fileA.excelRows);
MessageBox.Show("Files are NOT the same. Data was ADDED.\n"+ addedItems.ToString());
}
}
write now the message i get is ExcelFileReader.ExcelRow but i am not seeing the actual record that was removed.
I tried using the Except Operator but i got the same string message.
You can't just ToString() the excel row. It's an object, you need to go through the objects properties and convert the items that you want into a string representation. Something like the below.
foreach (ExcelCell cell in excelRow.AllocatedCells)
{
if (cell.Value != null)
Console.Write("{0}({1})", cell.Value, cell.Value.GetType().Name);
Console.Write("\t");
}
I am trying to remove paragraph (I'm using some placeholder text to do generation from docx template-like file) from .docx file using OpenXML, but whenever I remove paragraph it breaks the foreach loop which I'm using to iterate trough.
MainDocumentPart mainpart = doc.MainDocumentPart;
IEnumerable<OpenXmlElement> elems = mainPart.Document.Body.Descendants();
foreach(OpenXmlElement elem in elems){
if(elem is Text && elem.InnerText == "##MY_PLACE_HOLDER##")
{
Run run = (Run)elem.Parent;
Paragraph p = (Paragraph)run.Parent;
p.RemoveAllChildren();
p.Remove();
}
}
This works, removes my place holder and paragraph it is in, but foreach loop stops iterating. And I need more things to do in my foreach loop.
Is this ok way to remove paragraph in C# using OpenXML and why is my foreach loop stopping or how to make it not stop? Thanks.
This is the "Halloween Problem", so called because it was noticed by some developers on Halloween, and it looked spooky to them. It is the problem of using declarative code (queries) with imperative code (deleting nodes) at the same time. If you think about it, you are iterating though a linked list, and if you start deleting nodes in the linked list, you totally mess up the iterator. A simpler way to avoid this problem is to "materialize" the results of the query in a List, and then you can iterate through the list, and delete nodes at will. The only difference in the following code is that it calls ToList after calling the Descendants axis.
MainDocumentPart mainpart = doc.MainDocumentPart;
IEnumerable<OpenXmlElement> elems = mainPart.Document.Body.Descendants().ToList();
foreach(OpenXmlElement elem in elems){
if(elem is Text && elem.InnerText == "##MY_PLACE_HOLDER##")
{
Run run = (Run)elem.Parent;
Paragraph p = (Paragraph)run.Parent;
p.RemoveAllChildren();
p.Remove();
}
}
However, I have to note that I see another bug in your code. There is nothing to stop Word from splitting up that text node into multiple text elements from multiple runs. While in most cases, your code will work fine, sooner or later, you or a user is going to take some action (like selecting a character, and accidentally hitting the bold button on the ribbon) and then your code will no longer work.
If you really want to work at the text level, then you need to use code such as what I introduce in this screen-cast: http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2011/08/04/introducing-textreplacer-a-new-class-for-powertools-for-open-xml.aspx
In fact, you could probably use that code verbatim to handle your use case, I believe.
Another approach, more flexible and powerful, is detailed in:
http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2011/06/13/open-xml-presentation-generation-using-a-template-presentation.aspx
While that screen-cast is about PresentationML, the same principles apply to WordprocessingML.
But even better, given that you are using WordprocessingML, is to use content controls. For one approach to document generation, see:
http://ericwhite.com/blog/map/generating-open-xml-wordprocessingml-documents-blog-post-series/
And for lots of information about using content controls in general, see:
http://www.ericwhite.com/blog/content-controls-expanded
-Eric
You have to use two cycles first that stores items you want to delete and second that deletes items.
something like this:
List<Paragraph> paragraphsToDelete = new List<Paragraph>();
foreach(OpenXmlElement elem in elems){
if(elem is Text && elem.InnerText == "##MY_PLACE_HOLDER##")
{
Run run = (Run)elem.Parent;
Paragraph p = (Paragraph)run.Parent;
paragraphsToDelete.Add(p);
}
}
foreach (var p in paragraphsToDelete)
{
p.RemoveAllChildren();
p.Remove();
}
Dim elems As IEnumerable(Of OpenXmlElement) = MainPart.Document.Body.Descendants().ToList()
For Each elem As OpenXmlElement In elems
If elem.InnerText.IndexOf("fullname") > 0 Then
elem.RemoveAllChildren()
End If
Next
I am writing my own specific web crawler for product selling websites. Due to their very bad coding nature i get with getting urls pointing same page.
Example one
http://www.hizlial.com/bilgisayar/bilgisayar-bilesenleri/bilgisayar/yazicilar/samsung-scx-3200-tarayici-fotokopi-lazer-yazici_30.033.1271.0043.htm
For example the page above is same as below
http://www.hizlial.com/bilgisayar-bilesenleri/bilgisayar/yazicilar/samsung-scx-3200-tarayici-fotokopi-lazer-yazici_30.033.1271.0043.htm
As you can see it contains 2 "bilgisayar" element when you split via '/' character
So what i want is i want to split urls like this
string[] lstSPlit = srURL.Split('/');
After that check that whether that list contains same element more than once or not. Any element. If contains any element i will skip the url because i would have already have the real url extracted from some other page. So what is the best way of doing this ?
Longer but working version
string[] lstSPlit = srHref.Split('/');
bool blDoNotAdd = false;
HashSet<string> splitHashSet=new HashSet<string>();
foreach (var vrLstValue in lstSPlit)
{
if (vrLstValue.Length > 1)
{
if (splitHashSet.Contains(vrLstValue) == false)
{
splitHashSet.Add(vrLstValue);
}
else
{
blDoNotAdd = true;
break;
}
}
}
if (list.Distinct().Count() < list.Count)
This ought to be faster than grouping. (I haven't measured)
You can make it even faster by writing your own extension method that adds items to a HashSet<T> and returns false immediately if Add() returns false.
You can even do that using a wicked shorthand:
if (!list.All(new HashSet<string>().Add))
if(lstSPlit.GroupBy(i => i).Where(g => g.Count() > 1).Any())
{
// found more than once
}
My controller is passing through a list which I then need to loop through and update every record in the list in my database. I'm using ASP.NET MVC with a repository pattern using Linq to Sql. The code below is my save method which needs to add a record to an invoice table and then update the applicable jobs in the job table from the db.
public void SaveInvoice(Invoice invoice, IList<InvoiceJob> invoiceJobs)
{
invoiceTable.InsertOnSubmit(invoice);
invoiceTable.Context.SubmitChanges();
foreach (InvoiceJob j in invoiceJobs)
{
var jobUpdate = invoiceJobTable.Where(x => x.JobID == j.JobID).Single();
jobUpdate.InvoiceRef = invoice.InvoiceID.ToString();
invoiceJobTable.GetOriginalEntityState(jobUpdate);
invoiceJobTable.Context.Refresh(RefreshMode.KeepCurrentValues, jobUpdate);
invoiceJobTable.Context.SubmitChanges();
}
}
**I've stripped the code down to just the problem area.
This code doesn't work and no job records are updated, but the invoice table is updated fine. No errors are thrown and the invoiceJobs IList is definitely not null. If I change the code by removing the foreach loop and manually specifying which JobId to update, it works fine. The below works:
public void SaveInvoice(Invoice invoice, IList<InvoiceJob> invoiceJobs)
{
invoiceTable.InsertOnSubmit(invoice);
invoiceTable.Context.SubmitChanges();
var jobUpdate = invoiceJobTable.Where(x => x.JobID == 10000).Single();
jobUpdate.InvoiceRef = invoice.InvoiceID.ToString();
invoiceJobTable.GetOriginalEntityState(jobUpdate);
invoiceJobTable.Context.Refresh(RefreshMode.KeepCurrentValues, jobUpdate);
invoiceJobTable.Context.SubmitChanges();
}
I just can't get the foreach loop to work at all. Does anyone have any idea what I'm doing wrong here?
It seems like the mostly likely cause of this problem is that the invokeJobs collection is an empty collection. That is it has no elements hence the foreach loop effectively does nothing.
You can verify this by adding the following to the top of the method (just for debugging purposes)
if (invoiceJobs.Count == 0) {
throw new ArgumentException("It's an empty list");
}
Change this
var jobUpdate = invoiceJobTable.Where(x => x.JobID == 10000).Single();
jobUpdate.InvoiceRef = invoice.InvoiceID.ToString();
invoiceJobTable.GetOriginalEntityState(jobUpdate);
invoiceJobTable.Context.Refresh(RefreshMode.KeepCurrentValues, jobUpdate);
invoiceJobTable.Context.SubmitChanges();
to
var jobUpdate = invoiceJobTable.Where(x => x.JobID == 10000).Single();
jobUpdate.InvoiceRef = invoice.InvoiceID.ToString();
invoiceJobTable.SubmitChanges();
It looks like your GetOriginalEntityState doesn't actually do anything, because you don't use the returned value. I can't see any reason why you are making the DataContext.Refresh() call. All it does is erase the changes you made, thus making your "foreach loop not work"