Creating a simple 'spider' - c#

I have researched on spidering and think that it is a little too complex for quite a simple app I am trying to make. Some data on a web page is not available to view in the source as it is just being displayed by the browser.
If I wanted to get a value from a specific web page that I was to display in a WebBrowser control, is there any method to read values off of the contents of this browser?
If not, does anyone have any suggestions on how they might approach this?

You’re not looking for spidering, you’re looking for screen scraping.

I'd have to agree with Bombe, it sounds more like you want HTML Screen Scraping. It requires lots of parsing, and if the page your scraping ever changes, your app will break, however here's a small example of how to do it:
WebClient webClient = new WebClient();
const string strUrl = "http://www.yahoo.com/";
byte[] reqHTML;
reqHTML = webClient.DownloadData(strUrl);
UTF8Encoding objUTF8 = new UTF8Encoding();
string html = objUTF8.GetString(reqHTML);
Now the html variable has the entire HTML in it, and you can start parsing away.

Because the browser simply renders the underlying content, the most flexible approach would be to parse the underlying content (html/css/js/whatever) yourself.
I would create a parsing engine that looks for the things your spider application needs.
This could be a basic string searching algorithm which looks for href="" for example and reads the values in order to produce new requests and continue spidering. Your engine could be written to only look for things it is interested in and extended in that way for more functionality.

Related

How to Open Default browser and target an element?

Hi all I have done some google work and not come up with a great deal apart from using the browser within a From which I dont want to do.
Has anybody some sample code or a good resource that is detailed enough to get me on my way plesae
So for example
Process.Start("https://www.google.com")
and target the search element with a string and click search.
Using the default browser
Please help me...
Doing something like this would work:
string mySearchQuery = "this is a search example";
Process.Start("https://www.google.com/search?q=" + Uri.EscapeDataString(mySearchQuery));
If I'm understanding you correctly, this would use the default browser set in windows, then the query is just passed in as a GET request (that's the ?q variable).

Cannot access internal property 'ConvertedHtmlElementSelector' here

I am using HiQPdf Free to generate PDFs from an URL. I noticed in their documentation, you can simply grab a specific element instead of the whole page. It would go something like this:
HtmlToPdf htmlToPdfConverter = new HtmlToPdf();
htmlToPdfConverter.ConvertedHtmlElementSelector = "#logo";
htmToPdfConverter.ConvertUrlToFile("https://your-website.com/", "/path/to/pdf.pdf");
However, when I do the htmlToPdfConverter.ConvertedHtmlElementSelector in my code, it tells me this error:
Cannot access internal property 'ConvertedHtmlElementSelector' here
Could this be because it's a paid only feature? That seems like the only obvious reason, however, I haven't been able to find any source on that.
Converting only a region of the HTML page to PDF is a feature of the full version and it is not available in the free version. There is an example for this feature with C# and VB.NET code samples at http://www.hiqpdf.com/demo/ConvertHtmlRegionToPdf.aspx

Editing html templates

I am going to create a function in my application that is going to send som status mails to several receivers in a list.
Earlier i used plane text format on the email, but now i want to send the mail based on som html templates. I need tips reguarding a good way to insert data into these templates before sending them.
eks
%CpuStatus%
%HardriveStatus%
and so on. I have the solution for everything except a way to fill anchors like that with data. This is a WinForm application so i dont have access to the ASP functionality
Maybe this sort of thing would be the simplest?
// This would most likely be loaded from a file or database.
string emailBody = "CPU Status: %CpuStatus%\nHard Drive Status: %HardriveStatus%";
string cpuStatus = MyService.GetCpuStatus();
emailBody.Replace("%CpuStatus%", cpuStatus);
If you really wanted to make a big project out of it, you can use a webbrowser control, load it with your html file and then use the WebBrowser's Document property to get an HtmlDocument object. You can then loop through it's children (recursively) to find the tags you want to change.
Personally, I would do the .Replace method suggested previously.

How to translate website in another language?(ASP .NET , c#)

I have developed a large business portal. I just realized I need my website in another language. I have researched the solutions available like
Used third party control on my website. (Does fit in my design. Not useful regarding SEO point of view. Dont want to show third party brand names.)
Create Resource files for each language.( A lot of work required to restructure pages to use text from resource files. What about the data entered by the user like Business Description. )
Are there any Other options available.
I was thinking of a solution like a when a page is created on server side then I could translate it before sending back to client. Is there any way I can do that?(to translate everything including data added from databases or through a code. And without effecting design. )
If you really need to translate your application, it's going to take a lot of hard, tedious work. There is no magic bullet.
The first thing you need to do is convert your plain text in your markup to asp:Localize controls. By using the Localize control, you can leave your existing <span> tags in place and just replace the text inside of them. There's really no way around this. Visual Studio's search and replace supports regular expression matching that may help you with this, or you can use Resharper (see below).
The first approach would be to download the open source shopping application nopCommerce and see how they handle their localization. They store their strings in a database and have a UI for editing languages. A similar approach may work well for you.
Alternatively, if you want to use Resource Files, there are two tools that I would recommend using in addition to Visual Studio: Resharper 5 (Localization Features screencast) and Zeta Resource Editor. These are the steps I would take to accomplish it using this method:
Use the "Generate Local Resource" tool in visual studio for each page
Use Resharper's "Move HTML to resource" on the text in your markup to make them into Localize controls.
Use Resharper to search out any localizable strings in your code behind and move them to the resource file as well.
Use the Globalization Rules of Code Analysis / FXCop to help find any additional problems you might face formatting numbers, dates, etc.
Once all text is in the resx files, use Zeta Resource Editor to load up all of your resx files, add new languages, and export for translation (or auto translate if you're brave enough).
I've used this approach on a site translated into 8 languages (and growing) with dozens of pages (and growing). However, this is not a user-editable site; the pages are solely controlled by the programmers.
a large switch case? use a dictionary/hashtable (seperate instance for each a language), it is much, much more effective and fast.
To Convert The Page To Arabic Language Or Other Language .
Go to :
1-page design
2-Tools
3-Generate Local Resource
4-obtain "App_LocalResources" include "filename.aspx.resx"
5-copy the file and change the name to "filename.aspx.ar.resx" to convert the page to arabic language or other .
hope to helpful :)
I found a good solution, see in http://www.nopcommerce.com/p/1784/nopcommerce-translator.aspx
this project is open source and source repository is here: https://github.com/Marjani/NopCommerce-Translator
good luck
Without installing any 3rd party tool, APIs, or dll objects, I am able to utilize the App_LocalResources. Although I still use Google Translate for the words and sentences to be translated and copy and paste it to the file as you can see in one of the screenshots below (or you can have a person translator and type manually to add). In your Project folder (using MS Visual Studio as editor), add an App_LocalResources folder and create the English and other language (resx file). In my case, it's Spanish (es-ES) translation. See screenshot below.
Next, on your aspx, add the meta tags (meta:resourcekey) that will match in the App_LocalResources. One for English and another to the Spanish file. See screenshots below:
Spanish: (filename.aspx.es-ES.resx)
English: (filename.aspx.resx)
.
Then create a link on your masterpage file with a querystring that will switch the page translation and will be available on all pages:
<%--ENGLISH/SPANISH VERSION BUTTON--%>
<asp:HyperLink ID="eng_ver" runat="server" Text="English" Font-Underline="false"></asp:HyperLink> |
<asp:HyperLink ID="spa_ver" runat="server" Text="Español" Font-Underline="false"></asp:HyperLink>
<%--ENGLISH/SPANISH VERSION BUTTON--%>
.
On your masterpage code behind, create a dynamic link to the Hyperlink tags:
////LOCALIZATION
string thispage = Request.Url.AbsolutePath;
eng_ver.NavigateUrl = thispage;
spa_ver.NavigateUrl = thispage + "?ver=es-ES";
////LOCALIZATION
.
Now, on your page files' code behind, you can set a session variable to make all links or redirections to stick to the desired translation by always adding a querystring to urls.
On PageLoad:
///'LOCALIZATION
//dynamic querystring; add this to urls ---> ?" + Session["add2url"]
{
if (Session["version"] != null)
{
Session["add2url"] = "?ver=" + Session["version"]; //SPANISH version
}
else
{
Session["add2url"] = ""; // ENGLISH as default
}
}
///'LOCALIZATION
.
On Click Events sample:
protected void btnBack_Click(object sender, EventArgs e)
{
Session["FileName.aspx"] = null;
Response.Redirect("FileName.aspx" + Session["add2url"]);
}
I hope my descriptions were easy enough.
If you don't want to code more and if its feasible with google translator then You can try with Google Translator API. you can check below code.
<script src="http://translate.google.com/translate_a/element.js?cb=googleTranslateElementInit"></script>
<script>
function googleTranslateElementInit() {
$.when(
new google.translate.TranslateElement({pageLanguage: 'en', includedLanguages: 'en',
layout: google.translate.TranslateElement.FloatPosition.TOP_LEFT}, 'google_translate_element')
).done(function(){
var select = document.getElementsByClassName('goog-te-combo')[0];
select.selectedIndex = 1;
select.addEventListener('click', function () {
select.dispatchEvent(new Event('change'));
});
select.click();
});
}
$(window).on('load', function() {
var select = document.getElementsByClassName('goog-te-combo')[0];
select.click();
var selected = document.getElementsByClassName('goog-te-gadget')[0];
selected.hidden = true;
});
</script>
Also, Find below code for <body> tag
<div id="google_translate_element"></div>
It will certainly be more work to create resource files for each language - but this is the option I would opt for, as it gives you the opportunity to be more accurate. If you do it this way you can have the text translated, manually, by someone that speaks the language (there are many companies out there that offer this kind of service).
Automatic translation systems are often good for giving a general impression of what something in another language means, but I would never use them when trying to portray a professional image, as often what they output just doesn't make sense. Nothing screams 'unprofessional!' like text that just doesn't make sense because it's been automatically translated.
I would take the resource file route over the translation option because the meaning of words in a language can be very contextual and even one mistake could undermine your site's credibility.
As you suggest Visual Studio can generate the meta resource file keys for most controls containing text but may leave you having to do the rest manually but I don't see an easier, more reliable solution.
I don't think localisation is an easy-to-automate thing anyway as text held in the database often results in schema changes to allow for multiple languages, and web HTML often need restructuring to deal with truncated or wrapped label and button text because, for example, you've translated into German or something.
Other considerations:
Culture settings - financial delimitors, date formats.
Right-to-left - some languages like arabic are written right to left meaning that the pages require rethinking as to control positioning like images etc.
Good luck whatever you go with.
I ended up doing it the hard way:
I wrote an extension method on the string class called TranslateInto
On the Page's PreRender method I grab all controls recursively based on their type (the types that would have text)
Foreach through them and text.TranslateInto(SupportedLanguages.CurrentLanguage)
In my TranslateInto method I have a ridiculously large switch statement with every string displayed to the user and its associated translation.
Its not very pretty, but it worked.
We work with a Translation CAT tool (Computer Assisted Translation) called MemoQ that allows us to translate the text while leaving all the tags and coding in place. This is very helpful when the order of words change when you translate from one language to another.
It is also very useful because it allows us to work with translators from around the world, without the need for them to have any technical expertise. It also allows us to have the translation proof read by a second translator.
We use this translation environment to translate html, xml, InDesign, Word, etc.
I think you should try Google Translate.
http://translate.google.com/translate_tools
Very easy and very very effective.
HTH

How do I create an html report without hardcoding the html?

I'm currently refactoring a console application whose main responsibility is to generate a report based on values stored in the database.
The way I've been creating the report up til now is as follows:
const string format = "<tr><td>{0, 10}</td><td>
{1}</td><td>{2, 8}</td><td>{3}</td><td>{4, -30}</td>
<td>{5}</td><td>{6}</td></tr>";
if(items.Count > 0)
{
builder.AppendLine(
String.Format(format, "Date", "Id", "WorkItemId",
"Account Number", "Name", "Address", "Description"));
}
foreach(Item item in items)
{
builder.AppendLine(String.Format(format, item.StartDate, item.Id,
item.WorkItemId, item.AccountNumber,
String.Format("{0} {1}",
item.FirstName, item.LastName),
item.Address, item.Description));
}
string report = String.Format("<html><table border=\"1\">{0}
</table></html>",
builder.ToString());
(The above is just a sample...and sorry about the formatting...I tried to format it so it wouldn't require horizontal scrolling....)
I really don't like that way I've done this. It works and does the job for now...but I just don't think it is maintainable...particularly if the report becomes any more complex in terms of the html that needs to be created. Worse still, other developers on my team are sure to copy and paste my code for their applications that generate an html report and are likely to create a horrible mess. (I've already seen such horrors produced! Imagine a report function that has hundreds of lines of hard coded sql to retrieve the details of the report...its enough to make a grown man cry!)
However, while I don't like this at all...I just can't think of a different way to do it.
Surely there must be a way to do this...I'm certain of it. Not too long ago I was doing the same thing when generating tables in aspx pages until someone kindly showed me that I can just bind the objects to a control and let .NET take care of the rendering. It turned horrible code, similar to the code above, into two or three elegant lines of goodness.
Does anyone know of a similar way of creating the html for this report without hard-coding the html?
Make your app to produce XML file with raw data. Then apply an external XSLT to it which would contain HTML.
More info: http://msdn.microsoft.com/en-us/library/14689742.aspx
You could use a template engine like NVelocity to separate your report view and your code.
There are probably other decent template engines out there...
meziod - Another avenue to peruse is extension methods to the HtmlTextWriter object. I found a brilliant stab at just this on this very site.
HtmlTextWriter extension
I'm certain that you could leverage great potential from that...
regards - coola
Well, you could use one of the report frameworks (Crystal, MS RDL, etc) and export as html - however, I suspect that for simple data your current approach is less overhead. I might use an XmlWriter or LINQ-to-XML (rather than string.Format, which won't handle escaping)...
new XElement("tr",
new XElement("td", item.StartDate),
new XElement("td", item.Id),
new XElement("td", item.WorkItemId),
etc. Escaping is especially important for text values (name, description, etc).
You might consider using a simple template engine such as http://www.stefansarstedt.com/templatemaschine.html and separate your template from the content.
This is quite practical, allows template modification without recompiling and you still got C# power in your templates.
Microsoft SQL Reporting Services does this quite well, and can do multiple formats.
My company uses it to create PDF reports and since we have HIPAA requirements, we automatically put a password to it via a third party PDF control...
As coolashaka already mentioned, using the HtmlTextWriter is a good option, certainly if you add some useful extension methods, for example:
Simple example:
public static void WriteNav(this
HtmlTextWriter writer, List<String> navItems)
{
writer.RenderBeginTag("nav");
writer.RenderBeginTag(HtmlTextWriterTag.Ul);
foreach (var item in navItems)
{
writer.RenderBeginTag(HtmlTextWriterTag.Ul);
writer.AddAttribute(HtmlTextWriterAttribute.Href, "~/" + item + ".html");
writer.RenderBeginTag(HtmlTextWriterTag.A);
writer.Write(item);
writer.RenderEndTag();
writer.RenderEndTag();
}
writer.RenderEndTag();
writer.RenderEndTag();
}
I know this is an old question, but Google will keep directing people here for years to come.

Categories

Resources