Parse RSS Feeds or Websites Source Code - c#

I would like to make an app in C#, where to generate some graphics about football teams, for the most important championships.
Ex: FC Barcelona, or Real Madrid... in Primera Division...
For this, I need the last updated data, from internet, about teams like, name of team, points, ranking...
Which is the most common way to do this?
Do I have to find a RSS Feed for this? Do you have any information about this?
Do I have to find a website and from there to parse its source code?

If you could find the RSS feed for the information you want, that would be the best solution.
Otherwise, You could use the HTML Agility Pack, http://html-agility-pack.net/ (also available on NuGet). This makes it really easy to parse out relevant content from a messy html-page. You can (with some creative coding) write selectors that targets the content you want, on a dynamic site where the structure might change a bit. Really handy to get content of the web.

Related

how to Read Full Text RSS Feed

Some sites can get full text Rss feed when the rss address don't have full text
like this site This
how can I do that?
I don't know much about C#, but I can still give a general answer on how to solve your problem. RSS feeds (almost) always link to the article, hosted on the newspaper/blog's website, where the whole article is available. So the "RSS filler" takes the content of the article from the website content and basically puts it back in the feed, replacing the available (short) intro.
To achieve this you need to:
parse/generate RSS/Atoms feeds (I'm sure there are plenty of C# libs to do that)
find the actual article from the html page linked in the original RSS feed. Indeed the linked page contains a lot of things you don't want to put in the "full" RSS feed (such as the website header, nav bar, ads, comments, facebook like button and so on). The easiest way to do this is to use readability (a quick google check gives this lib).
If you combine both of these, you can achieve your goal.
You can find one implementation of this kind of tool at http://fivefilters.org, and their source code (for older versions) is at /content-only/ http://code.fivefilters.org/full-text-rss/. It's in PHP, but it can give a rough idea on how to proceed.
You can get the complete script which enlarge the partial rss feed from Full post rss feed website
The steps involves:
- Get the post URL from the RSS feed.
- Fetch the full content of the post URL, it will use curl to get the content.
- Parse the content, it uses templates for that. They keep on update the templates for the most popular websites and wordpress themes. Based on the template, parse the html content in to html dom objects and then find the content based on the html dom objects.
- Finally, generate the rss feed again with full content.
You can check the script which is written in PHP to get some idea, later you can rewrite the logic to any language.

How to create a multiple page invoice in asp.net c#?

I am thoroughly confused with something I want to do and am looking for some advice.
One of my client has to produce monthly invoice detailing all of the company expenditure, and two other such invoices. The client is sure that he only needs these invoices - and they are extremely simple enough to produce as far as logic is concerned.
Now, to make the actual invoice, I don't really want to use reporting solutions like Telerik, SSRS etc.. as I think they are an overkill for my purpose. At the same time, I am not sure how I can get the printer to print the invoices in a neat pages without cutting off anything.
I am very tempted to just give the output in a webpage and ask my client to print them off from there.
Am I not looking at this the right way? Is this possible?
I could use ITextSharp or something to produce pdf's.. In fact, I think I will go ahead with this if it isn't possible to just output to html page and get the printer to recognize the page breaks somehow.
Because this is a very small job, I don't want to spend too much time on it as the cost of this freelance project is minimal too.
The reason printing to a new page is important is that my client has a few shops he deals with and he would want to print each of his customers their own invoices. I can get him to produce each customer's invoice separately and print them but it is not ideal way to deal with it.
thanks
There is a css property which should tell a browser to break a page: page-break-before.
But if you have a a wide list of browsers to support, it would be better to get some HTML to PDF conversion library or really use iTextSharp (as far as I know there is even a module/class which allows to conver HTML to PDF with iTextSharp) as printing web pages has many issues.
In the past, when I wanted to create a reusable document, I used Word or Excel XML formats.
See: http://en.wikipedia.org/wiki/Microsoft_Office_XML_formats
They are easy to create and tweak, then all you have to do is recreate the dynamic parts in your code. All you have to do is save the document in Office XML format, then open it up in word pad to see where to make your changes.
SSRS has a drag and drop interface for designing reports and has a PDF output option. If the data is in a SQL server database then even with the learning curve it should be easier to do SSRS reports.

Issue in downloading video from Youtube?

I am trying to make a video download application for desktop in C#.
Now the problem is that following code works fine:
WebClient webOne = new WebClient();
string temp1 = " http://www.c-sharpcorner.com/UploadFile/shivprasadk/visual-studio-and-net-tips-and-tricks-15/Media/Tip15.wmv";
webOne.DownloadFile(new Uri(temp1), "video.wmv");
But following code doesn't:
temp1="http://www.youtube.com/watch?v=Y_..."
(in this case a 200-400 kilobyte junk file gets downloaded )
Difference between the two URLs is obvious, first one contains exact name for file while other seems to be encrypted in some way...
I was unable to find any proper and satisfactory solution to the problem so I would highly appreciate a little help here, Thanks.
Note:
from one of the questions here I got a link http://youtubefisher.codeplex.com/ so I visited there, got the source code and read it. It's great work but what I don't seem to get is that how in the world that person came to know what structures and classes he had to make for downloading a YouTube video and why did he have to go through all that trouble why isn't my method working?
Someone please guide. Thanks again.
In order to download a video from youtube, you have to find the actual video location. Not the page that you use to watch the video. The http://www.youtube.com/watch?v=... url, is an html page (much like this one) that will load the video from it's source location and display it. Normally, you have to parse the html and extract the video location from the html.
In your case, you found code that does this already - and lucky you, because downloading videos from youtube is not simple at all. Looking at the link you provided in your question, the magic behind the madness is available in YoutubeService.cs / GetDownloadUrl():
http://youtubefisher.codeplex.com/SourceControl/changeset/view/68461#1113202
That method is parsing the html page returned by a youtube watch url, and finding the actual video content. The added complexity, is that youtube videos can also be a variety of different formats.
If you need to convert the video type after downloading, i recommend FFMPEG
EDIT: In response to your comment - You didnt look at the source code of YoutubeFisher at all, did you.. I'd recommend analysing the file I mentioned (YoutubeService.cs). Although after taking a quick look myself, you'll have to parse the yt.playerConfig variable within the html page.
Use that source to help you.
EDIT: In response to your second comment: "Actually I am trying to develop an application that can download video from any video site." You say that like its easy - fyi, its not. Since every video website is different, you cant just write something that will work for everything out of the box. If I had to do it though, heres how i would: I would write custom parsers for the major video sharing websites (Metacafe, Youtube, Whatever else) so that those ones are guarenteed to work. After that, I would write a "fallover" if you will. Basically, if you're requesting a video from an unknown website, it would scour the html looking for known video extentions (flv, wmv, mp4, etc) and then extract the url from that.
You could use a regex for extracting the url in the latter case, or a combination of something like indexof, substring, and lastindexof.
I found this page # CodeProject, it shows you how to make a very efficient Youtube downloader using no third party libraries. Remember it is sometimes necessary to slightly modify the code as Youtube sometimes makes changes to it's web structure, which may interfere with the way your app interacts with Youtube.
Here is the link: here you can also download the C# project files and see the files directly.
CodeProject - Youtube downloader using C# .NET

Batch conversion of docx to clean HTML

I'm starting to wonder if this is even possible. I've searched for solutions on Google and come up with nothing that works exactly how I'd like it to.
I think it'd benefit to explain what that entails. I work for database group at my university's IT department. My main job is to take specs of a report in a docx file, copy that over to dreamweaver, fix some formatting, and put it onto their website. My issue is that it's ridiculously tedious to do this over and over. I figured, hey, I haven't written anything in C# for some time now, perhaps I could write an application to grab a docx file, convert it to HTML, fix the CSS, stick the header, and footer from the webpage on there, and save the result. I originally planned to have it do one by one, but it probably wouldn't be difficult to have it input a list of files and batch convert.
I've found these relevant topics on how to accomplish this, but they don't fit my needs well enough.
http://www.techrepublic.com/blog/howdoi/how-do-i-modify-word-documents-using-c/190
This is probably fine for a few documents, but since it's just automating an instance of Word, I feel like it'd be slow and memory intensive. I'd prefer to avoid opening and closing an instance of Word 50+ times.
http://openxmldeveloper.org/articles/333.aspx
This is what I started using. XSLT had the benefit of not needing word to be installed nor ran for each file. After some searching I got a proof of concept working. It takes in a docx file, decompresses it, grabs the document.xml from that, and uses the DocX2Html.xsl file I scavenged from OpenXML viewer. I believe that was originally provided by MS for sharepoint servers to provide the ability to render word documents in a browser. Or something along those lines.
After adjusting that code to fit my needs, and having issues with the objXSLT.Load () method, I ended up using IlMerge to make the XSL into a DLL. No idea why I kept getting a compile error when using the plain old XSL file, but the DLL worked fine, so I was satisfied. Here (http://pastebin.com/a5HBAakJ) is my current code. It does the job of converting docx to HTML just fine (other than random spaces between some words), but the result file has ridiculously ugly HTML syntax. An example of this monstrosity can be found here (http://pastebin.com/b8sPGmFE).
Does anyone know how I could remedy this? I'm thinking perhaps I need to make a new XSL file, as the one MS provided is what's responsible for sticking all those tags and extra code in there. My issue with that is that I don't know anything about how to do that. Perhaps there's an alternative version already out there. All I'd need is one that will preserve tables and text formatting. Images aren't needed.
This looks like just what you need: http://msdn.microsoft.com/en-us/library/ff628051(v=office.14).aspx
The author Eric White blogged about his experiences developing that tool. You can see that list of posts on his blog here: http://blogs.msdn.com/b/ericwhite/archive/2008/10/20/eric-white-s-blog-s-table-of-contents.aspx#Open_XML_to_XHtml
Since I'm a big fan of Aspose.Words, a commercial library to create/process Word documents, I would do something like:
Open the Word document with Aspose.Words.
Save the Word document as HTML.
Use something like SgmlReader or HTML Agility Pack (or even Regular Expressions if it is suitable) to remove unwanted HTML tags/attributes.
Since you wrote you work at an university, I'm not sure whether commercial packages are an option, though.
Hi not sure what the rules are on promoting your own solutions, so do let me know if I am out of line.
I am a web developer who had the same issues, so I created my own tool:
http://www.convertwordtohtml.com
We are also working on a new version that will have even better conversion quality and one click conversion eg you can right click on a word file and it will be directly converted to html and the code placed into the clipboard. The current version also supports command line access and the new version will have a server version to.
There is a free trial version downloadable from the site , and if you have any questions do contact me any time.

How, using .NET RegEx, do I parse an HTML file and find 1. External links. 2. Internal links

I am writing a program that will help me find out sites are my competitors linking to.
In order to do that, I am writing a program that will parse an HTML file, and will produce 2 lists: internal links and external links.
I will use the internal links to further crawl the website, and the external links are actually what I am looking for.
How, using .NET RegEx, do I parse an HTML file and find 1. External links. 2. Internal links.
Thanks in advance,
Eytan Levit.
Edit: In response to the question - no - I am not bound to regex, i can use any other ideas.
Don't use a regular expression for this.
Use something like the HTML Agility Pack which is specifically designed for parsing HTML. (There's even an example on their CodePlex homepage which finds all links in a page.)
i had used Regex for Html parsing it is really fast but now there are better options that will reduce the development cost.
Try Linq To Html it's good, Beth has a great post about it that can be found here

Categories

Resources