Does Google Accept Only Sitemap With .txt Extention? - c#

I have finalized working on my Asp.Net 4.0 website. Now that i am to publish it by next few days, i am finding resources that can help me better rank my site on popular search engines. My site displays both static and dynamic contents. For dynamic contents i will be generating dynamic sitemap each week. My problem is that i read on google webmaster website that google accepts sitemaps only with .txt extension. (https://support.google.com/webmasters/answer/183668?hl=en). Orignal instructions quoted as:
* For best results, use the following guidelines for creating text file sitemaps:
You must fully specify all URLs in your sitemap as Google attempts to crawl them exactly as you list them.
Your text file must use UTF-8 encoding.
Your text file should contain nothing but the list of URLs.
You can name the text file anything you wish, provided it has a .txt extension (for instance, sitemap.txt).
As i have mentioned, i will be using c# code to dynamically generate xml sitemap for my site but i am not sure i will be able to write xml (by using C#) into .txt files. I have very little knowledge of writing xml by using C# (Such as by utilizing XmlWriter Class). I have found this website which uses it's sitemap file which is in .xml extension (http://www.mikesdotnetting.com/sitemap). Can anybody tell me what do i need to do to complete this final step of my project. Another thing that i am interested to know is should i submit my sitemap every time to google when a link is modified? Google says to submit your sitemap to google that contains no more than 50000 urls or less than 50mb.

Submitting every dynamic URL isn't necessarily going to improve your ranking. A lot of your ranking will depend on your pagerank, not the number of URLs you have. You need good content, and people who link to your site that think your content is good.
You read the document wrong. The very first line says
In addition to the standard XML format, Google also accepts the following file types as sitemaps:
The .txt file extension is only required when you have a text file that lists only the urls. Notice how even their submission example has a .xml extension since it's using the XML sitemap format.
It would be much simpler instead of generating a new sitemap file weekly to simply write a handler to generate it upon request, and cache the data for a period of time so you're not constantly generating it at every request for the sitemap file.
If google knows where the sitemap is, it will check it out periodically anyways so re-submitting it might not get you anything. You also don't want to submit too often as Google and other search engines may think you are trying to spam them. That's why the xml sitemap definition has elements for change frequency so they know how often to re-spider the page.
FYI, don't expect to see good ranking right away. It takes a while, months even, depending on the popularity of your site and the quality of the content. The volume of links won't help and Google will find them anyways when it spiders. It is possible to do too much.

Related

Search an uploaded document asp.net core

I am looking for any help, so i am currently uploading a file which is either a pdf or a docx to a file server and storing the address in a sql database. I am now trying to allow the user to search the whole database and also the text in the documents that have been uploaded. I am struggling to find any solutions using asp.net core.
Does anyone have any ideas?
Depending how many documents you have this is probably going to be slow as hell.
You will want to look into building a full-text-index like lucene or importing the contents of the files into a SQL full text index.
There isn't a pre-made way to do this as it is complicated depending on specific requirements

Document search and add engine web application

I want to develop a asp.net web application which should do the following task
a) user should be able to add content to the document. Content to be added can include text as well as image, screen shots etc.
b) user should be able to search based on some keywords. when searching with the keyword appropriate content along with images(if any) should be shown to user.
I am not sure what should be the proper approach for this. One way i think is to store text content in some xml file and later search for keywords by going though each node of xml and displaying. but i am not sure how to attach image content with xml. Also this method doesn't seem to be nice and efficient if with time document size increases a lot.
Anyone please suggest some proper way to do above requirement. Any hint would be appreciated.
Split it to two tasks. Editation and search.
Full text search is solved problem. Simply use Sphinx Search and you are done. Sphinx is simple to use and can do everything you will need. It has MySQL interface (your app connects to sphinx the same way as to second MySQL database).
Editation is a bit more complicated. If I understand correctly, you want multiple users to edit single document concurrently.
I recommend using websockets to notify other clients about changes in document. Long-polling and Server Sent Events have ugly side effects, like stopping browser from making another requests to server. To implement client side in Javascript, I would use React, Angular or similar framework to make updates as easy as possible.
Server side requires modification-friendly representation of a document, so if one user changes one part, and another user another part, your app should be able to merge changes. Changing completely different parts is easy, but it may be tricky to change the same paragraph or document node. Exact representation of each change depends on format of your document.
I do not see much benefits of using XML rather than any other format. It may be practical for document representation, but it will not help with merging of colliding modifications. I would start with plain array of strings, each representing a single paragraph. Extending it to full XML document is the easy part, once two users can edit the same paragraph.
To store images in XML, simply store files using their hash as a file name and then use such name to link the file in XML. Git does the same thing and it works nicely. You may want to count references to identify unused files.

Uploading document to Google Docs: how to replace existing and how to upload and convert large doc?

I am trying to make tool for backup/restore of Documents from Google account.
Backup is easy and I have no problems with it. But I have two unsolved questions for restore:
1) Is it possible to upload new version of existing document? When I upload document, it appears as separate copy.
I found it was discussed already here Upload and replace file in given folder on Google Docs using .net api, but it seems it was suggested just to remove old version before uploading new, the Id of document will be changed. Is this correct?
2) Google Docs have limit for size of documents able to be converted into internal format. http://docs.google.com/support/bin/answer.py?hl=en&answer=37603. So it is possible to create large document, save it to local computer and then Google Docs will refuse to convert it because the document's size is over limit. In such case it is possible to upload the document without convert, but it becomes un-editable via web site. Is there some workaround for this situation?
Unable to upload large files to Google Docs - Here is advice to break document into small pieces before uploading and link them together after. But maybe there some other ideas?
1. Is it possible to upload new version of existing document? When I upload document, it appears as separate copy.
Yes, this is possible. We call it "upload & replace" as you've noticed. No need to remove the existing version first. The following link describes how to do this in the protocol:
http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#UpdatingMetadataAndContent
From the .NET client library, what you need to do is attach a an input stream to the Update() request. The method header for what you need is here:
http://code.google.com/p/google-gdata/source/browse/trunk/clients/cs/src/core/service.cs#554
Create a stream containing your new file content, and just pass that in. That should be it!
2. Google Docs have limit for size... Is there some workaround for this situation?
Unfortunately there is not a way currently to circumvent the size limitations of converted documents. They must be uploaded as unconverted files, and thus, are not editable in the Google Docs user interface.

Batch conversion of docx to clean HTML

I'm starting to wonder if this is even possible. I've searched for solutions on Google and come up with nothing that works exactly how I'd like it to.
I think it'd benefit to explain what that entails. I work for database group at my university's IT department. My main job is to take specs of a report in a docx file, copy that over to dreamweaver, fix some formatting, and put it onto their website. My issue is that it's ridiculously tedious to do this over and over. I figured, hey, I haven't written anything in C# for some time now, perhaps I could write an application to grab a docx file, convert it to HTML, fix the CSS, stick the header, and footer from the webpage on there, and save the result. I originally planned to have it do one by one, but it probably wouldn't be difficult to have it input a list of files and batch convert.
I've found these relevant topics on how to accomplish this, but they don't fit my needs well enough.
http://www.techrepublic.com/blog/howdoi/how-do-i-modify-word-documents-using-c/190
This is probably fine for a few documents, but since it's just automating an instance of Word, I feel like it'd be slow and memory intensive. I'd prefer to avoid opening and closing an instance of Word 50+ times.
http://openxmldeveloper.org/articles/333.aspx
This is what I started using. XSLT had the benefit of not needing word to be installed nor ran for each file. After some searching I got a proof of concept working. It takes in a docx file, decompresses it, grabs the document.xml from that, and uses the DocX2Html.xsl file I scavenged from OpenXML viewer. I believe that was originally provided by MS for sharepoint servers to provide the ability to render word documents in a browser. Or something along those lines.
After adjusting that code to fit my needs, and having issues with the objXSLT.Load () method, I ended up using IlMerge to make the XSL into a DLL. No idea why I kept getting a compile error when using the plain old XSL file, but the DLL worked fine, so I was satisfied. Here (http://pastebin.com/a5HBAakJ) is my current code. It does the job of converting docx to HTML just fine (other than random spaces between some words), but the result file has ridiculously ugly HTML syntax. An example of this monstrosity can be found here (http://pastebin.com/b8sPGmFE).
Does anyone know how I could remedy this? I'm thinking perhaps I need to make a new XSL file, as the one MS provided is what's responsible for sticking all those tags and extra code in there. My issue with that is that I don't know anything about how to do that. Perhaps there's an alternative version already out there. All I'd need is one that will preserve tables and text formatting. Images aren't needed.
This looks like just what you need: http://msdn.microsoft.com/en-us/library/ff628051(v=office.14).aspx
The author Eric White blogged about his experiences developing that tool. You can see that list of posts on his blog here: http://blogs.msdn.com/b/ericwhite/archive/2008/10/20/eric-white-s-blog-s-table-of-contents.aspx#Open_XML_to_XHtml
Since I'm a big fan of Aspose.Words, a commercial library to create/process Word documents, I would do something like:
Open the Word document with Aspose.Words.
Save the Word document as HTML.
Use something like SgmlReader or HTML Agility Pack (or even Regular Expressions if it is suitable) to remove unwanted HTML tags/attributes.
Since you wrote you work at an university, I'm not sure whether commercial packages are an option, though.
Hi not sure what the rules are on promoting your own solutions, so do let me know if I am out of line.
I am a web developer who had the same issues, so I created my own tool:
http://www.convertwordtohtml.com
We are also working on a new version that will have even better conversion quality and one click conversion eg you can right click on a word file and it will be directly converted to html and the code placed into the clipboard. The current version also supports command line access and the new version will have a server version to.
There is a free trial version downloadable from the site , and if you have any questions do contact me any time.

Download all the links from any page

I want to develop an asp.net page through which I can specify the URL of any page which contains links of many files & directories. I want to download them all. Similar to DownThemAll plugin of FireFox.
i.e.
"MyPage.htm" file contains many links to files/directories located on the same server.
now I want to write a function which can download all these file if I provide
"www.mycustomdomain.com\Mypage.htm" as input.
I hope question is clear.
Fetch the web page as HTML. Google (c# fetch file from web). The first link will give you the idea.
Then find the links with regular expressions.
Some example regex pattern for links in www.x.com should be as
(http://www.x.com/.*?)
(But better if you also include the A tag in your regex pattern)
And download the files as shown in:
http://www.csharp-examples.net/download-files/
Hope I understand your question. You have a HTM file with a list of links and these links are links to specific files on a remote server and you want to download all the files.
There is no fail proof way to do this.
Check this question. How do you parse an HTML in vb.net Even though this is for VB.net it is related to what you asked for. You can get an array of links and then start downloading the files.
You can use the Computer.Network.DownloadFile method to download the remot file as save it on a location of yours.
Thi is not a fail prrof method because if a download requires authentication then it will download the HTML page [mostly loin page]

Categories

Resources