Document search and add engine web application

Document search and add engine web application - c#

I want to develop a asp.net web application which should do the following task
a) user should be able to add content to the document. Content to be added can include text as well as image, screen shots etc.
b) user should be able to search based on some keywords. when searching with the keyword appropriate content along with images(if any) should be shown to user.
I am not sure what should be the proper approach for this. One way i think is to store text content in some xml file and later search for keywords by going though each node of xml and displaying. but i am not sure how to attach image content with xml. Also this method doesn't seem to be nice and efficient if with time document size increases a lot.
Anyone please suggest some proper way to do above requirement. Any hint would be appreciated.

Split it to two tasks. Editation and search.
Full text search is solved problem. Simply use Sphinx Search and you are done. Sphinx is simple to use and can do everything you will need. It has MySQL interface (your app connects to sphinx the same way as to second MySQL database).
Editation is a bit more complicated. If I understand correctly, you want multiple users to edit single document concurrently.
I recommend using websockets to notify other clients about changes in document. Long-polling and Server Sent Events have ugly side effects, like stopping browser from making another requests to server. To implement client side in Javascript, I would use React, Angular or similar framework to make updates as easy as possible.
Server side requires modification-friendly representation of a document, so if one user changes one part, and another user another part, your app should be able to merge changes. Changing completely different parts is easy, but it may be tricky to change the same paragraph or document node. Exact representation of each change depends on format of your document.
I do not see much benefits of using XML rather than any other format. It may be practical for document representation, but it will not help with merging of colliding modifications. I would start with plain array of strings, each representing a single paragraph. Extending it to full XML document is the easy part, once two users can edit the same paragraph.
To store images in XML, simply store files using their hash as a file name and then use such name to link the file in XML. Git does the same thing and it works nicely. You may want to count references to identify unused files.

Related

Does Google Accept Only Sitemap With .txt Extention?

I have finalized working on my Asp.Net 4.0 website. Now that i am to publish it by next few days, i am finding resources that can help me better rank my site on popular search engines. My site displays both static and dynamic contents. For dynamic contents i will be generating dynamic sitemap each week. My problem is that i read on google webmaster website that google accepts sitemaps only with .txt extension. (https://support.google.com/webmasters/answer/183668?hl=en). Orignal instructions quoted as:
* For best results, use the following guidelines for creating text file sitemaps:
You must fully specify all URLs in your sitemap as Google attempts to crawl them exactly as you list them.
Your text file must use UTF-8 encoding.
Your text file should contain nothing but the list of URLs.
You can name the text file anything you wish, provided it has a .txt extension (for instance, sitemap.txt).
As i have mentioned, i will be using c# code to dynamically generate xml sitemap for my site but i am not sure i will be able to write xml (by using C#) into .txt files. I have very little knowledge of writing xml by using C# (Such as by utilizing XmlWriter Class). I have found this website which uses it's sitemap file which is in .xml extension (http://www.mikesdotnetting.com/sitemap). Can anybody tell me what do i need to do to complete this final step of my project. Another thing that i am interested to know is should i submit my sitemap every time to google when a link is modified? Google says to submit your sitemap to google that contains no more than 50000 urls or less than 50mb.

Submitting every dynamic URL isn't necessarily going to improve your ranking. A lot of your ranking will depend on your pagerank, not the number of URLs you have. You need good content, and people who link to your site that think your content is good.
You read the document wrong. The very first line says
In addition to the standard XML format, Google also accepts the following file types as sitemaps:
The .txt file extension is only required when you have a text file that lists only the urls. Notice how even their submission example has a .xml extension since it's using the XML sitemap format.
It would be much simpler instead of generating a new sitemap file weekly to simply write a handler to generate it upon request, and cache the data for a period of time so you're not constantly generating it at every request for the sitemap file.
If google knows where the sitemap is, it will check it out periodically anyways so re-submitting it might not get you anything. You also don't want to submit too often as Google and other search engines may think you are trying to spam them. That's why the xml sitemap definition has elements for change frequency so they know how often to re-spider the page.
FYI, don't expect to see good ranking right away. It takes a while, months even, depending on the popularity of your site and the quality of the content. The volume of links won't help and Google will find them anyways when it spiders. It is possible to do too much.

how to save chat transcripts in sql server database and retrieve it on another page with formatting?

I am Working on a chat application in which i used a multiline textbox to show messages. Now First of all i want to give different colors in my chat. Our user message should be in different color and another user message should be in different color.
Second thing i want that i want to save transcripts of chat conversation with formatting so that user can see it anytime.
I don't know how to save transcript to database with formatting. So i am stucked here.
I am using C# and Sql server database.
How can i do that?

If you break down your problem, then it would make it much easier to solve it.
Instead of taking text from textbox. Have you thought about maybe storing it as xml?
If you use xml, you can use xslt to generate a nice looking html (but it depends on how you want to use it later)
You have different options how you store it in the database
plain text
xml
blob
Please do your research and then ask specific questions if you get stuck.

How to create a multiple page invoice in asp.net c#?

I am thoroughly confused with something I want to do and am looking for some advice.
One of my client has to produce monthly invoice detailing all of the company expenditure, and two other such invoices. The client is sure that he only needs these invoices - and they are extremely simple enough to produce as far as logic is concerned.
Now, to make the actual invoice, I don't really want to use reporting solutions like Telerik, SSRS etc.. as I think they are an overkill for my purpose. At the same time, I am not sure how I can get the printer to print the invoices in a neat pages without cutting off anything.
I am very tempted to just give the output in a webpage and ask my client to print them off from there.
Am I not looking at this the right way? Is this possible?
I could use ITextSharp or something to produce pdf's.. In fact, I think I will go ahead with this if it isn't possible to just output to html page and get the printer to recognize the page breaks somehow.
Because this is a very small job, I don't want to spend too much time on it as the cost of this freelance project is minimal too.
The reason printing to a new page is important is that my client has a few shops he deals with and he would want to print each of his customers their own invoices. I can get him to produce each customer's invoice separately and print them but it is not ideal way to deal with it.
thanks

There is a css property which should tell a browser to break a page: page-break-before.
But if you have a a wide list of browsers to support, it would be better to get some HTML to PDF conversion library or really use iTextSharp (as far as I know there is even a module/class which allows to conver HTML to PDF with iTextSharp) as printing web pages has many issues.

In the past, when I wanted to create a reusable document, I used Word or Excel XML formats.
See: http://en.wikipedia.org/wiki/Microsoft_Office_XML_formats
They are easy to create and tweak, then all you have to do is recreate the dynamic parts in your code. All you have to do is save the document in Office XML format, then open it up in word pad to see where to make your changes.

SSRS has a drag and drop interface for designing reports and has a PDF output option. If the data is in a SQL server database then even with the learning curve it should be easier to do SSRS reports.

word document, how to edit the right way in asp.net application

I have asp.net app in which I need to edit a word document and than send that document in email as an attachment.
I would like to know what will be the best way to edit the word document and than use it.
The document already has data and there are few variables such as "company name", "date", "amount", etc that I am searching in the document and I am replacing them with values from within the code.
The code works great when I am running it locally but from some people I received answers that editing word document on the server shouldn't be the way I am doing now but I need to use either openxml to edit the document or google docs.
Any idea what's the best way to tackle this?

I would vote for OpenXML, but be prepared to spend a good day or two reading how to use the API for .NET and be patient. =)
I remember using this tool -
http://openxmldeveloper.org/resources/dotnet/m/cc/303.aspx - quite a bit to find the relevant parts in the document to modify. You basically load a Word document and can "drilldown" to find the parts you want to modify. You can actually write some pretty clean code to search the document for your textual markers and then replace them with data.
(I hope I understood the question correctly. You said you already had working code, so I wasn't sure what the question was.)

You can use the Open XML Format SDK as per http://msdn.microsoft.com/en-us/library/dd440953%28v=office.12%29.aspx
For what you're doing though, I think your approach is fine.
I have a fair amount of screwdrivers, but if I noticed a lose screw in the stool in front of me, I might just use the knife on the table because it will do the job perfectly adequately and save me a trip to the toolbox. It's not a tool designed for that job, but it is a tool that would do the job just as well and with less effort.
Now, if I decided to set about a day's worth of DIY with only the knife instead of the set of screw-drivers, that would be going to the other extreme. Here I'd have long-ago crossed the line where using the tools designed for a given job would have made my life much easier.
It's just the same with software tools.
One of the very points of XML formats is that we can do simple tasks with it treating it just as text. Yeah, we none of us want to be the guy with a 3-page-long regular expression with which they're trying to parse a complicated XML document, but when the problem naturally breaks down to a simple text substitution, do a simple text substitution.

Batch conversion of docx to clean HTML

I'm starting to wonder if this is even possible. I've searched for solutions on Google and come up with nothing that works exactly how I'd like it to.
I think it'd benefit to explain what that entails. I work for database group at my university's IT department. My main job is to take specs of a report in a docx file, copy that over to dreamweaver, fix some formatting, and put it onto their website. My issue is that it's ridiculously tedious to do this over and over. I figured, hey, I haven't written anything in C# for some time now, perhaps I could write an application to grab a docx file, convert it to HTML, fix the CSS, stick the header, and footer from the webpage on there, and save the result. I originally planned to have it do one by one, but it probably wouldn't be difficult to have it input a list of files and batch convert.
I've found these relevant topics on how to accomplish this, but they don't fit my needs well enough.
http://www.techrepublic.com/blog/howdoi/how-do-i-modify-word-documents-using-c/190
This is probably fine for a few documents, but since it's just automating an instance of Word, I feel like it'd be slow and memory intensive. I'd prefer to avoid opening and closing an instance of Word 50+ times.
http://openxmldeveloper.org/articles/333.aspx
This is what I started using. XSLT had the benefit of not needing word to be installed nor ran for each file. After some searching I got a proof of concept working. It takes in a docx file, decompresses it, grabs the document.xml from that, and uses the DocX2Html.xsl file I scavenged from OpenXML viewer. I believe that was originally provided by MS for sharepoint servers to provide the ability to render word documents in a browser. Or something along those lines.
After adjusting that code to fit my needs, and having issues with the objXSLT.Load () method, I ended up using IlMerge to make the XSL into a DLL. No idea why I kept getting a compile error when using the plain old XSL file, but the DLL worked fine, so I was satisfied. Here (http://pastebin.com/a5HBAakJ) is my current code. It does the job of converting docx to HTML just fine (other than random spaces between some words), but the result file has ridiculously ugly HTML syntax. An example of this monstrosity can be found here (http://pastebin.com/b8sPGmFE).
Does anyone know how I could remedy this? I'm thinking perhaps I need to make a new XSL file, as the one MS provided is what's responsible for sticking all those tags and extra code in there. My issue with that is that I don't know anything about how to do that. Perhaps there's an alternative version already out there. All I'd need is one that will preserve tables and text formatting. Images aren't needed.

This looks like just what you need: http://msdn.microsoft.com/en-us/library/ff628051(v=office.14).aspx
The author Eric White blogged about his experiences developing that tool. You can see that list of posts on his blog here: http://blogs.msdn.com/b/ericwhite/archive/2008/10/20/eric-white-s-blog-s-table-of-contents.aspx#Open_XML_to_XHtml

Since I'm a big fan of Aspose.Words, a commercial library to create/process Word documents, I would do something like:
Open the Word document with Aspose.Words.
Save the Word document as HTML.
Use something like SgmlReader or HTML Agility Pack (or even Regular Expressions if it is suitable) to remove unwanted HTML tags/attributes.
Since you wrote you work at an university, I'm not sure whether commercial packages are an option, though.

Hi not sure what the rules are on promoting your own solutions, so do let me know if I am out of line.
I am a web developer who had the same issues, so I created my own tool:
http://www.convertwordtohtml.com
We are also working on a new version that will have even better conversion quality and one click conversion eg you can right click on a word file and it will be directly converted to html and the code placed into the clipboard. The current version also supports command line access and the new version will have a server version to.
There is a free trial version downloadable from the site , and if you have any questions do contact me any time.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.