I've been tasked to create (or seek something that is already working) a centralized server with an API that has the ability to return a PDF file passing some data, and the name of the template, it has to be a robust solution, enterprise ready. The goal is as follows:
A series of templates for different company things. (Invoices, Orders, Order Plannings, etc)
A way of returning a PDF from external software (Websites, ERP, etc)
Can be an already ready enterprise solution, but they are pressing for a custom one.
Can be any language, but we don't have any dedicated Java programmers in-house. We are PHP / .NET, some of us dabble, but the learning curve could be a little steep.
So, I've been reading. One way we've thought it may be possible is installing a jasper reports server, and creating the templates in Jaspersoft Studio, then using the API to return the PDF files. A colleague stands for this option, because it's mostly done, but 1º is java and 2º I think it's like using a hammer to crack a nut.
Other option we've been toying with is to use C# with iTextSharp to create a server, and create our own API that returns exactly the PDF with the data we need. Doing this we could have some benefits, like using the database connector we have already made and extracting most of the data from the database, instead of having to pass around a big chunk of data, but as it is bare, it doesn't really have a templating system. We'd have create something from with the XMLWorker or with c# classes but it's not really "easy" as drag and drop. For this case I've been reading about XFA too, but documentation on the iText site is misleading and not clear.
I've been also reading about some other alternatives, like PrinceXML, PDFBox, FOP, etc, but the concept will be the same as iText, we'd have to do it ourselves.
My vote, even if it's more work is to go the route of iText and use HTML / CSS for the templates, but my colleagues claim that the templates should be able to be changed every other week (I doubt it), and be easy. HTML / CSS would be too much work.
So the real question is, how do other business approach this? Did I leave anything out on my search? Is there an easier way to achieve this?
PS: I didn't know if SO would be the correct place for this question, but I'm mostly lost and risking a "too broad question" or "off topic" tag doesn't seem that bad.
EDIT:
Input should be sent with the same request. If we decide the C# route, we can get ~70% of the data from the ERP directly, but anyway, it should accept a post request with some data (template, and data needed for that template, like an invoice data, or the invoice ID if we have access to the ERP).
Output should be a PDF (not interested in other formats, just PDF).
Templates will be updated only by IT. (Mostly us, the development team).
Performance wise, I don't know how much muscle we'll need, but right now, without any increase, we are looking at ~500/1000 PDFs daily, mostly printed from 10 to 10.30 and from 12 to 13h. Then maybe 100 more the rest of the day.
TOP performance should not be more than ~10000 daily when the planets align, and is sales season (twice a year). That should be our ceiling for the years to come.
The templates have some requirements:
Have repeating blocks (invoice lines, for example).
Have images as background, as watermark and as blocks.
Have to be multi language (translatable, with the same data).
Have some blocks that are only show on a condition.
Blocks dependent on the page (PDF header / page header / page footer / PDF footer)
Template will maybe have to do calculations over some of the data, I don't think we'll ever need this, but it's something in the future may be asked by the company.
The PDFs don't need to be stored, as we have a document management system, maybe in the future we could link them.
Extra data: Right now we are using "Fast-Reports v2 VCL"
Your question shows you've been considering the problem in detail before asking for help so I'm sure SO will be friendly.
Certainly one thing you haven't detailed much in your description is the broader functional requirements. You mentioned cracking a nut with a hammer, but I think you are focused mostly on the technology/interfacing. If you consider your broader requirements for the documents you need to create, the variables involved, it's might be a bigger nut that you think.
The approach I would suggest is to prototype solutions, assuming you have some room to do so. From your research, pick maybe the best 3 to try which may well include the custom build you have in mind. Put them through some real use-cases end to end - rough as possible but realistic. One or two key documents you need to output should be used across all solutions. Make sure you are covering the most important or most common requirements in terms of:
Input Format(s) - who can/should be updating templates. What is the ideal requirement and what is the minimum requirement?
Output Requirement(s) - who are you delivering to and what formats are essential/desirable
Data Requirement(s) - what are your sources of data and how hard/easy is it to get data from your sources to the reporting system in the format needed?
Template feature(s) - if you are using templates, what features do the templates need? This includes input format(s) but I was mostly thinking of features of the engine like repeating/conditional content, image insertion, table manipulation etc. ie are your invoices, orders and planning documents plain or complex
API requirements - do you have any broader API requirements. You mentioned you use PHP so a PHP library or Web/Web Service is likely to be a good starting point.
Performance - you haven't mentioned any performance characteristics but certainly if you are working at scale (enterprise) it would be worth even rough-measuring the throughput.
iText and Jasper are certainly enterprise grade engines you can rely on. You may wish to look at Docmosis (please note I work for the company) and probably do some searches for PDF libraries that use templates.
A web service interface is possibly a key feature you might want to look at. A REST API is easy to call from PHP and virtually any technology stack. It means you will likely have options about how you can architect a solution, and it's typically easy to prototype against. If you decide to go down the prototyping path and try Docmosis, start with the cloud service since you can prototype/integrate very quickly.
I hope that helps.
From my years of experience in working with PDF I think you should pay attention to the following points:
The performance: You may do the fastest performance with API based pdf files generation in comparision to HTML or XML to PDF generation (because of an additional layer of conversion involved). Considering peaks in the load you may want to calculate the cost of scaling up the generation by adding more servers (and estimate the cost of additional servers or resources required per additional pdf file per day).
Ease of iterations and changes: how often will you need to adjust templates? If you are going to create templates just once (with some iterations) but then no changes required then you should be OK by just coding them using the API. Otherwise you should strongly consider using HTML or XML for templates to simplify changes and to decrease the complexity of making changes in templates;
Search and indexing: If you may need to run search among created documents then you should consider storing indexes of documents generated or maybe store more the source data in XML along with PDF file generated;
Long time preservation: you should better conform to PDF/A sub-format in case you are looking for a long time digital preservation for your documents. See the VeraPDF open source initiative that you may use to validate generated and incoming PDF documents against the conformance to PDF/A requirements;
Preserving source files The PDF format itself was not designed to be edited (though there are some PDF editors already) so you may consider the need of preserving the source data to be able to regenerate PDF documents later and probably introduce additional output formats later.
Related
A client wants to "Web-enable" a spreadsheet calculation -- the user to specify the values of certain cells, then show them the resulting values in other cells.
(They do NOT want to show the user a "spreadsheet-like" interface. This is not a UI question.)
They have a huge spreadsheet with lots of calculations over many, many sheets. But, in the end, only two things matter -- (1) you put numbers in a couple cells on one sheet, and (2) you get corresponding numbers off a couple cells in another sheet. The rest of it is a black box.
I want to present a UI to the user to enter the numbers they want, then I'd like to programatically open the Excel file, set the numbers, tell it to re-calc, and read the result out.
Is this possible/advisable? Is there a commercial component that makes this easier? Are their pitfalls I'm not considering?
(I know I can use Office Automation to do this, but I know it's not recommended to do that server-side, since it tries to run in the context of a user, etc.)
A lot of people are saying I need to recreate the formulas in code. However, this would be staggeringly complex.
It is possible, but not advisable (and officially unsupported).
You can interact with Excel through COM or the .NET Primary Interop Assemblies, but this is meant to be a client-side process.
On the server side, no display or desktop is available and any unexpected dialog boxes (for example) will make your web app hang – your app will behave flaky.
Also, attaching an Excel process to each request isn't exactly a low-resource approach.
Working out the black box and re-implementing it in a proper programming language is clearly the better (as in "more reliable and faster") option.
Related reading: KB257757: Considerations for server-side Automation of Office
You definitely don't want to be using interop on the server side, it's bad enough using it as a kludge on the client side.
I can see two options:
Figure out the spreadsheet logic. This may benefit you in the long term by making the business logic a known quantity, and in the short term you may find that there are actually bugs in the spreadsheet (I have encountered tons of monster spreadsheets used for years that turn out to have simple bugs in them - everyone just assumed the answers must be right)
Evaluate SpreadSheetGear.NET, which is basically a replacement for interop that does it all without Excel (it replicates a huge chunk of Excel's non-visual logic and IO in .NET)
Although this is certainly possible using ASP.NET, it's very inadvisable. It's un-scalable and prone to concurrency errors.
Your best bet is to analyze the spreadsheet calculations and duplicate them. Now, granted, your business is not going to like the time it takes to do this, but it will (presumably) give them a more usable system.
Alternatively, you can simply serve up the spreadsheet to users from your website, in which case you do almost nothing.
Edit: If your stakeholders really insist on using Excel server-side, I suggest you take a good hard look at Excel Services as #John Saunders suggests. It may not get you everything you want, but it'll get you quite a bit, and should solve some of the issues you'll end up with trying to do it server-side with ASP.NET.
That's not to say that it's a panacea; your mileage will certainly vary. And Sharepoint isn't exactly cheap to buy or maintain. In fact, short-term costs could easily be dwarfed by long-term costs if you go the Sharepoint route--but it might the best option to fit a requirement.
I still suggest you push back in favor of coding all of your logic in a separate .NET module. That way you can use it both server-side and client-side. Excel can easily pass calculations to a COM object, and you can very easily publish your .NET library as COM objects. In the end, you'd have a much more maintainable and usable architecture.
Neglecting the discussion whether it makes sense to manipulate an excel sheet on the server-side, one way to perform this would probably look like adopting the
Microsoft.Office.Interop.Excel.dll
Using this library, you can tell Excel to open a Spreadsheet, change and read the contents from .NET. I have used the library in a WinForm application, and I guess that it can also be used from ASP.NET.
Still, consider the concurrency problems already mentioned... However, if the sheet is accessed unfrequently, why not...
The simplest way to do this might be to:
Upload the Excel workbook to Google Docs -- this is very clean, in my experience
Use the Google Spreadsheets Data API to update the data and return the numbers.
Here's a link to get you started on this, if you want to go that direction:
http://code.google.com/apis/spreadsheets/overview.html
Let me be more adamant than others have been: do not use Excel server-side. It is intended to be used as a desktop application, meaning it is not intended to be used from random different threads, possibly multiple threads at a time. You're better off writing your own spreadsheet than trying to use Excel (or any other Office desktop product) form a server.
This is one of the reasons that Excel Services exists. A quick search on MSDN turned up this link: http://blogs.msdn.com/excel/archive/category/11361.aspx. That's a category list, so contains a list of blog posts on the subject. See also Microsoft.Office.Excel.Server.WebServices Namespace.
It sounds like you're talking that the user has the spreadsheet open on their local system, and you want a web site to manipulate that local spreadsheet?
If that's the case, you can't really do that. Even Office automation won't help, unless you want to require them to upload the sheet to the server and download a new altered version.
What you can do is create a web service to do the calculations and add some vba or vsto code to the Excel sheet to talk to that service.
I'm designing a survey tool. The survey will be very static and because of that, I can avoid building some kind of table-driven survey designer to accommodate the 167 questions on the survey (all 1-5 rating questions in a radio box or checkbox layout).
I was thinking of building the survey questions in a large XML file, but my non-technical co-worker that will be making frequent edits to the survey will likely do things that will break the integrity/validity of the raw xml file (think punctuation and special characters).
The XML file might look something like:
<questions>
<question>
<type>checkbox</type>
<text>Which beers do you like most</text>
<choices>Bud,Miller,Piels</choices>
<Required>true</Required>
</question>
<question>
<type>radio</type>
<text>Which beer is your favorite</text>
<choices>Bud,Miller,Piels</choices>
<Required>true</Required>
</question>
</questions>
Please use your imagination that this structure will be a bit more complex and that there will be 165 more questions.
Complicating matters, I need these questions in some form of object-oriented layout so that I can take the results and align them to other stuff. I had considered hard-coding a very lengthy survey form with 167 questions, but I need the data in blocks so that I can parse out question 37 and align it to something else in some other feature, that is related to question 37.
Here's what I'd like to do in a .Net app:
Define a enumerable class for this.
Do something where I can manually fill an enumerable collection of this class with all of the data I need. Using the p-code that would be familiar in my .asp world . . .
questions q = new questions()
q.type = "checkbox";
q.text = "which beers do you enjoy"'
q.choices = "Bud,Miller,Peils";
q.required = true;
q.add
q.type = "radio";
q.text = "what is your favorite beer";
q.choices = "Bud,Miller,Peils";
q.required = true;
q.add
My hope is that this .cs file (though foreign looking to the lay person) would be much easier for my co-worker to maintain, without me having to worry about syntax errors.
So, I guess what I'm looking for some feedback on:
Is this just a dumb idea. Should I do this in XML and I'll just consume the XML file and be done with it.
WWYD - What would you do? Is there an easier way to do this?
I don't care about performance as a relatively small number of users are using this.
I don't care about maintainability, because we will write this feature properly in the summer.
I just need to create a data structure that is not in a DB and that can be maintained by a non-technical person with a text-editor (for now).
If anyone made it this far, I appreciate it.
Everyone uses Excel...so consider using a CSV format which can be read by you as well as Excel which your counterpart will be using. One must specify to the user that the columns can't be changed, which is not a drawback per-se, but the user exports the dynamic changes to CSV which the program reads and can verify.
Plus the user does not have to be trained to use Excel so it is a win/win situation per your requirements not to use XMl.
As permanent store XML is good.
But that does not mean the user needs to edit the XML directly.
I would build the ability to edit, add, and delete the questions in the app.
Yes a bit a trouble but if they hack the XML then that is also a lot of trouble.
How do you plan to save survey results?
How do you plan to collect the survey results?
There is more to this project than you are realizing.
Do you need to combine results from more than one device?
If more than one device then you need to separate the questions from the results so you can update the questions on more than one device.
There are tools to read and write XML to disk.
Reading XML with the XmlReader
I don't agree with doug that you need to embed a database.
For a small number of questions I would use XML.
I would read all the XML into an object collection (A List).
You don't need a class the implements IEnumerable.
You put you objects in a a collections that implements IEnumerable.
I would go WPF over WinForms.
A ListBox with a DataTemplate.
On the DataTemplate you can have a dynamic selector in code behind but that is a real hassel.
Consider a single template that you manipulate in code behind.
So they are not RadioButtons but you uncheck the others in code behind.
For filtering I would go LINQ in public properties but there is also CollectionViewSource.
Used XML for an app that was used to collect field measurements.
A lot like this in measuring devices could change and need to collect the measurements.
If you are set on user editing the questions directly then XML with XSD is the best I can think of.
If you are looking for a simple human readable structured format, then you could be interrested by YAML.
YAML is a human-readable data serialization format that takes concepts
from programming languages such as C, Perl, and Python, and ideas from
XML and the data format of electronic mail.
Your question file would look like this:
questions:
- id: 1
type: checkbox
text: Which beers do you like most
choices: Bud,Miller,Piels
Required: true
- id: 2
type: radio
text: Which beer is your favorite
choices: Bud,Miller,Piels
Required: true
Some YAML libraries exists in .NET (from the article):
https://github.com/aaubry/YamlDotNet
http://yaml.codeplex.com/
http://www.codeproject.com/Articles/28720/YAML-Parser-in-C
http://yaml-net-parser.sourceforge.net/
There are plenty of xml editing tools out there that will actually make it easier to edit than editing a text file directly. I use XML Marker and it's pretty easy to use. http://symbolclick.com/
It will be quicker to train them to edit using the tool than it will be to build one.
Two answers here;
a: Write it to allow a proper admin interface, using a database to allow admin users to add/edit questions, response options and include appropriate security, auditing etc. You mention that this may not be feasible in the short term or that a 'proper' feature will be added soon, in which case, scrap this!
b: People say they have frequent edits/changes to make, but is this not a requirement which is co-related to a complete feature? Could you not in the short term, accept manual requests for change via email or something else documented, and make them yourself? Do you think the time taking to add a question/response or change some wording would be less than needing to parse XML manually to find a syntax error from someone who isn't familiar?
You'll need to weigh up frequency of change with impact to yourself of making a change vs likelihood of user error, vs estimated time needed to identify and resolve a syntax error (plus the possible bad-will of having a change break things).
Despite what some people think, users don't like making mistakes! putting them in a position where they have admin level powers over a system they don't have a full technical grasp of, could reduce confidence and future buy-in to the feature you're due to develop.
TLDR; In my opinion, unless it's a major hassle, do the changes yourself in the short term, perhaps with a maximum amount of time you'll make them (I make one change set a week, on a Friday for example). Keep the system working perfectly, and involve the users without putting them in an uncomfortable position being an non voluntary early adopter for a feature which isn't finished.
I used my complete mastery over winforms to create a little mock GUI application that enables users to quickly create one dimensional non conditional lists of questions with different question types.
Once you decided on an xml scheme you can easily import and export xml files.
Are you interested in further development of the magical survey creator? If so tell me and I will send you a practically finished prototype tomorrow morning. (You should provide me with an xml scheme though, otherwise I will do it in CSV)
I enjoy the exercise.
Picture related. Don't be put off by the colors, that's how I like it during development, to see the pixel exact boundaries of controls.
Unless your coworkers have some experience with programming or xml editing they will hate you if you instruct them to edit any sort of "code".
Our secretaries put their hand in front of their faces and start chanting "no, no, no..." when I tell them how to operate VBA macros.
I've recently encountered a performance issue involving ITextSharp taking extremely long times (often 30+ seconds) to render HTML content (being passed from an HTML Editor such as CKEditor, TinyMCE, etc).
Previously, the HTMLWorker was used to parse the content and it worked great. It was fast and fairly accurate, however when more complex HTML (such as tables, ordered lists and unordered lists) began to be passed in, it started to falter :
//The HTML Worker was quick, however it's weaknesses began to show with more
//complex HTML
List<IElement> objects = HTMLWorker.ParseToList(sr, ss);
The complex markup is a requirement in this situation and rather than attempting to perform Regular Expression surgery and other nasty things to try and fix these issues, I elected to use the XMLWorker to handle parsing.
//This outputs everything perfectly and retains all of the proper styling that is
//needed. However, when things get complex it gets sluggish
XMLWorkerHelper.GetInstance().ParseXHtml(writer,document,stringReader);
The XMLWorker results were incredible and it output everything just as we needed, but it's performance rendered it nearly unusable. As the complexity of the contents increased (through additional tables, styles and lists) so did the loading times.
The line above appears to be the performance bottleneck and trying several different alternatives using it didn't help at all (such as creating a basic custom XmlHandler).
Possible Causes and Ideas
I tried going through and stripping out any extraneous and invalid markup from the contents that are being passed in, but that did little.
Could the issue be with iTextSharp itself and how the XMLWorkerHelper is working? I attempted to use the SAME input within the iText XML Helper Demo here and it was amazingly fast. I figured the performance would be at least comparable.
Current considerations would be using a method of storage to actual store the rendered PDFs and then retrieving them on-demand as opposed to generating them dynamically. I would prefer to avoid this but it's on the table.
The Content is being pasted from Microsoft Word (cringe) which I have tried to clean up as much as possible, but I don't believe to be a major issue since the iText Demo mentioned above had no major issues with the same content.
Possible alternatives to using iTextSharp?
I would be glad to provide any additional details and code that I can.
Although this issue is a few years old, I thought that I would let any future readers know that I eventually elected to use wkhtmltopdf library via the TuesPechkin project.
The performance was a significant improvement over iTextSharp and it has great documentation with implementation examples for a variety of scenarios that may suit your existing project.
I have a lucene index with a lot of text data, each item has a description, I want to extract the more common words from the description and generate tags to classify each item based on the description, is there a lucene.net library for doing this or any other library for text classification?
No, lucene.net can make search, index, text normalization, "find more like this" funtionalty, but not a text classification.
What to suggest to you depends from your requirements. So, maybe more description needed.
But, generally, easiest way try to use external services. All external services have REST API, and it's very easy to interact with it using C#.
From external services:
Open Calais
uClassify
Google Prediction API
Text Classify
Alchemy API
Also there good Java SDK like Mahout. As I remember interactions with Mahout could be also done like with service, so integration with it is not a problem at all.
I had similar "auto tagging" task using c#, and I've used for that Open Calais. It's free to make 50,000 transactions per day. It was enough for me. Also uClassify has good pricing, as example "Indie" license 99$ per year.
But maybe external services and Mahout is not your way. Than take a look at DBpedia project and RDF.
And the last, you can use some implementations of Naive Bayes algorithm, at least. It's easy, and all will be under your control.
This is a very hard problem but if you don't want to spend time on it you can take all words which have between 5% and 10% frequency in the whole document. Or, you simply take the most common 5 words.
Doing tag extraction well is very very hard. It is so hard that whole companies live from webservices exposing such an API.
You can also do stopword removal (using a fixed stopword list obtained from the internet).
And you can find common N-grams (for example pairs) which you can use to find multi-word tags.
i am creating my own CMS frame work, because many of the clients i have, the have same requirements, like news module, newsletter module, etc.
now i am doing it fine, the only thing that is bothering me, is if a client wants to move from my server he would ask me to gibe him his files, and of course if i do so the new person who will take it he will see all my code, use it and benefit from i, and this is so bad for me that i spend all this time on creating my system and any one can easily see the code, plus he will see all the logic for my system, and he can easily know how other clients of mine sites are working, and that is a threat to me, finally i am using third party controls that i have paid for their license, and i don't want him to take it on a golden plate.
now what is the best way to solve this ? i thought it is encrypting, but how can i do that and how efficient is it ?
-should i merge all my CS files and Dlls in bin folder to one Dll and encrypt it, and how can i do that ?
i totally appreciate all the help on this matter as it is really crucial for me.
you should read this
Best .NET obfuscation tools/strategy
How effective is obfuscation?
In my experience, this is rarely worth the effort. Lots of companies who provide libraries like this don't bother obfuscating their code (Telerik, etc).
Especially considering what you are writing (CMSes are everywhere), you'd likely see more benefit from your time spent implementing features that put your product/implementation in a competitive advantage and make companies see that the software you are capable of writing has value, rather than the code itself.
In the end, you want to ensure you are a key factor in making software work for a company, not the DLLs you give them.
You'll need to precompile your site and obfuscate dlls.
Visual Studio has something like Dotfuscator Community Edition shipped with it. You could give it a try.
Of course, HTML output, CSS declarations, database structure and stored procedures code cannot be encrypted.
You can however try to compress CSS which will also reduce its readbility by humans.
Check here: The best approach to scramble CSS definitions to a human-unreadable state throughout an ASP.NET application
One other idea would be to use a frame in your HTML and put the most of the site pages inside of it. This way, it will not be visible when doing "View source".
Or just state it clearly that you offer whatever you're doing as a service and do not provide source codes of your work. I somehow doubt salesforce would be willing to give their sources to anyone who asks.