Sluggish Performance using ITextSharp XMLWorkerHelper and Parsing HTML - c#

I've recently encountered a performance issue involving ITextSharp taking extremely long times (often 30+ seconds) to render HTML content (being passed from an HTML Editor such as CKEditor, TinyMCE, etc).
Previously, the HTMLWorker was used to parse the content and it worked great. It was fast and fairly accurate, however when more complex HTML (such as tables, ordered lists and unordered lists) began to be passed in, it started to falter :
//The HTML Worker was quick, however it's weaknesses began to show with more
//complex HTML
List<IElement> objects = HTMLWorker.ParseToList(sr, ss);
The complex markup is a requirement in this situation and rather than attempting to perform Regular Expression surgery and other nasty things to try and fix these issues, I elected to use the XMLWorker to handle parsing.
//This outputs everything perfectly and retains all of the proper styling that is
//needed. However, when things get complex it gets sluggish
XMLWorkerHelper.GetInstance().ParseXHtml(writer,document,stringReader);
The XMLWorker results were incredible and it output everything just as we needed, but it's performance rendered it nearly unusable. As the complexity of the contents increased (through additional tables, styles and lists) so did the loading times.
The line above appears to be the performance bottleneck and trying several different alternatives using it didn't help at all (such as creating a basic custom XmlHandler).
Possible Causes and Ideas
I tried going through and stripping out any extraneous and invalid markup from the contents that are being passed in, but that did little.
Could the issue be with iTextSharp itself and how the XMLWorkerHelper is working? I attempted to use the SAME input within the iText XML Helper Demo here and it was amazingly fast. I figured the performance would be at least comparable.
Current considerations would be using a method of storage to actual store the rendered PDFs and then retrieving them on-demand as opposed to generating them dynamically. I would prefer to avoid this but it's on the table.
The Content is being pasted from Microsoft Word (cringe) which I have tried to clean up as much as possible, but I don't believe to be a major issue since the iText Demo mentioned above had no major issues with the same content.
Possible alternatives to using iTextSharp?
I would be glad to provide any additional details and code that I can.

Although this issue is a few years old, I thought that I would let any future readers know that I eventually elected to use wkhtmltopdf library via the TuesPechkin project.
The performance was a significant improvement over iTextSharp and it has great documentation with implementation examples for a variety of scenarios that may suit your existing project.

Related

PDF Creating Server

I've been tasked to create (or seek something that is already working) a centralized server with an API that has the ability to return a PDF file passing some data, and the name of the template, it has to be a robust solution, enterprise ready. The goal is as follows:
A series of templates for different company things. (Invoices, Orders, Order Plannings, etc)
A way of returning a PDF from external software (Websites, ERP, etc)
Can be an already ready enterprise solution, but they are pressing for a custom one.
Can be any language, but we don't have any dedicated Java programmers in-house. We are PHP / .NET, some of us dabble, but the learning curve could be a little steep.
So, I've been reading. One way we've thought it may be possible is installing a jasper reports server, and creating the templates in Jaspersoft Studio, then using the API to return the PDF files. A colleague stands for this option, because it's mostly done, but 1º is java and 2º I think it's like using a hammer to crack a nut.
Other option we've been toying with is to use C# with iTextSharp to create a server, and create our own API that returns exactly the PDF with the data we need. Doing this we could have some benefits, like using the database connector we have already made and extracting most of the data from the database, instead of having to pass around a big chunk of data, but as it is bare, it doesn't really have a templating system. We'd have create something from with the XMLWorker or with c# classes but it's not really "easy" as drag and drop. For this case I've been reading about XFA too, but documentation on the iText site is misleading and not clear.
I've been also reading about some other alternatives, like PrinceXML, PDFBox, FOP, etc, but the concept will be the same as iText, we'd have to do it ourselves.
My vote, even if it's more work is to go the route of iText and use HTML / CSS for the templates, but my colleagues claim that the templates should be able to be changed every other week (I doubt it), and be easy. HTML / CSS would be too much work.
So the real question is, how do other business approach this? Did I leave anything out on my search? Is there an easier way to achieve this?
PS: I didn't know if SO would be the correct place for this question, but I'm mostly lost and risking a "too broad question" or "off topic" tag doesn't seem that bad.
EDIT:
Input should be sent with the same request. If we decide the C# route, we can get ~70% of the data from the ERP directly, but anyway, it should accept a post request with some data (template, and data needed for that template, like an invoice data, or the invoice ID if we have access to the ERP).
Output should be a PDF (not interested in other formats, just PDF).
Templates will be updated only by IT. (Mostly us, the development team).
Performance wise, I don't know how much muscle we'll need, but right now, without any increase, we are looking at ~500/1000 PDFs daily, mostly printed from 10 to 10.30 and from 12 to 13h. Then maybe 100 more the rest of the day.
TOP performance should not be more than ~10000 daily when the planets align, and is sales season (twice a year). That should be our ceiling for the years to come.
The templates have some requirements:
Have repeating blocks (invoice lines, for example).
Have images as background, as watermark and as blocks.
Have to be multi language (translatable, with the same data).
Have some blocks that are only show on a condition.
Blocks dependent on the page (PDF header / page header / page footer / PDF footer)
Template will maybe have to do calculations over some of the data, I don't think we'll ever need this, but it's something in the future may be asked by the company.
The PDFs don't need to be stored, as we have a document management system, maybe in the future we could link them.
Extra data: Right now we are using "Fast-Reports v2 VCL"
Your question shows you've been considering the problem in detail before asking for help so I'm sure SO will be friendly.
Certainly one thing you haven't detailed much in your description is the broader functional requirements. You mentioned cracking a nut with a hammer, but I think you are focused mostly on the technology/interfacing. If you consider your broader requirements for the documents you need to create, the variables involved, it's might be a bigger nut that you think.
The approach I would suggest is to prototype solutions, assuming you have some room to do so. From your research, pick maybe the best 3 to try which may well include the custom build you have in mind. Put them through some real use-cases end to end - rough as possible but realistic. One or two key documents you need to output should be used across all solutions. Make sure you are covering the most important or most common requirements in terms of:
Input Format(s) - who can/should be updating templates. What is the ideal requirement and what is the minimum requirement?
Output Requirement(s) - who are you delivering to and what formats are essential/desirable
Data Requirement(s) - what are your sources of data and how hard/easy is it to get data from your sources to the reporting system in the format needed?
Template feature(s) - if you are using templates, what features do the templates need? This includes input format(s) but I was mostly thinking of features of the engine like repeating/conditional content, image insertion, table manipulation etc. ie are your invoices, orders and planning documents plain or complex
API requirements - do you have any broader API requirements. You mentioned you use PHP so a PHP library or Web/Web Service is likely to be a good starting point.
Performance - you haven't mentioned any performance characteristics but certainly if you are working at scale (enterprise) it would be worth even rough-measuring the throughput.
iText and Jasper are certainly enterprise grade engines you can rely on. You may wish to look at Docmosis (please note I work for the company) and probably do some searches for PDF libraries that use templates.
A web service interface is possibly a key feature you might want to look at. A REST API is easy to call from PHP and virtually any technology stack. It means you will likely have options about how you can architect a solution, and it's typically easy to prototype against. If you decide to go down the prototyping path and try Docmosis, start with the cloud service since you can prototype/integrate very quickly.
I hope that helps.
From my years of experience in working with PDF I think you should pay attention to the following points:
The performance: You may do the fastest performance with API based pdf files generation in comparision to HTML or XML to PDF generation (because of an additional layer of conversion involved). Considering peaks in the load you may want to calculate the cost of scaling up the generation by adding more servers (and estimate the cost of additional servers or resources required per additional pdf file per day).
Ease of iterations and changes: how often will you need to adjust templates? If you are going to create templates just once (with some iterations) but then no changes required then you should be OK by just coding them using the API. Otherwise you should strongly consider using HTML or XML for templates to simplify changes and to decrease the complexity of making changes in templates;
Search and indexing: If you may need to run search among created documents then you should consider storing indexes of documents generated or maybe store more the source data in XML along with PDF file generated;
Long time preservation: you should better conform to PDF/A sub-format in case you are looking for a long time digital preservation for your documents. See the VeraPDF open source initiative that you may use to validate generated and incoming PDF documents against the conformance to PDF/A requirements;
Preserving source files The PDF format itself was not designed to be edited (though there are some PDF editors already) so you may consider the need of preserving the source data to be able to regenerate PDF documents later and probably introduce additional output formats later.

Parsing custom XAML-like document to C# equivalent? (Or binaries)

I'm working on an XNA project, for the Xbox 360.
For this I have built an extensive UI library, all with relative positions, dozens of controls and stuff like that.
At the moment of speaking, all the design elements are created in page classes, which is a lot of code (think of margins, aligns, labels, and so on).
So I was thinking, shouldn't it be possible, to create a file, like a XAML file for example, and using C#, parse it to objects?
XNAML uses this kind of approach (I think), so I know it must be possible, I'm just wondering how. I could go with the approach of parsing the file manually line by line, and creating every object, but I think this would be pretty complex & nasty code, considering the amount of different objects. Any input is greatly appreciated!
F.Y.I., XNAML is deprecated (I even think it never went live)

Is it a Good Practice to Write HTML Using a StringBuilder in my ASP.NET Codebehind?

I'm interested to hear from other developers their opinion on an approach that I typically take. I have a web application, asp.net 2.0, c#.
What I usually do to write out drop downs, tables, input controls, etc. is in the code behind use StringBuilder and write out something like sb.Append("
I don't find myself using to many .net controls as I typically write out the html in the code behind. When I want to use jQuery or call JavaScript I just put that function call in my sb.Append tag like sb.Append("td...onblur='fnCallJS()'.
I've gotten pretty comfortable with this approach. For data access I use EntitySpaces.
I'm just kind of curious if this sort of approach is horribly wrong, ok depending on the context, good, time to learn 3.0, etc. I'm interested in learning and was just looking for some input.
Edit
After reading the comments here it sounds like I should take a look at MVC. I've not done that yet. The only hesitancy in doing so is that the existing project is just that, existing. There is a lot of code already done the way I explained and it is hard to imagine what would be involved in changing it, advantages of doing so, and just learning what that would take.
The other thing I'm taking away from the comments is that my code behind should really not include much of the sb.Append code, whereas now it is filled with it in numerous functions. To me it is not messy but that is because I know what each function does and can look at it and see, oh that writes out x, y, and z.
It's not uncommon for me to just have a div on the .aspx part and then build up the .innerHtml of that with the StringBuilder in the code behind.
Thanks again for the comments. I'm thinking as I'm reading them.
I typically write out the html in the code behind.
That part is a little odd, and not something I recommend for webforms. If you want to do that, consider an asp.net mvc project instead.
In webforms, you really want the meat of your html to live with the markup rather than the code. The two should remain separate. You also don't want a huge stringbuilder that encompasses your entire page. This will force you to keep the entire page in memory twice (once for the stringbuilder bytes and once for the built string at the end) rather than writing the page to the response stream as it's built. That means more memory per request, which can really kill scalability.
To those ends, I would abstract distinct portions of your stringbuilder code into custom/user controls that you can use in the aspx markup. These controls can use a stringbuilder to create their output. This means you only need to keep enough html markup in memory to render one control at a time. It also allows you to more easily re-use common markup across pages or even sites.
There are times when you need to generate some HTML in your code behind, but in general, you want to leave the HTML where it belongs, and that's seperated from your code. The VS IDE is a pretty good HTML editor. Use it.
I'm going to go out on a limb and guess you may have come from a "Classic" ASP (vbScript) or PHP background.
My back ground is "Classic ASP" and my first attempts at the Webforms Model were pretty much the same as yours, once I started usnig them and understanding them I've never looked back. There is a disctinct learning curve though in understanding how the page life cycle interacts with the various WebForm controls.
Look up the various threads on ASP.net WebForms vs MCV to see which suits your projects needs the best. MVC Isn't a magic cure-all but in many respects may be more familiar if you're from a "Classic ASP" or PHP backgound.
From a practical perspective, assuming you're sticking with WebForms, if there is the possibility of other developers becoming involved in the project you aim towards using more of the inbuilt controls where you can as that is more than likely what they will be familiar with. Stating the obvious, the more you use the controls the more you will become familiar with what they can and can't do and before to long you will find yourself writing your own controls to fill the gaps or finding existing 3rd party controls.
A big problem you have with that it can get pretty messy... having to escape all the " or messing with carriage returns. Sure YOU can program around that, but what if you want to copy/paste code? sounds like a nightmare and WAY more work than it's worth.
It sounds like you should be writing a custom control and using HtmlTextWriter to write the markup.
Or perhaps more appropriate would be a user control, with markup in the aspx page and anything else in the code behind.
If you're using this approach, you should migrate your development efforts to ASP.Net MVC. Whereas ASP.Net actively tries to abstract the HTML, CSS, JavaScript, etc. away by using web controls, ASP.Net MVC is built around a paradigm of directly controlling the markup itself (though that may arguably be the least of the differences between the two - you should definitely read up on it to at least know the alternatives, even if you stick with ASP.Net in the long run).
Otherwise, what you're doing works if done properly (though you'll be fighting the framework the whole way), though I'd recommend using a StringWriter instead. It uses a StringBuilder internally so the performance characteristics are the same between the two, but the semantics are more consistent with the rest of the .Net framework (e.g., Write vs. Append).
I think this approach kind of defeats the purpose of what webforms was trying to accomplish (separating markup and code).
I know this thread is kind of old and has been answered really well, I just thought I would "append" (pun intended) my answer since I am working with code that was mentioned in the question.
ALL the markup is in the C# classes and they created a StringBuilder object to append all the html and JavaScript strings. This has made it very difficult to read the code and see what's going on, and what if they want to change the markup/design of the front-end? Now, I've got a heck of job on my hands having to go in and refactor all that markup in the classes, when it would be so much easier to change the .aspx pages and connect the data model to those pages.
In my humble opinion, I can't find a good reason to put any markup in your classes/code behind. They are for logic only. Plus, it makes it difficult to test and debug Javascript. That's my two cents. K.

Simple screen scraping and analyze in .NET

I'm building a small specialized search engine for prise info. The engine will only collect specific segments of data on each site. My plan is to split the process into two steps.
Simple screen scraping based on a URL that points to the page where the segment I need exists. Is the easiest way to do this just to use a WebClient object and get the full HTML?
Once the HTML is pulled and saved analyse it via some script and pull out just the segment and values I need (for example the price value of a product). My problem is that this script somehow has to be unique for each site I pull, it has to be able to handle really ugly HTML (so I don't think XSLT will do ...) and I need to be able to change it on the fly as the target sites updates and changes. I will finally take the specific values and write these to a database to make them searchable
Could you please give me some hints on how to architect the best way? Would you do different then described above?
Well, i would go with the way you describe.
1.
How much data is it going to handle? Fetching the full HTML via WebClient / HttpWebRequest should not be a problem.
2.
I would go for HtmlAgilityPack for HTML parsing. It's very forgiving, and can handle prety ugly markup. As HtmlAgilityPack supports XPath, it's pretty easy to have specific xpath selections for individual sites.
I'm on the run and going to expand on this answer asap.
Yes, a WebClient can work well for this. The WebBrowser control will work as well depending on your requirements. If you are going to load the document into a HtmlDocument (the IE HTML DOM) then it might be easier to use the web browser control.
The HtmlDocument object that is now built into .NET can be used to parse the HTML. It is designed to be used with the WebBrowser control but you can use the implementation from the mshtml dll as well. I hav enot used the HtmlAgilityPack, but I hear that it can do a similar job.
The HTML DOM objects will typically handle, and fix up, most ugly HTML That you throw at them. As well as allowing a nicer way to parse the html, document.GetElementsByTag to get a collection of tag objects for example.
As for handling the changing requirements of the site, it sounds like a good candidate for the strategy pattern. You could load the strategies for each site using reflection or something of that sort.
I have worked on a system that uses XML to define a generic set of parameters for extracting text from HTML pages. Basically it would define start and end elements to begin and end extraction. I have found this technique to work well enough for a small sample, but it gets rather cumbersome and difficult to customize as the collection of sites gets larger and larger. Keeping the XML up to date and trying to keep a generic set of XML and code the handle any type of site is difficult. But if the type and number of sites is small then this might work.
One last thing to mention is that you might want to add a cleaning step to your approach. A flexible way to clean up HTML as it comes into the process was invaluable on the code I have worked on in the past. Perhaps implementing a type of pipeline would be a good approach if you think the domain is complex enough to warrant it. But even just a method that runs some regexes over the HTML before you parse it would be valuable. Getting rid of images, replacing particular mis-used tags with nicer HTML , etc. The amount of really dodgy HTML that is out there continues to amaze me...

ASP.NET Templating

We're building a text templating engine out of a custom HttpModule that replaces tags in our HTML with whole sections of text from an XML file.
Currently the XML file is loaded into memory as a string/string Dictionary so that the lookup/replace done by the HttpModule can be performed very quickly using a regex.
We're looking to expand the use of this though to incorperate larger and larger sections of replaced text and I'm concerned over keeping more verbose text in memory at one time as part of the Dictionary, especially as we use ASP.NET caching for many uses as well.
Does anyone have a suggestion for a more efficient and scalable data structure/management strategy that we could use?
UPDATE: In response to Corbin March's great suggestion below, I don't think this is a case of us going down a 'dark road' (although I appreciate the concern). Our application is designed to be reskinned completely for different clients, right down to text anywhere on the page - including the ability to have multiple languages. The technique I've described has proven to be the most flexible way to handle this.
The amount of memory you are using is going to be roughly the same as the size of the file. The XML will have some overhead in the tags that the Dictionary will not, so it's a safe estimate of the memory requirements. So are you talking about 10-50 MB or 100-500 MB? I wouldn't necessarily worry about 10 to 50 MB.
If you are concerned, then you need to think about if you really need to do the replacements everytime the page is loaded. Can you take the hit of going to a database or the XML file once per page and then caching the output of the ASP.NET page and hold it for an hour? If so, consider using Page Caching.
A couple ideas:
Compress your dictionary values. Check out Scott Hanselman's cache compressing article to get the spirit of the exercise. If your dictionary keys are large, consider compressing those as well.
Only load items from your XML file into memory when they're requested and attach an item expiration. If the expiration occurs without another request, unload the item. The idea is that some dictionary items are used less frequently so an IO hit for the infrequent items is acceptable. Obviously, the ASP.NET Cache does this for you - I'm assuming Cache is out-of-context by the time you're crunching your output.
Just an opinion... but my spidersense warns me you may be going down a dark road. A huge part of ASP.NET is its templating features - master pages, page templates, user controls, custom controls, templated controls, resource schemes for internationalization. If at all possible, I'd try to solve your problem with these tools versus a text-crunching HttpModule.

Categories

Resources