We're building a text templating engine out of a custom HttpModule that replaces tags in our HTML with whole sections of text from an XML file.
Currently the XML file is loaded into memory as a string/string Dictionary so that the lookup/replace done by the HttpModule can be performed very quickly using a regex.
We're looking to expand the use of this though to incorperate larger and larger sections of replaced text and I'm concerned over keeping more verbose text in memory at one time as part of the Dictionary, especially as we use ASP.NET caching for many uses as well.
Does anyone have a suggestion for a more efficient and scalable data structure/management strategy that we could use?
UPDATE: In response to Corbin March's great suggestion below, I don't think this is a case of us going down a 'dark road' (although I appreciate the concern). Our application is designed to be reskinned completely for different clients, right down to text anywhere on the page - including the ability to have multiple languages. The technique I've described has proven to be the most flexible way to handle this.
The amount of memory you are using is going to be roughly the same as the size of the file. The XML will have some overhead in the tags that the Dictionary will not, so it's a safe estimate of the memory requirements. So are you talking about 10-50 MB or 100-500 MB? I wouldn't necessarily worry about 10 to 50 MB.
If you are concerned, then you need to think about if you really need to do the replacements everytime the page is loaded. Can you take the hit of going to a database or the XML file once per page and then caching the output of the ASP.NET page and hold it for an hour? If so, consider using Page Caching.
A couple ideas:
Compress your dictionary values. Check out Scott Hanselman's cache compressing article to get the spirit of the exercise. If your dictionary keys are large, consider compressing those as well.
Only load items from your XML file into memory when they're requested and attach an item expiration. If the expiration occurs without another request, unload the item. The idea is that some dictionary items are used less frequently so an IO hit for the infrequent items is acceptable. Obviously, the ASP.NET Cache does this for you - I'm assuming Cache is out-of-context by the time you're crunching your output.
Just an opinion... but my spidersense warns me you may be going down a dark road. A huge part of ASP.NET is its templating features - master pages, page templates, user controls, custom controls, templated controls, resource schemes for internationalization. If at all possible, I'd try to solve your problem with these tools versus a text-crunching HttpModule.
Related
I've been tasked to create (or seek something that is already working) a centralized server with an API that has the ability to return a PDF file passing some data, and the name of the template, it has to be a robust solution, enterprise ready. The goal is as follows:
A series of templates for different company things. (Invoices, Orders, Order Plannings, etc)
A way of returning a PDF from external software (Websites, ERP, etc)
Can be an already ready enterprise solution, but they are pressing for a custom one.
Can be any language, but we don't have any dedicated Java programmers in-house. We are PHP / .NET, some of us dabble, but the learning curve could be a little steep.
So, I've been reading. One way we've thought it may be possible is installing a jasper reports server, and creating the templates in Jaspersoft Studio, then using the API to return the PDF files. A colleague stands for this option, because it's mostly done, but 1º is java and 2º I think it's like using a hammer to crack a nut.
Other option we've been toying with is to use C# with iTextSharp to create a server, and create our own API that returns exactly the PDF with the data we need. Doing this we could have some benefits, like using the database connector we have already made and extracting most of the data from the database, instead of having to pass around a big chunk of data, but as it is bare, it doesn't really have a templating system. We'd have create something from with the XMLWorker or with c# classes but it's not really "easy" as drag and drop. For this case I've been reading about XFA too, but documentation on the iText site is misleading and not clear.
I've been also reading about some other alternatives, like PrinceXML, PDFBox, FOP, etc, but the concept will be the same as iText, we'd have to do it ourselves.
My vote, even if it's more work is to go the route of iText and use HTML / CSS for the templates, but my colleagues claim that the templates should be able to be changed every other week (I doubt it), and be easy. HTML / CSS would be too much work.
So the real question is, how do other business approach this? Did I leave anything out on my search? Is there an easier way to achieve this?
PS: I didn't know if SO would be the correct place for this question, but I'm mostly lost and risking a "too broad question" or "off topic" tag doesn't seem that bad.
EDIT:
Input should be sent with the same request. If we decide the C# route, we can get ~70% of the data from the ERP directly, but anyway, it should accept a post request with some data (template, and data needed for that template, like an invoice data, or the invoice ID if we have access to the ERP).
Output should be a PDF (not interested in other formats, just PDF).
Templates will be updated only by IT. (Mostly us, the development team).
Performance wise, I don't know how much muscle we'll need, but right now, without any increase, we are looking at ~500/1000 PDFs daily, mostly printed from 10 to 10.30 and from 12 to 13h. Then maybe 100 more the rest of the day.
TOP performance should not be more than ~10000 daily when the planets align, and is sales season (twice a year). That should be our ceiling for the years to come.
The templates have some requirements:
Have repeating blocks (invoice lines, for example).
Have images as background, as watermark and as blocks.
Have to be multi language (translatable, with the same data).
Have some blocks that are only show on a condition.
Blocks dependent on the page (PDF header / page header / page footer / PDF footer)
Template will maybe have to do calculations over some of the data, I don't think we'll ever need this, but it's something in the future may be asked by the company.
The PDFs don't need to be stored, as we have a document management system, maybe in the future we could link them.
Extra data: Right now we are using "Fast-Reports v2 VCL"
Your question shows you've been considering the problem in detail before asking for help so I'm sure SO will be friendly.
Certainly one thing you haven't detailed much in your description is the broader functional requirements. You mentioned cracking a nut with a hammer, but I think you are focused mostly on the technology/interfacing. If you consider your broader requirements for the documents you need to create, the variables involved, it's might be a bigger nut that you think.
The approach I would suggest is to prototype solutions, assuming you have some room to do so. From your research, pick maybe the best 3 to try which may well include the custom build you have in mind. Put them through some real use-cases end to end - rough as possible but realistic. One or two key documents you need to output should be used across all solutions. Make sure you are covering the most important or most common requirements in terms of:
Input Format(s) - who can/should be updating templates. What is the ideal requirement and what is the minimum requirement?
Output Requirement(s) - who are you delivering to and what formats are essential/desirable
Data Requirement(s) - what are your sources of data and how hard/easy is it to get data from your sources to the reporting system in the format needed?
Template feature(s) - if you are using templates, what features do the templates need? This includes input format(s) but I was mostly thinking of features of the engine like repeating/conditional content, image insertion, table manipulation etc. ie are your invoices, orders and planning documents plain or complex
API requirements - do you have any broader API requirements. You mentioned you use PHP so a PHP library or Web/Web Service is likely to be a good starting point.
Performance - you haven't mentioned any performance characteristics but certainly if you are working at scale (enterprise) it would be worth even rough-measuring the throughput.
iText and Jasper are certainly enterprise grade engines you can rely on. You may wish to look at Docmosis (please note I work for the company) and probably do some searches for PDF libraries that use templates.
A web service interface is possibly a key feature you might want to look at. A REST API is easy to call from PHP and virtually any technology stack. It means you will likely have options about how you can architect a solution, and it's typically easy to prototype against. If you decide to go down the prototyping path and try Docmosis, start with the cloud service since you can prototype/integrate very quickly.
I hope that helps.
From my years of experience in working with PDF I think you should pay attention to the following points:
The performance: You may do the fastest performance with API based pdf files generation in comparision to HTML or XML to PDF generation (because of an additional layer of conversion involved). Considering peaks in the load you may want to calculate the cost of scaling up the generation by adding more servers (and estimate the cost of additional servers or resources required per additional pdf file per day).
Ease of iterations and changes: how often will you need to adjust templates? If you are going to create templates just once (with some iterations) but then no changes required then you should be OK by just coding them using the API. Otherwise you should strongly consider using HTML or XML for templates to simplify changes and to decrease the complexity of making changes in templates;
Search and indexing: If you may need to run search among created documents then you should consider storing indexes of documents generated or maybe store more the source data in XML along with PDF file generated;
Long time preservation: you should better conform to PDF/A sub-format in case you are looking for a long time digital preservation for your documents. See the VeraPDF open source initiative that you may use to validate generated and incoming PDF documents against the conformance to PDF/A requirements;
Preserving source files The PDF format itself was not designed to be edited (though there are some PDF editors already) so you may consider the need of preserving the source data to be able to regenerate PDF documents later and probably introduce additional output formats later.
I've recently encountered a performance issue involving ITextSharp taking extremely long times (often 30+ seconds) to render HTML content (being passed from an HTML Editor such as CKEditor, TinyMCE, etc).
Previously, the HTMLWorker was used to parse the content and it worked great. It was fast and fairly accurate, however when more complex HTML (such as tables, ordered lists and unordered lists) began to be passed in, it started to falter :
//The HTML Worker was quick, however it's weaknesses began to show with more
//complex HTML
List<IElement> objects = HTMLWorker.ParseToList(sr, ss);
The complex markup is a requirement in this situation and rather than attempting to perform Regular Expression surgery and other nasty things to try and fix these issues, I elected to use the XMLWorker to handle parsing.
//This outputs everything perfectly and retains all of the proper styling that is
//needed. However, when things get complex it gets sluggish
XMLWorkerHelper.GetInstance().ParseXHtml(writer,document,stringReader);
The XMLWorker results were incredible and it output everything just as we needed, but it's performance rendered it nearly unusable. As the complexity of the contents increased (through additional tables, styles and lists) so did the loading times.
The line above appears to be the performance bottleneck and trying several different alternatives using it didn't help at all (such as creating a basic custom XmlHandler).
Possible Causes and Ideas
I tried going through and stripping out any extraneous and invalid markup from the contents that are being passed in, but that did little.
Could the issue be with iTextSharp itself and how the XMLWorkerHelper is working? I attempted to use the SAME input within the iText XML Helper Demo here and it was amazingly fast. I figured the performance would be at least comparable.
Current considerations would be using a method of storage to actual store the rendered PDFs and then retrieving them on-demand as opposed to generating them dynamically. I would prefer to avoid this but it's on the table.
The Content is being pasted from Microsoft Word (cringe) which I have tried to clean up as much as possible, but I don't believe to be a major issue since the iText Demo mentioned above had no major issues with the same content.
Possible alternatives to using iTextSharp?
I would be glad to provide any additional details and code that I can.
Although this issue is a few years old, I thought that I would let any future readers know that I eventually elected to use wkhtmltopdf library via the TuesPechkin project.
The performance was a significant improvement over iTextSharp and it has great documentation with implementation examples for a variety of scenarios that may suit your existing project.
I am currently rewriting a large website with the goal of replacing a large number of page/form submittals - with AJAX calls. The goal is to reduce the amount of server roundtrips - and all the state handling on pages that are rich with client .
Having spent some time considering the best way forward with regards to performance - my question is now the following.
Will it lead to better performance to have just one single aspx page that are used for all AJAX calls - or will it be better to have a aspx page for every use of AJAX on a given webage?
Thank you very much for any insights
Lars Kjeldsen
Performancewise either approach can be made to work on a similar order of magnitude.
Maintanancewise, I prefer to have separate pages for each logical part of your site. Again, either can work, but I've seen more people make a mess of things with "monolithic" type approaches. Single page you'll need a good amount of skill structuring your scripts and client side logic. Well done there isn't a problem, however, I just see more people getting it right when they use separate pages for separate parts of the site.
If you take a look at the site http://battlelog.battlefield.com/ (you'll have to create an account) you'll notice a few things about this it.
It never refreshes the page as you navigate the website. (Using JSON to transmit new data)
It updates the URL and keeps track of where you are.
You can use the updated URL and immediately navigate to that portion of the web-application. (In this case it returns the HTML page)
Here's a full write up on the website.
Personally, I like this approach from a technology/performance perspective, but I don't know what the impact it will have on SEO since this design relies on the HTML5 History state mechanism in JavaScript.
Here's an article on SEO and JavaScript, but you'll have to do more research.
NOTE: History.js provides graceful degradation for Browsers that do not support History state.
I'm building a small specialized search engine for prise info. The engine will only collect specific segments of data on each site. My plan is to split the process into two steps.
Simple screen scraping based on a URL that points to the page where the segment I need exists. Is the easiest way to do this just to use a WebClient object and get the full HTML?
Once the HTML is pulled and saved analyse it via some script and pull out just the segment and values I need (for example the price value of a product). My problem is that this script somehow has to be unique for each site I pull, it has to be able to handle really ugly HTML (so I don't think XSLT will do ...) and I need to be able to change it on the fly as the target sites updates and changes. I will finally take the specific values and write these to a database to make them searchable
Could you please give me some hints on how to architect the best way? Would you do different then described above?
Well, i would go with the way you describe.
1.
How much data is it going to handle? Fetching the full HTML via WebClient / HttpWebRequest should not be a problem.
2.
I would go for HtmlAgilityPack for HTML parsing. It's very forgiving, and can handle prety ugly markup. As HtmlAgilityPack supports XPath, it's pretty easy to have specific xpath selections for individual sites.
I'm on the run and going to expand on this answer asap.
Yes, a WebClient can work well for this. The WebBrowser control will work as well depending on your requirements. If you are going to load the document into a HtmlDocument (the IE HTML DOM) then it might be easier to use the web browser control.
The HtmlDocument object that is now built into .NET can be used to parse the HTML. It is designed to be used with the WebBrowser control but you can use the implementation from the mshtml dll as well. I hav enot used the HtmlAgilityPack, but I hear that it can do a similar job.
The HTML DOM objects will typically handle, and fix up, most ugly HTML That you throw at them. As well as allowing a nicer way to parse the html, document.GetElementsByTag to get a collection of tag objects for example.
As for handling the changing requirements of the site, it sounds like a good candidate for the strategy pattern. You could load the strategies for each site using reflection or something of that sort.
I have worked on a system that uses XML to define a generic set of parameters for extracting text from HTML pages. Basically it would define start and end elements to begin and end extraction. I have found this technique to work well enough for a small sample, but it gets rather cumbersome and difficult to customize as the collection of sites gets larger and larger. Keeping the XML up to date and trying to keep a generic set of XML and code the handle any type of site is difficult. But if the type and number of sites is small then this might work.
One last thing to mention is that you might want to add a cleaning step to your approach. A flexible way to clean up HTML as it comes into the process was invaluable on the code I have worked on in the past. Perhaps implementing a type of pipeline would be a good approach if you think the domain is complex enough to warrant it. But even just a method that runs some regexes over the HTML before you parse it would be valuable. Getting rid of images, replacing particular mis-used tags with nicer HTML , etc. The amount of really dodgy HTML that is out there continues to amaze me...
i have read this article ScuttGu
for making use of User control to make Client side templates.
and this one too
Ecnosia
are they the same, regarding performance ?
OK Encosia's (Dave Ward) example uses the JQuery plug-in JTemplate to do its works.
ScottGu's utilizes more of the ASP.Net stack.
That said, ScottGu's sample lacks some fineness.
As for performance...I hate that question. Too many variables. But...
ScottGu's example will be moving more bits over the wire. You are essentially sending the entire html output to the browser via a web service call.
Encosia's example is sending the rawest form of data possible (JSON) to the browser, and then turning it into html using JQuery/JTemplate/Javascript.
In theory Encosia's example will perform better. Less data moved over the wire, should be less server load as well. But it will be more browser work (nominal at best here).
That said, for small amounts of data, I doubt it would matter either way.
Both samples use JSON, which you get for free when you use .NET Web Services. So the amount of data which goes over the wire will be the same.
On the client-side, don't know about about the client-side performance of the library generated by ScriptServiceAttribute, but the differences between doing it yourself and using that library should probably be marginal.
The Ecnosia example uses jTemplates. jTemplates can give you a good boost in performance when it comes to fetching large lists and displaying them in repeating sections (like html tables).
reply to devmania:
Scott's version applies the template server-side, and then sends html+data formatted to the client. The html here can be a real overhead (in case of a table, think about all the tr's, td's, style properties, spacing between tags...).
jTemplates renders client-side. The data is send in the more data efficient and compact JSON format (just the data, not the html). The template that jTemplates has to read is also much smaller, as it only contains the definitions for first row.
Yes, it is much easier to render server-side. Server-side can also be more flexible in rendering, as you can access data sources which you don't have on the client-side.
Client-side can in many cases be more efficient. Further, with some javascript, you can make it as flexible as server-side rendering. But, I reckon complex client-side rendering would take more time to develop.