External Style Sheet not working with HtmlTextWriter - c#

I am trying to generate a Pdf from Html string using DynamicPdf.HmtlConverter library.
For generating Html string I am using HtmlTextWriter class from System.Web.UI.
I am trying to add the external style sheet as follows :
StringBuilder sb = new StringBuilder();
sb.Append(#"<!DOCTYPE html>" + Environment.NewLine);
StringWriter stringWriter = new StringWriter();
using (HtmlTextWriter writer = new HtmlTextWriter(stringWriter))
{ //adding head and link tag
writer.RenderBeginTag(HtmlTextWriterTag.Html);
writer.RenderBeginTag(HtmlTextWriterTag.Head);
writer.Write("<meta charset=" + "\"UTF-8\">");
//Add Link tag attributes
writer.AddAttribute(HtmlTextWriterAttribute.Rel, "stylesheet");
writer.AddAttribute(HtmlTextWriterAttribute.Href,#"~\Stylesheet1.css"); //style sheet reference
writer.AddAttribute(HtmlTextWriterAttribute.Type, "text/css");
writer.RenderBeginTag(HtmlTextWriterTag.Link);
writer.RenderEndTag(); // end of Link tag
writer.RenderEndTag(); // end of head tag
// Body tag
writer.RenderBeginTag(HtmlTextWriterTag.Body)
writer.RenderEndTag(); // end of Body tag
writer.RenderEndTag(); // end of Html tag
}
sb.Append(stringWriter);
So "sb" will have the Html string which will be passed to DynamicPdf library method to generate Pdf as below :
ceTe.DynamicPDF.HtmlConverter.Converter.Convert(sb.ToString(),#"~\output3.pdf", null, options);
External style sheet does not show any effect on Html controls.
any suggestions how to use external style sheet with HtmlTextWriter and DynamicPdf library to generate a Pdf ..!!!

The issue is with path used for stylesheet. HTML does not recognizes any special meaning for '~' character. See RFC3986 Section:2.3
In Linux, '~' has a special meaning and translates to the home directory but for HTML it is just another character. I would suggest you to use Path.GetFullPath("~") to get full path to home directory and then use that instead.
Update:
I was mostly concentrated on the HTML part but I noticed that the third parameter of ceTe.DynamicPDF.HtmlConverter.Converter.Convert() is set to null. It must be the base-path for all the file paths used in html string. See DynamicPDF Reference
So the code should look something like this:
ceTe.DynamicPDF.HtmlConverter.Converter.Convert(sb.ToString(),#"~\output3.pdf", new Uri(#"file://C:\Users\Z0042ADE\source\repos\HtmlTextWriterDemo\HtmlTextWriterDemo\"), options);
Then you may not need to use any path in HTML string. This can be just stylesheet name:
writer.AddAttribute(HtmlTextWriterAttribute.Href,#"Stylesheet1.css");
Hopefully this will fix the issue.

This worked for me :
You can specify external CSS in the HTML source and use it for conversion without any problem using DynamicPdf library . You can get the CSS from a website using the URL or if you are using HTML string as input to the conversion and want to use external CSS then you will need to specify the base path to pick the resources. Please refer to the documentation on HTML converter base tags at:
[1]: https://www.dynamicpdf.com/docs/dotnet/html-converter-base-urls
Please see the code example below :
Below HTML text uses external styles from a file placed in the folder which is specified in the BasePath Uri.
Uri basepath = new Uri(#"C:\Temp\Resource\sytlesheet.css");
ceTe.DynamicPDF.HtmlConverter.Converter.Convert(htmlText, #"C:\Temp\MyHTmlPDF.pdf", basepath);

Related

SelectPDF ConvertUrl Non-English Characters Error

In my demo project I'm using Selectpdf tool to convert html pages to pdf documents. These html pages are stored locally. So I'm using ConvertUrl function for conversion. Here is the inline code
`
string url = AppDomain.CurrentDomain.BaseDirectory + "HTML" + "\\OrderName_" + DateTime.Now.ToString("yyyy'-'MM'-'dd'_'HH'-'mm_") + MockOrderNo + ".html";
HtmlToPdf converter = new HtmlToPdf();
PdfDocument doc = converter.ConvertUrl(htmlurl);
`
Then I save the pdf document, using doc.Save(). Here is the pdf document result
Now as you can see there is a problem displaying Turkish characters like "İ,ı,ş,ğ...". How can I resolve this using SelectPdf? If solving this with SelectPdf is not possible, what are the other prefable pdf conversion tools that does not have this kind of problem?
Also for my requirements I don't use ConvertHtmlString function. I need to store html pages in a folder, convert these html pages to pdf and store those pdf documents in an another folder.
Thanks for your help
I just changed the encoding of html file to windows-1252. This solved the problem

How to read pdf file to a text file in a proper format using Spire.PDF or any other library?

How can I read pdf files and save contents to a text file using Spire.PDF?
For example: Here is a pdf file and here is the desired text file from that pdf
I tried the below code to read the file and save it to a text file
PdfDocument doc = new PdfDocument();
doc.LoadFromFile(#"C:\Users\Tamal\Desktop\101395a.pdf");
StringBuilder buffer = new StringBuilder();
foreach (PdfPageBase page in doc.Pages)
{
buffer.Append(page.ExtractText());
}
doc.Close();
String fileName = #"C:\Users\Tamal\Desktop\101395a.txt";
File.WriteAllText(fileName, buffer.ToString());
System.Diagnostics.Process.Start(fileName);
But the output text file is not properly formatted. It has unnecessary whitespaces and a complete para is broken into multiple lines etc.
How do I get the desired result as in the desired text file?
Additionally, it is possible to detect and mark(like add a tag) to texts with bold, italic or underline forms as well? Also things get more problematic for pages have multiple columns of text.
Using iText
File inputFile = new File("input.pdf");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
SimpleTextExtractionStrategy stes = new SimpleTextExtractionStrategy();
PdfCanvasProcessor canvasProcessor = new PdfCanvasProcessor(stes);
canvasProcessor.processPageContent(pdfDocument.getPage(1));
System.out.println(stes.getResultantText());
This is (as the code says) a basic/simple text extraction strategy.
More advanced examples can be found in the documentation.
Use IronOCR
var Ocr = new IronOcr.AutoOcr();
var Results = Ocr.ReadPdf("E:\Demo.pdf");
File.WriteAllText("E:\Demo.txt", Convert.ToString(Results));
For reference https://ironsoftware.com/csharp/ocr/
Using this you should get formatted text output, but not exact desire output which you want.
If you want exact pre-interpreted output, then you should check paid OCR services like OmniPage capture SDK & Abbyy finereader SDK
That is the nature of PDF. It basically says "go to this location on a page and place this character there." I'm not familiar at all with Spire.PFF; I work with Java and the PDFBox library, but any attempt to extract text from PDF is heuristic and hence imperfect. This is a problem that has received considerable attention and some applications have better results than others, so you may want to survey all available options. Still, I think you'll have to clean up the result.

Get Document OuterHTML of MVC Application in C#

We need to export the entire page of MVC Application to PDF for that purpose need to get all the HTML contents (i.e. including dynamic content too)
To get the contents of page we used following code
string contents = File.ReadAllText(path);
but it will give only static content of page(i.e. it gives page source code) not new nodes added in DOM.
Then tried following code but this also gives static content
// WebClient object
WebClient client = new WebClient();
// Retrieve resource as a stream
Stream data = client.OpenRead(new Uri("xxxx.html"));
// Retrieve the text
StreamReader reader = new StreamReader(data);
string htmlContent = reader.ReadToEnd();
So i want to get enitre outerHTML of document in C# with out using any third party DLL . i googled so many links and everyone updated like use webbrowser control and get the content.
i don't how this will be useful for our application. Our Application is MVC4. we need to export the enitre page to PDF so we need enitre content OF HTML (dynamic content too)
How can i use this below code in ourt MVC Application to get document outerHTML
mshtml.HTMLDocument doc = webBrowser1.Document.DomDocument as mshtml.HTMLDocument;
string html = doc.documentElement.outerHTML;
or
var documentAsIHtmlDocument3 = (mshtml.IHTMLDocument3)webBrowser.Document.DomDocument;
StringReader sr = new StringReader(documentAsIHtmlDocument3.documentElement.outerHTML);
htmlDoc.Load(sr)
Any help on this.
You haven't mentioned what the PDF is intended for. Most likely it is for the visitor of the page to download. If that is true, maybe you could use jsPDF. That way you get around the problem with not having access to the entire page serverside.

HTML String Encoding

I've a requirement to export ppt from C# without using introp dlls. I am able to do that but when I do append some HTML string i.e. "<b>Krishna</b><br/><strong>Ram</strong>" in any slide, it is showing the same text, not rendered one. Can any help me ?
It appears that PPT does not currently support HTML rendering directly in PPT. You must either export your slide show as HTML, or use the built in formatting as shown in answer to the following question: Apply Font Formatting to PowerPoint Text Programatically.
Set tr = ActiveWindow.Selection.SlideRange.Shapes(1).TextFrame.TextRange
With tr
.Text = "Hi There Buddy!"
.Words(1).Font.Bold = msoTrue
For an idea of the settings in C# and Office 2010 specifically see Font Members.
You should be able to test my assertion yourself, by HTML encoding your text using the HttpServerUtility.HtmlEncode Method:
String TestString = "This is a <Test String>.";
String EncodedString = Server.HtmlEncode(TestString);

Sending a web page by email programmatically. Image URLs not resolved

I am writing a SharePoint timer job, which needs to pull the content of a web page, and send that HTML as an email.
I am using HttpWebRequest and HttpWebResponse objects to pull the content.
The emailing functionality works fine except for one problem.
The web page which serves up the content of my email contains images.
When the html of the page is sent as an email, the image URLs inside the HTML code are all relative URLs, they are not resolved as an absolute URL.
How do i resolve the image URLs to their absolute paths inside the web page content?
Is there any straight forward way to do this? I don't want to run a Regex over the html code to replace all relative URLs with absolute URLS.
Try adding a base element to the head of the html document you retrieve. As href attribute you should use the url of the page you are retrieving.
Found this cool Codeplex tool called HtmlAgilityPack.
http://www.codeplex.com/htmlagilitypack
Using this API, we can parse Html like we can parse XML documents. We can also query and search nodes using XPath.
I used the following code snippet to fix the Image URLs
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(htmlMessage);
//This selects all the Image Nodes
HtmlNodeCollection hrefNodes = htmlDoc.DocumentNode.SelectNodes("//img");
foreach (HtmlNode node in hrefNodes)
{
string imgUrl = node.Attributes["src"].Value;
node.Attributes["src"].Value = webAppUrl + imgUrl;
}
StringBuilder sb = new StringBuilder();
StringWriter sw = new StringWriter(sb);
htmlDoc.OptionOutputAsXml = false;
htmlDoc.Save(sw);
htmlMessage = sb.ToString();
I've run into this problem a few times, and I dont think there is any magic wand method out there to do it all for you. HTMLAgilityPack does a good job for aggregating the content you need, but you will have to decipher it yourself. For example; getting the list of HtmlNodes that contain "//img" could return any of the following items:
<img src="http://www.adg2435.com/pictures/pic.jpg"/> //absolute url
<img src="coolpicture.jpg"/> //relative to the page
<img src="pictures/pic.jpg"/>
<img src="./pictures/pic.jpg"/>
It is up to you to figure out which types of links are going to show up on the given webpage.
You also need to account for things like this: (Truncate your image url after the extension ".jpg")
<img src="/pictures/pic.jpg?45823593&xyz=95325235r0634945823ot49140200"/>
So, I find it handy to keep a few things on hand at any given time:
The source URL for the entire page
The domain for the given url (to do things like say "does the given src contain the domain?")
This is how you would get the domain of the source link:
Uri domainUri = new Uri(fullUrl);
domainUrl = domainUri.GetLeftPart(UriPartial.Authority);
Potentially, you may want the subdomain (i.e. "http://www.mysite.com/pictures/")
I don't want to run a Regex over the html code to replace all relative URLs with absolute URLS.
Too bad, because that's the only way you'll get the images to show up. Would you rather download all the images and embed them in the email too?

Categories

Resources