I am using wkhtmltopdf to generate a PDF file from a HTML string. The code is pretty much the one that follows:
// ...
processStartInfo.UseShellExecute = false;
processStartInfo.CreateNoWindow = true;
processStartInfo.RedirectStandardInput = true;
processStartInfo.RedirectStandardOutput = true;
processStartInfo.RedirectStandardError = true;
// ...
process = Process.Start(processStartInfo);
using (StreamWriter stramWriter = process.StandardInput)
{
stramWriter.AutoFlush = true;
stramWriter.Write(htmlCode);
}
byte[] buffer = new byte[32768], file;
using (var memoryStream = new MemoryStream())
{
while (true)
{
int read = process.StandardOutput.BaseStream.Read(buffer, 0, buffer.Length);
if (read <= 0)
break;
memoryStream.Write(buffer, 0, read);
}
file = memoryStream.ToArray();
}
process.WaitForExit(60000);
process.Close();
return file;
This works as expected, but for one specific piece of HTML, the first call of the StandardOutput.BaseStream.Read method returns an empty byte array, in which case the StandardOutput.EndOfStream is also true.
I would normally suspect the wkhtmltopdf tool failing to process the HTML input for any reason, but the problem is that this only happens in about two out of five attempts, so I now suspect that this might have something to do with process buffering and output stream reading. However, I don't seem to be able to
figure out what the exact problem is.
What could cause this behavior?
Update
Reading the StandardError was the obvious approach, but did not help, it is always an empty string. Neither did the process.ExitCode (-1073741819) which, based on my knowledge, just states that "the process crashed".
After almost a year of production usage, wkhtmltopdf is doing its job, with the issue described above reported not more than five times so far.
The problem usually goes away when adding a DIV somewhere toward the end of the document, with a value of height enough to cause the last line of text to move onto the next page (say 20px), if the page happens to be full.
We knew that the tool sometimes has trouble in properly splitting the HTML content into pages because in such cases it generated (say) seven pages while the page numbering reported only six; so the last page's number was "7 of 6". Maybe it sometimes fails completely and doesn't get to generate the pages at all, we thought. The document is generated from a highly dynamic HTML content. Making a change that resulted in a shorter/longer content without using dummy DIVs was relatively easy, and that's how we got through the errors so far.
Right now we are testing puppeteer.
Related
I have a fully working system for creating single page PDFs from HTML as below;
After initializing the converter
var nRecoHTMLToPDFConverter = new HtmlToPdfConverter();
nRecoHTMLToPDFConverter = PDFGenerator.PDFSettings(nRecoHTMLToPDFConverter);
string PDFContents;
PDFContents is an HTML string which is being populated.
The following command works perfectly and gives me the byte[] which I can return;
createDTO.PDFContent = nRecoHTMLToPDFConverter.GeneratePdf(PDFContents);
The problem arises when I want to test and develop the multi page functionality of the NReco library and change an arbitrary number of HTML pages to PDF pages.
var stringArray = new string[]
{
PDFContents, PDFContents,
};
var stream = new MemoryStream();
nRecoHTMLToPDFConverter.GeneratePdfFromFiles(stringArray, null, stream);
var mybyteArray = stream.ToArray();
the PDFContents are exactly the same as above. On paper, this should give me the byte array for 2 identical PDF pages however on call to GeneratePdfFromFiles method, I get the following exception;
WkHtmlToPdfException: Exit with code 1 due to network error: HostNotFoundError (exit code: 1)
Please help me resolve this if you have experience with this library and its complexities. I have a feeling that I'm not familiar with the proper use of a Stream object in this scenario. I've tested the working single page line and the malfunctioning multi page lines on the same method call so their context would be identical.
Many thanks
GeneratePdfFromFiles method you used expects array of file names (or URLs): https://www.nrecosite.com/doc/NReco.PdfGenerator/?topic=html/M_NReco_PdfGenerator_HtmlToPdfConverter_GeneratePdfFromFiles_1.htm
If you operate with HTML content as .NET strings you may simply save it to temp files, generate PDF and remove after that.
Yes, I will concur that at first glance, this looks exactly like a duplicate of the following:
How to get webpage title without downloading all the page source
How to get website title from c#
Truth be told... this question is extremely related to those two. However, I noticed that there was a flaw with the code from just about all links I have found so far while researching this particular topic.
Here are some other links that are similar to the above links in content:
Getting (Scraping) the title of a web page using C#
Get a Web Page's Title from a URL (C#)
If it has to be known, I am getting the URL of the page using this particular method, as outlined in this link, but I presumed that it wouldn't matter:
Dragging URLs to Windows Forms controls in C#
The code from the first link works pretty well, albeit with one big issue:
If, for example, I take the URL from this site: http://www.dotnetperls.com/imagelist
And pass it to the code, which I have a modified version of below:
private static string GetWebPageTitle(string url)
{
HttpWebRequest request = (HttpWebRequest.Create(url) as HttpWebRequest);
HttpWebResponse response = (request.GetResponse() as HttpWebResponse);
using (Stream stream = response.GetResponseStream())
{
// compiled regex to check for <title></title> block
Regex titleCheck = new Regex(#"<title>\s*(.+?)\s*</title>", RegexOptions.Compiled | RegexOptions.IgnoreCase);
int bytesToRead = 8092;
byte[] buffer = new byte[bytesToRead];
string contents = "";
int length = 0;
while ((length = stream.Read(buffer, 0, bytesToRead)) > 0)
{
// convert the byte-array to a string and add it to the rest of the
// contents that have been downloaded so far
contents += Encoding.UTF8.GetString(buffer, 0, length);
Match m = titleCheck.Match(contents);
if (m.Success)
{
// we found a <title></title> match =]
return m.Groups[1].Value.ToString();
break;
}
else if (contents.Contains("</head>"))
{
// reached end of head-block; no title found =[
return null;
break;
}
}
return null;
}
}
It returns me a blank result, or null. However, when observing the HTML code of the page, the title tag is most definitely there.
Thus, my question is: How can the code be modified or corrected, from either the modified code I have, or from any of the other four links presented, to also obtain the web page title from all web pages that have the title tag present, one example being the last link in this question, the one from DotNetPerls.
I am merely guessing, but I wonder if the website displays differently from other typical sites, like maybe it doesn't display any code when you load it the first time perhaps, but the browser actually reloads the site after loading it the first time transparently...
I would prefer an answer with some working example code, if possible.
It's not matching the title because the stream is actually the raw stream, in this case, it's been gzipped. (Add a Console.WriteLine(contents) inside the loop to see).
To have the stream automatically decompressed, do this:
request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
(Solution for automatic decompression taken from here)
We are having an issue with one server and it's utilization of the StreamWriter class. Has anyone experienced something similar to the issue below? If so, what was the solution to fix the issue?
using( StreamWriter logWriter = File.CreateText( logFileName ) )
{
for (int i = 0; i < 500; i++)
logWriter.WriteLine( "Process completed successfully." );
}
When writing out the file the following output is generated:
Process completed successfully.
... (497 more lines)
Process completed successfully.
Process completed s
Tried adding logWriter.Flush() before close without any help. The more lines of text I write out the more data loss occurs.
Had a very similar issue myself. I found that if I enabled AutoFlush before doing any writes to the stream and it started working as expected.
logWriter.AutoFlush = true;
sometimes even u call flush(), it just won't do the magic. becus Flush() will cause stream to write most of the data in stream except the last block of its buffer.
try
{
// ... write method
// i dont recommend use 'using' for unmanaged resource
}
finally
{
stream.Flush();
stream.Close();
stream.Dispose();
}
Cannot reproduce this.
Under normal conditions, this should not and will not fail.
Is this the actual code that fails ? The text "Process completed" suggests it's an extract.
Any threading involved?
Network drive or local?
etc.
This certainly appears to be a "flushing" problem to me, even though you say you added a call to Flush(). The problem may be that your StreamWriter is just a wrapper for an underlying FileStream object.
I don't typically use the File.CreateText method to create a stream for writing to a file; I usually create my own FileStream and then wrap it with a StreamWriter if desired. Regardless, I've run into situations where I've needed to call Flush on both the StreamWriter and the FileStream, so I imagine that is your problem.
Try adding the following code:
logWriter.Flush();
if (logWriter.BaseStream != null)
logWriter.BaseStream.Flush();
In my case, this is what I found with output file
Case 1: Without Flush() and Without Close()
Character Length = 23,371,776
Case 2: With Flush() and Without Close()
logWriter.flush()
Character Length = 23,371,201
Case 3: When propely closed
logWriter.Close()
Character Length = 23,375,887 (Required)
So, In order to get proper result, always need to close Writer instance.
I faced same problem
Following worked for me
using (StreamWriter tw = new StreamWriter(#"D:\Users\asbalach\Desktop\NaturalOrder\NatOrd.txt"))
{
tw.Write(abc.ToString());// + Environment.NewLine);
}
Using framework 4.6.1 and under heavy stress it still has this problem. I'm not sure why it does this, though i found a way to solve it very differently (which strengthens my feeling its indeed a .net bug).
In my case i tried write huge jagged arrays to disk (video caching).
Since the jagged array is quite large it had to do lot of repeated writes to store a large set of video frames, and despite they where uncompressed and each cache file got exact 1000 frames, the logged cash files had all different sizes.
I had the problem when i used this
//note, generateLogfileName is just a function to create a filename()
using (FileStream fs = new FileStream(generateLogfileName(), FileMode.OpenOrCreate))
{
using (StreamWriter sw = new StreamWriter(fs)
{
// do your stuff, but it will be unreliable
}
}
However when i provided it an Encoding type, all logged files got an equal size, and the problem was gone.
using (FileStream fs = new FileStream(generateLogfileName(), FileMode.OpenOrCreate))
{
using (StreamWriter sw = new StreamWriter(fs,Encoding.Unicode))
{
// all data written correctly, no data lost.
}
}
Note also read the file width the same encoding type!
This did the trick for me:
streamWriter.flush();
I'm trying to implement file compression to an application. The application has been around for a while, so it needs to be able to read uncompressed documents written by previous versions. I expected that DeflateStream would be able to process an uncompressed file, but for GZipStream I get the "The magic number in GZip header is not correct" error. For DeflateStream I get "Found invalid data while decoding". I guess it does not find the header that marks the file as the type it is.
If it's not possible to simply process an uncompressed file, then 2nd best would be to have a way to determine whether a file is compressed, and choose the method of reading the file. I've found this link: http://blog.somecreativity.com/2008/04/08/how-to-check-if-a-file-is-compressed-in-c/, but this is very implementation specific, and doesn't feel like the right approach. It can also provide false positives (I'm sure this would be rare, but it does indicate that it's not the right approach).
A 3rd option I've considered is to attempt using DeflateStream, and fallback to normal stream IO if an exception occurs. This also feels messy, and causes VS to break at the exception (unless I untick that exception, which I don't really want to have to do).
Of course, I may simply be going about it the wrong way. This is the code I've tried in .Net 3.5:
Stream reader = new FileStream(fileName, FileMode.Open, readOnly ? FileAccess.Read : FileAccess.ReadWrite, readOnly ? FileShare.ReadWrite : FileShare.Read);
using (DeflateStream decompressedStream = new DeflateStream(reader, CompressionMode.Decompress))
{
workspace = (Workspace)new XmlSerializer(typeof(Workspace)).Deserialize(decompressedStream);
if (readOnly)
{
reader.Close();
workspace.FilePath = fileName;
}
else
workspace.SetOpen(reader, fileName);
}
Any ideas?
Thanks!
Luke.
Doesn't your file format have a header? If not, now is the time to add one (you're changing the file format by supporting compression, anyway). Pick a good magic value, make sure the header is extensible (add a version field, or use specific magic values for specific versions), and you're ready to go.
Upon loading, check for the magic value. If not present, use your current legacy loading routines. If present, the header will tell you whether the contents are compressed or not.
Update
Compressing the stream means the file is no longer an XML document, and thus there's not much reason to expect the file can't contain more than your data stream. You really do want a header identifying your file :)
The below is example (pseudo)-code; I don't know if .net has a "substream", SubRangeStream is likely something you'll have to code yourself (DeflateStream probably adds it's own header, so a substream might not be necessary; could turn out useful further down the road, though).
Int64 oldPosition = reader.Position;
reader.Read(magic, 0, magic.length);
if(IsRightMagicValue(magic))
{
Header header = ReadHeader(reader);
Stream furtherReader = new SubRangeStream(reader, reader.Position, header.ContentLength);
if(header.IsCompressed)
{
furtherReader = new DeflateStream(furtherReader, CompressionMode.Decompress);
}
XmlSerializer xml = new XmlSerializer(typeof(Workspace));
workspace = (Workspace) xml.Deserialize(furtherReader);
} else
{
reader.Position = oldPosition;
LegacyLoad(reader);
}
In real-life, I would do things a bit differently - some proper error handling and cleanup, for instance. Also, I wouldn't have the new loader code directly in the IsRightMagicValue block, but rather I'd spin off the work either based on the magic value (one magic value per file version), or I would keep a "common header" portion with fields common to all versions. For both, I'd use a Factory Method to return an IWorkspaceReader depending on the file version.
Can't you just create a wrapper class/function for reading the file and catch the exception? Something like
try
{
// Try return decompressed stream
}
catch(InvalidDataException e)
{
// Assume it is already decompressed and return it as it is
}
I have a HttpModule with a filter (PageFilter) where the Writer method of PageFilter is called twice for every page request, unfortunately not with the same result.
The idea of the filter is to locate "" and insert some text/script in front of this. I have located a bunch of minor errors (and corrected them), but this error is playing tricks on me...
The constructor og PageFilter is called once, but its writer method is called twice per request?
below is the content of PageFilter.Writer (which runs twice)
string strBuffer = System.Text.UTF8Encoding.UTF8.GetString (buffer, offset, count);
try
{
Regex eof = new Regex("</html>", RegexOptions.IgnoreCase);
if (!eof.IsMatch(strBuffer))
{
//(1)
responseHtml.Append(strBuffer);
}
else
{
//(2)
responseHtml.Append (strBuffer);
string finalHtml = responseHtml.ToString ();
Regex re = null;
re = new Regex ("</body>", RegexOptions.IgnoreCase);
finalHtml = re.Replace(finalHtml, new MatchEvaluator(lastWebTrendsTagMatch));
// Write the formatted HTML back
byte[] data = System.Text.UTF8Encoding.UTF8.GetBytes (finalHtml);
responseStream.Write(data, 0, data.Length);
}
}
catch (Exception ex)
{
Logging.Logger(Logging.Level.Error, "Failed writing the HTML...", ex);
}
First time the method runs case (1) runs and on 2nd case (2) runs... this is not excatly what I want, anyone knows why and/or how I can make it work (consistently)?
The Write method may be called multiple times for a single page. The HttpWriter object chunks together data and then writes it to its output stream. Each time the HttpWriter sends out a chunk of data your response filter's Write method is invoked.
Refer this for one kind of solution...
HttpResponse.Filter write multiple times
Instead of responseStream.Write(data, 0, data.Length);
try responseStream.Write(data, 0, data.Length-1);
Hope you find this useful.
These "events" happens during 1 page request:
isAspx = true og /_layouts/ not found
(I verify that the file is .aspx and
URL not containing /_layouts/)
PageFilter constructor called
Writer method initiated...
eof (regex): (regex containing
created for matching)
!eof.IsMatch(strBuffer): (did not
match the regex )
Writer method initiated... (second
time around callinge the writer)
eof (regex): (regex containing
created for matching)
Regex initiated (matched the regex)
re (regex): (found the body
tag I need for inserting my script)
ScriptInclude = true (I've found the
web.config key telling my app that it
should include the script)
US script used (I have used the US
version of the script also based on a
web config key)
Problem is: on my dev deployment the writer runs twice, ending up with the above sequence and the script being included. On my Test deployment the writer runs twice and ends up NOT including the script...
I would like to avoid the calling of the Writer twice, but more so like to have the script included on test deployment