HtmlAgilityPack - Convert MHTML To HTML as String

HtmlAgilityPack - Convert MHTML To HTML as String - c#

I have a MHTML file and I am trying to convert it to HTML.
I have installed the HtmlAgilityPack and tried the following code:
var doc = new HtmlAgilityPack.MixedCodeDocument();
doc.Load("C:\\Users\\DickTracey\\Downloads\\Club Membership Report.mhtml");
var ms = new MemoryStream();
var sw = new StreamWriter(ms);
doc.Save(sw);
ms.Position = 0;
var sr = new StreamReader(ms);
return sr.ReadToEnd();
But it always returns null.
Can anyone explain the correct procedure to convert MHTML to HTML please?

MHTML to HTML Decoding in C#!
string mhtml = "This is your MHTML string"; // Make sure the string is in UTF-8 encoding
MHTMLParser parser = new MHTMLParser(mhtml);
string html = parser.getHTMLText(); // This is the converted HTML
git link : https://github.com/DavidBenko/MHTML-to-HTML-Decoding-in-C-Sharp.git

I had a quick look at an MHTML file with HxD. Although, as noted above, HtmlAgilityPack has little or no support for MHTML, the format itself looks simple enough. It appears to consist of the usual suspects (unencoded HTML, CSS, JS, graphics encoded in Base64, etc) concatenated in a way (with mime type headers) that could be worked out with a little effort. Having said that, the format is probably fully documented somewhere -- so dust off your browser, write some C# to parse it, and spoon-feed HtmlAgilityPack with the results.

Related

file data seems to be corrupted when reading file as a string

I'm trying to read a file as string. But it seems that the data is corrupted.
string filepaths = Files[0].FullName;
System.IO.StreamReader myFile = new System.IO.StreamReader(filepaths);
string datas = myFile.ReadToEnd();
but in datas, it contains "pk0101" etc instead of original data. I'm doing this so I can replace a placeholder with this string data,datas. And finally when I replace,gets replaced text as 0101 etc. Is it because of the content in datas. How can I read the file as string. Your help will be greatly appreciated. Thank You.

*.docx is a file format which in raw view represents xml document. Take a look here to become more familiar with this format definition.
For working with office formats Microsoft recommends to use Open Xml SDK at DocumentFormat.OpenXml library.
Here is a great article for learning how to work with Word files.
It works as follows:
using (var wordDocument = WordprocessingDocument.Open(string.Empty, false))
{
var body = wordDocument.MainDocumentPart.Document.Body;
var text = body.GetFirstChild<Paragraph>().InnerText;
}
Also, take a look at this SO question: How do I read data from a word with format using the OpenXML Format SDK with c#?

DOS-CSV import via C# and German Umlaute (Ä,Ü,Ö,ä,ü,ö)

I have the following problem: I have some Excel-sheets and must export them into DOS-CSV format (for some reasons). The consequence is, that the german umlaute (Ä,Ü,Ö,ä,ü,ö) are not exported correctly. In a next step these CSV-files must be imported into a Winform application. Is it now possible to get back the correkt characters Ä,Ü,Ö,ä,ü,ö during the import?

If you choose the DOS-CSV format, Excel is going to encode the document using the 437 codepage (found that here). You can convert it back to UTF-8 using a little bit of code:
Encoding dosEncoding = Encoding.GetEncoding(437);
string original = String.Empty;
using (StreamReader sr = new StreamReader(#"D:\Path\To\output.csv", dosEncoding))
{
original = sr.ReadToEnd();
sr.Close();
}
byte[] encBytes = dosEncoding.GetBytes(original);
byte[] utf8Bytes = Encoding.Convert(dosEncoding, Encoding.UTF8, encBytes);
string converted = Encoding.UTF8.GetString(utf8Bytes);
I tested this by putting Ä,Ü,Ö,ä,ü,ö into a cell and then saving it as a DOS formatted CSV file in Excel. Looking at the document, Excel turned it into Ž,š,™,„,,”.
Running it through the above code turned it back into Ä,Ü,Ö,ä,ü,ö.

Correcting Encoding in a large Xml File

I'm importing data from XML files containing this type of content:
<FirstName>™MšR</FirstName><MiddleName/><LastName>HšNER™Z</LastName>
The XML is loaded via:
XmlDocument doc = new XmlDocument();
try
{
doc.Load(fullFilePath);
}
When I execute this code with the data contained on top I get an exception about an illegal character. I understand that part just fine.
I'm not sure which encoding this is or how to solve this problem. Is there a way I can change the encoding of the XmlDocument or another method to make sure the above content is parsed correctly?
Update: I do not have any encoding declaration or <?xml in this document.
I've seen some links say to add it dynamically? Is this UTF-16 encoding?

It appears that:
The name was ÖMÜR HÜNERÖZ (or possibly ÔMÜR HÜNERÔZ or ÕMÜR HÜNERÕZ; I don't know what language that is).
The XML file was encoded using the DOS "OEM" code page, probably 437 or 850.
But it was decoded using windows-1252 (the "ANSI" code page).

If you look at the file with a hex editor (HXD or Visual Studio, for instance), what exactly do you see?
Is every character from the string you posted represented by a single byte? Does the file have a byte-order mark (a bunch of non-printable bytes at the start of the file)?
The ™ and š seem to indicate that something went pretty wrong with encoding/conversion along the way, but let's see... I guess they both correspond with a vowel (O-M-A-R H-A-NER-O-Z, maybe?), but I haven't figured out yet how they ended up looking like this...
Edit: dan04 hit the nail on the head. ™ in cp-1252 has hex value 99, and š is 9a. In cp-437 and cp-850, hex 99 represents Ö, and 9a Ü.
The fix is simple: just specify this encoding when opening your XML file:
XmlDocument doc = new XmlDocument();
using (var reader = new StreamReader(fileName, Encoding.GetEncoding(437)))
{
doc.Load(reader);
}

From here:
Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
using (var xmlreader = new XmlTextReader(stream))
{
xmlreader.MoveToContent();
encoding = xmlreader.Encoding;
}
}
You might want to take a look at this: How to best detect encoding in XML file?
For actual reading you can use StreamReader to take care of BOM(Byte order mark):
string xml;
using (var reader = new StreamReader("FilePath", true))
{ // ↑
xml= reader.ReadToEnd(); // detectEncodingFromByteOrderMarks
}
Edit: Removed the encoding parameter. StreamReader will detect the encoding of a file if the file contains a BOM. If it does not it will default to UTF8.
Edit 2: Detecting Text Encoding for StreamReader

Obviously you provided a fragment of the XML document since it's missing a root element, so I'll assume that was your intention. Is there an xml processing instruction at the top like <?xml version="1.0" encoding="UTF-8" ?>?

Need help for parsing HTML in C#

For personal use i am trying to parse a little html page that show in a simple grid the result of the french soccer championship.
var Url = "http://www.lfp.fr/mobile/ligue1/resultat.asp?code_jr_tr=J01";
WebResponse result = null;
WebRequest req = WebRequest.Create(Url);
result = req.GetResponse();
Stream ReceiveStream = result.GetResponseStream();
Encoding encode = System.Text.Encoding.GetEncoding(0);
StreamReader sr = new StreamReader(ReceiveStream, encode);
while (sr.Read() != -1)
{
Line = sr.ReadLine();
Line = Regex.Replace(Line, #"<(.|\n)*?>", " ");
Line = Line.Replace(" ", "");
Line = Line.TrimEnd();
Line = Line.TrimStart();
and then i really dont have a clue either take line by line or the
whole stream at one and how to retreive only the team's name with the next number that would be the score.
At the end i want to put both 2 team's with scores in a liste or xml to use it with an phone application
If anyone has an idea it would be great thanks!

Take a look at Html Agility Pack

You could put the stream into an XmlDocument, allowing you to query via something like XPath. Or you could use LINQ to XML with an XDocument.
It's not perfect though, because HTML files aren't always well-formed XML (don't we know it!), but it's a simple solution using stuff already available in the framework.

You'll need an SgmlReader, which provides an XML-like API over any SGML document (which an HTML document really is).

You could use the Regex.Match method to pull out the team name and score. Examine the html to see how each row is built up. This is a common technique in screen scraping.

Retrieving AlternateView's of email

I can't seem to retrieve the AlternateView from System.Net.Mail.AlternateView.
I have an application that is pulling email via POP3. I understand how to create an alternate view for sending, but how does one select the alternate view when looking at the email. I've have the received email as a System.Net.MailMessage object so I can easily pull out the body, encoding, subject line, etc. I can see the AlternateViews, that is, I can see that the count is 2 but want to extract something other than the HTML that is currently returned when I request the body.
Hope this makes some amount of sense and that someone can shed some light on this. In the end, I'm looking to pull the plaintext out, instead of the HTML and would rather not parse it myself.

Mightytighty is leading you down the right path, but you shouldn't presume the type of encoding. This should do the trick:
var dataStream = view.ContentStream;
dataStream.Position = 0;
byte[] byteBuffer = new byte[dataStream.Length];
var encoding = Encoding.GetEncoding(view.ContentType.CharSet);
string body = encoding.GetString(byteBuffer, 0,
dataStream.Read(byteBuffer, 0, byteBuffer.Length));

I was having the same problem, but you just need to read it from the stream. Here's an example:
public string ExtractAlternateView()
{
var message = new System.Net.Mail.MailMessage();
message.Body = "This is the TEXT version";
//Add textBody as an AlternateView
message.AlternateViews.Add(
System.Net.Mail.AlternateView.CreateAlternateViewFromString(
"This is the HTML version",
new System.Net.Mime.ContentType("text/html")
)
);
var dataStream = message.AlternateViews[0].ContentStream;
byte[] byteBuffer = new byte[dataStream.Length];
return System.Text.Encoding.ASCII.GetString(byteBuffer, 0, dataStream.Read(byteBuffer, 0, byteBuffer.Length));
}

There is a simpler way:
public string GetPlainTextBodyFromMsg(MailMessage msg)
{
StreamReader plain_text_body_reader = new StreamReader(msg.AlternateViews[0].ContentStream);
return(plain_text_body_reader.ReadToEnd());
}
This works if the first alternative view is the plain text version, as it happens usually.

Its not immediately possible to parse an email with the classes available in the System.Net.Mail namespace; you either need to create your own MIME parser, or use a third party library instead.
This great Codeproject article by Peter Huber SG, entitled 'POP3 Email Client with full MIME Support (.NET 2.0)' will give you an understanding of how MIME processing can be implemented, and the related RFC specification articles.
You can use the Codeproject article as a start for writing your own parser, or appraise a library like SharpMimeTools, which is an open source library for parsing and decoding MIME emails.
http://anmar.eu.org/projects/sharpmimetools/
Hope this helps!

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HtmlAgilityPack - Convert MHTML To HTML as String - c#

Related

file data seems to be corrupted when reading file as a string

DOS-CSV import via C# and German Umlaute (Ä,Ü,Ö,ä,ü,ö)

Correcting Encoding in a large Xml File

Need help for parsing HTML in C#

Retrieving AlternateView's of email

Categories

Resources