Save a string with HTML format to doc (with unicode)

Save a string with HTML format to doc (with unicode) - c#

I try many way but I don't get the result that I expect. Please help, thanks.
I have a byte array, and I read it to a string, the result is :
string mystring = "<p>Today is a <b>beautiful</b> day</p>"
now I want to return it to a DOC file with HTML format.
I have the problem that I cant get a file with unicode format.
This is what I want in doc file :
Today is a beautiful day
Can anyone help me find the way I can save my string to doc file with unicode encode?

If you want to create a very simple Word document from C# code, I would use the .docx format, and follow the MSDN blog entry here.
It includes full sample code to download.
From the page linked to above:
This code will generate a DOCX that loads in Word 2007 or any other
valid Open XML consumer.

Related

Reading text that is embedded in a PDF?

I have a PDF that has a string that is in the catalog portion of the PDF file. I need to read that string.
With iTextSharp 5 I was able to read the catalog and pull out the string.
I am now limited to another library (Syncfusion) and in that library the catalog is marked as private and I do not have access to it.
I am able to "open" the PDF in Notepad++ and I can see the string as plain text. I need to programatically open that file and retrieve that string. Using ReadAllBytes I can read the file but then am at a loss as to how to search it for a specific string.
Any suggestions or examples that I can explore would be appreciated.

If you know the encoding of the text, you could always convert the raw bytes to a string and then use a Regex to find what you need.
Here's an example of that:
var bytes = File.ReadAllBytes("example.pdf");
string pdfStr = Encoding.UTF8.GetString(bytes); //for UTF8
Regex pdfReg = new Regex(...); //the regex for finding your string
string pdfSubstring = pdfReg.Match(pdfStr); //the string you needed
C# Regex Reference

Convert all HTML entities not predefined for XML to unicode

I am trying to manipulate a string containing HTML-Code and then save the content to a htm-file. Afterwards the htm file is imported to a Word-File. Goal is to append a document formatted in HTML to a Word document. This process is part of a much larger programm and i cannot modify the given parameters.
To easily modify the HTML-Code I thought using XDocument would be a great idea.
So I tried this:
AppendContent(string content, Document doc)
{
string filePath = ...; //somewhere in /AppData/Local
var xDoc = XDocument.Parse(content);
// code left out because irrelevant
// Finding all "img" elements, in order to
// extract the embedded picture and save it as external file
FileHelper.SaveToFile(filePath, xDoc.ToString());
//... After this, the file is appended to the word file (the one in doc)
}
First attempt worked actually, with a small test html. Using any of the big documents I'm trying to append to the word document, cause an exception to be thrown:
XDocument.Parse cannot parse entities like "nbsp" or "uuml" (german ü). I already found out that XML only supports a hand full of predefined entities, so i would have to manually add the definition to the html file. This is not an option, because this operation is supposed to work with ANY Html file.
I found following fix:
var decodedContent = WebUtility.HtmlDecode(content);
var xDoc = XDocument.Parse(decodedContent);
This converts all entities to the representing character. So "uuml" is converted to "ü", etc. This worked until i hit a document that contained the "amp" entity, which is then converted to "&"... and such the XDocument.Parse is complaining again.
I'm looking for a way to convert HTML to unicode-representation ("\0x1234") or a HTML-decode, that does not decode XML-predefined entities.

Convert file path to UTF-8

I want to get, print and write to a text file the full path on disk of a file named A&T+X-8_L_R1.png but when I print it I get A&T+X-8_L_R1.png.
AFAIK I need to change the encoding. I did a search and found this potential solution but it doesn't work:
String filePathString = relativeUri.ToString();
byte[] bytes = Encoding.Default.GetBytes(filePathString);
filePathString = Encoding.UTF8.GetString(bytes);
filePathNode.SetValue(filePathString);
This is the full code of my class: http://pastebin.com/dZLGeS8p
The class searches recursively for *.png files and creates an XML structure from their paths. When I save the XML file the special characters from the paths like & are changed.
Can anyone point me to a solution?

You are writing an XML file, not a plain text file. In XML, an ampersand needs to be escaped to &.
So the result you get is perfectly ok. It's even required to be like this.
I recommend to open the XML file with an application that can properly validate and display XML. It'll be easier to see that the file is correct.
The UTF-8 conversion in your code isn't required. If the XML file is encoded in UTF-8, your XML classes will take care of any required conversions.

How to parse text from MS Word document to string

I am trying to find a way to parse a word document's text to a string in my project.I have more than 600 word(.doc) files that I need to get the text content(with the new lines and tabs if possible) and assign it to a string for each one.
I've been reading stuff about the Open XML SDK but it looks quite complicated for something that looks so simple.

Open XML SDK is only for 2007 and newer formats and it is not trivial to use.
If performance is not an issue you could use Word Automation and have Word do this for you.
It will look something like this:
var app = new Application();
var doc = app.Documents.Open(documentLocation);
string rangeText = doc.Range().Text;
doc.Save();
doc.Close();
Marshal.ReleaseComObject(doc);
Marshal.ReleaseComObject(app);
Take a look at http://www.codeproject.com/Articles/18703/Word-2007-Automation or http://www.codeproject.com/Articles/21247/Word-Automation for more complete examples and instructions. Note that this may become a bit more tricky if your documents are move complex (footnotes, text boxes, tables...).
Another option is have word save the document as a text and then read the text file. Take a look at this - http://msdn.microsoft.com/en-us/library/microsoft.office.tools.word.document.saveas(v=vs.80).aspx

You could give a look at NPOI:
This project is the .NET version of POI Java project at
http://poi.apache.org/. POI is an open source project which can help
you read/write xls, doc, ppt files. It has a wide application.
Take a look at this previous SO thread for more information.

Manipulating SVG files with C#

I need to be able to edit the text and images of an SVG file that has been rendered in Adobe Illustrator.
How can I iterate through the elements of an SVG file, check for type = text, change the value, and save the file to disk? Is there any library available that could help me?
So far I've tried this basic library but it doesn't do well with complex SVG structures.

SVG RENDERING ENGINE
I used this one for a project.
There were a few flaws but it did the job.

This may be very late to answer but for the sake of others if they land in this page, You can use HTMLAgilityPack. Here is the link to a similar question: What is the best way to parse html in C#?
I have used it in my case where i needed to edit the svg string and replace some values like this:
HtmlDocument theDocument = new HtmlDocument();
theDocument.LoadHtml(svgChartImg1);
HtmlNodeCollection theNodes = theDocument.DocumentNode.SelectNodes("//tspan");
Here, the svgChartImg1 is an svg xml string.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Save a string with HTML format to doc (with unicode) - c#

If you want to create a very simple Word document from C# code, I would use the .docx format, and follow the MSDN blog entry here. It includes full sample code to download. From the page linked to above: This code will generate a DOCX that loads in Word 2007 or any other valid Open XML consumer.

Related

Reading text that is embedded in a PDF?

Convert all HTML entities not predefined for XML to unicode

Convert file path to UTF-8

How to parse text from MS Word document to string

Manipulating SVG files with C#

Categories

Resources