HTML String Encoding - c#

I've a requirement to export ppt from C# without using introp dlls. I am able to do that but when I do append some HTML string i.e. "<b>Krishna</b><br/><strong>Ram</strong>" in any slide, it is showing the same text, not rendered one. Can any help me ?

It appears that PPT does not currently support HTML rendering directly in PPT. You must either export your slide show as HTML, or use the built in formatting as shown in answer to the following question: Apply Font Formatting to PowerPoint Text Programatically.
Set tr = ActiveWindow.Selection.SlideRange.Shapes(1).TextFrame.TextRange
With tr
.Text = "Hi There Buddy!"
.Words(1).Font.Bold = msoTrue
For an idea of the settings in C# and Office 2010 specifically see Font Members.
You should be able to test my assertion yourself, by HTML encoding your text using the HttpServerUtility.HtmlEncode Method:
String TestString = "This is a <Test String>.";
String EncodedString = Server.HtmlEncode(TestString);

Related

c# , how to generate password protected arabic pdf or word document

I used Itext7 in my C# code to create a pdf file, as I said in my other question here
Itext7 not showing arabic text
so I gave up on trying to fix it, because it seems like I need to pay for the addon, and I can't do that
I tried Pdf sharp, it showed arabic letters but there were disconnected and reversed, and writing arabic backward did not make the letters connect
I used SautinSoft library and it created a word document where arabic works fine, but it has a footer that says that it is a free version, so i can't use this one either
the pdf created by this library also doesnt support arabic
so I think I can't write pdf in arabic, all libraries I tried didn't supported it
is there anyway to fix it?
or can anyone please suggest another library that can create arabic pdf or a word document without watermarks or footers
I found the solution, using Gembox pdf, it only allows 20 paragraphs, but that is more than enough
What if DocumentCore?
public static void SecureDocument()
{
string filePath = #"ProtectedDocument.pdf";
DocumentCore dc = new DocumentCore();
// Let's create a simple document.
dc.Content.End.Insert("Hello World!!!", new CharacterFormat() { FontName = "Verdana", Size = 65.5f, FontColor = Color.Orange });
PdfSaveOptions so = new PdfSaveOptions();
// Password Protection
so.EncryptionDetails.UserPassword = "12345";
// EncryptionAlgorithm
so.EncryptionDetails.EncryptionAlgorithm = PdfEncryptionAlgorithm.RC4_128;
//Permissions: Content Copying, Commenting, Printing, Changing the Document, filing of form fildes
//Printing: Allowed
so.EncryptionDetails.Permissions = PdfPermissions.Printing;
// Save a document as the PDF file with Security Options.
dc.Save(filePath, so);
// Open the result for demonstration purposes.
System.Diagnostics.Process.Start(new System.Diagnostics.ProcessStartInfo(filePath) { UseShellExecute = true });
}

How to read pdf file to a text file in a proper format using Spire.PDF or any other library?

How can I read pdf files and save contents to a text file using Spire.PDF?
For example: Here is a pdf file and here is the desired text file from that pdf
I tried the below code to read the file and save it to a text file
PdfDocument doc = new PdfDocument();
doc.LoadFromFile(#"C:\Users\Tamal\Desktop\101395a.pdf");
StringBuilder buffer = new StringBuilder();
foreach (PdfPageBase page in doc.Pages)
{
buffer.Append(page.ExtractText());
}
doc.Close();
String fileName = #"C:\Users\Tamal\Desktop\101395a.txt";
File.WriteAllText(fileName, buffer.ToString());
System.Diagnostics.Process.Start(fileName);
But the output text file is not properly formatted. It has unnecessary whitespaces and a complete para is broken into multiple lines etc.
How do I get the desired result as in the desired text file?
Additionally, it is possible to detect and mark(like add a tag) to texts with bold, italic or underline forms as well? Also things get more problematic for pages have multiple columns of text.
Using iText
File inputFile = new File("input.pdf");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
SimpleTextExtractionStrategy stes = new SimpleTextExtractionStrategy();
PdfCanvasProcessor canvasProcessor = new PdfCanvasProcessor(stes);
canvasProcessor.processPageContent(pdfDocument.getPage(1));
System.out.println(stes.getResultantText());
This is (as the code says) a basic/simple text extraction strategy.
More advanced examples can be found in the documentation.
Use IronOCR
var Ocr = new IronOcr.AutoOcr();
var Results = Ocr.ReadPdf("E:\Demo.pdf");
File.WriteAllText("E:\Demo.txt", Convert.ToString(Results));
For reference https://ironsoftware.com/csharp/ocr/
Using this you should get formatted text output, but not exact desire output which you want.
If you want exact pre-interpreted output, then you should check paid OCR services like OmniPage capture SDK & Abbyy finereader SDK
That is the nature of PDF. It basically says "go to this location on a page and place this character there." I'm not familiar at all with Spire.PFF; I work with Java and the PDFBox library, but any attempt to extract text from PDF is heuristic and hence imperfect. This is a problem that has received considerable attention and some applications have better results than others, so you may want to survey all available options. Still, I think you'll have to clean up the result.

C# Web scraper copying text

I have a web scraper written in C# for extracting data. I want to copy text from the web browser control and paste it into a Word file programmatically. When I try to extract rich text box content using its ID and InnerText, the text contains encoded characters like %2c.
I need to get the text with all formatting but I can't find any way. I have tried Encoding, HTTPUtility.UrlDecode, SendKeys and elem.InvokeMember() without success.
How can I programmatically copy and paste text from web browser control preserving formatting?
Here is the sample data to extract:
Description
The Advance Concepts Engineering team designs and develops new vehicles which will meet future regulatory requirements and customer competitive requirements. A qualified candidate will be responsible for the total vehicle packaging. The candidate will identify and resolve adaptation and packaging issues as the vehicle moves toward production. They will lead cross functional team meetings working with Systems & Components, Advance Manufacturing, Service, etc. to ensure that the solutions are optimized for all stages of the vehicle's life.
HtmlElement elem = wb.Document.GetElementById("ctl00_contplhDynamic_txtDescrContentHiddenTextarea");
if (elem == null) return;
elem.InvokeMember("Click");
//elem.InvokeMember("Select All");
//elem.InvokeMember("Copy");
SendKeys.SendWait("^a");
SendKeys.SendWait("^c");
Clipboard.Clear();
elem.Focus();
elem.InvokeMember("Right Click");
elem.InvokeMember("Select All");
elem.InvokeMember("Copy");
Clipboard.SetText(elem.InnerText);
string clipbrdText = Clipboard.GetText();
string data = elem.InnerText;
richTextBox1.Text = data;
string temp = System.Web.HttpUtility.UrlDecode(data);
Encoding iso = Encoding.GetEncoding("windows-1252");
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(data);
byte[] isoBytes = Encoding.Convert(utf8, iso, utfBytes);
string msg = iso.GetString(isoBytes);
The text with "%2c" etc has been encoded. If you are getting the content of a web page, you are decoding the HTML, not the URL. You can use HttpUtility.HtmlDecode, or if you are using .NET 4.0 or above you can also use WebUtility.HtmlDecode - this is available within the System.Net namespace.
You should note that Word does not use HTML for its formatting, so you won't be able to paste HTML tags and expect it to recognise them. i.e. <strong>Description</strong> will not result in bold text if you type that into Word.
EDIT:
It looks like you are mixing two different ways to copy the text in the code you pasted - both SendKeys.SendWait("^c"); and elem.InvokeMember("Copy");. I presume both of these methods work?
I think the problem you are having lies in the way you are getting the text. I see you're using Clipboard.GetText() to get the text. Try specifying that it is formatted text using Clipboard.GetText(TextDataFormat.Rtf) or Clipboard.GetText(TextDataFormat.Html). This should hopefully copy the string preserving the formatting.

Create document using docx library with HTML text

I want to create a document (*.docx) using docx library. I have HTML formatted text from a rich text editor, and I want to save it as it is. I am not able to find any place where I can get the help. My current code is very basic, and is copy paste from there help.
My code looks like:
private string CreateDocumentFromText()
{
string filePath= Server.MapPath("../DocXExample.docx");
var document = Novacode.DocX.Create(filePath);
document.InsertParagraph("<b>Test</b>");
document.Save();
return fileName;
}
Document has content as:
<b>Test</b>
Whereas I want it to be:
Test
You have to convert it to xml element.
Try DocumentFormat.OpenXml library

Get message text from Lync database

I have a connection to Lync sql database. The problem is that saome messages are stored as HTML and some looke like:
{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil\fcharset0 Segoe UI;}{\f1\fnil Segoe UI;}}
{\colortbl ;\red0\green0\blue0;}
{\*\generator Riched20 15.0.4420}{\*\mmathPr\mwrapIndent1440 }\viewkind4\uc1
\pard\cf1\embo\f0\fs20 this\embo0 \embo is\embo0 \embo from\embo0 \embo
db\embo0\f1\par
{\*\lyncflags rtf=1}}
It's easy to handle HTML-encoded messages, but how can I get at least text from the other type? Deoes Lync SDK allows to do this? I didn't find how to do this with Lync SDK.
Even if Lync SDK allows to get message text I don't want to install SDK just for this purpose. Hope that there is a better way. Maybe there are free 3rd party parsers for this?
The text is in RTF fornmat. You can convert the RTF text to plain text using the RichTextBox inthe System.Windows.Forms namespace.
First you create a richtextbox and provide it with the text.
System.Windows.Forms.RichTextBox richTextBox = new System.Windows.Forms.RichTextBox();
richTextBox.Rtf = rtfText;
You can then read the plain text
string plainText = richTextBox.Text;
When doing this on the text in your example it plainText returns: this is from db.

Categories

Resources