I need to be able to edit the text and images of an SVG file that has been rendered in Adobe Illustrator.
How can I iterate through the elements of an SVG file, check for type = text, change the value, and save the file to disk? Is there any library available that could help me?
So far I've tried this basic library but it doesn't do well with complex SVG structures.
SVG RENDERING ENGINE
I used this one for a project.
There were a few flaws but it did the job.
This may be very late to answer but for the sake of others if they land in this page, You can use HTMLAgilityPack. Here is the link to a similar question: What is the best way to parse html in C#?
I have used it in my case where i needed to edit the svg string and replace some values like this:
HtmlDocument theDocument = new HtmlDocument();
theDocument.LoadHtml(svgChartImg1);
HtmlNodeCollection theNodes = theDocument.DocumentNode.SelectNodes("//tspan");
Here, the svgChartImg1 is an svg xml string.
Related
I am inserting text into an open XML document. The text I retrieve and insert into the document contains HTML formatting, i.e < p >some text< / p > < p >More text< / p > thus the inserted text inside word gets this as text. Can text with HTML get cast to something open XML documents will understand ?
New answer:
There is actually a project on codeplex that does exactly what you are looking for.
See here the project here:
Html to OpenXml on codeplex
However; if the formatting (headings/paragraphs etc...) are not important you can just strip the HTML-tags entirely.
Here is a tutorial on how to do that:
C# Remove HTML Tags
Old answer (OP worded his question a bit odd, and i misunderstood):
What you need to do is encode your HTML-code somehow; you could use base64 or whatever floats your boat. "Simple" HTML-encoding would probably be the best course of action here.
This way the HTML will not break your XML.
ASP.NET has support for this; but you can do it in any application by importing the required namespace.
Here's an example.
HtmlEncode from Class Library
I am trying to find a way to parse a word document's text to a string in my project.I have more than 600 word(.doc) files that I need to get the text content(with the new lines and tabs if possible) and assign it to a string for each one.
I've been reading stuff about the Open XML SDK but it looks quite complicated for something that looks so simple.
Open XML SDK is only for 2007 and newer formats and it is not trivial to use.
If performance is not an issue you could use Word Automation and have Word do this for you.
It will look something like this:
var app = new Application();
var doc = app.Documents.Open(documentLocation);
string rangeText = doc.Range().Text;
doc.Save();
doc.Close();
Marshal.ReleaseComObject(doc);
Marshal.ReleaseComObject(app);
Take a look at http://www.codeproject.com/Articles/18703/Word-2007-Automation or http://www.codeproject.com/Articles/21247/Word-Automation for more complete examples and instructions. Note that this may become a bit more tricky if your documents are move complex (footnotes, text boxes, tables...).
Another option is have word save the document as a text and then read the text file. Take a look at this - http://msdn.microsoft.com/en-us/library/microsoft.office.tools.word.document.saveas(v=vs.80).aspx
You could give a look at NPOI:
This project is the .NET version of POI Java project at
http://poi.apache.org/. POI is an open source project which can help
you read/write xls, doc, ppt files. It has a wide application.
Take a look at this previous SO thread for more information.
Knowing that I can't use HTMLAgilityPack, only straight .NET, say I have a string that contains some HTML that I need to parse and edit in such ways:
find specific controls in the hierarchy by id or by tag
modify (and ideally create) attributes of those found elements
Are there methods available in .net to do so?
HtmlDocument
GetElementById
HtmlElement
You can create a dummy html document.
WebBrowser w = new WebBrowser();
w.Navigate(String.Empty);
HtmlDocument doc = w.Document;
doc.Write("<html><head></head><body><img id=\"myImage\" src=\"c:\"/><a id=\"myLink\" href=\"myUrl\"/></body></html>");
Console.WriteLine(doc.Body.Children.Count);
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));
Console.WriteLine(doc.GetElementById("myLink").GetAttribute("href"));
Console.ReadKey();
Output:
2
file:///c:
about:myUrl
Editing elements:
HtmlElement imageElement = doc.GetElementById("myImage");
string newSource = "d:";
imageElement.OuterHtml = imageElement.OuterHtml.Replace(
"src=\"c:\"",
"src=\"" + newSource + "\"");
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));
Output:
file:///d:
Assuming you're dealing with well formed HTML, you could simply treat the text as an XML document. The framework is loaded with features to do exactly what you're asking.
http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx
Aside from the HTML Agility Pack, and porting HtmlUnit over to C#, what sounds like solid solutions are:
Most obviously - use regex. (System.Text.RegularExpressions)
Using an XML Parser. (because HTML is a system of tags treat it like an XML document?)
Linq?
One thing I do know is that parsing HTML like XML may cause you to run into a few problems. XML and HTML are not the same. Read about it: here
Also, here is a post about Linq vs Regex.
You can look at how HTML Agility Pack works, however, it is .Net. You can reflect the assembly and see that it is using the MFC and could be reproduced if you so wanted, but you'd be doing nothing more than moving the assembly, not making it any more .Net.
I saw a lot of solutions in here but none are clear or good answers.
Here is my simple question, hoping with a straight answer.
I have a PDF file (a template) which is created having text something like this:
{FIRSTNAME} {LASTNAME} {ADDRESS} {PHONENUMBER}
is it possible to have C# code that replace these templates with a text of my choice?
No fields, no other complex stuff.
Is there any Open source library helping me achieve that?
This thread is dead, however I'm posting my solution for other lost souls that might face this problem in the future. Unfortunately my company doesn't allow posting code online so I'll describe the solution :).
So basically what you have to do is use PdfSharp and modify this sample to replace text in stream, but you must take into account that text may be split into many parentheses (convert stream to string to see what the format is).
Then, with code similar to this sample traverse through source pdf page by page and modify current page by searching for PdfContent items inside PdfReference items and replacing text in content's stream.
The 'problem' with PDF documents is that they are inherently not suitable for editing. Especially ones without fields. The best thing is to step back and look at your process and see if there is a way to replace the text before the PDF was generated. Obviously, you may not always have this freedom.
If you will be able to replace text, then you should be aware that there will be no automatic reflow of the text following the replaced text. Given that you are fine with that, then there are very few solutions that allows you to replace text.
I know that you are looking for an OpenSource solution so I feel reluctant to offer you a commercial solution. We offer one called PDFKit.NET. It allows you to extract all content on a page as so-called shapes (text, images, curves, etc.). See method Page.CreateShapes in the type reference. You can then programmatically navigate and edit this structure of shapes and then write it back to a PDF again.
Here it is:
http://www.tallcomponents.com/pdfkit
Disclosure: I am the founder of TallComponents, vendor of this component
For simple text replace use iTextSharp library.
The code that replace one string with another is below.
Note that this will replace only simple text and may not work in all cases.
//using iTextSharp.text.pdf;
void VerySimpleReplaceText(string OrigFile, string ResultFile, string origText, string replaceText)
{
using (PdfReader reader = new PdfReader(OrigFile))
{
for (int i = 1; i <= reader.NumberOfPages; i++)
{
byte[] contentBytes = reader.GetPageContent(i);
string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
contentString = contentString.Replace(origText, replaceText);
reader.SetPageContent(i, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
}
new PdfStamper(reader, new FileStream(ResultFile, FileMode.Create, FileAccess.Write)).Close();
}
}
As stated in similar thread this is not really possible an easy way. The easier way it seems to be getting a DocX file and using DocX library which allow easy word swapping and then converting your DocX to PDF (using PDF Creator printer or so).
Or use pdf sharp/migradoc to create new documents.
Updating in PDF is hard and dirty. So may be adding a content on top of existing will work for you as well, as it worked for me. If so, here's my primitive, but working solution covering a lot of cases ("covering", indeed):
https://github.com/astef/PatchPdfText
I want to convert a simple HTML table into PDF in the memory (not file system) like this:
string pdfContent = "<table>...</table>";
byte[] pdf = WantedComponent.CreatePDF(pdfContent);
My HTML table has simple text without any formatting.
You can try iTextSharp. It can render simple HTML as a PDF. I've tried using it but ran into some issues with tables with colspans, but if your HTML table is very simple it might work for you.
http://sourceforge.net/projects/itextsharp/
Also, see the answer to this question for some sample code on how to convert HTML to PDF.