OpenXml Excel: throw error in any word after mail address - c#

I read Excel files using OpenXml. all work fine but if the spreadsheet contains one cell that has an address mail and after it a space and another word, such as:
abc#abc.com abc
It throws an exception immediately at the opening of the spreadsheet:
var _doc = SpreadsheetDocument.Open(_filePath, false);
exception:
DocumentFormat.OpenXml.Packaging.OpenXmlPackageException
Additional information:
Invalid Hyperlink: Malformed URI is embedded as a
hyperlink in the document.

There is an open issue on the OpenXml forum related to this problem: Malformed Hyperlink causes exception
In the post they talk about encountering this issue with a malformed "mailto:" hyperlink within a Word document.
They propose a work-around here: Workaround for malformed hyperlink exception
The workaround is essentially a small console application which locates the invalid URL and replaces it with a hard-coded value; here is the code snippet from their sample that does the replacement; you could augment this code to attempt to correct the passed brokenUri:
private static Uri FixUri(string brokenUri)
{
return new Uri("http://broken-link/");
}
The problem I had was actually with an Excel document (like you) and it had to do with a malformed http URL; I was pleasantly surprised to find that their code worked just fine with my Excel file.
Here is the entire work-around source code, just in case one of these links goes away in the future:
void Main(string[] args)
{
var fileName = #"C:\temp\corrupt.xlsx";
var newFileName = #"c:\temp\Fixed.xlsx";
var newFileInfo = new FileInfo(newFileName);
if (newFileInfo.Exists)
newFileInfo.Delete();
File.Copy(fileName, newFileName);
WordprocessingDocument wDoc;
try
{
using (wDoc = WordprocessingDocument.Open(newFileName, true))
{
ProcessDocument(wDoc);
}
}
catch (OpenXmlPackageException e)
{
e.Dump();
if (e.ToString().Contains("The specified package is not valid."))
{
using (FileStream fs = new FileStream(newFileName, FileMode.OpenOrCreate, FileAccess.ReadWrite))
{
UriFixer.FixInvalidUri(fs, brokenUri => FixUri(brokenUri));
}
}
}
}
private static Uri FixUri(string brokenUri)
{
brokenUri.Dump();
return new Uri("http://broken-link/");
}
private static void ProcessDocument(WordprocessingDocument wDoc)
{
var elementCount = wDoc.MainDocumentPart.Document.Descendants().Count();
Console.WriteLine(elementCount);
}
}
public static class UriFixer
{
public static void FixInvalidUri(Stream fs, Func<string, Uri> invalidUriHandler)
{
XNamespace relNs = "http://schemas.openxmlformats.org/package/2006/relationships";
using (ZipArchive za = new ZipArchive(fs, ZipArchiveMode.Update))
{
foreach (var entry in za.Entries.ToList())
{
if (!entry.Name.EndsWith(".rels"))
continue;
bool replaceEntry = false;
XDocument entryXDoc = null;
using (var entryStream = entry.Open())
{
try
{
entryXDoc = XDocument.Load(entryStream);
if (entryXDoc.Root != null && entryXDoc.Root.Name.Namespace == relNs)
{
var urisToCheck = entryXDoc
.Descendants(relNs + "Relationship")
.Where(r => r.Attribute("TargetMode") != null && (string)r.Attribute("TargetMode") == "External");
foreach (var rel in urisToCheck)
{
var target = (string)rel.Attribute("Target");
if (target != null)
{
try
{
Uri uri = new Uri(target);
}
catch (UriFormatException)
{
Uri newUri = invalidUriHandler(target);
rel.Attribute("Target").Value = newUri.ToString();
replaceEntry = true;
}
}
}
}
}
catch (XmlException)
{
continue;
}
}
if (replaceEntry)
{
var fullName = entry.FullName;
entry.Delete();
var newEntry = za.CreateEntry(fullName);
using (StreamWriter writer = new StreamWriter(newEntry.Open()))
using (XmlWriter xmlWriter = XmlWriter.Create(writer))
{
entryXDoc.WriteTo(xmlWriter);
}
}
}
}
}

The fix by #RMD works great. I've been using it for years. But there is a new fix.
You can see the fix here in the changelog for issue #793
Upgrade OpenXML to 2.12.0.
Right click solution and select Manage NuGet Packages.
Implement the fix
It is helpful to have a unit test. Create an excel file with a bad email address like test#gmail,com. (Note the comma instead of the dot).
Make sure the stream you open and the call to SpreadsheetDocument.Open allows Read AND Write.
You need to implement a RelationshipErrorHandlerFactory and use it in the options when you open. Here is the code I used:
public class UriRelationshipErrorHandler : RelationshipErrorHandler
{
public override string Rewrite(Uri partUri, string id, string uri)
{
return "https://broken-link";
}
}
Then you need to use it when you open the document like this:
var openSettings = new OpenSettings
{
RelationshipErrorHandlerFactory = package =>
{
return new UriRelationshipErrorHandler();
}
};
using var document = SpreadsheetDocument.Open(stream, true, openSettings);
One of the nice things about this solution is that it does not require you to create a temporary "fixed" version of your file and it is far less code.

Unfortunately solution where you have to open file as zip and replace broken hyperlink would not help me.
I just was wondering how it is posible that it works fine when your target framework is 4.0 even if your only installed .Net Framework has version 4.7.2.
I have found out that there is private static field inside System.UriParser that selects version of URI's RFC specification. So it is possible to set it to V2 as it is set for .net 4.0 and lower versions of .Net Framework. Only problem that it is private static readonly.
Maybe someone will want to set it globally for whole application. But I wrote UriQuirksVersionPatcher that will update this version and restore it back in Dispose method. It is obviously not thread-safe but it is acceptable for my purpose.
using System;
using System.Diagnostics;
using System.Reflection;
namespace BarCap.RiskServices.RateSubmissions.Utility
{
#if (NET20 || NET35 || NET40)
public class UriQuirksVersionPatcher : IDisposable
{
public void Dispose()
{
}
}
#else
public class UriQuirksVersionPatcher : IDisposable
{
private const string _quirksVersionFieldName = "s_QuirksVersion"; //See Source\ndp\fx\src\net\System\_UriSyntax.cs in NexFX sources
private const string _uriQuirksVersionEnumName = "UriQuirksVersion";
/// <code>
/// private enum UriQuirksVersion
/// {
/// V1 = 1, // RFC 1738 - Not supported
/// V2 = 2, // RFC 2396
/// V3 = 3, // RFC 3986, 3987
/// }
/// </code>
private const string _oldQuirksVersion = "V2";
private static readonly Lazy<FieldInfo> _targetFieldInfo;
private static readonly Lazy<int?> _patchValue;
private readonly int _oldValue;
private readonly bool _isEnabled;
static UriQuirksVersionPatcher()
{
var targetType = typeof(UriParser);
_targetFieldInfo = new Lazy<FieldInfo>(() => targetType.GetField(_quirksVersionFieldName, BindingFlags.Static | BindingFlags.NonPublic));
_patchValue = new Lazy<int?>(() => GetUriQuirksVersion(targetType));
}
public UriQuirksVersionPatcher()
{
int? patchValue = _patchValue.Value;
_isEnabled = patchValue.HasValue;
if (!_isEnabled) //Disabled if it failed to get enum value
{
return;
}
int originalValue = QuirksVersion;
_isEnabled = originalValue != patchValue;
if (!_isEnabled) //Disabled if value is proper
{
return;
}
_oldValue = originalValue;
QuirksVersion = patchValue.Value;
}
private int QuirksVersion
{
get
{
return (int)_targetFieldInfo.Value.GetValue(null);
}
set
{
_targetFieldInfo.Value.SetValue(null, value);
}
}
private static int? GetUriQuirksVersion(Type targetType)
{
int? result = null;
try
{
result = (int)targetType.GetNestedType(_uriQuirksVersionEnumName, BindingFlags.Static | BindingFlags.NonPublic)
.GetField(_oldQuirksVersion, BindingFlags.Static | BindingFlags.Public)
.GetValue(null);
}
catch
{
#if DEBUG
Debug.WriteLine("ERROR: Failed to find UriQuirksVersion.V2 enum member.");
throw;
#endif
}
return result;
}
public void Dispose()
{
if (_isEnabled)
{
QuirksVersion = _oldValue;
}
}
}
#endif
}
Usage:
using(new UriQuirksVersionPatcher())
{
using(var document = SpreadsheetDocument.Open(fullPath, false))
{
//.....
}
}
P.S. Later I found that someone already implemented this pathcher: https://github.com/google/google-api-dotnet-client/blob/master/Src/Support/Google.Apis.Core/Util/UriPatcher.cs

I haven't use OpenXml but if there's no specific reason for using it then I highly recommend LinqToExcel from LinqToExcel. Example of code is here:
var sheet = new ExcelQueryFactory("filePath");
var allRows = from r in sheet.Worksheet() select r;
foreach (var r in allRows) {
var cella = r["Header"].ToString();
}

Related

Is it possible to get a batch of text content through Azure DevOps REST API?

I need to get (not download) the content from 10.000~ manifest files within a project in Azure DevOps, but I don't manage to achieve this. I have found several ways to retrieve the content from one file at a time, but in this context, it is neither an efficient nor sustainable solution. I have managed to retrieve all files of a particular file type by checking if the file path ends with the name of the file, then using the TfvcHttpClientBase.GetItemsBatch method. However, this method does not return the item's content.
Program.cs
using Microsoft.TeamFoundation.SourceControl.WebApi;
AzureRest azureRest = new AzureRest();
var tfvcItems = azureRest.GetTfvcItems();
List<TfvcItemDescriptor> itemDescriptorsList = new List<TfvcItemDescriptor>();
foreach(var item in tfvcItems)
{
//Example manifest file .NET
if (item.Path.EndsWith("packages.config"))
{
var itemDescriptor = new TfvcItemDescriptor()
{
Path = item.Path,
RecursionLevel = VersionControlRecursionType.None,
Version = "",
VersionOption = TfvcVersionOption.None,
VersionType = TfvcVersionType.Latest
};
itemDescriptorsList.Add(itemDescriptor);
}
}
TfvcItemDescriptor[] itemDescriptorsArray = itemDescriptorsList.ToArray();
var itemBatch = azureRest.GetTfvcItemsBatch(itemDescriptorsArray);
foreach(var itemList in itemBatch)
{
foreach(var itemListList in itemList)
{
Console.WriteLine("Content: " + itemListList.Content); //empty/null
Console.WriteLine("ContentMetadata: " + itemListList.ContentMetadata); //not empty/null
}
}
AzureRest.cs
using Microsoft.TeamFoundation.SourceControl.WebApi;
using Microsoft.VisualStudio.Services.Common;
using Microsoft.VisualStudio.Services.WebApi;
public class AzureRest
{
const string ORG_URL = "https://org/url/url";
const string PROJECT = "Project";
const string PAT = "PersonalAccessToken";
private string GetTokenConfig()
{
return PAT;
}
private string GetProjectNameConfig()
{
return PROJECT;
}
private VssConnection Authenticate()
{
string token = GetTokenConfig();
string projectName = GetProjectNameConfig();
var credentials = new VssBasicCredential(string.Empty, token);
var connection = new VssConnection(new Uri(ORG_URL), credentials);
return connection;
}
public List<TfvcItem> GetTfvcItems()
{
var connection = Authenticate();
using (TfvcHttpClient tfvcClient = connection.GetClient<TfvcHttpClient>())
{
var tfvcItems = tfvcClient.GetItemsAsync(scopePath: "/Path", recursionLevel: VersionControlRecursionType.Full, true).Result;
return tfvcItems;
}
}
public List<List<TfvcItem>> GetTfvcItemsBatch(TfvcItemDescriptor[] itemDescriptors)
{
TfvcItemRequestData requestData = new TfvcItemRequestData()
{
IncludeContentMetadata = true,
IncludeLinks = true,
ItemDescriptors = itemDescriptors
};
var connection = Authenticate();
using (TfvcHttpClient tfvcClient = connection.GetClient<TfvcHttpClient>())
{
var tfvcItems = tfvcClient.GetItemsBatchAsync(requestData).Result;
return tfvcItems;
}
}
}
}
For reference:
I have tested the codes you shared and when debugging at "itemDescriptorsList" and have found that there is no content specified in it, so that's why you cannot get the txt content.
You should first check and add the content property into the "itemDescriptorsList".

Reading the file only once for every method call

I am new to object-oriented programming and I am working on a small personal project with some SQL scripts.
I have a scenario where a SQL script calls a static method with a file path as input.
queries = Select Query from Table where Utils.ContainsKeyword(Query, #Path1) AND NOT Utils.ContainsKeyword(Query, #Path2);
I had initially created a static class that does the following:
public static class Utils
{
public static bool ContainsKeyword(string query, string path)
{
var isQueryInFile = false;
var stringFromFile = GetStringFromFile(path);
List<Regex>regexList = GetRegexList(stringFromFile);
if(regexList!= null)
{
isQueryInFile = regexList.Any(pattern => pattern.IsMatch(query));
}
return isQueryInFile;
}
private static string GetStringFromFile(string path)
{
var words = String.Empty;
if(!string.IsNullOrEmpty(path))
{
try
{
using (StreamReader sr = File.OpenText(path))
{
words = sr.ReadToEnd().Replace(Environment.Newline, "");
}
}
catch { return words; }
}
return words;
}
private static List<Regex> GetRegexList(string words)
{
if(string.IsNullOrEmpty(words)) { return null; }
return words.Split(',').Select(w=> new Regex(#"\b" + Regex.Escape(w) + #'\b', RegexOptions.Compiled | RegexOptions.IgnoreCase)).ToList();
}
}
My problem is that I neither want to read from the file every time the ContainsKeyword static method is called nor do I want to create a new RegexList every time. Also, I cannot change the SQL script and I have to send the path to the file as an input parameter for the method call in the SQL script since the path might change in the future.
Is there a way to make sure I only read the contents from the input path only once, store them in a string, and use the string for the match with different input queries?
To read the content only once, saving in memory will probaby be needed. Memory capacity could be an issue.
public Dictionary<string, string> FileContentCache { get; set; } // make sure that gets initialized
public string GetFileContentCache(string path)
{
if (FileContentCache == null) FileContentCache = new Dictionary<string, string>();
if (FileContentCache.ContainsKey(path))
return FileContentCache[path];
var fileData = GetStringFromFile(path);
FileContentCache.Add(path, fileData);
return fileData;
}

Determine if a Database is "Equal" to a DacPackage

Is there a way to use the SQL Server 2012 Microsoft.SqlServer.Dac Namespace to determine if a database has an identical schema to that described by a DacPackage object? I've looked at the API docs for DacPackage as well as DacServices, but not having any luck; am I missing something?
Yes there is, I have been using the following technique since 2012 without issue.
Calculate a fingerprint of the dacpac.
Store that fingerprint in the target database.
The .dacpac is just a zip file containing goodies like metadata, and
model information.
Here's a screen-grab of what you will find in the .dacpac:
The file model.xml has XML structured like the following
<DataSchemaModel>
<Header>
... developer specific stuff is in here
</Header>
<Model>
.. database model definition is in here
</Model>
</<DataSchemaModel>
What we need to do is extract the contents from <Model>...</Model>
and treat this as the fingerprint of the schema.
"But wait!" you say. "Origin.xml has the following nodes:"
<Checksums>
<Checksum Uri="/model.xml">EB1B87793DB57B3BB5D4D9826D5566B42FA956EDF711BB96F713D06BA3D309DE</Checksum>
</Checksums>
In my experience, this <Checksum> node changes regardless of a schema change in the model.
So let's get to it.
Calculate the fingerprint of the dacpac.
using System.IO;
using System.IO.Packaging;
using System.Security.Cryptography;
static string DacPacFingerprint(byte[] dacPacBytes)
{
using (var ms = new MemoryStream(dacPacBytes))
using (var package = ZipPackage.Open(ms))
{
var modelFile = package.GetPart(new Uri("/model.xml", UriKind.Relative));
using (var streamReader = new System.IO.StreamReader(modelFile.GetStream()))
{
var xmlDoc = new XmlDocument() { InnerXml = streamReader.ReadToEnd() };
foreach (XmlNode childNode in xmlDoc.DocumentElement.ChildNodes)
{
if (childNode.Name == "Header")
{
// skip the Header node as described
xmlDoc.DocumentElement.RemoveChild(childNode);
break;
}
}
using (var crypto = new SHA512CryptoServiceProvider())
{
byte[] retVal = crypto.ComputeHash(Encoding.UTF8.GetBytes(xmlDoc.InnerXml));
return BitConverter.ToString(retVal).Replace("-", "");// hex string
}
}
}
}
With this fingerprint now available, pseudo code for applying a dacpac can be:
void main()
{
var dacpacBytes = File.ReadAllBytes("<path-to-dacpac>");
var dacpacFingerPrint = DacPacFingerprint(dacpacBytes);// see above
var databaseFingerPrint = Database.GetFingerprint();//however you choose to do this
if(databaseFingerPrint != dacpacFingerPrint)
{
DeployDacpac(...);//however you choose to do this
Database.SetFingerprint(dacpacFingerPrint);//however you choose to do this
}
}
Here's what I've come up with, but I'm not really crazy about it. If anyone can point out any bugs, edge cases, or better approaches, I'd be much obliged.
...
DacServices dacSvc = new DacServices(connectionString);
string deployScript = dacSvc.GenerateDeployScript(myDacpac, #"aDb", deployOptions);
if (DatabaseEqualsDacPackage(deployScript))
{
Console.WriteLine("The database and the DacPackage are equal");
}
...
bool DatabaseEqualsDacPackage(string deployScript)
{
string equalStr = string.Format("GO{0}USE [$(DatabaseName)];{0}{0}{0}GO{0}PRINT N'Update complete.'{0}GO", Environment.NewLine);
return deployScript.Contains(equalStr);
}
...
What I really don't like about this approach is that it's entirely dependent upon the format of the generated deployment script, and therefore extremely brittle. Questions, comments and suggestions very welcome.
#Aaron Hudon answer does not account for post script changes. Sometimes you just add a new entry to a type table without changing the model. In our case we want this to count as new dacpac. Here is my modification of his code to account for that
private static string DacPacFingerprint(string path)
{
using (var stream = File.OpenRead(path))
using (var package = Package.Open(stream))
{
var extractors = new IDacPacDataExtractor [] {new ModelExtractor(), new PostScriptExtractor()};
string content = string.Join("_", extractors.Select(e =>
{
var modelFile = package.GetPart(new Uri($"/{e.Filename}", UriKind.Relative));
using (var streamReader = new StreamReader(modelFile.GetStream()))
{
return e.ExtractData(streamReader);
}
}));
using (var crypto = new MD5CryptoServiceProvider())
{
byte[] retVal = crypto.ComputeHash(Encoding.UTF8.GetBytes(content));
return BitConverter.ToString(retVal).Replace("-", "");// hex string
}
}
}
private class ModelExtractor : IDacPacDataExtractor
{
public string Filename { get; } = "model.xml";
public string ExtractData(StreamReader streamReader)
{
var xmlDoc = new XmlDocument() { InnerXml = streamReader.ReadToEnd() };
foreach (XmlNode childNode in xmlDoc.DocumentElement.ChildNodes)
{
if (childNode.Name == "Header")
{
// skip the Header node as described
xmlDoc.DocumentElement.RemoveChild(childNode);
break;
}
}
return xmlDoc.InnerXml;
}
}
private class PostScriptExtractor : IDacPacDataExtractor
{
public string Filename { get; } = "postdeploy.sql";
public string ExtractData(StreamReader stream)
{
return stream.ReadToEnd();
}
}
private interface IDacPacDataExtractor
{
string Filename { get; }
string ExtractData(StreamReader stream);
}

How to implement and do OCR in a C# project?

I ve been searching for a while and all that i ve seen some OCR library requests. I would like to know how to implement the purest, easy to install and use OCR library with detailed info for installation into a C# project.
If posible, I just wanna implement it like a usual dll reference...
Example:
using org.pdfbox.pdmodel;
using org.pdfbox.util;
Also a little OCR code example would be nice, such as:
public string OCRFromBitmap(Bitmap Bmp)
{
Bmp.Save(temppath, System.Drawing.Imaging.ImageFormat.Tiff);
string OcrResult = Analyze(temppath);
File.Delete(temppath);
return OcrResult;
}
So please consider that I'm not familiar to OCR projects and give me an answer like talking to a dummy.
Edit:
I guess people misunderstood my request. I wanted to know how to implement those open source OCR libraries to a C# project and how to use them. The link given as dup is not giving answers that I requested at all.
If anyone is looking into this, I've been trying different options and the following approach yields very good results. The following are the steps to get a working example:
Add .NET Wrapper for tesseract to your project. It can be added via NuGet package Install-Package Tesseract(https://github.com/charlesw/tesseract).
Go to the Downloads section of the official Tesseract project (https://code.google.com/p/tesseract-ocr/ EDIT: It's now located here: https://github.com/tesseract-ocr/langdata).
Download the preferred language data, example: tesseract-ocr-3.02.eng.tar.gz English language data for Tesseract 3.02.
Create tessdata directory in your project and place the language data files in it.
Go to Properties of the newly added files and set them to copy on build.
Add a reference to System.Drawing.
From .NET Wrapper repository, in the Samples directory copy the sample phototest.tif file into your project directory and set it to copy on build.
Create the following two files in your project (just to get started):
Program.cs
using System;
using Tesseract;
using System.Diagnostics;
namespace ConsoleApplication
{
class Program
{
public static void Main(string[] args)
{
var testImagePath = "./phototest.tif";
if (args.Length > 0)
{
testImagePath = args[0];
}
try
{
var logger = new FormattedConsoleLogger();
var resultPrinter = new ResultPrinter(logger);
using (var engine = new TesseractEngine(#"./tessdata", "eng", EngineMode.Default))
{
using (var img = Pix.LoadFromFile(testImagePath))
{
using (logger.Begin("Process image"))
{
var i = 1;
using (var page = engine.Process(img))
{
var text = page.GetText();
logger.Log("Text: {0}", text);
logger.Log("Mean confidence: {0}", page.GetMeanConfidence());
using (var iter = page.GetIterator())
{
iter.Begin();
do
{
if (i % 2 == 0)
{
using (logger.Begin("Line {0}", i))
{
do
{
using (logger.Begin("Word Iteration"))
{
if (iter.IsAtBeginningOf(PageIteratorLevel.Block))
{
logger.Log("New block");
}
if (iter.IsAtBeginningOf(PageIteratorLevel.Para))
{
logger.Log("New paragraph");
}
if (iter.IsAtBeginningOf(PageIteratorLevel.TextLine))
{
logger.Log("New line");
}
logger.Log("word: " + iter.GetText(PageIteratorLevel.Word));
}
} while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));
}
}
i++;
} while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));
}
}
}
}
}
}
catch (Exception e)
{
Trace.TraceError(e.ToString());
Console.WriteLine("Unexpected Error: " + e.Message);
Console.WriteLine("Details: ");
Console.WriteLine(e.ToString());
}
Console.Write("Press any key to continue . . . ");
Console.ReadKey(true);
}
private class ResultPrinter
{
readonly FormattedConsoleLogger logger;
public ResultPrinter(FormattedConsoleLogger logger)
{
this.logger = logger;
}
public void Print(ResultIterator iter)
{
logger.Log("Is beginning of block: {0}", iter.IsAtBeginningOf(PageIteratorLevel.Block));
logger.Log("Is beginning of para: {0}", iter.IsAtBeginningOf(PageIteratorLevel.Para));
logger.Log("Is beginning of text line: {0}", iter.IsAtBeginningOf(PageIteratorLevel.TextLine));
logger.Log("Is beginning of word: {0}", iter.IsAtBeginningOf(PageIteratorLevel.Word));
logger.Log("Is beginning of symbol: {0}", iter.IsAtBeginningOf(PageIteratorLevel.Symbol));
logger.Log("Block text: \"{0}\"", iter.GetText(PageIteratorLevel.Block));
logger.Log("Para text: \"{0}\"", iter.GetText(PageIteratorLevel.Para));
logger.Log("TextLine text: \"{0}\"", iter.GetText(PageIteratorLevel.TextLine));
logger.Log("Word text: \"{0}\"", iter.GetText(PageIteratorLevel.Word));
logger.Log("Symbol text: \"{0}\"", iter.GetText(PageIteratorLevel.Symbol));
}
}
}
}
FormattedConsoleLogger.cs
using System;
using System.Collections.Generic;
using System.Text;
using Tesseract;
namespace ConsoleApplication
{
public class FormattedConsoleLogger
{
const string Tab = " ";
private class Scope : DisposableBase
{
private int indentLevel;
private string indent;
private FormattedConsoleLogger container;
public Scope(FormattedConsoleLogger container, int indentLevel)
{
this.container = container;
this.indentLevel = indentLevel;
StringBuilder indent = new StringBuilder();
for (int i = 0; i < indentLevel; i++)
{
indent.Append(Tab);
}
this.indent = indent.ToString();
}
public void Log(string format, object[] args)
{
var message = String.Format(format, args);
StringBuilder indentedMessage = new StringBuilder(message.Length + indent.Length * 10);
int i = 0;
bool isNewLine = true;
while (i < message.Length)
{
if (message.Length > i && message[i] == '\r' && message[i + 1] == '\n')
{
indentedMessage.AppendLine();
isNewLine = true;
i += 2;
}
else if (message[i] == '\r' || message[i] == '\n')
{
indentedMessage.AppendLine();
isNewLine = true;
i++;
}
else
{
if (isNewLine)
{
indentedMessage.Append(indent);
isNewLine = false;
}
indentedMessage.Append(message[i]);
i++;
}
}
Console.WriteLine(indentedMessage.ToString());
}
public Scope Begin()
{
return new Scope(container, indentLevel + 1);
}
protected override void Dispose(bool disposing)
{
if (disposing)
{
var scope = container.scopes.Pop();
if (scope != this)
{
throw new InvalidOperationException("Format scope removed out of order.");
}
}
}
}
private Stack<Scope> scopes = new Stack<Scope>();
public IDisposable Begin(string title = "", params object[] args)
{
Log(title, args);
Scope scope;
if (scopes.Count == 0)
{
scope = new Scope(this, 1);
}
else
{
scope = ActiveScope.Begin();
}
scopes.Push(scope);
return scope;
}
public void Log(string format, params object[] args)
{
if (scopes.Count > 0)
{
ActiveScope.Log(format, args);
}
else
{
Console.WriteLine(String.Format(format, args));
}
}
private Scope ActiveScope
{
get
{
var top = scopes.Peek();
if (top == null) throw new InvalidOperationException("No current scope");
return top;
}
}
}
}
Here's one: (check out http://hongouru.blogspot.ie/2011/09/c-ocr-optical-character-recognition.html or http://www.codeproject.com/Articles/41709/How-To-Use-Office-2007-OCR-Using-C for more info)
using MODI;
static void Main(string[] args)
{
DocumentClass myDoc = new DocumentClass();
myDoc.Create(#"theDocumentName.tiff"); //we work with the .tiff extension
myDoc.OCR(MiLANGUAGES.miLANG_ENGLISH, true, true);
foreach (Image anImage in myDoc.Images)
{
Console.WriteLine(anImage.Layout.Text); //here we cout to the console.
}
}
I'm using tesseract OCR engine with TessNet2 (a C# wrapper - http://www.pixel-technology.com/freeware/tessnet2/).
Some basic code:
using tessnet2;
...
Bitmap image = new Bitmap(#"u:\user files\bwalker\2849257.tif");
tessnet2.Tesseract ocr = new tessnet2.Tesseract();
ocr.SetVariable("tessedit_char_whitelist", "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz.,$-/#&=()\"':?"); // Accepted characters
ocr.Init(#"C:\Users\bwalker\Documents\Visual Studio 2010\Projects\tessnetWinForms\tessnetWinForms\bin\Release\", "eng", false); // Directory of your tessdata folder
List<tessnet2.Word> result = ocr.DoOCR(image, System.Drawing.Rectangle.Empty);
string Results = "";
foreach (tessnet2.Word word in result)
{
Results += word.Confidence + ", " + word.Text + ", " + word.Left + ", " + word.Top + ", " + word.Bottom + ", " + word.Right + "\n";
}
Some online API's work pretty well: ocr.space and Google Cloud Vision. Both of these are free, as long as you do less than 1000 OCR's per month. You can drag & drop an image to do a quick manual test to see how they perform for your images.
I find OCR.space easier to use (no messing around with nuget libraries), but, for my purpose, Google Cloud Vision provided slightly better results than OCR.space.
Google Cloud Vision example:
GoogleCredential cred = GoogleCredential.FromJson(json);
Channel channel = new Channel(ImageAnnotatorClient.DefaultEndpoint.Host, ImageAnnotatorClient.DefaultEndpoint.Port, cred.ToChannelCredentials());
ImageAnnotatorClient client = ImageAnnotatorClient.Create(channel);
Image image = Image.FromStream(stream);
EntityAnnotation googleOcrText = client.DetectText(image).First();
Console.Write(googleOcrText.Description);
OCR.space example:
string uri = $"https://api.ocr.space/parse/imageurl?apikey=helloworld&url={imageUri}";
string responseString = WebUtilities.DoGetRequest(uri);
OcrSpaceResult result = JsonConvert.DeserializeObject<OcrSpaceResult>(responseString);
if ((!result.IsErroredOnProcessing) && !String.IsNullOrEmpty(result.ParsedResults[0].ParsedText))
return result.ParsedResults[0].ParsedText;
A new API is OcrEngine.RecognizeAsync from WinRT/UWP. It can also be used in WinForms:
...
//for AsBuffer
using System.Runtime.InteropServices.WindowsRuntime;
...
async private void button5_Click(object sender, EventArgs e)
{
OcrEngine ocrEngine = null;
ocrEngine = OcrEngine.TryCreateFromUserProfileLanguages();
if (ocrEngine == null) return;
//convert the image to BGRA8 format which is needed by SoftwareBitmap
//is there a better method for this?
Bitmap img = new Bitmap(#"1.png");
byte[] ba = new byte[img.Width * img.Height * 4];
int o = 0;
for (int y = 0; y < img.Height; y++)
{
for (int x = 0; x < img.Width; x++)
{
var p = img.GetPixel(x, y);
ba[o++] = p.B;
ba[o++] = p.G;
ba[o++] = p.R;
ba[o++] = p.A;
}
}
var buffer = ba.AsBuffer();
var outputBitmap = SoftwareBitmap.CreateCopyFromBuffer(
buffer,
BitmapPixelFormat.Bgra8,
img.Width,
img.Height);
var ocrResult = await ocrEngine.RecognizeAsync(outputBitmap);
}
To use WinRT/UWP API in WinForms, add Nuget package "Microsoft.Windows.SDK.Contracts" (version 10.0.17134.100 for Win10 1803 SDK tested here) as described here.

Programmatically convert Word (docx) to PDF

Ok before you think "Not another question like this" please read this first.
I have an application (web application in ASP.NET MVC 3) the generates Word files in DocX using the DocX library.
The application takes a template and fills it in with all the data from a database.
Now I want to create a PDF version of that created docx-file.
I know apose.word is an option, but not for me since I have little budget. Other libs where I have to spend some money on are also out of the question.
I don't have a sharepoint server so Word Automation Services isn't an option either.
So I have 2 options (that I know) left and they both use iTextSharp. Don't know which is better.
I could use the generated XML from the docx file and transform it to a version that is usable by iTextSharp.
I could create the PDF like I create the docx with a template.
Anybody has any idea on how much work it is, which of those 2 has better performance and if it is even possible to do.
I know that the second option has the downside that when I change a template I have to change it for both versions.
If you have a better solution (free that is), you are welcome to share it.
Another option, even if it needs some work: install OpenOffice on server and, using UNO libraries (including them as assemblies in your app), you can open docx document and save it in PDF directly.
In a few minutes I post an example...
PARTIAL EXAMPLE:
This is a class I created a long time ago and used to convert files to pdf
using unoidl.com.sun.star.lang;
using unoidl.com.sun.star.uno;
using unoidl.com.sun.star.container;
using unoidl.com.sun.star.frame;
using unoidl.com.sun.star.beans;
using unoidl.com.sun.star.view;
using System.Collections.Generic;
using System.IO;
namespace QOpenOffice
{
public enum AppType
{
Writer,
Calc,
Impress,
Draw,
Math
}
public enum ExportFilter{
Word97,
WriterPDF,
CalcPDF,
DrawPDF,
ImpressPDF,
MathPDF
}
class OpenOffice
{
private XComponentContext context;
private XMultiServiceFactory service;
private XComponentLoader component;
private XComponent doc;
private List<string> filters = new List<string>();
#region Constructors
public OpenOffice()
{
/// This will start a new instance of OpenOffice.org if it is not running,
/// or it will obtain an existing instance if it is already open.
context = uno.util.Bootstrap.bootstrap();
/// The next step is to create a new OpenOffice.org service manager
service = (XMultiServiceFactory)context.getServiceManager();
/// Create a new Desktop instance using our service manager
component = (XComponentLoader)service.createInstance("com.sun.star.frame.Desktop");
// Getting filters
XNameContainer filters = (XNameContainer)service.createInstance("com.sun.star.document.FilterFactory");
foreach (string filter in filters.getElementNames())
this.filters.Add(filter);
}
~OpenOffice()
{
if (doc != null)
doc.dispose();
doc = null;
}
#endregion
#region Private methods
private string FilterToString(ExportFilter filter)
{
switch (filter)
{
case ExportFilter.Word97: return "MS Word 97";
case ExportFilter.WriterPDF: return "writer_pdf_Export";
case ExportFilter.CalcPDF: return "calc_pdf_Export";
case ExportFilter.DrawPDF: return "draw_pdf_Export";
case ExportFilter.ImpressPDF: return "impress_pdf_Export";
case ExportFilter.MathPDF: return "math_pdf_Export";
}
return "";
}
#endregion
#region Public methods
public bool Load(string filename, bool hidden)
{
return Load(filename, hidden, "", "");
}
public bool Load(string filename, bool hidden, int filter_index, string filter_options)
{
return Load(filename, hidden, filters[filter_index], filter_options);
}
public bool Load(string filename, bool hidden, string filter_name, string filter_options)
{
List<PropertyValue> pv = new List<PropertyValue>();
pv.Add(new PropertyValue("Hidden", 0, new uno.Any(hidden), PropertyState.DIRECT_VALUE));
if (filter_name != "")
{
pv.Add(new PropertyValue("FilterName", 0, new uno.Any(filter_name), PropertyState.DIRECT_VALUE));
pv.Add(new PropertyValue("FilterOptions", 0, new uno.Any(filter_options), PropertyState.DIRECT_VALUE));
}
try
{
doc = component.loadComponentFromURL(
"file:///" + filename.Replace('\\', '/'), "_blank",
0, pv.ToArray());
return true;
}
catch
{
doc = null;
return false;
}
}
public bool Print()
{
return Print(1, "");
}
public bool Print(int copies, string pages)
{
List<PropertyValue> pv = new List<PropertyValue>();
pv.Add(new PropertyValue("CopyCount", 0, new uno.Any(copies), PropertyState.DIRECT_VALUE));
if (pages != "")
pv.Add(new PropertyValue("Pages", 0, new uno.Any(pages), PropertyState.DIRECT_VALUE));
//if (doc is XPrintable)
try
{
((XPrintable)doc).print(pv.ToArray());
return true;
}
catch { return false; }
}
public bool Save(string filename, ExportFilter filter)
{
return Save(filename, FilterToString(filter));
}
public bool Save(string filename, string filter)
{
List<PropertyValue> pv = new List<PropertyValue>();
pv.Add(new PropertyValue("FilterName", 0, new uno.Any(filter), PropertyState.DIRECT_VALUE));
pv.Add(new PropertyValue("Overwrite", 0, new uno.Any(true), PropertyState.DIRECT_VALUE));
try
{
filename = filename.Replace("\\", "/");
((XStorable)doc).storeToURL("file:///" + filename, pv.ToArray());
return true;
}
catch { return false; }
}
public bool ExportToPdf(string filename)
{
filename = Path.ChangeExtension(filename, ".pdf");
bool ret = Save(filename, "writer_pdf_Export");
if (!ret) ret = Save(filename, "impress_pdf_Export");
if (!ret) ret = Save(filename, "calc_pdf_Export");
if (!ret) ret = Save(filename, "draw_pdf_Export");
if (!ret) ret = Save(filename, "impress_pdf_Export");
if (!ret) ret = Save(filename, "math_pdf_Export");
return ret;
}
public void Close()
{
doc.dispose();
doc = null;
}
public bool New(AppType app, bool hidden)
{
try
{
string sapp = "private:factory/";
switch (app)
{
case AppType.Writer:
sapp += "swriter";
break;
case AppType.Calc:
sapp += "scalc";
break;
case AppType.Impress:
sapp += "simpress";
break;
case AppType.Draw:
sapp += "sdraw";
break;
case AppType.Math:
sapp += "smath";
break;
}
PropertyValue pv = new PropertyValue("Hidden", 0, new uno.Any(hidden), PropertyState.DIRECT_VALUE);
doc = component.loadComponentFromURL(sapp, "_blank", 0, new PropertyValue[1] { pv });
return true;
}
catch
{
doc = null;
return false;
}
}
#endregion
#region Properties
public List<string> Filters
{
get { return filters; }
}
#endregion
}
}

Categories

Resources