I am not sure if the title makes it clear what I want to do.
My input for my parser contains debug information about C source files. Some of the input looks e.g. like this:
L:C$main.c$41$1$10:C0C5
Which basically means that line 10 in the source file main.c corresponds with the memory address C0C5.
Here is an example what my AST looks like:
Which represents the input:
M:main
L:C$main.c$29$1$0:C09C
L:C$main.c$30$1$10:C0A2
M:divide
L:C$divice.c$31$1$10:C5A9
What I want are two Hash-Maps such that I can access these information quickly at runtime. But how can I now build such Hash-Maps and is it possible to do that at parsetime?
This is how I would like to use my parser:
public CDBFileParser getFileParser(String cdbFilePath)
{
Stream stream = File.OpenRead(cdbFilePath);
ANTLRInputStream inputStream = new ANTLRInputStream(stream);
CDBFileLexer lexer = new CDBFileLexer(inputStream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
CDBFileParser parser = new CDBFileParser(tokens);
try
{
parser.TreeAdaptor = new CommonTreeAdaptor();
parser.parseCDBFile();
// All this works so far. Here comes the part I am looking for:
Modules[] modules = parser.getModules();
Dictionary<int, int> lineToAddress = modules[0].getLineToAddressMap();
Dictionary<int, int> addressToLine = modules[0].getAddressToLineMap();
int address = 0xC09C;
System.Out.WriteLine( "Address 0xC09C is at line " + addressToLine.get(address) + " in " + modules[0].getName() );
}
catch (Exception e)
{
printException(e);
}
return parser;
}
Expected Output:
Address 0xC09C is at line 29 in main
Can anybody help?
Best regards.
I was about to delete my question but maybe somebody else comes along to this post. I just made the transition to ANTLR4 and it really is much simpler (so far at least).
In ANTLR4 an interface (e.g. ICDBFileListener) is built for you which one can use to catch all information at parsetime:
namespace Parser
{
public class CDBFileParserListener : ICDBFileListener
{
public void ExitModule_name(CDBFileParser.Module_nameContext context)
{
Console.WriteLine("ModuleName: " + context.GetText());
// Add module to module-map and remember
// that current module is context.GetText()
}
public void ExitLine_number(CDBFileParser.Line_numberContext context)
{
Console.WriteLine("LineNumber: " + context.GetText());
// Remember line number
}
public void ExitMemory_address(CDBFileParser.Memory_addressContext context)
{
Console.WriteLine("MemoryAddress: " + context.GetText());
// Add linenumber <-> memoryaddress to maps
}
public Modules[] getModules()
{
return m_modules;
}
}
}
And this is how it can be used:
public CDBFileParser getFileParser(String cdbFilePath)
{
Stream stream = File.OpenRead(cdbFilePath);
AntlrInputStream inputStream = new AntlrInputStream(stream);
CDBFileLexer lexer = new CDBFileLexer(inputStream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
CDBFileParser parser = new CDBFileParser(tokens);
try
{
CDBFileParserListener listener = new CDBFileParserListener();
parser.AddParseListener(listener);
System.Diagnostics.Debug.WriteLine(parser.parseCDBFile().ToStringTree());
Dictionary<String, Module> modules = listener.Modules;
Module main;
modules.TryGetValue("main", out main);
long line = main.getLineFromAddress(0xC09C);
Console.WriteLine("0xC09C maps to " + line + " in main.c");
}
catch (Exception e)
{
printException(e);
}
return parser;
}
Related
I have data in tab-separated values (TSV) text files that I want to read and (eventually) store in database tables. With the TSV files, each line contains one record, but in one file the record can have 2 fields, in another file 4 fields, etc. I wrote working code to handle the 2-field records, but I thought this might be a good case for a generic method (or two) rather than writing new methods for each kind of record. However, I have not been able to code this because of 2 problems: I can't create a new object for holding the record data, and I don't know how to use reflection to generically fill the instance variables of my objects.
I looked at several other similar posts, including Datatable to object by using reflection and linq
Below is the code that works (this is in Windows, if that matters) and also the code that doesn't work.
public class TSVFile
{
public class TSVRec
{
public string item1;
public string item2;
}
private string fileName = "";
public TSVFile(string _fileName)
{
fileName = _fileName;
}
public TSVRec GetTSVRec(string Line)
{
TSVRec rec = new TSVRec();
try
{
string[] fields = Line.Split(new char[1] { '\t' });
rec.item1 = fields[0];
rec.item2 = fields[1];
}
catch (Exception ex)
{
System.Windows.Forms.MessageBox.Show("Bad import data on line: " +
Line + "\n" + ex.Message, "Error",
System.Windows.Forms.MessageBoxButtons.OK,
System.Windows.Forms.MessageBoxIcon.Error);
}
return rec;
}
public List<TSVRec> ImportTSVRec()
{
List<TSVRec> loadedData = new List<TSVRec>();
using (StreamReader sr = File.OpenText(fileName))
{
string Line = null;
while ((Line = sr.ReadLine()) != null)
{
loadedData.Add(GetTSVRec(Line));
}
}
return loadedData;
}
// *** Attempted generic methods ***
public T GetRec<T>(string Line)
{
T rec = new T(); // compile error!
Type t = typeof(T);
FieldInfo[] instanceVars = t.GetFields();
string[] fields = Line.Split(new char[1] { '\t' });
for (int i = 0; i < instanceVars.Length - 1; i++)
{
rec. ??? = fields[i]; // how do I finish this line???
}
return rec;
}
public List<T> Import<T>(Type t)
{
List<T> loadedData = new List<T>();
using (StreamReader sr = File.OpenText(fileName))
{
string Line = null;
while ((Line = sr.ReadLine()) != null)
{
loadedData.Add(GetRec<T>(Line));
}
}
return loadedData;
}
}
I saw the line
T rec = new T();
in the above-mentioned post, but it doesn't work for me...
I would appreciate any suggestions for how to make this work, if possible. I want to learn more about using reflection with generics, so I don't only want to understand how, but also why.
I wish #EdPlunkett had posted his suggestion as an answer, rather than a comment, so I could mark it as the answer...
To summarize: to do what I want to do, there is no need for "Assigning instance variables obtained through reflection in generic method". In fact, I can have a generic solution without using a generic method:
public class GenRec
{
public List<string> items = new List<string>();
}
public GenRec GetRec(string Line)
{
GenRec rec = new GenRec();
try
{
string[] fields = Line.Split(new char[1] { '\t' });
for (int i = 0; i < fields.Length; i++)
rec.items.Add(fields[i]);
}
catch (Exception ex)
{
System.Windows.Forms.MessageBox.Show("Bad import data on line: " + Line + "\n" + ex.Message, "Error",
System.Windows.Forms.MessageBoxButtons.OK,
System.Windows.Forms.MessageBoxIcon.Error);
}
return rec;
}
public List<GenRec> Import()
{
List<GenRec> loadedData = new List<GenRec>();
using (StreamReader sr = File.OpenText(fileName))
{
string Line = null;
while ((Line = sr.ReadLine()) != null)
loadedData.Add(GetRec(Line));
}
return loadedData;
}
I just tested this, and it works like a charm!
Of course, this isn't helping me to learn how to write generic methods or use reflection, but I'll take it...
I have a batch of PDFs that I want to convert to Text. It's easy to get text with something like this from iTextSharp:
PdfTextExtractor.GetTextFromPage(reader, pageNumber);
It's easy to get Images using this answer (or similar answers in the thread).
What I can't figure out easily... is how to interleave image placeholders in the text.
Given a PDF, a page # and GetTextFromPage I expect the output to be:
line 1
line 2
line 3
When I'd like it to be (Where 1.1 means page 1, image 1... Page 1, image 2):
line 1
[1.1]
line 2
[1.2]
line 3
Is there a way to get an "image placeholder" for iTextSharp, PdfSharp or anything similar? I'd like a GetTextAndPlaceHoldersFromPage method (or similar).
PS: Hrm... it's not letting me tag iTextSHARP - not iText. C# not Java.
C# Pdf to Text with image placeholder
https://stackoverflow.com/a/28087521/
https://stackoverflow.com/a/33697745/
Although this doesn't have the exact layout mentioned in my question (Since that was a simplified version of what I really wanted anyways), it does have the starting parts as listed by the second note (translated from iText Java)... with extra information pulled from the third note (Some of the reflection used in Java didn't seem to work in C#, so that info came from #3).
Working from this, I'm able to get a List of Strings representing lines in the PDF (all pages, instead of just page 1)... with text added where images should be (Huzzah!). ByteArrayToFile extension method added for flavor (Although I didn't include other parts/extensions that may break a copy/paste usages of this code).
I've also been able to greatly simplify other parts of my process and gut half of the garbage I had working before. Huzzah!!! Thanks #Mkl
internal class Program
{
public static void Main(string[] args)
{
var dir = Settings.TestDirectory;
var file = Settings.TestFile;
Log.Info($"File to Process: {file.FullName}");
using (var reader = new PdfReader(file.FullName))
{
var parser = new PdfReaderContentParser(reader);
var listener = new SimpleMixedExtractionStrategy(file, dir);
parser.ProcessContent(1, listener);
var x = listener.GetResultantText().Split('\n');
}
}
}
public class SimpleMixedExtractionStrategy : LocationTextExtractionStrategy
{
public static readonly ILog Log = LogManager.GetLogger(MethodBase.GetCurrentMethod().DeclaringType);
public DirectoryInfo OutputPath { get; }
public FileInfo OutputFile { get; }
private static readonly LineSegment UNIT_LINE = new LineSegment(new Vector(0, 0, 1), new Vector(1, 0, 1));
private int _counter;
public SimpleMixedExtractionStrategy(FileInfo outputFile, DirectoryInfo outputPath)
{
OutputPath = outputPath;
OutputFile = outputFile;
}
public override void RenderImage(ImageRenderInfo renderInfo)
{
try
{
var image = renderInfo.GetImage();
if (image == null) return;
var number = _counter++;
var imageFile = new FileInfo($"{OutputFile.FullName}-{number}.{image.GetFileType()}");
imageFile.ByteArrayToFile(image.GetImageAsBytes());
var segment = UNIT_LINE.TransformBy(renderInfo.GetImageCTM());
var location = new TextChunk("[" + imageFile + "]", segment.GetStartPoint(), segment.GetEndPoint(), 0f);
var locationalResultField = typeof(LocationTextExtractionStrategy).GetField("locationalResult", BindingFlags.NonPublic | BindingFlags.Instance);
var LocationalResults = (List<TextChunk>)locationalResultField.GetValue(this);
LocationalResults.Add(location);
}
catch (Exception ex)
{
Log.Debug($"{ex.Message}");
Log.Verbose($"{ex.StackTrace}");
}
}
}
public static class ByteArrayExtensions
{
public static bool ByteArrayToFile(this FileInfo fileName, byte[] byteArray)
{
try
{
// Open file for reading
var fileStream = new FileStream(fileName.FullName, FileMode.Create, FileAccess.Write);
// Writes a block of bytes to this stream using data from a byte array.
fileStream.Write(byteArray, 0, byteArray.Length);
// close file stream
fileStream.Close();
return true;
}
catch (Exception exception)
{
// Error
Log.Error($"Exception caught in process: {exception.Message}", exception);
}
// error occured, return false
return false;
}
}
I have the following code which takes a CSV and writes to a console:
using (CsvReader csv = new CsvReader(
new StreamReader("data.csv"), true))
{
// missing fields will not throw an exception,
// but will instead be treated as if there was a null value
csv.MissingFieldAction = MissingFieldAction.ReplaceByNull;
// to replace by "" instead, then use the following action:
//csv.MissingFieldAction = MissingFieldAction.ReplaceByEmpty;
int fieldCount = csv.FieldCount;
string[] headers = csv.GetFieldHeaders();
while (csv.ReadNextRecord())
{
for (int i = 0; i < fieldCount; i++)
Console.Write(string.Format("{0} = {1};",
headers[i],
csv[i] == null ? "MISSING" : csv[i]));
Console.WriteLine();
}
}
The CSV file has 7 headers for which I have 7 columns in my SQL table.
What is the best way to take each csv[i] and write to a row for each column and then move to the next row?
I tried to add the ccsv[i] to a string array but that didn't work.
I also tried the following:
SqlCommand sql = new SqlCommand("INSERT INTO table1 [" + csv[i] + "]", mysqlconnectionstring);
sql.ExecuteNonQuery();
My table (table1) is like this:
name address city zipcode phone fax device
your problem is simple but I will take it one step further and let you know a better way to approach the issue.
when you have a problem to sold, always break it down into parts and apply each part in each own method. For example, in your case:
1 - read from the file
2 - create a sql query
3 - run the query
and you can even add validation to the file (imagine your file does not even have 7 fields in one or more lines...) and the example below it to be taken, only if your file never passes around 500 lines, as if it does normally you should consider to use a SQL statement that takes your file directly in to the database, it's called bulk insert
1 - read from file:
I would use a List<string> to hold the line entries and I always use StreamReader to read from text files.
using (StreamReader sr = File.OpenText(this.CsvPath))
{
while ((line = sr.ReadLine()) != null)
{
splittedLine = line.Split(new string[] { this.Separator }, StringSplitOptions.None);
if (iLine == 0 && this.HasHeader)
// header line
this.Header = splittedLine;
else
this.Lines.Add(splittedLine);
iLine++;
}
}
2 - generate the sql
foreach (var line in this.Lines)
{
string entries = string.Concat("'", string.Join("','", line))
.TrimEnd('\'').TrimEnd(','); // remove last ",'"
this.Query.Add(string.Format(this.LineTemplate, entries));
}
3 - run the query
SqlCommand sql = new SqlCommand(string.Join("", query), mysqlconnectionstring);
sql.ExecuteNonQuery();
having some fun I end up doing the solution and you can download it here, the output is:
The code can be found here. It needs more tweaks but I will left that for others. Solution written in C#, VS 2013.
The ExtractCsvIntoSql class is as follows:
public class ExtractCsvIntoSql
{
private string CsvPath, Separator;
private bool HasHeader;
private List<string[]> Lines;
private List<string> Query;
/// <summary>
/// Header content of the CSV File
/// </summary>
public string[] Header { get; private set; }
/// <summary>
/// Template to be used in each INSERT Query statement
/// </summary>
public string LineTemplate { get; set; }
public ExtractCsvIntoSql(string csvPath, string separator, bool hasHeader = false)
{
this.CsvPath = csvPath;
this.Separator = separator;
this.HasHeader = hasHeader;
this.Lines = new List<string[]>();
// you can also set this
this.LineTemplate = "INSERT INTO [table1] SELECT ({0});";
}
/// <summary>
/// Generates the SQL Query
/// </summary>
/// <returns></returns>
public List<string> Generate()
{
if(this.CsvPath == null)
throw new ArgumentException("CSV Path can't be empty");
// extract csv into object
Extract();
// generate sql query
GenerateQuery();
return this.Query;
}
private void Extract()
{
string line;
string[] splittedLine;
int iLine = 0;
try
{
using (StreamReader sr = File.OpenText(this.CsvPath))
{
while ((line = sr.ReadLine()) != null)
{
splittedLine = line.Split(new string[] { this.Separator }, StringSplitOptions.None);
if (iLine == 0 && this.HasHeader)
// header line
this.Header = splittedLine;
else
this.Lines.Add(splittedLine);
iLine++;
}
}
}
catch (Exception ex)
{
if(ex.InnerException != null)
while (ex.InnerException != null)
ex = ex.InnerException;
throw ex;
}
// Lines will have all rows and each row, the column entry
}
private void GenerateQuery()
{
foreach (var line in this.Lines)
{
string entries = string.Concat("'", string.Join("','", line))
.TrimEnd('\'').TrimEnd(','); // remove last ",'"
this.Query.Add(string.Format(this.LineTemplate, entries));
}
}
}
and you can run it as:
class Program
{
static void Main(string[] args)
{
string file = Ask("What is the CSV file path? (full path)");
string separator = Ask("What is the current separator? (; or ,)");
var extract = new ExtractCsvIntoSql(file, separator);
var sql = extract.Generate();
Output(sql);
}
private static void Output(IEnumerable<string> sql)
{
foreach(var query in sql)
Console.WriteLine(query);
Console.WriteLine("*******************************************");
Console.Write("END ");
Console.ReadLine();
}
private static string Ask(string question)
{
Console.WriteLine("*******************************************");
Console.WriteLine(question);
Console.Write("= ");
return Console.ReadLine();
}
}
Usually i like to be a bit more generic so i'll try to explain a very basic flow i use from time to time:
I don't like the hard coded attitude so even if your code will work it will be dedicated specifically to one type. I prefer i simple reflection, first to understand what DTO is it and then to understand what repository should i use to manipulate it:
For example:
public class ImportProvider
{
private readonly string _path;
private readonly ObjectResolver _objectResolver;
public ImportProvider(string path)
{
_path = path;
_objectResolver = new ObjectResolver();
}
public void Import()
{
var filePaths = Directory.GetFiles(_path, "*.csv");
foreach (var filePath in filePaths)
{
var fileName = Path.GetFileName(filePath);
var className = fileName.Remove(fileName.Length-4);
using (var reader = new CsvFileReader(filePath))
{
var row = new CsvRow();
var repository = (DaoBase)_objectResolver.Resolve("DAL.Repository", className + "Dao");
while (reader.ReadRow(row))
{
var dtoInstance = (DtoBase)_objectResolver.Resolve("DAL.DTO", className + "Dto");
dtoInstance.FillInstance(row.ToArray());
repository.Save(dtoInstance);
}
}
}
}
}
Above is a very basic class responsible importing the data. Nevertheless of how this piece of code parsing CSV files (CsvFileReader), the important part is thata "CsvRow" is a simple List.
Below is the implementation of the ObjectResolver:
public class ObjectResolver
{
private readonly Assembly _myDal;
public ObjectResolver()
{
_myDal = Assembly.Load("DAL");
}
public object Resolve(string nameSpace, string name)
{
var myLoadClass = _myDal.GetType(nameSpace + "." + name);
return Activator.CreateInstance(myLoadClass);
}
}
The idea is to simple follow a naming convetion, in my case is using a "Dto" suffix for reflecting the instances, and "Dao" suffix for reflecting the responsible dao. The full name of the Dto or the Dao can be taken from the csv name or from the header (as you wish)
Next step is filling the Dto, each dto or implements the following simple abstract:
public abstract class DtoBase
{
public abstract void FillInstance(params string[] parameters);
}
Since each Dto "knows" his structure (just like you knew to create an appropriate table in the database), it can easily implement the FillInstanceMethod, here is a simple Dto example:
public class ProductDto : DtoBase
{
public int ProductId { get; set; }
public double Weight { get; set; }
public int FamilyId { get; set; }
public override void FillInstance(params string[] parameters)
{
ProductId = int.Parse(parameters[0]);
Weight = double.Parse(parameters[1]);
FamilyId = int.Parse(parameters[2]);
}
}
After you have your Dto filled with data you should find the appropriate Dao to handle it
which is basically happens in reflection in this line of the Import() method:
var repository = (DaoBase)_objectResolver.Resolve("DAL.Repository", className + "Dao");
In my case the Dao implements an abstract base class - but it's not that relevant to your problem, your DaoBase can be a simple abstract with a single Save() method.
This way you have a dedicated Dao to CRUD your Dto's - each Dao simply knows how to save for its relevant Dto. Below is the corresponding ProductDao to the ProductDto:
public class ProductDao : DaoBase
{
private const string InsertProductQuery = #"SET foreign_key_checks = 0;
Insert into product (productID, weight, familyID)
VALUES (#productId, #weight, #familyId);
SET foreign_key_checks = 1;";
public override void Save(DtoBase dto)
{
var productToSave = dto as ProductDto;
var saveproductCommand = GetDbCommand(InsertProductQuery);
if (productToSave != null)
{
saveproductCommand.Parameters.Add(CreateParameter("#productId", productToSave.ProductId));
saveproductCommand.Parameters.Add(CreateParameter("#weight", productToSave.Weight));
saveproductCommand.Parameters.Add(CreateParameter("#familyId", productToSave.FamilyId));
ExecuteNonQuery(ref saveproductCommand);
}
}
}
Please ignore the CreateParameter() method, since it's an abstraction from the base classs. you can just use a CreateSqlParameter or CreateDataParameter etc.
Just notice, it's a real naive implementation - you can easily remodel it better, depends on your needs.
From the first impression of your questionc I guess you would be having hugely number of records (more than lacs). If yes I would consider the SQL bulk copies an option. If the record would be less go ahead single record. Insert. The reason for you insert not working is u not providing all the columns of the table and also there's some syntax error.
I ve been searching for a while and all that i ve seen some OCR library requests. I would like to know how to implement the purest, easy to install and use OCR library with detailed info for installation into a C# project.
If posible, I just wanna implement it like a usual dll reference...
Example:
using org.pdfbox.pdmodel;
using org.pdfbox.util;
Also a little OCR code example would be nice, such as:
public string OCRFromBitmap(Bitmap Bmp)
{
Bmp.Save(temppath, System.Drawing.Imaging.ImageFormat.Tiff);
string OcrResult = Analyze(temppath);
File.Delete(temppath);
return OcrResult;
}
So please consider that I'm not familiar to OCR projects and give me an answer like talking to a dummy.
Edit:
I guess people misunderstood my request. I wanted to know how to implement those open source OCR libraries to a C# project and how to use them. The link given as dup is not giving answers that I requested at all.
If anyone is looking into this, I've been trying different options and the following approach yields very good results. The following are the steps to get a working example:
Add .NET Wrapper for tesseract to your project. It can be added via NuGet package Install-Package Tesseract(https://github.com/charlesw/tesseract).
Go to the Downloads section of the official Tesseract project (https://code.google.com/p/tesseract-ocr/ EDIT: It's now located here: https://github.com/tesseract-ocr/langdata).
Download the preferred language data, example: tesseract-ocr-3.02.eng.tar.gz English language data for Tesseract 3.02.
Create tessdata directory in your project and place the language data files in it.
Go to Properties of the newly added files and set them to copy on build.
Add a reference to System.Drawing.
From .NET Wrapper repository, in the Samples directory copy the sample phototest.tif file into your project directory and set it to copy on build.
Create the following two files in your project (just to get started):
Program.cs
using System;
using Tesseract;
using System.Diagnostics;
namespace ConsoleApplication
{
class Program
{
public static void Main(string[] args)
{
var testImagePath = "./phototest.tif";
if (args.Length > 0)
{
testImagePath = args[0];
}
try
{
var logger = new FormattedConsoleLogger();
var resultPrinter = new ResultPrinter(logger);
using (var engine = new TesseractEngine(#"./tessdata", "eng", EngineMode.Default))
{
using (var img = Pix.LoadFromFile(testImagePath))
{
using (logger.Begin("Process image"))
{
var i = 1;
using (var page = engine.Process(img))
{
var text = page.GetText();
logger.Log("Text: {0}", text);
logger.Log("Mean confidence: {0}", page.GetMeanConfidence());
using (var iter = page.GetIterator())
{
iter.Begin();
do
{
if (i % 2 == 0)
{
using (logger.Begin("Line {0}", i))
{
do
{
using (logger.Begin("Word Iteration"))
{
if (iter.IsAtBeginningOf(PageIteratorLevel.Block))
{
logger.Log("New block");
}
if (iter.IsAtBeginningOf(PageIteratorLevel.Para))
{
logger.Log("New paragraph");
}
if (iter.IsAtBeginningOf(PageIteratorLevel.TextLine))
{
logger.Log("New line");
}
logger.Log("word: " + iter.GetText(PageIteratorLevel.Word));
}
} while (iter.Next(PageIteratorLevel.TextLine, PageIteratorLevel.Word));
}
}
i++;
} while (iter.Next(PageIteratorLevel.Para, PageIteratorLevel.TextLine));
}
}
}
}
}
}
catch (Exception e)
{
Trace.TraceError(e.ToString());
Console.WriteLine("Unexpected Error: " + e.Message);
Console.WriteLine("Details: ");
Console.WriteLine(e.ToString());
}
Console.Write("Press any key to continue . . . ");
Console.ReadKey(true);
}
private class ResultPrinter
{
readonly FormattedConsoleLogger logger;
public ResultPrinter(FormattedConsoleLogger logger)
{
this.logger = logger;
}
public void Print(ResultIterator iter)
{
logger.Log("Is beginning of block: {0}", iter.IsAtBeginningOf(PageIteratorLevel.Block));
logger.Log("Is beginning of para: {0}", iter.IsAtBeginningOf(PageIteratorLevel.Para));
logger.Log("Is beginning of text line: {0}", iter.IsAtBeginningOf(PageIteratorLevel.TextLine));
logger.Log("Is beginning of word: {0}", iter.IsAtBeginningOf(PageIteratorLevel.Word));
logger.Log("Is beginning of symbol: {0}", iter.IsAtBeginningOf(PageIteratorLevel.Symbol));
logger.Log("Block text: \"{0}\"", iter.GetText(PageIteratorLevel.Block));
logger.Log("Para text: \"{0}\"", iter.GetText(PageIteratorLevel.Para));
logger.Log("TextLine text: \"{0}\"", iter.GetText(PageIteratorLevel.TextLine));
logger.Log("Word text: \"{0}\"", iter.GetText(PageIteratorLevel.Word));
logger.Log("Symbol text: \"{0}\"", iter.GetText(PageIteratorLevel.Symbol));
}
}
}
}
FormattedConsoleLogger.cs
using System;
using System.Collections.Generic;
using System.Text;
using Tesseract;
namespace ConsoleApplication
{
public class FormattedConsoleLogger
{
const string Tab = " ";
private class Scope : DisposableBase
{
private int indentLevel;
private string indent;
private FormattedConsoleLogger container;
public Scope(FormattedConsoleLogger container, int indentLevel)
{
this.container = container;
this.indentLevel = indentLevel;
StringBuilder indent = new StringBuilder();
for (int i = 0; i < indentLevel; i++)
{
indent.Append(Tab);
}
this.indent = indent.ToString();
}
public void Log(string format, object[] args)
{
var message = String.Format(format, args);
StringBuilder indentedMessage = new StringBuilder(message.Length + indent.Length * 10);
int i = 0;
bool isNewLine = true;
while (i < message.Length)
{
if (message.Length > i && message[i] == '\r' && message[i + 1] == '\n')
{
indentedMessage.AppendLine();
isNewLine = true;
i += 2;
}
else if (message[i] == '\r' || message[i] == '\n')
{
indentedMessage.AppendLine();
isNewLine = true;
i++;
}
else
{
if (isNewLine)
{
indentedMessage.Append(indent);
isNewLine = false;
}
indentedMessage.Append(message[i]);
i++;
}
}
Console.WriteLine(indentedMessage.ToString());
}
public Scope Begin()
{
return new Scope(container, indentLevel + 1);
}
protected override void Dispose(bool disposing)
{
if (disposing)
{
var scope = container.scopes.Pop();
if (scope != this)
{
throw new InvalidOperationException("Format scope removed out of order.");
}
}
}
}
private Stack<Scope> scopes = new Stack<Scope>();
public IDisposable Begin(string title = "", params object[] args)
{
Log(title, args);
Scope scope;
if (scopes.Count == 0)
{
scope = new Scope(this, 1);
}
else
{
scope = ActiveScope.Begin();
}
scopes.Push(scope);
return scope;
}
public void Log(string format, params object[] args)
{
if (scopes.Count > 0)
{
ActiveScope.Log(format, args);
}
else
{
Console.WriteLine(String.Format(format, args));
}
}
private Scope ActiveScope
{
get
{
var top = scopes.Peek();
if (top == null) throw new InvalidOperationException("No current scope");
return top;
}
}
}
}
Here's one: (check out http://hongouru.blogspot.ie/2011/09/c-ocr-optical-character-recognition.html or http://www.codeproject.com/Articles/41709/How-To-Use-Office-2007-OCR-Using-C for more info)
using MODI;
static void Main(string[] args)
{
DocumentClass myDoc = new DocumentClass();
myDoc.Create(#"theDocumentName.tiff"); //we work with the .tiff extension
myDoc.OCR(MiLANGUAGES.miLANG_ENGLISH, true, true);
foreach (Image anImage in myDoc.Images)
{
Console.WriteLine(anImage.Layout.Text); //here we cout to the console.
}
}
I'm using tesseract OCR engine with TessNet2 (a C# wrapper - http://www.pixel-technology.com/freeware/tessnet2/).
Some basic code:
using tessnet2;
...
Bitmap image = new Bitmap(#"u:\user files\bwalker\2849257.tif");
tessnet2.Tesseract ocr = new tessnet2.Tesseract();
ocr.SetVariable("tessedit_char_whitelist", "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz.,$-/#&=()\"':?"); // Accepted characters
ocr.Init(#"C:\Users\bwalker\Documents\Visual Studio 2010\Projects\tessnetWinForms\tessnetWinForms\bin\Release\", "eng", false); // Directory of your tessdata folder
List<tessnet2.Word> result = ocr.DoOCR(image, System.Drawing.Rectangle.Empty);
string Results = "";
foreach (tessnet2.Word word in result)
{
Results += word.Confidence + ", " + word.Text + ", " + word.Left + ", " + word.Top + ", " + word.Bottom + ", " + word.Right + "\n";
}
Some online API's work pretty well: ocr.space and Google Cloud Vision. Both of these are free, as long as you do less than 1000 OCR's per month. You can drag & drop an image to do a quick manual test to see how they perform for your images.
I find OCR.space easier to use (no messing around with nuget libraries), but, for my purpose, Google Cloud Vision provided slightly better results than OCR.space.
Google Cloud Vision example:
GoogleCredential cred = GoogleCredential.FromJson(json);
Channel channel = new Channel(ImageAnnotatorClient.DefaultEndpoint.Host, ImageAnnotatorClient.DefaultEndpoint.Port, cred.ToChannelCredentials());
ImageAnnotatorClient client = ImageAnnotatorClient.Create(channel);
Image image = Image.FromStream(stream);
EntityAnnotation googleOcrText = client.DetectText(image).First();
Console.Write(googleOcrText.Description);
OCR.space example:
string uri = $"https://api.ocr.space/parse/imageurl?apikey=helloworld&url={imageUri}";
string responseString = WebUtilities.DoGetRequest(uri);
OcrSpaceResult result = JsonConvert.DeserializeObject<OcrSpaceResult>(responseString);
if ((!result.IsErroredOnProcessing) && !String.IsNullOrEmpty(result.ParsedResults[0].ParsedText))
return result.ParsedResults[0].ParsedText;
A new API is OcrEngine.RecognizeAsync from WinRT/UWP. It can also be used in WinForms:
...
//for AsBuffer
using System.Runtime.InteropServices.WindowsRuntime;
...
async private void button5_Click(object sender, EventArgs e)
{
OcrEngine ocrEngine = null;
ocrEngine = OcrEngine.TryCreateFromUserProfileLanguages();
if (ocrEngine == null) return;
//convert the image to BGRA8 format which is needed by SoftwareBitmap
//is there a better method for this?
Bitmap img = new Bitmap(#"1.png");
byte[] ba = new byte[img.Width * img.Height * 4];
int o = 0;
for (int y = 0; y < img.Height; y++)
{
for (int x = 0; x < img.Width; x++)
{
var p = img.GetPixel(x, y);
ba[o++] = p.B;
ba[o++] = p.G;
ba[o++] = p.R;
ba[o++] = p.A;
}
}
var buffer = ba.AsBuffer();
var outputBitmap = SoftwareBitmap.CreateCopyFromBuffer(
buffer,
BitmapPixelFormat.Bgra8,
img.Width,
img.Height);
var ocrResult = await ocrEngine.RecognizeAsync(outputBitmap);
}
To use WinRT/UWP API in WinForms, add Nuget package "Microsoft.Windows.SDK.Contracts" (version 10.0.17134.100 for Win10 1803 SDK tested here) as described here.
I ran into a strange problem last week. A call to new XMLSerializer(typeof(MyType)) crashed with an ExternalException, telling me that csc.exe could not be executed.
After some investigation I found that this exception only occurs if the process environment size reaches a "critical" limit. I created a little sample application to verify that reason.
namespace EnvironmentTester
{
public class Program
{
private static void Main(string[] args)
{
FillProcessEnvironmentBlock(false);
SerializeDataObject();
}
private static void SerializeDataObject()
{
var dto = new DataObject {Name = "MyDto"};
try
{
var xmlSerializer = new XmlSerializer(dto.GetType()); // throws exception
xmlSerializer.Serialize(TextWriter.Null, dto);
Console.WriteLine("No exception occured.");
}
catch(Exception e)
{
Console.WriteLine("Exception occured : " + e.GetType());
}
Console.ReadKey();
}
private static void FillProcessEnvironmentBlock(bool fillToMax)
{
var currentEnvVarIndex = 0;
var environmentSize = GetEnvironmentSize();
int criticalEnvironmentSize = fillToMax ? 30692 : 30691;
while (environmentSize < criticalEnvironmentSize)
{
var envVarName = "Env" + currentEnvVarIndex;
var envVarValueLength = (criticalEnvironmentSize - environmentSize - envVarName.Length - 2) % 32000;
Environment.SetEnvironmentVariable(envVarName, new string('a', envVarValueLength));
currentEnvVarIndex++;
environmentSize = GetEnvironmentSize();
}
}
private static int GetEnvironmentSize()
{
var envVars = Environment.GetEnvironmentVariables();
int environmentSize = 0;
foreach (string envKey in envVars.Keys)
{
environmentSize += envKey.Length;
}
foreach (string envVar in envVars.Values)
{
environmentSize += envVar.Length;
}
environmentSize += 2*envVars.Keys.Count; // add the '=' and the '\0'
return environmentSize;
}
public class DataObject
{
[XmlAttribute("ObjectName")]
public string Name { get; set; }
}
}
}
If FillProcessEnvironmentBlock is called with parameter false, the critical size is not reached, if it's called with true, the ExternalException is thrown. I tested it on two different WindowsXP 32bit SP2 machines, with the same result.
I know that csc.exe is called to create a temporary assembly used to read/write the xml file. But I don't know why the call to csc.exe fails if the process environment is too large.
Does anyone know the reason for this exception? And how can I work around it (if I don't want to write my own xml serialization)? Are there any other known problems causeed by a process environment that's too large?
You can precompile serializers using Sgen tool or MSBuild Sgen task