C# Pdf to Text with image placeholder

C# Pdf to Text with image placeholder - c#

I have a batch of PDFs that I want to convert to Text. It's easy to get text with something like this from iTextSharp:
PdfTextExtractor.GetTextFromPage(reader, pageNumber);
It's easy to get Images using this answer (or similar answers in the thread).
What I can't figure out easily... is how to interleave image placeholders in the text.
Given a PDF, a page # and GetTextFromPage I expect the output to be:
line 1
line 2
line 3
When I'd like it to be (Where 1.1 means page 1, image 1... Page 1, image 2):
line 1
[1.1]
line 2
[1.2]
line 3
Is there a way to get an "image placeholder" for iTextSharp, PdfSharp or anything similar? I'd like a GetTextAndPlaceHoldersFromPage method (or similar).
PS: Hrm... it's not letting me tag iTextSHARP - not iText. C# not Java.

C# Pdf to Text with image placeholder
https://stackoverflow.com/a/28087521/
https://stackoverflow.com/a/33697745/
Although this doesn't have the exact layout mentioned in my question (Since that was a simplified version of what I really wanted anyways), it does have the starting parts as listed by the second note (translated from iText Java)... with extra information pulled from the third note (Some of the reflection used in Java didn't seem to work in C#, so that info came from #3).
Working from this, I'm able to get a List of Strings representing lines in the PDF (all pages, instead of just page 1)... with text added where images should be (Huzzah!). ByteArrayToFile extension method added for flavor (Although I didn't include other parts/extensions that may break a copy/paste usages of this code).
I've also been able to greatly simplify other parts of my process and gut half of the garbage I had working before. Huzzah!!! Thanks #Mkl
internal class Program
{
public static void Main(string[] args)
{
var dir = Settings.TestDirectory;
var file = Settings.TestFile;
Log.Info($"File to Process: {file.FullName}");
using (var reader = new PdfReader(file.FullName))
{
var parser = new PdfReaderContentParser(reader);
var listener = new SimpleMixedExtractionStrategy(file, dir);
parser.ProcessContent(1, listener);
var x = listener.GetResultantText().Split('\n');
}
}
}
public class SimpleMixedExtractionStrategy : LocationTextExtractionStrategy
{
public static readonly ILog Log = LogManager.GetLogger(MethodBase.GetCurrentMethod().DeclaringType);
public DirectoryInfo OutputPath { get; }
public FileInfo OutputFile { get; }
private static readonly LineSegment UNIT_LINE = new LineSegment(new Vector(0, 0, 1), new Vector(1, 0, 1));
private int _counter;
public SimpleMixedExtractionStrategy(FileInfo outputFile, DirectoryInfo outputPath)
{
OutputPath = outputPath;
OutputFile = outputFile;
}
public override void RenderImage(ImageRenderInfo renderInfo)
{
try
{
var image = renderInfo.GetImage();
if (image == null) return;
var number = _counter++;
var imageFile = new FileInfo($"{OutputFile.FullName}-{number}.{image.GetFileType()}");
imageFile.ByteArrayToFile(image.GetImageAsBytes());
var segment = UNIT_LINE.TransformBy(renderInfo.GetImageCTM());
var location = new TextChunk("[" + imageFile + "]", segment.GetStartPoint(), segment.GetEndPoint(), 0f);
var locationalResultField = typeof(LocationTextExtractionStrategy).GetField("locationalResult", BindingFlags.NonPublic | BindingFlags.Instance);
var LocationalResults = (List<TextChunk>)locationalResultField.GetValue(this);
LocationalResults.Add(location);
}
catch (Exception ex)
{
Log.Debug($"{ex.Message}");
Log.Verbose($"{ex.StackTrace}");
}
}
}
public static class ByteArrayExtensions
{
public static bool ByteArrayToFile(this FileInfo fileName, byte[] byteArray)
{
try
{
// Open file for reading
var fileStream = new FileStream(fileName.FullName, FileMode.Create, FileAccess.Write);
// Writes a block of bytes to this stream using data from a byte array.
fileStream.Write(byteArray, 0, byteArray.Length);
// close file stream
fileStream.Close();
return true;
}
catch (Exception exception)
{
// Error
Log.Error($"Exception caught in process: {exception.Message}", exception);
}
// error occured, return false
return false;
}
}

Related

How to obtain pdf byte[] from PdfDocument using itext7?

I'm having difficulty understanding how to obtain the content from a PdfDocument. I've learned from previous questions that PdfDocument flushes the content to optimize working with large documents. If my function returns a new PdfDocument, how do I get the byte[] to pass into my other functions?
Even with PdfDocument.GetReader() - I can't seem to find what I'm looking for.
My use-case is as follows:
Get pdf content from an email attachment
Pass the pdf to a helper function, which extracts specific pages from the initial attachment
Pass the new PdfDocument into a function which calls Azure's Forms Recognizer API to read the fields into an object
To summarize: given a PdfDocument only, how can I get/create a byte[] from it?
Here is my code:
public async Task<BaseResponse> Handle(ReceiveEmailCommand command, CancellationToken cancellationToken) {
var ms = new MemoryStream(command.attachments.First().Content)
var extractedDocument = pdfService.PreparePdfDocument(ms);
var analyzedDocument = await formsRecognizerService.AnalyzeDocument(extractedDocument);
// Do stuff with the analyzed document...
var response = await FileWebService.AddAnalyzedDocumentToFileSystem(analyzedDocument);
}
The function AnalyzeDocument expects a Stream parameter. I want to pass something like
new Stream(extractedDocument.GetReader().Stream)
Helper function implementations are below:
public PdfDocument PreparePdfDocument(MemoryStream ms)
{
PdfDocument extractedDoc;
var pdfReader = new PdfReader(ms);
var pdf = new PdfDocument(pdfReader);
var doc = new Document(pdf);
var matches = GetNumberWithPages(pdf);
if (matches.Count > 0)
{
var pageRange = matches
.Where(x => x.Number == "125")
.Select(x => Convert.ToInt32(x.PageIndex))
.ToList();
extractedDoc = SplitPages(pdf, pageRange.First(), pageRange.Last());
}
else
{
// If we couldn't parse the PDF then just take first 4, 3 or 2 pages
try
{
extractedDoc = SplitPages(pdf, 1, 4);
}
catch (ITextException)
{
try
{
extractedDoc = SplitPages(pdf, 1, 3);
}
catch (ITextException)
{
try
{
extractedDoc = SplitPages(pdf, 1, 2);
}
catch (Exception)
{
throw;
}
}
}
}
return extractedDoc;
}
private static List<Match> GetNumberWithPages(PdfDocument doc)
{
var regex = new Regex(#"\s+([0-9]+)\s+(\([0-9]+\/[0-9]+\))\s+Page\s+([0-9])\s+of\s+([0-9]+)");
var matches = new List<Match>();
for (int i = 1; i <= doc.GetNumberOfPages(); i++)
{
var page = doc.GetPage(i);
var text = PdfTextExtractor.GetTextFromPage(page);
if (!string.IsNullOrEmpty(text))
{
var match = regex.Match(text);
if (match.Success)
{
var match = EvaluateMatch(match, i, doc.GetNumberOfPages());
if (match != null)
{
matches.Add(match);
}
}
}
}
return matches;
}
private static Match? EvaluateMatch(Match match, int pageIndex, int totalPages)
{
if (match.Captures.Count == 1 && match.Groups.Count == 5)
{
var match = new Match
{
Number = match.Groups[1].Value,
Version = match.Groups[2].Value,
PageIndex = pageIndex.ToString(),
TotalPages = totalPages.ToString()
};
return match;
}
else
{
return null;
}
}
public PdfDocument SplitPages(PdfDocument doc, int startIndex, int endIndex)
{
var outputDocument = CreatePdfDocument();
doc.CopyPagesTo(startIndex, endIndex, outputDocument);
return outputDocument;
}
public PdfDocument CreatePdfDocument()
{
var baos = new ByteArrayOutputStream();
var writer = new PdfWriter(baos);
var pdf = new PdfDocument(writer);
return pdf;
}

I'm having difficulty understanding how to obtain the content from a PdfDocument.
You don't!
When you create a PdfDocument to write to, you initialize it with a PdfWriter. That PdfWriter in turn has been initialized to write somewhere. If you want to access the final PDF, you have to close the PdfDocument and look at that somewhere. Also it is not easy to retrieve that somewhere from the PdfWriter as it is wrapped in a number of layers therein. Thus, you should keep a reference to that somewhere close by.
Thus, your ByteArrayOutputStream usually wouldn't be created hidden in some method CreatePdfDocument but instead in the base method and forwarded to other methods as parameter. Then you can eventually retrieve its data. If you need to create your ByteArrayOutputStream hidden like that, you can return a Pair of PdfDocument and ByteArrayOutputStream instead of the plain PdfDocument.
By the way, the idea behind this architecture is that iText tries to write as much PDF content as possible to that somewhere output as early as possible and free the memory. This allows it to create large documents without requiring a similarly large amount of memory.
when I return the stream I cannot access a closed stream
The ByteArrayOutputStream essentially is a MemoryStream; so you can in particular call ToArray to retrieve the finished PDF even if it's closed.
If you need the ByteArrayOutputStream as a regular stream, simply call PdfWriter.SetCloseStream(false) for your writer to prevent the close of the PdfDocument from also closing the stream.

Google Drive API Upload Command ResponseBody always null C#

I am working on a system powered by a Raspberry Pi. As part of this I am attempting to upload a text file to my google drive from the Pi. As a first test, I have a UWP app to create a text file of temperature and humidity data and try to upload it.
When it gets to the upload command, nothing happens. No exceptions are thrown but neither are any files uploaded. I have searched the web and followed lots of dead ends, but without an error I don't really know where to begin to find out why the file is not uploading; I have followed the instructions from Google and like I say, nothing throws an exception, it just doesn't work... Any suggestions would be really welcome as I have been spending a long time Googling for answers now and found nothing.
here is the meat of the code behind the xaml page of the uwp app (I can send you the whole project if you ask). I have not included a couple of the methods in the code below as it is only the method StoreValues() that is giving me gyp it the moment. This is the first iteration of the project that has not thrown an error.
public sealed partial class MainPage : Page
{
static string[] Scopes = { DriveService.Scope.Drive };
static string AppName = "Drive API .NET Quickstart";
static string fileName = "Environmental_Data";
static string fileExtension = ".txt";
private const int DHTPIN = 4;
private IDht dht = null;
private GpioPin dhtPin = null;
private DispatcherTimer sensorTimer = new DispatcherTimer();
private int counterWrite = 0;
private int counterStore = 0;
//the number on the end of the filename that will be incremented each time a file is produced.
//TODO remove old files and reset counter periodically
private int fileId = 1;
//read values every this many seconds
private int readInterval = 1;
//write a new line to the file every this many readings
private int writeCollectionSize = 2;
//send file to cloud every this many lines
private int storeCollectionSize = 4;
private StorageFolder Folder = ApplicationData.Current.LocalFolder;
public List<Environment> Readings;
public StorageFile File1;
public MainPage()
{
InitializeComponent();
InitializeLocalFiles(fileId);
InitializePi();
}
private async void InitializeLocalFiles(int fileId)
{
try
{
//make a new target file to save our data to
var newFile = $"{fileName}_{fileId.ToString()}{fileExtension}";
File1 = await Folder.CreateFileAsync(newFile, CreationCollisionOption.ReplaceExisting);
File1 = await Folder.GetFileAsync(newFile);
//let a human know where the first file is
Debug.WriteLine(File1.Path);
}
catch (Exception ex)
{
Debug.WriteLine("error incountered in InitializeLocalFiles: {0}", ex.Message);
}
}
private void InitializePi()
{
try
{
//set up the pins etc
dhtPin = GpioController.GetDefault().OpenPin(DHTPIN, GpioSharingMode.Exclusive);
dht = new Dht11(dhtPin, GpioPinDriveMode.Input);
//This is what the environment class looks like: public class Environment
//{ public decimal Temperature; public decimal Humidity; public DateTime DateTime; }
Readings = new List<Environment>();
sensorTimer.Interval = TimeSpan.FromSeconds(readInterval);
//add the tick to the clock and set it going
sensorTimer.Tick += sensorTimer_Tick;
sensorTimer.Start();
//initialise the app
temperatureMeter.Value = "OFF";
humidityMeter.Value = "OFF";
}
catch (Exception ex)
{
Debug.WriteLine("error incountered in InitialisePi: {0}", ex.Message);
}
}
private void sensorTimer_Tick(object sender, object e)
{
try
{
//every tick of the clock, we will read the sensor to get new values
readSensor();
//if we got to the end of the write collection, then it is time to aggregate the figures and write a line to the local file
if (counterWrite == writeCollectionSize)
{
counterWrite = 0;
Readings.AggregateListToCSV(File1);
//clear the list of readings ready for the next tick
Readings.Clear();
}
//if we got to the end of the store collection, it's time to send the local file on a trip around the world
if (counterStore == storeCollectionSize)
{
counterStore = 0;
StoreValues(File1);
//rather than deleting the file, we will make a new one for the next go-round
fileId += 1;
InitializeLocalFiles(fileId);
}
}
catch (Exception ex)
{
Debug.WriteLine("error incountered in sensorTimer_Tick: {0}", ex.Message);
}
}
private async void StoreValues(StorageFile file)
{
try
{
UserCredential credential;
string cred = #"{""installed"":{""client_id"":""* **REDACTED FOR FORUM QUESTION ***"",""redirect_uris"":[""urn:ietf:wg:oauth:2.0:oob"",""http://localhost""]}}";
byte[] byteArray = Encoding.UTF8.GetBytes(cred);
//I'm using credentials stored in memory, but you could use a json file
using (var stream =
//new FileStream("credentials.json", FileMode.Open, FileAccess.Read))
new MemoryStream(byteArray))
{
//TODO find out why I have to copy the token.json file every time to the app folder - for some reason it does not copy over even though set to Copy Always
string credPath = "token.json";
credential = GoogleWebAuthorizationBroker.AuthorizeAsync(
GoogleClientSecrets.Load(stream).Secrets,
Scopes,
"user",
CancellationToken.None,
new FileDataStore(credPath, true)).Result;
}
//set up the stuff that the Google Drive seems to require. I don't pretend to understand this bit
DriveService service = new DriveService(new BaseClientService.Initializer()
{
HttpClientInitializer = credential,
ApplicationName = AppName,
});
service.HttpClient.Timeout = TimeSpan.FromMinutes(100);
var fileMetadata = new Google.Apis.Drive.v3.Data.File()
{
Name = "My Report",
MimeType = "application/vnd.google-apps.file"
};
FilesResource.CreateMediaUpload request;
//using (var stream = new FileStream(file.Path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (var stream = File.Open(file.Path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
request = service.Files.Create(fileMetadata, stream, "text/plain");
request.Fields = "id";
await request.UploadAsync();
}
//request.ResponsBody is always NULL - why??
var responseFile = request.ResponseBody;
//this line causes an error "attempted to read or write protected memory"
if (responseFile == null)
{
Debug.WriteLine($"** ERROR ** File {fileName}_{(fileId - 1).ToString()}{fileExtension} could not be uploaded.");
}
else
{
Debug.WriteLine($"File ID: {responseFile.Id.ToString()}");
}
}
catch(Exception ex)
{
Debug.WriteLine("error incountered in StoreValues: {0}", ex.Message);
}
}
}
this yields a Debug Output that looks like this (a snippet shown)
....
C:\Data\Users\DefaultAccount\AppData\Local\Packages\e50d180b-3aea-4119-a4ff-72592c8ac7b5_4j4qdv46jwqvm\LocalState\Environmental_Data_6.txt
Lines ready to write: 1 | Lines ready to store: 1
** ERROR ** File Environmental_Data_5.txt could not be uploaded.
Lines ready to write: 2 | Lines ready to store: 2
18.9,10.0,2/16/2019 7:32:16 AM
Lines ready to write: 1 | Lines ready to store: 3
Lines ready to write: 2 | Lines ready to store: 4
18.9,9.9,2/16/2019 7:32:19 AM
C:\Data\Users\DefaultAccount\AppData\Local\Packages\e50d180b-3aea-4119-a4ff-72592c8ac7b5_4j4qdv46jwqvm\LocalState\Environmental_Data_7.txt
** ERROR ** File Environmental_Data_6.txt could not be uploaded.
Lines ready to write: 1 | Lines ready to store: 1
Lines ready to write: 2 | Lines ready to store: 2
18.9,10.1,2/16/2019 7:32:22 AM
Lines ready to write: 1 | Lines ready to store: 3
Lines ready to write: 2 | Lines ready to store: 4
The thread 0xfac has exited with code 0 (0x0).
Lines ready to write: 1 | Lines ready to store: 1
18.9,10.05,2/16/2019 7:32:25 AM
C:\Data\Users\DefaultAccount\AppData\Local\Packages\e50d180b-3aea-4119-a4ff-72592c8ac7b5_4j4qdv46jwqvm\LocalState\Environmental_Data_8.txt
** ERROR ** File Environmental_Data_7.txt could not be uploaded.
Lines ready to write: 2 | Lines ready to store: 2
...
I believe I have all the credential etc in place - I have to copy the token.jsn file that was generated from my credentials onto the pi every time I deploy, but nonetheless it all seems to be connecting as it should.
While I'm beginning to get a bit of an understanding of the beginnings of wpf, this is all very much outside my ken and after 3 weeks of trying to sort this problem on and off, I'm really at the end of my tether - any suggestions would be brilliant.

OpenXml Excel: throw error in any word after mail address

I read Excel files using OpenXml. all work fine but if the spreadsheet contains one cell that has an address mail and after it a space and another word, such as:
abc#abc.com abc
It throws an exception immediately at the opening of the spreadsheet:
var _doc = SpreadsheetDocument.Open(_filePath, false);
exception:
DocumentFormat.OpenXml.Packaging.OpenXmlPackageException
Additional information:
Invalid Hyperlink: Malformed URI is embedded as a
hyperlink in the document.

There is an open issue on the OpenXml forum related to this problem: Malformed Hyperlink causes exception
In the post they talk about encountering this issue with a malformed "mailto:" hyperlink within a Word document.
They propose a work-around here: Workaround for malformed hyperlink exception
The workaround is essentially a small console application which locates the invalid URL and replaces it with a hard-coded value; here is the code snippet from their sample that does the replacement; you could augment this code to attempt to correct the passed brokenUri:
private static Uri FixUri(string brokenUri)
{
return new Uri("http://broken-link/");
}
The problem I had was actually with an Excel document (like you) and it had to do with a malformed http URL; I was pleasantly surprised to find that their code worked just fine with my Excel file.
Here is the entire work-around source code, just in case one of these links goes away in the future:
void Main(string[] args)
{
var fileName = #"C:\temp\corrupt.xlsx";
var newFileName = #"c:\temp\Fixed.xlsx";
var newFileInfo = new FileInfo(newFileName);
if (newFileInfo.Exists)
newFileInfo.Delete();
File.Copy(fileName, newFileName);
WordprocessingDocument wDoc;
try
{
using (wDoc = WordprocessingDocument.Open(newFileName, true))
{
ProcessDocument(wDoc);
}
}
catch (OpenXmlPackageException e)
{
e.Dump();
if (e.ToString().Contains("The specified package is not valid."))
{
using (FileStream fs = new FileStream(newFileName, FileMode.OpenOrCreate, FileAccess.ReadWrite))
{
UriFixer.FixInvalidUri(fs, brokenUri => FixUri(brokenUri));
}
}
}
}
private static Uri FixUri(string brokenUri)
{
brokenUri.Dump();
return new Uri("http://broken-link/");
}
private static void ProcessDocument(WordprocessingDocument wDoc)
{
var elementCount = wDoc.MainDocumentPart.Document.Descendants().Count();
Console.WriteLine(elementCount);
}
}
public static class UriFixer
{
public static void FixInvalidUri(Stream fs, Func<string, Uri> invalidUriHandler)
{
XNamespace relNs = "http://schemas.openxmlformats.org/package/2006/relationships";
using (ZipArchive za = new ZipArchive(fs, ZipArchiveMode.Update))
{
foreach (var entry in za.Entries.ToList())
{
if (!entry.Name.EndsWith(".rels"))
continue;
bool replaceEntry = false;
XDocument entryXDoc = null;
using (var entryStream = entry.Open())
{
try
{
entryXDoc = XDocument.Load(entryStream);
if (entryXDoc.Root != null && entryXDoc.Root.Name.Namespace == relNs)
{
var urisToCheck = entryXDoc
.Descendants(relNs + "Relationship")
.Where(r => r.Attribute("TargetMode") != null && (string)r.Attribute("TargetMode") == "External");
foreach (var rel in urisToCheck)
{
var target = (string)rel.Attribute("Target");
if (target != null)
{
try
{
Uri uri = new Uri(target);
}
catch (UriFormatException)
{
Uri newUri = invalidUriHandler(target);
rel.Attribute("Target").Value = newUri.ToString();
replaceEntry = true;
}
}
}
}
}
catch (XmlException)
{
continue;
}
}
if (replaceEntry)
{
var fullName = entry.FullName;
entry.Delete();
var newEntry = za.CreateEntry(fullName);
using (StreamWriter writer = new StreamWriter(newEntry.Open()))
using (XmlWriter xmlWriter = XmlWriter.Create(writer))
{
entryXDoc.WriteTo(xmlWriter);
}
}
}
}
}

The fix by #RMD works great. I've been using it for years. But there is a new fix.
You can see the fix here in the changelog for issue #793
Upgrade OpenXML to 2.12.0.
Right click solution and select Manage NuGet Packages.
Implement the fix
It is helpful to have a unit test. Create an excel file with a bad email address like test#gmail,com. (Note the comma instead of the dot).
Make sure the stream you open and the call to SpreadsheetDocument.Open allows Read AND Write.
You need to implement a RelationshipErrorHandlerFactory and use it in the options when you open. Here is the code I used:
public class UriRelationshipErrorHandler : RelationshipErrorHandler
{
public override string Rewrite(Uri partUri, string id, string uri)
{
return "https://broken-link";
}
}
Then you need to use it when you open the document like this:
var openSettings = new OpenSettings
{
RelationshipErrorHandlerFactory = package =>
{
return new UriRelationshipErrorHandler();
}
};
using var document = SpreadsheetDocument.Open(stream, true, openSettings);
One of the nice things about this solution is that it does not require you to create a temporary "fixed" version of your file and it is far less code.

Unfortunately solution where you have to open file as zip and replace broken hyperlink would not help me.
I just was wondering how it is posible that it works fine when your target framework is 4.0 even if your only installed .Net Framework has version 4.7.2.
I have found out that there is private static field inside System.UriParser that selects version of URI's RFC specification. So it is possible to set it to V2 as it is set for .net 4.0 and lower versions of .Net Framework. Only problem that it is private static readonly.
Maybe someone will want to set it globally for whole application. But I wrote UriQuirksVersionPatcher that will update this version and restore it back in Dispose method. It is obviously not thread-safe but it is acceptable for my purpose.
using System;
using System.Diagnostics;
using System.Reflection;
namespace BarCap.RiskServices.RateSubmissions.Utility
{
#if (NET20 || NET35 || NET40)
public class UriQuirksVersionPatcher : IDisposable
{
public void Dispose()
{
}
}
#else
public class UriQuirksVersionPatcher : IDisposable
{
private const string _quirksVersionFieldName = "s_QuirksVersion"; //See Source\ndp\fx\src\net\System\_UriSyntax.cs in NexFX sources
private const string _uriQuirksVersionEnumName = "UriQuirksVersion";
/// <code>
/// private enum UriQuirksVersion
/// {
/// V1 = 1, // RFC 1738 - Not supported
/// V2 = 2, // RFC 2396
/// V3 = 3, // RFC 3986, 3987
/// }
/// </code>
private const string _oldQuirksVersion = "V2";
private static readonly Lazy<FieldInfo> _targetFieldInfo;
private static readonly Lazy<int?> _patchValue;
private readonly int _oldValue;
private readonly bool _isEnabled;
static UriQuirksVersionPatcher()
{
var targetType = typeof(UriParser);
_targetFieldInfo = new Lazy<FieldInfo>(() => targetType.GetField(_quirksVersionFieldName, BindingFlags.Static | BindingFlags.NonPublic));
_patchValue = new Lazy<int?>(() => GetUriQuirksVersion(targetType));
}
public UriQuirksVersionPatcher()
{
int? patchValue = _patchValue.Value;
_isEnabled = patchValue.HasValue;
if (!_isEnabled) //Disabled if it failed to get enum value
{
return;
}
int originalValue = QuirksVersion;
_isEnabled = originalValue != patchValue;
if (!_isEnabled) //Disabled if value is proper
{
return;
}
_oldValue = originalValue;
QuirksVersion = patchValue.Value;
}
private int QuirksVersion
{
get
{
return (int)_targetFieldInfo.Value.GetValue(null);
}
set
{
_targetFieldInfo.Value.SetValue(null, value);
}
}
private static int? GetUriQuirksVersion(Type targetType)
{
int? result = null;
try
{
result = (int)targetType.GetNestedType(_uriQuirksVersionEnumName, BindingFlags.Static | BindingFlags.NonPublic)
.GetField(_oldQuirksVersion, BindingFlags.Static | BindingFlags.Public)
.GetValue(null);
}
catch
{
#if DEBUG
Debug.WriteLine("ERROR: Failed to find UriQuirksVersion.V2 enum member.");
throw;
#endif
}
return result;
}
public void Dispose()
{
if (_isEnabled)
{
QuirksVersion = _oldValue;
}
}
}
#endif
}
Usage:
using(new UriQuirksVersionPatcher())
{
using(var document = SpreadsheetDocument.Open(fullPath, false))
{
//.....
}
}
P.S. Later I found that someone already implemented this pathcher: https://github.com/google/google-api-dotnet-client/blob/master/Src/Support/Google.Apis.Core/Util/UriPatcher.cs

I haven't use OpenXml but if there's no specific reason for using it then I highly recommend LinqToExcel from LinqToExcel. Example of code is here:
var sheet = new ExcelQueryFactory("filePath");
var allRows = from r in sheet.Worksheet() select r;
foreach (var r in allRows) {
var cella = r["Header"].ToString();
}

Parse input and build a Dictionary/HashMap while parsing

I am not sure if the title makes it clear what I want to do.
My input for my parser contains debug information about C source files. Some of the input looks e.g. like this:
L:C$main.c$41$1$10:C0C5
Which basically means that line 10 in the source file main.c corresponds with the memory address C0C5.
Here is an example what my AST looks like:
Which represents the input:
M:main
L:C$main.c$29$1$0:C09C
L:C$main.c$30$1$10:C0A2
M:divide
L:C$divice.c$31$1$10:C5A9
What I want are two Hash-Maps such that I can access these information quickly at runtime. But how can I now build such Hash-Maps and is it possible to do that at parsetime?
This is how I would like to use my parser:
public CDBFileParser getFileParser(String cdbFilePath)
{
Stream stream = File.OpenRead(cdbFilePath);
ANTLRInputStream inputStream = new ANTLRInputStream(stream);
CDBFileLexer lexer = new CDBFileLexer(inputStream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
CDBFileParser parser = new CDBFileParser(tokens);
try
{
parser.TreeAdaptor = new CommonTreeAdaptor();
parser.parseCDBFile();
// All this works so far. Here comes the part I am looking for:
Modules[] modules = parser.getModules();
Dictionary<int, int> lineToAddress = modules[0].getLineToAddressMap();
Dictionary<int, int> addressToLine = modules[0].getAddressToLineMap();
int address = 0xC09C;
System.Out.WriteLine( "Address 0xC09C is at line " + addressToLine.get(address) + " in " + modules[0].getName() );
}
catch (Exception e)
{
printException(e);
}
return parser;
}
Expected Output:
Address 0xC09C is at line 29 in main
Can anybody help?
Best regards.

I was about to delete my question but maybe somebody else comes along to this post. I just made the transition to ANTLR4 and it really is much simpler (so far at least).
In ANTLR4 an interface (e.g. ICDBFileListener) is built for you which one can use to catch all information at parsetime:
namespace Parser
{
public class CDBFileParserListener : ICDBFileListener
{
public void ExitModule_name(CDBFileParser.Module_nameContext context)
{
Console.WriteLine("ModuleName: " + context.GetText());
// Add module to module-map and remember
// that current module is context.GetText()
}
public void ExitLine_number(CDBFileParser.Line_numberContext context)
{
Console.WriteLine("LineNumber: " + context.GetText());
// Remember line number
}
public void ExitMemory_address(CDBFileParser.Memory_addressContext context)
{
Console.WriteLine("MemoryAddress: " + context.GetText());
// Add linenumber <-> memoryaddress to maps
}
public Modules[] getModules()
{
return m_modules;
}
}
}
And this is how it can be used:
public CDBFileParser getFileParser(String cdbFilePath)
{
Stream stream = File.OpenRead(cdbFilePath);
AntlrInputStream inputStream = new AntlrInputStream(stream);
CDBFileLexer lexer = new CDBFileLexer(inputStream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
CDBFileParser parser = new CDBFileParser(tokens);
try
{
CDBFileParserListener listener = new CDBFileParserListener();
parser.AddParseListener(listener);
System.Diagnostics.Debug.WriteLine(parser.parseCDBFile().ToStringTree());
Dictionary<String, Module> modules = listener.Modules;
Module main;
modules.TryGetValue("main", out main);
long line = main.getLineFromAddress(0xC09C);
Console.WriteLine("0xC09C maps to " + line + " in main.c");
}
catch (Exception e)
{
printException(e);
}
return parser;
}

Finding elements from a memory mapped file in C#

I need to find certain elements within a memory mapped file. I have managed to map the file, however I get some problems finding the elements. My idea was to save all file elements into a list, and then search on that list.
How do I create a function that returns a list with all elements of the mapped file?
// Index indicates the line to read from
public List<string> GetElement(int index) {
}
The way I am mapping the file:
public void MapFile(string path)
{
string mapName = Path.GetFileName(path);
try
{
// Opening existing mmf
if (mapName != null)
{
_mmf = MemoryMappedFile.OpenExisting(mapName);
}
// Setting the pointer at the start of the file
_pointer = 0;
// We create the accessor to read the file
_accessor = _mmf.CreateViewAccessor();
// We mark the file as open
_open = true;
}
catch (Exception ex) {....}
try
{
// Trying to create the mmf
_mmf = MemoryMappedFile.CreateFromFile(path);
// Setting the pointer at the start of the file
_pointer = 0;
// We create the accessor to read the file
_accessor = _mmf.CreateViewAccessor();
// We mark the file as open
_open = true;
}
catch (Exception exInner){..}
}
The file that I am mapping is a UTF-8 ASCII file. Nothing weird.
What I have done:
var list = new List<string>();
// String to store what we read
string trace = string.Empty;
// We read the byte of the pointer
b = _accessor.ReadByte(_pointer);
int tracei = 0;
var traceb = new byte[2048];
// If b is different from 0 we have some data to read
if (b != 0)
{
while (b != 0)
{
// Check if it's an endline
if (b == '\n')
{
trace = Encoding.UTF8.GetString(traceb, 0, tracei - 1);
list.Add(trace);
trace = string.Empty;
tracei = 0;
_lastIndex++;
}
else
{
traceb[tracei++] = b;
}
// Advance and read
b = _accessor.ReadByte(++_pointer);
}
}
The code is difficult to read for humans and is not very efficient. How can I improve it?

You are re-inventing StreamReader, it does exactly what you do. The odds that you really want a memory-mapped file are quite low, they take a lot of virtual memory which you only can make pay off if you repeatedly read the same file at different offsets. Which is very unlikely, text files must be read sequentially since you don't know how long the lines are.
Which makes this one line of code the probable best replacement for what you posted:
string[] trace = System.IO.File.ReadAllLines(path);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Pdf to Text with image placeholder - c#

Related

How to obtain pdf byte[] from PdfDocument using itext7?

Google Drive API Upload Command ResponseBody always null C#

OpenXml Excel: throw error in any word after mail address

Parse input and build a Dictionary/HashMap while parsing

Finding elements from a memory mapped file in C#

Categories

Resources