Get all SDF/COS objects from PDF

Get all SDF/COS objects from PDF - c#

I am trying to get a list of all SDF/COS objects within a PDF document, using PDFNet 7.0.4 and netcoreapp3.1. Using a different PDF parser, I know that this document has 570 total COS objects within it, including 3 images.
Initially I used PDFDoc to load the document, and iterated through the pages just looking for Element objects of type e_image or e_inline_image, but this only yielded 2 out of 3 images. In a larger document it did even worse; 0 out of ~2600 images.
Now, I've stepped back and am trying to do a lower level search via SDFDoc. I can get a trailer object, and then iterate through it, recursing any e_dict or e_stream objects, and returning anything that looks like a real object (i.e., anything that actually has an object number and generation).
IEnumerable<Obj> Recurse(Obj root)
{
var idHash = new HashSet<PdfIdentifier>();
return Recurse(root, idHash);
static IEnumerable<Obj> Recurse(Obj obj, HashSet<PdfIdentifier> idHash)
{
var id = obj.ToPdfIdentifier();
if (!idHash.Contains(id))
{
if (id != nullIdentifier)
{
idHash.Add(id);
yield return obj;
}
if (obj.GetType().OneOf(Obj.ObjType.e_dict, Obj.ObjType.e_stream))
{
for (var iter = obj.GetDictIterator(); iter.HasNext(); iter.Next())
{
foreach (var child in Recurse(iter.Value(), idHash))
{
yield return child;
}
}
}
}
}
}
static PdfIdentifier nullIdentifier = new PdfIdentifier() { Generation = 0, ObjectNum = 0 };
ToPdfIdentifier is a simple extension method to get the object number and generation:
public static PdfIdentifier ToPdfIdentifier(this pdftron.SDF.Obj obj) => new PdfIdentifier { ObjectNum = obj.GetObjNum(), Generation = obj.GetGenNum() };
This runs OK, but only returns 45 objects, none of them the images I'm actually interested in.
How can I simply get all COS objects from a document?
edit
Here is the original PDFDoc code we tried to get all images:
private IEnumerable<(PdfIdentifier id, Element el)> GetImages(Stream stream)
{
var doc = new PDFDoc(stream);
var reader = new ElementReader();
for (var iter = doc.GetPageIterator(); iter.HasNext(); iter.Next())
{
reader.Begin(iter.Current());
var el = reader.Next();
while (el != null)
{
var type = el.GetType();
if (el.GetType().OneOf(Element.Type.e_image, Element.Type.e_inline_image))
{
var obj = el.GetXObject();
var id = el.GetXObject().ToPdfIdentifier();
yield return (id, el);
}
el = reader.Next();
}
reader.End();
}
}
This kind of worked in that it returned some images, but not all. For some sample documents it returned all, for some it returned a subset, and for some it returned none at all.
edit
Just for future reference, thanks to the answer below from Ryan, we ended up with a pair of nice clean extension methods:
public static IEnumerable<SDF.Obj> GetAllObj(this SDF.SDFDoc sdfDoc)
{
var xrefTableSize = sdfDoc.XRefSize();
for (int objNum = 0; objNum < xrefTableSize; objNum++)
{
var obj = sdfDoc.GetObj(objNum);
if (obj.IsFree())
{
continue;
}
else
{
yield return obj;
}
}
}
and
public static string Subtype(this SDF.Obj obj) => obj.FindObj("Subtype") switch
{
null => null,
var s when s.IsName() => s.GetName(),
var s when s.IsString() => s.GetAsPDFText(),
_ => throw new Exception("COS object has an invalid Subtype entry")
};
Now we can get images as simply as sdfDoc.GetAllObj().Where(o => o.IsStream() && o.Subtype() == "Image"); or even use Linq:
from o in sdfDoc.GetAllObj()
where o.IsStream() && o.Subtype() == "Image"
select new Image(o);

If you want to get the images that are actually used on a page of the PDF (in case there happen to be unused images in the PDF), then you would use this sample code. This code would have the added bonus of including inline images.
https://www.pdftron.com/documentation/samples/dotnetcore/cs/ImageExtractTest
Though the above can be slow, if the document has hundreds or thousands of pages, that are complicated graphically.
The otherway, as you described, is to iterate the COS objects. The following C# code finds all Image streams. Note, the PDF standard specifically states that Streams have to be Indirect objects. So I think you can safely omit reading through all the direct objects.
using (PDFDoc doc = new PDFDoc("2002.04610.pdf"))
{
doc.InitSecurityHandler();
int xrefSz = doc.GetSDFDoc().XRefSize();
for (int xrefCounter = 0; xrefCounter < xrefSz; ++xrefCounter)
{
Obj o = doc.GetSDFDoc().GetObj(xrefCounter);
if (o.IsFree())
{
continue;
}
if(o.IsStream())
{
Obj subtypeObj = o.FindObj("Subtype");
if (subtypeObj != null)
{
string subtype = "";
if(subtypeObj.IsName()) subtype = subtypeObj.GetName();
if(subtypeObj.IsString()) subtype = subtypeObj.GetAsPDFText(); // Subtype should be a Name, but just in case
if (subtype.CompareTo("Image") == 0)
{
Console.WriteLine("Indirect object {0} is an Image Stream", o.GetObjNum());
}
}
}
}
}

Related

Merge data from two arrays or something else

How to combine Id from the list I get from file /test.json and id from list ourOrders[i].id?
Or if there is another way?
private RegionModel FilterByOurOrders(RegionModel region, List<OurOrderModel> ourOrders, MarketSettings market, bool byOurOrders)
{
var result = new RegionModel
{
updatedTs = region.updatedTs,
orders = new List<OrderModel>(region.orders.Count)
};
var json = File.ReadAllText("/test.json");
var otherBotOrders = JsonSerializer.Deserialize<OrdersTimesModel>(json);
OtherBotOrders = new Dictionary<string, OrderTimesInfoModel>();
foreach (var otherBotOrder in otherBotOrders.OrdersTimesInfo)
{
//OtherBotOrders.Add(otherBotOrder.Id, otherBotOrder);
BotController.WriteLine($"{otherBotOrder.Id}"); //Output ID orders to the console works
}
foreach (var order in region.orders)
{
if (ConvertToDecimal(order.price) < 1 || !byOurOrders)
{
int i = 0;
var isOurOrder = false;
while (i < ourOrders.Count && !isOurOrder)
{
if (ourOrders[i].id.Equals(order.id, StringComparison.InvariantCultureIgnoreCase))
{
isOurOrder = true;
}
++i;
}
if (!isOurOrder)
{
result.orders.Add(order);
}
}
}
return result;
}
OrdersTimesModel Looks like that:
public class OrdersTimesModel
{
public List<OrderTimesInfoModel> OrdersTimesInfo { get; set; }
}
test.json:
{"OrdersTimesInfo":[{"Id":"1"},{"Id":"2"}]}
Added:
I'll try to clarify the question:
There are three lists with ID:
First (all orders): region.orders, as order.id
Second (our orders): ourOrders, as ourOrders[i].id in a while loop
Third (our orders 2): from the /test.json file, as an array {"Orders":[{"Id":"12345..."...},{"Id":"12345..." ...}...]}
There is a foreach in which there is a while, where the First (all orders) list and the Second (our orders) list are compared. If the id's match, then these are our orders: isOurOrder = true;
Accordingly, those orders that isOurOrder = false; will be added to the result: result.orders.Add(order)
I need:
So that if (ourOrders[i].id.Equals(order.id, StringComparison.InvariantCultureIgnoreCase)) would include more Id's from the Third (our orders 2) list.
Or any other way to do it?

You should be able to completely avoid writing loops if you use LINQ (there will be loops running in the background, but it's way easier to read)
You can access some documentation here: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/linq/introduction-to-linq-queries
and you have some pretty cool extension methods for arrays: https://learn.microsoft.com/en-us/dotnet/api/system.linq.enumerable?view=net-6.0 (these are great to get your code easy to read)
Solution
unsing System.Linq;
private RegionModel FilterByOurOrders(RegionModel region, List<OurOrderModel> ourOrders, MarketSettings market, bool byOurOrders)
{
var result = new RegionModel
{
updatedTs = region.updatedTs,
orders = new List<OrderModel>(region.orders.Count)
};
var json = File.ReadAllText("/test.json");
var otherBotOrders = JsonSerializer.Deserialize<OrdersTimesModel>(json);
// This line should get you an array containing
// JUST the ids in the JSON file
var idsFromJsonFile = otherBotOrders.Select(x => x.Id);
// Here you'll get an array with the ids for your orders
var idsFromOurOrders = ourOrders.Select(x => x.id);
// Union will only take unique values,
// so you avoid repetition.
var mergedArrays = idsFromJsonFile.Union(idsFromOurOrders);
// Now we just need to query the region orders
// We'll get every element that has an id contained in the arrays we created earlier
var filteredRegionOrders = region.orders.Where(x => !mergedArrays.Contains(x.id));
result.orders.AddRange(filteredRegionOrders );
return result;
}
You can add conditions to any of those actions (like checking for order price or the boolean flag you get as a parameter), and of course you can do it without assigning so many variables, I did it that way just to make it easier to explain.

Improve performance when sorting and reading JSON from file C#

I have a JSON file that contains about 20k lines of code that has to be read, sorted and saved into a database. I've written code for it and it works the way it's suppose to but my issue is that it takes about 10 minutes. Therefor I wonder if someone has any ideas what can be done to enhance the performance?
Json:
{
"Number": 123456,
"Area": "NE01"
},
{
"Number": 123457,
"Area": "NE01"
},
and so forth....
C#:
dynamic json = JsonConvert.DeserializeObject(File.ReadAllText(path, Encoding.UTF8));
foreach (var obj in json)
{
if (obj.Area == "NE01")
{
var o = new object
{
Number = obj.Number,
};
db.Entity.Add(obj);
continue;
}
if (obj.Area == "NE02")
{
var o = new object
{
Number = obj.Number,
};
db.Entity.Add(obj);
continue;
}
if (obj.Area == "NE03")
{
var o= new object
{
Number = obj.Number,
};
db.Entity.Add(obj);
continue;
}
if ( obj.Area== "NE04")
{
var o = new object
{
Number = obj.Number
};
db.Entity.Add(obj);
continue;
}
}
db.SaveChanges();
To make it clearer, area has four different values. Depending on the value the number will have a foreign key that points to the area. Unfortunately I'm not allowed to change anything in the underlying database.
Let me know if I have to provide further information.

Using EntityFramework.Utilities you can use Bulk Insert which should speed up insertions.
Something like:
public class Data
{
public int Number { get; set; }
public string Area { get; set; }
}
var objects = JsonConvert.DeserializeObject<List<Data>>(File.ReadAllText(path, Encoding.UTF8))
.Select(d => new object { Number = d.Number })
.ToList();
EFBatchOperation.For(db, db.Entity).InsertAll(objects);
Disclaimer: Code not tested.

Only by using AddRange I was able to decrease the time to just over one minute. Which is good enough for my purpose.

Neo4j: How to return multiple paths from different starting nodes

I have a question similar to
this question but I am using the c# with the neo4jClient instead of the Java.
I can get the parent path of a given node with the following code but it becomes a performance bottle neck when trying to find the parent path of many nodes. What I would like is a way to call the graph database once with a list of node keys and get back a list of parent paths. So that I can return a dictionary of results instead of a single list.
Any help accomplishing this would be greatly appreciated! Also if my original cypher query can be improved I'm open to that as well.
public IEnumerable<IGenericEntity> GetPath(string entityCompositeKey, GraphRelationship relationship)
{
var entity = new GenericEntity();
entity.setCompositeKey(entityCompositeKey);
var pathToRoot = new List<GenericEntity>(){ entity };
var query = new CypherFluentQuery(graphClient)
.Match("p = (current)-[r:" + relationship.Name + "*0..]->()")
.Where((IGenericEntity current) => current.CompositeKey == entityCompositeKey)
.Return(() => Return.As<IEnumerable<GenericEntity>>("nodes(p)"))
.OrderByDescending("length(p)")
.Limit(10);
var queryText = query.Query.QueryText;
var paramText = query.Query.QueryParameters;
if (query.Results != null)
{
var graphResults = query.Results.FirstOrDefault();
if (graphResults != null && graphResults.ToList().Count > 0)
{
pathToRoot = graphResults.ToList();
}
}
return pathToRoot;
}

There are a few things I'm not sure of - and it's most likely how my test DB is setup.
To answer the initial question of how to pass in multiple start nodes - that's probably best approached using the UNWIND operator, which in Neo4jClient is used like so:
var enumerable = new string[] { "a", "b" }
client.Unwind(enumerable, "item"). /*The rest*/
Obvs, if you place that in the top of your current query you'll get a monster set of nodes back, and you won't know which Root entity refers to which, soo... let's do some projecting...
To project, we must have something to project into:
public class Result {
public GenericEntity Root { get; set; }
public List<GenericEntity> Nodes { get; set; }
public int Length { get; set; }
}
This will contain the Root node, and the path to it, now to fill.
public IEnumerable<Result> GetPath(IEnumerable<string> rootKeys, GraphRelationship relationship)
{
var query = new CypherFluentQuery(Client)
.Unwind(rootKeys, "entityRootKey")
.Match(string.Format("p = (root)-[r:{0}*0..]->()", relationship.Name))
.Where("root.CompositeKey = entityRootKey")
.With("{Root:root, Nodes: nodes(p), Length: length(p)} as res")
.Return((res) => res.As<Result>())
.OrderByDescending("res.Length")
.Limit(10);
var results = query.Results;
return results;
}
I'm not using .Where with a parameter creating Func<T> this is because the parameter is created in the .Unwind statement.
Usage wise - something like this:
var res = GetPath(new[] {"a", "b"}, new GraphRelationship {Name = "RELATED"});
foreach (var result in res)
{
Console.WriteLine($"{result.Root.CompositeKey} => {result.Length}");
foreach (var node in result.Nodes)
Console.WriteLine($"\t{node.CompositeKey}");
}

How do I create a new root by adding and removing nodes retrieved from the old root?

I am creating a Code Fix that changes this:
if(obj is MyClass)
{
var castedObj = obj as MyClass;
}
into this:
var castedObj = obj as MyClass;
if(castedObj != null)
{
}
This means I have to do 3 things:
Change the condition in the if statement.
Move the casting right above the if statement.
Remove the statement in the body.
So far, all my attempts have stranded me at getting at most 2 of these things to work.
I believe this problem occurs because you basically have 2 syntax nodes on the same level. As such, making a change to one of them invalidates the location of the other one. Or something like that. Long story short: I either manage to copy the variable assignment outside the if statement, or I manage to change the condition + remove the variable assignment. Never all 3.
How would I solve this?
For good measure, here is my code which changes the condition and removes the assignment:
var newIfStatement = ifStatement.RemoveNode(
variableDeclaration,
SyntaxRemoveOptions.KeepExteriorTrivia);
newIfStatement = newIfStatement.ReplaceNode(newIfStatement.Condition, newCondition);
var ifParent = ifStatement.Parent;
var newParent = ifParent.ReplaceNode(ifStatement, newIfStatement);
newParent = newParent.InsertNodesBefore(
newIfStatement,
new[] { variableDeclaration })
.WithAdditionalAnnotations(Formatter.Annotation);
var newRoot = root.ReplaceNode(ifParent, newParent);

Have you looked at the DocumentEditor class ? It is very useful when dealing with modifying syntax, especially when the changes that are applied to the tree might cause invalidation problems. The operations are pretty much the same as the ones you already have defined, just use the DocumentEditor methods instead and see if that helps. I can't verify if that solves your problem ATM, but I think it solved the a similar problem for me once in the past. I'll test it out later if I can.
Something like this will do it:
var editor = await DocumentEditor.CreateAsync(document);
editor.RemoveNode(variableDeclaration);
editor.ReplaceNode(ifStatement.Condition, newCondition);
editor.InsertBefore(ifStatement,
new[] { variableDeclaration.WithAdditionalAnnotations(Formatter.Annotation) });
var newDocument = editor.GetChangedDocument();

I have managed to do something very similar in the following manner.
I extract the while condition and move it before the while and replace the condition with a new node.
In the body of while, I add a new statement.
In your case, instead of adding a statement, you will remove the desired statement from the body.
Start at
Refactor(BlockSyntax oldBody)
STEP 1: I first visit and mark the nodes that I want to change and at the same time generate new nodes, but don't add the new ones yet.
STEP 2: Track the marked nodes and replace with new ones.
class WhileConditionRefactoringVisitor : CSharpSyntaxRewriter
{
private static int CONDITION_COUNTER = 0;
private static string CONDITION_VAR = "whileCondition_";
private static string ConditionIdentifier
{
get { return CONDITION_VAR + CONDITION_COUNTER++; }
}
private readonly List<SyntaxNode> markedNodes = new List<SyntaxNode>();
private readonly List<Tuple<ExpressionSyntax, IdentifierNameSyntax, StatementSyntax, WhileStatementSyntax>> replacementNodes =
new List<Tuple<ExpressionSyntax, IdentifierNameSyntax, StatementSyntax, WhileStatementSyntax>>();
//STEP 1
public override SyntaxNode VisitWhileStatement(WhileStatementSyntax node)
{
var nodeVisited = (WhileStatementSyntax) base.VisitWhileStatement(node);
var condition = nodeVisited.Condition;
if (condition.Kind() == SyntaxKind.IdentifierName)
return nodeVisited;
string conditionVarIdentifier = ConditionIdentifier;
var newConditionVar = SyntaxFactoryExtensions.GenerateLocalVariableDeclaration(conditionVarIdentifier,
condition, SyntaxKind.BoolKeyword).NormalizeWhitespace().WithTriviaFrom(nodeVisited);
var newCondition = SyntaxFactory.IdentifierName(conditionVarIdentifier).WithTriviaFrom(condition);
markedNodes.Add(condition);
markedNodes.Add(node);
replacementNodes.Add(new Tuple<ExpressionSyntax, IdentifierNameSyntax, StatementSyntax, WhileStatementSyntax>(condition, newCondition, newConditionVar, node));
return nodeVisited;
}
//STEP 2
private BlockSyntax ReplaceNodes(BlockSyntax oldBody)
{
oldBody = oldBody.TrackNodes(this.markedNodes);
foreach (var tuple in this.replacementNodes)
{
var currentA = oldBody.GetCurrentNode(tuple.Item1);
if (currentA != null)
{
var whileStatement = currentA.Parent;
oldBody = oldBody.InsertNodesBefore(whileStatement, new List<SyntaxNode>() { tuple.Item3 });
var currentB = oldBody.GetCurrentNode(tuple.Item1);
oldBody = oldBody.ReplaceNode(currentB, tuple.Item2);
var currentWhile = oldBody.GetCurrentNode(tuple.Item4);
//modify body
var whileBody = currentWhile.Statement as BlockSyntax;
//create new statement
var localCondition = tuple.Item3 as LocalDeclarationStatementSyntax;
var initializer = localCondition.Declaration.Variables.First();
var assignment = SyntaxFactory.ExpressionStatement(SyntaxFactory.AssignmentExpression(SyntaxKind.SimpleAssignmentExpression,
SyntaxFactory.IdentifierName(initializer.Identifier), initializer.Initializer.Value));
var newStatements = whileBody.Statements.Add(assignment);
whileBody = whileBody.WithStatements(newStatements);
//updateWhile
var newWhile = currentWhile.WithStatement(whileBody);
oldBody = oldBody.ReplaceNode(currentWhile, newWhile);
}
}
return oldBody;
}
public BlockSyntax Refactor(BlockSyntax oldBody)
{
markedNodes.Clear();
replacementNodes.Clear();
//STEP 1
oldBody = (BlockSyntax)this.Visit(oldBody);
//STEP 2
oldBody = this.ReplaceNodes(oldBody);
return oldBody;
}
}

Lucene.NET and searching on multiple fields with specific values

I've created an index with various bits of data for each document I've added, each document can differ in it field name.
Later on, when I come to search the index I need to query it with exact field/ values - for example:
FieldName1 = X AND FieldName2 = Y AND FieldName3 = Z
What's the best way of constructing the following using Lucene .NET:
What analyser is best to use for this exact match type?
Upon retrieving a match, I only need one specific field to be returned (which I add to each document) - should this be the only one stored?
Later on I'll need to support keyword searching (so a field can have a list of values and I'll need to do a partial match).
The fields and values come from a Dictionary<string, string>. It's not user input, it's constructed from code.
Thanks,
Kieron

Well, I figured it out eventually - here's my take on it (this could be completely wrong, but it works for):
public Guid? Find (Dictionary<string, string> searchTerms)
{
if (searchTerms == null)
throw new ArgumentNullException ("searchTerms");
try
{
var directory = FSDirectory.Open (new DirectoryInfo (IndexRoot));
if (!IndexReader.IndexExists (directory))
return null;
var mainQuery = new BooleanQuery ();
foreach (var pair in searchTerms)
{
var parser = new QueryParser (
Lucene.Net.Util.Version.LUCENE_CURRENT, pair.Key, GetAnalyzer ());
var query = parser.Parse (pair.Value);
mainQuery.Add (query, BooleanClause.Occur.MUST);
}
var searcher = new IndexSearcher (directory, true);
try
{
var results = searcher.Search (mainQuery, (Filter)null, 10);
if (results.totalHits != 1)
return null;
return Guid.Parse (searcher.Doc (results.scoreDocs[0].doc).Get (ContentIdKey));
}
catch
{
throw;
}
finally
{
if (searcher != null)
searcher.Close ();
}
}
catch
{
throw;
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get all SDF/COS objects from PDF - c#

Related

Merge data from two arrays or something else

Improve performance when sorting and reading JSON from file C#

Neo4j: How to return multiple paths from different starting nodes

How do I create a new root by adding and removing nodes retrieved from the old root?

Lucene.NET and searching on multiple fields with specific values

Categories

Resources