Usage of Document Extraction cognitive skill

Usage of Document Extraction cognitive skill - c#

I am trying to utilize Azure Cognitive services to perform basic document extraction.
My intent is to input PDFs and DOCXs (and possibly some other files) into the Cognitive Engine for parsing, but unfortunately, the implementation of this is not as simple as it seems.
According to the documentation (https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-document-extraction#sample-definition), I must define the skill and then I should be able to input files, but there is no examples on how this should be done.
So far I have been able to define the skill but I am still not sure where I should be dropping the files into.
Please see my code below, as it seeks to replicate the same data structure shown in the example code (albeit using the C# Library)
public static DocumentExtractionSkill CreateDocumentExtractionSkill()
{
List<InputFieldMappingEntry> inputMappings = new List<InputFieldMappingEntry>
{
new("file_data") {Source = "/document/file_data"}
};
List<OutputFieldMappingEntry> outputMappings = new List<OutputFieldMappingEntry>
{
new("content") {TargetName = "extracted_content"}
};
DocumentExtractionSkill des = new DocumentExtractionSkill(inputMappings, outputMappings)
{
Description = "Extract text (plain and structured) from image",
ParsingMode = BlobIndexerParsingMode.Text,
DataToExtract = BlobIndexerDataToExtract.ContentAndMetadata,
Context = "/document",
};
return des;
}
And then I build on this skill like so:
_indexerClient = new SearchIndexerClient(new Uri(Environment.GetEnvironmentVariable("SearchEndpoint")), new AzureKeyCredential(Environment.GetEnvironmentVariable("SearchKey"));
List<SearchIndexerSkill> skills = new List<SearchIndexerSkill> { Skills.DocExtractionSkill.CreateDocumentExtractionSkill() };
SearchIndexerSkillset skillset = new SearchIndexerSkillset("DocumentSkillset", skills)
{
Description = "Document Cracker Skillset",
CognitiveServicesAccount = new CognitiveServicesAccountKey(Environment.GetEnvironmentVariable("CognitiveServicesKey"))
};
await _indexerClient.CreateOrUpdateSkillsetAsync(skillset);
And... then what?
There is no clear method that would fit what I believe the next stage, actually parsing documents.
What is the next step from here to begin dumping files into the _indexerClient (of type SearchIndexerClient)?
As the next stage shown in the documentation is:
{
"values": [
{
"recordId": "1",
"data":
{
"file_data": {
"$type": "file",
"data": "aGVsbG8="
}
}
}
]
}
Which is not clear as to where I would be doing this.

According to the document that you have mentioned. They are actually trying to get the output through postman. They are using a GET Method to receive the extracted document content by sending JSON request to the mentioned URL(i.e. Cognitive skill url) and the files/documents are needed to be uploaded to your storage account in order to get extracted.
you can follow this tutorial to get more insights.

Related

How to serialize tensor input required by dnnclassifier (serving_input_reciever)

I want to be able to use the dnnclassifier (estimator) on top of IIS using tensorflowsharp. The model has previously been trained in python. I got so far that I can now generate PB files, know the correct input/outputs, however I am stuck in tensorflowsharp using string inputs.
I can create a valid .pb file of the iris dataset. It uses the following feate_spec:
{'SepalLength': FixedLenFeature(shape=(1,), dtype=tf.float32, default_value=None), 'SepalWidth': FixedLenFeature(shape=(1,), dtype=tf.float32, default_value=None), 'PetalLength': FixedLenFeature(shape=(1,), dtype=tf.float32, default_value=None), 'PetalWidth': FixedLenFeature(shape=(1,), dtype=tf.float32, default_value=None)}
I have created a simple c# console to try and spin it up. The input should be an "input_example_tensor" and the output is located in "dnn/head/predictions/probabilities". This I discoved after alex_zu provided help using the saved_model_cli command here.
As far as I am aware all tensorflow estimator API's work like this.
Here comes the problem: the input_example_tensor should be of a string format which will be parsed internally by the ParseExample function. Now i am stuck. I have found TFTensor.CreateString, but this doesn't solve the problem.
using System;
using TensorFlow;
namespace repository
{
class Program
{
static void Main(string[] args)
{
using (TFGraph tfGraph = new TFGraph()){
using (var tmpSess = new TFSession(tfGraph)){
using (var tfSessionOptions = new TFSessionOptions()){
using (var metaGraphUnused = new TFBuffer()){
//generating a new session based on the pb folder location with the tag serve
TFSession tfSession = tmpSess.FromSavedModel(
tfSessionOptions,
null,
#"path/to/model/pb",
new[] { "serve" },
tfGraph,
metaGraphUnused
);
//generating a new runner, which will fetch the tensorflow results later
var runner = tfSession.GetRunner();
//this is in the actual tensorflow documentation, how to implement this???
string fromTensorflowPythonExample = "{'SepalLength': [5.1, 5.9, 6.9],'SepalWidth': [3.3, 3.0, 3.1],'PetalLength': [1.7, 4.2, 5.4],'PetalWidth': [0.5, 1.5, 2.1],}";
//this is the problem, it's not working...
TFTensor rawInput = new TFTensor(new float[4]{5.1f,3.3f,1.7f,0.5f});
byte[] serializedTensor = System.Text.Encoding.ASCII.GetBytes(rawInput.ToString());
TFTensor inputTensor = TensorFlow.TFTensor.CreateString (serializedTensor);
runner.AddInput(tfGraph["input_example_tensor"][0], inputTensor);
runner.Fetch("dnn/head/predictions/probabilities", 0);
//start the run and get the results of the iris example
var output = runner.Run();
TFTensor result = output[0];
//printing response to the client
Console.WriteLine(result.ToString());
Console.ReadLine();
}
}
}
}
}
}
}
This example will give the following error:
An unhandled exception of type 'TensorFlow.TFException' occurred in TensorFlowSharp.dll: 'Expected serialized to be a vector, got shape: []
[[Node: ParseExample/ParseExample = ParseExample[Ndense=4, Nsparse=0, Tdense=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], dense_shapes=[[1], [1], [1], [1]], sparse_types=[], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_input_example_tensor_0_0, ParseExample/ParseExample/names, ParseExample/ParseExample/dense_keys_0, ParseExample/ParseExample/dense_keys_1, ParseExample/ParseExample/dense_keys_2, ParseExample/ParseExample/dense_keys_3, ParseExample/Const, ParseExample/Const, ParseExample/Const, ParseExample/Const)]]'
How can I serialize tensors in such a way that I can use the pb file correctly?
I also posted the issue on github, here you can find the iris example python file, pb file and the console applications. In my opinion solving this creates a
neat solution for all tensorflow users having ancient production environments (like me).

The Expected serialized to be a vector, got shape: [] error can be fixed by using an overload of the TFTensor.CreateString function: Instead of directly taking a string, the model apparently expects a vector containing a single string:
TFTensor inputTensor = TFTensor.CreateString(new byte[][] { bytes }, new TFShape(1));
The input_example_tensor in your case now expects a serialized Example protobuf message (see also the docs and the example.proto file).
Using the protobuf compiler, I've generated a C# file containing the Example class. You can download it from here: https://pastebin.com/iLT8MUdR. Specifically, I used this online tool with CSharpProtoc and replaced the import "tensorflow/core/example/feature.proto"; line by the messages defined in that file.
Once you've added the file to your project, you'll need a package reference to Google.Protobuf. Then, you can pass serialized examples to the model like this:
Func<float, Tensorflow.Feature> makeFeature = (float x) => {
var floatList = new Tensorflow.FloatList();
floatList.Value.Add(x);
return new Tensorflow.Feature { FloatList = floatList };
};
var example = new Tensorflow.Example { Features = new Tensorflow.Features() };
example.Features.Feature.Add("SepalLength", makeFeature(5.1f));
example.Features.Feature.Add("SepalWidth", makeFeature(3.3f));
example.Features.Feature.Add("PetalLength", makeFeature(1.7f));
example.Features.Feature.Add("PetalWidth", makeFeature(0.5f));
TFTensor inputTensor = TFTensor.CreateString(
new [] { example.ToByteArray() }, new TFShape(1));
runner.AddInput(tfGraph["input_example_tensor"][0], inputTensor);
runner.Fetch("dnn/head/predictions/probabilities", 0);
//start the run and get the results of the iris example
var output = runner.Run();
TFTensor result = output[0];

How to set MongoDB Change Stream 'OperationType' in the C# driver?

When running the new MongDB Server, version 3.6, and trying to add a Change Stream watch to a collection to get notifications of new inserts and updates of documents, I only receive notifications for updates, not for inserts.
This is the default way I have tried to add the watch:
IMongoDatabase mongoDatabase = mongoClient.GetDatabase("Sandbox");
IMongoCollection<BsonDocument> collection = mongoDatabase.GetCollection<BsonDocument>("TestCollection");
var changeStream = collection.Watch().ToEnumerable().GetEnumerator();
changeStream.MoveNext();
var next = changeStream.Current;
Then I downloaded the C# source code from MongoDB to see how they did this. Looking at their test code for change stream watches, they create a new document(Insert) and then change that document right away(Update) and THEN set up the Change Stream watch to receive an 'update' notification.
No example is given on how to watch for 'insert' notifications.
I have looked at the Java and NodeJS examples, both on MongoDB website and SO, which seems to be straight forward and defines a way to see both Inserts and Updates:
var changeStream = collection.watch({ '$match': { $or: [ { 'operationType': 'insert' }, { 'operationType': 'update' } ] } });
The API for the C# driver is vastly different, I would have assumed they would have kept the same API for C# as Java and NodeJS. I found no or very few examples for C# to do the same thing.
The closest I have come is with the following attempt but still fails and the documentation for the C# version is very limited (or I have not found the right location). Setup is as follows:
String json = "{ '$match': { 'operationType': { '$in': ['insert', 'update'] } } }";
var options = new ChangeStreamOptions { FullDocument = ChangeStreamFullDocumentOption.UpdateLookup };
PipelineDefinition<ChangeStreamDocument<BsonDocument>, ChangeStreamDocument<BsonDocument>> pipeline = new EmptyPipelineDefinition<ChangeStreamDocument<BsonDocument>>().Match(Builders<ChangeStreamDocument<BsonDocument>>.Filter.Text(json,"json"));
Then running the statement below throws an Exception:
{"Command aggregate failed: $match with $text is only allowed as the
first pipeline stage."}
No other Filter options has worked either, and I have not found a way to just enter the JSON as a string to set the 'operationType'.
var changeStream = collection.Watch(pipeline, options).ToEnumerable().GetEnumerator();
changeStream.MoveNext();
var next = changeStream.Current;
My only goal here is to be able to set the 'operationType' using the C# driver. Does anyone know what I am doing wrong or have tried this using the C# driver and had success?
After reading though a large number of webpages, with very little info on the C# version of the MongoDB driver, I am very stuck!
Any help would be much appreciated.

Here is a sample of code I've used to update the collection Watch to retrieve "events" other than just document updates.
IMongoDatabase sandboxDB = mongoClient.GetDatabase("Sandbox");
IMongoCollection<BsonDocument> collection = sandboxDB.GetCollection<BsonDocument>("TestCollection");
//Get the whole document instead of just the changed portion
ChangeStreamOptions options = new ChangeStreamOptions() { FullDocument = ChangeStreamFullDocumentOption.UpdateLookup };
//The operationType can be one of the following: insert, update, replace, delete, invalidate
var pipeline = new EmptyPipelineDefinition<ChangeStreamDocument<BsonDocument>>().Match("{ operationType: { $in: [ 'replace', 'insert', 'update' ] } }");
var changeStream = collection.Watch(pipeline, options).ToEnumerable().GetEnumerator();
changeStream.MoveNext(); //Blocks until a document is replaced, inserted or updated in the TestCollection
ChangeStreamDocument<BsonDocument> next = changeStream.Current;
enumerator.Dispose();
The EmptyPiplineDefinition...Match() argument could also be:
"{ $or: [ {operationType: 'replace' }, { operationType: 'insert' }, { operationType: 'update' } ] }"
If you wanted to use the $or command, or
"{ operationType: /^[^d]/ }"
to throw a little regex in there. This last one is saying, I want all operationTypes unless they start with the letter 'd'.

Send and receive large MTOMAttachments to IBM Content Manager WebService

When trying to upload large files (100-500MB) I get the familiar "OutOfMemoryException" which is caused by trying to read the whole file into memory at once (asked and answered here on stackoverflow). I know that I should use a stream or divide the file into smaller parts. Changing the proxy code manually is an option, if that helps. I use the specific webservice (CMWebService).
Since I am unable to change IBMs code, is there any way to send the file in smaller parts? I have already found the classes UpdateItemRequestAdd and UpdateItemRequestAddPart but I can't get them to work. Unfortunately, there are also no samples available by IBM.
Receiving files pose the same problem, and I have not been able to find any classes that could help me there.
This is the code that I am currently using to upload files:
string resources0 = "tiffFileContent";
string resources1 = "image/tiff";
string resources2 = #"D:\myImageFile.tif";
CreateItemRequest createRequest = new CreateItemRequest()
{
AuthenticationData = data,
Item = new CreateItemRequestItem()
{
ItemXML = new ItemXML()
{
MYITEMTYPE = new MYITEMTYPE()
{
ArchiveId = "4719",
ICMBASE = new ICMBASE[] {
new ICMBASE(){
resourceObject = new LobObjectType()
{
label = new LobObjectTypeLabel()
{
name= resources0
},
MIMEType = resources1,
originalFileName = resources2
},
}
}
}
},
},
mtomRef = new MTOMAttachment[] { new MTOMAttachment() {
ID = resources0,
MimeType = resources1,
Value = System.IO.File.ReadAllBytes(resources2), // Error on large files
}},
};
var createReply = service.CreateItem(createRequest);

We "resolved" this by telling our customer to get a more potent system with more RAM. With 4-8GB of RAM we were able to upload files up to 200MB without problem.
Upon receiving large files, the Java-HeapSize in IBM Content Manager had to be increased.
http://www.mkyong.com/websphere/how-to-increase-websphere-jvm-memory/

How to set permissions to public, Elastic Transcoder Amazon SDK

How I can make all files public using Amazon.ElasticTranscoder.Model (.NET, C#).
Here is my code:
public static void CreateJobRequest(string videoPath, string bucketName)
{
string accsessKey = CloudSettings.AccessKeyID;
string secretKey = CloudSettings.SecreteKey;
var etsClient = new AmazonElasticTranscoderClient(accsessKey,secretKey, RegionEndpoint.USEast1);
var notifications = new Notifications()
{
Completed = "arn:aws:sns:us-east-1:XXXXXXXXXXXX:Transcode",
Error = "arn:aws:sns:us-east-1:XXXXXXXXXXXX:Transcode",
Progressing = "arn:aws:sns:us-east-1:XXXXXXXXXXXX:Transcode",
Warning = "arn:aws:sns:us-east-1:XXXXXXXXXXXX:Transcode"
};
var pipeline=etsClient.CreatePipeline(new CreatePipelineRequest()
{
Name = "MyFolder",
InputBucket = bucketName,
OutputBucket = bucketName,
Notifications = notifications,
Role = "arn:aws:iam::XXXXXXXXXXXX:role/Elastic_Transcoder_Default_Role"
}).CreatePipelineResult.Pipeline;
etsClient.CreateJob(new CreateJobRequest()
{
PipelineId = pipeline.Id,
Input = new JobInput()
{
AspectRatio = "auto",
Container = "mp4",
FrameRate = "auto",
Interlaced = "auto",
Resolution = "auto",
Key = videoPath
},
Output = new CreateJobOutput()
{
ThumbnailPattern = videoPath+"videoName{resolution}_{count}",
Rotate = "0",
PresetId = "1351620000000-000020",
Key = videoPath+"newFileName.mp4"
}
});
}
Everything works perfect, but transcoded files are private. How I can set it to public?

I've just had this issue today and the way to resolve it is as follows:
In your Pipeline under "Configure Amazon S3 Bucket for Transcoded Files and Playlists"
Use the "+ Add a permission link".
Select "Grantee Type" as "Amazon S3 Group".
Select "Grantee" as "All Users".
Then check "Access" as "Open/Download" AND "View Permission"
Save changes.
You can repeat this for any thumbnails that are generated in the section directly beneath: "Configure Amazon S3 Bucket for Thumbnails".

Just to add to #timstermatic answer - I only had to grant 'Open/Download' access to make the objects public.
The 'View Permissions' option is used to allow anyone to read the ACL (Access Control List) of the object, not to view the object itself - that's taken care of by the 'Open/Download' option.
As usual with AWS terms, it's easy to misinterpret - it's not 'Permission to View the Object', it's 'View the Object's Permissions'.
(Sorry I couldn't add this as a comment, I don't have enough rep.)

Usually a conflict in policies can give rise to this situation. An easier way to handle this is to set the public read permission straight on the bucket in the "Bucket Policy" section of the Permissions tab. This bit of code might help.
{
"Version": "2008-10-17",
"Statement": [
{
"Sid": "AllowPublicRead",
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::<BUCKET_NAME>/*"
}
]
}
You can specify the folder as well in the Resource parameter. Haven't tried extensions but I guess those should work too.

How do you use the CopyIntoItems method of the SharePoint Copy web service?

I am attempting to load document files into a document library in SharePoint using the CopyIntoItems method of the SharePoint Copy web service.
The code below executes and returns 0 (success). Also, the CopyResult[] array returns 1 value with a "Success" result. However, I cannot find the document anywhere in the library.
I have two questions:
Can anyone see anything wrong with my code or suggest changes?
Can anyone suggest how I could debug this on the server side. I don't have a tremendous amount of experience with SharePoint. If I can track what is going on through logging or some other method on the server side it may help me figure out what is going on.
Code Sample:
string[] destinationUrls = { Uri.EscapeDataString("https://someaddress.com/Reports/Temp") };
SPCopyWebService.FieldInformation i1 = new SPCopyWebService.FieldInformation { DisplayName = "Name", InternalName = "Name", Type = SPListTransferSpike1.SPCopyWebService.FieldType.Text, Value = "Test1Name" };
SPCopyWebService.FieldInformation i2 = new SPCopyWebService.FieldInformation { DisplayName = "Title", InternalName = "Title", Type = SPListTransferSpike1.SPCopyWebService.FieldType.Text, Value = "Test1Title" };
SPCopyWebService.FieldInformation[] info = { i1, i2 };
SPCopyWebService.CopyResult[] result;
byte[] data = File.ReadAllBytes("C:\\SomePath\\Test1Data.txt");
uint ret = SPCopyNew.CopyIntoItems("", destinationUrls, info, data, out result);
Edit that got things working:
I got my code working by adding "http://null" to the SourceUrl field. Nat's answer below would probably work for that reason. Here is the line I changed to get it working.
// Change
uint ret = SPCopyNew.CopyIntoItems("http://null", destinationUrls, info, data, out result);

I think the issue may be in trying to set the "Name" property using the webservice. I have had some fail doing that.
Given the "Name" is the name of the document, you may have some success with
string targetDocName = "Test1Name.txt";
string destinationUrl = Uri.EscapeDataString("https://someaddress.com/Reports/Temp/" + targetDocName);
string[] destinationUrls = { destinationUrl };
SPCopyWebService.FieldInformation i1 = new SPCopyWebService.FieldInformation { DisplayName = "Title", InternalName = "Title", Type = SPListTransferSpike1.SPCopyWebService.FieldType.Text, Value = "Test1Title" };
SPCopyWebService.FieldInformation[] info = { i1};
SPCopyWebService.CopyResult[] result;
byte[] data = File.ReadAllBytes("C:\\SomePath\\Test1Data.txt");
uint ret = SPCopyNew.CopyIntoItems(destinationUrl, destinationUrls, info, data, out result);
Note: I have used the "target" as the "source" property. Don't quite know why, but it does the trick.

I didn't understand very well what you're tying to do, but if you're trying to upload a file from a local directory into a sharepoint library, i would suggest you create a webclient and use uploadata:
Example (VB.NET):
dim webclient as Webclient
webClient.UploadData("http://srvasddress/library/filenameexample.doc", "PUT", filebytes)
Then you just have to check in the file using the lists web service, something like:
listService.CheckInFile("http://srvasddress/library/filenameexample.doc", "description", "1")
Hope it was of some help.
EDIT: Don't forget to set credentials for the web client, etc.
EDIT 2: Update metada fields using this:
listService.UpdateListItems("Name of the Library, batchquery)
You can find info on building batch query's in here: link

The sourceurl is used in Sharepoint. It is a link back to the "Source Document." When in your document library, hover over the item, to the right appears a down pointing triangle. Clicking on it, brings up a menu. Click on the "View Properties" Option. On this page you will see the following "This item is a copy of http://null ( Go To Source Item | Unlink )"
Because we are using the Copy function Sharepoint is keeping track of the "Source item" as part of the Document Management feature.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Usage of Document Extraction cognitive skill - c#

Related

How to serialize tensor input required by dnnclassifier (serving_input_reciever)

How to set MongoDB Change Stream 'OperationType' in the C# driver?

Send and receive large MTOMAttachments to IBM Content Manager WebService

How to set permissions to public, Elastic Transcoder Amazon SDK

How do you use the CopyIntoItems method of the SharePoint Copy web service?

Categories

Resources