lucene.net IndexWriter and Azure WebJob

lucene.net IndexWriter and Azure WebJob - c#

Ive got an Azure webjob running continuously that triggers based of a queue trigger. The queue contains a list of items that need to be written to my lucene Index. I currently have a lot of items on the queue (over 500k line items) and Im looking for the most effecient way to process it. I keep getting IndexWriter Lock exception when I attempt to 'scale' out the webjob.
Current Setup:
JobHostConfiguration config = new JobHostConfiguration();
config.Queues.BatchSize = 1;
var host = new JobHost(config);
host.RunAndBlock();
Web job function
public static void AddToSearchIndex([QueueTrigger("indexsearchadd")] List<ListingItem> items, TextWriter log)
{
var azureDirectory = new AzureDirectory(CloudStorageAccount.Parse(ConfigurationManager.ConnectionStrings["StorageConnectionString"].ConnectionString), "megadata");
var findexExists = IndexReader.IndexExists(azureDirectory);
var count = items.Count;
IndexWriter indexWriter = null;
int errors = 0;
while (indexWriter == null && errors < 10)
{
try
{
indexWriter = new IndexWriter(azureDirectory, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30), !IndexReader.IndexExists(azureDirectory), new Lucene.Net.Index.IndexWriter.MaxFieldLength(IndexWriter.DEFAULT_MAX_FIELD_LENGTH));
}
catch (LockObtainFailedException)
{
log.WriteLine("Lock is taken, Hit 'Y' to clear the lock, or anything else to try again");
errors++;
}
};
if (errors >= 10)
{
azureDirectory.ClearLock("write.lock");
indexWriter = new IndexWriter(azureDirectory, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30), !IndexReader.IndexExists(azureDirectory), new Lucene.Net.Index.IndexWriter.MaxFieldLength(IndexWriter.DEFAULT_MAX_FIELD_LENGTH));
log.WriteLine("IndexWriter lock obtained, this process has exclusive write access to index");
indexWriter.SetRAMBufferSizeMB(10.0);
// Parallel.ForEach(items, (itm) =>
//{
foreach (var itm in items)
{
AddtoIndex(itm, indexWriter);
}
//});
}
The method that updates the index items basically looks like this:
private static void AddtoIndex(ListingItem item, IndexWriter indexWriter)
{
var doc = new Document();
doc.Add(new Field("id", item.URL, Field.Store.NO, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
var title = new Field("Title", item.Title, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);
indexWriter.UpdateDocument(new Term("id", item.URL), doc);
}
Things I have tried:
Set the azure config batch size to the maximum 32
Make the method async and use Task.WhenAll
Use parallel for loop
When I try the above, it usually fails with:
Lucene.Net.Store.LockObtainFailedException: Lucene.Net.Store.LockObtainFailedException: Lock obtain timed out: AzureLock#write.lock.
at Lucene.Net.Store.Lock.Obtain(Int64 lockWaitTimeout) in d:\Lucene.Net\FullRepo\trunk\src\core\Store\Lock.cs:line 97
at Lucene.Net.Index.IndexWriter.Init(Directory d, Analyzer
Any suggestions on how I can architecturally set up this web job such that it can process more items in the queue instead of doing it one by one? They need to write to the same index?
Thanks

You are running into a Lucene sematic problem when multiple processes are trying to write to the Lucene index at the same time. Scaling azure app, using Tasks or parallel for loops will only cause problems, as only one process should write to the Lucene index at the time.
Architectural this is what you should do.
Ensure only one instance of the webjobs is running at any time – even
if the Web App scales (e.g. via auto scaling)
Use maximum webjob batch size (32)
Commit the Lucene index after each batch to minimize I/O
Ensuring only one instance of the webjob can be done by adding a settings.job file to webjob project. Set the build action to content and copy to output directory. Add the following JSON to the file
{ "is_singleton": true }
Configure the webjob batch site to maximum
JobHostConfiguration config = new JobHostConfiguration();
config.Queues.BatchSize = 1;
var host = new JobHost(config);
host.RunAndBlock();
Commit the Lucene index after each batch
public static void AddToSearchIndex([QueueTrigger("indexsearchadd")] List<ListingItem> items, TextWriter log)
{
...
indexWriter = new IndexWriter(azureDirectory, …);
foreach (var itm in items)
{
AddtoIndex(itm, indexWriter);
}
indexWriter.Commit();
}
This will only write to the storage account when the Lucene index is committed, speeding up indexing process. Furthermore, the webjob batching will also speed up message processing (number of messages processed over time and not the individual message process time).
You could add check to see if the Lucene index is locked (write.lock file exists) and unlock the index at the start of the batch process. This should never occur, but everything can happen, so I would add it to be sure.
You could further speed up the indexing process by using a larger Web App instance (mileage may vary) and use faster storage like Azure Premium Storage.
You can read more about the internals of Lucene indexes on Azure on my blog.

Related

How does the offset for a topic work in Kafka (Kafka_net)

I have a basic producer app and a consumer app. if I run both and have both start consuming on their respective topics, I have a great working system. My thought was that if I started the producer and sent a message that I would be able to then start the consumer and have it pick up that message. I was wrong.
Unless both are up and running, I lose messages (or they do not get consumed).
my consumer app looks like this for comsuming...
Uri uri = new Uri("http://localhost:9092");
KafkaOptions options = new KafkaOptions(uri);
BrokerRouter brokerRouter = new BrokerRouter(options);
Consumer consumer = new Consumer(new ConsumerOptions(receiveTopic, brokerRouter));
List<OffsetResponse> offset = consumer.GetTopicOffsetAsync(receiveTopic, 100000).Result;
IEnumerable<OffsetPosition> t = from x in offset select new OffsetPosition(x.PartitionId, x.Offsets.Max());
consumer.SetOffsetPosition(t.ToArray());
IEnumerable<KafkaNet.Protocol.Message> msgs = consumer.Consume();
foreach (KafkaNet.Protocol.Message msg in msgs)
{
do some stuff here based on the message received
}
unless I have the code between the lines, it starts at the beginning every time I start the application.
What is the proper way to manage topic offsets so messages are consumed after a disconnect happens?
If I run
kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic chat-message-reply-XXX consumer-property fetch-size=40000000 --from-beginning
I can see the messages, but when I connect my application to that topic, the consumer.Consume() does not pick up the messages it has not already seen. I have tried this with and without runing the above bat file to see if that makes any difference. When I look at the consumer.SetOffsetPosition(t.ToArray()) call (t specifically) it shows that the offset is the count of all messages for the topic.
Please help,

Set auto.offset.reset configuration in your ConsumerOptions to earliest. When the consumer group starts the consume messages, it will consume from the latest offset because the default value for auto.offset.reset is latest.
But I looked at kafka-net API now, it does not have a AutoOffsetReset property, and it seems pretty insufficient with its configuration in consumers. It also lacks documentation with method summaries.
I would suggest you use Confluent .NET Kafka Nuget package because it is owned by Confluent itself.
Also, why are calling GetTopicOffsets and setting that offset back again in consumer. I think when you configure your consumer, you should just start reading messages with Consume().
Try this:
static void Main(string[] args)
{
var uri = new Uri("http://localhost:9092");
var kafkaOptions = new KafkaOptions(uri);
var brokerRouter = new BrokerRouter(kafkaOptions);
var consumerOptions = new ConsumerOptions(receivedTopic, brokerRouter);
var consumer = new Consumer(consumerOptions);
foreach (var msg in consumer.Consume())
{
var value = Encoding.UTF8.GetString(msg.Value);
// Process value here
}
}
In addition, enable logs in your KafkaOptions and ConsumerOptions, they will help you a lot:
var kafkaOptions = new KafkaOptions(uri)
{
Log = new ConsoleLog()
};
var consumerOptions = new ConsumerOptions(topic, brokerRouter)
{
Log = new ConsoleLog()
});

I switched over to use Confluent's C# .NET package and it now works.

Nested Folder Lookup for Google Drive API .NET

I have made an implementation of File/Folder lookup in Google Drive v3 using their own Google API for .NET.
Code works fine, but to be honest I'm not sure if it's really on a standard-efficient way of doing this.
Logic:
I have to get to each and every folder and download specific files on it.
Structure can be A > B > C > D , basically a folder within a folder within a folder and so on.
I can't use a static-predefined directory schema as a long term solution as it can change anytime the owner wants to modify it, for now, the folders are at least 4 levels deep.
The only way I can navigate to the subfolders is to get its own Google Drive ID and use that to see its contents. It is like you need a KEY first before you can unlock/open the next subfolder.
In short, I can't do LOOKUP on subfolders content the easy way, unless there's someone that can give us better alternatives, I'll be glad to take any criticisms on my aproach and open to all your suggestions.
Thank you.
Update
Thank you all for providing links and examples,
I believed Recursion solution is the best so far on my current scenario,
And also, since this is a heavy IO Operations, I did apply ASYNC operations for downloading files and to the rest as possible, so I made sure to follow the ASYNC ALL THE WAY rule to prevent blocking.
To Call the Recursion Method
var parentID = "<folder id>";
var folderLevel = 0;
var listRequest = service.Files.List();
await MyTask(listRequest, id, count, folderLevel);
This is the Recursion Method, it will search all possible folders from the set root parent id that was defined...
private async Task RecursionTask(ListRequest listRequest, string parentId, int count, int folderLevel)
{
// This method do the Folder search
listRequest.Q = $"('{parentId}' in parents) and (mimeType = 'application/vnd.google-apps.folder') and trashed = false";
listRequest.Fields = "files(id,name)";
var filesTask = await listRequest.ExecuteAsync();
var files = filesTask.Files;
count = files.Count(); // Keep track of recursion flow
count--;
// Keep track of how deep recursion is diving on subfolders
folderLevel++;
var tasks = new List<Task>();
foreach(var file in files)
{
tasks.Add(InnerTask(file, listRequest, name, folderLevel)); // Create Array Of Tasks for IMAGE Search
if (count > 1) // Loop until I exhausted the value of count
{
// Return recursion flow
await RecursionTask(listRequest, file.Name, file.Id, count, folderLevel);
}
}
await Task.WhenAll(tasks); // Wait all tasks to finish
}
This is the innerTask that will handle the Downloading of drive files
private async Task InnerTask(File file, ListRequest listRequest, string name,int folderLevel)
{
// This method do the IMAGE SEARCH
listRequest.Q = $"('{file.Id}' in parents) and (mimeType = 'image/jpeg' or mimeType = 'image/png')";
listRequest.Fields = "files(id,name)";
var subFiles = await listRequest.ExecuteAsync();
foreach (var subFile in subFiles.Files)
{
// Do Async task for downloading images on Google Drive
}
}

Applying ACL silently failing (sometimes)

I have an application running in multiple servers applying some ACL's.
Problem is when more than one server is applying on the same folder structure (i.e. three levels), usually only levels one and three have the ACL's applied, but there's no exception.
I've created a test with parallel tasks (to simulate the different servers):
[TestMethod]
public void ApplyACL()
{
var baseDir = Path.Combine(Path.GetTempPath(), "ACL-PROBLEM");
if (Directory.Exists(baseDir))
{
Directory.Delete(baseDir, true);
}
var paths = new[]
{
Path.Combine(baseDir, "LEVEL-1"),
Path.Combine(baseDir, "LEVEL-1", "LEVEL-2"),
Path.Combine(baseDir, "LEVEL-1", "LEVEL-2", "LEVEL-3")
};
//create folders and files, so the ACL takes some time to apply
foreach (var dir in paths)
{
Directory.CreateDirectory(dir);
for (int i = 0; i < 1000; i++)
{
var id = string.Format("{0:000}", i);
File.WriteAllText(Path.Combine(dir, id + ".txt"), id);
}
}
var sids = new[]
{
"S-1-5-21-448539723-725345543-1417001333-1111111",
"S-1-5-21-448539723-725345543-1417001333-2222222",
"S-1-5-21-448539723-725345543-1417001333-3333333"
};
var taskList = new List<Task>();
for (int i = 0; i < paths.Length; i++)
{
taskList.Add(CreateTask(i + 1, paths[i], sids[i]));
}
Parallel.ForEach(taskList, t => t.Start());
Task.WaitAll(taskList.ToArray());
var output = new StringBuilder();
var failed = false;
for (int i = 0; i < paths.Length; i++)
{
var ok = Directory.GetAccessControl(paths[i])
.GetAccessRules(true, false, typeof(SecurityIdentifier))
.OfType<FileSystemAccessRule>()
.Any(f => f.IdentityReference.Value == sids[i]);
if (!ok)
{
failed = true;
}
output.AppendLine(paths[i].Remove(0, baseDir.Length + 1) + " --> " + (ok ? "OK" : "ERROR"));
}
Debug.WriteLine(output);
if (failed)
{
Assert.Fail();
}
}
private static Task CreateTask(int i, string path, string sid)
{
return new Task(() =>
{
var start = DateTime.Now;
Debug.WriteLine("Task {0} start: {1:HH:mm:ss.fffffff}", i, start);
var fileSystemAccessRule = new FileSystemAccessRule(new SecurityIdentifier(sid),
FileSystemRights.Modify | FileSystemRights.Synchronize,
InheritanceFlags.ContainerInherit | InheritanceFlags.ObjectInherit,
PropagationFlags.None,
AccessControlType.Allow);
var directorySecurity = Directory.GetAccessControl(path);
directorySecurity.ResetAccessRule(fileSystemAccessRule);
Directory.SetAccessControl(path, directorySecurity);
Debug.WriteLine("Task {0} finish: {1:HH:mm:ss.fffffff} ({2} ms)", i, DateTime.Now, (DateTime.Now - start).TotalMilliseconds);
});
}
I'm getting the same problem: usually (but not always) only levels one and three have the ACL's applied.
Why is that and how can I fix this?

Directory.SetAccessControl internally calls the Win32 API function SetSecurityInfo:
https://msdn.microsoft.com/en-us/library/windows/desktop/aa379588.aspx
The important part of the above documentation:
If you are setting the discretionary access control list (DACL) or any elements in the system access control list (SACL) of an object, the system automatically propagates any inheritable access control entries (ACEs) to existing child objects, according to the ACE inheritance rules.
The enumeration of child objects (CodeFuller already described this) is done in the low level function SetSecurityInfo itself. To be more detailed, this function calls into the system DLL NTMARTA.DLL, which does all the dirty work.
The background of this is inheritance, which is a "pseudo inheritance", done for performance reasons. Every object contains not only the "own" ACEs, but also the inherited ACEs (those which are grayed out in Explorer). All this inheritance is done during the ACL setting, not during runtime ACL resolution / checking.
This former decision of Microsoft is also the trigger of the following problem (Windows admins should know this):
If you move a directory tree to another location in the file system where a different ACL is set, the ACLs of the objects of the moved try will not change.
So to say, the inherited permissions are wrong, they do not match the parent’s ACL anymore.
This inheritance is not defined by InheritanceFlags, but instead with
SetAccessRuleProtection.
To add on CodeFuller’s answer:
>>After enumeration is completed, internal directory security record is assigned to directory.
This enumeration is not just a pure reading of the sub-objects, the ACL of every sub-object will be SET.
So the problem is inherent to the inner workings of Windows ACL handling:
SetSecurityInfo checks the parent directory for all ACEs which should be inherited and then does a recursion and applies these inheritable ACEs to all subobjects.
I know about this because I have written a tool which sets the ACLs of complete file systems (with millions of files) which uses what we call a "managed folder". We can have very complex ALCs with automatic calculated list permissions.
For the setting of the ACL to the files and folders I use SetKernelObjectSecurity. This API should not normally be used for file systems, since it does not handle that inheritance stuff. So you have to do this yourself. But, if you know what you do and you do it correctly, it is the only reliable way to set the ACL on a file tree in every situation.
In fact, there can be situations (broken / invalid ACL entries in child objects) where SetSecurityInfo fails to set these objects correctly.
And now to the code from Anderson Pimentel:
It should be clear from the above that the parallel setting can only work if the inheritance is blocked
at each directory level.
However, it does not work to just call
dirSecurity.SetAccessRuleProtection(true, true);
in the task, since this call may come to late.
I got the code working if the above statement is called before starting the task.
The bad news is that this call, done with C# also does a complete recursion.
So it seems that there is no real compelling solution in C#, beside using PInvoke calling the low-level security functions directly.
But that’s another story.
And to the initial problem where different servers are setting the ACL:
If we know about the intent behind and what you want the resulting ALC to be, we perhaps can find a way.
Let me know.

It's a funny puzzle.
I've launched your test and the problem reproduces almost for each run. And ACL are often not applied for LEVEL-3 too.
However the problem does not reproduce if tasks run not in parallel.
Also if directory does not contain those 1000 files, the problem reproduces much less often.
Such behavior is very similar to classic race condition.
I haven't found any explicit information on this topic but it seems like applying ACL on overlapping directory trees is not a thread-safe operation.
To confirm this we need to analyze implementation of SetAccessControl() (or rather underlying Windows API call). But let's try to imagine what it might be.
SetAccessControl() is called for given directory and DirectorySecurity record.
It creates some internal structure (filesystem object) and fills it with provided data.
Then it starts enumeration of child objects (directories and files). Such enumeration is partly confirmed by tasks execution time. It's about 500 ms for task3, 1000 ms for task2 and 1500 ms for task1.
After enumeration is completed, internal directory security record is assigned to directory.
But in parallel, the same is done for SetAccessControl() called on parent directory. Finally it will overwrite the record created on step 4.
Of course, described flow is just an assumption. We need NTFS or Windows internals experts to confirm this.
But observed behavior almost certainly indicates race condition. Just avoid such parallel applying of ACL on overlapping directory trees and sleep well.

Introduce a lock. You have the shared file system available, so use .NET to lock when a process makes changes to a folder:
using (new FileStream(lockFile, FileMode.Open, FileAccess.Read, FileShare.None))
{
// file locked
}
In your code add on initialization:
var lockFile = Path.Combine(baseDir, ".lock"); // just create a file
File.WriteAllText(lockFile, "lock file");
and pass the well-known lock file to your tasks.
Then wait for the file to get unlocked in each of your processes:
private static Task CreateTask(int i, string path, string sid, string lockFile)
{
return new Task(() =>
{
var start = DateTime.Now;
Debug.WriteLine("Task {0} start: {1:HH:mm:ss.fffffff}", i, start);
Task.WaitAll(WaitForFileToUnlock(lockFile, () =>
{
var fileSystemAccessRule = new FileSystemAccessRule(new SecurityIdentifier(sid),
FileSystemRights.Modify | FileSystemRights.Synchronize,
InheritanceFlags.ContainerInherit | InheritanceFlags.ObjectInherit,
PropagationFlags.None,
AccessControlType.Allow);
var directorySecurity = Directory.GetAccessControl(path);
directorySecurity.ResetAccessRule(fileSystemAccessRule);
Directory.SetAccessControl(path, directorySecurity);
}));
Debug.WriteLine("Task {0} finish: {1:HH:mm:ss.fffffff} ({2} ms)", i, DateTime.Now, (DateTime.Now - start).TotalMilliseconds);
});
}
private static async Task WaitForFileToUnlock(string lockFile, Action runWhenUnlocked)
{
while (true)
{
try
{
using (new FileStream(lockFile, FileMode.Open, FileAccess.Read, FileShare.None))
{
runWhenUnlocked();
}
return;
}
catch (IOException exception)
{
await Task.Delay(100);
}
}
}
With those changes the unit test passes.
You can add further locks on the various levels to make the process most efficient - something like an hierarchy lock logic.

Get Number of Requests Queued in IIS using C#

I need to collect following two informations from WebRole running IIS-8 on Azure.
Number of requests queued in IIS
Number of requests current being processed by worker
Since we are on Azure cloud service, I believe it would be better to stick together with default IIS configuration provided by Azure.
Approach 1: Use WorkerProcess Request Collection
public void EnumerateWorkerProcess()
{
ServerManager manager = new ServerManager();
foreach (WorkerProcess proc in manager.WorkerProcesses)
{
RequestCollection req = proc.GetRequests(1000);
Debug.WriteLine(req.Count);
}
}
Cons:
Requires RequestMonitor to be enabled explicitly in IIS.
Approach 2: Use PerformanceCounter class
public void ReadPerformanceCounter()
{
var root = HostingEnvironment.MapPath("~/App_Data/PerfCount.txt");
PerformanceCounter counter = new PerformanceCounter(#"ASP.NET", "requests current", true);
float val = counter.NextValue();
using (StreamWriter perfWriter = new StreamWriter(root, true))
{
perfWriter.WriteLine(val);
}
}
Cons:
Requires higher privilege than currently running IIS process.
P.S. There has been a four years old SO post but not answered well.

How to set a dynamic number of threadCounter variables?

I'm not really into multithreading so probably the question is stupid but it seems I cannot find a way to solve this problem (especially because I'm using C# and I've been using it for a month).
I have a dynamic number of directories (I got it from a query in the DB). Inside those queries there are a certain amount of files.
For each directory I need to use a method to transfer these files using FTP in a cuncurrent way because I have basically no limit in FTP max connections (not my word, it's written in the specifics).
But I still need to control the max amount of files transfered per directory. So I need to count the files I'm transfering (increment/decrement).
How could I do it? Should I use something like an array and use the Monitor class?
Edit: Framework 3.5

You can use the Semaphore class to throttle the number of concurrent files per directory. You would probably want to have one semaphore per directory so that the number of FTP uploads per directory can be controlled independently.
public class Example
{
public void ProcessAllFilesAsync()
{
var semaphores = new Dictionary<string, Semaphore>();
foreach (string filePath in GetFiles())
{
string filePathCapture = filePath; // Needed to perform the closure correctly.
string directoryPath = Path.GetDirectoryName(filePath);
if (!semaphores.ContainsKey(directoryPath))
{
int allowed = NUM_OF_CONCURRENT_OPERATIONS;
semaphores.Add(directoryPath, new Semaphore(allowed, allowed));
}
var semaphore = semaphores[directoryPath];
ThreadPool.QueueUserWorkItem(
(state) =>
{
semaphore.WaitOne();
try
{
DoFtpOperation(filePathCapture);
}
finally
{
semaphore.Release();
}
}, null);
}
}
}

var allDirectories = db.GetAllDirectories();
foreach(var directoryPath in allDirectories)
{
DirectoryInfo directories = new DirectoryInfo(directoryPath);
//Loop through every file in that Directory
foreach(var fileInDir in directories.GetFiles()) {
//Check if we have reached our max limit
if (numberFTPConnections == MAXFTPCONNECTIONS){
Thread.Sleep(1000);
}
//code to copy to FTP
//This can be Aync, when then transfer is completed
//decrement the numberFTPConnections so then next file can be transfered.
}
}
You can try something along the lines above. Note that It's just the basic logic and there are proberly better ways to do this.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.