Search String takes a long time the first time only? - c#

No shortage of search for string performance questions out there yet I still can not make heads or tails out of what the best approach is.
Long story short, I have committed to moving from 4NT to PowerShell. In leaving the 4NT I am going to miss the console super quick string searching utility that came with it called FFIND. I have decided to use my rudimentary C# programming skills to try an create my own utility to use in PowerShell that is just as quick.
So far search results on a string search in 100's of directories across a few 1000 files, some of which are quite large, are FFIND 2.4 seconds and my utility 4.4 seconds..... after I have ran mine at least once????
The first time I run them FFIND does it near the same time but mine takes over a minute? What is this? Loading of libraries? File indexing? Am I doing something wrong in my code? I do not mind waiting a little longer but the difference is extreme enough that if there is a better language or approach I would rather start down that path now before I get too invested.
Do I need to pick another language to write a string search that will be lighting fast
I have the need to use this utility to search through 1000 of files for strings in web code, C# code, and another propitiatory language that uses text files. I also need to be able to use this utility to find strings in very large log files, MB size.
class Program
{
public static int linecounter;
public static int filecounter;
static void Main(string[] args)
{
//
//INIT
//
filecounter = 0;
linecounter = 0;
string word;
// Read properties from application settings.
string filelocation = Properties.Settings.Default.FavOne;
// Set Args from console.
word = args[0];
//
//Recursive search for sub folders and files
//
string startDIR;
string filename;
startDIR = Environment.CurrentDirectory;
//startDIR = "c:\\SearchStringTestDIR\\";
filename = args[1];
DirSearch(startDIR, word, filename);
Console.WriteLine(filecounter + " " + "Files found");
Console.WriteLine(linecounter + " " + "Lines found");
Console.ReadKey();
}
static void DirSearch(string dir, string word, string filename)
{
string fileline;
string ColorOne = Properties.Settings.Default.ColorOne;
string ColorTwo = Properties.Settings.Default.ColorTwo;
ConsoleColor valuecolorone = (ConsoleColor)Enum.Parse(typeof(ConsoleColor), ColorOne);
ConsoleColor valuecolortwo = (ConsoleColor)Enum.Parse(typeof(ConsoleColor), ColorTwo);
try
{
foreach (string f in Directory.GetFiles(dir, filename))
{
StreamReader file = new StreamReader(f);
bool t = true;
int counter = 1;
while ((fileline = file.ReadLine()) != null)
{
if (fileline.Contains(word))
{
if (t)
{
t = false;
filecounter++;
Console.ForegroundColor = valuecolorone;
Console.WriteLine(" ");
Console.WriteLine(f);
Console.ForegroundColor = valuecolortwo;
}
linecounter++;
Console.WriteLine(counter.ToString() + ". " + fileline);
}
counter++;
}
file.Close();
file = null;
}
foreach (string d in Directory.GetDirectories(dir))
{
//Console.WriteLine(d);
DirSearch(d,word,filename);
}
}
catch (System.Exception ex)
{
Console.WriteLine(ex.Message);
}
}
}
}

If you want to speed up your code run a performance analysis and see what is taking the most time. I can almost guaruntee the longest step here will be
fileline.Contains(word)
This function is called on every line of the file, on every file. Naively searching for a word in a string can taken len(string) * len(word) comparisons.
You could code your own Contains method, that uses a faster string comparison algorithm. Google for "fast string exact matching". You could try using a regex and seeing if that gives you a performance enhancement. But I think the simplest optimization you can try is :
Don't read every line. Make a large string of all the content of the file.
StreamReader streamReader = new StreamReader(filePath, Encoding.UTF8);
string text = streamReader.ReadToEnd();
Run contains on this.
If you need all the matches in a file, then you need to use something like Regex.Matches(string,string).
After you have used regex to get all the matches for a single file, you can iterate over this match collection (if there are any matches). For each match, you can recover the line of the original file by writing a function that reads forward and backward from the match object index attribute, to where you find the '\n' character. Then output that string between those two newlines, to get your line.
This will be much faster, I guarantee it.
If you want to go even further, some things I've noticed are :
Remove the try catch statement from outside the loop. Only use it exactly where you need it. I would not use it at all.
Also make sure your system is running, ngen. Most setups usually have this, but sometimes ngen is not running. You can see the process in process explorer. Ngen generates a native image of the C# managed bytecode so the code does not have to be interpreted each time, but can be run natively. This speeds up C# a lot.
EDIT
Other points:
Why is there a difference between first and subsequent run times? Seems like caching. The OS could have cached the requests for the directories, for the files, for running and loading programs. Usually one sees speedups after a first run. Ngen could also be playing a part here, too, in generating the native image after compilation on the first run, then storing that in the native image cache.
In general, I find C# performance too variable for my liking. If the optimizations suggested are not satisfactory and you want more consistent performance results, try another language -- one that is not 'managed'. C is probably the best for your needs.

Related

"The process cannot access the file because it is being used by another process." with SystemReader

I have no coding experience but have been trying to fix a broken program many years ago. I've been fumbling through fixing things but have stumbled upon a piece that I can't fix. From what I've gathered you get Alexa to append a Dropbox file and the program reads that file looking for the change and, depending on what it is, executes a certain command based on a customizable list in an XML document.
I've gotten this to work about five times in the hundred of attempts I've done, every other time it will crash and Visual Studio gives me: "System.IO.IOException: 'The process cannot access the file 'C:\Users\\"User"\Dropbox\controlcomputer\controlfile.txt' because it is being used by another process.'"
This is the file that Dropbox appends and this only happens when I append the file, otherwise, the program works fine and I can navigate it.
I believe this is the code that handles this as this is the only mention of StreamReader in all of the code:
public static void launchTaskControlFile(string path)
{
int num = 0;
StreamReader streamReader = new StreamReader(path);
string str = "";
while (true)
{
string str1 = streamReader.ReadLine();
string str2 = str1;
if (str1 == null)
{
break;
}
str = str2.TrimStart(new char[] { '#' });
num++;
}
streamReader.Close();
if (str.Contains("Google"))
{
MainWindow.googleSearch(str);
}
else if (str.Contains("LockDown") && Settings.Default.lockdownEnabled)
{
MainWindow.executeLock();
}
else if (str.Contains("Shutdown") && Settings.Default.shutdownEnabled)
{
MainWindow.executeShutdown();
}
else if (str.Contains("Restart") && Settings.Default.restartEnabled)
{
MainWindow.executeRestart();
}
else if (!str.Contains("Password"))
{
MainWindow.launchApplication(str);
}
else
{
SendKeys.SendWait(" ");
Thread.Sleep(500);
string str3 = "potato";
for (int i = 0; i < str3.Length; i++)
{
SendKeys.SendWait(str3[i].ToString());
}
}
Console.ReadLine();
}
I've searched online but have no idea how I could apply anything I've found to this. Once again before working on this I have no coding experience so act like you're talking to a toddler.
Sorry if anything I added here is unnecessary I'm just trying to be thorough. Any help would be appreciated.
I set up a try delay pattern like Adriano Repetti said and it seems to be working, however doing that flat out would only cause it to not crash so I had to add a loop around it and set the loop to stop when a variable hit 1, which happened whenever any command types are triggered. This takes it out of the loop and sets the integer back to 0, triggering the loop again. That seems to be working now.

How to read and write more then 25000 records/lines into text file at a time?

I am connecting my application with stock market live data provider using web socket. So when market is live and socket is open then it's giving me nearly 45000 lines in a minute. at a time I am deserializing it line by line
and then write that line into text file and also reading text file and removing first line of text file. So handling another process with socket becomes slow. So please can you help me that how should I perform that process very fast like nearly 25000 lines in a minute.
string filePath = #"D:\Aggregate_Minute_AAPL.txt";
var records = (from line in File.ReadLines(filePath).AsParallel()
select line);
List<string> str = records.ToList();
str.ForEach(x =>
{
string result = x;
result = result.TrimStart('[').TrimEnd(']');
var jsonString = Newtonsoft.Json.JsonConvert.DeserializeObject<List<LiveAMData>>(x);
foreach (var item in jsonString)
{
string value = "";
string dirPath = #"D:\COMB1\MinuteAggregates";
string[] fileNames = null;
fileNames = System.IO.Directory.GetFiles(dirPath, item.sym+"_*.txt", System.IO.SearchOption.AllDirectories);
if(fileNames.Length > 0)
{
string _fileName = fileNames[0];
var lineList = System.IO.File.ReadAllLines(_fileName).ToList();
lineList.RemoveAt(0);
var _item = lineList[lineList.Count - 1];
if (!_item.Contains(item.sym))
{
lineList.RemoveAt(lineList.Count - 1);
}
System.IO.File.WriteAllLines((_fileName), lineList.ToArray());
value = $"{item.sym},{item.s},{item.o},{item.h},{item.c},{item.l},{item.v}{Environment.NewLine}";
using (System.IO.StreamWriter sw = System.IO.File.AppendText(_fileName))
{
sw.Write(value);
}
}
}
});
How to make process fast, if application perform this then it takes nearly 3000 to 4000 symbols. and if there is no any process then it executes 25000 lines per minute. So how to increase line execution time/process with all this code ?
First you need to cleanup you code to gain more visibility, i did a quick refactor and this is what i got
const string FilePath = #"D:\Aggregate_Minute_AAPL.txt";
class SomeClass
{
public string Sym { get; set; }
public string Other { get; set; }
}
private void Something() {
File
.ReadLines(FilePath)
.AsParallel()
.Select(x => x.TrimStart('[').TrimEnd(']'))
.Select(JsonConvert.DeserializeObject<List<SomeClass>>)
.ForAll(WriteRecord);
}
private const string DirPath = #"D:\COMB1\MinuteAggregates";
private const string Separator = #",";
private void WriteRecord(List<SomeClass> data)
{
foreach (var item in data)
{
var fileNames = Directory
.GetFiles(DirPath, item.Sym+"_*.txt", SearchOption.AllDirectories);
foreach (var fileName in fileNames)
{
var fileLines = File.ReadAllLines(fileName)
.Skip(1).ToList();
var lastLine = fileLines.Last();
if (!lastLine.Contains(item.Sym))
{
fileLines.RemoveAt(fileLines.Count - 1);
}
fileLines.Add(
new StringBuilder()
.Append(item.Sym)
.Append(Separator)
.Append(item.Other)
.Append(Environment.NewLine)
.ToString()
);
File.WriteAllLines(fileName, fileLines);
}
}
}
From here should be more easy to play with List.AsParallel to check how and with what parameters the code is faster.
Also:
You are opening the write file twice
The removes are also somewhat expensive, in the index 0 is more (however, if there are few elements this could not make much difference
if(fileNames.Length > 0) is useless, use a for, if the list is empty, then he for will simply skip
You can try StringBuilder instead string interpolation
I hope this hints can help you to improve your time! and that i have not forgetting something.
Edit
We have nearly 10,000 files in our directory. So when process is
running then it's passing an error that The Process can not access the
file because it is being used by another process
Well, is there a possibility that in your process lines there is duplicated file names?
If that is the case, you could try a simple approach, a retry after some milliseconds, something like
private const int SleepMillis = 5;
private const int MaxRetries = 3;
public void WriteFile(string fileName, string[] fileLines, int retries = 0)
{
try
{
File.WriteAllLines(fileName, fileLines);
}
catch(Exception e) //Catch the special type if you can
{
if (retries >= MaxRetries)
{
Console.WriteLine("Too many tries with no success");
throw; // rethrow exception
}
Thread.Sleep(SleepMillis);
WriteFile(fileName, fileLines, ++retries); // try again
}
}
I tried to keep it simple, but there are some annotations:
- If you can make your methods async, it could be an improvement by changing the sleep for a Task.Delay, but you need to know and understand well how async works
- If the collision happens a lot, then you should try another approach, something like a concurrent map with semaphores
Second edit
In real scenario I am connecting to websocket and receiving 70,000 to
1 lac records on every minute and after that I am bifurcating those
records with live streaming data and storing in it's own file. And
that becomes slower when I am applying our concept with 11,000 files
It is a hard problem, from what i understand, you're talking about 1166 records per second, at this size the little details can become big bottlenecks.
At that phase i think it is better to think about other solutions, it could be so much I/O for the disk, could be many threads, or too few, network...
You should start by profiling the app to check where the app is spending more time to focus in that area, how much resources is using? how much resources do you have? how is the memory, processor, garbage collector, network? do you have an SSD?
You need a clear view of what is slowing you down so you can attack that directly, it will depend on a lot of things, it will be hard to help with that part :(.
There are tons of tools for profile c# apps, and many ways to attack this problem (spread the charge in several servers, use something like redis to save data really quick, some event store so you can use events....

C# List<string> in linux with Mono

I have an executable file that processes a large number (1000+) of strings and adds each one to a List of strings. It is coded in C#, compiled in Visual Studio 2017 on a windows machine, then exported to and run on a Linux machine with Mono. Oddly enough, writing all of the strings to a text file works just fine, but adding them to the list causes the program's user interface to freeze and become unresponsive.
Here's my code:
client.BigDB.LoadRange("Clans", "ByName", null, startAt + "0000000000", stopAt + "zzzzzzzzzz", 1000, delegate (DatabaseObject[] o)
{
foreach (DatabaseObject obj in o)
{
//this section here does not work as intended
//string ClanName = obj.GetString("name");
//ClanNames.Add(ClanName);
//main.ui.AppendTestBox(ClanName);
//Clans++;
//but this section works perfectly
using (StreamWriter w = File.AppendText("ClanNameList.txt"))
{
w.Write(obj.GetString("name") + Environment.NewLine);
}
}
});
After inspecting the output file, I suspect that it is getting caught on the following string: "AK Union, Local 47". It processed every previous kind of character without problems, but it appears to not like commas for some reason. How do I get around this, if that's actually what's going on?
I did try to search for this problem on google and this site, but the search results are wildly unhelpful and quite unrelated to what I need :(
I cant see exactly your problem, though one thing i would suggest is update once via a StringBuilder, not 1000 times
client.BigDB.LoadRange("Clans", "ByName", null, startAt + "0000000000", stopAt + "zzzzzzzzzz", 1000, delegate (DatabaseObject[] o)
{
var sb = new StringBuilder();
foreach (DatabaseObject obj in o)
{
var name = obj.GetString("name");
ClanNames.Add(name);
sb.Append(name);
}
main.ui.AppendTestBox(sb.ToString());
});
StringBuilder.Append Method (System.Text)

Why does some file get missed out if i use Parallel.ForEach()?

Following is the code which processes about 10000 files.
var files = Directory.GetFiles(directorypath, "*.*", SearchOption.AllDirectories).Where(
name => !name.EndsWith(".gif") && !name.EndsWith(".jpg") && !name.EndsWith(".png")).ToList();
Parallel.ForEach(files,Countnumberofwordsineachfile);
And the Countnumberofwordsineachfile function prints the number of words in each file into the text.
Whenever i implement Parallel.ForEach(), i miss about 4-5 files everytime while processing.
Can anyone suggest as to why this happens?
public void Countnumberofwordsineachfile(string filepath)
{
string[] arrwordsinfile = Regex.Split(File.ReadAllText(filepath).Trim(), #"\s+");
Charactercount = Convert.ToInt32(arrwordsinfile.Length);
filecontent.AppendLine(filepath + "=" + Charactercount);
}
fileContent is probably not threadsafe. So if two (or more) tasks attempt to append to it at the same time one will win, the other will not. You need to remember to either lock the sections that are shared, or don't used shared data.
This is probably the easiest solution for your code. Locking, synchronises access (other tasks have to queue up to access the locked section) so it will slow down the algorithm, but since this is very short compared to the part that counts the words is likely to be then it isn't really going to be much of an issue.
private object myLock = new object();
public void Countnumberofwordsineachfile(string filepath)
{
string[] arrwordsinfile = Regex.Split(File.ReadAllText(filepath).Trim(), #"\s+");
Charactercount = Convert.ToInt32(arrwordsinfile.Length);
lock(myLock)
{
filecontent.AppendLine(filepath + "=" + Charactercount);
}
}
The cause has already been found, here is an alternative implementation:
//Parallel.ForEach(files,Countnumberofwordsineachfile);
var fileContent = files
.AsParallel()
.Select(f=> f + "=" + Countnumberofwordsineachfile(f));
and that requires a more useful design for the count method:
// make this an 'int' function, more reusable as well
public int Countnumberofwordsineachfile(string filepath)
{ ...; return characterCount; }
But do note that going parallel won't help you much here, your main function (ReadAllText) is I/O bound so you will most likely see a degradation from using AsParallel().
The better option is to use Directory.EnumerateFiles and then collect the results without parallelism:
var files = Directory.EnumerateFiles(....);
var fileContent = files
//.AsParallel()
.Select(f=> f + "=" + Countnumberofwordsineachfile(f));

write file need to optimised for heavy traffic

i am very new to C#, and this is my first question, please be gentle on me
I am trying to write a application to capture some tick data from the data provider, below is the main part of the program
void zf_TickEvent(object sender, ZenFire.TickEventArgs e)
{
output myoutput = new output();
myoutput.time = e.TimeStamp;
myoutput.product = e.Product.ToString();
myoutput.type = Enum.GetName(typeof(ZenFire.TickType), e.Type);
myoutput.price = e.Price;
myoutput.volume = e.Volume;
using (StreamWriter writer = File.AppendText("c:\\log222.txt"))
{
writer.Write(myoutput.time.ToString(timeFmt) + ",");
writer.Write(myoutput.product + "," );
writer.Write(myoutput.type + "," );
writer.Write(myoutput.price + ",");
writer.Write(myoutput.volume + ",");
}
i have successfully write the data into the text file, however i know that this method will be call like 10000 times a second during peak time, and open a file and append it many times a second is very inefficient, i was pointed to use a buffer or some sort, but i have no idea how to do it, i try reading the document but i still dont understand, thats why i turn in here for help.
Please give me some (working) snippet code so i can pointed to the write direction. thanks
EDIT: i have simplified the code as much as possible
using (StreamWriter streamWriter = File.AppendText("c:\\output.txt"))
{
streamWriter.WriteLine(string.Format("{0},{1},{2},{3},{4}",
e.TimeStamp.ToString(timeFmt),
e.Product.ToString(),
Enum.GetName(typeof(ZenFire.TickType), e.Type),
e.Price,
e.Volume));
}
ED has told me to make my stream to a field, how is the syntax looks like? can anyone post some code to help me? thanks a lot
You need to create a field for the stream instead of a local variable. Initialize it in constructor once and don't forget to close it somewhere. It's better to implement IDisposable interface and close the stream in Dispose() method.
IDisposable
class MyClass : IDisposable {
private StreamWriter _writer;
MyClass() {
_writer = File.App.....;
}
void zf_TickEvent(object sender, ZenFire.TickEventArgs e)
{
output myoutput = new output();
myoutput.time = e.TimeStamp;
myoutput.product = e.Product.ToString();
myoutput.type = Enum.GetName(typeof(ZenFire.TickType), e.Type);
myoutput.price = e.Price;
myoutput.volume = e.Volume;
_writer.Write(myoutput.time.ToString(timeFmt) + ",");
_writer.Write(myoutput.product + "," );
_writer.Write(myoutput.type + "," );
_writer.Write(myoutput.price + ",");
_writer.Write(myoutput.volume + ",");
}
public void Dispose() { /*see the documentation*/ }
}
There are many things you can do
Step 1. Make sure you don't make many io calls and string concatenations.
Output myOutput = new Outoput(e); // Maybe consruct from event args?
// Single write call, single string.format
writer.Write(string.Format("{0},{1},{2},{3},{4},{5}",
myOutput.Time.ToString(),
myOutput.Product,
...);
This I recommend regardless of what your current performance is. I also made some cosmetic changes (variable/property/class name casing. You should look up the difference between variables and properties and their recommended case etc.)
Step 2. Analyse your performance to see if it does what you want. If it does, no need to do anything further. If performance is still too bad, you can
Keep the file open and close it when your handler shuts down.
Write to a buffer and flush it at regular intervals.
Use a logger framework like log4net that internally handles the above for you, and takes care of hairy issues like access to the log file from multiple threads.
I would use String.Format:
using (StreamWriter writer = new StreamWriter(#"c:\log222.txt", true))
{
writer.AutoFlush = true;
writer.Write(String.Format("{0},{1},{2},{3},{4},", myoutput.time.ToString(timeFmt),
myoutput.product, myoutput.type, myoutput.price, myoutput.volume);
}
If you use # before string you don't have to use double \.
This is much faster - you write only once to the file instead of 5 times. Additionally you don't use + operator with strings which is not the fastest operation ;)
Also - if this is multithreading application - you should consider using some lock. It would prevent application from trying to write to the file from eg. 2 threads at one time.

Categories

Resources