I am trying to optimize the search for a string in a large text file (300-600mb). Using my current method, it is taking too long.
Currently I have been using IndexOf to search for the string, but the time it takes is way too long (20s) to build an index for each line with the string.
How can I optimize searching speed? I've tried Contains() but that is slow as well. Any suggestions? I was thinking regex match but I don't see that having a significant speed boost. Maybe my search logic is flawed
example
while ((line = myStream.ReadLine()) != null)
{
if (line.IndexOf(CompareString, StringComparison.OrdinalIgnoreCase) >= 0)
{
LineIndex.Add(CurrentPosition);
LinesCounted += 1;
}
}
The brute force algorithm you're using performs in O(nm) time, where n is the length of the string being searched and m the length of the substring/pattern you're trying to find. You need to use a string search algorithm:
Boyer-Moore is "the standard", I think:
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
But there are lots more out there:
http://www-igm.univ-mlv.fr/~lecroq/string/
including Morris-Pratt:
http://www.stoimen.com/blog/2012/04/09/computer-algorithms-morris-pratt-string-searching/
and Knuth-Morris-Pratt:
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
However, using a regular expression crafted with care might be sufficient, depending on what you are trying to find. See Jeffrey's Friedl's tome, Mastering Regular Expressions for help on building efficient regular expressions (e.g., no backtracking).
You might also want to consult a good algorithms text. I'm partial to Robert Sedgewick's Algorithms in its various incarnations (Algorithms in [C|C++|Java])
Unfortunately, I don't think there's a whole lot you can do in straight C#.
I have found the Boyer-Moore algorithm to be extremely fast for this task. But I found there was no way to make even that as fast as IndexOf. My assumption is that this is because IndexOf is implemented in hand-optimized assembler while my code ran in C#.
You can see my code and performance test results in the article Fast Text Search with Boyer-Moore.
Have you seen these questions (and answers)?
Processing large text file in C#
Is there a way to read large text file in parts?
Matching a string in a Large text file?
Doing it the way you are now seems to be the way to go if all you want to do is read the text file. Other ideas:
If it is possible to pre-sort the data, such as when it gets inserted into the text file, that could help.
You could insert the data into a database and query it as needed.
You could use a hash table
You can user regexp.Match(String). RegExp Match is faster.
static void Main()
{
string text = "One car red car blue car";
string pat = #"(\w+)\s+(car)";
// Instantiate the regular expression object.
Regex r = new Regex(pat, RegexOptions.IgnoreCase);
// Match the regular expression pattern against a text string.
Match m = r.Match(text);
int matchCount = 0;
while (m.Success)
{
Console.WriteLine("Match"+ (++matchCount));
for (int i = 1; i <= 2; i++)
{
Group g = m.Groups[i];
Console.WriteLine("Group"+i+"='" + g + "'");
CaptureCollection cc = g.Captures;
for (int j = 0; j < cc.Count; j++)
{
Capture c = cc[j];
System.Console.WriteLine("Capture"+j+"='" + c + "', Position="+c.Index);
}
}
m = m.NextMatch();
}
}
Related
I have a simple problem, but I could not find a simple solution yet.
I have a string containing for example this
UNB+123UNH+234BGM+345DTM+456
The actual string is lots larger, but you get the idea
now I have a set of values I need to find in this string
for example UNH and BGM and DTM and so on
So I need to search in the large string, and find the position of the first set of values.
something like this (not existing but to explain the idea)
string[] chars = {"UNH", "BGM", "DTM" };
int pos = test.IndexOfAny(chars);
in this case pos would be 8 because from all 3 substrings, UNH is the first occurrence in the variable test
What I actually trying to accomplish is splitting the large string into a list of strings, but the delimiter can be one of many values ("BGM", "UNH", "DTM")
So the result would be
UNB+123
UNH+234
BGM+345
DTM+456
I can off course build a loop that does IndexOf for each of the substrings, and then remember the smallest value, but that seems so inefficient. I am hoping for a better way to do this
EDIT
the substrings to search for are always 3 letters, but the text in between can be anything at all with any length
EDIT
It are always 3 alfanumeric characters, and then anything can be there, also lots of + signs
You will find more problems with EDI than just splitting into corresponding fields, what about conditions or multiple values or lists?. I recommend you to take a look at EDI.net
EDIT:
EDIFact is a format pretty complex to just use regex, as I mentioned before, you will have conditions for each format/field/process, you will need to catch the whole field in order to really parse it, means as example DTM can have one specific datetime format and in another EDI can have a DateTime format totally different.
However, this is the structure of a DTM field:
DTM DATE/TIME/PERIOD
Function: To specify date, and/or time, or period.
010 C507 DATE/TIME/PERIOD M 1
2005 Date or time or period function code
qualifier M an..3
2380 Date or time or period text C an..35
2379 Date or time or period format code C an..3
So you will have always something like 'DTM+d3:d35:d3' to search for.
Really, it doesn't worth the struggle, use EDI.net, create your own POCO classes and work from there.
Friendly reminder that EDIFact changes every 6 months on Europe.
If the separators can be any one of UNB, UNH, BGM, or DTM, the following Regex could work:
foreach (Match match in Regex.Matches(input, #"(UNB|UNH|BGM|DTM).+?(?=(UNB|UNH|BGM|DTM)|$)"))
{
Console.WriteLine(match.Value);
}
Explanation:
(UNB|UNH|BGM|DTM) matches either of the separators
.+? matches any string with at least one character (but as short as possible)
(?=(UNB|UNH|BGM|DTM)|$) matches if either a separator follows or if the string ends there - the match is however not included in the value.
It sounds like the other answer recognises the format - you should definitely consider a library specifically for parsing this format!
If you're intent on parsing it yourself, you could simply find the index of your identifiers in the string, determine the first 2 by position, and use those positions to Substring the original input
var input = "UNB+123UNH+234BGM+345DTM+456";
var chars = new[]{"UNH", "BGM", "DTM" };
var indexes = chars.Select(c => new{Length=c.Length,Position= input.IndexOf(c)}) // Get position and length of each input
.Where(x => x.Position>-1) // where there is actually a match
.OrderBy(x =>x.Position) // put them in order of the position in the input
.Take(2) // only interested in first 2
.ToArray(); // make it an array
if(indexes.Length < 2)
throw new Exception("Did not find 2");
var result = input.Substring(indexes[0].Position + indexes[0].Length, indexes[1].Position - indexes[0].Position - indexes[0].Length);
Live example: https://dotnetfiddle.net/tDiQLG
There is already a lot of answers here, but I took the time to write mine so might as well post it even if it's not as elegant.
The code assumes all tags are accounted for in the chars array.
string str = "UNB+123UNH+234BGM+345DTM+456";
string[] chars = { "UNH", "BGM", "DTM" };
var locations = chars.Select(o => str.IndexOf(o)).Where(i => i > -1).OrderBy(o => o);
var resultList = new List<string>();
for(int i = 0;i < locations.Count();i++)
{
var nextIndex = locations.ElementAtOrDefault(i + 1);
nextIndex = nextIndex > 0 ? nextIndex : str.Length;
nextIndex = nextIndex - locations.ElementAt(i);
resultList.Add(str.Substring(locations.ElementAt(i), nextIndex));
}
This is a fairly efficient O(n) solution using a HashSet
It's extremely simple, low allocations, more efficient than regex, and doesn't need a library
Given
private static HashSet<string> _set;
public static IEnumerable<string> Split(string input)
{
var last = 0;
for (int i = 0; i < input.Length-3; i++)
{
if (!_set.Contains(input.Substring(i, 3))) continue;
yield return input.Substring(last, i - last);
last = i;
}
yield return input.Substring(last);
}
Usage
_set = new HashSet<string>(new []{ "UNH", "BGM", "DTM" });
var results = Split("UNB+123UNH+234BGM+345DTM+456");
foreach (var item in results)
Console.WriteLine(item);
Output
UNB+123
UNH+234
BGM+345
DTM+456
Full Demo Here
Note : You could get this faster with a simple sorted tree, but would require more effort
I would like to count a string (search term) in another string (logfile).
Splitting the string with the method Split and searching the array afterwards is too inefficient for me, because the logfile is very large.
In the net I found the following possibility, which worked quite well so far. However,
count = Regex.Matches(_editor.Text, txtLookFor.Text, RegexOptions.IgnoreCase).Count;
I am now running into another problem there, that I get the following error when I count a string in the format of "Nachricht erhalten (".
Errormessage:
System.ArgumentException: "Nachricht erhalten (" analysed - not enough )-characters.
You need to escape the ( symbol as it has a special function in regular expressions:
var test = Regex.Matches("Nachricht erhalten (3)", #"Nachricht erhalten \(", RegexOptions.IgnoreCase).Count;
If you do this by user input where the user is not familiar with regular expressions you probably easier off using IndexOf in a while loop, where you keep using the new index found in the last loop. Which might also be a bit better on performance than a regular expression. Example:
var test = "This is a test";
var searchFor = "is";
var count = 0;
var index = test.IndexOf(searchFor, 0);
while (index != -1)
{
++count;
index = test.IndexOf(searchFor, index + searchFor.Length);
}
Suppose I have this CSV file :
NAME,ADDRESS,DATE
"Eko S. Wibowo", "Tamanan, Banguntapan, Bantul, DIY", "6/27/1979"
I would like like to store each token that enclosed using a double quotes to be in an array, is there a safe to do this instead of using the String split() function? Currently I load up the file in a RichTextBox, and then using its Lines[] property, I do a loop for each Lines[] element and doing this :
string[] line = s.Split(',');
s is a reference to RichTextBox.Lines[].
And as you can clearly see, the comma inside a token can easily messed up split() function. So, instead of ended with three token as I want it, I ended with 6 tokens
Any help will be appreciated!
You could use regex too:
string input = "\"Eko S. Wibowo\", \"Tamanan, Banguntapan, Bantul, DIY\", \"6/27/1979\"";
string pattern = #"""\s*,\s*""";
// input.Substring(1, input.Length - 2) removes the first and last " from the string
string[] tokens = System.Text.RegularExpressions.Regex.Split(
input.Substring(1, input.Length - 2), pattern);
This will give you:
Eko S. Wibowo
Tamanan, Banguntapan, Bantul, DIY
6/27/1979
I've done this with my own method. It simply counts the amout of " and ' characters.
Improve this to your needs.
public List<string> SplitCsvLine(string s) {
int i;
int a = 0;
int count = 0;
List<string> str = new List<string>();
for (i = 0; i < s.Length; i++) {
switch (s[i]) {
case ',':
if ((count & 1) == 0) {
str.Add(s.Substring(a, i - a));
a = i + 1;
}
break;
case '"':
case '\'': count++; break;
}
}
str.Add(s.Substring(a));
return str;
}
It's not an exact answer to your question, but why don't you use already written library to manipulate CSV file, good example would be LinqToCsv. CSV could be delimited with various punctuation signs. Moreover, there are gotchas, which are already addressed by library creators. Such as dealing with name row, dealing with different date formats and mapping rows to C# objects.
You can replace "," with ; then split by ;
var values= s.Replace("\",\"",";").Split(';');
If your CSV line is tightly packed it's easiest to use the end and tail removal mentioned earlier and then a simple split on a joining string
string[] tokens = input.Substring(1, input.Length - 2).Split("\",\"");
This will only work if ALL fields are double-quoted even if they don't (officially) need to be. It will be faster than RegEx but with given conditions as to its use.
Really useful if your data looks like
"Name","1","12/03/2018","Add1,Add2,Add3","other stuff"
Five years old but there is always somebody new who wants to split a CSV.
If your data is simple and predictable (i.e. never has any special characters like commas, quotes and newlines) then you can do it with split() or regex.
But to support all the nuances of the CSV format properly without code soup you should really use a library where all the magic has already been figured out. Don't re-invent the wheel (unless you are doing it for fun of course).
CsvHelper is simple enough to use:
https://joshclose.github.io/CsvHelper/2.x/
using (var parser = new CsvParser(textReader)
{
while(true)
{
string[] line = parser.Read();
if (line != null)
{
// do something
}
else
{
break;
}
}
}
More discussion / same question:
Dealing with commas in a CSV file
I am using C# and in one of the places i got list of all peoples names with their email id's in the format
name(email)\n
i just came with this sub string stuff just off my head. I am looking for more elegant, fast ( in the terms of access time, operations it performs), easy to remember line of code to do this.
string pattern = "jackal(jackal#gmail.com)";
string email = pattern.SubString(pattern.indexOf("("),pattern.LastIndexOf(")") - pattern.indexOf("("));
//extra
string email = pattern.Split('(',')')[1];
I think doing the above would do sequential access to each character until it finds the index of the character. Works ok now since name is short, but would struggle when having a large name ( hope people don't have one)
A dirty hack would be to let microsoft do it for you.
try
{
new MailAddress(input);
//valid
}
catch (Exception ex)
{
// invalid
}
I hope they would do a better job than a custom reg-ex.
Maintaining a custom reg-ex that takes care of everything might involve some effort.
Refer: MailAddress
Your format is actually very close to some supported formats.
Text within () are treated as comments, but if you replace ( with < and ) with > and get a supported format.
The second parameter in Substring() is the length of the string to take, not the ending index.
Your code should read:
string pattern = "jackal(jackal#gmail.com)";
int start = pattern.IndexOf("(") + 1;
int end = pattern.LastIndexOf(")");
string email = pattern.Substring(start, end - start);
Alternatively, have a look at Regular Expression to find a string included between two characters while EXCLUDING the delimiters
0000016011071693266104*014482*3 15301 45 VETRO NOVA BLUVETRO NOVA BLUE FLAT STRETCH 115428815150010050 05420 000033 0003
0000072011076993266101*014687*4 15300 45 VETRO NOVA BLUVETRO NOVA BLUE FLAT STRETCH 115428815160010030 05430 000032 0007
I have a text file which includes many barcode codes line by line, and as you see in above string format are company codes and others show other things.
So how can I get read this text line by line and character by character in C#?
For reading it line by line you can use a StreamReader - see for example on MSDN http://msdn.microsoft.com/en-us/library/db5x7c0d.aspx
Another option is:
string[] AllLines = File.ReadAllLines (#"C:\MyFile.txt");
This give you all lines in a string array and you can work with them - this uses more memory but is faster... see for example http://msdn.microsoft.com/en-us/library/s2tte0y1.aspx
When have a line in a string you can split that line for example:
string[] MyFields = AllLines[1].Split(null); // since your fields seem to be separated by whitespace
The result is that you have the parts of the line in an array and can access for example the second field in the line with MyFields[1] - see http://msdn.microsoft.com/en-us/library/b873y76a.aspx
EDIT - as per comment another option:
IF you exactly know the positions and lengths of your fields you can do this:
string MyIdentity = AllLines[1].SubString(1, 5);
For MSDN reference see http://msdn.microsoft.com/en-us/library/aka44szs.aspx
You use Microsoft libraries dedicated to files and streams to open a file, and Readline().
Then you use Microsoft libraries dedicated to parsing to parse those lines.
You create, with Microsoft libraries, a regular expression to detect bar codes (not borcod...)
Then you throw away anything that doesn't match your regular expression.
Then you compile and debug (you can use Mono). And voilĂ , you have a C# program that solves your problem.
Note: you definitely don't need to go "character by character". Microsoft libraries and parsing will be much easier for your simple need.
If all you are after is reading it line-by-line, and character-by-character, then this is a possible solution:
var lines = File.ReadLines(#"pathtotextfile.txt");
foreach (var line in lines)
{
foreach (var character in line)
{
char individualCharacter = character;
}
}
If you need to know which line and character you are on; you can use a for loop instead:
var lines = File.ReadAllLines(#"pathtotextfile.txt");
for (var i = 0; i < lines.Length; i++)
{
var line = lines[i];
for(var j = 0; j < line.Length; j++)
{
var character = line[j];
}
}
Or use SelectMany in LINQ:
var lines = File.ReadLines(#"pathtotextfile.txt");
foreach (char individualCharacter in lines.SelectMany(line => line))
{
}
Now, as far as my opinion goes, doing it "line by line" and "character by character" seems like a difficult choice to me. If you can tell us what exactly each bit of information is in the barcode, we could help you extract it that way.