I have a very huge text document. I am implementing "Search" functionality to find occurrences of a given string in the file and to display its position. It is not just whole word search, it can have part of a word / sentance/ paragraph. I am working out on efficient data structure for this process. If it is whole word search I could have used tries/ hash table. I will not be able to use suffix array/ suffix tree as the file size is very large. Sorting is also not that efficient. Other simple option is just to use string search/ regular expression functionality of the framework, which takes linear time. Is there any better known approach for this kind of opeation? Initially it is just string search, later on planning to give search with metacharacters.
Trie and suffix tree/array are a good option but if you do not like them i have another solution: create a hash table:
Create a hash table for all the strings of length 1, 2, 3, .. N where N is whatever number you want complexity O(N * size_of_text)
If you want to find a string you have 2 options:
If the size of the string is lower than N you just search it into the hash table ~O(1) for the search and o(size_of_string) for creating the hash_key
If the size is larger than N you just create chunks of size N and do this: Search a chunk and remember all the position. Than you do the same for the next chunk and check if there are numbers that are consecutive ( ex: first time we have i, j and second time we have k, i+N , than i, i+N is a good combination) save the last number of a consecutive pair(i, i+N, you keep just i+N) and continue until you don't have a number in your Stack or you finished the word
Hope it helped.
Lucene.NET is a search engine library that does text scanning with indexes:
http://incubator.apache.org/lucene.net/
Related
For my project, I have to search through a data set to find a string that matches. Previously, it was implemented by comparing every single item with the result string, but now my team wants it to run faster. I cannot use a hash map because we are searching through multiple datasets to find a string so I need an alternative. Please help, thank you!
//Looking for a specific string (time) in data set
foreach(var dataset in datasets)
{
if(dataset.Tables[0].Rows[i].ItemArray[0].ToString() == time.ToString())
{
//Enter this block
}
}
Use bitmaps. For example this one here - https://github.com/Auralytical/CRoaring.Net
Split your files into tokens, normalize them (to lower, trim, etc)
Assign index to each unique token (plain list, for example, any new token is just appended at the end)
If file contain token - set bit at its index.
You will get compact indexes for each file (sparse bitmap).
All you have to do now is to make the same with your query to get query bitmap, then just scan through all bitmaps to find which files are matched. You can make binary search tree to make it logarithmic.
pros
is that final complexity will be O(N) for indexing, O(log(N)) for finding any query you desire. N - count of documents.
cons
of this approach is if your tokens are random (hashes, md5, telephone numbers, datetimes, etc) - it will work very slow. Because domain is practically infinite and you should ignore those tokens and put them in some kind of offloading tree (or just into key-value database).
PS
This is how most of FullText search engines do it, so you can just straight use ElasticSearch or Lucene shard (which I recommend you to use, because it is simple and very flexible engine, and it is production ready which means there is gazilion of people who use this for most perverted business logic, like building their own search engines or porn story catalogs)
I need to build an index for a very big (50GB+) ASCII text file which will enable me to provide fast random read access to file (get nth line, get nth word in nth line). I've decided to use List<List<long>> map, where map[i][j] element is position of jth word of ith line in the file.
I will build the index sequentially, i.e. read the whole file and populating index with map.Add(new List<long>()) (new line) and map[i].Add(position) (new word). I will then retrieve specific word position with map[i][j].
The only problem I see is that I can't predict total count of lines/words, so I will bump into O(n) on every List reallocation, no idea of how I can avoid this.
Are there any other problems with the data structure I chose for the task? Which structure could be better?
UPD: File will not be altered during the runtime. There are no other ways to retrieve content except what I've listed.
Increasing size of a large list is very expensive operation; so, it's better to reserve list size at the beginning.
I'd suggest to use 2 lists. The first contains indexes of words within file, and the second contains indexes in the first list (index of the first word in the appropriate line).
You are very likely to exceed all available RAM. And when the system starts to page in/page out GC-managed RAM, performance of the program will be completely killed. I'd suggest to store your data in memory-mapped file rather than in managed memory. http://msdn.microsoft.com/en-us/library/dd997372.aspx
UPD memory mapped files are effective, when you need to work with huge amounts of data not fitting in RAM. Basically, it's your the only choice if your index becomes bigger than available RAM.
I'm looking for the fastest way to find all strings in a collection starting from a set of characters. I can use sorted collection for this, however I can't find convenient way to do this in .net. Basically I need to find low and high indexes in a collection that meet the criteria.
BinarySearch on List<T> does not guarantee the returned index is that of the 1st element, so one would need to iterate up and down to find all matching strings which is not fast if one has a large list.
There are also Linq methods (with parallel), but I'm not sure which data structure will provide the best results.
List example, ~10M of records:
aaaaaaaaaaaaaaabb
aaaaaaaaaaaaaaba
aaaaaaaaaaaaabc
...
zzzzzzzzzzzzzxx
zzzzzzzzzzzzzyzzz
zzzzzzzzzzzzzzzzzza
Search for strings starting from: skk...
Result: record indexes from x to y.
UPDATE: strings can have different lengths and are unique.
In terms of time complexity - you should use a trie, and not a sorted set or binary search.
Trie will get you a O(|S|) time complexity [while sorted set and binary search gets you O(|S|logn)] to find the node [let it be v] that represents that prefix.
All the strings [paths] in the trie that fit the prefix will "pass" via v. By adding numberOfLeaves field to each node, you can find out exactly how much leaves [=strings] this node has.
In a single pass - you can also find the index of this v [For each node u in the path from the root to v - sum numberOfLeaves for each sibling which is left to u].
This requires much more work then using already existing structures, but if the data is huge - it can make your algorithm much faster, so you should concider it if performance is an issue and you expect a huge set of strings.
You can do it with a hand-written binary search - one which just doesn't stop when it's found a match; it continues until it's found a single index.
In fact, you don't even have to write the binary search bit yourself - you could create a custom comparer which never returns 0, i.e. if you're looking for "abc" then it treats "abb" as being below the target value, but "abc" as being above the target value. This way the BinarySearch will always return a negative number, which you can then just bit-flip to find the theoretical insertion point for "the string which comes between abb and abc".
You can do the same in reverse (treat "abc" as lower than the target value) to find the highest bound.
If you know the format of these strings and it won't have edge cases like Unicode NULL characters, and everything's the same length, you can even do it without writing your own comparer:
// This could be done more efficiently :)
string stringJustBelow = target.Substring(0, target.Length - 1) +
target[target.Length - 1] + "X";
string stringJustAbove = target + "X"; // Or any character
int lowerBoundInclusive = ~list.BinarySearch(stringJustBelow);
int upperBoundExclusive = ~list.BinarySearch(stringJustAbove);
So if you strings are all length 3 and you were searching for "abc" you'd actually look for where "abbX" and "abcX" would be inserted.
Put them in SortedSet and use GetViewBetween.
This answer illustrates searching for both prefix and suffix, I'm sure you'll have no trouble adapting it to prefix-only search, if that is indeed what you want.
If you just want to search for a range (not prefix), directly using GetViewBetween should suffice.
I'm currently trying to process a number of data feeds that I have no control over, where I am using Regular Expressions in C# to extract information.
The originator of the data feed is extracting basic row data from their database (like a product name, price, etc), and then formatting that data within rows of English text. For each row, some of the text is repeated static text and some is the dynamically generated text from the database.
e.g
Panasonic TV with FREE Blu-Ray Player
Sony TV with FREE DVD Player + Box Office DVD
Kenwood Hi-Fi Unit with $20 Amazon MP3 Voucher
So the format in this instance is: PRODUCT with FREEGIFT.
PRODUCT and FREEGIFT are dynamic parts of each row, and the "with" text is static. Each feed has about 2000 rows.
Creating a Regular Expression to extract the dynamic parts is trivial.
The problem is that the marketing bods in control of the data feed keep on changing the structure of the static text, usually once a fortnight, so this week I might have:
Brand new Panasonic TV and a FREE Blu-Ray Player if you order today
Brand new Sony TV and a FREE DVD Player + Box Office DVD if you order today
Brand new Kenwood Hi-Fi unit and a $20 Amazon MP3 Voucher if you order today
And next week it will probably be something different, so I have to keep modifying my Regular Expressions...
How would you handle this?
Is there an algorithm to determine static and variable text within repeating rows of strings? If so, what would be the best way to use the output of such an algorithm to programatically create a dynamic Regular Expression?
Thanks for any help or advice.
This code isn't perfect, it certainly isn't efficient, and it's very likely to be too late to help you, but it does work. If given a set of strings, it will return the common content above a certain length.
However, as others have mentioned, an algorithm can only give you an approximation, as you could hit a bad batch where all products have the same initial word, and then the code would accidentally identify that content as static. It may also produce mismatches when dynamic content shares values with static content, but as the size of samples you feed into it grows, the chance of error will shrink.
I'd recommend running this on a subset of your data (20000 rows would be a bad idea!) with some sort of extra sanity checking (max # of static elements etc)
Final caveat: it may do a perfect job, but even if it does, how do you know which item is the PRODUCT and which one is the FREEGIFT?
The algorithm
If all strings in the set start with the same character, add that character to the "current match" set, then remove the leading character from all strings
If not, remove the first character from all strings whose first x (minimum match length) characters aren't contained in all the other strings
As soon as a mismatch is reached (case 2), yield the current match if it meets the length requirement
Continue until all strings are exhausted
The implementation
private static IEnumerable<string> FindCommonContent(string[] strings, int minimumMatchLength)
{
string sharedContent = "";
while (strings.All(x => x.Length > 0))
{
var item1FirstCharacter = strings[0][0];
if (strings.All(x => x[0] == item1FirstCharacter))
{
sharedContent += item1FirstCharacter;
for (int index = 0; index < strings.Length; index++)
strings[index] = strings[index].Substring(1);
continue;
}
if (sharedContent.Length >= minimumMatchLength)
yield return sharedContent;
sharedContent = "";
// If the first minMatch characters of a string aren't in all the other strings, consume the first character of that string
for (int index = 0; index < strings.Length; index++)
{
string testBlock = strings[index].Substring(0, Math.Min(minimumMatchLength, strings[index].Length));
if (!strings.All(x => x.Contains(testBlock)))
strings[index] = strings[index].Substring(1);
}
}
if (sharedContent.Length >= minimumMatchLength)
yield return sharedContent;
}
Output
Set 1 (from your example):
FindCommonContent(strings, 4);
=> "with "
Set 2 (from your example):
FindCommonContent(strings, 4);
=> "Brand new ", "and a ", "if you order today"
Building the regex
This should be as simple as:
"{.*}" + string.Join("{.*}", FindCommonContent(strings, 4)) + "{.*}";
=> "^{.*}Brand new {.*}and a {.*}if you order today{.*}$"
Although you could modify the algorithm to return information about where the matches are (between or outside the static content), this will be fine, as you know some will match zero-length strings anyway.
I think it would be possible with an algorithm , but the time it would take you to code it versus simply doing the Regular Expression might not be worth it.
You could however make your changing process faster. If instead of having your Regex String inside your application, you'd put it in a text file somewhere, you wouldn't have to recompile and redeploy everything every time there's a change, you could simply edit the text file.
Depending on your project size and implementation, this could save you a generous amount of time.
If i have to read a huge matrix of integers from a file, what would be the most efficient way to do so in C#?
Example :
n m // n - number of rows m - number of columns
a11 ... a1m
...
an1 ... anm
Mostly it depends what you mean by efficient, in terms of speed once loaded loading into an multidimensional array and using this would be very quick.
In terms of only using the resources that you need, streaming the rows line by line would be best.
In my opinion, after reading the whole text from file, you can use Regex class to match all numbers (using Matches(String) method) and retrieve an object of class Matches. All the number under String form are stored in this object and you can obtain the numbers by parsing it from these string.