comput edit distance between 2 very large strings

comput edit distance between 2 very large strings - c#

SCENARIO: Given 2 input strings I need to find minimum number of insertions deletions and substitutions required to convert one string to other. The strings are text from 2 files. The comparison has to be done at word level.
What i have done is implemented edit distance algorithm which does the job well using a 2-dimensional array of size (m*n) where sizes of input strings are m and n.
The PROBLEM i am facing is if the value of m and n becomes large, say more than 16,000 i am getting OutOfMemory exception due to the large size of m*n array. Also i am running into memory fragmentation and LargeObjectHeap issues
QUESTION Looking for a C# code to solve edit distance problem for 2 very large sized strings (each containing more than 20k words) without getting OutOfMemory exception.
MapReduce or DataBase or MemoryMappedFile related solutions not feasible. Only a pure C# code will work.

Related

OutputBuffer not working for large c# list

I'm currently using SSIS to do an improvement on a project. need to insert single documents in a MongoDB collection of type Time Series. At some point I want to retrieve rows of data after going through a C# transformation script. I did this:
foreach (BsonDocument bson in listBson)
{
OutputBuffer.AddRow();
OutputBuffer.DatalineX = (string) bson.GetValue("data");
}
But this piece of code that works great with small file does not work with a 6 million line file. That is, there are no lines in the output. The other following tasks validate but react as if they had received nothing as input.
Where could the problem come from?

Your OuputBuffer has DatalineX defined as a string, either DT_STR or DT_WSTR and a specific length. When you exceed that value, things go bad. In normal strings, you'd have a maximum length of 8k or 4k respectively.
Neither of which are useful for your use case of at least 6M characters. To handle that, you'll need to change your data type to DT_TEXT/DT_NTEXT Those data types do not require a length as they are "max" types. There are lots of things to be aware of when using the LOB types.
Performance can suck depending on whether SSIS can keep the data in memory (good) or has to write intermediate values to disk (bad)
You can't readily manipulate them in a data flow
You'll use a different syntax in a Script Component to work with them
e.g.
// TODO: convert to bytes
Output0Buffer.DatalineX.AddBlobData(bytes);
Longer example of questionable accuracy with regard to encoding the bytes that you get to solve at https://stackoverflow.com/a/74902194/181965

Fastet / most efficient way to work with very large Lists (in c#)

This question is related to the discussion here but can also be treated stand alone.
Also I think it would be nice to have the relevant results in one separate thread. I couldn't find anything on the web that was dealing with the topic comprehensively.
Let's say we need to work with a very very large List<double> of Data (e.g. 2 Billion entries).
Having a large list loaded into the memory leads to either a "System.OutOfMemoryException"
or if one uses <gcAllowVeryLargeObjects enabled="true|false" /> eventually just eats up the entire memory.
First: How to minimize the size of the container:
would an Array of 2 Billion doubles be less expensive ?
use decimals instead of doubles ?
is there a data-tye even lesse expensive than decimal in C# (-100
to 100 in 0.00001 steps would do the job for me)
Second: Saving the data to disc and reading it
What would be the fastet way to save the List to the disk and read it again?
Here I think the size of the file could be a problem. A txt-file containing 2 Billion entries will be huge - opening it could turn out to take years. (Perhaps some type of stream between the programm and the txt - file would do the job ?) - some sample code would be much appreciated :)
Third: Iteration
if the List is in memory I think using yield would make sense.
if the List is saved to the disk the speed of iteration is mostly limited by the speed at
which we can read it from the disc.

RFC_READ_TABLE throws Rfcabapexception after querying a lot of columns

Where I work, we have two systems that use SAP, one using Delphi and another using c#. I'm implementing the c# and both have the same problem, when I query for a great amount of columns using RFC_READ_TABLE, depending on the table ( usually 60+ ), it returns a Rfcabapexception with no description and no Inner Exception, just a title. What is causing this exception and what can I do to prevent it?

The function module RFC_READ_TABLE has to convert the data to a generic format because "really generic types" like DATA or STANDARD TABLE are not supported for RFC communication. Because of this, the outout is transmitted as a series of table lines, each a character field up to 512 characters in length.
This has several consequences:
If the total size of all fields you requested exceeds 512 characters, you will get a short dump (check with transaction ST22) and the exception you mentioned.
If you try to read fields that can not be converted to character fields and/or do hot have a fixed-length (!) character representation, bad things will happen. Most likely, RFC_READ_TABLE will either abort with a short dump or barf all over your output data.
You can bypass the first problem by slicing the table vertically and reading groups of columns sequentially. Be aware that RFC_READ_TABLE is not guaranteed to always return the data in the same order when stitching the results back together again. Also be aware that you might run into violations of transaction isolation, depending on how often the data you read changes.

Very heavy load of data cause a out of memory exception in a foreach

First of all, I got this huge xml file that represents data collected by an equipment. I convert this into an object. In fact, this object got a list of object. These objects have three strings in them. The strings look like this:
0,12987;0,45678;...
It is some sort of a list of double arranged this way for matters of performances. There are 1k doubles in each string, so 3k by object, and there are something like 3k objects just to give you an idea of a typical case.
When I read the data, I most get all the doubles from the objets and add them to the same list. I made an "object that contains three doubles" (one for each string) in a foreach, I get every objects and I split my strings into arrays. After that, I loop to turn my arrays into a list of "objects that contains three doubles" and I add it all to one list so I can use it for further operations.
It causes an out of memory exception before the end. Ideas? Something with linq would be the best.
What I got looks like this :

Let's do some math. 1000 values per string * 8 characters per value (6 digits plus a comma and semi-colon) * 2 bytes per character * 3 strings per object = 48,000 bytes per object. That by itself isn't a lot, and even with 3000 objects we're still only talking about around 150MB of RAM. That's still nothing to a modern system. Converting to double arrays is even less, as there's only 8 bytes per value rather than 16. String are also reference types, so there would have been overhead for that in the string version as well. The important thing is that no matter how you dice it you're still well short of the 85,000 byte thresh-hold for these to get stuck on the Large Object Heap, the normal source of OutOfMemoryException.
Without code it's hard to follow exactly what you're doing, but I have a couple different guesses:
Many of the values in your string are more than 5 digits, such that you cross the magic 85,000 byte threshold after all, and your objects end up on the Large Object Heap in the garbage collector. Thus, they are not collected and as you continue processing objects you soon run yourself out of Address Space (not real RAM).
You're extracting your doubles in such a way that you're rebuilding the string over and over. This creates a lot of memory pressure on the garbage collector, as it re-creates strings over and over and over.
If this is a long running program, where the size and number of items in each string can vary significantly, it could be that over time you're running into a few large lists of large values that will push your object just over the 85,000 byte mark.
Either way, what you want to do here is stop thing in terms of lists and start thinking in terms of sequences. Rather than List<string> or List<double>, try for an IEnumerable<string> and IEnumerable<double>. Write a parser for your string that uses the yield keyword to create an iterator block that will extract doubles from your string one at a time by iterating over the characters, without ever changing the string. This will perform way better, and likely fix your memory issue as well.

What is the fastest way to count the unique elements in a list of billion elements?

My problem is not usual. Let's imagine few billions of strings. Strings are usually less then 15 characters. In this list I need to find out the number of the unique elements.
First of all, what object should I use? You shouldn't forget if I add a new element I have to check if it is already existing in the list. It is not a problem in the beginning, but after few millions of words it can really slow down the process.
That's why I thought that Hashtable would be the ideal for this task because checking the list is ideally only log(1). Unfortunately a single object in .net can be only 2GB.
Next step will be to implement a custom hashtable which contains a list of 2GB hashtables.
I am wondering maybe some of you know a better solution.
(Computer has extremely high specification.)

I would skip the data structures exercise and just use an SQL database. Why write another custom data structure that you have to analyze and debug, just use a database. They are really good at answering queries like this.

I'd consider a Trie or a Directed acyclic word graph which should be more space-efficient than a hash table. Testing for membership of a string would be O(len) where len is the length of the input string, which is probably the same as a string hashing function.

This can be solved in worst-case O(n) time using radix sort with counting sort as a stable sort for each character position. This is theoretically better than using a hash table (O(n) expected but not guaranteed) or mergesort (O(n log n)). Using a trie would also result in a worst-case O(n)-time solution (constant-time lookup over n keys, since all strings have a bounded length that's a small constant), so this is comparable. I'm not sure how they compare in practice. Radix sort is also fairly easy to implement and there are plenty of existing implementations.
If all strings are d characters or shorter, and the number of distinct characters is k, then radix sort takes O(d (n + k)) time to sort n keys. After sorting, you can traverse the sorted list in O(n) time and increment a counter every time you get to a new string. This would be the number of distinct strings. Since d is ~15 and k is relatively small compared to n (a billion), the running time is not too bad.
This uses O(dn) space though (to hold each string), so it's less space-efficient than tries.

If the items are strings, which are comparable... then I would suggest abandoning the idea of a Hashtable and going with something more like a Binary Search Tree. There are several implementations out there in C# (none that come built into the Framework). Be sure to get one that is balanced, like a Red Black Tree or an AVL Tree.
The advantage is that each object in the tree is relatively small (only contains it's object, and a link to its parent and two leaves), so you can have a whole slew of them.
Also, because it's sorted, the retrieval and insertion time are both O log(n).

Since you specify that a single object cannot contain all of the strings, I would presume that you have the strings on disk or some other external memory. In that case I would probably go with sorting. From a sorted list it is simple to extract the unique elements. Merge sorting is popular for external sorts, and needs only an amount of extra space equal to what you have. Start by dividing the input into pieces that fit into memory, sort those and then start merging.

With a few billion strings, if even a few percent are unique, the chances of a hash collision are pretty high (.NET hash codes are 32-bit int, yielding roughly 4 billion unique hash values. If you have as few as 100 million unique strings, the risk of hash collision may be unacceptably high). Statistics isn't my strongest point, but some google research turns up that the probability of a collision for a perfectly distributed 32-bit hash is (N - 1) / 2^32, where N is the number of unique things that are hashed.
You run a MUCH lower probability of a hash collision using an algorithm that uses significantly more bits, such as SHA-1.
Assuming an adequate hash algorithm, one simple approach close to what you have already tried would be to create an array of hash tables. Divide possible hash values into enough numeric ranges so that any given block will not exceed the 2GB limit per object. Select the correct hash table based on the value of the hash, then search in that hash table. For example, you might create 256 hash tables and use (HashValue)%256 to get a hash table number from 0..255. Use that same algorithm when assigning a string to a bucket, and when checking/retrieving it.

divide and conquer - partition data by first 2 letters (say)
dictionary of xx=>dictionary of string=> count

I would use a database, any database would do.
Probably the fastest because modern databases are optimized for speed and memory usage.
You need only one column with index, and then you can count the number of records.

+1 for the SQL/Db solutions, keeps things simple --will allow you to focus on the real task at hand.
But just for academic purposes, I will like to add my 2 cents.
-1 for hashtables. (I cannot vote down yet). Because they are implemented using buckets, the storage cost can be huge in many practical implementation. Plus I agree with Eric J, the chances of collisions will undermine the time efficiency advantages.
Lee, the construction of a trie or DAWG will take up space as well as some extra time (initialization latency). If that is not an issue (that will be the case when you may need to perform search like operations on the set of strings in the future as well and you have ample memory available), tries can be a good choice.
Space will be the problem with Radix sort or similar implementations (as mentioned by KirarinSnow) because the dataset is huge.
The below is my solution for a one time duplicate counting with limits on how much space can be used.
If we have the storage available for holding 1 billion elements in my memory, we can go for sorting them in place by heap-sort in Θ(n log n) time and then by simply traversing the collection once in O(n) time and doing this:
if (a[i] == a[i+1])
dupCount++;
If we do not have that much memory available, we can divide the input file on disk into smaller files (till the size becomes small enough to hold the collection in memory); then sort each such small file by using the above technique; then merge them together. This requires many passes on the main input file.
I will like to keep away from quick-sort because the dataset is huge. If I could squeeze in some memory for the second case, I would better use it to reduce the number of passes rather than waste it in merge-sort/quick-sort (actually, it depends heavily on the type of input we have at hand).
Edit: SQl/DB solutions are good only when you need to store this data for a long duration.

Have you tried a Hash-map (Dictionary in .Net)?
Dictionary<String, byte> would only take up 5 bytes per entry on x86 (4 for the pointer to the string pool, 1 for the byte), which is about 400M elements. If there are many duplicates, they should be able to fit. Implementation-wise, it might be verrryy slow (or not work), since you also need to store all those strings in memory.
If the strings are very similar, you could also write your own Trie implementation.
Otherwise, you best bets would be to sort the data in-place on disk (after which counting unique elements is trivial), or use a lower-level, more memory-tight language like C++.

A Dictionary<> is internally organized as a list of lists. You won't get close to the (2GB/8)^2 limit on a 64-bit machine.

I agree with the other posters regarding a database solution, but further to that, a reasonably-intelligent use of triggers, and a potentially-cute indexing scheme (i.e. a numerical representation of the strings) would be the fastest approach, IMHO.

If What you need is a close approximation of the unique counts then look for HyperLogLog Algorithm. It is used to get a close estimation of the cardinality of large datasets like the one you are referring to. Google BigQuery, Reddit use that for similar purposes. Many modern databases have implemented this. It is pretty fast and can work with minimal memory.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.