Compress Guids by hashing in small data sets

Compress Guids by hashing in small data sets - c#

I'm working on a mobile app and i want to optimise the data that it's receiving from the server (as JSON).
There are 3 lists returned (each containing its own class of objects, the approximate list sizes are 50, 100 and 170). Each object has a Guid id and there is some relation data for each object. E.g.:
o = { Id = "8f088552-5b24-4ba4-a6e5-8958c4353581",
RelatedIds = ["19d2e562-0874-473f-8e05-7052e8defd9a", "615b4c47-199a-4f7d-8268-08ed43d9c891", ... ] }
Is there a way to compress these Guids to something sorter without storing an identity map? Perhaps using a hash function?

You can convert the 16-byte representation of a GUID into a Base 64 string. However you didn't mention a programming language so we can't help further.
A hash function is not recommended here because hash functions are generally lossy.

No. One of the attributes of (non-cryptographic) hashes is that they collide: hash(a) == hash(b) but a != b. They are a performance optimization in the case where you are doing a lot of equality checks and you expect many false results (because if hash(a) != hash(b) then a != b). A GUID->counter map is probably the best way to get smaller ids here.

You can convert hex (base16) to base64, and remove all the punctuation. You should save 25% for using base64, and another 4 bytes for punctuation.

Thinking about it some more i've realized that HTTP compression (if enabled) is probably going to compress that data well enough anyway, so it's not really worth the effort to compress data manually.

Related

OutputBuffer not working for large c# list

I'm currently using SSIS to do an improvement on a project. need to insert single documents in a MongoDB collection of type Time Series. At some point I want to retrieve rows of data after going through a C# transformation script. I did this:
foreach (BsonDocument bson in listBson)
{
OutputBuffer.AddRow();
OutputBuffer.DatalineX = (string) bson.GetValue("data");
}
But this piece of code that works great with small file does not work with a 6 million line file. That is, there are no lines in the output. The other following tasks validate but react as if they had received nothing as input.
Where could the problem come from?

Your OuputBuffer has DatalineX defined as a string, either DT_STR or DT_WSTR and a specific length. When you exceed that value, things go bad. In normal strings, you'd have a maximum length of 8k or 4k respectively.
Neither of which are useful for your use case of at least 6M characters. To handle that, you'll need to change your data type to DT_TEXT/DT_NTEXT Those data types do not require a length as they are "max" types. There are lots of things to be aware of when using the LOB types.
Performance can suck depending on whether SSIS can keep the data in memory (good) or has to write intermediate values to disk (bad)
You can't readily manipulate them in a data flow
You'll use a different syntax in a Script Component to work with them
e.g.
// TODO: convert to bytes
Output0Buffer.DatalineX.AddBlobData(bytes);
Longer example of questionable accuracy with regard to encoding the bytes that you get to solve at https://stackoverflow.com/a/74902194/181965

Fastest way to detect non-equal strings (without storing the string)?

I am writing a templating engine and I am searching for a good way to detect if a template has changed.
For this I have the following requirements (in order of importance):
non-equal strings are required to be detected different
as fast as possible
as less memory as possible (=> do not store the whole string for comparison)
high propability to detect equal strings as equal
It is not a big problem, if sometimes equal strings are not detected as equal as this would just trigger a "re-rendering" which would not be needed, but because of the "heavy work" of this, this should happen as less as possible.
I first thought of using String.GetHashCode(), but the probalility of getting the same hash-code for two non-equal strings is pretty high.
Are there any good combinations like checking hash-code and Length to get the probability of to non-equal strings wrongly detected as equal to an unrealisticly happening low number?
Or is using some hashing algorithm, like MD5 or SHA, a good alternative (after hash-code is equal)?
My rendering looks something like the following:
public string RenderTemplate(string name, string template)
{
var cachedTemplate = Cache.Get(name);
if(cachedTemplate == null || !cachedTemplate.Equals(template)) // <= Equals
{
cachedTemplate = new Template(name, template);
cachedTemplate.Render();
Cache.Set(name, cachedTemplate);
}
return cachedTemplate.Result;
}
The Equals is the point I am asking about.
I am also open for other suggestions how this could be solved.
UPDATE:
To add some numbers to get more context:
I expect to have >1000 individual templates and each template will have up to at least a few thousand characters.
This is why I would like to avoid storing the whole template-string "in memory" only for the comparison.
Most of the templates are stored in the DB.
UPDATE 2:
What do you think about extending my RenderTemplate method with a timestamp as suggested by Nikola:
public string RenderTemplate(string name, string template, DateTime timestamp)
Then I could compare name, GetHashCode and timestamp which does not need much memory, should be pretty fast and the probability of a "wrongly detected equality" is practically 0. The timestamp I can read from the DB (have it already there) or the "last changed date" from the file-system for a file-based template.

You don't have much choice. If you don't compare strings by comparing their content, use a hash algorithm to determine if strings are equal. Personally, I would probably use a hash algorithm. If you are a bit paranoid and afraid of a collision, choose algorithm with widest space (e.g. SHA512).
Why do you need to compare strings to determine that a template has changed? Why not use a different approach?
If file is stored on disk, why not use a file watcher?
If stored in database, why not use a timestamp to detect when it was saved?
If application is restarted, anyway reload templates
Also, it's worrying that a template for UI changes so often that you must make checks like this. I think you have more problems with design beside comparing strings.

I need very big array length(size) in C#

public double[] result = new double[ ??? ];
I am storing results and total number of the results are bigger than the 2,147,483,647 which is max int32.
I tried biginteger, ulong etc. but all of them gave me errors.
How can I extend the size of the array that can store > 50,147,483,647 results (double) inside it?
Thanks...

An array of 2,147,483,648 doubles will occupy 16GB of memory. For some people, that's not a big deal. I've got servers that won't even bother to hit the page file if I allocate a few of those arrays. Doesn't mean it's a good idea.
When you are dealing with huge amounts of data like that you should be looking to minimize the memory impact of the process. There are several ways to go with this, depending on how you're working with the data.
Sparse Arrays
If your array is sparsely populated - lots of default/empty values with a small percentage of actually valid/useful data - then a sparse array can drastically reduce the memory requirements. You can write various implementations to optimize for different distribution profiles: random distribution, grouped values, arbitrary contiguous groups, etc.
Works fine for any type of contained data, including complex classes. Has some overheads, so can actually be worse than naked arrays when the fill percentage is high. And of course you're still going to be using memory to store your actual data.
Simple Flat File
Store the data on disk, create a read/write FileStream for the file, and enclose that in a wrapper that lets you access the file's contents as if it were an in-memory array. The simplest implementation of this will give you reasonable usefulness for sequential reads from the file. Random reads and writes can slow you down, but you can do some buffering in the background to help mitigate the speed issues.
This approach works for any type that has a static size, including structures that can be copied to/from a range of bytes in the file. Doesn't work for dynamically-sized data like strings.
Complex Flat File
If you need to handle dynamic-size records, sparse data, etc. then you might be able to design a file format that can handle it elegantly. Then again, a database is probably a better option at this point.
Memory Mapped File
Same as the other file options, but using a different mechanism to access the data. See System.IO.MemoryMappedFile for more information on how to use Memory Mapped Files from .NET.
Database Storage
Depending on the nature of the data, storing it in a database might work for you. For a large array of doubles this is unlikely to be a great option however. The overheads of reading/writing data in the database, plus the storage overheads - each row will at least need to have a row identity, probably a BIG_INT (8-byte integer) for a large recordset, doubling the size of the data right off the bat. Add in the overheads for indexing, row storage, etc. and you can very easily multiply the size of your data.
Databases are great for storing and manipulating complicated data. That's what they're for. If you have variable-width data - strings and the like - then a database is probably one of your best options. The flip-side is that they're generally not an optimal solution for working with large amounts of very simple data.
Whichever option you go with, you can create an IList<T>-compatible class that encapsulates your data. This lets you write code that doesn't have any need to know how the data is stored, only what it is.

BCL arrays cannot do that.
Someone wrote a chunked BigArray<T> class that can.
However, that will not magically create enough memory to store it.

You can't. Even with gcAllowVeryLargeObjects, the maximum size of any dimension in an array (of non-bytes) is 2,146,435,071
So you'll need to rethink your design, or use an alternative implementation such as a jagged array.

Another possible approach is to implement your own BigList. First note that List is implemented as an array. Also, you can set the initial size of the List in the constructor, so if you know it will be big, get a big chunk of memory up front.
Then
public class myBigList<T> : List<List<T>>
{
}
or, maybe more preferable, use a has-a approach:
public class myBigList<T>
{
List<List<T>> theList;
}
In doing this you will need to re-implement the indexer so you can use division and modulo to find the correct indexes into your backing store. Then you can use a BigInt as the index. In your custom indexer you will decompose the BigInt into two legal sized ints.

I ran into the same problem. I solved it using a list of list which mimics very well an array but can go well beyond the 2Gb limit. Ex List<List> It worked for an 250k x 250k of sbyte running on a 32Gb computer even if this elephant represent a 60Gb+ space:-)

C# arrays are limited in size to System.Int32.MaxValue.
For bigger than that, use List<T> (where T is whatever you want to hold).
More here: What is the Maximum Size that an Array can hold?

Compress a short but repeating string

I'm working on a web app that needs to take a list of files on a query string (specifically a GET and not a POST), something like:
http://site.com/app?things=/stuff/things/item123,/stuff/things/item456,/stuff/things/item789
I want to shorten that string:
http://site.com/app?things=somekindofencoding
The string isn't terribly long, varies from 20-150 chars. Something that short isn't really suitable for GZip, but it does have an awful lot of repetition so compression should be possible.
I don't want a DB or Dictionary of strings - the URL will be built by a different application to the one that consumes it. I want a reversible compression that shortens this URL. It doesn't need to be secure.
Is there an existing way to do this? I'm working in C#/.Net but would be happy to adapt an algorithm from some other language/stack.

If you can express the data in BNF you could contruct a parser for the data. in stead of sending the data you could send the AST where each node would be identified as one character (or several if you have a lot of different nodes). In your example
we could have
files : file files
|
file : path id
path : itemsthing
| filesitem
| stuffthingsitem
you could the represent a list of files as path[id1,id2,...,idn] using 0,1,2 for the paths and the input being:
/stuff/things/item123,/stuff/things/item456,/stuff/things/item789
/files/item1,/files/item46,/files/item7
you'd then end up with ?things=2[123,456,789]1[1,46,7]
where /stuff/things/item is represented with 2 and /files/item/ is represented with 1 each number within [...] is an id. so 2[123] would expand to /stuff/things/item123
EDIT The approach does not have to be static. If you have to discover the repeated items dynamically you can use the same approach and pass the map between identifier and token. in that case the above example would be
?things=2[123,456,789]1[1,46,7]&tokens=2=/stuff/things/,1=/files/item
which if the grammar is this simple ofcourse would do better with
?things=/stuff/things/[123,456,789]/files/item[1,46,7]
compressing the repeated part to less than the unique value with such a short string is possible but will most likely have to be based on constraining the possible values or risk actually increasing the size when "compressing"

You can try zlib using raw deflate (no zlib or gzip headers and trailers). It will generally provide some compression even on short strings that are composed of printable characters and does look for and take advantage of repeated strings. I haven't tried it, but could also see if smaz works for your data.
I would recommend obtaining a large set of real-life example URLs to use for benchmark testing of possible compression approaches.

C# Random Code Field Generator for Object

I have an object with the following properties
GID
ID
Code
Name
Some of the clients dont want to enter the Code so the intial plan was to put the ID in the code but the baseobject of the orm is different so I'm like screwed...
my plan was to put ####-#### totally random values in code how can I generate something like that say a windows 7 serial generator type stuff but would that not have an overhead what would you do in this case.

Do you want a random value, or a unique value?
random != unique.
Remember, random merely states a probability of not generating the same value, or a probability of generating the same value again. As time increases, likelihood of generating a previous value increases - becoming a near certainty. Which do you require?
Personally, I recommend just using a Guid with some context [refer to easiest section below]. I also provided some other suggestions so you have options, depending on your situation.
easiest
If Code is an unbounded string [ie can be of any length], easiest semi-legible means of generating a unique code would be
OrmObject ormObject= new OrmObject ();
string code = string.
Format ("{0} [{1}]", ormObject.Name, Guid.NewGuid ()).
Trim ();
// generates something like
// "My Product [DA9190E1-7FC6-49d6-9EA5-589BBE6E005E]"
you can substitute ormObject.Name for any distinguishable string. I would typically use typeof (objectInstance.GetType ()).Name but that will only work if OrmObject is a base class, if it's a concrete class used for everything they will all end up with similar tags. The point is to add some user context, such that - as in #Yuriy Faktorovich's referenced wtf article - users have something to read.
random
I responded a day or two ago about random number generation. Not so much generating numbers as building a simple flexible framework around a generator to improve quality of code and data, this should help streamline your source.
If you read that, you could easily write an extension method, say
public static class IRandomExtensions
{
public static CodeType GetCode (this IRandom random)
{
// 1. get as many random bytes as required
// 2. transform bytes into a 'Code'
// 3. bob's your uncle
...
}
}
// elsewhere in code
...
OrmObject ormObject = new OrmObject ();
ormObject.Code = random.GetCode ();
...
To actually generate a value, I would suggest implementing an IRandom interface with a System.Security.Cryptography.RNGCryptoServiceProvider implementation. Said implementation would generate a buffer of X random bytes, and dole out as many as required, regenerating a stream when exhausted.
Furthermore - I don't know why I keep writing, I guess this problem is really quite fascinating! - if CodeType is string and you want something readable, you could just take said random bytes and turn them into a "seemingly" readable string via Base64 conversion
public static class IRandomExtensions
{
// assuming 'CodeType' is in fact a string
public static string GetCode (this IRandom random)
{
// 1. get as many random bytes as required
byte[] randomBytes; // fill from random
// 2. transform bytes into a 'Code'
string randomBase64String =
System.Convert.ToBase64String (randomBytes).Trim ("=");
// 3. bob's your uncle
...
}
}
Remember
random != unique.
Your values will repeat. Eventually.
unique
There are a number of questions you need to ask yourself about your problem.
Must all Code values be unique? [if not, you're trying too hard]
What Type is Code? [if any-length string, use a full Guid]
Is this a distributed application? [if not, use a DB value as suggested by #LBushkin above]
If it is a distributed application, can client applications generate and submit instances of these objects? [if so, then you want a globally unique identifier, and again Guids are a sure bet]
I'm sure you have more constraints, but this is an example of the kind of line of inquiry you need to perform when you encounter a problem like your own. From these questions, you will come up with a series of constraints. These constraints will inform your design.
Hope this helps :)
Btw, you will receive better quality solutions if you post more details [ie constraints] about your problem. Again, what Type is Code, are there length constraints? Format constraints? Character constraints?
Arg, last edit, I swear. If you do end up using Guids, you may wish to obfuscate this, or even "compress" their representation by encoding them in base64 - similar to base64 conversion above for random numbers.
public static class GuidExtensions
{
public static string ToBase64String (this Guid id)
{
return System.Convert.
ToBase64String (id.ToByteArray ()).
Trim ("=");
}
}
Unlike truncating, base64 conversion is not a lossful transformation. Of course, the trim above is lossful in context of full base64 expansion - but = is just padding, extra information introduced by the conversion, and not part of original Guid data. If you want to go back to a Guid from this base64 converted value, then you will have to re-pad your base64 string until its length is a multiple of 4 - don't ask, just look up base64 if you are interested :)

You could generate a Guid using :
Guid.NewGuid().ToString();
It would give you something like :
788E94A0-C492-11DE-BFD4-FCE355D89593

Use an Autonumber column or Sequencer from your database to generate a unique code number. Almost all modern databases support automatically generated numbers in one form or another. Look into what you database supports.
Autonumber/Sequencer values from the DB are guaranteed to be unique and are relatively inexpensive to acquire. If you want to avoid completely sequential numbers assigned to codes, you can pad and concatenate several sequencer values together.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.