Compress a short but repeating string - c#

I'm working on a web app that needs to take a list of files on a query string (specifically a GET and not a POST), something like:
http://site.com/app?things=/stuff/things/item123,/stuff/things/item456,/stuff/things/item789
I want to shorten that string:
http://site.com/app?things=somekindofencoding
The string isn't terribly long, varies from 20-150 chars. Something that short isn't really suitable for GZip, but it does have an awful lot of repetition so compression should be possible.
I don't want a DB or Dictionary of strings - the URL will be built by a different application to the one that consumes it. I want a reversible compression that shortens this URL. It doesn't need to be secure.
Is there an existing way to do this? I'm working in C#/.Net but would be happy to adapt an algorithm from some other language/stack.

If you can express the data in BNF you could contruct a parser for the data. in stead of sending the data you could send the AST where each node would be identified as one character (or several if you have a lot of different nodes). In your example
we could have
files : file files
|
file : path id
path : itemsthing
| filesitem
| stuffthingsitem
you could the represent a list of files as path[id1,id2,...,idn] using 0,1,2 for the paths and the input being:
/stuff/things/item123,/stuff/things/item456,/stuff/things/item789
/files/item1,/files/item46,/files/item7
you'd then end up with ?things=2[123,456,789]1[1,46,7]
where /stuff/things/item is represented with 2 and /files/item/ is represented with 1 each number within [...] is an id. so 2[123] would expand to /stuff/things/item123
EDIT The approach does not have to be static. If you have to discover the repeated items dynamically you can use the same approach and pass the map between identifier and token. in that case the above example would be
?things=2[123,456,789]1[1,46,7]&tokens=2=/stuff/things/,1=/files/item
which if the grammar is this simple ofcourse would do better with
?things=/stuff/things/[123,456,789]/files/item[1,46,7]
compressing the repeated part to less than the unique value with such a short string is possible but will most likely have to be based on constraining the possible values or risk actually increasing the size when "compressing"

You can try zlib using raw deflate (no zlib or gzip headers and trailers). It will generally provide some compression even on short strings that are composed of printable characters and does look for and take advantage of repeated strings. I haven't tried it, but could also see if smaz works for your data.
I would recommend obtaining a large set of real-life example URLs to use for benchmark testing of possible compression approaches.

Related

OutputBuffer not working for large c# list

I'm currently using SSIS to do an improvement on a project. need to insert single documents in a MongoDB collection of type Time Series. At some point I want to retrieve rows of data after going through a C# transformation script. I did this:
foreach (BsonDocument bson in listBson)
{
OutputBuffer.AddRow();
OutputBuffer.DatalineX = (string) bson.GetValue("data");
}
But this piece of code that works great with small file does not work with a 6 million line file. That is, there are no lines in the output. The other following tasks validate but react as if they had received nothing as input.
Where could the problem come from?
Your OuputBuffer has DatalineX defined as a string, either DT_STR or DT_WSTR and a specific length. When you exceed that value, things go bad. In normal strings, you'd have a maximum length of 8k or 4k respectively.
Neither of which are useful for your use case of at least 6M characters. To handle that, you'll need to change your data type to DT_TEXT/DT_NTEXT Those data types do not require a length as they are "max" types. There are lots of things to be aware of when using the LOB types.
Performance can suck depending on whether SSIS can keep the data in memory (good) or has to write intermediate values to disk (bad)
You can't readily manipulate them in a data flow
You'll use a different syntax in a Script Component to work with them
e.g.
// TODO: convert to bytes
Output0Buffer.DatalineX.AddBlobData(bytes);
Longer example of questionable accuracy with regard to encoding the bytes that you get to solve at https://stackoverflow.com/a/74902194/181965

How do I read the Windows NTFS $Secure file (and/or the $SDS stream) programmatically in C#

The methods in the .NET platform's DirectorySecurity namespace (e.g. GetAccessRules()) are far too slow for my purposes. Instead, I wish to directly query the NTFS $Secure metafile (or, alternatively, the $SDS stream) in order to retrieve a list of local accounts and their associated permissions for each file system object.
My plan is to first read the $MFT metafile (which I've already figured out how to do) - and then, for each entry therein, look up the appropriate security descriptor in the metafile (or stream).
The ideal code block would look something like this:
//I've already successfully written code for MFTReader:
var mftReader = new MFTReader(driveToAnalyze, RetrieveMode.All);
IEnumerable<INode> nodes = mftReader.GetNodes(driveToAnalyze.Name);
foreach (NodeWrapper node in nodes)
{
//Now I wish to return security information for each file system object
//WITHOUT needing to traverse the directory tree.
//This is where I need help:
var securityInfo = GetSecurityInfoFromMetafile(node.FullName, node.SecurityID);
yield return Tuple.Create(node.FullName, securityInfo.PrincipalName, DecodeAccessMask(securityInfo.AccessMask));
}
And I would like my output to look like this:
c:\Folder1\File1.txt jane_smith Read, Write, Execute
c:\Folder1\File1.txt bill_jones Read, Execute
c:\Folder1\File2.txt john_brown Full Control
etc.
I am running .NET version 4.7.1 on the Windows 10.
There's no API to read directly from $Secure, just like there is no API to read directly from $MFT. (There's FSCTL_QUERY_FILE_LAYOUT but that just gives you an abstracted interpretation of the MFT contents.)
Since you said you can read $MFT, it sounds like you must be using a volume handle to read directly from the volume, just like chkdsk and similar tools. That allows you to read whatever you want provided you know how to interpret the on-disk structures. So your question reduces to how to correctly interpret the $Secure file.
I will not give you code snippets or exact data structures, but I will give you some very good hints. There are actually two approaches possible.
The first approach is you could scan forward in $SDS. All of the security descriptors are there, in SecurityId order. You'll find there's at various 16-byte aligned offsets, there will be a 20-byte header that includes the SecurityId among other information, and following that there's the security descriptor in serialized form. The SecurityId values will appear in ascending order in $SDS. Also every alternate 256K region in $SDS is a mirror of the previous 256K region. To cut the work in half only consider the regions 0..256K-1, 512K..768K-1, etc.
The second approach is to make use of the $SII index, also part of the $Secure file. The structure of this is a B-tree very similar to how directories are structured in NTFS. The index entries in $SII have SecurityId as the index for lookups, and also contain the byte offset you can go to in $SDS to find the corresponding header and security descriptor. This approach will be more performant than scanning $SDS, but requires you to know how to interpret a lot more structures.
Craig pretty much covered everything. I would like to clear some of them. Like Craig, no code here.
Navigate to the node number 9 which corresponds to $Secure.
Get all the streams and get all the fragments of the $SDS stream.
Read the content and extract each security descriptor.
Use IsValidSecurityDescriptor to make sure the SD is valid and stop when you reach an invalid SD.
Remember that the $Secure store the security descriptors in self-relative format.
Are you using FSCTL_QUERY_FILE_LAYOUT? The only real source of how to use this function I have found is here:
https://wimlib.net/git/?p=wimlib;a=blob;f=src/win32_capture.c;h=d62f7d07ef20c08c9bec93f261131033e39b159b;hb=HEAD
It looks like he solves the problem with security descriptors like this:
He gets basically all information about files from the MFT, but not security descriptors. For those he gets the field SecurityId from the MFT and looks in a hash table whether he already has a mapping from this ID to the ACL. If he has, he just returns it, otherwise he uses NtQuerySecurityObject and caches it in the hash table. This should drastically reduce the amount of calls. It assumes that there are few security descriptors and that the SecurityID field correctly represents the single instancing of the descriptors

Best Way to Load a File, Manipulate the Data, and Write a New File

I have an issue where I need to load a fixed-length file. Process some of the fields, generate a few others, and finally output a new file. The difficult part is that the file is of part numbers and some of the products are superceded by other products (which can also be superceded). What I need to do is follow the superceded trail to get information I need to replace some of the fields in the row I am looking at. So how can I best handle about 200000 lines from a file and the need to move up and down within the given products? I thought about using a collection to hold the data or a dataset, but I just don't think this is the right way. Here is an example of what I am trying to do:
Before
Part Number List Price Description Superceding Part Number
0913982 3852943
3852943 0006710 CARRIER,BEARING
After
Part Number List Price Description Superceding Part Number
0913982 0006710 CARRIER,BEARING 3852943
3852943 0006710 CARRIER,BEARING
As usual any help would be appreciated, thanks.
Wade
Create structure of given fields.
Read file and put structures in collection. You may use part number as key for hashtable to provide fastest searching.
Scan collection and fix the data.
200 000 objects from given lines will fit easily in memory.
For example.
If your structure size is 50 bytes then you will need only 10Mb of memory. It is nothing for modern PC.

Advanced C# pattern search in long string (100-25000 char)

Let me start with this: I can't zip it or anything similar.
What I'm trying to do is search through fairly large strings. I use data blocks that look like 0g12h. (The 0 is the color from my palette. The g is a space to divide the numbers. The 12 means 12 pixels in a row use that color. The h is to divide the numbers again.)
The problem I'm having is that the blocks aren't all the same length. They range from 0g1h to 2546g115h. Basically I want to create a palette of common patterns to hopefully save space. Say I have: 12g345h19g12h190g11h occurring at least three times, then I could save space if I had something like: a=12g345h19g12h190g11h in the palette array and just put 'a' in the string. Or even not look at the data blocks, as you see in the attached file you get g640h a ton of times.
I could be wrong, but I'm pretty sure this could work. If you have a better idea how I could save space and not lose data, I'm more than open to the ideas.
Here is a great example since you can visually see the pattern: http://pastebin.com/5dbhxZQK. I chose this file because I knew it would have massive redundancy; most aren't this simple.
You could use a dictionary (probably Dictionary<string, int> and just could how many times each pattern occurs, then go back and rewrite the string with the appropriate replacements.
However, I would recommend that you read up a little about compression algorithms, what you are implementing appears to be a Run Length Encoding (RLE) scheme. You are then trying to compress again on top of that, consider looking at how Sliding Window compression works (which is what GZIP does) as an alternative to your RLE. Or look at Huffman encoding as a mechanism to reduce the amount of space needed for the codewords that you are creating (in simple terms Huffman encoding uses shorter symbols for more frequent patterns and longer symbols for less frequent patterns in a 'optimal' way)
This is a fun problem space to play in! Good Luck!

Compress Guids by hashing in small data sets

I'm working on a mobile app and i want to optimise the data that it's receiving from the server (as JSON).
There are 3 lists returned (each containing its own class of objects, the approximate list sizes are 50, 100 and 170). Each object has a Guid id and there is some relation data for each object. E.g.:
o = { Id = "8f088552-5b24-4ba4-a6e5-8958c4353581",
RelatedIds = ["19d2e562-0874-473f-8e05-7052e8defd9a", "615b4c47-199a-4f7d-8268-08ed43d9c891", ... ] }
Is there a way to compress these Guids to something sorter without storing an identity map? Perhaps using a hash function?
You can convert the 16-byte representation of a GUID into a Base 64 string. However you didn't mention a programming language so we can't help further.
A hash function is not recommended here because hash functions are generally lossy.
No. One of the attributes of (non-cryptographic) hashes is that they collide: hash(a) == hash(b) but a != b. They are a performance optimization in the case where you are doing a lot of equality checks and you expect many false results (because if hash(a) != hash(b) then a != b). A GUID->counter map is probably the best way to get smaller ids here.
You can convert hex (base16) to base64, and remove all the punctuation. You should save 25% for using base64, and another 4 bytes for punctuation.
Thinking about it some more i've realized that HTTP compression (if enabled) is probably going to compress that data well enough anyway, so it's not really worth the effort to compress data manually.

Categories

Resources