Shortest encoding, Hexadecimal - c#

Hi guys i would like to send the shortest possible string/value.
If i have the following
1)l23k43i221j44h55uui6n433bb4
2)124987359824369785493584379
3)kla^askdjaslkd3AS423$#ksala
What is the terminology for shortening strings,
Encoding? Encrypting?
At the same time, what is the best method to shorten a string of text, taking into consideration that i have only sms 255 limit

So my first day in prison I was taken to the mess hall for lunch, and I was sitting with a bunch of old guys who had been there for years. One of them stood up and yelled "51!" and sat down, and everyone laughed. A few minutes later another inmate stood up and yelled "96!" and again, everyone laughed.
I asked the old guy next to me what was going on and he explained that they had heard each other's jokes so many times that they had just made a list of them, numbered them, and yelled out the number to save the time of actually telling the joke.
So I stood up and yelled "23!"
Silence.
I sat down.
"Well, some people just aren't good at telling jokes I guess" said the old guy.
If you know in advance the strings that you're going to send, you can distribute a list of them ahead of time, and then just send the number of the string.

The term you're looking for is compression. Basically, you transform input data into output data that is shorter or of the same length. This will usually work for patterns and repetitions in the data (something like abcabcabc) or for a limited alphabet (like in your second example).

Related

Regex extract first +- n Characters but also include the rest of the charactes until the end of the line

Please help:
I have the following expression that retrieves the first 1000 characters of a string up until the end of the word, however I would like it to grab the text up until the end of the line (\r\n)?
Expression:
System.Text.RegularExpressions.Regex.Match(mystring, #"^.{1,1000}\b(?<!\s)").Value
Example text:
Whoo Hoo!! We made it through another year. Can you believe it? I can
hardly contain myself. This has been one bumpy ride. And now all of
that is over.\r\n\r\n Being a Taxi Driver can be quite rewarding in
more ways than one. There is nothing like it in the world!\r\n\r\n I
made some new friends and caught up with a couple old ones. I met new
people from all over the world and experienced a garden variety of
situations, all over a 12 hour period. \r\n\r\nTonight I had lots of
fun, though I can't help but to think of how disappointing it was for
everyone, financially, this year. I can remember a time when we would
have all booked and taken home about 3-4 times more than we did over
the past day. Buddy is my best friend. He is a small, blond, six and a
half year old Pekingese. He has been part of the family since he was
only three months old. He rode along on the front passenger seat as
one set of passengers came and went. watching intensely out the window
and paying close attention to where we were going and where we were
at. He would check out each person who entered the cab
(Currently it grabs the text till here)
just by looking at them over. He was enjoying his New Year!\r\n\r\n
(I would like it to Grab the text until here)
Around 1:30 AM I was crossing Ocean Blvd. A van ran out in front of
me. I slammed on the breaks and avoided a collision. Buddy was not
buckled in. His entire body slammed into the dash board and then he
dropped into the floor board. He was shaken and scared. Heck, I was
too. I was afraid that he was injured worse than the bump on his head
and his upset nervous system. It was at this moment I decided that it
was time for Buddy to go home and rest. I was glad that the dash is
built so high, otherwise Buddy would have met the wind shield.\r\n\r\n
You see, I feel like everything happens for a reason. At one moment my
mind started spinning, only lasting for a few seconds, and I realized
the there was a good reason I must have been sent to pick up each and
every one of the people whom I did this night. I set out each day with
a mission. Today I'll be in the right place, at the right time, to
meet the right people, for the betterment of all. \r\n\r\n 'I see
opportunity in every challenge.' \r\n\r\n This day did not short
change me on challenges, this is for sure. I know that I didn't do as
well as others, but did better than others. It seems, according to the
statics, that I may have faired out average. Over all I did well if
you ask me. Just look at the big picture. Note: This is not the order
of events of the evening.
"^(?<text>.{1,1000}[^\\]*)\\r\\n"
should work as it finds the corresponding \r\n, but only captures the text you want into the group "text".
No regex needed. Try
mystring.Substring(0, mystring.LastIndexOf("\r\n", 1000));
\r and \n are special "characters" in regex, respectively for, as you may expect, carriage-return and line-feed; so: let your regex scan until \r\n

Memory-wise, is it better to store a long non-dynamic string as a single string object or to have the program build it out of it's repetitive parts?

This is a bit of an odd question and more of a though experiment that anything I need, but I'm still curious about the answer: If I have a string that I know ahead of time will never change but is (mostly) made up of repetitive parts, would it be better to have said string as just a single string object, get called when needed, and be done with it - or should I break the string up into smaller strings that represent the repeated parts and concatenate them when needed?
Let me use an example: Let's say we have a naive programmer who wants to create a regular expression for validating IP Addresses (in other words, I know this regular expression won't work as intended, but it helps show what I mean by repetitive parts and saves me a bit of typing for the second part of the example). So he writes this function:
private bool isValidIP(string ip)
{
Regex checkIP = new Regex("\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?");
return checkIP.IsMatch(ip);
}
Now our young programmer notices that he has "\d", "\d?", and "\." just repeated a few times. This gives him an idea that he could both save some storage space and help remind himself what this means for later. So he remakes the function:
private bool isValidIP(string ip)
{
string escape = "\\";
string digi = "d";
string digit = escape + digi;
string possibleDigit = digit + '?';
string IpByte = digit + possibleDigit + possibleDigit;
string period = escape + '.';
Regex checkIP = new Regex(IpByte + period + IpByte + period + IpByte + period + IpByte);
return checkIP.IsMatch(ip);
}
The first method is simple. It just stores 38 chars in the program's instructions, which are just read into memory each time the function is called.
The second method stores (I suspect) two 1 length strings and two chars into the program's instructions as well as all of the calls to concatenate those four into different orders. This creates at least 8 strings in memory when the program is called (the six named strings, a temporary string for the first four parts of the regex, and then the final string created from the previous string + the three strings of the regex). This second method also happens to help explain what the regex is looking for - though not what the final regex would look like. It could also help with refactoring, say if our hypothetical programmer realizes that his current regex will allow for more than just 0-255 in the IP Address, and the constitute parts can be changed without having to find every single item that would need to be fixed.
Again, which method would be better? Would it just be as simple as a trade-off between program size vs. memory usage? Of course, with something as simple as this, the trade-off is negligible at best, but what about a much larger, more complex string?
Oh, yes, and a much better regex for IP Addresses would be:
^(25[0-5]|2[0-4]\\d|[01]?\\d\\d?)(\\.(25[0-5]|2[0-4]\\d|[01]?\\d\\d?)){3}$
Wouldn't work as well as an example, would it?
The first is by far the better option. Here's why
It's clearer.
It's cheaper. Any time you declare a new object it's an "expensive" process. You have to make space for it on the heap (well for strings at least). Yes, you could in theory be saving a byte or so, but your spending a lot more time (probably, I haven't tested it) going through and allocating space for each string, additional memory instructions etc. Not to mention the fact that remember, you also have to factor in the use of the GC. You keep allocating strings and eventually you are going to have to contend with it taking up process ticks also. You really want to hit on optimization, I can easily tell this code isn't as efficient as it could be. There are no constants for one thing, which means that you are possibly creating more objects than you need instead of letting the compiler optimize for strings that don't need to change. This leads me to think, that as a person reviewing this code, I need to take a much closer look at what is going to see what is going on and figure out if something is wrong.
It's clearer (yes, I said this again). You want to do an academic pursuit to see how efficient you can make it. That's cool. I get that. I do it myself. It's fun. I NEVER let that slip into production code. I don't care about losing a tick, I care about having a bug in production, and I care about if other programmers can understand what my code does. Reading someone else's code is hard enough, I don't want to add the extra task of them having to try and figure out which micro-optimization I put in and what happens if they "nudge" the wrong piece of code.
You hit on another point. What if the original regex is wrong. Google will tell you this problem has been solved. You can Google another regex that's right and has been tested. You can't Google "What's wrong with my code." You can post it on SO sure, but that means that someone else has to get involved and look through it.
Here's how to make the first example win the horse race easily:
Regex checkIP = new Regex(
"\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?\\.\\d\\d?\\d?");
private bool isValidIP(string ip)
{
return checkIP.IsMatch(ip);
}
Declare once, reuse over and over. If you are taking the time to recreate the regex dynamically to save a few, don't get to do that. Technically you could do that and still only create the object once, but that is a lot more work than say, moving it to a class level variable.
You're effectively attempting to game the compiler here and implement your own string compression. For the kinds of string literals you're describing, it seems like your savings will be mere tens of bytes shaved off of the compiled binary, which due to memory alignment may not even be realized. In exchange for these few bytes of saved space, this approach adds code complexity and runtime overhead, not to mention difficulty in debugging.
Storage is cheap. Why make your life (and the lives of your coworkers) harder? Keep your code simple, clear, and evident - you'll thank yourself later.
The second is worse off in memory consumption, as every time you concatenate two strings you've got three in memory.
Although the compiler started handling some instances of string constants by creating a StringBuilder for you, I'd still vote for the first one being less memory intensive, because if the system does create the StringBuilder for you, you are going to have the overhead for that, and if it doesn't see the first paragraph...
I am now curious how compiling the RegEx would effect the memory usage.
Savings here are illusionary and splitting this string up is a big overshot. Saving insignificant amount of memory and complicating so simple code is just pointless. You will not see any savings but next person to maintain that code will spend 10x more time understanding it.
Strings are immutable so if your string never/rarely changes keep it in one piece. Intense string concatenations give garbage collector additional strain.
Unless your strings and sub-strings are big and you could save at least kilobytes, do not spend your time and effort on such optimizations.

To find out the number of occruence of words in a file

I came across this question in an interview:
We have to find out the number of occurences of two given words in a text file with <=n words between them.
Example1:
text:`this is first string this is second string`
Keywords:`this, string`
n= 4
output= 2
"this is first string" is the first occurrence and number of words between this and string is 2(is, first) which is less than 4.
this is second string is the remaining string. number of words between *this and string * is 2 (is, second) which is less than 4.
Therefore the answer is 2.
I have thought that I will use
Dictionary<string, List<int>>.
My idea was that I use the dictionary and get the list of places where the particular word is repeated and then iterate through both the lists, increment the count if a condition is met and then display the count.
Is my thinking process correct? Please provide any suggestions to improve my solution.
Thanks,
Not an answer per-se (as quite honestly, I don't understand the question :P), but to add some general interview advice to the other answers:
In interviews the interviewer is always looking for the thought process and that you are a critical, logical thinker. Not necessarily that you have excellent coding recall and can compile code in your brain.
In addition interviews are a stressful process. By slowing down and talking out loud as you work things out you not only look like a better communicator and logical thinker (even if getting the question wrong), you also give yourself time to think.
Use a pen and paper, speak as you think, start off from the top and work through it. I've got jobs even if I didn't know the answers to tech questions by demonstrating that I can at least try to work things out ;-)
In short, it's not just down to technical prowess
I think it depends if the call is done only one or multiple times per string. If it's something like
int getOccurences(String str, String reference, int min_size) { ... }
then you don't really need the dictionary, not even a ist. You can just iterate through the string to find occurrences of words and then check the number of separators between them.
If on the other hand the problem is for arbitrary search/indexing, IMHO you do need a dictionary. I'd go for a dictionary where the key is the word and the value is a list of indexes where it occurs.
HTH
If you need to do that repeatedly for different pairs of words in the same text, then a word dictionary with a list of indexes is a good solution. However, if you were only looking for one pair, then two lists of indexes for those two words would be sufficient.
The lists allow you to separate the word detection operation from the counting logic.

Flood control, check how equal the past message is compared with the latest message in %

I am working on a flood control for a chat system, one of the ideas was to check how equal the past message is based on the newest message a member has sent within X minutes.
So if the member's lastest message was sent with 5 minutes of his past message it will check how equal the past message is against the latest message he sent, if that hits 80% or more he would be not be able to talk for a while.
Problem is I don't know how this sort of algorithm would look like and I am not sure if it would be an efficient approach either...
Let's go to the facts, user sends:
[00:00:01] MemberX: Hi everyone !
[00:00:02] MemberX: Hi everyone ! MUAH
[00:00:03] MemberX: Hi everyone ! 1
So in the above context the user would have his talking access removed for X minutes.
I guess I could checksum the message which would work for sequential messages like those where text is add at the end.
How would I calculate the percentage of match?
Out of the byte length of the past message against the byte length of the latest message that matched?
Example:
past message 10 bytes
latest message 14 bytes
checksum matched up to 9 bytes: (9/10)*100 = 90%
Now let's go a little harder:
[00:00:01] MemberX: Hi hey everyone !
[00:00:02] MemberX: Hi everyone ! MUAH
[00:00:03] MemberX: Hi 123 everyone !
In this second case checksum would fail and would not be usable at all, I believe.
Is there a good algorithm to catch flood like that? I don't want to catch 100% of it but at least a small percentage to make the room cleanner.
The first part of it would work for a lot of abusers but some of the smarter folks would think of the 2nd way there is probably a lot other ways too, this is just an initial idea of things I could implement.
I don't want to restrain all users from talking with a flood time limit as most of them do type fast. I just want to catch people sending repeatable text over and over within a small range of time.
So my question is what would be a good algorithm to overcome this sort of flood?
Many IRC servers use a "Leaky Bucket" approach to throttle users to a constant rate. They keep track of the delta-time between the user's last messages sent and use that to calculate a "rate". This is often implemented as a per-user queue of messages to be sent. If the user goes above the rate they are throttled, unless they exceed the rate by a given amount at which point they are banned.
Another common approach on IRC is to simply keep track of the last N messages, and if some threshold of repeatability (i.e. the same message over and over again) is exceeded to kick/ban the user.
I would probably look into http://en.wikipedia.org/wiki/Levenshtein_distance and then combining the score for all the words in the recieved string vs the older one.
Only thing that immediately comes to mind.

Findings string segments in a string

I have a list of segments (15000+ segments), I want to find out the occurence of segments in a given string. The segment can be single word or multiword, I can not assume space as a delimeter in string.
e.g.
String "How can I download codec from internet for facebook, Professional programmer support"
[the string above may not make any sense but I am using it for illustration purpose]
segment list
Microsoft word
Microsoft excel
Professional Programmer.
Google
Facebook
Download codec from internet.
Ouptut :
Download codec from internet
facebook
Professional programmer
Bascially i am trying to do a query reduction.
I want to achieve it less than O(list length + string length) time.
As my list is more than 15000 segments, it will be time consuming to search entire list in string.
The segments are prepared manully and placed in a txt file.
Regards
~Paul
You basically want a string search algorithm like Aho-Corasik string matching. It constructs a state machine for processing bodies of text to detect matches, effectively making it so that it searches for all patterns at the same time. It's runtime is on the order of the length of the text and the total length of the patterns.
In order to do efficient searches, you will need an auxiliary data structure in the form of some sort of index. Here, a great place to start would be to look at a KWIC index:
http://en.wikipedia.org/wiki/Key_Word_in_Context
http://www.cs.duke.edu/~ola/ipc/kwic.html
What your basically asking how to do is write a custom lexer/parser.
Some good background on the subject would be the Dragon Book or something on lex and yacc (flex and bison).
Take a look at this question:
Poor man's lexer for C#
Now of course, alot of people are going to say "just use regular expressions". Perhaps. The deal with using regex in this situation is that your execution time will grow linearly as a function of the number of tokens you are matching against. So, if you end up needing to "segment" more phrases, your execution time will get longer and longer.
What you need to do is have a single pass, popping words on to a stack and checking if they are valid tokens after adding each one. If they aren't, then you need to continue (disregard the token like a compiler disregards comments).
Hope this helps.

Categories

Resources