Ignoring fake usernames [closed]

Ignoring fake usernames [closed] - c#

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
So I am using an api that returns usernames to me. In them are some fake ones like:
8spZKYf1t2
xOzJzaYJe2
0x5jD4xmTM
PJFBoDFJsW
UZV908nNF7
CRuMGgh1bM
lyhDRamtFf
wELYyunHZU
NC8ZbYCjig
plK2KtwQwE
EKRlRLRitP
0CULcA8lIR
Yyi2NV3P8n
Anybody know a good algorithm to ignore these?

You will need a database of usernames in order to learn the difference between real ones and fake one. Call this the "training set":
From the training set, calculate the number of occurrences for each 3 letter combination. From "mtimmerm", for example, you would add counts for "mti", "tim", "imm", etc. Let N(x) be the number of counts for x in the training set, and let TOTAL be the total number of counts. Let F(x) = (N(x)+1)/(TOTAL+1) This will be our estimate for the frequency at which that 3 letter combination occurs in usernames.
Given a candidate username U, for each 3-letter combination x in U, calculate H(x) = -log(F(X)). Add all these together and divide by length(U)-2 (the number of combinations) to get H(U). The is a measure of how 'unrealistic' U is.
Calculate H(U) for a bunch of usernames, and you should find that it is much higher for fake ones.
If you want to learn the theory behind how this works, the google word is "Entropy": https://en.wikipedia.org/wiki/Entropy_(information_theory)
What we are doing is making a statistical model for usernames, and then calculating how 'unusual' each username is according to that model. This is actually a measure of how many bits it would take to store the username if we used our model to compress it (sort of -- I simplified the calculations but they should be accurate enough relative to one another). Randomly generated usernames (which we are assuming are fake) will take more information to store than real ones.
NOTE: it's nice if the training set doesn't contain any fake usernames, but it won't make too much difference as long as most of them are real. Also note that it's not quite right to test names from the training set. If you're going to test names from the training set, then subtract 1/(TOTAL+1) from each F(X) so the username's own counts aren't included when testing it.

Related

How do I randomly split a multiset into sets of predetermined sizes, without duplicates?

I'm looking for help on a problem which I don't know how to deal with. I'm guessing similar questions have already been asked, but I couldn't google it the right way.
What I'm trying to do is a randomizer for the boards in FFXII in C# and there's a part of the problem I don't know how to solve: the randomization.
I'm simplifying a bit here: there are 12 boards containing licenses that you can unlock to equip stuff or use magic. Board spots may be empty and no single license may appear twice on one board, but licenses can occur several times if they are on different boards. Each board also has a different number of licenses. There's a total of 1626 licenses on the boards, with the number of unique licenses being around 350. I have a list of all licenses, along with the number times they occur in the original board setup. (The one you get if you play the game normally.)
I would like help with generating 12 random licence lists of predetermined size, without duplicates, from the multiset of licence occurences in the original game. What I'm specifically worried about is that the algorithm might get stuck in a state where there are more duplicate elements than there are sets with room for those elements. The total size of the 12 lists is equal to the number of elements in the multiset, of course. (I'll place them on the board myself, that is not too difficult.)

How to print a text using "pixels" [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm using a fitness tracker with a watchface store - some of this watchfaces print the current time using other elements than simple text.
A few months ago I tried to beat a challange in "Coding game" where you have to print dynamic text in ascii art - I didn't manage to do that :(
I'm assuming that the basic technique should be the same but I cannot think about how to do it. May anyone tell me how to print e.g. the current time using dots / squares?
My last approach was to define a two dimensional array for each possible character, bitwise defining which spot should be active or inactive. But as this is pretty hard work I discarded that really quick.
Thanks for your help.

Actually, you were on the right track. That's one way to do it - a 2D array for each character with the bitmask. Or better yet - a full ASCII picture. Takes some work upfront, but not that much - you can probably do it in a day or two. You can also simplify the typing a bit. Like this:
Dictionary<char, string[]> = {
{'A', new string[] {
" AAAA ",
"AA AA",
"AAAAAA",
"AA AA",
"AA AA"
}},
// etc. for all characters in your "ASCII font"
}
Another way would be to define some sort of "vector format" for your characters, scale it to be as big as you need, and then convert to ASCII with some sort of algorithm that uses the characters -|/\. for outlines. That's much harder, and you still need to describe every character individually. But it looks prettier.
Or take a look at this answer: https://stackoverflow.com/a/7239064/41360 - that actually uses a bitmask. Yes, it probably took someone a few days to get it right. Harder than the string approach above, but also more compact if you're short on space (like in some embedded hardware).
And then there's Figlet - that piece of software has gone all the way and defined its own font format, not to mention some sort of advanced text rendering engine which actually merges the characters together. Crazy shit.
But no matter what approach you use, someone somewhere will need to design each letter individually and store the designs in some kind of format (thus producing a "font"). Most likely it will be you. There's no way around it.
Programmers might be lazy and use tricks to make their lives easier - but sometimes hard work is simply unavoidable. When that comes, you just do it. At the end of the day, it's the result that counts, not how it was obtained (well, as long as it's ethical/legal of course).

Algorithm to split people into groups with most diversity per group [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I'd like an algorithm to put people into groups for an upcoming conference. There's lots of people going, from different regions, departments, genders etc, and they want to split people up as much as possible so get diversity in each group.
So is there either a well known algorithm or even tool in (say) Excel or something to solve this problem, which must be very common?
To simplify the problem say there are
n people (say 100)
To be split into g groups (say 6) and there should be as close to even number in each group.
They have regions: London, North, Midlands, West, Scotland (mostly London)
Gender: Female, Male, Other
Departments: Sales, Support, Management
Grade: 6 different grades
Additional info
There are differing proportions of people in each category, i.e. more sales than management.
There probably is a priority in the ordering, they want an even gender split more than an even department split.
I work in C# but happy to read in anything.
Thanks!
Ben

This is not a trivial problem by any means, and hard, if not impossible to solve with an exact algorithm. I don't know an academic analogue, but this is a perfect use case for stochastic/probabilistic optimization.
You need a fitness function that can convey how diverse the current assignment is with a single number, e.g. something simple and intuitive like:
sum
for each group
for each trait
trait_weight * abs(%_occurrence_in_group - %_occurrence_in_population)
(in the above case, lower is better)
Choose a method like simulated annealing or a genetic algorithm, and search for an extremum.

Lets first define a utility function. We want one that's accurate but quick to calculate, so how about how close the proportion of people of each category is in a group compared to the actual proportion of each category in total.
so if a group of 8 has 5 males, 3 males , 4 salespeople and 4 support, but there is an equal split of males and females in total, and 2/3rds the total number of people are sales, the other 1/3rd support the utility function will be
-((5/8-1/2)+(3/8-1/2)+(4/8-2/3)+(4/8-1/3))
The reason there is a minus in front is so that the utility function increases with diversity.
Once you've defined a utility function, there's a lot of ways to go about it, including simulated annealing for example. However for your purposes I recommend hill climbing with random restart, as I think it will be sufficient.
Randomly assign people to different groups, then calculate the utility function. Randomly select one person from 1 group and another from another group, and if the utility will be higher when you swap them do so. Continue swapping, for a number of rounds (eg,200), and then record the assignment and the utility function. Restart from a new random assignment, and repeat the whole process a few more times. Pick the one with the highest utility function.
If that's not clear, ask me to clarify.

List vs Dictionary when referring by index/key [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
So I have been mainly using lists to retrieve small amounts of data from a database which feeds into a web application but have recently come across dictionaries which produce more readable code with keys but what is the performance difference when just referring by index/key?
I understand that a dictionary uses more memory but what is best practice in this scenario and is it worth the performance/maintenance trade-off bearing in mind that I will not be performing searches or sorting the data?

When you do want to find some one item through list, then you should see ALL items till you find its key.
Let's see some basic example. You have
Person
{
public int ID {get;set;}
public string Name {get;set;}
}
and you have collection List<Person> persons and you want to find some person by its ID:
var person = persons.FirstOrDefault(x => x.ID == 5);
As written it has to enumerate the entire List until it finds the entry in the List that has the correct ID (does entry 0 match the lambda? No... Does entry 1 match the lambda? No... etc etc). This is O(n).
However, if you want to find through the Dictionary dictPersons :
var person = dictPersons[person.ID];
If you want to find a certain element by key in a dictionary, it can instantly jump to where it is in the dictionary - this is O(1). O(n) for doing it for every person. (If you want to know how this is done - Dictionary runs a mathematical operation on the key, which turns it into a value that is a place inside the dictionary, which is the same place it put it when it was inserted. It is called hash-function)
So, Dictionary is faster than Listbecause Dictionary does not iterate through the all collection, but Dictionary takes the item from the exact place(hash-function calculates this place). It is a better algorithm.
Dictionary relies on chaining (maintaining a list of items for each hash table bucket) to resolve collisions whereas Hashtable uses rehashing for collision resolution (when a collision occurs, tries another hash function to map the key to a bucket). You can read how hash function works and difference between chaining and rehashing.

Unless you're actually experiencing performance issues and need to optimize it's better to go with what's more readable and maintainable. That's especially true since you mentioned that it's small amounts of data. Without exaggerating - it's possible that over the life of the application the cumulative difference in performance (if any) won't equal the time you save by making your code more readable.
To put it in perspective, consider the work that your application already does just to read request headers and parse views and read values from configuration files. Not only will the difference in performance between the list and the dictionary be small, it will also be a tiny fraction of the overall processing your application does just to serve a single page request.
And even then, if you were to see performance issues and needed to optimize, there would probably be plenty of other optimizations (like caching) that would make a bigger difference.

Can you speed up this algorithm? C# / C++ [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Hey I've been working on something from time to time and it has become relatively large now (and slow). However I managed to pinpoint the bottleneck after close up measuring of performance in function of time.
Say I want to "permute" the string "ABC". What I mean by "permute" is not quite a permutation but rather a continuous substring set following this pattern:
A
AB
ABC
B
BC
C
I have to check for every substring if it is contained within another string S2 so I've done some quick'n dirty literal implementation as follows:
for (int i = 0; i <= strlen1; i++)
{
for (int j = 0; j <= strlen2- i; j++)
{
sub = str1.Substring(i, j);
if (str2.Contains(sub)) {do stuff}
else break;
This was very slow initially but once I realised that if the first part doesnt exist, there is no need to check for the subsequent ones meaning that if sub isn't contained within str2, i can call break on the inner loop.
Ok this gave blazing fast results but calculating my algorithm complexity I realised that in worst case this will be N^4 ? I forgot that str.contains() and str.substr() both have their own complexities (N or N^2 I forgot which).
The fact that I have a huge amount of calls on those inside a 2nd for loop makes it perform rather.. well N^4 ~ said enough.
However I calculated the average run-time of this both mathematically using probability theory to evaluate the probability of growth of the substring in a pool of randomly generated strings (this was my base line) measuring when the probability became > 0.5 (50%)
This showed an exponential relationship between the number of different characters and the string length (roughly) which means that in the scenarios I use my algorithm the length of string1 wont (most probably) never exceed 7
Thus the average complexity would be ~O(N * M) where N is string length1 and M is string length 2. Due to the fact that I've tested N in function of constant M, I've gotten linear growth ~O(N) (not bad opposing to the N^4 eh?)
I did time testing and plotted a graph which showed nearly perfect linear growth so I got my actual results matching my mathematical predictions (yay!)
However, this was NOT taking into account the cost of string.contains() and string.substring() which made me wonder if this could be optimized even further?
I've been also thinking of making this in C++ because I need rather low-level stuff? What do you guys think? I have put a great time into analysing this hope I've elaborated everything clear enough :)!

Your question is tagged both C++ and C#.
In C++ the optimal solution will be to use iterators, and std::search. The original strings remains unmodified, and no intermediate objects get created. There won't be an equivalent of your Substring() taking place at all, so this eliminates that part of the overhead.
This should achieve the theoretically-best performance: brute force search, testing all permutations, with no intermediate object construction or destruction, other than the iterators themselves, which simply replace your two int index variables. I can't think of any faster way of implementing this basic algorithm.

Are You testing one string against one string? If You test bunch of strings against another bunch of strings, it is a whole different story. Even if You have the best algorithm for comparing one string against another O(X), it does not mean repeating it M*N times You would get the best algorithm for processing M strings against N.
When I made something simmiliar, I built dictionary of all substrings of all N strings
Dictionary<string, List<int>>
The string is a substring and int is index of string that contains that substring. Then I tested all substrings of all M strings against it. The speed was suddenly not O(M*N*X), but O(max(M,N)*S), where S is number of substrings of one string. Depending on M, N, X, S that may be faster. I do not say the dictionary of substrings is the best approach, I just want to point out that You should always try to see the whole picture.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.