How to access the reference values of a HashSet<TValue> without enumeration? - c#

I have this scenario in which memory conservation is paramount. I am trying to read in > 1 GB of Peptide sequences into memory and group peptide instances together that share the same sequence. I am storing the Peptide objects in a Hash so I can quickly check for duplication, but found out that you cannot access the objects in the Set, even after knowing that the Set contains that object.
Memory is really important and I don't want to duplicate data if at all possible. (Otherwise I would of designed my data structure as: peptides = Dictionary<string, Peptide> but that would duplicate the string in both the dictionary and Peptide class). Below is the code to show you what I would like to accomplish:
public SomeClass {
// Main Storage of all the Peptide instances, class provided below
private HashSet<Peptide> peptides = new HashSet<Peptide>();
public void SomeMethod(IEnumerable<string> files) {
foreach(string file in files) {
using(PeptideReader reader = new PeptideReader(file)) {
foreach(DataLine line in reader.ReadNextLine()) {
Peptide testPep = new Peptide(line.Sequence);
if(peptides.Contains(testPep)) {
// ** Problem Is Here **
// I want to get the Peptide object that is in HashSet
// so I can add the DataLine to it, I don't want use the
// testPep object (even though they are considered "equal")
peptides[testPep].Add(line); // I know this doesn't work
testPep.Add(line) // THIS IS NO GOOD, since it won't be saved in the HashSet which i use in other methods.
} else {
// The HashSet doesn't contain this peptide, so we can just add it
testPep.Add(line);
peptides.Add(testPep);
}
}
}
}
}
}
public Peptide : IEquatable<Peptide> {
public string Sequence {get;private set;}
private int hCode = 0;
public PsmList PSMs {get;set;}
public Peptide(string sequence) {
Sequence = sequence.Replace('I', 'L');
hCode = Sequence.GetHashCode();
}
public void Add(DataLine data) {
if(PSMs == null) {
PSMs = new PsmList();
}
PSMs.Add(data);
}
public override int GethashCode() {
return hCode;
}
public bool Equals(Peptide other) {
return Sequence.Equals(other.Sequence);
}
}
public PSMlist : List<DataLine> { // and some other stuff that is not important }
Why does HashSet not let me get the object reference that is contained in the HashSet? I know people will try to say that if HashSet.Contains() returns true, your objects are equivalent. They may be equivalent in terms of values, but I need the references to be the same since I am storing additional information in the Peptide class.
The only solution I came up with is Dictionary<Peptide, Peptide> in which both the key and value point to the same reference. But this seems tacky. Is there another data structure to accomplish this?

Basically you could reimplement HashSet<T> yourself, but that's about the only solution I'm aware of. The Dictionary<Peptide, Peptide> or Dictionary<string, Peptide> solution is probably not that inefficient though - if you're only wasting a single reference per entry, I would imagine that would be relatively insignificant.
In fact, if you remove the hCode member from Peptide, that will safe you 4 bytes per object which is the same size as a reference in x86 anyway... there's no point in caching the hash as far as I can tell, as you'll only compute the hash of each object once, at least in the code you've shown.
If you're really desperate for memory, I suspect you could store the sequence considerably more efficiently than as a string. If you give us more information about what the sequence contains, we may be able to make some suggestions there.
I don't know that there's any particularly strong reason why HashSet doesn't permit this, other than that it's a relatively rare requirement - but it's something I've seen requested in Java as well...

Use a Dictionary<string, Peptide>.

Related

Is there an accepted pattern to preserve variables' values during a function call which modifies those variables?

Within a class I have a property used by a method which I want to remain in the same state after a call to a second method (which might alter that state).
Example: for a property Value I could do something like this:
void MethodOne()
{
...
var tempValue = this.Value;
MethodTwo(); // might modify this.Value
this.Value = tempValue;
...
}
For a single property this isn't a big deal. If I have multiple properties it gets uglier.
I'm looking for a C# solution but would be interested to know if this kind of construct appears in any common language. The sort of syntax I'm after might look something like this:
void MethodOne()
{
...
preserving(this.Value)
{
MethodTwo(); // might modify this.Value
}
...
}
where the preserving keyword could potentially accept multiple properties/fields.
In my specific case it's a recursive method, so the code looks more like:
void MethodOne(object[] args)
{
...
// Do something which might modify this.Value
preserving(this.Value)
{
MethodOne(args);
}
...
}
Is there an accepted pattern / best practice to achieve this?
EDIT
The specific case for which I'm asking is something like this:
For the purposes of sorting lists I have a custom comparison class which implements IComparer. Its Compare method acts on objects which appear in collections (which may therefore be sorted). These collections might be nested, so sorting such a collection might result in the sort function, and therefore Compare(), being called recursively.
The actual comparison function is partially dynamic, which means that it could be set at runtime to something invalid (e.g. non-transitive or non-deterministic). I can't prevent this, so I want to set a limit on the number of comparisons (let's say n-squared, where n is the length of the list being sorted) to protect against cases where an invalid comparison function might result in the sorting algorithm going into an infinite loop.
The Compare method might be called from (e.g.) various LINQ methods such as OrderBy, possibly resulting in lazily evaluated sorts and possibly from code over which I have no control. However, I need to count the number of comparisons in each sort without any 'subsorts' of nested objects corrupting the count (but also counting comparisons in those subsorts).
My code looks something like this:
public int Compare(T x, T y)
{
// this.MaxComparisons is set from outside this code, since this method does not know the length of the list it is sorting.
if (++this.ComparisonCount > this.MaxComparisons)
{
// Error: too many comparisons
}
if (predicate)
{
// Preserve...
tempComparisonCount = this.ComparisonCount;
tempMaxComparisons = this.MaxComparisons;
// ...reset...
this.ComparisonCount = 0;
this.MaxComparisons = ... ; // set as required
var result = this.customComparer.Compare(x.Child, y.Child); // might involve further calls to the above method, which should be counted separately
// ...and restore
this.ComparisonCount = tempComparisonCount;
this.MaxComparisons = tempMaxComparisons;
return result;
}
else
{
return otherComparer.Compare(x, y);
}
}
I hope this makes it clearer why I have asked the question.
private static void Preserving<T>(ref T value, Action act)
{
T old = value;
act();
value = old;
}
then you can do:
Preserving(ref this.Value, MethodTwo);
If you have multiple variables you want to save and restore, you should probably create a Context class containing the state you want to save and then push/pop them from a stack.

c# Best type/collection/list/dataset to handle super large data (csv/tab files)

I am building one WPF (MVVM) app that handles really large csv files. We are talking about 1GB to 10GB.
I open the file and parse it with File.ReadLines into a List of following class:
public class FileLine
{
public DateTime Time { get; set; }
public string Message { get; set; } //Usually around 256 characters
public string Info1 { get; set; } //Exact 56 characters
public string Info2 { get; set; } //Exact 4 characters
//and so on
}
... then I do all sort of data manipulation, queries, charts... you name it... everything using Linq.
We are testing a 1.8GB file and when it is opened, the process takes around 2GB of memory.
Eventually, when my customer needs to open his 10GB file it will be impossible, because it is going to take 12GB+ of Memory.
What is the best type/collection/list/dataset to this kind of work?
When i've had to do something like this before I handled it by having a container object that held a list of dictionaries. At the time I thought the limit would/should be 2^32 number of elements, but an exception for exceeding the collection was thrown well before getting 2^32 elements and still had many GB of ram left. Say you want a Dictionary, something like this should work until you really do exhaust all physical and virtual memory... A possible solution for you follows... I remember when I worked on this a few years ago the server actually had 512Gb of ram, I'm sure they have ones with more now... Anyway that's a separate story.
public class MyHugeDictionary
{
List<Dictionary<typea, typeb> allDict= null;
Dictionary<typea, typeb> currDictionary ;
MyHugeDictiionary()
{
allDict = new List<Dictionary<typea, typeb>();
currDictionary = new Dictionary<typea, typeb);
allDict.Add(currDictionary);
}
public bool ItemExists( typea, typeb)
{
foreach( KeyValue<Dictionary<typea, typeb> kv in allDict)
{
if( kv.ContainsKey(typea) )
{
return true;
}
}
return false;
}
public Add( typea a, typeb b)
{
try
{
if( !ItemExist( tyepa, typeb) ) // find if items is in any other dictionary first
{
currDictionary.Add( a, b) ;
}
else { // handle dups... ; }
}
catch( CollectionSizeError x) // look-up for actual exception
{
currDictionary = CreateDictiionary();
allDict.Add( currDictionary ) ;
currDictionary.Add( a,b);
}
catch( OutOfMemory y) // look-up for actual exception
{
// oops game over for real now :(
}
}
}
After some discussion the best thing is to read the file, process it, and dispose all the rest, sticking only with the result.
Another possibility was to use database, but it would add too much complexity, although it is possible.
See this:
https://github.com/aumcode/nfx/tree/master/Source/NFX/ApplicationModel/Pile
https://www.infoq.com/articles/Big-Memory-Part-3
You can store whatever you want - no pauses.
The problem with large collections is:
a. They are not really designed to hold very many entries (i.e. Dictionary never shrinks back to zero size)
b. You get GC stalls/pauses when you have too many objects
see the links above - what we did is "hiding" of data from GC as described in the article. This way you can store millions of objects using LocalCache class as a dictionary.
For large memory apps in net - remember to enable 64 bit and set GC to SERVER mode in your app config file

Make a list readonly in c#

I have this example code. What I want to do is to make it so that the "Nums" value can only be written to using the "AddNum" method.
namespace ConsoleApplication1
{
public class Person
{
string myName = "N/A";
int myAge = 0;
List<int> _nums = new List<int>();
public List<int> Nums
{
get
{
return _nums;
}
}
public void AddNum(int NumToAdd)
{
_nums.Add(NumToAdd);
}
public string Name { get; set; }
public int Age { get; set; }
}
}
Somehow, I've tried a bunch of things regarding AsReadOnly() and the readonly keyword, but I can't seem to get it to do what I want it to do.
Here is the sample of the code I have to access the property.
Person p1 = new Person();
p1.Nums.Add(25); //access 1
p1.AddNum(37); //access 2
Console.WriteLine("press any key");
Console.ReadLine();
I really want "access 1" to fail, and "access 2" to be the ONLY way that the value can be set. Thanks in advance for the help.
√ DO use ReadOnlyCollection, a subclass of ReadOnlyCollection,
or in rare cases IEnumerable for properties or return values
representing read-only collections.
The quote from this article.
You should have something like this:
List<int> _nums = new List<int>();
public ReadOnlyCollection<int> Nums
{
get
{
return _nums.AsReadOnly();
}
}
In general, collection types make poor properties because even when a collection is wrapped in ReadOnlyCollection, it's inherently unclear what:
IEnumerable<int> nums = myPerson.Nums;
myPerson.AddNum(23);
foreach(int i in nums) // Should the 23 be included!?
...
is supposed to mean. Is the object returned from Nums a snapshot of the numbers that existed when it called, is it a live view?
A cleaner approach is to have a method called something like GetNumsAsArray which returns a new array each time it's called; it may also be helpful in some cases to have a GetNumsAsList variant depending upon what the caller will want to do with the numbers. Some methods only work with arrays, and some only work with lists, so if only one of the above is provided some callers will have to call it and then convert the returned object to the required type.
If performance-sensitive callers will be needing to use this code a lot, it may be helpful to have a more general-purpose method:
int CopyNumsIntoArray(int sourceIndex, int reqCount, ref int[] dest,
int destIndex, CopyCountMode mode);
where CopyCountMode indicates what the code should do the number of items available starting at sourceIndex is greater or less than reqCount; the method should either return the number of items that were available, or throw an exception if it violated the caller's stated expectations. Some callers might start by create and passing in a 10-item array but be prepared to have the method replace it with a bigger array if there are more than ten items to be returned; others might expect that there will be exactly 23 items and be unprepared to handle any other number. Using a parameter to specify the mode will allow one method to service many kinds of callers.
Although many collection authors don't bother including any method that fits the above pattern, such methods can greatly improve efficiency in cases where code wants to work with a significant minority of a collection (e.g. 1,000 items out of a collection of 50,000). In the absence of such methods, code wishing to work with such a range must either ask for a copy of the whole thing (very wasteful) or request thousands of items individually (also wasteful). Allowing the caller to supply the destination array would improve efficiency in the case where the same method makes many queries, especially if the destination array would be large enough to be put on the large object heap.

How to implement a SearchByID?

Good afternoon all!
As a part of getting a better grip on some of the most aspects of object based programming, I've started to attempt something far larger than I have done in the past. Hereby I'm trying to learn about inheritance, code reuse, using classes far more extensively, and so on.
For this purpose I am trying to piece together all the parts required for a basic RPG/dungeon crawler.
I know this has been done a billion times before, but I find that actually trying to code something like it takes you through a lot more problems than you might think, which is a great way to learn (I think).
For now I have only loaded up a WPF application, since my interest is 95% on being able to piece together the working classes, routines, functions, etc. And not so much interested in how it will look. I am actually reading up on XNA, but since I am mostly trying to get a grip on the basic workings, I don't want to complicate those aspects with the graphical side of things just yet.
The problem I am now facing is that when I would a character to attack or defend, it should know from which other character it came, or to which one it should be pointed. I figured I could either use a GUID, or a manually appointed ID. But the problem is that I don't really know how I can implement such a thing.
The thing that I figured was that I could maybe add a reference to an array (Character[]), and have a SearchByID function loop through them to find the right one, and return it. Like so:
internal Character SearchByID(string _ID)
{
foreach(Character charToFind in Character[])
{
if(charToFind.ID == _ID)
return charToFind;
}
}
This of course has to be altered a bit due to the return at the moment, but just to give you an idea.
What I am stuck on is how to create the appropriate array outside of the "Character"-class? I can fill it up just fine, but how do I go about having it added above class level?
The way the "Character"-class is built up is that every new character instantiates from the Character class. The constructor then loads the appropriate values. But other than this, I see no possibility to initialize an array outside of this.
If it is preferable to post the entire code that I have, that will be no problem at all!
Thanks for any insights you may provide me with.
I think you can just use the Character-class and pass other Characters to it, for example:
public class Character
{
public string Name { get; private set; }
public int HitPoints { get; private set; }
public int Offense { get; private set; }
public int Defense { get; private set; }
public Character(string name, int hitPoints, int offense, int defense)
{
Name = name;
HitPoints = hitPoints;
Offense = offense;
Defense = defense;
}
public void Defend(Character source)
{
HitPoints = HitPoints - (source.Offense - Defense);
if (HitPoints <= 0)
{
Console.WriteLine("{0} died", Name);
}
}
public void Attack(Character target)
{
// Here you can call the other character's defend with this char as an attacker
target.Defend(this);
if (target.HitPoints <= 0)
{
Console.WriteLine("{0} killed {1}", Name, target.Name);
}
}
}
The thing with object oriented programming is that you have to start thinking in objects. Objects are like boxes when they're concrete. You can make new ones and give them some properties, like a name, height, width, hitpoints, whatever. You can also let these objects perform actions. Now a simple box won't do much itself, but a character can do various things, so it makes sense to put these actions in the Character-class.
Besides having Characters, you might have a Game-class which manages the game-state, characters, monsters, treasure chests etc...
Now this simple example may cause you to gain HitPoints when your defense is higher than the attacker's offense, but that's details, I'll leave the exact implementation up to you.
I guess you want a way to insert characters in an array when they are instantiated..
You can make a static array or list
So,your class in my opinion should be
class Character
{
static List<Character> characterList=new List<Character>();//all characters are here
public Character(string id,...)
{
//initialize your object
characterList.Add(this);//store them in the list as and when created
}
internal Character SearchByID(string _ID)
{
foreach(Character charToFind in characterList)
{
if(charToFind.ID == _ID)
return charToFind;
}
}
}
As you may be knowing static members are associated with the class not with the object.So,when you create a new character object it would be automatically added to the characterList
Unless you are dealing with seperate processes, e.g. client-server, you probably don't want to use "Id"s at all.
Whereever you are passing string _ID around, pass the actual Character instead. This saves you looking up in an array or whatever.
Post more code, and I can show you what I mean.
You could use a dictionary, instantiated in your controller class:
Dictionary<Guid, Character> _characterList = new Dictionary<Guid, Character>();
Initialise:
var someCharacter = new Character() { stats = something };
var otherCharacter = new Character() { stats = anotherThing };
var char1Id = Guid.NewGuid();
var char2Id = Guid.NewGuid();
_characterList.Add(char1Id, someCharacter);
_characterList.Add(char2Id, otherCharacter);
then, to access characters:
var charToFind = _characterList[char1Id];
or
var charToFind = _characterList.Single(c => c.Name = "Fred The Killer");
or whatever else...
Check out keyed collection
KeyedCollection
It is like a dictionary where the key is a property of class.
You will be able to reference a Character with
Characters[id]
Syntax
On your Character class overrite GetHashCode and Equals for performance.
If you use Int32 for the ID then you will get a perfect hash.
Very fast and O(1).

Generating the next available unique name in C#

If you were to have a naming system in your app where the app contains say 100 actions, which creates new objects, like:
Blur
Sharpen
Contrast
Darken
Matte
...
and each time you use one of these, a new instance is created with a unique editable name, like Blur01, Blur02, Blur03, Sharpen01, Matte01, etc. How would you generate the next available unique name, so that it's an O(1) operation or near constant time. Bear in mind that the user can also change the name to custom names, like RemoveFaceDetails, etc.
It's acceptable to have some constraints, like restricting the number of characters to 100, using letters, numbers, underscores, etc...
EDIT: You can also suggest solutions without "filling the gaps" that is without reusing the already used, but deleted names, except the custom ones of course.
I refer you to Michael A. Jackson's Two Rules of Program Optimization:
Don't do it.
For experts only: Don't do it yet.
Simple, maintainable code is far more important than optimizing for a speed problem that you think you might have later.
I would start simple: build a candidate name (e.g. "Sharpen01"), then loop through the existing filters to see if that name exists. If it does, increment and try again. This is O(N2), but until you get thousands of filters, that will be good enough.
If, sometime later, the O(N2) does become a problem, then I'd start by building a HashSet of existing names. Then you can check each candidate name against the HashSet, rather than iterating. Rebuild the HashSet each time you need a unique name, then throw it away; you don't need the complexity of maintaining it in the face of changes. This would leave your code easy to maintain, while only being O(N).
O(N) will be good enough. You do not need O(1). The user is not going to click "Sharpen" enough times for there to be any difference.
I would create a static integer in action class that gets incremented and assigned as part of each new instance of the class. For instance:
class Blur
{
private static int count = 0;
private string _name;
public string Name
{
get { return _name; }
set { _name = value; }
}
public Blur()
{
_name = "Blur" + count++.ToString();
}
}
Since count is static, each time you create a new class, it will be incremented and appended to the default name. O(1) time.
EDIT
If you need to fill in the holes when you delete, I would suggest the following. It would automatically queue up numbers when items are renamed, but it would be more costly overall:
class Blur
{
private static int count = 0;
private static Queue<int> deletions = new Queue<int>();
private string _name;
public string Name
{
get { return _name; }
set
{
_name = value;
Delete();
}
}
private int assigned;
public Blur()
{
if (deletions.Count > 0)
{
assigned = deletions.Dequeue();
}
else
{
assigned = count++;
}
_name = "Blur" + assigned.ToString();
}
public void Delete()
{
if (assigned >= 0)
{
deletions.Enqueue(assigned);
assigned = -1;
}
}
}
Also, when you delete an object, you'll need to call .Delete() on the object.
CounterClass Dictionary version
class CounterClass
{
private int count;
private Queue<int> deletions;
public CounterClass()
{
count = 0;
deletions = new Queue<int>();
}
public string GetNumber()
{
if (deletions.Count > 0)
{
return deletions.Dequeue().ToString();
}
return count++.ToString();
}
public void Delete(int num)
{
deletions.Enqueue(num);
}
}
you can create a Dictionary to look up counters for each string. Just make sure you parse out the index and call .Delete(int) whenever you rename or delete a value.
You can easily do it in O(m) where m is the number of existing instances of the name (and not dependent on n, the number of items in the list.
Look up the string S in question. If S isn't in the list, you're done.
S exists, so construct S+"01" and check for that. Continue incrementing (e.g. next try S+"02" until it doesn't exist.
This gives you unique names but they're still "pretty" and human-readable.
Unless you expect a large number of duplicates, this should be "near-constant" time because m will be so small.
Caveat: What if the string naturally ends with e.g. "01"? In your case this sounds unlikely so perhaps you don't care. If you do care, consider adding more of a suffix, e.g. "_01" instead of just "01" so it's easier to tell them apart.
You could do something like this:
private Dictionary<string, int> instanceCounts = new Dictionary<string, int>();
private string GetNextName(string baseName)
{
int count = 1;
if (instanceCounts.TryGetValue(baseName, out count))
{
// the thing already exists, so add one to it
count++;
}
// update the dictionary with the new value
instanceCounts[baseName] = count;
// format the number as desired
return baseName + count.ToString("00");
}
You would then just use it by calling GetNextName(...) with the base name you wanted, such as
string myNextName = GetNextName("Blur");
Using this, you wouldn't have to pre-init the dictionary.
It would fill in as you used the various base words.
Also, this is O(1).
I would create a dictionary with a string key and a integer value, storing the next number to use for a given action. This will be almost O(1) in practice.
private IDictionary<String, Int32> NextFreeActionNumbers = null;
private void InitializeNextFreeActionNumbers()
{
this.NextFreeActionNumbers = new Dictionary<String, Int32>();
this.NextFreeActionNumbers.Add("Blur", 1);
this.NextFreeActionNumbers.Add("Sharpen", 1);
this.NextFreeActionNumbers.Add("Contrast", 1);
// ... and so on ...
}
private String GetNextActionName(String action)
{
Int32 number = this.NextFreeActionNumbers[action];
this.NextFreeActionNumbers[action] = number + 1;
return String.Format("{0} {1}", action, number);
}
And you will have to check against collisions with user edited values. Again a dictionary might be a smart choice. There is no way around that. What ever way you generate your names, the user can always change a existing name to the next one you generate unless you include all existing names into the generation schema. (Or use a special character that is not allowed in user edited names, but that would be not that nice.)
Because of the comments on reusing the holes I want to add it here, too. Don't resuse the holes generated be renaming or deletion. This will confuse the user because names he deleted or modified will suddenly reappear.
I would look for ways to simplify the problem.
Are there any constraints that can be applied? As an example, would it be good enough if each user can only have one (active) type of action? Then, the actions could be distinguished using the name (or ID) of the user.
Blur (Ben F)
Blur (Adrian H)
Focus (Ben F)
Perhaps this is not an option in this case, but maybe something else would be possible. I would go to great lengths in order to avoid the complexity in some of the proposed solutions!
If you want O(1) time then just track how many instances of each you have. Keep a hashtable with all of the possible objects, when you create an object, increment the value for that object and use the result in the name.
You're definitely not going to want to expose a GUID to the user interface.
Are you proposing an initial name like "Blur04", letting the user rename it, and then raising an error message if the user's custom name conflicts? Or silently renaming it to "CustomName01" or whatever?
You can use a Dictionary to check for duplicates in O(1) time. You can have incrementing counters for each effect type in the class that creates your new effect instances. Like Kevin mentioned, it gets more complex if you have to fill in gaps in the numbering when an effect is deleted.

Categories

Resources