How to handle null values during binary search? - c#

What's the best way to go about handling nulls during a binary search over a List<string> (well, it would be a List<string> if I could read all the values out beforehand)?
int previous = 0;
int direction = -1;
if (itemToCompare == null) {
previous = mid;
for (int tries = 0; tries < 2; tries++) {
mid += direction;
itemToCompare = GetItem(mid);
while (itemToCompare == null && insideInclusiveRange(min, max, mid)) {
mid += direction;
itemToCompare = GetItem(mid);
}
if (!insideInclusiveRange(min, max, mid)) {
/* Reached an endpoint without finding anything,
try the other direction. */
mid = previous;
direction = -direction;
} else if (itemToCompare != null) {
break;
}
}
}
I'm currently doing something like the above - if null is encountered, then linearly search in a direction until either non-null or beyond endpoint is encountered, if no success then repeat in other direction. In the actual code I'm getting direction from the previous comparison result, and GetItem() caches the values it retrieves. Is there an easier way, without making an intermediate list of non-null values (takes far too long for my purposes because the GetItem() function above is slow)?
I guess I'm asking if there's a smarter way to handle null values than to degrade to a linear search. In all likelihood there will only be a small percentage of nulls (1-5%), but it's possible for there to be sequences of 100s of null.
Edit - The data looks something like this
aa aaa
b bb bbb
c cc
d ddd
where each row is a separate object, and not all cells are guaranteed to be filled. The user needs to be able to search across an entire row (so that both "bb" and "bbb" would match the entire second row). Querying each object is slow enough that a linear search will not work. For the same reason, creating a new list without nulls is not really feasible.

Unless there is a reason to actually select/find a null value (not sure what that means as null is a singleton and binary search is often most desirable on unique values), consider not allowing them in the list at all.
[Previous answer: After reflecting on the question more I have decided that nulls likely have no place in the problem-space -- take bits and parts as appropriate.]
If nulls are desired, just sort the list such that null values are first (or last) and update the logic correctly -- then just make sure not to invoke a method upon any of the null values ;-)
This should have little overall impact since a sort is already required. If items are changed to null -- which sounds like an icky side-effect! -- then just "compact" the List (e.g. "remove" the null item). I would, however, just not modify the sorted list unless there is a good reason.
Binary search is only really designed/suitable for (entirely) sorted data. No point turning it into a binary-maybe-linear search.
Happy coding.

Related

C# Is it possible to generate an identifier for array of double values

I am working with existing data and have records which contain an array double[23] and double[46]. The values in the array can be the same across multiple records. I would like to generate an id (perhaps an int) to uniquely identify the values in each array.
There are places in the application where I need to group records based on the values in the array being identical. While there are ways to query for this, I was hoping for a single int field (or something similar) to group on. This would really help simplify queries and especially help with report tools where grouping on a smaller single field would help immensely.
I thought of generating a hash code, but I understand these are not guaranteed to be the same for each double[] with matching values. I had tried implementing
((IStructuralEquatable)combined).GetHashCode(EqualityComparer<double>.Default);
To compare the structure and data, but again, I don't think this is guaranteed to match another double[] having the same values.
Perhaps a form of checksum would work but admittedly I am having trouble implementing something. I am looking for suggestions/direction.
Here is data for 3 sample records. Data in record 1&3 are the same so a generated id should match for those.
32.7,48.9,55.9,48.9,47.7,46.9,45.7,44.4,43.4,41.9,40.4,38.4,36.7,34.4,32.4,30.4,27.9,25.4,22.4,19.4,16.4,13.4,10.4,47.9
40.8,49.0,50.0,49.0,47.8,47.0,45.8,44.5,43.5,42.0,40.5,38.5,36.8,34.5,32.5,30.5,28.0,25.5,22.5,19.5,16.5,13.5,10.5,48.0
32.7,48.9,55.9,48.9,47.7,46.9,45.7,44.4,43.4,41.9,40.4,38.4,36.7,34.4,32.4,30.4,27.9,25.4,22.4,19.4,16.4,13.4,10.4,47.9
Perhaps this is not possible without just checking all the data, but was hoping for a better solution to simplify the application and improve the speed.
The goal is to add a new id field to the existing records to represent the array data. That way, passing records into report tools would group together easily on one field rather than checking the whole array on each record.
I appreciate any direction.
EDIT - Some issues I ran into trying things (incase it helps someone)
In trying to understand this originally, I was calling this code (which is part of .NET). I understood these functions would hash the values of the array together (only 8 values in this case). I didn't think it included the array handle. The result was not quite as expected as there is a bug MS corrected in .NET as per the commented line below. With the fix I was getting better results.
int IStructuralEquatable.GetHashCode(IEqualityComparer comparer) {
if (comparer == null)
throw new ArgumentNullException("comparer");
Contract.EndContractBlock();
int ret = 0;
for (int i = (this.Length >= 8 ? this.Length - 8 : 0); i < this.Length; i++) {
ret = CombineHashCodes(ret, comparer.GetHashCode(GetValue(i)));
//.NET 4.6.2, in .NET 4.5.2 it is ret = CombineHashCodes(ret, comparer.GetHashCode(GetValue(0)))
}
return ret;
}
internal static int CombineHashCodes(int h1, int h2) {
return (((h1 << 5) + h1) ^ h2);
}
I modified this to handle more than 8 values and still had some hashes not matching. I later determined the issue was in the data; I was unaware some of the records had some doubles stored with more than one decimal place (should have been rounded). This of course changed the hash. Now that I have the data consistent, I am seeing matching hashes; any arrays with identical values have an identical hash.
I thought of generating a hash code, but I understand these are not guaranteed to be the same for each double[] with matching values
Quite the opposite, a hash function is required by design to return equal hashes for equal inputs. For example, 0 is a good starting point for your hash function, returning the value 0 for equal rows. Everything else is just an optimization to try to reduce false positives.
Perhaps this is not possible without just checking all the data
Of course you need to check all the data, how else would you do it?
However your implementation is broken. The default hash function for an array hashes the handle to the array itself, so different instances of arrays with the same data will show up as different. What you want to do is to use a HashCode instance and Add() each element of your array in it to get a proper hash code.

Loop - Calculated last element different

Hi everyone (sry for the bad title),
I have a loop in which I can get a rounding difference every time I pass. I would like to cumulate them and add it to the last record of my result.
var cumulatedRoundDifference = 0m;
var resultSet = Enumerable.Range(0, periods)
.Select(currentPeriod => {
var value = this.CalculateValue(currentPeriod);
var valueRounded = this.CommercialRound(value);
// Bad part :(
cumulatedRoundDifference += value - valueRounded;
if (currentPeriod == periods - 1)
valueRounded = this.CommercialRound(value + valueRounded);
return valuesRounded;
}
At the moment the code of my opinion is not so nice.
Is there a pattern / algorithm for such a thing or is it somehow clever with Linq, without a variable outside the loop?
many Greetings
It seems like you are doing two things - rounding everything, and calculating the total rounding error.
You could remove the variable outside the lambda, but then you would need 2 queries.
var baseQuery = Enumerable.Range(0, periods)
.Select(x => new { Value = CalculateValue(x), ValueRounded = CommercialRound(x) });
var cumulateRoundDifference = baseQuery.Select(x => x.Value - x.ValueRounded).Sum();
// LINQ isn't really good at doing something different to the last element
var resultSet = baseQuery.Select(x => x.ValueRounded).Take(periods - 1).Concat(new[] { CommercialRound(CalculateValue(periods - 1) + CommericalRound(periods - 1)) });
Is there a pattern / algorithm for such a thing or is it somehow clever with Linq, without a variable outside the loop?
I don't quite agree with what you're trying to accomplish. You're trying to accomplish two very different tasks, so why are you trying to merge them into the same iteration block? The latter (handling the last item) isn't even supposed to be an iteration.
For readability's sake, I suggest splitting the two off. It makes more sense and doesn't require you to check if you're on the last loop of the iteration (which saves you some code and nesting).
While I don't quite understand the calculation in and of itself, I can answer the algorithm you're directly asking for (though I'm not sure this is the best way to do it, which I'll address later in the answer).
var allItemsExceptTheLastOne = allItems.Take(allItems.Count() - 1);
foreach(var item in allItemsExceptTheLastOne)
{
// Your logic for all items except the last one
}
var theLastItem = allItems.Last();
// Your logic for the last item
This is in my opinion a cleaner and more readable approach. I'm not a fan of using lambda methods as mini-methods with a less-than-trivial readability. This may be subjective and a matter of personal style.
On rereading, I think I understand the calculation better, so I've added an attempt at implementing it, while still maximizing readability as best I can:
// First we make a list of the values (without the sum)
var myValues = Enumerable
.Range(0, periods)
.Select(period => this.CalculateValue(period))
.Select(period => period - this.CommercialRound(period))
.ToList();
// myValues = [ 0.1, 0.2, 0.3 ]
myValues.Add(myValues.Sum());
// myValues = [ 0.1, 0.2, 0.3, 0.6 ]
This follows the same approach as the algorithm I first suggested: iterate over the iteratable items, and then separately handle the last value of your intended result list.
Note that I separated the logic into two subsequent Select statements as I consider it the most readable (no excessive lambda bodies) and efficient (no duplicate CalculateValue calls) way of doing this. If, however, you are more concerned about performance, e.g. when you are expecting to process massive lists, you may want to merge these again.
I suggest that you always try to default to writing code that favors readability over (excessive) optimization; and only deviate from that path when there is a clear need for additional optimization (which I cannot decide based on your question).
On a second reread, I'm not sure you've explained the actual calculation well enough, as cumulatedRoundDifference is not actually used in your calculations, but the code seems to suggest that its value should be important to the end result.

C# List sorting explanation needed

I have a problem which is already solved, but I don't know what really happens. Here is the simplified task: I have a list of records. The records consists of 2 fields, a key and a value. All keys are different. I want to sort them, so
I have a row with empty string as key, that should be in the first place.
The come the rows in which the value contains "Key not found" in alphabetical order by key
Then the rest of the rows in alphabetical order by key.
So I made this class:
private class RecordComparer : IComparer<GridRecord>
{
public int Compare(GridRecord x, GridRecord y)
{
if (x.Key == String.Empty)
return -1;
else if (y.Key == String.Empty)
return 1;
else if (x.Value.Contains("Key not found:") && !y.Value.Contains("Key not found:"))
return -1;
else if (!x.Value.Contains("Key not found:") && y.Value.Contains("Key not found:"))
return 1;
else return (x.Key.CompareTo(y.Key));
}
}
When I try to use it, I got Comparer (or the IComparable methods it relies upon) did not return zero when Array.Sort called x. CompareTo(x). x: '' x's type: 'GridRecord' The IComparer:
The error doesn't always appear, sometimes(usually when I use it first time in my program) it works fine. Second or third call crashes.
Inserting
if (x.Key == y.Key)
return 0;
in the begginning of the Compare function above solved the problem, everything works fine. Why?
If you compare {Key=""} with anything, you are currently returning -1. Even if you are comparing it with itself. When you compare something with itself (or something semantically equivalent to the same), you are supposed to return 0. That is what the error is about.
It is wise to enforce total order in your custom comparer. One of the requirements for total order is reflexivity: for any x Compare(x, x) must be equal to zero. This property is required, for example, when comparer is used to sort an array with non-unique values.
Some libraries may do additional checks for custom comparers. There is no point to compare the element to itself, but on the other hand such check allow the runtime to find subtle errors (like the one you made). Probably thats why you've got your error message. Fixing such errors makes your code more stable. Usually the checks like this exist in debug builds only.

How do I insert an int into a sorted array quickly?

I'd like to insert an int into a sorted array. This operation is going to be performed very often, so it needs to be as fast as possible.
It is possible and even preferred to use a List or any other class instead of an array
All values are in the 1 to 34 range
The array typically contains exactly 14 values
I was thinking of many different approaches, including binary search and simple insert-on-copy, but found it hard to decide. Also, I felt like I missed an idea. Do you have experiences on this topic or any new ideas to consider?
I will use an int array whose length is 35(because you said range 1-34) to record the status of the numbers.
int[] status = Enumerable.Repeat(0, 35).ToArray();
//an array contains 35 zeros
//which means currently there is no elements in the array
status[10] = 1; // now the array have only one number: 10
status[11] ++; // a new number 11 is added to the list
So if you want to add a number i to the list:
status[i]++; // O(1) to add a number
To remove an i from the list:
status[i]--; // O(1) to remove a number
Want to know all the numebrs in the list?
for (int i = 0; i < status.Length; i++)
{
if (status[i] > 0)
{
for (int j = 0; j < status[i]; j++)
Console.WriteLine(i);
}
}
//or more easier using LINQ
var result = status.SelectMany((i, index) => Enumerable.Repeat(index, i));
The following example may help you understand my code better:
the real number array: 1 12 12 15 9 34 // i don't care if it's sorted
the status array: status[1]=1,status[12]=2,status[15]=1,status[9]=1,status[34]=1
all others are 0
At 14 values this is a pretty small array, I don't think switching to a smarter data structure such as a list will win you much, especially if you fast good random access. Even binary search may actually be slower than linear search at this scale. Are you sure that, say, insert-on-copy does not satisfy your performance requirements?
This operation is going to be performed very often, so it needs to be as fast as possible.
The things that you notice happen "very often" are frequently not the bottlenecks in the program - it's often surprising what the actual bottlenecks are. You should code something simple and measure the actual performance of your program before performing any optimizations.
I was thinking of many different approaches, including binary search and simple insert-on-copy, but found it hard to decide.
Assuming that this is the bottleneck, the big-O performance of the different methods is not going to be relevant here because of the small size of your array. It is easier to just try a few different approaches, measure the results, see which performs best and choose that method. If you have followed the advice from the first paragraph you already have a profiler setup that you can use for this step too.
For inserting into the middle, a LinkedList<int> would be the fastest option - anything else involves copying data. At 14 elements, don't stress over binary search etc - just walk forwards to the item you want:
using System;
using System.Collections.Generic;
static class Program
{
static void Main()
{
LinkedList<int> data = new LinkedList<int>();
Random rand = new Random(12345);
for (int i = 0; i < 20; i++)
{
data.InsertSortedValue(rand.Next(300));
}
foreach (int i in data) Console.WriteLine(i);
}
}
static class LinkedListExtensions {
public static void InsertSortedValue(this LinkedList<int> list, int value)
{
LinkedListNode<int> node = list.First, next;
if (node == null || node.Value > value)
{
list.AddFirst(value);
}
else
{
while ((next = node.Next) != null && next.Value < value)
node = next;
list.AddAfter(node, value);
}
}
}
Doing the brute-force approach is the best decision here because 14 isn't a number :). However, this is not a scalable decision, since should 14 become 14000 one day that will cause problems
What is the most common operation with your array?
Insert? Read?
Heap data structure will give you O(log(14)) for both of them. SortedDictionary may hit your performance.
Using a simple array will give you O(1) for reading and O(14) for insert.
By the way, have you tried System.Collections.Generic.SortedDictionary ot System.Collections.Generic.SortedList?
If you're on .Net 4 you should take a look at the SortedSet<T>. Otherwise take a look at SortedDictionary<TKey, TValue> where you make TValue as object and just put null into it, cause you're just interested into the keys.
If there is no repeated value on the array and the possible values won´t change maybe a fixed size array where the value is equal to the index is a good choice
Both insert and read are O(1)
You have a range of possible values from 1-34 which is rather narrow. So the fastest way would likely be using an array with 34 slots. To insert a number n just do array[n-1]++ and to remove it do array[n.1]-- (if n>0).
To check if a value exists in your collection you do array[n-1]>0.
edit: Damn...Danny was faster. :)
Write a method takes an array of integers and sorts them in place using Bubble Sort. The method is not allowed to create any additional arrays. Bubble Sort is a simple sorting algorithm that works by looping through the array to be sorted, comparing each pair of adjacent elements and swapping them if they are in the wrong order.

How to make large if-statement more readable

Is there a better way for writing a condition with a large number of AND checks than a large IF statement in terms of code clarity?
Eg I currently need to make a field on screen mandatory if other fields do not meet certain requirements. At the moment I have an IF statement which runs over 30 LOC, and this just doesn't seem right.
if(!(field1 == field2 &&
field3 == field4 &&
field5 == field6 &&
.
.
.
field100 == field101))
{
// Perform operations
}
Is the solution simply to break these down into smaller chunks and assign the results to a smaller number of boolean variables? What is the best way for making the code more readable?
Thanks
I would consider building up rules, in predicate form:
bool FieldIsValid() { // condition }
bool SomethingElseHappened() { // condition }
// etc
Then, I would create myself a list of these predicates:
IList<Func<bool>> validConditions = new List<Func<bool>> {
FieldIsValid,
SomethingElseHappend,
// etc
};
Finally, I would write the condition to bring forward the conditions:
if(validConditions.All(c => c())
{
// Perform operations
}
The right approach will vary depending upon details you haven't provided. If the items to be compared can be selected by some sort of index (a numeric counter, list of field identifiers, etc.) you are probably best off doing that. For example, something like:
Ok = True
For Each fld as KeyValuePair(Of Control, String) in CheckFields
If fld.FormField.Text fld.RequiredValue Then
OK = False
Exit For
End If
Next
Constructing the list of controls and strings may be a slight nuisance, but there are reasonable ways of doing it.
Personally, I feel that breaking this into chunks will just make the overall statement less clear. It's going to make the code longer, not more concise.
I would probably refactor this check into a method on the class, so you can reuse it as needed, and test it in a single place. However, I'd most likely leave the check written as you have it - one if statement with lots of conditions, one per line.
You could refactor your conditional into a separate function, and also use De Morgan's Laws to simplify your logic slightly.
Also - are your variables really all called fieldN?
Part of the problem is you are mixing meta data and logic.
WHICH Questions are required(/must be equal/min length/etc) is meta data.
Verifying that each field meets it's requirements is program logic.
The list of requirements (and the fields that apply too) should all be stored somewhere else, not inside of a large if statement.
Then your verification logic reads the list, loops through it, and keeps a running total. If ANY field fails, you need to alert the user.
It may be useful to begin using the Workflow Engine for C#. It was specifically designed to help graphically lay out these sorts of complex decision algorithms.
Windows WorkFlow Foundation
The first thing I'd change for legibility is to remove the almost hidden negation by inverting the statement (by using De Morgan's Laws):
if ( field1 != field2 || field3 != field4 .... etc )
{
// Perform operations
}
Although using a series of && rather than || does have some slight performance improvement, I feel the lack of readability with the original code is worth the change.
If performance were an issue, you could break the statements into a series of if-statements, but that's getting to be a mess by then!
Is there some other relationship between all the variables you're comparing which you can exploit?
For example, are they all the members of two classes?
If so, and provided your performance requirements don't preclude this, you can scrape references to them all into a List or array, and then compare them in a loop. Sometimes you can do this at object construction, rather than for every comparison.
It seems to me that the real problem is somewhere else in the architecture rather than in the if() statement - of course, that doesn't mean it can easily be fixed, I appreciate that.
Isn't this what arrays are basically for?
Instead of having 100 variables named fieldn, create an array of 100 values.
Then you can have a function to loop combinations in the array and return true or false if the condition matches.

Categories

Resources