Combine multiple realm query results .net - c#

As a Contains implementation, I am using a bit tweaked method written by Andy Dent to query my realm database:
private IQueryable<Entry> FilterEntriesByIds(IQueryable<Entry> allEntries, int[] idsToMatch)
{
// Fancy way to invent Contains<>() for LINQ
ParameterExpression pe = Expression.Parameter(typeof(Entry), "Entry");
Expression chainedByOr = null;
Expression left = Expression.Property(pe, typeof(Entry).GetProperty("Id"));
for (int i = 0; i < idsToMatch.Count(); i++) {
Expression right = Expression.Constant(idsToMatch[i]);
Expression anotherEqual = Expression.Equal(left, right);
if (chainedByOr == null)
chainedByOr = anotherEqual;
else
chainedByOr = Expression.OrElse(chainedByOr, anotherEqual);
}
MethodCallExpression whereCallExpression = Expression.Call(
typeof(Queryable),
"Where",
new Type[] { allEntries.ElementType },
allEntries.Expression,
Expression.Lambda<Func<Entry, bool>>(chainedByOr, new ParameterExpression[] { pe }));
return allEntries.Provider.CreateQuery<Entry>(whereCallExpression);
}
It all works just fine as long as I pass less than 2-3K of ids, when I go with larger amounts the app just crashes with what seems to be a stackoverflow exception.
My first thought to solve this was to break the query into chunks and later combine the results, but the Concat and Union methods do not work on these realm IQueryables, so, how else can I merge such chunked results? Or is there any other workaround?
I can't just convert the results to a List or something and then merge, I have to return realm objects as IQueryable<>
The call stack:
=================================================================
Native Crash Reporting
=================================================================
Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.
=================================================================
No native Android stacktrace (see debuggerd output).
=================================================================
Basic Fault Address Reporting
=================================================================
Memory around native instruction pointer (0x7c326075c8):0x7c326075b8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0x7c326075c8 fd 7b bb a9 fd 03 00 91 a0 0b 00 f9 10 0e 87 d2 .{..............
0x7c326075d8 10 d0 a7 f2 90 0f c0 f2 b0 0f 00 f9 10 00 9e d2 ................
0x7c326075e8 f0 d5 a9 f2 90 0f c0 f2 b0 13 00 f9 a0 a3 00 91 ................
=================================================================
Managed Stacktrace:
============================
=====================================
at Realms.QueryHandle:GroupBegin <0x00000>
at Realms.RealmResultsVisitor:VisitCombination <0x0007f>
at Realms.RealmResultsVisitor:VisitBinary <0x003f7>
at System.Linq.Expressions.BinaryExpression:Accept <0x00073>
at System.Linq.Expressions.ExpressionVisitor:Visit <0x00087>
at Realms.RealmResultsVisitor:VisitCombination <0x000d7>
at Realms.RealmResultsVisitor:VisitBinary <0x003f7>
at System.Linq.Expressions.BinaryExpression:Accept <0x00073>
at System.Linq.Expressions.ExpressionVisitor:Visit <0x00087>
at Realms.RealmResultsVisitor:VisitCombination <0x000d7>
at Realms.RealmResultsVisitor:VisitBinary <0x003f7>
at System.Linq.Expressions.BinaryExpression:Accept <0x00073>
at System.Linq.Expressions.ExpressionVisitor:Visit <0x00087>
at Realms.RealmResultsVisitor:VisitCombination <0x000d7>
at Realms.RealmResultsVisitor:VisitBinary <0x003f7>
at System.Linq.Expressions.BinaryExpression:Accept <0x00073>
at System.Linq.Expressions.Ex
pressionVisitor:Visit <0x00087>
UPD
I found the exact number of elements after which this error gets thrown: 3939, if I pass anything larger than that, it crashes.

I'm not a database expert, just have worked at varying levels (inc Realm) usually on top of someone's lower-level engine. (The c-tree Plus ISAM engine or the C++ core by the real wizards at Realm).
My first impression is that you have mostly-static data and so this is a fairly classical problem where you want a better index generated up front.
I think you can build such an index using Realm but with a bit more application logic.
It sounds a bit like an Inverted Index problem as in this SO question.
For all the single and probably at least double-word combinations which map to your words, you want another table that links to all the matching words. You can create those with a fairly easy loop that creates them from the existing words.
eg: your OneLetter table would have an entry for a that used a one-to-many relationship to all the matching words in the main Words table.
That will delivery a very fast Ilist you can iterate of the matching words.
Then you can flip to your Contains approach at 3 letters or above.

The SQL provider for SQL Server can handle a contains operation with more items than is the limit for maximum number of SQL parameters in one query, which is 2100.
This sample is working:
var ids = new List<int>();
for (int i = 0; i < 10000; i++)
{
ids.Add(i);
}
var testQuery = dbContext.Entity.Where(x => ids.Contains(x.Id)).ToList();
It means, you could try to rewrite your method to use ids.Contains(x.Id).
An example how to do it is in this post.
UPDATE: sorry did not mention this is of Realm, but perhaps it is still worth trying.

Related

JFIF and EXIF data structures

I'm creating a program which reads and prints out metadata from an image, but I'm struggling to get my head around the JFIF and EXIF marker structures.
According to the wikipedia page for JFIF, the JFIF marker should be as follows:
FF E0 s1 s2 4A 46 49 46 00 ii ii jj XX XX YY YY xx yy
Where:
FF E0 is the start of the JFIF marker
s1 and s2 combined give the size of the segment (excluding the APP0 marker)
4A 46 49 46 00 is the identifier (literally JFIF in ascii)
i is the version of JFIF (2 bytes)
j is the density uni for DPI measurement
X is the Horizontal DPI
Y is the Vertical DPI
x is the Horizontal THUMBNAIL DPI
y is the Vertical THUMBNAIL DPI
However, when running an image through my program, I get this:
ff e0 20 10 4a 46 49 46 20 1 1 20 20 48 20 48 20 20 ff e1
The marker start is there, but the segment size looks well off (0x2010??) seeing as the next Marker for EXIF data starts just 15 bytes later! (FF E1)
I think that hex values of 0x00 aren't being printed (hence why my image prints the JFIF identifier without the zeros) which may be adding to the confusion, but even then, how is the JFIF version 20 20?
If anyone on here has any experience looking at image metadata, I'd really appreciate your help! There's not a lot of resources that I can find that break down JFIF/EXIF data very clearly.
If you need me to post any code in here then I can, though apart from not printing 0x00 values to the console, it seems to being working as expected, so I think my main issue is actually understanding the meta data
Here is the code for taking the byte stream and then converting it into hex:
fileLocation = Console.ReadLine();
var fileDataAsBytes = File.ReadAllBytes(fileLocation);
var headers = fileDataAsBytes
.Select((b, i) => (b, i))
.Where(tuple => tuple.b == 0xFF
&& fileDataAsBytes[tuple.i + 1] == 0xE1)
.Select(tuple => $"{tuple.i}: {tuple.b:x}
{fileDataAsBytes[tuple.i + 1]:x}");
Console.WriteLine(String.Join(",", headers));
DealWithMarkers(fileDataAsBytes);
DisplayAllConversions(fileDataAsBytes);
public static void DisplayAllConversions(byte[] fileDataAsBytes)
{
DisplayBytes(fileDataAsBytes);
DisplayHex(fileDataAsBytes);
DisplayString(fileDataAsBytes);
}
public static void DisplayHex(byte[] fileDataAsBytes)
{
Console.WriteLine($"\n\n\n\t*\t*\t*\t(As Hex)\t*\t*\t*\n");
for (int i = 0; (i < 1000) && (i < fileDataAsBytes.Length); i++)
{
Console.Write($"{fileDataAsBytes[i]:x} ");
}
}
I tried this with a different image and it actually prints out the 0 value bytes correctly, so there must be something weird with the image file I was analyzing!
Image file with "incorrect" 0 bytes
Image file with "correct" 0 bytes

C# Translating a few smaller arrays to one big

Title may not explain fully what I want to do, so I made an image.
You can see there are 4 1D arrays(red numbers, black colored numbers are indexes), each of this array goes from 0 to 63. I want to somehow translate them, that for example, index 16 will point to first index of second array.
What I was thinking of, is a function where I give List of arrays and index that I want to get as input, and its returns me the index of array and exact index in this array as output.
I would like to have some hints or suggestions on how to proceed here to achieve this functionality.
Ok, your image shows an interleaved data of 16 elements, so you want to have (showing an example of only two matrices because of space :D)
Global index
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
-----------------------------------------------------------------------------------------------------
Array0 indexes - Array1 indexes
0 1 2 3 4 5 6 7 8 9 A B C D E F - 0 1 2 3 4 5 6 7 8 9 A B C D E F
-------------------------------------------------------------------------------------------------
Global index
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
-----------------------------------------------------------------------------------------------------
Array0 indexes - Array1 indexes
10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F - 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
To get it you can do something like this:
public class CompositeIndex
{
public int ArrayNumber { get; set; }
public int ElementNumber { get; set; }
}
public static CompositeIndex GetIndex(int GlobalIndex, int ArrayCount, int ElementsToInterleave)
{
CompositeIndex index = new CompositeIndex();
int fullArrays = GlobalIndex / ElementsToInterleave; //In your example: 16 / 16 = 1;
index.ArrayNumber = fullArrays % ArrayCount; //In your example: 1 mod 4 = 1;
index.ElementNumber = GlobalIndex - (fullArrays * ElementsToInterleave); //In your example: 16 - (1 * 16) = 0;
return index;
}
Then, if you have 4 matrices and want to get the "global" index 16 you do:
var index = GetIndex(16, 4, 16);
This function allows you to use an indeterminated number of arrays and interleaved elements.
BTW, another time ask better your question, a lot more people will help you if they don't have to solve a puzzles to understand what you want...

Spliting into lines then into variables

i have this text file below:
001 Bulbasaur 45 49 49 65 65 45 Grass Poison
002 Ivysaur 60 62 63 80 80 60 Grass Poison
003 Venusaur 80 82 83 100 100 80 Grass Poison
004 Charmander 39 52 43 60 50 65 Fire
005 Charmeleon 58 64 58 80 65 80 Fire
I have written this piece of code to split it into lines then into variables but it refuses to work, my apology's if i am asking this in the wrong place. (unity C# question).
var lines = textFile.text.Split("\n"[0]);
allMonsters = new Monsters[lines.Length];
List<string> lineSplit = new List<string>();
for (int i = 0; i < lines.Length; i++) {
Debug.Log(lines[i]);
lineSplit.Clear();
lineSplit = lines[i].Split(' ').ToList ();
int ID = int.Parse(lineSplit[0]);
string Name = lineSplit[1].ToString();
float HP = float.Parse(lineSplit[2]);
float ATK = float.Parse(lineSplit[3]);
float DEF = float.Parse(lineSplit[4]);
float SPATK = float.Parse(lineSplit[5]);
float SpDEF = float.Parse(lineSplit[6]);
float speed = float.Parse(lineSplit[7]);
FirstType Ft = (FirstType)System.Enum.Parse(typeof(FirstType),lineSplit[8]);
SecondType ST = (SecondType)System.Enum.Parse(typeof(SecondType),lineSplit[9]); }
The code works for the first line but then on the second run of this code i get null reference to an object error, please help me.
Note, variables are assigned to so they aren't overwritten after code.
EDIT: LineSplit variable is over 1200 elements long, so i do not think unity is clearing the array properly could this be the issue?
I'm not sure about the second iteration, but on the 4th iteration line 004 Charmander 39 52 43 60 50 65 Fire have only 9 parameters (by split of space) and you using lineSplit[9], so there you will get NullPointerException.
When working with text files, linefeed often comes along with carriage return.
Try
var lines = textFile.text.Split(new string[]{"\r\n"}, System.StringSplitOptions.None);
in the first line.

Firmata over nRF24

I'm having some technical problems... I'm trying to use Firmata for arduino but over nrf24, not over Serial interface. I have tested nRF24 communication and it's fine. I have also tested Firmata over Serial and it works.
Base device is simple "serial relay". When it has data available on Serial, read it and send it over nRF24 network. If there is data available from network, read it and send it through Serial.
Node device is a bit complex. It has custom Standard Firmata where I have just added write and read override.
Read override id handeled in loop method in this way:
while(Firmata.available())
Firmata.processInput();
// Handle network data and send it to Firmata process method
while(network.available()) {
RF24NetworkHeader header;
uint8_t data;
network.read(header, &data, sizeof(uint8_t));
Serial.print(data, DEC); Serial.print(" ");
Firmata.processInputOverride(data);
BlinkOnBoard(50);
}
currentMillis = millis();
Firmata processInputOverrride is little changed method of processInput where processInput reads data directly from FirmataSerial, and in this method we pass data down to method from network. This was tested and it should work fine.
Write method is overloaded in a different way. In Firmata.cpp I have added an method pointer that can be set to a custom method and used to send data using that custom method. I have then added custom method call after each of the FirmataSerial.write() call:
Firmata.h
...
size_t (*firmataSerialWriteOverride)(uint8_t);
...
void FirmataClass::printVersion(void) {
FirmataSerial.write(REPORT_VERSION);
FirmataSerial.write(FIRMATA_MAJOR_VERSION);
FirmataSerial.write(FIRMATA_MINOR_VERSION);
Firmata.firmataSerialWriteOverride(REPORT_VERSION);
Firmata.firmataSerialWriteOverride(FIRMATA_MAJOR_VERSION);
Firmata.firmataSerialWriteOverride(FIRMATA_MINOR_VERSION);
}
I have then set the overrided write method to a custom method that just writes byte to network instead of Serial.
size_t ssignal(uint8_t data) {
RF24NetworkHeader header(BaseDevice);
network.write(header, &data, sizeof(uint8_t));
}
void setup() {
...
Firmata.firmataSerialWriteOverride = ssignal;
...
}
Everything seems to be working fine, it's just that some data seems to be inverted or something. I'm using sharpduino (C#) to do some simple digital pin toggle. Here's how output looks like: (< came from BASE, > sent to BASE)
> 208 0
> 209 0
...
> 223 0
> 249
< 4 2 249
and here communication stops...
That last line came inverted. So i tough that I only need to invert received bytes. And it worked for that first command. But then something happens and communication stops again.
> 208 0
> 209 0
...
> 223 0
> 249 // Report firmware version request
< 249 2 4
> 240 121 247 // 240 is sysex begin and 247 is systex end
< 240 121
< 101 0 67 0 0 1 69 0 118
< 117 0 115 0
< 0 70 0 105 0 116 0 111 0 109
< 0 97 0
< 0 109
< 116 0 97 0 247
> 240 107 247
So what could be the problem here? It seems that communication with Firmata works but something isn't right...
-- EDIT --
I solved that issue. The problem was that I didn't see Serial.write() calls in sysex callback. Now that that is solved, I came up to another problem... All stages pass right (I guess) and then I dont get any response from Node when I request pin states
...
< f0 6a 7f 7f 7f ... 7f 0 1 2 3 4 5 6 7 8 9 a b c d e f f7 // analog mapping
> f0 6d 0 f7 // sysex request pin 0 state and value
> f0 6d 1 f7
> f0 6d 2 f7
...
> f0 6d 45 f7
// And I wait for response...
There is no response. Any ideas why would that happen? Node receive all messages correctly and code for handling pin states exist.

Integer vs double arithmetic performance?

i'm writing a C# class to perform 2D separable convolution using integers to obtain better performance than double counterpart. The problem is that i don't obtain a real performance gain.
This is the X filter code (it is valid both for int and double cases):
foreach (pixel)
{
int value = 0;
for (int k = 0; k < filterOffsetsX.Length; k++)
{
value += InputImage[index + filterOffsetsX[k]] * filterValuesX[k]; //index is relative to current pixel position
}
tempImage[index] = value;
}
In the integer case "value", "InputImage" and "tempImage" are of "int", "Image<byte>" and "Image<int>" types.
In the double case "value", "InputImage" and "tempImage" are of "double", "Image<double>" and "Image<double>" types.
(filterValues is int[] in each case)
(The class Image<T> is part of an extern dll. It should be similar to .NET Drawing Image class..).
My goal is to achieve fast perfomance thanks to int += (byte * int) vs double += (double * int)
The following times are mean of 200 repetitions.
Filter size 9 = 0.031 (double) 0.027 (int)
Filter size 13 = 0.042 (double) 0.038 (int)
Filter size 25 = 0.078 (double) 0.070 (int)
The performance gain is minimal. Can this be caused by pipeline stall and suboptimal code?
EDIT: simplified the code deleting unimportant vars.
EDIT2: i don't think i have a cache miss related problema because "index"iterate through adjacent memory cells (row after row fashion). Moreover "filterOffstetsX" contains only small offsets relatives to pixels on the same row and at a max distance of filter size / 2. The problem can be present in the second separable filter (Y-filter) but times are not so different.
Using Visual C++, because that way I can be sure that I'm timing arithmetic operations and not much else.
Results (each operation is performed 600 million times):
i16 add: 834575
i32 add: 840381
i64 add: 1691091
f32 add: 987181
f64 add: 979725
i16 mult: 850516
i32 mult: 858988
i64 mult: 6526342
f32 mult: 1085199
f64 mult: 1072950
i16 divide: 3505916
i32 divide: 3123804
i64 divide: 10714697
f32 divide: 8309924
f64 divide: 8266111
freq = 1562587
CPU is an Intel Core i7, Turbo Boosted to 2.53 GHz.
Benchmark code:
#include <stdio.h>
#include <windows.h>
template<void (*unit)(void)>
void profile( const char* label )
{
static __int64 cumtime;
LARGE_INTEGER before, after;
::QueryPerformanceCounter(&before);
(*unit)();
::QueryPerformanceCounter(&after);
after.QuadPart -= before.QuadPart;
printf("%s: %I64i\n", label, cumtime += after.QuadPart);
}
const unsigned repcount = 10000000;
template<typename T>
void add(volatile T& var, T val) { var += val; }
template<typename T>
void mult(volatile T& var, T val) { var *= val; }
template<typename T>
void divide(volatile T& var, T val) { var /= val; }
template<typename T, void (*fn)(volatile T& var, T val)>
void integer_op( void )
{
unsigned reps = repcount;
do {
volatile T var = 2000;
fn(var,5);
fn(var,6);
fn(var,7);
fn(var,8);
fn(var,9);
fn(var,10);
} while (--reps);
}
template<typename T, void (*fn)(volatile T& var, T val)>
void fp_op( void )
{
unsigned reps = repcount;
do {
volatile T var = (T)2.0;
fn(var,(T)1.01);
fn(var,(T)1.02);
fn(var,(T)1.03);
fn(var,(T)2.01);
fn(var,(T)2.02);
fn(var,(T)2.03);
} while (--reps);
}
int main( void )
{
LARGE_INTEGER freq;
unsigned reps = 10;
do {
profile<&integer_op<__int16,add<__int16>>>("i16 add");
profile<&integer_op<__int32,add<__int32>>>("i32 add");
profile<&integer_op<__int64,add<__int64>>>("i64 add");
profile<&fp_op<float,add<float>>>("f32 add");
profile<&fp_op<double,add<double>>>("f64 add");
profile<&integer_op<__int16,mult<__int16>>>("i16 mult");
profile<&integer_op<__int32,mult<__int32>>>("i32 mult");
profile<&integer_op<__int64,mult<__int64>>>("i64 mult");
profile<&fp_op<float,mult<float>>>("f32 mult");
profile<&fp_op<double,mult<double>>>("f64 mult");
profile<&integer_op<__int16,divide<__int16>>>("i16 divide");
profile<&integer_op<__int32,divide<__int32>>>("i32 divide");
profile<&integer_op<__int64,divide<__int64>>>("i64 divide");
profile<&fp_op<float,divide<float>>>("f32 divide");
profile<&fp_op<double,divide<double>>>("f64 divide");
::QueryPerformanceFrequency(&freq);
putchar('\n');
} while (--reps);
printf("freq = %I64i\n", freq);
}
I did a default optimized build using Visual C++ 2010 32-bit.
Every call to profile, add, mult, and divide (inside the loops) got inlined. Function calls were still generated to profile, but since 60 million operations get done for each call, I think the function call overhead is unimportant.
Even with volatile thrown in, the Visual C++ optimizing compiler is SMART. I originally used small integers as the right-hand operand, and the compiler happily used lea and add instructions to do integer multiply. You may get more of a boost from calling out to highly optimized C++ code than the common wisdom suggests, simply because the C++ optimizer does a much better job than any JIT.
Originally I had the initialization of var outside the loop, and that made the floating-point multiply code run miserably slow because of the constant overflows. FPU handling NaNs is slow, something else to keep in mind when writing high-performance number-crunching routines.
The dependencies are also set up in such a way as to prevent pipelining. If you want to see the effects of pipelining, say so in a comment, and I'll revise the testbench to operate on multiple variables instead of just one.
Disassembly of i32 multiply:
; COMDAT ??$integer_op#H$1??$mult#H##YAXACHH#Z##YAXXZ
_TEXT SEGMENT
_var$66971 = -4 ; size = 4
??$integer_op#H$1??$mult#H##YAXACHH#Z##YAXXZ PROC ; integer_op<int,&mult<int> >, COMDAT
; 29 : {
00000 55 push ebp
00001 8b ec mov ebp, esp
00003 51 push ecx
; 30 : unsigned reps = repcount;
00004 b8 80 96 98 00 mov eax, 10000000 ; 00989680H
00009 b9 d0 07 00 00 mov ecx, 2000 ; 000007d0H
0000e 8b ff npad 2
$LL3#integer_op#5:
; 31 : do {
; 32 : volatile T var = 2000;
00010 89 4d fc mov DWORD PTR _var$66971[ebp], ecx
; 33 : fn(var,751);
00013 8b 55 fc mov edx, DWORD PTR _var$66971[ebp]
00016 69 d2 ef 02 00
00 imul edx, 751 ; 000002efH
0001c 89 55 fc mov DWORD PTR _var$66971[ebp], edx
; 34 : fn(var,6923);
0001f 8b 55 fc mov edx, DWORD PTR _var$66971[ebp]
00022 69 d2 0b 1b 00
00 imul edx, 6923 ; 00001b0bH
00028 89 55 fc mov DWORD PTR _var$66971[ebp], edx
; 35 : fn(var,7124);
0002b 8b 55 fc mov edx, DWORD PTR _var$66971[ebp]
0002e 69 d2 d4 1b 00
00 imul edx, 7124 ; 00001bd4H
00034 89 55 fc mov DWORD PTR _var$66971[ebp], edx
; 36 : fn(var,81);
00037 8b 55 fc mov edx, DWORD PTR _var$66971[ebp]
0003a 6b d2 51 imul edx, 81 ; 00000051H
0003d 89 55 fc mov DWORD PTR _var$66971[ebp], edx
; 37 : fn(var,9143);
00040 8b 55 fc mov edx, DWORD PTR _var$66971[ebp]
00043 69 d2 b7 23 00
00 imul edx, 9143 ; 000023b7H
00049 89 55 fc mov DWORD PTR _var$66971[ebp], edx
; 38 : fn(var,101244215);
0004c 8b 55 fc mov edx, DWORD PTR _var$66971[ebp]
0004f 69 d2 37 dd 08
06 imul edx, 101244215 ; 0608dd37H
; 39 : } while (--reps);
00055 48 dec eax
00056 89 55 fc mov DWORD PTR _var$66971[ebp], edx
00059 75 b5 jne SHORT $LL3#integer_op#5
; 40 : }
0005b 8b e5 mov esp, ebp
0005d 5d pop ebp
0005e c3 ret 0
??$integer_op#H$1??$mult#H##YAXACHH#Z##YAXXZ ENDP ; integer_op<int,&mult<int> >
; Function compile flags: /Ogtp
_TEXT ENDS
And of f64 multiply:
; COMDAT ??$fp_op#N$1??$mult#N##YAXACNN#Z##YAXXZ
_TEXT SEGMENT
_var$67014 = -8 ; size = 8
??$fp_op#N$1??$mult#N##YAXACNN#Z##YAXXZ PROC ; fp_op<double,&mult<double> >, COMDAT
; 44 : {
00000 55 push ebp
00001 8b ec mov ebp, esp
00003 83 e4 f8 and esp, -8 ; fffffff8H
; 45 : unsigned reps = repcount;
00006 dd 05 00 00 00
00 fld QWORD PTR __real#4000000000000000
0000c 83 ec 08 sub esp, 8
0000f dd 05 00 00 00
00 fld QWORD PTR __real#3ff028f5c28f5c29
00015 b8 80 96 98 00 mov eax, 10000000 ; 00989680H
0001a dd 05 00 00 00
00 fld QWORD PTR __real#3ff051eb851eb852
00020 dd 05 00 00 00
00 fld QWORD PTR __real#3ff07ae147ae147b
00026 dd 05 00 00 00
00 fld QWORD PTR __real#4000147ae147ae14
0002c dd 05 00 00 00
00 fld QWORD PTR __real#400028f5c28f5c29
00032 dd 05 00 00 00
00 fld QWORD PTR __real#40003d70a3d70a3d
00038 eb 02 jmp SHORT $LN3#fp_op#3
$LN22#fp_op#3:
; 46 : do {
; 47 : volatile T var = (T)2.0;
; 48 : fn(var,(T)1.01);
; 49 : fn(var,(T)1.02);
; 50 : fn(var,(T)1.03);
; 51 : fn(var,(T)2.01);
; 52 : fn(var,(T)2.02);
; 53 : fn(var,(T)2.03);
; 54 : } while (--reps);
0003a d9 ce fxch ST(6)
$LN3#fp_op#3:
0003c 48 dec eax
0003d d9 ce fxch ST(6)
0003f dd 14 24 fst QWORD PTR _var$67014[esp+8]
00042 dd 04 24 fld QWORD PTR _var$67014[esp+8]
00045 d8 ce fmul ST(0), ST(6)
00047 dd 1c 24 fstp QWORD PTR _var$67014[esp+8]
0004a dd 04 24 fld QWORD PTR _var$67014[esp+8]
0004d d8 cd fmul ST(0), ST(5)
0004f dd 1c 24 fstp QWORD PTR _var$67014[esp+8]
00052 dd 04 24 fld QWORD PTR _var$67014[esp+8]
00055 d8 cc fmul ST(0), ST(4)
00057 dd 1c 24 fstp QWORD PTR _var$67014[esp+8]
0005a dd 04 24 fld QWORD PTR _var$67014[esp+8]
0005d d8 cb fmul ST(0), ST(3)
0005f dd 1c 24 fstp QWORD PTR _var$67014[esp+8]
00062 dd 04 24 fld QWORD PTR _var$67014[esp+8]
00065 d8 ca fmul ST(0), ST(2)
00067 dd 1c 24 fstp QWORD PTR _var$67014[esp+8]
0006a dd 04 24 fld QWORD PTR _var$67014[esp+8]
0006d d8 cf fmul ST(0), ST(7)
0006f dd 1c 24 fstp QWORD PTR _var$67014[esp+8]
00072 75 c6 jne SHORT $LN22#fp_op#3
00074 dd d8 fstp ST(0)
00076 dd dc fstp ST(4)
00078 dd da fstp ST(2)
0007a dd d8 fstp ST(0)
0007c dd d8 fstp ST(0)
0007e dd d8 fstp ST(0)
00080 dd d8 fstp ST(0)
; 55 : }
00082 8b e5 mov esp, ebp
00084 5d pop ebp
00085 c3 ret 0
??$fp_op#N$1??$mult#N##YAXACNN#Z##YAXXZ ENDP ; fp_op<double,&mult<double> >
; Function compile flags: /Ogtp
_TEXT ENDS
It seems like you are saying you are only running that inner loop 5000 times in even your longest case. The FPU last I checked (admittedly a long time ago) only took about 5 more cycles to perform a multiply than the integer unit. So by using integers you would be saving about 25,000 CPU cycles. That's assuming no cache misses or anything else that would cause the CPU to sit and wait in either event.
Assuming a modern Intel Core CPU clocked in the neighborhood of 2.5Ghz, You could expect to have saved about 10 microseconds runtime by using the integer unit. Kinda paltry. I do realtime programming for a living, and we wouldn't sweat that much CPU wastage here, even if we were missing a deadline somewhere.
digEmAll makes a very good point in the comments though. If the compiler and optimizer are doing their jobs, the entire thing is pipelined. That means that in actuality the entire innner loop will take 5 cycles longer to run with the FPU than the Integer Unit, not each operation in it. If that were the case, your expected time savings would be so small it would be tough to measure them.
If you really are doing enough floating-point ops to make the entire shebang take a very long time, I'd suggest looking into doing one or more of the following:
Parallelize your algorithm and run it on every CPU available from your processor.
Don't run it on the CLR (use native C++, or Ada or Fortran or something).
Rewrite it to run on the GPU. GPUs are essentially array processors and are designed to do massively parallel math on arrays of floating-point values.
Your algorithm seems to access large regions of memory in a very non-sequential pattern. It's probably generating tons of cache misses. The bottleneck is probably memory access, not arithmetic. Using ints should make this slightly faster because ints are 32 bits, while doubles are 64 bits, meaning cache will be used slightly more efficiently. If almost every loop iteration involves a cache miss, though, you're basically out of luck unless you can make some algorithmic or data structure layout changes to improve the locality of reference.
BTW, have you considered using an FFT for convolution? That would put you in a completely different big-O class.
at least it is not fair to compare int (DWORD, 4 bytes) and double (QWORD, 8 bytes) on 32-bit system. Compare int to float or long to double to get fair results. double has increased precision, you must pay for it.
PS: for me it smells like micro(+premature) optimization, and that smell is not good.
Edit: Ok, good point. It is not correct to compare long to double, but still comparing int and double on 32 CPU is not correct even if they have both intrinsic instructions. This is not magic, x86 is fat CISC, still double is not processed as single step internally.
On my machine, I find that floating-point multiplication is about the same speed as integer multiplication.
I'm using this timing function:
static void Time<T>(int count, string desc, Func<T> action){
action();
Stopwatch sw = Stopwatch.StartNew();
for(int i = 0; i < count; i++)
action();
double seconds = sw.Elapsed.TotalSeconds;
Console.WriteLine("{0} took {1} seconds", desc, seconds);
}
Let's say you're processing a 200 x 200 array with a 25-length filter 200 times, then your inner loop is executing 200 * 200 * 25 * 200 = 200,000,000 times. Each time, you're doing one multiply, one add, and 3 array indices. So I use this profiling code
const int count = 200000000;
int[] a = {1};
double d = 5;
int i = 5;
Time(count, "array index", ()=>a[0]);
Time(count, "double mult", ()=>d * 6);
Time(count, "double add ", ()=>d + 6);
Time(count, "int mult", ()=>i * 6);
Time(count, "int add ", ()=>i + 6);
On my machine (slower than yours, I think), I get the following results:
array index took 1.4076632 seconds
double mult took 1.2203911 seconds
double add took 1.2342998 seconds
int mult took 1.2170384 seconds
int add took 1.0945793 seconds
As you see, integer multiplication, floating-point multiplication, and floating-point addition all took about the same time. Array indexing took a little longer (and you're doing it three times), and integer addition was a little faster.
So I think the performance advantage to integer math in your scenario is just too slight to make a significant difference, especially when outweighed by the relatively huge penalty you're paying for array indexing. If you really need to speed this up, then you should use unsafe pointers to your arrays to avoid the offset calculation and bounds checking.
By the way, the performance difference for division is much more striking. Following the pattern above, I get:
double div took 3.8597251 seconds
int div took 1.7824505 seconds
One more note:
Just to be clear, all profiling should be done with an optimized release build. Debug builds will be slower overall, and some operations may not have accurate timing with respect to others.
If the times you measuerd are accurate, then the runtime of your filtering algorithm seems to grow with the cube of the filter size. What kind of filter is that? Maybe you can reduce the number of multiplications needed. (e.g. if you're using a separable filter kernel?)
Otherwise, if you need raw performance, you might consider using a library like the Intel Performance Primitives - it contains highly optimized functions for things like this that use CPU SIMD instructions. They're usually a lot faster than hand-written code in C# or C++.
Did you try looking at the disassembled code? In high-level languages i'm pretty much trusting the compiler to optimize my code.
For example for(i=0;i<imageSize;i++) might be faster than foreach.
Also, arithmetic operrations might get optimized by the compiler anyway.... when you need to optimize something you either optimize the whole "black-box" and maybe reinvent the algorithm used in that loop, or you first take a look at the dissasembled code and see whats wrong with it

Categories

Resources