I have got a text file that all lines like:
8:30 8:50 1
..........
20:30 20:35 151
Every line is a new user connection with it's time period in In-net.
The goal is to find periods of time where the quantity reaches the maximum.
So, maybe someone knows algorithm that can help me with this task(multiple intersections)? Find this task non-trivial for me(because i am new in programming), i have some ideas but i find them awful, that's why maybe i should start with mathematical algorithms to make the best way to achieve my goal.
For beginning we have to make some assumptions.
Assume you are looking for the shortest period with maximum connections.
Assume every line represents one connection. It's not clear from our
question what are the integer numbers after start and end times on
every line. So I ignore it.
The lines are given in order of increasing period start time.
We are free to choose any local maximum as the answer in case we got several periods with the same number of simultaneous connections.
The first stage of the solution is parsing. Given a sequence of lines we get the sequence of pairs of System.DateTime – a pair for each period in order.
static Tuple<DateTime, DateTime> Parse(string line)
{
var a = line.Split()
.Take(2) // take the start and end times only
.Select(p =>
DateTime.ParseExact(p, "H:m",
CultureInfo.InvariantCulture))
.ToArray();
return Tuple.Create(a[0], a[1]);
}
The next stage is the algorithm itself. It has two parts. First, we find local maximums as triples of start time, end time and connection count. Second, we select the absolute maximum from the set produced by the first part:
File.ReadLines(FILE_PATH).Select(Parse).GetLocalMaximums().Max(x=>x.Item3)
File.ReadLines(FILE_PATH).Select(Parse).GetLocalMaximums()
.Aggregate((x, y) => x.Item3 > y.Item3 ? x : y))
File.ReadLines(FILE_PATH).Select(Parse).GetLocalMaximums()
.Aggregate((x, y) => x.Item3 >= y.Item3 ? x : y))
The most sophisticated part is detection of a local maximum.
Take the first period A and write down its end time. Then write down
its start time as the last known start time. Note there is one end
time written and there is one active connection.
Take the next period B and write its end time. Compare the start
time of B to the minimum of end times written.
If there is no written end time smaller than B's start time then the
number of connections increases at this time. So discard previous
value for the last known start time and replace it with B's start
time. Then proceed to the next period. Note again there are one more
connections at this time and we have one more end time. So the number
of active connections is always equal to number of written down end
times.
If there is a value in the list of end time smaller than B's end, we
had a decrease in connection count and this means we just passed a
local maximum (here is the math). We have to report it: yield the triple (the last known start time, the minimum of written end times, the number of
end times written minus one). We should not count the end time for B
we had already written. Then discard all the end times being less
than B's start time, replace the last known start time, and proceed
to the next period.
When the minimum end time equals to the B's start, it means we've
lost one connection and got another one at the same time. This means
we have to discard the end time and proceed to the next period.
Repeat from step 2 for all the periods we have.
The source code for the local maximum detection
static IEnumerable<Tuple<DateTime, DateTime, int>>
GetLocalMaximums(this IEnumerable<Tuple<DateTime, DateTime>> ranges)
{
DateTime lastStart = DateTime.MinValue;
var queue = new List<DateTime>();
foreach (var r in ranges)
{
queue.Add(r.Item2);
var t = queue.Min();
if (t < r.Item1)
{
yield return Tuple.Create(lastStart, t, queue.Count-1);
do
{
queue.Remove(t);
t = queue.Min();
} while (t < r.Item1);
}
if (t == r.Item1) queue.Remove(t);
else lastStart = r.Item1;
}
// yield the last local maximum
if (queue.Count > 0)
yield return Tuple.Create(lastStart, queue.Min(), queue.Count);
}
While using List(T) was not the best decision, it's easy to understand. Use a sorted version of list for better performance. Replacing tuples with structs will eliminate a lot of memory allocation operations.
You could do:
string[] lines=System.IO.File.ReadAllLines(filePath)
var connections = lines
.Select(d => d.Split(' '))
.Select(d => new
{
From = DateTime.Parse(d[0]),
To = DateTime.Parse(d[1]),
Connections = int.Parse(d[2])
})
.OrderByDescending(d=>d.Connections).ToList();
connections will contain the sorted list with the top results first
Related
I have a text of the Word document and an array of the strings. The goal is to find all occurrences for those strings in the document's text. I tried to use Aho-Corasick string matching in C# implementation of the Aho-Corasick algorithm but the default implementation doesn't fit for me.
The typical part of the text looks like
“Activation” means a written notice from Lender to the Bank substantially in the form of Exhibit A.
“Activation Notice” means a written notice from Lender to the Bank substantially in the form of Exhibit A and Activation.
“Business Day" means each day (except Saturdays and Sundays) on which banks are open for general business and Activation Notice.
The array of the keywords looks like
var keywords = new[] {"Activation", "Activation Notice"};
The default implementation of the Aho-Corasick algorithm returns the following count of the occurrences
Activation - 4
Activation Notice - 2
For 'Activation Notes' it's the correct result. But for 'Activation' the correct count should be also 2
because I do not need to consider occurrences inside the adjacent keyword 'Activation Notice'.
Is there a proper algorithm for this case?
I will assume you got your results according to the example you linked.
StringSearchResult[] results = searchAlg.FindAll(textToSearch);
With those results, if you assume that the only overlaps are subsets, you can sort by index and collect your desired results in a single pass.
public class SearchResultComparer : IComparer<StringSearchResult> {
public int StringSearchResult(StringSearchResult x, StringSearchResult y)
{
// Try ordering by the start index.
int compare = x.Index.CompareTo(y.Index);
if (compare == 0)
{
// In case of ties, reverse order by keyword length.
compare = y.Keyword.Length.CompareTo(x.Keyword.Length);
}
return compare;
}
}
// ...
IComparer searchResultComparer = new SearchResultComparer();
Array.Sort(results, searchResultComparer);
int activeEndIndex = -1;
List<StringSearchResult> nonOverlappingResults = new List<StringSearchResult>();
foreach(StringSearchResult r in results)
{
if (r.Index < activeEndIndex)
{
// This range starts before the active range ends.
// Since it's an overlap, skip it.
continue;
}
// Save this result, track when it ends.
nonOverlappingResults.Add(r);
activeEndIndex = r.Index + r.Keyword.Length;
}
Due to the index sorting, the loop guarantees that only non-overlapping ranges will be kept. But some ranges will be rejected. This can only happen for two reasons.
The candidate starts at the same index as the active range. Since the sorting breaks these ties so longest goes first, the candidate must be shorter than the active range and can be skipped.
The candidate starts after the active range. Since the only overlaps are subsets, and this overlaps with the active range, it is a subset that starts later but still ends at or before.
Therefore the only rejected candidates will be subsets, and must end before the active range. So the active range remains the only thing to worry about overlapping with.
I have a game file with millions of events, file size can be > 10gb
Each line is a game action, like:
player 1, action=kill, timestamp=xxxx(ms granularity)
player 1, action=jump, timestamp=xxxx
player 2, action=fire, timestamp=xxxx
Each action is unique and finite for this data set.
I want to perform analysis on this file, like the total number of events per second, while tracking the individual number of actions in that second.
My plan in semi pseudocode:
lastReadGameEventTime = DateTime.MinValue;
while(line=getNextLine() != null)
{
parse_values(lastReadGameEventTime, out var timestamp, out var action);
if(timestamp == MinValue)
{
lastReadGameEventTime = timestamp;
}
else if(timestamp.subtract(lastReadGameEventTime).TotalSeconds > 1)
{
notify_points_for_this_second(datapoints);
datapoints = new T();
}
if(!datapoints.TryGetValue(action, out var act))
act = new Dictionary<string,int>();
act[action] = 0;
else
act[action]++;
}
lastReadGameEventTime = parse_time(line)
My worry is that this is too naive. I was thinking maybe count the entire minute and get the average per second. But of course I will miss game event spikes.
And if I want to calculate a 5 day average, it will further degrade the result set.
Any clever ideas?
You're asking several different questions here. All are related. Your requirements aren't real detailed, but I think I can point you in the right direction. I'm going to assume that all you want is number of events per second, for some period in the past. So all we need is some way to hold an integer (count of events) for every second during that period.
There are 86,400 seconds in a day. Let's say you want 10 days worth of information. You can build a circular buffer of size 864,000 to hold 10 days' worth of counts:
const int SecondsPerDay = 86400;
const int TenDays = 10 * SecondsPerDay;
int[] TenDaysEvents = new int[TenDays];
So you always have the last 10 days' of counts.
Assuming you have an event handler that reads your socket data and passes the information to a function, you can easily keep your data updated:
DateTime lastEventTime = DateTime.MinValue;
int lastTimeIndex = 0;
void ProcessReceivedEvent(string event)
{
// here, parse the event string to get the DateTime
DateTime eventTime = GetEventDate(event);
if (lastEventTime == DateTime.MinValue)
{
lastTimeIndex = 0;
}
else if (eventTime != lastEventTime)
{
// get number of seconds since last event
var elapsedTime = eventTime - lastEventTime;
var elapsedSeconds = (int)elapsedTime.TotalSeconds;
// For each of those seconds, set the number of events to 0
for (int i = 1; i <= elapsedSeconds; ++i)
{
lastTimeIndex = (lastTimeIndex + 1) % TenDays; // wrap around if we get past the end
TenDaysEvents[lastTimeIndex] = 0;
}
}
// Now increment the count for the current time index
++TenDaysEvents[lastTimeIndex];
}
This keeps the last 10 days in memory at all times, and is easy to update. Reporting is a bit more difficult because the start might be in the middle of the array. That is, if the current index is 469301, then the starting time is at 469302. It's a circular buffer. The naive way to report on this would be to copy the circular buffer to another array or list, with the starting point at position 0 in the new collection, and then report on that. Or, you could write a custom enumerator that counts back from the current position and starts there. That wouldn't be especially difficult to create.
The beauty of the above is that your array remains static. You allocate it once, and just re-use it. You might want to add an extra 60 entries, though, so that there's some "buffer" between the current time and the time from 10 days ago. That will prevent the data for 10 days ago from being changed during a query. Add an extra 300 items, to give yourself a 5-minute buffer.
Another option is to create a linked list of entries. Again, one per second. With that, you add items to the end of the list, and remove older items from the front. Whenever an event comes in for a new second, add the event entry to the end of the list, and then remove entries that are more than 10 days (or whatever your threshold is) from the front of the list. You can still use LINQ to report on things, as recommended in another answer.
You could use a hybrid, too. As each second goes by, write a record to the database, and keep the last minute, or hour, or whatever in memory. That way, you have up-to-the-second data available in memory for quick reports and real-time updates, but you can also use the database to report on any period since you first started collecting data.
Whatever you decide, you probably should keep some kind of database, because you can't guarantee that your system won't go down. In fact, you can pretty much guarantee that your system will go down at some point. It's no fun losing data, or having to scan through terabytes of log data to re-build the data that you've collected over time.
I've been looking for examples on how to use Observable.Buffer in rx but can't find anything more substantial than boiler plate time buffered stuff.
There does seem to be an overload to specify a "bufferClosingSelector" but I can't wrap my mind around it.
What I'm trying to do is create a sequence that buffers by time or by an "accumulation".
Consider a request stream where every request has some sort of weight to it and I do not want to process more than x accumulated weight at a time, or if not enough has accumulated just give me what has come in the last timeframe(regular Buffer functionality)
bufferClosingSelector is a function called every time to get an Observable which will produce a value when the buffer is expected to be closed.
For example,
source.Buffer(() => Observable.Timer(TimeSpan.FromSeconds(1))) works like the regular Buffer(time) overload.
In you want to weight a sequence, you can apply a Scan over the sequence and then decide on your aggregating condition.
E.g., source.Scan((a,c) => a + c).SkipWhile(a => a < 100) gives you a sequence which produces a value when the source sequence has added up to more than 100.
You can use Amb to race these two closing conditions to see which reacts first:
.Buffer(() => Observable.Amb
(
Observable.Timer(TimeSpan.FromSeconds(1)),
source.Scan((a,c) => a + c).SkipWhile(a => a < 100)
)
)
You can use any series of combinators which produces any value for the buffer to be closed at that point.
Note:
The value given to the closing selector doesn't matter - it's the notification that matters. So to combine sources of different types with Amb simply change it to System.Reactive.Unit.
Observable.Amb(stream1.Select(_ => new Unit()), stream2.Select(_ => new Unit())
I am trying to write a method that will determine the closest date chronlogically, given a list of dates and a target date. For example, given a (simplified) date set of {Jan 2011, March 2011, November 2011}, and a target date of April 2011, the method would return March 2011
At first, I was thinking of using LINQ's skip, but I'm not sure of an appropriate Func such that it would stop before the date was exceeded. This below seems a viable solution, but I'm not sure if there's a more efficient means of doing this. Presumably Last and First would be linear time each.
The source dataSet can be between 0 and 10,000 dates, generally around 5,000. Also, I am iterating over this whole process between 5 and 50 times (this is the number of target dates for different iterations).
// assume dateSet are ordered ascending in time.
public DateTime GetClosestDate(IEnumerable<DateTime> dateSet, DateTime targetDate)
{
var earlierDate = dateSet.Last(x => x <= targetDate);
var laterDate = dateSet.First(x => x >= targetDate);
// compare TimeSpans from earlier to target and later to target.
// return closestTime;
}
Well, using MinBy from MoreLINQ:
var nearest = dateSet.MinBy(date => Math.Abs((date - targetDate).Ticks);
In other words, for each date, find out how far it is by subtracting one date from the other (either way round), taking the number of Ticks in the resulting TimeSpan, and finding the absolute value. Pick the date which gives the smallest result for that difference.
If you can't use MoreLINQ, you could either write something similar yourself, or do it in two steps (blech):
var nearestDiff = dateSet.Min(date => Math.Abs((date - targetDate).Ticks));
var nearest = dateSet.Where(date => Math.Abs((date - targetDate).Ticks) == nearestDiff).First();
Using Last and First iterates the dateSet twice. You could iterate the dateSet yourself using your own logic. This would be more efficient, but unless your dateSet is very large or enumerating the dateSet is very costly for some other reason, the little gain in speed is probably not worth writing a more complicated code. Your code should be easy to understand in the first place.
It is simple!
List<DateTime> MyDateTimeList =....;
....
getNearest(DateTime dt)
{
return MyDateTimeList.OrderBy(t=> Math.ABS((dt-t).TotalMiliSeconds)).First();
}
I'm using timestamps to temporally order concurrent changes in my program, and require that each timestamp of a change be unique. However, I've discovered that simply calling DateTime.Now is insufficient, as it will often return the same value if called in quick succession.
I have some thoughts, but nothing strikes me as the "best" solution to this. Is there a method I can write that will guarantee each successive call produces a unique DateTime?
Should I perhaps be using a different type for this, maybe a long int? DateTime has the obvious advantage of being easily interpretable as a real time, unlike, say, an incremental counter.
Update: Here's what I ended up coding as a simple compromise solution that still allows me to use DateTime as my temporal key, while ensuring uniqueness each time the method is called:
private static long _lastTime; // records the 64-bit tick value of the last time
private static object _timeLock = new object();
internal static DateTime GetCurrentTime() {
lock ( _timeLock ) { // prevent concurrent access to ensure uniqueness
DateTime result = DateTime.UtcNow;
if ( result.Ticks <= _lastTime )
result = new DateTime( _lastTime + 1 );
_lastTime = result.Ticks;
return result;
}
}
Because each tick value is only one 10-millionth of a second, this method only introduces noticeable clock skew when called on the order of 10 million times per second (which, by the way, it is efficient enough to execute at), meaning it's perfectly acceptable for my purposes.
Here is some test code:
DateTime start = DateTime.UtcNow;
DateTime prev = Kernel.GetCurrentTime();
Debug.WriteLine( "Start time : " + start.TimeOfDay );
Debug.WriteLine( "Start value: " + prev.TimeOfDay );
for ( int i = 0; i < 10000000; i++ ) {
var now = Kernel.GetCurrentTime();
Debug.Assert( now > prev ); // no failures here!
prev = now;
}
DateTime end = DateTime.UtcNow;
Debug.WriteLine( "End time: " + end.TimeOfDay );
Debug.WriteLine( "End value: " + prev.TimeOfDay );
Debug.WriteLine( "Skew: " + ( prev - end ) );
Debug.WriteLine( "GetCurrentTime test completed in: " + ( end - start ) );
...and the results:
Start time: 15:44:07.3405024
Start value: 15:44:07.3405024
End time: 15:44:07.8355307
End value: 15:44:08.3417124
Skew: 00:00:00.5061817
GetCurrentTime test completed in: 00:00:00.4950283
So in other words, in half a second it generated 10 million unique timestamps, and the final result was only pushed ahead by half a second. In real-world applications the skew would be unnoticeable.
One way to get a strictly ascending sequence of timestamps with no duplicates is the following code.
Compared to the other answers here this one has the following benefits:
The values track closely with actual real-time values (except in extreme circumstances with very high request rates when they would get slightly ahead of real-time).
It's lock free and should perform better that the solutions using lock statements.
It guarantees ascending order (simply appending a looping a counter does not).
public class HiResDateTime
{
private static long lastTimeStamp = DateTime.UtcNow.Ticks;
public static long UtcNowTicks
{
get
{
long original, newValue;
do
{
original = lastTimeStamp;
long now = DateTime.UtcNow.Ticks;
newValue = Math.Max(now, original + 1);
} while (Interlocked.CompareExchange
(ref lastTimeStamp, newValue, original) != original);
return newValue;
}
}
}
Also note the comment below that original = Interlocked.Read(ref lastTimestamp); should be used since 64-bit read operations are not atomic on 32-bit systems.
Er, the answer to your question is that "you can't," since if two operations occur at the same time (which they will in multi-core processors), they will have the same timestamp, no matter what precision you manage to gather.
That said, it sounds like what you want is some kind of auto-incrementing thread-safe counter. To implement this (presumably as a global service, perhaps in a static class), you would use the Interlocked.Increment method, and if you decided you needed more than int.MaxValue possible versions, also Interlocked.Read.
DateTime.Now is only updated every 10-15ms.
Not a dupe per se, but this thread has some ideas on reducing duplicates/providing better timing resolution:
How to get timestamp of tick precision in .NET / C#?
That being said: timestamps are horrible keys for information; if things happen that fast you may want an index/counter that keeps the discrete order of items as they occur. There is no ambiguity there.
I find that the most foolproof way is to combine a timestamp and an atomic counter. You already know the problem with the poor resolution of a timestamp. Using an atomic counter by itself also has the simple problem of requiring its state be stored if you are going to stop and start the application (otherwise the counter starts back at 0, causing duplicates).
If you were just going for a unique id, it would be as simple as concatenating the timestamp and counter value with a delimiter between. But because you want the values to always be in order, that will not suffice. Basically all you need to do is use the atomic counter value to add addition fixed width precision to your timestamp. I am a Java developer so I will not be able to provide C# sample code just yet, but the problem is the same in both domains. So just follow these general steps:
You will need a method to provide you with counter values cycling from 0-99999. 100000 is the maximum number of values possible when concatenating a millisecond precision timestamp with a fixed width value in a 64 bit long. So you are basically assuming that you will never need more than 100000 ids within a single timestamp resolution (15ms or so). A static method, using the Interlocked class to provide atomic incrementing and resetting to 0 is the ideal way.
Now to generate your id you simply concatenate your timestamp with your counter value padded to 5 characters. So if your timestamp was 13023991070123 and your counter was at 234 the id would be 1302399107012300234.
This strategy will work as long as you are not needing ids faster than 6666 per ms (assuming 15ms is your most granular resolution) and will always work without having to save any state across restarts of your application.
It can't be guaranteed to be unique, but is perhaps using ticks is granular enough?
A single tick represents one hundred
nanoseconds or one ten-millionth of a
second. There are 10,000 ticks in a
millisecond.
Not sure what you're trying to do entirely but maybe look into using Queues to handle sequentially process records.