Kinect gesture analysis

Kinect gesture analysis - c#

I'm doing a kinect Application using Official Kinect SDK.
The Result I want
1) able to identify the body have been waving for 5sec. Do something if it does
2) able to identify leaning with one leg for 5sec. do something if it does.
Anyone knows how to do so? I'm doing in a WPF application.
Would like to have some example. I'm rather new to Kinect.
Thanks in advance for all your help!

The Kinect provides you with the skeletons it's tracking, you have to do the rest. Basically you need to create a definition for each gesture you want, and run that against the skeletons every time the SkeletonFrameReady event is fired. This isn't easy.
Defining Gestures
Defining the gestures can be surprisingly difficult. The simplest (easiest) gestures are ones that happen at a single point in time, and therefore don't rely on past locations of the limbs. For example, if you want to detect when the user has their hand raised above their head, this can be checked on every individual frame. More complicated gestures need to take a period of time into account. For your waving gesture, you won't be able to tell from a single frame whether a person is waving or just holding their hand up in front of them.
So now you need to be able to store relevant information from the past, but what information is relevant? Should you keep a store of the last 30 frames and run an algorithm against that? 30 frames only gets you a second's worth of information.. perhaps 60 frames? Or for your 5 seconds, 300 frames? Humans don't move that fast, so maybe you could use every fifth frame, which would bring your 5 seconds back down to 60 frames. A better idea would be to pick and choose the relevant information out of the frames. For a waving gesture the hand's current velocity, how long it's been moving, how far it's moved, etc. could all be useful information.
After you've figured out how to get and store all the information pertaining to your gesture, how do you turn those numbers into a definition? Waving could require a certain minimum speed, or a direction (left/right instead of up/down), or a duration. However, this duration isn't the 5 second duration you're interested in. This duration is the absolute minimum required to assume that the user is waving. As mentioned above, you can't determine a wave from one frame. You shouldn't determine a wave from 2, or 3, or 5, because that's just not enough time. If my hand twitches for a fraction of a second, would you consider that a wave? There's probably a sweet spot where most people would agree that a left to right motion constitutes a wave, but I certainly don't know it well enough to define it in an algorithm.
There's another problem with requiring a user to do a certain gesture for a period of time. Chances are, not every frame in that five seconds will appear to be a wave, regardless of how well you write the definition. Where as you can easily determine if someone held their hand over their head for five seconds (because it can be determined on a single frame basis), it's much harder to do that for complicated gestures. And while waving isn't that complicated, it still shows this problem. As your hand changes direction at either side of a wave, it stops moving for a fraction of a second. Are you still waving then? If you answered yes, wave more slowly so you pause a little more at either side. Would that pause still be considered a wave? Chances are, at some point in that five second gesture, the definition will fail to detect a wave. So now you need to take into account a leniency for the gesture duration.. if the waving gesture occurred for 95% of the last five seconds, is that good enough? 90%? 80%?
The point I'm trying to make here is there's no easy way to do gesture recognition. You have to think through the gesture and determine some kind of definition that will turn a bunch of joint positions (the skeleton data) into a gesture. You'll need to keep track of relevant data from past frames, but realize that the gesture definition likely won't be perfect.
Consider the Users
So now that I've said why the five second wave would be difficult to detect, allow me to at least give my thoughts on how to do it: don't. You shouldn't force users to repeat a motion based gesture for a set period of time (the five second wave). It is surprisingly tiring and just not what people expect/want from computers. Point and click is instantaneous; as soon as we click, we expect a response. No one wants to have to hold a click down for five seconds before they can open Minesweeper. Repeating a gesture over a period of time is okay if it's continually executing some action, like using a gesture to cycle through a list - the user will understand that they must continue doing the gesture to move farther through the list. This even makes the gesture easier to detect, because instead of needing information for the last 5 seconds, you just need enough information to know if the user is doing the gesture right now.
If you want the user to hold a gesture for a set amount of time, make it a stationary gesture (holding your hand at some position for x seconds is a lot easier than waving). It's also a very good idea to give some visual feedback, to say that the timer has started. If a user screws up the gesture (wrong hand, wrong place, etc) and ends up standing there for 5 or 10 seconds waiting for something to happen, they won't be happy, but that's not really part of this question.
Starting with Kinect Gestures
Start small.. really small. First, make sure you know your way around the SkeletonData class. There are 20 joints tracked on each skeleton, and they each have a TrackingState. This tracking state will show whether the Kinect can actually see the joint (Tracked), if it is figuring out the joint's position based on the rest of the skeleton (Inferred), or if it has entirely abandoned trying to find the joint (NotTracked). These states are important. You don't want to think the user is standing on one leg simply because the Kinect doesn't see the other leg and is reporting a bogus position for it. Each joint has a position, which is how you know where the user is standing.. piece by piece. Become familiar with the coordinate system.
After you know the basics of how the skeleton data is reported, try for some simple gestures. Print a message to the screen when the user raises a hand above their head. This only requires comparing each hand to the Head joint and seeing if either hand is higher than the head in the coordinate plane. After you get that working, move up to something more complicated. I'd suggest trying a swiping motion (hand in front of body, moves either right to left or left to right some minimum distance). This requires information from past frames, so you'll have to think through what information to store. If you can get that working, you could try string a series of swiping gestures in a small amount of time and interpreting that as a wave.
tl;dr: Gestures are hard. Start small, build your way up. Don't make users do repetitive motions for a single action, it's tiring and annoying. Include visual feedback for duration based gestures. Read the rest of this post.

The Kinect SDK helps you get the coordinates of different joints. A gesture is nothing but change in position of a set of joints over a period of time.
To recognize gestures, you've to store the coordinates for a period of time and iterate through it to see if it obeys the rules for a particular gesture (such as - the right hand always moves upwards).
For more details, check out my blog post on the topic:
http://tinyurl.com/89o7sf5

Related

Is it possible to detect two seperate swipes? (Unity C#)

I try to make a simple but unusual 2D game for android using Unity in which player is supposed to move two different objects in two different halfes of the screen(heah that sounds stupid, sorry) AT THE SAME TIME. So, most of the work is done but I found a problem I've no idea how to deal with - I can't swipe in both halfs of the screen at the same time. If I want to move both objects in both halfs of the screen AT THE SAME TIME I won't be able to do it. It is so because my script for swipes checks position of first touch and then waits for an end of swipe. So if I touch both halfes of the screen and than make a swipe by stopping touching the screen the script will only detect the swipe in the half I first touched the screen.
Sorry for bad description. :(
Any thoughts?

The most obvious variant (in case I understood your problem correctly) is to change your input code to be able to work with several touches, not single. It should work like this:
detect new touch (check if the Input.touchCount is greater than the number of currently tracked touches)
store info about handled touch (you can use fingerId value from Touch structure)
find an object, that will be affected by this touch (it seems like you already have this functionality)
listen to ALL updates for ALL touches to affect corresponding objects if it is needed.
remove handled touch on finger release from tracked touches
So you should work with ALL touches, not just the first one. Also, make sure that you have Input.multiTouchEnabled set to true.

Kinect v2 Hand to Mouse position drops on hand close

I am developing a WPF App that uses Kinect v2, and I use the hand to simulate the mouse. It works but I have a little problem - when I close the hand I simulate a click but the cursor drops its position a little bit relative to when the hand was open and sometimes it will end in a click in the wrong button or place.
Any ideas on how can I solve this?
I already tried to track the wrist and the thumbs instead of the hand but the problem still happens.
Thanks!

Here are some ideas:
Filter and smooth the hand position data a bit more. For a UI/menu system, it should be acceptable to have some latency as it doesn't require reduced latency as much as other uses.
Modify the hand position based on the hand's open/close state. Introduce a constant to bump up the hand position when the hand is closed, with appropriate smoothing to get this to feel and look correct
Keep a list of hand positions and use the data from a few frames before (though it might be tricky to get this to feel and look correct)
As a note, also consider these points:
Use bigger buttons. Buttons should have appropriate spacing, placement, and sizes. The app's UI should be specifically designed for a Kinect application.
Use a different gesture for a mouse click, such as push or press which is the recommended approach in the Kinect Human Interface Guidelines 2.0

XNA: Have I clicked on anything?

I'm making a game in XNA and currently I'm checking the coordinates of the mouse click against the coordinates of each object that can be clicked.
This is fine for my small game but for larger games it would become CPU intensive to check through every object for each frame.
Is there a better way to approach this?

You will want to partition your world space with some sort of algorithm like Quadtree.
In your most basic form you'll want to be able to take all objects and be able to quickly throw out a bunch of them before you even do your detailed check. For instance, if you are clicking on the right side of the screen you want to throw out everything on the left side of the screen automagically.

Dealing with lag in XNA + lidgren

I am experimenting with lidgren in XNA and I'm having some issues with the 'lag'.
I've downloaded their XNA sample and noticed that even their sample lags. The thing is, the movement is not smooth on the other side, and I'm trying this on a LAN (on the same computer actually) not over the internet.
Has any had the same issues as regards unsmooth movement due to a lagging connection with lidgren and XNA ?

The sample you linked directly sets the position to whatever it receives from the network, this is a bad idea for a multiplayer game!
What you should do in a real game is interpolate between the local position and the remote position. So, your receive method would look a little like this:
void Receive(packet)
{
unit.RemoteX = packet.Read_X_Position();
unit.RemoteY = packet.Read_Y_Position();
}
This has no affect on the local position on the unit, instead in your update method (every frame), you move the local position towards the remote position:
void Interpolate(deltaTime)
{
difference = unit.RemoteX - unit.LocalX
if (Math.Abs(difference) < threshold)
unit.LocalX = unit.RemoteX
else
unit.LocalX += difference * deltaTime * interpolation_constant
}
You then display the "local" position of the unit, this achieves lagless movement like so:
If the unit position is almost at the remote position, it will jump to the remote position (however, it will jump such a tiny distance that it won't look laggy).
If the difference is too big to jump, then move slowly towards the position you should be in.
Since the unit moves smoothly towards where it should be, it looks like there is no lag at all!
The interpolation constant controls how fast the local and remote positions will converge:
0: Ignore network updates
Small: Snap into place very quickly (possibly look laggy)
Large: Slide slowly into place, looks smooth but may feel unresponsive
You need to choose a compromise somewhere in between these options.
There are some other things to consider when implementing this kind of system, for example you often want an upper limit on how far apart units can be from their remote position otherwise the local and remote state can become "unstuck" in some situations. If they are too far apart (which should never happen except in cases of extreme lag) you can either halt the game and tell the user it's too laggy, or jump the unit straight into position, which will look laggy but at least the game will continue.
Addendum: Rereading this answer, it occurs to me that an enhancement would be to track time differences. If you know (roughly) what the lag is in the system, then you know that when you receive a packet with a remote position in you know roughly how far into the past that packet is from. If you send remote velocity too, you can predict where the object is now (assuming constant velocity). This may make the difference between estimated local state and true remote state smaller in some games, in other games (where you have lots of changing velocities) it might make things worse.

I've been looking at writing a multiplayer fps game, starting with a demo of just moving some cubes around and replicating the position/rotations on another machine, which is in a spectator mode.
I'm using your code sample above and it's working well (I've had to tweak the interpolation constant higher than 1 to make it look smooth).
I've seen a few interpolation examples which take into account the time difference between the current time and the time stamp on the received message.
I see this code does not use the time difference, so interpolation will take as long as it needs to, to get to the target value (or at least within the threshold value to then snap into position). My question is, is there any advantage to this?
Many thanks.

Programmable cameras C# for vehicle system

I recently joined a project where I need to get some vehicle based computer vision system. So what sort of special functionalities does a camera need, to be able to capture images while traveling at varying speeds ? for example how high a frame rate is required, and the exposure duration, shutter speed? Do you think that webcams(even if high end) will be able to achieve it ? The project requires the camera to be programmable in C# ...
Thank you very much in advance!

Unless video is capable of producing high quality low blur images, I would go with a camera with really fast shutterspeed, very short exposure duration, and for frame rate, following Seth's math, 44 centimeters is roughly a little more than a foot, which should be decent for calculations.
Reaction time for a human to respond to someone hitting the breaks in front of them is 1.5 seconds. If you can determine they hit their break light within 1/30th of a second, and it takes you 1 second to calculate and apply breaks, you already beat a human in reaction time.
How fast your shutter speed needs to be, is based on how fast you're vehicle is moving. Shutter speed reduces motion blur for a more accurate picture to analyze.
Try different speeds (if you can get a camera with this value configurable, might help).

I'm not sure that's an answerable question. It sounds like the sort of thing that the Darpa Grand Challenge hopes to determine :)
With regard to frame rate: If you're vehicle is going 30 miles per hour, a 30 FPS web cam will capture one frame for every 44 centimeters the vehicle travels. Whether or not that's "enough" depends on what you're planning to do with the image.

Not sure about the out-of-the-box C# programability, but a specific web-cam style camera to consider would be the PS3 eye.
It was specially engineered for motion-capture and (as I understand it) is capable of higher-quality images a high framerates than the majority of the competition. Windows drivers are available for it, and that opens the door for creating a C# wrapper.
Here is the product page, note the 120fps upper-end spec (not sure that the Windows drivers run at this rate, but obviously the hardware is capable of it).
One Note on shutter speed... images taken at a high framerate in low-light will likely be underexposed and unusable. If you'll need this to work in varying light conditions then the framerate will likely either need to be fixed at the low-end of your acceptable range, or will need to self-adjust based on available light.

These guys: Mobileye - develop such commercial systems for lane departure warnings and vehicle and pedestrian detection.
If you go to the "Manufacturer Products->Development and Evaluation Platforms->Cameras"
You can see what they use as cameras and also for their processing platforms.
30 fps should be sufficient for the applications mentioned above.

If money isn't an issue, take a look at cameras from companies like Opeton and others. You can control every aspect of every image capture including: capture time, image size, ++.

My iPhone can take pictures out the side of a car that are fairly blur free... past 10-20 feet. Inside of that, things are simply moving too fast; the shutter speed would need to be higher to not blur that.
Start with a middle-of-the-road webcamera, and move up as necessary? A laptop and a ride in your car while capturing still images would probably give you an idea of how well it works.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.