If that is your core requirement, then simply go ahead and build it yourself (as you mentioned in your first post). Shouldn‘t be too complicated.
As a starter have a look here which already does the same thing based on camera images. With some changes it will be possible with video, too.