Skip to main content

An improved stereo video analytics solution

· 5 min read
Jonathan Takahashi

Recently CVision AI completed work on an underwater stereo camera we called ShoalSight and an algorithm assisted workflow that was implemented in Tator. This work was funded by a grant from the Massachusetts Department of Marine Fisheries and performed in collaboration with UMass Dartmouth's School for Marine Science and Technology (SMAST). SMAST conducts stock assessment surveys using video data captured from within an open trawling net, reducing the ecological impact and time to conduct the survey. The goal of our work was to both improve the quality of the video data through hardware upgrades and to reduce the time to analyze the video through increased algorithm assistance. The image below shows Tator being used for a monoscopic workflow (A) and the codend setup (B and C) for the SMAST video trawl survey. Stereo cameras were introduced to allow for length measurements as well as counting and classification.

Beyond COTS stereo cameras

The first part of this work was building an imaging system to deal with the tough requirements of this project. Fish in the video footage are often moving quickly through the net, which can induce motion blur even for fairly short integration times. There is virtually no natural light at typical depths, so light levels are limited. A large field of view (FOV) is required so all fish are captured in the video. To ensure fish close to the camera can be classified properly, the cameras need a large depth of field.

Prior to this grant, a commercial off the shelf (COTS) stereo camera was used for this purpose. However, in addition to significant motion blur and dark footage, this camera suffered from using rolling shutter. Due to the rolling shutter effect, in which motion induces distortion in the imagery, fish length measurements were found to be inaccurate. Most COTS cameras utilize rolling shutter as it has much lower costs compared to global shutter sensors. They also are typically focused more on small form factor than aperture and focal length, which are important in low light situations.

Our imaging solution addresses these issues using multiple strategies. We use a global shutter sensor to ensure no distortion due to motion. To mitigate motion blur we limit sensor integration time to two milliseconds, and optimize for low light using pixel binning, a low f-stop lens, and a balance between focal length and depth of field. The underwater housings for our system use dome ports, minimizing optical distortion and reducing the depth of field requirements. We choose a large format sensor to increase field of view and pixel surface area to capture more light per pixel.

The image below shows a comparison of the COTS camera (left) and ShoalSight (right) for two fish moving rapidly across the field of view.

The cameras are controlled by a single board computer in a separate underwater housing. Using hardware triggering, we were able to synchronize cross-channel frame captures to within 200 microseconds. The video is encoded in realtime to a Tator-compatible streaming format, which also reduces file size and increases recording time.

Integration for trawl surveys

Our camera system includes a custom mechanical mount that is bolted into a plastic ring sewn into the net. The underwater housings are depth rated to over 500m and feature large dome ports, which helps with depth of field. The image below shows the ShoalSight imaging system mounted for the trawl survey.

The cases were vacuumed through a pressure relief valve to ensure no condensation inside the cases and no leaks. Currently a battery is used to power the cameras, however future work will provide the option of power and a live video feed through a cable.

Algorithm assisted stereo annotation

At CVision we believe in spiral development and the gradual introduction of algorithms into annotation workflows. Initially the stereo annotation workflow was mostly manual, requiring the annotator to manually draw lines along the length of each fish and to associate those lines between stereo channels. An applet was developed to compute the length of the fish by adapting code from SEBASTES. Drawing the lines and associating them, as well as keeping track of which fish had already been counted, consumed most of the annotators' time.

To improve this, we fine tuned an existing object detection and segmentation algorithm to the footage from the stereo camera. A tracking algorithm was used to associate segmentation masks over time, and a separate algorithm computed the correspondence between segmentation masks across stereo channels. A novel algorithm was developed for drawing lines across the length of the fish using the segmentation masks. When it all comes together, we get lines drawn and associated on every fish in every frame, as shown in the video below.

In addition, a new applet using C++ and WebAssembly was developed that quickly computes length measurements on the selected track for the current frame, allowing the annotator to simply select the keyframe and assign a species for each fish. In the future, automated keyframe selection and classification may be implemented to further automate the workflow.