Video super-resolution often reconstructs high-resolution (HR) video from low-resolution (LR) video that has been downsampled using predefined methods, which is an ill-posedness problem. Recent video rescaling algorithms alleviate this problem by jointly training the downsampling and upsampling processes. However, they primarily exploit the shallow temporal correlations among video frames, overlooking the intricate, long-term sequential depth dependencies within the video. In this paper, we propose an omniscient feature alignment to leverage the bidirectional deep temporal information for video rescaling, namely OFA-VRN. In the downsampling phase, the proposed method separates the input HR video into LR frames and high-frequency components using haar wavelet transform and explicitly embeds the high-frequency components into the LR frames. In this way, detailed information is stored in the frame and maintains visual perception quality in downsampled videos. During the upsampling phase, we use an advanced bidirectional propagation paradigm to enhance temporal information aggregation capabilities. By incorporating the proposed omniscient feature alignment, the network is capable of leveraging multi-frame feature information from the triplet dimension to further alleviate misalignment issues, thereby enhancing its capacity for deep temporal information utilization. The experiments on Vid4 and Vimeo90K-T demonstrate that our model achieves competitive performance compared to the state-of-the-art methods.
2023
TGRS
DOPNet: Dense Object Prediction Network for Multi-Class Object Counting and Localization in Remote Sensing Images
Object counting and localization for remote sensing images are effective means to solve large-scale object analysis problems. Nowadays, most counting methods obtain the number of objects by employing convolutional neural network to regress a density map of objects. Even if these leading methods have achieved impressive performances, they simply focus on estimating the number of single-class objects, without providing location information and cannot support multi-class objects. To tackle these problems, a point-based network named Dense Object Prediction Network (DOPNet) is proposed for multi-class object counting and localization for remote sensing images. DOPNet differs from the conventional approach of predicting multiple density maps by incorporating category attributes into the predicted objects, enabling the accurate counting and localization of multi-class objects. Specifically, DOPNet adopts a multi-scale architecture to provide dense predictions of object proposals. A Scale Adaptive Feature Enhancement Module (SAFEM) is designed to predict scales of objects for the suppression of duplicate proposals. Given only point level annotations for training, a pseudo box generation algorithm is designed to find the most suitable pseudo box of each annotated object for the supervision of scale learning. Comprehensive experiments prove that DOPNet can achieve preferable performance on challenging benchmarks of counting while providing object locations. Code and pre-trained models are available at https://github.com/Ceoilmp/DOPNet.
2022
TGRS
Object Counting for Remote-Sensing Images via Adaptive Density Map-Assisted Learning
Object counting has attracted a lot of attention in remote sensing image analysis. In density map based object counting algorithms, the ground truth density maps generated by fix-sized Gaussian kernels ignore the spatial features of the objects. In this paper, an Adaptive Density Map Assisted Learning algorithm (ADMAL) is proposed, which taps into spatial features of the objects from the beginning phase of ground truth density map generation. ADMAL consists of two networks: a Contexture Aware Density Map Generation (CADMG) network and a Transformer-based Density Map Estimation (TDME) network. The CADMG network is designed to generate a ground truth density map from each annotated point map. Comparing with Gaussian convolved density maps, the ground truth density maps generated by CADMG will be tailored according to the texture and neighborhood relationship among objects, which can promote the learning effect of the TDME network. TDME is the core network for object counting. The backbone of the TDME network adopts a Swin transformer structure, the self-attention mechanism of which possesses a larger receptive field for effective feature extraction in remote sensing images. Comprehensive experiments prove that the ground truth density map generated by CADMG can help various density map estimation networks achieve better training effects, among which TDME achieves the best performances. Moreover, the ADMAL algorithm can achieve preferable object counting performances for both satellite-based image and drone-based image. Code and pre-trained models are available at https://github.com/gcding/ADMAL-pytorch.
TMM
Crowd counting via unsupervised cross-domain feature adaptation
Given an image, crowd counting aims to estimate the amount of target objects in the image. With un-predictable installation situations of surveillance systems (or other equipment), crowd counting images from different data sets may exhibit severe discrepancies in viewing angle, scale, lighting condition, etc. As it is usually expensive and time-consuming to annotate each data set for model training, it has been an essential issue in crowd counting to transfer a well-trained model on a labeled data set (source domain) to a new data set (target domain). To tackle this problem, we propose a cross-domain learning network to learn the domain gaps in an unsupervised learning manner. The proposed network comprises of a Multi-granularity Feature-aware Discriminator (MFD) module, a Domain-Invariant Feature Adaptation (DFA) module, and a Cross-domain Vanishing Bridge (CVB) module to remove domain-specific information from the extracted features and promote the mapping performances of the network. Unlike most existing methods that use only Global Feature Discriminator (GFD) to align features at image level, an additional Local Feature Discriminator (LFD) is inserted and together with GFD form the MFD module. As a complement to MFD, LFD refines features at pixel level and has the ability to align local features. The DFA module explicitly measures the distances between the source domain features and the target domain features and aligns the marginal distribution of their features with Maximum Mean Discrepancy (MMD). Finally, the CVB module provides an incremental capability of removing the impact of interfering part of the extracted features. Several well-known networks are adopted as the backbone of our algorithm to prove the effectiveness of the proposed adaptation structure. Comprehensive experiments demonstrate that our model achieves competitive performance to the state-of-the-art methods. Code and pre-trained models are available at https://github.com/gcding/CDFA-pytorch.
CVPRW
A Coarse-To-Fine Boundary Localization Method for Naturalistic Driving Action Recognition
Guanchen Ding *, Wenwei Han *, Chenglong Wang *, Mingpeng Cui, Lin Zhou, Dianbo Pan, Jiayi Wang, Junxi Zhang, and Zhenzhong Chen
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Jun 2022
Naturalistic driving action recognition plays an important role in understanding drivers’ distraction behavior in the traffic environment. The main challenge of this task is the accurate localization of the temporal boundary for each distraction driving behavior in the video. Although many temporal action localization methods can identify action classes, it is difficult to predict accurate temporal boundaries for this task since the driving actions of the same category usually present large intra-class variation. In this paper, we introduce a Coarse-to-Fine Boundary Localization method called CFBL, which obtains fine-grained temporal boundaries progressively through three stages. Concretely, in the first coarse boundary generation stage, we adopt a modified anchor-free model Anchor-Free Saliency-based Detector (AFSD) to make an interval estimation of the temporal boundaries of distraction behavior. In the second boundary refinement stage, we use the Dense Boundary Generation (DBG) model to adjust the estimated interval of the temporal boundaries. In the final boundary decision stage, we build a Localization Boundary Refinement Module to determine the final boundaries of different actions. Besides, we adopt a voting strategy to combine the results of different camera views to enhance the model’s distraction driving action classification ability. The experiments conducted on the Track 3 validation set of the 2022 AI City Challenge demonstrate competitive performance of the proposed method.
ICPR
The First Challenge on Moving Object Detection and Tracking in Satellite Videos: Methods and Results
Yulan Guo, Qian Yin, Qingyong Hu, Feng Zhang, Chao Xiao, Ye Zhang, Hanyun Wang, Chenguang Dai, Jian Yang, Zhuang Zhou, and 26 more authors
In 26th International Conference on Pattern Recognition, ICPR 2022, Montreal, QC, Canada, August 21-25, 2022, Jun 2022
In this paper, we briefly summarize the first challenge on moving object detection and tracking in satellite videos (SatVideoDT). This challenge has three tracks related to satellite video analysis, including moving object detection (Track 1), single object tracking (Track 2), and multiple-object tracking (Track 3). 123, 89, and 70 participants successfully registered, while 37, 42, and 29 teams submitted their final results on the test datasets for Tracks 1-3, respectively. The top-performing methods and their results in each track are described with details. This challenge establishes a new benchmark for satellite video analysis.
ECCVW
Efficient and Accurate Quantized Image Super-Resolution on Mobile NPUs, Mobile AI & AIM 2022 Challenge: Report
Andrey Ignatov, Radu Timofte, Maurizio Denna, Abdel Younes, Ganzorig Gankhuyag, Jingang Huh, Myeong Kyun Kim, Kihwan Yoon, Hyeon-Cheol Moon, Seungho Lee, and 86 more authors
In Computer Vision - ECCV 2022 Workshops - Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part III, Jun 2022
Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper.
2021
ICCVW
VisDrone-CC2021: The Vision Meets Drone Crowd Counting Challenge Results
Zhihao Liu, Zhijian He, Lujia Wang, Wenguan Wang, Yixuan Yuan, Dingwen Zhang, Jinglin Zhang, Pengfei Zhu, Luc Van Gool, Junwei Han, and 29 more authors
In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct 2021
Crowding counting research evolves quickly by the leverage of development in deep learning. Many researchers put their efforts into crowd counting tasks and have achieved many significant improvements. However, current datasets still barely satisfy this evolution and high quality evaluation data is urgent. Motivated by high quality and quantity study in crowding counting, we collect a drone-captured dataset formed by 5,468 images(images in RGB and thermal appear in pairs and 2,734 respectively). There are 1,807 pairs of images for training, and 927 pairs for testing. We manually annotate persons with points in each frame. Based on this dataset, we organized the Vision Meets Drone Crowd Counting Challenge(Visdrone-CC2021) in conjunction with the International Conference on Computer Vision (ICCV 2021). Our challenge attracts many researchers to join, which pave the road of speed up the milestone in crowding counting. To summarize the competition, we select the most remarkable algorithms from participants’ submissions and provide a detailed analysis of the evaluation results. More information can be found at the website: http://www.aiskyeye.com/.
CVPRW
Dual-Modality Vehicle Anomaly Detection via Bilateral Trajectory Tracing
Traffic anomaly detection has played a crucial role in Intelligent Transportation System (ITS). The main challenges of this task lie in the highly diversified anomaly scenes and variational lighting conditions. Although much work has managed to identify the anomaly in homogenous weather and scene, few resolved to cope with complex ones. In this paper, we proposed a dual-modality modularized methodology for the robust detection of abnormal vehicles. We introduced an integrated anomaly detection framework comprising the following modules: background modeling, vehicle tracking with detection, mask construction, Region of Interest (ROI) backtracking, and dual-modality tracing. Concretely, we employed background modeling to filter the motion information and left the static information for later vehicle detection. For the vehicle detection and tracking module, we adopted YOLOv5 and multi-scale tracking to localize the anomalies. Besides, we utilized the frame difference and tracking results to identify the road and obtain the mask. In addition, we introduced multiple similarity estimation metrics to refine the anomaly period via backtracking. Finally, we proposed a dual-modality bilateral tracing module to refine the time further. The experiments conducted on the Track 4 testset of the NVIDIA 2021 AI City Challenge yielded a result of 0.9302 F1-Score and 3.4039 root mean square error (RMSE), indicating the effectiveness of our framework.
2020
VCIP
Drone-Based Car Counting via Density Map Learning
Jingxian Huang *, Guanchen Ding *, Yujia Guo, Daiqin Yang, Sihan Wang, Tao Wang, and Yunfei Zhang
In 2020 IEEE International Conference on Visual Communications and Image Processing (VCIP), Jun 2020
Car counting on drone-based images is a challenging task in computer vision. Most advanced methods for counting are based on density maps. Usually, density maps are first generated by convolving ground truth point maps with a Gaussian kernel for later model learning (generation). Then, the counting network learns to predict density maps from input images (estimation). Most studies focus on the estimation problem while overlooking the generation problem. In this paper, a training framework is proposed to generate density maps by learning and train generation and estimation subnetworks jointly. Experiments demonstrate that our method outperforms other density map-based methods and shows the best performance on drone-based car counting.