POET: Prompt Offset Tuning for Continual Human Action Adaptation

Abstract

As extended reality (XR) is redefining how users interact with computing devices, research in human action recognition is gaining prominence. Typically, models deployed on immersive computing devices are static and limited to their default set of classes.

The goal of our research is to provide users and developers with the capability to personalize their experience by adding new action classes to their device models continually. Importantly, a user should be able to add new classes in a low-shot and efficient manner, while this process should not require storing or replaying any of user's sensitive training data. We formalize this problem as privacy-aware few-shot continual action recognition.

Towards this end, we propose POET:Prompt Offset Tuning. While existing prompt tuning approaches have shown great promise for continual learning of image, text, and video modalities; they demand access to extensively pretrained transformers. Breaking away from this assumption, POET demonstrates the efficacy of prompt tuning a significantly lightweight backbone, pretrained exclusively on the base class data. We propose a novel spatio-temporal learnable prompt offset tuning approach, and are the first to apply such prompt tuning to Graph Neural Networks.

We contribute two new benchmarks for our new problem setting in human action recognition: (i) NTU RGB+D dataset for activity recognition, and (ii) SHREC-2017 dataset for hand gesture recognition. We find that POET consistently outperforms comprehensive benchmarks.

Method: Prompt Offset Tuning for Graph Neural Networks

Activity Recognition Benchmark

Fine-tune & LWF don’t freeze the feature extractor in t>0. Entire model overfits to few-shots, leading to almost complete forgetting.
SOTA continual prompt tuning works. L2P, CODA-P fail to learn new classes. They are designed for image classification, update using full supervision, assume ViT pretrained on ImageNet21k.
POET significantly outperforms upper bounds on new & is competitive on old. Best stability-plasticity trade-off (HM graph).
We show comparison with replay methods that violate privacy. FE+Replay is freezing backbone + rehearsal of stored training data. POET learns New better. (Right) Removing prompts from POET gives Feature Extraction (FE) – this is simply frozen feature extractor with expanding linear classifier. Prompt offsets retain intermediate class performance.

Gesture Recognition Benchmark

ALICE [3] is a few-shot class-incremental baseline adapted from image classification works. It uses a non-parametric classifier. It retains old classes well since it doesn’t have an expanding classifier. But it fails to learn new classes well.
POET sets SOTA on the stability-plasticity trade-off metric, Harmonic Mean.
This dataset and model is more lightweight than the activity benchmark. Hence, its more plastic to new knowledge and less stable.
In addition to prompts, we mitigate forgetting from the classifier by freezing weights corresponding to previous classes in the classifier.

BibTeX

@inproceedings{garg2024poet,
        author    = {Garg, Prachi and Joseph, K J and Balasubramanian, Vineeth N and Camgoz, Necati Cihan and Wan, Chengde and Kin, Kenrick and Si, Weiguang and Ma, Shugao and De La Torre, Fernando},
        title     = {POET: Prompt Offset Tuning for Continual Human Action Adaptation},
        booktitle   = {European Conference on Computer Vision},
        year      = {2024},
      }