Get the latest tech news

SATO: Stable Text-to-Motion Framework


Is the Text to Motion model robust? Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models.

Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Despite the considerable advancements made in these models, we find a notable weakness: all of them demonstrate instability in prediction when encountering minor textual perturbations, such as synonym substitutions.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Framework

Framework

Photo of motion

motion

Photo of sato

sato

Related news:

News photo

A brief history of web development, and why your framework doesn't matter

News photo

Framework Won't Be Just a Laptop Company Anymore

News photo

Framework won't be just a laptop company anymore