ICLR2024
[Website] [Arxiv] [PDF]

Can we better anticipate an actor’s future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively.
AntGPT is the proposed framework in our paper to leverage LLMs in video-based long-term action anticipation. AntGPT achieves state-of-the-art performance on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ by the time of publication.
Clone this repository.
git clone [email protected]:brown-palm/AntGPT.git
cd AntGPTSet up python (3.9) virtual environment. Install pytorch with the right CUDA version.
python3 -m venv venv/forecasting
source venv/forecasting/bin/activate
pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --extra-index-url https://download.pytorch.org/whl/cu117Install CLIP.
pip install git+https://github.com/openai/CLIP.gitInstall other packages.
pip install -r requirements.txt Install llama-recipe packages following instructions here.
In our experiments, we used data from Ego4D, Epic-Kitchens-55, and EGTEA GAZE+. For Epic-Kitchens-55 and EGTEA GAZE+, we also used the data annotation and splits of EGO-TOPO. First start a data folder in the root directory.
mkdir dataDownload Ego4D dataset, annotations and pretrained models from here.
Download Epic-Kitchens 55 dataset and annotations.
Download EGTEA GAZE+ dataset from here.
Download data annotations from EGO-TOPO. Please refer to their instructions.
You can find our preprocessed file including text prompts, goal features, etc here.
Downloaded and unzip both folders.
Place the goal_features under data folder.
Place the dataset folder under Llama2_models folder.
Make a symlink in the ICL subfolder of the Llama2_models folder.
ln -s {path_to_dataset} AntGPT/Llama2_models/ICLWe used CLIP to extract features from these datasets. You can use the feature extraction file under transformer_models to extract the features.
python -m transformer_models.generate_clip_img_embeddingWe have a data folder structure like illustrated below. Feel free to use your own setup and remember to adjust the path configs accordingly.
data
├── ego4d
│ └── annotations
| │ ├── fho_lta_taxonomy.json
| │ ├── fho_test_unannotated.json
│ │ ├── ...
│ │
│ └── clips
│ ├── 0a7a74bf-1564-41dc-a516-f5f1fa7f75d1.mp4
│ ├── 0a975e6e-4b13-426d-be5f-0ef99b123358.mp4
│ ├── ...
│
├── ek
│ └── annotations
| │ ├── EPIC_many_shot_verbs.csv
│ │ ├── ...
│ │
│ └── clips
│ ├── rgb
│ ├── obj
│ └── flow
│
├── gaze
│ └── annotations
| │ ├── action_list_t+v.csv
│ │ ├── ...
│ │
│ └── clips
│ ├── OP01-R01-PastaSalad.mp4
│ ├── ...
│
├── goal_features
│ ├── ego4d_feature_gt_val.pkl
│ ├── ...
│
├── output_CLIP_img_embedding_ego4d
│
...
Our codebase consists of three parts: the transformer experiments, the GPT experiments, and the Llama2 experiments. Implementation of each modules are located in the transformer_models folder, GPT_models, and Llama2_models folder respectively.
You can find our model checkpoints and output files for Ego4D LTA here.
Unzip both folders.
Place the ckpt folder under the llama_recipe subfolder of the Llama2_models folder.
Place the ego4d_outputs folder under the llama_recipe subfolder of the Llama2_models folder.
Submit the output files to leaderboard.
cd Llama2_models/Finetune/llama-recipesCUDA_VISIBLE_DEVICES=0 python inference/inference_lta.py --model_name {your llama checkpoint path} --peft_model {pretrained model path} --prompt_file ../dataset/test_nseg8_recog_egovlp.jsonl --response_path {output file path}To run an experiment on the transformer models, please use the following command
python -m transformer_models.run --cfg transformer_models/configs/ego4d_image_pred_in8.yaml --exp_name ego4d_lta/clip_feature_in8To run a GPT experiment, please use one of the workflow illustration notebooks.
To run a Llama2 experiment, please refer to the instructions in that folder.
Our paper is available on Arxiv. If you find our work useful, please consider citing us.
@article{zhao2023antgpt,
title = {AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?},
author = {Qi Zhao and Shijie Wang and Ce Zhang and Changcheng Fu and Minh Quan Do and Nakul Agarwal and Kwonjoon Lee and Chen Sun},
journal = {ICLR},
year = {2024}
}This project is released under the MIT license.