Hello! This is the GitHub space for the FoundationVision @ ByteDance.
We are dedicated to exploring the frontiers of multimodal intelligence, with the ultimate goal of building Artificial General Intelligence systems (AGI).
Our research focuses on deep learning and multimodal intelligence. We are particularly interested in:
- Visual Foundation Models, Generative Pretrained Models and Large Language Models.
- Multimodal Foundation Models and Representation Learning.
- Open World Interaction via Unified Multi-modal generation and understanding.
- Large-scale Multi-modal generative Pretraining and Alignment.
Our group strives to push the boundaries of multimodal intelligence and has produced highly influential works in the field, including:
- Generative Models: VAR, Waver, Infinity, LLamaGen, InfinityStar, OmniTokenizer
- Object Recognition and Representation Learning: Sparse-RCNN, ByteTrack, SparK, UNINEXT, Unicorn, DanceTrack, VNext, Referformer
- Multimodal Foundation Models: Groma, Liquid, UniTok, GLEE