Transformers provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio.
These models can be applied on:
📝Text, for tasks like text classification, information extraction, question answering, summarization, translation, text generation, in over 100 languages. 🖼️Images, for tasks like image classification, object detection, and segmentation. 🗣️Audio, for tasks like speech recognition and audio classification.
Transformer models can also perform tasks on several modalities combined, such as table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering.