Search Machine Learning Papers
Search Results for "computer vision"
Found 20 papers
Multimodal Categorization of Crisis Events in Social Media
Recent developments in image classification and natural language processing, coupled with the rapid growth in social media usage, have enabled fundamental advances in detecting breaking events around ...
Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions
A long-term goal of AI research is to build intelligent agents that can communicate with humans in natural language, perceive the environment, and perform real-world tasks. Vision-and-Language Navigat...
K-LITE: Learning Transferable Visual Models with External Knowledge
The new generation of state-of-the-art computer vision systems are trained from natural language supervision, ranging from simple object category names to descriptive captions. This form of supervisio...
Out-of-Distribution Detection with Attention Head Masking for Multimodal Document Classification
Detecting out-of-distribution (OOD) data is crucial in machine learning applications to mitigate the risk of model overconfidence, thereby enhancing the reliability and safety of deployed systems. The...
Mamba in Vision: A Comprehensive Survey of Techniques and Applications
Mamba is emerging as a novel approach to overcome the challenges faced by Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in computer vision. While CNNs excel at extracting local f...
Trainable Noise Model as an XAI evaluation method: application on Sobol for remote sensing image segmentation
eXplainable Artificial Intelligence (XAI) has emerged as an essential requirement when dealing with mission-critical applications, ensuring transparency and interpretability of the employed black box ...
CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks
Current state-of-the-art vision-and-language models are evaluated on tasks either individually or in a multi-task setting, overlooking the challenges of continually learning (CL) tasks as they arrive....
VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning
Large Vision-Language Models (LVLMs) are pivotal for real-world AI tasks like embodied intelligence due to their strong vision-language reasoning abilities. However, current LVLMs process entire image...
Coordinated Robustness Evaluation Framework for Vision-Language Models
Vision-language models, which integrate computer vision and natural language processing capabilities, have demonstrated significant advancements in tasks such as image captioning and visual question a...
Distilling Large Vision-Language Model with Out-of-Distribution Generalizability
Large vision-language models have achieved outstanding performance, but their size and computational requirements make their deployment on resource-constrained devices and time-sensitive tasks impract...
VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?
Visual storytelling is an interdisciplinary field combining computer vision and natural language processing to generate cohesive narratives from sequences of images. This paper presents a novel approa...
Analyzing The Language of Visual Tokens
With the introduction of transformer-based models for vision and language tasks, such as LLaVA and Chameleon, there has been renewed interest in the discrete tokenized representation of images. These ...
Transformers without Normalization
Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better per...
Natural Language Descriptions of Deep Visual Features
Some neurons in deep networks specialize in recognizing highly specific perceptual, structural, or semantic features of inputs. In computer vision, techniques exist for identifying neurons that respon...
Unsupervised Model Diagnosis
Ensuring model explainability and robustness is essential for reliable deployment of deep vision systems. Current methods for evaluating robustness rely on collecting and annotating extensive test set...
Linear Spaces of Meanings: Compositional Structures in Vision-Language Models
We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs). Traditionally, compositionality has been associated with algebraic operations on embeddings o...
A General Framework for Defending Against Backdoor Attacks via Influence Graph
In this work, we propose a new and general framework to defend against backdoor attacks, inspired by the fact that attack triggers usually follow a \textsc{specific} type of attacking pattern, and the...
Uni-Mlip: Unified Self-supervision for Medical Vision Language Pre-training
Recent advancements in vision-language pre-training via contrastive learning have significantly improved performance across computer vision tasks. However, in the medical domain, obtaining multimodal ...
Vision Language Transformers: A Survey
Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform. A relatively recent body of research has adapted t...
Generalized but not Robust? Comparing the Effects of Data Modification Methods on Out-of-Domain Generalization and Adversarial Robustness
Data modification, either via additional training datasets, data augmentation, debiasing, and dataset filtering, has been proposed as an effective solution for generalizing to out-of-domain (OOD) inpu...