Subscribe to Updates
Subscribe to get the latest content in real time.
Author: Mason King
Authors: Jinghuai Jie、Yan Guo、Guixing Wu、Junmin Wu、Baojian Hua Paper: https://arxiv.org/abs/2408.10527 Introduction Edge detection is a fundamental task in computer vision, crucial for various applications such as object recognition, image segmentation, and scene understanding. Traditional methods primarily rely on local features like color and texture variations, while more recent deep learning approaches leverage convolutional neural networks (CNNs) to capture global and semantic features. However, CNNs often struggle to preserve intricate local details. This paper introduces EdgeNAT, a one-stage transformer-based edge detector that utilizes the Dilated Neighborhood Attention Transformer (DiNAT) as its encoder. EdgeNAT aims to efficiently and accurately extract object boundaries and…
Authors: Wall Kim Paper: https://arxiv.org/abs/2408.10517 Introduction Background Offline reinforcement learning (RL) has been a significant area of research due to its potential to learn optimal policies from pre-collected datasets without additional environment interactions. This is particularly crucial in scenarios where interactions are costly or risky. Return-Conditioned Transformer Decision Models (RCTDM) have shown promise in enhancing transformer performance in offline RL by using returns-to-go instead of rewards in the input sequence. However, RCTDM faces challenges in learning optimal policies from limited suboptimal trajectories. Problem Statement The primary challenges with using transformers as decision models in offline RL are: 1. Handling Trajectory…
Authors: Xuan Xie、Jiayang Song、Yuheng Huang、Da Song、Fuyuan Zhang、Felix Juefei-Xu、Lei Ma Paper: https://arxiv.org/abs/2408.10474 Introduction Large Language Models (LLMs) have revolutionized various domains, including natural language processing, code generation, and robotic system control. Despite their impressive capabilities, concerns about their trustworthiness persist, particularly regarding issues like hallucination and toxicity. Recent research has focused on developing testing methods to uncover these untrustworthy behaviors before deployment. However, a systematic and formalized approach to measure the sufficiency and coverage of LLM testing is still lacking. To address this gap, the authors propose LeCov, a set of multi-level testing criteria for LLMs, which considers three crucial internal…
Authors: Vijul Shah、Brian B. Moser、Ko Watanabe、Andreas Dengel Paper: https://arxiv.org/abs/2408.10397 Introduction The ability to accurately measure pupil diameter is crucial for assessing various psychological and physiological states, such as stress levels and cognitive load. However, the low resolution of images in many eye-tracking datasets often hampers precise measurement. This study investigates the impact of various upscaling methods on pupil diameter predictions from webcam images. By comparing several pre-trained super-resolution (SR) methods, the study aims to determine how upscaling can enhance the accuracy of pupil diameter prediction models. Related Work Super-Resolution as Pre-Processing Image super-resolution (SR) is the process of converting low-resolution…
Authors: Zeyuan Chen、Haiyan Wu、Kaixin Wu、Wei Chen、Mingjie Zhong、Jia Xu、Zhongyi Liu、Wei Zhang Paper: https://arxiv.org/abs/2408.09439 In the ever-evolving landscape of search engines, relevance modeling plays a pivotal role in enhancing user experience by accurately identifying items that align with users’ queries. Traditional models often fall short by relying solely on semantic congruence, which is insufficient for capturing the full spectrum of relevance. This blog delves into a novel approach that leverages user interactions and advanced prompting techniques to boost relevance modeling driven by Large Language Models (LLMs). Introduction Background Search engines are indispensable tools for navigating the vast expanse of online content. The…
Authors: Bruce W. Lee、Yeongheon Lee、Hyunsoo Cho Paper: https://arxiv.org/abs/2408.09049 Introduction Recent advancements in large language models (LLMs) have significantly enhanced their capabilities, making them integral to various real-world applications. However, a primary concern is their non-deterministic nature, which allows them to generate diverse responses to the same input. This variability stems from the vast and heterogeneous datasets they consume, enabling them to capture complex probability distributions for a single topic and encompass multiple viewpoints. Despite this flexibility, LLMs exhibit a tendency towards specific phrasings, tones, or content types, indicating a central tendency within their outputs. To systematically explore this phenomenon, the…
RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search
Authors: Meng Chen、Kai Zhang、Zhenying He、Yinan Jing、X. Sean Wang Paper: https://arxiv.org/abs/2408.08933 Introduction Approximate Nearest Neighbor Search (ANNS) is a critical component in various applications, such as recommendation systems and large language model-based applications. With the rise of multimodal neural models, cross-modal ANNS has become essential for retrieving similar items across different modalities (e.g., using text to find similar images). However, existing ANNS approaches struggle with cross-modal queries due to the inherent distribution gap between embeddings from different modalities. This paper introduces RoarGraph, a projected bipartite graph designed to address these inefficiencies and significantly improve cross-modal ANNS performance. Related Work Background on…
Authors: Xiaomeng Jin、Jeonghwan Kim、Yu Zhou、Kuan-Hao Huang、Te-Lin Wu、Nanyun Peng、Heng Ji Paper: https://arxiv.org/abs/2408.10086 Introduction Multimodal Language Models (MLMs) have shown remarkable capabilities in understanding and integrating various modalities, including text, images, and videos. However, the process of manually annotating high-quality image-text pair data for fine-tuning and alignment is both costly and time-consuming. Existing multimodal data augmentation frameworks often face challenges such as semantic inconsistency between texts and images or the generation of unrealistic images, leading to a knowledge gap with real-world examples. To address these issues, the authors propose ARMADA (Attribute-Based Multimodal Data Augmentation), a novel method that leverages knowledge-guided manipulation of…
Authors: Duo Su、Junjie Hou、Guang Li、Ren Togo、Rui Song、Takahiro Ogawa、Miki Haseyama Paper: https://arxiv.org/abs/2408.08610 Introduction In this blog post, we delve into the paper titled “Generative Dataset Distillation Based on Diffusion Model,” which presents a novel approach to dataset distillation using the SDXL-Turbo diffusion model. This method was developed for the generative track of The First Dataset Distillation Challenge at ECCV 2024. The authors, Duo Su, Junjie Hou, Guang Li, Ren Togo, Rui Song, Takahiro Ogawa, and Miki Haseyama, propose a technique that leverages the high-speed and high-quality image generation capabilities of the SDXL-Turbo model to achieve impressive results in dataset distillation. Background…
Authors: Yusen Wu、Hao Chen、Alex Pissinou Makki、Phuong Nguyen、Yelena Yesha Paper: https://arxiv.org/abs/2408.08456 Introduction Distributional drift, also known as dataset drift, in medical imaging refers to changes in data distribution over time, which can significantly affect the performance of machine learning models used for diagnostic purposes. This drift may result from various factors, including alterations in imaging equipment, differences in imaging protocols, variations in patient demographics, or updates in image preprocessing techniques. Detecting and managing drift is critical in the medical field to ensure that models remain accurate and reliable. Ignoring drift can lead to incorrect diagnoses or suboptimal treatment recommendations, thereby potentially…