new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jan 5

Disentangled Graph Variational Auto-Encoder for Multimodal Recommendation with Interpretability

Multimodal recommender systems amalgamate multimodal information (e.g., textual descriptions, images) into a collaborative filtering framework to provide more accurate recommendations. While the incorporation of multimodal information could enhance the interpretability of these systems, current multimodal models represent users and items utilizing entangled numerical vectors, rendering them arduous to interpret. To address this, we propose a Disentangled Graph Variational Auto-Encoder (DGVAE) that aims to enhance both model and recommendation interpretability. DGVAE initially projects multimodal information into textual contents, such as converting images to text, by harnessing state-of-the-art multimodal pre-training technologies. It then constructs a frozen item-item graph and encodes the contents and interactions into two sets of disentangled representations utilizing a simplified residual graph convolutional network. DGVAE further regularizes these disentangled representations through mutual information maximization, aligning the representations derived from the interactions between users and items with those learned from textual content. This alignment facilitates the interpretation of user binary interactions via text. Our empirical analysis conducted on three real-world datasets demonstrates that DGVAE significantly surpasses the performance of state-of-the-art baselines by a margin of 10.02%. We also furnish a case study from a real-world dataset to illustrate the interpretability of DGVAE. Code is available at: https://github.com/enoche/DGVAE.

  • 2 authors
·
Feb 25, 2024

A Tale of Two Graphs: Freezing and Denoising Graph Structures for Multimodal Recommendation

Multimodal recommender systems utilizing multimodal features (e.g., images and textual descriptions) typically show better recommendation accuracy than general recommendation models based solely on user-item interactions. Generally, prior work fuses multimodal features into item ID embeddings to enrich item representations, thus failing to capture the latent semantic item-item structures. In this context, LATTICE proposes to learn the latent structure between items explicitly and achieves state-of-the-art performance for multimodal recommendations. However, we argue the latent graph structure learning of LATTICE is both inefficient and unnecessary. Experimentally, we demonstrate that freezing its item-item structure before training can also achieve competitive performance. Based on this finding, we propose a simple yet effective model, dubbed as FREEDOM, that FREEzes the item-item graph and DenOises the user-item interaction graph simultaneously for Multimodal recommendation. Theoretically, we examine the design of FREEDOM through a graph spectral perspective and demonstrate that it possesses a tighter upper bound on the graph spectrum. In denoising the user-item interaction graph, we devise a degree-sensitive edge pruning method, which rejects possibly noisy edges with a high probability when sampling the graph. We evaluate the proposed model on three real-world datasets and show that FREEDOM can significantly outperform current strongest baselines. Compared with LATTICE, FREEDOM achieves an average improvement of 19.07% in recommendation accuracy while reducing its memory cost up to 6times on large graphs. The source code is available at: https://github.com/enoche/FREEDOM.

  • 2 authors
·
Nov 13, 2022

Learning Item Representations Directly from Multimodal Features for Effective Recommendation

Conventional multimodal recommender systems predominantly leverage Bayesian Personalized Ranking (BPR) optimization to learn item representations by amalgamating item identity (ID) embeddings with multimodal features. Nevertheless, our empirical and theoretical findings unequivocally demonstrate a pronounced optimization gradient bias in favor of acquiring representations from multimodal features over item ID embeddings. As a consequence, item ID embeddings frequently exhibit suboptimal characteristics despite the convergence of multimodal feature parameters. Given the rich informational content inherent in multimodal features, in this paper, we propose a novel model (i.e., LIRDRec) that learns item representations directly from these features to augment recommendation performance. Recognizing that features derived from each modality may capture disparate yet correlated aspects of items, we propose a multimodal transformation mechanism, integrated with modality-specific encoders, to effectively fuse features from all modalities. Moreover, to differentiate the influence of diverse modality types, we devise a progressive weight copying fusion module within LIRDRec. This module incrementally learns the weight assigned to each modality in synthesizing the final user or item representations. Finally, we utilize the powerful visual understanding of Multimodal Large Language Models (MLLMs) to convert the item images into texts and extract semantics embeddings upon the texts via LLMs. Empirical evaluations conducted on five real-world datasets validate the superiority of our approach relative to competing baselines. It is worth noting the proposed model, equipped with embeddings extracted from MLLMs and LLMs, can further improve the recommendation accuracy of NDCG@20 by an average of 4.21% compared to the original embeddings.

  • 4 authors
·
May 8, 2025

CM$^3$: Calibrating Multimodal Recommendation

Alignment and uniformity are fundamental principles within the domain of contrastive learning. In recommender systems, prior work has established that optimizing the Bayesian Personalized Ranking (BPR) loss contributes to the objectives of alignment and uniformity. Specifically, alignment aims to draw together the representations of interacting users and items, while uniformity mandates a uniform distribution of user and item embeddings across a unit hypersphere. This study revisits the alignment and uniformity properties within the context of multimodal recommender systems, revealing a proclivity among extant models to prioritize uniformity to the detriment of alignment. Our hypothesis challenges the conventional assumption of equitable item treatment through a uniformity loss, proposing a more nuanced approach wherein items with similar multimodal attributes converge toward proximal representations within the hyperspheric manifold. Specifically, we leverage the inherent similarity between items' multimodal data to calibrate their uniformity distribution, thereby inducing a more pronounced repulsive force between dissimilar entities within the embedding space. A theoretical analysis elucidates the relationship between this calibrated uniformity loss and the conventional uniformity function. Moreover, to enhance the fusion of multimodal features, we introduce a Spherical B\'ezier method designed to integrate an arbitrary number of modalities while ensuring that the resulting fused features are constrained to the same hyperspherical manifold. Empirical evaluations conducted on five real-world datasets substantiate the superiority of our approach over competing baselines. We also shown that the proposed methods can achieve up to a 5.4% increase in NDCG@20 performance via the integration of MLLM-extracted features. Source code is available at: https://github.com/enoche/CM3.

  • 3 authors
·
Aug 2, 2025 2

Train Once, Deploy Anywhere: Matryoshka Representation Learning for Multimodal Recommendation

Despite recent advancements in language and vision modeling, integrating rich multimodal knowledge into recommender systems continues to pose significant challenges. This is primarily due to the need for efficient recommendation, which requires adaptive and interactive responses. In this study, we focus on sequential recommendation and introduce a lightweight framework called full-scale Matryoshka representation learning for multimodal recommendation (fMRLRec). Our fMRLRec captures item features at different granularities, learning informative representations for efficient recommendation across multiple dimensions. To integrate item features from diverse modalities, fMRLRec employs a simple mapping to project multimodal item features into an aligned feature space. Additionally, we design an efficient linear transformation that embeds smaller features into larger ones, substantially reducing memory requirements for large-scale training on recommendation data. Combined with improved state space modeling techniques, fMRLRec scales to different dimensions and only requires one-time training to produce multiple models tailored to various granularities. We demonstrate the effectiveness and efficiency of fMRLRec on multiple benchmark datasets, which consistently achieves superior performance over state-of-the-art baseline methods. We make our code and data publicly available at https://github.com/yueqirex/fMRLRec.

  • 5 authors
·
Sep 25, 2024

Towards Agentic Recommender Systems in the Era of Multimodal Large Language Models

Recent breakthroughs in Large Language Models (LLMs) have led to the emergence of agentic AI systems that extend beyond the capabilities of standalone models. By empowering LLMs to perceive external environments, integrate multimodal information, and interact with various tools, these agentic systems exhibit greater autonomy and adaptability across complex tasks. This evolution brings new opportunities to recommender systems (RS): LLM-based Agentic RS (LLM-ARS) can offer more interactive, context-aware, and proactive recommendations, potentially reshaping the user experience and broadening the application scope of RS. Despite promising early results, fundamental challenges remain, including how to effectively incorporate external knowledge, balance autonomy with controllability, and evaluate performance in dynamic, multimodal settings. In this perspective paper, we first present a systematic analysis of LLM-ARS: (1) clarifying core concepts and architectures; (2) highlighting how agentic capabilities -- such as planning, memory, and multimodal reasoning -- can enhance recommendation quality; and (3) outlining key research questions in areas such as safety, efficiency, and lifelong personalization. We also discuss open problems and future directions, arguing that LLM-ARS will drive the next wave of RS innovation. Ultimately, we foresee a paradigm shift toward intelligent, autonomous, and collaborative recommendation experiences that more closely align with users' evolving needs and complex decision-making processes.

  • 12 authors
·
Mar 20, 2025

RecoWorld: Building Simulated Environments for Agentic Recommender Systems

We present RecoWorld, a blueprint for building simulated environments tailored to agentic recommender systems. Such environments give agents a proper training space where they can learn from errors without impacting real users. RecoWorld distinguishes itself with a dual-view architecture: a simulated user and an agentic recommender engage in multi-turn interactions aimed at maximizing user retention. The user simulator reviews recommended items, updates its mindset, and when sensing potential user disengagement, generates reflective instructions. The agentic recommender adapts its recommendations by incorporating these user instructions and reasoning traces, creating a dynamic feedback loop that actively engages users. This process leverages the exceptional reasoning capabilities of modern LLMs. We explore diverse content representations within the simulator, including text-based, multimodal, and semantic ID modeling, and discuss how multi-turn RL enables the recommender to refine its strategies through iterative interactions. RecoWorld also supports multi-agent simulations, allowing creators to simulate the responses of targeted user populations. It marks an important first step toward recommender systems where users and agents collaboratively shape personalized information streams. We envision new interaction paradigms where "user instructs, recommender responds," jointly optimizing user retention and engagement.

  • 15 authors
·
Sep 12, 2025 2

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

Existing video recommender systems rely primarily on user-defined metadata or on low-level visual and acoustic signals extracted by specialised encoders. These low-level features describe what appears on the screen but miss deeper semantics such as intent, humour, and world knowledge that make clips resonate with viewers. For example, is a 30-second clip simply a singer on a rooftop, or an ironic parody filmed amid the fairy chimneys of Cappadocia, Turkey? Such distinctions are critical to personalised recommendations yet remain invisible to traditional encoding pipelines. In this paper, we introduce a simple, recommendation system-agnostic zero-finetuning framework that injects high-level semantics into the recommendation pipeline by prompting an off-the-shelf Multimodal Large Language Model (MLLM) to summarise each clip into a rich natural-language description (e.g. "a superhero parody with slapstick fights and orchestral stabs"), bridging the gap between raw content and user intent. We use MLLM output with a state-of-the-art text encoder and feed it into standard collaborative, content-based, and generative recommenders. On the MicroLens-100K dataset, which emulates user interactions with TikTok-style videos, our framework consistently surpasses conventional video, audio, and metadata features in five representative models. Our findings highlight the promise of leveraging MLLMs as on-the-fly knowledge extractors to build more intent-aware video recommenders.

  • 3 authors
·
Aug 13, 2025 7

Short-Form Video Recommendations with Multimodal Embeddings: Addressing Cold-Start and Bias Challenges

In recent years, social media users have spent significant amounts of time on short-form video platforms. As a result, established platforms in other domains, such as e-commerce, have begun introducing short-form video content to engage users and increase their time spent on the platform. The success of these experiences is due not only to the content itself but also to a unique UI innovation: instead of offering users a list of choices to click, platforms actively recommend content for users to watch one at a time. This creates new challenges for recommender systems, especially when launching a new video experience. Beyond the limited interaction data, immersive feed experiences introduce stronger position bias due to the UI and duration bias when optimizing for watch-time, as models tend to favor shorter videos. These issues, together with the feedback loop inherent in recommender systems, make it difficult to build effective solutions. In this paper, we highlight the challenges faced when introducing a new short-form video experience and present our experience showing that, even with sufficient video interaction data, it can be more beneficial to leverage a video retrieval system using a fine-tuned multimodal vision-language model to overcome these challenges. This approach demonstrated greater effectiveness compared to conventional supervised learning methods in online experiments conducted on our e-commerce platform.

  • 5 authors
·
Jul 25, 2025

ViMRHP: A Vietnamese Benchmark Dataset for Multimodal Review Helpfulness Prediction via Human-AI Collaborative Annotation

Multimodal Review Helpfulness Prediction (MRHP) is an essential task in recommender systems, particularly in E-commerce platforms. Determining the helpfulness of user-generated reviews enhances user experience and improves consumer decision-making. However, existing datasets focus predominantly on English and Indonesian, resulting in a lack of linguistic diversity, especially for low-resource languages such as Vietnamese. In this paper, we introduce ViMRHP (Vietnamese Multimodal Review Helpfulness Prediction), a large-scale benchmark dataset for MRHP task in Vietnamese. This dataset covers four domains, including 2K products with 46K reviews. Meanwhile, a large-scale dataset requires considerable time and cost. To optimize the annotation process, we leverage AI to assist annotators in constructing the ViMRHP dataset. With AI assistance, annotation time is reduced (90 to 120 seconds per task down to 20 to 40 seconds per task) while maintaining data quality and lowering overall costs by approximately 65%. However, AI-generated annotations still have limitations in complex annotation tasks, which we further examine through a detailed performance analysis. In our experiment on ViMRHP, we evaluate baseline models on human-verified and AI-generated annotations to assess their quality differences. The ViMRHP dataset is publicly available at https://github.com/trng28/ViMRHP

  • 4 authors
·
May 12, 2025 2

Quadratic Interest Network for Multimodal Click-Through Rate Prediction

Multimodal click-through rate (CTR) prediction is a key technique in industrial recommender systems. It leverages heterogeneous modalities such as text, images, and behavioral logs to capture high-order feature interactions between users and items, thereby enhancing the system's understanding of user interests and its ability to predict click behavior. The primary challenge in this field lies in effectively utilizing the rich semantic information from multiple modalities while satisfying the low-latency requirements of online inference in real-world applications. To foster progress in this area, the Multimodal CTR Prediction Challenge Track of the WWW 2025 EReL@MIR Workshop formulates the problem into two tasks: (1) Task 1 of Multimodal Item Embedding: this task aims to explore multimodal information extraction and item representation learning methods that enhance recommendation tasks; and (2) Task 2 of Multimodal CTR Prediction: this task aims to explore what multimodal recommendation model can effectively leverage multimodal embedding features and achieve better performance. In this paper, we propose a novel model for Task 2, named Quadratic Interest Network (QIN) for Multimodal CTR Prediction. Specifically, QIN employs adaptive sparse target attention to extract multimodal user behavior features, and leverages Quadratic Neural Networks to capture high-order feature interactions. As a result, QIN achieved an AUC of 0.9798 on the leaderboard and ranked second in the competition. The model code, training logs, hyperparameter configurations, and checkpoints are available at https://github.com/salmon1802/QIN.

  • 7 authors
·
Apr 24, 2025

MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling

Lifelong user interest modeling is crucial for industrial recommender systems, yet existing approaches rely predominantly on ID-based features, suffering from poor generalization on long-tail items and limited semantic expressiveness. While recent work explores multimodal representations for behavior retrieval in the General Search Unit (GSU), they often neglect multimodal integration in the fine-grained modeling stage -- the Exact Search Unit (ESU). In this work, we present a systematic analysis of how to effectively leverage multimodal signals across both stages of the two-stage lifelong modeling framework. Our key insight is that simplicity suffices in the GSU: lightweight cosine similarity with high-quality multimodal embeddings outperforms complex retrieval mechanisms. In contrast, the ESU demands richer multimodal sequence modeling and effective ID-multimodal fusion to unlock its full potential. Guided by these principles, we propose MUSE, a simple yet effective multimodal search-based framework. MUSE has been deployed in Taobao display advertising system, enabling 100K-length user behavior sequence modeling and delivering significant gains in top-line metrics with negligible online latency overhead. To foster community research, we share industrial deployment practices and open-source the first large-scale dataset featuring ultra-long behavior sequences paired with high-quality multimodal embeddings. Our code and data is available at https://taobao-mm.github.io.

  • 11 authors
·
Dec 8, 2025

Personalized Image Generation with Large Multimodal Models

Personalized content filtering, such as recommender systems, has become a critical infrastructure to alleviate information overload. However, these systems merely filter existing content and are constrained by its limited diversity, making it difficult to meet users' varied content needs. To address this limitation, personalized content generation has emerged as a promising direction with broad applications. Nevertheless, most existing research focuses on personalized text generation, with relatively little attention given to personalized image generation. The limited work in personalized image generation faces challenges in accurately capturing users' visual preferences and needs from noisy user-interacted images and complex multimodal instructions. Worse still, there is a lack of supervised data for training personalized image generation models. To overcome the challenges, we propose a Personalized Image Generation Framework named Pigeon, which adopts exceptional large multimodal models with three dedicated modules to capture users' visual preferences and needs from noisy user history and multimodal instructions. To alleviate the data scarcity, we introduce a two-stage preference alignment scheme, comprising masked preference reconstruction and pairwise preference alignment, to align Pigeon with the personalized image generation task. We apply Pigeon to personalized sticker and movie poster generation, where extensive quantitative results and human evaluation highlight its superiority over various generative baselines.

  • 7 authors
·
Oct 18, 2024

MMHCL: Multi-Modal Hypergraph Contrastive Learning for Recommendation

The burgeoning presence of multimodal content-sharing platforms propels the development of personalized recommender systems. Previous works usually suffer from data sparsity and cold-start problems, and may fail to adequately explore semantic user-product associations from multimodal data. To address these issues, we propose a novel Multi-Modal Hypergraph Contrastive Learning (MMHCL) framework for user recommendation. For a comprehensive information exploration from user-product relations, we construct two hypergraphs, i.e. a user-to-user (u2u) hypergraph and an item-to-item (i2i) hypergraph, to mine shared preferences among users and intricate multimodal semantic resemblance among items, respectively. This process yields denser second-order semantics that are fused with first-order user-item interaction as complementary to alleviate the data sparsity issue. Then, we design a contrastive feature enhancement paradigm by applying synergistic contrastive learning. By maximizing/minimizing the mutual information between second-order (e.g. shared preference pattern for users) and first-order (information of selected items for users) embeddings of the same/different users and items, the feature distinguishability can be effectively enhanced. Compared with using sparse primary user-item interaction only, our MMHCL obtains denser second-order hypergraphs and excavates more abundant shared attributes to explore the user-product associations, which to a certain extent alleviates the problems of data sparsity and cold-start. Extensive experiments have comprehensively demonstrated the effectiveness of our method. Our code is publicly available at: https://github.com/Xu107/MMHCL.

  • 7 authors
·
Apr 23, 2025

Multi-Modal Recommendation Unlearning for Legal, Licensing, and Modality Constraints

User data spread across multiple modalities has popularized multi-modal recommender systems (MMRS). They recommend diverse content such as products, social media posts, TikTok reels, etc., based on a user-item interaction graph. With rising data privacy demands, recent methods propose unlearning private user data from uni-modal recommender systems (RS). However, methods for unlearning item data related to outdated user preferences, revoked licenses, and legally requested removals are still largely unexplored. Previous RS unlearning methods are unsuitable for MMRS due to the incompatibility of their matrix-based representation with the multi-modal user-item interaction graph. Moreover, their data partitioning step degrades performance on each shard due to poor data heterogeneity and requires costly performance aggregation across shards. This paper introduces MMRecUn, the first approach known to us for unlearning in MMRS and unlearning item data. Given a trained RS model, MMRecUn employs a novel Reverse Bayesian Personalized Ranking (BPR) objective to enable the model to forget marked data. The reverse BPR attenuates the impact of user-item interactions within the forget set, while the forward BPR reinforces the significance of user-item interactions within the retain set. Our experiments demonstrate that MMRecUn outperforms baseline methods across various unlearning requests when evaluated on benchmark MMRS datasets. MMRecUn achieves recall performance improvements of up to 49.85% compared to baseline methods and is up to 1.3x faster than the Gold model, which is trained on retain set from scratch. MMRecUn offers significant advantages, including superiority in removing target interactions, preserving retained interactions, and zero overhead costs compared to previous methods. Code: https://github.com/MachineUnlearn/MMRecUN Extended version: arXiv:2405.15328

  • 3 authors
·
May 24, 2024

Refining Contrastive Learning and Homography Relations for Multi-Modal Recommendation

Multi-modal recommender system focuses on utilizing rich modal information ( i.e., images and textual descriptions) of items to improve recommendation performance. The current methods have achieved remarkable success with the powerful structure modeling capability of graph neural networks. However, these methods are often hindered by sparse data in real-world scenarios. Although contrastive learning and homography ( i.e., homogeneous graphs) are employed to address the data sparsity challenge, existing methods still suffer two main limitations: 1) Simple multi-modal feature contrasts fail to produce effective representations, causing noisy modal-shared features and loss of valuable information in modal-unique features; 2) The lack of exploration of the homograph relations between user interests and item co-occurrence results in incomplete mining of user-item interplay. To address the above limitations, we propose a novel framework for REfining multi-modAl contRastive learning and hoMography relations (REARM). Specifically, we complement multi-modal contrastive learning by employing meta-network and orthogonal constraint strategies, which filter out noise in modal-shared features and retain recommendation-relevant information in modal-unique features. To mine homogeneous relationships effectively, we integrate a newly constructed user interest graph and an item co-occurrence graph with the existing user co-occurrence and item semantic graphs for graph learning. The extensive experiments on three real-world datasets demonstrate the superiority of REARM to various state-of-the-art baselines. Our visualization further shows an improvement made by REARM in distinguishing between modal-shared and modal-unique features. Code is available https://github.com/MrShouxingMa/REARM{here}.

  • 4 authors
·
Aug 19, 2025 2

MMGRec: Multimodal Generative Recommendation with Transformer Model

Multimodal recommendation aims to recommend user-preferred candidates based on her/his historically interacted items and associated multimodal information. Previous studies commonly employ an embed-and-retrieve paradigm: learning user and item representations in the same embedding space, then retrieving similar candidate items for a user via embedding inner product. However, this paradigm suffers from inference cost, interaction modeling, and false-negative issues. Toward this end, we propose a new MMGRec model to introduce a generative paradigm into multimodal recommendation. Specifically, we first devise a hierarchical quantization method Graph RQ-VAE to assign Rec-ID for each item from its multimodal and CF information. Consisting of a tuple of semantically meaningful tokens, Rec-ID serves as the unique identifier of each item. Afterward, we train a Transformer-based recommender to generate the Rec-IDs of user-preferred items based on historical interaction sequences. The generative paradigm is qualified since this model systematically predicts the tuple of tokens identifying the recommended item in an autoregressive manner. Moreover, a relation-aware self-attention mechanism is devised for the Transformer to handle non-sequential interaction sequences, which explores the element pairwise relation to replace absolute positional encoding. Extensive experiments evaluate MMGRec's effectiveness compared with state-of-the-art methods.

  • 6 authors
·
Apr 25, 2024

DAS: Dual-Aligned Semantic IDs Empowered Industrial Recommender System

Semantic IDs are discrete identifiers generated by quantizing the Multi-modal Large Language Models (MLLMs) embeddings, enabling efficient multi-modal content integration in recommendation systems. However, their lack of collaborative signals results in a misalignment with downstream discriminative and generative recommendation objectives. Recent studies have introduced various alignment mechanisms to address this problem, but their two-stage framework design still leads to two main limitations: (1) inevitable information loss during alignment, and (2) inflexibility in applying adaptive alignment strategies, consequently constraining the mutual information maximization during the alignment process. To address these limitations, we propose a novel and flexible one-stage Dual-Aligned Semantic IDs (DAS) method that simultaneously optimizes quantization and alignment, preserving semantic integrity and alignment quality while avoiding the information loss typically associated with two-stage methods. Meanwhile, DAS achieves more efficient alignment between the semantic IDs and collaborative signals, with the following two innovative and effective approaches: (1) Multi-view Constrative Alignment: To maximize mutual information between semantic IDs and collaborative signals, we first incorporate an ID-based CF debias module, and then design three effective contrastive alignment methods: dual user-to-item (u2i), dual item-to-item/user-to-user (i2i/u2u), and dual co-occurrence item-to-item/user-to-user (i2i/u2u). (2) Dual Learning: By aligning the dual quantizations of users and ads, the constructed semantic IDs for users and ads achieve stronger alignment. Finally, we conduct extensive offline experiments and online A/B tests to evaluate DAS's effectiveness, which is now successfully deployed across various advertising scenarios at Kuaishou App, serving over 400 million users daily.

  • 6 authors
·
Aug 14, 2025