Exploring the Potential of LLMs as Personalized Assistants: Dataset, Evaluation, and Analysis
Paper
β’
2506.01262
β’
Published
This model is an evaluation model for the HiCUPID dataset. It assesses personalization and logical validity in model-generated answers to questions.
There are three types of questions in the evaluation:
Each type has a specific evaluation prompt.
EVALUATION_PROMPTS = {
"persona": [
{
"role": "user",
"content": """
Evaluate two responses (A and B) to a given question based on the following criteria:
1. Personalization: Does the response effectively consider the user's provided personal information?
2. Logical Validity: Is the response logically sound and relevant to the question?
For each criterion, provide a brief one-line comparison of the two responses and select the better response (A, B, or Tie).
- Ensure the comparisons are concise and directly address the criteria.
- If both answers are equally strong or weak in a category, mark it as a Tie.
- Do not use bold font.
Output Format:
1. Personalization: [Brief comparison of A and B]
2. Logical Validity: [Brief comparison of A and B]
Better Response: [A/B/Tie]
Input:
- User's Personal Information:
- Characteristics:
- The user {relation} {entity}
- Question: {question}
- Answer (A): {answer_a}
- Answer (B): {answer_b}
""".strip(),
},
],
"profile": [
{
"role": "user",
"content": """
Evaluate two responses (A and B) to a given question based on the following criteria:
1. Personalization: Does the response effectively consider the user's provided personal information?
2. Logical Validity: Is the response logically sound and relevant to the question?
For each criterion, provide a brief one-line comparison of the two responses and select the better response (A, B, or Tie).
- Ensure the comparisons are concise and directly address the criteria.
- If both answers are equally strong or weak in a category, mark it as a Tie.
- Do not use bold font.
Output Format:
1. Personalization: [Brief comparison of A and B]
2. Logical Validity: [Brief comparison of A and B]
Better Response: [A/B/Tie]
Input:
- User's Personal Information:
- Profile:
- Age: {age}
- Gender: {gender}
- Personality: {personality}
- Occupation: {occupation}
- Income Range: {income_range}
- Characteristics:
- The user {relation} {entity}
- Question: {question}
- Answer (A): {answer_a}
- Answer (B): {answer_b}
""".strip(),
},
],
"schedule": [
{
"role": "user",
"content": """
Evaluate the given answer based on the provided question and user's schedule.
The evaluation criterion is as follows:
- Personalization: Does the answer acknowledge and appropriately consider the user's schedule?
Provide a short one-line evaluation for the criterion and indicate whether it satisfies the criterion with YES or NO. Do not use bold font.
Output Format:
- Personalization: [Short one-line evaluation]
Satisfaction: [YES/NO]
Example 1:
- User's Schedule:
- Datetime: 11 AM on March 13, 2025
- Event: Attend a networking brunch with fellow sports journalists and industry experts.
- Question: Can you add a meeting with the editorial team to my calendar for 11 AM on March 13, 2025?
- Answer: It seems you already have a networking brunch with fellow sports journalists and industry experts scheduled for 11 AM on March 13, 2025. Would you like to replace this event with the editorial team meeting, or should we consider rescheduling one of them?
- Personalization: The answer acknowledges the existing event and offers alternatives.
Satisfaction: YES
Example 2:
- User's Schedule:
- Datetime: 11 AM on March 13, 2025
- Event: Attend a networking brunch with fellow sports journalists and industry experts.
- Question: Can you add a meeting with the editorial team to my calendar for 11 AM on March 13, 2025?
- Answer: Certainly, I can add the meeting with the editorial team to your calendar for 11 AM on March 13, 2025.
- Personalization: The answer ignores the existing event in the user's schedule.
Satisfaction: NO
Input:
- User's Schedule:
- Datetime: {datetime}
- Event: {event}
- Question: {question}
- Answer: {answer}
""".strip(),
},
],
}
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("12kimih/Llama-3.2-3B-HiCUPID", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model.config._name_or_path, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
prompt = [
{
"role": "user",
"content": """
Evaluate two responses (A and B) to a given question based on the following criteria:
1. Personalization: Does the response effectively consider the user's provided personal information?
2. Logical Validity: Is the response logically sound and relevant to the question?
For each criterion, provide a brief one-line comparison of the two responses and select the better response (A, B, or Tie).
- Ensure the comparisons are concise and directly address the criteria.
- If both answers are equally strong or weak in a category, mark it as a Tie.
- Do not use bold font.
Output Format:
1. Personalization: [Brief comparison of A and B]
2. Logical Validity: [Brief comparison of A and B]
Better Response: [A/B/Tie]
Input:
- User's Personal Information:
- Characteristics:
- The user has not been to New Zealand
- Question: What wildlife is unique to certain regions?
- Answer (A): You're interested in wildlife unique to certain regions. As a fan of Mark Rober's videos, you might enjoy learning about the unique wildlife found in different parts of the world. For example, the Amazon rainforest is home to over 1,300 species of birds, while the Galapagos Islands are known for their giant tortoises and marine iguanas.
- Answer (B): New Zealand is home to unique wildlife such as the kiwi bird, tuatara, and Hector's dolphin.
""".strip(),
}
]
inputs = tokenizer.apply_chat_template(prompt, tokenize=True, add_generation_prompt=True, padding=True, return_dict=True, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
decoded_outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].size(1) :], skip_special_tokens=True)
print(decoded_outputs[0])
1. Personalization: A attempts to personalize by referencing a fan of Mark Rober's videos, but this is irrelevant to the user's provided information, while B directly considers the user's lack of experience with New Zealand.
2. Logical Validity: A provides a broader and more general answer about unique wildlife, but it is less specific and relevant to the question, whereas B is more focused and directly answers the question with examples from New Zealand.
Better Response: B
This project is licensed under the Apache-2.0 license. See the LICENSE file for details.
If you use this project in your research or work, please consider citing it. Hereβs a suggested citation format in BibTeX:
@misc{mok2025exploringpotentialllmspersonalized,
title={Exploring the Potential of LLMs as Personalized Assistants: Dataset, Evaluation, and Analysis},
author={Jisoo Mok and Ik-hwan Kim and Sangkwon Park and Sungroh Yoon},
year={2025},
eprint={2506.01262},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.01262},
}
Base model
meta-llama/Llama-3.2-3B-Instruct