Fine-Tuning LLaMA2 for News-Aware Conversational AI
I fine-tuned LLaMA2 for news classification without enterprise-grade hardware. The interesting part? LLaMA2 can be trained to be a news expert without losing its friendly conversational skills.
As I've been diving deep into the world of large language models, I've discovered that while models like LLaMA2 are impressive out-of-the-box, they truly shine when adapted to specific domains. What started as a curiosity about conversational news classification turned into a comprehensive fine-tuning pipeline that I believe others might find valuable.
The Technical Challenge
Pre-trained language models excel at general language tasks but often lack the specific knowledge or formats required for specialized applications. To get them ready for real-world applications, you often need to adapt them. That's where the challenge lies:
Full fine-tuning is computationally expensive and resource-intensive
Specialized knowledge acquisition requires targeted training approaches
Maintaining conversational quality while adding classification capabilities becomes a delicate balancing act
My goal was to transform the open-source Llama-2-7b-hf into a model capable of accurately classifying news content while preserving its natural, coherent dialogue abilities.
What I've Built
Through a lot of experimentation (and a few dead ends with wasted compute!), I put together a complete workflow that covers:
Preparing news datasets in formats LLaMA2 can effectively learn from
Parameter-efficient fine-tuning using TorchTune CLI that works well on limited compute
Automated evaluation using another LLM DeepSeek-R1-Distill-Llama-8B as a benchmark
The interesting aspect has been observing how the model maintains its conversational abilities while developing specialized knowledge in news categorization. When done right, you get a model that's both focused and coherent.
There are so many articles, tools, and libraries out there that it can feel overwhelming. I’m sharing this guide because I wish I had something like this when I started. Fine-tuning is all about optimization and execution, and I hope my practical tips save you time and effort.
My Approach: Instruction-Based Fine-Tuning
I used Parameter-Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA)—a supervised fine tuning (SFT) method that updates only a small fraction of model parameters while preserving performance.
LoRA works by adding trainable rank decomposition matrices to existing weights, enabling efficient adaptation without modifying all model parameters. This approach is particularly valuable when working with models as large as LLaMA2, as it makes fine-tuning accessible even without enterprise-grade GPUs.
For training, I used the AGNews dataset with news articles in four categories: Business, World, Sci/Tech, and Sports. I tweaked it into an instruction-response format - specifically Alpaca format, like this:
instruction: Classify this news article.
input: Apple reports record profits in Q4, exceeding analyst expectations by 15%. The tech giant attributes growth to strong iPhone 14 sales and expanding services revenue.
output: Business
This teaches the model to understand instructions, making it a news-savvy conversationalist.
Why This Matters: Business Benefits
For companies dealing with tons of information, efficient content categorization is a huge advantage. With parameter-efficient fine-tuning, I was able to:
Cut costs: It's cheaper than traditional fine-tuning.
Democratize advanced AI: Make it accessible even without a massive budget.
Achieve high accuracy: The model reached 95% accuracy in news classification.
Keep the conversation flowing: The model didn't lose its natural language abilities.
Tech Deep Dive: The Full Pipeline
Here’s the specifics of the full pipeline from an implementation point of view.
Dataset Prep:
Making AGNews instruction-friendly.
Keeping it conversational.
Preprocessing:
Using LLaMA2's tokenizer.
Ensuring everything aligns with the base model.
Fine-Tuning Implementation:
LoRA magic to tune the base model.
TorchTune CLI for easy training.
Hugging Face Transformers, PEFT, and TorchTune working together seamlessly.
Inference:
Testing the fine-tuned model.
Comparing it to the baseline model.
Evaluation:
Human testers for conversational quality.
Automated benchmarks using DeepSeek-R1-Distill-Llama-8B.
FineTuning with LoRA
I started with the Llama-2-7b-hf model and used TorchTune CLI to run the LoRA fine-tuning. This made the process much easier. Here's the recipe and configuration I used:
!tune run lora_finetune_single_device \\
--config llama2/7B_lora_single_device \\
output_dir="./llama2_7B_lora_single_device_outputs" \\
checkpointer.checkpoint_dir="./cache/models--meta-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/" \\
tokenizer.path="./cache/models--meta-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9/tokenizer.model"\\
dataset._component_=torchtune.datasets.alpaca_cleaned_dataset \\
dataset.source="json" \\
dataset.column_map='{"instruction": "instruction", "input": "input", "output": "output"}' \\
dataset.data_files="./preprocessed_datasets/agnews_train.jsonl" \\
dataset.train_on_input=False \\
lr_scheduler.num_warmup_steps=5 \\
batch_size=2 \\
gradient_accumulation_steps=8 \\ metric_logger._component_=torchtune.training.metric_logging.WandBLogger \\
metric_logger.project=llama2_finetune_agnews_with_cli metric_logger.group=llama2_7b_lora_batch_8 \\
metric_logger.job_type=lora_single_device \\
metric_logger.log_dir="./llama2_7B_lora_single_device_outputs/wandb_logs" \\
log_every_n_steps=1 \\
log_peak_memory_stats=True
Key Insight #1: Learning Rate Matters
I used a learning rate warmup and decay strategy. This kept the training stable and prevented overfitting.
Key Insight #2: Efficiency is King
LoRA cut down the trainable parameters from billions to millions. This meant I could run the fine-tuning on regular hardware, saving time and money. Gradient accumulation helped me handle larger batch sizes, further improving stability.
Key Insight #3: Keep an Eye on Things
I monitored training metrics in real-time using Weights & Biases (W&B). Here's how I set it up:
metric_logger._component_=torchtune.training.metric_logging.WandBLogger \\
metric_logger.project=llama2_finetune_agnews_with_cli \\
metric_logger.group=llama2_7b_lora_batch_8 \\
metric_logger.job_type=lora_single_device \\
metric_logger.log_dir="./llama2_7B_lora_single_device_outputs/wandb_logs"
W&B provided me with real-time insights into three key metrics:
Metric #1: Learning Rate Schedule
This shows how the learning rate changed over time. I used a cosine standard linear decay schedule, which gradually reduced the learning rate as training progressed.
Metric #2: Training Throughput
I achieved an average throughput of about 12.21 tokens per second per GPU. This indicated stable hardware performance.
Metric #3: Loss Curve
The loss curve shows:
Rapid start: The model quickly learned the basic patterns.
Stabilization: The loss stabilized after about 500 steps.
Consistent: The model reached a good optimization point.
Inference: Putting the Model to Work
Loading the fine-tuned model
Loading LoRA-adapted models requires some special handling of its fine-tuned weights. Here's how I did it using Hugging Face's PEFT libraries:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
import os
# Path configurations
base_model_cache_path = "./cache/models--meta-llama--Llama-2-7b-hf/snapshots/01c7f73d771dfac7d292323805ebc428287df4f9"
adapter_path = "./llama2_7B_lora_single_device_outputs/epoch_0"
base_model_name = "meta-llama/Llama-2-7b-hf"
# Login to Hugging Face
hf_token = os.environ.get("HF_TOKEN")
if hf_token:
login(token=hf_token)
print(f"Loading baseline model: {base_model_name} from cache_path: {base_model_cache_path}")
tokenizer = AutoTokenizer.from_pretrained(base_model_name, cache_dir=base_model_cache_path)
base_model = AutoModelForCausalLM.from_pretrained(base_model_name, cache_dir=base_model_cache_path, use_safetensors=False)
print(f"Loading fine tuned model...")
print("Loading LoRA adapter...")
if os.path.exists(os.path.join(adapter_path, "adapter_config.json")):
print("Loading from adapter_config.json")
model = PeftModel.from_pretrained(base_model, adapter_path)
else:
print("Loading from safetensors files")
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, lora_config)
state_dict = {}
safetensor_files = [f for f in os.listdir(adapter_path) if f.endswith('.safetensors')]
for file_name in safetensor_files:
file_path = os.path.join(adapter_path, file_name)
print(f"Loading weights from {file_path}")
if os.path.exists(file_path):
state_dict.update(load_file(file_path))
model.load_state_dict(state_dict, strict=False)
model_name=f"finetuned {base_model_name}"
print(f"Successfully loaded {model_name}")
Running Batch Inference
Batch inference is essential for evaluating models on large datasets. Here are some optimizations I used:
Dataset Processing Optimization: Processes only the first half of the dataset to save time and compute resources while still providing meaningful evaluation.
total_entries = len(dataset)
half_entries = total_entries // 2
subset_indices = list(range(half_entries))
subset_dataset = torch.utils.data.Subset(dataset, subset_indices)
Padding Token with LLaMA Tokenizer: LLaMA tokenizers don't have a padding token configured by default, causing errors when processing batched inputs. I set the EOS token as the padding token during model initialization.
Custom Data Collation: Used a custom collation function to handle dataset outputs.
def collate_batch(self, batch):
"""Custom collate function for DataLoader to handle the dataset outputs."""
prompts = [item[0] for item in batch]
entries = [item[1] for item in batch]
return prompts, entries
Parallel Processing: Accelerated batch inference using PyTorch’s DataSet and DataLoader.
Tensor Indexing in Batch Processing: Fixed tensor slicing issues to extract generated tokens correctly.
input_length = batch_inputs.input_ids[i].size(0)
new_tokens = sequence[input_length:]
generated_text = self.tokenizer.decode(new_tokens, skip_special_tokens=True)
Memory-Conscious Result Handling: Streamed results to disk in real-time, saving each processed batch to the output file rather than keeping all results in memory.
with open(response_test_path, "w") as output_file:
for batch_prompts, batch_entries in tqdm(dataloader, desc=f"Evaluating {self.model_name}", unit="batch"):
generated_entries = self.batch_inference(batch_prompts, batch_entries, max_new_tokens=30)
for entry in generated_entries:
output_file.write(json.dumps(entry) + "\\n")
Consistent Prompt Formatting: Maintained the same instruction-based prompt structure used during training, ensuring consistency between training and inference phases.
prompt = (f"Below is an instruction that describes a task, paired with an input that provides further context. "
f"Write a response that appropriately completes the request.\\n\\n### Instruction:\\n{entry['instruction']}\\n\\n### Input:\\n{entry['input']}\\n\\n### Response:\\n")
Configurable Generation Parameters: Used these settings to balance between deterministic and varied outputs.
generated_ids = self.model.generate(
batch_inputs.input_ids,
attention_mask=batch_inputs.attention_mask,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_p=0.95,
repetition_penalty=1.2,
do_sample=(temperature > 0)
)
Evaluating the Fine-Tuned LLM
When evaluating a fine-tuned LLM, there are a few key approaches:
Benchmarks like MMLU: These use short-answer and multiple-choice questions to test general knowledge.
Human preference comparisons: Directly comparing the model’s responses to those of other LLMs with human judges.
Automated conversational benchmarks: Using another LLM to evaluate responses and assign scores.
MMLU and similar benchmarks are useful for testing general knowledge, but I was more interested in how well the model performed in news-conversations and its classification tasks. That’s where automated evaluation became the most relevant approach.
Automated Evaluation
Instead of relying entirely on human evaluation—which is slow and subjective—I used an automated approach, leveraging a DeepSeek-R1 model as a judge.
Here’s the simple prompt I used for scoring:
prompt = (
f"Given the input `{format_input(entry)}` "
f"and correct output `{entry['output']}`, "
f"score the model response `{entry[json_key]}`"
f" on a scale from 0 to 100, where 100 is the best score. "
f"IMPORTANT: Your final answer must be EXACTLY one integer number between 0 and 100. "
f"Do not write any text before or after the number. "
f"Do not explain your reasoning. "
f"Type only the number. "
)
This setup instructed DeepSeek-R1-Distill-Llama-8B to:
Look at both the input context and the correct output.
Compare them against responses from both the fine-tuned and baseline models.
Assign a numerical score (0-100) based on how well the response matched expectations.
This approach gave me a fast and objective way to measure improvements in the fine-tuned model without the inconsistencies of human evaluation.
Results and Observations
The evaluation shows significant improvements in the fine-tuned model's ability to follow instructions for news classification. The model scored 95.29 out of 100 across 4,950 test examples, as evaluated by the DeepSeek R1 LLaMA distilled model.
Key Improvements:
Instruction Following: The fine-tuned model consistently understood and followed classification tasks, unlike the base LLaMA2 model.
Response Quality: Responses were more coherent and demonstrated a better understanding of context. Here’s an example response from the fine tuned LLaMA2 v.s. its base model.
{"instruction": "Classify this news article.",
"input": "Olympics-Fencing-U.S. and Swiss End Gold Drought ATHENS (Reuters) - Mariel Zagunis won the first fencing gold for the United States for 100 years when she beat Xue Tan of China 15-9 in the inaugural Olympic women's sabre final on Tuesday.",
"output": "Sports",
"model_response": "Sports"}
FineTuned LLaMA2 model response for a Sports article
{"instruction": "Classify this news article.",
"input": "Olympics-Fencing-U.S. and Swiss End Gold Drought ATHENS (Reuters) - Mariel Zagunis won the first fencing gold for the United States for 100 years when she beat Xue Tan of China 15-9 in the inaugural Olympic women's sabre final on Tuesday.",
"output": "Sports",
"model_response": "\\n* [ ] News Article\\n* [ ] Opinion Piece\\n* [ ] Editorial\\n* [ ] Review\\n* [ ]"}
Base LLaMA2 model response for same Sports article
Efficient Fine-Tuning: I achieved these results using LoRA fine-tuning, which requires less computing power than traditional methods making it cost-effective.
Conclusion
This guide demonstrates how fine-tuning using PEFT with LoRA creates specialized systems without excessive computational demands. If you're adapting LLMs for domain-specific tasks, I hope my experiences help you avoid common pitfalls allowing you to customize any language model yourself.
For more details, you can explore the complete code and implementation in my GitHub repo.
I’d love to hear about your experiences should you give this a try —feel free to share your progress or ask questions in the comments!