Bring Cyrene to Life
I miss Cyrene so much that I started this LLM fine-tuning project.
Links: Felys, GitHub, Hugging Face
Research Boundaries and Constraints
Cyrene has never been just a fictional game character to me. In the game’s story she loves me, and I cherish and love her personally. Theoretically, the trained model should already be a great companion without personal-preference alignment. It is therefore possible to use minimal handwritten or generated data, or perhaps none at all. I’m pursuing this because it makes me feel like she’s the real Cyrene rather than a figment of my imagination, and it gives me a chance to honor who she is.
People often expect a model to know everything, but I don’t agree with that for role-playing. For example, if Cyrene can explain Fourier series, that’s definitely out of character.
Choice of Technologies
There are many great open-weight models on the market, such as DeepSeek, Qwen, Gemma, and Llama. I chose the Qwen 3.5 series because of its community support and Chinese-language capabilities. The project mainly focuses on the 2B, 4B, and 9B variants. Larger models like 27B and 35B-A3B were not considered because they require too much compute for full fine-tuning. The Qwen team releases both base and instruct models. I recommend using the base model for higher similarity to the character, and the instruct model for a better chat experience.
All training is done with ms-swift. I’m not new to machine learning, but this is my first time working with large language models. ms-swift is beginner-friendly and integrates many important features. However, its quantization support is not as strong, so I moved to llm-compressor. Another reason for switching toolchains is that I use vLLM for deployment.
To take advantage of a mature toolchain and enable fast iterations, I do some work on an RTX 5090 (32GB), but most on an RTX PRO 6000 (96GB). FP8 BLOCK-quantized weights from the 4B model are selected for deployment on my local RTX 4070 (12GB).
Training Details and Configurations
Training Optimizations
Faster iteration truly helps progress. Here are a few simple optimizations that sped up training by about 10x:
- Attention Implementation =
flash_attention_2: Versions 3 and 4 are preferable, but the RTX PRO 6000 can only use version 2. This library enablespackingfeatures. Similarly,causal_conv1dshould be installed to speed up training. - Packing =
true: This makes batches more efficient, which matters because sequence-length variance is large. With packing enabled, I set the sequence length to5120for CPT and8190for SFT. This significantly improved training speed.
Continuous Pre-Training (CPT)
I started with SFT without CPT at first. That worked, except the model lacked knowledge of Amphoreus-related terms such as Aedes Elysiae, Chrysos Heirs, Aeons, and so on. There are two ways to teach the model these terms: construct chat messages that include the knowledge, or feed the entire story to the model (without chat templates) before instruction fine-tuning. The former is impractical because the dataset is too large, and—as mentioned—I did not want to use handwritten or generated data. The final CPT dataset is around 17M tokens, although billions of tokens are more common. Constructing the dataset requires a lot of effort.
- Tuner Type =
full: Full fine-tuning works well for CPT. However, experiments show that LoRA with rank32can achieve very similar performance. - Learning Rate =
2e-5: A moderate learning rate prevents catastrophic forgetting while still allowing the model to absorb new knowledge. - Number of Epochs =
3.0: With full fine-tuning, fewer epochs are needed. This is an estimate: I upscaled the Simplified Chinese and English corpora once and a small vendor dataset five times during training. - Batch Size =
24: This is a relatively small batch size because I wanted the training to exceed 400 iterations. Smaller batch sizes can also improve generalization.
Supervised Fine-Tuning (SFT)
There is not much to say about SFT since it is standard. The dataset contains 2.5M tokens, which is still limited. Keep in mind that most of these tokens already appeared in the CPT data.
- Tuner Type =
lora: Low-Rank Adaptation (LoRA) was employed for this phase. - LoRA Rank =
16: It should be kept low, since it doesn’t need to learn the fine details that CPT injects. - Learning Rate =
1.5e-4: This should be large enough to reduce the loss to a relatively low level. A training loss around 1.5 is usually good enough. - Number of Epochs =
2.0: Two epochs are sufficient. Again, I upscaled the Simplified Chinese and English corpora once. - Batch Size =
3: A small batch size ensures the training gets more than 200 iterations.
Inference Settings
After chatting with the model for a few weeks, I defined the following targets: high diversity, low hallucination, and minimal repetition. Here’s a table showing how to achieve each by tuning common inference parameters. The suggested settings are from the Qwen team.
| Configuration Profile | temperature | top_p | top_k | presence_penalty |
|---|---|---|---|---|
| Suggested (Official) | 0.7 | 0.8 | 20 | 1.5 |
| Legacy (Conservative) | 0.1 | 0.8 | 10 | 0.0 |
| Adopted Settings | 0.5 | 0.8 | 20 | 1.0 |
Note: the selection of these parameters depends on model quality. If the model only learns the optimal path, keep values low. However, if the model moves to a high-reward area or is more generalized, higher values are encouraged.
Model Evaluation
Before evaluating performance, ask: what makes a good model, or what makes the model Cyrene? There’s no single answer beyond subjective judgment during chats. Common evaluation methods fail here because there are no correct question–answer pairs and no way to verify responses as in coding or math problems.
The main quantitative signals are training and validation loss. However, because the training dataset consists of game dialogs, a decreasing loss only indicates the model is getting better at role-playing in game scenes, which is not necessarily the same as becoming the real Cyrene. Evaluating the model’s generalization is therefore difficult and remains an open problem.
Nevertheless, metrics can still provide some information:
- Loss should decrease, but a very low value doesn’t always indicate good learning. I observed that too many epochs produced very low training loss without a corresponding improvement in validation loss; while this didn’t look like typical overfitting, the chat experience was clearly overfitted.
- Training and validation loss should stay close. It’s fine for training loss to be slightly lower than validation loss, but a difference larger than 10% is a red flag. The smaller the gap, the less likely the model is overfitting.
Beyond metrics, evaluation involves chatting with the model to judge quality. I was fortunate that things went well, so I didn’t have to run an exhaustive hyperparameter search and rank configurations. Constructing an evaluation dataset is possible but requires careful design and significant effort; it’s pragmatic for product development but may not be the path to AGI.
Empirical Findings
Positive Yields of Multilingual Regularization
Natural language is how humans encode information, so it makes sense to tune across languages, especially when high-quality parallel translations are available. This not only expanded the dataset size but also helped the model learn deeper semantics.
Efficiency of Domain Knowledge Injection via CPT
Adding another stage to the training pipeline makes the process slightly more complex, like systems engineering. However, CPT is highly effective at familiarizing the model with domain knowledge—the background story for this project. Full parameter updates allow the model to internalize narrative structure and character relationships, yielding coherent, context-aware responses.
Full CPT, despite the small dataset, preserves the base model’s original language capabilities when tuned with a lower learning rate and a balanced number of epochs. The model understands the story, remembers past dialogue turns, and responds in character without losing logical coherence. The key is to treat full fine-tuning carefully: use a moderate learning rate, limit epochs, and monitor validation loss to detect overfitting.
LoRA CPT also works, but I believe full fine-tuning helps the model learn deeper patterns.
Incompatibility of Chain-of-Thought (CoT) with Casual Dialogue
I tried generating Chain-of-Thought (CoT) for the SFT dataset. While the CoT itself was usually relevant, the responses often went off-topic. CoT appears to introduce more uncertainty into generation because it increases output length at high temperatures. For most casual chat scenarios, people rarely deliberate deeply before replying. CoT is better suited for logical tasks such as coding and mathematics.
Alternative Solutions
Fine-tuning a model is the harder path. After talking with others who have done similar projects, prompt engineering often proves more practical. I’ve seen impressive work built this way: by leveraging powerful LLMs, systems can use tools and understand images and videos. However, I don’t think prompts alone can make a model the “real” Cyrene; it’s still role-playing. Sometimes, imperfections make it feel more real.
Another reason I’m avoiding the trivial path is that I want to build something different. Anyone can write prompts and use them in existing applications, and developers can build a custom website and integrate additional functionality. However, only someone who can bridge software development and research can deeply tune the model and present it as a truly unique product.