Evaluation & Benchmarking: Have you trained an LLM for a specific task (e.g., extracting ticket details). How did you go about training the model? What was the data? How did you evaluate this model beyond just loss curves? Were standard benchmarks (like MMLU) relevant there? And if not, what specific metrics would you design for this task? If you have no experience with this, please specify.✱