Towards EnergyGPT: Building an LLM That Speaks Energy

General-purpose large language models are impressive, but ask one a detailed question about wellbore completions, reservoir pressure transient analysis, or DOE critical-minerals policy, and the cracks show quickly. The energy sector demands technical precision that broad-spectrum models routinely miss — muddled terminology, misapplied assumptions, or confident answers that ignore real engineering constraints.

That gap is why we built EnergyGPT: a domain-specialized language model tailored for energy, developed by fine-tuning Meta's LLaMA 3.1-8B with Supervised Fine-Tuning (SFT) on a rigorously curated corpus of energy-related texts. The research, authored by Amal Chebbi and Babajide Kolade and published on arXiv, presents the full development pipeline from data curation through evaluation and deployment.

Why not just prompt a bigger model?

Prompting a frontier model with energy context (or bolting on RAG) works for some use cases, but it has structural limits. Retrieval quality is sensitive to chunking strategy, embedding choice, and corpus coverage. Latency grows with every external lookup. And you are still relying on a model whose internal representations were shaped overwhelmingly by web text, not by high-precision engineering literature and domain-specific terminology.

Fine-tuning embeds domain knowledge directly into the model weights. The result is a model that doesn't just retrieve relevant content — it reasons in the language of the domain: it can handle formation names, completion types, thermodynamic relationships, and regulatory frameworks as first-class concepts rather than surface-pattern matches.

Building the training corpus

The quality of any fine-tuned model is bounded by the quality of its training data. We assembled a corpus from two primary sources:

Scientific literature. Approximately 40,000 papers from ASME journals spanning roughly two decades — the Journal of Energy Resources Technology, the Journal of Heat and Mass Transfer, the Journal of Fluids Engineering, and others. These yielded roughly 1.8 billion tokens after cleaning, which included stripping metadata, correcting broken LaTeX, normalizing Unicode, and fixing OCR artifacts.

The Pile (filtered). We extracted energy-relevant subsets from The Pile, a 22-domain English text corpus. This involved a multi-stage pipeline: quality classification using DeBERTa, exact deduplication via hashing, fuzzy deduplication via MinHash + Locality-Sensitive Hashing (with 260 hash functions per document and a Jaccard threshold of 0.8), and finally semantic filtering using transformer-based embeddings against expert-curated energy reference topics. This produced an energy-relevant subset on the order of ~133K documents and contributed roughly ~340M additional tokens of relevant content.

We also deliberately preserved a small slice of the training mix as general-domain content to mitigate catastrophic forgetting, where fine-tuning on a narrow domain erodes the model's broader language capabilities.

From text to training pairs

SFT requires structured input-output pairs, not raw text. We designed a sliding-window pairing strategy that preserves cross-paragraph context:

For scientific papers, documents are segmented into paragraph-aware, sentence-aware, and equation-aware chunks, then paired using a stride of one paragraph, with each pair capped at 4,096 tokens. For Pile-sourced content, chunks of ~600 tokens are paired with a stride of two chunks. The result: over 730,000 high-quality input-output pairs for supervised fine-tuning.

Training setup

EnergyGPT was fine-tuned on a cluster of 4 × NVIDIA A100-80GB GPUs using the NVIDIA NeMo framework and Megatron-LM toolkit. We trained a full-parameter SFT variant (updating all model weights) as well as a parameter-efficient variant (LoRA) to study the practical tradeoff between maximum domain adaptation and retraining efficiency.

Full-parameter SFT is often the best choice when the domain gap is large and the goal is deeper domain internalization; parameter-efficient tuning can be attractive when iteration speed and lower compute costs are the priority.

Evaluating with a custom benchmark

Off-the-shelf benchmarks don't test what matters for energy professionals. We built a 476-question benchmark spanning three formats: 100 true/false statements, 233 multiple-choice questions, and 143 open-ended queries across difficulty levels from basic to challenging.

We evaluated open-ended responses using a rubric designed to reflect expert expectations (relevance, correctness, technical/scientific level, explainability, conciseness, and coherence). In addition to human review, we used calibrated LLM-based judging as supporting evidence to scale comparisons consistently (details in the paper).

Results

EnergyGPT consistently outperformed the base LLaMA 3.1-8B across most evaluation dimensions. The clearest gains were in multiple-choice accuracy and in the quality of open-ended technical answers, where EnergyGPT responses were more domain-grounded and better structured on harder questions. (For full quantitative breakdowns across question types and variants, see the paper.)

The most notable qualitative improvement: EnergyGPT generates responses that stay on-topic and provide contextually appropriate technical detail, whereas the base model more often drifts, oversimplifies, or misses domain constraints.

Deployment: from research to production

A model that only runs in a Jupyter notebook doesn't help anyone make decisions. We deployed EnergyGPT in two configurations:

On-premises via NVIDIA NIMs, serving inference on a 4×A100 server through an OpenAI-compatible REST API. We built a custom FastAPI gateway on top for multi-tenant API key management, per-project token quotas, and usage logging — essentially a lightweight LLM-as-a-service layer.

Cloud-hosted on Microsoft Azure, using Azure Machine Learning managed endpoints with 4-bit quantization for reduced memory footprint, fronted by Azure API Management for authentication, rate limiting, and monitoring.

Both paths produce a secure, scalable REST endpoint that can plug into downstream applications — including ODIS, where EnergyGPT powers the domain copilot for natural-language queries over oilfield data.

What's next

EnergyGPT is a starting point, not an endpoint. The paper identifies several directions we're actively pursuing: integrating RAG to complement the parametric knowledge with live retrieval over operator documents and regulatory filings; adding structured multi-step reasoning for engineering calculations; and extending the training corpus to cover renewables, grid operations, and carbon-management literature more deeply.

The broader takeaway: domain specialization through SFT is a practical, cost-efficient path to building AI tools that energy professionals can actually trust. You don't need a 100B-parameter model or a massive GPU cluster — you need clean data, a sound training strategy, and a rigorous evaluation framework.

The full paper is available at arXiv:2509.07177. If you want to see EnergyGPT in action within a full decision platform, explore ODIS or see how it powers our Critical Minerals workflows.