NASA partnered with IBM, using a Space Act agreement, to jointly develop a large language model (LLM) called INDUS. The model is designed to serve multiple scientific fields including Earth sciences, biological sciences, physical sciences, heliophysics, planetary sciences, and astrophysics, and is trained using curated scientific literature from diverse data sources. INDUS is unique in its use of custom tokenizers and a large library of domain-specific vocabularies, giving it outstanding capabilities in processing scientific literature and answering scientific questions.
NASA's Interagency Implementation and Advanced Concepts Team (IMPACT), through Space Act agreements with private, non-federal partners, is developing INDUS, a suite of tools for Earth sciences, biological and physical sciences, heliophysics, planetary sciences, and Large language models (LLMs) in fields such as astrophysics and trained using curated scientific literature from diverse data sources.

INDUS contains two types of models: encoders and sentence converters. Encoders convert natural language text into numeric encodings that can be processed by LLM. The INDUS encoder was trained on a 6 billion token corpus containing data from astrophysics, planetary sciences, earth sciences, heliophysics, biological sciences, and physical sciences. The custom tokenizer developed by the IMPACT-IBM collaboration improves on the general tokenizer by identifying scientific terms such as biomarkers and phosphorylation. More than half of the 50,000 words in INDUS are unique to the specific scientific fields in which it is trained. The INDUS encoder model was used to fine-tune approximately 268 million text pairs, including title/summary and question/answer.
By providing INDUS with a domain-specific vocabulary, the IMPACT-IBM team achieved better performance than an open, non-domain-specific LLM on the biomedical task benchmark, the science question answering benchmark, and the earth science entity recognition test. By designing diverse language tasks and retrieval-enhanced generation, INDUS is able to handle researchers' questions, retrieve relevant documents, and generate answers. For latency-sensitive applications, the team developed smaller, faster versions of the encoder and sentence converter models.
Validation tests showed that INDUS was able to retrieve relevant passages from the scientific literature when answering a NASA test set of approximately 400 questions. Commenting on the overall approach, IBM researcher Bishwaranjan Bhattacharjee said, "We achieved superior performance by not only having a custom vocabulary, but also a large number of specialized trained encoder models and a good training strategy. For the smaller, faster version, we used Neural architecture search to obtain model architecture and use greater model supervision for knowledge distillation for training. ”
Highlights:
- NASA cooperates with IBM to develop the INDUS large-scale language model, which is suitable for fields such as earth science, biological and physical sciences, heliophysics, planetary science and astrophysics.
- INDUS contains two types of models, encoder and sentence converter, trained using a custom tokenizer and a 6 billion token corpus, and fine-tuned on approximately 268 million text pairs.
- INDUS achieves better performance than open, non-domain-specific LLMs through domain-specific vocabulary and designed diverse language tasks and retrieval enhancements to handle researchers' questions, retrieve relevant documents, and generate answers.
In short, the INDUS large-scale language model provides a powerful new tool for scientific research, and its excellent performance in specific scientific fields indicates its broad application prospects in future scientific research. The cooperation between NASA and IBM also sets a benchmark for the future application of large language models in the scientific field.