Engineered a multimodal search framework to enhance Earth science data exploration and knowledge discovery by leveraging lightweight text and vision embeddings from textual metadata and geospatial measurements, represented as images with stacked channels. Leveraged state-of-the-art multimodal models (see below), PyTorch, Transformers, and supervised learning techniques on a GPU server to fine-tune a multimodal foundation model on image-text pairs from a curated geospatial dataset for retrieval. Monitored training with a Weights & Biases logging framework, and evaluated model performance with various methods and metrics, including unsupervised learning techniques (eg. clustering, t-SNE dimensionality reduction). Met with a team of interdisciplinary mentors weekly to collaborate and informally present progress, findings, and proposed solutions to immediate and foreseen challenges. Formally presented project outcomes to the all-hands GES-DISC staff, followed by an oral presentation at the 2024 AGU Annual Meeting—an unprecedented accomplishment for an intern, and the first time an intern from NASA GES-DISC has achieved this while still in the program.
By demonstrating the proof-of-concept success of our framework, our final multimodal geospatial model has potential for scalability to a diverse range of geospatial datasets and tasks. This scalability ultimately would have a high impact on data discoverability, as our model supports search capabilities based on the content within data collections, rather than using traditional search capabilities to search for individual data collections in their entirety. Tools used include CLIP model (Contrastive Language-Image Pre-Training), BLIP-2 model (Bootstrapping Language-Image Pre-Training Version 2), Weights & Biases, GPU Server, CUDA, and Python libraries (e.g. PyTorch, Transformers).