Hackathon Resources
Welcome to your curated resource list for the hackathon. This guide provides a starting point for learning key techniques, finding datasets, and exploring the essential tools and research papers at the intersection of Large Language Models (LLMs), materials science, and chemistry.
Tutorials & Learning Resources
Foundational LLM Concepts
Intro to Large Language Models
A general-audience overview of how LLMs are trained and the key concepts behind their operation, including pre-training, fine-tuning, and RLHF.
Library-Specific Tutorials
RDKit
RDKit is a collection of cheminformatics and machine-learning software written in C++ and Python.
Official Documentation: The "Getting Started with the RDKit in Python" guide is the best place to begin.
YouTube Tutorial: Video tutorial by Jan Jensen is another great resource to start with.
PySCF
The Python-based Simulations of Chemistry Framework (PySCF) is an open-source library for quantum chemistry calculations. It is highly extensible and designed for simplicity, both for users and developers.
Official Documentation: The 'User Guide' and 'Tutorials' are great starting points.
Atomic Simulation Environment (ASE)
The Atomic Simulation Environment (ASE) is a set of tools and Python modules for setting up, manipulating, running, visualizing and analyzing atomistic simulations.
Official Documentation: The 'ASE Tutorials' is the best place to begin with.
Pymatgen
Pymatgen (Python Materials Genomics) is a robust, open-source Python library for materials analysis.
Official Documentation: The pymatgen API Documentation is the best place to begin with.
YouTube Tutorial: Video tutorial by Anubhav Jain, developer of pymatgen, is another great resource to start with.
LangChain
LangChain is an open-source library specifically designed for creating applications using large language models (LLMs).
Official Documentation: The best place to begin is the official LangChain documentation. It offers a comprehensive overview of the framework, from installation to advanced use cases.
YouTube Tutorial: For visual learners, aiwithbrandon YouTube channel provides a wealth of tutorials. The "LangChain Master Class For Beginners 2024" video is an excellent starting point.
LangGraph
LangGraph, created by LangChain, is an open source AI agent framework designed to build, deploy and manage complex generative AI agent workflows.
Official Documentation: To dive into building stateful, multi-actor applications, the LangGraph documentation is your go-to resource.
YouTube Tutorial: A great video tutorial by LangChain on "Building Effective Agents with LangGraph" provides a practical introduction to creating sophisticated agents.
Technique-Specific Guides
Fine-Tuning LLMs
Hugging Face Blog: A practical guide on how to fine-tune a Llama 2 model in a Google Colab notebook, providing hands-on experience.
YouTube Tutorial: An excellent, detailed video from DeepLearning.AI that walks through the concepts and code for fine-tuning Large Language Models.
RAG Technique
LangChain Documentation: Tutorial for Retrieval-Augmented Generation (RAG) from LangChain, explaining how to build applications that connect LLMs to external data sources.
YouTube Tutorial: A clear and concise explanation from IBM Technology on what RAG is, how it works, and why it's a powerful technique for enhancing LLM performance.
Datasets
Foundational Materials & Chemistry Databases
Materials Project
An open-access database of computed properties for over 200,000 inorganic materials and 577,000 molecules, powered by high-throughput DFT calculations. It's a primary resource for crystal structures, formation energies, and electronic properties.
NOMAD Laboratory (Novel Materials Discovery)
A massive, FAIR-compliant repository that aggregates and normalizes materials science data from over 60 different simulation codes and experimental sources. It hosts over 19 million entries and features an integrated AI Toolkit with Jupyter notebooks for direct analysis.
PubChem
A comprehensive public database from the NIH containing information on over 92 million unique chemical compounds, including their properties, structures, and biological activities. It is organized into Substance, Compound, and BioAssay databases.
ChEMBL
A manually curated database of bioactive molecules with drug-like properties, focusing on compound bioactivity data against drug targets. It contains over 5.4 million bioactivity measurements from the scientific literature.
Open Reaction Database (ORD)
An open-access project to create a centralized repository of high-quality, structured organic reaction data, designed to overcome the noise of patent-mined datasets.
USPTO Reaction Datasets
Collections of chemical reactions extracted from United States patents, which have been foundational for machine learning in reaction prediction and retrosynthesis. Note that these often require cleaning.
Crystallography Open Database (COD)
An open-access collection of over 400,000 crystal structures for organic, inorganic, and metal-organic compounds.
Curated List of More Datasets
A curated list of the most useful datasets in materials science and chemistry for training machine learning and AI foundation models by Ben Blaiszik.
General Datasets
Key Research Papers & Reviews
Examples & Reviews on LLMs in Materials & Chemistry
Reflections from the 2024 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry
In the 2nd global hackathonfor LLMs applications for materials and chemistry 34 teams used large language models to create applications for materials science and chemistry research across seven different areas like property prediction, molecular design, and scientific communication.
14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon
In the 1st global hackathon for LLMs applications for materials and chemistry participants used large language models like GPT-4 to build working prototypes for chemistry and materials science applications in just two days.
A Survey of AI for Materials Science: Foundation Models, LLM Agents, Datasets, and Tools
A comprehensive review on how large AI foundation models (like ChatGPT-style systems) are being used to accelerate materials science research across six key areas from data analysis to discovering new materials.