Resources - LLM Hackathon

Welcome to your curated resource list for the hackathon. This guide provides a starting point for learning key techniques, finding datasets, and exploring the essential tools and research papers at the intersection of Large Language Models (LLMs), materials science, and chemistry.

Tutorials & Learning Resources

Foundational LLM Concepts

Intro to Large Language Models

A general-audience overview of how LLMs are trained and the key concepts behind their operation, including pre-training, fine-tuning, and RLHF.

Let's build GPT: from scratch, in code, spelled out

A deep, code-first dive into the Transformer architecture that powers models like GPT.

LLM Course on GitHub

A comprehensive course covering LLM fundamentals, the science of building LLMs, and the engineering of LLM-based applications.

Library-Specific Tutorials

RDKit

RDKit is a collection of cheminformatics and machine-learning software written in C++ and Python.

Official Documentation: The "Getting Started with the RDKit in Python" guide is the best place to begin.

YouTube Tutorial: Video tutorial by Jan Jensen is another great resource to start with.

PySCF

The Python-based Simulations of Chemistry Framework (PySCF) is an open-source library for quantum chemistry calculations. It is highly extensible and designed for simplicity, both for users and developers.

Official Documentation: The 'User Guide' and 'Tutorials' are great starting points.

Atomic Simulation Environment (ASE)

The Atomic Simulation Environment (ASE) is a set of tools and Python modules for setting up, manipulating, running, visualizing and analyzing atomistic simulations.

Official Documentation: The 'ASE Tutorials' is the best place to begin with.

Pymatgen

Pymatgen (Python Materials Genomics) is a robust, open-source Python library for materials analysis.

Official Documentation: The pymatgen API Documentation is the best place to begin with.

YouTube Tutorial: Video tutorial by Anubhav Jain, developer of pymatgen, is another great resource to start with.

LangChain

LangChain is an open-source library specifically designed for creating applications using large language models (LLMs).

Official Documentation: The best place to begin is the official LangChain documentation. It offers a comprehensive overview of the framework, from installation to advanced use cases.

YouTube Tutorial: For visual learners, aiwithbrandon YouTube channel provides a wealth of tutorials. The "LangChain Master Class For Beginners 2024" video is an excellent starting point.

LangGraph

LangGraph, created by LangChain, is an open source AI agent framework designed to build, deploy and manage complex generative AI agent workflows.

Official Documentation: To dive into building stateful, multi-actor applications, the LangGraph documentation is your go-to resource.

YouTube Tutorial: A great video tutorial by LangChain on "Building Effective Agents with LangGraph" provides a practical introduction to creating sophisticated agents.

Technique-Specific Guides

Fine-Tuning LLMs

Hugging Face Blog: A practical guide on how to fine-tune a Llama 2 model in a Google Colab notebook, providing hands-on experience.

YouTube Tutorial: An excellent, detailed video from DeepLearning.AI that walks through the concepts and code for fine-tuning Large Language Models.

RAG Technique

LangChain Documentation: Tutorial for Retrieval-Augmented Generation (RAG) from LangChain, explaining how to build applications that connect LLMs to external data sources.

YouTube Tutorial: A clear and concise explanation from IBM Technology on what RAG is, how it works, and why it's a powerful technique for enhancing LLM performance.

Datasets

Foundational Materials & Chemistry Databases

Materials Project

An open-access database of computed properties for over 200,000 inorganic materials and 577,000 molecules, powered by high-throughput DFT calculations. It's a primary resource for crystal structures, formation energies, and electronic properties.

NOMAD Laboratory (Novel Materials Discovery)

A massive, FAIR-compliant repository that aggregates and normalizes materials science data from over 60 different simulation codes and experimental sources. It hosts over 19 million entries and features an integrated AI Toolkit with Jupyter notebooks for direct analysis.

PubChem

A comprehensive public database from the NIH containing information on over 92 million unique chemical compounds, including their properties, structures, and biological activities. It is organized into Substance, Compound, and BioAssay databases.

ChEMBL

A manually curated database of bioactive molecules with drug-like properties, focusing on compound bioactivity data against drug targets. It contains over 5.4 million bioactivity measurements from the scientific literature.

Open Reaction Database (ORD)

An open-access project to create a centralized repository of high-quality, structured organic reaction data, designed to overcome the noise of patent-mined datasets.

USPTO Reaction Datasets

Collections of chemical reactions extracted from United States patents, which have been foundational for machine learning in reaction prediction and retrosynthesis. Note that these often require cleaning.

Crystallography Open Database (COD)

An open-access collection of over 400,000 crystal structures for organic, inorganic, and metal-organic compounds.

Curated List of More Datasets

A curated list of the most useful datasets in materials science and chemistry for training machine learning and AI foundation models by Ben Blaiszik.

General Datasets

arXiv Preprints

The entire collection of preprints from arXiv is available for bulk download, providing a massive corpus for text mining the latest scientific research.

Hugging Face Datasets

A central hub for thousands of datasets, including a growing number for materials science and chemistry.

Kaggle Datasets

A platform hosting a wide variety of public datasets, including many relevant to chemistry and materials science.

Key Research Papers & Reviews

Examples & Reviews on LLMs in Materials & Chemistry

Reflections from the 2024 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry

In the 2nd global hackathonfor LLMs applications for materials and chemistry 34 teams used large language models to create applications for materials science and chemistry research across seven different areas like property prediction, molecular design, and scientific communication.

14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon

In the 1st global hackathon for LLMs applications for materials and chemistry participants used large language models like GPT-4 to build working prototypes for chemistry and materials science applications in just two days.

A Survey of AI for Materials Science: Foundation Models, LLM Agents, Datasets, and Tools

A comprehensive review on how large AI foundation models (like ChatGPT-style systems) are being used to accelerate materials science research across six key areas from data analysis to discovering new materials.

A review of large language models and autonomous agents in chemistry

The authors reviewed how large language models and AI agents are being used in chemistry for tasks like molecule design and laboratory automation, and created a repository to track ongoing research in this rapidly evolving field.

Hackathon Resources

Tutorials & Learning Resources

Foundational LLM Concepts

Intro to Large Language Models

Let's build GPT: from scratch, in code, spelled out

LLM Course on GitHub

Library-Specific Tutorials

RDKit

PySCF

Atomic Simulation Environment (ASE)

Pymatgen

LangChain

LangGraph

Technique-Specific Guides

Fine-Tuning LLMs

RAG Technique

Datasets

Foundational Materials & Chemistry Databases

Materials Project

NOMAD Laboratory (Novel Materials Discovery)

PubChem

ChEMBL

Open Reaction Database (ORD)

USPTO Reaction Datasets

Crystallography Open Database (COD)

Curated List of More Datasets

General Datasets

arXiv Preprints

Hugging Face Datasets

Kaggle Datasets

Key Research Papers & Reviews

Examples & Reviews on LLMs in Materials & Chemistry

Reflections from the 2024 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry

14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon

A Survey of AI for Materials Science: Foundation Models, LLM Agents, Datasets, and Tools

A review of large language models and autonomous agents in chemistry

General Tools

🎬 Slide Design + MVP Recording

Powtoon

HiSlide

TinyTake

Canva

Google Slides

Loom

🌀 Website Builder + Landing Page Mockup

Wix

Weebly

Squarespace

Canva

💻 No Code MVP + Prototyping Tools

Moqups

Marvel Apps

Origami

InVision

Framer

🎨 Design + Artwork + Creative Tools

Unsplash

Noun Project

Icons8

Feather Icons

FreePik

Undraw

Artboard.Studio

✏️ Marketing Copy

Good Email Copy

CopyAI

ReallyGoodEmails