A from scratch implementation of a decoder-only transformer based language model. You can find my articles about this project here:
https://davids.onl/blog/mathematical-foundations-of-self-attention-mechanisms/
https://davids.onl/blog/%20mathematical-and-architectural-analysis-of-decoder-only-transformers/
This project gave me insight and perspective on language models, and made me realize how much data is required and how compute intensive language is as a task. As language must not only be semantically and grammatically correct, but also extremely concise and informational.
Some stats about one of the completed training runs:

Visual reprsentation of how attention mechanisms work:
Attention mechanisms on images, the white blur shows where the model attends to.
Image credits: https://arxiv.org/abs/1502.03044
Attention mechanisms working on next word prediction.
Image Credits: https://arxiv.org/abs/1706.03762