Landmark LLM Papers
Introduction A list of curated landmark papers in the field of LLMs. Foundational Efficient Estimation of Word Representations in Vector Space (Word2Vec) (2013) GloVe: Global Vectors for Word Representation (2014) Neural Machine Translation by Jointly Learning to Align and Translate (2014) Introduced the concept of attention Transformer Attention Is All You Need (2017) Introduced the Transformer architecture Self-Attention with Relative Position Representations (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018) Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (2021) RoFormer: Enhanced Transformer with Rotary Position Embedding (2021) Large Language Models Universal Language Model Fine-tuning for Text Classification (ULMFiT)(2018) Improving Language Understanding by Generative Pre-Training (GPT-1)(2018) Language Models are Unsupervised Multitask Learners (GPT-2)(2019) Language Models are Few-Shot Learners (GPT-3) (2020) What Can Transformers Learn In-Context? A Case Study of Simple Function Classes (2021) GPT-4 Technical Report (2023) Alignment Deep reinforcement learning from human preferences (2017) Training language models to follow instructions with human feedback (2022) Constitutional AI: Harmlessness from AI Feedback (2022) Scaling Laws, Emergence Scaling Laws for Neural Language Models (2020) Training Compute-Optimal Large Language Models (2022) Emergent Abilities of Large Language Models (2022) Prompt / Context Engineering Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (2021) Efficient Transformers Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (2019) Reformer: The Efficient Transformer (2019) Longformer: The Long-Document Transformer (2020) Generating Long Sequences with Sparse Transformers (2020) Big Bird: Transformers for Longer Sequences (2020) FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (2022) Survey Papers A Survey of Transformers (2022) Efficient Transformers: A Survey (2020) A Survey of Large Language Models (2023) On the Opportunities and Risks of Foundation Models (2022) Pre-train, Prompt, and Predict: A Survey of Prompting Methods in NLP (2021) Speed Always Wins: A Survey on Efficient Architectures for Large Language Models (2025)
Decoding the ResNet architecture
Introduction Fast.ai’s 2017 batch kicked off on 30th Oct and Jeremy Howard introduced us participants to the ResNet model in the first lecture itself. I had used this model earlier in the passing but got curious to dig into its architecture this time. (In fact in one of my earlier client projects I had used Faster RCNN, which uses a ResNet variant under the hood.) ResNet was unleashed in 2015 by Kaiming He. et.al. through their paper Deep Residual Learning for Image Recognition and bagged all the ImageNet challenges including classification, detection, and localization. ...
Network In Network architecture: The beginning of Inception
Introduction In this post, I explain the Network In Network paper by Min Lin, Qiang Chen, Shuicheng Yan (2013). This paper was quite influential in that it had a new take on convolutional filter design, which inspired the Inception line of deep architectures from Google. Motivation Anyone getting introduced to convolutional networks first come across this familiar arrangement of neurons designed by Yann LeCun decades ago: Fig. LeNet-5 Yann LeCun’s work (1989, 1998) triggered the convolutional approach, which takes into account the inherent structure of the incoming data (mostly image data) while propagating them through the network and learning about them. ...