["to", "be", "or"]
["the", "cat", "sat", "on", "the", "mat"]
["the cat", "cat sat", "sat on", "on the", "the mat"]
["the cat sat", "cat sat on", "sat on the", "on the mat"]
Given a sequence of tokens, we can predict the probability of the
Each of these conditional probabilities can be estimated from the frequency of the
The most likely next word is the one with the highest probability
What are some limitations of this approach?

Alternative solution: represent individual words as vectors, or embeddings
How are these embeddings defined?
| Word | Embedding |
|---|---|
| cat | [0.2, 0.3, 0.5] |
| dog | [0.1, 0.4, 0.4] |
| mat | [0.5, 0.2, 0.2] |
| rug | [0.4, 0.1, 0.1] |


General process:
To Colab!
This is the process you'll be following for Assignment 3
Time flies like an arrow; fruit flies like a banana.
Simple approach: just reverse the sequence
"Our representations differ from traditional word type embeddings in that each token is assigned a representation that is a function of the entire input sentence. We use vectors derived from a bidirectional LSTM that is trained with a coupled language model objective on a large text corpus" -- Peters et al
"This warm weather is enjoyable""This", "warm", "weath", "er", "is", "enjoy", "able"| English | Spanish |
|---|---|
| My mother did nothing but weep | Mi madre no hizo nada sino llorar |
| Croatia is in the southeastern part of Europe | Croacia está en el sudeste de Europa |
| I would prefer an honorable death | Preferiría una muerte honorable |
| I have never eaten a mango before | Nunca he comido un mango |

Challenges: different word order and length, special characters, grammar, idioms, etc
Approach: Intermediate representation
As vocab grows, softmax is very slow. Sampled softmax is one solution