

Value Iteration:
Problem: we still don't know the optimal policy
Bellman's equation for Q-values (optimal state-action pairs):
Optimal policy
For small spaces, we can use dynamic programming to iteratively solve for
At each iteration, the Q estimate is updated according to:
Where:
One approach: Approximate Q-learning: