Get the latest tech news
Reinforcement Learning – A Reference
...
Solution: Use importance sampling to correct for the difference between behaviour and target policy -> Off-policy Monte Carlo Problem: It's hard to pick the right value of n. Solution: Calculate n-step returns for multiple n's and combine them -> TD(λ) Solution: Compare returns to a reference value (baseline) to better judge if the outcome was actually above average from that state.
Or read this on Hacker News