Combining policy gradient and Q-learning

Q-Prop: Sample Efficient Policy Gradient with an off-policy critic

Bridging the Gap between Value and Policy Based Reinforcement Learning

Sample Efficient Actor-Critic with Experience Replay

Equivalence Between Policy Gradients and Soft Q-Learning

The Reactor: A Sample-Efficient Actor-Critic Architecture