However, in this case, wouldn't I benefit from having more threads than CPUs?
I understand switching between threads is expensive but having idle capacity should help right?
Why am I only seeing a 5% increase in performance when I am running 20 threads instead of 5?
Why not try it both ways and see? And given you have provided no code, I won't guess at your last question.
For a situation such as this, I will typically create a thread queue: with several threads that pull their operations from a Queue containing Runnable implementations. If you've got 20+ separate things to do, these get placed into the queue for the threads to chew on. See
http://en.wikipedia.org/wiki/Thread_pool_pattern