The idea is that your CPU has internal cache memory which is significantly faster than your system memory. However, it's usually quite small (my CPU has an L3 cache of 3 MB compared to a system memory of 8 GB). There are multiple cache levels, with each having a speed/size trade-off. At the very end you have registers which are the fastest, but you only get a handful of registers.
If you can coalesce as many read/writes as possible so they operate on data already in the cache, you'll have better performance than if you had to constantly retrieve data from main memory. Every time you have to read/write data from main memory is a cache miss. The better you can coalesce read/writes the better your memory locality and thus the better your chances of having your code take advantage of cache memory.
There are a many different strategies for improving memory locality, not just trying to access data sequentially. For example, I could divide a large array into tiles which fit into my cache then do as any operations I wanted on that tile before moving to the next tile. If you look at the way mergesort works, this is basically what it does: it divides the data in half, and recursively sorts each half. At a certain point the entire dataset will fit into cache memory and can be sorted very fast.
Actual performance benefits of memory locality depends from case to case, I've seen cases where speedups have been an order of magnitude better and others which are only marginally better, or perform worse due to added complexity. Memory locality and exploiting the cache aren't usually the first place I go looking for performance benefits. A better bounded algorithm usually provides significantly better performance than any amount of micro optimization can provide. I say usually because there are cases where better bounded algorithms sometimes have extremely large factors which make them perform poorly in practice.