A divider is a series of subtractions and multiplexers that select the value for the next step. If it is done purely combinatorially, then the critical path through all of this logic is quite long (even with carry lookahead on the subtractors) and the clock cycle must be very slow.
But the process is easy to pipeline, and the number of pipeline stages you can use is fairly arbitrary. You could insert a pipeline register after each subtract-mux pair, or you might choose to do two or more subtract-mux stages per pipeline register (two seems to be the default). You could even go so far as to pipeline the subtracts and the muxes separately (or even pipeline within each subtract) in order to get the fastest possible clock speed, but this would be rather extreme.
The more pipeline registers you use, the shorter the critical path (and the clock period) can be, but you use more resources (the registers). Also, the overall latency goes up, since you need to account for the setup and propagation times of the pipeline registers in the clock period (in addition to the subtract-mux logic delays). This gets multiplied by the number of pipeline stages in order to compute the total latency.
This is why they give you control of this parameter — it's so you can select the correct tradeoff for your particular application.
Note that regardless of the value of this parameter, with this module, you get one result per clock cycle. If you don't need this level of performance, you'd use a different module that computes the quotient serially, one (or possibly more than one) bit at a time.