Yihui asked this question yesterday. My supervisor Dr. Hau also criticized routine grouping discretization. I encountered two plausible reasons in 2007 classes, one negative, the other at least conditionally positive.

The first is a variant of the old Golden Hammer law -- if the only tool is ANOVA, every continuous predictor need discretization. The second reason is empirical -- ANOVA with discretization steals df(s). Let's demo it with a diagram.

The red are the population points, and the black are samples. Which predicts the population better--the green continuous line, or the discretized blue dashes? R simulation code is given.

The discretization here is essentially a kind of local smoothing techniques using a constant kernel function. Generally speaking, local modeling can effectively improve fitness (lower error sum of squares) but we have to carefully avoid overfitting. If you discretize x into more intervals, the fitting will be even better.

Residuals and errors are different. The more intervals, squared-residuals decrease while squared-errors increase. So the black points, or discretization with max intervals, predict red population the worst.

Discretization fades micro information (most errors) while highlights macro information (usually non-linear). When LOESS is popular enough, discretization will be abandoned. Practitioners really need local smoothing to preview their concerned macro models.