## Two Pittsburgh Pirates hitters started 2017 at very different levels of performance. Then, they took very divergent paths to the batting lines they carry today. What role did BABIP play here?

Pittsburgh Pirates UT Adam Frazier began the 2017 season on an absolute tear, hitting a slash line of .370/.458/.515 on May 25th. Despite the lack of power, Frazier was on pace for one of the best overall batting seasons of all time, then he suddenly came back down to Earth a little more than a month into the season. Today Frazier is about a league average hitter, hitting somewhere in the range of .275/.350/.375.

What caused all this regression?

OF Andrew McCutchen began the exact opposite, starting in a terrible slump, hitting .200/.271/.359 on 5/23 but today is hitting at an All-Star caliber .290/.385/.535.

What was the catalyst in McCutchen’s marked improvement?

A key statistic we can look at when picking candidates for regression in batting, both positively and negatively, is Batting Average on Balls In Play or BABIP. BABIP is exactly what is sounds like, is the number of times a ball lands for a hit per ball in play. The exact formula is this:

Players generally tend to stick to their career BABIP numbers, meaning that a career .320 BABIP hitter generally won’t dip down to .260 over long periods, and will usually bounce back to somewhere near that .320 mark. This is because BABIP has a lot more to do with skill than people realize, to go along with luck.

### A quick BABIP primer

A player that hits scorching line drives at fielders every time they step to the plate would have a low BABIP due to bad luck, whereas a player that hits soft ground balls that barely sneak through, will eventually run out of their good luck and their BABIP will probably drop.

The team average BABIP from 2010-2016 was .298 and the team average BABIP had a ceiling of .329 and a floor of .269. These are pretty good numbers to use as a guide for individual players because they are all based on large sample sizes. Individual players can have a BABIP better than .330, but in the last 50 years, among players with at least 3000 PA, no one in baseball has done better than Roberto Clemente’s .364 BABIP. The minimum is usually considered to be somewhere in the .260 range, but there are batters that qualify with 3000 PA who have hit below .250. So we expect batters to hit somewhere around the .300 mark maxing out at about .360 and bottoming out at around .260.

BABIP itself is made up of a few important factors, the first being the defense; the better a defense, the more balls they can get to, the lower the BABIP. The second factor is the type of contact the player makes with the ball. A player that hits the ball very hard will see more balls land for hits than someone that makes weak contact. There is also a third factor, and that’s the hitting skill of particular players, like their ability to put balls where they want and so on, but we’ll get into that later.

### Model behavior

If we are looking to make a statistical model of BABIP, we might go about it in several ways. The first way, and I think most intuitive way, is to run a regression of BABIP on Hard Medium and Soft contact rates, so that we can see the value of hitters who barrel up the ball versus those who just get weak contact. The problem with this model is that it only correlates at about 38%, and none of the coefficients themselves are actually statistically significant. In other words — if we use this model, we would just be using the hard hit rate to describe BABIP, with no insight gained.

The next statistical model might be BABIP as a function of Ground Balls, Fly Balls, Infield Fly Balls, and Line Drives. Rather than using the number of GBs, FBs, IFFBs, and LDs directly, we will be using their rates, so the GB%, FB%, IFFB%, and LD%. Manthematically, the reason for using the rates instead of the raw numbers is that GBs, FBs, IFFBs, and LDs are “counting” statistics and not “rate” statistics, it’s difficult, mathematically, to model a “rate” statistic like BABIP using “counting” statistics.

Luckily the fix is easy, if we just use GB%, FB%, IFFB%, and LD% which give us the rate of their respective hit ball type, in terms of balls in play, we can more easily and more significantly model BABIP. Using all 2016 player batting data, here’s what the model looks like;

This model correlates at 59% and all but our Constant and IFFB% coefficients are statistically significant. The fact that our constant (.002) is so small means we don’t have to worry too much about it skewing our model; similarly the fact that IFFB% is rather low across the league means we don’t have to worry about it messing up our model too much either, so we should be all set.

This model gives us the Expected BABIP, meaning it gives us what we would expect a player’s BABIP given that they are playing against an average defense and have average luck. Here’s what our formula looks like;

(As a quick note, if you search “xBABIP” online, you’ll find many different equations using different weights, variables, etc.; while it’s certainly possible that one of them is better than this one, using this linear weights model to describe BABIP is something that I am confident in the mathematical consistency of.)

A quick takeaway from this formula is that Line Drive Rate increases, improve a player’s BABIP at nearly 3.5 times the impact of an increase Fly Ball Rate, and more than 5 times the impact of an increase to the Ground Ball Rate. This means that a player that hits line drives is going to have a much better BABIP than someone who hits mostly FBs and GBs.

### Bringing it back to the Pittsburgh Pirates

We can now use this formula to determine where Frazier and McCutchen’s BABIP should have been, and compare that to where they were.

First off, Adam Frazier. Here is the graph of Frazier’s season long cumulative BABIP (blue), xBABIP (Red), and the difference between his BABIP and xBABIP (Green);

*The flatline between mid April and mid May is due to injury.*

Looking at this chart is is quite clear why Frazier regressed to the extent he did. Frazier’s BABIP was stratospheric between 5/16 and 5/27, when he had a BABIP greater than Clemente’s .364 mark. Moreover, the fact that he didn’t even see the league average .300 mark until July 7th is insane.

When we compare Frazier’s rather explosive BABIP to his relatively flat xBABIP we see why he was destined for regression. Frazier’s May spike in BABIP so no significant correlating spike in xBABIP, meaning he was hitting FBs, GBs, LDs, and IFFBs, at about the same rate, and was just getting luckier, playing bad defenses, or some combination of both.

Finally if we look at the line of difference between between Frazier’s BABIP and xBABIP, we see that the two are trending towards and nearing convergence. This means that the Adam Frazier we see today, hitting a decent .275/.350/.375, is more likely the Adam Frazier we will see going forward in his career, and not the red hot one of earlier this season.

Now time for Andrew McCutchen. Similar to Frazier’s graph, here is McCutchen’s season long cumulative BABIP (blue), xBABIP (Red), and and difference between BABIP and xBABIP (Green);

McCutchen’s story is basically the opposite of Frazier’s; where Frazier was getting extremely lucky, McCutchen couldn’t buy a hit. Between May 5th and June 6th McCutchen’s BABIP was actually underperforming his xBABIP.

We can see that McCutchen’s BABIP bottomed out on 5/23 at an unsustainably low .214, just a few days before the benching on on May 25th and 26th, that is credited as the catalyst for his turn around. Interestingly, however, is that McCutchen’s xBABIP bottomed out on May 5th at .235, exactly 20 days prior to the benching. This means that McCutchen had changed his hitting approach and was actually hitting the ball much better for 20 days, before he saw that translate to actual base hits and was really just unlucky during that stretch.

Something else important is that McCutchen’s turnaround seems real, at least in terms of his BABIP. Cutch has taken a better plate approach, stabilizing his GB% and FB% while increasing his LD%, which means that he should be getting more hits and being the valuable player he is.

### Good players won’t fear the reaper that is BABIP

A few final notes. The first is that McCutchen and Frazier have career BABIPs of .330 and .320 respectively, and this is really where we should expect these players to be batting at.

Secondly, while xBABIP is useful because it eliminates luck out of the equation, it also takes out the batting skill of a player to put the ball where he wants it to go. If we look at active batters with more than 3000 PA, the top 3 in BABIP are Mike Trout (.360), Paul Goldschmidt (.358) and Joey Votto (.353). Those three are some of the best hitters in baseball, largely because they can hit the ball where they want to. This means that this xBABIP equation will underestimate above-average hitters and overestimate below average hitters. The fact that Cutch and Frazier, both guys with above-average career BABIPs, are slightly outperforming their xBABIP, isn’t so much a cause for concern that they are due for regression, but rather a limitation of the xBABIP formula to quantify all abilities of a batter.

What are the big takeaways? The first and probably biggest is that BABIP is an unstoppable force that comes for everyone who is too far above or too far under their career numbers. Frazier was the beneficiary of some great luck, and McCutchen had to push through some bad luck early on, but both players have more or less converged to something of an equilibrium.

Second, the part that a player controls of their BABIP is their contact type; a player with a high LD% will tend to have a better BABIP than someone with a higher FB% or GB%.

The final takeaway is more of a prediction. Andrew McCutchen and Adam Frazier may actually be due for some positive regression, given that they are both playing below their career numbers, but are also currently playing at sustainable levels.