Descriptive statistics and lactation profile
The average total milk yield (TMY) was recorded as 4414.58±69.75 kg, with high coefficient of variation (CV) of 31.64%, (Table 1). Whereas 305DMY was 3476.28±39.12 kg which were found to have comparatively low variability (CV = 22.53%). Peak yield was reported as with the mean of 13.28±0.25 kg, with the mean TDMY was 8.32±0.16 kg. The mean total fat yield per lactation was 142.61 kg. The average lactation length was 426.11 days, indicating a prolonged production cycle. Furthermore, the population exhibited a high average fat concentration of 4.14%. The Persistency found to have mean of 0.61±0.01 with a high CV of 36.54%, indicating a wide disparity in the animals’ ability to maintain milk production levels post-peak (Table 1).
Trends in test-day milk and fat yields
The early lactation phase (TD1 to TD3) was marked by a characteristic ascent, initiating at 8.60 ±0.23 kg and reaching a maximum of 9.56±0.24 kg by the third test day. This period corresponded with the highest observed fat yield, which also peaked at TD3 (0.51 ± 0.01 kg). Following this peak, the mid-lactation phase (TD4 to TD6) exhibited a gradual decline, with milk yield decreasing from 8.97±0.23 kg at TD4 to 8.24±0.21 kg by TD6. The late lactation stage (TD7 to TD10) demonstrated a persistent downward trajectory
(Rana et al., 2024). Yields dropped significantly to a minimum of 6.93±0.18 kg at TD8 before stabilizing slightly to conclude at 7.41±0.20 kg on the tenth test day. Fat yields depict same pattern, generally declining to 0.36±0.06 kg by the end of the lactation period (Table 2).
Effect of non-genetic factors
The analysis of variance revealed that the SOB had a statistically significant effect on the 305DMY (
p<0.05). Cows born in Season 3 exhibited the highest standardized yield 3614.02±73.88 kg, which was higher as comparable to those born in Season 4 (3551.51±66.64) kg but significantly superior to animals born in seasons 1 and 2. In contrast, the season of birth did not significantly influence other production parameters, including LL, TMY, TFY and FP (
p>0.05). The SOC significantly influenced the TFY, with the highest fat yield observed in animals calving during Season 1 (147.54±2.59) kg. This was significantly higher than the yield recorded for Season 3 (136.73±2.75) kg, while Seasons 2 and 4 showed intermediate values. Furthermore, the YOC showed a significant effect on milk composition, specifically FP. It was observed that YOC-4 recorded the highest fat percentage (7.34±0.10%), differing significantly from the lower percentages observed in YOC-1 (6.92±0.11%) and YOC-3. No statistically significant differences were found for LL and TMY across the different seasons and years of calving (Table 3).
Non-parametric lactation and parametric lactation curve modelling
Non-parametric modelling revealed an initial yield of 8.78±5.08 kg and a peak of 13.2±5.49 kg. A persistency of 0.620±0.28 indicated that KF cows maintained ~62% of peak production. As shown in (Fig 1), the mean lactation length was 426.11 days (Solid blue line). The KDE curve (dashed line) peaked between 380-480 days, with standard deviation markers (red dotted lines) confirming most cows exceeded the standard 305-day lactation period.
The Karan Fries lactation trajectory was modeled using Wood’s function (
Wood, 1967) via non-linear regression (nls in R). The scaling parameter (a) was 55.1, representing initial production; the inclining rate (b) was 0.102; and the decay rate (c) was 0.00113. Together, these three parameters standardize the population’s production pattern and serve as fundamental predictors for total lactation yield.
Phenotypic correlations
A strong positive correlation exists between total milk yield and total fat yield (r = 0.716), indicating that higher volumes yield greater cumulative fat. Similarly, lactation length and total fat yield are positively correlated (r = 0.68), highlighting the role of persistency in maximizing nutrient harvest. Conversely, total milk yield and average fat percentage share a moderate negative correlation (r = -0.56).
Multiple linear regression
Building on the previous analysis, the predictive performance for both 305DMY and TMY was evaluated across the testing dataset (Table 4), the MLR model demonstrated moderate predictive capability, achieving an R
2 of 0.21, resulting in an RMSE of 729.71 kg and an MAE of 570.56 kg. A similar trend was observed for TMY prediction (Table 4), yielding a lower R
2 of 0.1012. The associated error metrics were notably higher than those for 305DMY, with an RMSE of 1405.87 kg and an MAE of 1082.63 kg, highlighting the increased complexity of predicting TMY compared to the standardized 305DMY (Fig 2 and 5).
Random forest
Random forest (RF) model demonstrated a substantial leap in predictive accuracy. For 305 DMY, the RF model achieved an R
2 of 0.88. This performance represents a dramatic improvement over the MLR baseline, reducing the RMSE to 315.98 kg and the MAE to 209.61 kg. A similarly robust pattern emerged for TMY prediction, where the RF model yielded an R
2 of 0.8688, an RMSE of 634.12 kg and an MAE of 432.20 kg. Remarkably, the algorithm maintained consistent efficacy across both 305DMY and TMY with R
2 of (0.88 vs. 0.87), suggesting that Random Forest effectively captures the underlying biological determinants of milk production regardless of the specific trait definition or scale. (Fig 3 and 5; Table 4).
Extreme gradient boosting
In addition to the MLR and Random Forest approaches, the Extreme Gradient Boosting (XGBoost) model was evaluated. For 305DMY, XGBoost achieved an R
2 of 0.82, with an RMSE of 359.63 kg and an MAE of 264.01 kg. While this represents a substantial improvement over the linear model, it fell short of the accuracy achieved by the Random Forest. This hierarchical trend persisted in the TMY predictions, where XGBoost again occupied the middle ground between MLR and RF. For TMY, the model yielded an R
2 of 0.7818, an RMSE of 755.03 kg and an MAE of 551.84 kg (Fig 4 and 5; Table 4).
Variable importance and biological drivers
Variable importance rankings (Fig 6) identify mean test-day milk yield (Mean TDMY) as the primary predictor across all models. For TMY and 305DMY, persistency (67.0%) and slope early peak (62.8%) were the leading secondary predictors. Although included, raw Wood’s function coefficients (a, b, c) showed low importance (<5%), suggesting that Random Forest models prioritize derived biological indicators over raw mathematical parameters to describe the lactation curve.
Residual distribution
The residual analysis reveals a near-Gaussian distribution of errors for the ensemble models, particularly random forest. The concentration of residuals around the zero-line and the lack of skewness in the error histograms (Fig 7) indicate that the model assumptions hold true for this population.
This study interprets key production metrics including fat yield, fat percentage, Total Milk Yield (TMY), 305-day milk yield (305DAY) and extended lactation length to establish a comprehensive performance profile of the KF cattle. By blending traditional biometric frameworks, such as Wood’s Gamma Function, with modern machine learning techniques like Random Forest (RF) and XGBoost.
Comparative analysis of milk production traits
Average total milk yield (TMY) was 4414.58±69.75 kg, exceeding estimates by
Worku et al., (2021). The 305-day milk yield (305-DMY) of 3476.28±39.12 kg also surpassed findings by
Kumar et al., (2025) (Table 2). A high TMY coefficient of variation (CV) of 31.64% reflected significant herd heterogeneity. The average lactation length (LL) of 426.11 days exceeded the 305-day
ICAR (2023) standard, indicating high persistence while maintaining 4.14% fat. Despite a peak yield of 13.2±5.49 kg and mean persistency of 0.61±0.01, a high persistency CV (36.54%) suggested greater diversity than noted by
Dash et al., (2016), signalling substantial potential for future genetic improvement in KF cattle.
Trends in test day milk and fat yields
Karan Fries (KF) lactation followed the standard biological trajectory: incline, stabilization and decline (
Wood, 1967). Milk yield rose from 8.60±0.23 kg (TD1) to 8.95±0.26 kg (TD2), driven by mammary cell proliferation and fat mobilization (
Bauman and Currie, 1980). This ascending trend aligns with
Kokate et al. (2012) and
Singh et al., (2016). Peak fat yield (0.51±0.12 kg) confirmed successful homeorhetic adaptation to early metabolic demands (
Bauman, 2000). The decline phase began at TD5 (8.43±0.22 kg), reached a low at TD8 (6.93±0.18 kg) and ended at 7.41±0.20 kg by TD10, with fat yield dropping to 0.36 kg due to the gradual loss of secretory cells (
Knight and Wilde, 1993).
Influence of fixed factors on production traits
In this study, SOC did not significantly influence TMY, 305-DMY, LL, or FP (p>0.05), suggesting the KF breed’s adaptability to environmental fluctuations aligning with
Ilayaraja et al. (2025) but contradicting
Worku et al., (2021). However, SOC significantly impacted total fat yield (p = 0.041), with summer calvers (Season 1) yielding the highest. SOB significantly influenced 305-DMY (p = 0.048), with Season 3-born cows performing best, suggesting that early-life conditions affect long-term productivity. YOC significantly affected 305-DMY and FP (p = 0.02), likely due to annual variations in climate and fodder availability, consistent with
Worku et al. (2021) but differing from
Yadav et al. (2017).
Phenotypic correlations
A primary finding was the strong positive correlation between total milk yield (TMY) and total fat yield (TFY) (r = 0.716), similarly reported by
Upadhyay et al. (2024). Furthermore, the positive correlation between lactation length and fat yield (r = 0.685) underscores the physiological importance of production persistency (ICAR-NDRI, 2015). Conversely, a moderate negative correlation was observed between total milk yield and average fat percentage (r = -0.564). This illustrates the dilution effect, a biological trade-off where increased fluid volume leads to a decreased concentration of solids.
Comparative evaluation of statistical and machine learning models
The RF algorithm significantly outperformed MLR and XGBoost in predicting 305 DMY and TMY. For the test set, RF achieved high values (0.878 for 305DMY; 0.869 for TMY), while XGBoost showed intermediate performance and MLR performed poorly (R
2<0.22). RF’s superiority is validated by its lower error metrics; for 305DMY, RF yielded the lowest RMSE (315.98 kg) and MAE (209.61 kg). In contrast, MLR’s RMSE was more than double (729.71 kg). The >360 kg reduction in MAE compared to MLR confirms RF as a more reliable tool for lactation forecasting. Although a high training R
2 (0.98) versus test R
2 (0.88) suggests slight overfitting typical of ensemble learning, the robust test performance demonstrates that RF effectively captured key biological signals. Future validation using multi-herd data and tree-pruning could further refine this gap.
By assuming linear relationships, MLR fails to capture the inherently non-linear lactation curve, which involves a sharp rise followed by a gradual decline
(Dongre et al., 2012). Its high residual errors (RMSE = 729.71) confirm an inability to model intricate physiological patterns and environmental interactions
(Sahoo et al., 2019). Consequently, MLR’s lack of adaptability to non-linear data limits its precision despite its high interpretability
(Singh et al., 2021).
Conversely, RF has emerged as a superior tool for predicting TMY and 305DMY, consistently outperforming linear methods
(Worku et al., 2021; Singh et al., 2021). Its low MAE (432.20 kg) suggests better handling of missing data and physiological fluctuations than MLR (
Tırınk et al., 2023). While XGBoost uses sequential error minimization (
Chen and Guestrin, 2016), its 305DMY accuracy (R
2= 0.825) trailed RF, likely due to data deviations. Although RF showed a tendency to overfit (R
2 0.98 training vs. test) a common trait in small biological datasets (n=473) it achieved the highest predictive power. These results align with findings in Iranian Holsteins = R
2 0.91; (
Abdollahi-Arpanahi et al., 2020) and Italian Holsteins = R
2 0.88;
(Bovo et al., 2021). Future multi-herd studies using nested cross-validation could further refine these estimates.
A key finding is the hierarchical preference of ML models for specific lactation features. While Wood’s function provided the essential framework, its raw parameters (a, b, c) showed low relative importance in the RF model. This suggests an information-routing effect: because Persistency and Early Slope were derived from the Wood’s curve, they can serve as biological summaries for the lactation profile in KF cattle. Since ML algorithms seek the most efficient predictive paths, they prioritize these integrated biological features over individual mathematical constants. This highlights Wood’s function’s value as a powerful tool for feature engineering rather than just a standalone fitting method.