Predictive Crop Yield Machine Learning Pipeline

Produced yield estimates competitive with official USDA in-season forecasts

Problem

Agricultural market participants need to anticipate harvest outcomes before official USDA estimates move the market. A crop yield model with daily and weekly forecasts helps meet that need.

Summary

I built an MLOps pipeline with a machine learning regression model that predicts agricultural crop yields based on relationships with crop condition index, soil moisture, and observed temperature and precipitation patterns throughout the growing season.

Sources include agricultural and soil moisture data publicly available from the National Agricultural Statistics Service (NASS) under the U.S. Department of Agriculture (USDA), and meteorological observation data from the Automated Surface/Weather Observing Systems (ASOS/AWOS) under the National Oceanic and Atmospheric Administration (NOAA).

Source data were ingested, cleaned, transformed, and feature-engineered into datasets accepted by the XGBoost and Random Forest models. Used a time-based 80/20 train-test split across a 40-year dataset, with the most recent five years set aside as the validation set. Opted not to randomly shuffle due to leakage risk on a chronological set. Performed a limited grid search to find the best hyperparameters, such as learning rate and number of estimators. A Neural Network model was also tested, but the former two performed better in terms of both MAE and RMSE.

Initially, I attempted to use one model for multiple crops using one-hot encoding, but this added too much noise that "confused" the model, so I decided to create separate models for one crop each. In doing so, both training and testing errors were greatly reduced for each crop-specific model. Blending complementary agronomic and meteorological features also materially improved accuracy.

Results

The crop yield models achieved strong predictive accuracy, materially outperforming the initial baseline in both MAE and RMSE. Feature importance plots showed corn and soybean yields are driven by different predictors, supporting the decision to train separate crop-specific models.

If seasonal RMSE rises above a certain threshold, data drift may be indicated, warranting further investigation and potentially manual retraining of the crop models.

Tech Stack

Python SQL AWS RDS AWS S3 REST APIs scikit-learn XGBoost matplotlib pandas NumPy cron boto3

GitHub

Confidential — code not publicly available.

← Back to portfolio