Crop Economics

Data Pipeline & Modeling

Our data pipeline is built upon 2 large datasets -

Remote sensed satellite data made available through Google Earth Engine

Crop information available through USDA web services.

For the crop yield model, the remote sensed data from Google Earth along with crop yield information from USDA are preprocessed and converted to build a yield prediction model based on ensembled based machine learning modeling techniques.

For the price yield model, multiple USDA datasets are preprocessed and converted to build a price prediction model leveraging different regression techniques.

The current performance of the yield model is in line with other state of the art machine learning schemes. While there aren't any other efforts to compare the price model, it's accuracy is high overall and we feel very confident about its predictions.

Model Accuracy

The crop yield model produced the best results with XGBoost leveraging auto-correlation (given the strong correlation with prior years), as well as a rolling window (for additional observations), to get an R² of 0.71 and RMSE of 8.9.

Also of note, was the non-parametric nature of remote sensing data that led other modeling techniques like Support Vector Machine (SVM) and regression methods to not do as well.

The price model produced the best results using a Huber regression leveraging multiple economic indexes and prior year correlations to get an R² of 0.89 and RMSE of 0.015.

Our use of the Huber loss function was motivated by the fact that the data collected by the USDA was survey based and at times had outliers. This also likely caused other modeling tecniques to not perform as well.

Application Architecture

We currently leverage Amazon AWS to host our application infrastructure. The tool is hosted as a secure website load-balanced between to AWS zones and protected further by an application firewall.

Our data pipeline is built upon satellite data available in Google Earth Engine, as well as data available through USDA web services. Our data is sourced from these locations on a daily basis to make this information available locally for training and updating our machine learning models as well as allow our online website to benefit from a significantly improved overall performance.

our web-based tool is built from the ground up using a modern web technology stack. The frontend is developed using ReactJS, NPM and NodeJS, to create an optimized site that is a single-page application with dynamic content loading.

To allow the user / the farmer to look up their farm’s location, we use GeoCoding apis through Mapbox. That makes the lookup and pin down of a specific location similar in user experience to any other Map app out there.

This frontend talks with a Python based Flask web service connected to MySQL and load-balanced through Nginx. Finally the data architecture and web application run in a Kubernetes cluster in AWS leveraging multiple Docker containers.