Home About CO2 Prediction Team Contact


Technical Approach.Our solution adopts the data science project mananagement cycles and methodology. We follow the data pipeline and implement our architecture to carry out the different processes through the project life cycle. We first subdivide our project into two phases, phase 1 is the baseline modeling with countries data for modeling and evaluation. We then obtained the most optimal model in phase 1 and uses that for the phase 2 modeling and prediction for California Counties Carbon Emission with additional additional features of each Cal Counties

Raw Processed
Phase Data Cetegories License Year Covered Time Unit Selected Features Data Shape
Phase 1
Our World in Data
(CO2 and GHG)
Open (CC BY 4.0) 1750 ~2019 Year 1. CO2 Emission (Target Outputs)
2. GDP
3. Population
4. Land Area
World Bank Indicators Open (CC BY 4.0) 1960 ~ 2020 Year
Phase 2
Gas (From Suppliers) Open (link) 2013 ~ 2021 Month 1. Elec - Residential
2. Elec - Non Residential
3. Elec - Total
4. Gas - Residential
5. Gas - Non Residential
6. Gas - Total
7-16 Veh Number of vehicles in each type (10 in total)
17. Waste - DisposalTon
18. Water
Gas (From Government) Open (link) 1990 ~ 2020 Year
Electricity (From Suppliers) Open (link) 2013 ~ 2021 Month
Electricity (From Government) Open (link) 1990 ~ 2020 Year
Vehicle Open(link) 2010 ~ 2020 Year
Waste Open (link) 2019 Year
Water Open (link) 2015 Year


Raw DataSets. Datasets are the crucial element for this solution which we aim at the least error model for the CO2 prediction. We collected and validated number of datasets with features of carbon emission, GDP, Population, Land Area, Water usage, Type of Vehicles, Income Tax electricity and gas of countries and coounties which we believe they are relevant for our modeling and prediction. All datasets being used for our solution are Open and Public licenses.



End to End Data Pipeline.Our solution's project data pipeline is built with the set of processes, that is acquire, clean , explore , model and present. We first acquire the raw datasets, and then go through different process in the project life cycle, we use ML to create and evaluate our models and algorithms , at the end, our product will come up the actionable answers to the business problem, that provide insights to the audience for making business or investment decisions.



End to End Architecture. Our end to end system architecture is rided on AWS Cloud, AWS SageMaker, IAM Services and EC2 Resources. The eniviroment is administrated with IAM for accounts and security management. Raw datasets are uploaded to the R and Python notebooks for data cleaning and pre-processing, training and testing datasets (80/20 splits) are then created for regression modeling in which the final model with least error is used for the consumption CO2 estimation for California Counties. The model and data are being stored to Web where end users can obtain the estimation with fixed or varied input parameters.



Regression Models. Our solution is based on the most common regression models as we are estimating the carbon emissions with different parameters' inputs and the inputs and outputs are continuous and real value outputs, that is the solution is tackling a regession problem. We use the most common regression models which are Linear Regression, Linear Regression with Regulaization, Decision Tree Regression and KNN regressor with the evaluation metrics discussed in the evluation section.

Error Metrics Linear Regression LR - Lasso LR - Ridge Decision Tree Regression KNN Regression
Phase 1 Countries Datasets R2 training 0.911 0.9114 0.9114 0.8851 0.943
R2 testing 0.949 0.9487 0.9482 0.8096 0.841
MSE 0.367 0.367 0.370 1.361 1.134
RMSE 0.6058 0.6506 0.6084 1.1666 1.065
Phase 2 Counties Datasets R2 training 0.961 0.961 0.964 0.689 0.966
R2 testing 0.952 0.944 0.934 0.236 0.848
MSE 0.084 0.169 0.196 1.337 0.265
RMSE 0.289 0.411 0.443 1.156 0.515


Error Metrics. Three errors metrics are being measured to find out the accuracy of each of the regression models, based on the least errors metrics, the optimial model, Linear Regression is adopted to be used for the county CO2 emission prediction.

  • R-sqaured
  • MSE
  • RMSE