Technical Approach.Our solution adopts the data science project mananagement cycles and methodology. We follow the data pipeline and implement our architecture to carry out the different processes through the project life cycle. We first subdivide our project into two phases, phase 1 is the baseline modeling with countries data for modeling and evaluation. We then obtained the most optimal model in phase 1 and uses that for the phase 2 modeling and prediction for California Counties Carbon Emission with additional additional features of each Cal Counties
|Phase||Data Cetegories||License||Year Covered||Time Unit||Selected Features||Data Shape|
|Our World in Data
(CO2 and GHG)
|Open (CC BY 4.0)||1750 ~2019||Year||1. CO2 Emission (Target Outputs)
4. Land Area
|World Bank Indicators||Open (CC BY 4.0)||1960 ~ 2020||Year|
|Gas (From Suppliers)||Open (link)||2013 ~ 2021||Month||1. Elec - Residential
2. Elec - Non Residential
3. Elec - Total
4. Gas - Residential
5. Gas - Non Residential
6. Gas - Total
7-16 Veh Number of vehicles in each type (10 in total)
17. Waste - DisposalTon
|Gas (From Government)||Open (link)||1990 ~ 2020||Year|
|Electricity (From Suppliers)||Open (link)||2013 ~ 2021||Month|
|Electricity (From Government)||Open (link)||1990 ~ 2020||Year|
|Vehicle||Open(link)||2010 ~ 2020||Year|
Raw DataSets. Datasets are the crucial element for this solution which we aim at the least error model for the CO2 prediction. We collected and validated number of datasets with features of carbon emission, GDP, Population, Land Area, Water usage, Type of Vehicles, Income Tax electricity and gas of countries and coounties which we believe they are relevant for our modeling and prediction. All datasets being used for our solution are Open and Public licenses.
End to End Data Pipeline.Our solution's project data pipeline is built with the set of processes, that is acquire, clean , explore , model and present. We first acquire the raw datasets, and then go through different process in the project life cycle, we use ML to create and evaluate our models and algorithms , at the end, our product will come up the actionable answers to the business problem, that provide insights to the audience for making business or investment decisions.
End to End Architecture. Our end to end system architecture is rided on AWS Cloud, AWS SageMaker, IAM Services and EC2 Resources. The eniviroment is administrated with IAM for accounts and security management. Raw datasets are uploaded to the R and Python notebooks for data cleaning and pre-processing, training and testing datasets (80/20 splits) are then created for regression modeling in which the final model with least error is used for the consumption CO2 estimation for California Counties. The model and data are being stored to Web where end users can obtain the estimation with fixed or varied input parameters.
Regression Models. Our solution is based on the most common regression models as we are estimating the carbon emissions with different parameters' inputs and the inputs and outputs are continuous and real value outputs, that is the solution is tackling a regession problem. We use the most common regression models which are Linear Regression, Linear Regression with Regulaization, Decision Tree Regression and KNN regressor with the evaluation metrics discussed in the evluation section.
|Error Metrics||Linear Regression||LR - Lasso||LR - Ridge||Decision Tree Regression||KNN Regression|
|Phase 1 Countries Datasets||R2 training||0.911||0.9114||0.9114||0.8851||0.943|
|Phase 2 Counties Datasets||R2 training||0.961||0.961||0.964||0.689||0.966|
Error Metrics. Three errors metrics are being measured to find out the accuracy of each of the regression models, based on the least errors metrics, the optimial model, Linear Regression is adopted to be used for the county CO2 emission prediction.