The Socio-Economic System (SES) at the regional or provincial level refers to the way economic and social factors influence each other in households and local community units. These systems significantly affect the pollution, deforestation of the environment, contamination, catastrophic events, and production of energy and use

East Godavari district is one of the northern coastal areas of A.P., India, being the division of the flush Godavari delta; horticulture and aquaculture are significant economies for this District. With the ongoing discoveries of hotspots for oil and Natural gas, it expanded its pace in the industrial part too. It is home for two significant fertilizer 135 industrial facilities and scarcely any gas-based force plants and petroleum treatment facilities. Presently it is the One of the biggest oil and Gas Hub in India

This research work is related to Rajamahandravaram SES with an empirical dataset. As per the review of the literature, very little work has been reported on Rajamahandravaram socio-economic studies using machine learning (ML) for the prediction of SES levels. Statistical analysis and less work on machine learning (ML) for prediction classes of SES levels are done. This data set is novel, No attempt is found to design ML models on the Rajamahandravaram SES data set. Further analysis, Section 2 gives other research works descriptions in detail relevant to SES analysis with machine learning algorithms with different data sets with various MLs. Section 3 provides the proposed model and materials that different ML algorithms are analyzed. Section 4 and section 5 provides detailed comparative experimental result analysis and conclusion of the work and future work proposals.

In this, we have to describe descriptions of the researcher’s views about socio-economic status and ML. We reviewed reputed journals related to this topic and some of the papers are presented in this section. Socio-economic status is a multidimensional problem that has a variety of definitions. For some authors, it is measured by income, while other researchers include also health, happiness, education, social status, peace, and political rights into the picture. However, what connects all researchers in their work in the field of identification of factors, classification of the population according to different views of Socio-economic status (SES), and prediction of future Socio-economic status levels.

The following is the consignments of this work

In this research, we collected the household information from Rajahmundry, Andhra Pradesh, India using a good questionnaire. The data sampling is using ratios of SES levels like rich, above middle class, middle class, and poor.

We compose the data set (*.csv) using 49 attributes including class attribute.

In this paper, we apply 5 reputed ML models like Naïve Bayes, DTs (Tree), k-NN, SVM (kernel RBF), and Random Forest (RF) as well as past and recent SES levels detection research works. As per comparison, the RF model is superior to others.

This research is very useful in socioeconomic systems for the researchers, analysts and administrative employees and government, and so on.

This research leads or helps to auto-detection SES level applications like mobile apps.

We will extend this research with COVID-19 effects on the SES of the Rajahmundry area.

The paper organized as following points

Section 2 describes the introduction and literature descriptions in detail relevant to SES research work.

Section 3 outlines the proposed model, along with the structure of the proposed model. The detailed architecture describes the mathematical and algorithmic structure.

Section 4 presents the details about the experimental setup and analysis of the simulated results. In this, we have been analyzed ML models with Rajahmundry AP SES Dataset as well as compare the results of ML models.

Section 5 concludes the work with some future directions.

In this section, we describe the detailed model of the experimental setup and its working process step by step. And it describes experimental materials like metrics and measurement equations. Mainly, it focuses on ML algorithms and their setups and working process, and also describes measuring performance tools like confusion matrix, ROC, and so on.

The

Using this information, construct the statistical analysis reports for analysts and decision-makers to prevent actions about poverty. In another hand, the data is pre-processed by pre-processing algorithms like PCA (principal component analysis) and split the data set into training and testing parts (80% of Train and 20% of Test) for applying Machine Learning algorithms. Mainly, we use popular ML algorithms like Naïve-Bays, Decision Trees, Random Forest Trees, k-NN (k-Nearest Neighborhood), and SVM (Support Vector Machines). After designing the models, we evaluate the models with evaluated metrics like Accuracy (AC), TP Rate, FP Rate, F1, and AUC (using ROC). As per comparison, choose the best-performed ML model for predicting unknown input feature attribute values. Lastly, we will send the performance results, predicting values and visualization graphs to the analysts and decision-makers

Rajahmundry renamed as Rajamahandravaram is one of the major consistency of East Godavari district in Andhra Pradesh, India. We gather information about each house from this constitution area of rural and urban. Nearly, we collected the 1742 samples as per socio-economic ratios and area wise ratios with good questionnaires between 2018 and 2019. Some of the data is plotted on the Rajamahandravaram Map using longitude and latitude values. The

Dataset
Data Type
Description
Area(R-0/U-1)
Discrete (Integer)
House hold from Rural (0) or Urban (1)
Family Size
Continues (Integer)
Total members in House, range is 1 to 16
Male Size
Continues (Integer)
Total male members in House, range is 1 to 8
Female Size
Continues (Integer)
Total female members in House, range is 0 to 8
below 18
Continues (Integer)
Total members less than18 age in House, range is 0 to 5
above 18
Continues (Integer)
Total member >=18 age in House, range is 0 to 12
married people
Continues (Integer)
Total married members in House, range is 0 to 8
No. of children
Continues (Integer)
Number of children in House, range is 0 to 2
No. of literates
Continues (Integer)
Number of literates in House - range is 0 to 12
High Qualification
Discrete (Integer)
Qualification of house members range 0 to 5 0-very low 3-moderate 5- very high
No. of Workers
Continues (Integer)
Number of workers in House - range is 0 to 8
Child work below 15
Continues (Integer)
Number of Child workers in House - range is 0 to 1
Occupation(0-5)
Discrete (Integer)
Occupation of house hold members range is 0 to 7 0-very low or none 4-moderate 7- very high
Major Work
Discrete (Integer)
Work Category 0 to 5 0-very low 3-moerate 5- very high
Ration Cards
Discrete (Integer)
Ration Cards of house 1-white and 2- pink
Health cards
Discrete (Integer)
Health cards range is 0 to 1, 0-no 1-yes
No. of Diseased people
Continues (Integer)
Number of Diseased members in House - range is 0 to 2
No. of Handicapped
Continues (Integer)
Number of Handicapped members in House – range is 0 to 1
Bikes
Continues (Integer)
No. of Bikes in House range 0 to 3, 0 for none 3 for ‘ 3 or more’
Cars
Continues (Integer)
No. of Cars in House range 0 to 3, 0 for none 3 for ‘ 3 or more’
others
Continues (Integer)
No. of other Vehicles in House range 0 to 3, 0 for none 3 for ‘ 3 or more’
Own house(1) or rental(0)
Discrete (Integer)
Status of House 0 for ‘Rental’ 1 for ‘Own’
land(cents)
Continues (Real)
Agriculture Land in cents range is 0 to 2000.0
Gold
Continues (Real)
Gold in grams in House range is 0 to 1500.0
Annual Income
Continues (Real)
Annual income of house range is 27000.0 to 15000000.0
Income from Govt
Continues (Real)
Income from Govt. of house range is 0 to 1480000.0
income from pension
Continues (Real)
Income from Pension of house range is 0 to 40000.0
Income from private
Continues (Real)
Income from private or own of house range is 0 to 15000000.0
Social status(0/1/2/3/4)
Discrete (Integer)
Social Status 1 for ‘ST’ 2 for ‘SC’ 3 for ‘OBC’ 4 for ‘OC’ 0 for ‘none’
Nearest Hospital in Km
Continues (Real)
Hospital distance range 1.0 to 6.0
Nearest Primary School in Km
Continues (Real)
Primary School distance in KMs. range 1.0 to 5.0
Nearest High School in Km
Continues (Real)
High Schools distance in KMs range 1.0 to 10.0
Nearest College in Km
Continues (Real)
College distance in KMs range 2 to 14.0
Nearest University
Continues (Real)
University distance in KMs range 30.0 to 45.0
Addicted persons to smoke drinks
Discrete (Integer)
Habits of drink and smoke in house 0-None 1-Partial 2-Addicted 3- Extreme
Building model
Discrete (Integer)
Building model range is 0 to 5 0-low level 3-moderate 5-high level
Water sources
Discrete (Integer)
Water Facilities 0- none or poor 1-moderate 2-good facility 3-very good
Toilets facilities 1/0
Discrete (Integer)
Toilet facilities 0 for ‘no’ and 1 for ‘yes’
electricity1/0
Discrete (Integer)
electricity facilities 0 for ‘no’ and 1 for ‘yes’
TV 1/0
Discrete (Integer)
TV facilities 0 for ‘no’ and 1 for ‘yes’
Fridge 1/0
Discrete (Integer)
Fridge facilities 0 for ‘no’ and 1 for ‘yes’
Air Condition1/0
Discrete (Integer)
Air Condition facilities 0 for ‘no’ and 1 for ‘yes’
Heater1/0
Discrete (Integer)
Heater facilities 0 for ‘no’ and 1 for ‘yes’
Computer 1/0
Discrete (Integer)
Computer facilities 0 for ‘no’ and 1 for ‘yes’
Fuel for cooking 1/0
Discrete (Integer)
Fuel for cooking 0 for ‘Non-Gas’ and 1 for ‘Gas’
Income status in past 5 years(2/0/1)
Discrete (Integer)
0-Decremental Income 1-Remain same 2-Increment
Internet 1/0
Discrete (Integer)
Internet facility 0-No and 1-Yes
Migrated family or not
Discrete (Integer)
Migrated from other places 0-No and 1 - Yes
SES Levels(1-4)
Discrete (String)
Levels 1-Poor 2-middle 3-upper middle 4-Rich

It expects that the presence of an unambiguous aspect of a class is autonomous of every other aspect ^{16}. As per Bayes theorem, the contingent probability is given by the Equation

It is the most successful algorithm for many applications such as text document classification, spam filtering, Recommender system, etc.

NB classifier model for the SES level probabilities:

Step1: Firstly we compute the SES data set class levels prior probabilities.

Step2: Find likelihood with each attribute for each class in SES

Step3: Bayes Formula is computed using feature attributes of SES and computer the posterior probabilities.

Step4: find the superior probability as per input to class which is high probability.

For streamlining posterior and prior probabilities utilize the two tables' probability and frequency tables. Both of these tables will assist us with calculating the probabilities of posterior and prior. All features of SES are in frequency table.

Another incredible supervised ML model is SVM that can be used for both regression and classification issues. The ^{17}.

The k-NN is a non-parametric supervised algorithm method suitable for both classification and regression. It considers the k closest data points in the training examples. The output differs based on the fact that KNN is used for classification or regression. The output predicts the class to which a data point belongs based on how closely it matches with the k nearest neighbors. This is one of the instance-based learning, or lazy learning algorithms ^{18}. This algorithm uses the distance function to calculate the close approximate with the K Nearest Neighbors. For continuous variables, Euclidean, Manhattan, and Minkowski distance measures are used and hamming distance for categorical variables shown in equations (3-5).

K-nearest neighbors (KNN) model utilizes ' similarity of features’ to estimate the estimations of new information or data which further implies that the new data points will be allotted a value on how tightly matches the data points in the set of training. The

Step 1 – Give the SES data set of training and testing.

Step 2 – Initialize the K value that it can be any number.

Step 3 − For each data point in the test information do the accompanying −

i. Calculate the distance train and test data points using Hamming, Manhattan or Euclidean methods. (Euclidean distance is used in the experimental set up for SES data set)

ii. Sort them in order of ascending.

iii. We will pick the top rows as per value of K from the arranged data set.

iv. Now, it will allot a class to the test point dependent on the most recurrent class of these data rows.

Step 4 – End

DTs model is one of the supervised learning algorithms. In contrast to other supervised ML models, the DTs can be utilized for solving both classification and regression problems, but most researchers used this model for classification issues. It is a tree-organized classifier, where intermediate nodes describe the features of dataset. The decisions rules are designed with and leaf nodes are described with results. DTs classify the data points by arranging them down the tree from the root to some terminal node, with the leaf node giving the order of the model. Every node in the tree goes about as an experiment for some feature attributes, and each edge plunging from the node relates to the potential responses to the experiment. This procedure is recursive in nature and is recurrent for each sub tree rooted at the new node.

In Decision Trees, for anticipating a class name for a record we start from the base of the tree. We look at the estimations of the root characteristic with the record (genuine dataset) property. Based on correlation, we follow the branch relating to that worth and hop to the following hub. For the following hub, the calculation again contrasts the quality worth and the other sub-hubs and move further. It proceeds with the procedure until it arrives at the leaf hub of the tree. The

Step-1: Begin the tree with the root hub, says S, which contains the total SES dataset.

Step-2: Find the best characteristic in the SES dataset utilizing Attribute Selection Measure (ASM).

Step-3: Divide the S into subsets that contains potential qualities for the best properties.

Step-4: Generate the choice tree hub, which contains the best trait.

Step-5: Recursively settle on new choice trees utilizing the subsets of the dataset made in step-3. Proceed with this procedure until a phase is arrived at where you can't further arrange the hubs and called the last hub as a leaf hub.

RF is a supervised ML models for classification that is ensemble learning model. The basic reason of this model is that building a little decision-tree with small set of features is a computationally modest procedure. On the off chance that we can construct smaller trees in large number, parallel constructed trees in weak, we would then be able to join the trees to frame a single, averagely strong learner or taking the vote in major. The

Step 1: Choose randomly n features from the SES total feature Set.

Step 2: As per decision trees, choose best splitting tree for the root node.

Step 3: Predict the result utilizing these trees for decisions.

Step 4: Calculate the target votes using each decision tree predictions.

Step 5: The objective or target with the most prominent vote is considered as the last prediction of the SES Data Set.

In this, we represent the 4 class problem that is Middle, Poor, Rich, and upper-middle. ^{19}.

Classifier | Actual or True Values | |||||

Predicted Values | Class | Middle | Poor | Rich | U-middle | ∑(Total) |

Middle(M) | M-M | M-P | M-R | M-U | T5 | |

Poor(P) | P-M | P-P | P-R | P-U | T6 | |

Rich(R) | R-M | R-P | R-R | R-U | T7 | |

U-middle(U) | U-M | U-P | U-R | U-U | T8 | |

∑(Total) | T1 | T2 | T3 | T4 | Total(T) |

Performance parameters results give the performance of data set ^{20}. We calculated the performance parameters like TPR-True Positive Rate-Recall-Sensitivity, Probability of Detection, Power, FNR-False Negative Rate, Miss Rate, FPR-False Positive Rate, Fall Out, Probability of False Alarm, SPC-Specificity, Selectivity, True Negative Rate (TNR), PPV-Positive Predictive Value, Precision, FOR-False Omission Rate, LR+-Positive Likelihood Ratio, LR—Negative Likelihood Ratio, ACC-Accuracy, FDR-False Discovery Rate, NPV-Negative Predictive Value, DOR-Diagnostic Odds Ratio, F1Score 6 to 17 respectively.

In this, we have to analyze the statistical analysis results and machine learning models classification accuracies in detail.

We collected the data from rural and urban areas of the Rajahmundry constitution, East Godavari District, A.P., India. For this, collected sampling data is as per ratios of social and economical status. The rural area samples are 946 and urban area samples are 796 (Total 1742). As per the statistical analysis of the household dataset, some of the houses contain on average 4 to 5 members where the mean value is 4.381 and Std. Dev is 1.467. Some of the houses have only one member (min value is 1) and some of the houses contain 16 (max value). Each house contains at least one male person (min value male persons in a house is 1) and a maximum of 8 male persons as well as on average 2 to 3 persons per one house. On the other hand, the female persons' min value is 0 and max values are 8 and mean and SD values are 1.975 and 0.776 respectively which means every house contains on average one to two females. As per statistics some good conditions that very fewer child workers, average young generation 2 to 3 people in every house and average 1 to 2 workers in each house. Another good thing, the number of diseased people and the number of handicapped people are very less percentage that the mean values are 0.066 and 0.024 respectively.

Attributes Min, Max, Mean and Standard Deviation Statistics | Yes or No Attributes Statistics | ||||||

Attribute | Min | Max | Mean | SD | Attribute | No | Yes |

Family Size | 1 | 16 | 4.381 | 1.467 | Health cards | 689 | 1053 |

Male | 1 | 8 | 2.406 | 0.773 | Own House | 466 | 1276 |

Female | 0 | 8 | 1.975 | 0.776 | Toilets facilities | 168 | 1574 |

below 18 | 0 | 5 | 1.592 | 0.882 | electricity | 0 | 1742 |

above 18 | 0 | 12 | 2.802 | 1.183 | TV | 442 | 1300 |

married people | 0 | 8 | 2.037 | 0.375 | Fridge | 1134 | 608 |

No. of children | 0 | 2 | 0.131 | 0.362 | Air Condition | 1481 | 261 |

No. of literates | 0 | 12 | 0.738 | 0.826 | Heater | 1523 | 219 |

No. of Workers | 0 | 8 | 1.693 | 0.693 | Computer | 1421 | 321 |

Child. work | 0 | 1 | 0.006 | 0.079 | Other Type of Attributes Statistics | ||

No. of Diseased people | 0 | 2 | 0.066 | 0.268 | Attribute | Type | Value |

No. of Handicapped | 0 | 1 | 0.024 | 0.152 | Ration Cards | white | 660 |

land(cents) | 0 | 2000 | 140.576 | 202.602 | Pink | 1082 | |

Gold(grams) | 0 | 1500 | 34.788 | 51.792 | Fuel for cooking | Gas | 1575 |

Annual Income | 27000 | 15000000 | 331504.6 | 467944.2 | other | 167 | |

Income from Govt. | 0 | 1480000 | 24055.68 | 117628.8 | Social status | ST | 31 |

income from pension | 0 | 40000 | 408.726 | 1351.704 | SC | 190 | |

Income from private | 0 | 15000000 | 308868 | 467484.8 | BC | 896 | |

Hospital in Km | 1 | 6 | 3.637 | 0.836 | OC | 625 | |

Primary School in Km | 1 | 5 | 2.846 | 1.901 | Addicted persons to smoke and drinking in House | None | 736 |

High School in Km | 1 | 10 | 3.832 | 1.899 | Partial | 826 | |

College in Km | 2 | 14 | 6.866 | 2.057 | Addicted | 158 | |

University in Km | 30 | 45 | 35.065 | 4.185 | Extreme | 22 |

Some other Types of Attribute Statistics | |||||

Attribute | Type | Value | Attribute | Type | Value |

Literacy and Educators Houses | None or Below 10th | 296 | Occupation Major Work Occupation in House | No Work or Very less | 6 |

10th Standard | 428 | Seasonal Workers | 463 | ||

Inter Level or ITI | 386 | Average or Daily wagers | 497 | ||

Degree Level | 272 | Permanent Low salary | 345 | ||

Technical Degree or Other | 249 | Permanent Middle Salary | 364 | ||

P.G. level | 101 | Permanent High Salary | 62 | ||

Professional or Ph.D. Level | 10 | Business or Organizers | 5 | ||

Having Bikes in House | None | 939 | Having Cars in House | None | 1607 |

One | 671 | One | 128 | ||

Two | 118 | Two | 5 | ||

More Than Two | 14 | More Than Two | 2 | ||

Having Other Traveling Recourses | None | 1615 | Target Class (SES levels) | Rich | 73 |

One | 121 | Middle class | 794 | ||

Two | 5 | Upper Middle class | 526 | ||

More Than Two | 1 | Poor | 349 |

Very important thing for the economic status that it is fully depends on annual income for each house and their resources that are from public, private, asserts and work, and so on. As per statistics annual income min value is 27000/- and the max value is 8000000/-. The income sources from private, government or pension schemes. The detailed analysis is shown in the

In this section, we analyze accuracy values of ML algorithms k-NN, DTs, SVM, RF and NB in detailed. For this, we used confusion matrix for each algorithm.

The k-NN model classifies correctly 1643 instances out of 1742. The remaining 99 instances are classified incorrectly by this model. The total accuracy (CA) value is 0.94316 (94.4%). The F1-score is 0.94312 and the precision value is 0.94341. The time taken for the construction of the model is 0.29 seconds. The

The C 4.5 model classifies correctly 1684 instances out of 1742. The remaining 58 instances are classified incorrectly by this model. The total accuracy (CA) value is 0.9667 (96.67%). The F1-score is 0.96659 and the precision value is 0.96672. The time taken for the construction of the model is 0.18 seconds. The

The Support Vector Machine with kernel radial bios function (RBF) model classifies correctly 1647 instances out of 1742. The remaining 95 instances are classified incorrectly by this model. The total accuracy (CA) value is 0.94546 (94.6%). The F1-score is 0.94545 and the precision value is 0.94822. The time taken for the construction of the model is 0.16 seconds. The

The RF’s model classifies correctly 1704 instances out of 1742. The remaining 38 instances are classified incorrectly by this model. The total accuracy (CA) value is 0.97818 (97.81%). The F1-score is 0.9782 and the precision value is 0.9781. The time taken for the construction of the model is 0.41seconds. The

The Naïve Bayes model classifies correctly 1514 instances out of 1742. The remaining 228 instances are classified incorrectly by this model. The total accuracy (CA) value is 0.86912 (86.91%). The F1-score is 0.87333 and the precision value is 0.89096. The time taken for the construction of the model is 0.12 seconds. The

The ROC curve constructed with specificity (FP Rate) and Sensitivity (TP Rate) measures with 0 to 1 values. The

The

The

The

Model
AUC
CA
F1
Precision
Recall
Time Taken
k-NN
0.9954
0.94316
0.94312
0.94341
0.94316
0.29 sec.
DTs (Tree)
0.9905
0.96670
0.96659
0.96672
0.96671
0.19 sec.
SVM
0.9907
0.94546
0.94545
0.94822
0.94546
0.16 sec.
Random Forest
0.9992
0.97818
0.9782
0.9781
0.97818
0.41 sec.
Naive Bayes
0.9765
0.86911
0.87333
0.89096
0.86911
0.12 sec.

The

This analysis analyzed using bar-chart diagram in detail. The

The

Analysis and prediction of socio-economic status research work are very useful for analysts, organizations, and government. Good sampling statistical results represented economic features, social standards, and SES levels of the Rajamahandravaram consistency area. This useful work had described SES with machine learning representation. As per the comparative study, the Random Forest ML model was the best for predicting SES levels of Rajahmundry SES data set where accuracy (CA) value is 0.976, and the AUC value is 0.999. Further, we will take and elaborate overall East Godavari district samples and working with GPS data using Deep Learning concepts for more accurate performance values. And also conduct the research work on before the COVID-19 and after the COVID-19 for this area.

We would like to thank Prof. M. Jagannadha Rao Vice Chancellor Adikavi Nannaya University, AP, India who supported this research, and we would like to thank R & D Cell Adikavi Nannaya University for providing necessary materials and technical supports obtained for this research. We would like to thank the people who gave the house hold information by themself for this research.