census-income dataset github


Comparing machine learning classification models. Further,after having sufficient knowledge about the attributes, performed a predictive task of classification to predict whether an individual makes over 50K a year or less,by using different Machine Learning Algorithms. topic, visit your repo's landing page and select "manage topics.". Only about 1/3 of the population at the time would be considered high income while 2/3 of the population was making less than 50,000 USD per year. The following is the summary of the performance of our other models: This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. c) Extracted the column number 5, 8, 11 and stored it in census_col. I analyze and explore US Census Bureau Data using Data Visualization techniques to identify salient features useful for predicting an individual's income level. ii) Built a linear model on the test set where the dependent variable is hours.per.week and independent variable is education.num. It contains adult.data for training and adult.test for testing. education-num: the highest level of education achieved in numerical form. Ensemble: Majority Vote for 7 Learned Classifiers. v) Built a confusion matrix and calculated the accuracy. The Neural Network is the strongest classifier of the seven methods. Final prediction is made by taking a majority vote model among the predictions of these classifiers. To build a classification methodology to determine whether a person makes over 50K per year. Predicting-Income-Class-based-on-Census-Data, Predicting Income Class based on US Census Data.ipynb, Predicting Income Class based on US Census Data.

Construct a model that accurately predicts whether an individual makes more than $50,000. We have over a decade of experience creating beautiful pieces of custom-made keepsakes and our state of the art facility is able to take on any challenge. You signed in with another tab or window.

The Random Forest classifier also gives very high prediction accuracy on test set. g) Get the count of different levels of the workclass column. iii) Ploted the decision tree. 2019/20 - Income Prediction (Ind. I will explore the data at face value in order to understand the trends and representations of certain demographics in the census and then I will use these insites to form machine learning models to help predict whether an individual would make more or less than 50,000 USD in 1994. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. vi) Ploted the ROC curve and found the auc(Area Under Curve). a) Built a simple linear regression model as follows: i) Divided the dataset into training and test sets in 70:30 ratio. iv) Ploted accuracy vs cut-off and picked an ideal value for cut-off. Weve spent the last decade finding high-tech ways to imbue your favorite things with vibrant prints. Married people had the highest percentage of high income people with husbands making up the majority of the workforce, The male working market more than doubled the female working market in 1994, The male dominant job was Craft-repair whereas the female dominant job was Adm-clerical. After one-hot-encoding all our other nominal categorical fields, the final features we chose for our Machine Learning models were: Running Decision Tree and Random Forest allowed us to white-box the most important features that predicted income. The goal of this project is to profile people in the above dataset based on available demographic attributes. I have created a machine learning model to predict whether a salary of a person is greater than $50K or not. income-prediction In this project, initially we preprocess the data and then develop an understanding of different features of the data by performing exploratory analysis and creating visualizations. Prediction accuracy for instances label > 50K is 0.82. It's highly recommended to install Anaconda, a pre-packaged Python distribution that contains all of the necessary libraries and software for this project. We are dedicated team of designers and printmakers. For example an individual could be a Husband. ), Data science project for feature engineering and classification using as case study the Census Income dataset, To predict whether the income of the individual is above or below 50 K. TCD Machine Learning Kaggle Individual Competition in which one predicted the income of people with Machine Learning using pandas, numpy and sklearn. Any underneath clusters (group) based on census data. For Bigdata perspective, we have replicated the census.csv to twice its size ie 97684 and named it census2.csv. marital-status: marital status of an individual. Capital Gain was a good indicator of wealth with a pretty clear separation of people making higher than 50k with higher capital gain which is an indicator of the wealth gap in the US starting to grow. Building a Classification model to predict whether a person's annual Income is more than $50K or below $50K, Census income prediction by employing the power of supervised learning. Each entry only has one relationship attribute and is somewhat redundant with marital status. It describes 15 variables on a sample of individuals from the US Census database. ii) Built a random forest model where the dependent variable is X(Yearly Income) and the rest of the variables as independent variables and number of trees as 300. The dataset consist of 32562 rows and 14 features . a) Built a random forest model as follows: i) Divided the dataset into training and test sets in 80:20 ratio. a) Built a bar-plot for the relationship column and filled the bars according to the race column. Please see UCI Website for more details and attribute information. Performed data manipulation to analyze the data set using various functions from the dplyr package. 4.http://scg.sdsu.edu/dataset-adult_r/. This project requires Python 3.x and the following Python libraries installed: You will also need to have software installed to run and execute an iPython Notebook. Prediction accuracy for instances label > 50K is 0.84. fnlwgt: final weight. Finally, we found the classifier with the highest prediction accuracy by conducting performance evaluation using ROC curves and AUC. I would highly recommend that before the hack night you have some kind of toolchain and development environment already installed and ready.

In this project report we have a summary of our analysis and exploration of the Adult Census Data to come up with meaningful, important and interesting attributes of the data. Using this data, I will build ML models to predict whether an individuals income will be greater than or less than $50,000 USD per year based on features from the census data. f) Extracted 200 random rows from the census data frame and stored it in census_200. Decisiontreebgdata.R file uses census2.csv as an input.This is also using the Mydata.csv which is the processed and cleaned form of census2.csv. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. You signed in with another tab or window. Perform Big Data Analytics on UCI Census Income Dataset for Income Prediction. After cleaning, analyzing, and plotting the data, here are some insights I found: Numerical fields that showed correlation include: age, education_num, capital_gain, capital_loss, and hours_per_week. If you have no idea where to start with this, try a combination like: The data here is for the "Census Income" dataset, which contains data on adults from the 1994 census. Individual Machine Learning competition code as part of the 2019/20 Machine Learning module at Trinity College Dublin. You signed in with another tab or window. iv) Calculated the root-mean-square error (RMSE). v) Built a confusion matrix and calculated the accuracy. e) Extracted all the 39 year olds who either have a bachelor's degree or who are native of United States and stored the result in census_us. Education was a pretty good indicator of income with the highest percentage of high income individuals finishing a pHD, Masters, or Bachelor's degree. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. The project involved data assessment and cleaning, performing EDA and drawing conclusions from the data. The Us Adult income dataset was extracted by Barry Becker from the 1994 US Census Database. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. Too often, great ideas and memories are left in the digital realm, only to be forgotten. ii) Built a logistic regression model where the dependent variable is X(yearly income) and independent variables are age, workclass, and education. The majority of the population had either a high school degree and / or some college finished. This data is labeled with whether the person's yearly income is above or below $50K. occupation: the general type of occupation of an individual. a) Extracted the education column and stored it in census_ed . You signed in with another tab or window. You signed in with another tab or window. SVM Model give max. b) Built a multiple logistic regression model as follows: i) Divided the dataset into training and test sets in 80:20 ratio. The dataset is in the form of a csv file you can download here. v) Built a confusion matrix and calculated the accuracy. And this is a binary classification problem. You signed in with another tab or window. The dataset is in the data folder. You signed in with another tab or window. income-prediction Prediction accuracy for instances label <= 50K is 0.81. iv) Predicted the values on the test set. c) Removed all whitespaces from the columns. The greatest percentage of Asians were making over 50K with White class following close behind. a) Built a decision tree model as follows: i) Divided the dataset into training and test sets in 70:30 ratio. Basic Machine Learning Classifiers with scikit-learn. Predict whether income exceeds $50K/yr based on census data. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. iii) Predicted the values on the train set and found the error in prediction. The main code for this project is located in the AF Data Science Project.ipynb notebook file. This project is to predict a person's salary lies in either 50K+ or 50K-.

https://archive.ics.uci.edu/ml/datasets/census+income. K-nearest-neighborhood, Support Vector Machine, Logistic Regression, Random Forest, Navie Bayes, Decision Tree, Adaboost Decision Tree. Please see the Report for more details. You signed in with another tab or window. b) Built a Histogram for the age column with number of bins equal to 50. Build a classification model based on the features that you select to predict if the income is above $50k or not. https://www.kaggle.com/uciml/adult-census-income, Salary, age, workclass, fnlwgt, education, education_num, marital-status, occupation, relationship, race, sex, capital-gain,capital-Loss, hours-per-week, native-country. race: Descriptions of an individuals race, sex: the biological sex of the individual, capital-gain: capital gains for an individual (money gained outside of salary), capital-loss: capital loss for an individual (money lost outside of salary), hours-per-week: the hours an individual has reported to work per week, native-country: country of origin for an individual. For more details about this dataset, you can refer to the following link: https://archive.ics.uci.edu/ml/datasets/census+income. iii) Predicted the values on the test set.

The most important features included: The Machine Learning Model that gave us the most accuracy was the ADABOOST Model using Decision Tree which gave us an 86.96 percent accuracy. d) Built a box-plot between education and age column.Map education on the x-axis and age on the y-axis. Each entry contains the following information about the class of individual: workclass: a general term to represent the employment status of an individual. iii) Predicted the values on the test set. I am going to examine the Census Income dataset available at the UC Irvine Machine Learning Repository. Prediction accuracy for instances label > 50K is 0.85. The dataset we are going to use is the Adult census income dataset from Kaggle which contains about 32561 rows and 15 features. The "Census Income" dataset from the UCI Machine Learning Repository that contains the income information for over 48,000 individuals taken from the 1994 US census. Add a description, image, and links to the relationship: represents what this individual is relative to others. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. Prediction accuracy for instances label <= 50K is 0.83. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. If you want to view the deployed model, click on the following link : https://incomep.herokuapp.com/, If you are searching for Code, Algorithms used and Accuracy of the model Please Open "Income Prediction.ipynb" file. d) Extracted all the male employees who work in state-gov and stored it in male_gov. We had to remove certain fields to prevent co-correlation. iv) Built a confusion matrix and calculated the accuracy. In a terminal or command window, navigate to the top-level project directory Adult-Census-Income/ (that contains this README) and run one of the following commands: This will open the iPython Notebook software and project file in your browser. What are the key factors contributing to high vs. low income? Build a classification model based on the features that you select to predict person's yearly income is above $50k or not. Weve done the legwork and spent countless hours on finding innovative ways of creating high-quality prints on just about anything. h) Calculated the mean of capital.gain column grouped according to workclass. iii) Predicted values on the test set education: the highest level of education achieved by an individual. Building an Income Prediction System Using Machine Learning Model and Deploying it as a Web App. Used various Machine Learning Algorithms to performed a predictive task of classification to predict whether an individual makes over 50K a year or less on the 'US Census Income' dataset. Data-Visualization-and-Machine-Learning-Classification-Methods-for-Income-Prediction, TCD_Income_Prediction_Kaggle_Competition_2019. Prediction accuracy for instances label <= 50K is 0.84. This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. You signed in with another tab or window. ii) Built a logistic regression model where the dependent variable is X(yearly income) and independent variable is occupation. Our client is a local University who wants to use income as the key demographic to decide criteria for marketing its degree programs.

From training dataset, use Undersampling method by selecting a subset of the majority examples to match the number of minority examples to create a balanced dataset. accuracy of 83%, also voting classifier model is second most accurate, Note that the dataset is made up of categorical and continuous features. Each classifier explored has an accuracy of over 85%. In other words, this is the number of people the census believes the entry represents. The Census Dataset is provided by UC Irvine Machine Learning Repository. Team member: Xue Cao, Xiyu (Ella) Liang, Ke (Vicky) Xu. iv) Ploted accuracy vs cut-off and picked an ideal value for cut-off. Predict whether income exceeds $50K/yr based on census data, This data set contains information about age, gender, occupation, education, workclass of 32,000 people from US. Prediction accuracy for instances label <= 50K is 0.82. b) Removed all the rows that contain NA values. Each instance of training and test data is classified 0 (corresponding to less or equal than 50K dollars annually) or 1 (corresponding to greater than 50K annually) using the learned classifiers. Building a Classification model to predict whether a person's annual Income is more than $50K or below $50K. 3.https://archive.ics.uci.edu/ml/datasets/Census+Income The data set consists of anonymous information such as occupation, age, native country, race, capital gain, capital loss, education, work class and more. c) Built a scatter-plot between capital.gain and hours.per.week. Further, after having sufficient knowledge about the attributes we have performed a predictive task of classification, whether income exceeds $50K/yr. based on given features, such as age, education, occupation, gender, race, etc. Python, Numpy, Pandas, Matplotlib, Seaborn. You signed in with another tab or window. b) Extracted all the columns from age to relationship and stored it in census_seq. vi) Ploted the ROC curve and calculated the auc(Area Under Curve). We use those relevant features and multiple classification methods (Decision-Tree, SVM, and K-Nearest Neighbor) to predict the income level for unknown individuals. It also contains missing values The categorical columns are: workclass, education, marital_status, occupation, relationship, race, gender, native_country, An environment to work in - something like. income: whether or not an individual makes more than 50,000 annually. In this project we analyze a U.S. census data taken from the UCI Machine Learning Repository. Finding Donors for CharityML using supervised learners. Capital Loss was a mixture of both high income and low income individuals and not a clear indicator of wealth. ii) Built a decision tree model where the dependent variable is X(Yearly Income) and the rest of the variables as independent variables. Note: Married-civ-spouse corresponds to a civilian spouse while Married-AF-spouse is a spouse in the Armed Forces. Prediction accuracy for instances label <= 50K is 0.80. The Salary Feature is the label we want to predict. Additionally, the AF Data Science Project.html file contains a snapshot of the main code in the jupyter notebook with all code cells executed. CS7CS4- Machine Learning- Income Prediction- Kaggle Competition, Udacity Machine Learning Engineer Nanodegree Program Capstone Project, TCD ML Comp. 1.https://mathematicaforprediction.wordpress.com/2014/03/30/classification-and-association-rules-for-census-income-data/ If it somehow can't, see if you can at least install Python and pip and then use pip to install the above packages. This repo stores the script, the data, and its description for the kernels published on kaggle. To associate your repository with the Map capital.gain on the x- axis and hours.per.week on the y-axis. For Linux people, your package manager should be able to handle all of this.

US Adult Census data relating income to social factors such as Age, Education, race etc. 2.https://www.knowbigdata.com/blog/predicting-income-level-analytics-casestudy-r topic page so that developers can more easily learn about it. Our correlation matrices show lots of correlation as well as co-correlation between fields. a) Built a simple logistic regression model as follows: i) Divided the dataset into training and test sets in 65:35 ratio. https://mathematicaforprediction.wordpress.com/2014/03/30/classification-and-association-rules-for-census-income-data/, https://www.knowbigdata.com/blog/predicting-income-level-analytics-casestudy-r, https://archive.ics.uci.edu/ml/datasets/Census+Income. In this Project, we are going to predict whether a person's annual income is more than $50K or below $50K using various features like age, education, workclass, country, occupation etc. Prediction accuracy for instances label > 50K is 0.80. The US Adult Census dataset is a repository of 48,842 entries extracted from the 1994 US Census database. Trained a AdaBoost model to predict income with 85.44% quality. Each row is labelled as either having a salary greater than ">50K" or "<=50K". For DataAnalysis, we have used the DataAnalysis.R file which uses the census.csv file as an input and contains 48842 rows. Reference: No description, website, or topics provided. That means: We can print whatever you need on a massive variety of mediums. Are there any significant gaps in these Census attributes by gender or race? Cannot retrieve contributors at this time. a) Replaced all the missing values with NA.