KIIT Deemed To Be University: A Project Report
KIIT Deemed To Be University: A Project Report
on
“HOUSING PRICE
PREDICTION”
Submitted to
KIIT Deemed to be University
CERTIFICATE
This is certify that the project entitled
“HOUSING PRICE
PREDICTION”
submitted by
is a record of bonafide work carried out by them, in the partial fulfillment of the
requirement for the award of Degree of Bachelor of Engineering (Computer Sci- ence
& Engineering OR Information Technology) at KIIT Deemed to be university,
Bhubaneswar. This work is done during year 2018-2019, under our guidance.
(Prof. BHASWATI
SAHOO)
Acknowledgement
We are profoundly grateful of Prof. Bhaswati Sahoo for her expert guidance and
continuous encouragement throughout the project right from its commencement to
its completion.
Broadly, this paper finds the solution to the question of how house prices are
affected by housing characteristics (both internally, such as the number of
bathrooms, bedrooms, etc. and externally, such as schools, or parks, etc. in the
neighbourhood). Using data from Kaggle, a prominent dataset website, this paper
utilizes both the Linear Regression model, to briefly predict house prices. This
paper also identifies the important attributes in housing price prediction such as
comparable houses, sold price, price per square foot, year in which the house is
sold, building type and bedroom, etc.
It also sees the variations of the sales price with the existing as well as derived
features for a more brief and easy to understand prediction through six graphs. We
seek answers to different questions on capabilities of data set and through the
scatterplot ,we show the comparison between actual price and predicted price and
arrive on a range of price which sees the most number of sales.
Introduction
The fact that buying of houses is not a seasonal activity,but a regular thing and
describing homes and their variance with price is of utmost interest and
importance, using the linear regression model in Python,we are analyzing the
pricing patterns and identifying the features affecting the price of a house ,we
are predicting the price of houses of a city.
Objective of this project is to identify the most important variables and to define
the best regression model for predicting the housing prices in Ames, Iowa. The
data set used for the project purposes, describes 1500 residential property sales
in Ames, Iowa between 2006 and 2012. It contains 26 explanatory variables
describing every aspect of the home. Continuous variables determine the various
area dimensions such as the size of the living area, the basement while discrete
variables quantify the number of rooms, baths, kitchens, parking spots etc.
Nominal variables typically describe the various types or classes of dwellings,
materials and locations such as the name of the neighborhood, the garage type,
the sale type etc. Ordinal variables typically rate the quality and condition of
different house parts and utilities. The fact that the data-set was over
parameterized and heterogeneous lead to the following hardships and increased
the difficulty of the analysis.
Literature Survey
This section focuses on the most popular and relevant methods used for
predicting the housing prices. Many research has been done to practice the
prediction of the housing prices of different cities considering the different
attributes for each city. Methods like Linear Regression, Random Forest,SVM
and also other machine learning algorithms are used to predict the prices of the
house.
One of the famous research paper was written by An Nguyen. This paper
explores the question of how house prices in five different counties are affected
by housing characteristics (both internally, such as the number of bathrooms,
bedrooms, etc. and externally, such as public schools’ scores or the walkability
score of the neighborhood). This paper also identifies the four most important
attributes in housing price prediction across as assessment, comparable houses’
sold price, listed price and number of bathrooms.
The machine learning algorithms used in this paper are Random Forest and
Support Vector Machine (SVM) to do the prediction of houses in Zillow, Trulia,
and Red-fin.
Using a data-set of 1,457 houses from 5 different counties scraped from Zillow,
Trulia and Red-fin, this paper addresses the following questions:
1. Can the models propose in this paper outperform or get close to Zillow’s
prediction score baseline?
2. Can the overestimate to underestimated house ratio be reduced?
3. What are the most important attributes that affect the sold price?
For Hunt (TX), SVM outperforms the baseline by 3.2%. Random Forest outputs
close predictions scores to the baseline with the data-set from Cowlitz (WA) and
Montgomery (IL). Moreover, results suggest that using one single set of 10
attributes for all counties will not change the models’ accuracy scores by a lot in
comparison to using different sets of attributes for different counties.
1.1 Introduction:
Through housing price prediction a user can predict the price of a house by
providing certain information about the house such as number of bedroom,
number of bathroom, kitchen area, living area, parking lot and various other
attributes.After providing these information we analyse the data and do data
engineering and select the relevant features to predict the price of the house.
3.2 Objective:
Objective of this project is to identify the most important variables and to define
the best regression model for predicting the housing prices in Ames, Iowa. The
data set used for the project purposes, describes 1500 residential property sales
in Ames, Iowa between 2006 and 2012. It contains 26 explanatory variables
describing every aspect of the home. Continuous variables determine the various
area dimensions such as the size of the living area, the basement while discrete
variables quantify the number of rooms, baths, kitchens, parking spots etc.
Nominal variables typically describe the various types or classes of dwellings,
materials and locations such as the name of the neighborhood, the garage type,
the sale type etc. Ordinal variables typically rate the quality and condition of
different house parts and utilities. The fact that the data-set was over
parameterized and heterogeneous lead to the following hardships and increased
the difficulty of the analysis.
3.3 Problem Statement:
Let's take a real estate company that has a dataset containing the prices of
properties. It wants to utilize the data to optimise the sale prices of the properties
based on important features.
Essentially, the company wants to —
Identify the variables affecting house prices.
Design a linear model that quantitatively relates house prices with variables
or factors such as number of rooms, area, number of bathrooms, etc.
Know the accuracy of the model, i.e. how well these features can predict
house prices.
The technical tools used in making this project include the following:
System Design
System Testing
Test Cases and Test Results
Test Test Case Title Test Condition System Behavior Expected Result
ID
Implementation
7.1.3 BATHROOMS
In the dataset we were given different types of bathroom of a house like full bathroom, half
half bathroom and bathroom in basement. So we combined all these into one column as
they all comes under bathrooms, and also bathrooms are something which everyone looks
for while buying a house. So it is very helpful in predicting the price of houses.
The Ground Living Area shows the area in Square foot in which the house is built. The
price of houses is almost dependent on this, as they are directly proportional.
Sales price is the output label here as we have to predict the Sales price of the houses
considering the different attributes and features from the given dataset.
The data set of Ames Housing Price has been used which was taken from Kaggle. We have
cleaned and preprocessed the data by checking out the NULL values and corelation(>=0.8)
between the columns. We also have to deal with the categorical variables so, the attributes
which were not significant in predicting the price was also removed.
We can see that we have many Null values data, so we will replace them and also we have
Some constant and Quasi-Constant attributes we will also remove them as they won’t be
any helpful in prediction.
We will copy all the attributes which are useful in other data set in pandas(python).
Now, we will check for corelation between the attributes and deal with it.
Page 12
Here, from the corelation matrix we can see that OverAllQual is highly corelated with Sales Price.
With this graph we can visualize the No. Of sales which is happening in a given price range.
Page 13
7.4.2 Plotting the graph of Ground Living Area vs Sales Price
With the help of this scatter plot we can visualize the outliers and also the variation in prices
with respect to the living area which is in square foot.
We have made a new feature called house age(Year Sold - Year Built) to see the variation
in price trend of houses, this graph will help us to see how the price varies when the
age of house is more.
We have made a new feature out of Building Type and Garage Cars to see the price trend of houses
depending on the type of building and number of cars the garage can park.
We have done encoding of Building type to convert it from categorical variable to numerical variable.
Page 14
7.4.5 Plotting the graph of Bathroom vs Sales Price
In the dataset we were given different types of bathroom of a house like full bathroom, half
half bathroom and bathroom in basement. So we combined all these into one column as
they all comes under bathrooms, and also bathrooms are something which everyone looks
for while buying a house. So it is very helpful in predicting the price of houses.
Page 15
7.5 APPLYING THE MODEL AND CHECKING ACCURACY
Using the Linear Regression Model we are getting an accuracy something near to 76%.
Page 16
NAME OF PROJECT
Chapter 8
Screenshots of Project
Page 17
8.2 Feature Selection
Page 20
Chapter 9
Conclusion and Future Scope
9.1 Conclusion
By analysing the pricing patterns and identifying the features affecting the
price of a house we are predicting the price of houses of that city.
It is observed by creating a scatter plot between the actual price and
observed price that the houses costing in between 1000000$ to
2000000$ are predicted quite accurately and is also observed that the
houses costing between 1500000$-200000$ are sold the most.
Page 21
References
Kaggle.com
Wikipedia.com
Google.com
Towardsdatascience.com
Page 22
Appendix-I
STUDENT'S CONTRIBUTION TO THE PROJECT
CONTRIBUTION
1. CONTRIBUTION TO Contributed in the report regarding project planning and
THE PROJECT implementation along with screenshot of project.
REPORT
2. CONTRIBUTION Derived features from the existing set of features and plotted
DURING the bar graph for house age(in years) vs sales price($).
IMPLEMENTATION
3. CONTRIBUTION FOR Histogram,scatterplot and bar graph for house age(in years) vs
THE PROJECT sales price($).
DEMONSTRATION /
PRESENTATION
SIGNATURE OF STUDENT
Page 23
SIGNATURE OF GUIDE
Appendix-II
STUDENT'S CONTRIBUTION TO THE PROJECT
CONTRIBUTION
4. CONTRIBUTION TO Contributed in system design and testing along with a section
THE PROJECT of screenshot of project.
REPORT
5. CONTRIBUTION Plotted scatterplot of sale price against living area and bar
DURING graph for different features against sale price.
IMPLEMENTATION
6. CONTRIBUTION FOR Two bar graphs of house type and no of bathrooms vs sales
THE PROJECT price along with a scatterplot showing comparison between
DEMONSTRATION / predicted price and actual price.
PRESENTATION
SIGNATURE OF STUDENT
Page 24
SIGNATURE OF GUIDE
Appendix-III
STUDENT'S CONTRIBUTION TO THE PROJECT
CONTRIBUTION
7. CONTRIBUTION TO Contributed in introduction,software requirement specification
THE PROJECT and a section of screenshot of project.
REPORT
8. CONTRIBUTION Plotted scatter plot for the comparison of actual and predicted
DURING price and histogram of sale price against no of sales.
IMPLEMENTATION
SIGNATURE OF STUDENT
Page 25
SIGNATURE OF GUIDE
Appendix-IV
STUDENT'S CONTRIBUTION TO THE PROJECT
CONTRIBUTION
10. CONTRIBUTION TO Contributed in conclusion and future scope along with
THE PROJECT screenshot of project.
REPORT
SIGNATURE OF STUDENT
Page 26
SIGNATURE OF GUIDE
Appendix-V
STUDENT'S CONTRIBUTION TO THE PROJECT
CONTRIBUTION
13. CONTRIBUTION TO Contributed in data pre-processing,data cleaning and a section
THE PROJECT of screenshot of project.
REPORT
15. CONTRIBUTION FOR Tools used for the project along with the reason and Linear
THE PROJECT regression model in Python.
DEMONSTRATION /
PRESENTATION
SIGNATURE OF STUDENT
Page 27
SIGNATURE OF GUIDE
Page 28