A Novel Method For Rainfall Prediction Using Machine Learning
A Novel Method For Rainfall Prediction Using Machine Learning
ABSTRACT
Surges are viewed as catastrophic events that can cause setbacks and destroying of infra structures. Uncertainty of
rainfall also creates problem, a reduced amount of rainfall and high amount of rainfall both are not desirable
henceforth for both the cases water resource management is necessary. Prediction of rainfall can play impotent role
for WRM (Water resource management). After studying different literature, work can be carried out using data
mining techniques and machine learning model. In this we have proposed a rainfall prediction model which is an
integration of clustering data mining technique and multiple regression, which will make efficient and accurate
prediction. Proposed algorithm used k- nearest neighbor regression, and we have also implemented k-medoid
regression. Further we have passed predicted data to classifier which will generate confusion matrix with two values
TPR (True Positive Rate) and FNR (False negative Rate).
Keywords: WRM, TPR, FNR
Forecasting is a procedure of estimating or predicting The analysis and collection of data about the present,
the future depends on past and nearby data. Forecasting history and future involves lots of time and capital.
provides information about the impending future Consequently, managers have to equilibrium the
measures and their consequences for the administration. cost of forecasting with its reimbursement. Most of
It may not decrease the difficulties and hesitation of the the small firms don't do forecasting on account of
future. Nevertheless, it increases the self-reliance of the the high cost.
management to craft imperative decisions. Forecasting Forecasting task can only approximate the future
is the foundation of premising. Forecasting uses various measures. It cannot pledge that these measures will
statistical data. Consequently, it is also called as take place in the future. Long-term prediction will
Statistical Analysis. Significance of forecasting involves be fewer accurate in comparison with to short-range
following points: forecast.
Data Prediction is based on convinced assumptions.
Forecasting provides reliable and relevant If these assumptions are mistaken, the forecasting
information about the present and past events and will be incorrect. Forecasting is depend on past
the probable future measures. This is very essential measures. On the other hand, past may not reiterate
for sound planning. itself at all times.
It gives self-belief to the managers for making Forecasting need proper skills and judgment on the
imperative decisions. part of managers. Forecasts may go incorrect due to
It is the source for making planning grounds. terrible judgment and skills on the part of some of
It keeps managers alert and active to face the the managers. Consequently, predicting data are
challenges of future measures and the changes in subject to human error.
the atmosphere.
IJSRSET173647 | Received : 13 Sep 2017 | Accepted : 26 Sep 2017 | September-October-2017 [(3)6: 347-352] 347
Forecast is merely a prediction about the future values II. Literature Survey
of data. However, most extrapolative model forecasts
assume that the past is a proxy for the future. There are Andrew Kusiak et. al. said that Rainfall affects local
many traditional models for forecasting: exponential water quantity and quality. A data-mining approach is
smoothing, regression, time series, and composite applied to predict rainfall in a watershed basin at
model forecasts, often involving expert forecasts. Oxford, Iowa, based on radar reflectivity and tipping-
Regression analysis is a statistical technique to analyze bucket (TB) data. Five data-mining algorithms, neural
quantitative data to estimate model parameters and network, random forest, classification and regression
make forecasts. tree, support vector machine, and k-nearest neighbor,
are employed to build prediction models. The algorithm
Regression analysis is a statistical process for offering the highest accuracy is selected for further
estimating the relationships among variables. It includes study. Model I is the baseline model constructed from
many techniques for modeling and analyzing several radar data covering Oxford. Model II predicts rainfall
variables, when the focus is on the relationship between from radar and TB data collected at Oxford. Model III is
a dependent variable and one or more independent constructed from the radar and TB data collected at
variables (or 'predictors'). The horizontal line is called South Amana (16 km west of Oxford) and Iowa City
the X-axis and the vertical line the Y-axis. Regression (25 km east of Oxford). The computation results
analysis looks for a relationship between the X variable indicate that the three models offer similar accuracy
(sometimes called the “independent” or “explanatory” when predicting rainfall at current time. Model II
variable) and the Y variable (the “dependent” variable). performs better than the other two models when
predicting rainfall at future time horizons [IEEE 2013].
In our proposed algorithm k-nn classification and Fig. 2 Proposed Scheme Layout
regression is integrated to overcome the bottle neck if
exsiting k-nn algorithm. Proposed Algorithm
Regression (featTrain classTrain, featTest, classTest,
Further we have also compared the performance of featName, classifier)
earlier prediction algorithm with our proposed /*featTrain- A NUMERIC matrix of training features (N
algorithm. x M)
classTrain- A NUMERIC vector representing the values
Algorithm of the dependent variable of the training data (N x 1)
featTest- A NUMERIC matrix of testing features (Nts x
Algorithm: K-Medoid M)
classTest- A NUMERIC vector representing the values
1. Initialize: randomly select(without of the dependent variable of the testing data (Nts x 1)
replacement) k of the n data points as the featName- The CELL vector of string representing the
medoids label of each features, (1 x M) cell*/
2. Associate each data point to the closest medoid. //classifier as KNN Regression
3. While the cost of the configuration decreases: NNBestFeat = floor(Datapoints()/10) //nearest neighbor
1. For each medoid m, for each non-
medoid data point o: trainModel=KNN Regression model
1. Swap m and o, recompute the NNSearch=Initialize earch function for KNNReg as
cost (sum of distances of linearsearch
points to their medoid) //Set the distance measure for NNSearch
distFunc = Euclidean distance (or similarity) function
2. If the total cost of the
trainModel.setNearestNeighbourSearchAlgorithm
configuration increased in the
(NNSearch)
previous step, undo the swap
trainModel.setKNN(NNBestFeat)
K-Medoid Clustering