Cryptocurrency Prediction
Cryptocurrency Prediction
1 Introduction
Human history is filled with prediction attempts and with prophets [Nelson, 2000],
but the ability to anticipate the future events of life evolution, especially in financial
economics [Gifu and Cristea, 2012], is a complex task that can be solved using
artificial intelligence (AI) techniques. Forecasting the market, the purpose of this
paper remains an important challenge with a lot of implications for safety economy.
Economists developed, through many studies and observations on graphs and data,
diverse techniques to get an idea of how the prices will rise or fall (martingales)
[Feller, 1971]. The accuracy of data retrieval depends on the understanding of the
rational, emotional and physiological limiting factors towards the stock market. There
are also various variables to consider as: the causal impact of media, investor’s
behavior, time-varying local economic conditions, history of prices, natural
calamities, etc. They can systematically affect the fluctuation of the market. Some
specialists in financial economics and big data consider that trying to predict and
invest in a random manner can direct to same results. This assumption is known as the
EMH (Efficient Market Hypothesis)1 or the RWH (Random Walk Hypothesis) [Fama,
1
Starting with Eugene Fama’s PhD (1965), EMH becoming one of the most known theories in
financial economics that confirm the connection between prices and public information.
1965; Jonathan Clarke et al., 2001; Zunino et al., 2012; Bariviera et al., 2014; Marwala,
2015; Marwala and Hurwitz, 2017].
In this study, we have chosen the cryptocurrency market. Why? Because, Bitcoin
(BTC), Ethereum (ETH), Ripple (XRP) and the rest of the cryptocoins and tokens
represent the future of money. Moreover, in the last year, the market capitalization of
cryptocurrency has grown from 16 billion of dollars (10 January 2017) to an
outstanding peak of 829 billion of dollars (7 January 2018)2. Because of the high
trading volume, a lot of data (graphs and text) is gained everyday 3 and is free for the
entire public to be used by every possible mean. The best approach is proving to be a
combination of mathematical models with psychological models because it has a
projected growth rate on prediction.
The paper is organized as follows: section 2 shortly reviews the relevant literature
on various approaches about the huge role of predicting the cryptocurrency in human
life; section 3 presents materials (data set) and methods based on machine learning
(ML) algorithms; the results are described in section 4; section 5 highlights a short
discussion about this survey, and, finally, our conclusions are given in section 6.
2 Related Work
A collection of AI algorithms has been used to predict the fluctuation of a
cryptocurrency. Some of them were developed through observation, like mean
reversion, based on the assumption that the stocks will reach the average value in time
[Spierdijk and Bikker, 2012; Dai et al., 2013]. However, the problem still remains:
you cannot really predict when and how long the prices will stay on the average. We
know that stock prices fluctuate in a random manner [Granger, 1992]. It is the reason
why stock investors use martingales to predict the future trend.
To forecast a linear graph, we have discovered the fact that linear regression could
be intuitively used, being among the first attempts in predicting the course of a
practical linear model [Xin, 2009]. This algorithm, one of the most well known
algorithms in machine learning, was used in solving the prediction problem due to the
nature of the data set used in the stock market (graph, CSV) [Nunno, 2017].
Another (supervised) machine learning algorithm used in reducing the risk of
investing in the stock market is Random Forests. This model constructs a number of
decision trees and gives as an output the class or mean prediction of the trees. The
method has been proven effective in trading, but on a long-term period [Khaidem et
al., 2016] and on the stock market. It is known the fact the stock market is a more
stable market than the crypto one.
For the overreaction-driven price momentum mechanism [Daniel et al., 1998; Hou
et al., 2009], media is reflected on the investors’ decision behavior [Barber & Odean,
2008].
In general, sentiment analysis (SA) as a NLP (Natural Language Processing)
approach is a popular predicting algorithm. Hilbert considers that the negative news
increase the probability to lower the prices of stocks. On the other hand, the positive
news could reflect a rise of the value of stocks [Hilbert et al., 2014].
2
cryptolization.com
3
coinmarketcap.com
In our work, the linear regression, random forest and random forest aided by
sentiment analysis algorithms formed the basis for the prediction models of the
cryptocurrency market described below.
In this section, we describe the data set and methods used to predict cryptocurrency
market (here 3 cryptocoins, called: Ethereum, Bitcoin and Ripple), on which we will
build our prediction models.
For these experiments, we preferred to choose cryptocoins that are found in the top
5 of the cryptocurrency market, because these coins are considered mature, meaning
that there’s a lot of trading activity every day (the data is diversified). The trading
volume and, also the notoriety of the cryptocoins is direct proportional with the place
held in the top. Moreover, the media closely follows how they fluctuate over time.
The titles of articles have been analyzed by us using sentiment analysis classifiers
from the NLTK python library. The algorithm will gather data on a period of 3
months, from 15 October 2017 to 05 January 2018. The reason for this short period of
time is that the cryptomarket is still evolving and in the last 3 months in media spiked
a huge flow of articles regarding cryptocurrencies.
3.1 Data Set
We test our system on two data collections: the closing price chart and the news heads
table. Their differences permit us to improve the performance of our approach in
types of cryptocurrency’s fluctuations. This is also a great start for entrepreneurs,
programmers and researchers to find better solutions on predictions, because using a
site to get the news titles helps in avoiding hazard in data. Data could be scraped from
other sites like Twitter, Reddit forums, YouTube and other media from where you can
gather various information. However, the problem is finding the best source and
cleaning the data set (the English used in forums and social media isn't all the time
accurate and can have a lot of mistakes). Thus, for the moment, we will get our data
from the best news site on crypto.
The text data set (Table 1 contains only a sample of data) is used by the NLP
component of the prediction algorithm and is formed of article titles from
cryptocurrency news site4. Note that the articles, directed towards the coin in order to
be predicted, will not appear daily. In this case, we approximate the missing scores
using our formula (3), tailored for our needs. In addition, there is a possibility that
many articles about the coin would appear in a single day.
… … …
1/01/2018 What Will the Bitcoin Price Be in 2017? 8
Total 4473
Average 8.89
4
www.coindesk.com
Note that all text data set contains 420 (Bitcoin), 56 (Ethereum), and 22 ( Ripple)
articles.
The price chart (see Figure 2) is a table that consists of eight columns, but for this
survey, we keep the date and the closing price of the cryptocurrency. The table has an
entry for each day, on a period of approx. 3 months and the data will be fetched by a
web crawler. Figure 3 presents a good image about the volatility of the
cryptocurrency market.
Fig. 2. The prices distributions of the 3 most mature coins per time
Fig. 3. Price trends of Bitcoin, Ethereum, and Ripple over the past 3 months (October 15,
2017–January 05, 2018). Data source: coinmarketcap.com
3.2 Phases Methodology
The work flow (Fig. 1) starts with the web crawlers that gather data from specific
sites to fill the tables. There are two crawlers: one that scrapes a site where the closing
prices of the cryptocoins are held online in a table form5, and the other crawler gets
the head of the news articles from a site specialized on cryptocurrency news 6 .
The next step is to pass the news titles through the SentimentIntensityAnalyzer
module from the NLTK library7 in Python; this will be the sentiment analysis
component. The title will be automatically annotated to be passed through a polarity
detection algorithm regarding the feeling that is deducted. The process will give
scores for each piece of text, consisting in two types of measures: positive and
negative feeling score. The final score will be calculated by the formula:
The sum (1) is used to solve the multiple articles problem, to answer at the
question: what sentiment is dominant? Note that, in some days, media wrote more
articles about those three cryptocoins. Averaging the sum of all positive scores minus
the sum of the negative ones will result the overall sentiment on that day.
We have three distinct situations (Table 2): 1) periods of time without data (e.g.
between 2017-10-18 and 2017-10-26), 2) multiple articles in the same day (e.g. 2017-
10-16), and 3) neutral articles (when the SA score is equal with 0). Because articles
regarding a crypto coin do not appear every day, gaps in the table will occur.
Goel from Stanford University [Mittal and Goel, 2012] filled these blank spaces
with the formula (2):
gap = (x + y)/2 (2)
5
https://coinmarketcap.com
6
https://www.coindesk.com/
7
http://www.nltk.org/
where x and y are table boxes filled with values, and between x and y are n number of
blank spaces (period of days when nothing was published towards the analyzed
cryptocoins).
Moreover, machine learning could be used to predict the values for the blank
spaces. However, the machine learning algorithm won’t behave well because of the
anomalies in the graph (unexpected rise and fall in the graph, specific for the
cryptomarket). Our approach is a modification to Goel’s formula:
This alteration changes the behavior of the graphic, reflected in formula (3),
having a more natural evolution from x to y. The algorithm responds better on our
needs. There are three cases on how the graphic would behave (we will take n = 5)
(Fig. 4, 5 and 6):
The formula (3), represented by the blue line, has a more natural evolution,
because it falls and it also rises to reach the y value, despite Goel’s equation, where it
only falls.
Fig. 6. Case 3: x = y
We can observe as well that there is a movement in the line even though x is equal
with y. In Table 3, we used the formula (3), in order to complete the scores where we
have no data.
Our method is based on three algorithms, using the scikit-learn Python library8:
linear regression, random forests and random forests aided by sentiment analysis. In
this way, we can compare the precision, the recall and the f-measure obtained with
each of them, in order to analyze the potency of the NLP component.
8
http://scikit-learn.org
3.2.1. Linear Regression Model
y = b0 + b 1 × x (4)
where y is a dependent value (the price we want to predict), b0 is called intercept, x the
independent variable (the regressor, the price we know) and b1 is the slope of the line.
In this case, the algorithm will be trained on both tables (Table 2 and 3). According
with two features (price and sentiment score) (Fig. 7)the decision trees are expected to
be more ample, because of the new feature, and they will also use the same splitting
scheme (first 90% for training and last 10% for testing).
The graphs above represent the output of the Random Forest, aided by Sentiment
Analysis module. The orange line is the price evolution and the blue one represents
the prediction of prices. The first chart is a prediction for the Ethereum data set, and
the second graph was made on the Ripple data.
4 Results
From the entire set of data presented above, we looked at Precision, Recall and F-
measure performance measures to evaluate our model for the prediction problem
proposed in this paper (Table 4).
The results are promising. Still, we observe that through SA we gain better results,
understanding that the feeling over media really influence the flow of the market.
Furthermore, Random Forest works better using two features, not only one. The
Linear Regression model had the worst performance on prediction, using this kind of
data.
5 Discussion
This section describes the study results by running a series of three algorithms
(Random Forests, Linear Regression, and SA with Random Forests Development). Our
aim in these experiments was to compare three known techniques. In addition, our
final goal was to define a new formula (3), by adding new missing data, in order to
predict the fluctuation of a cryptocurrency. For this purpose, we compare the
Precision and Recall measures.
As we observed the graphs of prediction, the nodes do not match the price very
accurate. However, there’s no need in matching exactly, more important is to predict
the rise and fall of the price in the right time. The results showed that in the Ethereum
prediction chart, a down spike occurred in the both lines at 12th January 2018. This
reveals that is a great time to invest, and that the day before is perfect to sell the
cryptocoins for profit. The behavior of the Ethereum prediction went well because the
text data set is more than double then Ripple’s media coverage. These points out the
influence of the media on the investor’s opinion on buying and selling on the
cryptomarket. Only one problem we face: the lack of trusted media to train our
algorithms.
6 Conclusions
This paper describes how machine learning algorithms and sentiment analysis of
the afferent media are affecting the prediction process of cryptocurrency market. This
method could become very helpful for entrepreneurs and people who would like to
invest in such market and need a promising start.
Trying to predict a young market is not an easy task. The shortage of data is a big
challenge. There are not many sites writing about cryptocurrency, making difficult the
process of trusting. On the other hand, the multitude of unrefined sources (blogs,
forums, twitter) gives us a hard time to identify feelings in comments; various
heuristics need to be written in the code. Different sites may write about the same
story, but can forge different feelings (humans tend to be subjective). The Efficient
Market Hypothesis (EMH) seems to work here in some points of time, cryptocoins
jumping on new levels of prices, becoming more and more valuable in a short time.
This is the reason why the data is scraped only of the last three months and why
Random Forest is used as the main algorithm (randomness seems to work).
Until the market will get in a mature phase and media will be more and more
involved in the cryptomarket, the best approach in solving the gap problem is to
approximate the sentiment score, using a formula. Therefore the algorithm will better
tuned, even though, big spikes (low or high) in the graph (stock or cryptocoins) still
remain a hard problem to solve.
Acknowledgments: This survey was published with the support of the grant of the
Romanian National Authority for Scientific Research and Innovation, CNCS/CCCDI
– UEFISCDI, project number PN-III-P2-2.1-BG-2016-0390 and by a grant of the
Romanian Ministry of Research and Innovation CCCDI-UEFISCDI, project number
PN-III-P1-1.2-PCCDI-2017-0818 within PNCDI III.
References
1. Barber, B. M., and Odean, T.: All that glitters: The effect of attention and news on the
buying behavior of individual and institutional investors. In: Review of Financial Studies,
21, 785–818 (2008).
2. Bariviera, A. F., Zunino, L. Guercio, M. B., Martinez, L. B., and Rosso, O. A.: Revisiting
the European sovereign bonds with a permutation-information- theory approach. Eur. Phys.
J. B. 86: 509. doi:10.1140/epjb/e2013-40660-7 (2014).
3. Clarke, J., Jandik, T., and Mandelker, G.: The efficient markets hypothesis. In: Robert C.
ARFFA, ed. Expert Financial Planning: Investment Strategies from Industry Leaders. New
York: Wiley, Chapter 9, pp. 126-141 (2001).
4. Dai, Y., and Zhang, Y.: Machine Learning in Stock Price Trend Forecasting. Stanford
University (2013).
5. Daniel, K., Hirshleifer, D., and Subrahmanyam, A.: Investor psychology and security
market under- and overreactions. In: Journal of Finance, 53, 1839–1885 (1998).
6. Fama, E.: The Behavior of Stock Market Prices. Journal of Business. 38: 34–105.
doi:10.1086/294743 (1965).
7. Feller, W.: Martingales. In: An Introduction to Probability Theory and Its Applications,
Vol. 2, New York: Wiley, pp. 210-215 (1971).
8. Gîfu, D. and Cristea, D.: Public discourse semantics. A method of anticipating economic
crisis presented at the Exploratory Workshop on Intelligent Decision Support Systems for
Crisis Management, 8-12 May 2012, Oradea, Romania. In: International Journal of
Computers, Communications and Control, see, I. Dzitac, F.G. Filip, M.-J. Manolescu
(eds.), vol. 7/5, Agora University Editing House, pp. 829-836 (2012).
9. Granger, C. W. J.: Forecasting stock market prices: Lessons for forecasters. In:
International Journal of Forecasting 8, North-Holland, 3-13 (1992).
10. Hilbert, A., Jacobs, H., and Müller, S.: Media Makes Momentum. Review of Financial
Studies, 27(12), 3467–3501 (2014).
11. Hou, K., Peng, L., and Xiong, W.: A Tale of Two Anomalies: The Implications of Investor
Attention for Price and Earnings Momentum, SSRN 976394 (2009).
12. Khaidem, L., Saha, S., and Dey, S. R.: Predicting the direction of stock market prices using
random forest. In: Applied Mathematical Finance, 1-20 (2016).
13. Marwala, T.: Impact of Artificial Intelligence on Economic Theory – via arXiv.org (2015).
14. Marwala, T., and Hurwitz, E.: Artificial Intelligence and Economic Theory: Skynet in the
Market. London: Springer (2017).
15. Mittal, A. and Goel, A.: Stock prediction using twitter sentiment analysis. Stanford
University, CS229 (2012).
16. Nelson, R.: Prophecy: A History of the Future - The Rex Research Civilization Kit,
http://www.rexresearch.com/prophist/phfcon.htm (2000).
17. Nunno, L.: Stock Market Price Prediction Using Linear and Polynomial Regression
Models (2017).
18. Spierdijk, L., and Bikker, J. A.: Mean Reversion in Stock Prices: Implications for Long-
Term Investors (2012).
19. Zunino, L., Bariviera, A. F., Guercio, M. B., Martinez, L. B., and Rosso, O. A.: On the
efficiency of sovereign bond markets. Phys. A Stat. Mech. Appl., 391: 4342–4349.
doi:10.1016/j.physa.2012.04.009 (2012).
20. Xin, Y.: Linear Regression Analysis: Theory and Computing (2009).