Data Modeling For Big Data: ISSN 2085-4579
Data Modeling For Big Data: ISSN 2085-4579
II. BIG DATA CHARACTERISTIC scale of this growth surpasses the reasonable
capacity of traditional relational database
management system, or even typical hardware
A. The Definition of Big Data configuration supporting file-based data access.
The rapid acceleration of data growth also causes
Most definitions of big data focus on the size the increasing data volumes pushed into the
of data in storage. Size matters, but there are network. These makes Big Data can be described
other important attributes of big data, namely by its volume or size of data [3].
data variety and data velocity. The three Vs of big
data (volume, variety, and velocity) constitute a 2. Velocity
comprehensive definition, and they bust the myth
that big data is only about data volume. In addition, Big data also can be described by its velocity
each of the three Vs has its own ramifications for or speed. There are two aspects to velocity, one
analytics [5]. The simulation is shown at Figure representing the throughput of data and the other
1. representing latency. Throughput represents
the data moving in the pipes. Latency is a time
It’s obvious that data volume is the primary interval between the stimulation or request or data
recalled and the responds [7].
attribute of big data. With that in mind, most
people define big data in terabytes—sometimes 3. Complexity/Variety
petabytes. Big data also can be described by its
Nowadays, Data Warehouse technology is
velocity or speed. You may prefer to think of it as rapidly introduced. The purpose is to create meta-
the frequency of data generation or the frequency models to represents all the data in one standard
of data delivery. format. The data was compiled from a variety
of sources and transformed using ETL (Extract,
Transform, Load) or ELT (Extract the data and
Load it in the warehouse, then Transform it inside
the warehouse). The basic premise was narrow
variety and structured content. Big Data has
significantly expanded our horizons, enabled by
new data integration and analytics technologies.
A number of call center analytics solutions are
seeking analysis of call center conversations and
their correlation with emails, trouble tickets, and
social media blogs. The source data includes
unstructured text, sound, and video in addition
to structured data. A number of applications are
gathering data from emails, documents, or blogs.
4. Veracity
Most Big Data comes from sources outside
our control and therefore suffers from significant
Fig 1. The 3 Vs of Big Data correctness or accuracy problems. Veracity
represents both, the credibility of the data source
as well as the suitability of the data for the target
B. Big Data Characteristic audience [7].
For example, if a company wants to
Talking about Big Data is not only about the collect product information from third party
big size, but also about the stream and the type. So, and offer it to their contact center employees
it is important to define the characteristic of Big to support customer queries, the data would
have to be screened for source accuracy and
Data. The defined characteristic will be used to credibility. Otherwise, the contact centers could
measure the quality of each database management end up recommending competitive offers that
system to tackle the Big Data challenge. might marginalize offerings and reduce revenue
opportunities. Likewise, the suitability for the
The characteristics are defined below. user or audience.
1. Volume 5. Reliability
According to the 2011 IDC Digital Universe The reliability in big data is about the accuracy
Study, “In 2011, the amount of information and completeness of computer processed data,
created and replicated will surpass 1.8 zeta bytes, given the uses they are intended for. Those, in
growing by a factor 9 in just five years [6].” The Big Data challenge, when there are a lot of data
that must be executed in some ways, the expected more difficult or complex data structure and
output is the closes intention. schema, the user must using very complex query.
6. Extensibility So, it is needed to leverage new programming
language functionality to implement an object-
The extensibility is a system design principle relational mapping pattern. These programming
where the implementation takes future growth
into consideration. Because of the rapid growth environments allow developers to benefit from
of the data, Big Data will lead to a new challenge the robustness of DBMS technologies without the
to overcome. Therefore, to accomplice the current burden of writing complex SQL [8].
and the future goal of Big Data, the system must
consider what is going to be happened in the 11. Fault tolerance
future.
7. Interoperability Managing large-scale data needs to concern
about the performance. One of the performance
The available data in the cloud or in the Big points is handling the fault that occurs during the
Data environment is going to be used together,
interchangeable, and interpreted. So, for a system execution of computation. Such as the system has
to be interoperable, it must be able to exchange to deal with disk failures. Therefore, it is needed a
data and subsequently present that data such fault handling scheme. If a unit of work fails, then
that it can be understood by a user [10]. In
Big Data area, it is essential to take a global the system must automatically restart the task on
approach to interoperability and discoverability an alternate node, in order to do not waste the time
of information. by restarting the entire query [8].
8. Scalability
III. APPROACH OF BIG DATA ANALYSIS
Big Data can be considered as the tsunami
of information which has been steadily growing There are two kinds of approach for big data
and growing as result of the increasing of digital analysis, Map Reduce and parallel database
world. Nowadays, every single people movement
or activity is captured and transformed to the management system.
digital data. At the end, Big Data is going to keep
getting bigger, and more organization are going to
be looking to find out more about what to do[9]. A. Map Reduce
IV. Data Modeling for Big Data dominant persistent storage technology. It has
many shortcomings which can hinder performance
Database model is a theory or specification levels. As more and more applications are
describing how a database is structured and used. launched in environments that have massive
Several such models have been suggested such as workloads such as cloud and web services, their
hierarchical, network, relational and non-relational scalability requirements change very quickly and
[20]. Nowadays, relational database models are also grow very large. It is difficult to manage with
the dominant persistent storage technology. The a relational database sitting on a single in-house
relational database model has been dominating server.
since 80s [16], with implementation like Oracle
databases [17], MySQL [18], and Microsoft SQL To solve all these matters, vendors can optimize
Server [19]. for non-relational database models. Non-Relational
databases enjoy schema-free architecture and
possess the power to manage highly unstructured
A. Relational Database Model data. They can be easily deployed to multi-core
or multi-server clusters serving modularization,
A relational database is a collection of data scalability and incremental replication. Non-
items organized in formally-described tables from relational databases being extremely scalable,
which data can be accessed or reassembled in offer high availability and reliability, even while
many different ways. Relational Database is a set running on hardware that is typically prone to
of tables referred to as relation with data category failure, thereby challenging relational database,
described in columns similar to spreadsheets. where consistency, data integrity, uptime and
Each row contains a unique instance of data for performance are of prime importance [20,21,33].
the corresponding data category. While creating
a relational database domain of possible values Non-relational database model is unlike
along with constrains are applied to the data. It Relational database model. It does not guarantee
is the relation between the tables that makes it the ACID properties [32]. Non-relational
a ‘relation’ table. They require few assumptions databases may primarily be classified on the basis
about how data will be extracted from the database. of way of organizing data as follows.
As a result, the same database can be viewed in
many different ways. Mostly all the relational 1. Key Value Store
databases use Structured Query Language (SQL)
to access and modify the data stored in the Key value store allows us to store schema-
database. Originally it was based upon relational less data. This data consists of a key which is
calculus and relational algebra and is subdivided represented by a string and the actual data which
into elements such as clauses, predicates, queries is the value in key-value pair. The data can be
and statements [21]. any primitive of programming language, which
may be a string, an integer or an array or it can
The advantages of Relational Database Model be an object. Thus it loosens the requirement of
are as follow. formatted data for storage, eliminating the need
for fixed data model [21].
• The data in relational database model are
mostly stored in database, not in application.
object relational, has no this attribute. Another prove is that the relational database has no Map reduce
attribute that can accelerate the computation process. Unlike the non-relational database that almost all
of them having this attribute.
Table 2. Comparison some of Relational and Non-Relational Database with basic few attributes-
scalability, variety, velocity, veracity, volume.
Relational Database Non Relational Database
Attribute
Relational Object Relational Document-Stored Wide-Column Store Key-Value Stored Graph-Oriented
Characteristic
Database Name Oracle Postgre CouchDB MongoDB Hbase Cassandra Oracle Coherence Redis Neo4j Titan
Database Model Relational Object-Relational Document-Stored Document-Stored Column-oriented Column-oriented Key-value Key-value Graph-Oriented Graph-Oriented
Scalability Query language API calls, REST, SparQL,
SQL, HTTP, SparQL, Xquery, JavaScript, REST, API calls, JavaScript, API calls, REST, XML Tinkerpop, Gremlin, REST,
SQL API calls, CQL Thrift API calls, CohQL API calls, Lua Cypher, Tinkerpop,
Xpath, API calls, Java API Erlang REST Thrift API calls
remlin
Horisontal Scalable Yes No Yes Yes Yes Yes Yes Yes Yes Yes
Replication Mode Master-Slave-Replica
Master-Slave Replication Master-Slave Master-Master Master-Slave-Replica Master-Slave Master-Master Master-Slave
Replication Master-Slave Replication Symmetric Replication
Multi-master replication Replication Replication Replication Replication Replication Replication
Master-Master Replication
Sharding No Yes No Yes Yes Yes Yes No No Yes
Shared Nothing
No Yes No Yes Yes Yes Yes Yes No Yes
Architecture
Variety Data types Binnary Binnary JSON BSON, Binnary Binnary Binnary Binnary Binnary Binnary
Graph Support Yes Yes No Yes No No Yes No Yes Yes
Velocity Map and reduce No No Yes Yes Yes Yes Yes No No No
Replication Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
TTL for entries Yes Yes No Yes Yes Yes Yes Yes No No
Secondary Indexes Yes Yes Yes Yes No Yes Yes No Yes Yes
Composite keys Yes Yes Yes Yes Yes Yes Yes No Yes Yes
Geospatial Indexes Yes Yes Yes Yes No No No No Yes Yes
Query Cache Yes Yes No Yes No Yes Yes Yes Yes Yes
Veracity Data Storage Volatile memory Volatile memory File System Berkeley DB
ASM File System File System File System HDFS File System Volatile Memory
File System File System Volatile memory Cassandra Hadoop
Conditional entry updates
Yes Yes Yes Yes Yes Yes Yes Yes No No
Isolation Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Unicode Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Compression Yes Yes Yes Yes Yes Yes Yes Yes No Yes
Atomicity Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Consistency Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Durability (data storage) Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Transactions Yes Yes Yes No No No Yes Yes Yes Yes
Referential integrity Yes Yes No No No No Yes No Yes Yes
Revision control No No Yes No Yes Yes Yes No No No
Locking model Optimistic Locking
Lock on Model MVCC No MVCC MVCC Explicit locking Lock Free Model Lock on write Distributed Locking
Lock on write
Full Text Search Yes Yes No Yes No No Yes No Yes Yes
Integrity model ACID Eventual consistency
ACID ACID MVCC ACID BASE Log Replication BASE ACID ACID
Log Replication ACID
Volume Value size max. 4GB 1GB 4GB 500000GB 2000GB 2GB 64000GB 0.5GB 4GB 64GB
J. Chen, and D. Chen. G-Hadoop: MapReduce [25] R. Angles, C. Gutierrez. Survey of Graph
Across Distributed Data Centers for Data Database Models. Technical Report Number TR/
Intensive Computing. ACM, New York, USA, DCC-2005-10, Computer Science Department:
NY, USA, 2012, pp. 739-750. Universidad de Chile. 2005.
[12] M. Stonebreaker, D. Abadi, D.J. Dewitt, S. [26] S.Jouili, V. Vansteenberghe. An Empirical
Madden, E. Paulson, A. Pavlo, And A. Rasin. Comparison of Graph Databases.
MapReduce and Parallel DBMS: Friends or
Foes?. 2010. [27] R. H. G¨uting. GraphDB: Modeling and Querying
Graphs in Databases. In Proc. of 20th Int. Conf.
[13] S. Harizopoulos, D. Abadi, and P. Boncz. on Very Large Data Bases (VLDB), pages 297–
Column-Oriented Database System. 2009. 308. Morgan Kaufmann, September 1994.
Available: www.cs.yale.edu/homes/dna/talks/
Column_Store_Tutorial_VLDB09.pdf [28] K. C. Kim, C. S. Kim. Parallel Processing of
Sensor Network Data using Column-Oriented
[14] D. J. Dewitt, J. Gray. Parallel Database Systems: Databases. AASRI Conference on Parallel and
The Future of High Performance Database Distributed Computing Systems, pp. 2-8. 2013.
processing. June, 1992.
[29] M. Stonebraker, D.J. Abadi, A. Batkin, X.
[15] [15] M.T. Ozsu, P.. Distributed and Parallel Chen, M. Cherniack, M Ferreira, E. Lau, A.Lin,
Database Systems. -. Available: www.cs.uoi. S. Madden, E. O’Neil, P. O’Neil, A. Rasin,
gr/~pitoura/courses/ddbs03/paper-to-translate. N. Tran, and S. Zdonik. C-Store: A Column-
pdf oriented DBMS. Proceedings of the 31st VLDB
Conference, Trondheim, Norway, 2005.
[16] [16] A. B. M. Moniruzzaman, S. A. Hossain.
NoSQL Database: New Era of Databases for Big [30] D.J. Abadi, S. R. Madden, and N. Hachem.
data Analytics - Classification, Characteristics and Column-Store vs Row-Store. SIGMOD’08,
Comparison. International Journal of Database Vancouver, BC, Canada. June, 2008.
Theory and Application. 2013.
[31] L. Wang, M. Kunze, J. Tao, G. von Laszewski,
[17] Oracle. Oracle Databases. Accessed: 29th Towards building a cloud for scientific
December 2013. Available: Oracle Databases applications, Advances in Engineering Software
from web: http://www.oracle.com/us/products/ 42 (9), pp. 714–722. 2011.
database/overview/index.html.
[32] T. A. M. C Thantriwatte, C. I. Keppetiyagama.
[18] MySQL. MySQL Database. Accessed 29th NoSQL Query Processing System for Wireless
December 2013. Avalable: web: http://www. Ad-hoc and Sensor Networks. In Advances in ICT
mysql.com. for Emerging Regions (ICTer), 2011 International
Conference on (pp. 78-82). IEEE. September,
[19] Microsoft. Microsoft SQL Server Databases. 2011.
Accessed: 29th December 2013. Available: http://
www.microsoft.com/en-us/sqlserver/default. [33] J. Han, E. Haihong, Guan Le, and Jian Du. Survey
aspx. on NoSQL Database. Pervasive Computing
and Applications (ICPCA), 6th International
[20] U. Bhat, S. Jadhav. Moving towards Non- Conference, pp.363-366. October, 2011.
Relational Databases. International Journal of
Computer Applications, 2010. [34] A. Paul, RDBMS dominate the database market,
but NoSQL systems are catching up, Accessed
[21] N. Jatana, S. Puri, M. Ahuja, I. Kathuria, and D. 28th December 2013. Available: http://db-
Gosain. A Survey and Comparison of Relational engines.com/en/blog_post/23
and Non-Relational Database. International
Journal of Engineering Research & Technology, [35] Z. Belal, A. Essam. The Constraints of Object-
August 2012. Oriented Databases, Int. J. Open Problems
Compt. Math., Vol. 1, No. 1, June 2008.
[22] R. Cattell. Scalable SQL and NoSQL Datastore.
2011. [36] Beall, J.: The Weaknesses of Full-Text Searching.
The Journal of Academic Librarianship
[23] K. Orend, (2010) “Analysis and Classification of (September 2008)
NoSQL Databases and Evaluation of their Ability
to Replace an Object-relational Persistence [37] Yates, R.B., Neto, B.R.: Modern Information
Layer,” Master Thesis, Technical University of Re-trieval. Addison Weasley Longman Limited
Munich, Munich. (1999)
[24] M. Levene and G. Loizou. A Graph-Based Data [38] M. Young, The Technical Writer’s Handbook.
Model and its Ramifications. IEEE Transactions Mill Valley, CA: University Science, 1989.
on Knowledge and Data Engineering (TKDE),
7(5):809–823, 1995.