Seminar On Database Management System Design
Seminar On Database Management System Design
The first DBMS was developed in the early 1960s when Charles Bachman created a navigational DBMS
known as the Integrated Data Store. Shortly after, IBM developed Information Management System
(IMS), a hierarchical DBMS designed for IBM mainframes that's still used by many large organizations
today.
The next major advancement came in 1971 when the Conference/Committee on Data Systems
Languages (CODASYL) standard was delivered. Integrated Database Management System is a
commercial implementation of the network model database approach advanced by CODASYL.
But the DBMS market changed forever as the relational model for data gained popularity. Introduced by
Edgar Codd of IBM in 1970 in his seminal paper "A Relational Model of Data for Large Shared Data
Banks," the RDBMS soon became the industry standard. The first RDBMS was Ingres, developed at the
University of California, Berkeley by a team led by Michael Stonebraker in the mid-1970s. At about the
same time, IBM was working on its System R project to develop an RDBMS.
In 1979, the first successful commercial RDBMS, Oracle, was released, followed a few years later by
IBM's Db2, Sybase SQL Server and many others. In the 1990s, as object-oriented (OO) programming
became popular, several OO database systems came to market, but they never gained significant market
share. Later in the 1990s, the term NoSQL was coined. Over the next decade, several types of new non-
relational DBMS products, including key-value, graph, document and wide-column store, were grouped
into the NoSQL category.
Today, the DBMS market is dominated by RDBMS, but NewSQL and NoSQL database systems continue
to grow in popularity.
DEFINITION OF TERMS
Data:
It is a collection of information. The facts that can be recorded and which have implicit meaning known
as 'data'.
Example:
Customer ----- 1.cname.
2.cno.
3.ccity.
Database:
It is a collection of interrelated data. These can be stored in the form of
tables.
A database can be of any size and varying complexity.
A database may be generated and manipulated manually or it may be
computerized. Example:
Customer database consists the fields as cname, cno, and ccity
Cname Cno Ccity
Database System:
It is computerized system, whose overall purpose is to maintain the information and to make that the
information is available on demand.
Advantages:
1.Redundency can be reduced.
2.Inconsistency can be
avoided.
3.Data can be shared.
4.Standards can be enforced.
5.Security restrictions can be applied.
6.Integrity can be maintained.
7.Data gathering can be possible.
8.Requirements can be balanced.
Advantages of DBMS:
1.Data Independence.
2.Efficient Data Access.
3.Data Integrity and security.
4.Data administration.
5.Concurrent access and Crash recovery.
6.Reduced Application Development Time.
DISADVANTAGES OF DBMS
Applications
Database Applications:
Banking: all transactions
Airlines: reservations, schedules
Universities: registration, grades
Sales: customers, products, purchases
Online retailers: order tracking, customized recommendations
Manufacturing: production, inventory, orders, supply chain
Human resources: employee records, salaries, tax deduction
1.0 INPUT/OUTPUT FORMS DESIGNS/GENERATOR
In an information system, input is the raw data that is processed to produce output. During the input
design, the developers must consider the input devices such as PC, MICR, OMR, etc.
Therefore, the quality of system input determines the quality of system output. Well designed input
forms and screens have following properties −
It should serve specific purpose effectively such as storing, recording, and retrieving the
information.
It ensures proper completion with accuracy.
It should be easy to fill and straightforward.
It should focus on user’s attention, consistency, and simplicity.
All these objectives are obtained using the knowledge of basic design principles regarding −
o What are the inputs needed for the system?
o How end users respond to different elements of forms and screens.
It is important to design appropriate data input methods to prevent errors while entering data. These
methods depend on whether the data is entered by customers in forms manually and later entered by
data entry operators, or data is directly entered by users on the PCs.
Input integrity controls include a number of methods to eliminate common input errors by end-users.
They also include checks on the value of individual fields; both for format and the completeness of all
inputs.
Audit trails for data entry and other system operations are created using transaction logs which gives a
record of all changes introduced in the database to provide security and means of recovery in case of
any failure.
The design of output is the most important task of any system. During output design, developers identify
the type of outputs needed, and consider the necessary output controls and prototype report layouts.
To develop output design that serves the intended purpose and eliminates the production of
unwanted output.
To develop the output design that meets the end users requirements.
To deliver the appropriate quantity of output.
To form the output in appropriate format and direct it to the right person.
To make the output available on time for making good decisions.
External Outputs
Manufacturers create and design external outputs for printers. External outputs enable the system to
leave the trigger actions on the part of their recipients or confirm actions to their recipients.
Some of the external outputs are designed as turnaround outputs, which are implemented as a form
and re-enter the system as an input.
Internal outputs
Internal outputs are present inside the system, and used by end-users and managers. They support the
management in decision making and reporting.
There are three types of reports produced by management information −
Detailed Reports − They contain present information which has almost no filtering or restriction
generated to assist management planning and control.
Summary Reports − They contain trends and potential problems which are categorized and
summarized that are generated for managers who do not want details.
Exception Reports − They contain exceptions, filtered data to some condition or standard before
presenting it to the manager, as information.
Output integrity controls include routing codes to identify the receiving system, and verification
messages to confirm successful receipt of messages that are handled by network protocol.
Printed or screen-format reports should include a date/time for report printing and the data. Multipage
reports contain report title or description, and pagination. Pre-printed forms usually include a version
number and effective date.
Both forms and reports are the product of input and output design and are business document
consisting of specified data. The main difference is that forms provide fields for data input but reports
are purely used for reading. For example, order forms, employment and credit application, etc.
To keep the screen simple by giving proper sequence, information, and clear captions.
To meet the intended purpose by using appropriate forms.
To ensure the completion of form with accuracy.
To keep the forms attractive by using icons, inverse video, or blinking cursors etc.
To facilitate navigation.
Types of Forms
Flat Forms
It is a single copy form prepared manually or by a machine and printed on a paper. For
additional copies of the original, carbon papers are inserted between copies.
It is a simplest and inexpensive form to design, print, and reproduce, which uses less volume.
These are papers with one-time carbons interleaved into unit sets for either handwritten or
machine use.
Carbons may be either blue or black, standard grade medium intensity. Generally, blue carbons
are best for handwritten forms while black carbons are best for machine use.
These are multiple unit forms joined in a continuous strip with perforations between each pair
of forms.
It is a less expensive method for large volume use.
They use carbonless papers which have two chemical coatings (capsules), one on the face and
the other on the back of a sheet of paper.
When pressure is applied, the two capsules interact and create an image
2.0 PROGRAM GENERATOR
A program generator (or just generator ) is a program that generates other programs. This broad
definition is often qualified to include constraints such as “the generated program is expressed in a high-
level programming language”. No definition, however, offers strict boundaries distinguishing generators
from traditional compilers, text-generating programs, etc. Thus, what really constitutes a generator is
often determined intuitively and does not reflect so much a technical distinction as the emphasis in the
development of the “generator”.
A related concept is meta-programming. Meta-programming can be described as the creation of a
program that computes (something about) other programs. Meta-programmingtools constitute the
general platform for implementing generators in a given language setting.To understand what
generators are we should look at their common uses. In practice, generators are typically compilers for
domain-specific languages (DSLs). A domain-specific language is a special-purpose programming
language for a particular software domain. A “domain” can be defined either by its technical structure
(e.g., the domain of reactive real-time programs, the domain of LALR language parsers) or by real-world
application (e.g., the database domain, the telephony domain, etc.). The purpose of restricting attention
to specific domains is to exploit the domain features and knowledge to increase automation.
If we view generators as compilers for DSLs, it is worth
asking how they differ from general-purpose language compilers. Indeed, the research and practice of
program generators is very different from that of general-purpose compilers. A general-purpose
language compiler implements a stable, separately defined specification and can take several man-years
to develop. In contrast, a generator is typically
co-designed with the DSL that it implements. Thus, the emphasis is not on deeply analyzing a program to
infer its properties, but on designing the DSL so that the properties are clearly exposed and on having
the generator exploit them with as little effort as possible. The effort of implementing a program
generator is typically small—comparable to the effort of implementing a software library for the
domain. This is largely the result of leveraging the high-level
language (commonly called the object language) in which the generated programs are expressed.
The above features of program generators (domain-specificity, language/generator co-design, low-effort
development) also define the focus of research in the area. Most research in program generators
concentrates either on specific domains that are amenable to a generator-based approach or on meta-
programming tools that simplify generator implementation.
USES
Query optimization is used to create an efficient access strategy that includes:
(1) which local DBMSs are involved,
(2) the operations to be performed at the local DBMSs,
(3) integration of intermediate results from the local DBMSs, and
(4) the location(s) of the global processing.
Query processing is translating a user query into an executable plan for a system and executing it. For
queries which are formulated in procedural language or by simple API (Nishimura et al., 2011; Tang et
al., 2016), this requires programming skills. In this case, it is the responsibility of the user to produce an
optimal query plan.
Besides, geospatial or astronomical SQL-like query language is broadly successful, because SQL is user-
friendly and is a declarative language – the user declares what he/she wants, without telling the SQL
engine how to solve the query. However, common Big Data systems with SQL language like Spark SQL
and Hive (Apache Hive) are unable to meet the spatial queries. At best, they require query processing
extensions. In this subsection we describe the key points of query processing in representative systems
that support geospatial or astronomical queries.
The query in the form of an SQL statement is parsed by an SQL parser to build up a syntax tree, in which
keywords, expressions, variables, etc., are all identified.
Then the evaluation step is started. Each SQL statement that refers to resources is bound to concrete
objects in the spatial data schema (columns, tables, views, etc.).
After the query optimizer provides a series of candidates (logical execution plans), it chooses an optimal
plan according to statistics (e.g., the histogram of the value distributions in each column, heuristics to
reduce disk I/O, CPU consumption, or network costs). The physical plan is generated through binding
data sources, and choosing the appropriate algorithms in order to improve query performance.
Finally, the physical query plan is put into the executing queue of the query execution engine. Most Big
Data processing systems do not provide support for spatial data and operations, even though they offer
a partial implementation of the SQL standard. Indeed, several problems should be addressed to
integrate a spatial or astronomical data type as a basic domain of values and to be able to query it in the
SQL language.
Parsing Stage The primary responsibility of the parser is to determine the grammatical structure of its
input and, if well-formed, construct a data structure containing an abstract representation of that input
(i.e., parse tree). A well-formed SQL query always start by the SELECT clause and follows by the FROM
clause.
Many big data systems with an SQL interface also provide user definition function (UDF) utility, that is,
users can customize the required UDF to process spatial data. Data types and operators for spatial UDF
should follow as well as possible the standard. For example, this is how Aji et al. (2013) extends Hive
SQL, and Huang et al. (2017) extends Spark SQL with spatial features. Therefore, using the UDF facility
allows taking full advantage of built-in parsers like Spark SQL parsing as in Xie et al. (2016) and Huang et
al. (2017) to achieve the support of SQL Spatial.
The aforementioned prototypes addressed geospatial applications. The standard used query language in
astronomy is ADQL (GAVO Data Center, 2008), promoted by the International Virtual Observatory
Alliance (IVOA), which is an SQL-like language and is widely used by the community of astronomers.
However, it is not completely compliant with the SQL standard. For this reason, the existing SQL built-in
parser for SQL cannot process ADQL queries, and a custom parser is required. To take full advantage of
the built-in query processor of the existing system, another way is to use an ADQL parser to translate it
to an SQL-compliant statement with spatial UDF. ADQL Library provides such parser that translates
ADQL to Postgresql/Q3C or Postgres/pgSphere. AstroSpark (Brahem et al., 2016) adapts this parser to
translate ADQL into Spark SQL with UDF.
Evaluation Stage The evaluation step ( Gardarin and Valduriez, 1989) is to bound SQL statements to
concrete resources in data schemas (i.e., columns, tables, views, etc.).
When the data schema refers to a large data volume and no index exists, which is the case for most Big
Data systems, a commonly used approach is to split spatial data into partitions. To save useless disk I/O,
the idea is to bind only relevant partitions instead of the whole dataset. In Wang et al. (2011), data are
split in chunks according a spatial grid. In Brahem et al. (2016), data are split according to the HEALPix
order. Every data element (i.e., representing a star) is tagged by the HEALPix index of the cell where it is
located. Then, all the data are sorted according to the tagged HEALPix index, and the dataset is divided
into equal length size partitions. A partition could be virtual, as shown in Yu et al. (2015). The most
common approach is to split the sky according to a grid. All objects located in a given cell of the grid are
virtually in a same partition. For these objects in the cell, they are tagged with the same partition ID.
To exploit this feature at this stage, the idea as shown in Wang et al. (2011) is to bind to relevant
partitions instead of the entire dataset. For that, the process is to rewrite the initial query (i.e., the
logical plan) to a set of subqueries where each subquery addresses a partition. Then, the initial query is
transformed into a union of subqueries, where each subquery is sent to the processing node which
holds the partition. In this process, an additional subquery that merges the results from each node is
also added to the transformed query.
To achieve this, as most spatial queries are applied on a limited area of the space, the first step is to
extract from the user query the minimal spatial bound (MSB), which is a geometric object (e.g.,
rectangle or circle). If an object in a partition for which the shape intersects the MSB, the partition is
relevant, and a subquery is generated to address it.
In existing systems that offer partitioning capability as in Apache Hive or Apache Spark, this task is
usually done at the optimization stage, where the optimizer determines relevant partitions according to
metadata which describe how the data are partitioned. In the case of spatial data, as illustrated in
Brahem et al. (2016), the optimizer requires a new rule that allows to exploit this feature. For that, the
optimizer extracts the MSB from the user query which allows to determine relevant Healpix cells. With
these cells and facing it with the partitioning metadata description, the optimizer could deduce all
relevant partitions. The new rule in the optimizer will add a constraint on relevant partitions by
extending the WHERE clause. The constraint is used at the execution step for loading in memory only
relevant data.
Optimization Stage At this stage, the query optimizer provides several candidate execution plans
according to heuristics, and it produces an optimal plan according to the cost-based function (Xie et al.,
2016) which integrates running statistics, the existing index, disk I/O cost, and so on. The optimization
differs if the engine is memory-oriented or disk-oriented. Systems build on an existing query optimizer
could take advantage of implemented heuristics for reducing I/O costs for disk-oriented processing, or
CPU costs for memory-oriented processing. Heuristics can change the order of join operators and add
projection operators to prune nonnecessary attributes during query execution (Gardarin and Valduriez,
1989).
However, traditional query optimizers are not able to optimize queries with complex UDF (i.e., spatial
UDF) because it appears as a black box. This often leads to very expensive query evaluation plans. For
example, when Spark SQL implements a spatial distance join via UDF, it has to use the expensive
Cartesian product approach, which is not scalable for large data. To avoid this, it is beneficial to replace
it by an optimized algorithm for this purpose (Xie et al., 2016; Huang et al., 2017). Many Big Data
systems provide an extensibility in the built-in optimizer as in Catalyst. It permits to modify the
optimizer using mechanisms like rules or overload existing rules; otherwise, the user must rewrite the
optimizer as in Xie et al. (2016). After the logical plan is optimized, a series of candidate physical plans
are generated through binding data, physical operators, and index use when relevant, and/or data
organization like partitioning that allows speedup of data access. Then, the optimizer chooses an
optimal plan to put into the executing queue.
For systems based on the Map/Reduce paradigm, the query optimizer proceeds differently. The
Map/Reduce paradigm is based on job, which is a unit of execution. If we consider that each algebraic
operator in a query plan is translated into a job, and each job (i.e., operator) reads data from the disk
and stores the result back to the disk, the query cost could be prohibitive. To minimize the disk I/O in
the case of SQL query, the optimizer of a Map/Reduce framework decomposes the logical plan into a
sequence of jobs, where each job could contain more than one algebraic operator (i.e., join, project,
select, etc.) (Apache Hive). For procedural query language where each instruction could be an algebraic
operator, a system like PIG introduces the lazy query evaluation. The idea is to record the sequence of
instructions (i.e., the program) until a materialization instruction (e.g., STORE, DUMP in PigLatin) is
reached. With the records of instructions, the query optimizer could reorder and group instructions to
provide a sequence of optimal jobs.
Execution Stage The last stage is the execution on an engine which could evaluate the query plan.
Almost all existing spatial Big Data are based on the existing NoSQL Databases, of which Spark and
Hadoop Map/Reduce are a part. Using such a system provides properties like parallelism, scalability,
reliability, and fault tolerance, which are valuable for processing large datasets. The Map/Reduce
framework is designed for batch processing that is efficient when the program can be done in one job.
On the other side, memory-oriented frameworks like Spark or Flink are interesting when interactivity
with users is required because it avoids data reloading between operators of an algebraic query plan, in
contrast to the Hadoop Map/Reduce framework. This last point is a big advantage because Spark applies
the lazy evaluation (i.e., some operators do not materialize intermediate results).
In contrast to traditional distributed database systems, the benefit of using NoSQL systems is that the
query optimizer does not have to take care of an optimal distribution plan for sending task to processing
nodes. This is usually dynamically done by the resources manager (Yarn) in the case of the Map/Reduce
framework or by the low level of the system (e.g., Spark). The resources manager is able to load
balancing the execution even in the presence of nodes with resource gap, and in presence of data skew
in the case of Map/Reduce.
However, using an existing NoSQL system for execution of a query plan is not straightforward because
there is no spatial support. If the framework provides extensibility to implement new value domains, it is
used to define a spatial value domain with ad hoc operators. This was the case in Yu et al. (2015) where
the authors extended the resilient distributed datasets (RDDs) of APACHE SPARK to the spatial domain
(named sRDDs) and added new spatial algorithms like kNN search. Otherwise, it requires rewriting part
of the framework code to achieve the spatial data processing, as in Spatial-Hadoop (Xie et al., 2016;
Eldawy and Mokbel, 2015). This approach offers more flexibility but it has the drawback that at each
update of the framework, the implemented code could be invalidated.
The spatial index is often implemented in spatial systems (Tang et al., 2016; Nishimura et al., 2011; Xie
et al., 2016) because they could drastically save the required processing resources. It could save the
number of disk I/O (Eldawy and Mokbel, 2015), or reduce memory access (Huang et al., 2017). However,
Map/Reduce, unlike DBMS, does not have the notion of index. Each job works with the whole dataset in
input. To optimize this, many works have implemented a Global and Local index in the Map/Reduce
framework as in Spatial Hadoop (Eldawy and Mokbel, 2015). At the execution stage, systems use global
index to determine the subspace to be processes and also to determine for each worker the part of the
subdataset to process. The local index is to speed up the processing inside of nodes, which is useful for
complex spatial processing.
2 Materialization Model
Each operator processes its input all at once and then emits its output all at once. The operator
“materializes”
its output as a single result.
Every query plan operator implements an output function:
• The operator processes all the tuples from its children at once.
• The return result of this function is all the tuples that operator will ever emit. When the operator
finishes executing, the DBMS never needs to return to it to retrieve more data.
This approach is better for OLTP workloads because queries typically only access a small number of
tuples at a time. Thus, there are fewer function calls to retrieve tuples. Not good for OLAP queries with
large intermediate results because the DBMS may have to spill those results to disk between operators.
3 Vectorization Model
Like the iterator model where each operator implements a next function. But each operator emits a
batch(i.e., vector) of data instead of a single tuple:
• The operator implementation can be optimized for processing batches of data instead of a single item
at a time.
This approach is ideal for OLAP queries that have to scan a large number of tuples because there are
fewer invocations of the next function.
Access Methods
An access method is the how the DBMS accesses the data stored in a table. These will be the bottom
operators in a query plan that “feed” data into the operators above it in the tree. There is no
corresponding operator in relational algebra.
Sequential Scan
For each page in table, iterate over each page and retrieve it from the buffer pool. For each page, iterate
over all the tuples and evaluate the predicate to decide whether to include tuple or not.
Optimizations:
• Prefetching: Fetches next few pages in advance so that the DBMS does not have to block when
accessing each page.
• Parallelization: Execute the scan using multiple threads/processes in parallel.
• Buffer Pool Bypass: The scan operator stores pages that it fetches from disk in its local memory instead
of the buffer pool. This avoids the sequential flooding problem.
• Zone Map: Pre-compute aggregations for each tuple attribute in a page. The DBMS can then check
whether it needs to access a page by checking its Zone Map first. The Zone Maps for each page are
stored in separate pages and there are typically multiple entries in each Zone Map page. Thus, it is
possible to reduce the total number of pages examined in a sequential scan.
• Late Materialization: Each operator passes the minimal amount of information needed to by the next
operator (e.g., record id). This is only useful in column-store systems (i.e., DSM).
• Heap Clustering: Tuples are stored in the heap pages using an order specified by a clustering index.
Index Scan
The DBMS picks an index (or indexes) to find the tuples that the query needs.
When using multiple indexes, the DBMS executes the search on each index and generates the set of
matching record ids. One can implement this record id using bitmaps, hash tables, or Bloom filters. The
DBMS combines these sets based on the query’s predicates (union vs. intersect). It then retrieve the
records and apply any remaining terms. The more advanced DBMSs support multi-index scans.
Retrieving tuples in the order that they appear in an unclustered index is inefficient. The DBMS can first
figure out all the tuples that it needs and then sort them based on their page id.
4.0 DATABASE DESIGN ISSUES
If database design is done right, then the development, deployment and subsequent performance in
production will give little trouble. A well-designed database 'just works'. There are a small number of
mistakes in database design that causes subsequent misery to developers, managewrs, and DBAs alike.
Here are the ten worst mistakes
No list of mistakes is ever going to be exhaustive. People (myself included) do a lot of really stupid
things, at times, in the name of “getting it done.” This list simply reflects the database design mistakes
that are currently on my mind, or in some cases, constantly on my mind. I have done this topic two
times before. If you’re interested in hearing the podcast version, visit Greg Low’s super-excellent SQL
Down Under. I also presented a boiled down, ten-minute version at PASS for the Simple-Talk booth.
Originally there were ten, then six, and today back to ten. And these aren’t exactly the same ten that I
started with; these are ten that stand out to me as of today.
Before I start with the list, let me be honest for a minute. I used to have a preacher who made sure to
tell us before some sermons that he was preaching to himself as much as he was to the congregation.
When I speak, or when I write an article, I have to listen to that tiny little voice in my head that helps
filter out my own bad habits, to make sure that I am teaching only the best practices. Hopefully, after
reading this article, the little voice in your head will talk to you when you start to stray from what is right
in terms of database design practices.
1. Poor design/planning
2. Ignoring normalization
3. Poor naming standards
4. Lack of documentation
5. One table to hold all domain values
6. Using identity/guid columns as your only key
7. Not using SQL facilities to protect data integrity
8. Not using stored procedures to access data
9. Trying to build generic objects
10. Lack of testing
Poor design/planning
“If you don’t know where you are going, any road will take you there” – George Harrison
Prophetic words for all parts of life and a description of the type of issues that plague many projects
these days.
Let me ask you: would you hire a contractor to build a house and then demand that they start pouring a
foundation the very next day? Even worse, would you demand that it be done without blueprints or
house plans? Hopefully, you answered “no” to both of these. A design is needed make sure that the
house you want gets built, and that the land you are building it on will not sink into some underground
cavern. If you answered yes, I am not sure if anything I can say will help you.
Like a house, a good database is built with forethought, and with proper care and attention given to the
needs of the data that will inhabit it; it cannot be tossed together in some sort of reverse implosion.
Since the database is the cornerstone of pretty much every business project, if you don’t take the time
to map out the needs of the project and how the database is going to meet them, then the chances are
that the whole project will veer off course and lose direction. Furthermore, if you don’t take the time at
the start to get the database design right, then you’ll find that any substantial changes in the database
structures that you need to make further down the line could have a huge impact on the whole project,
and greatly increase the likelihood of the project timeline slipping.
Far too often, a proper planning phase is ignored in favor of just “getting it done”. The project heads off
in a certain direction and when problems inevitably arise – due to the lack of proper designing and
planning – there is “no time” to go back and fix them properly, using proper techniques. That’s when the
“hacking” starts, with the veiled promise to go back and fix things later, something that happens very
rarely indeed.
Admittedly it is impossible to predict every need that your design will have to fulfill and every issue that
is likely to arise, but it is important to mitigate against potential problems as much as possible, by careful
planning.
Ignoring Normalization
Normalization defines a set of methods to break down tables to their constituent parts until each table
represents one and only one “thing”, and its columns serve to fully describe only the one “thing” that
the table represents.
The concept of normalization has been around for 30 years and is the basis on which SQL and relational
databases are implemented. In other words, SQL was created to work with normalized data structures.
Normalization is not just some plot by database programmers to annoy application programmers (that is
merely a satisfying side effect!)
SQL is very additive in nature in that, if you have bits and pieces of data, it is easy to build up a set of
values or results. In the FROM clause, you take a set of data (a table) and add (JOIN) it to another table.
You can add as many sets of data together as you like, to produce the final set you need.
This additive nature is extremely important, not only for ease of development, but also for performance.
Indexes are most effective when they can work with the entire key value. Whenever you have to
use SUBSTRING, CHARINDEX, LIKE, and so on, to parse out a value that is combined with other values in
a single column (for example, to split the last name of a person out of a full name column) the SQL
paradigm starts to break down and data becomes become less and less searchable.
So normalizing your data is essential to good performance, and ease of development, but the question
always comes up: “How normalized is normalized enough?” If you have read any books about
normalization, then you will have heard many times that 3rd Normal Form is essential, but 4th and 5th
Normal Forms are really useful and, once you get a handle on them, quite easy to follow and well worth
the time required to implement them.
In reality, however, it is quite common that not even the first Normal Form is implemented correctly.
Whenever I see a table with repeating column names appended with numbers, I cringe in horror. And I
cringe in horror quite often. Consider the following example Customer table:
Are there always 12 payments? Is the order of payments significant? Does a NULL value for a payment
mean UNKNOWN (not filled in yet), or a missed payment? And when was the payment made?!?
A payment does not describe a Customer and should not be stored in the Customer table. Details of
payments should be stored in a Payment table, in which you could also record extra information about
the payment, like when the payment was made, and what the payment was for:
In this second design, each column stores a single unit of information about a single “thing” (a payment),
and each row represents a specific instance of a payment.
This second design is going to require a bit more code early in the process but, it is far more likely that
you will be able to figure out what is going on in the system without having to hunt down the original
programmer and kick their butt…sorry… figure out what they were thinking
“That which we call a rose, by any other name would smell as sweet“
This quote from Romeo and Juliet by William Shakespeare sounds nice, and it is true from one angle. If
everyone agreed that, from now on, a rose was going to be called dung, then we could get over it and it
would smell just as sweet. The problem is that if, when building a database for a florist, the designer
calls it dung and the client calls it a rose, then you are going to have some meetings that sound far more
like an Abbott and Costello routine than a serious conversation about storing information about
horticulture products.
Names, while a personal choice, are the first and most important line of documentation for your
application. I will not get into all of the details of how best to name things here- it is a large and messy
topic. What I want to stress in this article is the need for consistency. The names you choose are not just
to enable you to identify the purpose of an object, but to allow all future programmers, users, and so on
to quickly and easily understand how a component part of your database was intended to be used, and
what data it stores. No future user of your design should need to wade through a 500 page document to
determine the meaning of some wacky name.
Consider, for example, a column named, X304_DSCR. What the heck does that mean? You might decide,
after some head scratching, that it means “X304 description”. Possibly it does, but maybe DSCR means
discriminator, or discretizator?
That just leaves you to figure out what the X304 part of the name means. On first inspection, to me,
X304 sounds like more like it should be data in a column rather than a column name. If I subsequently
found that, in the organization, there was also an X305 and X306 then I would flag that as an issue with
the database design. For maximum flexibility, data is stored in columns, not in column names.
Along these same lines, resist the temptation to include “metadata” in an object’s name. A name such
as tblCustomer or colVarcharAddress might seem useful from a development perspective, but to the
end user it is just confusing. As a developer, you should rely on being able to determine that a table
name is a table name by context in the code or tool, and present to the users clear, simple, descriptive
names, such as Customer and Address.
A practice I strongly advise against is the use of spaces and quoted identifiers in object names. You
should avoid column names such as “Part Number” or, in Microsoft style, [Part Number], therefore
requiring you users to include these spaces and identifiers in their code. It is annoying and simply
unnecessary.
Lack of documentation
I hinted in the intro that, in some cases, I am writing for myself as much as you. This is the topic where
that is most true. By carefully naming your objects, columns, and so on, you can make it clear to anyone
what it is that your database is modeling. However, this is only step one in the documentation battle.
The unfortunate reality is, though, that “step one” is all too often the only step.
Not only will a well-designed data model adhere to a solid naming standard, it will also contain
definitions on its tables, columns, relationships, and even default and check constraints, so that it is
clear to everyone how they are intended to be used. In many cases, you may want to include sample
values, where the need arose for the object, and anything else that you may want to know in a year or
two when “future you” has to go back and make changes to the code.
NOTE:
Where this documentation is stored is largely a matter of corporate standards and/or convenience to the
developer and end users. It could be stored in the database itself, using extended properties.
Alternatively, it might be in maintained in the data modeling tools. It could even be in a separate data
store, such as Excel or another relational database. My company maintains a metadata repository
database, which we developed in order to present this data to end users in a searchable, linkable format.
Format and usability is important, but the primary battle is to have the information available and up to
date.
Your goal should be to provide enough information that when you turn the database over to a support
programmer, they can figure out your minor bugs and fix them (yes, we all make bugs in our code!). I
know there is an old joke that poorly documented code is a synonym for “job security.” While there is a
hint of truth to this, it is also a way to be hated by your coworkers and never get a raise. And no good
programmer I know of wants to go back and rework their own code years later. It is best if the bugs in
the code can be managed by a junior support programmer while you create the next new thing. Job
security along with raises is achieved by being the go-to person for new challenges.
“One Ring to rule them all and in the darkness bind them“
This is all well and good for fantasy lore, but it’s not so good when applied to database design, in the
form of a “ruling” domain table. Relational databases are based on the fundamental idea that every
object represents one and only one thing. There should never be any doubt as to what a piece of data
refers to. By tracing through the relationships, from column name, to table name, to primary key, it
should be easy to examine the relationships and know exactly what a piece of data means.
The big myth perpetrated by architects who don’t really understand relational database architecture
(me included early in my career) is that the more tables there are, the more complex the design will be.
So, conversely, shouldn’t condensing multiple tables into a single “catch-all” table simplify the design? It
does sound like a good idea, but at one time giving Pauly Shore the lead in a movie sounded like a good
idea too.
For example, consider the following model snippet where I needed domain values for:
Customer CreditStatus
Customer Type
Invoice Status
Invoice Line Item BackOrder Status
Invoice Line Item Ship Via Carrier
On the face of it that would be five domain tables…but why not just use one generic domain table, like
this?
This may seem a very clean and natural way to design a table for all but the problem is that it is just not
very natural to work with in SQL. Say we just want the domain values for the Customer table:
1
SELECT *
FROM Customer
JOIN GenericDomain as CustomerType
ON Customer.CustomerTypeId = CustomerType.GenericDomainId
and CustomerType.RelatedToTable = ‘Customer’
and CustomerType.RelatedToColumn = ‘CustomerTypeId’
JOIN GenericDomain as CreditStatus
ON Customer.CreditStatusId = CreditStatus.GenericDomainId
and CreditStatus.RelatedToTable = ‘Customer’
and CreditStatus.RelatedToColumn = ‘ CreditStatusId’
As you can see, this is far from being a natural join. It comes down to the problem of mixing apples with
oranges. At first glance, domain tables are just an abstract concept of a container that holds text. And
from an implementation centric standpoint, this is quite true, but it is not the correct way to build a
database. In a database, the process of normalization, as a means of breaking down and isolating data,
takes every table to the point where one row represents one thing. And each domain of values is a
distinctly different thing from all of the other domains (unless it is not, in which case the one table will
suffice.). So what you do, in essence, is normalize the data on each usage, spreading the work out over
time, rather than doing the task once and getting it over with.
So instead of the single table for all domains, you might model it as:
Looks harder to do, right? Well, it is initially. Frankly it took me longer to flesh out the example tables.
But, there are quite a few tremendous gains to be had:
Data can be validated using foreign key constraints very naturally, something not feasible for the
other solution unless you implement ranges of keys for every table – a terrible mess to maintain.
If it turns out that you need to keep more information about a ShipViaCarrier than just the
code, ‘UPS’, and description, ‘United Parcel Service’, then it is as simple as adding a column or
two. You could even expand the table to be a full blown representation of the businesses that
are carriers for the item.
All of the smaller domain tables will fit on a single page of disk. This ensures a single read (and
likely a single page in cache). If the other case, you might have your domain table spread across
many pages, unless you cluster on the referring table name, which then could cause it to be
more costly to use a non-clustered index if you have many values.
You can still have one editor for all rows, as most domain tables will likely have the same base
structure/usage. And while you would lose the ability to query all domain values in one query
easily, why would you want to? (A union query could easily be created of the tables easily if
needed, but this would seem an unlikely need.)
I should probably rebut the thought that might be in your mind. “What if I need to add a new column to
all domain tables?” For example, you forgot that the customer wants to be able to do custom sorting on
domain values and didn’t put anything in the tables to allow this. This is a fair question, especially if you
have 1000 of these tables in a very large database. First, this rarely happens, and when it does it is going
to be a major change to your database in either way.
Second, even if this became a task that was required, SQL has a complete set of commands that you can
use to add columns to tables, and using the system tables it is a pretty straightforward task to build a
script to add the same column to hundreds of tables all at once. That will not be as easy of a change, but
it will not be so much more difficult to outweigh the large benefits.
The point of this tip is simply that it is better to do the work upfront, making structures solid and
maintainable, rather than trying to attempt to do the least amount of work to start out a project. By
keeping tables down to representing one “thing” it means that most changes will only affect one table,
after which it follows that there will be less rework for you down the road.
First Normal Form dictates that all rows in a table must be uniquely identifiable. Hence, every table
should have a primary key. SQL Server allows you to define a numeric column as an IDENTITY column,
and then automatically generates a unique value for each row. Alternatively, you can
use NEWID() (or NEWSEQUENTIALID()) to generate a random, 16 byte unique value for each row. These
types of values, when used as keys, are what are known as surrogate keys. The word surrogate means
“something that substitutes for” and in this case, a surrogate key should be the stand-in for a natural
key.
The problem is that too many designers use a surrogate key column as the only key column on a given
table. The surrogate key values have no actual meaning in the real world; they are just there to uniquely
identify each row.
Now, consider the following Part table, whereby PartID is an IDENTITY column and is the primary key
for the table:
PartI PartNumberDescription
D
1 XXXXXXXX The X part
2 XXXXXXXX The X part
3 YYYYYYYY The Y part
How many rows are there in this table? Well, there seem to be three, but are rows with PartIDs 1 and 2
actually the same row, duplicated? Or are they two different rows that should be unique but were keyed
in incorrectly?
The rule of thumb I use is simple. If a human being could not pick which row they want from a table
without knowledge of the surrogate key, then you need to reconsider your design. This is why there
should be a key of some sort on the table to guarantee uniqueness, in this case likely on PartNumber.
In summary: as a rule, each of your tables should have a natural key that means something to the user,
and can uniquely identify each row in your table. In the very rare event that you cannot find a natural
key (perhaps, for example, a table that provides a log of events), then use an artificial/surrogate key.
All fundamental, non-changing business rules should be implemented by the relational engine. The base
rules of nullability, string length, assignment of foreign keys, and so on, should all be defined in the
database.
There are many different ways to import data into SQL Server. If your base rules are defined in the
database itself can you guarantee that they will never be bypassed and you can write your queries
without ever having to worry whether the data you’re viewing adheres to the base business rules.
Rules that are optional, on the other hand, are wonderful candidates to go into a business layer of the
application. For example, consider a rule such as this: “For the first part of the month, no part can be
sold at more than a 20% discount, without a manager’s approval”.
Taken as a whole, this rule smacks of being rather messy, not very well controlled, and subject to
frequent change. For example, what happens when next week the maximum discount is 30%? Or when
the definition of “first part of the month” changes from 15 days to 20 days? Most likely you won’t want
go through the difficulty of implementing these complex temporal business rules in SQL Server code –
the business layer is a great place to implement rules like this.
However, consider the rule a little more closely. There are elements of it that will probably never
change. E.g.
These aspects of the business rule very much ought to get enforced by the database and design. Even if
the substance of the rule is implemented in the business layer, you are still going to have a table in the
database that records the size of the discount, the date it was offered, the ID of the person who
approved it, and so on. On the Discount column, you should have a CHECK constraint that restricts the
values allowed in this column to between 0.00 and 0.90 (or whatever the maximum is). Not only will this
implement your “maximum discount” rule, but will also guard against a user entering a 200% or a
negative discount by mistake. On the ManagerID column, you should place a foreign key constraint,
which reference the Managers table and ensures that the ID entered is that of a real manager (or,
alternatively, a trigger that selects only EmployeeIds corresponding to managers).
Now, at the very least we can be sure that the data meets the very basic rules that the data must follow,
so we never have to code something like this in order to check that the data is good:
1
Stored procedures are your friend. Use them whenever possible as a method to insulate the database
layer from the users of the data. Do they take a bit more effort? Sure, initially, but what good thing
doesn’t take a bit more time? Stored procedures make database development much cleaner, and
encourage collaborative development between your database and functional programmers. A few of
the other interesting reasons that stored procedures are important include the following.
Maintainability
Stored procedures provide a known interface to the data, and to me, this is probably the largest draw.
When code that accesses the database is compiled into a different layer, performance tweaks cannot be
made without a functional programmer’s involvement. Stored procedures give the database
professional the power to change characteristics of the database code without additional resource
involvement, making small changes, or large upgrades (for example changes to SQL syntax) easier to do.
Encapsulation
Stored procedures allow you to “encapsulate” any structural changes that you need to make to the
database so that the knock on effect on user interfaces is minimized. For example, say you originally
modeled one phone number, but now want an unlimited number of phone numbers. You could leave
the single phone number in the procedure call, but store it in a different table as a stopgap measure, or
even permanently if you have a “primary” number of some sort that you always want to display. Then a
stored proc could be built to handle the other phone numbers. In this manner the impact to the user
interfaces could be quite small, while the code of stored procedures might change greatly.
Security
Stored procedures can provide specific and granular access to the system. For example, you may have
10 stored procedures that all update table X in some way. If a user needs to be able to update a
particular column in a table and you want to make sure they never update any others, then you can
simply grant to that user the permission to execute just the one procedure out of the ten that allows
them perform the required update.
Performance
There are a couple of reasons that I believe stored procedures enhance performance. First, if a newbie
writes ratty code (like using a cursor to go row by row through an entire ten million row table to find one
value, instead of using a WHERE clause), the procedure can be rewritten without impact to the system
(other than giving back valuable resources.) The second reason is plan reuse. Unless you are using
dynamic SQL calls in your procedure, SQL Server can store a plan and not need to compile it every time it
is executed. It’s true that in every version of SQL Server since 7.0 this has become less and less
significant, as SQL Server gets better at storing plans ad hoc SQL calls (see note below). However, stored
procedures still make it easier for plan reuse and performance tweaks. In the case where ad hoc SQL
would actually be faster, this can be coded into the stored procedure seamlessly.
In 2005, there is a database setting (PARAMETERIZATION FORCED) that, when enabled, will cause all
queries to have their plans saved. This does not cover more complicated situations that procedures
would cover, but can be a big help. There is also a feature known as plan guides, which allow you to
override the plan for a known query type. Both of these features are there to help out when stored
procedures are not used, but stored procedures do the job with no tricks.
And this list could go on and on. There are drawbacks too, because nothing is ever perfect. It can take
longer to code stored procedures than it does to just use ad hoc calls. However, the amount of time to
design your interface and implement it is well worth it, when all is said and done.
I touched on this subject earlier in the discussion of generic domain tables, but the problem is more
prevalent than that. Every new T-SQL programmer, when they first start coding stored procedures,
starts to think “I wish I could just pass a table name as a parameter to a procedure.” It does sound quite
attractive: one generic stored procedure that can perform its operations on any table you choose.
However, this should be avoided as it can be very detrimental to performance and will actually make life
more difficult in the long run.
T-SQL objects do not do “generic” easily, largely because lots of design considerations in SQL Server
have clearly been made to facilitate reuse of plans, not code. SQL Server works best when you minimize
the unknowns so it can produce the best plan possible. The more it has to generalize the plan, the less it
can optimize that plan.
Note that I am not specifically talking about dynamic SQL procedures. Dynamic SQL is a great tool to use
when you have procedures that are not optimizable / manageable otherwise. A good example is a
search procedure with many different choices. A precompiled solution with multiple OR conditions
might have to take a worst case scenario approach to the plan and yield weak results, especially if
parameter usage is sporadic.
However, the main point of this tip is that you should avoid coding very generic objects, such as ones
that take a table name and twenty column names/value pairs as a parameter and lets you update the
values in the table. For example, you could write a procedure that started out:
The idea would be to dynamically specify the name of a column and the value to pass to a SQL
statement. This solution is no better than simply using ad hoc calls with an UPDATE statement. Instead,
when building stored procedures, you should build specific, dedicated stored procedures for each task
performed on a table (or multiple tables.) This gives you several benefits:
Properly compiled stored procedures can have a single compiled plan attached to it and reused.
Properly compiled stored procedures are more secure than ad-hoc SQL or even dynamic SQL
procedures, reducing the surface area for an injection attack greatly because the only
parameters to queries are search arguments or output values.
Testing and maintenance of compiled stored procedures is far easier to do since you generally
have only to search arguments, not that tables/columns/etc exist and handling the case where
they do not
A nice technique is to build a code generation tool in your favorite programming language (even T-SQL)
using SQL metadata to build very specific stored procedures for every table in your system. Generate all
of the boring, straightforward objects, including all of the tedious code to perform error handling that is
so essential, but painful to write more than once or twice.
In my Apress book, Pro SQL Server 2005 Database Design and Optimization, I provide several such
“templates” (manly for triggers, abut also stored procedures) that have all of the error handling built in, I
would suggest you consider building your own (possibly based on mine) to use when you need to
manually build a trigger/procedure or whatever.
Lack of testing
When the dial in your car says that your engine is overheating, what is the first thing you blame? The
engine. Why don’t you immediately assume that the dial is broken? Or something else minor? Two
reasons:
The engine is the most important component of the car and it is common to blame the most
important part of the system first.
It is all too often true.
As database professionals know, the first thing to get blamed when a business system is running slow is
the database. Why? First because it is the central piece of most any business system, and second
because it also is all too often true.
We can play our part in dispelling this notion, by gaining deep knowledge of the system we have created
and understanding its limits through testing.
But let’s face it; testing is the first thing to go in a project plan when time slips a bit. And what suffers the
most from the lack of testing? Functionality? Maybe a little, but users will notice and complain if the
“Save” button doesn’t actually work and they cannot save changes to a row they spent 10 minutes
editing. What really gets the shaft in this whole process is deep system testing to make sure that the
design you (presumably) worked so hard on at the beginning of the project is actually implemented
correctly.
But, you say, the users accepted the system as working, so isn’t that good enough? The problem with
this statement is that what user acceptance “testing” usually amounts to is the users poking around,
trying out the functionality that they understand and giving you the thumbs up if their little bit of the
system works. Is this reasonable testing? Not in any other industry would this be vaguely acceptable. Do
you want your automobile tested like this? “Well, we drove it slowly around the block once, one sunny
afternoon with no problems; it is good!” When that car subsequently “failed” on the first drive along a
freeway, or during the first drive through rain or snow, then the driver would have every right to be very
upset.
Too many database systems get tested like that car, with just a bit of poking around to see if individual
queries and modules work. The first real test is in production, when users attempt to do real work. This
is especially true when it is implemented for a single client (even worse when it is a corporate project,
with management pushing for completion more than quality).
Initially, major bugs come in thick and fast, especially performance related ones. If the first time you
have tried a full production set of users, background process, workflow processes, system maintenance
routines, ETL, etc, is on your system launch day, you are extremely likely to discover that you have not
anticipated all of the locking issues that might be caused by users creating data while others are reading
it, or hardware issues cause by poorly set up hardware. It can take weeks to live down the cries of “SQL
Server can’t handle it” even after you have done the proper tuning.
Once the major bugs are squashed, the fringe cases (which are pretty rare cases, like a user entering a
negative amount for hours worked) start to raise their ugly heads. What you end up with at this point is
software that irregularly fails in what seem like weird places (since large quantities of fringe bugs will
show up in ways that aren’t very obvious and are really hard to find.)
Now, it is far harder to diagnose and correct because now you have to deal with the fact that users are
working with live data and trying to get work done. Plus you probably have a manager or two sitting on
your back saying things like “when will it be done?” every 30 seconds, even though it can take days and
weeks to discover the kinds of bugs that result in minor (yet important) data aberrations. Had proper
testing been done, it would never have taken weeks of testing to find these bugs, because a proper test
plan takes into consideration all possible types of failures, codes them into an automated test, and tries
them over and over. Good testing won’t find all of the bugs, but it will get you to the point where most
of the issues that correspond to the original design are ironed out.
If everyone insisted on a strict testing plan as an integral and immutable part of the database
development process, then maybe someday the database won’t be the first thing to be fingered when
there is a system slowdown.
success you must fail first. And even when you succeed in one area, all too often other minor failures
crop up in other parts of the project so that some of your successes don’t even get noticed.
The tips covered here are ones that I have picked up over the years that have turned me from being
mediocre to a good data architect/database programmer. None of them take extraordinary amounts of
time (except perhaps design and planning) but they all take more time upfront than doing it the “easy
way”. Let’s face it, if the easy way were that easy in the long run, I for one would abandon the harder
way in a second. It is not until you see the end result that you realize that success comes from
Primary Storage − The memory storage that is directly accessible to the CPU comes under this
category. CPU's internal memory (registers), fast memory (cache), and main memory (RAM) are directly
accessible to the CPU, as they are all placed on the motherboard or CPU chipset. This storage is typically
very small, ultra-fast, and volatile. Primary storage requires continuous power supply in order to
maintain its state. In case of a power failure, all its data is lost.
Secondary Storage − Secondary storage devices are used to store data for future use or as backup.
Secondary storage includes memory devices that are not a part of the CPU chipset or motherboard, for
example, magnetic disks, optical disks (DVD, CD, etc.), hard disks, flash drives, and magnetic tapes.
Tertiary Storage − Tertiary storage is used to store huge volumes of data. Since such storage devices
are external to the computer system, they are the slowest in speed. These storage devices are mostly
used to take the back up of an entire system. Optical disks and magnetic tapes are widely used as
tertiary storage
Memory Hierarchy
A computer system has a well-defined hierarchy of memory. A CPU has direct access to it main memory
as well as its inbuilt registers. The access time of the main memory is obviously less than the CPU speed.
To minimize this speed mismatch, cache memory is introduced. Cache memory provides the fastest
access time and it contains data that is most frequently accessed by the CPU.
The memory with the fastest access is the costliest one. Larger storage devices offer slow speed and
they are less expensive, however they can store huge volumes of data as compared to CPU registers or
cache memory.
Magnetic Disks
Hard disk drives are the most common secondary storage devices in present computer systems. These
are called magnetic disks because they use the concept of magnetization to store information. Hard
disks consist of metal disks coated with magnetizable material. These disks are placed vertically on a
spindle. A read/write head moves in between the disks and is used to magnetize or de-magnetize the
spot under it. A magnetized spot can be recognized as 0 (zero) or 1 (one).
Hard disks are formatted in a well-defined order to store data efficiently. A hard disk plate has many
concentric circles on it, called tracks. Every track is further divided into sectors. A sector on a hard disk
typically stores 512 bytes of data.
RAID or Redundant Array of Independent Disks, is a technology to connect multiple secondary storage
devices and use them as a single storage media.
RAID consists of an array of disks in which multiple disks are connected together to achieve different
goals. RAID levels define the use of disk arrays.
File Organization defines how file records are mapped onto disk blocks. We have four types of File
Organization to organize file records
When a file is created using Heap File Organization, the Operating System allocates memory area to that
file without any further accounting details. File records can be placed anywhere in that memory area. It
is the responsibility of the software to manage the records. Heap File does not support any ordering,
sequencing, or indexing on its own.
Every file record contains a data field (attribute) to uniquely identify that record. In sequential file
organization, records are placed in the file in some sequential order based on the unique key field or
search key. Practically, it is not possible to store all the records sequentially in physical form.
Hash File Organization uses Hash function computation on some fields of the records. The output of the
hash function determines the location of disk block where the records are to be placed.
Clustered file organization is not considered good for large databases. In this mechanism, related
records from one or more relations are kept in the same disk block, that is, the ordering of records is not
based on primary key or search key.
Update Operations
Retrieval Operations
Update operations change the data values by insertion, deletion, or update. Retrieval operations, on the
other hand, do not alter the data but retrieve them after optional conditional filtering. In both types of
operations, selection plays a significant role. Other than creation and deletion of a file, there could be
several operations, which can be done on files.
Open − A file can be opened in one of the two modes, read mode or write mode. In read mode,
the operating system does not allow anyone to alter data. In other words, data is read only. Files
opened in read mode can be shared among several entities. Write mode allows data
modification. Files opened in write mode can be read but cannot be shared.
Locate − Every file has a file pointer, which tells the current position where the data is to be read
or written. This pointer can be adjusted accordingly. Using find (seek) operation, it can be
moved forward or backward.
Read − By default, when files are opened in read mode, the file pointer points to the beginning
of the file. There are options where the user can tell the operating system where to locate the
file pointer at the time of opening a file. The very next data to the file pointer is read.
Write − User can select to open a file in write mode, which enables them to edit its contents. It
can be deletion, insertion, or modification. The file pointer can be located at the time of opening
or can be dynamically changed if the operating system allows to do so.
Close − This is the most important operation from the operating system’s point of view. When a
request to close a file is generated, the operating system
o removes all the locks (if in shared mode),
o saves the data (if altered) to the secondary storage media, and
o releases all the buffers and file handlers associated with the file.
The organization of data inside a file plays a major role here. The process to locate the file pointer to a
desired record inside a file various based on whether the records are arranged sequentially or clustered.
7.0 DATA STRUCTURES AND ACCESS STRATEGIES
A data structure is a way of storing data in a computer so that it can be used efficiently and it will allow
the most efficient algorithm to be used. The choice of the data structure begins from the choice of an
abstract data type (ADT). A well-designed data structure allows a variety of critical operations to be
performed, using as few resources, both execution time and memory space, as possible.
Data structure introduction refers to a scheme for organizing data, or in other words it is an
arrangement of data in computer's memory in such a way that it could make the data quickly available
to the processor for
required calculations.
A data structure should be seen as a logical concept that must address two fundamental concerns.
1. First, how the data will be stored, and
2. Second, what operations will be performed on it.
As data structure is a scheme for data organization so the functional definition of a data structure should
be independent of its implementation. The functional definition of a data structure is known as ADT
(Abstract Data Type) which is independent of implementation. The way in which the data is organized
affects the performance of a program for different tasks. Computer programmers decide which data
structures to use based on the nature of the data and the processes that need to be performed on that
data. Some of the more commonly used data structures include lists, arrays, stacks, queues, heaps,
trees,
and graphs.
Linked List:
linked list or single linked list is a sequence of elements in which every element has link to its next
element in the sequence.
Every element is called as a "node". Every "node" contains two fields, data and link. The data is a value
or string and link is an address of next node.
The first node is called HEAD which is an empty node contains an address of the first node so it link to
the first node.
The first node link to the second node and so on.
The last node does not link to address but link to NULL. Let ptr be a pointer to the linked list. The
example is given below
Stack:
A stack is a data structure in which additions and deletions are made at the top of the stack. So we can
perform two operations on stack.
1. Adding elements into the stack known as push;
2.Deleting elements from the stack known as pop
Queue:
A queue is a data structure in which additions are made at one end and deletions are made at the
other
end. We can represent a queue in an array.
Here we can perform two operations on queue.
1. Adding elements into the queue known as insertion at rear
2.Deleting elements from the queue known as deletion from front
Databases provide us with information stored with a hierarchical and related structure, which allows us
to extract the content and arrange it easily. There are plenty of options to choose from for business and
companies. Even though MySQL is the most widespread database, there are other alternatives, such as
Microsoft SQL Server, PostgreSQL, or MongoDB.
Regardless of our choice, there are a number of common recommendations regarding the management
and optimization of the databases. Following these best practices will make your work easier, improving
the experience of our users.
No matter how carefully you plan the initial approach to the database, as time goes by, you inevitably
end up with unused tables. Just do not hesitate: delete them! Having unused tables is quite similar to
keeping things you don’t use in a full drawer. When you go looking for something, it can take you much
longer to find it! The same thing happens in the databases: the systems have to track all the tables and
elements until they find the answer to the query.
2. Proper Indexing
Having a good index among tables is essential for relational searches to work correctly. Add indexes to
the tables and use the query statements (SELECT, WHERE …). It is also advisable to periodically check the
registry of slow queries to identify those that should be optimized. No indexing at all or excessive
indexing are not a good idea. Without any indexing, the process will be prolonged, whereas indexing
everything will render the insert and update triggers ineffective
If any a code can be well written simply, there is absolutely no need to make it complex with temporary
tables. Subqueries usually alternate temporary tables, but keep in mind the precise performance that
each of these would provide in each case.
Avoiding coding loops is much required in order to prevent stalling the entire sequence. It can be
accomplished by employing the unique UPDATE or INSERT commands with individual rows and by
making sure that the WHERE command does not update the stored data in case it finds a preexisting
matching data.
Of course, there are many other ways to tune your SQL database in the most efficient and practical way.
Moreover, there is a big likelihood that the steps mentioned in this article might not be the appropriate
choice for all databases, as each database will require optimizing techniques uniquely specific to its
needs.
4 The execution plan tool created by the optimizer play major role in tuning SQL databases. They help in
creating proper indexes too. Although, its main function is to display graphically the various methods to
retrieve data. This, in turn, helps in creating the needed indexes and doing the other required steps to
optimize the database.
Of course, there are tons of other ways one can tune their SQL database in the most efficient manner.
Also, there is big chance that the steps mentioned above, might not be the right choice for all databases.
Each database will require optimizing techniques uniquely specific to its needs
The more data you have, the slower data retrieval can be if your database is not optimized. Performance
tuning lets you build indexes and eliminate problems which could cause your data retrieval to be slower
than it has to be. Nothing frustrates your team of employees more than waiting for the database to
conduct its searches. That will lead to more frustration reaching your customers or clients.
Coding loops can have the effect of hammering your database. When there is an SQL query inside of a
loop, the query is run multiple times. If you move the query out of the loop, on the other hand, a
performance boost can occur because the query is run just once instead of many iterations.
With all of these improvements to help the efficiency of your database improve, then your database can
handle more information. That scalability overcomes decreasing performance and keeps all users
satisfied with their experience.
Database security refers to the range of tools, controls, and measures designed to establish and
preserve database confidentiality, integrity, and availability. This article will focus primarily on
confidentiality since it’s the element that’s compromised in most data breaches.
Database security is a complex and challenging endeavor that involves all aspects of information security
technologies and practices. It’s also naturally at odds with database usability. The more accessible and
usable the database, the more vulnerable it is to security threats; the more invulnerable the database is
to threats, the more difficult it is to access and use. (This paradox is sometimes referred to as Anderson’s
Rule. (link resides outside IBM)
Why is it important
By definition, a data breach is a failure to maintain the confidentiality of data in a database. How much
harm a data breach inflicts on your enterprise depends on a number of consequences or factors:
Compromised intellectual property: Your intellectual property—trade secrets, inventions,
proprietary practices—may be critical to your ability to maintain a competitive advantage in
your market. If that intellectual property is stolen or exposed, your competitive advantage may
be difficult or impossible to maintain or recover.
Damage to brand reputation: Customers or partners may be unwilling to buy your products or
services (or do business with your company) if they don’t feel they can trust you to protect your
data or theirs.
Business continuity (or lack thereof): Some business cannot continue to operate until a breach
is resolved.
Fines or penalties for non-compliance: The financial impact for failing to comply with global
regulations such as the Sarbannes-Oxley Act (SAO) or Payment Card Industry Data Security
Standard (PCI DSS), industry-specific data privacy regulations such as HIPAA, or regional data
privacy regulations, such as Europe’s General Data Protection Regulation (GDPR) can be
devastating, with fines in the worst cases exceeding several million dollars per violation.
Costs of repairing breaches and notifying customers: In addition to the cost of communicating a
breach to customer, a breached organization must pay for forensic and investigative activities,
crisis management, triage, repair of the affected systems, and more.
Insider threats
An insider threat is a security threat from any one of three sources with privileged access to the
database:
Insider threats are among the most common causes of database security breaches and are often the
result of allowing too many employees to hold privileged user access credentials.
Human error
Accidents, weak passwords, password sharing, and other unwise or uninformed user behaviors continue
to be the cause of nearly half (49%) of all reported data breaches.
A database-specific threat, these involve the insertion of arbitrary SQL or non-SQL attack strings into
database queries served by web applications or HTTP headers. Organizations that don’t follow secure
web application coding practices and perform regular vulnerability testing are open to these attacks.
Buffer overflow occurs when a process attempts to write more data to a fixed-length block of memory
than it is allowed to hold. Attackers may use the excess data, stored in adjacent memory addresses, as a
foundation from which to launch attacks.
In a denial of service (DoS) attack, the attacker deluges the target server—in this case the database
server—with so many requests that the server can no longer fulfill legitimate requests from actual users,
and, in many cases, the server becomes unstable or crashes.
In a distributed denial of service attack (DDoS), the deluge comes from multiple servers, making it more
difficult to stop the attack. See our video “What is a DDoS Attack”(3:51) for more information:
10.0 ALGORITHM AND PROGRAMS FOR DATABASE FILE PROCESSING
Files are used to store various documents. All files are grouped based on their categories. The file names
are very related to each other and arranged properly to easily access the files. In file processing system,
if one needs to insert, delete, modify, store or update data, one must know the entire hierarchy of the
files.
Cost friendly –
There is a very minimal to no set up and usage fee for File Processing System. (In most cases,
free tools are inbuilt in computers.)
Easy to use –
File systems require very basic learning and understanding, hence, can be easily used.
High scalability –
One can very easily switch from smaller to larger files as per his needs.
Anyone who gets access to the file can read or modify the data.
Note –
Database Management System (DBMS) is used to eliminate the disadvantages of the FPS.
Algorithms:
In-database algorithms including linear regression, logistic regression, k-means clustering, Naive
Bayes classification, random forest decision trees, and support vector machine regression and
classification. Allows deployment of ML models to multiple clusters.
1. MySQL
2. SQL Server
3. Oracle
4. dBASE
5. FoxPro etc
I will only be discussing mysql because it is mostly used in database system. Once you know how
to enter SQL statements, you are ready to access a database. Suppose that you have several
pets in your home (your menagerie) and you would like to keep track of various types of
information about them. You can do so by creating tables to hold your data and loading them
with the desired information. Then you can answer different sorts of questions about your
animals by retrieving data from the tables. This section shows you how to perform the following
operations: • Create a database • Create a table • Load data into the table • Retrieve data from
the table in various ways • Use multiple tables The menagerie database is simple (deliberately),
but it is not difficult to think of real-world situations in which a similar type of database might be
used. For example, a database like this could be used by a farmer to keep track of livestock, or
by a veterinarian to keep track of patient records. A menagerie distribution containing some of
the queries and sample data used in the following sections can be obtained from the MySQL
website. It is available in both compressed tar file and Zip formats at https://
dev.mysql.com/doc/.
Use the SHOW statement to find out what databases currently exist on the server:
mysql> SHOW DATABASE