0% found this document useful (0 votes)
119 views

CS3361 - Data Science Lab Manual-1

Uploaded by

unmaxcoc1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views

CS3361 - Data Science Lab Manual-1

Uploaded by

unmaxcoc1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

DEPARTMENT OF COMPUTER SCIENCE AND

ENGINEERING

CS3361 DATA SCIENCE


LABORATORY

II YEAR A & B / BATCH : 2022 -26

Name :_____________________

Register No :_____________________

Lab : Data Science Laboratory

Branch/Sem : CSE / III


B.E. - II YEAR /III SEMESTER

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

This is a Certified Bonafide Record Work of

Register No. Submitted for the Anna University Practical Examination

held on in CS3361- Data Science Laboratory during the year 2023-2024.

Signature of the HOD Signature of the Lab-In-Charge

Date: Internal Examiner:

External Examiner:
Vision of Institution
To build Jeppiaar Engineering College as an Institution of Academic Excellence in Technical
education and Management education and to become a World Class University.

Mission of Institution

M1 To excel in teaching and learning, research and innovation by promoting the


principles of scientific analysis and creative thinking

To participate in the production, development and dissemination of knowledge and


M2
interact with national and international communities

To equip students with values, ethics and life skills needed to enrich their lives and
M3
enable them to meaningfully contribute to the progress of society

M4 To prepare students for higher studies and lifelong learning, enrich them with the
practical and entrepreneurial skills necessary to excel as future professionals and
contribute to Nation’s economy

Program Outcomes (POs)


Engineering knowledge: Apply the knowledge of mathematics, science,
PO1 engineering fundamentals, and an engineering specialization to the solution of
complex engineering problems.
Problem analysis: Identify, formulate, review research literature, and analyze
PO2 complex engineering problems reaching substantiated conclusions using first
principles of mathematics, natural sciences, and engineering sciences.
Design/development of solutions: Design solutions for complex engineering
problems and design system components or processes that meet the specified
PO3
needs with appropriate consideration for the public health and safety, and the
cultural, societal, and environmental considerations
Conduct investigations of complex problems: Use research-based knowledge
PO4 and research methods including design of experiments, analysis and interpretation
of data, and synthesis of the information to provide valid conclusions.
Modern tool usage: Create, select, and apply appropriate techniques, resources,
PO5 and modern engineering and IT tools including prediction and modeling to
complex engineering activities with an understanding of the limitations.
The engineer and society: Apply reasoning informed by the contextual
PO6 knowledge to assess societal, health, safety, legal and cultural issues and the
consequent responsibilities relevant to the professional engineering practice.
Environment and sustainability: Understand the impact of the professional
PO7
engineering solutions in societal and environmental contexts, and demonstrate the
knowledge of, and need for sustainable development.

Ethics: Apply ethical principles and commit to professional ethics and


PO8 responsibilities and norms of the engineering practice.

Individual and team work: Function effectively as an individual, and as a


PO9
member or leader in diverse teams, and in multidisciplinary settings.
Communication: Communicate effectively on complex engineering activities
with the engineering community and with society at large, such as, being able to
PO10
comprehend and write effective reports and design documentation, make effective
presentations, and give and receive clear instructions.
Project management and finance: Demonstrate knowledge and understanding
of the engineering and management principles and apply these to one’s own work,
PO11
as a member and leader in a team, to manage projects and in multidisciplinary
environments.
Life-long learning: Recognize the need for, and have the preparation and ability
PO12 to engage in independent and life-long learning in the broadest context of
technological change.

Vision of Department
To emerge as a globally prominent department, developing ethical computer professionals,
innovators and entrepreneurs with academic excellence through quality education and research.

Mission of Department

M1 To create computer professionals with an ability to identify and formulate the


engineering problems and also to provide innovative solutions through effective teaching
learning process.

M2 To strengthen the core-competence in computer science and engineering and to create


an ability to interact effectively with industries.

M3 To produce engineers with good professional skills, ethical values and life skills for the
betterment of the society.

M4 To encourage students towards continuous and higher level learning on technological


advancements and provide a platform for employment and self-employment.
Program Educational Objectives (PEOs)

PEO1 To address the real time complex engineering problems using innovative approach
with strong core computing skills.

PEO2 To apply core-analytical knowledge and appropriate techniques and provide


solutions to real time challenges of national and global society

PEO3 Apply ethical knowledge for professional excellence and leadership for the
betterment of the society.

PEO4 Develop life-long learning skills needed for better employment and
entrepreneurship

Program Specific Outcomes (PSOs)

Students will be able to

An ability to understand the core concepts of computer science and engineering and to enrich
PSO1 problem solving skills to analyze, design and implement software and hardware based
systems of varying complexity.

To interpret real-time problems with analytical skills and to arrive at cost effective and
PSO2 optimal solution using advanced tools and techniques.

An understanding of social awareness and professional ethics with practical proficiency in


the broad area of programming concepts by lifelong learning to inculcate employment and
PSO3
entrepreneurship skills.

COURSE OUTCOMES:

At the end of this course, the students will be able to:

CO1: Make use of the python libraries for data science

CO2: Make use of the basic Statistical and Probability measures for data science.

CO3: Perform descriptive analytics on the benchmark data sets.

CO4: Perform correlation and regression analytics on standard data sets.

CO5: Present and interpret data using visualization packages in Python


INDEX

EXP DATE PAGE SIGN


NAME OF THE EXPERIMENT
NO. NO

DOWNLOAD, INSTALL AND EXPLORE


1 21/9/23 THE FEATURES OFNUMPY, SCIPY,
JUPYTER, STATSMODELS AND
PANDAS PACKAGES.

05/10/23 WORKING WITH NUMPY ARRAYS


2

3 19/10/23 WORKING WITH PANDAS DATA FRAMES

READING DATA FROM TEXT FILES, EXCEL


AND THE WEBAND EXPLORING VARIOUS
4
9/11/23 COMMANDS FOR DOING DESCRIPTIVE
ANALYTICS ON THE IRIS DATA SET.

USE THE DIABETES DATA SET


FROM UCI AND PIMA INDIANS
5
23/11/23 DIABETES DATA SET FOR
PERFORMING UNIVARIATE
ANALYSIS, BIVARIATE ANALYSIS
ANDMULTIPLE REGRESSION
ANALYSIS

APPLY AND EXPLORE VARIOUS


6 30/11/23 PLOTTING FUNCTIONS ONUCI DATA
SETS.

VISUALIZING GEOGRAPHIC DATA WITH


7
13/12/23 BASEMAP
EX NO: DOWNLOAD, INSTALL AND EXPLORE THE FEATURES OF NUMPY, SCIPY,
JUPYTER, STATSMODELS AND PANDAS PACKAGES

i. NUMPY PACKAGE

Array processing for numbers, strings, records, and objects.

To install this package with conda run:


conda install -c anaconda numpy

Description:

NumPy is the fundamental package needed for scientific computing with Python.

How to check the Numpy version

1. Use the pip list or pip3 list command.


2. From command line type: pip3 show numpy or pip show numpy.

OUTPUT:

1
ii. SCIPY PACKAGE

Scientific Library for Python


To install this package with conda run:
conda install -c anaconda scipy

Description

SciPy is a Python-based ecosystem of open-source software for mathematics, science, and


engineering.

OUTPUT:

iii. JUPYTER PACKAGE


To install Jupyter using Anaconda, just go through the following instructions:

1. Launch Anaconda Navigator:


2. Click on the Install Jupyter Notebook Button:
3. Beginning the Installation:
4. Loading Packages:
5. Finished Installation:

JupyterLab. Install JupyterLab with pip : pip install jupyterlab.

2
OUTPUT:

iv. STATSMODELS PACKAGE

Statistical computations and models for use with SciPy

To install this package with conda run:


conda install -c anaconda statsmodels

Description
Statsmodels is a Python module that allows users to explore data, estimate statistical models, and
perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting
functions, and result statistics are available for different types of data and each estimator.
Researchers across fields may find that statsmodels fully meets their needs for statistical
computing and data analysis in Python.

3
OUTPUT:

v. PANDAS PACKAGES

install pandas conda

conda install -c anaconda pandas

pandas pip install

pip install pandas

4
OUTPUT:

RESULT:

Thus download, install and explore the features of numpy, scipy, jupyter, statsmodels and pandas
packages are successfully completed.

5
EX NO:
WORKING WITH NUMPY ARRAYS

AIM: To work with numpy arrays.

PROCEDURE:

(i) To check numpy version

Step 1 : Import the NumPy library


Step 2 : Access the version attribute of NumPy
Step 3 : Print the Numpy Version

(ii) Attributes of arrays or Numpy Array Attibutes

Step 1 : Import the NumPy library and alias it as `np`.


Step 2 : Set a random seed using `np.random.seed(0)` for reproducibility.
Step 3 : Create a one-dimensional array `x1` using `np.random.randint.
Step 4 : Print the number of dimensions of `x3` using `x3.ndim`.
Step 5 : Print the shape of `x3` using `x3.shape`.
Step 6 : Print the size of `x3` using `x3.size`.
Step 7 : Print the data type of `x3` using `x3.dtype`.
Step 8 : Print the size of each element in `x3` in bytes using `x3.itemsize`.
Step 9 : Print the total size of the array `x3` in bytes using `x3.nbytes`.

(iii) Indexing of Arrays


Array Indexing: Accessing Single Elements :

Step 1 : Import the NumPy library and alias it as np.


Step 2 : Create a one-dimensional array arr using np.array([5, 0, 3, 3, 7, 9]).
Step 3: Print the element at index 2 of the array using print(arr[2]).
Step 4 : Print the last element of the array using print(arr[-1]).
Step 5 : Print the element at index 5 of the array using print(arr[5]).
Step 6 : Print the second-to-last element of the array using print(arr[-2])

Array Indexing: Multidimensional Array :


Step 1: Import the NumPy library and alias it as np.
Step 2: Create a 2-dimensional array arr using np.array([[3, 5, 2, 4], [7, 6, 8, 8], [1, 6, 7, 7]]).
Step 3: Print the element at the first row and first column of the array using print(arr[0, 0]).
Step 4: Print the element at the third row and second-to-last column of the array using print(arr[2, -2]).

(iv) Slicing of Arrays


Array Slicing: Accessing Subarrays
Step 1: Import the NumPy library and alias it as np.
Step 2: Create a one-dimensional array arr using np.array([0,1,2,3,4,5,6,7,8,9]).
Step 3: Create another one-dimensional array arr1 using np.arange(10).
6
Step 4: Print elements from index 1 to 4 of the array arr using print(arr[1:5]).
Step 5: Print elements from index 2 to 6 of the array arr1 using print(arr1[2:7]).
Step 6: Print elements from index 5 to the end of the array arr using print(arr[5:]).
Step 7: Print elements from the beginning of the array arr up to index 4 using print(arr[:5]).
Step 8: Print elements from index -3 to -1 of the array arr1 using print(arr1[-3:-1]).
Step 9: Print every second element of the array arr1 using print(arr1[::2]).
Step 10: Print elements starting from index 0 with a step of 3 in the array arr using print(arr[0::3]).
Step 11: Print the array arr1 in reverse order using print(arr1[::-1]).

Array Slicing: Multidimensional subarrays

Step 1: Import the NumPy library and alias it as np.


Step 2: Create a 2-dimensional array arr using np.array([[12, 5, 2, 4], [7, 6, 8, 8], [1, 6, 7, 7]]).
Step 3: Print the subarray containing the first two rows and the first three columns of arr using print(arr[:2, :3]).
Step 4: Print the subarray containing all rows and every other column of arr using print(arr[:3, ::2]).
Step 5: Print the reversed subarray of arr by reversing both rows and columns using print(arr[::-1, ::-1]).

(v) Reshaping of Arrays

Step 1: Import the NumPy library and alias it as np.


Step 2: Create a one-dimensional array grid using np.arange(0, 9).
Step 3: Print the one-dimensional array grid using print(grid).
Step 4: Reshape the one-dimensional array grid into a 3x3 two-dimensional array using grid.reshape(3,3).
Step 5: Print the reshaped 3x3 array using print(grid.reshape(3,3)).

PROGRAM

i. To check the Numpy version:

import numpy

numpy. version

OUTPUT:

'1.21.5'

ii. Attributes of arrays or Numpy Array Attibutes

import numpy as np

np.random.seed(0) # seed for reproducibility

x1 = np.random.randint(10, size=6) # One-dimensional array

x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array

x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array


print("x3 ndim: ", x3.ndim)
7
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
print("dtype:", x3.dtype)
print("itemsize:", x3.itemsize, "bytes")
print("nbytes:", x3.nbytes, "bytes")

OUTPUT:

x3 ndim: 3

x3 shape: (3, 4, 5)

x3 size: 60
dtype: int32

itemsize: 4 bytes

nbytes: 240 bytes

i. Indexing of Arrays
Getting and setting the value of individual array elements.
Array Indexing: Accessing Single Elements

If you are familiar with Python’s standard list indexing, indexing in NumPy will feel quite
familiar. In a one-dimensional array, you can access the ith value (counting from zero) by
specifying the desired index in square brackets, just as with Python lists
To index from the end of the array, you can use negative indices.

import numpy as np

arr=np.array([5, 0, 3, 3, 7, 9])

print(arr[2])

print(arr[-1])

print(arr[5])

print(arr[-2])

OUTPUT:
3

8
Array Indexing: Multidimensional Array

In a multidimensional array, you access items using a comma-separated tuple of indices:

import numpy as np

arr=np.array([[3, 5, 2, 4],[7, 6, 8, 8],[1, 6, 7, 7]])

print(arr[0,0])

print(arr[2,-2])

OUTPUT:

ii. Slicing of Arrays

Getting and setting smaller subarrays within a larger array

Array Slicing: Accessing Subarrays


Just as we can use square brackets to access individual array elements, we can also use them to
access subarrays with the slice notation, marked by the colon (:) character. The NumPy slicing
syntax follows that of the standard Python list; to access a slice of an array x, use this:
x[start:stop:step]
If any of these are unspecified, they default to the values start=0, stop=size of dimension, step=1.

import numpy as np

arr=np.array([0,1,2,3,4,5,6,7,8,9])

arr1=np.arange(10)

print(arr[1:5])

print(arr1[2:7])

print(arr[5:])

print(arr[:5])

print(arr1[-3:-1])

print(arr1[::2])

print(arr[0::3])

9
print(arr1[::-1]) # all elements, reversed

OUTPUT:

[1 2 3 4]

[2 3 4 5 6]

[5 6 7 8 9]

[0 1 2 3 4]

[7 8]

[0 2 4 6 8]

[0 3 6 9]

[9 8 7 6 5 4 3 2 1 0]

Array Slicing: Multidimensional subarrays

import numpy as np

arr=np.array([[12, 5, 2, 4],[ 7, 6, 8, 8],[ 1, 6, 7, 7]])

print(arr[:2, :3]) # two rows, three columns

print(arr[:3, ::2]) # all rows, every other column

print(arr[::-1, ::-1]) #Finally, subarray dimensions can even be reversed together:

OUTPUT:

[[12 5 2]

[ 7 6 8]]

[[12 2]

[ 7 8]

[ 1 7]]

[[ 7 7 6 1]

[ 8 8 6 7]

[ 4 2 5 12]]

10
iii. Reshaping of Arrays
Changing the shape of a given array. Another useful type of operation is reshaping of arrays. The
most flexible way of doing this is with the reshape() method. For example, if you want to put the
numbers 1 through 9 in a 3×3 grid, you can do the following:

import numpy as np

grid = np.arange(0, 9)

print(grid)

print(grid.reshape(3,3))

OUTPUT:

[0 1 2 3 4 5 6 7 8]

[[0 1 2]

[3 4 5]

[6 7 8]]

RESULT: We have learnt to work with NumPy arrays.

11
EX NO:
WORKING WITH PANDAS DATA FRAMES

AIM: To work with pandas data frames.

PROCEDURE:

(i) Create a Simple Pandas DataFrame

Step 1: Import the pandas library and alias it as pd.


Step 2: Create a dictionary data with keys "calories" and "duration" and corresponding values.
Step 3: Create a DataFrame df using the pd.DataFrame(data) function to load the data.
Step 4: Print the DataFrame df using print(df).
Step 5: Access and print the row at index 0 using print(df.loc[0]).
Step 6: Access and print the rows at indexes 0 and 1 using print(df.loc[[0, 1]]).

(ii) a) Named Indexes

Step 1: Import the pandas library and alias it as pd.


Step 2: Create a dictionary data with keys "calories" and "duration" and corresponding values.
Step 3: Create a DataFrame df using the pd.DataFrame(data, index=["day1", "day2", "day3"]) function to load the
data with custom row indexes.
Step 4: Print the DataFrame df using print(df).
b) Locate Named Indexes
Step 1: Access and print the row with the named index "day2" using the df.loc["day2"] syntax.

(iii) Creating Pandas dataframe from lists using dictionary

Method #1: Creating DataFrame using dictionary of lists


(a) Transform a dictionary of lists into a dataframe.
Step 2: Define a dictionary data containing employee data with keys as column names and values as lists of
corresponding data.
Step 3: Convert the dictionary data into a DataFrame df using the pd.DataFrame(data) function.
Step 4: Print the DataFrame df using print(df).
Step 5: Select and print two columns, namely "Name" and "Qualification", from the DataFrame df using
print(df[['Name', 'Qualification']]).
(b) Adding index to a dataframe explicitly :

Step 1: Import the pandas library and alias it as pd.


Step 2: Define a dictionary data containing employee data with keys as column names and values as lists of
corresponding data.
Step 3: Explicitly add index values ['Rollno1', 'Rollno2', 'Rollno3', 'Rollno4'] to the DataFrame df using
pd.DataFrame(data, index=['Rollno1', 'Rollno2', 'Rollno3', 'Rollno4']).
Step 4: Print the DataFrame df using print(df).
Method #2: Using from_dict() function
12
Step 2: Define a dictionary `data` containing employee data with keys as column names and values as lists of
corresponding data.
Step 3: Create a DataFrame `df` using the `pd.DataFrame.from_dict(data)` function, which converts the dictionary
into a DataFrame.
Step 4: Print the DataFrame `df` using `print(df)`.
Method #3: Creating dataframe by passing lists variables to dictionary
Step 1: Import the pandas library and alias it as `pd`.
Step 2: Define four lists: `name`, `age`, `address`, and `qualification`, containing employee data.
Step 3: Create a dictionary `data` using the defined lists.
Step 4: Create a DataFrame `df` from the dictionary `data` using `pd.DataFrame(data)`.
Step 5: Print the DataFrame `df` using `print(df)`.
Step 6: Create another DataFrame `df1` from the same dictionary `data`, but explicitly add index values ['No1', 'No2',
'No3', 'No4'] using `pd.DataFrame(data, index=['No1', 'No2', 'No3', 'No4'])`.
Step 7: Print the DataFrame `df1` using `print(df1)`.

(iv) Dealing with Rows and Columns in Pandas dataFrame


(a) Dealing with columns
Step 1: Import the pandas library and alias it as `pd`.
Step 2: Define a dictionary `data` containing employee data with keys as column names and values as lists of
corresponding data.
Step 3: Convert the dictionary `data` into a DataFrame `df` using the `pd.DataFrame(data)` function.
Step 4: Print two columns, namely "Name" and "Qualification", from the DataFrame `df` using `print(df[['Name',
'Qualification']])`.
b)Column Addition
Step 1: Import the necessary library - `import pandas as pd`.
Step 2: Define a dictionary `data` containing employee data with 'Name', 'Age', and 'Qualification' as keys.
Step 3: Convert the dictionary into a DataFrame using `df = pd.DataFrame(data)`.
Step 4: Declare a list `address` that is to be converted into a new column.
Step 5: Add the new column 'Address' to the DataFrame using `df['Address'] = address`.
Step 6: Print the DataFrame using `print(df)`.
(v) Indexing and selecting data in Pandas DataFrame using [ ], loc & iloc
(a) Creating a Dataframe to Select Rows & Columns in Pandas
Step 1: Import the necessary library - `import pandas as pd`.
Step 2: Create a list of tuples named `employees` with employee data.
Step 3: Create a DataFrame `df` from the list of tuples with columns specified as 'Name', 'Age', 'City', and 'Salary'
using `pd.DataFrame(employees, columns=['Name', 'Age', 'City', 'Salary'])`.
Step 4: Print the DataFrame using `print(df)`.
b) Select Columns by Name in Pandas DataFrame using [ ]

Example 1
Step 1: Import the necessary library - `import pandas as pd`.
Step 2: Create a list of tuples named `employees` with employee data.
Step 3: Create a DataFrame `df` from the list of tuples with columns specified as 'Name', 'Age', 'City', and
'Salary' using `pd.DataFrame(employees, columns=['Name', 'Age', 'City', 'Salary'])`.
Step 4: Select the 'City' column from the DataFrame using `result = df["City"]`.
Step 5: Print the result using `print(result)`.
Example 2 :
Step 1: Import the necessary library - `import pandas as pd`.
Step 2: Create a list of tuples named `employees` with employee data.
13
Step 3: Create a DataFrame `df` from the list of tuples with columns specified as 'Name', 'Age', 'City', and
'Salary' using `pd.DataFrame(employees, columns=['Name', 'Age', 'City', 'Salary'])`.
Step 4: Select multiple columns ('Name', 'Age', 'Salary') from the DataFrame using `result = df[["Name",
"Age", "Salary"]]`.
Step 5: Print the result using `print(result)`.

14
PROGRAM:

A Pandas DataFrame is a 2 dimesional data structure , like a 2 dimensional array, or a table


with rows and columns.
i. Create a Simple Pandas DataFrame

import pandas as pd
data={"calories":[420,380,390],"duration":[50,40,45]}
df=pd.DataFrame(data) #Load data into DataFrame object
print(df)
print(df.loc[0]) #Refer to the row index
print(df.loc[[0,1]]) #Use a list of indexes

OUTPUT:
calories duration
0 420 50
1 380 40
2 390 45

calories 420
duration 50
Name: 0, dtype: int64

ii. a) Named Indexes


With the index argument, you can name your own indexes.

import pandas as pd
data={"calories":[420,380,390],"duration":[50,40,45]}
df=pd.DataFrame(data,index=["day1","day2","day3"])
print(df)

OUTPUT:
calories duration
day1 420 50
day2 380 40
day3 390 45

15
b) Locate Named Indexes
Use the named index in the loc attribute to return the specified row(s).

Use the syntax:


print(df.loc["day2"])
OUTPUT:
calories 380
duration 40
Name: day2, dtype: int64

iii. Creating Pandas dataframe from lists using dictionary

Method #1: Creating DataFrame using dictionary of lists


a)
With this method in pandas, we can transform a dictionary of lists into a dataframe.

import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
print(df)
# select two columns
print(df[['Name', 'Qualification']])

OUTPUT:
As is evident from the output, the keys of a dictionary is converted into columns of a dataframe
whereas the elements in lists are converted into rows.

Name Age Address Qualification


0 Jai 27 Delhi Msc
1 Princi 24 Kanpur MA
2 Gaurav 22 Allahabad MCA
3 Anuj 32 Kannauj Phd

16
Name Qualification
0 Jai Msc
1 Princi MA
2 Gaurav MCA
3 Anuj Phd

b) Adding index to a dataframe explicitly

import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Adding index value explicitly
df = pd.DataFrame(data,index=['Rollno1','Rollno2','Rollno3','Rollno4'])
print(df)

OUTPUT:
Name Age Address Qualification
Rollno1 Jai 27 Delhi Msc
Rollno2 Princi 24 Kanpur MA
Rollno3 Gaurav 22 Allahabad MCA
Rollno4 Anuj 32 Kannauj Phd

Method #2: Using from_dict() function

import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
df = pd.DataFrame.from_dict(data) #from_dict() function
print(df)

17
OUTPUT:
Name Age Address Qualification
0 Jai 27 Delhi Msc
1 Princi 24 Kanpur MA
2 Gaurav 22 Allahabad MCA
3 Anuj 32 Kannauj Phd

Method #3: Creating dataframe by passing lists variables to dictionary


import pandas as pd
#Dictionary of lists
name=['Jai', 'Princi', 'Gaurav', 'Anuj']
age=[27, 24, 22, 32]
address=['Delhi', 'Kanpur', 'Allahabad', 'Kannauj']
qualification=['Msc', 'MA', 'MCA', 'Phd']
data={'Name':name,'Age':age,'Address':address,'Qualification':qualification}
df=pd.DataFrame(data)
print(df)
df1=pd.DataFrame(data,index=['No1','No2','No3','No4']) #Explicitly add index value
print(df1)

OUTPUT
Name Age Address Qualification
0 Jai 27 Delhi Msc
1 Princi 24 Kanpur MA
2 Gaurav 22 Allahabad MCA
3 Anuj 32 Kannauj Phd

Name Age Address Qualification


No1 Jai 27 Delhi Msc
No2 Princi 24 Kanpur MA
No3 Gaurav 22 Allahabad MCA
No4 Anuj 32 Kannauj Phd

iv. Dealing with Rows and Columns in Pandas dataFrame


A Data frame is a two-dimensional data structure, i.e, data is aligned in a tabular fashion in rows
and columns. We can perform basic operations on rows/columns like selecting, deleting, adding
and renaming.

a) Dealing with columns


In order to deal with columns, we perform basic operations on columns like selecting, deleting,
adding and renaming.

18
Column selection
In order to select a column in Pandas DataFrame, we can either access the columns by calling
them by their columns name.

import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# select two columns
print(df[['Name', 'Qualification']])

OUTPUT:
Name Qualification
0 Jai Msc
1 Princi MA
2 Gaurav MCA
3 Anuj Phd

Column Addition
In Order to add a column in Pandas DataFrame, we can declare a new list as a column and add to
a existing DataFrame.

import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# Declare a list that is to be converted into a column
address=['Delhi', 'Kanpur', 'Allahabad', 'Kannauj']
df['Address']=address

19
# select two columns
print(df)

OUTPUT:
Name Age Qualification Address
0 Jai 27 Msc Delhi
1 Princi 24 MA Kanpur
2 Gaurav 22 MCA Allahabad
3 Anuj 32 Phd Kannauj

v. Indexing and selecting data in Pandas DataFrame using [ ], loc & iloc
Indexing in Pandas means selecting rows and columns of data from a Dataframe. It can be
selecting all the rows and the particular number of columns, a particular number of rows, and all
the columns or a particular number of rows and columns each. Indexing is also knownas
Subset selection.

Creating a Dataframe to Select Rows & Columns in Pandas


A list of tuples, say column names are: ‘Name’, ‘Age’, ‘City’, and ‘Salary’.

import pandas as pd

# List of Tuples
employees = [('Stuti', 28, 'Varanasi', 20000),
('Saumya', 32, 'Delhi', 25000),
('Aaditya', 25, 'Mumbai', 40000),
('Saumya', 32, 'Delhi', 35000)
('Saumya', 32, 'Delhi', 30000),
('Saumya', 32, 'Mumbai', 20000),
('Aaditya', 40, 'Dehradun', 24000),
('Seema', 32, 'Delhi', 70000)]

# Create a DataFrame object from list

df = pd.DataFrame(employees,columns =['Name', 'Age','City', 'Salary'])

# Show the dataframe

20
print(df)
OUTPUT

Name Age City Salary


0 Stuti 28 Varanasi 20000
1 Saumya 32 Delhi 25000
2 Aaditya 25 Mumbai 40000
3 Saumya 32 Delhi 35000
4 Saumya 32 Delhi 30000
5 Saumya 32 Mumbai 20000
6 Aaditya 40 Dehradun 24000
7 Seema 32 Delhi 70000

Select Columns by Name in Pandas DataFrame using [ ]


The [ ] is used to select a column by mentioning the respective column name.
Example 1:
Select a single column.
import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),


('Saumya', 32, 'Delhi', 25000),
('Aaditya', 25, 'Mumbai', 40000),
('Saumya', 32, 'Delhi', 35000),
('Saumya', 32, 'Delhi', 30000),
('Saumya', 32, 'Mumbai', 20000),
('Aaditya', 40, 'Dehradun', 24000),
('Seema', 32, 'Delhi', 70000)]

# Create a DataFrame object from list

df = pd.DataFrame(employees,

columns=['Name', 'Age','City', 'Salary'])


# Using the operator []

# to select a column

result = df["City"]

print(result)

21
OUTPUT:

0 Varanasi
1 Delhi
2 Mumbai
3 Delhi
4 Delhi
5 Mumbai
6 Dehradun
7 Delhi

Name: City, dtype: object

Example 2:
Select multiple columns.
import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),


('Saumya', 32, 'Delhi', 25000),
('Aaditya', 25, 'Mumbai', 40000),

('Saumya', 32, 'Delhi', 35000),


('Saumya', 32, 'Delhi', 30000),
('Saumya', 32, 'Mumbai', 20000),
('Aaditya', 40, 'Dehradun', 24000),
('Seema', 32, 'Delhi', 70000)]
# Create a DataFrame object from list

df = pd.DataFrame(employees,columns =['Name', 'Age', 'City', 'Salary'])

# Using the operator [] to

# select multiple columns

result = df[["Name", "Age", "Salary"]]


print(result)

OUTPUT:
Name Age Salary
0 Stuti 28 20000
1 Saumya 32 25000
2 Aaditya 25 40000

22
3 Saumya 32 35000
4 Saumya 32 30000
5 Saumya 32 20000
6 Aaditya 40 24000
7 Seema 32 70000

Select Rows by Name in Pandas DataFrame using loc


The .loc[] function selects the data by labels of rows or columns. It can select a subset of rows
and columns. There are many ways to use this function.
Example 1:
Select a single row.
import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),

('Saumya', 32, 'Delhi', 25000),

('Aaditya', 25, 'Mumbai', 40000),

('Saumya', 32, 'Delhi', 35000),

('Saumya', 32, 'Delhi', 30000),

('Saumya', 32, 'Mumbai', 20000),

('Aaditya', 40, 'Dehradun', 24000),

('Seema', 32, 'Delhi', 70000)]

# Create a DataFrame object from list

df = pd.DataFrame(employees,columns =['Name', 'Age','City', 'Salary'])

# Set 'Name' column as index

# on a Dataframe

df.set_index("Name", inplace = True)


23
# Using the operator .loc[]

# to select single row

result = df.loc["Stuti"]

24
# Show the dataframe

print(result)

OUTPUT:

Age 28

City Varanasi

Salary 20000

Name: Stuti, dtype: object

Example 2:
Select multiple rows.

# import pandas

import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),

('Saumya', 32, 'Delhi', 25000),

('Aaditya', 25, 'Mumbai', 40000),

('Saumya', 32, 'Delhi', 35000),

('Saumya', 32, 'Delhi', 30000),

('Saumya', 32, 'Mumbai', 20000),

('Aaditya', 40, 'Dehradun', 24000),

('Seema', 32, 'Delhi', 70000)]

# Create a DataFrame object from list

df = pd.DataFrame(employees,columns =['Name', 'Age','City', 'Salary'])

# Set 'Name' column as index

25
# on a Dataframe

df.set_index("Name", inplace = True)

# Using the operator .loc[]

# to select single row

result = df.loc[["Stuti","Seema","Aaditya"]]

# Show the dataframe

print(result)

OUTPUT:

Name Age City Salary

Stuti 28 Varanasi 20000

Seema 32 Delhi 70000

Aaditya 25 Mumbai 40000

Aaditya 40 Dehradun 24000

Example 3:
Select multiple rows and particular columns.
Syntax: Dataframe.loc[["row1", "row2"...], ["column1", "column2", "column3"...]]

# import pandas

import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),

('Saumya', 32, 'Delhi', 25000),

('Aaditya', 25, 'Mumbai', 40000),

26
('Saumya', 32, 'Delhi', 35000),

('Saumya', 32, 'Delhi', 30000),

('Saumya', 32, 'Mumbai', 20000),

('Aaditya', 40, 'Dehradun', 24000),

('Seema', 32, 'Delhi', 70000)]

# Create a DataFrame object from list

df = pd.DataFrame(employees,columns =['Name', 'Age','City', 'Salary'])

# Set 'Name' column as index

# on a Dataframe

df.set_index("Name", inplace = True)

# Using the operator .loc[] to

# select multiple rows with some

# multiple columns

result = df.loc[["Stuti", "Seema"],["City", "Salary"]]

# Show the dataframe

print(result)

OUTPUT:

Name City Salary

Stuti Varanasi 20000

Seema Delhi 70000

27
Example 4:
Select all the rows with some particular columns. We use a single colon [ : ] to select all rows
and the list of columns that we want to select as given below :
Syntax: Dataframe.loc[[:, ["column1", "column2", "column3"]]

# import pandas

import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),

('Saumya', 32, 'Delhi', 25000),

('Aaditya', 25, 'Mumbai', 40000),

('Saumya', 32, 'Delhi', 35000),

('Saumya', 32, 'Delhi', 30000),

('Saumya', 32, 'Mumbai', 20000),

('Aaditya', 40, 'Dehradun', 24000),

('Seema', 32, 'Delhi', 70000)]

# Creating a DataFrame object from list

df = pd.DataFrame(employees,

columns =['Name', 'Age','City', 'Salary'])

# Set 'Name' column as index

# on a Dataframe

df.set_index("Name", inplace = True)

# Using the operator .loc[] to

28
# select all the rows with

# some particular columns

result = df.loc[:, ["City", "Salary"]]

# Show the dataframe

print(result)

OUTPUT:

Name City Salary

Stuti Varanasi 20000

Saumya Delhi 25000

Aaditya Mumbai 40000

Saumya Delhi 35000

Saumya Delhi 30000

Saumya Mumbai 20000

Aaditya Dehradun 24000

Seema Delhi 70000

i. Select Rows by Index in Pandas DataFrame using iloc


The iloc[ ] is used for selection based on position. It is similar to loc[] indexer but it takes only
integer values to make selections.

Example 1:

select a single row.

# import pandas

29
import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),

('Saumya', 32, 'Delhi', 25000),

('Aaditya', 25, 'Mumbai', 40000),

('Saumya', 32, 'Delhi', 35000),

('Saumya', 32, 'Delhi', 30000),

('Saumya', 32, 'Mumbai', 20000),

('Aaditya', 40, 'Dehradun', 24000),

('Seema', 32, 'Delhi', 70000)]

# Create a DataFrame object from list

df = pd.DataFrame(employees,columns =['Name', 'Age','City', 'Salary'])

# Using the operator .iloc[]

# to select single row

result = df.iloc[2]

# Show the dataframe

print(result)

OUTPUT:

Name Aaditya

Age 25

30
City Mumbai

Salary 40000

Name: 2, dtype: object

Example 2:
Select multiple rows.

# import pandas

import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),

('Saumya', 32, 'Delhi', 25000),

('Aaditya', 25, 'Mumbai', 40000),

('Saumya', 32, 'Delhi', 35000),

('Saumya', 32, 'Delhi', 30000),

('Saumya', 32, 'Mumbai', 20000),

('Aaditya', 40, 'Dehradun', 24000),

('Seema', 32, 'Delhi', 70000)]

# Create a DataFrame object from list

df = pd.DataFrame(employees,

columns=['Name', 'Age','City', 'Salary'])

# Using the operator .iloc[]

# to select multiple rows

result = df.iloc[[2, 3, 5]]

# Show the dataframe

31
print(result)

OUTPUT:
Name Age City Salary
2 Aaditya 25 Mumbai 40000
3 Saumya 32 Delhi 35000
5 Saumya 32 Mumbai 20000

Example 3:
Select multiple rows with some particular columns.

# import pandas

import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),

('Saumya', 32, 'Delhi', 25000),


('Aaditya', 25, 'Mumbai', 40000),
('Saumya', 32, 'Delhi', 35000),
('Saumya', 32, 'Delhi', 30000),
('Saumya', 32, 'Mumbai', 20000),
('Aaditya', 40, 'Dehradun', 24000),
('Seema', 32, 'Delhi', 70000)]
# Creating a DataFrame object from list

df = pd.DataFrame(employees,

columns =['Name', 'Age','City', 'Salary'])

# Using the operator .iloc[]

# to select multiple rows with# some particular columns result = df.iloc[[2, 3, 5],[0, 1]] # Show the
dataframe print(result)

32
OUTPUT:

Name Age
2 Aaditya 25
3 Saumya 32
5 Saumya 32

Example 4:
Select all the rows with some particular columns.

# import pandas

import pandas as pd

# List of Tuples

employees = [('Stuti', 28, 'Varanasi', 20000),

('Saumya', 32, 'Delhi', 25000),

('Aaditya', 25, 'Mumbai', 40000),

('Saumya', 32, 'Delhi', 35000),

33
('Saumya', 32, 'Delhi', 30000),

('Saumya', 32, 'Mumbai', 20000),

('Aaditya', 40, 'Dehradun', 24000),

('Seema', 32, 'Delhi', 70000)]

# Create a DataFrame object from list

df = pd.DataFrame(employees,columns =['Name', 'Age','City', 'Salary'])

# Using the operator .iloc[]

# to select all the rows with

# some particular columns

result = df.iloc[:, [0, 1]]

# Show the dataframe

print(result)

OUTPUT:
Name Age
0 Stuti 28
1 Saumya 32
2 Aaditya 25
3 Saumya 32
4 Saumya 32
5 Saumya 32
6 Aaditya 40
7 Seema 32

RESULT: We have worked with pandas data frames.

34
EX NO: READING DATA FROM TEXT FILES, EXCEL AND THE WEB AND
EXPLORING VARIOUS COMMANDS FOR DOING DESCRIPTIVE
ANALYTICS ON THE IRIS DATA SET

AIM: To read data from text files , excel and the web and exploring various commands for doing
descriptive analytics on the iris data set.

PROCEDURE:

Step 1: Import the pandas library and alias it as `pd`.


Step 2: Read the CSV file "D:\iris_csv.csv" into a DataFrame `df` using `pd.read_csv("D:\iris_csv.csv")`.
Step 3: Print the top 5 rows of the DataFrame using `print(df.head())`.
Step 4: Print the shape of the DataFrame using `print(df.shape)`.
Step 5: Print information about the DataFrame using `print(df.info())`.
Step 6: Print descriptive statistics of the DataFrame using `print(df.describe())`.
Step 7: Print the sum of null values in each column using `print(df.isnull().sum())`.
Step 8: Print a sample of 10 rows from the DataFrame using `print(df.sample(10))`.
Step 9: Print the column names of the DataFrame using `print(df.columns)`.
Step 10: Print the entire DataFrame using `print(df)`.
Step 11: Print rows from index 10 to 20 using `print(df[10:21])`.
Step 12: Save the sliced data from rows 10 to 20 in a variable `sliced_data` for further use in analysis using
`sliced_data = df[10:21]`.
Step 13: Print the sum, mean, and median of the "sepallength" column using `print("Sum:", sum_data, "\nMean:",
mean_data, "\nMedian:", median_data)`.
Step 14: Print the minimum and maximum values of the "sepallength" column using `print("Minimum:", min_data,
"\nMaximum:", max_data)`.
Step 15: Print the counts of each unique value in the "class" column using `print(df["class"].value_counts())`.
Step 16: Create a scatter plot of "sepallength" against "sepalwidth" using `df.plot(kind="scatter", x="sepallength",
y="sepalwidth")`.
Step 17: Create a scatter plot with green color and marker size 70 using `df.plot(kind="scatter", x="sepallength",
y="sepalwidth", color="green", s=70)`.

35
PROGRAM:

import pandas as pd

# Reading the CSV file

df = pd.read_csv("D:\iris_csv.csv")

# Printing top 5 rows

print(df.head())

print(df.shape)

print(df.info())

print(df.describe())

print(df.isnull().sum())

print(df.sample(10))

print(df.columns)

print(df)

#data[start:end]

#start is inclusive whereas end is exclusive

print(df[10:21])

# it will print the rows from 10 to 20.

# you can also save it in a variable for further use in analysis

sliced_data=df[10:21]

print(sliced_data)

# data["column_name"].sum()

sum_data = df["sepallength"].sum()

mean_data = df["sepallength"].mean()

median_data = df["sepallength"].median()

print("Sum:",sum_data, "\nMean:", mean_data, "\nMedian:",median_data)

36
min_data=df["sepallength"].min()

max_data=df["sepallength"].max()

print("Minimum:",min_data, "\nMaximum:", max_data)

print(df["class"].value_counts())

# The pandas plot extenstion can be used to make a scatterplot

# Display your plot with plt.show

df.plot(kind="scatter", x="sepallength", y="sepalwidth")

#To change color and size, add the following:

df.plot(kind="scatter", x="sepallength", y="sepalwidth",color="green",s=70)

OUTPUT:

sepallength sepalwidth petallength petalwidth class

0 5.1 3.5 1.4 0.2 Iris-setosa


1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
(150, 5)

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 150 entries, 0 to 149

Data columns (total 5 columns):

# Column Non-Null Count Dtype

0 sepallength 150 non-null float64

1 sepalwidth 150 non-null float64

2 petallength 150 non-null float64

37
3 petalwidth 150 non-null float64
std 0.828066 0.433594 1.764420 0.763161

min 4.300000 2.000000 1.000000 0.100000

25% 5.100000 2.800000 1.600000 0.300000

50% 5.800000 3.000000 4.350000 1.300000

75% 6.400000 3.300000 5.100000 1.800000

max 7.900000 4.400000 6.900000 2.500000

sepallength 0

sepalwidth 0

petallength 0

petalwidth 0

class 0

dtype: int64
4 class 150 non-null object

dtypes: float64(4), object(1)

memory usage: 6.0+ KB

None

sepallength sepalwidth petallength petalwidth

count 150.000000 150.000000 150.000000 150.000000

mean 5.843333 3.054000 3.758667 1.198667

sepallength sepalwidth petallength petalwidth class


113 5.7 2.5 5.0 2.0 Iris-virginica
120 6.9 3.2 5.7 2.3 Iris-virginica
116 6.5 3.0 5.5 1.8 Iris-virginica
105 7.6 3.0 6.6 2.1 Iris-virginica

38
93 5.0 2.3 3.3 1.0 Iris-versicolor
30 4.8 3.1 1.6 0.2 Iris-setosa
27 5.2 3.5 1.5 0.2 Iris-setosa
26 5.0 3.4 1.6 0.4 Iris-setosa
17 5.1 3.5 1.4 0.3 Iris-setosa
136 6.3 3.4 5.6 2.4 Iris-virginica

Index(['sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'class'], dtype='object')

sepallength sepalwidth petallength petalwidth class

0 5.1 3.5 1.4 0.2 Iris-setosa


1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica

[150 rows x 5 columns]

sepallength sepalwidth petallength petalwidth class

10 5.4 3.7 1.5 0.2 Iris-setosa


11 4.8 3.4 1.6 0.2 Iris-setosa
12 4.8 3.0 1.4 0.1 Iris-setosa
13 4.3 3.0 1.1 0.1 Iris-setosa

39
14 5.8 4.0 1.2 0.2 Iris-setosa
15 5.7 4.4 1.5 0.4 Iris-setosa
16 5.4 3.9 1.3 0.4 Iris-setosa
17 5.1 3.5 1.4 0.3 Iris-setosa
18 5.7 3.8 1.7 0.3 Iris-setosa
19 5.1 3.8 1.5 0.3 Iris-setosa
20 5.4 3.4 1.7 0.2 Iris-setosa

sepallength sepalwidth petallength petalwidth class

10 5.4 3.7 1.5 0.2 Iris-setosa


11 4.8 3.4 1.6 0.2 Iris-setosa
12 4.8 3.0 1.4 0.1 Iris-setosa
13 4.3 3.0 1.1 0.1 Iris-setosa
14 5.8 4.0 1.2 0.2 Iris-setosa
15 5.7 4.4 1.5 0.4 Iris-setosa
16 5.4 3.9 1.3 0.4 Iris-setosa
17 5.1 3.5 1.4 0.3 Iris-setosa
18 5.7 3.8 1.7 0.3 Iris-setosa
19 5.1 3.8 1.5 0.3 Iris-setosa
20 5.4 3.4 1.7 0.2 Iris-setosa

Sum: 876.5

Mean: 5.843333333333335

Median: 5.8

Minimum: 4.3

Maximum: 7.9

Iris-setosa 50

40
Iris-versicolor 50

Iris-virginica 50

Name: class, dtype: int64

RESULT: We have read data from text files , excel and the web and explored various commands for doing
descriptive analytics on the iris data set.

41
EX NO: USE THE DIABETES DATA SET FROM UCI AND PIMA INDIANS DIABETES
DATA SET FOR PERFORMING UNIVARIATE ANALYSIS, BIVARIATE
ANALYSIS AND MULTIPLE REGRESSION ANALYSIS

AIM: To use the diabetes data set from UCI and PIMA indians diabetes data set for performing univariate
analysis , bivariate analysis and multiple regression analysis

PROCEDURE:

Step 1: Import the necessary libraries - pandas, matplotlib.pyplot, statsmodels.api, and seaborn.
Step 2: Read the CSV file "D:\di.csv" into a DataFrame `df` using `pd.read_csv("D:\di.csv")`.
Step 3: Print the entire DataFrame using `print(df)`.
Step 4: Calculate the mean of selected columns using `mean =
df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].mean()`.
Step 5: Print the mean values using `print(mean)`.
Step 6: Calculate the median of selected columns using `median =
df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].median()`.
Step 7: Print the median values using `print(median)`.
Step 8: Calculate the mode of selected columns using `mode =
df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].mode()`.
Step 9: Print the mode values using `print(mode)`.
Step 10: Calculate the variance of selected columns using `variance =
df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].var()`.
Step 11: Print the variance values using `print(variance)`.
Step 12: Calculate the standard deviation of selected columns using `sd =
df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].std()`.
Step 13: Print the standard deviation values using `print(sd)`.
Step 14: Perform bivariate analysis by creating a scatter plot of Age vs Glucose using `plt.scatter(df.Age,df.Glucose)`
and display it with `plt.show()`.
Step 15: Perform linear regression modeling for Age vs Glucose using Ordinary Least Squares (OLS) with `x =
sm.add_constant(df[['Age']])` and `model = sm.OLS(df['Glucose'], x).fit()`. Print the summary using
`print(model.summary())`.

42
PROGRAM:
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis.

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
# Reading the CSV file
df = pd.read_csv("D:\di.csv")
print(df)
#Mean
mean=df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].mean()
print(mean)
#Median
median=df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].median()
print(median)
#Mode
mode=df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].mode()
print(mode)
#Variance
variance=df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].var()
print(variance)
#StandardDeviation
sd=df[["Pregnancies","Glucose","BP","Insulin","Diabetes","Age"]].std()
print(sd)

43
b. Bivariate analysis: Linear and logistic regression modeling

#View first five rows of DataFrame


plt.scatter(df.Age,df.Glucose)
plt.title('Age vs Glucose')
plt.xlabel('Age')
plt.ylabel('Glucose')

c. Multiple Regression analysis


d. Also compare the results of the above analysis for the two data sets.

#print(df.corr()) #Correlation Coefficient


y=df['Glucose']
x=df[['Age']]
x=sm.add_constant(x)
model=sm.OLS(y,x).fit()
print(model.summary())

OUTPUT:

Pregnancies Glucose BP Insulin Diabetes Age

0 2 148 72 0 0.627 50

1 1 85 66 0 0.351 31

2 3 183 64 0 0.672 32

3 0 89 66 94 0.167 21

4 3 137 40 168 2.288 33

5 2 116 74 0 0.201 30

6 0 78 50 88 0.248 26

7 5 115 0 0 0.134 29

8 4 197 70 543 0.158 53

9 1 166 96 0 0.232 54

44
Pregnancies 2.1000

Glucose 131.4000

BP 59.8000

Insulin 89.3000

Diabetes 0.5078

Age 35.9000

dtype: float64

Pregnancies 2.00

Glucose 126.50

BP 66.00

Insulin 0.00

Diabetes 0.24

Age 31.50

dtype: float64

Pregnancies Glucose BP Insulin Diabetes Age

0 0.0 78 66.0 0.0 0.134 21


1 1.0 85 NaN NaN 0.158 26
2 2.0 89 NaN NaN 0.167 29
3 3.0 115 NaN NaN 0.201 30
4 NaN 116 NaN NaN 0.232 31
5 NaN 137 NaN NaN 0.248 32
6 NaN 148 NaN NaN 0.351 33
7 NaN 166 NaN NaN 0.627 50
8 NaN 183 NaN NaN 0.672 53
9 NaN 197 NaN NaN 2.288 54

45
Pregnancies 2.766667

Glucose 1753.155556
BP 658.177778
Insulin 28878.677778
Diabetes 0.427865
Age 140.988889

dtype: float64

Pregnancies 1.663330

Glucose 41.870700
BP 25.654976
Insulin 169.937276
Diabetes 0.654114
Age 11.873874

dtype: float64

C:\Users\NIRMALKUMAR\anaconda3\lib\site-packages\scipy\stats\stats.py:1541:
UserWarning: kurtosistest only valid for n>=20 .... continuing anyway, n=10

warnings.warn("kurtosistest only valid for n>=20 .... continuing "

OLS Regression Results

====================================================================
==========

Dep. Variable: Glucose R-squared: 0.563


Model: OLS Adj. R-squared: 0.508
Method: Least Squares F-statistic: 10.29
Date: Sat, 12 Nov 2022 Prob (F-statistic): 0.0125
Time: 19:10:53 Log-Likelihood: -46.873
No. Observations: 10 AIC: 97.75
Df Residuals: 8 BIC: 98.35

Df Model: 1

Covariance Type: nonrobust

46
coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------

const 36.4400 31.021 1.175 0.274 -35.095 107.975

Age 2.6451 0.824 3.208 0.012 0.744 4.546

====================================================================
==========

Omnibus: 4.877 Durbin-Watson: 2.460

Prob(Omnibus): 0.087 Jarque-Bera (JB): 1.759

Skew: 0.990 Prob(JB): 0.415

Kurtosis: 3.552 Cond. No. 126.

RESULT: We have used the diabetes data set from UCI and PIMA indians diabetes data set and
performed univariate analysis , bivariate analysis and multiple regression analysis.

47
EX NO:
APPLY AND EXPLORE VARIOUS PLOTTING FUNCTIONS ON UCI DATA
SETS.

AIM: To apply and explore various plotting functions on UCI data sets.

PROCEDURE:
a. Normal Curves
Step 1: Import the required libraries - `import numpy as np`, `import pandas as pd`, and `import
matplotlib.pyplot as plt`.
Step 2: Set the style for the plot using `plt.style.use('seaborn-whitegrid')`.
Step 3: Create a series of data `x` ranging from 1 to 100 with 50 data points using `x = np.linspace(1, 100,
50)`.
Step 4: Define a function `normal_dist` that calculates the probability density of a normal distribution given
data `x`, mean, and standard deviation.
Step 5: Calculate the mean and standard deviation of the data using `mean = np.mean(x)` and `sd =
np.std(x)`.
Step 6: Plot the results by plotting `x` against `pdf` with a red color using `plt.plot(x, pdf, color='red')
b. Density and contour plots
Step 1: Define a function `f(x, y)` that represents a mathematical expression using NumPy.
Step 2: Create linearly spaced values for `x` and `y` using `x = np.linspace(0, 5, 50)` and `y =
np.linspace(0, 5, 40)`.
Step 3: Create a grid of points using `np.meshgrid(x, y)` and assign the result to `X` and `Y`.
Step 4: Evaluate the function `f` for each point on the grid to obtain a 2D array `Z` representing the
function values.
Step 5: Plot the contours of the function using `plt.contour(X, Y, Z, colors='black')`.
Step 6: Plot colored contours with 20 levels using `plt.contour(X, Y, Z, 20, cmap='RdGy')`.
c. Correlation and scatter plots
Step 1: Create a random number generator with a seed value using `rand = np.random.RandomState(10)`.
Step 2: Generate an array of 20 random integers between 0 and 100 using `x = rand.randint(100, size=20)`.
Step 3: Calculate the sine of each element in the array `x` and store it in `y` using `y = np.sin(x)`.
Step 4: Plot the points (`x, y`) as circles ('o') in black color using `plt.plot(x, y, 'o', color='black')`.
d. Histograms
Step 1: Set the plot style to 'seaborn-white' using `plt.style.use('seaborn-white')`.
Step 2:Create a random number generator with a seed value using `rand = np.random.RandomState(0)`.
Step 3: Generate an array of 5 random integers between 0 and 10 using `x = rand.randint(10, size=5)`.
e. Three dimensional plotting
Step 1: Create a 3D plot using `ax = plt.axes(projection='3d')`.
Step 2: Generate data for a three-dimensional line - `zline = np.linspace(0, 15, 100)`, `xline = np.sin(zline)`,
and `yline = np.cos(zline)`.
Step 3: Plot the three-dimensional line using `ax.plot3D(xline, yline, zline, 'gray')`.
Step 4: Generate data for three-dimensional scattered points - `zdata = 15 * np.random.random(100)`,
`xdata = np.sin(zdata) + 0.1 * np.random.randn(100)`, and `ydata = np.cos(zdata) + 0.1 * np.random.randn(
Step 5: Scatter plot the three-dimensional points forax.scatter3D(xdata, ydata, zdata, c=zdata,
cmap='Greens')`.
48
PROGRAM

a. Normal curves

# Importing required libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

plt.style.use('seaborn-whitegrid')

# Creating a series of data of in range of 1-100.

x = np.linspace(1,100,50)

#Creating a Function.

def normal_dist(x , mean , sd):

prob_density = (np.pi*sd) * np.exp(-0.5*((x-mean)/sd)**2)

return prob_density

#Calculate mean and Standard deviation.

mean = np.mean(x)

sd = np.std(x)

#Apply function to the data.

pdf = normal_dist(x,mean,sd)

#Plotting the Results

plt.plot(x,pdf , color = 'red')

plt.xlabel('Data points')

plt.ylabel('Probability Density')

49
b. Density and contour plots

#%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
import numpy as np
def f(x, y):
return np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 40)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
plt.contour(X, Y, Z, colors='black') #Visualizing three-dimensional data with contours
plt.contour(X, Y, Z, 20, cmap='RdGy') #Visualizing three-dimensional data with colored
contours

c. Correlation and scatter plots

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

rand=np.random.RandomState(10)

x=rand.randint(100,size=20)

y = np.sin(x)

plt.plot(x, y, 'o', color='black')

50
d. Histograms

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
rand=np.random.RandomState(0)
x=rand.randint(10,size=5)
plt.hist(x)

e. Three dimensional plotting

from mpl_toolkits import mplot3d


#%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
ax = plt.axes(projection='3d')
# Data for a three-dimensional line
zline = np.linspace(0, 15, 100)
xline = np.sin(zline)
yline = np.cos(zline)
ax.plot3D(xline, yline, zline, 'gray')
# Data for three-dimensional scattered points
zdata = 15 * np.random.random(100)
xdata = np.sin(zdata) + 0.1 * np.random.randn(100)
ydata = np.cos(zdata) + 0.1 * np.random.randn(100)
ax.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Greens')

51
OUTPUT:

a) Normal Curve

b) Density and Contour Plots

52
c) Correlation and Scatter Plots

53
d) Histogram

e) Three dimensional plotting

RESULT: We have applied and explored various plotting functions on UCI data sets.

54
EX NO:
VISUALIZING GEOGRAPHIC DATA WITH BASEMAP

AIM: To visualize geographic data with basemap

PROCEDURE:

Step 1: Import the required libraries - `from mpl_toolkits.basemap import Basemap` and `import matplotlib.pyplot as
plt`.
Step 2: Create a figure with a size of 12x12 inches using `fig = plt.figure(figsize=(12, 12))`.
Step 3: Initialize a Basemap object `m` using `Basemap()`.
Step 4: Draw coastlines using `m.drawcoastlines()`.
Step 5: Display the plot with the title "Coastlines" using `plt.title("Coastlines", fontsize=20)` and `plt.show()`.
Step 6: Draw country boundaries by adding `m.drawcountries()` after drawing coastlines.
Step 7: Display the plot with the title "Country boundaries" using `plt.title("Country boundaries", fontsize=20)` and
`plt.show()`.
Step 8: Draw major rivers by adding `m.drawrivers(linewidth=0.5, linestyle='solid', color='#0000ff')` after drawing
coastlines and countries.
Step 9: Display the plot with the title "Major rivers" using `plt.title("Major rivers", fontsize=20)` and `plt.show()`.
Step 10: Draw a filled map boundary by filling continents with coral color and the map boundary with aqua color
using `m.fillcontinents(color='coral', lake_color='aqua')` and `m.drawmapboundary(color='b', linewidth=2.0,
fill_color='aqua')`.
Step 11: Display the plot with the title "Filled map boundary" using `plt.title("Filled map boundary", fontsize=20)`
and `plt.show()`.
Step 12: Create a new figure with a size of 10x8 inches using `fig = plt.figure(figsize=(10, 8))`.
Step 13: Initialize an orthographic Basemap projection with a central longitude of 25 and a central latitude of 10
using `m = Basemap(projection='ortho', lon_0=25, lat_0=10)`.
Step 14: Draw coastlines, continents, country boundaries, and map boundary in an orthographic projection using
appropriate Basemap methods.
Step 15: Display the plot with the title "Orthographic Projection" using `plt.title("Orthographic Projection",
fontsize=18)`.

55
Basemap() Package Installation
Installation of Basemap is straightforward; if you’re using conda you can type this and the package
will be downloaded:
conda install -c anaconda basemap

Description
Basemap toolkit is a library for plotting 2D data on maps in Python. It is similar in functionality
to the matlab mapping toolbox, the IDL mapping facilities, GrADS, or the Generic Mapping Tools.

56
PROGRAM:

from mpl_toolkits.basemap import Basemap


import matplotlib.pyplot as plt
fig = plt.figure(figsize = (12,12))
m = Basemap()
#Draw coastlines
m.drawcoastlines()
plt.title("Coastlines", fontsize=20)
plt.show()
#Draw Country boundaries
m.drawcoastlines(linewidth=1.0, linestyle='solid', color='black')
m.drawcountries()
plt.title("Country boundaries", fontsize=20)
plt.show()
#Draw major rivers
m.drawcoastlines(linewidth=1.0, linestyle='solid', color='black')
m.drawcountries(linewidth=1.0, linestyle='solid', color='k')
m.drawrivers(linewidth=0.5, linestyle='solid', color='#0000ff')
plt.title("Major rivers", fontsize=20)
plt.show()
#Filled map boundary
m.drawcoastlines(linewidth=1.0, linestyle='solid', color='black')
m.drawcountries(linewidth=1.0, linestyle='solid', color='k')
m.fillcontinents(color='coral',lake_color='aqua')
m.drawmapboundary(color='b', linewidth=2.0, fill_color='aqua')
plt.title("Filled map boundary", fontsize=20)
plt.show()
#Orthographic Projection
fig = plt.figure(figsize = (10,8))
m = Basemap(projection='ortho', lon_0 = 25, lat_0 = 10)

57
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lightblue')
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
m.drawmapboundary(fill_color='lightblue')
plt.title("Orthographic Projection", fontsize=18)

OUTPUT:

58
RESULT: We have visualized geographic data with basemap.
59

You might also like