0% found this document useful (0 votes)
11 views

5th Sem Internship Eport

The document outlines an internship plan focused on data science, detailing a schedule from June to August 2024, including topics such as data science fundamentals, machine learning models, and practical coding sessions using Python and Jupyter Notebook. It covers essential concepts, workflows, tools, applications, and challenges in data science, alongside hands-on activities like data cleaning, visualization, and model building. The internship aims to equip participants with both theoretical knowledge and practical skills in data science and machine learning.

Uploaded by

dixu rachcha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

5th Sem Internship Eport

The document outlines an internship plan focused on data science, detailing a schedule from June to August 2024, including topics such as data science fundamentals, machine learning models, and practical coding sessions using Python and Jupyter Notebook. It covers essential concepts, workflows, tools, applications, and challenges in data science, alongside hands-on activities like data cleaning, visualization, and model building. The internship aims to equip participants with both theoretical knowledge and practical skills in data science and machine learning.

Uploaded by

dixu rachcha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 83

226150316051

Internship Plan

1
226150316051

No Topic Page no

1 27/06/2024

2 01/07/2024

3 03/07/2024

4 6/07/2024

5 10/07/2024

6 15/07/2024
7 16/07/2024

8 20/07/2024
9 25/07/2024
10 26/08/2024

11 27/07/2024
12 31/08/2024
13 04/08/2024

2
226150316051

DAY 1
1. Intro about DataScience

3
226150316051

Data science is a comprehensive, interdisciplinary field that focuses on extracting insights and
knowledge from data to solve complex problems. Here’s a more detailed look at various aspects
of data science:

1. Core Concepts:

 Data: The foundation of data science, which can be structured (like


databases and spreadsheets) or unstructured (like text, images, or videos).
 Statistics and Probability: Data science heavily relies on statistical
methods to analyze data, make inferences, and draw conclusions. Probability
theory helps in predicting outcomes and quantifying uncertainty.
 Machine Learning: A subset of artificial intelligence, machine learning
enables systems to learn from data and make predictions or decisions
without explicit programming. Algorithms like regression, classification,
clustering, and deep learning are core to this process.
 Big Data: As the volume, velocity, and variety of data grow, data science
utilizes tools like Apache Hadoop, Spark, and cloud computing to process
and analyze vast datasets.

2. Data Science Workflow:

The process of data science typically follows a series of steps:

 Data Collection: Gathering raw data from various sources (databases, APIs,
web scraping, sensors, etc.).
 Data Cleaning and Preprocessing: Handling missing values, correcting
errors, transforming data formats, and standardizing data to ensure accuracy.
 Exploratory Data Analysis (EDA): Investigating data sets to discover
patterns, trends, correlations, and anomalies using statistical measures and
visualization tools.
 Feature Engineering: Creating new features or variables that improve the
predictive power of models.
 Model Building: Applying machine learning algorithms to build predictive
or descriptive models using training data.
 Model Evaluation: Assessing model performance using metrics like
accuracy, precision, recall, F1-score, and AUC (Area Under the Curve).
 Deployment and Monitoring: Integrating the model into production
systems and monitoring its performance over time.

3. Key Techniques and Tools:

4
226150316051

 Programming Languages: Python and R are the most common languages


used in data science due to their rich ecosystems of libraries and ease of use.
Python libraries such as NumPy, Pandas, scikit-learn, TensorFlow, and
Keras, along with R’s ggplot2 and caret, are fundamental tools.
 SQL: Essential for querying databases and retrieving data.
 Data Visualization: Tools like Matplotlib, Seaborn, Plotly (Python),
ggplot2 (R), Tableau, and Power BI help present data insights visually,
making it easier to communicate results to non-technical stakeholders.
 Cloud Platforms: AWS, Google Cloud, and Microsoft Azure provide
scalable storage, processing power, and machine learning services to handle
big data and deploy machine learning models.
 Version Control: Git and GitHub for collaboration and maintaining version
control of code.

4. Applications of Data Science:

Data science is used across a wide range of industries to drive innovation and
improve decision-making. Some applications include:

 Healthcare: Predictive modeling for disease diagnosis, personalized


medicine, drug discovery, and medical imaging analysis.
 Finance: Fraud detection, risk management, algorithmic trading, and
customer segmentation.
 Marketing: Targeted advertising, customer segmentation, recommendation
engines, and sentiment analysis.
 E-commerce: Price optimization, product recommendations, and demand
forecasting.
 Transportation: Route optimization, predictive maintenance in automotive
industries, and autonomous vehicles.
 Social Sciences: Understanding social behavior through analysis of surveys,
social media, and other demographic data.
 Sports Analytics: Optimizing player performance, injury prediction, and
game strategy.

 5. Emerging Trends in Data Deep Learning: Using neural networks with


many layers to solve complex problems like image recognition, natural
language processing (NLP), and autonomous driving.
 Natural Language Processing (NLP): Techniques to analyze, understand,
and generate human language. NLP is used in applications like chatbots,
sentiment analysis, and language translation.

5
226150316051

 AutoML (Automated Machine Learning): Tools that automate the


machine learning process, from data cleaning to model selection and
hyperparameter tuning, making it easier for non-experts to use.
 Explainable AI (XAI): As AI models become more complex, there’s a
growing need for transparency and interpretability. XAI focuses on making
machine learning models understandable to humans, which is crucial for
sectors like healthcare and finance where decisions must be explainable.
 Edge Computing: Bringing computation and data storage closer to data
sources (like IoT devices) to reduce latency and bandwidth use, especially
useful in real-time applications.

Science:

6. Roles and Responsibilities in Data Science:

There are several key roles in the data science field:

 Data Scientist: Develops models and analyzes data to provide actionable


insights.
 Data Analyst: Focuses on interpreting data and producing reports or
visualizations.
 Machine Learning Engineer: Specializes in building and deploying
machine learning models.
 Data Engineer: Designs and maintains the architecture (like databases and
large-scale processing systems) that allows data scientists and analysts to
work with data.
 Business Analyst: Bridges the gap between data science and business,
interpreting insights in a way that aligns with business goals.

7. Skills Needed for Data Science:

To be successful in data science, professionals need a combination of technical and


non-technical skills:

 Technical Skills: Programming (Python, R), data wrangling, statistical


analysis, machine learning, database management (SQL), data visualization,
and cloud computing.
 Mathematics and Statistics: Understanding probability, linear algebra,
calculus, and statistical theory is fundamental.
 Problem-Solving: Data scientists must think critically to solve complex
business problems using data-driven approaches.
6
226150316051

 Communication: The ability to communicate technical findings to non-


technical stakeholders is key.
 Curiosity and Continuous Learning: Given the rapid pace of change in
tools, methods, and data, data scientists must continuously learn and adapt.

8. Challenges in Data Science:

 Data Quality: Incomplete, inconsistent, or incorrect data can lead to flawed


models and inaccurate conclusions.
 Data Privacy and Ethics: Handling personal or sensitive data requires a
strong understanding of data protection laws and ethical considerations,
especially in industries like healthcare and finance.
 Bias in Machine Learning: Models can inherit biases from training data,
leading to unfair or inaccurate predictions, particularly in sensitive areas like
hiring, law enforcement, and lending.
 Interpretability: As models become more complex, particularly in deep
learning, they become harder to interpret and explain to non-experts.

9. Future of Data Science:

Data science will continue to evolve, driven by advances in AI, automation, and
computing power. New tools will make it easier for businesses to leverage data,
and data science will become even more critical in decision-making across
industries. The increasing role of ethics and regulation around AI and data will also
shape how data science is applied in the future.

In summary, data science is an ever-growing field with a broad range of


applications, driven by advances in technology and increasing amounts of data. It
requires a diverse skill set and is a key component in shaping the future of
industries and technologies worldwide.

DAY 2
1. Installation of Anaconda(Setup) and start writing and running
your Python code in jupyter

7
226150316051

1.1 Introduction
Anaconda is an open-source distribution of the Python and R programming
languages, widely used for scientific computing, data science, machine
learning, and data analysis. It simplifies package management and
deployment, making it a popular choice for both beginners and experienced
users in the data science community.
1.2 Steps in the Installation Process

1.2.1 Download Anaconda:

 Go to the Anaconda Distribution page.


 Click on the "Download" button and select the version for Windows.

1.2.2 Run the Installer:

 Once the download is complete, run the installer executable.


 Follow the installation prompts.

1.2.3 Choose Installation Options:

 You can choose to install Anaconda for just yourself or for all users
(requires administrator permissions).
 Select the destination folder (default is usually fine).

1.2.4 Advanced Installation Options:

• Choose whether to add Anaconda to your PATH environment variable.


(Recommended: "Do not add Anaconda to the PATH environment
variable").
• Choose whether to register Anaconda as your default Python 3.8
(Recommended).

1.2.5 Complete the Installation:

 Click "Install".
 Once installation is complete, you can choose to launch Anaconda Navigator
or Jupyter Notebook to get started immediately.

8
226150316051

1.3 Create a New Environment


Creating a new environment in Anaconda helps to manage dependencies and keep
your work organized.

1.3.1 Open Anaconda Prompt (Windows) or Terminal (macOS/Linux).

1.3.2 Create a New Environment:

1.4 Start Jupyter Notebook


1.4.1 Launch Jupyter Notebook:

This command opens the Jupyter Notebook interface in your web browser.

1.4.2 Create a New Notebook:

o In the Jupyter Notebook interface, click on "New" and select "Python


3" to create a new notebook.
o You can start writing and running your Python code in this notebook.

2. Find the dataset from Kaggle / uci machine learning repository

3. Load the dataset using pandas in jupyter notebook

9
226150316051

DAY 3
1. Find the missing value from the dataset
10
226150316051

2. Plot Scatter plot, line plot and Heatmap using Matplotlib and
Seaborn

11
226150316051

12
226150316051

13
226150316051

DAY 4
Datafame Detailng using Switviz

14
226150316051

DAY 5-7

15
226150316051

Using Model for Classification


• Logistic Regression, Decision Tree, Gradient Boosting Classifier &
Random forest Used to classify whether a student passes or fails based on
study hours.

• The data is split into training and testing sets, and we measure performance
using accuracy for classification models and mean squared error (MSE)
for the regression model.

Logistic Regression Model:


Logistic Regression is a widely used supervised machine learning algorithm that
is typically applied to binary classification problems. Unlike linear regression,
which predicts a continuous value, logistic regression predicts the probability of
an instance belonging to a specific class (0 or 1, True or False, Positive or
Negative).

16
226150316051

Decision Tree Model:

17
226150316051

A Decision Tree is a popular supervised machine learning algorithm used for both
classification and regression tasks. It models decisions and their possible
consequences in a tree-like structure, where internal nodes represent decision
points based on a feature, branches represent the outcomes of these decisions, and
leaf nodes represent the final output (a class label in classification or a numerical
value in regression).

18
226150316051

19
226150316051

Random Forest Model:


Random Forest is a powerful ensemble machine learning algorithm that combines
multiple decision trees to improve the model's accuracy and robustness. It is
widely used for both classification and regression tasks. The key idea behind
random forests is to create a "forest" of decision trees where each tree is trained on
a different subset of the data and a random subset of the features, and then the
results of all the trees are combined to make a final prediction.

20
226150316051

21
226150316051

Gradient Boosting Classifier:

Gradient Boosting Classifier is a powerful ensemble machine learning algorithm


that builds multiple decision trees sequentially, with each tree trying to correct the
mistakes of the previous one. Unlike Random Forest, which builds trees in parallel,
Gradient Boosting focuses on improving the model by adding trees one at a time,
optimizing the errors made by earlier trees. This leads to a more accurate and
robust predictive model.

22
226150316051

23
226150316051

DAY 8-11
Using Model for Regression
Linear Regression & Gradient Boostibg Regression Used to predict exam
scores based on study hours.
Linear Regression Model:

Linear Regression is a simple yet powerful algorithm used to model the


relationship between a dependent variable (target) and one or more independent
variables (predictors). It assumes that the relationship between the variables is
linear, meaning that the change in the dependent variable is proportional to the
change in the independent variable(s). Linear regression is widely used in statistics
and machine learning for predictive modeling, especially in tasks involving
continuous output (regression tasks).

24
226150316051

25
226150316051

Gradient Boostibg Regression:


Gradient Boosting Regression is a powerful machine learning technique used for
predictive modeling, particularly in regression tasks where the goal is to predict a
continuous target variable. Like Gradient Boosting Classifier, it builds an
ensemble of decision trees in a sequential manner, where each tree tries to correct
the errors (residuals) made by the previous trees. The primary difference is that
Gradient Boosting Regression is focused on predicting continuous outputs rather
than class labels.

26
226150316051

27
226150316051

DAY 12
Find other Dataset Related to Disease and apply Different
Classification Models
DecisionTree Classifier:

28
226150316051

Support Vector Classifier:


Support Vector Classifier (SVC) is a supervised machine learning algorithm that
belongs to the Support Vector Machines (SVM) family. It is widely used for
classification tasks, where the goal is to classify data into distinct categories. SVC
is highly effective in high-dimensional spaces and can be used for both binary and
multi-class classification problems.

Gradient Boosting Classifier:

29
226150316051

Gradient Boosting Classifier is a powerful machine learning algorithm used for


classification tasks. It belongs to the family of ensemble learning methods, which
build a strong predictive model by combining the predictions of multiple weaker
models. Gradient Boosting, in particular, builds models sequentially, with each
new model improving on the errors (residuals) of the previous ones. This iterative
approach helps the model progressively minimize classification errors and make
more accurate predictions.

DAY 13

30
226150316051

Introduction about Streamlit, installation on juypter notebook

Virtual Environment

Using a virtual environment, such as venv, is highly recommended. It isolates


your project's dependencies, preventing any conflicts with other projects. To
create a virtual environment, navigate to your project directory and run:

python -m venv .venv

Activate your environment with:

 Windows: .venv\Scripts\activate.bat
 macOS/Linux: source .venv/bin/activate

Installing Streamlit

With your environment activated, install Streamlit using pip:

pip install streamlit

Running a Streamlit App

To run a Streamlit app, such as a simple "Hello World", create a file


named app.py and add the following code:

import streamlit as st

st.write("Hello world")

Run your app with:

streamlit run app.py

This command launches your app in the default web browser. For a more
detailed exploration, including how to run Streamlit apps in Jupyter
notebooks, refer to the official documentation.

31
226150316051

DAY 14-15
Use another Dataset use different models and plot graphs

32
226150316051

33
226150316051

34
226150316051

35
226150316051

36
226150316051

37
226150316051

38
226150316051

39
226150316051

40
226150316051

41
226150316051

DAY 16

42
226150316051

43
226150316051

44
226150316051

DAY 17-19
Predict answer from Disease dataset using streamlit by user input

This Streamlit app performs heart disease prediction using a Gradient Boosting
Classifier model trained on the heart disease dataset. Here's a brief explanation of
the code:

1. Data Loading and Preprocessing:


o The dataset heart_disease_data.csv is loaded.
o Features (X) are created by dropping the target column (target),
which is the indicator for heart disease.
o The target values (y) are stored for prediction.
2. Model Training:
o The dataset is split into training and testing sets using
train_test_split().
o A Gradient Boosting Classifier model is trained on the training
data.
3. Model Evaluation:
o The trained model is used to predict heart disease on the test set.
o The accuracy of the model on the test set is calculated and displayed.
4. User Input via Streamlit :
o The app allows users to input values for features like age, sex, blood
pressure, etc., using input fields.
o When the user clicks the "Predict" button, these values are passed to
the trained model for prediction.
o The result (whether heart disease is present or not) is displayed.
5. Sample Test:
o A predefined set of feature values (sample input) is tested for heart
disease prediction when the "Test Sample Input" button is clicked.
6. Test Predictions Display:
o The first 10 predictions of the test set are shown alongside actual
values to compare model performance.

This provides an interactive tool where users can input health metrics to predict
whether they may have heart disease based on the model's learning from the
dataset.

45
226150316051

46
226150316051

47
226150316051

48
226150316051

49
226150316051

DAY 20
Predict Medicine by user input in Streamlit

50
226150316051

51
226150316051

DAY 21
Create a GitHub account, set it up, and upload files (like your code)
to a repository

1. Create a GitHub Account

1. Go to GitHub.com.
2. Click on the Sign Up button.
3. Fill in your details (email, password, username).
4. Complete the verification process.
5. Choose a plan (the free plan is enough for most purposes).
6. Confirm your email address.

2. Set Up GitHub Locally

To upload your files from your local machine to GitHub, you'll need to install Git
on your system.

For Windows:

1. Download and install Git from Git for Windows.


2. During installation, select "Use Git from Git Bash only" or the appropriate
settings for your needs.

For macOS:

1. Open a terminal and type:

git --version

If Git is not installed, you'll be prompted to install it.

For Linux:

1. Use the following command to install Git:

sudo apt-get install git

3. Configure Git

52
226150316051

After installing Git, configure your GitHub username and email.

Open your terminal (or Git Bash for Windows) and run the following commands:

git config --global user.name "Your Name"


git config --global user.email "[email protected]"

4. Create a New Repository on GitHub

1. Go to your GitHub account.


2. Click the + icon in the top-right corner and select New Repository.
3. Name your repository (e.g., medicine-prediction).
4. Optionally, add a description and choose whether the repository is public or
private.
5. Click Create Repository.

5. Upload Files to GitHub via Command Line

After creating your repository on GitHub, you’ll want to upload your local files.

1. Initialize Git in your project folder: Go to your local project directory


(where your code is stored) and run:

git init

2. Add the remote repository: Link your local folder to the GitHub
repository:

git remote add origin https://github.com/yourusername/repository-name.git

3. Add your files: Add all the files in your folder to the Git staging area:

git add .

4. Commit your changes: Commit the added files with a message:

git commit -m "Initial commit"

5. Push your files to GitHub: Finally, push your local files to the GitHub
repository:

git push -u origin master

53
226150316051

54
226150316051

DAY 22
55
226150316051

Face Detection

DAY 23-26

56
226150316051

57
226150316051

58
226150316051

59
226150316051

60
226150316051

61
226150316051

62
226150316051

63
226150316051

64
226150316051

65
226150316051

66
226150316051

DAY 27-30
Email analysis process

67
226150316051

68
226150316051

69
226150316051

70
226150316051

71
226150316051

72
226150316051

73
226150316051

74
226150316051

75
226150316051

76
226150316051

77
226150316051

78
226150316051

79
226150316051

80
226150316051

81
226150316051

82
226150316051

83

You might also like