0% found this document useful (0 votes)
7 views

data science foundations

The document provides an overview of data science, including its concepts, terminology, applications, data types, sources, collection methods, and ethical considerations. It emphasizes the importance of data science in various industries such as healthcare, finance, and marketing, and outlines practical applications like reducing hospital readmissions and predicting customer churn. Additionally, it discusses the ethical implications of data usage, highlighting the need for responsible data practices and regulations.

Uploaded by

Edilita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

data science foundations

The document provides an overview of data science, including its concepts, terminology, applications, data types, sources, collection methods, and ethical considerations. It emphasizes the importance of data science in various industries such as healthcare, finance, and marketing, and outlines practical applications like reducing hospital readmissions and predicting customer churn. Additionally, it discusses the ethical implications of data usage, highlighting the need for responsible data practices and regulations.

Uploaded by

Edilita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

FOUNDATIONS OF DATA

SCIENCE
MR. Peter Wawire Barasa
0735616893/[email protected]
•“You cannot change what you are, only what you do.” ―
Philip Pullman, “The Golden Compass”

•“The people who are crazy enough to think they can change
the world are the ones who do.” — Steve Jobs

•“Be the change that you wish to see in the world.” ―


Mahatma Gandhi
Session objectives
By the end of the session, participants should
be able to:-

1. Introduction to data science concepts,


terminology, and applications.
2. Understanding data types, data sources,
and data formats.
3. Data collection methods and data ethics
Introduction to Data Science
• Data science is a rapidly growing field that combines
various disciplines, including computer science, statistics,
mathematics, and domain expertise.
• The primary goal of data science is to extract valuable
insights and knowledge from data to solve complex
problems and make informed decisions.
• Data scientists use a wide range of tools and techniques to
collect, process, analyze, and interpret large volumes of
structured and unstructured data. The insights gained
from data science can be applied across various industries,
such as business, healthcare, finance, marketing, and
sports, to drive innovation, optimize processes, and improve
outcomes.
Key Concepts and Terminology
• Big Data refers to datasets that are too large and
complex to be processed using traditional data
processing tools. These datasets often require
distributed computing and advanced analytics techniques
to derive meaningful insights.
• Machine Learning is a subset of artificial intelligence
that focuses on developing algorithms and models that
enable systems to learn and improve from experience
without being explicitly programmed. There are three
main types of machine learning: supervised learning,
unsupervised learning, and reinforcement learning
Key Concepts and Terminology
• Data Mining is the process of discovering hidden
patterns, correlations, and insights in large datasets. It
involves using various techniques, such as clustering,
association rule mining, and anomaly detection, to
uncover valuable information.
• Data Visualization is the practice of representing data
through visual elements like charts, graphs, and maps. It
helps to communicate complex data in a clear and easily
understandable way, facilitating data-driven decision
making.
Key Concepts and Terminology
• Predictive Analytics involves using historical data,
statistical algorithms, and machine learning techniques
to identify the likelihood of future outcomes. This
enables organizations to make proactive decisions and
take preventive measures.
Key Concepts and Terminology
• Predictive Analytics involves using historical data,
statistical algorithms, and machine learning techniques
to identify the likelihood of future outcomes. This
enables organizations to make proactive decisions and
take preventive measures.
Data Science Applications:
• Data science has numerous applications across various
domains. Some examples include:
• In business, data science is used to identify trends, optimize
processes, and make data-driven decisions. It can help
businesses understand customer behavior, improve product
recommendations, and optimize supply chain management.
• Healthcare organizations use data science to develop
personalized medicine, predict disease outbreaks, and
improve patient outcomes. Data science can help analyze
electronic health records, medical images, and genomic data
to identify patterns and develop targeted treatments.
Data Science Applications:
• In finance, data science is used for fraud
detection, risk assessment, and algorithmic trading.
It can help analyze market trends, predict stock
prices, and optimize portfolio management.
• Marketing teams leverage data science to analyze
customer behavior, optimize marketing campaigns,
and build recommender systems. It can help segment
customers, personalize content, and measure
campaign effectiveness.
Data Science Applications:
• Sports organizations use data science to evaluate
player performance, predict game outcomes, and
prevent injuries.
• It can help analyze player statistics, track player
movements, and optimize training programs.
Data Science Applications:
• Sports organizations use data science to evaluate
player performance, predict game outcomes, and
prevent injuries.
• It can help analyze player statistics, track player
movements, and optimize training programs.
Data Types and Formats:
• Data comes in various types and formats, each
requiring different processing and analysis techniques.
• Structured data is organized in a predefined format,
such as spreadsheets or SQL databases. It has a
clear schema and can be easily processed using
traditional data processing tools.
• Unstructured data lacks a predefined format and
includes text, images, audio, and video. It requires
advanced techniques like natural language processing
and computer vision to extract insights.
Data Sources:
• Data can be obtained from various sources, including:
• Databases, both relational and NoSQL, store
structured and semi-structured data.
• Data warehouses and data lakes are centralized
repositories that store large volumes of structured
and unstructured data from multiple sources.
• Web scraping involves extracting data from websites
using automated tools or scripts.
Data Sources:
• APIs (Application Programming Interfaces) and web
services provide access to data from external sources,
such as social media platforms, weather services, and
financial data providers.
• IoT (Internet of Things) devices and sensors generate
real-time data from various sources, such as industrial
equipment, smart homes, and wearable devices.
• Social media platforms like Facebook, Twitter, and
LinkedIn provide rich data on user behavior,
preferences, and interactions.
Data Sources:
• Surveys and forms are used to collect data directly
from individuals or organizations.
• Open data portals, such as government websites and
research institutions, provide access to publicly available
datasets.
Data Sources:
• Surveys and forms are used to collect data directly
from individuals or organizations.
• Open data portals, such as government websites and
research institutions, provide access to publicly available
datasets.
Data Collection Methods:
• Data can be collected using various methods, depending
on the data source and the research objectives. Some
common data collection methods include:
• Interviews and surveys are used to gather qualitative
and quantitative data directly from individuals or
organizations.
• Web crawling and scraping involve using automated
tools to navigate and extract data from websites.
Data Collection Methods:
• Sensors and IoT devices can automatically collect data
from the physical world, such as temperature, humidity,
and motion.
• Transactional systems, such as point-of-sale systems
and e-commerce platforms, generate data on customer
purchases and interactions.
• Experiments and A/B testing are used to collect data
on user behavior and preferences by comparing different
variations of a product or service.
Data Collection Methods:
• Third-party data providers offer access to datasets
that have been collected and curated by other
organizations.
• Publicly available datasets can be accessed from
government websites, research institutions, and data
repositories like Kaggle and UCI Machine Learning
Repository.
Data Ethics:
• As data science becomes increasingly prevalent, it is
crucial to consider the ethical implications of
collecting, processing, and using data. Some key ethical
considerations include:
• Privacy and data protection regulations, such as the
General Data Protection Regulation (GDPR) and the
Health Insurance Portability and Accountability Act
(HIPAA), ensure that personal data is collected,
stored, and used in a responsible and secure manner.
Data Ethics:
• Bias and fairness in data and algorithms are
important considerations to prevent discrimination and
ensure equitable treatment of individuals and groups.
• Transparency and explainability of models are
essential for building trust and accountability in
data-driven decision making. Data scientists should be
able to explain how their models work and how
decisions are made.
Data Ethics:
• Responsible data collection and usage involve
obtaining informed consent, protecting individual
privacy, and using data only for the intended purposes.
• Ethical considerations in data-driven decision making
include ensuring that decisions are fair, unbiased, and
do not have unintended consequences for individuals or
society.
Data Ethics:
• Data security and integrity are crucial for protecting
sensitive information and preventing unauthorized
access or manipulation of data.
Data Ethics:
• In summary, the Foundations of Data Science cover
a wide range of topics, from the basic concepts and
terminology to the various applications, data types,
sources, collection methods, and ethical
considerations. Understanding these foundations is
essential for anyone seeking to pursue a career in data
science or leverage data-driven insights in their
organization.
Practical 1: Data Science
Concepts and Applications.
• Real-World Problem: Reducing Hospital Readmissions
• Industry: Healthcare
• Problem: High rates of hospital readmissions lead to
increased healthcare costs and lower quality of care
• Challenge: Identify patients at high risk of
readmission and intervene early to prevent
unnecessary readmissions
Applying Data Science to
Address the Problem.
• Collect and integrate patient data from electronic
health records (EHRs), including demographics, medical
history, lab results, and medication data
• Preprocess and clean the data to handle missing values,
outliers, and inconsistencies
• Apply machine learning algorithms (e.g., logistic
regression, random forests) to predict the likelihood of
readmission based on patient characteristics and clinical
factors
Applying Data Science to
Address the Problem.
• Identify the key features and risk factors associated
with high readmission risk
Potential Benefits and
Impact
• Early identification of high-risk patients allows for
targeted interventions and personalized care plans
• Reducing unnecessary readmissions improves patient
outcomes and satisfaction
• Lowering readmission rates leads to significant cost
savings for hospitals and healthcare systems
• Data-driven insights can inform hospital policies and
resource allocation decisions
Potential Benefits and
Impact
• Early identification of high-risk patients allows for
targeted interventions and personalized care plans
• Reducing unnecessary readmissions improves patient
outcomes and satisfaction
• Lowering readmission rates leads to significant cost
savings for hospitals and healthcare systems
• Data-driven insights can inform hospital policies and
resource allocation decisions
Key Data Science
Concepts and Terminology
• Predictive modeling: Using historical data to build
models that can predict future outcomes (e.g.,
readmission risk)
• Feature selection: Identifying the most informative
variables or features that contribute to the predictive
power of the model
Key Data Science
Concepts and Terminology
• Supervised learning: Training machine learning models
using labeled data, where the desired output (e.g.,
readmission status) is known
• Model evaluation: Assessing the performance of the
predictive model using metrics such as accuracy,
precision, recall, and F1-score
Real-World Problem:
Customer Churn Prediction
• Industry: Telecommunications
• Problem: High customer churn rates lead to lost revenue
and increased customer acquisition costs
• Challenge: Identify customers at high risk of churning
and take proactive measures to retain them
Applying Data Science to
Address the Problem
• Collect customer data, including demographics, usage
patterns, billing information, and customer service
interactions
• Preprocess the data to handle missing values, encode
categorical variables, and scale numerical features
• Apply machine learning algorithms (e.g., logistic
regression, decision trees, neural networks) to predict
the likelihood of churn for each customer
Applying Data Science to
Address the Problem
• Identify the key factors and patterns associated with
high churn risk
Potential Benefits and
Impact
• Proactive identification of at-risk customers allows
for targeted retention campaigns and personalized
offers
• Reducing churn rates leads to increased customer
lifetime value and revenue stability
• Data-driven insights can inform product development
and service improvements to enhance customer
satisfaction
Potential Benefits and
Impact
• Efficient allocation of marketing and customer
service resources based on churn risk predictions
Key Data Science
Concepts and Terminology
• Customer churn: The rate at which customers
discontinue their relationship with a company or service
• Feature engineering: Creating new variables or
features from existing data to improve the predictive
power of the model
• Imbalanced data: Handling datasets where the target
variable (e.g., churn status) has a skewed distribution,
with one class being significantly more prevalent than
the other
Key Data Science
Concepts and Terminology
• Ensemble methods: Combining multiple machine
learning models to improve prediction accuracy and
robustness
Practical 2: Data Types
and Formats
• Example 1: Customer Reviews Dataset
• Dataset: Amazon product reviews dataset
• Topic: Sentiment analysis of customer reviews
Practical 2: Data Types
and Formats
• Download the Amazon product reviews dataset from
a reliable source (e.g., Kaggle, Amazon product reviews
dataset)
• The dataset typically includes fields such as reviewer
ID, product ID, review text, rating, and review
timestamp
Practical 2: Data Types
and Formats
• The dataset is usually in a structured format, such
as CSV (comma-separated values) or JSON (JavaScript
Object Notation)
• The review text field contains unstructured data
(free-form text)
• Other fields like reviewer ID, product ID, and
rating are structured data
Example 2: Social Media
Posts Dataset
• Dataset: Twitter tweets dataset
• Topic: Analyzing trending topics and user
engagement on Twitter
Example 2: Social Media
Posts Dataset
• Dataset: Twitter tweets dataset
• Topic: Analyzing trending topics and user
engagement on Twitter
Example 2: Social Media
Posts Dataset
Data Collection
• Use the Twitter API to collect tweets related to a
specific topic or hashtag
• The collected data includes tweet text, user
information, timestamps, retweets, and likes
Example 2: Social Media
Posts Dataset
Data Types and Formats
• The tweet dataset is in a semi-structured format,
typically JSON
• The tweet text field contains unstructured data
(free-form text)
• User information, timestamps, retweets, and likes are
structured data
Example 3: Sensor Data
Dataset
Dataset: IoT sensor readings from a manufacturing plant
Topic: Monitoring machine performance and predicting
maintenance needs
Data Exploration and
Analysis with Python
Conduct basic data analysis:
• Calculate summary statistics (e.g., mean, median, standard
deviation) for numerical fields
• Visualize the distribution of ratings or sensor
measurements using histograms or bar charts
• Analyze the sentiment of review text or tweet text using
sentiment analysis techniques
Summary Report
• Discuss the characteristics of the dataset, including
its size, structure, and format
• Highlight the presence of structured, unstructured,
and semi-structured data within the dataset
• Explain how the data types and formats impact data
processing and analysis approaches
• Summarize key insights gained from basic data
exploration and analysis
• Identify potential challenges or considerations when
working with the specific data types and formats
Data Ethics and Privacy
• Case Study: Cambridge Analytica Scandal
• Background: Cambridge Analytica, a political
consulting firm, collected and misused Facebook user
data to influence the 2016 U.S. presidential election
and the Brexit campaign.
Key Ethical Issues and
Concerns
• Unauthorized collection and use of personal data
without users' explicit consent
• Violation of privacy rights and data protection
regulations
• Manipulation of user behavior and opinions through
targeted advertising and psychographic profiling
• Lack of transparency in data collection and usage
practices
Analysis from Data
Privacy Perspective
• Facebook allowed third-party apps to collect user
data without proper oversight and control
• Cambridge Analytica exploited the Facebook API to
harvest data from millions of users without their
knowledge or consent
• The collected data was used to create detailed
psychological profiles of users for targeted political
advertising
• The incident highlighted the need for stricter data
privacy regulations and user control over personal data
Analysis from Bias and
Transparency Perspective
• The targeted advertising campaigns based on
psychographic profiling raised concerns about bias and
manipulation
• The algorithms used for profiling and targeting users
lacked transparency, making it difficult to assess their
fairness and accuracy
• The lack of transparency in data collection and
usage practices eroded user trust and raised questions
about the ethical boundaries of data-driven
campaigning
Recommendations and Best
Practices
• Strengthen data privacy regulations to ensure user
consent and control over personal data collection and
usage
• Implement strict oversight and auditing mechanisms
for third-party access to user data on social media
platforms
• Enhance transparency in data collection, processing,
and usage practices, allowing users to make informed
decisions
Summary Report
• The Cambridge Analytica scandal exposed serious
ethical violations in the collection and use of personal
data for political purposes
• The case study highlights the importance of data
privacy, user consent, and transparency in data-driven
practices
• Stricter regulations, oversight, and ethical guidelines
are necessary to prevent the misuse of personal data
and protect user rights
Summary Report
• Increased public awareness and digital literacy are
crucial for empowering users to make informed
decisions about their personal data
• The incident serves as a wake-up call for the tech
industry to prioritize data ethics and rebuild user
trust through responsible data practices
Thank you

You might also like