Lec 20 Pip Pandas Plotly
Assignment Updates
Stuff due soon:
- HW7 Due 5/15
- Capstone Project Presentations 5/16 and 5/17
- Quiz due Thursday
Slides
- Section 1, 2, 3, 4, 5, 6
Resources
StudentsPerformance.csv Download:
Download: StudentsPerformance.csv
Notes:
Pip
- Pip is a package manager for Python
- A package manager is a tool that automates the process of installing, updating, and removing packages
- A package is a collection of code files that are bundled together (e.g., a library or module)
- Pip is included with Python 3.4 and above
- To check if you have pip installed, open a terminal and type
pip --version
- May need to type
pip3 --version
(if you have both Python 2 and 3 installed). If this is the case, you will need to usepip3
instead ofpip
for the rest commands
- To check if you have pip installed, open a terminal and type
- To install a package, open a terminal and type
pip install <package name>
- This will download the package from the Python Package Index (PyPI) and install it on your computer
- To search for a package, go to https://pypi.org/ and type the name of the package in the search bar
- It will show you the package name, a short description, the latest version, and the command to install it
- To see a list of all installed packages, type
pip list
. It may be a long list, but it's alphabetical so you can scroll through it to find the package you're looking for - Let's install pandas and plotly:
pip install pandas plotly
Pandas
- Panda is a data analysis library for data structured in tables (like csv files!)
- It has a lot of built-in functions for manipulating and analyzing data
- Official documentation: https://pandas.pydata.org/pandas-docs/stable/
- To use pandas, we need to import it into our Python script
import pandas as pd
- The
as pd
part is optional, but it's a common convention to usepd
as the alias for pandas
- Pandas has two main data structures: Series and DataFrame
- A Series is a one-dimensional array of indexed data
- A DataFrame is a two-dimensional array of indexed data
- It's like a table in a spreadsheet
- Each row is a record (or observation)
- Each column is a variable (or feature)
- Let's load a csv file into a DataFrame
- On the website is a csv file called
StudentsPerformance.csv
- It contains data about students' test scores (Info about the file)
- Download the file and save it in the same folder as your Python script
- To load the csv file into a DataFrame, use the
read_csv()
functiondf = pd.read_csv('StudentsPerformance.csv')
(after importing pandas)- The
df
variable is a DataFrame object (it's a table of data)
- To see the first 5 rows of the DataFrame, use the
head()
functionprint(df.head())
(ordf.head(12)
to see the first 12 rows)
- On the website is a csv file called
- Now, let's analyze the data
- To start, let's describe the data
print(df.describe())
- This will give us some basic statistics about the data
- We can see the count, mean, standard deviation, min, max, and the quartiles for each column
- The 25% is a value such that 25% of the data is less than that value. This is also called the first quartile (Q1).
- We can get a single column from the DataFrame using the column name
print(df['math score'])
- This will print the math score column
- We can also get multiple columns by passing in a list of column names
print(df[['math score', 'reading score']])
- To get the mean of a column, use the
mean()
function- WARNING: This will only work on numeric columns (can't get the mean of a column of strings)
print(df['math score'].mean())
- This will print the mean of the math score column
- We can also get the mean of multiple columns
print(df[['math score', 'reading score']].mean())
- Other methods include:
median()
,mode()
,min()
,max()
,std()
,var()
,sum()
,count()
- Feel free to try them out or look them up online
- We can also filter the rows
- We can get all the rows where the math score is greater than 90
greater_than_90_df = df[df['math score'] > 90]
- This says: Get all rows where the
math score
column is greater than 90
- We can get get all rows where gender if female
female_df = df[df['gender'] == "female"]
- This says: Get all rows where the
gender
column's value is equal to the stringfemale
- We can get all the rows where the math score is greater than 90
- To start, let's describe the data
Plotly
Finally, let's visualize the data using Plotly Express
- Plotly is a data visualization library
- It has a lot of built-in functions for creating charts and graphs
- Official documentation: https://plot.ly/python/ (it's a bit confusing, but it has a lot of examples)
To use Plotly, we need to import it into our Python script
import plotly.express as px
- The
as px
part is optional, but it's a common convention to usepx
as the alias for plotly.express
Plotly creates charts and graphs using DataFrame objects
- This is why we needed to use pandas to load the csv file into a DataFrame
Let's create a histogram chart of the math scores
- Plotly has a function called
histogram()
that creates a histogram chart- It takes:
data_frame
: The DataFrame object (with the data to plot)x
: The column name for the x-axis
- It takes:
- Example:
fig = px.histogram(data_frame=df, x='math score')
- This will create a histogram chart of the math scores
- Plotly has a function called
After creating the chart, we need to display it
To display the chart, use the
show()
functionExample:
fig.show()
It will open a new tab in your browser to display the chart. This will take a few seconds to load.
Full Example:
import pandas as pd
import plotly.express as px
# Load the data
df = pd.read_csv('StudentsPerformance.csv')
# Create the chart
fig = px.histogram(data_frame=df, x='math score')
# Show the chart
fig.show()
We can do the same thing for the reading scores and writing scores:
fig = px.histogram(data_frame=df, x='reading score')
fig = px.histogram(data_frame=df, x='writing score')
We can use different colors to represent different gender's scores
- Add the keyword parameter
color="gender"
to thepx.histogram()
function to set the colors based on the values of thegender
column- There are 2 values (
"male"
and"female"
) so there will be 2 colors - When displaying the chart, it will also show a legend with the colors and the values they represent
- Clicking the color in the legend will hide/show the data for that value
- There are 2 values (
- Add the keyword parameter
Conclusion
Pip: A package manager for Python. Let's you install, update, and remove packages (e.g., libraries and modules)
Pandas: A data analysis library for data structured in tables (like csv files!). It has a lot of built-in functions for manipulating and analyzing data.
Plotly: A data visualization library. It has a lot of built-in functions for creating charts and graphs.
You can now use pandas to load csv files into DataFrames and analyze the data. You can also use plotly to create charts and graphs of the data.
StudentPerformance.csv Notes
Download: StudentsPerformance.csv
- This is a csv file containing data about students' test scores
- Source: https://www.kaggle.com/datasets/whenamancodes/students-performance-in-exams
- This is from Kaggle, a website for data science and machine learning. It has a lot of datasets that you can use for practice
- I think this dataset was generated (not real data) based on the discussion(s) associated with the dataset
- The file contains the following columns (Be aware that some column names have spaces in them):
gender
(String): "male" or "female" based on the student's genderrace/ethnicity
(String): A string containing the words "group" and a letter (e.g., "group A", "group B", etc.) They correspond to different race/ethnicity groupsparental level of education
(String): The student's parent's level of education (e.g., "some high school", "bachelor's degree", "associate's degree", etc.)lunch
(String): Whether the student gets free/reduced lunch or not. "standard" means they don't get free/reduced lunch, "free/reduced" means they do get free/reduced lunch.test preparation course
(String): Whether the student completed a test preparation course or not. "none" means they didn't complete a test preparation course, "completed" means they did complete a test preparation course.math score
(Integer): The student's math score (out of 100) on an examreading score
(Integer): The student's reading score (out of 100) on an examwriting score
(Integer): The student's writing score (out of 100) on an exam