Manipulating Data with Python¶
Intro¶
What we’ll cover¶
Brief intro to data analytics
Why Python in data science
Loops and if-statements in Python
Working with Files in Python
Data Analytics¶
What is Data Analytics?¶
The process of exploring data, formulating business and technical questions, and answering those questions through data visualization
Process¶
Collaborate and communicate with business and engineering teams on open questions
Obtain data and conduct data cleansing and initial exploration
Explore and analyze data using statisical methods
Create visualizations of data and propose answers to questions
Present any additional insights or learnings from exploration
Technical skills¶
Data cleansing: use code to “clean” dataset
Remove incomplete or erroneous data points
Ensure that data is in a consistent format
Data exploration: use code to look at shape of data
Look at basic stats, such as mean, median, mode
Data size and any other notable characteristics of dataset
Data analysis: use code to answer to questions with data
Predict how users will behave / make decisions
Look at relationships between data points
Find patterns
Data Visualization: use code to generate graphs, charts, and other visuals
Create histograms, pie charts, and line plots
Show patterns, relationships
Soft skills¶
Work with non-technical teams to understand business needs and open questions
Presentation skills to be able to communicate results of analysis and help influence business decisions
Domain experience is a nice-to-have
If working in bio-tech, experience with biology
If working in health care tech, experience with health insurance or health care
Etc.
Why Python¶
About Python¶
Created in early 90s - mature and solid community
Creator Guido had strong opinions about what made a language easy to learn and read
Open source - continuing to evolve and improve even today
Extremely popular - among the most well-loved and highest-paid languages
Zen of Python¶
$ python3
>>> import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than right now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
General purpose¶
Python can be used for
Back End Web Development
Data visualization
Data analysis
Automation
… and more!
Great for collaboration with other teams
Readable Syntax¶
Python is well known for it’s emphasis on readability
It’s a pleasure to both write and read Python
Why Python?¶
Perhaps the most compelling reason to use Python for data analysis:
Python has an extensive suite of open-source, easy-to-get-started set of tools for data science
Built-In Python Data Tools¶
Useful “abstract” data structures such as sets, tuples
Easy text parsing syntax and string methods
Easy to read, create, and modify files
Modules for parsing JSON and CSV files
The math module provides many basic statistical functions
External Python Data Tools¶
Data Exploration & Analysis: pandas, scipy, numpy, statsmodels
Data Collection & Processing: scrapy, nltk
Data Visualization: matplotlib, plotly, bokeh, seaborn
Jupyter Notebooks¶
Python tool which is best explained as “Google Docs for code”
Allows you to create a document that shows Python code
And you can run it in the same place
And you can easily share it with others
Provides ways of showing graphs & charts as well!
Very popular in academic research, data science
Click here to view an example
Python Basics¶
Hello world¶
print("Hello world")
Code comments in python begin with # (octothorpe)
# can comment here
print("Hello world") # can also comment here
Variables & Data Types¶
Similar to Javascript, without
var,let,const
age = 22 # integer
movie_rating = 4.5 # float
name = "Rubber Duck" # string
likes_to_code = True # boolean
hobby = None # Nonetype
If-statements¶
Here’s a basic if-statement
rating = 4.5
if rating > 3:
print("good")
Here’s an if-else statement
rating = 4.5
if rating > 3:
print("good")
else:
print("bad")
Here’s an if-elif-else statement
rating = 4.5
if rating > 3:
print("good")
elif rating == 3:
print("okay")
else:
print("bad")
Whitespace Rules¶
In Python, indentation is meaningful
This works
rating = 4.5
if rating > 3:
print("good")
This will throw an IndentationError
rating = 4.5
if rating > 3:
print("good")
This is great for readability but it’s a little harder to compose
correctly than Javascript, where we indicate things with {} curly
braces
You can add zero or one line in between if-statements, but no more than that
This works
rating = 4.5
if rating > 3:
print("good") # no line after
elif rating == 3:
print("okay") # one line after
else:
print("bad")
This will not work as intended
rating = 4.5
if rating > 3:
print("good") # two lines after
elif rating == 3:
print("okay")
else:
print("bad")
Lists¶
At a high-level, almost identical to arrays in Javascript
Ordered collection of items
Must use square brackets and commas
fruits = ["apple", "berry", "cherry"]
nums = [1, 2, 3]
random_stuff = ["apple", 4, True, 5.1]
Loops & Lists¶
Can loop over lists
fruits = ["apple", "berry", "cherry"]
for fruit in fruits:
print(fruit)
fruit here could be any word, but fruit is probably the best choice
fruits = ["apple", "berry", "cherry"]
for kjrjfskwlsnacx it in fruits:
print(kjrjfskwlsnacx)
The indentation under your for loop is important.
This will cause an IndentationError
fruits = ["apple", "berry", "cherry"]
for fruit in fruits:
print(fruit) # missing indentation!
Working with Files¶
Cupcake Order Data¶
orders.csv
Mon,40,red velvet,$130.00
Mon,10,vanilla,$30.50
Tues,53,chocolate sprinkle,$170.00
Wed,40,vanilla,$73.00
Wed,80,vanilla,$240.00
Wed,15,chocolate sprinkle,$45.00
Thurs,40,red velvet,$73.00
Fri,40,vanilla,$73.00
Fri,20,vanilla,$35.00
Format is day of order, num cupcakes, flavor, order total
Opening Files¶
Can open a file in Python
open_file = open("orders.csv")
Loop Over a File¶
Can loop over each line of a file
open_file = open("orders.csv")
for line in open_file:
print(line)
Good practice to close file at end of program
open_file.close()
Filter a File¶
Only print orders that start with “Mon”
Only print Monday orders
open_file = open("orders.csv")
for line in open_file:
if line.startswith("Mon"):
print(line)