Manipulating Data with Python

Intro

What we’ll cover

  • Brief intro to data analytics

  • Why Python in data science

  • Loops and if-statements in Python

  • Working with Files in Python

Data Analytics

What is Data Analytics?

The process of exploring data, formulating business and technical questions, and answering those questions through data visualization

Process

  • Collaborate and communicate with business and engineering teams on open questions

  • Obtain data and conduct data cleansing and initial exploration

  • Explore and analyze data using statisical methods

  • Create visualizations of data and propose answers to questions

  • Present any additional insights or learnings from exploration

Technical skills

  • Data cleansing: use code to “clean” dataset

    • Remove incomplete or erroneous data points

    • Ensure that data is in a consistent format

  • Data exploration: use code to look at shape of data

    • Look at basic stats, such as mean, median, mode

    • Data size and any other notable characteristics of dataset

  • Data analysis: use code to answer to questions with data

    • Predict how users will behave / make decisions

    • Look at relationships between data points

    • Find patterns

  • Data Visualization: use code to generate graphs, charts, and other visuals

    • Create histograms, pie charts, and line plots

    • Show patterns, relationships

Soft skills

  • Work with non-technical teams to understand business needs and open questions

  • Presentation skills to be able to communicate results of analysis and help influence business decisions

  • Domain experience is a nice-to-have

    • If working in bio-tech, experience with biology

    • If working in health care tech, experience with health insurance or health care

    • Etc.

Why Python

About Python

  • Created in early 90s - mature and solid community

  • Creator Guido had strong opinions about what made a language easy to learn and read

  • Open source - continuing to evolve and improve even today

  • Extremely popular - among the most well-loved and highest-paid languages

Zen of Python

$ python3
>>> import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than right now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

General purpose

  • Python can be used for

    • Back End Web Development

    • Data visualization

    • Data analysis

    • Automation

    • … and more!

  • Great for collaboration with other teams

Readable Syntax

  • Python is well known for it’s emphasis on readability

  • It’s a pleasure to both write and read Python

Why Python?

Perhaps the most compelling reason to use Python for data analysis:

Python has an extensive suite of open-source, easy-to-get-started set of tools for data science

Built-In Python Data Tools

  • Useful “abstract” data structures such as sets, tuples

  • Easy text parsing syntax and string methods

  • Easy to read, create, and modify files

  • Modules for parsing JSON and CSV files

  • The math module provides many basic statistical functions

External Python Data Tools

  • Data Exploration & Analysis: pandas, scipy, numpy, statsmodels

  • Data Collection & Processing: scrapy, nltk

  • Data Visualization: matplotlib, plotly, bokeh, seaborn

Jupyter Notebooks

  • Python tool which is best explained as “Google Docs for code”

  • Allows you to create a document that shows Python code

    • And you can run it in the same place

    • And you can easily share it with others

  • Provides ways of showing graphs & charts as well!

  • Very popular in academic research, data science

Click here to view an example

Python Basics

Hello world

print("Hello world")

Code comments in python begin with # (octothorpe)

# can comment here

print("Hello world")    # can also comment here

Variables & Data Types

  • Similar to Javascript, without var, let, const

age = 22              # integer

movie_rating = 4.5    # float

name = "Rubber Duck"  # string

likes_to_code = True  # boolean

hobby = None          # Nonetype

If-statements

Here’s a basic if-statement

rating = 4.5

if rating > 3:
    print("good")

Here’s an if-else statement

rating = 4.5

if rating > 3:
    print("good")

else:
    print("bad")

Here’s an if-elif-else statement

rating = 4.5

if rating > 3:
    print("good")

elif rating == 3:
    print("okay")

else:
    print("bad")

Whitespace Rules

In Python, indentation is meaningful

This works

rating = 4.5

if rating > 3:
    print("good")

This will throw an IndentationError

rating = 4.5

if rating > 3:
print("good")

This is great for readability but it’s a little harder to compose correctly than Javascript, where we indicate things with {} curly braces

You can add zero or one line in between if-statements, but no more than that

This works

rating = 4.5

if rating > 3:
    print("good")    # no line after
elif rating == 3:
    print("okay")    # one line after

else:
    print("bad")

This will not work as intended

rating = 4.5

if rating > 3:
    print("good")    # two lines after


elif rating == 3:
    print("okay")


else:
    print("bad")

Lists

  • At a high-level, almost identical to arrays in Javascript

  • Ordered collection of items

  • Must use square brackets and commas

fruits = ["apple", "berry", "cherry"]

nums = [1, 2, 3]

random_stuff = ["apple", 4, True, 5.1]

Loops & Lists

Can loop over lists

fruits = ["apple", "berry", "cherry"]

for fruit in fruits:
    print(fruit)

fruit here could be any word, but fruit is probably the best choice

fruits = ["apple", "berry", "cherry"]

for kjrjfskwlsnacx it in fruits:
    print(kjrjfskwlsnacx)
  • The indentation under your for loop is important.

  • This will cause an IndentationError

fruits = ["apple", "berry", "cherry"]

for fruit in fruits:
print(fruit)            # missing indentation!

Working with Files

Cupcake Order Data

orders.csv

Mon,40,red velvet,$130.00
Mon,10,vanilla,$30.50
Tues,53,chocolate sprinkle,$170.00
Wed,40,vanilla,$73.00
Wed,80,vanilla,$240.00
Wed,15,chocolate sprinkle,$45.00
Thurs,40,red velvet,$73.00
Fri,40,vanilla,$73.00
Fri,20,vanilla,$35.00

Format is day of order, num cupcakes, flavor, order total

Opening Files

Can open a file in Python

open_file = open("orders.csv")

Loop Over a File

Can loop over each line of a file

open_file = open("orders.csv")

for line in open_file:
    print(line)

Good practice to close file at end of program

open_file.close()

Filter a File

  • Only print orders that start with “Mon”

  • Only print Monday orders

open_file = open("orders.csv")

for line in open_file:

    if line.startswith("Mon"):
        print(line)

The End