Pandas - Few Concepts

Posted on April 26, 2016

pandas, data manupulation, data analayis, data handling

Python Pandas QA Series

What is pandas

Python data analysis library. Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

A library that provides:

  • data analysis
  • data manipulation
  • data visualization

How do I read tabular data into pandas

Tabular data: data in form of rows and columns; spread sheet, csv data

import pandas as pd


# pd.read_table('data/chipotle.tsv')
# can also read directly from the url
orders = pd.read_table("http://bit.ly/chiporders")
orders.head()
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98

read_table by default takes seperator as tab and assumes first row as header

user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_table("http://bit.ly/movieusers", sep='|', header=None, names=user_cols)
users.head()
user_id age gender occupation zip_code
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213

Tip : read_table method has lots of parameters that can be twicked. skiprows and skipfooter are quite useful in skipping some descriptions or comments in the data file.

How do I select a panda Series from a DataFrame

There are two basic objects in pandas that hold data.

  • DataFrame

DataFrame - It basically is a table of rows and columns

Each of these columns is called a panda’s series

import pandas as pd


# pd.read_table("http://bit.ly/uforeports", sep=",")
ufo = pd.read_csv("http://bit.ly/uforeports")
ufo.head()
City Colors Reported Shape Reported State Time
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
1 Willingboro NaN OTHER NJ 6/30/1930 20:00
2 Holyoke NaN OVAL CO 2/15/1931 14:00
3 Abilene NaN DISK KS 6/1/1931 13:00
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
type(ufo)
pandas.core.frame.DataFrame
ufo['City']
0                      Ithaca
1                 Willingboro
2                     Holyoke
3                     Abilene
4        New York Worlds Fair
5                 Valley City
6                 Crater Lake
7                        Alma
8                     Eklutna
9                     Hubbard
10                    Fontana
11                   Waterloo
12                     Belton
13                     Keokuk
14                  Ludington
15                Forest Home
16                Los Angeles
17                  Hapeville
18                     Oneida
19                 Bering Sea
20                   Nebraska
21                        NaN
22                        NaN
23                  Owensboro
24                 Wilderness
25                  San Diego
26                 Wilderness
27                     Clovis
28                 Los Alamos
29               Ft. Duschene
                 ...         
18211                 Holyoke
18212                  Carson
18213                Pasadena
18214                  Austin
18215                El Campo
18216            Garden Grove
18217           Berthoud Pass
18218              Sisterdale
18219            Garden Grove
18220             Shasta Lake
18221                Franklin
18222          Albrightsville
18223              Greenville
18224                 Eufaula
18225             Simi Valley
18226           San Francisco
18227           San Francisco
18228              Kingsville
18229                 Chicago
18230             Pismo Beach
18231             Pismo Beach
18232                    Lodi
18233               Anchorage
18234                Capitola
18235          Fountain Hills
18236              Grant Park
18237             Spirit Lake
18238             Eagle River
18239             Eagle River
18240                    Ybor
Name: City, dtype: object
type(ufo['City'])
pandas.core.series.Series

Another way of selecting series is to user dot notation (attribute selection)

ufo.City
0                      Ithaca
1                 Willingboro
2                     Holyoke
3                     Abilene
4        New York Worlds Fair
5                 Valley City
6                 Crater Lake
7                        Alma
8                     Eklutna
9                     Hubbard
10                    Fontana
11                   Waterloo
12                     Belton
13                     Keokuk
14                  Ludington
15                Forest Home
16                Los Angeles
17                  Hapeville
18                     Oneida
19                 Bering Sea
20                   Nebraska
21                        NaN
22                        NaN
23                  Owensboro
24                 Wilderness
25                  San Diego
26                 Wilderness
27                     Clovis
28                 Los Alamos
29               Ft. Duschene
                 ...         
18211                 Holyoke
18212                  Carson
18213                Pasadena
18214                  Austin
18215                El Campo
18216            Garden Grove
18217           Berthoud Pass
18218              Sisterdale
18219            Garden Grove
18220             Shasta Lake
18221                Franklin
18222          Albrightsville
18223              Greenville
18224                 Eufaula
18225             Simi Valley
18226           San Francisco
18227           San Francisco
18228              Kingsville
18229                 Chicago
18230             Pismo Beach
18231             Pismo Beach
18232                    Lodi
18233               Anchorage
18234                Capitola
18235          Fountain Hills
18236              Grant Park
18237             Spirit Lake
18238             Eagle River
18239             Eagle River
18240                    Ybor
Name: City, dtype: object

Note: Dot notation doesn’t work in case series name is space seperated or there is already a builtin type attribute, method with same name. Fall back to previous notation.

creating a new Series in a DataFrame

ufo['Location'] = ufo['City'] + ", " + ufo['State']
ufo.head()
City Colors Reported Shape Reported State Time Location
0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00 Ithaca, NY
1 Willingboro NaN OTHER NJ 6/30/1930 20:00 Willingboro, NJ
2 Holyoke NaN OVAL CO 2/15/1931 14:00 Holyoke, CO
3 Abilene NaN DISK KS 6/1/1931 13:00 Abilene, KS
4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00 New York Worlds Fair, NY
ufo['City'][0]
'Ithaca'

Better practice is to use loc method

ufo.loc[0, 'City']
'Ithaca'

Why do some pandas commands ends with parantheses, and other commands don’t

import pandas as pd


movies = pd.read_csv("http://bit.ly/imdbratings")
movies.head()
star_rating title content_rating genre duration actors_list
0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....

Note: ‘describe’ method of DataFrame describe the all series which has numerical values

movies.describe()
star_rating duration
count 979.000000 979.000000
mean 7.889785 120.979571
std 0.336069 26.218010
min 7.400000 64.000000
25% 7.600000 102.000000
50% 7.800000 117.000000
75% 8.100000 134.000000
max 9.300000 242.000000
movies.shape
(979, 6)
movies.dtypes
star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object
movies.describe(include=['object'])
title content_rating genre actors_list
count 979 976 979 979
unique 975 12 16 969
top The Girl with the Dragon Tattoo R Drama [u'Daniel Radcliffe', u'Emma Watson', u'Rupert...
freq 2 460 278 6

Above, describe methods only describes the Series of types mentioned in include list

%matplotlib inline
movies.plot.scatter('duration', 'star_rating', c='green')
<matplotlib.axes.AxesSubplot at 0x7f29cbe5cbd0>
png
png