Pandas - Few Concepts

Posted on April 26, 2016

pandas, data manupulation, data analayis, data handling

Python Pandas QA Series

What is pandas

Python data analysis library. Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

A library that provides:

data analysis
data manipulation
data visualization

How do I read tabular data into pandas

Tabular data: data in form of rows and columns; spread sheet, csv data

import pandas as pd


# pd.read_table('data/chipotle.tsv')
# can also read directly from the url
orders = pd.read_table("http://bit.ly/chiporders")

orders.head()

	order_id	quantity	item_name	choice_description	item_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	$2.39
1	1	1	Izze	[Clementine]	$3.39
2	1	1	Nantucket Nectar	[Apple]	$3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	$2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$16.98

read_table by default takes seperator as tab and assumes first row as header

user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_table("http://bit.ly/movieusers", sep='|', header=None, names=user_cols)

users.head()

	user_id	age	gender	occupation	zip_code
0	1	24	M	technician	85711
1	2	53	F	other	94043
2	3	23	M	writer	32067
3	4	24	M	technician	43537
4	5	33	F	other	15213

Tip : read_table method has lots of parameters that can be twicked. skiprows and skipfooter are quite useful in skipping some descriptions or comments in the data file.

How do I select a panda Series from a DataFrame

There are two basic objects in pandas that hold data.

DataFrame

DataFrame - It basically is a table of rows and columns

Each of these columns is called a panda’s series

import pandas as pd


# pd.read_table("http://bit.ly/uforeports", sep=",")
ufo = pd.read_csv("http://bit.ly/uforeports")
ufo.head()

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00
3	Abilene	NaN	DISK	KS	6/1/1931 13:00
4	New York Worlds Fair	NaN	LIGHT	NY	4/18/1933 19:00

type(ufo)

pandas.core.frame.DataFrame

ufo['City']

0                      Ithaca
1                 Willingboro
2                     Holyoke
3                     Abilene
4        New York Worlds Fair
5                 Valley City
6                 Crater Lake
7                        Alma
8                     Eklutna
9                     Hubbard
10                    Fontana
11                   Waterloo
12                     Belton
13                     Keokuk
14                  Ludington
15                Forest Home
16                Los Angeles
17                  Hapeville
18                     Oneida
19                 Bering Sea
20                   Nebraska
21                        NaN
22                        NaN
23                  Owensboro
24                 Wilderness
25                  San Diego
26                 Wilderness
27                     Clovis
28                 Los Alamos
29               Ft. Duschene
                 ...         
18211                 Holyoke
18212                  Carson
18213                Pasadena
18214                  Austin
18215                El Campo
18216            Garden Grove
18217           Berthoud Pass
18218              Sisterdale
18219            Garden Grove
18220             Shasta Lake
18221                Franklin
18222          Albrightsville
18223              Greenville
18224                 Eufaula
18225             Simi Valley
18226           San Francisco
18227           San Francisco
18228              Kingsville
18229                 Chicago
18230             Pismo Beach
18231             Pismo Beach
18232                    Lodi
18233               Anchorage
18234                Capitola
18235          Fountain Hills
18236              Grant Park
18237             Spirit Lake
18238             Eagle River
18239             Eagle River
18240                    Ybor
Name: City, dtype: object

type(ufo['City'])

pandas.core.series.Series

Another way of selecting series is to user dot notation (attribute selection)

ufo.City

0                      Ithaca
1                 Willingboro
2                     Holyoke
3                     Abilene
4        New York Worlds Fair
5                 Valley City
6                 Crater Lake
7                        Alma
8                     Eklutna
9                     Hubbard
10                    Fontana
11                   Waterloo
12                     Belton
13                     Keokuk
14                  Ludington
15                Forest Home
16                Los Angeles
17                  Hapeville
18                     Oneida
19                 Bering Sea
20                   Nebraska
21                        NaN
22                        NaN
23                  Owensboro
24                 Wilderness
25                  San Diego
26                 Wilderness
27                     Clovis
28                 Los Alamos
29               Ft. Duschene
                 ...         
18211                 Holyoke
18212                  Carson
18213                Pasadena
18214                  Austin
18215                El Campo
18216            Garden Grove
18217           Berthoud Pass
18218              Sisterdale
18219            Garden Grove
18220             Shasta Lake
18221                Franklin
18222          Albrightsville
18223              Greenville
18224                 Eufaula
18225             Simi Valley
18226           San Francisco
18227           San Francisco
18228              Kingsville
18229                 Chicago
18230             Pismo Beach
18231             Pismo Beach
18232                    Lodi
18233               Anchorage
18234                Capitola
18235          Fountain Hills
18236              Grant Park
18237             Spirit Lake
18238             Eagle River
18239             Eagle River
18240                    Ybor
Name: City, dtype: object

Note: Dot notation doesn’t work in case series name is space seperated or there is already a builtin type attribute, method with same name. Fall back to previous notation.

creating a new Series in a DataFrame

ufo['Location'] = ufo['City'] + ", " + ufo['State']
ufo.head()

	City	Colors Reported	Shape Reported	State	Time	Location
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00	Ithaca, NY
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00	Willingboro, NJ
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00	Holyoke, CO
3	Abilene	NaN	DISK	KS	6/1/1931 13:00	Abilene, KS
4	New York Worlds Fair	NaN	LIGHT	NY	4/18/1933 19:00	New York Worlds Fair, NY

ufo['City'][0]

'Ithaca'

Better practice is to use loc method

ufo.loc[0, 'City']

'Ithaca'

Why do some pandas commands ends with parantheses, and other commands don’t

import pandas as pd


movies = pd.read_csv("http://bit.ly/imdbratings")

movies.head()

	star_rating	title	content_rating	genre	duration	actors_list
0	9.3	The Shawshank Redemption	R	Crime	142	[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1	9.2	The Godfather	R	Crime	175	[u'Marlon Brando', u'Al Pacino', u'James Caan']
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
4	8.9	Pulp Fiction	R	Crime	154	[u'John Travolta', u'Uma Thurman', u'Samuel L....

Note: ‘describe’ method of DataFrame describe the all series which has numerical values

movies.describe()

	star_rating	duration
count	979.000000	979.000000
mean	7.889785	120.979571
std	0.336069	26.218010
min	7.400000	64.000000
25%	7.600000	102.000000
50%	7.800000	117.000000
75%	8.100000	134.000000
max	9.300000	242.000000

movies.shape

(979, 6)

movies.dtypes

star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object

movies.describe(include=['object'])

	title	content_rating	genre	actors_list
count	979	976	979	979
unique	975	12	16	969
top	The Girl with the Dragon Tattoo	R	Drama	[u'Daniel Radcliffe', u'Emma Watson', u'Rupert...
freq	2	460	278	6

Above, describe methods only describes the Series of types mentioned in include list

%matplotlib inline
movies.plot.scatter('duration', 'star_rating', c='green')

<matplotlib.axes.AxesSubplot at 0x7f29cbe5cbd0>