Pandas - Few Concepts
Posted on April 26, 2016
pandas, data manupulation, data analayis, data handlingPython Pandas QA Series
What is pandas
Python data analysis library. Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
A library that provides:
- data analysis
- data manipulation
- data visualization
How do I read tabular data into pandas
Tabular data: data in form of rows and columns; spread sheet, csv data
import pandas as pd
# pd.read_table('data/chipotle.tsv')
# can also read directly from the url
orders = pd.read_table("http://bit.ly/chiporders")
orders.head()
order_id | quantity | item_name | choice_description | item_price | |
---|---|---|---|---|---|
0 | 1 | 1 | Chips and Fresh Tomato Salsa | NaN | $2.39 |
1 | 1 | 1 | Izze | [Clementine] | $3.39 |
2 | 1 | 1 | Nantucket Nectar | [Apple] | $3.39 |
3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | NaN | $2.39 |
4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $16.98 |
read_table by default takes seperator as tab and assumes first row as header
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_table("http://bit.ly/movieusers", sep='|', header=None, names=user_cols)
users.head()
user_id | age | gender | occupation | zip_code | |
---|---|---|---|---|---|
0 | 1 | 24 | M | technician | 85711 |
1 | 2 | 53 | F | other | 94043 |
2 | 3 | 23 | M | writer | 32067 |
3 | 4 | 24 | M | technician | 43537 |
4 | 5 | 33 | F | other | 15213 |
Tip : read_table method has lots of parameters that can be twicked. skiprows and skipfooter are quite useful in skipping some descriptions or comments in the data file.
How do I select a panda Series from a DataFrame
There are two basic objects in pandas that hold data.
- DataFrame
DataFrame - It basically is a table of rows and columns
Each of these columns is called a panda’s series
import pandas as pd
# pd.read_table("http://bit.ly/uforeports", sep=",")
ufo = pd.read_csv("http://bit.ly/uforeports")
ufo.head()
City | Colors Reported | Shape Reported | State | Time | |
---|---|---|---|---|---|
0 | Ithaca | NaN | TRIANGLE | NY | 6/1/1930 22:00 |
1 | Willingboro | NaN | OTHER | NJ | 6/30/1930 20:00 |
2 | Holyoke | NaN | OVAL | CO | 2/15/1931 14:00 |
3 | Abilene | NaN | DISK | KS | 6/1/1931 13:00 |
4 | New York Worlds Fair | NaN | LIGHT | NY | 4/18/1933 19:00 |
type(ufo)
pandas.core.frame.DataFrame
ufo['City']
0 Ithaca
1 Willingboro
2 Holyoke
3 Abilene
4 New York Worlds Fair
5 Valley City
6 Crater Lake
7 Alma
8 Eklutna
9 Hubbard
10 Fontana
11 Waterloo
12 Belton
13 Keokuk
14 Ludington
15 Forest Home
16 Los Angeles
17 Hapeville
18 Oneida
19 Bering Sea
20 Nebraska
21 NaN
22 NaN
23 Owensboro
24 Wilderness
25 San Diego
26 Wilderness
27 Clovis
28 Los Alamos
29 Ft. Duschene
...
18211 Holyoke
18212 Carson
18213 Pasadena
18214 Austin
18215 El Campo
18216 Garden Grove
18217 Berthoud Pass
18218 Sisterdale
18219 Garden Grove
18220 Shasta Lake
18221 Franklin
18222 Albrightsville
18223 Greenville
18224 Eufaula
18225 Simi Valley
18226 San Francisco
18227 San Francisco
18228 Kingsville
18229 Chicago
18230 Pismo Beach
18231 Pismo Beach
18232 Lodi
18233 Anchorage
18234 Capitola
18235 Fountain Hills
18236 Grant Park
18237 Spirit Lake
18238 Eagle River
18239 Eagle River
18240 Ybor
Name: City, dtype: object
type(ufo['City'])
pandas.core.series.Series
Another way of selecting series is to user dot notation (attribute selection)
ufo.City
0 Ithaca
1 Willingboro
2 Holyoke
3 Abilene
4 New York Worlds Fair
5 Valley City
6 Crater Lake
7 Alma
8 Eklutna
9 Hubbard
10 Fontana
11 Waterloo
12 Belton
13 Keokuk
14 Ludington
15 Forest Home
16 Los Angeles
17 Hapeville
18 Oneida
19 Bering Sea
20 Nebraska
21 NaN
22 NaN
23 Owensboro
24 Wilderness
25 San Diego
26 Wilderness
27 Clovis
28 Los Alamos
29 Ft. Duschene
...
18211 Holyoke
18212 Carson
18213 Pasadena
18214 Austin
18215 El Campo
18216 Garden Grove
18217 Berthoud Pass
18218 Sisterdale
18219 Garden Grove
18220 Shasta Lake
18221 Franklin
18222 Albrightsville
18223 Greenville
18224 Eufaula
18225 Simi Valley
18226 San Francisco
18227 San Francisco
18228 Kingsville
18229 Chicago
18230 Pismo Beach
18231 Pismo Beach
18232 Lodi
18233 Anchorage
18234 Capitola
18235 Fountain Hills
18236 Grant Park
18237 Spirit Lake
18238 Eagle River
18239 Eagle River
18240 Ybor
Name: City, dtype: object
Note: Dot notation doesn’t work in case series name is space seperated or there is already a builtin type attribute, method with same name. Fall back to previous notation.
creating a new Series in a DataFrame
ufo['Location'] = ufo['City'] + ", " + ufo['State']
ufo.head()
City | Colors Reported | Shape Reported | State | Time | Location | |
---|---|---|---|---|---|---|
0 | Ithaca | NaN | TRIANGLE | NY | 6/1/1930 22:00 | Ithaca, NY |
1 | Willingboro | NaN | OTHER | NJ | 6/30/1930 20:00 | Willingboro, NJ |
2 | Holyoke | NaN | OVAL | CO | 2/15/1931 14:00 | Holyoke, CO |
3 | Abilene | NaN | DISK | KS | 6/1/1931 13:00 | Abilene, KS |
4 | New York Worlds Fair | NaN | LIGHT | NY | 4/18/1933 19:00 | New York Worlds Fair, NY |
ufo['City'][0]
'Ithaca'
Better practice is to use loc method
ufo.loc[0, 'City']
'Ithaca'
Why do some pandas commands ends with parantheses, and other commands don’t
import pandas as pd
movies = pd.read_csv("http://bit.ly/imdbratings")
movies.head()
star_rating | title | content_rating | genre | duration | actors_list | |
---|---|---|---|---|---|---|
0 | 9.3 | The Shawshank Redemption | R | Crime | 142 | [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... |
1 | 9.2 | The Godfather | R | Crime | 175 | [u'Marlon Brando', u'Al Pacino', u'James Caan'] |
2 | 9.1 | The Godfather: Part II | R | Crime | 200 | [u'Al Pacino', u'Robert De Niro', u'Robert Duv... |
3 | 9.0 | The Dark Knight | PG-13 | Action | 152 | [u'Christian Bale', u'Heath Ledger', u'Aaron E... |
4 | 8.9 | Pulp Fiction | R | Crime | 154 | [u'John Travolta', u'Uma Thurman', u'Samuel L.... |
Note: ‘describe’ method of DataFrame describe the all series which has numerical values
movies.describe()
star_rating | duration | |
---|---|---|
count | 979.000000 | 979.000000 |
mean | 7.889785 | 120.979571 |
std | 0.336069 | 26.218010 |
min | 7.400000 | 64.000000 |
25% | 7.600000 | 102.000000 |
50% | 7.800000 | 117.000000 |
75% | 8.100000 | 134.000000 |
max | 9.300000 | 242.000000 |
movies.shape
(979, 6)
movies.dtypes
star_rating float64
title object
content_rating object
genre object
duration int64
actors_list object
dtype: object
movies.describe(include=['object'])
title | content_rating | genre | actors_list | |
---|---|---|---|---|
count | 979 | 976 | 979 | 979 |
unique | 975 | 12 | 16 | 969 |
top | The Girl with the Dragon Tattoo | R | Drama | [u'Daniel Radcliffe', u'Emma Watson', u'Rupert... |
freq | 2 | 460 | 278 | 6 |
Above, describe methods only describes the Series of types mentioned in include list
%matplotlib inline
movies.plot.scatter('duration', 'star_rating', c='green')
<matplotlib.axes.AxesSubplot at 0x7f29cbe5cbd0>