The data is stored in a CSV format. This dataset includes eights categorical variables: This dataset includes 8 categorical variables:
workclass education marital occupation relationship race sex native_country
And, 6 continuous variables:
age fnlwgt education_num capital_gain capital_loss hours_week
In this tutorial, you will learn:
Import CSV in Pandas Pandas read_csv() Method Pandas groupby() Method
Pandas read_csv() Method
To import a CSV dataset, you can use the object pd.read_csv(). The basic argument inside is:
Pandas read_csv() Syntax
pandas.read_csv(filepath_or_buffer,sep=’, ‘,names=None
,index_col=None
,skipinitialspace=False
)
filepath_or_buffer: Path or URL with the data
sep=’, ‘: Define the delimiter to use
names=None
: Name the columns. If the dataset has ten columns, you need to pass ten names
index_col=None
: If yes, the first column is used as a row index
skipinitialspace=False
: Skip spaces after delimiter.
For more information about read_csv(), please check the official documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Pandas read_csv() Example
Import csv
import pandas as pd
Define path data
COLUMNS = [‘age’,‘workclass’, ‘fnlwgt’, ’education’, ’education_num’, ‘marital’, ‘occupation’, ‘relationship’, ‘race’, ‘sex’, ‘capital_gain’, ‘capital_loss’, ‘hours_week’, ’native_country’, ’label’] PATH = “https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" df_train = pd.read_csv(PATH, skipinitialspace=True, names = COLUMNS, index_col=False) df_train.shape
Output:
(32561, 15)
Pandas groupby() Method
An easy way to see the data is to use the groupby method. This method can help you to summarize the data by group. Below is a list of methods available with groupby() method:
count: count min: min max: max mean: mean median: median standard deviation: sdt etc
Inside groupby(), you can use the column you want to apply the method. Let’s have a look at a single grouping with the adult dataset. You will get the mean of all the continuous variables by type of revenue, i.e., above 50k or below 50k:
df_train.groupby([’label’]).mean()
You can get the minimum of age by type of household:
df_train.groupby([’label’])[‘age’].min() label <=50K 17
50K 19 Name: age, dtype: int64
You can also group by multiple columns. For instance, you can get the maximum capital gain according to the household type and marital status.
df_train.groupby([’label’, ‘marital’])[‘capital_gain’].max()
label marital
<=50K Divorced 34095
Married-AF-spouse 2653
Married-civ-spouse 41310
Married-spouse-absent 6849
Never-married 34095
Separated 7443
Widowed 6849
50K Divorced 99999 Married-AF-spouse 7298 Married-civ-spouse 99999 Married-spouse-absent 99999 Never-married 99999 Separated 99999 Widowed 99999 Name: capital_gain, dtype: int64
You can create a plot following groupby. One way to do it is to use a plot after the grouping. To create a more excellent plot, you will use unstack() after mean() so that you have the same multilevel index, or you join the values by revenue lower than 50k and above 50k. In this case, the plot will have two groups instead of 14 (2*7). If you use Jupyter Notebook, make sure to add %matplotlib inline, otherwise, no plot will be displayed
%matplotlib inline df_plot = df_train.groupby([’label’, ‘marital’])[‘capital_gain’].mean().unstack() df_plot
Summary
To import a CSV dataset in Pandas, you can use the object pd.read_csv(). The groupby() method can help you to summarize the data by group. You can also group by multiple columns. For instance, you can get the maximum capital gain according to the household type and marital status.