The data is stored in a CSV format. This dataset includes eights categorical variables: This dataset includes 8 categorical variables:

workclass education marital occupation relationship race sex native_country

And, 6 continuous variables:

age fnlwgt education_num capital_gain capital_loss hours_week

In this tutorial, you will learn:

Import CSV in Pandas Pandas read_csv() Method Pandas groupby() Method

Pandas read_csv() Method

To import a CSV dataset, you can use the object pd.read_csv(). The basic argument inside is:

Pandas read_csv() Syntax

pandas.read_csv(filepath_or_buffer,sep=’, ‘,names=None,index_col=None,skipinitialspace=False)

filepath_or_buffer: Path or URL with the data sep=’, ‘: Define the delimiter to use names=None: Name the columns. If the dataset has ten columns, you need to pass ten names index_col=None: If yes, the first column is used as a row index skipinitialspace=False: Skip spaces after delimiter.

For more information about read_csv(), please check the official documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Pandas read_csv() Example

Import csv

import pandas as pd

Define path data

COLUMNS = [‘age’,‘workclass’, ‘fnlwgt’, ’education’, ’education_num’, ‘marital’, ‘occupation’, ‘relationship’, ‘race’, ‘sex’, ‘capital_gain’, ‘capital_loss’, ‘hours_week’, ’native_country’, ’label’] PATH = “https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" df_train = pd.read_csv(PATH, skipinitialspace=True, names = COLUMNS, index_col=False) df_train.shape

Output:

(32561, 15)

Pandas groupby() Method

An easy way to see the data is to use the groupby method. This method can help you to summarize the data by group. Below is a list of methods available with groupby() method:

count: count min: min max: max mean: mean median: median standard deviation: sdt etc

Inside groupby(), you can use the column you want to apply the method. Let’s have a look at a single grouping with the adult dataset. You will get the mean of all the continuous variables by type of revenue, i.e., above 50k or below 50k:

df_train.groupby([’label’]).mean()

You can get the minimum of age by type of household:

df_train.groupby([’label’])[‘age’].min() label <=50K 17

50K 19 Name: age, dtype: int64

You can also group by multiple columns. For instance, you can get the maximum capital gain according to the household type and marital status.

df_train.groupby([’label’, ‘marital’])[‘capital_gain’].max() label marital
<=50K Divorced 34095 Married-AF-spouse 2653 Married-civ-spouse 41310 Married-spouse-absent 6849 Never-married 34095 Separated 7443 Widowed 6849

50K Divorced 99999 Married-AF-spouse 7298 Married-civ-spouse 99999 Married-spouse-absent 99999 Never-married 99999 Separated 99999 Widowed 99999 Name: capital_gain, dtype: int64

You can create a plot following groupby. One way to do it is to use a plot after the grouping. To create a more excellent plot, you will use unstack() after mean() so that you have the same multilevel index, or you join the values by revenue lower than 50k and above 50k. In this case, the plot will have two groups instead of 14 (2*7). If you use Jupyter Notebook, make sure to add %matplotlib inline, otherwise, no plot will be displayed

%matplotlib inline df_plot = df_train.groupby([’label’, ‘marital’])[‘capital_gain’].mean().unstack() df_plot

Summary

To import a CSV dataset in Pandas, you can use the object pd.read_csv(). The groupby() method can help you to summarize the data by group. You can also group by multiple columns. For instance, you can get the maximum capital gain according to the household type and marital status.