Sign in

Daily coding : day 2 of 66

Pandas data cleaning

  1. fill in the missing value:

a. df[‘Test Score’] = df[‘Test Score’].fillna(df[‘Test Score’].interpolate())

interpolate 以遺失值的前後兩者平均數填補遺失值

b. df[“score”]=df[“score”].fillna(df[“score”].mean()) 以平均值填補

c. df=df.fillna(method=”pad”) 以前值填補遺失值

2. non-standard missing value

isnull() function only picks up ‘Nan’ , will not other pick other types of missing values such as a dash(‘-‘) ,blank or even ‘na’

df=df.replace([“-”,” “,”na”],np.nan)

3. creating dataframe from list or scratch

a. df=pd.Dataframe

Months=[“jan”,”feb”,”mar”,”may”]

Days=[1,2,3,5]

df[“month”]=Months

df[“Days”]=Days

4. creating new colnames for dataframe

df.rename(columns={‘id_x’:’purchase_id’, ‘id_y’:’customer_id’,’id’:’product_id’})

5. count 資料轉換是否有錯誤

print(pd.to_datetime(combinedData[‘purch_date’], errors=’coerce’).isnull().value_counts())

print(pd.to_datetime(combinedData['purch_date'], errors='coerce'))

errors=”coerce” …replace error row with np.NAN, if we want to drop the error, we can use dropna afterwards.

errors=”raise” … show the value that error occurs