Daily coding : day 2 of 66
Pandas data cleaning
- fill in the missing value:
a. df[‘Test Score’] = df[‘Test Score’].fillna(df[‘Test Score’].interpolate())
interpolate 以遺失值的前後兩者平均數填補遺失值
b. df[“score”]=df[“score”].fillna(df[“score”].mean()) 以平均值填補
c. df=df.fillna(method=”pad”) 以前值填補遺失值
2. non-standard missing value
isnull() function only picks up ‘Nan’ , will not other pick other types of missing values such as a dash(‘-‘) ,blank or even ‘na’
df=df.replace([“-”,” “,”na”],np.nan)
3. creating dataframe from list or scratch
a. df=pd.Dataframe
Months=[“jan”,”feb”,”mar”,”may”]
Days=[1,2,3,5]
df[“month”]=Months
df[“Days”]=Days
4. creating new colnames for dataframe
df.rename(columns={‘id_x’:’purchase_id’, ‘id_y’:’customer_id’,’id’:’product_id’})
5. count 資料轉換是否有錯誤
print(pd.to_datetime(combinedData[‘purch_date’], errors=’coerce’).isnull().value_counts())
print(pd.to_datetime(combinedData['purch_date'], errors='coerce'))errors=”coerce” …replace error row with np.NAN, if we want to drop the error, we can use dropna afterwards.
errors=”raise” … show the value that error occurs




