How to clean a messy warehouse data using Python

0 Views

Overview of the warehouse dataset

The dataset we’re working with consists of 1000 rows or information of stock objects and 10 columns.

The dataset will be downloaded utilizing this GitHub link.

Every document has a number of attributes, such because the product identify, class, amount, worth, warehouse location, provider, final restocked date, and standing.

Nevertheless, the dataset wants cleansing up as a result of a number of points, together with:

Inconsistent Textual content Formatting: The Product Identify and Class columns comprise inconsistent use of uppercase and lowercase letters, making it tough to group related objects.
Main and Trailing Areas: A number of string columns have main and trailing areas, which may trigger points when performing operations like filtering or grouping knowledge.
Incorrect Knowledge Sorts: The Amount and Value columns, which ought to be numeric, comprise textual content entries and are saved as strings. This prevents numerical operations and analyses.
Invalid Values: Some entries within the Amount, Value, and Final Restocked columns are marked as ‘NaN’, and the Amount column even has a price recorded as ‘2 hundred’.
Date Formatting Points: The Final Restocked column incorporates dates in an inconsistent format which may result in potential errors in date-related analyses.

Step-by-step information to cleansing the dataset

1. Loading the Dataset

Step one is to load the messy dataset right into a Pandas DataFrame. This enables us to look at the information and determine the problems that must be addressed.

import pandas as pd 
import numpy as np 

# Load the messy knowledge 
file_path_warehouse_messy = '/path_to/warehouse_messy_data.csv' 
df_warehouse_messy = pd.read_csv(file_path_warehouse_messy)

NB: Please exchange the “path_to” with the precise location of the saved messy knowledge in your gadget.

2. Stripping Main and Trailing Areas

Knowledge typically incorporates pointless areas that may result in mismatches and errors in evaluation. We’ll strip any main or trailing areas from all string columns.

# Strip main and trailing areas from string columns 
for column in df_warehouse_messy.select_dtypes(embrace=['object']).columns: 
df_warehouse_messy[column] = df_warehouse_messy[column].str.strip()

3. Standardising Textual content Codecs

Inconsistent textual content formatting could make it tough to group and analyse knowledge. Right here, we standardise the ‘Product Identify’ and ‘Class’ columns by changing them to correct case and capitalising the primary letter, respectively.

# Convert Product Identify to correct case 
df_warehouse_messy['Product Name'] = df_warehouse_messy['Product Name'].str.title() 

# Appropriate the Class column 
df_warehouse_messy['Category'] = df_warehouse_messy['Category'].str.capitalize()

4. Correcting and Changing Knowledge Sorts

Knowledge varieties have to be right for correct evaluation. We have to exchange incorrect entries, convert text-based numbers to numeric varieties, and be certain that dates are within the right format.

# Appropriate the Amount column 
df_warehouse_messy['Quantity'] = df_warehouse_messy['Quantity'].exchange('2 hundred', 200) # Exchange '2 hundred' with 200 
df_warehouse_messy['Quantity'] = df_warehouse_messy['Quantity'].exchange('NaN', np.nan) # Exchange 'NaN' with np.nan 
df_warehouse_messy['Quantity'] = pd.to_numeric(df_warehouse_messy['Quantity'], errors="coerce") # Convert to numeric 

# Appropriate the Value column 
df_warehouse_messy['Price'] = df_warehouse_messy['Price'].exchange('NaN', np.nan) # Exchange 'NaN' with np.nan 
df_warehouse_messy['Price'] = pd.to_numeric(df_warehouse_messy['Price'], errors="coerce") # Convert to numeric 

# Appropriate the Final Restocked column 
df_warehouse_messy['Last Restocked'] = df_warehouse_messy['Last Restocked'].exchange('NaN', np.nan) # Exchange 'NaN' with np.nan 
df_warehouse_messy['Last Restocked'] = pd.to_datetime(df_warehouse_messy['Last Restocked'], errors="coerce") # Convert to datetime

5. Dealing with Lacking Values

Lacking values can distort your evaluation, so it’s essential to deal with them appropriately. We’ll fill numeric columns with the imply, and categorical columns with probably the most frequent worth or a placeholder.

# Fill NaN values with a default worth or technique, right here we use imply for numeric columns 
df_warehouse_messy['Quantity'].fillna(df_warehouse_messy['Quantity'].imply(), inplace=True) 
df_warehouse_messy['Price'].fillna(df_warehouse_messy['Price'].imply(), inplace=True) 

# Fill NaN values for categorical columns with a placeholder or most frequent worth 
df_warehouse_messy['Product Name'].fillna('Unknown Product', inplace=True) 
df_warehouse_messy['Category'].fillna('Unknown Class', inplace=True) 
df_warehouse_messy['Warehouse'].fillna(df_warehouse_messy['Warehouse'].mode()[0], inplace=True) 
df_warehouse_messy['Location'].fillna('Unknown Location', inplace=True) 
df_warehouse_messy['Supplier'].fillna(df_warehouse_messy['Supplier'].mode()[0], inplace=True) 
df_warehouse_messy['Status'].fillna(df_warehouse_messy['Status'].mode()[0], inplace=True) 
df_warehouse_messy['Last Restocked'].fillna(pd.to_datetime('at the moment'), inplace=True)

6. Saving the Cleaned Dataset

Lastly, as soon as the information has been cleaned, we save the cleaned dataset to a brand new CSV file.

# Save the cleaned knowledge to a brand new CSV file 
df_warehouse_messy.to_csv('/path_to/cleaned_warehouse_messy_data.csv', index=False)

NB: Please exchange the “path_to” with the precise location the place you prefer to the cleaned dataset to be saved in your gadget.

The cleaned dataset will be obtain utilizing this GitHub link.

Under is only a snippet of the cleaned dataset:

Product ID	Product Identify	Class	Warehouse	Location	Amount	Value	Provider	Standing	Final Restocked
1102	Gadget Y	Electronics	Warehouse 2	Aisle 1	300.0	9.99	Provider C	In Inventory	2024-06-17 12:40:01.374181
1435	Gadget Y	Electronics	Warehouse 2	Aisle 4	200.0	19.99	Provider C	Out of Inventory	2024-06-17 12:40:01.374181
1860	Widget A	Clothes	Warehouse 2	Aisle 3	100.0	19.99	Provider B	In Inventory	2022-12-20 00:00:00.000000
1270	Gadget Z	Toys	Warehouse 2	Aisle 4	50.0	49.99	Provider B	In Inventory	2022-12-20 00:00:00.000000
1106	Widget A	Furnishings	Warehouse 3	Aisle 3	200.0	9.99	Provider D	Out of Inventory	2023-04-25 00:00:00.000000
1071	Widget B	Clothes	Warehouse 3	Aisle 5	300.0	28.08583858764187	Provider A	In Inventory	2022-12-20 00:00:00.000000
1700	Widget A	Clothes	Warehouse 2	Aisle 2	200.0	49.99	Provider B	In Inventory	2022-12-20 00:00:00.000000
1020	Widget C	Clothes	Warehouse 1	Aisle 5	200.0	9.99	Provider D	Out of Inventory	2022-12-20 00:00:00.000000
1614	Gadget Y	Electronics	Warehouse 3	Aisle 3	300.0	9.99	Provider B	Out of Inventory	2023-03-05 00:00:00.000000

Abstract of steps taken

Loaded the messy dataset right into a Pandas DataFrame.
Stripped main and trailing areas from string columns.
Standardised textual content codecs within the ‘Product Identify’ and ‘Class’ columns.
Corrected and transformed knowledge varieties for the ‘Amount’, ‘Value’, and ‘Final Restocked’ columns.
Dealt with lacking values by filling them with applicable defaults.
Saved the cleaned knowledge into a brand new CSV file.

Conclusion

Cleansing a dataset is a vital step in knowledge evaluation, guaranteeing that your knowledge is dependable and prepared for additional evaluation.

By following the steps outlined on this tutorial, you’ll be able to successfully clear messy datasets and remodel them into useful belongings in your knowledge initiatives.

Python, with its highly effective Pandas library, offers a strong toolkit for tackling a variety of information cleansing duties.

Hold practising with totally different datasets to hone your abilities and grow to be proficient in knowledge cleansing.

Trending Merchandise

Add to compare