Check Duplicated Row

This transform finds the duplicate rows in a dataset based on all (or) list of columns and returns the boolean with the value True denoting duplicate rows. Here, True indicates that the row is identical to the previous one and False indicates that the row is unique.

tags: [“EDA”]

Parameters

The table gives a brief description about each parameter in Check Duplicated Row transform.

Name:

By default, the transform name is populated. You can also add a custom name for the transform.

Input Dataset:

The file name of the input dataset. You can select the dataset that was uploaded from the drop-down list to remove the duplicate rows. (Required: True, Multiple: False)

Output Dataset:

The file name with which the output dataset is created. This contains the boolean series indicating whether the row is duplicate or not. (Required: True, Multiple: False)

subset:

The column in which duplicate values must be searched for in a dataset. You can only provide one column name in this field. (Required: True, Multiple: True, Datatypes: [‘ANY’], Options: [“FIELDS”], Datasets: [“df”])

Sample input for Check Duplicated Row transform:

../../../_images/duplicaterows_input.png

The output after running the Check Duplicated Row transform on the dataset appears as below:

../../../_images/duplicaterows_output.png

How to use it in Notebook

The following is the code snippet you must use in the Jupyter Notebook editor to run the Check Duplicated Row transform:

template=TemplateV2.get_template_by('Check Duplicated Row')

recipe_Check_Duplicated_Row= project.addRecipe([car_data, employee_data, temperature_data, only_numeric], name='Check Duplicated Row')

transform=Transform()
transform.templateId = template.id
transform.name='Check Duplicated Row'
transform.variables = {
'input_dataset':'car',
'output_dataset':'car_duplicated',
'value_1':"carheight"}
recipe_Check_Duplicated_Row.add_transform(transform)
recipe_Check_Duplicated_Row.run()

Requirements

pandas