User Guide
data-duper is a tool to replicate the structure of private or protected data for testing.
Installation
The source code is currently hosted on GitHub at: https://github.com/kjanker/data-duper.
Binary installers for the latest released version are available at the [Python Package Index (PyPI)](https://pypi.org/project/data-duper).
pip install data-duper
How to use the Duper
data-duper follows the scikit learn model-paradigm:
You create a new
Duperobject.fitit to a pandasDataFrame.(Optionally) inspect or customize the fit.
makea new syntheticDataFrameof desired shape.
In the example bellow, we load public energy data (source) as pandas DataFrame and create a synthetic data frame with data-duper. The duped data holds the same columns and similar values as the original data.
import pandas as pd
from duper import Duper
# load data from https://data.open-power-system-data.org/renewable_power_plants/
data_file = "https://data.open-power-system-data.org/renewable_power_plants/2020-08-25/renewable_power_plants_SE.csv"
df_real = pd.read_csv(data_file, parse_dates=["commissioning_date"])
# replicate data with duper
duper = Duper()
duper.fit(df_real)
df_dupe = duper.make(size=10000)
print(df_dupe.head())
Fitting to the data
When fitting the Duper to a data frame, we use find_best_generator
to derive and configure the best method to dupe each column. This approach considers the data type
and values of each column, and creates a Generator to dupe
that column’s data.
Note that we currently do not account for relations between columns, but consider each column separately.
NA values in the data are aggregated to an na_rate
Constant represents columns that only hold a single value of any type. See
Constantgenerator.Category is designed for data with few different values of any type. This is also the fallback generator if no other fits. See
Categorygenerator.Numerical data is fitted to an empirical distribution function. The original data type and granularity of the values is taken into account. However, we currently do not capture continuous index-like nummers. Id-like numbers of a fixed length can be cast as strings to be duped via regex. See
Numericgenerator.DataTime data is currently fitted similar to numerical data. Hence, it works great for unordered dates and times but does not work not capture continuous index-like timestamps of a certain frequency. See
Datetimegenerator.Regex helps to replicate ids, serial numbers, and other special strings using a regular expression. See
Regexgenerator.
Inspect and edit generators
You might want to inspect your Duper after fitting to a data set. This is done
by use of the the attributes columns, dtypes, and generators. Alternatively, you can get
and set a columns generator also directly with the column name in square brackets. This can also be used
to add new columns manually.
import pandas as pd
from duper import Duper
from duper.generators import Constant
# load data from https://data.open-power-system-data.org/renewable_power_plants/
data_file = "https://data.open-power-system-data.org/renewable_power_plants/2020-08-25/renewable_power_plants_SE.csv"
df_real = pd.read_csv(data_file, parse_dates=["commissioning_date"])
# replicate data with duper
duper = Duper()
duper.fit(df_real)
# insprect duper
print(duper.columns)
print(duper.dtypes)
print(duper.generators)
# inspect specific column generators
duper["commissioning_date"]
# set/overwrite a specific generators
duper["manufacturer"] = Constant(value="Umbrella Corp.")
duper["my_new_column"] = Constant(value=True)
Make synthetic dataset
A fitted Duper can be used to make a random
new data frame that replicates the original one. Since the data is generated randomly, the number of rows (size)
can be set freely.
Voila, you have created a new data frame that replicates the original structure.