A simple tool for table data obfuscation. It reads an input table and produces an output table, that retains some properties of input, but contains different data. It allows publishing almost real production data for usage in benchmarks. It is designed to retain the following properties of data:Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
- cardinalities of values (number of distinct values) for every column and every tuple of columns;
- conditional cardinalities: number of distinct values of one column under the condition on the value of another column;
- probability distributions of the absolute value of integers; the sign of signed integers; exponent and sign for floats;
- probability distributions of the length of strings;
-
probability of zero values of numbers; empty strings and arrays,
NULLs; - data compression ratio when compressed with LZ77 and entropy family of codecs;
- continuity (magnitude of difference) of time values across the table; continuity of floating-point values;
-
date component of
DateTimevalues; - UTF-8 validity of string values;
- string values look natural.
IsMobile in your table with values 0 and 1. In transformed data, it will have the same value.
So, the user will be able to count the exact ratio of mobile traffic.
Let’s give another example. When you have some private data in your table, like user email, and you don’t want to publish any single email address.
If your table is large enough and contains multiple different emails and no email has a very high frequency than all others, it will anonymize all data. But if you have a small number of different values in a column, it can reproduce some of them.
You should look at the working algorithm of this tool works, and fine-tune its command line parameters.
This tool works fine only with at least a moderate amount of data (at least 1000s of rows).