# Generation This guide explains how to generate privacy-preserving synthetic data, covering both the streamlined `AIMGenerator` interface and the lower-level calibration function. ## How Marginal-Based Generation Works ### Main Generation Algorithm This library uses the AIM (Adaptive and Iterative Mechanism)[^1] algorithm for privacy-preserving synthetic data generation. AIM belongs to the [Select-Measure-Generate](https://differentialprivacy.org/synth-data-1/) family of algorithms, which work in three steps: 1. **Select** which statistics of the original data to preserve. 2. **Measure** those statistics with calibrated noise added—this is how we get the privacy guarantees using differential privacy. 3. **Generate** synthetic data that matches the noisy statistics. The statistics that AIM preserves are called **marginals**. A marginal is a frequency table over a subset of columns. For example, suppose a dataset has columns for city, income bracket, and age group: - A **1-way marginal** of "city" counts how many people live in each city. - A **2-way marginal** of ("city", "income bracket") counts how many people in each city fall into each income bracket—capturing the *relationship* between city and income. - A **3-way marginal** of ("city", "income bracket", "age group") captures the three-way interaction—for example, that high-income young people concentrate in certain cities. The number of columns in a marginal is its **degree**. Higher-degree marginals capture more complex relationships, but there are exponentially more of them. AIM adaptively selects which marginals to measure, focusing the privacy budget on the marginals that matter most for the data distribution. ### Pre-Processing To provide end-to-end privacy guarantees, we need to ensure there is no leakage privacy in the entire synthetic data generation pipeline, and that includes pre-processing. The AIM algorithm expects to take as input the following data domain information: - _Bounds._ The minimum and maximum values of numeric features - _Binning._ Marginals are inherently discrete, thus we have to discretize numeric features by clustering the domain Both of these pre-processing procedures can come with privacy leakage, as in the worst-case adversaries could learn information about outliers with extreme values of their features. Unless the bounds and bins are specified from domain knowledge, we need to allocate a part of the privacy budget, i.e., a part of all of the randomness used to ensure the target level of risk, to pre-processing. ## AIMGenerator ### Parameters #### `risk` The `Risk` specification defining the privacy guarantee. See the [Risk Modeling](risk-modeling.md) guide for how to choose this. #### `degree` (default: 2) The maximum degree of marginals that AIM will try to preserve. This is the most important parameter for controlling the utility-privacy trade-off: - **`degree=1`**: Only preserves individual column distributions. Each column's frequencies will be close to the original, but correlations between columns (e.g., "older people tend to have higher incomes") will not be captured. Very private, since few statistics need measuring—but the synthetic data may be misleading for any analysis that depends on relationships between columns. - **`degree=2`** (default): Preserves pairwise relationships between columns. Captures correlations like "city and income are related" or "age and health status are correlated." A good starting point for most datasets. - **`degree=3`**: Preserves three-way interactions. Captures patterns like "the relationship between income and health status differs by city." Better for datasets where multi-column dependencies matter, but requires more privacy budget since there are many more three-way marginals to measure. **The trade-off:** Increasing degree captures more complex patterns but spreads the privacy budget across more marginals, adding more noise to each one. For a given privacy budget, there is a sweet spot: going too high can hurt utility because of the added noise. #### `max_model_size` (default: 80) Controls the maximum size (in MB) of the internal graphical model that AIM builds. A larger value enables AIM to include more marginals in the model, potentially improving fidelity. The default of 80 works well in most cases. Increasing this value may improve utility for datasets with many columns but will increase computation time and memory usage. #### `compress` (default: True) Enables compression in AIM, which reduces the number of marginals that need to be measured by combining related ones. This generally improves utility for a given privacy budget and should be left enabled unless you have a specific reason to disable it. #### `proc_epsilon` (default: 0.1) Controls the amount of randomness via the classical epsilon parameter in differential privacy that is allocated to pre-processing procedures: estimation of domain bounds and clustering of numeric features. It is highly recommended to at least provide domain bounds whenever possible, as detailed next. ### Domain Specification The `domain` parameter in `fit()` tells the generator the range of possible values for each column: ```python domain = { # Numeric columns: specify lower and upper bounds "age": {"lower": 18, "upper": 100}, "income": {"lower": 0, "upper": 500000}, # Categorical columns: list all possible values (inferred from data if omitted) "city": ["NYC", "LA", "SF", "Chicago"], } ``` For numeric columns, providing bounds avoids private preprocessing and preserves more privacy budget for generation. For categorical columns, the domain is inferred from the data if not specified. ### Full Example ```python import pandas as pd from risksyn import Risk, AIMGenerator df = pd.DataFrame({ "age": [25, 30, 35, 40, 45], "income": [50000, 60000, 70000, 80000, 90000], "city": ["NYC", "LA", "NYC", "SF", "LA"], }) domain = { "age": {"lower": 18, "upper": 100}, "income": {"lower": 0, "upper": 500000}, } risk = Risk.from_advantage(0.2) gen = AIMGenerator(risk=risk, degree=3, max_model_size=80) gen.fit(df, domain=domain) synthetic_df = gen.generate(count=1000) ``` ### Saving and Loading Fitted generators can be saved to disk and loaded later: ```python # Save gen.store("my_generator") # Load and generate loaded = AIMGenerator.load("my_generator") synthetic_df = loaded.generate(count=1000) ``` ## Using Calibration Utilities with dpmm For users who need direct control over the [dpmm](https://github.com/sassoftware/dpmm/) pipeline—for example, to use a different pipeline type, customize the generation process, or integrate into an existing workflow—`calibrate_parameters_to_risk` converts a `Risk` specification into the `(epsilon, delta)` parameters that dpmm expects. Importantly, these `(epsilon, delta)` values are *not* supposed to be used for interpretation of the privacy guarantees—the risk level is input to the calibration procedure—but rather an intermediate technical crutch to set the noise levels in the backend. ### Without Private Preprocessing When you provide complete domain bounds for all numeric columns, no private preprocessing is needed. The full privacy budget goes to generation: ```python from risksyn import Risk, calibrate_parameters_to_risk from dpmm.pipelines import AIMPipeline risk = Risk.from_advantage(0.2) params = calibrate_parameters_to_risk(risk) pipeline = AIMPipeline( epsilon=params["epsilon"], delta=params["delta"], gen_kwargs={"degree": 3}, ) pipeline.fit(df, domain) synthetic_df = pipeline.generate(n_records=1000) ``` ### With Private Preprocessing When numeric columns lack explicit bounds, dpmm can estimate them privately. You must reserve part of the privacy budget for this by passing `proc_epsilon`: ```python params = calibrate_parameters_to_risk(risk, proc_epsilon=0.1) pipeline = AIMPipeline( epsilon=params["epsilon"], delta=params["delta"], proc_epsilon=params["proc_epsilon"], gen_kwargs={"degree": 3}, ) pipeline.fit(df) # bounds estimated privately synthetic_df = pipeline.generate(n_records=1000) ``` ### Preprocessing Responsibility When using `AIMGenerator`, preprocessing is handled automatically—it detects whether numeric columns need private domain estimation and allocates the budget accordingly. When using the calibration utilities directly, **you are responsible** for getting this right. The `proc_epsilon` values passed to `calibrate_parameters_to_risk` and to `AIMPipeline` must be consistent. The calibration function deducts the preprocessing budget from the total privacy budget before computing the generation parameters. If dpmm performs private preprocessing without the budget being accounted for in the calibration step, the overall privacy guarantee may not hold at the specified risk level. ## References [^1]: [AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data](https://arxiv.org/abs/2201.12677). VLDB 2022.