# Dataset for "Physics-informed Gaussian process regression for particle-tracking data assimilation" [10.5258/SOTON/D3642](https://doi.org/10.5258/SOTON/D3642) This documentation describes the datasets and MATLAB scripts supporting the publication J. M. Lawson. Physics informed Gaussian process regression for particle tracking data assimilation. Phys. Rev. Fluids 2026, 00:004900, 2026. doi:10.1103/zvm4-wtkq. ## Dataset Descriptions The **Homogeneous Isotropic Turbulence (HIT)** dataset consists of statistics of reconstructions of synthetic Particle Tracking Velocimetry (PTV) data derived from a Direct Numerical Simulation (DNS) at a Taylor-microscale Reynolds number of $Re_\lambda = 110$. This dataset was designed to evaluate the ability of physics-informed Gaussian Process Regression (GPR) to leverage statistical isotropy and learn velocity correlation functions from sparse and noisy observations. The data includes various cases with different seeding concentrations and additive Gaussian measurement noise. Sample reconstructed fields are provided on a $65 \times 65 \times 65$ element Cartesian grid. The **Turbulent Channel Flow (TCF)** dataset consists of statistics of reconstructions of synthetic PTV data generated from a DNS of a fully developed turbulent channel at a friction Reynolds number of $Re_\tau = 180$. Unlike the HIT case, this dataset presents the challenge of a non-stationary, wall-bounded flow where the velocity statistics vary with the distance from the wall. The reconstruction is performed on a uniform Cartesian grid of $33 \times 67 \times 17$ vectors, with a resolution of approximately $2.9\delta_\nu$. These files contain PTV data downsampled from the original simulation by factors of 2, 8, and 32. Only one level of additive measurement noise, corresponding to an RMS error which is 1% of the bulk velocity, is tested. The **Square Prism Wake (SPW)** dataset is based on experimental PTV data captured in the wake of a square prism at a Reynolds number of approximately 7850. To test the effect of seeding concentration upon reconstruction accuracy, we downsampled the original dataset to retain 10, 50 or 90% of the original source data ("dsXX"), with the remainder used for cross-validation. Flow fields are reconstructed on a grid of $38 \times 61 \times 16$ vectors. ## Dataset Structure The data is stored in `.mat` files (named `statistics.mat` or `vicplus_statistics.mat`). Each file contains the following key variables: | Variable | Type | Description | | --- | --- | --- | | **Coordinates** | | | | `x1vec`, `x2vec`, `x3vec` | `double` | 1D vectors defining the grid coordinates for the three spatial axes. | | `x1m`, `x2m`, `x3m` | `double` | 3D meshgrid arrays of the grid coordinates. | | `domain_lim` | `double` | $2 \times 3$ matrix defining the physical limits [min, max] of the domain. | | **Flow Fields** | | | | `ui_recon` | `double` | An example of a reconstructed velocity field. This is a 4D array `[x,y,z,comp]`. | | `ui_ref` | `single` | (Synthetic only) The ground-truth velocity field from DNS used for validation. | | `ui_error` | `single` | (Synthetic only) The point-wise error between the reconstruction and the reference. | | `ui_uncertainty` | `double` | The GPR-predicted standard deviation (uncertainty) of the reconstructed velocity field. | | **Statistics** | | | | `uiuj_recon` | `double` | Reconstructed Reynolds stresses $\langle u'_i u'_j \rangle$. | | `uiuj_ref` | `single` | (Synthetic only) Reference Reynolds stresses from ground truth. | | `PSD_recon` / `PSD_ref` | `double/single` | Power Spectral Density of the reconstructed and reference fields. | | `Ui_recon` / `Ui_mean` | `double` | Mean velocity field averaged over many samples of the flow. | | **Uncertainty Validation** | | | | `z_score_hist` | `double` | Histogram of the standardized error (z-score) used to validate the GP uncertainty model. | | `z_score_bins` / `edges` | `double` | Bin centers and edges for the z-score histogram. | | `z_score_pdf` | `double` | The probability density function of the calculated z-scores. | | **Metadata** | | | | `flow` | `struct` | Contains physical constants (e.g., `nu`, `U0`) and scales (e.g., `eta`, `delta_nu`, `H`). | | `dset` / `simname` | `char` | Strings identifying the specific simulation or experimental case. | | `n_p_recon` / `rho_p_recon` | `double` | The number of particles and seeding density ($\rho_p$) in the reconstruction. | ## Example Usage To visualize the results and reproduce the figures from the publication, use the provided MATLAB scripts `plot_hit.m`, `plot_tcf.m`, and `plot_spw.m`. You can toggle specific analyses by setting the corresponding boolean flags (e.g., `b_plot_PSD = true`) at the top of the file. In the SPW and TCF scripts, the `dset_id` variable allows you to switch between different dataset sizes or seeding concentrations.