PICurv 0.1.0
A Parallel Particle-In-Cell Solver for Curvilinear LES
Loading...
Searching...
No Matches
Sweep and Study Guide

picurv sweep orchestrates parameter studies with generated run variants, scheduler arrays, and aggregate metrics.

1. Inputs and Templates

A sweep/study commonly uses:

  • base case/solver/monitor/post templates,
  • study.yml defining parameter combinations and metrics,
  • optional cluster scheduler settings for array submission.

Starter templates are available under examples/*/*study*.yml and examples/master_template/.

2. Core Sweep Command

./bin/picurv sweep --study <study.yml> --cluster <cluster.yml>

Optional generation-only mode:

./bin/picurv sweep --study <study.yml> --cluster <cluster.yml> --no-submit

Delayed submit from existing staged study artifacts:

./bin/picurv submit --study-dir studies/<study_id>

There is no dedicated --dry-run flag on sweep; use --no-submit for non-submitting artifact generation.

3. Study Contract Essentials

A study definition usually specifies:

  • base_configs:
    • case, solver, monitor, post paths (all required)
  • study_type:
    • one of grid_independence, timestep_independence, sensitivity
  • parameters:
    • non-empty mapping of <target>.<yaml.path> -> non-empty list of values
    • <target> must be one of case, solver, monitor, post
  • parameter_sets:
    • optional alternative to parameters for coupled overrides that should move together
    • non-empty list of explicit <target>.<yaml.path> -> scalar bundles
    • provide exactly one of parameters or parameter_sets
  • metrics (optional):
    • list of metric specs or metric names for aggregation
  • plotting (optional):
    • output controls (enabled, output_format)
  • execution (optional):
    • controls like max_concurrent_array_tasks for Slurm array throttling

Each combination yields a generated run with fully materialized config set.

Parameter keys can target nested case/solver/monitor/post values such as:

  • case.models.physics.particles.count
  • case.run_control.dt_physical

Not every study should use the default msd_final metric shorthand. Cases that write other scalar diagnostics, such as logs/interpolation_error.csv, should define explicit CSV metrics instead. Search and migration characterization studies can aggregate logs/search_metrics.csv columns such as search_failure_fraction, search_work_index, re_search_fraction, or normalized run-level signals derived from lost_cumulative.

CSV metric specs also support reduction: p95, per-row ratios via numerator_column plus denominator_column, and scalar normalization through normalize_by_parameter for observables such as run loss fraction.

4. Outputs and Aggregates

Expected study outputs include:

  • studies/<study_id>/cases/case_####/ per-combination run directories
  • studies/<study_id>/scheduler/case_index.tsv
  • studies/<study_id>/scheduler/solver_array.sbatch
  • studies/<study_id>/scheduler/post_array.sbatch
  • studies/<study_id>/scheduler/metrics_aggregate.sbatch
  • studies/<study_id>/scheduler/solver_<array_jobid>_<taskid>.out/.err after submission
  • studies/<study_id>/scheduler/post_<array_jobid>_<taskid>.out/.err after submission
  • studies/<study_id>/scheduler/submission.json (when jobs are submitted)
  • studies/<study_id>/results/metrics_table.csv
  • studies/<study_id>/results/plots/* (when plotting is enabled and matplotlib is available)
  • studies/<study_id>/study_manifest.json

This keeps raw run data and comparative study diagnostics in one reproducible structure.

Metrics aggregation runs automatically as a Slurm job chained after the post-processing array (afterany dependency). If the automatic metrics job fails (e.g. Python unavailable on compute nodes), use --reaggregate manually.

5. Operational Workflow

Recommended workflow:

  1. run a tiny subset locally or with --no-submit,
  2. verify parameter substitution and metric extraction,
  3. launch full array, either directly with picurv sweep ... or later with picurv submit --study-dir ...,
  4. inspect aggregate outputs (auto-collected by the metrics Slurm job, or via --reaggregate),
  5. archive the exact study file with results for reproducibility.

picurv sweep is the scheduler-backed study path. For local parameter studies, repeat picurv run manually across a small set of edited case variants and compare the resulting run directories.

For fragile metrics, add smoke tests or fixture-based validation before large queue submissions.

Implementation details worth knowing:

  • case expansion uses the full cross-product over all parameters.* lists, or explicit paired bundles from parameter_sets when coupled overrides must move together.
  • generated case configs are revalidated through the same solver/post validators used by picurv run.
  • submission chain: solver array → post array (afterok) → metrics job (afterany).
  • scheduler/submission.json is the study-directory contract consumed by picurv submit --study-dir ....
  • generator/file grid external paths are rewritten to absolute paths during case materialization so they remain valid in studies/<study_id>/cases/....
  • generated solver_array.sbatch exports walltime metadata for the runtime walltime guard, while post_array.sbatch remains a plain post-processing launcher.
  • post_array.sbatch is rendered with nodes=1, ntasks_per_node=1, and a single-rank launcher command even if the solver array uses more tasks or the cluster launcher args include -n/-np.

6. Continuing a Partially-Completed Study

If any solver case is killed (e.g. by the walltime guard or Slurm time limit), the entire post array is cancelled (afterok dependency). Use --continue to resume the study:

./bin/picurv sweep --continue --study-dir studies/<study_id>

To override cluster resources (e.g. increase walltime):

./bin/picurv sweep --continue --study-dir studies/<study_id> \
--cluster cluster_more_time.yml

What --continue does:

  1. Reads the original study.yml and case_index.tsv from the study directory.
  2. Classifies each case as complete, partial, or empty by scanning checkpoints.
  3. If all cases are complete, auto-aggregates metrics and exits (no jobs submitted).
  4. For partial cases: updates case.yml (start_step, total_steps), sets particle restart_mode to load when a checkpoint exists, and delegates to resolve_restart_source for the full restart scenario matrix.
  5. For empty cases (no checkpoint): re-runs from scratch with unmodified control files.
  6. Submits a sparse solver array (incomplete cases only) → full post array → metrics aggregation.

Repeated continuation is safe: the target step count is always computed from the original study.yml, not from the (potentially modified) per-case case.yml.

7. Manual Metrics Re-Aggregation

If the automatic metrics Slurm job fails or you want to re-collect metrics after manual intervention:

./bin/picurv sweep --reaggregate --study-dir studies/<study_id>

This reads all case outputs, writes results/metrics_table.csv, and generates plots (if enabled in study.yml).

8. Related Pages

CFD Reader Guidance and Practical Use

This page describes Sweep and Study Guide within the PICurv workflow. For CFD users, the most reliable reading strategy is to map the page content to a concrete run decision: what is configured, what runtime stage it influences, and which diagnostics should confirm expected behavior.

Treat this page as both a conceptual reference and a runbook. If you are debugging, pair the method/procedure described here with monitor output, generated runtime artifacts under runs/<run_id>/config, and the associated solver/post logs so numerical intent and implementation behavior stay aligned.

What To Extract Before Changing A Case

  • Identify which YAML role or runtime stage this page governs.
  • List the primary control knobs (tolerances, cadence, paths, selectors, or mode flags).
  • Record expected success indicators (convergence trend, artifact presence, or stable derived metrics).
  • Record failure signals that require rollback or parameter isolation.

Practical CFD Troubleshooting Pattern

  1. Reproduce the issue on a tiny case or narrow timestep window.
  2. Change one control at a time and keep all other roles/configs fixed.
  3. Validate generated artifacts and logs after each change before scaling up.
  4. If behavior remains inconsistent, compare against a known-good baseline example and re-check grid/BC consistency.