Skip to main content

Documentation Index

Fetch the complete documentation index at: https://tracebloc-docs-fix-issue-131-declarative-staging.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Make your data available to the Kubernetes cluster so it can be used for training and evaluation. Regardless of where your client runs on Azure, AWS, Google Cloud, or a local Minikube setup, the process of ingesting datasets works the same way. The data ingestor is a lightweight service that bridges your raw data and the cluster’s persistent storage. It comes with ready-made templates (CSV, images, text) that you can use as starting points and customize for your own dataset. By containerizing the ingestion step, the ingestor validates data format and schema, enforces consistency, and transfers the dataset securely into cluster’s SQL storage where it becomes accessible to all training and evaluation jobs. This guide covers:
  • Customizing ingestor templates for different data types (CSV, images, text)
  • Deploying the data ingestor for training and test data using Kubernetes
  • Managing datasets through the tracebloc interface
IMPORTANT Make sure that the data format and ML task is supported and that data standards are met by reviewing the docs. You must run the process twice, once to ingest training and once to ingest testing data.

Setup options

You can ingest data into your client in two ways:
  • Declarative YAML (recommended, simpler) — describe your dataset in ~8 lines of ingest.yaml, then helm install. No Dockerfile, no custom Python script. The official ingestor image runs it for you. Use this for any dataset that fits a supported category.
  • Custom Python template + Kubernetes Job (advanced) — clone the data-ingestors repo, pick a per-category template script, edit it, build and push a Docker image, then kubectl apply an ingestor-job.yaml. Use this when the declarative schema can’t express what your data needs — e.g. non-trivial preprocessing, a custom validator, or a BaseProcessor subclass.
Start with the declarative method below. Drop down to the custom-template flow only if you need it. Describe your dataset in ~8 lines of YAML, then helm install. The official ingestor image (published as ghcr.io/tracebloc/ingestor) runs it. No Dockerfile, no Python script.

1. Add the chart repo (one-time)

helm repo add tracebloc https://tracebloc.github.io/client
helm repo update
The tracebloc/client parent chart bootstraps the cluster (jobs-manager, MySQL, RBAC). The tracebloc/ingestor subchart submits per-dataset ingestion runs against it.
If you installed the client via the one-liner (bash <(curl -fsSL https://tracebloc.io/i.sh)), use --reset-then-reuse-values so the helm upgrade doesn’t drop the values the installer applied:
helm upgrade <workspace> tracebloc/client -n <namespace> --reset-then-reuse-values
Append --version <version-number> to pin a specific chart version.

2. Stage your data on the cluster’s shared PVC

The chart doesn’t transport data into the cluster — it points at data already accessible to the cluster’s shared PVC (client-pvc by default, mounted at /data/shared/ inside the ingestor Pod). Before installing, get your raw files there. For a single-node workspace (the default install), the PVC is backed by a host directory the installer created at ~/.tracebloc/<workspace>/data/. Drop your files into a per-dataset subdirectory:
# Host path on the machine where the tracebloc client is installed.
# Pick a <prefix> per dataset — it becomes the path you reference in ingest.yaml.
mkdir -p ~/.tracebloc/<workspace>/data/<prefix>
cp -R LOCAL_PATH/images   ~/.tracebloc/<workspace>/data/<prefix>/
cp    LOCAL_PATH/labels.csv ~/.tracebloc/<workspace>/data/<prefix>/
Inside the ingestor Pod those files appear at /data/shared/<prefix>/... — that’s what you’ll put in ingest.yaml below.
For multi-node or EKS deployments where the PVC isn’t backed by a local host path, use a throwaway kubectl cp Pod or a cloud-storage init container instead. See the client ingestor README for those recipes.

3. Write your ingest.yaml

The example below is for image_classification. Other categories require different fields — e.g. tabular_classification has no images: and instead needs a typed schema: block. Don’t copy this one blindly; grab the matching file from examples/yaml/ (one per category) and edit from there. Per-category sample data and READMEs live under templates/.
apiVersion: tracebloc.io/v1
kind: IngestConfig
category: image_classification
table: cats_dogs_train
intent: train
csv: /data/shared/cats-dogs/labels.csv
images: /data/shared/cats-dogs/images/
label: label
The top-level shape (apiVersion, kind, category, table, intent, label) is the same for every category; the category field picks the validator set, file-extension defaults, and column conventions. The data-source fields (csv:, images:, schema:, …) vary per category. The paths are paths inside the ingestor Pod, which is the PVC mount you populated in step 2.

4. Install once per dataset

The ingestor runs once: validates your data, copies files into the destination directory on the PVC, inserts rows into MySQL, sends metadata to the tracebloc backend, then exits. Run it twice per dataset — once with intent: train, once with intent: test — using distinct table: names. The example below shows both releases:
# Train release — points at the ingest.yaml from step 3 (table: cats_dogs_train, intent: train)
helm install cats-dogs-train tracebloc/ingestor \
  --namespace <workspace> \
  --set-file ingestConfig=./ingest-train.yaml

# Test release — same shape, with table: cats_dogs_test and intent: test
helm install cats-dogs-test tracebloc/ingestor \
  --namespace <workspace> \
  --set-file ingestConfig=./ingest-test.yaml
Each helm install is a separate release (the first argument is the release name), so the two runs don’t collide. The ingestor Pod picks up CLIENT_ID / CLIENT_PASSWORD automatically from the Kubernetes Secret the parent tracebloc/client chart created in <workspace> at install time — you don’t pass credentials on the helm install command. Full chart docs (data-staging recipe, schema, every category, update model, verification, override knobs) → client ingestor README.

Custom Python template (advanced)

Use this flow when the declarative schema can’t express what your data needs — typically when you have non-trivial preprocessing logic, a custom validator, or a BaseProcessor subclass. The sections below — Quick Setup and Detailed Setup — both describe this advanced path.

Quick Setup

Use this quick setup if you already have an ingestor configured and just want to switch datasets or toggle between training and testing. If you are setting up for the first time, go to the next section for the detailed walkthrough.

Steps

  1. Pick a template script and edit it. E.g. /templates/tabular_classification/tabular_classification.py
  • Update csv options and data_path
  • Only for tabular data: Update schema
  • Set schema and CSVIngestor()parameters like category, intent, label_column, etc. to match data type, task and train/test purpose
ingestor = CSVIngestor(
    ...
    category=TaskCategory.TABULAR_CLASSIFICATION, # Adjust for your task
    csv_options=csv_options,      # Defined above
    label_column="ColumnName",    # Target column
    intent=Intent.TRAIN,          # TRAIN or TEST
)
  1. Build and push docker image:
Make sure Docker is running on your system (e.g. by starting Docker Desktop), then execute the following command:
# Build for cloud (multi-arch) and push directly to registry
docker buildx build --platform linux/amd64,linux/arm64 -t <your-username>/<image-name>:<tag> --push .
  1. Edit ingestor-job.yaml:
  • metadata.name: Unique job name (e.g. ingestor-job-train and ingestor-job-test)
  • image: The tag you built and pushed
  • LABEL_FILE: Path inside the pod to the labels CSV, under the PVC mount (e.g. /data/shared/labels.csv). For tabular data, this is the same file that contains both labels and features.
  • TABLE_NAME: Unique table name (no spaces, one per dataset). Title is optional
  • SRC_PATH: Root of the mounted dataset directory inside the pod (/data/shared, backed by ~/.tracebloc/<workspace>/data on the client host)
  1. Deploy to Kubernetes
`kubectl apply -f ingestor-job.yaml -n <workspace>`

Detailed Setup

1. Configure a Template

This section walks you through the step-by-step setup of a data ingestor. You will clone the repository, select the right template for your data type, and customize it to match your task. Follow this guide if you are setting up an ingestor for the first time or need full control beyond the quick setup.

Clone the Data Ingestor Repository

Clone the public Data Ingestor GitHub repository:
git clone https://github.com/tracebloc/data-ingestors.git
cd data-ingestors
The repository contains ready-to-use python templates for common formats for tabular, images, and text data in the /templates/ folder. In most cases you only need to make minimal adjustments. IMPORTANT: Datasets must be cleaned and preprocessed before ingestion. Participants cannot view, clean or fix raw data, so model performance will only be as good as the data you provide.

Choose a Template

Select the appropriate template from the /templates/ folder based on your data and task type. Each template is already configured with the correct data category and format:
Data TypeTemplate FileData CategoryData Format
Tabulartemplates/tabular_classification/tabular_classification.pyTaskCategory.TABULAR_CLASSIFICATIONDataFormat.TABULAR
Tabulartemplates/tabular_regression/tabular_regression.pyTaskCategory.TABULAR_REGRESSIONDataFormat.TABULAR
Tabulartemplates/time_series_forecasting/time_series_forecasting.pyTaskCategory.TIME_SERIES_FORECASTINGDataFormat.TABULAR
Tabulartemplates/time_to_event_prediction/time_to_event_prediction.pyTaskCategory.TIME_TO_EVENT_PREDICTIONDataFormat.TABULAR
Imagetemplates/image_classification/image_classification.pyTaskCategory.IMAGE_CLASSIFICATIONDataFormat.IMAGE
Imagetemplates/object_detection/object_detection.pyTaskCategory.OBJECT_DETECTIONDataFormat.IMAGE
Texttemplates/text_classification/text_classification.pyTaskCategory.TEXT_CLASSIFICATIONDataFormat.TEXT

High Level Template Structure

All templates follow the same structure:

from tracebloc_ingestor import Config, Database, APIClient, CSVIngestor
from tracebloc_ingestor.utils.constants import TaskCategory, Intent, DataFormat

...

def main():
    """Run the CSV ingestion example."""
    try:
        # Initialize components
        database = Database(config)
        # Initialize API client
        api_client = APIClient(config)

        # Define csv_options and schema (schema is only needed for tabular data)
        csv_options = {...}
        schema = {...}

        # Initialize ingestor
        ingestor = CSVIngestor()

         # Run and ingest data
        with ingestor:
            ingestor.ingest(config.LABEL_FILE, batch_size=config.BATCH_SIZE)
    except:
        ...
Both Database, APIClient and other values are configured automatically from the environment variables defined in ingestor_job.yaml.
  • config.LABEL_FILE: Path to local csv label file
  • config.BATCH_SIZE: Batch size used during ingestion

Customize a Template

Templates provide a starting point, but every dataset has its own format and labels. In this step you adapt the template to your data by tuning CSV ingestion options and setting the ingestor parameters (category, label column, intent, data path and schema). The following example in templates/tabular_classification/tabular_classification.py shows how to ingest a tabular dataset, but the setup works the same way for image or text data.

Needed for Tabular Data: Define Schema

Define the dataset schema as a Python dictionary, mapping each column to its SQL type and constraints. Do not include IDs or the label column into the schema.
# Schema definition for tabular data
schema = {
  "feature_00": "FLOAT ",
  "feature_01": "FLOAT ",
  "feature_02": "FLOAT ",
  ...
}

Needed for Image Classification Data: Define Image Options

Define image size and file extension.
# Image specific options including CSV options
image_options = {
    # Image processing options
    "target_size": (512, 512),  # Define image size. Height = Width
    "extension": FileExtension.JPG,  # allowed extension for images: jpeg, jpg, png
}

Needed for Object Detection Data: Define Image Options

Define file extension.
# Object detection specific options including CSV options
object_detection_options = {
    # Image processing options
    "target_size": (448, 448),  # Resize images to this fixed dimension. Dimension is not changeable.
    "extension": FileExtension.JPG,  # allowed extension for images: jpeg, jpg, png
}

Needed for Text Data: Define File Extension

Define file extensions.
text_options = {"extension": FileExtension.TXT}  # Allowed text file extensions

Set CSV ingestion options

Customize parsing, memory handling, and data cleaning with the csv_options dictionary:
csv_options = {
    "chunk_size": 1000,          # Process rows in batches for efficiency
    "delimiter": ",",            # Column separator
    "quotechar": '"',            # Quoted field character
    "escapechar": "\\",          # Escape character for quotes
    "encoding": "utf-8",         # File encoding
    "on_bad_lines": "warn",      # Log malformed rows instead of failing
    "skip_blank_lines": True,    # Ignore empty rows
    "na_values": ["", "NA", "NULL", "None"]  # Treat these as missing values
}

Set Up the Ingestor

Define the Ingestor instance with the required configuration. See the tabular data example below:
ingestor = CSVIngestor(
    database=database,                  # From ingestor-job.yaml
    api_client=api_client,              # From ingestor-job.yaml
    table_name=config.TABLE_NAME,       # From ingestor-job.yaml
    schema=schema,                      # Defined above, only needed for tabular data
    data_format=DataFormat.TABULAR,     # Set the data format for the task
    category=TaskCategory.TABULAR_CLASSIFICATION, # Adjust for your task
    csv_options=csv_options,            # Defined above
    file_options={"number_of_columns": len(schema)}, # Don´t change
    label_column="ColumnName",          # Target column
    intent=Intent.TRAIN,                # TRAIN or TEST
)
Specify:
  • category, choose the ML task type (TABULAR_CLASSIFICATION, IMAGE_CLASSIFICATION, OBJECT_DETECTION)
  • label_column, target column or class labels
  • intent, set as TRAIN or TEST depending on dataset purpose
  • include file_options or schema depending on the data type
Other data types work similarly, follow the same configuration pattern using the corresponding template scripts in the templates/ folder.

2. Build Docker Image

With your template configured, the next step is to package it into a Docker image so it can run inside the Kubernetes cluster.

Docker Hub Setup (first-time users)

The cluster pulls your ingestor image from a public Docker registry, so you need an account before you can push. If you already have one, skip to Edit Dockerfile.
  1. Create a Docker Hub account at hub.docker.com/signup and verify your email.
  2. Log in from your terminal so the docker push command can authenticate:
    docker login
    
  3. Push the data ingestor image to your account using the build/push commands in the next section. The image name takes the form <your-docker-username>/<image-name>:<tag> — the username segment must match the account you just created.
  4. Make the image public so the cluster can pull it without credentials: Keeping the image private is also fine, but then you must create a Kubernetes imagePullSecret named regcred in the client namespace (the ingestor-job.yaml already references it).

Place data files on the client host

Datasets are not baked into the Docker image. They live on the client host in the per-workspace data directory and are mounted into the ingestor pod through the shared PVC (client-pvc/data/shared). Copy your dataset into the client’s data directory, where <workspace> is the workspace name you chose during client install (which is also the Helm release name and the Kubernetes namespace — the chart uses the same value for all three). The directory ~/.tracebloc/<workspace>/data/ is created automatically by the installer; just drop your files into it:
# Host path on the machine where the tracebloc client is installed.
# HOST_DATA_DIR defaults to ~/.tracebloc; override only if you set it during install.
cp -R LOCAL_PATH/images   ~/.tracebloc/<workspace>/data/
cp    LOCAL_PATH/labels.csv ~/.tracebloc/<workspace>/data/
Inside the ingestor pod this directory is mounted at /data/shared, so the same files appear as /data/shared/images/... and /data/shared/labels.csv. Set SRC_PATH and LABEL_FILE in ingestor-job.yaml to point at those in-pod paths (see Configure Kubernetes below). For tabular data the same rule applies — drop the single labels.csv (with features and labels) into ~/.tracebloc/<workspace>/data/.

Edit Dockerfile

The Dockerfile only needs to package the ingestion script — the dataset is mounted at runtime, so do not COPY data into the image:
# Copy the ingestion script into /app
COPY templates/tabular_classification/tabular_classification.py /app/ingestor.py
If the cluster enforces the restricted Pod Security Standard (see Run as non-root below), also add a non-root user to the Dockerfile, before the # Set the entrypoint line:
RUN groupadd -g 1000 app && \
    useradd -u 1000 -g 1000 -m -s /bin/bash app && \
    chown -R 1000:1000 /app

USER 1000

# Set the entrypoint

Build Docker Image

You need a docker user and password to proceed with the next step. Cloud platforms run a mix of x86 and ARM nodes (e.g. AWS Graviton, Azure Ampere, GCP Tau T2A). Building a multi-arch image with --platform linux/amd64,linux/arm64 guarantees the image runs on either, particularly if you build on Apple Silicon (M1/M2) or other ARM-based systems. Build and push the image with a single command:
docker buildx build --platform linux/amd64,linux/arm64 -t <your-username>/<image-name>:<tag> --push .

3. Configure Kubernetes

With the image generated and pushed to the registry, edit ingestor-job.yaml with your settings:
apiVersion: batch/v1
kind: Job
metadata:
  name: <JOBNAME> # Set a job name e.g. ingestor-job-train
  namespace: <workspace> # Use the client namespace
spec:
  template:
    spec:
      containers:
      - name: api
        image: <YOUR_DOCKER_USER>/<YOUR_IMAGE_NAME>:latest # Your Docker image name and tag, e.g. "latest"
        imagePullPolicy: Always  # Use IfNotPresent only for local tests
        # Required if the namespace enforces the `restricted` Pod Security Standard.
        # See "Run as non-root" below.
        securityContext:
          allowPrivilegeEscalation: false
          runAsNonRoot: true
          capabilities:
            drop:
              - "ALL"
          seccompProfile:
            type: RuntimeDefault
        volumeMounts:
          - name: shared-volume
            mountPath: "/data/shared" # Client shared PVC. Backed by ~/.tracebloc/<workspace>/data on the client host — read your dataset from here
        env:
        # Client credentials
        - name: CLIENT_ENV
          value: "prod"
        - name: CLIENT_ID # Client credentials from tracebloc dashboard
          value: <YOUR_CLIENT_ID>
        - name: CLIENT_PASSWORD # Client credentials from tracebloc dashboard
          value: <YOUR_CLIENT_PASSWORD>

        # Storage configuration
        - name: CLIENT_PVC # value has to match the shared data PVC name in the client values.yaml
          value: "client-pvc"

        # MySQL configuration
        - name: MYSQL_HOST # value has to match the mysql deployment name in the client values.yaml
          value: "mysql-client"

        # Dataset information — paths inside the ingestor pod.
        # /data/shared is the mount of the client-pvc, which is backed by
        # ~/.tracebloc/<workspace>/data on the client host.
        - name: SRC_PATH
          value: "/data/shared" # Root of the mounted dataset directory
        - name: LABEL_FILE
          value: "/data/shared/labels.csv" # Path to the labels CSV inside the pod
        - name: TABLE_NAME
          value: <UNIQUE_TABLE_NAME> # Different for train and test, no spaces
        - name: TITLE
          value: <DATASET_TITLE> # Optional
        - name: BATCH_SIZE
          value: "4000" # Optional, defaults to 4000
        - name: LOG_LEVEL
          value: "DEBUG" # Set DEBUG, "WARNING", "INFO" or "ERROR"
      imagePullSecrets:
      - name: regcred
      volumes:
        - name: shared-volume
          persistentVolumeClaim:
            claimName: client-pvc # value has to match the shared data PVC name in the client values.yaml
      restartPolicy: Never
Specify:
  • JOBNAME, to distinguish between train and test data jobs.
  • NAMESPACE, use the same as your client.
  • image, your Docker image (imagePullPolicy: Always for DockerHub, IfNotPresent for local)
  • CLIENT_ID, CLIENT_PASSWORD from the tracebloc client view
  • TABLE_NAME, unique per dataset, train and test use different names, no spaces. Different names for train and test data is mandatory
  • LABEL_FILE, path inside the ingestor pod (under /data/shared) to the CSV with file paths and labels — must match the location of the file you placed in ~/.tracebloc/<workspace>/data/
  • SRC_PATH, root inside the pod where the dataset directory is mounted (/data/shared)
  • BATCH_SIZE is the number of entries sent to the server per request. Optional — defaults to 4000. Keep it consistent across data types. It depends on available CPU memory, not for example image size. Too large can exhaust memory. It was tested up to 10,000, but 5,000 is a safe default for most systems.
  • LOG_LEVEL, “WARNING” for all warnings and errors, “INFO” for all logs, “ERROR” for errors only

4. Deploy

Run the ingestor as a Kubernetes Job:
kubectl apply -f ingestor-job.yaml -n <workspace>
kubectl wait -n <workspace> --for=condition=complete job/<INGESTOR_JOB_NAME>
kubectl logs -n <workspace> job/<INGESTOR_JOB_NAME>

# Delete the job only after verifying logs
kubectl delete -n <workspace> job/<INGESTOR_JOB_NAME>
This will start a pod, run the ingestion process once, and once complete you can delete the job. IMPORTANT: You must run this process twice — once for training data and once for test data. Use different JOBNAME and TABLE_NAME values for each run (e.g. ingestor-job-train / ingestor-job-test), and set intent to TRAIN or TEST accordingly in your template script.

Run as non-root

If the namespace enforces the restricted Pod Security Standard, kubectl apply will be admitted but the pod will be rejected with a warning like:
Warning: would violate PodSecurity "restricted:latest":
  allowPrivilegeEscalation != false (container "api" must set securityContext.allowPrivilegeEscalation=false),
  unrestricted capabilities (container "api" must set securityContext.capabilities.drop=["ALL"]),
  runAsNonRoot != true (pod or container "api" must set securityContext.runAsNonRoot=true),
  seccompProfile (pod or container "api" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
job.batch/ingestor-job-train-data created
Two changes are needed: 1. Add a securityContext block to the container in ingestor-job.yaml (already shown in the YAML above):
securityContext:
  allowPrivilegeEscalation: false
  runAsNonRoot: true
  capabilities:
    drop:
      - "ALL"
  seccompProfile:
    type: RuntimeDefault
2. Run the container as a non-root user. Add the following to the Dockerfile before the # Set the entrypoint line so the image ships with a UID that satisfies runAsNonRoot: true:
RUN groupadd -g 1000 app && \
    useradd -u 1000 -g 1000 -m -s /bin/bash app && \
    chown -R 1000:1000 /app

USER 1000
Rebuild and push the image, then re-apply the job. The data ingestor always runs a validation step before ingestion and moving files.

Verify Deployment

Verify if jobs and pods are deployed successfully and running:
kubectl get jobs,pods -n <workspace>
kubectl logs -n <workspace> <pod-name>
Look for “All records processed successfully” in the logs.

Dataset Management Interface

View your datasets at ai.tracebloc.io/data after successful deployment. Interface displays:
  • Dataset name, ID, and record count
  • Data type (Tabular, Image, Text) and purpose (Training/Testing)
  • Namespace and GPU requirements

Best Practices

  • Deploy jobs for training and testing simultaneously using different job names
  • Use consistent, descriptive table names (e.g., insurance-claims-train, insurance-claims-test)
  • Validate data schemas before deployment to prevent ingestion failures
  • Clean data before ingestion - Participants cannot view, clean, or fix raw data, so model performance depends entirely on the quality of data you provide

Troubleshooting

Recommended for debugging: Use k9s, a terminal-based Kubernetes dashboard, to monitor jobs, pods, and logs in real time. Run k9s -n <workspace> to get a live view of resources, switch between them instantly, and inspect logs or events with a few keystrokes. Compared to kubectl, it is faster and more convenient. Stale Kubernetes Job preventing new Job execution:
kubectl delete job ingestor-job -n <workspace>
kubectl logs <pod-name>
Storage Issues:
kubectl get pvc -n <workspace>

Next Steps


Need Help?