Version: Next

HTTP(s) Data Connector

The HTTP(s) Data Connector enables federated SQL query across supported file formats stored at an HTTP(s) endpoint. The connector supports dynamic query and data refresh through SQL-based filtering.

datasets:
  - from: http://static_username@localhost:3001/report.csv
    name: local_report
    params:
      http_password: ${env:MY_HTTP_PASS}

Examples

Basic Example

datasets:
  - from: https://github.com/LAION-AI/audio-dataset/raw/7fd6ae3cfd7cde619f6bed817da7aa2202a5bc28/metadata/freesound/parquet/freesound_parquet.parquet
    name: laion_freesound

Using Basic Authentication

datasets:
  - from: http://static_username@localhost:3001/report.csv
    name: local_report
    params:
      http_password: ${env:MY_HTTP_PASS}

Using Custom Headers

Custom HTTP headers can be specified for authentication, API keys, or other requirements. Headers are treated as sensitive data and will not be logged.

datasets:
  - from: https://api.example.com/data.csv
    name: api_data
    params:
      http_headers: 'Authorization:Bearer ${secrets:api_token},Accept:application/json'

Headers can also be separated by semicolons:

datasets:
  - from: https://api.example.com/data.csv
    name: api_data
    params:
      http_headers: 'Authorization: Bearer ${secrets:api_token}; X-API-Key: ${secrets:api_key}'

Using OAuth2 Refresh-Token Authentication

For JSON APIs protected by OAuth2, the connector can exchange a long-lived refresh token for short-lived access tokens and keep them fresh automatically (RFC 6749 §6). On startup Spice hits the configured token endpoint once, then stamps Authorization: Bearer <access_token> on every data request and refreshes the token in the background before it expires.

datasets:
  - from: https://api.example.com
    name: secure_data
    params:
      file_format: json
      allowed_request_paths: '/v1/**'
      auth_token_url: https://auth.example.com/oauth/token
      http_auth_refresh_token: ${secrets:my_refresh_token}
      http_auth_client_id: ${secrets:my_client_id}
      http_auth_client_secret: ${secrets:my_client_secret}
      auth_scopes: 'read:data offline_access'

The http_auth_refresh_token, http_auth_client_id, and http_auth_client_secret parameters can be loaded from any supported secret store (environment variables, Kubernetes Secrets, AWS Secrets Manager, HashiCorp Vault, the OS keychain, etc.) using the ${secrets:...} replacement syntax.

Applies to JSON API endpoints (e.g. file_format: json). Structured file formats (csv/parquet/etc.) go through the object-store listing path and are not affected by this setting — use http_headers for those.

See OAuth2 Refresh-Token Authentication for the full parameter reference and behavior notes.

Configuration

`from`

The from field specifies the HTTP(s) endpoint and can be configured in two ways:

Direct URL to a file: A complete URL pointing to a specific supported file.
```
from: https://example.com/data/report.csv
```
Base domain/path: A base URL that will be combined with special metadata fields to construct the complete request.
```
from: https://api.example.com/v1
```

The connector supports templated URLs with query parameters that can be dynamically populated using refresh_sql filters and special metadata fields.

`name`

The dataset name. This will be used as the table name within Spice.

Example:

datasets:
  - from: http://static_username@localhost:3001/report.csv
    name: cool_dataset
    params: ...

SELECT COUNT(*) FROM cool_dataset;

+----------+
| count(*) |
+----------+
| 6001215  |
+----------+

The dataset name cannot be a reserved keyword.

`params`

The connector supports authentication, timeout, connection pooling, and retry configuration via params.

Parameter Name	Description
`http_port`	Optional. Port to create HTTP(s) connection over. Default: 80 and 443 for HTTP and HTTPS respectively.
`http_username`	Optional. Username for HTTP basic authentication. Default: None.
`http_password`	Optional. Password for HTTP basic authentication. Default: None. Use the secret replacement syntax to load the password from a secret store, e.g. `${secrets:my_http_pass}`.
`http_headers`	Optional. Custom HTTP headers as a comma-separated list of `key:value` pairs. Example: `Content-Type:application/json,Accept:application/json`. Default: None.
`allowed_request_paths`	Required for using `request_path` filters. Comma-separated list of allowed paths. Example: `/api/users,/api/posts`. Paths must start with `/` and cannot contain `..` segments.
`request_query_filters`	Optional. Set to `enabled` to enable `request_query` filters. Default: `disabled`. When disabled, query parameter filters will be rejected.
`request_body_filters`	Optional. Set to `enabled` to enable `request_body` filters for POST requests. Default: `disabled`. When disabled, request body filters will be rejected.
`client_timeout`	Optional. Maximum time to wait for a response from the HTTP server (in seconds). Default: `30`. Applied to the entire request-response cycle.
`connect_timeout`	Optional. Timeout for establishing HTTP(s) connections (in seconds). Default: `10`.
`pool_max_idle_per_host`	Optional. Maximum number of idle connections to keep alive per host. Default: `10`.
`pool_idle_timeout`	Optional. Timeout for idle connections in the pool (in seconds). Default: `90`.
`max_retries`	Optional. Maximum number of retries for failed HTTP requests. Default: `3`.
`retry_backoff_method`	Optional. Retry backoff strategy: `fibonacci` (default), `linear`, or `exponential`.
`retry_max_duration`	Optional. Maximum total duration for all retries (e.g., `30s`, `5m`). If not set, retries continue up to `max_retries`.
`retry_jitter`	Optional. Randomization factor for retry delays (0.0 to 1.0). Default: `0.3` (30% randomization). Set to `0` for no jitter.
`max_request_query_length`	Optional. Maximum length in characters for `request_query` filter values. Default: `1024`. Maximum: `4096`.
`max_request_body_bytes`	Optional. Maximum size in bytes for `request_body` filter values. Default: `16384` (16 KiB). Maximum: `65536` (64 KiB).
`health_probe`	Optional. Custom health probe path for endpoint validation during initialization (e.g., `/health`, `/api/status`). The endpoint must return a 2xx status code to pass validation. If not set, a random path is used and any status (including 404) is accepted. Must start with `/`.
`auth_token_url`	Optional. OAuth2 token endpoint URL (must be HTTPS; `http://localhost` and loopback IPs are allowed for local testing). When set together with `http_auth_refresh_token`, the connector exchanges the refresh token for short-lived access tokens and attaches `Authorization: Bearer <token>` to all data requests. Applies to JSON API endpoints only. See OAuth2 Refresh-Token Authentication.
`http_auth_refresh_token`	Optional. OAuth2 refresh token exchanged against `auth_token_url` to obtain access tokens. Required when `auth_token_url` is set. Use a secret store, e.g. `${secrets:my_refresh_token}`.
`http_auth_client_id`	Optional. OAuth2 `client_id` presented to the token endpoint. Required for confidential clients; optional for public clients. Must be paired with `http_auth_client_secret` for confidential clients.
`http_auth_client_secret`	Optional. OAuth2 `client_secret` presented to the token endpoint. Required when the client is confidential; must be set together with `http_auth_client_id`. Use a secret store, e.g. `${secrets:my_client_secret}`.
`auth_scopes`	Optional. Space-separated OAuth2 scopes to request when refreshing (e.g. `read:data offline_access`). Omit to inherit the scopes bound to the refresh token.
`auth_client_auth`	Optional. How client credentials are sent to the token endpoint: `basic` (HTTP Basic header, default per RFC 6749 §2.3.1) or `body` (`client_id`/`client_secret` in the form body). Default: `basic`.

Pagination Parameters

Parameter Name	Description
`pagination`	Optional. Pagination mode: `auto` (default) auto-detects `Link` headers, `enabled` explicitly enables pagination with configuration below, `disabled` turns off pagination.
`pagination_next_pointer`	Optional. JSON pointer (RFC 6901) to the next page URL or cursor in the response body (e.g., `/next`, `/pagination/cursor`, `/links/next`).
`pagination_link_header`	Optional. Whether to follow HTTP `Link` headers with `rel="next"` for pagination. Default: `enabled`. Set to `disabled` to ignore `Link` headers.
`pagination_token_param`	Optional. When set, the value from `pagination_next_pointer` is treated as a cursor/token and passed as this query parameter name in subsequent requests. When not set, the value is treated as a full URL.
`pagination_data_pointer`	Optional. JSON pointer (RFC 6901) to the data array in each page's response (e.g., `/data`, `/results`, `/items`). When set, only the array at this path is returned as data rows.
`pagination_max_pages`	Optional. Maximum number of pages to fetch. Default: `100`.
`pagination_data_map_to_array`	Optional. When `enabled`, if the data at `pagination_data_pointer` (or the top-level response) is a JSON object/map, extracts its values as rows instead of treating it as a single row. Default: `disabled`. Requires pagination to be enabled.
`pagination_query_params`	Optional. Query parameter template for client-driven pagination. Supports `{offset}`, `{limit}`, and `{page}` variables (e.g., `offset={offset}&limit={limit}`). Requires `pagination_page_size`. Mutually exclusive with `pagination_next_pointer` and `pagination_token_param`.
`pagination_page_size`	Optional. Number of items per page for query-parameter pagination. Must be a positive integer. Expands `{limit}` in `pagination_query_params` and detects the last page (fewer results than `page_size` means done). Requires `pagination_query_params`.

HTTP Response Headers

When querying HTTP(s) datasets, Spice respects standard HTTP caching headers in responses. The connector supports the following cache-related response headers:

`Cache-Control`

The Cache-Control response header from the HTTP(s) endpoint is passed through to clients querying Spice. When the HTTP(s) server returns a Cache-Control header with the stale-while-revalidate directive, clients can use this value to determine appropriate caching behavior.

For example, if the HTTP(s) endpoint returns:

Cache-Control: max-age=10, stale-while-revalidate=10

Clients querying Spice will receive this header and can:

Serve fresh data for 10 seconds after fetching.
Between 10-20 seconds, serve stale data while fetching fresh data in the background.
After 20 seconds, fetch fresh data before serving the next request.

The stale-while-revalidate behavior in Spice is controlled by the stale_while_revalidate_ttl parameter in the caching configuration. When stale_while_revalidate_ttl is set to 0 (default), stale data will not be served. When set to a non-zero value, Spice serves stale cache entries while revalidating in the background.

Advanced Features

The HTTP connector provides advanced capabilities for working with dynamic APIs and RESTful services, including built-in pagination and special metadata fields.

Pagination

The HTTP connector supports automatic pagination for REST APIs that return data across multiple pages. Pagination is configured via params and works transparently with acceleration (caching, append, and full refresh modes) — each page is streamed as a separate batch without buffering entire result sets in memory.

Pagination Modes

There are three pagination modes:

URL mode — The next page URL is extracted from the response body (via pagination_next_pointer) or from the HTTP Link header with rel="next".

datasets:
  - from: https://api.example.com/v1/items
    name: items
    params:
      pagination: enabled
      pagination_next_pointer: /links/next
      pagination_data_pointer: /data
      pagination_max_pages: 50

Token mode — A cursor/token is extracted from the response body (via pagination_next_pointer) and passed as a query parameter (specified by pagination_token_param) in subsequent requests.

datasets:
  - from: https://api.example.com/v1/items
    name: items
    params:
      pagination: enabled
      pagination_next_pointer: /pagination/cursor
      pagination_token_param: cursor
      pagination_data_pointer: /results

Query-parameter mode — The client drives pagination by expanding a template (pagination_query_params) with {offset}, {limit}, and {page} variables. Pagination stops when a page returns fewer rows than pagination_page_size.

datasets:
  - from: https://api.example.com/v1/widgets
    name: widgets
    params:
      pagination: enabled
      pagination_query_params: "offset={offset}&limit={limit}"
      pagination_page_size: "100"
      pagination_max_pages: "50"

Map-to-Array Conversion

Some APIs return data as a JSON object/map (e.g., {"1": {...}, "2": {...}}) instead of an array. Set pagination_data_map_to_array: enabled to extract the map values as individual rows.

datasets:
  - from: https://api.example.com/v1/records
    name: records
    params:
      pagination: enabled
      pagination_data_map_to_array: enabled
      pagination_query_params: "offset={offset}&limit={limit}"
      pagination_page_size: "100"

Auto Mode

By default, pagination is set to auto, which automatically follows HTTP Link headers with rel="next" if present in responses. Set pagination: disabled to turn off all pagination behavior, or pagination: enabled to explicitly configure pagination with the parameters above.

Validation Rules

pagination_query_params requires pagination_page_size (and vice versa)
pagination_query_params is mutually exclusive with pagination_next_pointer and pagination_token_param
pagination_query_params must contain {offset} or {page} to ensure pages advance
pagination_token_param requires pagination_next_pointer
pagination_next_pointer and pagination_data_pointer must be valid JSON pointers starting with /

SSRF Protection

When using URL mode, next-page URLs extracted from response bodies are validated to share the same origin as the base URL configured in from. Cross-origin redirects are rejected.

Special Metadata Fields

The HTTP connector supports special metadata fields that provide fine-grained control over HTTP requests. These fields can be included in your dataset schema to dynamically construct request URLs and payloads.

Security Requirements

For security, these metadata fields require explicit configuration to prevent unauthorized access:

request_path requires allowed_request_paths to be configured with glob patterns
request_query requires request_query_filters: enabled
request_body requires request_body_filters: enabled

Field Name	Type	Description
`request_path`	String	Specifies the URL path to append to the base URL from the `from` field. When using a base domain/path in `from`, `request_path` constructs the complete endpoint. Example: If `from: https://api.example.com` and `request_path: /users/123`, the request will be made to `https://api.example.com/users/123`. Requires `allowed_request_paths` parameter.
`request_query`	String	Defines query parameters to append to the request URL. Formatted as a query string (e.g., `key1=value1&key2=value2`). These parameters are appended to the URL after any path specified in `request_path`. Requires `request_query_filters: enabled`. Maximum length: configurable via `max_request_query_length` (default: 1024 characters).
`request_body`	String	Contains the request body for POST/PUT requests. Typically used with REST APIs that require a JSON or form-encoded payload. The content type should be specified using `http_headers`. Requires `request_body_filters: enabled`. Maximum size: configurable via `max_request_body_bytes` (default: 16 KiB).

These metadata fields work in combination:

If from specifies a complete file URL, these fields are ignored
If from specifies a base URL, these fields construct the full request dynamically
request_path is appended to the base URL
request_query is appended as query parameters
request_body is sent as the request payload (requires appropriate HTTP method configuration)

Response Metadata Fields

In addition to request metadata, the HTTP connector includes response metadata fields in the dataset schema. These fields capture information about the HTTP response and are available in SQL queries.

Field Name	Type	Description
`content`	String	The response body content.
`response_status`	UInt16	The HTTP status code of the response (e.g., `200`, `404`, `500`).
`response_headers`	Map(String, String)	The HTTP response headers as key-value pairs. Each header name maps to its value. Available for inspection in queries, e.g., to check `content-type` or custom headers returned by the API.
`fetched_at`	Timestamp (Nanosecond)	The timestamp when the data was fetched. Uses the HTTP `Date` response header when available, falling back to the current system time.

Querying Response Metadata

-- Check the HTTP status of cached responses
SELECT request_path, response_status, fetched_at
FROM my_http_dataset;

-- Inspect response headers
SELECT request_path, response_headers
FROM my_http_dataset
WHERE request_path = '/api/data';

note

When using caching refresh mode, transient HTTP error responses (5xx server errors and 429 Too Many Requests) are automatically excluded from the cache. These responses are still returned to the querying client but are not persisted, preventing temporary failures from polluting cached data.

Endpoint Validation

The HTTP connector validates the configured endpoint during initialization to detect issues such as DNS errors, connection problems, or invalid URLs early in the startup process.

Default Validation Behavior

By default, the connector performs a health check by requesting a randomly generated path (e.g., /__spice_health_check_abc123def456) that is expected to return a 404 status. Any HTTP response, including 404 Not Found, indicates that the endpoint is reachable and the dataset will initialize successfully.

This default behavior works for most HTTP endpoints but may not be suitable for APIs that:

Return error responses for unknown paths without proper HTTP status codes
Have strict path validation that rejects requests to non-existent endpoints
Require authentication for all paths, including health check endpoints

Custom Health Probe

For endpoints that require a specific health check path, configure the health_probe parameter:

datasets:
  - from: https://api.example.com/v1
    name: api_data
    params:
      health_probe: /health

When a custom health probe is configured:

The connector validates the endpoint by requesting the specified path
The health probe endpoint must return a 2xx status code (200-299) for validation to succeed
If the health probe returns a non-2xx status code, the dataset will fail to initialize with an error message

This provides more reliable validation for APIs with dedicated health check endpoints.

Example with Authentication

datasets:
  - from: https://api.example.com
    name: authenticated_api
    params:
      http_headers: 'Authorization:Bearer ${secrets:api_token}'
      health_probe: /api/status

In this configuration, the health probe request to /api/status will include the authentication header, ensuring that the validation succeeds for APIs that require authentication on all endpoints.

Health Probe Requirements

The health_probe parameter has the following requirements:

Must start with /
Cannot exceed 2048 characters in length
The target endpoint must return a 2xx HTTP status code for validation to succeed

OAuth2 Refresh-Token Authentication

The HTTP connector supports the OAuth2 refresh-token grant (RFC 6749 §6) for JSON APIs. Given a long-lived refresh token and a token endpoint, Spice will:

Exchange the refresh token for an access token at dataset startup.
Attach Authorization: Bearer <access_token> to every data request.
Refresh the access token in the background, 60 seconds before it expires, for the lifetime of the process.
Honor rotated refresh tokens — when the token endpoint returns a new refresh_token, Spice uses it for the next exchange.

This flow is designed for APIs where the operator already has a refresh token in hand (e.g. issued via a separate authorization-code or device-code flow). Spice does not perform an interactive authorization flow, nor does it retry data requests on 401 — keeping the token continuously fresh in the background is the only recovery path.

Basic Configuration

datasets:
  - from: https://api.example.com
    name: secure_data
    params:
      file_format: json
      allowed_request_paths: '/v1/**'
      auth_token_url: https://auth.example.com/oauth/token
      http_auth_refresh_token: ${secrets:my_refresh_token}
      http_auth_client_id: ${secrets:my_client_id}
      http_auth_client_secret: ${secrets:my_client_secret}

Parameter Reference

Parameter	Kind	Required	Description
`auth_token_url`	runtime	yes (for OAuth)	OAuth2 token endpoint URL. Must be HTTPS; `http://localhost`, `http://127.0.0.1`, and `http://[::1]` are accepted for local testing.
`http_auth_refresh_token`	component, secret	yes (for OAuth)	Long-lived refresh token. Exchanged on startup for the first access token. Can be loaded from any supported secret store via `${secrets:...}`.
`http_auth_client_id`	component, secret	confidential	`client_id`. Required for confidential clients, optional for public clients. When set together with `http_auth_client_secret`, both are sent to the token endpoint.
`http_auth_client_secret`	component, secret	confidential	`client_secret`. Must be paired with `http_auth_client_id`. Can be loaded from any supported secret store via `${secrets:...}`.
`auth_scopes`	runtime	no	Space-separated OAuth2 scopes (e.g. `read:data offline_access`). Omit to inherit the scopes bound to the refresh token.
`auth_client_auth`	runtime	no	How client credentials are sent to the token endpoint: `basic` (default, HTTP Basic header per RFC 6749 §2.3.1) or `body` (as `client_id`/`client_secret` form fields).

Parameter naming convention

Component/secret parameters carry the http_ prefix when set in a dataset (http_auth_refresh_token, http_auth_client_id, http_auth_client_secret). Runtime parameters do not (auth_token_url, auth_scopes, auth_client_auth). This follows the same convention as http_password vs client_timeout.

Loading secrets from a secret store

The refresh token and client secret should never be committed to source. Reference them from any supported secret store — environment variables, Kubernetes Secrets, AWS Secrets Manager, HashiCorp Vault, or the OS keychain — using the ${secrets:...} replacement syntax. For example, with Kubernetes Secrets enabled:

params:
  auth_token_url: https://auth.example.com/oauth/token
  http_auth_refresh_token: ${secrets:my_refresh_token}
  http_auth_client_id: ${secrets:my_client_id}
  http_auth_client_secret: ${secrets:my_client_secret}

Public Clients (No Client Secret)

For public clients the client_secret is omitted. If you still want to send a client_id for correlation, set http_auth_client_id without http_auth_client_secret:

params:
  auth_token_url: https://auth.example.com/oauth/token
  http_auth_refresh_token: ${secrets:my_refresh_token}
  http_auth_client_id: ${secrets:my_public_client_id}

Sending Credentials in the Body Instead of Basic Auth

Some token endpoints require client_id/client_secret in the form body rather than via the HTTP Basic header. Set auth_client_auth: body:

params:
  auth_token_url: https://auth.example.com/oauth/token
  http_auth_refresh_token: ${secrets:my_refresh_token}
  http_auth_client_id: ${secrets:my_client_id}
  http_auth_client_secret: ${secrets:my_client_secret}
  auth_client_auth: body

Local Testing

The connector rejects http:// token URLs by default, but allows http://localhost, http://127.0.0.1, and http://[::1] so you can run a mock OAuth server for development:

params:
  auth_token_url: http://localhost:8080/oauth/token
  http_auth_refresh_token: local-dev-token

Error Behavior

The connector classifies token-endpoint errors to make remediation easy:

Configuration errors (fail-fast at dataset init, surfaces as InvalidConfiguration):
- Malformed or insecure auth_token_url
- Token endpoint returns 400, 401, or 403 (typically an invalid refresh token, client credentials, or scope)
- Token endpoint returns a non-Bearer token_type
- Incomplete config (e.g. http_auth_refresh_token without auth_token_url, or http_auth_client_secret without http_auth_client_id)
- Both OAuth2 auth and an Authorization header in http_headers — remove one
Transient / connection errors (surfaces as UnableToConnect, retried in the background):
- Network / TLS failures
- 5xx, 408, or 429 from the token endpoint
- Parse failures on the token response

Error bodies returned by the token endpoint are truncated to 512 bytes and whitespace-collapsed before being surfaced in errors or logs, so hostile or misbehaving endpoints cannot force unbounded buffering or leak multi-line payloads into logs.

Limitations

JSON APIs only. Structured file formats (csv, parquet, etc.) go through the object-store listing path and are not authenticated by this feature. For those, use a static bearer via http_headers.
No interactive auth flows. Only the refresh-token grant is supported. Obtain the initial refresh token out-of-band.
No 401→refresh-and-retry. Background refresh keeps the token fresh; if a data request 401s, it propagates to the caller.
One authenticator per dataset. Configure either OAuth2 or an Authorization header in http_headers, not both — the connector rejects the combination at registration time.

Advanced Usage

Using Special Metadata Fields with Base URL

When using a base URL with special metadata fields, you can dynamically construct different API endpoints:

datasets:
  - from: https://api.example.com/v1
    name: api_requests
    params:
      http_headers: 'Content-Type:application/json'
      allowed_request_paths: '/users,/data/upload,/api/**'
      request_query_filters: enabled
      request_body_filters: enabled

With the above configuration, you can query different endpoints by providing values for the special metadata fields:

-- Query a specific user endpoint
SELECT * FROM api_requests
WHERE request_path = '/users/123' AND request_query = 'include=profile,settings';

-- Make a POST request with a body
SELECT * FROM api_requests
WHERE request_path = '/data/upload' AND request_body = '{"name":"example","value":42}';

The connector will construct requests like:

https://api.example.com/v1/users/123?include=profile,settings
https://api.example.com/v1/data/upload with the JSON body

Securing Paths with Glob Patterns

The allowed_request_paths parameter supports glob patterns to flexibly and securely match request paths. This provides a flexible way to configure path filtering without listing every possible endpoint.

Pattern Types:

Single wildcard (*): Matches any characters within a single path segment
- Example: /shows/* matches /shows/123 and /shows/breaking-bad
- Does not match across path separators: /shows/* does not match /shows/123/episodes
**Recursive wildcard (**)**: Matches any number of path segments
- Example: /api/** matches /api/users, /api/v1/users, and /api/v2/posts/123
- Use for flexible API version matching or deep hierarchies
Character classes ([...]): Matches one character from a set
- Example: /api/v[0-9]/* matches /api/v1/users and /api/v2/posts
- Example: /api/v[1-3]/* matches /api/v1/users, /api/v2/posts, and /api/v3/data

Examples:

datasets:
  - from: https://api.tvmaze.com
    name: tv_api
    params:
      # Match any show ID
      allowed_request_paths: '/shows/*'

-- Matches because /shows/82 matches the pattern /shows/*
SELECT * FROM tv_api WHERE request_path = '/shows/82';

datasets:
  - from: https://api.example.com
    name: versioned_api
    params:
      # Match all endpoints under any API version
      allowed_request_paths: '/api/**'

-- All of these match the pattern /api/**
SELECT * FROM versioned_api WHERE request_path = '/api/users';
SELECT * FROM versioned_api WHERE request_path = '/api/v1/users';
SELECT * FROM versioned_api WHERE request_path = '/api/v2/products/electronics';

datasets:
  - from: https://api.example.com
    name: specific_versions
    params:
      # Match only API versions 1-9
      allowed_request_paths: '/api/v[0-9]/*'

-- Matches because /api/v1/users matches /api/v[0-9]/*
SELECT * FROM specific_versions WHERE request_path = '/api/v1/users';

-- Does NOT match because v10 has two digits
SELECT * FROM specific_versions WHERE request_path = '/api/v10/users';

Dynamic Filters with Metadata Fields

The special metadata fields can be combined with dynamic filters to create sophisticated data refresh patterns.

Dynamic API Queries with SQL

datasets:
  - from: https://api.tvmaze.com
    name: tv_shows
    params:
      http_headers: 'Accept:application/json'
      allowed_request_paths: '/search/shows,/shows/*,/shows/*/episodes'
      request_query_filters: enabled

Query specific API endpoints dynamically:

-- Search for shows by name
SELECT * FROM tv_shows
WHERE request_path = '/search/shows' AND request_query = 'q=game+of+thrones';

-- Get a specific show by ID (matches /shows/* pattern)
SELECT * FROM tv_shows
WHERE request_path = '/shows/82';

-- Get episodes for a show with filters (matches /shows/*/episodes pattern)
SELECT * FROM tv_shows
WHERE request_path = '/shows/82/episodes' AND request_query = 'season=1';

Incremental Loading with Metadata Fields

datasets:
  - from: https://api.example.com
    name: events
    params:
      allowed_request_paths: '/events,/events/*'
      request_query_filters: enabled
    acceleration:
      enabled: true
      refresh_mode: append
      refresh_sql: |
        SELECT * FROM events
        WHERE request_path = '/events'
          AND request_query = CONCAT('since=', (SELECT MAX(created_at) FROM events))

This configuration:

Uses request_path to specify the /events endpoint
Dynamically constructs the request_query parameter using the latest timestamp from existing data
On each refresh, only fetches events created after the last refresh

Paginated Data Loading

tip

For APIs with standard pagination patterns, consider using the built-in pagination feature instead of manual refresh_sql pagination. Built-in pagination handles page traversal automatically with streaming execution.

datasets:
  - from: https://api.example.com/v2
    name: paginated_data
    params:
      http_headers: 'Content-Type:application/json'
      allowed_request_paths: '/data'
      request_query_filters: enabled
    acceleration:
      enabled: true
      refresh_mode: append
      refresh_sql: |
        SELECT * FROM paginated_data
        WHERE request_path = '/data'
          AND request_query = CONCAT('page=', 
                              COALESCE((SELECT MAX(page_number) FROM paginated_data) + 1, 1),
                              '&limit=100')

This incrementally loads pages of data by:

Tracking the last loaded page number
Constructing the next page query parameter
Fetching 100 records per page

POST Request with Dynamic Body

datasets:
  - from: https://api.example.com
    name: search_results
    params:
      http_headers: 'Content-Type:application/json'
      allowed_request_paths: '/search'
      request_body_filters: enabled
    acceleration:
      enabled: true
      refresh_mode: full
      refresh_sql: |
        SELECT * FROM search_results
        WHERE request_path = '/search'
          AND request_body = '{"query": {"match": {"status": "active"}}, "from": 0, "size": 1000}'

This example demonstrates:

Using _body to send a JSON payload for a POST request
Executing complex search queries against REST APIs
Fetching results based on structured query syntax

Processing JSON Responses

APIs often return JSON data that requires parsing to extract specific fields. Spice provides JSON functions to process and transform JSON responses directly in SQL queries.

Extracting Fields from JSON

datasets:
  - from: https://api.tvmaze.com
    name: tvmaze
    params:
      file_format: json
      allowed_request_paths: '/shows/*'

Extract specific fields from JSON responses:

-- Extract the show name from a JSON response
SELECT json_get_str(content, 'name') as name
FROM tvmaze
WHERE request_path = '/shows/169';

Working with Nested JSON

APIs often return deeply nested JSON structures that require parsing to extract specific fields. Use chained JSON functions to navigate nested objects:

-- Extract nested fields from a show's network information
SELECT
  json_get_str(content, 'name') as show_name,
  json_get_str(json_get(content, 'network'), 'name') as network_name,
  json_get_str(json_get(json_get(content, 'network'), 'country'), 'name') as country,
  json_get_str(json_get(json_get(content, 'network'), 'country'), 'code') as country_code
FROM tvmaze
WHERE request_path = '/shows/82';

This demonstrates extracting nested objects step by step:

json_get(content, 'network') extracts the network object
json_get_str(json_get(content, 'network'), 'name') gets the network name from the nested object
Multiple json_get calls can be chained to navigate deeper levels

Extracting Multiple Fields

-- Parse multiple fields from a TV show API response
SELECT
  json_get_str(content, 'name') as show_name,
  json_get_str(content, 'type') as show_type,
  json_get_str(content, 'language') as language,
  json_get_int(content, 'runtime') as runtime_minutes,
  json_get_str(content, 'premiered') as premiere_date,
  json_get_str(content, 'status') as status
FROM tvmaze
WHERE request_path = '/shows/169';

Processing JSON Arrays

-- Extract genres from a JSON array
SELECT
  json_get_str(content, 'name') as show_name,
  json_get_array(content, 'genres') as genres_array
FROM tvmaze
WHERE request_path = '/shows/82';

For more details on available JSON functions including json_get, json_get_str, json_get_int, json_get_bool, and others, refer to the JSON functions reference.

Refresh SQL with Dynamic Filters

The HTTP connector supports dynamic URL construction through refresh_sql with templated query parameters. This enables incremental data loading by appending filter conditions from the SQL query to the HTTP request URL.

How It Works

When refresh_sql is specified with filters, the connector extracts filter conditions and appends them as query parameters to the URL. This is particularly useful for APIs that support filtering via query parameters.

Time-Based Incremental Loading

datasets:
  - from: https://api.example.com/data.csv?start_time={start_time}&end_time={end_time}
    name: incremental_data
    acceleration:
      enabled: true
      refresh_mode: append
      refresh_sql: |
        SELECT * FROM incremental_data 
        WHERE timestamp > (SELECT MAX(timestamp) FROM incremental_data)

In this example:

The {start_time} and {end_time} placeholders in the URL are replaced with values extracted from the WHERE clause in refresh_sql
Each refresh appends only new data since the last refresh
The connector automatically maps SQL filter conditions to URL query parameters

Supported Filter Operations

The dynamic filter feature supports the following SQL operations:

Equality comparisons (=)
Greater than (>)
Less than (<)
Greater than or equal (>=)
Less than or equal (<=)
Range queries with BETWEEN

Notes

URL parameters must match filter column names in the refresh_sql
Only filters that can be pushed down to the HTTP source will be applied to the URL
Complex filters may not be supported for URL templating

Limitations

Security Constraints

For security and to prevent unauthorized access, the HTTP connector enforces the following constraints on special metadata fields:

Request Path Limitations

Explicit Allow-List Required: The request_path field cannot be used without configuring allowed_request_paths
Path Pattern Format: All patterns in allowed_request_paths must:
- Start with /
- Not contain .. path traversal segments
- Not exceed 2048 characters in length
Glob Pattern Matching: Query filters are matched against glob patterns in the allowed_request_paths list using:
- * matches a single path segment (e.g., /shows/* matches /shows/123 but not /shows/123/episodes)
- ** matches multiple path segments recursively (e.g., /api/** matches /api/v1/users and /api/v2/posts/123)
- [...] character classes (e.g., /api/v[0-9]/* matches /api/v1/users but not /api/v10/users)
Empty Paths: Empty request_path filters are rejected

Example error when allowed_request_paths is not configured:

request_path filters are disabled for this dataset. Configure allowed_request_paths to enable them.

Request Query Limitations

Explicit Enable Required: The request_query field requires request_query_filters: enabled
Length Limit: Query strings are limited to 1024 characters by default (configurable up to 4096 via max_request_query_length)
Control Characters: Query strings cannot contain control characters
Leading Question Mark: The connector automatically strips leading ? if present

Example error when query filters are not enabled:

request_query filters are disabled for this dataset. Enable request_query_filters to use them.

Request Body Limitations

Explicit Enable Required: The request_body field requires request_body_filters: enabled
Size Limit: Request bodies are limited to 16 KiB (16,384 bytes) by default (configurable up to 64 KiB via max_request_body_bytes)
POST Method: When a request_body filter is present, the HTTP method automatically changes to POST

Example error when body filters are not enabled:

request_body filters are disabled for this dataset. Enable request_body_filters to use them.

Configuration Requirements

To use the special metadata fields (request_path, request_query, request_body), you must:

For request_path: Configure allowed_request_paths with a comma-separated list of allowed path patterns (supports glob patterns)
For request_query: Set request_query_filters: enabled in params
For request_body: Set request_body_filters: enabled in params

Example minimal configuration for all three fields:

datasets:
  - from: https://api.example.com
    name: my_api
    params:
      allowed_request_paths: '/users,/posts,/comments,/api/**'
      request_query_filters: enabled
      request_body_filters: enabled

Performance Considerations

Connection Pooling: The connector maintains up to 10 idle connections per host by default
Retry Overhead: With the default 3 retries and Fibonacci backoff, failed requests may take several seconds before returning an error
Cache Behavior: HTTP responses are cached based on the combination of path, query, and body parameters

Secrets

Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the secret stores documentation. Additionally, learn how to use referenced secrets in component parameters by visiting the using referenced secrets guide.

Examples​

Basic Example​

Using Basic Authentication​

Using Custom Headers​

Using OAuth2 Refresh-Token Authentication​

Configuration​

from​

name​

params​

Pagination Parameters​

HTTP Response Headers​

Cache-Control​

Advanced Features​

Pagination​

Pagination Modes​

Map-to-Array Conversion​

Auto Mode​

Validation Rules​

SSRF Protection​

Special Metadata Fields​

Response Metadata Fields​

Querying Response Metadata​

Endpoint Validation​

Default Validation Behavior​

Custom Health Probe​

Example with Authentication​

Health Probe Requirements​

OAuth2 Refresh-Token Authentication​

Basic Configuration​

Parameter Reference​

Public Clients (No Client Secret)​

Sending Credentials in the Body Instead of Basic Auth​

Local Testing​

Error Behavior​

Limitations​

Advanced Usage​

Using Special Metadata Fields with Base URL​

Securing Paths with Glob Patterns​

Dynamic Filters with Metadata Fields​

Dynamic API Queries with SQL​

Incremental Loading with Metadata Fields​

Paginated Data Loading​

POST Request with Dynamic Body​

Processing JSON Responses​

Extracting Fields from JSON​

Working with Nested JSON​

Extracting Multiple Fields​

Processing JSON Arrays​

Refresh SQL with Dynamic Filters​

How It Works​

Time-Based Incremental Loading​

Supported Filter Operations​

Notes​

Limitations​

Security Constraints​

Request Path Limitations​

Request Query Limitations​

Request Body Limitations​

Configuration Requirements​

Performance Considerations​

Secrets​

Examples

Basic Example

Using Basic Authentication

Using Custom Headers

Using OAuth2 Refresh-Token Authentication

Configuration

`from`

`name`

`params`

Pagination Parameters

HTTP Response Headers

`Cache-Control`

Advanced Features

Pagination

Pagination Modes

Map-to-Array Conversion

Auto Mode

Validation Rules

SSRF Protection

Special Metadata Fields

Response Metadata Fields

Querying Response Metadata

Endpoint Validation

Default Validation Behavior

Custom Health Probe

Example with Authentication

Health Probe Requirements

OAuth2 Refresh-Token Authentication

Basic Configuration

Parameter Reference

Public Clients (No Client Secret)

Sending Credentials in the Body Instead of Basic Auth

Local Testing

Error Behavior

Limitations

Advanced Usage

Using Special Metadata Fields with Base URL

Securing Paths with Glob Patterns

Dynamic Filters with Metadata Fields

Dynamic API Queries with SQL

Incremental Loading with Metadata Fields

Paginated Data Loading

POST Request with Dynamic Body

Processing JSON Responses

Extracting Fields from JSON

Working with Nested JSON

Extracting Multiple Fields

Processing JSON Arrays

Refresh SQL with Dynamic Filters

How It Works

Time-Based Incremental Loading

Supported Filter Operations

Notes

Limitations

Security Constraints

Request Path Limitations

Request Query Limitations

Request Body Limitations

Configuration Requirements

Performance Considerations

Secrets